Printer-styled range in Python:
a case study
Recently, I had to convert a printer-styled range string into the corresponding list of selected elements. This made for a nice case study on Python class properties, iterability, length, and attribute/method privacy.
A feature request
So, there’s this script of mine that some colleagues use to convert microscopy images
from the microscope vendor’s proprietary format (we have a Nikon one in the lab, so some
nd2
files) to an open source one (.tif
in this case).
Recently, I have been asked to add a feature to this script. They would like to be able to convert only some fields of view. To do this, I decided to use a range string format that is ubiquitous, OS-independent, and widely used everyday: the printer-styled range string.
This is as simple as having the pages comma-separated. Let’s say, if I wanted to print
pages number 1, 5, and 6, I would write 1,5,6
. Also, page ranges are allowed by using
the dash (-
): to print page from 3 to 6 (both included), I would write 3-6
. And the
two things can be combined into 1,3-6
.
A Python3.6+ class
So, I needed something to validate, parse, and convert a printer-styled range string into
an actual series of numbers. The desired behavior would be to convert 1,3-6
into
[0,2,3,4,5]
(remember, Python is a 0-indexed language).
To do this, I implemented a Python3.6+ class, MultiRange
, that I then published as a
Gist on GitHub (see below). I will break it down and explain it bit by bit below.
Dependencies
The class has only two dependencies from the Python standard libraries:
re
provides regular expression related features. We will use it to validate the printer-style range string.typing
is used to provide type hints and help developers.
The class attributes
- The
__string_range
attribute will contain the printer-styled range string that we want to convert into a list of numbers. For example:1,3-6
. __extremes_list
will contain a list of tuples. Each tuple will have two elements: the start and stop of a range/slice. The idea is to convert__string_range
into something like[(1,1),(3,6)]
.__reg
contains the validation regular expression. During class instantiation, we will try to match the input string to this. If no match is found, the class raises an error. Basically, we allow only strings including digits, commas, dashes, and spaces.- I will talk about
__length
and__ready
in the length section below.
As you might have noticed, all attributes start with two underscores __
. In Python
(according to PEP8), this makes them ‘private’, i.e., inaccessible from outside the
class through name mangling.
The __init__
method
The __init__
method creates an instance of the class and takes the printer-style range
string as input (s
). The first thing we want to do is validate the input by checking it
against __reg
, and stop otherwise by raising an AssertError
.
After that, we store the input string in __string_range
and split it into elements by
using the commas as delimiters and removing any terminal white spaces.
Each element is now a string containing either a single page ("1"
) or a page range
("3-6"
). We identify these two cases by splitting each element by the dashes, and
counting the number of generated elements. In the case of a single page, we convert it to
a number (int
) and store it into __extremes_list
as (1,1)
. In the case of a page
range, we save it as (3,6)
.
Then, we sort the list of tuples we created by their first elements, and clean them up. Basically, this cleaning operation is meant to avoid overlaps between tuples (see the next section for more details).
Finally, we verify that the first tuple does not start with a 0 (page numbers start
from 1) and tell the class that it is __ready
to be used.
Cleaning the list of extremes
After converting the input string into a list of number pairs (as tuples), each
representing a range of pages from the first number to the second one, both included.
You can imagine any of these pairs as a (start,end)
element.
Before proceeding, we want to be sure that every page number will be provided only
once. In other words, all these ranges should not overlap. We achieve this by iterating
through the the list of (start,end)
elements and comparing each pair with the next pair
and see if they overlap. For this to work, it is crucial to have the list of pairs sorted
by their start element (which we did in the __init__
function).
Now, when are two pairs overlapping? Given that the first pair (A
) has a start (A[0]
)
that is always lower than or equal to the start of the second one B[0]
, this can happen
in two ways:
- The second pair is fully included in the first. In other words, the first pair ends
after or precisely when the second one ends:
A[1] >= B[1]
. - The two pairs partially overlap. The first pair ends after or precisely when the second
one starts (
A[1] >= B[0]
), but the first pair ends before the second one ends (A[1] < B[1]
).
It is important to distinguish these two scenarios because they require us to act differently:
- If the second pair is fully included in the first one, we can simply remove (pop) it from our list of ranges.
- If the overlap is only partial, we need to merge the two ranges. We can achieve this
by removing (popping) the second one, and replacing the end position of the first one
with the end of the second. Basically, we remove both pairs and place a new one:
(A[0], B[1])
.
Every time we found overlapping pairs, we resolve the conflict, and restart from the beginning of the list. Only when we reach the end of the list, we can say that it is indeed clean and ready to be used.
The MultiRange length
We want to be able to know the length of our MultiRange instance. But given that this
requires some computation, we want to (1) calculate it only if the MultiRange is
__ready
to be used, (2) calculate it once, (3) store it somewhere (self.__length
),
(4) read the stored value at any future call of the len
function.
To provide a hook for the len
function and make the len(MultiRange("1,2-6"))
code
work, we need to define a __len__
method. The method returns an integer number: the one
we will store in __length
.
We do not want anyone to be able to mess with the stored value, that’s why we store it in
the private attribute __length
(notice those __
at the beginning!). But we want a
user to be able to access it without having to call the len
function. To do this, we
define a class property with the @property
decorator.
Basically, every time the user calls MultiRange("1,2-6").length
it will trigger the
property. This will check if the class is ready (self.__ready
) and if no length value
has been previously stored (self.__length is None
). Only and only in this case it will
calculate the length value and store it in __length
. Otherwise, it will return the
default length value: None
.
To calculate the MultiRange length, we calculate the difference between all range extreme
pairs (a,b)
and sum 1 to each (as they should both be included): b-a+1
. Then, we sum
them all and store it in __length
!
How to iterate over the MultiRange elements?
What is an ‘iterator’?
First things first: what is an iterator? Well, in Python we can have iterable (1) and/or iterator (2) methods/classes. An iterable contains elements that can be iterated over, while an iterator is used to iterate over an iterable. According to PEP234:
- An object can be iterated over with for if it implements
__iter__()
or__getitem__()
.- An object can function as an iterator if it implements
next()
.
Just to re-iterate the concept (pun intended): if a class includes an __iter__
method
it can be iterated over, i.e., the class is iterable. If a class includes a
__next__
method, it is an iterator.
Notice that, when a class is an iterator it is, per definition, also iterable. The vice-versa is not true: not all iterables are iterators. While the wording might be confusing, the concept is quite simple. Since an iterator iterates over the elements of something, one could iterate over its own elements, which makes it an iterable too. On the other hand, one could potentially be able to iterate over the elements of an iterable object only if an iterator is available.
Making your own Python iterator
So, now that we know what an iterator is, how can we make our MultiRange
class into
one? It’s fairly simple: we implemented two methods:
__next__
is our iterator method, which will return one element of MultiRange at a time, and in order. Once the elements are over, it will raise aStopIteration
error.__iter__
is our iterable method. When called, it allows a user to iterate over the elements of the MultiRange, starting from its first. This method needs only to reset the iterable location to the first element, and return the class itself. Python will take care of the magic behind it all, and link__iter__
and__next__
together.
In __next__
we want to (1) know which element have iterated to and (2) iterate to the
next one if possible, otherwise trigger a StopIteration
. First, we store the iterator
location in __current_item
as a tuple containing two numbers: the index of the current
range and the index of the current element in that range (starting from (0,0)
).
When we reach the last element of a range, we move to the first element of the next range. If we have reached the last element of the last range, it is time to stop.