Eclectic ENFJ, sketcher, bioinformagician.
Gabriele Girelli's avatar

Printer-styled range in Python:
a case study

Printer-styled range in Python:  a case study

Recently, I had to convert a printer-styled range string into the corresponding list of selected elements. This made for a nice case study on Python class properties, iterability, length, and attribute/method privacy.

A feature request

So, there’s this script of mine that some colleagues use to convert microscopy images from the microscope vendor’s proprietary format (we have a Nikon one in the lab, so some nd2 files) to an open source one (.tif in this case).

Recently, I have been asked to add a feature to this script. They would like to be able to convert only some fields of view. To do this, I decided to use a range string format that is ubiquitous, OS-independent, and widely used everyday: the printer-styled range string.

This is as simple as having the pages comma-separated. Let’s say, if I wanted to print pages number 1, 5, and 6, I would write 1,5,6. Also, page ranges are allowed by using the dash (-): to print page from 3 to 6 (both included), I would write 3-6. And the two things can be combined into 1,3-6.

A Python3.6+ class

So, I needed something to validate, parse, and convert a printer-styled range string into an actual series of numbers. The desired behavior would be to convert 1,3-6 into [0,2,3,4,5] (remember, Python is a 0-indexed language).

To do this, I implemented a Python3.6+ class, MultiRange, that I then published as a Gist on GitHub (see below). I will break it down and explain it bit by bit below.

Dependencies

import re
from typing import Iterator, List, Optional, Pattern, Tuple

The class has only two dependencies from the Python standard libraries:

  • re provides regular expression related features. We will use it to validate the printer-style range string.
  • typing is used to provide type hints and help developers.

The class attributes

**current_item: Tuple[int] = (0, 0)
**string_range: Optional[str] = None
**extremes_list: Optional[List[Tuple[int]]] = None
**reg: Pattern = re.compile(r'^[0-9-, ]+$')
**length: Optional[int] = None
**ready: bool = False
  • The __string_range attribute will contain the printer-styled range string that we want to convert into a list of numbers. For example: 1,3-6.
  • __extremes_list will contain a list of tuples. Each tuple will have two elements: the start and stop of a range/slice. The idea is to convert __string_range into something like [(1,1),(3,6)].
  • __reg contains the validation regular expression. During class instantiation, we will try to match the input string to this. If no match is found, the class raises an error. Basically, we allow only strings including digits, commas, dashes, and spaces.
  • I will talk about __length and __ready in the length section below.

As you might have noticed, all attributes start with two underscores __. In Python (according to PEP8), this makes them ‘private’, i.e., inaccessible from outside the class through name mangling.

The __init__ method

def **init**(self, s: str):
super(MultiRange, self).**init**()

    assert self.__reg.search(s) is not None, (
        "cannot parse range string. It should only contains numbers, "
        + "commas, dashes, and spaces.")
    self.__string_range = s

    string_range_list = [b.strip() for b in self.__string_range.split(",")]

    self.__extremes_list = []
    for string_range in string_range_list:
        extremes = [int(x) for x in string_range.split("-")]
        if 1 == len(extremes):
            extremes = [extremes[0], extremes[0]]
        assert 2 == len(extremes), "a range should be specified as A-B"
        assert extremes[1] >= extremes[0]
        self.__extremes_list.append(tuple(extremes))

    self.__extremes_list = sorted(
        self.__extremes_list, key=lambda x: x[0])
    self.__clean_extremes_list()

    assert 0 < self.__extremes_list[0][0], "'page' count starts from 1."
    self.__ready = True

The __init__ method creates an instance of the class and takes the printer-style range string as input (s). The first thing we want to do is validate the input by checking it against __reg, and stop otherwise by raising an AssertError.

After that, we store the input string in __string_range and split it into elements by using the commas as delimiters and removing any terminal white spaces.

Each element is now a string containing either a single page ("1") or a page range ("3-6"). We identify these two cases by splitting each element by the dashes, and counting the number of generated elements. In the case of a single page, we convert it to a number (int) and store it into __extremes_list as (1,1). In the case of a page range, we save it as (3,6).

Then, we sort the list of tuples we created by their first elements, and clean them up. Basically, this cleaning operation is meant to avoid overlaps between tuples (see the next section for more details).

Finally, we verify that the first tuple does not start with a 0 (page numbers start from 1) and tell the class that it is __ready to be used.

Cleaning the list of extremes

def **clean_extremes_list(self) -> None:
is_clean = False
while not is_clean:
popped = 0
i = 0
while i < len(self.**extremes_list)-1:
A = self.**extremes_list[i]
B = self.**extremes_list[i+1]
if A[1] >= B[0] and A[1] < B[1]:
self.**extremes_list[i] = (A[0], B[1])
self.**extremes_list.pop(i+1)
popped = 1
break
elif A[1] >= B[1]:
self.**extremes_list.pop(i+1)
popped = 1
break
i += 1
if i >= len(self.**extremes_list)-2+popped:
is_clean = True

After converting the input string into a list of number pairs (as tuples), each representing a range of pages from the first number to the second one, both included. You can imagine any of these pairs as a (start,end) element.

Before proceeding, we want to be sure that every page number will be provided only once. In other words, all these ranges should not overlap. We achieve this by iterating through the the list of (start,end) elements and comparing each pair with the next pair and see if they overlap. For this to work, it is crucial to have the list of pairs sorted by their start element (which we did in the __init__ function).

Now, when are two pairs overlapping? Given that the first pair (A) has a start (A[0]) that is always lower than or equal to the start of the second one B[0], this can happen in two ways:

  1. The second pair is fully included in the first. In other words, the first pair ends after or precisely when the second one ends: A[1] >= B[1].
  2. The two pairs partially overlap. The first pair ends after or precisely when the second one starts (A[1] >= B[0]), but the first pair ends before the second one ends (A[1] < B[1]).

It is important to distinguish these two scenarios because they require us to act differently:

  1. If the second pair is fully included in the first one, we can simply remove (pop) it from our list of ranges.
  2. If the overlap is only partial, we need to merge the two ranges. We can achieve this by removing (popping) the second one, and replacing the end position of the first one with the end of the second. Basically, we remove both pairs and place a new one: (A[0], B[1]).

Every time we found overlapping pairs, we resolve the conflict, and restart from the beginning of the list. Only when we reach the end of the list, we can say that it is indeed clean and ready to be used.

The MultiRange length

@property
def length(self):
if self.**length is None and self.**ready:
self.**length = 0
for a, b in self.**extremes_list:
self.**length += b-a+1
return self.**length

def **len**(self) -> Optional[int]:
return self.length

We want to be able to know the length of our MultiRange instance. But given that this requires some computation, we want to (1) calculate it only if the MultiRange is __ready to be used, (2) calculate it once, (3) store it somewhere (self.__length), (4) read the stored value at any future call of the len function.

To provide a hook for the len function and make the len(MultiRange("1,2-6")) code work, we need to define a __len__ method. The method returns an integer number: the one we will store in __length.

We do not want anyone to be able to mess with the stored value, that’s why we store it in the private attribute __length (notice those __ at the beginning!). But we want a user to be able to access it without having to call the len function. To do this, we define a class property with the @property decorator.

Basically, every time the user calls MultiRange("1,2-6").length it will trigger the property. This will check if the class is ready (self.__ready) and if no length value has been previously stored (self.__length is None). Only and only in this case it will calculate the length value and store it in __length. Otherwise, it will return the default length value: None.

To calculate the MultiRange length, we calculate the difference between all range extreme pairs (a,b) and sum 1 to each (as they should both be included): b-a+1. Then, we sum them all and store it in __length!

How to iterate over the MultiRange elements?

What is an ‘iterator’?

First things first: what is an iterator? Well, in Python we can have iterable (1) and/or iterator (2) methods/classes. An iterable contains elements that can be iterated over, while an iterator is used to iterate over an iterable. According to PEP234:

  1. An object can be iterated over with for if it implements __iter__() or __getitem__().
  2. An object can function as an iterator if it implements next().

Just to re-iterate the concept (pun intended): if a class includes an __iter__ method it can be iterated over, i.e., the class is iterable. If a class includes a __next__ method, it is an iterator.

Notice that, when a class is an iterator it is, per definition, also iterable. The vice-versa is not true: not all iterables are iterators. While the wording might be confusing, the concept is quite simple. Since an iterator iterates over the elements of something, one could iterate over its own elements, which makes it an iterable too. On the other hand, one could potentially be able to iterate over the elements of an iterable object only if an iterator is available.

Making your own Python iterator

So, now that we know what an iterator is, how can we make our MultiRange class into one? It’s fairly simple: we implemented two methods:

  • __next__ is our iterator method, which will return one element of MultiRange at a time, and in order. Once the elements are over, it will raise a StopIteration error.
  • __iter__ is our iterable method. When called, it allows a user to iterate over the elements of the MultiRange, starting from its first. This method needs only to reset the iterable location to the first element, and return the class itself. Python will take care of the magic behind it all, and link __iter__ and __next__ together.

In __next__ we want to (1) know which element have iterated to and (2) iterate to the next one if possible, otherwise trigger a StopIteration. First, we store the iterator location in __current_item as a tuple containing two numbers: the index of the current range and the index of the current element in that range (starting from (0,0)).

When we reach the last element of a range, we move to the first element of the next range. If we have reached the last element of the last range, it is time to stop.

def **next**(self) -> int:
current_range_id, current_item = self.**current_item
if current_range_id >= len(self.**extremes_list):
raise StopIteration
current_range = self.**extremes_list[current_range_id]
if current_item >= current_range[1]-current_range[0]:
self.**current_item = (current_range_id+1, 0)
else:
self.\_\_current_item = (current_range_id, current_item+1)
return current_range[0]+current_item

def **iter**(self) -> Iterator[int]:
self.\_\_current_item = (0, 0)
return self
Printer-styled range in Python: a case study
(Updated: )
Categories