Printer-styled range in Python - a case study

Recently, I had to convert a printer-styled range string into the corresponding list of selected elements. An excellent case study on Python class properties, iterability, length, and attribute/method privacy.

A feature request

So, there’s this script of mine that some colleagues use to convert microscopy images from the microscope vendor’s proprietary format (we have a Nikon one in the lab, so some nd2 files) to an open-source one (.tif in this case).

Recently, a colleague asked me to add a feature to this script. They would like to be able to convert only some fields of view. To do this, I decided to use a range string format that is ubiquitous, OS-independent, and widely used every day: the printer-styled range string.

This way of selecting items is as simple as having the pages comma-separated. Let’s say, if I wanted to print the pages numbered 1, 5, and 6, then I would write 1,5,6. Also, page ranges are allowed by using the dash (-): to print page from 3 to 6 (both included), I would write 3-6. And the two things can be combined into 1,3-6.

A Python3.6+ class

So, I needed something to validate, parse, and convert a printer-styled range string into an actual series of numbers. The desired behavior would be to convert 1,3-6 into [0,2,3,4,5] (remember, Python is a 0-indexed language).

To do this, I implemented a Python3.6+ class, MultiRange, that I then published as a Gist on GitHub (see below). Here, I break it down and explain it bit by bit.

Dependencies

import re
from typing import Iterator, List, Optional, Pattern, Tuple

The class has only two dependencies from the Python standard libraries:

  • re provides regular expression related features. We use it to validate the printer-style range string.
  • typing is used to provide type hints and help developers.

The class attributes

__current_item: Tuple[int] = (0, 0)
__string_range: Optional[str] = None
__extremes_list: Optional[List[Tuple[int]]] = None
__reg: Pattern = re.compile(r'^[0-9-, ]+$')
__length: Optional[int] = None
__ready: bool = False
  • The __string_range attribute contains the printer-styled range string that we want to convert into a list of numbers. For example, 1,3-6.
  • __extremes_list contains a list of tuples. Each tuple has two elements: the start and stop of a range/slice. The idea is to convert __string_range into something like [(1,1),(3,6)].
  • __reg contains the regular expression used to validate the input string. During class instantiation, we match the input string to this. If we do not find any match, the class raises an error. In other words, we allow only strings that include digits, commas, dashes, and spaces.
  • More details about __length and __ready are available in the length section below.
  • __current_item is necessary to iterate over the indexes stored in the MultiRange, which we discuss in the last section of this article.

As you might have noticed, all attributes start with two underscores __. In Python (according to PEP8), this makes them private, i.e., inaccessible from outside the class through name mangling.

The __init__ method

def __init__(self, s: str):
    super(MultiRange, self).__init__()

    assert self.__reg.search(s) is not None, (
        "cannot parse range string. It should only contains numbers, "
        + "commas, dashes, and spaces.")
    self.__string_range = s

    string_range_list = [b.strip() for b in self.__string_range.split(",")]

    self.__extremes_list = []
    for string_range in string_range_list:
        extremes = [int(x) for x in string_range.split("-")]
        if 1 == len(extremes):
            extremes = [extremes[0], extremes[0]]
        assert 2 == len(extremes), "a range should be specified as A-B"
        assert extremes[1] >= extremes[0]
        self.__extremes_list.append(tuple(extremes))

    self.__extremes_list = sorted(
        self.__extremes_list, key=lambda x: x[0])
    self.__clean_extremes_list()

    assert 0 < self.__extremes_list[0][0], "'page' count starts from 1."
    self.__ready = True

The __init__ method creates an instance of the class and takes the printer-style range string as input (s). The first thing we want to do is validate the input by checking it against __reg, and stop otherwise by raising an AssertError.

After that, we store the input string in __string_range and split it into elements by using the commas as delimiters and removing any leading or trailing blank space.

Each element is now a string containing either a single page ("1") or a page range ("3-6"). We identify these two cases by splitting each element by the dashes and counting the number of generated elements. In the case of a single page, we convert it to a number (int) and store it into __extremes_list as (1,1). In the case of a page range, we save it as (3,6).

Then, we sort the list of tuples we created by their first elements, and clean them up. This cleaning operation avoids overlaps between tuples (see the next section for more details).

Finally, we verify that the first tuple does not start with a 0 (page numbers start from 1) and tell the class that it is __ready for use.

Cleaning the list of extremes

def __clean_extremes_list(self) -> None:
    is_clean = False
    while not is_clean:
        popped = 0
        i = 0
        while i < len(self.__extremes_list)-1:
            A = self.__extremes_list[i]
            B = self.__extremes_list[i+1]
            if A[1] >= B[0] and A[1] < B[1]:
                self.__extremes_list[i] = (A[0], B[1])
                self.__extremes_list.pop(i+1)
                popped = 1
                break
            elif A[1] >= B[1]:
                self.__extremes_list.pop(i+1)
                popped = 1
                break
            i += 1
        if i >= len(self.__extremes_list)-2+popped:
            is_clean = True

After converting the input string into a list of number pairs (as tuples), each representing a range of pages from the first number to the second one, both included. You can imagine any of these pairs as a (start,end) element.

Before proceeding, we want to be sure that every page number appearson only once in the MultiRange. In other words, all these ranges should not overlap. We achieve this by iterating through the list of (start,end) elements and comparing each pair with the next pair and see if they overlap. For this to work, it is crucial to have the list of pairs sorted by their start element (which we did in the __init__ function).

Now, when are two pairs overlapping? Given that the first pair (A) has a start (A[0]) that is always lower than or equal to the start of the second one B[0], this can happen in two ways:

  1. The first pair fully includes the second one. In other words, the first pair ends after or precisely when the second one ends: A[1] >= B[1].
  2. The two pairs partially overlap. The first pair ends after or precisely when the second one starts (A[1] >= B[0]), but the first pair ends before the second one ends (A[1] < B[1]).

It is important to distinguish these two scenarios because they require us to act differently:

  1. If the first pair fully includes the second one, we can simply remove (pop) it from our list of ranges.
  2. If the overlap is only partial, we need to merge the two ranges. We can achieve this by removing (popping) the second one, and replacing the end position of the first one with the end of the second. Simply put, we remove both pairs and place a new one: (A[0], B[1]).

Every time we found overlapping pairs, we resolve the conflict and restart from the beginning of the list. Only when we reach the end of the list, we can say that it is indeed clean and ready to be used.

The MultiRange length

@property
def length(self):
    if self.__length is None and self.__ready:
        self.__length = 0
        for a, b in self.__extremes_list:
            self.__length += b-a+1
    return self.__length

def __len__(self) -> Optional[int]:
    return self.length

We want to be able to know the length of our MultiRange instance. But given that this requires some computation, we want to (1) calculate it only if the MultiRange is __ready to be used, (2) calculate it once, (3) store it somewhere (self.__length), (4) read the stored value at any future call of the len function.

To provide a hook for the len function and make the len(MultiRange("1,2-6")) code work, we need to define a __len__ method. The method returns an integer number: the one we plan to store in __length.

We do not want anyone to be able to mess with the stored value, that’s why we store it in the private attribute __length (notice those __ at the beginning!). But we want a user to be able to access it without having to call the len function. To do this, we define a class property with the @property decorator.

Every time the user calls MultiRange(" 1,2-6").length, it triggers the property. This checks if the class is ready (self.__ready) and if no length value has been previously stored (self.__length is None). Only and only in this case it calculates the length value and stores it in __length. Otherwise, it returns the default length value: None.

To calculate the MultiRange length, we calculate the difference between all range extreme pairs (a,b) and sum 1 to each (as they should both be included): b-a+1. Then, we sum them all and store it in __length!

How to iterate over the MultiRange elements?

What is an ‘iterator’?

First things first: what is and iterator? Well, in Python we can have iterable (1) and/or iterator (2) methods/classes. An iterable contains elements that can be iterated over, while an iterator is used to iterate over an iterable. According to PEP234:

  1. An object can be iterated over with for if it implements __iter__() or __getitem__().
  2. An object can function as an iterator if it implements next().

Just to re-iterate the concept (pun unintended): if a class includes an __iter__ method it can be iterated over, i.e., the class is iterable. If a class includes a __next__ method, it is an iterator.

Notice that, when a class is an iterator it is, per definition, also iterable. The vice-versa is not true: not all iterables are iterators. While the wording might be confusing, the concept is quite simple. Since an iterator iterates over the elements of something, one could iterate over its elements, which makes it an iterable too. On the other hand, one could potentially be able to iterate over the elements of an iterable only if an iterator is available.

Making your own Python iterator

So, now that we know what an iterator is, how can we make our MultiRange class into one? It’s fairly simple: we implemented two methods:

  • __next__ is our iterator method, which will return one element of MultiRange at a time, and in order. Once the elements are over, it will raise a StopIteration error.
  • __iter__ is our iterable method. When called, it allows a user to iterate over the elements of the MultiRange, starting from its first. This method needs only to reset the iterable location to the first element, and return the class itself. Python will tkae care of the magic behind it all, and link __iter__ and __next__ together.

In __next__ we want to (1) know which element have iterated to and (2) iterate to the next one if possible, otherwise trigger a StopIteration. First, we store the iterator location in __current_item as a tuple containing two numbers: the index of the current range and the index of the current element in that range (starting from (0,0)).

When we reach the last element of a range, we move to the first element of the next range. If we have reached the last element of the last range, it is time to stop.

def __next__(self) -> int:
    current_range_id, current_item = self.__current_item
    if current_range_id >= len(self.__extremes_list):
        raise StopIteration
    current_range = self.__extremes_list[current_range_id]
    if current_item >= current_range[1]-current_range[0]:
        self.__current_item = (current_range_id+1, 0)
    else:
        self.__current_item = (current_range_id, current_item+1)
    return current_range[0]+current_item

def __iter__(self) -> Iterator[int]:
    self.__current_item = (0, 0)
    return self