Dropping into a debugger with the built-in debugger() function

One of the nice little things Python 3.7 brought us is the new built-in function breakpoint(), which was proposed in PEP 553. As its name suggests, it allows us to create a breakpoint in Python source code.

Now, creating breakpoints in the source is nothing new, but it was always a bit tedious:

name = input("Enter name: ")
import pdb; pdb.set_trace()
user = find_user(name=name)

Quite a lot of typing, and just like the author of the linked PEP, I also mistype it quite often. When deep into the bug tracking, such typos and script re-runs are moderately annoying and unnecessarily contribute to the overall cognitive load. Typing breakpoint() is slightly easier.

Additionally, if the project is set up to use auto code formatting tools such as black on every build, these tools are happy to refactor this debugging snippet to follow the style guide, adding insult to injury:

name = input("Enter name: ")
import pdb

pdb.set_trace()  # argh!!
user = find_user(name=name)

On the other hand, the following looks considerably cleaner and does not suffer from the same problem:

name = input("Enter name: ")
breakpoint()
user = find_user(name=name)

When called, the breakpoint() function calls the sys.breakpointhook() hook. The latter drops you into the build-in debugger pdb by default, but you can override the hook to do something else, such as invoking a completely different debugger or trolling:

import sys

def trolling():
    raise RuntimeError("Debugging not allowed!")

sys.breakpointhook = trolling  # for the lulz

...

breakpoint()  # RuntimeError: Debugging not allowed!

The default implementation of the hook also allows customizing the breakpoint() behavior through the PYTHONBREAKPOINT environment variable (provided that the hook was not overridden as above):

  • If PYTHONBREAKPOINT is not set or set to the empty string, pdb.set_trace() is called.
  • If set to "0", breakpoint() returns immediately and does not do anything. Useful for quickly disabling all breakpoints without modifying the source.
  • If set to anything else, e.g. "ipdb.set_trace", the value is treated as the name of the function to import and run. If importing fails, a warning is issued and the breakpoint is a no-op.

Useful? Tell me what you think!

Enums to replace hardcoded string constants

TL; DR – If you just want to see how to make the following work with string enums:

FooEnum.BAR == "Bar"
# True, without having to say `FooEnum.BAR.value`

… scroll down to the Trick™ section. Otherwise keep reading.


You might have already seen an application that used string literals for common constant values. For example, in an app with business objects that can be in different states, the code such as the following can be found:

if obj.state == "Active":
    # do something

...

if all_done():
  obj.state = "Completed"

Object states are represented by strings such as "Open", "Active", and "Completed", and there are many of these scattered around the code. Needless to say this implementation is not the best – it is susceptible to typos, and renaming a state requires a find & replace operation that can never go wrong (right?). A better approach is thus to store state names into constants (“constants” by convention at least), so that any future renamings can be done in a single place:

STATE_NEW = "Open"
STATE_ACTIVE = "Active"
STATE_DONE = "Completed"
...

if obj.state == STATUS_ACTIVE:
    # do something

...

if all_done():
  obj.state = STATE_DONE

If there are more than a few such constants defined in the application, it makes sense to group the related ones into namespaces. The most straightforward way is to defining them as class members:

class State:
    NEW = "Open"
    ACTIVE = "Active"
    DONE = "Completed"


if obj.state == State.ACTIVE:
    # etc.

Neat.

The State class has several drawback, however. Its members can be modified. Its members can be deleted. It is not iterable, thus compiling a list of all possible states is not elegant (one needs to peek into the class __dict__).

>>> State.NEW = "Completed"
>>> del State.ACTIVE
>>> list(
        (key, val) for key, val in State.__dict__.items()
        if not key.startswith('__')
    )
[('NEW', 'Open'), ('ACTIVE', 'Active'), ('DONE', 'Completed')]
>>> list(State)
...at's
TypeError: 'type' object is not iterable

Canonical solution: Enums

Starting with Python 3.4, the standard library provides the enum module that addresses these shortcomings (there is also a backport for older Python versions).

from enum import Enum

class State(Enum):
    NEW = "Open"
    ACTIVE = "Active"
    DONE = "Completed"

The only change is that the State class now inherits from Enum, suddenly making it more robust:

>>> State.NEW = "Completed"
# AttributeError: Cannot reassign members.
>>> del State.NEW
# AttributeError: State: cannot delete Enum member.
>>> list(State)
[<State.NEW: 'Open'>, <State.ACTIVE: 'Active'>, <State.DONE: 'Completed'>]

Each enum member is an object that has a name and a value:

>>> type(State.NEW)
<enum 'State'>
>>> State.NEW.name
'NEW'
>>> State.NEW.value
'Open'

There is a caveat, however – enum members can no longer be directly compared to string values:

>>> state = fetch_object_state(obj)  # assume "Open"
>>> State.NEW == state
False  # !!!

In order to work as expected, an enum member’s value must be compared:

>>> State.NEW.value == state
True

This is unfortunate, because the extra .value part makes the expression more verbose, and people might (rightfully) start complaining about readability. Not to mention that it represents a trap, it is too easy to forget about the .value suffix.

The standard library provides the IntEnum that makes the following work:

from enum import IntEnum

class Color(IntEnum):
    WHITE = 5
    BLACK = 10

>>> Color.WHITE == 5
True

Sadly, there is not “StringEnum” class, and it seems that you are on your own if you have string members. This sole reason can make some developers even think about ditching an enum altogether in favor of a plain class (first hand experience).

The trick™

And now for the primary motivation for this post. Thank you for reading it to here. 🙂

It is possible to use an enum while still preserving the convenience of a plain class when comparing the enum members to plain values. The trick is to subclass the type of enum members!

class State(str, Enum):  # <-- look, here
    NEW = "Open"
    ACTIVE = "Active"
    DONE = "Completed"

>>> State.NEW == "Open"
True
>>> State.NEW.value == "Open"
True

Even though this is described in the enum docs, one has to scroll down quite a lot towards the last quarter of the page to find it, thus you cannot blame yourself if you missed it the first time when you were just looking for a quick recipe.

With some creativity it is even possible to construct enums with types other than just the typical boring integers or strings. In a chess program, one could find the following enum useful to represent the corners of the board:

class Corner(tuple, Enum):
    TOP_LEFT = ('A', 8)
    TOP_RIGHT = ('H', 8)
    BOTTOM_LEFT = ('A', 1)
    BOTTOM_RIGHT = ('H', 1)

>>> rook_position = ('H', 8)
>>> is_top_corner = rook_position in (Corner.TOP_LEFT, Corner.TOP_RIGHT)
>>> is_top_corner
True

If you learned something new and found this trick useful, feel free to drop me a note. Thank you for reading!

Python attribute lookup explained in detail

A few months ago I gave a talk at the local Python meetup on how attribute lookup on an object works in Python. It is not as straightforward as it looks like on the surface, and thought that it might be an interesting topic to present.

I received highly positive feedback from the listeners, confirming that they learned something new and potentially valuable. I used a Jupyter Notebook for the presentation, and if you prefer running the code examples to reading, you can jump straight into it – just download the notebook and play around with it. It contains quite a few comments, thus the examples should hopefully be self-explanatory.

Storing attributes on an object

Say we have the following instance:

class Foo(object):  # a new-style class
    x = 'x of Foo'

foo_inst = Foo()

We can inspect its attributes by peeking into the instances __dict__, which is currently empty, because the x from above belong to the instance’s class:

>>> foo_inst.__dict__
{}

Nevertheless, an attempt to retrieve x from the instance succeeds, because Python finds it in the instance’s class. The lookup is dynamic and a change to a class attribute is also reflected on the instance:

>>> foo_inst.x
'x of Foo'
>>> Foo.x = 'new x of Foo'
>>> foo_inst.x
'new x of Foo'

But what happens when both the instance and its class contain an attribute with the same name? Which one takes precedence? Let’s inject x into the instance and observe the result:

>>> foo_inst.__dict__['x'] = 'x of foo_inst'
>>> foo_inst.__dict__
{'x': 'x of foo_inst'}
>>> foo_inst.x
'x of foo_inst'

No surprises here, the x is looked up on the instance first and found there. If we now remove the x, it will be picked from the class again:

>>> del foo_inst.__dict__['x']
>>> foo_inst.__dict__
{}
>>> foo_inst.x
'new x of Foo'

As demonstrated, instance attributes take precedence over class attributes – with a caveat. Contrary to what a quite lot of people think, this is not always the case, and sometimes class attributes shadow the instance attributes. Enter descriptors.

Descriptors

Descriptors are special objects that can alter the interaction with attributes. For an object to be a descriptor, it needs to define at least one of the following special methods: __get__(), __set__(), or __delete__().

class DescriptorX(object):

    def __get__(self, obj, obj_type=None):
        if obj is None:
            print('__get__(): Accessing x from the class', obj_type)
            return self

        print('__get__(): Accessing x from the object', obj)
        return 'X from the descriptor'

    def __set__(self, obj, value):
        print('__set__(): Setting x on the object', obj)
        obj.__dict__['x'] = '{0}|{0}'.format(value)

The class DescriptorX conforms to the given definition, and we can instantiate it to turn the attribute x into a descriptor:

>>> Foo.x = DescriptorX()
>>> Foo.__dict__['x']
<__main__.DescriptorX at 0x7fa0b2ff3790>

Accessing a descriptor does not simply return it as it is the case with non-descriptor attributes, but instead invokes its __get__() method and return the result.

>>> Foo.x
# prints: __get__(): Accessing x from the class <class '__main__.Foo'>
<__main__.DescriptorX at 0x7fa0b2ff3790>

Even though the result is actually the descriptor itself, the extra line printed to output tells us that its __get__() method was indeed invoked, returning the descriptor.

The __get__() method receives two arguments – the instance on which an attribute was looked up (can be None if accessing an attribute on a class), and the “owner” class, i.e. the class containing the descriptor instance.

Let’s see what happens if we access a descriptor on an instance of a class, and that instance also contains an attribute with the same name:

>>> foo_inst.__dict__['x'] = 'x of foo_inst is back'
>>> foo_inst.__dict__
{'x': 'x of foo_inst is back'}
>>> foo_inst.x
# prints: __get__(): Accessing x from the object <__main__.Foo object at 0x7fe2bc613350>
'X from the descriptor'

The result might surprise you – the descriptor (defined on the class) took precedence over the instance attribute!

Overriding and non-overriding descriptors

The story does not end here, however – sometimes a descriptor does not take precedence:

>>> del DescriptorX.__set__
>>> foo_inst.x
'x of foo_inst is back'

It turns out there are actually two kinds of descriptors:

  • Data descriptors (overriding) – they define the __set__() and/or the __delete__() method (but normally __set__() as well) and take precedence over instance attributes.
  • Non-data descriptors (non-overriding) – they define only the __get__() method and are shadowed by an instance attribute of the same name.

If descriptor behavior seems similar to a property to you, it is because properties are actually implemented as descriptors behind the scenes. The same goes for class methods, ORM attributes on data models, and several other constructs.

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class SomeClass(Base):
    __tablename__ = 'some_table'
    id = Column(Integer, primary_key=True)  # a descriptor
    name =  Column(String(50))  # a descriptor

    @property  # creates a descriptor under the name "foo"
    def foo(self):  
        return 'foo'

    @classmethod  # creates a descriptor, too
    def make_instance(cls, **kwargs):
        return cls(**kwargs)

Traversing the inheritance hierarchy

Sometimes an attribute does not exist on a class/instance, but Python does not give up just yet. It continues searching the parent classes, as the attribute might be found there.

Consider the following hierarchy:

class A(object): pass

class B(object):
    x = 'x from B'

class C(A, B): pass

class D(B):
    x = 'x from D'

class E(C, D): pass

Or in a picture, because a picture is worth a thousand words:

Class hierarchy

The attribute x is defined both on the class D and class B. If accessing it on an instance of E that does not have it, the lookup still succeeds:

>>> e_inst = E()
>>> e_inst.x
'x from D'

The thing to observe here is that, apparently, the lookup algorithm is not depth-first search, otherwise x would be first found on B. The algorithm used is also not breadth-first search, otherwise x would still be picked from D in the following example, but it is instead retrieved from A:

>>> A.x = 'x from A'
>>> e_inst.x
'x from A'

The actual lookup order can be seen by inspecting the method resolution order (MRO) of a class:

>>> E.__mro__
(__main__.E, __main__.C, __main__.A, __main__.D, __main__.B, object)

Python uses the C3 linearization algorithm to construct it and to decide how to traverse the class hierarchy.

What about metaclasses?


Interlude – what is a metaclass?
Putting it simply, a metaclass is a “thing” that can create a new class in the same way a class can be used to create new objects, i.e. instances of itself:

  • metaclass() —> a new Class (an instance of metaclass)
  • Class() —> a new object (an instance of Class)

Or by example:

# "AgeMetaclass" inherits from the metaclass "type",
# thus "AgeMetaclass" is a metaclass, too
class AgeMetaclass(type):
    age = 18

# create an instance of a metaclass to produce a class
Person = AgeMetaclass('Person', (object,), {'age': 5})  # name, base classes, class attributes

# the above is the same as using the standard class definition syntax:
class Person(object):
    __metaclass__ = AgeMetaclass
    age = 5

# NOTE: in Python 3 the metaclass would be specified differently:
class Person(metaclass=AgeMetaclass):
   age = 5  

If an attribute is found on a class, its metaclass does not interfere, nor does it interfere when looking up an attribute on an instance:

>>> Person.age
5
>>> john_doe = Person()
>>> john_doe.age
5

On the other hand, if an attribute is not found on the class, it s looked up on its metaclass:

>>> del Person.age
>>> Person.age
18

There is a caveat, however – a metaclass is not considered when accessing an attribute on a class instance:

>>> john_doe.age
# AttributeError: 'Person' object has no attribute 'age'

The lookup only goes one layer up. It inspects the class of an instance, or a metaclass of a class, but not an “indirect metaclass”1 of a class instance.

What happens if an attribute is not found?

Python does not give up just yet. If implemented, it uses the __getattr__() hook on the class as a fallback.

class Product(object):
    def __init__(self, label):
        self.label = label

    def __getattr__(self, name):
        print('attribute "{}" not found, but giving you a foobar tuple!'.format(name))
        return ('foo', 'bar')

Let’s access an attribute that exists, and then an attribute that does not:

>>> chair = Product('dining chair DC-745')
>>> chair.label
'dining chair DC-745'
>>> chair.manufacturer
# prints: attribute "manufacturer" not found, but giving you a foobar tuple!
('foo', 'bar')

Because of the fallback, the AttributeError was not raised. Just keep in mind that defining __getattr__() on an instance instead of on a class will not work:

>>> del Product.__getattr__
>>> chair.__getattr__ = lambda self, name: 'instance __getattr__'
>>> chair.unknown_attr
# AttributeError: 'Product' object has no attribute 'unknown_attr'
NOTE: __getattr__() or __getattribute__() ?

__getattr__() should not be confused with __getattribute__(). The former is a fallback for missing attributes as demonstrated above, while the latter is the method that gets invoked on attribute access, i.e. when using the “dot” operator. It implements the lookup algorithm explained in this post, but can be overriden and customized. The default implementation is in C function _PyObject_GenericGetAttrWithDict().

Most of the time, however, it is probably the __getattr__() method that you want to override.

Summary

Accessing an attribute on a (new-style) class instance invokes the __getattribute__() method that performs the following:

  • Check the class hierarchy using MRO (but do not examine metaclasses):
    • If a data (overriding) descriptor is found in class hierarchy, call its __get__() method;
  • Otherwise check the instance __dict__ (assuming no __slots__ for the sake of example). If an attribute is there, return it;
  • If attribute not in instance.__dict__ but found in the class hierarchy:
    • If (non-data) descriptor, call its __get__() method;
    • If not a descriptor, return the attribute itself;
  • If still not found, invoke __getattr__(), if implemented on a class;
  • Finally give up and raise AttributeError.

  1. I totally made this term up, do not use it in a conversation when trying to sound smart. :) 

How Python computes 2 + 5 under the hood (part 2)

(see also Part 1 of this post)

Representing Python objects at C level

In CPython, Python objects are represented as C structs. While struct members can vary depending on the object type, all PyObject instances contain at least the following two members, i.e. the so-called PyObject_HEAD:

  • ob_refcnt – the number of references to the object. Used for garbage
    collection purposes, since the objects that are not referred by anything anymore
    should be cleaned up to avoid memory leaks.
  • ob_type – a pointer to a type object, which is a special object describing
    the referencing object’s type.

The segment of the interpreter code for the BINARY_ADD instruction that was omitted for brevity in Part 1 is the following:

if (PyUnicode_CheckExact(left) &&
         PyUnicode_CheckExact(right)) {
    sum = unicode_concatenate(left, right, f, next_instr);
    /* unicode_concatenate consumed the ref to left */
}
else {
    sum = PyNumber_Add(left, right);
    Py_DECREF(left);
}
Py_DECREF(right);

Here Python checks if the left and right operands are both Unicode instances, i.e. strings. It does that by inspecting their type objects. If both operands are indeed strings, it performs string concatenation on them, but for anything else the PyNumber_Add() function gets called. Since the operands 2 and 5 in our case are integers, this is exactly what happens. There is also some reference count management (the Py_DECREF() macro), but we will not dive into that.

PyNumberAdd() first tries to perform the add operation on the given operands v and w (two pointers to PyObject) by invoking binary_op1(v, w, NB_SLOT(nb_add)). If the result of that call is Py_NotImplemented, it further tries to concatenate the operands as sequences. This is not the case with integers, however, so let’s have a look at the binary_op1() function located in Objects/abstract.c file:

static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
    PyObject *x;
    binaryfunc slotv = NULL;
    binaryfunc slotw = NULL;

    if (v->ob_type->tp_as_number != NULL)
        slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL) {
        slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot);
        if (slotw == slotv)
            slotw = NULL;
    }
    if (slotv) {
        if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}
Delegating the work to the right function

The binary_op1() function expects references to two Python objects and the binary operation that should be performed on them. The actual function that will perform this operation is obtained with the following:

NB_BINOP(v->ob_type->tp_as_number, op_slot)

Remember how each PyObject contains a reference to another object describing the former’s type, i.e. the ob_type struct member? For integers this is the PyLong_Type located in Objects/longobject.c.

PyLong_Type has the tp_as_number member, a reference to a structure holding pointers to all “number” methods available on Python int objects (integers in Python 3 are what is known as the long type in Python 2):

static PyNumberMethods long_as_number = {
    (binaryfunc)long_add,       /*nb_add*/
    (binaryfunc)long_sub,       /*nb_subtract*/
    (binaryfunc)long_mul,       /*nb_multiply*/
    long_mod,                   /*nb_remainder*/
    ...
}

Finally there is the NB_BINOP(nb_methods, slot) macro that picks a particular method from this list. Since in our case binary_op1() is invoked with NB_SLOT(nb_add) as the third argument, the function for adding two integers is returned.

Now, with two operands in the expression left + right, a decision needs to be made which operand should be used to pick the addition function from to compute the result. As explained in a helpful comment above the binary_op1() function, the order is as follows:

  • If right is a strict subclass of left, right.add(left, right) is tried.
  • left.add(left, right) is tried.
  • right.add(left, right) is tried (unless it hast already been tried in the first step).

Python tries to do its best to obtain a meaningful result, i.e. something other than NotImplemented, and if one of the operands does not support the operation, the other one is tried, too.

Nailing it

So which function is the one that actually computes the sum of 2 and 5 in the end?

It’s the long_add() function implemented in Objects/longobject.c. It is perhaps a bit more complex than expected, because it needs to support the addition of integers of arbitrary length, and still performing fast for integers small enough to fit into a CPU register.

Whoa! After all the digging down the rabbit hole we finally found the right function. Quite a lot of extra work for such a simple operation the addition is, but that’s the price we have to pay in order to get the Python’s dynamic nature in exchange. Remember that the same add(x, y) function we wrote in Part 1 of this post works out of the box with different operand types, and I hope the mechanisms behind the scenes that allow for this are now more clear.

>>> add(2, 5)
7
>>> add('2', '5')
'25'
>>> add([2], [5])
[2, 5]

As always, comments, suggestions, praise, and (constructive) criticism are all welcome. Thanks for reading!

How Python computes 2 + 5 under the hood (part 1)

Suppose we have a very simple Python function that accepts two arguments and returns their sum, and let’s name this function with an (un)imaginative name add:

def add(x, y):
    return x + y

>>> add (2, 5)
7

As a bonus, since Python is a dynamic language, the function also works with (some) other argument types out of the box. If given, say, two sequences, it returns their concatenation:

>>> add([1, 2], [3, 4])
[1, 2, 3, 4]
>>> add('foo', 'bar')
'foobar'

How does this work, you might ask? What happens behind the scenes when we invoke add()? We will see this in a minute.

Python 3.7 will be used in the examples (currently in beta 1 at the time of writing).

The dis module

For a start, we will inspect the add() function by using a handy built-in module dis.

>>> import dis
>>> bytecode = dis.Bytecode(add)
>>> print(bytecode.info())
Name:              add
Filename:          <stdin>
Argument count:    2
Kw-only arguments: 0
Number of locals:  2
Stack size:        2
Flags:             OPTIMIZED, NEWLOCALS, NOFREE
Constants:
   0: None
Variable names:
   0: x
   1: y

Besides peeking into function’s metadata, we can also disassemble it:

>>> dis.dis(add)
  2           0 LOAD_FAST                0 (x)
              2 LOAD_FAST                1 (y)
              4 BINARY_ADD
              6 RETURN_VALUE

Disassembling shows that the function is comprised of four primitive bytecode instructions, which are understood and interpreted by the Python virtual machine.

Python is a stack-based machine

In CPython implementation, the interpreter is a stack-based machine, meaning that it does not have registers, but instead uses a stack to perform the computations.

The first bytecode instruction, LOAD_FAST, pushes a reference to a particular local variable onto the stack, and the single argument to the instruction specifies which variable that is. LOAD_FAST 0 thus picks a reference to x, because x is the first local variable, i.e. at index 0, which can be also seen from the function’s metadata presented just above.

Similarly, LOAD_FAST 1 pushes a reference to y onto the stack, resulting in the following state after the first two bytecode instructions have been executed:

    +---+  <-- TOS (Top Of Stack)
    | y |
    +---+
    | x |
 ---+---+---

The next instruction, BINARY_ADD, takes no arguments. It simply takes the top two elements from the stack, performs an addition on them, and pushes the result of the operation back onto the stack.

At the end, RETURN_VALUE takes whatever the remaining element on the stack is, and returns that element to the caller of the add() function.

Going even deeper (enter the C level)

The bytecode instructions themselves are also just an abstraction, and something needs to make sense of them. That “something” is the Python interpreter. In CPython, its reference implementation, this is a program written in C that loops through the bytecode given to it, and interprets the instructions in it one by one.

The heart of this machinery is implemented in Python/ceval.c file. It runs an infinite loop that contains a giant switch statement with each case (target) handling on of the possible bytecode operations.

This is how the code for the instruction BINARY_ADD looks like:

TARGET(BINARY_ADD) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;

    /* the computation of the "sum" omitted */

    SET_TOP(sum);
    if (sum == NULL)
        goto error;
    DISPATCH();
}

POP(), TOP(), and SET_TOP() are convenience C macros that perform primitive interpreter stack operations such as popping the top value from the stack, or replacing the current TOS (Top Of Stack) value with a different one.

The code above is actually pretty straightforward. It pops the right-hand operand from the top of the stack, which is a reference to y in our case, and stores it under the name right. It then also stores a reference to the left-hand side operand, i.e. x, that became the new TOS.

After performing the calculation, it sets the sum, i.e. a reference to the result, as the new TOS, performs a quick error check, and dispatches the control to the next bytecode instruction in the line.

In Part 2 the representation of Python objects at C level is explained, and how adding two such objects is done.

Giving Python slices a name

Python makes it easy for a developer to work with sequence types such as lists, strings, tuples, and others. This is especially true when extracting sub-sequences from a given sequence.

>>> vowels = ['A', 'E', 'I', 'O', 'U']
>>> vowels[1:3]
['E', 'I']
>>> vowels[3:5]
['O', 'U']
>>> vowels[-4:-2]
['E', 'I']

Out of bound indexes are gracefully handled:

>>> vowels[2:99]
['I', 'O', 'U']
>>> vowels[-5:2]
['A', 'E']

Omitted start/end indexes default to the beginning/end for the sequence, respectively:

>>> vowels[:2]
['A', 'E']
>>> vowels[-2:]
['O', 'U']

If given a step n, only every n-th item in the specified range is included in the result:

>>> vowels[::2]
['A', 'I', 'U']

Step can also be a negative number:

>>> vowels[4:1:-1]
['U', 'O', 'I']
Slice objects

When using the “extended indexing” syntax (made up the name) from above, what actually happens behind the scenes is that a slice() object is created and passed to the sequence object being sliced. The following two expressions are thus equivalent:

>>> vowels[4:2:-1]
['U', 'O']
>>> vowels[slice(4, 2, -1)]
['U', 'O']

This is great, because it allows us to assign descriptive names to slices, and possibly reusing them if the same sub-slice is used at more than a single place:

>>> FIRST_THREE = slice(0, 3)
>>> ODD_ITEMS = slice(1, None, 2)
>>> vowels[FIRST_THREE]
['A', 'E', 'I']
>>> 'abcdef'[FIRST_THREE]
'abc'
>>> vowels[ODD_ITEMS]
['E', 'O']
>>> 'abcdef'[ODD_ITEMS]
'bdf'
Adding support for slicing to custom objects

It’s worth noting that object slicing is not something that is automatically given to us, Python merely allows us to implement support for it ourselves, if we want so.

When the square brackets notation ([]) is used, Python tries to invoke the __getitem__() magic method on the object, passing the given key to it as an argument. That method can be overridden to define custom indexing behavior.

As an example, let’s try to create a class whose instances can be queried for balance. Even if an instance itself does not contain anything, it will somehow caclulate the required amount out of thin air and return that made up number to us. We will call that class a Bank.

class Bank(object):
    """Can create money out of thin air."""

    def __getitem__(self, key):
        if not isinstance(key, (int, slice)):
            raise TypeError('Slice or integer index expected')

        if isinstance(key, int):
            return idx

        # key is a slice() instance
        start = key.start if isinstance(key.start, int) else 0
        stop = key.stop if isinstance(key.stop, int) else 0
        step = key.step if isinstance(key.step, int) else 1
        return sum(range(start, stop, step))

If we query a Bank instance (by indexing it) with a single integer, it will simply return us the amount equal to the given index. If queried by a range (slice), however, it will return the sum of all indices contained in it:1

>>> b = Bank()                                                                                                                                                                                                                                                                                                                                     
>>> b[7]                                                                                                                                                                                                                                                                                                                                           
7
>>> b[-3:0]                                                                                                                                                                                                                                                                                                                                      
-6
>>> b[0:7:2]                                                                                                                                                                                                                                                                                                                                      
12
>>> bank[::]
0 
>>> bank['':5:{}]
10

As the last example demonstrates, slices can contain just about any value, not just integers and None, thus the Bank class must check for these cases and use defaults if needed.

Just a word of caution – if sub-classing built-in types in Python 2, and want to implement custom slicing behavior, you need to override the deprecated __getslice__() method (documentation).


  1. Not saying that this is actually the best way to run a bank in real life, nor that (ab)using slices like this will make you popular with people using your sliceable class… 

Handling application settings hierarchy more easily with Python’s ChainMap

Say you have an application that, as almost every more complex application, exposes some settings that affect how that application behaves. These settings can be configured in several ways, such as specifying them in a configuration file, or by using the system’s environment variables. They can also be provided as command line arguments when starting the application.

If the same setting is provided at more than a single configuration layer, the value from the layer with the highest precedence is taken. This could mean that, for example, configuration file settings override those from environment variables, but command line arguments have precedence over both. If a particular setting is not specified in any of these three layers, its default value is used.

Ordered by priority (highest first), our hierarchy of settings layers looks as follows:

  • command line arguments
  • configuration file
  • environment variables
  • application defaults

A natural way of representing settings is by using a dictionary, one for each layer. An application might thus contain the following:

defaults = {
    'items_per_page': 100,
    'log_level': 'WARNING',
    'max_retries': 5,
}

env_vars = {
    'max_retries': 10
}

config_file = {
    'log_level': 'INFO'
}

cmd_args = {
    'max_retries': 2
}

Putting them all together while taking the precedence levels into account, here’s how the application settings would look like:

>>> settings  # all settings put together
{
    'items_per_page': 100,  # a default value
    'log_level': 'INFO',  # from the config file
    'max_retries': 2,  # from a command line argument
}

There is a point in the code where a decision needs to be made on what setting value to consider. One way of of determining that is by examining the dictionaries one by one until the setting is found:1

setting_name = 'log_level'

for settings in (cmd_args, config_file, env_vars, defaults):
    if setting_name in settings:
        value = settings[setting_name]
        break
else:
    raise ValueError('Setting not found: {}'.format(setting_name))

# do something with value...

A somewhat verbose approach, but at least better than a series of if-elif statements.

An alternative approach is to merge all settings into a single dictionary before using any of their values:

settings = defaults.copy()
for d in (env_vars, config_file, cmd_args):
    settings.update(d)

Mind that here the order of applying the settings layers must be reversed, i.e. the highest priority layers get applied last, so that no lower-level layers can override them. This works quite well, with a possibly only downside that if any of the underlying settings dictionaries get updated, the “main” settings dictionary must be rebuilt, because the changes are not reflected in it automatically:

>>> config_file['items_per_page'] = 25
>>> settings['items_per_page']
100  # change not propagated

Now, I probably wouldn’t be writing about all this, if there didn’t already exist an elegant solution in the standard library. Python 3.32 brought us collections.ChainMap, a handy class that can transparently group multiple dicts together. We just need to pass it all our settings layers (higher-priority ones first), and ChainMap takes care of the rest:

>>> from collections import ChainMap
>>> settings = ChainMap(cmd_args, config_file, env_vars, defaults)
>>> settings['items_per_page']
100
>>> env_vars['items_per_page'] = 25
>>> settings['items_per_page']
25  # yup, automatically propagated

Pretty handy, isn’t it?


  1. In case you spot a mis-indented else block – the indentation is actually correct. It’s a for-else statement, and the else block only gets executed if the loop finishes normally, i.e. without break-ing out of it. 
  2. There is also a polyfill for (some) older Python versions.