# Optimizing Python code for computations of pair-wise distances - Part III

**Article Series**

Part I: Python implementation only using built-in libraries

Part II: Numpy implementation

Part III: Numba and Cython implementation (you are here)

This is Part III of a series of three posts. In Part I and II, I discussed pure python and numpy implementations of performing pair-wise distances under a periodic condition, respectively. In this post, I show how to use Numba and Cython to further speed up the python codes.

At some point, the optimized python codes are not *strictly* python codes anymore. For instance, in this post, using Cython, we can make our codes very efﬁcient. However, strictly speaking, Cython is *not* Python. It is a superset of Python, which means that any Python code can be compiled as Cython code but not vice versa. To see the performance boost, one needs to write Cython codes. So what is stopping you to just write C++/C codes instead and be done with it? I believe there is always some balance between the performance of the codes and the effort you put into writing the codes. As I will show here, using Numba or writing Cython codes is straightforward if you are familiar with Python. Hence, I always prefer to optimize the Python codes rather than rewrite it in C/C++ because it is more cost-effective for me.

## Background

Just to reiterate, the computation is to calculate pair-wise distances between every pair of $N$ particles under periodic boundary condition. The positions of particles are stored in an array/list with form `[[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]]`

. The distance between two particles, $i$ and $j$ is calculated as the following,

$\Delta_{ij} = \sigma_{ij} - \left[ \sigma_{ij}/L \right] \cdot L$

where $\sigma_{ij}=x_i-x_j$ and $L$ is the length of the simulation box edge. $x_i$ and $x_j$ is the positions. For more information, please read Part I.

## Using Numba

Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Numba has existed for a few years. I remembered trying it a few years ago but didn’t have a good experience with it. Now it is much more matured and very easy to use as I will show in this post.

### Serial Numba Implementation

On their website, it is stated that Numba can make Python codes as fast as C or Fortran. Numba also provides a way to parallelize the `for`

loop. First, let’s try to implement a serial version. Numba’s ofﬁcial documentation recommends using Numpy with Numba. Following the suggestion, using the Numpy code demonstrated in Part II, I have the Numba version,

```
import numba
from numba import jit
@jit(nopython=True, fastmath=True)
def pdist_numba_serial(positions, l):
"""
Compute the pair-wise distances between every possible pair of particles.
positions: a numpy array with form np.array([[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]])
l: the length of edge of box (cubic/square box)
return: a condensed 1D list
"""
# determine the number of particles
n = positions.shape[0]
# create an empty array storing the computed distances
pdistances = np.empty(int(n*(n-1)/2.0))
for i in range(n-1):
D = positions[i] - positions[i+1:]
out = np.empty_like(D)
D = D - np.round(D / l, 0, out) * l
distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
idx_s = int((2 * n - i - 1) * i / 2.0)
idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
pdistances[idx_s:idx_e] = distance
return pdistances
```

Using Numba is *almost* (see blue box below) as simple as adding the decorator `@jit(nopython=True, fastmath=True)`

to our function.

Inside the function `pdist_numba_serial`

, we basically copied the codes except the line `D = D - np.round(D / l) * l`

in the original code. Instead we need to use `np.round(D / l, 0, out)`

which is pointed out in this github issue

### Parallel Numba Implementation

`pdist_numba_serial`

is a serial implementation. The nature of pair-wise distance computation allows us to parallelize the process by simplifying distributing pairs to multiple cores/threads. Fortunately, Numba does provide a very simple way to do that. The for loop in `pdist_numba_serial`

can be parallelized using Numba by replacing `range`

with `prange`

and adding `parallel=True`

to the decorator,

```
from numba import prange
# add parallel=True to the decorator
@jit(nopython=True, fastmath=True, parallel=True)
def pdist_numba_parallel(positions, l):
# determine the number of particles
n = positions.shape[0]
# create an empty array storing the computed distances
pdistances = np.empty(int(n*(n-1)/2.0))
# use prange here instead of range
for i in prange(n-1):
D = positions[i] - positions[i+1:]
out = np.empty_like(D)
D = D - np.round(D / l, 0, out) * l
distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
idx_s = int((2 * n - i - 1) * i / 2.0)
idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
pdistances[idx_s:idx_e] = distance
return pdistances
```

There are some caveats when using `prange`

when race condition would occur. However for our case, there is no race condition since the distances calculations for pairs are independent with each other, i.e. there is no communication between cores/threads. For more information on parallelizing using Numba, refer to their documentation.

### Benchmark

Now let’s benchmark the two versions of Numba implementations. The result is shown below,

Compared to the fastest Numpy implementation shown in Part II, the serial Numba implementation provides more than three times of speedup. As one can see, the parallel version is about twice as fast as the serial version on my 2-cores laptop. I didn’t test on the machines with more cores but I expect the speed up should scale linearly with the number of cores.

I am sure there are some more advanced techniques to make the Numba version even faster (using CUDA for instance). I would argue that the implementations above are the most cost-effective.

## Using Cython

As demonstrated above, Numba provides a very simple way to speed up the python codes with minimal effort. However, if we want to go further, it is probably better to use Cython.

Cython is basically a superset of Python. It allows Cython/Python codes to be compiled to C/C++ and then compiled to machine codes using C/C++ compiler. In the end, you have a C module you can import directly in Python.

Similar to the Numba versions, I show both serial and parallel versions of Cython implementations

### Serial Cython implementation

```
%load_ext Cython # load Cython in Jupyter Notebook
%%cython --force --annotate
import cython
import numpy as np
from libc.math cimport sqrt
from libc.math cimport nearbyint
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True) # Do not check division, may leads to 20% performance speedup
def pdist_cython_serial(double [:,:] positions not None, double l):
cdef Py_ssize_t n = positions.shape[0]
cdef Py_ssize_t ndim = positions.shape[1]
pdistances = np.zeros(n * (n-1) // 2, dtype = np.float64)
cdef double [:] pdistances_view = pdistances
cdef double d, dd
cdef Py_ssize_t i, j, k
for i in range(n-1):
for j in range(i+1, n):
dd = 0.0
for k in range(ndim):
d = positions[i,k] - positions[j,k]
d = d - nearbyint(d / l) * l
dd += d * d
pdistances_view[j - 1 + (2 * n - 3 - i) * i // 2] = sqrt(dd)
return pdistances
```

#### Some Remarks

Declare static types for variables using

`cdef`

. For instance,`cdef double d`

declare that the variable`d`

has a double/ﬂoat type.Import

`sqrt`

and`nearbyint`

from C library instead of using Python function. The general rule is that always try to use C functions directly whenever possible.`positions`

is a Numpy array and declared using Typed Memoryviews.Similar to

`positions`

,`pdistances_view`

access the memory buffer of the numpy array`pdistances`

. Value assignments of`pdistances`

are achieved through`pdistances_view`

.It is useful to use

`%%cython --annotate`

to display the analysis of Cython codes. In such a way, you can inspect the potential slowdown of the code. The analysis will highlight lines where Python interaction occurs. In this particular example, it is very important to keep the core part — nested loop — from Python interaction. For instance, if we don’t use`sqrt`

and`nearbyint`

from`libc.math`

but instead just use python’s built-in`sqrt`

and`round`

, then you won’t see much speedup since there is a lot of overhead in calling these python functions inside the loop.

### Parallel Cython Implementation

Similar to Numba, Cython also allows parallelization. The parallelization is achieved using OpenMP. First, to use OpenMP with Cython, we need to import needed modules,

```
from cython.parallel import prange, parallel
```

Then, replace the `for i in range(n-1)`

in the serial version with

```
with nogil, parallel():
for i in prange(n-1, schedule='dynamic'):
```

Everything else remains the same. Here I follow the example on Cython’s ofﬁcial documentation.

`schedule='dynamic'`

allows the iterations in the loop are distributed through threads as request. Other options include `static`

, `guided`

, etc. See full documentation.

I had some trouble compiling the parallel version directly in the Jupyter Notebook. Instead, it is compiled as a standalone module. The `.pyx`

ﬁle and `setup.py`

ﬁle can be found in this gist.

### Benchmark

The result of benchmarking `pdist_cython_serial`

and `pdist_cython_parallel`

is shown in the ﬁgure below,

As expected, the serial version is about half the speed of the parallel version on my 2-cores laptop. The serial version is more than two times faster than its counterpart using Numba.

## Summing up

In this serial of posts, using computations of pair-wise distance under periodic boundary condition as an example, I showed various ways to optimize the Python codes using built-in Python functions (Part I), NumPy (Part II), Numba and Cython (this post). The benchmark results from all of the functions tested are summarized in the table below,

Function | Averaged Speed (ms) | Speedup |
---|---|---|

`pdist` | 1270 | 1 |

`pdist_v2` | 906 | 1.4 |

`pdist_np_broadcasting` | 160 | 7.9 |

`pdist_np_naive` | 110 | 11.5 |

`pdist_numba_serial` | 20.7 | 61 |

`pdist_numba_parallel` | 12.6 | 101 |

`pdist_cython_serial` | 5.84 | 217 |

`pdist_cython_parallel` | 3.19 | 398 |

The time is measured when $N=1000$. The parallel versions are tested on a 2-cores machine.