Optimizing python code for com­pu­ta­tions of pair-wise dis­tances - Part III

Oct 17, 2019

Open In Colab

Article Series

This is Part III of a se­ries of three posts. In Part I and II, I dis­cussed pure python and numpy im­ple­men­ta­tions of per­form­ing pair-wise dis­tances un­der a pe­ri­odic con­di­tion, re­spec­tively. In this post, I show how to use Numba and Cython to fur­ther speed up the python codes.

At some point, the op­ti­mized python codes are not strictly python codes any­more. For in­stance, in this post, us­ing Cython, we can make our codes very ef­fi­cient. However, strictly speak­ing, Cython is not Python. It is a su­per­set of Python, which means that any Python code can be com­piled as Cython code but not vice versa. To see the per­for­mance boost, one needs to write Cython codes. So what is stop­ping you to just write C++/C codes in­stead and be done with it? I be­lieve there is al­ways some bal­ance be­tween the per­for­mance of the codes and the ef­fort you put into writ­ing the codes. As I will show here, us­ing Numba or writ­ing Cython codes is straight­for­ward if you are fa­mil­iar with Python. Hence, I al­ways pre­fer to op­ti­mize the Python codes rather than rewrite it in C/C++ be­cause it is more cost-ef­fec­tive for me.

Background

Just to re­it­er­ate, the com­pu­ta­tion is to cal­cu­late pair-wise dis­tances be­tween every pair of NN par­ti­cles un­der pe­ri­odic bound­ary con­di­tion. The po­si­tions of par­ti­cles are stored in an ar­ray/​list with form [[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]]. The dis­tance be­tween two par­ti­cles, ii and jj is cal­cu­lated as the fol­low­ing,

Δij=σij[σij/L]L\Delta_{ij} = \sigma_{ij} - \left[ \sigma_{ij}/L \right] \cdot L

where σij=xixj\sigma_{ij}=x_i-x_j and LL is the length of the sim­u­la­tion box edge. xix_i and xjx_j is the po­si­tions. For more in­for­ma­tion, please read Part I.

Using Numba

Numba is an open-source JIT com­piler that trans­lates a sub­set of Python and NumPy code into fast ma­chine code.

Numba has ex­isted for a few years. I re­mem­bered try­ing it a few years ago but did­n’t have a good ex­pe­ri­ence with it. Now it is much more ma­tured and very easy to use as I will show in this post.

Serial Numba Implementation

On their web­site, it is stated that Numba can make Python codes as fast as C or Fortran. Numba also pro­vides a way to par­al­lelize the for loop. First, let’s try to im­ple­ment a se­r­ial ver­sion. Numba’s of­fi­cial doc­u­men­ta­tion rec­om­mends us­ing Numpy with Numba. Following the sug­ges­tion, us­ing the Numpy code demon­strated in Part II, I have the Numba ver­sion,

import numba
from numba import jit

@jit(nopython=True, fastmath=True)
def pdist_numba_serial(positions, l):
    """
    Compute the pair-wise distances between every possible pair of particles.

    positions: a numpy array with form np.array([[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]])
    l: the length of edge of box (cubic/square box)
    return: a condensed 1D list
    """
    # determine the number of particles
    n = positions.shape[0]
    # create an empty array storing the computed distances
    pdistances = np.empty(int(n*(n-1)/2.0))
    for i in range(n-1):
        D = positions[i] - positions[i+1:]
        out = np.empty_like(D)
        D = D - np.round(D / l, 0, out) * l
        distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
        idx_s = int((2 * n - i - 1) * i / 2.0)
        idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
        pdistances[idx_s:idx_e] = distance
    return pdistances

Using Numba is al­most (see blue box be­low) as sim­ple as adding the dec­o­ra­tor @jit(nopython=True, fastmath=True) to our func­tion.

Inside the func­tion pdist_numba_serial, we ba­si­cally copied the codes ex­cept the line D = D - np.round(D / l) * l in the orig­i­nal code. Instead we need to use np.round(D / l, 0, out) which is pointed out here

Parallel Numba Implementation

pdist_numba_serial is a se­r­ial im­ple­men­ta­tion. The na­ture of pair-wise dis­tance com­pu­ta­tion al­lows us to par­al­lelize the process by sim­pli­fy­ing dis­trib­ut­ing pairs to mul­ti­ple cores/​threads. Fortunately, Numba does pro­vide a very sim­ple way to do that. The for loop in pdist_numba_serial can be par­al­lelized us­ing Numba by re­plac­ing range with prange and adding parallel=True to the dec­o­ra­tor,

from numba import prange

# add parallel=True to the decorator
@jit(nopython=True, fastmath=True, parallel=True)
def pdist_numba_parallel(positions, l):
    # determine the number of particles
    n = positions.shape[0]
    # create an empty array storing the computed distances
    pdistances = np.empty(int(n*(n-1)/2.0))
    # use prange here instead of range
    for i in prange(n-1):
        D = positions[i] - positions[i+1:]
        out = np.empty_like(D)
        D = D - np.round(D / l, 0, out) * l
        distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
        idx_s = int((2 * n - i - 1) * i / 2.0)
        idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
        pdistances[idx_s:idx_e] = distance
    return pdistances

There are some caveats when us­ing prange when race con­di­tion would oc­cur. However for our case, there is no race con­di­tion since the dis­tances cal­cu­la­tions for pairs are in­de­pen­dent with each other, i.e. there is no com­mu­ni­ca­tion be­tween cores/​threads. For more in­for­ma­tion on par­al­leliz­ing us­ing Numba, re­fer to their doc­u­men­ta­tion.

Benchmark

Now let’s bench­mark the two ver­sions of Numba im­ple­men­ta­tions. The re­sult is shown be­low,

Speed Benchmark: comparison between pdist_numba_serial and pdist_numba_parallel
Speed Benchmark: com­par­i­son be­tween pdis­t_num­ba_se­r­ial and pdis­t_num­ba_­par­al­lel

Compared to the fastest Numpy im­ple­men­ta­tion shown in Part II, the se­r­ial Numba im­ple­men­ta­tion pro­vides more than three times of speedup. As one can see, the par­al­lel ver­sion is about twice as fast as the se­r­ial ver­sion on my 2-cores lap­top. I did­n’t test on the ma­chines with more cores but I ex­pect the speed up should scale lin­early with the num­ber of cores.

I am sure there are some more ad­vanced tech­niques to make the Numba ver­sion even faster (using CUDA for in­stance). I would ar­gue that the im­ple­men­ta­tions above are the most cost-ef­fec­tive.

Using Cython

As demon­strated above, Numba pro­vides a very sim­ple way to speed up the python codes with min­i­mal ef­fort. However, if we want to go fur­ther, it is prob­a­bly bet­ter to use Cython.

Cython is ba­si­cally a su­per­set of Python. It al­lows Cython/Python codes to be com­piled to C/C++ and then com­piled to ma­chine codes us­ing C/C++ com­piler. In the end, you have a C mod­ule you can im­port di­rectly in Python.

Similar to the Numba ver­sions, I show both se­r­ial and par­al­lel ver­sions of Cython im­ple­men­ta­tions

Serial Cython im­ple­men­ta­tion

%load_ext Cython # load Cython in Jupyter Notebook
%%cython --force --annotate

import cython
import numpy as np

from libc.math cimport sqrt
from libc.math cimport nearbyint

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True) # Do not check division, may leads to 20% performance speedup
def pdist_cython_serial(double [:,:] positions not None, double l):
    cdef Py_ssize_t n = positions.shape[0]
    cdef Py_ssize_t ndim = positions.shape[1]
    
    pdistances = np.zeros(n * (n-1) // 2, dtype = np.float64)
    cdef double [:] pdistances_view = pdistances
    
    cdef double d, dd
    cdef Py_ssize_t i, j, k
    for i in range(n-1):
        for j in range(i+1, n):
            dd = 0.0
            for k in range(ndim):
                d = positions[i,k] - positions[j,k]
                d = d - nearbyint(d / l) * l
                dd += d * d
            pdistances_view[j - 1 + (2 * n - 3 - i) * i // 2] = sqrt(dd)
    
    return pdistances

Some Remarks

  • Declare sta­tic types for vari­ables us­ing cdef. For in­stance, cdef double d de­clare that the vari­able d has a dou­ble/​float type.

  • Import sqrt and nearbyint from C li­brary in­stead of us­ing Python func­tion. The gen­eral rule is that al­ways try to use C func­tions di­rectly when­ever pos­si­ble.

  • positions is a Numpy ar­ray and de­clared us­ing Typed Memoryviews.

  • Similar to positions, pdistances_view ac­cess the mem­ory buffer of the numpy ar­ray pdistances. Value as­sign­ments of pdistances are achieved through pdistances_view.

  • It is use­ful to use %%cython --annotate to dis­play the analy­sis of Cython codes. In such a way, you can in­spect the po­ten­tial slow­down of the code. The analy­sis will high­light lines where Python in­ter­ac­tion oc­curs. In this par­tic­u­lar ex­am­ple, it is very im­por­tant to keep the core part — nested loop — from Python in­ter­ac­tion. For in­stance, if we don’t use sqrt and nearbyint from libc.math but in­stead just use python’s built-in sqrt and round, then you won’t see much speedup since there is a lot of over­head in call­ing these python func­tions in­side the loop.

Parallel Cython Implementation

Similar to Numba, Cython also al­lows par­al­leliza­tion. The par­al­leliza­tion is achieved us­ing OpenMP. First, to use OpenMP with Cython, we need to im­port needed mod­ules,

from cython.parallel import prange, parallel

Then, re­place the for i in range(n-1) in the se­r­ial ver­sion with

with nogil, parallel():
    for i in prange(n-1, schedule='dynamic'):

Everything else re­mains the same. Here I fol­low the ex­am­ple on Cython’s of­fi­cial doc­u­men­ta­tion.

schedule='dynamic' al­lows the it­er­a­tions in the loop are dis­trib­uted through threads as re­quest. Other op­tions in­clude static, guided, etc. See here for full doc­u­men­ta­tion.

I had some trou­ble com­pil­ing the par­al­lel ver­sion di­rectly in the Jupyter Notebook. Instead, it is com­piled as a stand­alone mod­ule. The .pyx file and setup.py file can be found here.

Benchmark

The re­sult of bench­mark­ing pdist_cython_serial and pdist_cython_parallel is shown in the fig­ure be­low,

Speed Benchmark: comparison between pdist_cython_serial and pdist_cython_parallel
Speed Benchmark: com­par­i­son be­tween pdis­t_­cython_se­r­ial and pdis­t_­cython_­par­al­lel

As ex­pected, the se­r­ial ver­sion is about half the speed of the par­al­lel ver­sion on my 2-cores lap­top. The se­r­ial ver­sion is more than two times faster than its coun­ter­part us­ing Numba.


Summing up

In this se­r­ial of posts, us­ing com­pu­ta­tions of pair-wise dis­tance un­der pe­ri­odic bound­ary con­di­tion as an ex­am­ple, I showed var­i­ous ways to op­ti­mize the Python codes us­ing built-in Python func­tions (Part I), NumPy (Part II), Numba and Cython (this post). The bench­mark re­sults from all of the func­tions tested are sum­ma­rized in the table be­low,

FunctionAveraged Speed (ms)Speedup
pdist12701
pdist_v29061.4
pdist_np_broadcasting1607.9
pdist_np_naive11011.5
pdist_numba_serial20.761
pdist_numba_parallel12.6101
pdist_cython_serial5.84217
pdist_cython_parallel3.19398

The time is mea­sured when N=1000N=1000. The par­al­lel ver­sions are tested on a 2-cores ma­chine.


Further Reading