# Optimizing Python code for computations of pair-wise distances - Part II

Oct 08, 2019 Article Series
• Part II: Numpy im­ple­men­ta­tion (you are here)

This is part II of se­ries of three posts on op­ti­miz­ing python code. Using an ex­am­ple of com­put­ing pair-wise dis­tances un­der pe­ri­odic bound­ary con­di­tions, I will ex­plore sev­eral ways to op­ti­mize the python codes, in­clud­ing pure python im­ple­men­ta­tion with­out any third-party li­braries, Numpy im­ple­men­ta­tion, and im­ple­men­ta­tion us­ing Numba or Cython.

In this post, I show how to use Numpy to do the com­pu­ta­tion. I will demon­strate two dif­fer­ent im­ple­men­ta­tions.

## Background #

Just to re­it­er­ate, the com­pu­ta­tion is to cal­cu­late pair-wise dis­tances be­tween every pair of $N$ par­ti­cles un­der pe­ri­odic bound­ary con­di­tion. The po­si­tions of par­ti­cles are stored in an ar­ray/​list with form [[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]]. The dis­tance be­tween two par­ti­cles, $i$ and $j$ is cal­cu­lated as the fol­low­ing,

$\Delta_{ij} = \sigma_{ij} - \left[ \sigma_{ij}/L \right] \cdot L$

where $\sigma_{ij}=x_i-x_j$ and $L$ is the length of the sim­u­la­tion box edge. $x_i$ and $x_j$ is the po­si­tions. For more in­for­ma­tion, you read up in Part I.

## Naive Numpy Implementation #

By naive, what I meant is that we sim­ply treat numpy ar­ray like a nor­mal python list and uti­lize some ba­sic numpy func­tions to com­pute quan­tity such as sum­ma­tion, mean, power, etc. To get to the point, the codes are the fol­low­ing,

def pdist_np_naive(positions, l):
"""
Compute the pair-wise distances between every possible pair of particles.

positions: a numpy array with form np.array([[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]])
l: the length of edge of box (cubic/square box)
return: a condensed 1D list
"""
# determine the number of particles
n = positions.shape
# create an empty array storing the computed distances
pdistances = np.empty(int(n*(n-1)/2.0))
for i in range(n-1):
D = positions[i] - positions[i+1:]
D = D - np.round(D / l)  * l
distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
idx_s = int((2 * n - i - 1) * i / 2.0)
idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
pdistances[idx_s:idx_e] = distance
return pdistances


### Benchmark #

n = 100
positions = np.random.rand(n,2)
%timeit pdist_np_naive(positions,1.0)
2.7 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The per­for­mance is not bad. This is roughly 4 times speedup com­pared to the pure python im­ple­men­ta­tion shown in Part I (might not be as fast as what one would ex­pect since the python code shown in the pre­vi­ous post is al­ready well-op­ti­mized). Is there any way we can speed up the cal­cu­la­tion? We know that for loops can be very slow in python. Hence, elim­i­nat­ing the for loop in the ex­am­ple above might be the cor­rect di­rec­tion. It turns out that we can achieve this by fully uti­liz­ing the broad­cast­ing fea­ture of numpy.

## Numpy im­ple­men­ta­tion us­ing broad­cast­ing #

To get rid of the loops in the codes above, we need to ﬁnd some numpy na­tive way to do the same thing. One typ­i­cal method is to use the broad­cast­ing. Consider the fol­low­ing ex­am­ple,

a = np.array([1,2,3])
b = 4
a + b
>>> array([5,6,7])


This is a sim­pler ex­am­ple of broad­cast­ing. The un­der­ly­ing op­er­a­tion, in this case, is a loop over the el­e­ment of a and add value of b to it. Instead of writ­ing the loop our­selves, you can sim­ply do a+b and numpy will do the rest. The term broadcasting” is in the sense that b is stretched to be the same di­men­sion of a and then el­e­ment-by-el­e­ment arith­metic op­er­a­tions are taken. Because the broad­cast­ing is im­ple­mented in C un­der the hood, it is much faster than writ­ing for loop ex­plic­itly.

The na­ture of pair-wise dis­tance com­pu­ta­tion re­quires dou­ble nested loops which it­er­ate over every pos­si­ble pair of par­ti­cles. It turns out that such a task can also be done us­ing broad­cast­ing. Again, I rec­om­mend read­ing their of­ﬁ­cial doc­u­men­ta­tion on broad­cast­ing. The ex­am­ple 4 on that page is a nested loop. Look at the ex­am­ple, shown be­low

import numpy as np
a = np.array([0.0,10.0,20.0,30.0])
b = np.array([1.0,2.0,3.0])
a[:, np.newaxis] + b
>>> array([[  1.,  2.,  3.],
[ 11., 12., 13.],
[ 21., 22., 23.],
[ 31., 32., 33.]])


Notice that the + op­er­a­tion is ap­plied on every pos­si­ble pair of el­e­ments from a and b. It is equvan­lently to the codes be­low,

a = np.array([0.0,10.0,20.0,30.0])
b = np.array([1.0,2.0,3.0])
c = np.empty((len(a), len(b)))
for i in range(len(a)):
for j in range(len(b)):
c[i,j] = a[i] + b[j]


The broad­cast­ing is much sim­pler re­gard­ing the syn­tax and faster in many cases (but not all) com­pared to ex­plicit loops. Let’s look at an­other ex­am­ple shown be­low,

a = np.array([[1,2,3],[-2,-3,-4],[3,4,5],[5,6,7],[7,6,5]])
diff = a[:, np.newaxis] - a
print('shape of array [a]:', a.shape)
print('Shape of array [diff]:', diff.shape)
>>> shape of array [a]: (5,3)
>>> shape of array [diff]: (5,5,3)


Array a, with shape (5,3), rep­re­sents 5 par­ti­cles with co­or­di­nates on three di­men­sions. If we want to com­pute the dif­fer­ences be­tween each par­ti­cle on each di­men­sion, a[:, np.newaxis] - a does the job. Quantity a[:, np.newaxis] - a has a shape (5,5,3) whose ﬁrst and sec­ond di­men­sion is the par­ti­cle in­dices and the third di­men­sion is spa­tial.

Following this path, we reach the ﬁ­nal code to com­pute the pair-wise dis­tances un­der pe­ri­odic bound­ary con­di­tion,

def pdist_np_broadcasting(positions, l):
"""
Compute the pair-wise distances between every possible pair of particles.

postions: numpy array storing the positions of each particle. Shape: (nxdim)
l: edge size of simulation box
return: nxn distance matrix
"""
D = positions[:, np.newaxis] - positions # D is a nxnxdim matrix/array
D = D - np.around(D / l) * l
# unlike the pdist_np_naive above, pdistances here is a distance matrix with shape nxn
pdistances = np.sqrt(np.sum(np.power(D, 2.0), axis=2))
return pdistances


### Benchmark #

n = 100
positions = np.random.rand(n,2)
>>> 1.43 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This is about twice as fast as the naive numpy im­ple­men­ta­tion.

pdist_np_broadcasting re­turns an ar­ray with shape (n,n) which can be con­sid­ered as a dis­tance ma­trix whose el­e­ment [i,j] is the dis­tances be­tween par­ti­cle i and j. As you can see, this ma­trix is sym­met­ric and hence con­tains du­pli­cated in­for­ma­tion. There are prob­a­bly bet­ter ways than what shown here to only com­pute the up­per tri­an­gle of the ma­trix in­stead of a full one.

Now let’s make a ﬁ­nal sys­tem­atic com­par­i­son be­tween pdsit_np_naive and pdist_np_broadcasting. I bench­mark the per­for­mance for dif­fer­ent val­ues of n and plot the speed as the func­tion of n. The re­sult is shown in the ﬁg­ure be­low,

The re­sult is sur­pris­ing. The broad­cast­ing ver­sion is faster only when the data size is smaller than 200. For large data set, the naive im­ple­men­ta­tion turns out to be faster. What is go­ing on? After googling a lit­tle bit, I found these StackOverﬂow ques­tions 1, 2, 3. It turns out that the prob­lem may lie in mem­ory us­age and ac­cess. Using the mem­ory-pro­ﬁler, I can com­pare the mem­ory us­age from the two ver­sions as a func­tion of n (see the ﬁg­ure be­low). The re­sult shows that pdist_np_broadcasting uses much more mem­ory than pdist_np_naive, which could ex­plain the dif­fer­ences in speed.

The ori­gin of the dif­fer­ence in mem­ory us­age is that for the pdist_np_naive ver­sion, the com­pu­ta­tion is split­ted into in­di­vid­ual it­er­ac­tions of the for loop. Whereas the pdist_np_broadcasting per­forms the com­pu­ta­tion in one sin­gle batch. pdist_np_naive ex­e­cutes D = positions[i] - positions[i+1:] in­side the loop and every sin­gle it­er­a­tion only cre­ates an ar­ray of D of size smaller than n. On the other hand, D = positions[:, np.newaxis] - positions and D = D - np.around(D / l) * l in pdist_np_broadcasting cre­ate sev­eral tem­po­rary ar­ray of size n*n.

## Summing up #

First, both of numpy im­ple­men­ta­tions shown here lead to sev­eral times of speed up com­par­ing to the pure python im­ple­men­ta­tion. When work­ing with nu­mer­i­cal com­pu­ta­tion, use Numpy usu­ally will give bet­ter per­for­mance. One of the coun­terex­am­ples would be ap­pend­ing to a list/​ar­ray where python’s append is much faster than numpy’s append.

Many on­line tu­to­ri­als and posts rec­om­mend us­ing the numpy’s broad­cast­ing fea­ture when­ever pos­si­ble. This is a largely cor­rect state­ment. However, the ex­am­ple given here shows that the de­tails of the im­ple­men­ta­tion of broad­cast­ing mat­ters. On numpy’s of­ﬁ­cial doc­u­men­ta­tion, it states

There are also cases where broad­cast­ing is a bad idea be­cause it leads to in­ef­ﬁ­cient use of mem­ory that slows com­pu­ta­tion

pdist_np_broadcasting is one of the ex­am­ples where broad­cast­ing might hurt per­for­mance. I guess the take-home mes­sage is that do not ne­glect space com­plex­ity (memory re­quire­ment) if you are try­ing to op­ti­mize the codes and numpy’s broad­cast­ing is not al­ways a good idea.

In the next post, I will show how to use Numba and Cython to boost the com­pu­ta­tion speed even more.