effective numerical computation in numpy and scipy

Effective Numerical Computation in NumPy and SciPy

Kimikazu Kato

PyCon JP 2014

September 13, 2014

1 / 35

About Myself

Kimikazu KatoChief Scientists at Silver Egg Technology Co., Ltd.

Ph.D in Computer Science

Background in Mathematics, Numerical Computation, Algorithms, etc.

<2 year experience in Python>10 year experience in numerical computation

Now designing algorithms for recommendation system, and doing researchabout machine learning and data analysis.

2 / 35

This talk...

is about effective usage of NumPy/SciPyis NOT exhaustive introduction of capabilities, but shows some casestudies based on my experience and interest

3 / 35

Table of Contents

IntroductionBasics about NumPy

BroadcastingIndexing

Sparse matrixUsage of scipy.sparseInternal structure

Case studiesConclusion

4 / 35

Numerical Computation

Differential equationsSimulationsSignal processingMachine Learningetc...

Why Numerical Computation in Python?

ProductivityEasy to writeEasy to debug

Connectivity with visualization toolsMatplotlibIPython

Connectivity with web systemMany frameworks (Django, Pyramid, Flask, Bottle, etc.)

5 / 35

But Python is Very Slow!

Code in C

#include <stdio.h>int main() { int i; double s=0; for (i=1; i<=100000000; i++) s+=i; printf("%.0f\n",s);}

Code in Python

s=0.for i in xrange(1,100000001): s+=iprint s

Both of the codes compute the sum of integers from 1 to 100,000,000.

Result of benchmark in a certain environment:Above: 0.109 sec (compiled with -O3 option)Below: 8.657 sec(80+ times slower!!)

6 / 35

Better code

import numpy as npa=np.arange(1,100000001)print a.sum()

Now it takes 0.188 sec. (Measured by "time" command in Linux, loading timeincluded)

Still slower than C, but sufficiently fast as a script language.

7 / 35

Lessons

Python is very slow when written badlyTranslate C (or Java, C# etc.) code into Python is often a bad idea.Python-friendly rewriting sometimes result in drastic performanceimprovement

8 / 35

Basic rules for better performance

Avoid for-sentence as far as possibleUtilize libraries' capabilities insteadForget about the cost of copying memory

Typical C programmer might care about it, but ...

9 / 35

Basic techniques for NumPy

BroadcastingIndexing

10 / 35

Broadcasting

>>> import numpy as np>>> a=np.array([0,1,2])>>> a*3array([0, 3, 6])

>>> b=np.array([1,4,9])>>> np.sqrt(b)array([ 1., 2., 3.])

A function which is applied to each element when applied to an array is calleda universal function.

11 / 35

Broadcasting (2D)

>>> import numpy as np>>> a=np.arange(9).reshape((3,3))>>> b=np.array([1,2,3])>>> aarray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])>>> barray([1, 2, 3])>>> a*barray([[ 0, 2, 6], [ 3, 8, 15], [ 6, 14, 24]])

12 / 35

Indexing

>>> import numpy as np>>> a=np.arange(10)>>> aarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>>> indices=np.arange(0,10,2)>>> indicesarray([0, 2, 4, 6, 8])>>> a[indices]=0>>> aarray([0, 1, 0, 3, 0, 5, 0, 7, 0, 9])>>> b=np.arange(100,600,100)>>> barray([100, 200, 300, 400, 500])>>> a[indices]=b>>> aarray([100, 1, 200, 3, 300, 5, 400, 7, 500, 9])

13 / 35

Refernces

Gabriele Lanaro, "Python High Performance Programming," PacktPublishing, 2013.Stéfan van der Walt, Numpy Medkit

14 / 35

http://mentat.za.net/numpy/numpy_advanced_slides/

Sparse matrix

Defined as a matrix in which most elements are zeroCompressed data structure is used to express it, so that it will be...

Space effectiveTime effective

15 / 35

scipy.sparse

The class scipy.sparse has mainly three types as expressions of a sparsematrix. (There are other types but not mentioned here)

lil_matrix : convenient to set data; setting a[i,j] is fastcsr_matrix : convenient for computation, fast to retrieve a rowcsc_matrix : convenient for computation, fast to retrieve a column

Usually, set the data into lil_matrix, and then, convert it to csc_matrix orcsr_matrix.

For csr_matrix, and csc_matrix, calcutaion of matrices of the same type is fast,but you should avoid calculation of different types.

16 / 35

Use case

>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,0]=1.; a[0,2]=2.>>> a=a.tocsr()>>> print a (0, 0) 1.0 (0, 2) 2.0>>> a.todense()matrix([[ 1., 0., 2.], [ 0., 0., 0.], [ 0., 0., 0.]])>>> b=lil_matrix((3,3))>>> b[1,1]=3.; b[2,0]=4.; b[2,2]=5.>>> b=b.tocsr()>>> b.todense()matrix([[ 0., 0., 0.], [ 0., 3., 0.], [ 4., 0., 5.]])>>> c=a.dot(b)>>> c.todense()matrix([[ 8., 0., 10.], [ 0., 0., 0.], [ 0., 0., 0.]])>>> d=a+b>>> d.todense()matrix([[ 1., 0., 2.], [ 0., 3., 0.], [ 4., 0., 5.]]) 17 / 35

Internal structure: csr_matrix

>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5.>>> b=a.tocsr()>>> b.todense()matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]])>>> b.indicesarray([1, 2, 2, 0, 1], dtype=int32)>>> b.dataarray([ 1., 2., 3., 4., 5.])>>> b.indptrarray([0, 2, 3, 5], dtype=int32)

18 / 35

Internal structure: csc_matrix

>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5.>>> b=a.tocsc()>>> b.todense()matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]])>>> b.indicesarray([2, 0, 2, 0, 1], dtype=int32)>>> b.dataarray([ 4., 1., 5., 2., 3.])>>> b.indptrarray([0, 1, 3, 5], dtype=int32)

19 / 35

Merit of knowing the internal structure

Setting csr_matrix or csc_matrix with its internal structure is much faster thansetting lil_matrix with indices.

See the benchmark of setting

⎛

⎝

⎜⎜⎜⎜⎜⎜⎜⎜

2 12 1

⋱ ⋱

⋱ 12

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟

20 / 35

from scipy.sparse import lil_matrix, csr_matriximport numpy as npfrom timeit import timeit

def set_lil(n): a=lil_matrix((n,n)) for i in xrange(n): a[i,i]=2. if i+1<n: a[i,i+1]=1. return a

def set_csr(n): data=np.empty(2*n-1) indices=np.empty(2*n-1,dtype=np.int32) indptr=np.empty(n+1,dtype=np.int32) # to be fair, for-sentence is intentionally used # (using indexing technique is faster) for i in xrange(n): indices[2*i]=i data[2*i]=2. if i<n-1: indices[2*i+1]=i+1 data[2*i+1]=1. indptr[i]=2*i indptr[n]=2*n-1 a=csr_matrix((data,indices,indptr),shape=(n,n)) return a

print "lil:",timeit("set_lil(10000)", number=10,setup="from __main__ import set_lil")print "csr:",timeit("set_csr(10000)", number=10,setup="from __main__ import set_csr")

21 / 35

Result:

lil: 11.6730761528csr: 0.0562081336975

Remark

When you deal with already sorted data, setting csr_matrix or csc_matrixwith data, indices, indptr is much faster than setting lil_matrixBut the code tend to be more complicated if you use the internal structureof csr_matrix or csc_matrix

22 / 35

Case Studies

23 / 35

Case 1: Norms

If is dense:

norm=np.dot(v,v)

Expressed as product of matrices. (dot means matrix product, but you don'thave to take transpose explicitly.)

When is sparse, suppose that is expressed as matrix:

norm=v.multiply(v).sum()

(multiply() is element-wise product)

This is because taking transpose of a sparse matrix changes the type.

∥v =∥2 ∑i

v2i

v

v v 1 × n

24 / 35

Frobenius norm:

norm=a.multiply(a).sum()

=∥A∥Fro ∑ij

a2ij

25 / 35

Case 2: Applying a function to all of the elements of asparse matrix

A universal function can be applied to a dense matrix:

>>> import numpy as np>>> a=np.arange(9).reshape((3,3))>>> aarray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])>>> np.tanh(a)array([[ 0. , 0.76159416, 0.96402758], [ 0.99505475, 0.9993293 , 0.9999092 ], [ 0.99998771, 0.99999834, 0.99999977]])

This is convenient and fast.

However, we cannot do the same thing for a sparse matrix.

26 / 35

>>> from scipy.sparse import lil_matrix>>> a=lil_matrix((3,3))>>> a[0,0]=1.>>> a[1,0]=2.>>> b=a.tocsr()>>> np.tanh(b)<3x3 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Row format>

This is because, for an arbitrary function, its application to a sparse matrix isnot necessarily sparse.

However, if a universal function satisfies , the density ispreserved.

Then, how can we compute it?

f f(0) = 0

27 / 35

Use the internal structure!!

The positions of the non-zero elements are not changed after application ofthe function.

Keep indices and indptr, and just change data.

Solution:

b = csr_matrix((np.tanh(a.data), a.indices, a.indptr), shape=a.shape)

28 / 35

Case 3: Formula which appears in a paper

In the algorithm for recommendation system [1], the following formulaappears:

where is dense matrix, and D is a diagonal matrix defined from agiven array as:

Here, (which corresponds to the number of users or items) is big and (which means the number of latent factors) is small.

[1] Hu et al. Collaborative Filtering for Implicit Feedback Datasets, ICDM,2008.

⋅ D ⋅ AAT

A n × f( )di

D =

⎛

⎝⎜⎜⎜⎜⎜

d1

d2

⋱dn

⎞

⎠⎟⎟⎟⎟⎟

n f

29 / 35

Solution 1:

There is a special class dia_matrix to deal with a diagonal sparse matrix.

import scipy.sparse as sparseimport numpy as np

def f(a,d): """a: 2d array of shape (n,f), d: 1d array of length n""" dd=sparse.diags([d],[0]) return np.dot(a.T,dd.dot(a))

30 / 35

Solution 2:

Pack csr_matrix with data,indices,indptr

data=dindices=[0,1,..,n]indptr=[0,1,...,n+1]

def g(a,d): n,f=a.shape data=d indices=np.arange(n) indptr=np.arange(n+1) dd=sparse.csr_matrix((data,indices,indptr),shape=(n,n)) return np.dot(a.T,dd.dot(a))

31 / 35

Solution 3:

This is equivalent to the broadcasting!

def h(a,d): return np.dot(a.T*d,a)

( D)A = × × AAT

⎛

⎝⎜⎜⎜⎜

a11

a12

⋮a1m

a21

a22

⋮a2m

⋯⋯

⋯

an1

an2

⋮anm

⎞

⎠⎟⎟⎟⎟

⎛

⎝⎜⎜⎜⎜⎜

d1

d2

⋱dn

⎞

⎠⎟⎟⎟⎟⎟

= × A

⎛

⎝⎜⎜⎜⎜

a11d1

a12d1

⋮a1md1

a21d2

a22d2

⋮a2md2

⋯⋯

⋯

an1dn

an2dn

⋮anmdn

⎞

⎠⎟⎟⎟⎟

32 / 35

Benchmark

def datagen(n,f): np.random.seed(0) a=np.random.random((n,f)) d=np.random.random(n) return a,d

from timeit import timeitprint "dia_matrix :",timeit("f(a,d)",number=10, setup="from __main__ import f,datagen; a,d=datagen(1000000,10)")print "csr_matrix :",timeit("g(a,d)",number=10, setup="from __main__ import g,datagen; a,d=datagen(1000000,10)")print "broadcasting :",timeit("h(a,d)",number=10, setup="from __main__ import h,datagen; a,d=datagen(1000000,10)")

Result:

dia_matrix : 1.60458707809csr_matrix : 1.32580018044broadcasting : 1.30032682419

33 / 35

Conclusion

Try not to use for-sentence, but use libraries' capabilities instead.Knowledge about the internal structure of the sparse matrix is useful toextract further performance.Mathematical derivation is important. The key is to find a mathematicallyequivalent and Python-friendly formula.Computational speed does not necessarily matter. Finding a better code ina short time is valuable. Otherwise, you shouldn't pursue too much.

34 / 35

Acknowledgment

I would like to thank

(@shima__shima)who gave me useful advice in Twitter.

35 / 35