fast n-body algorithms for massive datasets

Fast N-Body Algorithmsfor Massive Datasets

Alexander GrayGeorgia Institute of Technology

Is science in 2007different from science in 1907?

Instruments

[Science, Szalay & J. Gray, 2001]

Is science in 2007different from science in 1907?

1990 COBE 1,0002000 Boomerang 10,0002002 CBI

50,0002003 WMAP 1 Million2008 Planck 10 Million

Data: CMB Maps

Data: Local Redshift Surveys1986 CfA 3,5001996 LCRS 23,0002003 2dF 250,0002005 SDSS 800,000

Data: Angular Surveys1970 Lick 1M1990 APM 2M2005 SDSS 200M2008 LSST 2B

Instruments

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1985 1990 1995 2000 2005 2010

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1985 1990 1995 2000 2005 2010

[Science, Szalay & J. Gray, 2001]

Sloan Digital Sky Survey (SDSS)

Size matters! Now possible:• low noise: subtle patterns• global properties and patterns• rare objects and patterns • more info: 3d, deeper/earlier, bands• in parallel: more accurate simulations• 2008: LSST – time-varying phenomena

1 billion objects144 dimensions

(~250M galaxies in 5 colors, ~1M 2000-D spectra)

Happening everywhere!Molecular biologymicroarray chips

Earth sciencessatellite topography

Neurosciencefunctional MRI

microprocessor

nuclear mag. resonance Drug discovery

Physical simulation

Internetfiber optics

1.How did galaxies evolve?2.What was the early universe like?3.Does dark energy exist?4. Is our model (GR+inflation) right?

Astrophysicist

Machine learning/statistics guy

R. Nichol, Inst. Cosmol. GravitationA. Connolly, U. Pitt PhysicsC. Miller, NOAOR. Brunner, NCSAG. Kulkarni, Inst. Cosmol. GravitationD. Wake, Inst. Cosmol. Gravitation

R. Scranton, U. Pitt PhysicsM. Balogh, U. Waterloo PhysicsI. Szapudi, U. Hawaii Inst. AstronomyG. Richards, Princeton PhysicsA. Szalay, Johns Hopkins Physics

Astrophysicist

O(Nn)O(N2)

O(N2)O(N2)

O(N2)O(N3)

O(cDT(N))

R. Nichol, Inst. Cosmol. Grav.A. Connolly, U. Pitt PhysicsC. Miller, NOAOR. Brunner, NCSAG. Kulkarni, Inst. Cosmol. Grav.D. Wake, Inst. Cosmol. Grav.

R. Scranton, U. Pitt PhysicsM. Balogh, U. Waterloo PhysicsI. Szapudi, U. Hawaii Inst. Astro.G. Richards, Princeton PhysicsA. Szalay, Johns Hopkins Physics

• Kernel density estimator • n-point spatial statistics• Nonparametric Bayes classifier• Support vector machine• Nearest-neighbor statistics• Gaussian process regression• Bayesian inference

Astrophysicist

O(Nn)O(N2)

O(N2)O(N2)

O(N2)O(N3)

O(cDT(N))

But I have 1 million points

Astrophysicist

O(Nn)O(N2)

O(N2)O(N2)

O(N2)O(N3)

O(cDT(N))

Data: The Stack

Apps (User, Science)

Perception Computer Vision, NLP, Machine Translation, Bibleome , Autonomous vehicles

ML / Opt Machine Learning / Optimization / Linear Algebra / Privacy

Data Abstractions DBMS , MapReduce , VOTables ,

Clustering / Threading Programming with 1000s of powerful compute nodes

O/SNetworkMotherboards / DatacenterICs

Data: The Stack

Apps (User, Science)

Perception Computer Vision, NLP, Machine Translation, Bibleome , Autonomous vehicles

ML / Opt Machine Learning / Optimization / Linear Algebra / Privacy

Data Abstractions DBMS , MapReduce , VOTables , Data structures

Clustering / Threading Programming with 1000s of powerful compute nodes

O/SNetworkMotherboards / DatacenterICs

Making fast algorithms

• There are many large datasets. There are many questions we want to ask them.– Why we must not get obsessed with one

specific dataset.– Why we must not get obsessed with one

specific question.• The activity I’ll describe is about

accerating computations which occur commonly across many ML methods.

Scope• Nearest neighbor• K-means• Hierarchical clustering• N-point correlation functions• Kernel density estimation• Locally-weighted regression• Mean shift tracking• Mixtures of Gaussians• Gaussian process regression• Manifold learning• Support vector machines• Affinity propagation• PCA• ….

• ML methods with distances underneath– Distances only– Continuous kernel functions

• ML methods with counting underneath

• Computational ideas in this tutorial:– Data structures – Monte Carlo– Series expansions– Problem/solution abstractions

• Challenges– Don’t introduce error, if possible– Don’t introduce tweak parameters, if

possible

Two canonical problems

• Nearest-neighbor search

• Kernel density estimation

)(1)(ˆ

qrrqhq xxK

rqrq xxxNN minarg)(

1. Data structures and how to use them2. Monte Carlo3. Series expansions4. Problem/solution abstractions

Slides by Jeremy KubicaQuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Nearest Neighbor - Naïve Approach

• Given a query point X.• Scan through each point Y:

– Calculate the distance d(X,Y)

– If d(X,Y) < best_seen then Y is the new nearest neighbor.

• Takes O(N) time for each query!

33 Distance Computations

Speeding Up Nearest Neighbor

• We can speed up the search for the nearest neighbor:– Examine nearby points first.– Ignore any points that are further then the nearest

point found so far.• Do this using a KD-tree:

– Tree based data structure– Recursively partitions points into axis aligned boxes.

KD-Tree Construction

Pt X Y1 0.00 0.002 1.00 4.313 0.13 2.85

… … …

We start with a list of n-dimensional points.

Pt X Y1 0.00 0.003 0.13 2.85

… … …

We can split the points into 2 groups by choosing a dimension X and value V and separating the points into X > V and X <= V.

Pt X Y

2 1.00 4.31

… … …

Pt X Y1 0.00 0.003 0.13 2.85

… … …

We can then consider each group separately and possibly split again (along same/different dimension).

Pt X Y

2 1.00 4.31

… … …

Pt X Y3 0.13 2.85

… … …

We can then consider each group separately and possibly split again (along same/different dimension).

Pt X Y

2 1.00 4.31

… … …

Pt X Y1 0.00 0.00… … …

Y>.1NO YES

We can keep splitting the points in each set to create a tree structure. Each node with no children (leaf node) contains a list of points.

We will keep around one additional piece of information at each node. The (tight) bounds of the points at or below this node.

Use heuristics to make splitting decisions:

• Which dimension do we split along? Widest

• Which value do we split at? Median of value of that split dimension for the points.

• When do we stop? When there are fewer then m points left OR the box has hit some minimum width.

Exclusion and inclusion, using point-node kd-tree bounds.

O(D) bounds on distance minima/maxima:

dddddii uxxlxx 0,max0,maxmin 22

dddddii lxxuxx 22 )(,maxmax

Nearest Neighbor with KD Trees

We traverse the tree looking for the nearest neighbor of the query point.

Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

When we reach a leaf node: compute the distance to each point in the node.

Then we can backtrack and try the other branch at each node visited.

Each time a new closest node is found, we can update the distance bounds.

Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

Simple recursive algorithm(k=1 case)

NN(xq,R,dlo,xsofar,dsofar){ if dlo > dsofar, return.

if leaf(R), [xsofar,dsofar]=NNBase(xq,R,dsofar). else, [R1,d1,R2,d2]=orderByDist(xq,R.l,R.r). NN(xq,R1,d1,xsofar,dsofar). NN(xq,R2,d2,xsofar,dsofar).}

Instead, some animations showing real data…1. kd-tree with cached sufficient statistics2. nearest-neighbor with kd-trees3. range-count with kd-trees

For animations, see:http://www.cs.cmu.edu/~awm/animations/kdtree

Range-count example

Pruned!(inclusion)

Range-count example

Pruned!(exclusion)

Range-count example

Some questions• Asymptotic runtime analysis?

– In a rough sense, O(logN)– But only under some regularity conditions

• How high in dimension can we go?– Roughly exponential in intrinsic dimension– In practice, in less than 100 dimensions,

still big speedups

Another kind of tree• Ball-trees, metric trees

– Use balls instead of hyperrectangles– Can often be more efficient in high

dimension (though not always)– Can work with non-Euclidean metric (you

only need to respect the triangle inequality)– Many non-metric similarity measures can

be bounded by metric quantities.

A Set of Points in a metric

Ball Tree root node

A Ball Tree

•J. Uhlmann, 1991

•S. Omohundro, NIPS 1991

Ball-trees: properties

Let Q be any query point and let x be a point inside ball B

B.center

How to build a metric tree, exactly?

• Must balance quality vs. build-time• ‘Anchors hierarchy’ (farthest-points

heuristic, 2-approx used in OR)• Omohundro: ‘Five ways to build a ball-tree’• Which is the best? A research topic…

Some other trees

• Cover-tree– Provable worst-case O(logN) under an

assumption (bounded expansion constant)– Like a non-binary ball-tree

• Learning trees– In this conference

‘All’-type problems

• Nearest-neighbor search

All-nearest neighbor (bichromatic):

• Kernel density estimation

‘All’ version (bichromatic):

)(1)(ˆ

qrrqhq xxK

)(1)(ˆ:

qrrqhqq xxK

rqrqq xxxNNx minarg)(:

rqrq xxxNN minarg)(

Almost always ‘all’-type problems• Kernel density estimation• Nadaraya-Watson & locally-wgtd regression• Gaussian process prediction• Radial basis function networks

• Monochromatic all-nearest neighbor (e.g. LLE)

• n-point correlation (n-tuples)

Always ‘all’-type problems

Dual-tree idea

If all the queries are available simultaneously, then it is faster to:

1. Build a tree on the queries as well2. Effectively process the queries in

chunks rather than individually work is shared between similar query points

Single-tree:

Dual-tree (symmetric):

Exclusion and inclusion, using kd-tree node-node bounds.

(Analogous to point-node bounds.)

Also needed:Nodewise bounds.

Exclusion and inclusion, using kd-tree node-node bounds.

Also needed:

(Analogous to point-node bounds.)

Nodewise bounds.

Single-tree: simple recursive algorithm(k=1 case)

NN(xq,R,dlo,xsofar,dsofar){ if dlo > dsofar, return.

if leaf(R), [xsofar,dsofar]=NNBase(xq,R,dsofar). else, [R1,d1,R2,d2]=orderByDist(xq,R.l,R.r). NN(xq,R1,d1,xsofar,dsofar). NN(xq,R2,d2,xsofar,dsofar).}

Single-tree Dual-tree

• xq Q

• dlo(xq,R) dlo(Q,R)

• xsofar xsofar, dsofar dsofar

• store Q.dsofar=maxQdsofar

• 2-way recursion 4-way recursion

• N x O(logN) O(N)

Dual-tree: simple recursive algorithm (k=1)AllNN(Q,R,dlo,xsofar,dsofar){ if dlo > Q.dsofar, return.

if leaf(Q) & leaf(R), [xsofar,dsofar]=AllNNBase(Q,R,dsofar). Q.dsofar=maxQdsofar. else if !leaf(Q) & leaf(R), … else if leaf(Q) & !leaf(R), … else if !leaf(Q) & !leaf(R), [R1,d1,R2,d2]=orderByDist(Q.l,R.l,R.r). AllNN(Q.l,R1,d1,xsofar,dsofar). AllNN(Q.l,R2,d2,xsofar,dsofar). [R1,d1,R2,d2]=orderByDist(Q.r,R.l,R.r). AllNN(Q.r,R1,d1,xsofar,dsofar). AllNN(Q.r,R2,d2,xsofar,dsofar). Q.dsofar = max(Q.l.dsofar,Q.r.dsofar).}

Query points Reference points

Dual-tree traversal(depth-first)

Dual-tree traversal

Meta-idea: Higher-order

Divide-and-conquer

Break each set into pieces.

Solving the sub-parts of the problem and combining these sub-solutions appropriately

might be easier than doing this over only one set.

Generalizes divide-and-conquer of a single set to divide-and-conquer of multiple sets.

1. Data structures and how to use them2. Monte Carlo3. Series expansions4. Problem/solution abstractions

2-point correlation

ijji rxxI )(

Characterization of an entire distribution?

“How many pairs have distance < r ?”

2-point correlationfunction

The n-point correlation functions• Spatial inferences: filaments, clusters, voids,

homogeneity, isotropy, 2-sample testing, …• Foundation for theory of point processes

[Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976]

• Used heavily in biostatistics, cosmology, particle physics, statistical physics

)](1[212 rdVdVdP

2pcf definition:

)],,()()()(1[ 1323121323123213 rrrrrrdVdVdVdP

3pcf definition:

3-point correlation

)()()( 321 rIrIrI ki

ijkjkij

“How many triples have pairwise distances < r ?”

Standard model: n>0 terms should be zero!

How can we count n-tuples efficiently?

“How many triples have pairwise distances < r ?”

Use n trees![Gray & Moore, NIPS 2000]

“How many valid triangles a-b-c(where )

could there be? CcBbAa ,,

count{A,B,C} =

count{A,B,C.left}+

count{A,B,C.right}A

count{A,B,C} =

count{A,B,C.left}+

count{A,B,C.right}

count{A,B,C} =

Exclusion

count{A,B,C} =

Inclusion

count{A,B,C} =

|A| x |B| x |C|

Inclusion

Key idea(combinatorial proximity

problems):

for n-tuples: n-tree recursion

3-point runtime(biggest previous: 20K)

VIRGO simulation data,N = 75,000,000

naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)

n=2: O(N)

n=3: O(Nlog3)

n=4: O(N2)

But…

Depends on rD-1.Slow for large radii.

VIRGO simulation data, N = 75,000,000

naïve: ~150 yearsmulti-tree: large h: 24 hrs

Let’s develop a method for large radii.

c = p T

EASIER?known.hard.

Sppzp )ˆ1(ˆˆ 2/

no dependence on N! but it does depend on p

c = p T

Sppzp )ˆ1(ˆˆ 2/

c = p T

Sppzp )ˆ1(ˆˆ 2/

c = p T

Sppzp )ˆ1(ˆˆ 2/

c = p T

Sppzp )ˆ1(ˆˆ 2/

This is junk:don’t bother

c = p T

This ispromising

c = p T

Basic idea:

1. Remove some junk(Run exact algorithm for a while)

make p larger

2. Sample from the rest

Get disjoint sets from the recursion tree

… … … [prune]

all possible n-tuples

1T + + =

3T2T T

11 p̂TT

22 p̂TT

33 p̂TT p̂

TT 2̂

Now do stratified sampling

Speedup Results

VIRGO simulation dataN = 75,000,000

naïve: ~150 yearsmulti-tree: large h: 24 hrs

multi-tree monte carlo: 99% confidence: 96 sec

1. Data structures and how to use them2. Monte Carlo3. Multipole methods4. Problem/solution abstractions

Kernel density estimation

qrrqhqq xxK

Nxfx )(1)(ˆ,

How to use a tree…1. How to approximate?

2. When to approximate?

[Barnes and Hut, Science, 1987]

RRi qKNxqK ),(),(

KNqKNqq

),()()(

How to use a tree…3. How to know potential error?

Let’s maintain bounds on the true kernel sumi

ixqKq ),()(

)(At the beginning:

Single-tree:

Dual-tree (symmetric): [Gray & Moore 2000]

How to use a tree…4. How to do ‘all’ problem?

qrrqhqq xxK

Nxfx )(1)(ˆ,

How to use a tree…4. How to do ‘all’ problem?

RRi qKNxqKQq ),(),(,

),max( RQ rrs

Generalizes Barnes-Hut to dual-tree

Case 1 – alg. gives no error boundsCase 2 – alg. gives error bounds, but must be rerun Case 3 – alg. automatically achieves error tolerance

We have a tweak parameter:

So far we have case 2; let’s try for case 3

Let’s try to make an automatic stopping rule

Finite-difference function approximation.

)()()(21)()(

iii xx

xxxfxfxfxf

)()()(21)()( lo

lohilo KKKK

))(()()( axafafxf Taylor expansion:

Gregory-Newton finite form:

Finite-difference function approximation.

rqrq KKNKKerr

)()(21 hi

QRloQR KKK

assumes monotonic decreasing kernel

approximate {Q,R} if

)()()( 2 QKK loNhilo

Speedup Results (KDE)

One order-of-magnitude speedupover single-tree at ~2M points

12.5K 7 .1225K 31 .3150K 123 .46

100K 494 1.0200K 1976* 2400K 7904* 5800K 31616* 101.6M 35 hrs 23

dual-N naïve tree

1. Data structures and how to use them2. Monte Carlo3. Multipole methods4. Problem/solution abstractions

These are all examples of…

Generalized N-body problems

General theory and toolkit for designing algorithms for

such problems

All-NN:

2-point:

3-point:

SPH: };),(,,{}}{;),(,,{}),(,,,{

}),(,,{},min,arg,{

twKhKwI

For more…

In this conference:• Learning trees• Monte Carlo for statistical summations• Large-scale learning workshop

• EMST• GNP’s and MapReduce-like parallelization• Monte Carlo SVD

agray@cc.gatech.edu

fast n-body algorithms for massive datasets

pitt physicsc

pitt physicsm

princeton physicsa

waterloo physicsi

johns hopkins physicshow

model gr inflation

dark energy

early universe

Documents

scalable machine learning for massive datasets: fast...

interactive visual statistics on massive datasets

efficient handling of massive (terrain) datasets

mining of massive datasets - stanford...

cs246: mining massive datasets jure leskovec,...

scalab le isosurface visualization of massive datasets on

csci 6900: mining massive datasets - university of...

plotcon nyc: interactive visual statistics on massive...

optimization algorithms for analyzing large datasets

cs425: algorithms for web scale data - bilkent...

[rajaraman,ullman] mining of massive datasets

scalable machine learning for massive astronomical...

cs246: mining massive datasets jure leskovec,...

cs246: mining massive datasets jure ... - stanford...

advanced algorithms for massive datasets

advanced algorithms for massive datasets basics of hashing

cs246: mining massive datasets crash course in...

cs246: mining massive datasets jure leskovec,...

hierarchical clustering algorithms for document datasets

taming massive distributed datasets: data sampling using...