edgar - lawrence berkeley national laboratoryedgar(fy15)& awards: • aydın&buluç,...

EDGAR

•  Energy-‐efficient Data and Graph Algorithms Research •  Funded by Applied Math, ASCR •  Early Career Research Program (start: 2013) •  PI: Aydın Buluç (Berkeley Lab) •  Postdoctoral Fellow:

– Ariful Azad (100%) •  Students:

–  Veronika Strnadova-‐Neeley (Since Oct 2015, UCSB) – Adam Sealfon (CSGF Fellow, Summer 2015, MIT) –  Chaitanya Aluru (undergraduate, UC Berkeley)

EDGAR (FY15) Awards: •  Aydın Buluç, IEEE TCSC Award for Excellence for Early Career

Research by the IEEE CommiZee on Scalable Compu[ng, 2015 •  HipMer team (next slide), HPCWire’s Readers’ Choice Award

for the Best Use of HPC Applica>on in Life Sciences Ar[facts: •  Aydın Buluç, Guest Editor: Parallel Compu[ng, special issue

on “Graph Analysis for Scien[fic Discovery” •  Six peer-‐reviewed publica[ons, one invited ar[cle •  Six invited talks (one at conference, five at universi[es/labs)

HipMer: An Extreme-‐Scale De Novo Genome Assembler

k-mers

New k-‐mer analysis filters errors using probabilis6c “Bloom Filter”

Graph algorithm (connected components) scales to 15K cores on NERSC’s Edison

contigs

scaffolding using scalable alignment

x x

New fast & parallel I/O

reads

Meraculous Assembly Pipeline Meraculous assembler is used in produc>on at the Joint Genome Ins>tute •  Wheat assembly is a “grand challenge” •  Hardest part is con[g genera[on (large in-‐

memory hash table that represents graph) •  HipMer is an efficient paralleliza[on of

Meraculous

32

64

128

256

512

1024

2048

4096

8192

16384

960 1920 3840 7680 15360

Seco

nds

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

Performance improvement from days to minutes

E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar, K. Yelick. HiPMer: An extreme-‐scale de novo genome assembler. SC’15

Communica[on-‐Avoiding Sparse Matrix-‐Matrix Mul[ply

64 256 1024 4096 163840.25

1

4

16

Number of Cores

Tim

e (s

ec)

nlpkkt160 x nlpkkt160 (on Edison)

2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

2D#threads

increasing

3D#layers &#threads

increasing

3D-‐threaded (red) beats the previous state-‐of-‐the-‐art (blue) by 8X at large concurrencies

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

Applica[ons: –  Algebraic mul[grid (AMG) restric[on –  Graph computa[ons –  Quantum chemistry –  Similarity computa[on (data mining) –  Interior-‐point op[miza[on

A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, S. Williams. Exploi[ng mul[ple levels of parallelism in sparse matrix-‐matrix mul[plica[on. arXiv preprint arXiv:1510.00844.

0

4

8

12

16

kkt_power

huget

race delaunay

rgg_n24_s0

coPaersDBL

P

amazon03

12 cit-‐Pa

tents

RMAT road_

usa wb-‐edu web-‐G

oogle wikip

edia

Rela[ve pe

rformance

Push-‐Relabel Pothen-‐Fan MS-‐BFS-‐Gram

18 35 29 42

Performance: On 40-‐core Intel On average 7x faster than current best algorithm. Can be up to 42x faster.

1

2

4

8

16

32

1 4 16 64

Speedu

p

Number of Threads

Scien[fic

Scale-‐free

Networks

Hyperthreading

Change of socket Scaling: One node of Edison

(24-‐core Intel Ivy Bridge). On average 17x speedups rela[ve to

serial algorithm.

Parallel Maximum Cardinality Matching in Bipar>te Graphs Using “Tree GraVing”

A. Azad, A.Pothen, A. Buluç. Compu[ng Maximum Cardinality Matchings in Parallel on Bipar[te Graphs via Tree-‐Graming. Under revision in IEEE Transac[ons on Parallel and Distributed Systems (TPDS).

Counting and Enumerating Triangles using Matrix Algebra

A L U

1 2

1 1 1 2

C

A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)

A ∧ B = C (closed wedge)

sum(C)/2 = 4 triangles

A

5

6

3

1 2

4 5

6

3

1 2

4

1

1

2

B, C

A. Azad, A. Buluç, and J. R Gilbert. “Parallel triangle coun[ng and enumera[on using matrix algebra”, Graph Algorithm Building Blocks IPDPS Workshops, 2015

Maximal Cardinality Matching using Matrix Algebra

A. Azad, A Buluç. Distributed-‐memory algorithms for maximal cardinality matching using matrix algebra. In IEEE Interna6onal Conference on Cluster Compu6ng (CLUSTER), 2015 (full paper).

256 512 1024 2048 4096 8192 16384.25

.5

1

2

4

8

16

32

Number of Processors

Time(sec)

(a) Greedy

RMAT−26RMAT−28RMAT−30

256 512 1024 2048 4096 8192 16384.25

.5

1

2

4

8

16

32

64

Number of ProcessorsTime(sec)

(c) Dynamic Mindegree


256 512 1024 2048 4096 8192 16384.25

.5

1

2

4

8

16

Number of Processors

Time(sec)

(b) Karp−Sipser


1

5

2 3 4

1 5 2 3 4

×

× ×

×

× ×

c1

c2

c3

c4

r1

r2

r3

r4

r5 c5

× × ×

1 2 5

1

1

2

5

5

Select columns

Unmatched columns: fc

Visited

rows: f r

Matrix A Bipar[te graph G(R,C,E)

Matrix-‐based primi[ves enable efficient and scalable distributed-‐memory implementa[ons of various maximal cardinality algorithms solely by minimal modifica[ons to the underlying semiring operator.

edgar - lawrence berkeley national laboratoryedgar(fy15)& awards: • aydın&buluç,...

Documents