edgar - lawrence berkeley national laboratoryedgar(fy15)& awards: • aydın&buluç,...
TRANSCRIPT
EDGAR
• Energy-‐efficient Data and Graph Algorithms Research • Funded by Applied Math, ASCR • Early Career Research Program (start: 2013) • PI: Aydın Buluç (Berkeley Lab) • Postdoctoral Fellow:
– Ariful Azad (100%) • Students:
– Veronika Strnadova-‐Neeley (Since Oct 2015, UCSB) – Adam Sealfon (CSGF Fellow, Summer 2015, MIT) – Chaitanya Aluru (undergraduate, UC Berkeley)
EDGAR (FY15) Awards: • Aydın Buluç, IEEE TCSC Award for Excellence for Early Career
Research by the IEEE CommiZee on Scalable Compu[ng, 2015 • HipMer team (next slide), HPCWire’s Readers’ Choice Award
for the Best Use of HPC Applica>on in Life Sciences Ar[facts: • Aydın Buluç, Guest Editor: Parallel Compu[ng, special issue
on “Graph Analysis for Scien[fic Discovery” • Six peer-‐reviewed publica[ons, one invited ar[cle • Six invited talks (one at conference, five at universi[es/labs)
HipMer: An Extreme-‐Scale De Novo Genome Assembler
k-mers
New k-‐mer analysis filters errors using probabilis6c “Bloom Filter”
Graph algorithm (connected components) scales to 15K cores on NERSC’s Edison
contigs
scaffolding using scalable alignment
x x
New fast & parallel I/O
reads
Meraculous Assembly Pipeline Meraculous assembler is used in produc>on at the Joint Genome Ins>tute • Wheat assembly is a “grand challenge” • Hardest part is con[g genera[on (large in-‐
memory hash table that represents graph) • HipMer is an efficient paralleliza[on of
Meraculous
32
64
128
256
512
1024
2048
4096
8192
16384
960 1920 3840 7680 15360
Seco
nds
Number of Cores
overall timekmer analysis
contig generationscaffolding
ideal overall time
Performance improvement from days to minutes
E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar, K. Yelick. HiPMer: An extreme-‐scale de novo genome assembler. SC’15
Communica[on-‐Avoiding Sparse Matrix-‐Matrix Mul[ply
64 256 1024 4096 163840.25
1
4
16
Number of Cores
Tim
e (s
ec)
nlpkkt160 x nlpkkt160 (on Edison)
2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)
2D#threads
increasing
3D#layers &#threads
increasing
3D-‐threaded (red) beats the previous state-‐of-‐the-‐art (blue) by 8X at large concurrencies
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
Applica[ons: – Algebraic mul[grid (AMG) restric[on – Graph computa[ons – Quantum chemistry – Similarity computa[on (data mining) – Interior-‐point op[miza[on
A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, S. Williams. Exploi[ng mul[ple levels of parallelism in sparse matrix-‐matrix mul[plica[on. arXiv preprint arXiv:1510.00844.
0
4
8
12
16
kkt_power
huget
race delaunay
rgg_n24_s0
coPaersDBL
P
amazon03
12 cit-‐Pa
tents
RMAT road_
usa wb-‐edu web-‐G
oogle wikip
edia
Rela[ve pe
rformance
Push-‐Relabel Pothen-‐Fan MS-‐BFS-‐Gram
18 35 29 42
Performance: On 40-‐core Intel On average 7x faster than current best algorithm. Can be up to 42x faster.
1
2
4
8
16
32
1 4 16 64
Speedu
p
Number of Threads
Scien[fic
Scale-‐free
Networks
Hyperthreading
Change of socket Scaling: One node of Edison
(24-‐core Intel Ivy Bridge). On average 17x speedups rela[ve to
serial algorithm.
Parallel Maximum Cardinality Matching in Bipar>te Graphs Using “Tree GraVing”
A. Azad, A.Pothen, A. Buluç. Compu[ng Maximum Cardinality Matchings in Parallel on Bipar[te Graphs via Tree-‐Graming. Under revision in IEEE Transac[ons on Parallel and Distributed Systems (TPDS).
Counting and Enumerating Triangles using Matrix Algebra
A L U
1 2
1 1 1 2
C
A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)
A ∧ B = C (closed wedge)
sum(C)/2 = 4 triangles
A
5
6
3
1 2
4 5
6
3
1 2
4
1
1
2
B, C
A. Azad, A. Buluç, and J. R Gilbert. “Parallel triangle coun[ng and enumera[on using matrix algebra”, Graph Algorithm Building Blocks IPDPS Workshops, 2015
Maximal Cardinality Matching using Matrix Algebra
A. Azad, A Buluç. Distributed-‐memory algorithms for maximal cardinality matching using matrix algebra. In IEEE Interna6onal Conference on Cluster Compu6ng (CLUSTER), 2015 (full paper).
256 512 1024 2048 4096 8192 16384.25
.5
1
2
4
8
16
32
Number of Processors
Time(sec)
(a) Greedy
RMAT−26RMAT−28RMAT−30
256 512 1024 2048 4096 8192 16384.25
.5
1
2
4
8
16
32
64
Number of ProcessorsTime(sec)
(c) Dynamic Mindegree
RMAT−26RMAT−28RMAT−30
256 512 1024 2048 4096 8192 16384.25
.5
1
2
4
8
16
Number of Processors
Time(sec)
(b) Karp−Sipser
RMAT−26RMAT−28RMAT−30
1
5
2 3 4
1 5 2 3 4
×
× ×
×
× ×
c1
c2
c3
c4
r1
r2
r3
r4
r5 c5
× × ×
1 2 5
1
1
2
5
5
Select columns
Unmatched columns: fc
Visited
rows: f r
Matrix A Bipar[te graph G(R,C,E)
Matrix-‐based primi[ves enable efficient and scalable distributed-‐memory implementa[ons of various maximal cardinality algorithms solely by minimal modifica[ons to the underlying semiring operator.