david ediger, jason riedy, rob mccoll, david a. baderbader/papers/ppopp12/ppopp-2012-part1.pdf ·...

Parallel Programming for Graph AnalysisDavid Ediger, Jason Riedy, Rob McColl, David A. Bader

Tutorial Schedule

• 08:30am – 08:45am: Welcome & Introduction• 08:45am – 10:00am: Par t 1:Network Analysis• 10:00am – 10:30am: Coffee Break• 10:30am – 12:00pm: Par t 2: Static Parallel Algorithms• 12:00pm – 01:30pm: Lunch• 01:30pm – 03:00pm: Par t 3: Dynamic Paral lel

Algorithms• 03:00pm – 03:30pm: Coffee Break• 03:30pm – 05:00pm: Par t 4: STINGER

Parallel Programming for Graph Analysis 2

Tutorial Outline

• Introduction– Speakers

• 1: Network Analysis– Introduction to Graph Theory– Opportunities & Challenges– Parallel, Multicore & Multithreaded

Architectures– Q&A

• 2: Static Parallel Algorithms– Graph Traversal / Graph500– Social Networking Algorithms– Graph Software Libraries– Q&A

• 3: Dynamic Parallel Algorithms– Data Structures for Streaming Data– Clustering Coefficients– Connected Components– Anomaly Detection– Q&A

• 4: GraphCT & STINGER

– Using GraphCT, app & lib.

– Using STINGER• Creation, Insertion, & Deletion

• Graph Traversal

• Connected Components

• Included Utilities

– Q&A


E. Jason Riedy• Research Scientist II, Georgia Institute of Technology

– School of Computation Science & Engineering

• Ph.D. 2010, University of California, Berkeley– Still likes Fortran...

• Contact Info:– College of Computing– Georgia Institute of Technology– Atlanta, GA 30332– [email protected]– http://www.cc.gatech.edu/~jriedy


David Ediger

• Ph.D. Student, Georgia Tech– advisor: Prof. David A. Bader

• B.S. 2008, The George Washington University• M.S. 2010, Georgia Tech


Currently working on development efforts of parallel graph algorithms and software: GraphCT and STINGER.

http://www.facebook.com/photo.php?pid=32985166&id=5304065



Outline

• Introduction• Network Analysis

– Introduction to Graph Applications– Opportunities & Challenges– Parallel, Multicore & Multithreaded Architectures

• Static Parallel Algorithms• Dynamic Parallel Algorithms• GraphCT & STINGER


MOTIVATION


Exascale Analytics: Real-world challenges

• Health care disease spread, detection and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu, …)

• Massive social networks energy conservation requires social change, modeling pandemic spread, transportation and evacuation

• Intell igence business analytics, anomaly detection, security, knowledge discovery from massive data sets

• Systems Biology understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease, and evolution

• Electric Power Grid communication, transportation, energy, water, food supply

• Modeling and Simulation Perform full-scale economic-social-political simulations

Requires dynamic Spatio-Temporal Interaction Networks and Graphs (STING)


Driving Forces in Social Network Analysis• Facebook has more than 850 million active users

• Note the graph is changing as well as growing.• What are this graph's properties? How do they change?• Traditional graph partitioning often fails:

– Topology: Interaction graph is low-diameter, and has no good separators– Irregularity : Communities are not uniform in size– Overlap: individuals are members of one or more communities

• Sample queries: – Allegiance switching : identify entities that switch communities.– Community structure: identify the genesis and dissipation of communities– Phase change: identify significant change in the network structure


http://www.new.facebook.com/album.php?profile&id=20531316728

Friendship Networks

• This image is a visualization of LiveJournal's friendship network. The network consists of 4.8 million people (vertices) connected by 69 million relationships (edges). Credit: Yifan Hu (AT&T)


http://www2.research.att.com/~yifanhu/gallery.html

Center for AdaptiveSupercomputing Software (CASS-MT)• CASS-MT, launched July 2008• Pacific-Northwest Lab

– Georgia Tech, Sandia, WA State, Delaware• The newest breed of supercomputers have hardware set up not just for

speed, but also to better tackle large networks of seemingly random data. And now, a multi-institutional group of researchers has been awarded over $14 million to develop software for these supercomputers. Applications include anywhere complex webs of information can be found: from internet security and power grid stability to complex biological networks.

David A. Bader 11

CASS-MT TASK 7: Analysis of Massive Social Networks

David A. Bader 12

ObjectiveTo design software for the analysis of massive-scale spatio-temporal interaction networks using multithreaded architectures such as the Cray XMT. The Center launched in July 2008 and is led by Pacific-Northwest National Laboratory.

DescriptionWe are designing and implementing advanced, scalable algorithms for static and dynamic graph analysis, including generalized k-betweenness centrality and dynamic clustering coefficients.

HighlightsOn a 64-processor Cray XMT, k-betweenness centrality scales nearly linearly (58.4x) on a graph with 16M vertices and 134M edges. Initial streaming clustering coefficients handle around 200k updates/sec on a similarly sized graph.

Our research is focusing on temporal analysis, answering questions about changes in global properties (e.g. diameter) as well as local structures (communities, paths).

Image Courtesy of Cray, Inc.

David A. Bader (CASS-MT Task 7 LEAD)David Ediger, Jason Riedy, Henning Meyerhenke

Ubiquitous High Performance Computing (UHPC)

Goal: develop highly parallel, security enabled, power efficient processing systems, supporting ease of programming, with resilient execution through all failure modes and intrusion attacks

Program Objectives:One PFLOPS, single cabinet including self-contained cooling

50 GFLOPS/W (equivalent to 20 pJ/FLOP)

Total cabinet power budget 57KW, includes processing resources, storage and cooling

Security embedded at all system levels

Parallel, efficient execution models

Highly programmable parallel systems

Scalable systems – from terascale to petascale

Architectural Drivers:Energy EfficientSecurity and DependabilityProgrammability

Echelon: Extreme-scale Compute Hierarchies with Efficient Locality-Optimized Nodes

“NVIDIA-Led Team Receives $25 Million Contract From DARPA to Develop High-Performance GPU Computing Systems” -MarketWatch

David A. Bader (CSE)Echelon Leadership Team

13

Parallel Programming for Graph Analysis

And more...

•Spatio-Temporal Interaction Networks and Graphs Software for Intel platforms–Intel Project on Parallel Algorithms in Non-

Numeric Computing, multi-year project–Collaborators: Mattson (Intel), Gilbert (UCSD),

Blelloch (CMU), Lumsdaine (Indiana)

•Graph500: http://graph500.org– Shooting for reproducible benchmarks for

massive graph-structured data

14

http://graph500.org/


Example: Mining Twitter for Social Good

15

ICPP 2010

Image credit: bioethicsinstitute.org

http://twitter.com/

• CDC / Nation-scale surveillance of public health

• Cancer genomics and drug design– computed Betweenness Centrality

of Human Proteome

Massive Data Analytics: Protecting our NationUS High Voltage Transmission Grid (>150,000 miles of l ine)

Public Health


ENSG00000145332.Kelch-

like proteinimplicat

ed in breast cancer

Routing in transportation networks

Road networks, Point-to-point shor test paths: 15 seconds (naïve) 10 microseconds

H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.

17Parallel Programming for Graph Analysis

Internet and the WWW

• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering

• Internet topologies (router networks) are naturally modeled as graphs


Scientific Computing• Reorderings for sparse solvers

– Fill reducing orderings• partitioning, traversals, eigenvectors

– Heavy diagonal to reduce pivoting (matching)

• Data structures for efficient exploitation

of sparsity• Derivative computations for optimization

– Matroids, graph colorings, spanning trees

• Preconditioning– Incomplete Factorizations– Partitioning for domain decomposition– Graph techniques in algebraic multigrid

• Independent sets, matchings, etc.– Support Theory

• Spanning trees & graph embedding techniquesB. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf


Graphs are pervasive in large-scale data analysis

• Sources of massive data: petascale simulations, experimental devices, the Internet, scientific applications.

• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.

BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.

Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com


http://physics.nmt.edu/images/astro/hst_starfield.jpg

Data Analysis and Graph Algorithms in Systems Biology

• Study of the interactions between various components in a biological system

• Graph-theoretic formulations are pervasive:– Predicting new interactions:

modeling– Functional annotation of novel

proteins: matching, clustering– Identifying metabolic pathways:

paths, clustering– Identifying new protein

complexes: clustering, centralityImage Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.


Image Source: Nexus (Facebook application)

Graph –theoretic problems in social networks

– Community identification: clustering

– Targeted advertising: centrality– Information spreading: modeling


Network Analysis for Intelligence and Survelliance

• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information

• Plot masterminds correctly identified from interaction patterns: centrality

• A global view of entities is often more insightful

• Detect anomalous activities by exact/approximate graph matching / motifs

Image Source: http://www.orgnet.com/hijackers.html

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47


Characterizing Graph-theoretic computations

• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics

• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics

• paths• clusters• partitions• matchings• patterns• orderings

• paths• clusters• partitions• matchings• patterns• orderings

Input: Graph abstraction

Problem: Find ***

Factors that influence choice of algorithmGraph

algorithms

• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..

• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations


Massive data analytics in informatics networks

• Graphs arising in Informatics are very different from topologies in scientific computing.

• We need new data representations and parallel algorithms that exploit topological characteristics of informatics networks.

Emerging applications: dynamic, high-dimensional data

Static networks, Euclidean topologies


What we’d like to infer from information networks

• What are the degree distributions, clustering coefficients, diameters, etc.?

– Heavy-tailed, small-world, expander, geometry+rewiring, local-global decompositions, ...

• How do networks grow, evolve, respond to perturbations, etc.?– Preferential attachment, copying, HOT, shrinking diameters, ..

• Are there natural clusters, communities, partitions, etc.?– Concept-based clusters, link-based clusters, density-based clusters, ...

• How do dynamic processes – search, diffusion, etc. – behave on networks?

– Decentralized search, undirected diffusion, cascading epidemics, ...

• How best to do learning, e.g., classification, regression, ranking, etc.?– Information retrieval, machine learning, ...

Slide credit: Michael Mahoney, Stanford


CHALLENGES


Hierarchy of Interesting Graph Analytics

Extend single-shot graph queries to include time.Are there s-t paths between time T1 and T

2?

What are the important vertices at time T?

Use persistent queries to monitor proper ties.Does the path between s and t shorten drastically?

Is some vertex suddenly very central?

Extend persistent queries to fully dynamic proper ties.

Does a small community stay independent rather than merge with larger groups?

When does a vertex jump between communities?

New types of queries, new challenges...


Graph Analytics for Social Networks

• Are there new graph techniques? Do they parallelize? Can the computational systems (algorithms, machines) handle massive networks with millions to billions of individuals? Can the techniques tolerate noisy data, massive data, streaming data, etc. …

• Communities may overlap, exhibit dif ferent proper ties and sizes, and be driven by dif ferent models

– Detect communities (static or emerging)– Identify impor tant individuals– Detect anomalous behavior– Given a community, f ind a representative

member of the community– Given a set of individuals, f ind the best

community that includes them


Open Questions for Massive Analytic Applications

• How do we diagnose the health of streaming systems?• Are there new analytics for massive spatio-temporal interaction

networks and graphs (STING)?• Do current methods scale up from thousands to millions and

billions?• How do I model massive, streaming data streams?• Are algorithms resil ient to noisy data? • How do I visualize a STING with O(1M) entities? O(1B)?

O(100B)? with scale-free power law distribution of vertex degrees and diameter =6 …

• Can accelerators aid in processing streaming graph data?• How do we leverage the benefits of multiple architectures (e.g.

map-reduce clouds, and massively multithreaded architectures) in a single platform?


Computational research challenges

• How do we solve massive graph problems (statistical analysis, dynamics, community detection, learning) with current algorithms/ systems/languages/programming models/libraries?

• How do we get analytics routines to scale for massive datasets?– We need new algorithms that exploit graph topology & modern computer

architectures

• What are the right abstractions for designing portable, high-performance graph algorithms?

– PRAM-like, BSP-like, MapReduce, matrices/tensors?

• Application Architecture mapping for current systems?– Multicore servers, commodity clusters, supercomputers, clouds

• What can we do with accelerators?– GPUs, Cell processor, Netezza and Azul data warehousing appliances

• What are the languages & programming models we should be using for high productivity/performance?

– C/C++ & threading, PGAS languages, MPI, Pig/Hive/Sawzall, …


ARCHITECTURE


Flywheel has driven HPC into a corner


For decades, HPC has been on a vicious cycle of enabling applications that run well on HPC systems.

Real-World Applications(with natural parallelism)

Graphs have natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.

• For the first time in decades, manycore and multithreaded can let us rethink architecture.

These are today’s MPI applications!

34

Looking Ahead: Grand Challenge in Graph Analytics

• Driving real-world applications are not just traditional HPC:– health care, transportation, energy, proteomics, security, data

sciences, informatics, …

• FLOPS are “free”, data is the challenge!• Fundamental abstractions are more irregular and complex

than sparse/dense matrices– Very sparse, very high-dimensional data– For example, there are few (if any) ef f icient distributed-

memory parallel implementations of even the simplest algorithm for sparse, arbitrary graphs!

• We must integrate across algorithms, programming models, and architectures, to address the growing requirements of grand challenge applications

Parallel Programming for

Graph Analysis

• Ideally, we want to do 1 (or more) memory op/clock cycle/processor.

• However, we have to deal with– Caches

• Reduce latency by storing data close to the processor

– Vectors• Amortize latency by fetching N words at a time

– Parallelism• Hide latency by switching tasks• Little’s law: concurrency = bandwidth * latency

Hiding memory latency


• Tolerates latency by extreme multithreading– Each processor supports 128 hardware threads– Context switch in a single clock tick– “No” cache or local memory– Context switch on memory request– Multiple outstanding loads (configurable, often 180)

• Remote memory requests do not stall processors– Other streams work while the request

gets fulfilled

• Light-weight, word-level synchronization– Minimizes access conflicts

• Hashed global shared memory– 64-byte granularity, minimizes hotspots

• High-productivity graph analysis!

Cray XMT Operation


Image: Swiss NationalSupercomputer Centre

XMT ThreadStorm Processor (logical view)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions

Slide Credit: Cray, Inc.


XMT ThreadStorm System (logical view)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Programs running in parallel

Concurrent threads of computation

Multithreaded across multiple processors

. . . . . . . . . . . .

Slide Credit: Cray, Inc.


Scale to larger problems by adding more nodes.Thousands possible... 0.5 TiB per 16 processors (XMT2)

What is less important on XMT• Placing data near computation• Modifying shared data• Accessing data in order• Using indirection or linked data-structures• Partitioning program into independent,

balanced computations• Using adaptive or dynamic computations• Minimizing synchronization operations

Many costs amortized with memory access!39Parallel Programming for Graph Analysis

• Architecture supports up to 2 TiB, commercially available.

• Tolerates some latency by hyperthreading.

– Hardware: 2 threads / core, 10 cores / socket, four sockets.

– Fast cores (2.4 GHz), fast memory (1 066 MHz).

– Not so many outstanding memory requests (60/socket), but large caches (30 MiB L3 per socket).

• Good system support

– Transparent hugepages reduces TLB costs.

– Fast, user-level locking. (HLE may be better...)

– Widely available tuned primitives• Also high-productivity graph analysis!

Intel E7-8870-based platforms


Image: Intel(tm) press kit

Problem Size (Log2 # of vertices)

10 12 14 16 18 20 22 24 26

Pe

rfor

ma

nce

rate

(Mill

ion

Tra

vers

ed E

dge

s/s)

0

10

20

30

40

50

60

Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)

Input: Synthetic R-MAT graphs (# of edges m = 8n)

The problems: #1. The locality challenge“Large memory footprint, low spatial and temporal locality impede performance”

~ 5X drop in performance

No LLC misses

O(m) LLC misses


LLC: Last-Level Cache

• Graph topology assumptions in classical algorithms do not match real-world datasets

• Parallelization strategies at loggerheads with techniques for enhancing memory locality

• Classical “work-efficient” graph algorithms may not fully exploit new architectural features– Increasing complexity of memory hierarchy (x86), DMA support (Cell),

wide SIMD, floating point-centric cores (GPUs).

• Tuning implementation to minimize parallel overhead is non-trivial– Shared memory: minimizing overhead of locks, barriers.– Distributed memory: bounding message buffer sizes, bundling

messages, overlapping communication w/ computation.

The problems: #2. The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”



The Big Picture

Analyst makesqueries.

MassiveDatabases

FastGraphQuery

Extract “Window”Extract “Window”

High Latency Query

Graph resides in memory of supercomputer.

43

Graph Extraction

• Filtering massive (petabytes to exabytes) raw data- HTML documents a link graph- E-mail archives a collaboration network- Research papers a citation network- A large network a subnetwork

<li><a href="http://www.ieee.org/fellow/">IEEE Fellow</a>

<li> <a href="http://www.nsf.gov/">National Science Foundation</a> CAREER Award. <li>Georgia Tech College of Computing 2007 Dean's Award.<li>2008 IBM Shared University Research Award <li>2006 IBM Faculty Fellowship Award<li>2008 NVIDIA Professor Partnership Award <li>2006-7 Microsoft Research award <li>UNM Regents' Lecturer. <li><a href="http://www.computer.org/">IEEE Computer Society</a>

<a href="http://www.computer.org/chapter/DVP/">Distinguished Speaker</a>

…


Graph Analysis

• Irregularly traversing large (multi-billion vertices with up to a trillion edges) graph data.



Informatics Graphs are Even Tougher

• Very dif ferent from graphs in scientif ic computing!– Graphs can be enormous– Power-law distribution of the number of neighbors– Small world property – no long paths– Very limited locality, not partitionable– Highly unstructured– Edges and vertices have types

• Experience in scientific computing applications provides only limited insight.

Six degrees of Kevin BaconSource: Seokhee Hong

46

David A. Bader 47

Architectural Requirements for Efficient Graph Analysis (Challenges)

• Runtime is dominated by latency– Random accesses to global address space– Perhaps many at once

• Essentially no computation to hide memory costs

• Access pattern is data dependent– Prefetching unlikely to help– Usually only want small part of cache line

• Potentially abysmal locality at all levels of memory hierarchy

47

David A. Bader 48

Architectural Requirementsfor Efficient Graph Analysis (Desired Features)

• A large memory capacity for random access- For example, 1 trillion edges can fit into the 8 TB

(instead of 8 PB) main memory• Low latency / high bandwidth

– For small messages!• Latency tolerant• Light-weight synchronization mechanisms• Global address space

– No graph partitioning required– Avoid memory-consuming profusion of ghost-nodes– No local/global numbering conversions

48


Breadth-first search comparison: Graph500• Graph500 benchmark: BFS on an artificial graph• 2^Size vertices, 16 * 2^Size edges• TEPS: Traversed edges per second

49

Rank Machine Owner Size TEPS

1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores)

NNSA and IBM Research, T.J. Watson

32 254.3 B

2 Hopper XE6 (1800 nodes / 43,200 cores)

LBNL 37 113.3 B

3 Lomonosov (4096 nodes / 32,768 cores)

Moscow State University

37 103.2 B

17 mirasol (Quad-Socket Intel E7-8870 - 2.4 GHz - 10 cores / 20 threads per socket)

Intel/Georgia Tech(UCB implementation)

28 5.1 B

20 Dingus (Convey HC1ex, 4FPGA)

SNL 28 1.7 B

35 Matterhorn (64-proc XMT2) CSCS 31 884.6 M

Graph Analysis Performance:Multithreaded (Cray XMT) vs. Cache-based multicore

• SSCA#2 network, SCALE 24 (16.77 million vertices and 134.21 million edges.)

David A. Bader 50

http://images.google.com/imgres?imgurl=http://www.solutions-lab.net/images/Xeon.jpg&imgrefurl=http://solutions-lab.net/order/mod.php?mod=siteinfo&id=10&phpcoinsessid=18eb8641b90c05b9c322803143ec857a&usg=__1jZEwJTy-bjNT_a5uo6JUtGwrmU=&h=365&w=300&sz=114&hl=en&start=12&um=1&itbs=1&tbnid=AtlfIRTSNQV9LM:&tbnh=121&tbnw=99&prev=/images?q=intel+xeon&hl=en&rls=com.microsoft:en-us&sa=N&um=1

Q&A


david ediger, jason riedy, rob mccoll, david a. baderbader/papers/ppopp12/ppopp-2012-part1.pdf ·...

Documents