2013 06-03 berlin buzzwords

26
Scaling Apache Giraph Nitay Joffe, Data Infrastructure Engineer [email protected] June 3, 2013

Upload: nitay-joffe

Post on 01-Nov-2014

11 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 2013 06-03 berlin buzzwords

Scaling Apache Giraph

Nitay Joffe, Data Infrastructure Engineer

[email protected]

June 3, 2013

Page 2: 2013 06-03 berlin buzzwords

Agenda

1 Background

2 Scaling

3 Results

4 Questions

Page 3: 2013 06-03 berlin buzzwords

Background

Page 4: 2013 06-03 berlin buzzwords

What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, and Accumulo.• BSP model with simple think like a vertex API.• Combiners, Aggregators, Mutability, and more.• Configurable Graph<I,V,E,M>:

– I: Vertex ID

– V: Vertex Value

– E: Edge Value

– M: Message data

What is Giraph NOT?• A Graph database. See Neo4J.• A completely asynchronous generic MPI system.• A slow tool.

implementsWritable

Page 5: 2013 06-03 berlin buzzwords

Why not Hive?

Inputformat

Outputformat

Map tasks

Intermediatefiles

Reducetasks

Output 0

Output 1

Input 0

Input 1

Iterate!

• Too much disk. Limited in-memory caching.• Each iteration becomes a MapReduce job!

Page 6: 2013 06-03 berlin buzzwords

Giraph components

Master – Application coordinator

• Synchronizes supersteps

• Assigns partitions to workers before superstep begins

Workers – Computation & messaging

• Handle I/O – reading and writing the graph

• Computation/messaging of assigned partitions

ZooKeeper

• Maintains global application state

Page 7: 2013 06-03 berlin buzzwords

Giraph Dataflow

Split 0

Split 1

Split 2

Split 3

Work

er

1

Mast

er

Work

er

0Input

formatLoad / SendGraph

Load / SendGraph

Loading the graph

1

Part 0

Part 1

Part 2

Part 3

Compute / Send

Messages

Work

er

1

Compute / Send

Messages

Mast

er

Work

er

0

In-memory graph

Send stats / iterate!

Compute/Iterate

2

Work

er

1W

ork

er

0 Part 0

Part 1

Part 2

Part 3

Output format

Part 0

Part 1

Part 2

Part 3

Storing the graph

3

Split 4

Split

Page 8: 2013 06-03 berlin buzzwords

Giraph Job Lifetime

Output

Active Inactive

Vote to Halt

Received Message

Vertex Lifecycle

All Vertices Halted?

InputCompute Superstep

No

Master halted?

No

Yes

Yes

Page 9: 2013 06-03 berlin buzzwords

Simple Example – Compute the maximum value

5

15

2

5

5

25

5

5

5

5

1

2

Processor 1

Processor 2

Time

Connected Componentse.g. Finding Communities

Page 10: 2013 06-03 berlin buzzwords

PageRank – ranking websites

Mahout (Hadoop)854 lines

Giraph< 30 lines

• Send neighbors an equal fraction of your page rank • New page rank = 0.15 / (# of vertices) + 0.85 *

(messages sum)

Page 11: 2013 06-03 berlin buzzwords

Scaling

Page 12: 2013 06-03 berlin buzzwords

Problem: Worker Crash.

Superstep i(no

checkpoint)

Superstep i+1

(checkpoint)

Superstep i+2(no

checkpoint)Worker failure!

Superstep i+1

(checkpoint)

Superstep i+2(no

checkpoint)

Superstep i+3

(checkpoint)Worker failure after

checkpoint complete!

Superstep i+3(no

checkpoint)

ApplicationComplete…

Solution: Checkpointing.

Page 13: 2013 06-03 berlin buzzwords

“Spare”Master 2

ActiveMaster State“Spare”

Master 1

“Active”Master 0

Before failure of active master 0

“Spare”Master 2

ActiveMaster State“Active”

Master 1

“Active”Master 0

After failure of active master 0

ZooKeeper ZooKeeper

Problem: Master Crash.

Solution: ZooKeeper Master Queue.

Page 14: 2013 06-03 berlin buzzwords

Problem: Primitive Collections.• Graphs often parameterized with {Null,Int,Long,Float,Double}• Boxing/unboxing. Objects have internal overhead.

3

Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion

1

24

5

1.2

0.50.8

0.4

1.7

0.7

Single Source Shortest Path

s

t

1.2

0.50.8

0.4

0.2

0.7

Network Flow

3

1

24

5

Count In-Degree

Page 15: 2013 06-03 berlin buzzwords

Problem: Too many objects.Lots of time spent in GC.

Graph: 1B Vertices, 200B Edges, 200 Workers.

• 1B Edges per Worker. 1 object per edge value.• List<Edge<I, E>> ~ 10B objects

• 5M Vertices per Worker. 10 objects per vertex value.• Map<I, Vertex<I, V, E> ~ 50M objects

• 1 Message per Edge. 10 objects per message data.• Map<I, List<M>> ~ 10B objects

• Objects used ~= O(E*e + V*v + M*m) => O(E*e)

Label Propagatione.g. Who’s sleeping?

3

1

24

5

Boring

Amazing

Q: What did he think?

0.5

0.2

0.8 0.36

0.17

0.41

Confusing

Page 16: 2013 06-03 berlin buzzwords

Problem: Too many objects.Lots of time spent in GC.

Solution: byte[]• Serialize messages, edges, and vertices.• Iterable interface with representative object.

Input Input Input

next()next()

next()Objects per worker ~= O(V)

Label Propagatione.g. Who’s sleeping?

3

1

24

5

Boring

Amazing

Q: What did he think?

0.5

0.2

0.8 0.36

0.17

0.41

Confusing

Page 17: 2013 06-03 berlin buzzwords

Problem: Serialization of byte[]• DataInput? Kyro? Custom?

Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).

• AWESOME. As fast as it gets.• True native. Essentially C: *(long*)(data+offset);

Page 18: 2013 06-03 berlin buzzwords

Problem: Large Aggregations.

Worker

Worker

Worker Worke

r

Worker

Master

Workers own aggregators

Worker

Worker

Worker Worke

r

Worker

Master

Aggregator owners communicatewith Master

Worker

Worker

Worker Worke

r

Worker

Master

Aggregator owners distribute values

Solution: Sharded Aggregators.

Worker

Worker

Worker Worke

r

Worker

Master

K-Means Clusteringe.g. Similar Emails

Page 19: 2013 06-03 berlin buzzwords

Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.

Solution: NettyTune queue sizes & threads

BarrierBarrier

Begin superstep

computenetwork

End compute

End superstep

wait

BarrierBarrier

Begin superstep

compute

network

wait

Time to first message

End compute

End superstep

Page 20: 2013 06-03 berlin buzzwords

Results

Page 21: 2013 06-03 berlin buzzwords

50 100 150 200 250 3000

50

100

150

200

250

300

350

400

450

2B Vertices, 200B Edges, 20 Compute Threads

Workers

Itera

tion

Tim

e (

sec)

Increasing Workers

Increasing Data Size

1000000000 1010000000000

50

100

150

200

250

300

350

400

450

50 Workers, 20 Compute Threads

EdgesIt

era

tion

Tim

e (

sec)

Scalability Graphs

Page 22: 2013 06-03 berlin buzzwords

Lessons Learned

• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primitive collections, primitive performance. Use fastutil.• byte[] is simple yet powerful.• Being Unsafe can be a good thing.

• Have a graph? Use Giraph.

Page 23: 2013 06-03 berlin buzzwords

What’s the final result?

Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.

Computations on entire Facebook graph no longer “weekend jobs”.Now they’re coffee breaks.

Page 24: 2013 06-03 berlin buzzwords

Questions?

Page 25: 2013 06-03 berlin buzzwords

Problem: Measurements.

• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.

Solution: Do it all.• YourKit – see YourKitProfiler• jmap – see JMapHistoDumper• VisualVM –with jstatd & ssh socks proxy• Yammer Metrics• Hadoop Counters• Logging & GC prints

Page 26: 2013 06-03 berlin buzzwords

Problem: Mutations• Synchronization.• Load balancing.

Solution: Reshuffle resources• Mutations handled at barrier between supersteps.• Master rebalances vertex assignments to optimize

distribution.• Handle mutations in batches.• Avoid if using byte[].• Favor algorithms which don’t mutate graph.