exploiting bounded staleness to speed up big data …exploiting bounded staleness to speed up big...
TRANSCRIPT
Exploiting Bounded Staleness to Speed up Big Data Analytics
Henggang Cui James Cipar, Qirong Ho, Jin Kyu Kim,
Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Greg Ganger, Phil Gibbons (Intel), Garth Gibson, and Eric Xing
Carnegie Mellon University
Iterative program fits model
Big Data Analytics Overview
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 2
Huge input data Model parameters (solution)
One
iter
atio
n
Big Data Analytics Overview
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 3
Partitioned input data
Parallel iterative program
Model parameters (solution)
Big Data Analytics Overview
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 4
Partitioned input data
Parallel iterative program
Model parameters (solution)
Parameter server
Goal: Less sync overhead
Outline • Two novel synchronization approaches
• Arbitrarily-sized Bulk Synchronous Parallel (A-BSP) • Stale Synchronous Parallel (SSP)
• LazyTable architecture overview • Taste of experimental results
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 5
Bulk Synchronous Parallel
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 6
• A barrier every clock (a.k.a. epoch) • In ML apps, often one iteration over input data
Clock 0 1 2 3 4
Thread 1
Thread 2
Thread 3 Updates not necessarily visible
Thread 1 blocked by barrier
Iteration 0 1 2 3 4
Iterations complete, updates visible
Thread progress illustration:
Data Staleness • In BSP, threads can see "out-of-date" values
• May not see others' updates right away • Convergent apps usually tolerate that
• Allowing more staleness for speed • Less synchronizing among threads • More using cached values • More delaying and batching of updates
• But, too much staleness hurts convergence • Important to have staleness bound • Staleness should be tunable
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 7
0 1 2 3
0 1 2 3
Arbitrarily-sized BSP (A-BSP) • Work in each clock can be more than one iteration
– Less synchronization overhead
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 8
Clock 0 1 2
Thread 1
Thread 2
Thread 3
two iterations per clock Iteration 0 1 2 3 4
Thread 1 blocked by barrier
Problem of (A-)BSP: Stragglers • A-BSP still has the straggler problem
• A slow thread will slow down all • Stragglers are common
in large systems
• Many reasons for stragglers • Hardware: lost packets, SSD cleaning, disk resets • Software: garbage collection, virtualization • Algorithmic: calculating objectives and stopping
conditions
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 9
0 1 2
0 1 2 3 4
Stale Synchronous Parallel (SSP)
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 10
• Threads are allowed to be slack clocks ahead of the slowest thread
[HotOS’13, NIPS’13]
Thread 1
Thread 2
Thread 3
Slack of 1 clock
Thread 1 at clock 3
Iteration 0 1 2 3 4
Clock 0 1 2 3 4
Thread 2 at clock 2
Two Dimensional Config. Space • Iters-per-clock and slack are both tunable
• A-BSP is SSP with a slack of zero • Every SSP config. has an A-BSP counterpart
with the same data staleness bound
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 11
A-BSP (iters-per-clock=2, slack=0): SSP (iters-per-clock=1, slack=1):
Iteration 0 1 2 3 4
Clock 0 1 2 3 4
Clock 0 1 2
Iteration
0 1 2 3 4
LazyTable Architecture
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 12
Partitioned input data
Parallel iterative program on LazyTable
Model parameters (sharded)
Client-lib
Client-lib
Client-lib
Server
Server
Server
Server
LazyTable Architecture
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 13
Partitioned input data
Parallel iterative program on LazyTable
Model parameters (sharded)
Client-lib
Client-lib
Client-lib
Server
Server
Server
Server
See the paper for more details
Primary Experimental Setup • Hardware information
• 8 machines, each with 64 cores & 128GB RAM
• Basic configuration • One client & tablet server per machine • One computation thread per core
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 14
Application Benchmark #1 • Topic Modeling
• Algorithm: Gibbs Sampling on LDA • Input: NY Times dataset
– 300k docs, 100m words, 100k vocabulary • Solution quality criterion: Loglikelihood
– How likely the model generates observed data – Becomes higher as the algorithm converges – A larger value indicates better quality
More apps described and used in paper
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 15
Controlling Data Staleness • SSP
• Larger slack -> more staleness
• A-BSP • Larger iterations-per-clock -> more staleness
• The tradeoffs with increased staleness
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 16
Staleness Increases Iters/sec
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 17
0 100 200 3000
20
40
60
Iterations per s ec
Iteratio
ns don
e
Giters-per-clock is 1
0 100 200 3000
20
40
60
Iterations per s ec
Iteratio
ns don
e
T ime (s ec )
s la ck=0 (B S P )
Staleness Increases Iters/sec
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 18
iters-per-clock is 1
0 100 200 3000
20
40
60
Iterations per s ec
Iteratio
ns don
e
T ime (s ec )
s la ck=0 (B S P ) s la ck=1 (S S P )
Staleness Increases Iters/sec
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 19
larger iters per sec with more staleness
iters-per-clock is 1
0 100 200 3000
20
40
60
Iterations per s ec
Iteratio
ns don
e
T ime (s ec )
s la ck=0 (B S P ) s la ck=1 (S S P ) s la ck=3 (S S P )
Staleness Increases Iters/sec
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 20
iters-per-clock is 1
Staleness Reduces Converge/iter
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 21
0 20 40 60
Con
vergen
ce(highe
r is better)
Ite ra tions done
C onverg enc e per iter
iters-per-clock is 1
Staleness Reduces Converge/iter
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 22
0 20 40 60
Con
vergen
ce(highe
r is better)
Ite ra tions done
s la ck=0 (B S P )
C onverg enc e per iter
iters-per-clock is 1
0 20 40 60
Con
vergen
ce(highe
r is better)
Ite ra tions done
s la ck=0 (B S P ) s la ck=1 (S S P )
C onverg enc e per iter
Staleness Reduces Converge/iter
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 23
more iters to converge with more staleness
iters-per-clock is 1
0 20 40 60
Con
vergen
ce(highe
r is better)
Ite ra tions done
s la ck=0 (B S P ) s la ck=1 (S S P ) s la ck=3 (S S P )
C onverg enc e per iter
Staleness Reduces Converge/iter
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 24
iters-per-clock is 1
0 20 40 60
Con
vergen
ce(highe
r is better)
Ite ra tions done
s la ck=0 (B S P ) s la ck=1 (S S P ) s la ck=3 (S S P )
C onverg enc e per iter
0 100 200 300
C onverg enc e per s ec
Con
vergen
ce(highe
r is better)
T ime (s ec )
s la ck=0 (B S P ) s la ck=1 (S S P ) s la ck=3 (S S P )
Sweet Spot Balances the Two
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 25
0 100 200 3000
20
40
60
Iterations per s ec
Iteratio
ns don
e
T ime (s ec )
s la ck=0 (B S P ) s la ck=1 (S S P ) s la ck=3 (S S P )
Speed up with a good slack
Key Takeaway Insight #1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 26
Fresher data Staler data
Iterations per second
Convergence per iteration
Convergence per second
The sweet spot
SSP vs A-BSP • Similar performance
• In the absence of stragglers
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 27
What about environment with stragglers?
Straggler Experiment #1 • Stragglers caused by background disruption
• Fairly common in large, shared clusters
• Experiment setup • One disrupter process per machine
– Uses 50% of CPU cycles • Work (disrupt) or sleep randomly for t seconds
– 10% work, 90% sleep More straggler experiments in the paper
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 28
Straggler Results #1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 29
1 2 4 80
20
40
60
Disruption duration (sec)
Itera
tion
time
incr
ease
(%)
w/o disrupt, each iter takes 4.2 sec
1 2 4 80
20
40
60
Disruption duration (sec)
Itera
tion
time
incr
ease
(%) ideal
Straggler Results #1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 30
Ideally 5%, because 50% slow down with 10% probability
w/o disrupt, each iter takes 4.2 sec
1 2 4 80
20
40
60
Disruption duration (sec)
Itera
tion
time
incr
ease
(%) ideal
wpc=2, slack=0 (A-BSP)
Straggler Results #1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 31
w/o disrupt, each iter takes 4.2 sec
Ideally 5%, because 50% slow down with 10% probability
1 2 4 80
20
40
60
Disruption duration (sec)
Itera
tion
time
incr
ease
(%) ideal
wpc=2, slack=0 (A-BSP)wpc=1, slack=1 (SSP)
Straggler Results #1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 32
SSP tolerates transient stragglers
w/o disrupt, each iter takes 4.2 sec
Ideally 5%, because 50% slow down with 10% probability
Conclusion • Staleness should be tuned
• By iters-per-clock and/or slack
• LazyTable implements SSP and A-BSP • See paper for details
• Key results from experiments • Both SSP and A-BSP are able to exploit the
staleness sweet-spot for faster convergence • SSP is tolerant of small transient stragglers • But SSP incurs more communication traffic
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 33
References • J. Cipar, G. Ganger, K. Keeton, C. B. Morrey, III, C. A. Soules, and A.
Veitch. LazyBase: trading freshness for performance in a scalable database. Eurosys’12.
• J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. HotOS’13.
• NYTimes: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words • Q. Ho, J. Cipar, H. Cui, S. Lee, J. Kim, P. Gibbons, G. Gibson, G.
Ganger, and E. Xing. More effective distributed ML via a stale synchronous parallel parameter server. NIPS’13.
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 34
BACK-UP
Example: Topic Modeling
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 36
Corpus of documents
word-topic table
Topic modeler
doc-topic table
... ... ... ... Doc i Topic 1: 0.8 Topic 2: 0.1 ... ... ... ... ...
• (i, j) represents iteration i, work uint j
BSP Progress and Staleness
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 37
(2,b)
(2,d)
(2,f)
Thre
ad
1
2
3
2
(3,a) (3,b) (4,a) (4,b)
(3,c) (3,d) (4,c) (4,d)
(3,e) (3,f) (4,e) (4,f) 3 Clock 4
barrier barrier barrier
... (2,a)
... (2,c)
... (2,e)
barrier
• A-BSP, wpc = 2 iterations
A-BSP Progress and Staleness
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 38
... (2,b)
(2,d)
(2,f)
Thre
ad
1
2
3
1
(3,a) (3,b) (4,a) (4,b)
(3,c) (3,d) (4,c) (4,d)
(3,e) (3,f) (4,e) (4,f) 2 Clock barrier barrier
(2,a)
... (2,c)
... (2,e)
SSP Progress and Staleness • SSP, wpc = 1 iteration, slack = 1 clock
• Same staleness bound as the A-BSP one – But more flexible
• Data staleness for SSP with wpc and slack: – wpc x (slack + 1) - 1
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 39
(2,b)
(2,d)
(2,f)
Thre
ad
1
2
3
2
(3,a) (3,b) (4,a) (4,b)
(3,c) (3,d) (4,c) (4,d)
(3,e) (3,f) (4,e) (4,f) 3 Clock 4
Slack of 1 clock
... (2,a)
... (2,c)
... (2,e)
SSP VS A-BSP • A-BSP is SSP with a slack of zero • Data staleness bound
• SSP {wpc, slack} == A-BSP {wpc x (slack + 1), 0}
• SSP is a “pipelined” version of A-BSP • Tolerates transient stragglers
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 40
LazyTable Architecture
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 41
Tablet server process-0
Client process-0
App. thread
Client library
Thread cache/oplog
Process cache/oplog
App. thread
Thread cache/oplog
Tablet server process-1
Client process-1
App. thread
Client library
Thread cache/oplog
Process cache/oplog
App. thread
Thread cache/oplog
Stragglers: Delay • Delaying some threads
• Artificially introduce stragglers to the system • Have some threads sleep() for a time
• Experiment setup • Threads sleep “d” seconds in turn
– Threads of machine “i” sleep at iteration “i” • Compare influence of different “d”
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 42
0 2 4 6 8 100
2
4
6
8
Delay d (sec)
Tim
e pe
r ite
ratio
n (s
ec)
idealwpc=2, slack=0 (A-BSP)wpc=1, slack=1 (SSP)
Stragglers: Delay (Results)
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 43
Ideal slow down: d/8 per iter on 8 machines
SSP tolerates transient stragglers
A-BSP slow down: d/2 per iter
The Cost of Increased Flexibility • Comparing {wpc=X, ...} with {wpc=2X, ...}
• Bytes sent doubled (send update twice as often) • Bytes received almost doubled
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 44
Send Receive0
50
100
150
Meg
abyt
es p
er it
erat
ion
(MB/
iter)
wpc=4, slack=0wpc=2, slack=1wpc=1, slack=3
smaller wpc, larger slack
Key Takeaway Insight #3 • SSP incurs more traffic
• Finer grained division of clocks • Avoiding barriers is still a win,
if communication is not the bottleneck
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 45
Prefetching • Conservative prefetch
• refresh row only when it’s too stale
• Aggressive prefetch
• refresh row every clock
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 46
Prefetching (results)
Henggang Cui © July 14!http://www.pdl.cmu.edu/ 47
0 20 40 60
-‐1.08x109
-‐1.04x109
-‐1.00x109
-‐9.60x108
Loglikelihoo
d
Work done (itera tions )
cons erva tive pf ag g res s ive pf
• wpc = 1 iter, slack = 7 clocks