p-gas: parallelizing a many-core processor simulator using pdes huiwei lv, yuan cheng, lu bai,...
TRANSCRIPT
P-GAS: Parallelizing a Many-Core Processor Simulator
Using PDES
Huiwei Lv, Yuan Cheng, Lu Bai,
Mingyu Chen, Dongrui Fan, Ninghui Sun
Institute of Computing Technology
PADS 2010, May 18, 2010
Motivation• Multi-core platforms
are common now
Courtesy: Sun® UltraSPARC T2
Courtesy: AMD® Opeteron 6000
Courtesy: Intel® Nehalem
• System Simulators still sequential
Motivation• Multi-core platforms
are common now
courtesy: Sun® UltraSPARC T2
courtesy: AMD® Phenom
courtesy: Intel® Nehalem
• System Simulators still sequential
Multi-core is wasted Multi-core is wasted
Simulation speed is limited by single core performance
Simulation speed is limited by single core performance
Poor Scalability of Single-threaded Simulator
• Slowdown grow exponentially
• Not able to simulate future many-core systems
1000+ cores
Too slow to simulate future many-coresToo slow to simulate future many-cores
Goal: fast and accurate computer system simulation
Functional CycleAccuracyAccuracy
Speed(slowdown)
Speed(slowdown)
Speedup 10x without accuracy lostSpeedup 10x without accuracy lost
COTSonCOTSon(HPCA’10)
(SIGOPS Oper. Syst. Rev.’09)
(MICRO’06)
(J. Comput.’09)
Outline
• Motivation• Implementation
BackgroundFrom DES to PDESOptimization
• Evaluation• Conclusion
Godson-T Architecture Simulator
• Discrete Event Simulation (DES)
one global event queueevent assigned to sinkersnew event insert back into event queue
• Fine-grained
EVENT A
EVENT B
SimK: PDES Framework
• Open source• Conservative PDES• Highly optimized
pthreadslock-free user-level thread scheduling
• Modularizeduse SimK API to implement a LP
schedule, execschedule, execschedule, execschedule, exec
commu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploy
APIAPIAPIAPI
LP
LP
LP
LP
LP……
core core core core
Host
SimK
LP
From DES to PDES
• Seperate global queue
• Group sinkers into logical processes(LP), 1 queue/LP
• Event across LPs is wrapped with PDES time
router
core
cache
PDES time wrapper
router
core
cache
LP
LP
router 1
E.g. Router Event
• before
PDES time wrapper
router 0
core 0
cache 0
router 1
core 1
cache 1
LP 0
LP 1
router 0
core 0
cache 0
core 1
cache 1
• after
Event Queue
Router 0 send a event to router 1
Events from DES to PDES
• Single-thread multi-threads• Conservative PDES
Simulation Time
Thread 1
Thread 2
Thread 3
Thread 4
1 cycle
event
dependence
Grouping Into Big LPs
• ProblemAvg. speedup is 1.8x with 16 thread (16 1-core LPs proto.)
• Cause of Problemtoo many LPs + lookahead is extremely small high sync cost
• Solutiongrouping adjacent LPs into one big LP
LP
Final Parallelized version
• Parallel Discrete Event Simulation
sinkers grouped into big LPsLPs binded to threads using SimK APItime sync between LPs using PDESsched and exec under SimK framework
schedule, execschedule, execschedule, execschedule, exec
commu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploy
APIAPIAPIAPI
core core core core Host
SimK
Evaluation Setup
• GAS v.s. P-GAS• 4 Quad-Core AMD Opteron 8347 SMP
16 cores total, 64GB Memory
• Benchmark: SPLASH-2 kernelcount benchmark computing time in wall-clock time
P-GAS Speedup
• 16 threads, SPLASH-2 Kernel Avg. speedup is 9.8x• best speedup 13.6x(LU,16 threads)• 5.3x super-linear speedup with 4 threads
Avg. 9.8
Max. 13.6
5.3
Why super-linear speedup?
• More cores, more caches to use• The insert-to-queue time is shorter
18
5.3x super-linear speedup with 4 threads
Conclusion
• P-GAS use PDES to speedup a cycle-accurate many-core processor simulator
speedup 9.8x on a 16-core SMPcycle error < 0.04%
• Highly optimized conservative PDES could be used in fast and accurate system simulation
multi-core/many-core processor simulationSMP cluster, many-core cluster ...
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES
Please email me the questions:
[email protected] source release of our PDES framework:
http://simk.sf.net