p-gas: parallelizing a many-core processor simulator using pdes huiwei lv, yuan cheng, lu bai,...

20
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing Technology [email protected] PADS 2010, May 18, 2010

Upload: clement-mclaughlin

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

P-GAS: Parallelizing a Many-Core Processor Simulator

Using PDES

Huiwei Lv, Yuan Cheng, Lu Bai,

Mingyu Chen, Dongrui Fan, Ninghui Sun

Institute of Computing Technology

[email protected]

PADS 2010, May 18, 2010

Motivation• Multi-core platforms

are common now

Courtesy: Sun® UltraSPARC T2

Courtesy: AMD® Opeteron 6000

Courtesy: Intel® Nehalem

• System Simulators still sequential

Motivation• Multi-core platforms

are common now

courtesy: Sun® UltraSPARC T2

courtesy: AMD® Phenom

courtesy: Intel® Nehalem

• System Simulators still sequential

Multi-core is wasted Multi-core is wasted

Simulation speed is limited by single core performance

Simulation speed is limited by single core performance

Poor Scalability of Single-threaded Simulator

• Slowdown grow exponentially

• Not able to simulate future many-core systems

1000+ cores

Too slow to simulate future many-coresToo slow to simulate future many-cores

Goal: fast and accurate computer system simulation

Functional CycleAccuracyAccuracy

Speed(slowdown)

Speed(slowdown)

Speedup 10x without accuracy lostSpeedup 10x without accuracy lost

COTSonCOTSon(HPCA’10)

(SIGOPS Oper. Syst. Rev.’09)

(MICRO’06)

(J. Comput.’09)

Outline

• Motivation• Implementation

BackgroundFrom DES to PDESOptimization

• Evaluation• Conclusion

Godson-T Architecture Simulator

• Discrete Event Simulation (DES)

one global event queueevent assigned to sinkersnew event insert back into event queue

• Fine-grained

EVENT A

EVENT B

SimK: PDES Framework

• Open source• Conservative PDES• Highly optimized

pthreadslock-free user-level thread scheduling

• Modularizeduse SimK API to implement a LP

schedule, execschedule, execschedule, execschedule, exec

commu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploy

APIAPIAPIAPI

LP

LP

LP

LP

LP……

core core core core

Host

SimK

LP

From DES to PDES

• Seperate global queue

• Group sinkers into logical processes(LP), 1 queue/LP

• Event across LPs is wrapped with PDES time

router

core

cache

PDES time wrapper

router

core

cache

LP

LP

router 1

E.g. Router Event

• before

PDES time wrapper

router 0

core 0

cache 0

router 1

core 1

cache 1

LP 0

LP 1

router 0

core 0

cache 0

core 1

cache 1

• after

Event Queue

Router 0 send a event to router 1

Events from DES to PDES

• Single-thread multi-threads• Conservative PDES

Simulation Time

Thread 1

Thread 2

Thread 3

Thread 4

1 cycle

event

dependence

Grouping Into Big LPs

• ProblemAvg. speedup is 1.8x with 16 thread (16 1-core LPs proto.)

• Cause of Problemtoo many LPs + lookahead is extremely small high sync cost

• Solutiongrouping adjacent LPs into one big LP

LP

Final Parallelized version

• Parallel Discrete Event Simulation

sinkers grouped into big LPsLPs binded to threads using SimK APItime sync between LPs using PDESsched and exec under SimK framework

schedule, execschedule, execschedule, execschedule, exec

commu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploycommu, sync, buffer, deploy

APIAPIAPIAPI

core core core core Host

SimK

Outline

• Motivation• Implementation• Evaluation

AccuracySpeedup

• Conclusion

Evaluation Setup

• GAS v.s. P-GAS• 4 Quad-Core AMD Opteron 8347 SMP

16 cores total, 64GB Memory

• Benchmark: SPLASH-2 kernelcount benchmark computing time in wall-clock time

Cycle Count Error

• Avg. cycle count error: 0.04%

16

P-GAS Speedup

• 16 threads, SPLASH-2 Kernel Avg. speedup is 9.8x• best speedup 13.6x(LU,16 threads)• 5.3x super-linear speedup with 4 threads

Avg. 9.8

Max. 13.6

5.3

Why super-linear speedup?

• More cores, more caches to use• The insert-to-queue time is shorter

18

5.3x super-linear speedup with 4 threads

Conclusion

• P-GAS use PDES to speedup a cycle-accurate many-core processor simulator

speedup 9.8x on a 16-core SMPcycle error < 0.04%

• Highly optimized conservative PDES could be used in fast and accurate system simulation

multi-core/many-core processor simulationSMP cluster, many-core cluster ...

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES

Please email me the questions:

[email protected] source release of our PDES framework:

http://simk.sf.net