trace-drivensimulation of multithreaded applications in...
TRANSCRIPT
Trace-driven simulation of multithreadedapplications in gem5
ARM gem5 Research Summit 2017 Workshop
Gilles SASSATELLI [email protected]
Alejandro NOCUA, Florent BRUGUIER, Anastasiia BUTKOCambridge, UK – September 11, 2017
2
◼ Mont-Blanc 1, 2 & 3 projects (FP7, H2020)● Getting ARM technology ready for HPC: HW, SW & Apps● Advances in energy efficiency towards Exascale
◼ Initial effort: using gem5 for performance prediction (2011)● STE Nova A9500 SoC (dual-core Cortex A9)● Fed publicly available parameters into a gem5 FS model● 1.5% - 18% error, due to rough DRAM model and interconnect
Background motivations
Experimental Setup
23
` Simulator - gem5: ` event-driven, full system (FS) ` multiple CPUs models: simple, in-order, out-of-order, etc. ` different memory systems: bus-based, Ruby NoC, cache, etc.
` Real Hardware - Snowball Sky board ` STE Nova A9500 SoC (ARM dual-core)
` Benchmark:
` SPLASH-2, ALPBench, Stream ` Pthread programming
Comparison & Analysis
24
` Results ` Execution time: between 1.4% and 18%
` Sources of error ` Specification ` Memory & Interconnect
Butko, A., Garibotti, R, Ost, L., and Sassatelli, G., Accuracy evaluation of GEM5 simulator system, In 2012 7th
Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), July 2012
*
BUTKO A et Al, « Accuracy Evaluation of GEM5 Simulator System », IEEE ReCoSoC 2012: Reconfigurable Communication Centric SoCs,ISBN 978-1-4673-2570-7, 9-11 juillet 2012, York, pp. 1-7.
Parameters
21
` Specification information: ` CPU – number of cores, core types, etc. ` Memory – cache, coherency, main memory, etc. ` Interconnect – bus parameters,…
` Component implementation:
` Execution behavior ` Communication protocols
Core0
Core1
I D I D
L2
Memory
Parameters
21
` Specification information: ` CPU – number of cores, core types, etc. ` Memory – cache, coherency, main memory, etc. ` Interconnect – bus parameters,…
` Component implementation:
` Execution behavior ` Communication protocols
Core0
Core1
I D I D
L2
Memory
3
◼ Calibration against real hardware● Using gem5 for performance prediction● And McPAT for power estimation
◼Then onto exploring fancy architecture configurations● Heterogeneous single-ISA multicores à la big.LITTLE● Assymetric, 3 levels of heterogeneity etc.
Background motivations cont’d
CPU CPU
CPU CPU
L2 L2
RAM
CPU CPU
CPU CPU
L2 L2
RAM
Assymetric Big.LITTLE Big.medium.LITTLE
Accuracy assessments
57
` Exynos 5 Octa SoC model in gem5/McPAT
` 8 heterogeneous cores
` Independent clock per cluster
` 2 L2 caches, Crossbar Bus
` Results
` Execution time error ~20%
` Power error ~12%
` EtoS ~20%
A7 A7
A7 A7
L2
A15
L2
Interconnect
Memory
A15
A15 A15
‘big’ cluster
‘LITTLE’ cluster
Butko, A., Gamatié, A., Sassatelli G., Torres, L., and Robert, M., Design Exploration For Next Generation High-Performance Manycore On-
chip Systems: Application To big.LITTLE Architectures, In 2015 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2015
Accuracy assessments
57
` Exynos 5 Octa SoC model in gem5/McPAT
` 8 heterogeneous cores
` Independent clock per cluster
` 2 L2 caches, Crossbar Bus
` Results
` Execution time error ~20%
` Power error ~12%
` EtoS ~20%
A7 A7
A7 A7
L2
A15
L2
Interconnect
Memory
A15
A15 A15
‘big’ cluster
‘LITTLE’ cluster
Butko, A., Gamatié, A., Sassatelli G., Torres, L., and Robert, M., Design Exploration For Next Generation High-Performance Manycore On-
chip Systems: Application To big.LITTLE Architectures, In 2015 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2015
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
!"#$%&'()*&'+#*,-#$()./0
1213
1214
1215
1216
127
!)#89:;&(;-(<%&'()*,=(%<#/0
121> 121?> 127 1273> 127> 127?> 123 1233>
6 7>
6 !
6 ?
? !"7 7>
#$/#<')#*%4 ?"4 7>&
-'(
)'9*+,--+!")'9*'#.'%+
)'9*'#.'%+*+,--+!
!)#89:*.#<$:*(8(.%$&
!)#89:*.#<$:*(8(.%$&
(a) bfs
!"#$%&'()*&'+#*,-#$()./0
12!
"#$
"#%
"#&
"#'
"#(
"#)
*+,-./01203245162+789254,:;
"#& "#( #! #$
(" %
(")
("'
&"'# ")# " %
$%:,46+,7&$"'#$" %'
3()
*6.+,-..,*#*6.+(,K65L
*6.+(,K65L+,-..,*
*+,-./7/,4%/7)-2K5N1
*+,-./7/,4%/7)-2K5N1
(b) backprop
!"#$%&'()*&'+#*,-#$()./0
!
"
#
$%
$&
$!
'()*+,-./-0/12.3/(456/21)78
$! $" $# &% && &! &"
#9$:
#9;
#9
&9 !:9;!$9$:
"#7)13()4$!9 !!9$:%
0&'
(3+)*+,,*'!(3+)&)I32J
(3+)&)I32J)*+,,*'
'()*+,4-)1#,4'*/I2L.
'()*+,4-)1#,4'*/I2L.
(c) heartwall
!"#$%&'()*&'+#*,-#$()./0
1213
121!"
#$%
#$%&"
#$%"
#$%!"
'()*+,-./-0/12.3/(456/21)78
#$%& #$%9 #$%: #$%; #$&
; %"
; !
; ! "#7)13()4$9 !%9 %"&
0'(
)3+*+,--+'%)3+*')J32K
)3+*')J32K*+,--+'
'()*+,4.)1#,4(*/J2M.
(d) hotspot
!"#$%&'()*&'+#*,-#$()./0
!"#
$
$"#
%
%"#
&
'()*+,-./-0/12.3/(456/21)78
%"# & &"# # #"# 9 9"# :
; !#
; !
; :
! :"9 !"! !#
#$7)13()4%& :"& !#&
0'(
)3+*+,--+'")3+*')J32K
)3+*')J32K*+,--+'
'()*+,4.)1$,4(*/J2M.
'()*+,4.)1$,4(*/J2M.
(e) kmeans
!"#$%&'()*&'+#*,-#$()./0
!
!
"
#
$
%
&
'()*+,-./-0/12.3/(456/21)78
# $ % & 9 : ; ;;
9!;$
9!:
9!&
&!:";!;$
#$7)13()4%#!&"#!;$&
0'(
)3+*+,--+'")3+*')J32K
)3+*')J32K*+,--+'
'()*+,4.)1$,4(*/J2M.
'()*+,4.)1$,4(*/J2M.
(f) lud
!"#$%&'()*&'+#*,-#$()./0
12!"
#$!%"
#$&
#$&&"
#$&"
#$&%"
#$'
()*+,-./0.1023/40)567032*89
#$& #$&" #$' #$'" #$:
; !"
; !
; %
! %"# !"! !"
$%8*24)*5&: %": !"'
1()
*4,+,-..,("*4,+(*K43L
*4,+(*K43L+,-..,(
()*+,-5/*2%-5)+0K3N/
()*+,-5/*2%-5)+0K3N/
(g) nw
!"#$%"%&
'()*+,-./0,-1
)023)*./456
%
%7&
8
87&
9
97&
:
'/);<=>,.>3.?+,-./02 .+?)56
9 : & ! !
""%&
""#
""!
!"!$%"%&
#$5)?-/)0%:"!$:"%&&
3'(
)-<*+,--+'$)-<*')4-+1
)-<*')4-+1*+,--+'
'/);<=0.)?$=0(;.4+*,
'/);<=0.)?$=0(;.4+*,
(h) nn
!"#$%&'()*&'+#*,-#$()./0
123
12!
"#$
"#%
"#&
'
()*+,-./0.1023/40)567032*89
"#: "#; "#! "#$ "#% "#& '
$ $!' ';
% ';
% &
% $
"#8*24)*5$: $!: ';%
1&'
(4,)*+,,*(!(4,)&*I43J
(4,)&*I43J)*+,,*(
()*+,-5-*2#-5'+0I3L/
()*+,-5-*2#-5'+0I3L/
(i) srad v1 (j) CPI and Execution time comparison
Fig. 4: Performance and energy trade-offs.
CPU CPU
CPU CPU
L2 L2
RAM
CPU CPU
CPU CPU
L2 L2
RAM
Assymetric Big.LITTLE Big.medium.LITTLE
4
◼ Not ready for manycores, too slow!● 1K-1M (simulated) IPS ● Scales bad with system size● Already much better than RTL though
◼ Trading speed for accuracy ? Any sweet spot?
Background motivations cont’d
Towards energy-efficiency
6
` Mobile/Embedded technology ` power efficiency ` low cost ` performance?
` Samsung Exynos 5 Octa 9 GFLOPS
` Supercomputer node ` Intel Xeon E5 300 GFLOPS
` Idea ` Multicore Manycore ` Design Space Exploration - Simulation
State-of-the-art
10
Acc
urac
y
Speed
RTL Cycle-accurate
Cycle-approximate
JIT
Distributed Trace-driven
High
Low
Low High
Ideal
5
◼ Not ready for manycores, too slow!● 1K-1M (simulated) IPS● Scales bad with system size● Already much better than RTL though
◼ Trading speed for accuracy ? Any sweet spot?
Background motivations cont’d
Towards energy-efficiency
6
` Mobile/Embedded technology ` power efficiency ` low cost ` performance?
` Samsung Exynos 5 Octa 9 GFLOPS
` Supercomputer node ` Intel Xeon E5 300 GFLOPS
` Idea ` Multicore Manycore ` Design Space Exploration - Simulation
State-of-the-art
10
Acc
urac
y
Speed
RTL Cycle-accurate
Cycle-approximate
JIT
Distributed Trace-driven
High
Low
Low High
Ideal
State-of-the-art
13
Acc
urac
y
Speed
RTL Cycle-accurate
JIT
Distributed
Ideal High
Low
Low High
Trace-driven
Cycle-approximate
10s MIPS
1 core FS simulation
6
Simulation speed
18
` Event-driven simulation (Cycle-approximate)
` Abstracting events into traces
Memory
Cache
CPU
Time
Time 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒
Memory
Cache
CPU
◼ ~ 70% simulation effort goes into CPU● Abstracting away CPU cores sounds like a good idea● Between 2 consecutive L1 cache misses (in-order) CPU cores perform « consistently »
Trace-driven simulation principle
Simulation speed
17
` Event-driven simulation (Cycle-approximate)
Memory
Cache
CPU
Time
? ?Tc1 Tc1 Tc2FS s
imul
atio
nTr
ace-
driv
en
7
◼ SimMATE: 2-stage process● Trace collection: tracing only L1 miss related transactions● Trace replay: Using trace injectors that initiate transactions as previously recorded
SimMATE
Trace-Driven Approach
27
` Main concept ` Trace collection, trace simulation
Core 0 Core n
$ $ $ $
Interconnect
Memory
Hits & Misses
Misses
Memory traces
1 Trace Collection
TI 0 TI n
$ $ $ $
Interconnect
2 Trace Simulation
Misses
Misses
L2 Memory L2 Memory Memory
... ...
Time 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒 𝑇𝑟𝑎𝑐𝑒
Memory
Cache
CPU
Time
8
◼ Trace collection = freezing● CPU parameters alongside application SW● Private caches sizes, speed etc.
◼ Trace replay allows exploring the rest● L2 size, policy etc.● Interconnect type & speed● Main memory speed
SimMATE for faster DSE
Trace-Driven Approach
29
` Flexibility
Core 0 Core n
$ $ $ $
Interconnect
Memory
Hits & Misses
Misses
Memory traces
1 Trace Collection
TI 0 TI n
$ $ $ $
Interconnect
2 Trace Simulation
Misses
Misses
1 time Multiple times
Memory L2
... ...
9
◼ Trace replay performs event (re-) scheduling● Simple « time shifting » approach● Maintaining constant compute phases
SimMATE for faster DSE cont’dTrace-Driven Approach
30
` Trace Injector ` Events scheduling
Memory & Interconnect
$
TI
Time 𝑇𝑅1
communication computation
𝑇𝑅2
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
Trace-Driven Approach
31
` Trace Injector ` Events scheduling
Memory & Interconnect
$
TI
Time 𝑇𝑅1
communication
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
Trace-Driven Approach
31
` Trace Injector ` Events scheduling
Memory & Interconnect
$
TI
Time 𝑇𝑅1
communication
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
Trace-Driven Approach
32
` Trace Injector ` Events scheduling
Memory & Interconnect
$
TI
Time 𝑇𝑅1
communication
𝑇𝐴1′ 𝑇𝐴1
delay
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
Trace-Driven Approach
33
` Trace Injector ` Events scheduling
Memory & Interconnect
$
TI
Time 𝑇𝑅1
communication
𝑇𝐴1′ 𝑇𝐴1
delay delay
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
𝑇𝑅2′
10
◼ Tuning DRAM latency● Collection performed with 30ns ● TD simulation from 5ns to 55ns● FS used as reference
SimMATE BenchmarkingValidation
34
` External memory latency Trace Collection
30 ns
5 ns
15 ns
45 ns
55 ns
5 ns
15 ns
45 ns
55 ns
TD Simulation FS Simulation %
%
%
%
TI 0 TI n
$ $ $ $
Interconnect
Memory
...
Validation
35
` External memory latency ` Collection - 30ns ` Simulation - 5ns, 15ns, 30ns, 45ns, 55ns
0,6% 0,5% 0,1% 0,4% 0,4%
2,8%
4,5% 5,8%
5,1% 6%
5ns 15ns 30ns 45ns 55ns
Execution time error [%] MJPEG FFT
Original
11
◼ Tuning L2 size● Collection performed without L2● TD simulation from 0 to 16MB L2● Errors originate from Cold-start bias / cache warmup
SimMATE BenchmarkingValidation
36
` L2 Cache ` Short duration application ` Cold-start bias ` Cache warm-up
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
15%
7,5%
18% 15%
0,1%
8%
1%
14%
8% 8%
0,3%
10% 7%
Original 256kB 1MB 8MB 16MB
Execution time error [%] ET ET+init ET+OS boot
L2 cache size
Traces:
Validation
36
` L2 Cache ` Short duration application ` Cold-start bias ` Cache warm-up
TI 0 TI n
$ $ $ $
Interconnect
Memory L2
...
15%
7,5%
18% 15%
0,1%
8%
1%
14%
8% 8%
0,3%
10% 7%
Original 256kB 1MB 8MB 16MB
Execution time error [%] ET ET+init ET+OS boot
L2 cache size
Traces:
12
◼ Having these traces collected makes it easy to:● Perform « Trace replication » i.e. emulate more CPU cores for scalability study● This corresponds to weak scaling experiments, i.e. per-core workload remains same
Multithreaded applications
Manycore simulation
37
` Trace Replication
Time
Core 0/ Thread 0
Core1/ Thread 1
Core 0 Core n
$ $ $ $
Interconnect
Memory
Number of Cache Misses
Trace Collection Trace Simulation
... TI 0 TI 1
$ $ $ $
Interconnect
Memory
TI n TI m
$ $ $ $
... T T T T T T
13
◼ Yet synchronizations must be accounted for!● Using whatever API: POSIX threads, OpenMP 3.0 …● Approach: embed synchronizations into traces● Have an arbiter that takes care of locking (when barrier reached) and unlocking TIs
Multithreaded applications
Trace Arbiter
Manycore simulation
39
` Trace Synchronization
$ $ $ $
Interconnect
Memory
Thread 1 Thread 2 OS
$ $ $ $
Interconnect
Memory
Trace 1 Trace 2
Trace Collection Trace Simulation
Synchronization Traces Barrier Join Memory Traces
14
◼ And most ARM AP are OoO (Out-of-Order)● Meaning multiple outstanding memory transactions● The assumption of constant time btw. 2 misses does not hold
◼ big-LITTLE & other heterogeneous friends everywhere● And there microarchitecture details cannot be overlooked
Limitation: in-order only!Simulation speed
17
` Event-driven simulation (Cycle-approximate)
Memory
Cache
CPU
Time Tc1 Tc1 Tc2
15
◼ Modeling micro-architecture timing & dependencies● Tracing with O3 model + probes, without L2 cache● Replay done in a smart « elastic »fashion
Elastic Traces: Trace-driven simulation for OoO
http://gem5.org/TraceCPU
16
◼Smart TraceCPU● Updating a dependency graph pushing ready instructions into a queue for issue
Elastic Traces: Trace-driven simulation for OoO
http://gem5.org/TraceCPU
17
(a) Trace replay is based on 3 collected traces for N cores are used to replay M cores
(b) Replay overview including modified TraceCPU and arbiter
Figure 4: Trace replay approach
collection. Three types of events are considered during parallelsections: instruction executed, dependencies (load/store), andsynchronization events. Instruction and data dependency tracesare captured thanks to an augmented TraceCPU model. Thefollowing information are captured into the synchronizationtrace for each CPU and each event:
• Tick: the tick count of a CPU at the entrance in theparallel region.
• Program Counter: the program counter at the beginningof a parallel region ; it will be used during the replayphase for identifying parallel sections.
• Thread ID: the thread ID assigned by the scheduler.• Event Type: an enumerate type that encodes events
corresponding to parallel for, critical and barrier.• Number of instructions: the number of instructions
executed by a CPU during the recorded event.• Number of data accesses: the number of data accesses
performed during the recorded event.It has to be noted that for each thread under analysis the
Tick and PC information will be the same since all threadsare created at the same time. It means that the informationon the synchronization traces is the same. In the case of thedependency trace, the load and store information are onlycollected between the CPU and the L1 caches. At the endof the trace collection phase, three Google Protobuf files areobtained per simulated CPU core with the required data forthe replay phase:
• Instruction Executed Trace File.• Dependency Trace File (LOAD/STORE).• Synchronization Event Trace File.
4) Traces replay: As illustrated in Figure 4(a), collectedtraces can be replayed in target architecture configurationsin different ways. Letting N be the number of cores used intrace collection and M in trace replay, two main purposes areconsidered as follows:
• Parameters exploration: we perform an architecturalexploration in which we replay the exact number ofsimulated cores, i.e., N = M. The objective is to analyzethe influence of a number of architectural parameterssuch as cache sizes, interconnect bandwidth or memorylatency.
• Replication: we perform a scalability analysis, by tar-geting a higher core count compared to that of the initialsystem from which given traces are captured, i.e., M ^
N. This is achieved by simulating more trace injectors.Note that the replication mechanism allows us to performweak-scaling analysis as the problem size is increasedby the ration of M
N . In addition, current implementationis performed with no address offsetting mechanism. Thismeans that most of the resources are shared among cores.
5) Implementation: Figure 4(b) shows the interplay of theprincipal objects involved during the replay phase in Elastic-SimMATE. A number of TraceCPU objects operate and checkif LOAD/STORE dependencies are met on the basis of thetraces they access, as per the Elastic Traces base model. Thesefurther read out the synchronization trace and keep track of theparallel regions. The actual behaviour when entering a parallelregion is as follows:
• Init: whenever one such region is detected on aTraceCPU, a notification is sent to the arbiter so as to
◼ SimMATE + Elastic Traces = ElasticSimMATE● Enabling both OoO + multithreaded applications● Key: embed synchronization information @ tracing time.
ElasticSimMATE (ESM)
Elastic Trace
Barriers etc.
Global Stop/go
18
◼ Proper tracing of synchronizations● API-dependant: OpenMP 3.x● Tracing whenever entering or leaving parallel region, barrier etc.
ElasticSimMATE (ESM)
#pragma omp parallel
for(i=0;i<n;i++) {
/* do_some work */
}
for(i=0;i<n;i++) {
OMP_runtime_call()
/* do_some work */
}
1.bin
OMP runtime + scheduler SWHW
T T T T
T T T T
Tick PC th_id type IC DC389178 177216 0 1 1081157 165738…
19
◼ ESM flow wrapup● Using BSC Mercurium compiler / Nanos++ runtime● Tweaked runtime such that custom m5 pseudo instructions produce trace records
ESW wrapup
#pragma ompparallelfor(i=0;i<n;i++) {/* do_something*/}
Instruction
Dependency
Synchronization Trac
e R
epla
y
Unmodified sources
gem5 Trace Collection Generated Traces gem5
Trace Replay
Trac
e R
epla
yTa
rget
_Arc
h1
#pragma ompparallel{for(i=0;i<n;i++){
nanos_create_team();/* do_something */
nanos_end_team();}
}
Ref
_Arc
hite
ctur
e
App Binaries
Thread creation & suchm5_trace(TYPE,th_id)
Nanos++ runtime
20
◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »
◼ Speedup & accuracy?● Experiments on Rodinia application kernels
Benchmarking
1 core HOTSPOT
0
0,5
1
1,5
2
2,5
3
0MB 16kB 128kB 512kB 1MB 2MB
ExecutionTime[s]
L2CacheSize
FS ET ESM
-4
-2
0
2
4
6
8
10
12
14
0MB 16kB 128kB 512kB 1MB 2MB
ErrorP
ercentage[%
]
L2CacheSize
FSvsET FSvsESM
21
◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »
◼ Speedup & accuracy?● Experiments on Rodinia application kernels
Benchmarking
1 core KMEANS
0
20
40
60
80
100
120
140
160
0MB 16kB 128kB 512kB 1MB 2MB
ExecutionTime[m
s]
L2CacheSize
FS ET ESM
0
2
4
6
8
10
12
14
16
18
0MB 16kB 128kB 512kB 1MB 2MB
ErrorP
ercentage[%
]
L2CacheSize
FSvsET FSvsESM
22
◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »
◼ Speedup & accuracy?● Experiments on Rodinia application kernels
Benchmarking
1 core CANNEAL
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
0MB 16kB 128kB 512kB 1MB 2MB
ExecutionTime[s]
L2CacheSize
FS ET ESM
0
5
10
15
20
25
0MB 16kB 128kB 512kB 1MB 2MB
ErrorP
ercentage[%
]
L2CacheSize
FSvsET FSvsESM
23
◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »
◼ Speedup & accuracy?● Experiments on Rodinia application kernels
Benchmarking
L2 KMEANS
0
10
20
30
40
50
60
70
16kB 128kB 512kB 1MB 2MB
L2Cache
Miss
Rate[%
]
L2CacheSize
FS ESM
0
2E+09
4E+09
6E+09
8E+09
1E+10
1,2E+10
1,4E+10
1,6E+10
16kB 128kB 512kB 1MB 2MB
L2Cache
OverallMiss
Latency[Cycles]
L2CacheSize
FS ESM
24
◼ Two main use cases:● Fast parameter exploration● Scalability study: « trace replication »
◼ Speedup & accuracy?● Experiments on Rodinia application kernels
Benchmarking
128 cores KMEANS
0
200
400
600
800
1000
1 2 4 8 16 32 64 128
ExecutionTime[s]
NumberofCores
Blackscholes Hotspot
Manycore simulation
37
` Trace Replication
Time
Core 0/ Thread 0
Core1/ Thread 1
Core 0 Core n
$ $ $ $
Interconnect
Memory
Number of Cache Misses
Trace Collection Trace Simulation
... TI 0 TI 1
$ $ $ $
Interconnect
Memory
TI n TI m
$ $ $ $
... T T T T T T
0200400600800
10001200
1 2 4 8 16 32 64 128
SimulationTime[m
in]
NumberofCores
Blackscholes Hotspot
25
◼ Scalability analysis has limits● Requires additional features s.a. address offsetting● Weak scaling only (replicated per-core workloads)
◼ Programming models moving from loops to tasks● OpenMP 4.0, OmpSs● Still pragma-based● More parallelims available at run-time … more opportunities for smart job scheduling
Perspectives Manycore simulation
37
` Trace Replication
Time
Core 0/ Thread 0
Core1/ Thread 1
Core 0 Core n
$ $ $ $
Interconnect
Memory
Number of Cache Misses
Trace Collection Trace Simulation
... TI 0 TI 1
$ $ $ $
Interconnect
Memory
TI n TI m
$ $ $ $
... T T T T T T
L2
BSC OmpSs: Cholesky decomposition HW
Ready Task queue
Scheduler
26
◼ Unbinding traces from cores● One trace per Task, not per core!● Assign traces to cores by emulating runtime behaviour in trace replay● This is real strong scaling
Perspectives
Scheduler
Task 1 trace
Task 2 trace
Task N trace…
T2T1
T3
…
+Task
dependencies
Manycore simulation
37
` Trace Replication
Time
Core 0/ Thread 0
Core1/ Thread 1
Core 0 Core n
$ $ $ $
Interconnect
Memory
Number of Cache Misses
Trace Collection Trace Simulation
... TI 0 TI 1
$ $ $ $
Interconnect
Memory
TI n TI m
$ $ $ $
... T T T T T T
Scheduler /runtime(emulated)
Task 1 trace
Task 2 trace
T0
Ready Task queue
Ready Task queue
Trace collection Trace simulation
27
◼ Current ESM prototype● ~5x - 10x speedup for low core count, probably more for tens / thousands● Nice solution for fast DSE● Remaining accuracy issues for some applications╶ Common to ESM & Elastic Traces╶ Under investigation with ARM
◼ Use cases● Exploration of memory subsystem● Some microarchitecture parameters (Elastic Traces)● …
◼ Future directions● Ruby compatibility● Could be combined with other initiatives (dist-gem5)● Can be extended to other PM / APIs (Tasking, MPI…)
Conclusion
05E+11 1E+12
1,5E+12 2E+12
2,5E+12 3E+12
16kB 128kB 512kB 1MB 2MBL2CacheM
issLa
tency[Cycles]
L2CacheSize
FS– 1Core ESM– 1Core
Canneal
http://montblanc-project.eu