performance analysis of multiple threads/cores using the ultrasparc t1 (niagara) unique chips and...

23
Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) Unique Chips and Systems (UCAS- 4) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin

Upload: darrion-covan

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Performance Analysis of Multiple Threads/CoresUsing the UltraSPARC T1(Niagara)

Unique Chips and Systems (UCAS-4)

Dimitris Kaseridis & Lizy K. John

The University of Texas at Austin

Laboratory for Computer Architecture

http://lca.ece.utexas.edu

Page 2: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Outline

04/18/23D. Kaseridis - Laboratory for Computer

Architecture2

Brief Description of UltraSPARC T1 Architecture

Analysis Objectives / Methodology Analysis of Results

Interference on Shared Resources Scaling of Multiprogrammed Workloads Scaling of Multithreaded Workloads

Page 3: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

UltraSPARC T1 (Niagara)

04/18/23D. Kaseridis - Laboratory for Computer

Architecture3

A multi-threaded processor that combines CMP & SMT in CMT

8 cores with each one handling 4 hardware context threads 32 active hardware context threads

Simple in-order pipeline with no branch predictor unit per core

Optimized for multithreaded performance Throughput

High throughput hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty

Page 4: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

UltraSPARC T1 Core Pipeline

04/18/23D. Kaseridis - Laboratory for Computer

Architecture4

Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path

Blue areas are replicated copies per hardware context thread

Page 5: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Objectives

04/18/23D. Kaseridis - Laboratory for Computer

Architecture5

Purpose Analysis of interference of multiple executing

threads on the shared resources of Niagara Scaling abilities of CMT architectures for both

multiprogrammed and multithreaded workloads

Methodology Interference on Shared Resources (SPEC

CPU2000) Scaling of a Multiprogrammed Workload

(SPEC CPU2000) Scaling of a Multithreaded Workloads

(SPECjbb2005)

Page 6: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Analysis Objectives / Methodology

6

04/18/23D. Kaseridis - Laboratory for Computer

Architecture

Page 7: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Methodology (1/2)

04/18/23D. Kaseridis - Laboratory for Computer

Architecture7

On-chip performance counters for real/accurate results

Niagara: Solaris10 tools : cpustat, cputrack , psrset to bind

processes to H/W threads 2 counters per Hardware Thread with one only for

Instruction count

Page 8: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Methodology (2/2)

04/18/23D. Kaseridis - Laboratory for Computer

Architecture8

Niagara has only one FP unit only integer benchmark was considered

Performance Counter Unit in the granularity of a single H/W context thread No way to break down effects of more threads per

H/W thread Software profiling tools too invasive

Only pairs of benchmarks was considered to allow correlation of benchmarks with events

Many iterations and use average behavior

Page 9: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload

Analysis of Results9

04/18/23D. Kaseridis - Laboratory for Computer

Architecture

Page 10: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Interference on Shared Resources

04/18/23D. Kaseridis - Laboratory for Computer

Architecture10

Two modes considered: “Same core” mode executes

a benchmark on the same core Sharing of pipeline, TLBs, L1

bandwidth More like an SMT

“Two cores” mode execute each member of pair on a different core Sharing of L2

capacity/bandwidth and main memory

More like an CMP

Page 11: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Interference “same core” (1/2)

04/18/23D. Kaseridis - Laboratory for Computer

Architecture11

On average 12% drop of IPC when running in a pair Crafty followed by twolf showed the worst

performance Eon best behavior keeping the IPC almost close to

the single thread case

Page 12: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

04/18/23D. Kaseridis - Laboratory for Computer

Architecture12

Interference “same core” (2/2)

DC misses increased 20% on average / 15% taking out crafty Worst DC misses are vortex and perlbmk Highest ratios of L2 misses demonstrated are not the one that features an

important decrease in IPC mcf and eon pairs with more than 70% L2 misses

Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth latency hiding technique is promising

Page 13: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

04/18/23D. Kaseridis - Laboratory for Computer

Architecture13

Only stressing L2 and shared communication buses On average the misses on L2 are almost the same as in

the case on “same core”: underutilized the available resources Multiprogrammed workload with no data sharing

Interference “two cores”

Page 14: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multiprogrammed Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture14

Reduced benchmark pair set

Scaling 4 8 16 threads with configurations

Page 15: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multiprogrammed Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture15

“Same core”

“Mixed mode” mode

Page 16: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

04/18/23D. Kaseridis - Laboratory for Computer

Architecture16

Scaling of Multiprogrammed “same core”

4 8 case IPC / Data cache misses not affected L2 data misses increased but IPC is not Enough resources running fully occupied memory latency hiding

8 16 case More cores running same benchmark Some footprint / request to L2 /Main memory L2 requirements / shared interconnect traffic decreased

performance

IPC ratio

DC misses ratio

L2 misses ratio

Page 17: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multiprogrammed “mixed mode”

04/18/23D. Kaseridis - Laboratory for Computer

Architecture17

Mixed mode case Significant decrease in IPC when moving both

from 4 8 and 8 16 threads Same behavior as “same core” case for DC

and L2 misses with an average of 1% - 2%

difference

Overall for both modes Niagara demonstrated that moving from 4 to 16 threads can

be done with less than 40% on average performance drop Both modes showed that significantly increased L1 and L2

misses can be handed favoring throughput

IPC ratio

Page 18: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multithreaded Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture18

Scaled from 1 up to 64 threads 1 8 threads mapped 1 thread per core 8 16 threads mapped at maximum 2 threads per

core 16 32 threads up to 4 threads per core 32 64 more threads per core, swapping is necessary

Configuration used for SPECjbb2005

Page 19: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multithreaded Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture19

SPECjbb2005 score per warehouse

GC effect

Page 20: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multithreaded Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture20

Ratio over 8 threads case with 1 thread per core

Instruction fetch and DTLB stressed

the most L1 data and L2 Caches managed

to scale even for more then 32 threads

GC effect

Page 21: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Scaling of Multithreaded Workload

04/18/23D. Kaseridis - Laboratory for Computer

Architecture21

Scaling of Performance

Linear scaling of almost 0.66 per thread up to 32 threads 20x speed up at 32 threads SMP (2 Threads/core) gives on average 1.8x speed up over

the CMP configuration (region 1 SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup

over the 2-way SMT per core and the single-threaded CMP, respectively.

Page 22: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Conclusions

04/18/23D. Kaseridis - Laboratory for Computer

Architecture22

Demonstration of interference on a real CMT system

Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation

Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread

Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses

Page 23: PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The

Q/A

04/18/23D. Kaseridis - Laboratory for Computer

Architecture23

Thank you…

Questions?

The Laboratory for Computer Architecture

web-site: http://lca.ece.utexas.edu

Email: [email protected]