performance analysis of multiple threads/cores using the ultrasparc t1 (niagara) unique chips and...

Performance Analysis of Multiple Threads/CoresUsing the UltraSPARC T1(Niagara)

Unique Chips and Systems (UCAS-4)

Dimitris Kaseridis & Lizy K. John

The University of Texas at Austin

Laboratory for Computer Architecture

http://lca.ece.utexas.edu

Outline

04/18/23D. Kaseridis - Laboratory for Computer

Architecture2

Brief Description of UltraSPARC T1 Architecture

Analysis Objectives / Methodology Analysis of Results

Interference on Shared Resources Scaling of Multiprogrammed Workloads Scaling of Multithreaded Workloads

UltraSPARC T1 (Niagara)


Architecture3

A multi-threaded processor that combines CMP & SMT in CMT

8 cores with each one handling 4 hardware context threads 32 active hardware context threads

Simple in-order pipeline with no branch predictor unit per core

Optimized for multithreaded performance Throughput

High throughput hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty

UltraSPARC T1 Core Pipeline


Architecture4

Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path

Blue areas are replicated copies per hardware context thread

Objectives


Architecture5

Purpose Analysis of interference of multiple executing

threads on the shared resources of Niagara Scaling abilities of CMT architectures for both

multiprogrammed and multithreaded workloads

Methodology Interference on Shared Resources (SPEC

CPU2000) Scaling of a Multiprogrammed Workload

(SPEC CPU2000) Scaling of a Multithreaded Workloads

(SPECjbb2005)

Analysis Objectives / Methodology

6


Architecture

Methodology (1/2)


Architecture7

On-chip performance counters for real/accurate results

Niagara: Solaris10 tools : cpustat, cputrack , psrset to bind

processes to H/W threads 2 counters per Hardware Thread with one only for

Instruction count

Methodology (2/2)


Architecture8

Niagara has only one FP unit only integer benchmark was considered

Performance Counter Unit in the granularity of a single H/W context thread No way to break down effects of more threads per

H/W thread Software profiling tools too invasive

Only pairs of benchmarks was considered to allow correlation of benchmarks with events

Many iterations and use average behavior

Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload

Analysis of Results9


Architecture

Interference on Shared Resources


Architecture10

Two modes considered: “Same core” mode executes

a benchmark on the same core Sharing of pipeline, TLBs, L1

bandwidth More like an SMT

“Two cores” mode execute each member of pair on a different core Sharing of L2

capacity/bandwidth and main memory

More like an CMP

Interference “same core” (1/2)


Architecture11

On average 12% drop of IPC when running in a pair Crafty followed by twolf showed the worst

performance Eon best behavior keeping the IPC almost close to

the single thread case


Architecture12

Interference “same core” (2/2)

DC misses increased 20% on average / 15% taking out crafty Worst DC misses are vortex and perlbmk Highest ratios of L2 misses demonstrated are not the one that features an

important decrease in IPC mcf and eon pairs with more than 70% L2 misses

Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth latency hiding technique is promising


Architecture13

Only stressing L2 and shared communication buses On average the misses on L2 are almost the same as in

the case on “same core”: underutilized the available resources Multiprogrammed workload with no data sharing

Interference “two cores”

Scaling of Multiprogrammed Workload


Architecture14

Reduced benchmark pair set

Scaling 4 8 16 threads with configurations

Scaling of Multiprogrammed Workload


Architecture15

“Same core”

“Mixed mode” mode


Architecture16

Scaling of Multiprogrammed “same core”

4 8 case IPC / Data cache misses not affected L2 data misses increased but IPC is not Enough resources running fully occupied memory latency hiding

8 16 case More cores running same benchmark Some footprint / request to L2 /Main memory L2 requirements / shared interconnect traffic decreased

performance

IPC ratio

DC misses ratio

L2 misses ratio

Scaling of Multiprogrammed “mixed mode”


Architecture17

Mixed mode case Significant decrease in IPC when moving both

from 4 8 and 8 16 threads Same behavior as “same core” case for DC

and L2 misses with an average of 1% - 2%

difference

Overall for both modes Niagara demonstrated that moving from 4 to 16 threads can

be done with less than 40% on average performance drop Both modes showed that significantly increased L1 and L2

misses can be handed favoring throughput

IPC ratio

Scaling of Multithreaded Workload


Architecture18

Scaled from 1 up to 64 threads 1 8 threads mapped 1 thread per core 8 16 threads mapped at maximum 2 threads per

core 16 32 threads up to 4 threads per core 32 64 more threads per core, swapping is necessary

Configuration used for SPECjbb2005



Architecture19

SPECjbb2005 score per warehouse

GC effect



Architecture20

Ratio over 8 threads case with 1 thread per core

Instruction fetch and DTLB stressed

the most L1 data and L2 Caches managed

to scale even for more then 32 threads

GC effect



Architecture21

Scaling of Performance

Linear scaling of almost 0.66 per thread up to 32 threads 20x speed up at 32 threads SMP (2 Threads/core) gives on average 1.8x speed up over

the CMP configuration (region 1 SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup

over the 2-way SMT per core and the single-threaded CMP, respectively.

Conclusions


Architecture22

Demonstration of interference on a real CMT system

Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation

Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread

Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses

Q/A


Architecture23

Thank you…

Questions?

The Laboratory for Computer Architecture

web-site: http://lca.ece.utexas.edu

Email: [email protected]

performance analysis of multiple threads/cores using the ultrasparc t1 (niagara) unique chips and...

Documents

computer architecture

kaseridis laboratory

computer architecture

cmp slide

austin laboratory

penalty slide

hardware thread

ultrasparc t1 core pipeline