cda 5106 advanced computer architecture i...

School of Electrical Engineering and Computer Science

University of Central Florida

CDA 5106 Advanced Computer Architecture I

Thread-Level Parallelism

University of Central Florida 3

Resource Utilization (small instruction window)

time

Cache miss Cache miss repaired

…

…

…

…


Resource Utilization (large instruction window)

time

Cache miss Cache miss repaired


Resource Utilization (cont.)

• Limited instruction-level parallelism from single thread

• How about multiple threads? => multithreading


Multithreading

• Run multiple independent programs at same time

• Hardware support for multithreading

– Replicate some or all resources

• Fetch multiple programs

– Replicate state

• Each program needs its own context

• Context = register state + memory state

– Mechanism to switch context (?)


Multithreading on a Chip

• Two major alternatives

– Simultaneous multithreading (SMT)

– Chip multiprocessor (CMP)


SMT Motivation

• Wide superscalar is the way to go

– Must continue to improve single-program performance

• Multiple programs share fetch & issue bandwidth

– Thread-level parallelism (TLP) improves utilization of wide superscalar

– Can still exploit high ILP in a single program


CMP Motivation

• Wide superscalar not the way to go

– Trades fast clock for minor IPC gain

– Design and verification complexity

• Multiple simple processors (e.g., 2-way or 4-way superscalar)

– Exploit TLP

– Moderate ILP plus fast clock

– Use multiple cores to exploit single-thread ILP


thread 1

thread 2

thread 3

single-threaded

SMT

CMP


More multithreading architectures

CMP: IBM Power4

Fine-grain multi-threading (FGMT): CDC-6600

Coarse-grain multi-threading (CGMT): Northstar/Pulsar PowerPC

SMT: Intel Pentium 4


ISCA-23 SMT Paper

• D. Tullsen, S. Eggers, J. Emer, et. al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous

multithreading processor”, ISCA-23, 1996

• Contributions

– Implementable SMT microarchitecture

• Leverage existing superscalar mechanisms

• Single-program performance almost unimpacted

– Exploit TLP better

• Basic fetch/issue policies exploit TLP poorly

– Utilization tops out at 50%

– Increasing number of threads beyond 5 doesn’t help

• Insight into fetch and issue bottlenecks

• Novel fetch/issue policies


Fetch

Unit

Instruction

Cache

Decode

Register

Renaming

FP

Registers FP queue

Int. queue

Int.

Registers

FP

units

Int.+ load/store

units

Data

Cache

PC

Superscalar architecture (Single thread)


Fetch

Unit

PC

Instruction

Cache

Decode

Register

Renaming

FP

Registers FP queue

Int. queue

Int.

Registers

FP

units

Int.+ load/store

units

Data

Cache

Multiple PCs

• Multiple rename map tables

• Multiple ROBs

Selective

squash

Selective

squash

Replicate

architectural

state

Replicate

architectural

state

Per-thread

disambiguation

• Thread selection

• Replicate RAS

• BTB thread ids


Types of Changes

• Types of changes required for SMT support

– REP: Replicate hardware

– SIZE: Resize hardware

– CTL: Additional control

– ID: Thread identifiers needed


Instruction Fetch

• Multiple program counters (REP)

• Thread selection (CTL)

• Per-thread return address stacks (REP)

• Thread ids in BTB (ID)


Register File Management

• Per-thread rename map tables (REP)

• Per-thread ROB (REP)

– Simplifies retirement & squashing


Issue Queues

• Selective squash in queues (ID, CTL)

– Already implement selective squash due to arbitrary order in queue

– Augment selective squash with thread id


Load-Store Queue

• Options

– Per-thread load-store queues (REP)

-OR-

– Disambiguate using physical addresses instead of virtual addresses


Potential Problem Areas

• Predictor and cache pressure

• Claims from paper

– For chosen workload, predictor and cache conflicts not a problem

– Any extra mispredictions and misses are cushioned by abundant TLP

• Large benchmarks in practice

– Cache conflicts may cause slowdown w.r.t. running programs serially since serial execution exploits cache locality

– Slowdown due to contention on other shared resources, e.g., ROB.


Register File Impact

• Why is register file larger with SMT?

– 1 thread: Minimum of 1*32 integer registers

– 8 threads: Minimum of 8*32 integer registers

– I.e., need to store architectural state of all threads

– Note that amount of speculative state is independent of number of threads (depends only on total number of active instructions)

• Implication

– Don’t want to increase cycle time

– Expand register read stage into two stages; same with register write stage

– Performance impact?


Fetch Decode Rename Queue Reg Read Exec. Reg. Wrt. Commit

misfetch penalty 2 cycles

misprediction penalty 6 cycle minimum

register usage 4 cycle minimum

single bypass

may overlap

Fetch Decode Rename Queue Reg Read Exec. Reg. Wrt. Commit Reg Read

misfetch penalty 2 cycles

misprediction penalty 7 cycle minimum

double bypasses

Reg. Wrt.

register usage 6 cycle minimum


Methodology

8-way issue superscalar, 8 threads


Performance of Base SMT

• Positive results – Single-thread performance degrades only 2% due to additional pipe stages – SMT throughput is 84% higher than superscalar

• Negative results – Processor utilization still low at 50% (IPC = 4) – Throughput peaks at 5 or 6 threads (not 8)


SMT Bottlenecks

1. Fetch throughput

– Sustaining only 4.2 useful instructions per cycle!

– Base thread selection: Round-Robin, 1 thread at a time

– “Horizontal waste” due to single-threaded fetch stage

– Sources of waste include misalignment and taken branches


SMT Bottlenecks (cont.)

2. Lack of parallelism

– 8 independent threads should provide plenty of parallelism

– Perhaps have the wrong instructions in the issue queues!


Fetch Unit: in search of useful instructions

• Exploit choice

– SMT offers unique ability to improve fetch throughput by fetching from multiple threads at same time

– Just like we do for issue

– Misalignment, taken branches will have less impact


Fetch Unit (cont.)

• Fetch models

– Notation: alg.num1.num2

– alg => thread selection method (which thread(s) to fetch)

– num1 => # of threads that can fetch in 1 cycle

– num2 => max # of instructions fetched per thread per cycle

• There are 8 instruction cache banks and conflicts are modeled


Fetch Unit: Partitioning

• Keep thread selection (alg) fixed

– Round Robin (RR)

• Models

– RR.1.8

• Base scheme: 1 thread at a time, has full b/w of 8

– RR.2.4 and RR.4.2

• Total # of fetched instructions remains 8

• If num1 is too high, suffer thread shortage problem: too few threads to achieve 8 instr./cycle

– RR.2.8

• Eliminates thread shortage problem (each thread can fetch max b/w) while reducing horizontal waste


Throughput


Fetch Unit: Thread Choice

• Replace Round-Robin (RR) with more intelligent thread selection policies

– BRCOUNT

– MISSCOUNT

– ICOUNT


BRCOUNT

• Give high priority to those threads with fewest unresolved branches

• Attacks wrong-path fetching

• Also reduce pressure on shadow maps


MISSCOUNT

• Give priority to threads with fewest outstanding data cache misses

• Attacks IQ clog


ICOUNT

• Give priority to threads with fewest instructions in decode, rename, and issue queues

– Threads that make good progress => high priority

– Threads that make slow progress => low priority

• Attacks IQ clog generally


Throughput


Resource allocation

• STALL: build on top of ICOUNT (MICRO 34, 2001) – preventing the thread from fetching more instructions when a pending L2

miss is detected – Problem: L2 miss is detected too late

• FLUSH: flush all instructions from the thread with a pending L2 miss (MICRO34, 2001)

• Data Gating: Stall a thread on each L1 D-cache miss [HPCA 2003] – Problem: Not every L1 miss leads to an L2 miss

• Static partition (PACT 2003)

– Each thread has the equal share of resource

• Dynamically redistribute resources (MICRO37, 2004)

– Thread classification as either high LIP or memory intensive • Insight: threads without L2 misses require less resource

– Resource usage classification: active vs. inactive depending on with a type of resource is being used for following Y cycles

– Sharing model based on pre-calculated resource allocation values and the run-time thread & resource classification


Resource allocation (cont.)

• Resource impact for high ILP threads


Resource Sharing Model

• E = R / T – E is the number of the entries in a shared structure that each thread gets – R is the number of the entries in a shared structure – T is the number of threads

• Since slow threads require more resource than fast threads, we steal resource from fast threads

• Eslow = (1 + C * F) * R / T – Eslow is the number of entries in a shared – C is sharing factor: determines the amounts of resource each fast thread

offers. Empirically, C = 1 / (T + 4) gives good results – F: number of fast threads

• Taking into account that not all threads will require a particular type of resource

• Eslow = (1 + C * FA) * R / TA – TA = FA +SA (Fast active and slow active, thereby eliminating the

competition from non-active threads



• Pre-calculated resource sharing model



• Implementation


Methodology


Methodology (cont.)


Results


Results (cont.)


Sharing Caches among Multi-core Processors

• Last-Level Cache usually shared among multiple cores

• Issues for Shared Cache Design

– Layout

– Management

Shared Cache

C1 C2


Chip Layout for Multi-Core/Many-Core Processors

• Cache-in-middle surrounded by processor cores

• Non-Uniform Cache Access Time

– Line migration

• Shared lines present a problem

– Line replication

• Reducing the effective cache size

• Write-invalidation is required


Nahalal: Cache Organization for CMPs (CAL 2007)

• Inspired by a 19th century town design

• A central bank for shared

cache

- Key insight: sharing data is

relatively small


Cache Management [PACT’06]

• Capitalist: LRU

• Communist: Fairness (absolute vs. relative)

• Utilitarian: Overall system performance

• Aspects of cache partitioning policy

– Performance target

• Minimizing memory bandwidth

• Maximizing IPC

• Maximizing weighted IPC

– Evaluation metric

– Policy

– Policy metric


Methodology

• 1. Select performance targets

• 2. Obtain (Static) optimal cache allocation through off-line optimization analysis

• 3. Data analysis


Performance Targets

• Misses per access (MPA)

– Easy to measure online

– Not reflect to performance

• Miss per cycle (MPC)

– Bandwidth measure

• Instruction per cycle (IPC)

– May favor high IPC workloads at the cost of low IPC ones

– Solution: Weighted IPC = IPC / Baseline IPC

– Which baseline?

• Utilitarian or Communist


Performance Targets


Workload characteristics


Results

• Two types of applications (specweb and tpc-c) sharing the cache


Results (cont.) • Four types of applications sharing the cache

• General observation: optimal partition behavior varies widely and the impact of changing performance targets is not easy to predict (i.e., no general trends)


Communism vs. Utilitarian

Observation:

1. For most cases, Communist and Utilitarian metrics are comparable (i.e., optimizing

for one metric usually results near-optimal for the other)

2. However, there are tails in either distribution. In other words, there are exceptions

that when optimizing for one metric, it results in very poor performance in the other


The impact of the Baseline on Weighted Performance

Observation:

1. Utilitarian weighted IPC metrics are close

2. More variation observed from Communist metrics


Policy metrics: which on-line metric is useful

Observation:

1. MPA correlates poorly with all performance targets

2. MPC and IPC are relative good for Utilitarian targets

3. MPC and IPC are not good for Communist targets


Policy Evaluation

• Compare the optimal with LRU and even partition

Observation:

1. LRU is not close to optimal for most of the performance targets except raw IPC

2. Even partition is not good for either Communist or Utilitarian

cda 5106 advanced computer architecture i...

Documents