dean m. tullsen, susan j. eggers, joel s. emer, henry m. levy, jack l.lo, and rebecca l. stamm

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca

L. Stamm

Presented by Kim Ki Young @ DCSLab

Simultaneous Multithreading(SMT)A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unitTwo major impediments to processor utilization

long latencieslimited per-thread parallelism

2/20

1.Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor2.Show that SMT need not compromise single-thread performance3.Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model4.Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture

3

A projection of current superscalar design trends 3-5 years into the futureChanges necessary to support simultaneous multithreading

Multiple program countersSeparate return stack for each threadPer-thread instruction retirement, instruction queue flush, and trap mechanismsA thread id with each branch target buffer entryA larger register file

5

MIPSIMIPS-based simulatorexecutes unmodified Alpha object code

WorkloadSPEC92 benchmark suitefive floating point programs, two integer programs, TeX

Multiflowtrace scheduling compiler

8

With only single thread, throughput is less than 2% below a superscalar w/o SMT supportPeak throughput is 84% higher than the superscalarThree problems

IQ sizeFetch throughputLack of parallelism

10

Improve fetch throughput w/o increasing the fetch bandwidthalg.num1.num2

alg : Fetch selection methodnum1 : # of threads that can fetch in 1 cyclenum2 : max # of instructions fetched per thread in 1 cycle

Partitioning the fetch unitRR.1.8RR.2.4, RR.4.2

Some hardware additionRR.2.8

Additional logic is required11

Fetch PoliciesBRCOUNT

that are least likely to be on a wrong path

MISSCOUNTthat have the fewest outstanding D cache miss

ICOUNTwith the fewest instructions in decode

IQPOSNwith instructions farther from head of IQ

13

Unblocking the Fetch UnitBIGQ

increase IQ’s size as long as we don’t increase the search spacedouble size, search first 32 entries

ITAGdo I cache tag lookup a cycle early

16

Two sources of issue slot wasteWrong-path instructions

result from mispredicted branchesOptimistically issued instructions

result from cache miss or bank conflictIssue Algorithms

OPT_LASTSPEC_LASTBRANCH_FIRST

17

The Issue Bandwidthnot a bottleneck

Instruction Queue Sizenot a bottleneckexperiment with larger queues increased throughput by less than 1%

Fetch Bandwidthprime candidate for bottleneck statusincreasing IQ and excess registers increased performance another 7%

Branch Predictionless sensitive in SMT

18

Speculative Executionnot a bottleneckeliminating will be a issue

Memory Throughputinfinite bandwidth caches will increase throughput only by 3%

Register File Sizeno sharp drop-off point

Fetch Throughput is still a bottleneck

19

Borrows heavily from conventional superscalar design, requiring little additional hardware supportMinimizes the impact on single-thread performance, running only 2% slower in that scenarioAchieves significant throughput improvements over the superscalar when many threads are running

20

Intel Pentium4, 2002Hyper-Threading Technology(HTT)30% speed improvement

MIPS MTIBM POWER5, 2004

two-thread SMT engineSUN Ultrasparc T1, 2005

CMT : SMT + CMP(Chip-level multiprocessing)

21

dean m. tullsen, susan j. eggers, joel s. emer, henry m. levy, jack l.lo, and rebecca l. stamm

Documents

fetch throughput wo

single thread

thread parallelism

throughput gains of

fetch bandwidthalg

fetch unitrr

fetch policiesbrcountthat

fewest instructions