dean m. tullsen, susan j. eggers, joel s. emer, henry m. levy, jack l.lo, and rebecca l. stamm
DESCRIPTION
Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab. Introduction. Simultaneous Multithreading(SMT) - PowerPoint PPT PresentationTRANSCRIPT
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca
L. Stamm
Presented by Kim Ki Young @ DCSLab
Simultaneous Multithreading(SMT)A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unitTwo major impediments to processor utilization
long latencieslimited per-thread parallelism
2/20
1.Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor2.Show that SMT need not compromise single-thread performance3.Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model4.Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture
3
4
A projection of current superscalar design trends 3-5 years into the futureChanges necessary to support simultaneous multithreading
Multiple program countersSeparate return stack for each threadPer-thread instruction retirement, instruction queue flush, and trap mechanismsA thread id with each branch target buffer entryA larger register file
5
6
7
MIPSIMIPS-based simulatorexecutes unmodified Alpha object code
WorkloadSPEC92 benchmark suitefive floating point programs, two integer programs, TeX
Multiflowtrace scheduling compiler
8
9
With only single thread, throughput is less than 2% below a superscalar w/o SMT supportPeak throughput is 84% higher than the superscalarThree problems
IQ sizeFetch throughputLack of parallelism
10
Improve fetch throughput w/o increasing the fetch bandwidthalg.num1.num2
alg : Fetch selection methodnum1 : # of threads that can fetch in 1 cyclenum2 : max # of instructions fetched per thread in 1 cycle
Partitioning the fetch unitRR.1.8RR.2.4, RR.4.2
Some hardware additionRR.2.8
Additional logic is required11
12
Fetch PoliciesBRCOUNT
that are least likely to be on a wrong path
MISSCOUNTthat have the fewest outstanding D cache miss
ICOUNTwith the fewest instructions in decode
IQPOSNwith instructions farther from head of IQ
13
14
15
Unblocking the Fetch UnitBIGQ
increase IQ’s size as long as we don’t increase the search spacedouble size, search first 32 entries
ITAGdo I cache tag lookup a cycle early
16
Two sources of issue slot wasteWrong-path instructions
result from mispredicted branchesOptimistically issued instructions
result from cache miss or bank conflictIssue Algorithms
OPT_LASTSPEC_LASTBRANCH_FIRST
17
The Issue Bandwidthnot a bottleneck
Instruction Queue Sizenot a bottleneckexperiment with larger queues increased throughput by less than 1%
Fetch Bandwidthprime candidate for bottleneck statusincreasing IQ and excess registers increased performance another 7%
Branch Predictionless sensitive in SMT
18
Speculative Executionnot a bottleneckeliminating will be a issue
Memory Throughputinfinite bandwidth caches will increase throughput only by 3%
Register File Sizeno sharp drop-off point
Fetch Throughput is still a bottleneck
19
Borrows heavily from conventional superscalar design, requiring little additional hardware supportMinimizes the impact on single-thread performance, running only 2% slower in that scenarioAchieves significant throughput improvements over the superscalar when many threads are running
20
Intel Pentium4, 2002Hyper-Threading Technology(HTT)30% speed improvement
MIPS MTIBM POWER5, 2004
two-thread SMT engineSUN Ultrasparc T1, 2005
CMT : SMT + CMP(Chip-level multiprocessing)
21