dean m. tullsen, susan j. eggers, joel s. emer, henry m. levy, jack l.lo, and rebecca l. stamm
DESCRIPTION
Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab. Introduction. Simultaneous Multithreading(SMT) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/1.jpg)
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca
L. Stamm
Presented by Kim Ki Young @ DCSLab
![Page 2: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/2.jpg)
Simultaneous Multithreading(SMT)A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unitTwo major impediments to processor utilization
long latencieslimited per-thread parallelism
2/20
![Page 3: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/3.jpg)
1.Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor2.Show that SMT need not compromise single-thread performance3.Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model4.Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture
3
![Page 4: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/4.jpg)
4
![Page 5: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/5.jpg)
A projection of current superscalar design trends 3-5 years into the futureChanges necessary to support simultaneous multithreading
Multiple program countersSeparate return stack for each threadPer-thread instruction retirement, instruction queue flush, and trap mechanismsA thread id with each branch target buffer entryA larger register file
5
![Page 6: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/6.jpg)
6
![Page 7: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/7.jpg)
7
![Page 8: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/8.jpg)
MIPSIMIPS-based simulatorexecutes unmodified Alpha object code
WorkloadSPEC92 benchmark suitefive floating point programs, two integer programs, TeX
Multiflowtrace scheduling compiler
8
![Page 9: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/9.jpg)
9
![Page 10: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/10.jpg)
With only single thread, throughput is less than 2% below a superscalar w/o SMT supportPeak throughput is 84% higher than the superscalarThree problems
IQ sizeFetch throughputLack of parallelism
10
![Page 11: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/11.jpg)
Improve fetch throughput w/o increasing the fetch bandwidthalg.num1.num2
alg : Fetch selection methodnum1 : # of threads that can fetch in 1 cyclenum2 : max # of instructions fetched per thread in 1 cycle
Partitioning the fetch unitRR.1.8RR.2.4, RR.4.2
Some hardware additionRR.2.8
Additional logic is required11
![Page 12: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/12.jpg)
12
![Page 13: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/13.jpg)
Fetch PoliciesBRCOUNT
that are least likely to be on a wrong path
MISSCOUNTthat have the fewest outstanding D cache miss
ICOUNTwith the fewest instructions in decode
IQPOSNwith instructions farther from head of IQ
13
![Page 14: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/14.jpg)
14
![Page 15: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/15.jpg)
15
![Page 16: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/16.jpg)
Unblocking the Fetch UnitBIGQ
increase IQ’s size as long as we don’t increase the search spacedouble size, search first 32 entries
ITAGdo I cache tag lookup a cycle early
16
![Page 17: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/17.jpg)
Two sources of issue slot wasteWrong-path instructions
result from mispredicted branchesOptimistically issued instructions
result from cache miss or bank conflictIssue Algorithms
OPT_LASTSPEC_LASTBRANCH_FIRST
17
![Page 18: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/18.jpg)
The Issue Bandwidthnot a bottleneck
Instruction Queue Sizenot a bottleneckexperiment with larger queues increased throughput by less than 1%
Fetch Bandwidthprime candidate for bottleneck statusincreasing IQ and excess registers increased performance another 7%
Branch Predictionless sensitive in SMT
18
![Page 19: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/19.jpg)
Speculative Executionnot a bottleneckeliminating will be a issue
Memory Throughputinfinite bandwidth caches will increase throughput only by 3%
Register File Sizeno sharp drop-off point
Fetch Throughput is still a bottleneck
19
![Page 20: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/20.jpg)
Borrows heavily from conventional superscalar design, requiring little additional hardware supportMinimizes the impact on single-thread performance, running only 2% slower in that scenarioAchieves significant throughput improvements over the superscalar when many threads are running
20
![Page 21: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm](https://reader035.vdocuments.us/reader035/viewer/2022070502/56813fd7550346895daabc71/html5/thumbnails/21.jpg)
Intel Pentium4, 2002Hyper-Threading Technology(HTT)30% speed improvement
MIPS MTIBM POWER5, 2004
two-thread SMT engineSUN Ultrasparc T1, 2005
CMT : SMT + CMP(Chip-level multiprocessing)
21