ioana burcea initial observations of the simultaneous multithreading pentium 4 processor nathan tuck...
TRANSCRIPT
![Page 1: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/1.jpg)
Ioana Burcea
Initial Observations of the Simultaneous Multithreading
Pentium 4 Processor
Nathan Tuck and Dean M. Tullsen
![Page 2: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/2.jpg)
Agenda
SMT – proposed in research Intel Hyper-threading Methodology
- Benchmarks and experiments Experimental Results Questions?
![Page 3: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/3.jpg)
SMT in Research
Up to 8 contexts – 8 way SMT ICOUNT 2.8 fetching policy
![Page 4: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/4.jpg)
Intel: Hyper-threading
SMT in real silicon – Intel Pentium 4
- Single vs. multithreaded mode
![Page 5: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/5.jpg)
Methodology
Pentium 4 2.5 GHz 512 DRAM RedHat 7.3 Linux 2.4.28smp
- Linux treats the system as a dual-processor
- It has a separate run queue for each virtual processor Benchmarks
- SPEC CPU2000
- NAS parallel benchmarks
- SPLASH2 (modified input)
![Page 6: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/6.jpg)
Speedup for Heterogeneous Workloads
TSMT = total_execution_time / number of runs
Speedup = Tseq / TSMT
Speedup per combination = Sbench_1 + Sbench_2
• At least 12 total jobs
• At least 3 runs for each job
![Page 7: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/7.jpg)
Static Partitioning of Resources
• SPECINT 83% on average
• SPECFP 85% on average
• eon 71%
• wupwise 72%
• mcf 93%
• art 97%
• swim 98%
![Page 8: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/8.jpg)
Independent Threads
![Page 9: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/9.jpg)
Parallel Multithreaded Speedup
SPLASH NAS
![Page 10: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/10.jpg)
Synchronization and Communication Speed
Reading a value protected by a lock
- 37 million times per second
- 68 cycles = lock & read Updating a value protected by a lock
- 14.6 million times per second
- 171 cycles = lock & update
![Page 11: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/11.jpg)
Synchronization and Communication Speed (cont’d)
Loop result = independent computationcomputation that uses result – flow dependence
Independent computation a loop that contains
a load a float multiply a float add
![Page 12: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/12.jpg)
Synchronization and Communication Speed (cont’d)
![Page 13: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/13.jpg)
Heterogeneous vs. Homogeneous Workloads
Two self copies of SPEC
- Average speedup 1.11 < 1.20 Integer vs. integer 1.17 Float vs. float 1.20 Integer vs. float 1.21
![Page 14: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/14.jpg)
Compiler Interaction
Baseline?
![Page 15: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen](https://reader036.vdocuments.us/reader036/viewer/2022072013/56649e495503460f94b3c4ca/html5/thumbnails/15.jpg)
Questions?
Is resource partitioning a good approach? IBM’s Power5 implementation? Other implementations?