ioana burcea initial observations of the simultaneous multithreading pentium 4 processor nathan tuck...

15
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Upload: aubrey-cunningham

Post on 27-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Ioana Burcea

Initial Observations of the Simultaneous Multithreading

Pentium 4 Processor

Nathan Tuck and Dean M. Tullsen

Page 2: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Agenda

SMT – proposed in research Intel Hyper-threading Methodology

- Benchmarks and experiments Experimental Results Questions?

Page 3: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

SMT in Research

Up to 8 contexts – 8 way SMT ICOUNT 2.8 fetching policy

Page 4: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Intel: Hyper-threading

SMT in real silicon – Intel Pentium 4

- Single vs. multithreaded mode

Page 5: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Methodology

Pentium 4 2.5 GHz 512 DRAM RedHat 7.3 Linux 2.4.28smp

- Linux treats the system as a dual-processor

- It has a separate run queue for each virtual processor Benchmarks

- SPEC CPU2000

- NAS parallel benchmarks

- SPLASH2 (modified input)

Page 6: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Speedup for Heterogeneous Workloads

TSMT = total_execution_time / number of runs

Speedup = Tseq / TSMT

Speedup per combination = Sbench_1 + Sbench_2

• At least 12 total jobs

• At least 3 runs for each job

Page 7: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Static Partitioning of Resources

• SPECINT 83% on average

• SPECFP 85% on average

• eon 71%

• wupwise 72%

• mcf 93%

• art 97%

• swim 98%

Page 8: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Independent Threads

Page 9: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Parallel Multithreaded Speedup

SPLASH NAS

Page 10: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Synchronization and Communication Speed

Reading a value protected by a lock

- 37 million times per second

- 68 cycles = lock & read Updating a value protected by a lock

- 14.6 million times per second

- 171 cycles = lock & update

Page 11: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Synchronization and Communication Speed (cont’d)

Loop result = independent computationcomputation that uses result – flow dependence

Independent computation a loop that contains

a load a float multiply a float add

Page 12: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Synchronization and Communication Speed (cont’d)

Page 13: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Heterogeneous vs. Homogeneous Workloads

Two self copies of SPEC

- Average speedup 1.11 < 1.20 Integer vs. integer 1.17 Float vs. float 1.20 Integer vs. float 1.21

Page 14: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Compiler Interaction

Baseline?

Page 15: Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen

Questions?

Is resource partitioning a good approach? IBM’s Power5 implementation? Other implementations?