helper threads via virtual multithreading on an experimental itanium 2 processor platform. perry h...

20
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Upload: leo-bryan

Post on 17-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Helper Threads via Virtual Multithreading on an experimental

Itanium 2 processor platform.

Perry H Wang et. Al.

Page 2: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Outline

Helper threads VMT ideas Implementation details

• Hardware

• Firmware

• Compiler Results Conclusion.

Page 3: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Helper threads

Used in Multi-threaded architectures to prefetch hard-to-predict delinquent data or compute hard-to-predict branches.

Threads share resources as fetch bandwith and functional units.

Page 4: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Hyper Threading (Intel - P4)

Each hardware thread context is exposed as logical processor to the OS.

OS finds threads for execution and binds them to the logical processor.

User has to use OS-visible thread API to create and manage threads.

Page 5: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Helper Threads - Issues

Resource contention among multiple helper threads

Adaptable invocation for different program phases.• Threads have to be self-throttling.

OS based thread synchronization is unpredictable and has long latency ( ~micro secs)

Page 6: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Virtual Multithreading

Single processor supports multiple thread contexts. Monitors long latency micro-architectural events. Switches to different Instruction in same program

in 100 cycles.

OS transparent Uses firmware support in Itanium 2 processor to

reduce context switch time.

Page 7: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Context switch requires

Page 8: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Advantages with VMT on Itanium

Ability to track micro-architectural events without involvement of the OS. Eg: Last level cache misses.

Large register set partitioned by compiler for helper threads• Register communication is easier

• Value Synchronization - no memory comm.

OS context switches allow threads to be resumed on any processor.

Page 9: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

New Instructions

YieldSynchronous transfer to

VMT thread, similar to branch misprediction

Yield conditionalTransfer only when

pipeline stalls at some later instruction.

Execution proceed, instructions retire

No pipeline stall instruction behaves as nop

Page 10: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Key Characteristics

Self throttling – main and helper threads keep counters to track progress (iteration counter)• Helper thread falls behind -> reload value

• Helper thread runs too far ahead -> relinquish ctrl.

Main thread begins execution at instruction that triggered helper thread invocation.

VMT preserves thread continuation of helper threads -> helper thread can restart where it stopped.

Page 11: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Key Characteristics

VMT has to maintain• Initial instruction address.

• Continuation instruction address

Compiler preserves 2 registers for the purpose.

Support for multiple helper threads can be done by reprogramming these registers.

Page 12: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Itanium Firmware

Programmable debugging hardware support for PAL• To enable silicon debugging

& validation. PAL can program PMU to

monitor and count events of interest - opcode monitoring, instruction addr., Data addr.

Debugging hardware can trigger a PAL handler when the monitored even occurs.

Page 13: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Firmware

VMT mechanism emulated by firmware infrastructure. Opcode monitoring to simulate yield and yield conditional. PAL programs PMU to track

• Last level cache misses

• Pipeline stalls

• Instructions with special opcodes. Thread switch latency = pipeline flush + overhead

for manipulating registers. (~140 cycles giving 60 cycles of computation time when memory miss ~200 cycles)

Page 14: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Experimental machine.

4 way 1.5 Ghz Itanium 2 processor based MP system with 16 GB of RAM.

Separate 16 KB 4-way set associate L1 I- and D-cache

Shared 256 KB 8-way set associative L2 cache. 6 MB 24-way L3 cache that can be configured as 1

MB 4-way set associative cache.

Page 15: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

workloads

MCF – combinatorial optimization.

VPR – FPGA Circuit Placement and Routing

DOT – graph layout optimization tool

DSS system running on 100 GB IBM DB2 database

6 queries with long run time and span large portions of database.

95% cpu utilization, 40 concurrent threads

Page 16: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Compiler and optimizations

Electon –O3, IPA, Profile guided opt. , Itanium2 specific opt.

Recompiled to obtain threads and linked with original binaries.• Register partitioning to minimize VMT context switch

• Aggressive software prefetching with profile feedback.

• Ld.s , chk, predication, branch prediction hints.

Page 17: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

SpeedUp A few helper threads give good

speedup Significant fraction of L3

misses are removed from main thread (avg. 48%). Capacity misses in L3 are due to pointer chasing.

Helper thread size is small. Helper thread can contain

control flow dependencies also. Throughput improved by

reducing latency of individual threads

Page 18: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al
Page 19: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

Conclusion

Fly-weight context switching 5.8 – 38.5% increase for SPEC2000 INT 5-12% speedup on DSS workload. VMT threads are to be invoked based on

program behavior depending on number of cache misses.

Page 20: Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al

My view on limitations

Requires large register files and firmware support. Too Itanium (not adaptable to other architectures).

Scalability of helper threads.(# helper threads running at one time…to complex)