![Page 1: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/1.jpg)
TarantulaA Vector Extension to the Alpha ArchitectureRoger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, André Seznec
Universitat Politècnica Catalunya, Barcelona, SpainCompaq Computer Corporation, Shrewsbury, MA
![Page 2: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/2.jpg)
State of the World• CMOS Technology progresses
– More transistors, more functional units, more control overhead
• VLIW and Wide Superscalar – More individually controlled units– Amount of real estate for control logic grows non-
linearly• Vector ISA
– Localization of parallelism, aggregation of control– Regular structures, simple control
![Page 3: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/3.jpg)
Tarantula• EV8 core + tightly integrated Vector Unit
– Out of Order execution, Register Renaming– Integrated in VM and cache coherence
system– SMT support
• Targeted at scientific computing applications
• Requires compiler support and recompilation
![Page 4: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/4.jpg)
Vector ISA• New Architectural State
– 32 vector registers (v0-v31)• v31 wired to 0. Used for prefetch
– Vector length (vl), Vector stride (vs), Vector Mask (vm)
• 45 New Instructions– 5 Groups
• Vector-Vector, Vector-Scalar, Strided Memory Access, Random Memory Access, Vector Control
![Page 5: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/5.jpg)
Vector Mask• Allows conditional
execution without EV8 scalar registers
• VM can be renamed
A(i).ne.0.and.B(i).gt.2
vloadq A(i) --> v0vloadq B(i) --> v1vcmpne v0, #0 --> v6vcmpgt v1, #2 --> v7vand v6, v7 --> v8setvm v8 --> vm
![Page 6: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/6.jpg)
Tarantula Block Diagram
![Page 7: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/7.jpg)
Vector Execution Unit• 16 independent lanes
– No communication, except for gather/scatter• Each lane has
– 2 functional units– Slice of Register File and Mask
• Allows high bandwidth
– Address generator and private TLB• 32 functional unit appear as only 2 issue ports
– Simple scheduling
![Page 8: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/8.jpg)
Vector Unit – Core Interface• Vector Unit physically separate from core
– Little modification to core• Large bus prevented by routing space
– Core to VBox• 3 Instruction Bus• 2 Data Buses for Scalars from EV8 register file• 3 Instruction Kill Signal Bus for misspeculation
– VBox to Core• 3 Instruction Completion Bus
![Page 9: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/9.jpg)
Power Consumption
![Page 10: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/10.jpg)
Vector Memory System• Bound to EV8 VM and Cache Coherence
architecture• High Load/Store Bandwidth required
– Goal one 64bit datum per flop– Memory Bus to slow– L1 Cache to small for vector data– Direct Connection to L2 Cache
• Non-Unit Stride central problem– 20% of all accesses– Don’t match cache lines
![Page 11: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/11.jpg)
Non-Unit Strides• EV8 4MByte L2 Cache in 128 banks
– 8 ways, 16 banks per way– Read 8 ways, select correct one
• Non-unit stride accesses– Read 16 independent cache lines– Select one qword per line
• Requires– Conflict free addresses– Conflict free writes to 16 lanes
• One qword per lane per cycle
![Page 12: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/12.jpg)
Conflict Free Addresses• Possible for any 128 consecutive elements
– For stride S= × 2s with s ≤ 4– Order stored in ROM table
• Elements accessed out of order– Even for length < 128 full eight cycles for
address generation• Slice
– Group of 16 conflict free addresses
![Page 13: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/13.jpg)
PUMP• Stride 1 accesses
– 80% of all accesses– 128 Qwords in 16 (aligned) or 17
(misaligned) cache lines• Full cache lines read into PUMP latches
– Two qwords per cycle sent to VBox• Similar for writes• Allows double bandwidth
![Page 14: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/14.jpg)
Gathers and Scatters• Arbitrary Address for every vector element
– Reordering algorithm doesn’t work• Conflict Resolution Box (CR)
– Find biggest subset of non-conflicting addresses, pack into slice
– Add new addresses to remaining ones and repeat• Worst case 128 slices generated• Same algorithm used for self-conflicting strides
– stride S= × 2s with s > 4
![Page 15: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/15.jpg)
Vector Misses• To handle L2 misses consider slices as
atomic• On miss, slice moved to Miss Address File
(MAF)– Wait for missing data– Go to retry queue
• Too many retries cause Panic Mode– MAF nacks all other L2 requests, that might
prevent progress
![Page 16: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/16.jpg)
Scalar-Vector Coherency• VBox by-passes L1 cache
– Presence bit P indicates L2 cache line loaded by VCore
– If P Set, VBox invalidates L1• Scalar Write followed by Vector Read is not
covered– Barrier command required – DrainM Purges write buffer and cause replay
trap
![Page 17: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/17.jpg)
Evaluation• No Compiler support available
– Hand coded assembler cores• Scientific Benchmarks• ASIM Simulator
– Cycle Accurate EV8 simulator• Tarantula compared to
– EV8– EV8 + Trantula’s memory system– Tarantula4 1:4 ratio to RAMBUS frequency
![Page 18: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/18.jpg)
Operations per Cycle
![Page 19: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/19.jpg)
Speed Up over EV8
![Page 20: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/20.jpg)
Conclusions• Vector Processor most efficient solution for many
applications• Vector Unit can be added to standard
microprocessor core• Big Bandwidth requirement can only be satisfied
by L2 cache• Potentially big performance gains
– 2 to 20 over EV8• Performance depends on good code
– Tiling + aggressive prefetching• Very good power/performance ratio
![Page 21: Tarantula A Vector Extension to the Alpha Architecture](https://reader035.vdocuments.us/reader035/viewer/2022062816/56814d71550346895dbaca5b/html5/thumbnails/21.jpg)
Questions• Can only scientific applications exploit
vector processors?– Radix sort worked– Powerful memory access instructions– Masks allow logic execution
• Does anyone no more about PRAM algorithms?
• EV8/VBox coherency seems quirky. Does anyone see a better solution?