frank casilio computer engineering may 15, 1997 multithreaded processors
TRANSCRIPT
Frank CasilioComputer Engineering
May 15, 1997
Multithreaded Processors
1997 Frank Casilio2Computer Engineering
Problems with MultiProcessors
• Memory Latency
• Context Switching Time
• Communication/Synchronization Latency
• Cache Coherence• Writes To Memory
• Poor Programming Model
1997 Frank Casilio3Computer Engineering
Motivation
• Reduce/Tolerate Memory Latency
• General Purpose Machine
• Scalability
• Shared Memory
• Simpler Programming Model
1997 Frank Casilio4Computer Engineering
Typical Ways To Reduce Latency
• On-Chip Cache
• Shortens Round Trip To Memory
• Fast Buses & Networks
• Hardware Synchronization
• Prefetching
1997 Frank Casilio5Computer Engineering
Multi-Threading: The Concept
• Support For Multiple Concurrent Hardware Contexts
• Tolerates Latency Instead of Reducing It
• Swap Contexts During Latencies
• Experimental Systems Have Existed Since The 50’s• Only 2 Commercial Systems Ever Produced
• HEP• Tera MTA
1997 Frank Casilio6Computer Engineering
Parameters That Effect Efficiency
• Number Of Contexts Supported
• Switching Overhead
• Run Length (Granularity)
• Average Latency To Be Hidden
1997 Frank Casilio7Computer Engineering
Switching Theory
• Determines How Often Contexts Switch
• Two Different Types
• Fine Grained• Coarse Grained
• Directly Related to Cost
1997 Frank Casilio8Computer Engineering
Fine Grained Switching
• Switches Contexts Every Cycle
• Many Long Latencies Operations Tolerated
• Requires More Contexts• Workload Requirements
• Can Simplify Overall Processor Complexity
1997 Frank Casilio9Computer Engineering
Coarse Grained Switching
• Switches Contexts After A Couple Of Cycles• Has Problems With Sporadic Latencies
• Requires Less Contexts
• Requires More Complex Processors
1997 Frank Casilio10Computer Engineering
The TERA MTA
• First Commercial Multithreaded Machine Since 1978
• Uniform Shared Memory
• Scalable
• Direct Relationship b/w PE’s & Throughput
• Fine Grained Architecture
1997 Frank Casilio11Computer Engineering
The Tera MTA Cont’d
• Torodial Interconnection
• 12 Million Dollar Base System
• 16-256 Processor Versions
1997 Frank Casilio12Computer Engineering
Processor Characteristics
• Support For 128 Threads
• 16 Protection Domains
• 333 MHz Nominal Speed
• 0 Context Switching Overhead!!!
• 1 GFLOP Peak Performance
1997 Frank Casilio13Computer Engineering
Processor Characteristics Cont’d
• Load-Store Architecture• 3 Addressing Modes
• 31 64-bit GPR’s
• 3 Operations Per Instruction• 1 Memory Reference• 1 Arithmetic Operation• 1 Control (i.e.. Branch)
• 6KW Of Power Dissipation Per Processor
1997 Frank Casilio14Computer Engineering
Interconnection Network
• 3-D Torus Contains 3p/2 nodes
• Packet Switching
• 3 Cycles of Latency Per Node
• Messages Are Assigned Random Priorities
• 164 Bit Packets• 64 Bits Are Data• 2.67 GB/s Bandwidth In Each Direction
• 2 HIPPI Channels / Processor For Net Connection
1997 Frank Casilio15Computer Engineering
Memory
• 8, 16, 32 and 64 Bit Addressable
• 4 Bits per Word Of Access State For Synchronization
• Memory Units Equipped With Error Correcting Code
• Memory Usage In Random To All Banks
• Either 2p or 4p Units, Interleaved 64 Ways
• 16 MB DRAM Chips
1997 Frank Casilio16Computer Engineering
Input / Output• Maximum Strategy Gen5 XL RAID
• Sustained Bandwidth of 130 MB/s
• At Least p/16 Disk Arrays Are Required
• System Capacity of 300p GB
• 20p MB/s In Each Direction
1997 Frank Casilio17Computer Engineering
Operating System
• Distributed Parallel Version Of Unix• Highly Concurrent Version Of Berkeley
• Allows Systems To Run p Tasks Truly Parallel
• Streams Are Dynamically Created w/o OS Intervention
• Processes Are Broken Up Into Tasks By OS
• Two Tier Scheduler Provides Better Resource Allocation• PL Scheduler• PB Scheduler
1997 Frank Casilio18Computer Engineering
Software / Languages
• Implicit And Explicit Parallelism Is Allowed
• Automatic Parallelization Of:• C, C++ & Fortran By The Compiler
• High Degree of Cray Compatibility
• Easy To Program b/c Of Architecture
1997 Frank Casilio19Computer Engineering
System Performance
• 3.84-12.8 Times Performance Of Cray T90/32
• 1K x 1K Matrix Multiple in 50 ms
• Integer Sort of 100M Keys in 36 ms
1997 Frank Casilio20Computer Engineering
Conclusion
• Proven Effectiveness
• Logical Step For Multiprocessor Computers
• Still Very Pricey
• Allow General Purpose Workload
• Scalable
• Shared Memory
1997 Frank Casilio21Computer Engineering
Questions?
1997 Frank Casilio22Computer Engineering
Instruction Pipeline
1997 Frank Casilio23Computer Engineering
Breakdown Of A Task
Task
Tea
m
Tea
m
Tea
m
Tea
m
VPVPVPVPVPVPVPVP
1997 Frank Casilio24Computer Engineering
1997 Frank Casilio25Computer Engineering
Deciding The Of Number Contexts