high-performance processors’ design choices€¦ · • use distributed-memory organization at...

57
1 High-Performance Processors’ Design Choices Ramon Canal PD Fall 2013

Upload: others

Post on 09-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

1

High-Performance Processors’Design Choices

Ramon Canal

PDFall 2013

Page 2: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

2

High-Performance Processors’Design Choices

1 Motivation2 Multiprocessors3 Multithreading

4 VLIW

Page 3: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

3

Outline• Motivation• Multiprocessors

– SISD, SIMD, MIMD, and MISD– Memory organization– Communication mechanisms

• Multithreading• VLIW

Page 4: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

4

MotivationInstruction-Level Parallelism (ILP): What all we have covered so far:

– simple pipelining– dynamic scheduling: scoreboarding and Tomasulo’s alg.– dynamic branch prediction– multiple-issue architectures: superscalar, VLIW– compiler techniques and software approaches

Bottom line: There just aren’t enough instructions that can actually be executed in parallel!– instruction issue: limit on maximum issue count– branch prediction: imperfect– # registers: finite– functional units: limited in number– data dependencies: hard to detect dependencies via memory

Page 5: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

5

So, What do we do?Key Idea: Increase number of running processes

– multiple processes: at a given “point” in time• i.e., at the granularity of one (or a few) clock cycles• not sufficient to have multiple processes at the OS level!

Two Approaches:– multiple CPU’s: each executing a distinct process

• “Multiprocessors” or “Parallel Architectures”– single CPU: executing multiple processes (“threads”)

• “Multi-threading” or “Thread-level parallelism”

Page 6: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

6

Taxonomy of Parallel Architectures

Flynn’s Classification:– SISD: Single instruction stream, single data stream

• uniprocessor– SIMD: Single instruction stream, multiple data streams

• same instruction executed by multiple processors• each has its own data memory• Ex: multimedia processors, vector architectures

– MISD: Multiple instruction streams, single data stream• successive functional units operate on the same stream of data• rarely found in general-purpose commercial designs• special-purpose stream processors (digital filters etc.)

– MIMD: Multiple instruction stream, multiple data stream• each processor has its own instruction and data streams• most popular form of parallel processing

– single-user: high-performance for one application– multiprogrammed: running many tasks simultaneously (e.g., servers)

Page 7: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

7

Multiprocessor: Memory Organization

Centralized, shared-memory multiprocessor:– usually few

processors– share single memory

& bus– use large caches

Page 8: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

8

Multiprocessor: Memory Organization

Distributed-memory multiprocessor:– can support large processor counts

• cost-effective way to scale memory bandwidth• works well if most accesses are to local memory node

– requires interconnection network• communication between processors becomes more complicated,

slower

Page 9: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

9

Communication Mechanisms• Shared-Memory Communication

– around for a long time, so well understood and standardized• memory-mapped

– ease of programming when communication patterns are complex or dynamically varying

– better use of bandwidth when items are small– Problem: cache coherence harder

• use “Snoopy” and other protocols

• Message-Passing Communication (i.e. intel’s Knight… family)– simpler hardware because keeping caches coherent is easier– communication is explicit, simpler to understand

• focuses programmer attention on communication– synchronization: naturally associated with communication

• fewer errors due to incorrect synchronization

Page 10: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

10

Multiprocessor: Hybrid Organization

• Use distributed-memory organization at top level• Each node itself may be a shared-memory

multiprocessor (2-8 processors)

Page 11: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

11

Multiprocessor: Hybrid Organization

• Use distributed-memory organization at top level• Each node itself may be a shared-memory

multiprocessor (2-8 processors)

• What about Big Data? Is it a “game changer”?– Next slides based on the following works:

• M. Ferdman et al. “Clearing the clouds” ASPLOS’12• P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12• B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE

MICRO 2012

– Next couple of slides © of Prof. Babak Falsafi (EPFL)

Page 12: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

Multiprocessors and Big Data

PD, 2013 12

Page 13: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

PD, 2013 13

Page 14: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

PD, 2013 14

Page 15: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

PD, 2013 15

Page 16: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

PD, 2013 16

Page 17: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

Scale-out Processors• Small LLC. Just to capture instructions.• More cores for higher throughput• “Pods” for small distance to memory

PD, 2013 17

Page 18: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

Performance• Iso server power (20MW)

PD, 2013 18

Page 19: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

Summary Multiprocessors• Need to tailor chip design to applications

– Big Data applications are too big for data caches. Best solution is too eliminate them.

– Big Data applications in need of coarse grainparallelism (i.e. At the request level)

– Still single-thread performance is STILL important for other applications (i.e. Computation intensive)

PD, 2013 19

Page 20: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

20

MultithreadingThreads: multiple processes that share code and data

(and much of their address space)• recently, the term has come to include processes that may run on

different processors and even have disjoint address spaces, as long as they share the code

Multithreading: exploit thread-level parallelism within a processor– fine-grain multithreading

• switch between threads on each instruction!– coarse-grain multithreading

• switch to a different thread only if current thread has a costly stall– E.g., switch only on a level-2 cache miss

Page 21: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

21

• How can we guarantee no dependencies between instructions in a pipeline?– One way is to interleave execution of instructions from

different program threads on same pipelineInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

Multithreading

Page 22: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

22

Simple Multithreaded Pipeline

• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

Page 23: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

23

MultithreadingFine-grain multithreading

– switch between threads on each instruction!– multiple threads executed in interleaved manner– interleaving is usually round-robin– CPU must be capable of switching threads on every

cycle!• fast, frequent switches

– main disadvantage:• slows down the execution of individual threads• that is, traded off latency for better throughput

Page 24: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

24

CDC 6600 Peripheral Processors (Cray, 1965)

• First multithreaded hardware• 10 “virtual” I/O processors• fixed interleave on simple pipeline• pipeline has 100ns cycle time• each processor executes one instruction every 1000ns• accumulator-based instruction set to reduce processor

state

Page 25: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

25

Denelcor HEP (Burton Smith, 1982)

• First commercial machine to use hardware threading in main CPU– 120 threads per processor– 10 MHz clock rate– Up to 8 processors– precursor to Tera MTA (Multithreaded Architecture)

Page 26: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

26

Tera MTA (Cray, 1997)• Up to 256 processors• Up to 128 active threads per processor• Processors and memory modules populate a sparse

3D torus interconnection fabric• Flat, shared main memory

– No data cache– Sustains one main memory access per cycle per processor

• 50W/processor @ 260MHz

Page 27: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

27

• Each processor supports 128 active hardware threads– 128 SSWs, 1024 target registers, 4096 general-purpose

registers• Every cycle, one instruction from one active thread is

launched into pipeline• Instruction pipeline is 21 cycles long• At best, a single thread can issue one instruction every

21 cycles– Clock rate is 260MHz, effective single thread issue rate is 260/21

= 12.4MHz

Tera MTA (Cray)

Page 28: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

28

MultithreadingCoarse-grain multithreading

– switch only if current thread has a costly stall• E.g., level-2 cache miss

– can accommodate slightly costlier switches– less likely to slow down an individual thread

• a thread is switched “off” only when it has a costly stall

– main disadvantage:• limited in ability to overcome throughput losses

– shorter stalls are ignored, and there may be plenty of those• issues instructions from a single thread

– every switch involves emptying and restarting the instruction pipeline

Page 29: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

29

IBM PowerPC RS64-III (Pulsar)

• Commercial coarse-grain multithreading CPU• Based on PowerPC with quad-issue in-order five

stage pipeline• Each physical CPU supports two virtual CPUs• On L2 cache miss, pipeline is flushed and

execution switches to second thread– short pipeline minimizes flush penalty (4 cycles),

small compared to memory access latency– flush pipeline to simplify exception handling

Page 30: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

30

Simultaneous Multithreading (SMT)

Key Idea: Exploit ILP across multiple threads!– Share CPU to multiple threads– i.e., convert thread-level parallelism into more ILP– exploit following features of modern processors:

• multiple functional units– modern processors typically have more functional units

available than a single thread can utilize• register renaming and dynamic scheduling

– multiple instructions from independent threads can co-exist and co-execute!

Page 31: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

31

Multithreading: Illustration

(a) A superscalar processor with no multithreading(b) A superscalar processor with coarse-grain multithreading(c) A superscalar processor with fine-grain multithreading(d) A superscalar processor with simultaneous multithreading

(SMT)

(a) (b) (c) (d)

Page 32: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

32

From Superscalar to SMT

• SMT is an out-of-order superscalar extended withhardware to support multiple executing threads

Page 33: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

33

Simultaneous Multithreaded Processor

Page 34: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

34

• Add multiple contexts and fetch engines to wide out-of-order superscalar processor– [Tullsen, Eggers, Levy, University of Washington, 1995]

• OOO instruction window already has most of the circuitry required to schedule from multiple threads

• Any single thread can utilize whole machine

• First examples:– Alpha 21464 (DEC/Compaq)– Pentium IV (Intel)– Power 5 (IBM)– Ultrasparc IV (Sun)

Simultaneous Multithreaded Processor

Page 35: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

35

SMT: Design Challenges• Dealing with a large register file

– needed to hold multiple contexts

• Maintaining low overhead on clock cycle– fast instruction issue: choosing what to issue– instruction commit: choosing what to commit– keeping cache conflicts within acceptable bounds

• Power hungry!

Page 36: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

36

Intel Pentium-4 Processor• Hyperthreading = SMT• Dual physical processors, each 2-way SMT• Logical processors share nearly all resources of the physical

processor– Caches, execution units, branch predictors

• Die area overhead of hyperthreading ~5 %• When one logical processor is stalled, the other can make

progress– No logical processor can use all entries in queues when two

threads are active• A processor running only one active software thread to run at

the same speed with or without hyperthreading

Page 37: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

37

Pentium 4 Micro-architecture

400 MHz System

Bus

RapidExecution

Engine

ExecutionTrace Cache

HyperPipelined

Technology

AdvancedTransfer Cache

Advanced DynamicExecution

StreamingSIMD

Extensions 2Enhanced FloatingPoint / Multi-Media

Page 38: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

38

Pentium 4 Micro-architecture

HyperPipelined

Technology

Advanced DynamicExecution

What hardware complexity does OoO and SMT incur in?

Page 39: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

Sun/Oracle Ultrasparc T5 (2013)

PD, 2013 39

16 Core3,6 Ghz8 threads/core(128 T/Chip)

X Core:2-way OoO16 KB I$16 KB D$128 KB L28 MB L3

28nm

Page 40: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

IBM Power 7

PD, 2013 40

Page 41: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

41

VLIW• Very Long Instruction Word:

– Compiler packs a fixed number of operations into a single VLIW “instruction”.

– The operations within a VLIW instruction are issued and executed in parallel.

– Example: • High-end signal processors (TMS320C6201) • Intel’s Itanium• Transmeta Crusoe, Efficeon

Page 42: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

42

VLIW• VLIW (very long instruction word) processors use a long instruction

word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously.

• All operations specified within a VLIW instruction must be independent of one another.

• Some of the key issues of a (V)LIW processor:– (very) long instruction word (up to 1 024 bits per instruction),– each instruction consists of multiple independent parallel operations,– each operation requires a statically known number of cycles to

complete,– a central controller that issues a long instruction word every cycle,– multiple FUs connected through a global shared register file.

Page 43: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

43

VLIW and Superscalar• sequential stream of long instruction words• instructions scheduled statically by the compiler• number of simultaneously issued instructions is fixed during compile-time • instruction issue is less complicated than in a superscalar processor• Disadvantage: VLIW processors cannot react on dynamic events,

e.g. cache misses, with the same flexibility like superscalars.• The number of instructions in a VLIW instruction word is usually fixed.• Padding VLIW instructions with no-ops is needed in case the full issue

bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.

• VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.

• VLIW processors take advantage of spatial parallelism.

Page 44: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

44

VLIW and Superscalar• Superscalar RISC solution

– Based on sequential execution semantics

– Compiler’s role is limited by the instruction set architecture

– Superscalar hardware identifies and exploits parallelism

• VLIW solution

– Based on parallel execution semantics

– VLIW ISA enhancements support static parallelization

– Compiler takes greater responsibility for exploiting parallelism

– Compiler / hardware collaboration often resembles superscalar

Page 45: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

45

VLIW and Superscalar• Advantages of pursuing VLIW architectures

– Make wide issue & deep latency less expensive in hardware

– Allow processor parallelism to scale with additional VLSI density

• Architect the processor to do well with in-order execution

– Enhance the ISA to allow static parallelization

– Use compiler technology to parallelize program

• Loop Unrolling, Software Pipelining, ...

– However, a purely static VLIW is not appropriate for general-purpose use

Page 46: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

46

Examples• Intel Itanium

• Transmeta Crusoe

• Almost all DSPs

– Texas Instruments

– ST Microelectronics

Page 47: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

47

Intel Itanium, Itanium 2

Page 48: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

48

IA-64 Encoding

Source: Intel/HP IA-64 Application ISA Guide 1.0

Page 49: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

49

IA-64 Templates

Source: Intel/HP IA-64 Application ISA Guide 1.0

Page 50: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

50

Intel's IA-64 ISA • Intel 64-bit Architecture (IA-64) register model:

– 128 64-bit general purpose registers GR0-GR127to hold values for integer and multimedia computations

• each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid.

– 128 82-bit floating-point registers FR0-FR127• registers f0 and f1 are read-only with values +0.0 and +1.0,

– 64 1-bit predicate registers P0-PR63• the first register p0 is read-only and always reads 1 (true)

– 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches

Page 51: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

51

Transmeta Crusoe i Efficeon

Page 52: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

52

Overview• HW/SW system for executing x86 code

– VLIW processor– Code Morphing Software

• Underlying ISA and details invisible– convenient level of indirection– upgrades, fixes, freedom for changes

• as long as new CMS is implemented– anything else?

Page 53: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

53

VLIW CPU• Simple

– in-order, very few interlocks– TM5400, 7 million transistors, 7 stage pipeline – low power, easier (and cheaper) to design

• TM5800– <=1GHz, 64KB L1, 512KB L2– 0.5-15W @ 300-1000MHz, 0.8-1.3V running typ mm app

Page 54: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

54

Crusoe vs. PIII mobile (temperature)

Page 55: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

55

VLIW CPU• RISC-like ISA

– molecule(long instruction)• 2 or 4 atoms (RISC-like instruction)• slot distribution?

• 64 gprs and 32 fprs– dedicated regs for x86 architectural regs

FADD ADD LD BRCC

Floatingpointunit

INT unit 1

INT unit 2

Load/Storeunit

Branchunit

128-bit molecule

Page 56: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

56

Conclusions• VLIW

– Reduces hardware complexity at the cost of increasing compiler complexity

– Good for DSPs– Not so good for GPPs (so far?)

Page 57: High-Performance Processors’ Design Choices€¦ · • Use distributed-memory organization at top level • Each node itself may be a shared-memory multiprocessor (2-8 processors)

57

Conclusions• Multiprocessors

– Conventional superscalars are reaching ILP’s limits → exploit TLP or PLP

– Already known technology• Multithreading

– Good for extensive use of superscalar cores– More efficient than MP but more complex too