high-performance processors’ design choices€¦ · • use distributed-memory organization at...

1

High-Performance Processors’Design Choices

Ramon Canal

PDFall 2013

2

High-Performance Processors’Design Choices

1 Motivation2 Multiprocessors3 Multithreading

4 VLIW

3

Outline• Motivation• Multiprocessors

– SISD, SIMD, MIMD, and MISD– Memory organization– Communication mechanisms

• Multithreading• VLIW

4

MotivationInstruction-Level Parallelism (ILP): What all we have covered so far:

– simple pipelining– dynamic scheduling: scoreboarding and Tomasulo’s alg.– dynamic branch prediction– multiple-issue architectures: superscalar, VLIW– compiler techniques and software approaches

Bottom line: There just aren’t enough instructions that can actually be executed in parallel!– instruction issue: limit on maximum issue count– branch prediction: imperfect– # registers: finite– functional units: limited in number– data dependencies: hard to detect dependencies via memory

5

So, What do we do?Key Idea: Increase number of running processes

– multiple processes: at a given “point” in time• i.e., at the granularity of one (or a few) clock cycles• not sufficient to have multiple processes at the OS level!

Two Approaches:– multiple CPU’s: each executing a distinct process

• “Multiprocessors” or “Parallel Architectures”– single CPU: executing multiple processes (“threads”)

• “Multi-threading” or “Thread-level parallelism”

6

Taxonomy of Parallel Architectures

Flynn’s Classification:– SISD: Single instruction stream, single data stream

• uniprocessor– SIMD: Single instruction stream, multiple data streams

• same instruction executed by multiple processors• each has its own data memory• Ex: multimedia processors, vector architectures

– MISD: Multiple instruction streams, single data stream• successive functional units operate on the same stream of data• rarely found in general-purpose commercial designs• special-purpose stream processors (digital filters etc.)

– MIMD: Multiple instruction stream, multiple data stream• each processor has its own instruction and data streams• most popular form of parallel processing

– single-user: high-performance for one application– multiprogrammed: running many tasks simultaneously (e.g., servers)

7

Multiprocessor: Memory Organization

Centralized, shared-memory multiprocessor:– usually few

processors– share single memory

& bus– use large caches

8

Multiprocessor: Memory Organization

Distributed-memory multiprocessor:– can support large processor counts

• cost-effective way to scale memory bandwidth• works well if most accesses are to local memory node

– requires interconnection network• communication between processors becomes more complicated,

slower

9

Communication Mechanisms• Shared-Memory Communication

– around for a long time, so well understood and standardized• memory-mapped

– ease of programming when communication patterns are complex or dynamically varying

– better use of bandwidth when items are small– Problem: cache coherence harder

• use “Snoopy” and other protocols

• Message-Passing Communication (i.e. intel’s Knight… family)– simpler hardware because keeping caches coherent is easier– communication is explicit, simpler to understand

• focuses programmer attention on communication– synchronization: naturally associated with communication

• fewer errors due to incorrect synchronization

10

Multiprocessor: Hybrid Organization

• Use distributed-memory organization at top level• Each node itself may be a shared-memory

multiprocessor (2-8 processors)

11

Multiprocessor: Hybrid Organization

• Use distributed-memory organization at top level• Each node itself may be a shared-memory

multiprocessor (2-8 processors)

• What about Big Data? Is it a “game changer”?– Next slides based on the following works:

• M. Ferdman et al. “Clearing the clouds” ASPLOS’12• P.Lotfi-Kamran et al.‘‘Scale-OutProcessors” ISCA’12• B. Grot et al. “Optimizing Datacenter TCO with Scale-Out Processors”, IEEE

MICRO 2012

– Next couple of slides © of Prof. Babak Falsafi (EPFL)

Multiprocessors and Big Data

PD, 2013 12

PD, 2013 13

PD, 2013 14

PD, 2013 15

PD, 2013 16

Scale-out Processors• Small LLC. Just to capture instructions.• More cores for higher throughput• “Pods” for small distance to memory

PD, 2013 17

Performance• Iso server power (20MW)

PD, 2013 18

Summary Multiprocessors• Need to tailor chip design to applications

– Big Data applications are too big for data caches. Best solution is too eliminate them.

– Big Data applications in need of coarse grainparallelism (i.e. At the request level)

– Still single-thread performance is STILL important for other applications (i.e. Computation intensive)

PD, 2013 19

20

MultithreadingThreads: multiple processes that share code and data

(and much of their address space)• recently, the term has come to include processes that may run on

different processors and even have disjoint address spaces, as long as they share the code

Multithreading: exploit thread-level parallelism within a processor– fine-grain multithreading

• switch between threads on each instruction!– coarse-grain multithreading

• switch to a different thread only if current thread has a costly stall– E.g., switch only on a level-2 cache miss

21

• How can we guarantee no dependencies between instructions in a pipeline?– One way is to interleave execution of instructions from

different program threads on same pipelineInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

Multithreading

22

Simple Multithreaded Pipeline

• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

23

MultithreadingFine-grain multithreading

– switch between threads on each instruction!– multiple threads executed in interleaved manner– interleaving is usually round-robin– CPU must be capable of switching threads on every

cycle!• fast, frequent switches

– main disadvantage:• slows down the execution of individual threads• that is, traded off latency for better throughput

24

CDC 6600 Peripheral Processors (Cray, 1965)

• First multithreaded hardware• 10 “virtual” I/O processors• fixed interleave on simple pipeline• pipeline has 100ns cycle time• each processor executes one instruction every 1000ns• accumulator-based instruction set to reduce processor

state

25

Denelcor HEP (Burton Smith, 1982)

• First commercial machine to use hardware threading in main CPU– 120 threads per processor– 10 MHz clock rate– Up to 8 processors– precursor to Tera MTA (Multithreaded Architecture)

26

Tera MTA (Cray, 1997)• Up to 256 processors• Up to 128 active threads per processor• Processors and memory modules populate a sparse

3D torus interconnection fabric• Flat, shared main memory

– No data cache– Sustains one main memory access per cycle per processor

• 50W/processor @ 260MHz

27

• Each processor supports 128 active hardware threads– 128 SSWs, 1024 target registers, 4096 general-purpose

registers• Every cycle, one instruction from one active thread is

launched into pipeline• Instruction pipeline is 21 cycles long• At best, a single thread can issue one instruction every

21 cycles– Clock rate is 260MHz, effective single thread issue rate is 260/21

= 12.4MHz

Tera MTA (Cray)

28

MultithreadingCoarse-grain multithreading

– switch only if current thread has a costly stall• E.g., level-2 cache miss

– can accommodate slightly costlier switches– less likely to slow down an individual thread

• a thread is switched “off” only when it has a costly stall

– main disadvantage:• limited in ability to overcome throughput losses

– shorter stalls are ignored, and there may be plenty of those• issues instructions from a single thread

– every switch involves emptying and restarting the instruction pipeline

29

IBM PowerPC RS64-III (Pulsar)

• Commercial coarse-grain multithreading CPU• Based on PowerPC with quad-issue in-order five

stage pipeline• Each physical CPU supports two virtual CPUs• On L2 cache miss, pipeline is flushed and

execution switches to second thread– short pipeline minimizes flush penalty (4 cycles),

small compared to memory access latency– flush pipeline to simplify exception handling

30

Simultaneous Multithreading (SMT)

Key Idea: Exploit ILP across multiple threads!– Share CPU to multiple threads– i.e., convert thread-level parallelism into more ILP– exploit following features of modern processors:

• multiple functional units– modern processors typically have more functional units

available than a single thread can utilize• register renaming and dynamic scheduling

– multiple instructions from independent threads can co-exist and co-execute!

31

Multithreading: Illustration

(a) A superscalar processor with no multithreading(b) A superscalar processor with coarse-grain multithreading(c) A superscalar processor with fine-grain multithreading(d) A superscalar processor with simultaneous multithreading

(SMT)

(a) (b) (c) (d)

32

From Superscalar to SMT

• SMT is an out-of-order superscalar extended withhardware to support multiple executing threads

33

Simultaneous Multithreaded Processor

34

• Add multiple contexts and fetch engines to wide out-of-order superscalar processor– [Tullsen, Eggers, Levy, University of Washington, 1995]

• OOO instruction window already has most of the circuitry required to schedule from multiple threads

• Any single thread can utilize whole machine

• First examples:– Alpha 21464 (DEC/Compaq)– Pentium IV (Intel)– Power 5 (IBM)– Ultrasparc IV (Sun)

Simultaneous Multithreaded Processor

35

SMT: Design Challenges• Dealing with a large register file

– needed to hold multiple contexts

• Maintaining low overhead on clock cycle– fast instruction issue: choosing what to issue– instruction commit: choosing what to commit– keeping cache conflicts within acceptable bounds

• Power hungry!

36

Intel Pentium-4 Processor• Hyperthreading = SMT• Dual physical processors, each 2-way SMT• Logical processors share nearly all resources of the physical

processor– Caches, execution units, branch predictors

• Die area overhead of hyperthreading ~5 %• When one logical processor is stalled, the other can make

progress– No logical processor can use all entries in queues when two

threads are active• A processor running only one active software thread to run at

the same speed with or without hyperthreading

37

Pentium 4 Micro-architecture

400 MHz System

Bus

RapidExecution

Engine

ExecutionTrace Cache

HyperPipelined

Technology

AdvancedTransfer Cache

Advanced DynamicExecution

StreamingSIMD

Extensions 2Enhanced FloatingPoint / Multi-Media

38

Pentium 4 Micro-architecture

HyperPipelined

Technology

Advanced DynamicExecution

What hardware complexity does OoO and SMT incur in?

Sun/Oracle Ultrasparc T5 (2013)

PD, 2013 39

16 Core3,6 Ghz8 threads/core(128 T/Chip)

X Core:2-way OoO16 KB I$16 KB D$128 KB L28 MB L3

28nm

IBM Power 7

PD, 2013 40

41

VLIW• Very Long Instruction Word:

– Compiler packs a fixed number of operations into a single VLIW “instruction”.

– The operations within a VLIW instruction are issued and executed in parallel.

– Example: • High-end signal processors (TMS320C6201) • Intel’s Itanium• Transmeta Crusoe, Efficeon

42

VLIW• VLIW (very long instruction word) processors use a long instruction

word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously.

• All operations specified within a VLIW instruction must be independent of one another.

• Some of the key issues of a (V)LIW processor:– (very) long instruction word (up to 1 024 bits per instruction),– each instruction consists of multiple independent parallel operations,– each operation requires a statically known number of cycles to

complete,– a central controller that issues a long instruction word every cycle,– multiple FUs connected through a global shared register file.

43

VLIW and Superscalar• sequential stream of long instruction words• instructions scheduled statically by the compiler• number of simultaneously issued instructions is fixed during compile-time • instruction issue is less complicated than in a superscalar processor• Disadvantage: VLIW processors cannot react on dynamic events,

e.g. cache misses, with the same flexibility like superscalars.• The number of instructions in a VLIW instruction word is usually fixed.• Padding VLIW instructions with no-ops is needed in case the full issue

bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.

• VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.

• VLIW processors take advantage of spatial parallelism.

44

VLIW and Superscalar• Superscalar RISC solution

– Based on sequential execution semantics

– Compiler’s role is limited by the instruction set architecture

– Superscalar hardware identifies and exploits parallelism

• VLIW solution

– Based on parallel execution semantics

– VLIW ISA enhancements support static parallelization

– Compiler takes greater responsibility for exploiting parallelism

– Compiler / hardware collaboration often resembles superscalar

45

VLIW and Superscalar• Advantages of pursuing VLIW architectures

– Make wide issue & deep latency less expensive in hardware

– Allow processor parallelism to scale with additional VLSI density

• Architect the processor to do well with in-order execution

– Enhance the ISA to allow static parallelization

– Use compiler technology to parallelize program

• Loop Unrolling, Software Pipelining, ...

– However, a purely static VLIW is not appropriate for general-purpose use

46

Examples• Intel Itanium

• Transmeta Crusoe

• Almost all DSPs

– Texas Instruments

– ST Microelectronics

47

Intel Itanium, Itanium 2

48

IA-64 Encoding

Source: Intel/HP IA-64 Application ISA Guide 1.0

49

IA-64 Templates

Source: Intel/HP IA-64 Application ISA Guide 1.0

50

Intel's IA-64 ISA • Intel 64-bit Architecture (IA-64) register model:

– 128 64-bit general purpose registers GR0-GR127to hold values for integer and multimedia computations

• each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid.

– 128 82-bit floating-point registers FR0-FR127• registers f0 and f1 are read-only with values +0.0 and +1.0,

– 64 1-bit predicate registers P0-PR63• the first register p0 is read-only and always reads 1 (true)

– 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches

51

Transmeta Crusoe i Efficeon

52

Overview• HW/SW system for executing x86 code

– VLIW processor– Code Morphing Software

• Underlying ISA and details invisible– convenient level of indirection– upgrades, fixes, freedom for changes

• as long as new CMS is implemented– anything else?

53

VLIW CPU• Simple

– in-order, very few interlocks– TM5400, 7 million transistors, 7 stage pipeline – low power, easier (and cheaper) to design

• TM5800– <=1GHz, 64KB L1, 512KB L2– 0.5-15W @ 300-1000MHz, 0.8-1.3V running typ mm app

54

Crusoe vs. PIII mobile (temperature)

55

VLIW CPU• RISC-like ISA

– molecule(long instruction)• 2 or 4 atoms (RISC-like instruction)• slot distribution?

• 64 gprs and 32 fprs– dedicated regs for x86 architectural regs

FADD ADD LD BRCC

Floatingpointunit

INT unit 1

INT unit 2

Load/Storeunit

Branchunit

128-bit molecule

56

Conclusions• VLIW

– Reduces hardware complexity at the cost of increasing compiler complexity

– Good for DSPs– Not so good for GPPs (so far?)

57

Conclusions• Multiprocessors

– Conventional superscalars are reaching ILP’s limits → exploit TLP or PLP

– Already known technology• Multithreading

– Good for extensive use of superscalar cores– More efficient than MP but more complex too

high-performance processors’ design choices€¦ · • use distributed-memory organization at...

Documents