uiuc - cs 433 ibm power7

30
UIUC - CS 433 IBM POWER7 Adam Kunk Anil John Pete Bohman

Upload: pravat

Post on 25-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Adam Kunk Anil John Pete Bohman. UIUC - CS 433 IBM POWER7. Quick Facts. Released by IBM in 2010 (~ February) Successor of the POWER6 Implements IBM PowerPC architecture v2.06 Clock Rate: 2.4 GHz - 4.25 GHz Feature size: 45 nm ISA: Power ISA v 2.06 (RISC) Cores: 4, 6, 8 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: UIUC - CS 433 IBM  POWER7

UIUC - CS 433IBM POWER7

Adam KunkAnil JohnPete Bohman

Page 2: UIUC - CS 433 IBM  POWER7

Quick Facts

Released by IBM in 2010 (~ February) Successor of the POWER6 Implements IBM PowerPC architecture

v2.06

Clock Rate: 2.4 GHz - 4.25 GHz Feature size: 45 nm ISA: Power ISA v 2.06 (RISC) Cores: 4, 6, 8 Cache: L1, L2, L3 – On Chip

References: [1], [5]

Page 3: UIUC - CS 433 IBM  POWER7

Why the POWER7?

PERCS – Productive, Easy-to-use, Reliable Computer System DARPA funded contract that IBM won in order

to develop the Power7 ($244 million contract, 2006)▪ Contract was to develop a petascale supercomputer

architecture before 2011 in the HPCS (High Performance Computing Systems) project.

IBM, Cray, and Sun Microsystems received HPCS grant for Phase II.

IBM was chosen for Phase III in 2006.References: [1], [2]

Page 4: UIUC - CS 433 IBM  POWER7

Blue Waters

Side note: The Blue Waters system was meant to

be the first supercomputer using PERCS technology.

But, the contract was cancelled (cost and complexity).

Page 5: UIUC - CS 433 IBM  POWER7

History of Power

2004 2001 2007 2010

POWER4/4+

Dual Core Chip Multi Processing Distributed Switch Shared L2 Dynamic LPARs (32)180nm,

POWER5/5+

Dual Core & Quad Core MdEnhanced Scaling2 Thread SMTDistributed Switch +Core Parallelism +FP Performance +Memory bandwidth +130nm, 90nm

POWER6/6+

Dual Core High Frequencies Virtualization + Memory Subsystem + Altivec Instruction Retry Dyn Energy Mgmt 2 Thread SMT + Protection Keys 65nm

POWER7/7+

4,6,8 Core 32MB On-Chip eDRAM Power Optimized Cores Mem Subsystem ++ 4 Thread SMT++ Reliability + VSM & VSX Protection Keys+ 45nm, 32nm

POWER8

Future

First Dual Corein Industry

HardwareVirtualizationfor Unix & Linux

FastestProcessorIn Industry

MostPOWERful &ScalableProcessor inIndustry

References: [3]

Page 7: UIUC - CS 433 IBM  POWER7

POWER7 LayoutCores: 8 Intelligent Cores / chip (socket) 4 and 6 Intelligent Cores available

on some models 12 execution units per core Out of order execution 4 Way SMT per core 32 threads per chip L1 – 32 KB I Cache / 32 KB D

Cache per core L2 – 256 KB per coreChip: 32MB Intelligent L3 Cache on chip

Core

L2

Core

L2

Memory Interface

Core

L2Core

L2

Core

L2

Core

L2

Core

L2Core

L2

GX

SMPFABRIC

POWER

BUS

Memory++

L3 CacheeDRAM

References: [3]

Page 8: UIUC - CS 433 IBM  POWER7

POWER7 Options (8, 6, 4 cores)

References: [3]

Page 9: UIUC - CS 433 IBM  POWER7

POWER7 Core

Each core implements “aggressive” out-of-order (OoO) instruction execution

The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues

Up to eight instructions per cycle can be issued to the Instruction Execution units

References: [4]

Page 10: UIUC - CS 433 IBM  POWER7

Pipeline

Page 11: UIUC - CS 433 IBM  POWER7

Instruction Fetch

8 inst. fetched from L2 to L1 I-cache or fetch buffer Balanced instruction rates across active threads Inst. Grouping

Instructions belonging to group issued together Groups contain independent instructions

Page 12: UIUC - CS 433 IBM  POWER7

Instruction Fetch

Branch Prediction

Page 13: UIUC - CS 433 IBM  POWER7

Execution Units

Each POWER7 core has 12 execution units: 2 fixed point units 2 load store units 4 double precision floating point units (2x

power6) 1 vector unit 1 branch unit 1 condition register unit 1 decimal floating point unit

References: [4]

Page 14: UIUC - CS 433 IBM  POWER7

ILP

Page 15: UIUC - CS 433 IBM  POWER7

Exceptions

Page 16: UIUC - CS 433 IBM  POWER7

SMT

Simultaneous Multithreading SMT1: Single instruction execution

thread per core SMT2: Two instruction execution threads

per core SMT4: Four instruction execution threads

per core

This means that an 8-core Power7 can execute 32 threads simultaneously

Page 17: UIUC - CS 433 IBM  POWER7

Multithreading History

Thread 1 Executing

Thread 0 Executing

No Thread Executing

FX0FX1FP0FP1LS0LS1BRXCRL

Single thread Out of OrderFX0FX1FP0FP1LS0LS1BRXCRL

S80 HW Multi-thread

FX0FX1FP0FP1LS0LS1BRXCRL

POWER5 2 Way SMTFX0FX1FP0FP1LS0LS1BRXCRL

POWER7 4 Way SMT

Thread 3 Executing

Thread 2 ExecutingReferences: [3]

Page 18: UIUC - CS 433 IBM  POWER7

TLP

Page 19: UIUC - CS 433 IBM  POWER7

Memory

Page 20: UIUC - CS 433 IBM  POWER7

Memory Access

(Look at section 2.1.4 in http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf)

Page 21: UIUC - CS 433 IBM  POWER7

CachesParameter L1 L2 L3 (Local) L3 (Global)Size 64 KB

(32 I, 32 D)256 KB 4 MB 32 MB

Access Time

.5 ns 2 ns 6 ns 30 ns

Associativity

4-way I-cache8-way D-cache

8-way 8-way 8-way

Write Policy

Write Through

Write Back Partial Victim

Adaptive

Line size 128 B 128 B 128 B 128 B

Page 22: UIUC - CS 433 IBM  POWER7

L1 Cache

2 read ports, 1 write port Write has higher priority over a read Write-Through

No L1 cast-outs required B-Tree LRU replacement Way prediction bits reduce hit

latency

Page 23: UIUC - CS 433 IBM  POWER7

L2 Cache

Inclusive of L1 L3 partial victim relationship

Page 24: UIUC - CS 433 IBM  POWER7

L3 Cache

Details of the L3 Cache …. (leads up to eDRAM)

Page 25: UIUC - CS 433 IBM  POWER7

eDRAM eDRAM – Embedded dynamic random-access

memory This means the L3 cache (shared 32 MB) is on-chip Essentially faster due to decreased distance Less area, less power, on-chip interconnects

provide each core with 32-byte buses to and from the L3 cache

Side note: eDRAM is also used in many different game consoles (PS2, GameCube, Wii, Etc.)

References: [5], [6]

Page 26: UIUC - CS 433 IBM  POWER7

eDRAM (cont.)

eDRAM in the POWER7 provides 1/6 the latency and twice the bandwidth (compared with off-chip eDRAM), and 1/5 standby power in 1/3 the required area (compared with SRAM)

References: [5]

Page 27: UIUC - CS 433 IBM  POWER7

Performance

Page 28: UIUC - CS 433 IBM  POWER7

Closing/Wrap-up

Page 29: UIUC - CS 433 IBM  POWER7

References 1. http://en.wikipedia.org/wiki/POWER7 2. http://en.wikipedia.org/wiki/PERCS 3. Central PA PUG POWER7 review.ppt

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCEQFjAA&url=http%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fwikis%2Fdownload%2Fattachments%2F135430247%2FCentral%2BPA%2BPUG%2BPOWER7%2Breview.ppt&ei=3El3T6ejOI-40QGil-GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE-v3S_5t3A

Page 30: UIUC - CS 433 IBM  POWER7

References (cont.)

4. http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf

5. http://www.serc.iisc.ernet.in/~govind/243/Power7.pdf

6. http://en.wikipedia.org/wiki/EDRAM