hc19.20.330.trips- a distributed explicit data graph ... · trips: a distributed explicit data...

TRIPS: A Distributed Explicit Data

Graph Execution (EDGE)

Microprocessor

Madhu Saravana Sibi Govindan,

Doug Burger, Steve Keckler,

and the TRIPS Team

Hot Chips 19, August 2007

Computer Architecture and Technology Laboratory

Department of Computer Sciences

The University of Texas at Austin

www.cs.utexas.edu/users/trips

Hot Chips 19, August 2007 2The University of

Texas at Austin

The University of

Texas at Austin

Recent Trends

Scaling challenges for conventional superscalarprocessors

Power and pipeline limits impede clock rate growth

Wire delays cause overheads for concurrency

Complexity of large monolithic architectures

Industry shift to multicore architectures

Work well for certain types of workloads

Big challenges: how to program/Amdahl’s law

Single-thread performance is still important


Texas at Austin

The University of

Texas at Austin

TRIPS – A Technology Scalable Architecture

Goals

High single-thread performance through ILP

Exploit concurrency at multiple granularities

Scalable with technology trends

Key technologies

Explicit data graph execution (EDGE) ISA

Distributed processor microarchitecture

Distributed non-uniform (NUCA) L2 cache

Tiled and networked design

Hot Chips 2005 - Chip in RTL design phase

Hot Chips 2007 - Chip/system complete

Manufacturing/bring-up complete

Performance tuning in progress

This talk focuses on chip and systemimplementation


Texas at Austin

The University of

Texas at Austin

Outline

Explicit Data Graph Execution (EDGE) ISAs

ISA support for distributed execution

TRIPS networked microarchitecture

Processor microarchitecture

Non-uniform cache memory system

TRIPS chip and system implementation

Custom ASIC chip and system boards

Preliminary performance results


Texas at Austin

The University of

Texas at Austin

Explicit Data Graph Execution (EDGE)

Two key features

Program broken into sequence of instruction blocks

Execution model: fetch, execute, and commit blocks atomically

Amortize overheads over many instructions

Within a block, instructions explicitly encode producer-consumercommunication

Producer instructions explicitly targets consumers and sendresults directly to them (without using a register file)

Instructions “fire” when all operands arrive

Any instruction may be predicated (conditionally executed)

TRIPS blocks - up to 128 instructions

Compile-time techniques to create large blocks

Average block sizes typically greater than 45 instructions

Long-term goal: hide hard-to-predict branches inside blocks


Texas at Austin

The University of

Texas at Austin

TRIPS Execution Model

TR

IPS

blo

ck

TR

IPS

blo

ck

Program

Control Flow

Graph (CFG)

Basic

Block

CFG Loop Example

if (p==0)

z = a * 2 + 3;

else

z = b * 3 + 4;

teqz

muli

addi

st(1)

muli

addi

a b

p

TRIPS block for

if-then-else

if-then-else Example

if-then-else

example


Texas at Austin

The University of

Texas at Austin

TRIPS Execution Model

COMMITFETCH EXECUTE

5 10 30 400

Block 0 Bi

(variable execution time)

Time (cycles)

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Bi+2

Bi+3

Bi+4

Bi+5

Bi+6

Bi+7

Fetch/Execute/Commit overlapped across multiple blocks

Can execute up to 8 blocks at a time via speculation

Exposes concurrency with a very large instruction window

Single-threaded mode with up to 1024 instructions

Simultaneous multi-threaded (SMT) mode with up to 4 threads and

256 instruction window per thread

Block 8 Bi+8

Block 1 Bi+1

In-flight

blocks/instructions

Structural

dependence


Texas at Austin

The University of

Texas at Austin

TRIPS Prototype Chip

2 TRIPS Processors

NUCA L2 Cache

1 MB, 16 banks

On-Chip Network (OCN)

2D mesh network

Replaces on-chip bus

Controllers

2 DDR SDRAM controllers

2 DMA controllers

External Bus Controller (EBC)

Interfaces with PowerPC 440GP(control processor)

Chip-to-Chip (C2C) network controller

Clocking

2 PLLs

4 Clock domains

1x and 2x SDRAM

Main and C2C

Clock tree

Main domain has 4 quadrants tolimit local skew

PROC 0

EBC

PROC 1

OCN

SDCDMA

C2CSDCDMA

TEST PLLS

108

DDR

SDRAM

108 8x39

C2C

Links

44

EBI

(External Bus

Interface)

IRQ

(Interrupt

Request)

GPIO

16

CLK

DDR

SDRAM

JTAG

NUCA

L2

Cache


Texas at Austin

The University of

Texas at Austin

TRIPS Microarchitecture Principles

Distributed and tiled architecture

Small and simple tiles (register file, data cache bank, etc.)

Short local wires

Tiles are small: 2-5 mm2 per tile is typical

No centralized resources

Networks connect the tiles

Networks implement distributed protocols (I-fetch, bypass, etc.)

Includes well-defined control and data networks

Networks connect only nearest neighbors

No global wires

Design modularity and scalability

Design productivity by replicating tiles (design reuse)

Networks extensible, even late in design cycle


Texas at Austin

The University of

Texas at Austin

TRIPS Tile-level Microarchitecture

TRIPS Tiles

G:Processor control - TLB w/ variable size pages,

dispatch, next block predict, commit

R: Register file - 32 registers x 4 threads, register

forwarding

I: Instruction cache - 16KB storage per tile

D: Data cache - 8KB per tile, 256-entry load/store queue,

TLB

E: Execution unit - Int/FP ALUs, 64 reservation stations

M: Memory - 64KB, configurable as L2 cache or scratchpad

N: OCN network interface - router, translation tables

DMA: Direct memory access controller

SDC: DDR SDRAM controller

EBC: External bus controller - interface to external

PowerPC

C2C: Chip-to-chip network controller - 4 links to XY

neighbors


Texas at Austin

The University of

Texas at Austin

TRIPS Processor

EDGE ISA blocks mapped to array of tiles

Compile-time scheduler decides where (not when) instructionsexecute

TRIPS: aggressive processor capabilities

Up to 16 instructions per cycle

Up to 4 loads/stores per cycle

Up to 64 outstanding L1 data cache misses

Up to 1024 dynamically scheduled instructions

Up to 4 simultaneous multithreading (SMT) threads

Memory system

4 simultaneous L1 cache fills per processor

Up to 16 simultaneous L2 cache accesses

Up to 16 outstanding L2 cache misses


Texas at Austin

The University of

Texas at Austin

PR

OC

0P

RO

C 1

Bank Bank

BankBank

Bank Bank

BankBank

Bank Bank

BankBank

Bank Bank

BankBank

Request

Reply

Non-Uniform (NUCA) L2 Cache

1MB L2 cache

Sixteen tiled 64KB banks

On-chip network

4x10 2D mesh topology

128-bit links, 366MHz (4.7GB/sec)

4 virtual channels prevent deadlocks

Requests and replies are wormhole-

routed across the network

Up to 10 memory requests per cycle

Up to 128 bytes per cycle returned to

the processors

Individual banks reconfigurable as

scratchpad


Texas at Austin

The University of

Texas at Austin

Technology trend analysis

“What are the problems to solve?”

Prototype implementation

“Is it feasible?”

Prototype testing.

“Does it work; how well?”

Unveiling

1999 2000 2001 20032002 2004 2005 2006 2007

New architecture concepts, invention, publishing

“What are the solutions?”

Prototype design

“Solve detailed challenges of architecture.”

TRIPS Project Timeline

UT-Austin team

12 graduate students + 1

engineer

RTL, verification, timing

IBM ASIC team

Physical design


Texas at Austin

The University of

Texas at Austin

170 millionTransistor count (est.)

36W at 366MHz

(chip has no power mgt.)Power (measured)

1.06 kmTotal wire length

2.7ns (actual)

4.5ns (worse case sim)Clock period

6.5 million# of routed nets

6.1 million# of placed cells

626 signals, 352 Vdd, 348

GNDPin Count

47mm x 47mm BGAPackage

18.3mm x 18.37mm

(336 mm2)Die Size

130nm ASIC with 7 metal

layersProcess Technology

TRIPS Chip Implementation


Texas at Austin

The University of

Texas at Austin

Die Photos

With C4 Array Without C4 Array


Texas at Austin

The University of

Texas at Austin

Benefits of Tiled Design

Design modularity

11 different tiles, instantiated a total of 106 times

Clean interfaces at tile boundaries

Verification - no hardware bugs

Tiles verified extensively before stitching together

Place and Route - hierarchical in nature

Wiring only between nearest neighbors

But - each physical instance was a little different

Timing - trivial at top level, communication planned

No global wires or timing paths


Texas at Austin

The University of

Texas at Austin

TRIPS Daughtercard

12V to daughtercard, steppeddown to 1.5V, 2.5V, and 3.3V

12 GFlops peak

45 W total (worst case)

Voltage Regulator

Heatsink/Fan

(rated to 90 W)

2 x 1GB DIMMs


Texas at Austin

The University of

Texas at Austin

TRIPS Motherboard

1 motherboard includes:

4 daughter-boards

4 TRIPS chips

8 GBytes DRAM

PowerPC 440GP

control processor

I/O: ethernet, serial,

C2C links

FPGA I/O interface

Peak performance

48 GFlops at 366 MHz

180 Watts


Texas at Austin

The University of

Texas at Austin

TRIPS Multi-board System

Extensible to 8 boards (64 processors)

Micro-coax cables extend C2C network across boards

Peak performance

380 GFlops, 1.4KW

Parallel message-passing software


Texas at Austin

The University of

Texas at Austin

TRIPS Prototype System Layers

0

P 1 3

2

HOST PC

x86 Linux

PowerPC

440GP

EBC

TRIPS Resource Manager (TRM)

File system

Runtime services

Login/debug/etc.

Local Resoure Manager (LRM) listens to HostPC

Runs embedded Linux

PowerPC EBI device driver to control TRIPS chips

PowerPC EBI TRIPS EBC

Board 0

Board 1

Board 2

Runs TRIPS apps

Interrupts PowerPC if necessary

System calls, exceptions

Ethernet

SwitchEBC EBC

EBC

IRQ

EBI


Texas at Austin

The University of

Texas at Austin

250 nm100 MHz

SDRAM450 MHzPentium 3

90 nm533 MHz

DDR23.6 GHzPentium 4

65 nm800 MHz

DDR2

1.6 GHz

(underclocked)Core 2

130 nm200 MHz

DDR366 MHzTRIPS

Process

Technology

Memory

SpeedClock SpeedProcessor

Preliminary Performance (HW)Challenges

Different technology and ISAs

Different processor-to-memoryclock ratio

TRIPS compiler fine-tuning inprogress

Cycle-to-cycle comparison on multipleHW platforms

TRIPS Performance counters

PAPI - Performance API on Linuxsystems for others

Applications

Compiled + hand-optimized

Mix of kernels and fullalgorithms

Compiled only

The EmbeddedMicroprocessor BenchmarkConsortium (EEMBC)

Versabench (MIT)

SPEC benchmarks in progress


Texas at Austin

The University of

Texas at Austin

TRIPS vs. Conventional Processors: Kernels

BIT INT FLOAT STREAM

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

802.11a 8b10b a2time rspeed vadd conv matrix autocor ct fmradio Geometric

Mean

Sp

ee

du

p R

ela

tiv

e t

o C

ore

2 (

cy

cle

s)

TRIPS-Handoptimized

TRIPS-Compiled

Pentium 3

Pentium 4

Core 2


Texas at Austin

The University of

Texas at Austin

TRIPS vs. Conventional Processors

EEMBC and signal processing (compiled)

0

0.5

1

1.5

2

2.5

aifftr

aiifft

bas

efp

cac

heb

iirfl

t m

atrix

pnt

rch

puw

mod

rsp

eed

osp

f b

ezie

r

text

aut

ocor

fft

conv ct

Geo

met

ric M

ean

Graph shows representative subset of 33 benchmarks and

geometric mean for all 33 benchmarks

Sp

ee

du

p R

ela

tiv

e t

o C

ore

2 (

cy

cle

s)

TRIPS-Compiled

Pentium 3

Pentium 4

Core 2


Texas at Austin

The University of

Texas at Austin

Summary

TRIPS prototype demonstrates feasibility of:

Explicit Data Graph Execution (EDGE) ISAs

Distributed processor and memory microarchitectures

Scaled and tiled uniprocessors

Non-uniform cache architectures (NUCA)

Recompilation required, but no change to source code

Performance is promising

First generation TRIPS is 3.1x faster for hand-optimized code

Compiled-code reasonable and improving

Identified several opportunities for improvements

Enhanced operand networks

Dynamic aggregation of processor cores

TRIPS-like architectures are a potential “alternative” tomulticore


Texas at Austin

The University of

Texas at Austin

Acknowledgements

TRIPS Hardware Team

Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, DivyaGulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu,Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, KaruSankaralingam, Simha Sethumadhavan, PremkishoreShivakumar

TRIPS Software Team

Kathryn McKinley, Jim Burrill, Xia Chen, Sundeep Kushwaha,Bert Maher, Nick Nethercote, Suriya Narayanan, Sadia Sharif,Aaron Smith, Bill Yoder

IBM Microelectronics Austin ASIC Group

TRIPS Sponsors

DARPA Polymorphous Computing Architectures

Air Force Research Laboratories

National Science Foundation

IBM, Intel, Sun Microsystems

hc19.20.330.trips- a distributed explicit data graph ... · trips: a distributed explicit data...

Documents