hc19.20.330.trips- a distributed explicit data graph ... · trips: a distributed explicit data...
TRANSCRIPT
TRIPS: A Distributed Explicit Data
Graph Execution (EDGE)
Microprocessor
Madhu Saravana Sibi Govindan,
Doug Burger, Steve Keckler,
and the TRIPS Team
Hot Chips 19, August 2007
Computer Architecture and Technology Laboratory
Department of Computer Sciences
The University of Texas at Austin
www.cs.utexas.edu/users/trips
Hot Chips 19, August 2007 2The University of
Texas at Austin
The University of
Texas at Austin
Recent Trends
Scaling challenges for conventional superscalarprocessors
Power and pipeline limits impede clock rate growth
Wire delays cause overheads for concurrency
Complexity of large monolithic architectures
Industry shift to multicore architectures
Work well for certain types of workloads
Big challenges: how to program/Amdahl’s law
Single-thread performance is still important
Hot Chips 19, August 2007 3The University of
Texas at Austin
The University of
Texas at Austin
TRIPS – A Technology Scalable Architecture
Goals
High single-thread performance through ILP
Exploit concurrency at multiple granularities
Scalable with technology trends
Key technologies
Explicit data graph execution (EDGE) ISA
Distributed processor microarchitecture
Distributed non-uniform (NUCA) L2 cache
Tiled and networked design
Hot Chips 2005 - Chip in RTL design phase
Hot Chips 2007 - Chip/system complete
Manufacturing/bring-up complete
Performance tuning in progress
This talk focuses on chip and systemimplementation
Hot Chips 19, August 2007 4The University of
Texas at Austin
The University of
Texas at Austin
Outline
Explicit Data Graph Execution (EDGE) ISAs
ISA support for distributed execution
TRIPS networked microarchitecture
Processor microarchitecture
Non-uniform cache memory system
TRIPS chip and system implementation
Custom ASIC chip and system boards
Preliminary performance results
Hot Chips 19, August 2007 5The University of
Texas at Austin
The University of
Texas at Austin
Explicit Data Graph Execution (EDGE)
Two key features
Program broken into sequence of instruction blocks
Execution model: fetch, execute, and commit blocks atomically
Amortize overheads over many instructions
Within a block, instructions explicitly encode producer-consumercommunication
Producer instructions explicitly targets consumers and sendresults directly to them (without using a register file)
Instructions “fire” when all operands arrive
Any instruction may be predicated (conditionally executed)
TRIPS blocks - up to 128 instructions
Compile-time techniques to create large blocks
Average block sizes typically greater than 45 instructions
Long-term goal: hide hard-to-predict branches inside blocks
Hot Chips 19, August 2007 6The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Execution Model
TR
IPS
blo
ck
TR
IPS
blo
ck
Program
Control Flow
Graph (CFG)
Basic
Block
CFG Loop Example
if (p==0)
z = a * 2 + 3;
else
z = b * 3 + 4;
teqz
muli
addi
st(1)
muli
addi
a b
p
TRIPS block for
if-then-else
if-then-else Example
if-then-else
example
Hot Chips 19, August 2007 7The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Execution Model
COMMITFETCH EXECUTE
5 10 30 400
Block 0 Bi
(variable execution time)
Time (cycles)
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Bi+2
Bi+3
Bi+4
Bi+5
Bi+6
Bi+7
Fetch/Execute/Commit overlapped across multiple blocks
Can execute up to 8 blocks at a time via speculation
Exposes concurrency with a very large instruction window
Single-threaded mode with up to 1024 instructions
Simultaneous multi-threaded (SMT) mode with up to 4 threads and
256 instruction window per thread
Block 8 Bi+8
Block 1 Bi+1
In-flight
blocks/instructions
Structural
dependence
Hot Chips 19, August 2007 8The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Prototype Chip
2 TRIPS Processors
NUCA L2 Cache
1 MB, 16 banks
On-Chip Network (OCN)
2D mesh network
Replaces on-chip bus
Controllers
2 DDR SDRAM controllers
2 DMA controllers
External Bus Controller (EBC)
Interfaces with PowerPC 440GP(control processor)
Chip-to-Chip (C2C) network controller
Clocking
2 PLLs
4 Clock domains
1x and 2x SDRAM
Main and C2C
Clock tree
Main domain has 4 quadrants tolimit local skew
PROC 0
EBC
PROC 1
OCN
SDCDMA
C2CSDCDMA
TEST PLLS
108
DDR
SDRAM
108 8x39
C2C
Links
44
EBI
(External Bus
Interface)
IRQ
(Interrupt
Request)
GPIO
16
CLK
DDR
SDRAM
JTAG
NUCA
L2
Cache
Hot Chips 19, August 2007 9The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Microarchitecture Principles
Distributed and tiled architecture
Small and simple tiles (register file, data cache bank, etc.)
Short local wires
Tiles are small: 2-5 mm2 per tile is typical
No centralized resources
Networks connect the tiles
Networks implement distributed protocols (I-fetch, bypass, etc.)
Includes well-defined control and data networks
Networks connect only nearest neighbors
No global wires
Design modularity and scalability
Design productivity by replicating tiles (design reuse)
Networks extensible, even late in design cycle
Hot Chips 19, August 2007 10The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Tile-level Microarchitecture
TRIPS Tiles
G:Processor control - TLB w/ variable size pages,
dispatch, next block predict, commit
R: Register file - 32 registers x 4 threads, register
forwarding
I: Instruction cache - 16KB storage per tile
D: Data cache - 8KB per tile, 256-entry load/store queue,
TLB
E: Execution unit - Int/FP ALUs, 64 reservation stations
M: Memory - 64KB, configurable as L2 cache or scratchpad
N: OCN network interface - router, translation tables
DMA: Direct memory access controller
SDC: DDR SDRAM controller
EBC: External bus controller - interface to external
PowerPC
C2C: Chip-to-chip network controller - 4 links to XY
neighbors
Hot Chips 19, August 2007 11The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Processor
EDGE ISA blocks mapped to array of tiles
Compile-time scheduler decides where (not when) instructionsexecute
TRIPS: aggressive processor capabilities
Up to 16 instructions per cycle
Up to 4 loads/stores per cycle
Up to 64 outstanding L1 data cache misses
Up to 1024 dynamically scheduled instructions
Up to 4 simultaneous multithreading (SMT) threads
Memory system
4 simultaneous L1 cache fills per processor
Up to 16 simultaneous L2 cache accesses
Up to 16 outstanding L2 cache misses
Hot Chips 19, August 2007 12The University of
Texas at Austin
The University of
Texas at Austin
PR
OC
0P
RO
C 1
Bank Bank
BankBank
Bank Bank
BankBank
Bank Bank
BankBank
Bank Bank
BankBank
Request
Reply
Non-Uniform (NUCA) L2 Cache
1MB L2 cache
Sixteen tiled 64KB banks
On-chip network
4x10 2D mesh topology
128-bit links, 366MHz (4.7GB/sec)
4 virtual channels prevent deadlocks
Requests and replies are wormhole-
routed across the network
Up to 10 memory requests per cycle
Up to 128 bytes per cycle returned to
the processors
Individual banks reconfigurable as
scratchpad
Hot Chips 19, August 2007 13The University of
Texas at Austin
The University of
Texas at Austin
Technology trend analysis
“What are the problems to solve?”
Prototype implementation
“Is it feasible?”
Prototype testing.
“Does it work; how well?”
Unveiling
1999 2000 2001 20032002 2004 2005 2006 2007
New architecture concepts, invention, publishing
“What are the solutions?”
Prototype design
“Solve detailed challenges of architecture.”
TRIPS Project Timeline
UT-Austin team
12 graduate students + 1
engineer
RTL, verification, timing
IBM ASIC team
Physical design
Hot Chips 19, August 2007 14The University of
Texas at Austin
The University of
Texas at Austin
170 millionTransistor count (est.)
36W at 366MHz
(chip has no power mgt.)Power (measured)
1.06 kmTotal wire length
2.7ns (actual)
4.5ns (worse case sim)Clock period
6.5 million# of routed nets
6.1 million# of placed cells
626 signals, 352 Vdd, 348
GNDPin Count
47mm x 47mm BGAPackage
18.3mm x 18.37mm
(336 mm2)Die Size
130nm ASIC with 7 metal
layersProcess Technology
TRIPS Chip Implementation
Hot Chips 19, August 2007 15The University of
Texas at Austin
The University of
Texas at Austin
Die Photos
With C4 Array Without C4 Array
Hot Chips 19, August 2007 16The University of
Texas at Austin
The University of
Texas at Austin
Benefits of Tiled Design
Design modularity
11 different tiles, instantiated a total of 106 times
Clean interfaces at tile boundaries
Verification - no hardware bugs
Tiles verified extensively before stitching together
Place and Route - hierarchical in nature
Wiring only between nearest neighbors
But - each physical instance was a little different
Timing - trivial at top level, communication planned
No global wires or timing paths
Hot Chips 19, August 2007 17The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Daughtercard
12V to daughtercard, steppeddown to 1.5V, 2.5V, and 3.3V
12 GFlops peak
45 W total (worst case)
Voltage Regulator
Heatsink/Fan
(rated to 90 W)
2 x 1GB DIMMs
Hot Chips 19, August 2007 18The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Motherboard
1 motherboard includes:
4 daughter-boards
4 TRIPS chips
8 GBytes DRAM
PowerPC 440GP
control processor
I/O: ethernet, serial,
C2C links
FPGA I/O interface
Peak performance
48 GFlops at 366 MHz
180 Watts
Hot Chips 19, August 2007 19The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Multi-board System
Extensible to 8 boards (64 processors)
Micro-coax cables extend C2C network across boards
Peak performance
380 GFlops, 1.4KW
Parallel message-passing software
Hot Chips 19, August 2007 20The University of
Texas at Austin
The University of
Texas at Austin
TRIPS Prototype System Layers
0
P 1 3
2
HOST PC
x86 Linux
PowerPC
440GP
EBC
TRIPS Resource Manager (TRM)
File system
Runtime services
Login/debug/etc.
Local Resoure Manager (LRM) listens to HostPC
Runs embedded Linux
PowerPC EBI device driver to control TRIPS chips
PowerPC EBI TRIPS EBC
Board 0
Board 1
Board 2
Runs TRIPS apps
Interrupts PowerPC if necessary
System calls, exceptions
Ethernet
SwitchEBC EBC
EBC
IRQ
EBI
Hot Chips 19, August 2007 21The University of
Texas at Austin
The University of
Texas at Austin
250 nm100 MHz
SDRAM450 MHzPentium 3
90 nm533 MHz
DDR23.6 GHzPentium 4
65 nm800 MHz
DDR2
1.6 GHz
(underclocked)Core 2
130 nm200 MHz
DDR366 MHzTRIPS
Process
Technology
Memory
SpeedClock SpeedProcessor
Preliminary Performance (HW)Challenges
Different technology and ISAs
Different processor-to-memoryclock ratio
TRIPS compiler fine-tuning inprogress
Cycle-to-cycle comparison on multipleHW platforms
TRIPS Performance counters
PAPI - Performance API on Linuxsystems for others
Applications
Compiled + hand-optimized
Mix of kernels and fullalgorithms
Compiled only
The EmbeddedMicroprocessor BenchmarkConsortium (EEMBC)
Versabench (MIT)
SPEC benchmarks in progress
Hot Chips 19, August 2007 22The University of
Texas at Austin
The University of
Texas at Austin
TRIPS vs. Conventional Processors: Kernels
BIT INT FLOAT STREAM
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
802.11a 8b10b a2time rspeed vadd conv matrix autocor ct fmradio Geometric
Mean
Sp
ee
du
p R
ela
tiv
e t
o C
ore
2 (
cy
cle
s)
TRIPS-Handoptimized
TRIPS-Compiled
Pentium 3
Pentium 4
Core 2
Hot Chips 19, August 2007 23The University of
Texas at Austin
The University of
Texas at Austin
TRIPS vs. Conventional Processors
EEMBC and signal processing (compiled)
0
0.5
1
1.5
2
2.5
aifftr
aiifft
bas
efp
cac
heb
iirfl
t m
atrix
pnt
rch
puw
mod
rsp
eed
osp
f b
ezie
r
text
aut
ocor
fft
conv ct
Geo
met
ric M
ean
Graph shows representative subset of 33 benchmarks and
geometric mean for all 33 benchmarks
Sp
ee
du
p R
ela
tiv
e t
o C
ore
2 (
cy
cle
s)
TRIPS-Compiled
Pentium 3
Pentium 4
Core 2
Hot Chips 19, August 2007 24The University of
Texas at Austin
The University of
Texas at Austin
Summary
TRIPS prototype demonstrates feasibility of:
Explicit Data Graph Execution (EDGE) ISAs
Distributed processor and memory microarchitectures
Scaled and tiled uniprocessors
Non-uniform cache architectures (NUCA)
Recompilation required, but no change to source code
Performance is promising
First generation TRIPS is 3.1x faster for hand-optimized code
Compiled-code reasonable and improving
Identified several opportunities for improvements
Enhanced operand networks
Dynamic aggregation of processor cores
TRIPS-like architectures are a potential “alternative” tomulticore
Hot Chips 19, August 2007 25The University of
Texas at Austin
The University of
Texas at Austin
Acknowledgements
TRIPS Hardware Team
Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, DivyaGulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu,Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, KaruSankaralingam, Simha Sethumadhavan, PremkishoreShivakumar
TRIPS Software Team
Kathryn McKinley, Jim Burrill, Xia Chen, Sundeep Kushwaha,Bert Maher, Nick Nethercote, Suriya Narayanan, Sadia Sharif,Aaron Smith, Bill Yoder
IBM Microelectronics Austin ASIC Group
TRIPS Sponsors
DARPA Polymorphous Computing Architectures
Air Force Research Laboratories
National Science Foundation
IBM, Intel, Sun Microsystems