rice university ‘stream’-based wireless computing sridhar rajagopal research group meeting...
TRANSCRIPT
RICE UNIVERSITY
‘Stream’-based wireless computing
Sridhar Rajagopal
Research group meeting December 17, 2002
The figures used in the slides are borrowed from papers at VT and Stanford.
RICE UNIVERSITY
Motivation
‘Stream’-based computing what does it mean?
Not a well-defined term ‘computation’ that uses flow of self-guided
info. ‘sequence of data’
Related to flow of data through architecture
Application to implementing wireless algorithms
RICE UNIVERSITY
Outline
Stallion reconfigurable computing at Virginia Tech ‘stream’-based computing #1 Custom Configurable Machines (CCM)
Imagine media processing at Stanford ‘stream’-based computing #2 programmable architectures
RICE UNIVERSITY
Stallion at VT
Wormhole Run-Time Reconfiguration (RTR) coarse-grained structure reconfiguration using ‘streams’
RICE UNIVERSITY
‘Stream’ packets
A stream packet
Stream flow through architecture
RICE UNIVERSITY
Functional description of PE
RICE UNIVERSITY
Stream module description
4 States:IDLE – reconf. in progressBUSY – doing workPROGRAM – load reconf. dataPASS – meant for next module
Need to output packet/cycleVALID – maintain sync. - set INVALID instead of wait states - strip information off stack
RICE UNIVERSITY
Processing layer
Static section configures the reconf. section buffers data during reconf. & sends ‘IDLE’
packets Reconf. Section
processing of the data done here
Higher layers convert algorithm to data and configuration patterns
RICE UNIVERSITY
Cart before the horse Colt before the Stallion
Colt architecture (also at VT)
IFU Mesh – Mesh of interconnected func. units
RICE UNIVERSITY
Stallion chip
16-bit data4-control
3
3
4
4
2
2
RICE UNIVERSITY
IFU mesh in Stallion
Dash-line –-skip buses
Can send operandsover 1/more IFUs
RICE UNIVERSITY
IFU details
Only left input can do barrel shifting
ALU based on LUT
Control register – stores control information for reconfiguration
Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams
Cond. unit
Output control unit
RICE UNIVERSITY
Radio testbed at VT
Stallion
RICE UNIVERSITY
Worm-hole routing
stream = worm architecture = holes
multiple, independent streams can wind their way through the chip simultaneously
parts of system can be processing, parts could be reconfiguring
GOAL: Layered Software Radio Architecture
RICE UNIVERSITY
‘Stream’ processing at Stanford
Speeding up media applications
Need lots of computations per memory reference
Lots of data and sub-word parallelism
Current GPP architectures do not have enough ALUs
‘Stream’ processors to the rescue
RICE UNIVERSITY
Special-purpose processors
Fed by dedicated wires/memoriesLots (100s) of ALUs
RICE UNIVERSITY
Care and feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP‘Feeding’ Structure Dwarfs ALU
RICE UNIVERSITY
Architecture implications
Tremendous opportunities media problems have lots of parallelism and locality VLSI technology enables 100s of ALUs/chip (1000s
soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)
Challenging problems locality - global structures won’t work explicit parallelism - ILP won’t keep 100 ALUs busy memory - streaming applications don’t cache well
Its time to try some new approaches
RICE UNIVERSITY
Register file organization
Register files functions: short term storage for intermediate results communication between multiple function
units
Global register files don’t scale with #ALUs need more registers to hold more results (grows with #ALUs ) need more ports to connect all of the units (grows with #ALUs 2 )
RICE UNIVERSITY
Register files dwarf ALUs
N A rithm etic Units
1 cm
32 ALUs
Size of RFto support32 ALUs
Size of1 ALU
Size of RFto support
1 ALU1 cm
4 ALUs 16 ALUs
RICE UNIVERSITY
Distributed register files
Distributed register files means: not all functional units can access all data each functional unit input/output no longer
has a dedicated route from/to all register files
A D D 0 L/S A D D 1
can write toeither or
both busescan read
from eitherbus
RICE UNIVERSITY
Stream processing
SAD
Kernel StreamInput Data
Output Data
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output
pixels) Compute intensive (60 operations per memory reference)
RICE UNIVERSITY
Stream programming
Streams Communication
void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }
Kernels Computation
KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }
RICE UNIVERSITY
Stream Processor
Instructions are Load, Store, and Operateoperands are streams
Operate performs a compound stream operationread elements from input streamsperform a local computationappend elements to output streamsrepeat until input stream is consumed(e.g., triangle transform)
RICE UNIVERSITY
Imagine
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
RICE UNIVERSITY
Arithmetic clusters
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
RICE UNIVERSITY
Bandwidth hierarchy
VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
RICE UNIVERSITY
Conclusions
‘Streams’ shown to be promising for reconfigurable computing wireless may need reconfigurability
‘Streams’ shown to be promising for media processing wireless may have similar workloads
Important to understand pros and cons of different methodologies for good wireless architectures
Important to have the right tools