ece8833 polymorphous and many-core computer architecture
DESCRIPTION
ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 4 Billion-Transistor Architecture 97 (Part II). Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Practitioners’ Groups. Every one has an acronym ! IRAM Implementation at Berkeley CMP - PowerPoint PPT PresentationTRANSCRIPT
ECE8833 Polymorphous and Many-Core Computer Architecture
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering
Lecture 4 Billion-Transistor Architecture 97 (Part II)
2ECE8833 H.-H. S. Lee 2009
Practitioners’ GroupsEvery one has an acronym ! • IRAM
– Implementation at Berkeley
• CMP– Lead to Sun Niagra and the multicore
(r)evolution
• SMT – Intel HyperThreading (arguably Intel first
envisioned the idea), IBM Power5, Alpha 21464– Many credit this technology to UCSB’s
multistreaming work in early 1990s.
• RAW– Lead to Tilera64
3ECE8833 H.-H. S. Lee 2009
C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick
4ECE8833 H.-H. S. Lee 2009
Mission Statement
5ECE8833 H.-H. S. Lee 2009
Future Roadblocks that Inspires IRAM• Latency issues
– Continuingly increased performance gap between processor and memory
– DRAM optimized for density, not speed
• Bandwidth issues – Off-chip bus
• Slow and narrow• high capacitance, high energy
– Especially, scientific codes, database, etc.
6ECE8833 H.-H. S. Lee 2009
IRAM Approach • Move DRAM closer to processor
– Enlarge on-chip bandwidth
• Fewer I/O pins – Smaller package– Serial interface
Anything look familiar?
7ECE8833 H.-H. S. Lee 2009
IRAM Chip Design Research • How much larger and slower is a processor designed in a straight
DRAM process vs. a standard logic process– Microprocessor fab offers fast transistors fo fast logic and many metal
layers for accelerating communication and simplifying power distribution– DRAM fabs offer many poly layers to give small DRAM cells and low
leakage for low refresh rate
• Speed of page buffer vs. registers and cache
• New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin)
• Quantify Bandwidth vs. Area/Power tradeoff
• Area overhead for IRAM vs. a DRAM
• Extra power dissipation for IRAM vs. a DRAM
• Performance of IRAM with same area and power as DRAM (“processor for free)Source: David Patterson’s slide in his IRAM Overview talk
8ECE8833 H.-H. S. Lee 2009
IRAM Architecture Research • How much slower can a processor with a high
bandwidth memory be and yet be as fast as a conventional computer? (very interesting point)
• Compare memory management schemes (e.g., vector registers, scratch pad, wide TLB/cache)
• Compare scheme for running large programs, i.e., span multiple IRAMs
• Quantify value of compact programs and data (e.g., compact code, on-the-fly compression)
• Quantify pros and cons of standard instruction set vs. custom IRAM instruction set
Source: David Patterson’s slide in his IRAM Overview talk
9ECE8833 H.-H. S. Lee 2009
IRAM Compiler Research• Explicit SW control of memory management vs.
conventional implicit HW designs– Protection (software fault isolation)– Paging (dynamic relocation, overlap I/O accesses)– “Cache” control (vector register, scratch pad)– I/O interrupt/polling
• Evaluate benchmark performance in conjunction with architectural research– Number crunching (Vector vs. superscalar)– Memory intensive (database, operating system)– Real-time benchmarks (stability and performance)– Pointer intensive (GCC compiler)
• Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs Java)
Source: David Patterson’s slide in his IRAM Overview talk
10ECE8833 H.-H. S. Lee 2009
Potential IRAM Architecture• “New Model”: VSIW=Very Short Instruction
Word!– Compact: Describe N operations with 1 short inst.
(vector)– Predictable: (real-time) perf. Vs. statistical perf.
(cache)– Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16– Easy to get high performance; N operations:
• Are independent• Use same functional unit• Access disjoint registers• Access registers in same order as previous instructions• Access contiguous memory words or known pattern• Hides memory latency (and any other latency)
– Compiler technology already developed..Source: David Patterson’s slide in his IRAM talk
11ECE8833 H.-H. S. Lee 2009
Berkeley Vector-Intelligent RAMWhy vector processing• Scalable design• Higher code density• Run at a higher clock rate• Better energy efficiency
due to easier clock gating for vector / scalar units
• Lower die temperature to keep good data retention rate
• On-chip DRAM is sufficient for embedded applications
• Use external off-chip DRAM as secondary memory – Pages swapped between
on-chip and off-chip DRAMs
12ECE8833 H.-H. S. Lee 2009
VIRAM-1 Floorplan• 180nm, CMOS, 6-layer copper• 125 million transistors, 325 mm2
• 2 watts @ 200MHz• 13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers)• VRF = 32x64b or 64x32b or 128x16b
[Gebis et al. DAC student contest 04]
IBM Embedded DRAM macros, each 13Mbit
¼ of 8KB VRF (Custom layout)
64-bit MIPS M5Kc
13ECE8833 H.-H. S. Lee 2009
S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, D. M. Tullsen
14ECE8833 H.-H. S. Lee 2009
SMT Concept vs. Other Alternatives
Thread 1Thread 1UnusedUnused
Exec
utio
n Ti
me
Exec
utio
n Ti
me
FU1FU1 FU2FU2 FU3FU3 FU4FU4
ConventionalConventionalSuperscalarSuperscalar
SingleSingleThreadedThreaded
SimultaneousSimultaneousMultithreadingMultithreading(or Intel’s HT)(or Intel’s HT)
Fine-grainedFine-grainedMultithreadingMultithreading(cycle-by-cycle(cycle-by-cycle
Interleaving)Interleaving)
Thread 2Thread 2Thread 3Thread 3Thread 4Thread 4Thread 5Thread 5
Coarse-grainedCoarse-grainedMultithreadingMultithreading
(Block Interleaving)(Block Interleaving)
Chip Chip MultiprocessorMultiprocessor
(CMP)(CMP)
• Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94)• The name SMT was christened by the group at University of Washington ISCA’95
15ECE8833 H.-H. S. Lee 2009
Exploiting Choice: SMT Inst Fetch Policies• FIFO, Round Robin, simple but may be too
naive• RR.X.Y
– X threads for Y instructions– RR1.8 – RR.2.4 or RR.4.2– RR.2.8
• What are the main design and/or performance issue when X > 1
[Tullsen et al. ISCA96]
16ECE8833 H.-H. S. Lee 2009
Exploiting Choice: SMT Inst Fetch Policies• Adaptive Fetching Policies
– BRCOUNT (reduce wrong path issuing)• Count # of br inst in decode/rename/IQ stages• Give top priority to thread with the least BRCOUNT
– MISSCOUT (reduce IQ clog)• Count # of outstanding D-cache misses• Give top priority to thread with the least MISSCOUNT
– ICOUNT (reduce IQ clog)• Count # of inst in decode/rename/IQ stages• Give top priority to thread with the least ICOUNT
– IQPOSN (reduce IQ clog)• Give lowest priority to those threads with inst closest to the
head of INT or FP instruction queues– Due to that threads with the oldest instructions will be
most prone to IQ clog• No Counter needed
[Tullsen et al. ISCA96]
17ECE8833 H.-H. S. Lee 2009
Exploiting Choice: SMT Inst Fetch Policies
[Tullsen et al. ISCA96]
18ECE8833 H.-H. S. Lee 2009
Alpha 21464 (EV8)• Leading-edge process technology
– 1.2 to 2.0GHz– 0.125m CMOS– SOI-compatible– Cu interconnect, 7 metal layers– Low-k dielectrics
• Chip characteristics– 1.2V Vdd, 250W (EV6: 72W and EV7: 125W)
– 250 million transistors, 350mm2
– 1100 signal pins in flip chip packaging
Slide Source: Dr. Joel Emer
19ECE8833 H.-H. S. Lee 2009
EV8 Architecture Overview• Enhanced OoO execution• 8-wide issue superscalar processor• Large on-die L2 (1.75MB)• 8 DRDRAM channels• On-chip router for system interconnect• Directory-based ccNUMA for up to 512-way
SMP• 4-way SMT
Slide Source: Dr. Joel Emer
20ECE8833 H.-H. S. Lee 2009
SMT Pipeline• Replicated
– PCs– Register maps
Slide Source: Dr. Joel Emer
Fetch Decode/Map
Queue Reg Read
Execute Dcache/Store Buffer
Reg Write
Retire
IcacheDcache
PC
RegisterMap
Regs Regs
• Shared resources– RF– Instruction queue– First and second level caches– Translation buffers– Branch predictor
21ECE8833 H.-H. S. Lee 2009
Intel HyperThreading • Intel Xeon Processor, Xeon MP Processor, and ATOM
• Enable Simultaneous Multi-Threading (SMT)– Exploit ILP through TLP (—Thread-Level Parallelism)– Issuing and executing multiple threads at the same snapshot
• Appears to be 2 logical processors
• Share the same execution resources
• Duplicate architectural states and certain microarchitectural states – IPs, iTLB, streaming buffer– Architectural register file– Return stack buffer– Branch history buffer– Register Alias Table
22ECE8833 H.-H. S. Lee 2009
Sharing Resource in Intel HT• P4’s TC (or ROM) is alternatively accessed per cycle
for each logical processor unless one is stalled due to TC miss
• TLB shared with logical processor ID but partitioned– X86 does not employ ASID – Hard-partitioning appears to be the only option to allow HT
op queue (into ½) after fetched from TC• ROB (126/2 in P4)• LB (48/2 in P4)• SB (24/2 or 32/2 in P4)• General op queue and memory op queue (1/2) • Retirement: alternating between 2 logical processors
23ECE8833 H.-H. S. Lee 2009
HT in Intel ATOM
Source: Microprocessor Report and Intel
• First In-order processor with HT• HT claimed to enlarge silicon
asset by 8%• Claimed 30% performance
increase at 15% power increase• Shared cache space
deprived/competed between threads
• No dedicated Multiplier – use SIMD Multiplier
• No dedicated Int Divider - use FP Divider
32KB
24KB
25mm2 @45nm
512KB
24ECE8833 H.-H. S. Lee 2009
L. Hammond, B. A. Nayfeh, K. Olukotun
25ECE8833 H.-H. S. Lee 2009
Main Argument• Single thread of control has limited parallelism (ILP is dead)• Cost of the above is prohibitive due to complexity
• Achieving parallelization with SW, not HW– Inherently parallel multimedia application– Widespread Multi-tasking OS – Emerging parallel compilers (ref. SUIF), mainly for loop-level
parallelism• Why not SMT?
– Interconnect delay issue – Partitioning is less localized than CMP
• Use relatively simple single-thread processor– Exploit only “modest” amount of ILP– Execute multiple threads in parallel
• Bottom line
26ECE8833 H.-H. S. Lee 2009
Architectural Comparison
27ECE8833 H.-H. S. Lee 2009
Single Chip Multiprocessor
28ECE8833 H.-H. S. Lee 2009
Commercial CMP (AMD Phenom II Quad-Core)
• AMD K10 (Barcelona)• Code name “Deneb” • 45nm process• 4 cores, private 512KB L2• Shared 6MB L3 (2MB in
Phenom)• Integrated Northbridge
– Up to 4 DIMMs
• Sideband Stack optimizer (SSO)– Parallelize many POPs and
PUSHs (which were dependent on each other)
• Convert them into pure loads/store instructions
– No uops in FUs for stack pointer adjustment
29ECE8833 H.-H. S. Lee 2009
Intel Core i7 (Nehalem)• 4-core• HT support each core• 8MB shared L3
• 3 DDR3 channels• 25.6GB/s memory BW
• Turbo Boost Technology– New P-state
(Performance)– DFVS when workloads
operated under max power
– Same frequency for all cores
30ECE8833 H.-H. S. Lee 2009
Ultra Sparc T1• Up to Eight cores, each 4-way threaded• Fine-grained multithreading
– a thread-selection logic• Take out threads that encounter
long latency events– Round-robin cycle-by-cycle– 4 threads in a group share a
processing pipeline (Sparc pipe)• 1.2 GHz (90nm)• In-order, 8 instructions per cycle (single
issue from each core)• 1 shared FPU• Caches
– 16K 4-way 32B L1-I– 8K 4-way 16B L1-D– Blocking cache (reason for MT)– 4-banked 12-way 3MB L2 + 4
memory controllers. (shared by all)– Data moved between the L2 and the
cores using an integrated crossbar switch to provide high throughput (200GB/s)
31ECE8833 H.-H. S. Lee 2009
Ultra Sparc T1• Thread-select logic marks a thread inactive
based on– Instruction type
• A predecode bit in the I-cache to indicate long-latency instruction
– Misses– Traps– Resource conflicts
32ECE8833 H.-H. S. Lee 2009
Ultra Sparc T2
• A fatter version of T1• 1.4GHz (65nm)• 8 threads per core, 8 cores on-die• 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)• L2 increased to 8-banked 16-way 4MB shared • 8 stage integer pipeline ( as opposed to 6 for T1)• 16 instructions per cycle• One PCI Express port (x8 1.0)• Two 10 Gigabit Ethernet ports with packet classification and
filtering• Eight encryption engines • Four dual-channel FBDIMM memory controllers• 711 signal I/O,1831 total
• Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads
33ECE8833 H.-H. S. Lee 2009
Sun ROCK Processor • 16-core, two threads per core• Hardware scout threading
(runahead)– Invisible to SW– Long latency inst starts auto HW
scout• L1 D$ miss• Micro-DTLB miss• Divide
– Warm up branch predictor– Prefetch memory
• Execute Ahead (EXE)– Retire independent instructions
while scouting• Simultaneous Speculative
Threading (SST) [ISCA’09]
– Two hardware threads for one program
– Runahead speculatively executes under a cache miss
– OoO retirement• HTM Support
34ECE8833 H.-H. S. Lee 2009
Many-Core Processors• 2KB Data Memory• 3KB Instruction Memory• No coherence support• 2 FMACs
• Next-gen will have 3D-integrated memory– SRAM first– DRAM in the future
Intel Teraflops (Polaris)
35ECE8833 H.-H. S. Lee 2009
E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal
36ECE8833 H.-H. S. Lee 2009
MIT RAW Design Tenet• Long wire across chip will be the constraint• Exposed architecture to software (parallelizing
compilers)– Explicit parallelization– Pins– Communication
• Use tile-based architecture– Similar designs sponsored by DARPA PCA program: UT
TRIPS, Stanford Smart Memories
• Simple Point-to-point static routing network– One cycle across each tile– Scalable (than bus)– Harnessed by compiler with precise count of wire hops– Use dynamic router to support memory accesses that
cannot be analyzed statically.
37ECE8833 H.-H. S. Lee 2009
Application Mapping on RAW
[Taylor IEEE MICRO’02]
Four-way parallelized scalar code
Two-way threaded Java program
httpd Zzzz..
VideoData
Stream
Frame BufferAnd Screen
Custom Data Path Pipeline(by Compiler)
Sleep Mode (power saving)Fast Inter-tile ALU
forwarding : 3 cycles
38ECE8833 H.-H. S. Lee 2009
Scalar Operand Network Design
[Taylor et al. HPCA’03]
Non-Pipelined Scalar Operand Network
Pipelined w/ Bypass Link
Pipelined w/ Bypass Link and Multiple ALUsLots of live values in the SON
39ECE8833 H.-H. S. Lee 2009
Communication Scalability Issue
• RB (# of result bus) * WS (window size) compares made per cycle• Long, dense wire elongates cycle time
– Pipeline the wire • Cost of processing incoming information is high• Similar problem in bus-based snoopy cache protocol
Routing area
Large MUX
Complex Compare logic
40ECE8833 H.-H. S. Lee 2009
Scalar Operand Network
RegFile RegFile RegFileMultiscalar Operand Network(distributed ILP machine)
[Taylor et al. HPCA’03]
RegFile RegFile RegFile
RegFile RegFile RegFile
Scalar Operand NetworkOn a 2-D, p2p interconnect(e.g., Raw or TRIPS)
Switch
41ECE8833 H.-H. S. Lee 2009
Mapping Operations to Tile-based Architecture
• Done at – Compile time (RAW)– Or Runtime
• “Point-to-point” 2D mesh • Tradeoff
– Computation vs. Communication
– Compute Affinity (data flow through fewer hops)
• How to maintain control flow-control
RegFile RegFile RegFile
RegFile RegFile RegFile
ld a
ld bst bst b
*>>
+
i = a[j];q = b[i];r = q+j;s = q >> 3;t = r * s;b[j] = l;b[t] = t;
42ECE8833 H.-H. S. Lee 2009
RAW Core-to-Core Communication• Static Router
– Place-and-route wires by software – P2p scalar transport– Compilers (or assembly writers) handle
predictable communication
• Dynamic Router– Transport dynamic, unpredictable operations
• Interrupts• Cache misses
– Unpredictable communication at compile-time
43ECE8833 H.-H. S. Lee 2009
Architectural Comparison
• Raw replace a bus of a superscalar with switched network• Switched network is tightly integrated into processor’s pipeline to
support single-cycle message injection and receive operations• Raw software (compiler) has to implement functions such as
instruction scheduling, dependency checking, etc.• Raw yields complexity to software so that more hardware can be
used for ALU and memory
RAW Superscalar Multiprocessor
44ECE8833 H.-H. S. Lee 2009
RAW’s Four On-Chip Mesh Networks
ComputePipeline
Registered at input longest wire = length of tile
8 32-bit channels
[Slide Source: Michael B. Taylor]
45ECE8833 H.-H. S. Lee 2009
Raw Architecture
[Slide Source: Volker Strumpen]
46ECE8833 H.-H. S. Lee 2009
Raw Compute Processor Pipeline
[Taylor IEEE MICRO’02]
Fast ALU-to-network (4 cycles)
R24-27 map to 4 on-chip physical
networks
0-cycle local bypass
47ECE8833 H.-H. S. Lee 2009
RAW Processor TileEach tile contains• Tile processor
– 32-bit MIPS, 8-stage in-order, single issue
– 32KB instruction memory– 32KB data cache (not
coherent, user managed)
• Switch processor– 8K-instruction memory– Executes basic move and
branch instructions– Transfer between local
switch and neighbor switches
• Dynamic Router– Hardware control (not
directly under programmer’s control)
48ECE8833 H.-H. S. Lee 2009
Raw Programming• Compute the sum c=a+b across four tiles:
49ECE8833 H.-H. S. Lee 2009
Data Path: Zoom 1• Stateful hardware: local data memory (a,c), register (b) and
both static networks (snet1 and 2)
50ECE8833 H.-H. S. Lee 2009
Zoom 2: Processor Datapaths
51ECE8833 H.-H. S. Lee 2009
Zoom 2: Switch Datapaths (+-tile processor)
52ECE8833 H.-H. S. Lee 2009
Raw Assembly
53ECE8833 H.-H. S. Lee 2009
RAW On-Chip Network• 2D Mesh
– Longest wire is no greater than one side of a tile– Worst case: 6 hops (or cycles) for 16 tiles
• 2 Static Routers, “point-to-point,” each has– A 64KB SW-managed instruction cache– A pair of routing crossbars– Example:
• 2 Dynamic Routers– Dimension-ordered routing by hardware– Example:
lui $3, $0, 15ihdr $cgno, $3, 0x0200 #header msg len=2or $cgno,$0,$9 #sent word1ld $cgno,$0,$csti #sent word2
or $2, $cgni, $0 #word1or $3, $cgni, $0 #word2
Tile 15 (receiver)Tile 0 (sender)
or $csto, $0, $5nop route $csto->$cEo2 #SWITCH0
nop route $cWi2->$csti2 #SWITCH1and $5, $5, $csti2
Tile 1 (receiver)Tile 0 (sender)
54ECE8833 H.-H. S. Lee 2009
Control Orchestration Optimization• Orchestrated by the Raw compiler• Control localization
– Hide control flow sequence within a “macro-instruction” assigned to a tile
[Lee et al. ASPLOS’98]
macroins
: One instruction
55ECE8833 H.-H. S. Lee 2009
Example of RAW Compiler Transformation
y = a+b;z = a*a;a = y*a*5;y = y*b*6;
read(a)read(b)y_1 = a+bz_1 = a*atmp_1 = y_1*aa_1 = tmp_1*5tmp_2 = y_1*by_2 = tmp_2*6write(z)write(a)write(y)
Initial Code Transformatio
n
Instruction Partitioner
Global Data Partitioner
Data & Inst Placer
Communication Code Gen
Event Scheduler
read (a)
z_1 = a*a
write(z)
tmp_1 = y_1*a
a_1=tmp_1*5
write(a)
read (b)
y_1 = a+b
tmp_2 = y_1*b
y_2 = tmp_2*6
write(y)
[Lee et al. ASPLOS’98]
Initial Code Transformation
56ECE8833 H.-H. S. Lee 2009
Example of RAW Compiler Transformation
[Lee et al. ASPLOS’98]
read (a)
z_1 = a*a
write(z)
tmp_1 = y_1*a
a_1=tmp_1*5
write(a)
read (b)
y_1 = a+b
tmp_2 = y_1*b
y_2 = tmp_2*6
write(y)
Instruction Partitioner
{a,z}
{b,y}
GlobalData
Partitioner
read (a)
z_1 = a*a
write(z)
tmp_1 = y_1*a
a_1=tmp_1*5
write(a)
read (b)
y_1 = a+b
tmp_2 = y_1*b
y_2 = tmp_2*6
write(y)
Data & Inst Placer
{a,z} {b,y}
P0 P1
57ECE8833 H.-H. S. Lee 2009
Example of RAW Compiler Transformation
[Lee et al. ASPLOS’98]
read (a)
z_1 = a*a
write(z)
tmp_1 = y_1*a
a_1=tmp_1*5
write(a)
read (b)
y_1 = a+b
tmp_2 = y_1*b
y_2 = tmp_2*6
write(y)
Communication Code Gen
P0 P1
send (a)
route(P0,S1) route(S0,P1)
a=rcv()
send(y_1)route(S1,P0) route(P1,S0)
y_1=rcv()
S0 S1
58ECE8833 H.-H. S. Lee 2009
Example of RAW Compiler Transformation
[Lee et al. ASPLOS’98]Event Scheduler
route(P0,S1)
route(S1,P0)
S0
route(S0,P1)
route(P1,S0)
S1
read (a)
z_1 = a*a
write(z)
tmp_1 = y_1*a
a_1=tmp_1*5
write(a)
P0
send (a)
y_1=rcv()
P1
read (b)
y_1 = a+b
tmp_2 = y_1*b
y_2 = tmp_2*6
write(y)
a=rcv()
send(y_1)
59ECE8833 H.-H. S. Lee 2009
Raw Compiler Example
tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
Assign instructions to the tiles, maximizing locality. Generate the static routerinstructions to transferOperands & streams tiles.
[Slide Source: Michael B. Taylor]
60ECE8833 H.-H. S. Lee 2009
Scalability
1 cycle180 nm, 16 tiles 90 nm, 64 tiles
Just stamp out more tiles!
Longest wire, frequency, design and verification complexityall independent of issue width.
Architecture is backwards compatible.
[Slide Source: Michael B. Taylor]