11 1 hierarchical coarse-grained stream compilation for software defined radio yuan lin, manjunath...
Post on 19-Dec-2015
214 views
TRANSCRIPT
![Page 1: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/1.jpg)
11
1
Hierarchical Coarse-grained Stream Compilation for Software Defined Radio
Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor MudgeAdvanced Computer Architecture Laboratory
University of Michigan at Ann Arbor
![Page 2: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/2.jpg)
22
2
2University of Michigan
Software Defined Radio
Use software routines instead of ASICs for the physical layer operations of wireless communication system
Advantages: Multi-mode operation
Lower costs Faster time to market
Prototyping and bug fixes
Chip volumes
Longevity of platforms
Enables future wireless communication innovations Complexity favors software-based solutions
UWB EDGE 802.16a
802.16a Bluetooth
802.11b WCDMA 802.11n
SDR
![Page 3: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/3.jpg)
33
3
3University of Michigan
Case Study: W-CDMA
Key software characteristics Multiple kernels connected together as a system
Streaming computation
Vector-based inter-kernel communications
Mostly static computation patterns
System: 2Mbps W-CDMA Protocol Diagram
Analog Frontend Upper layersTransmitter
Receiver
Scrambler Spreader Interleaver Turbo Encoder
Descrambler Despreader Combiner
DeinteleaverLPF-Rx
Descrambler Despreader
Channel Estimation
ModulationFiltering Error Correction
Searcher
TurboDecoder
LPF-Tx
![Page 4: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/4.jpg)
44
4
4University of Michigan
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SODA: A SDR DSP Architecture (ISCA 06)
Control-data decoupled multi-core architecture
1 ARM general purpose control processor Scalar algorithms and protocol controls
4 data processing elements SIMD+Scalar units
Used for high-throughput DSP algorithms
![Page 5: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/5.jpg)
55
5
5University of Michigan
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SODA Execution Model
Software managed scratchpad memories Each PE can only access its local memory
DMA operations Access global memory
Inter-PE communications
Algorithms statically mapped onto PEs RPCs from the ARM control processor
![Page 6: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/6.jpg)
66
6
6University of Michigan
Compilation Challenges for SDR
Compilation support for SDR is essential Flexibility
Lower development cost
More complex protocols
Compilation support for SDR is challenging Heterogeneous multiprocessor hardware
ARM + DSPs
Two level scratchpad memories
Multiple software constraints
Throughput + code & data size + real-time execution + others
![Page 7: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/7.jpg)
77
7
7University of Michigan
2-Tier Compilation Process
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
System: 2Mbps W-CDMA Protocol Diagram
Analog Frontend Upper layersTransmitter
Receiver
Scrambler Spreader Interleaver Turbo Encoder
Descrambler Despreader Combiner
DeinteleaverLPF-Rx
Descrambler Despreader
Channel Estimation
ModulationFiltering Error Correction
Searcher
TurboDecoder
LPF-Tx
512-bitSIMDReg.File
EX
512-bitSIMDALU+Mult
SIMDShuffle
Net-work(SSN)
WB
ScalarALU
WB
EX
ScalarRF
LocalSIMD
Memory
LocalScalar
Memory
STV
AGURF
EX
WB
AGUALU
1. SIMD pipeline
2. Scalar pipeline
4. AGU pipeline
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)ALU
RF
DMA
SODAPE
5. DMA
3. Localmemory
ToSystem
Bus
Multiprocessor system compilation
DSP kernel compilation
This study is focused on system compilation
Kernel compilation is treated as a black box Existing libraries SIMD compilers
Objective Kernel-to-PE assignments Memory allocations
Subject to Throughput constraints Memory constraints
void Turbo_decoder(int* in, int* out) { ...
for (iter = 0; iter < niter; iter++) { descramble(L_a, L_e, alpha); component_decoder(L_all, g, L_a, 1);
for (i = 0; i < FRAME_SIZE; i++) { L_e[i] = L_all[i] * 7 / 10; } }
... }
![Page 8: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/8.jpg)
88
8
8University of Michigan
System Compilation Outline
SPIR – Function level IR Traditional IR is not adequate Complex inter-function interactions
Backend compilation Scheduling functions instead of
instructions Function-level modulo scheduling
SPEX Frontend
SPIR Backend
Matlab Frontend
SPIRcombiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx 11
32
32
32
32
32
32
32
32
25
60
25
60
320
4
4
4
4
1
1
1
1
1
1
1
1
Rake receiver
Controlproc’s C
code
PE’sC codePE’s
C codePE’sC code
C++ w.SPEX
Matlab w.Simulink
SPEX Frontend
SPIR Backend
Matlab Frontend
SPIRcombiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx 11
32
32
32
32
32
32
32
32
25
60
25
60
320
4
4
4
4
1
1
1
1
1
1
1
1
Rake receiver
Controlproc’s C
code
PE’sC codePE’s
C codePE’sC code
C++ w.SPEX
Matlab w.Simulink
SPEX Frontend
SPIR Backend
Matlab Frontend
SPIRcombiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx 11
32
32
32
32
32
32
32
32
25
60
25
60
320
4
4
4
4
1
1
1
1
1
1
1
1
Rake receiver
Controlproc’s C
code
PE’sC codePE’s
C codePE’sC code
C++ w.SPEX
Matlab w.Simulink
![Page 9: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/9.jpg)
99
9
9University of Michigan
SPIR Overview
Dataflow programming model Graph consists of nodes and edges
Two types of nodes Kernel (yellow) nodes for modeling functions
Memory (blue) nodes for modeling vector buffers
Buffer stream description + vector stream description
Dataflow edges Synchronous dataflow (in the scope of this paper)
combiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx 11
32
32
32
32
32
32
32
32
25
60
25
60
320
4
4
4
4
1
1
1
1
1
1
1
1
Rake receiverdelay buffer
inteleaver TurboDecoder
1 640 640 9600 3200
![Page 10: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/10.jpg)
1010
10
10University of Michigan
SPIR Overview
combiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx 11
32
32
32
32
32
32
32
32
320
4
4
4
4
1
1
1
1
1
1
1
1
inteleaver TurboDecoder
1 640 640 9600 3200
Problems with flat dataflow graph representations Matched to the highest rate
SDR kernels have very different stream rates
Turbo decoder: input rate = 9600; output rate = 3200
LPF: input rate = 1; output rate = 1
![Page 11: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/11.jpg)
1111
11
11University of Michigan
SPIR Overview
combiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx
38.4K
38.4K
38.4K
38.4K
38.4K
inteleaver TurboDecoder
9600 9600 9600 9600 3200
9600 9600
9600 9600
9600 9600
9600 9600
38.4K
38.4K
38.4K
38.4K
38.4K
38.4K
38.4K
38.4K
38.4K38.4K
Problems with flat dataflow graph representations All must match to 9600 of the Turbo decoder
Minimum LPF rate: input = 38.4K, output = 38.4K
Stream rates translate to memory buffers
Unnecessarily large memory buffers
![Page 12: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/12.jpg)
1212
12
12University of Michigan
SPIR Overview
Hierarchical dataflow graphs Different hierarchy level with different streaming rates
Streaming vectors are modeled as hierarchical communications
Top level: buffer queue descriptions
Bottom level: vector streaming descriptions
TurboDecoder
300 100
9600
9600 node29600 3200node138400 9600
combiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreader
LPF-Rx
2.56K
2.56K
2.56K
2.56K
2.56K
inteleaver640 640 640
640 640
640 640
640 640
640 640
2.56K
2.56K
2.56K
2.56K
2.56K
2.56K
2.56K
2.56K
2.56K2.56K
![Page 13: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/13.jpg)
1313
13
13University of Michigan
SPIR Overview
W-CDMA Modeled with 3-level hierarchy in SPIR
Memory nodes are inserted between nodes with child graph
4x decrease in memory buffer usage
TurboDecoder
300 100
96
00
96
00
node29600 3200node138400 9600
inter-leaver
640 640Rake2560
640
64
0
64
0combiner
descrambler despreader
searcher
descrambler despreader
descrambler despreader
descrambler despreaderLPF-Rx 256256128
128
128
128
128
128
128
128
320
128
128
128
128
32
32
32
32
32
32
32
32
32
LPF-Rx2560 2560
25
60
25
60
![Page 14: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/14.jpg)
1414
14
14University of Michigan
Coarse-grained System Compilation
Three major tasks Resource allocation (processor, memory and DMA) Kernel execution ordering Kernel execution timing
Static or dynamic? Static – compiler
Less flexible, more efficient Dynamic – run-time scheduler or OS
More flexible, less efficient
For SDR applications Resource allocation: static Kernel execution ordering: static Kernel execution timing: dynamic
![Page 15: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/15.jpg)
1515
15
15University of Michigan
Software Pipelining Streaming Kernels
Problem with coarse-grained compilation Requires kernel-level parallelism to utilize the PEs
SDR protocols do not have many data-independent kernels
Compiler optimization: coarse-grained software pipelining Stream computation: pipeline parallelism
Modulo scheduling
FIR
Rake
Turbo
in[0..N]
PE1 PE2 PE3FIR
Rake
Turbo
PE1 PE2 PE3FIR Rake Turbo
in[i]
in[i+1]
Turbo
in[i+2]
FIR
Rake FIR
Turbo Rake FIR
Turbo Rake
![Page 16: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/16.jpg)
1616
16
16University of Michigan
Coarse-grained System Compilation
Input Hierarchical graph
Step 1 Dataflow rate matching
Step 2 Stream size selection
Step 3 Modulo scheduling
Step 4 Hierarchical compilation
DMA1
GMEM to PE1
GMEM to PE2
PE2 to PE1
PE2 to PE1
PE1 to GMEM
PE1 PE2
descrambler descrambler
despreader despreader
II1 descrambler descrambler
despreader despreader
combiner
PE1 PE2
2 descr.
2 desp.
PE3
FIR2
PE4
searchercombiner
DMA1
GMEM to PE1
PE1 to GMEM
2 descr.
2 desp.
FIR1
DMA2GMEM to PE2
GMEM to PE3
PE2 to GMEM
PE3 to GMEM
DMA3
GMEM to PE4
II2
Modulo compilation
Dataflow rate matching
Stream size selection
Hierarchical scheduling
combiner
descrambler despreader
descrambler despreader32
32
32
3225
60
25
60
4
4
1
1
1
11
64
0
combiner
descrambler despreader
descrambler despreader32
32
32
322560
2560
32
32
8
8
8
88
640
combiner
descrambler despreader
descrambler despreader128
128
128
1282560
2560
128
128
32
32
32
3232
640
![Page 17: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/17.jpg)
1717
17
17University of Michigan
Coarse-grained System Compilation
Step 1: Dataflow rate matching
Producer and consumer pair must have the same ratesEdges are memory buffers
Well studied with many existing algorithmsSingle appearance schedule
Dataflow rate matching
combiner
descrambler despreader
descrambler despreader32
32
32
32
4
4
1
1
1
11
combiner
descrambler despreader
descrambler despreader32
32
32
32
32
32
8
8
8
88
![Page 18: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/18.jpg)
1818
18
18University of Michigan
Coarse-grained System Compilation
Step 2: Stream size selection
Pick optimal input/output buffer sizeMultiple of the base rate
Binary search algorithmModulo schedule each candidate
buffer size
Stream size selection
combiner
descrambler despreader
descrambler despreader32
32
32
32
32
32
8
8
8
88
combiner
descrambler despreader
descrambler despreader128
128
128
128
128
128
32
32
32
3232
DMA in 1
DMA_out 1
kernel(1)
loop N
DMA in N
DMA_out N
kernel(N)
Case 1 Case 2
DMA in M
DMA_out M
kernel(M)
loop N/M
Case 3
Rate = 1, Streaming N elements Case 1: N iterations
Too much DMA overhead Case 2: 1 iteration
Cannot software pipeline Case 3: N/M iterations
![Page 19: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/19.jpg)
1919
19
19University of Michigan
Coarse-grained System Compilation
Step 3: Function-level modulo scheduling
II selection (Initiation Interval) Interval between the start of successive iterations MinII = Max(ResMII, RecMII) ResMII: total latency of all nodes divided by # of PEs RecMII: maximum latency of feedback paths
Constraint-based modulo scheduling SMT-based algorithm
DMA1
GMEM to PE1
GMEM to PE2
PE2 to PE1
PE2 to PE1
PE1 to GMEM
PE1 PE2
descrambler descrambler
despreader despreader
II1 descrambler descrambler
despreader despreader
combiner
Modulo compilation
combiner
descrambler despreader
descrambler despreader128
128
128
128
128
128
32
32
32
3232
![Page 20: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/20.jpg)
2020
20
20University of Michigan
SMT-based Modulo Scheduling Using Satisfiability Modulo Theory (SMT) solver Yices
Input: a set of constraints expressed as equations
Output: a set of conditions where the constraints evaluate to true
Constraints Throughput constraints
i.e. total execution time must be less than or equal to II
Memory constraints
i.e. buffer size less than PE’s scratchpad memories
Communication constraints
i.e. DMA added for communicating kernels on different PEs
status of kernel vi assigned to processor j (1 or 0)
number of kernels
![Page 21: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/21.jpg)
2121
21
21University of Michigan
Coarse-grained System Compilation
DMA1
GMEM to PE1
GMEM to PE2
PE2 to PE1
PE2 to PE1
PE1 to GMEM
PE1 PE2
descrambler descrambler
despreader despreader
II1 descrambler descrambler
despreader despreader
combiner
PE1 PE2
2 descr.
2 desp.
PE3
FIR2
PE4
searchercombiner
DMA1
GMEM to PE1
PE1 to GMEM
2 descr.
2 desp.
FIR1
DMA2GMEM to PE2
GMEM to PE3
PE2 to GMEM
PE3 to GMEM
DMA3
GMEM to PE4
II2
Hierarchical scheduling
combiner
descrambler despreader
descrambler despreader128
128
128
128
128
128
32
32
32
3232
Rake2560
640LPF-Rx2560 2560
2560
2560
Step 4: Hierarchical scheduling
Bottom up scheduling
Treat each child graph as a single node
Memory nodes assigned to global memory
![Page 22: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/22.jpg)
2222
22
22University of Michigan
Conclusion
Compilation support for SDR is essential
2-tiered compilation process System compilation
DSP compilation
System compilation is function-level scheduling Hierarchical dataflow IR
~4x saving in memory buffer allocation
SMT-based modulo scheduling
Linear speedup up to 8 PEs
Resulting in ~23% faster schedules than greedy
![Page 23: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649d355503460f94a0cd86/html5/thumbnails/23.jpg)
2323
23
23University of Michigan
Questions