11 1 hierarchical coarse-grained stream compilation for software defined radio yuan lin, manjunath...

11

1

Hierarchical Coarse-grained Stream Compilation for Software Defined Radio

Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor MudgeAdvanced Computer Architecture Laboratory

University of Michigan at Ann Arbor

22

2

2University of Michigan

Software Defined Radio

Use software routines instead of ASICs for the physical layer operations of wireless communication system

Advantages: Multi-mode operation

Lower costs Faster time to market

Prototyping and bug fixes

Chip volumes

Longevity of platforms

Enables future wireless communication innovations Complexity favors software-based solutions

UWB EDGE 802.16a

802.16a Bluetooth

802.11b WCDMA 802.11n

SDR

33

3


Case Study: W-CDMA

Key software characteristics Multiple kernels connected together as a system

Streaming computation

Vector-based inter-kernel communications

Mostly static computation patterns

System: 2Mbps W-CDMA Protocol Diagram

Analog Frontend Upper layersTransmitter

Receiver

Scrambler Spreader Interleaver Turbo Encoder

Descrambler Despreader Combiner

DeinteleaverLPF-Rx

Descrambler Despreader

Channel Estimation

ModulationFiltering Error Correction

Searcher

TurboDecoder

LPF-Tx

44

4


LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

SODA: A SDR DSP Architecture (ISCA 06)

Control-data decoupled multi-core architecture

1 ARM general purpose control processor Scalar algorithms and protocol controls

4 data processing elements SIMD+Scalar units

Used for high-throughput DSP algorithms

55

5


LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE


SODA Execution Model

Software managed scratchpad memories Each PE can only access its local memory

DMA operations Access global memory

Inter-PE communications

Algorithms statically mapped onto PEs RPCs from the ARM control processor

66

6


Compilation Challenges for SDR

Compilation support for SDR is essential Flexibility

Lower development cost

More complex protocols

Compilation support for SDR is challenging Heterogeneous multiprocessor hardware

ARM + DSPs

Two level scratchpad memories

Multiple software constraints

Throughput + code & data size + real-time execution + others

77

7


2-Tier Compilation Process

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE


System: 2Mbps W-CDMA Protocol Diagram

Analog Frontend Upper layersTransmitter

Receiver

Scrambler Spreader Interleaver Turbo Encoder

Descrambler Despreader Combiner

DeinteleaverLPF-Rx

Descrambler Despreader

Channel Estimation

ModulationFiltering Error Correction

Searcher

TurboDecoder

LPF-Tx

512-bitSIMDReg.File

EX

512-bitSIMDALU+Mult

SIMDShuffle

Net-work(SSN)

WB

ScalarALU

WB

EX

ScalarRF

LocalSIMD

Memory

LocalScalar

Memory

STV

AGURF

EX

WB

AGUALU

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RF

DMA

SODAPE

5. DMA

3. Localmemory

ToSystem

Bus

Multiprocessor system compilation

DSP kernel compilation

This study is focused on system compilation

Kernel compilation is treated as a black box Existing libraries SIMD compilers

Objective Kernel-to-PE assignments Memory allocations

Subject to Throughput constraints Memory constraints

void Turbo_decoder(int* in, int* out) { ...

for (iter = 0; iter < niter; iter++) { descramble(L_a, L_e, alpha); component_decoder(L_all, g, L_a, 1);

for (i = 0; i < FRAME_SIZE; i++) { L_e[i] = L_all[i] * 7 / 10; } }

... }

88

8


System Compilation Outline

SPIR – Function level IR Traditional IR is not adequate Complex inter-function interactions

Backend compilation Scheduling functions instead of

instructions Function-level modulo scheduling

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner

descrambler despreader

searcher




LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner


searcher




LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner


searcher




LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

99

9


SPIR Overview

Dataflow programming model Graph consists of nodes and edges

Two types of nodes Kernel (yellow) nodes for modeling functions

Memory (blue) nodes for modeling vector buffers

Buffer stream description + vector stream description

Dataflow edges Synchronous dataflow (in the scope of this paper)

combiner


searcher




LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiverdelay buffer

inteleaver TurboDecoder

1 640 640 9600 3200

1010

10


SPIR Overview

combiner


searcher




LPF-Rx 11

32

32

32

32

32

32

32

32

320

4

4

4

4

1

1

1

1

1

1

1

1


1 640 640 9600 3200

Problems with flat dataflow graph representations Matched to the highest rate

SDR kernels have very different stream rates

Turbo decoder: input rate = 9600; output rate = 3200

LPF: input rate = 1; output rate = 1

1111

11


SPIR Overview

combiner


searcher




LPF-Rx

38.4K

38.4K

38.4K

38.4K

38.4K


9600 9600 9600 9600 3200

9600 9600

9600 9600

9600 9600

9600 9600

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K38.4K

Problems with flat dataflow graph representations All must match to 9600 of the Turbo decoder

Minimum LPF rate: input = 38.4K, output = 38.4K

Stream rates translate to memory buffers

Unnecessarily large memory buffers

1212

12


SPIR Overview

Hierarchical dataflow graphs Different hierarchy level with different streaming rates

Streaming vectors are modeled as hierarchical communications

Top level: buffer queue descriptions

Bottom level: vector streaming descriptions

TurboDecoder

300 100

9600

9600 node29600 3200node138400 9600

combiner


searcher




LPF-Rx

2.56K

2.56K

2.56K

2.56K

2.56K

inteleaver640 640 640

640 640

640 640

640 640

640 640

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K2.56K

1313

13


SPIR Overview

W-CDMA Modeled with 3-level hierarchy in SPIR

Memory nodes are inserted between nodes with child graph

4x decrease in memory buffer usage

TurboDecoder

300 100

96

00

96

00

node29600 3200node138400 9600

inter-leaver

640 640Rake2560

640

64

0

64

0combiner


searcher



descrambler despreaderLPF-Rx 256256128

128

128

128

128

128

128

128

320

128

128

128

128

32

32

32

32

32

32

32

32

32

LPF-Rx2560 2560

25

60

25

60

1414

14


Coarse-grained System Compilation

Three major tasks Resource allocation (processor, memory and DMA) Kernel execution ordering Kernel execution timing

Static or dynamic? Static – compiler

Less flexible, more efficient Dynamic – run-time scheduler or OS

More flexible, less efficient

For SDR applications Resource allocation: static Kernel execution ordering: static Kernel execution timing: dynamic

1515

15


Software Pipelining Streaming Kernels

Problem with coarse-grained compilation Requires kernel-level parallelism to utilize the PEs

SDR protocols do not have many data-independent kernels

Compiler optimization: coarse-grained software pipelining Stream computation: pipeline parallelism

Modulo scheduling

FIR

Rake

Turbo

in[0..N]

PE1 PE2 PE3FIR

Rake

Turbo

PE1 PE2 PE3FIR Rake Turbo

in[i]

in[i+1]

Turbo

in[i+2]

FIR

Rake FIR

Turbo Rake FIR

Turbo Rake

1616

16



Input Hierarchical graph

Step 1 Dataflow rate matching

Step 2 Stream size selection

Step 3 Modulo scheduling

Step 4 Hierarchical compilation

DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2

descrambler descrambler

despreader despreader

II1 descrambler descrambler


combiner

PE1 PE2

2 descr.

2 desp.

PE3

FIR2

PE4

searchercombiner

DMA1

GMEM to PE1

PE1 to GMEM

2 descr.

2 desp.

FIR1

DMA2GMEM to PE2

GMEM to PE3

PE2 to GMEM

PE3 to GMEM

DMA3

GMEM to PE4

II2

Modulo compilation

Dataflow rate matching

Stream size selection

Hierarchical scheduling

combiner


descrambler despreader32

32

32

3225

60

25

60

4

4

1

1

1

11

64

0

combiner



32

32

322560

2560

32

32

8

8

8

88

640

combiner



128

128

1282560

2560

128

128

32

32

32

3232

640

1717

17



Step 1: Dataflow rate matching

Producer and consumer pair must have the same ratesEdges are memory buffers

Well studied with many existing algorithmsSingle appearance schedule

Dataflow rate matching

combiner



32

32

32

4

4

1

1

1

11

combiner



32

32

32

32

32

8

8

8

88

1818

18



Step 2: Stream size selection

Pick optimal input/output buffer sizeMultiple of the base rate

Binary search algorithmModulo schedule each candidate

buffer size

Stream size selection

combiner



32

32

32

32

32

8

8

8

88

combiner



128

128

128

128

128

32

32

32

3232

DMA in 1

DMA_out 1

kernel(1)

loop N

DMA in N

DMA_out N

kernel(N)

Case 1 Case 2

DMA in M

DMA_out M

kernel(M)

loop N/M

Case 3

Rate = 1, Streaming N elements Case 1: N iterations

Too much DMA overhead Case 2: 1 iteration

Cannot software pipeline Case 3: N/M iterations

1919

19



Step 3: Function-level modulo scheduling

II selection (Initiation Interval) Interval between the start of successive iterations MinII = Max(ResMII, RecMII) ResMII: total latency of all nodes divided by # of PEs RecMII: maximum latency of feedback paths

Constraint-based modulo scheduling SMT-based algorithm

DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2





combiner

Modulo compilation

combiner



128

128

128

128

128

32

32

32

3232

2020

20


SMT-based Modulo Scheduling Using Satisfiability Modulo Theory (SMT) solver Yices

Input: a set of constraints expressed as equations

Output: a set of conditions where the constraints evaluate to true

Constraints Throughput constraints

i.e. total execution time must be less than or equal to II

Memory constraints

i.e. buffer size less than PE’s scratchpad memories

Communication constraints

i.e. DMA added for communicating kernels on different PEs

status of kernel vi assigned to processor j (1 or 0)

number of kernels

2121

21



DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2





combiner

PE1 PE2

2 descr.

2 desp.

PE3

FIR2

PE4

searchercombiner

DMA1

GMEM to PE1

PE1 to GMEM

2 descr.

2 desp.

FIR1

DMA2GMEM to PE2

GMEM to PE3

PE2 to GMEM

PE3 to GMEM

DMA3

GMEM to PE4

II2

Hierarchical scheduling

combiner



128

128

128

128

128

32

32

32

3232

Rake2560

640LPF-Rx2560 2560

2560

2560

Step 4: Hierarchical scheduling

Bottom up scheduling

Treat each child graph as a single node

Memory nodes assigned to global memory

2222

22


Conclusion

Compilation support for SDR is essential

2-tiered compilation process System compilation

DSP compilation

System compilation is function-level scheduling Hierarchical dataflow IR

~4x saving in memory buffer allocation

SMT-based modulo scheduling

Linear speedup up to 8 PEs

Resulting in ~23% faster schedules than greedy

2323

23


Questions

11 1 hierarchical coarse-grained stream compilation for software defined radio yuan lin, manjunath...

Documents

scheduling slide

sdr compilation support

compilation challenges

paper slide

arm control processor

input rate

output rate

ann arbor slide