a system solution for high- performance, low power sdr yuan lin 1, hyunseok lee 1, yoav harel 1,...
Post on 20-Dec-2015
217 Views
Preview:
TRANSCRIPT
A System Solution for High- Performance, Low Power SDR
Yuan Lin1, Hyunseok Lee1, Yoav Harel1, Mark Woh1,
Scott Mahlke1, Trevor Mudge1 and Krisztian Flautner2
1Advanced Computer Architecture Laboratory
University of Michigan2ARM, Ltd.
Advanced Computer Architecture LaboratoryUniversity of Michigan
2
SDR Design Challenges:
Hardware design challenges High computational throughput (~40 Gops) Low power consumption (~200mW) Meet real-time requirements
DSP programming support System-level development
Inter-algorithm communication Algorithm-level development
Efficient DSP representations
SDR Benchmark Design & Analysis
Advanced Computer Architecture LaboratoryUniversity of Michigan
4
W-CDMA Protocol: 2Mbps
Frontend
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
modulator
demodulator
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
Advanced Computer Architecture LaboratoryUniversity of Michigan
5
W-CDMA Characteristics
Plenty of vector parallelism 8 & 16-bit DSP algorithms Multiplication is not dominant No floating-point operation Small instruction/data memory Has periodic real-time tasks
Advanced Computer Architecture LaboratoryUniversity of Michigan
6
802.11a Protocol: 24Mbps
Frontend
LPF-TxQAM
MappingInterleaver
Channelencoder
LPF-Rx
FFT
Frequency, TimeSynchronization
modulator
demodulator
deinteleaverChanneldecoder(viterbi)
Upper layersTransmitter
Receiver
IFFTPreambleInsertion
ChannelEstimation
FrequencyEqualization
QAMdemapping
Advanced Computer Architecture LaboratoryUniversity of Michigan
7
802.11a Characteristics
Similar to W-CDMA Plenty of vector parallelism No floating-point operation Small instruction/data memory
Different from W-CDMA Mostly 16-bit DSP algorithms Multiplication is more dominant No periodic real-time tasks
SDR Processor Architecture Design
Advanced Computer Architecture LaboratoryUniversity of Michigan
9
System Architecture Design Tradeoffs
Fine grainprocessors
Inter-processorNetwork
PE
Coarse grainprocessors
Uniprocessoralu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
PEalu alu alu alu
PEalu alu alu alu
PEalu alu alu alu
PEalu alu alu alu
Inter-processorNetwork
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
Amortized fetch
Intra-processorCommunication
Inter-processor Communication
Advanced Computer Architecture LaboratoryUniversity of Michigan
10
System Architecture Design Tradeoffs
128x1
64x2
32x4
16x8
8x16
4x32 2x64 1x1281x256
(50% Utilization)
0
50
100
150
200
250
300
0 50 100 150 200 250 300
SIMD width (arithmetic units)
MO
PS
/mW
Intra-processor shuffle networkcritical delay path
ALU dominated critical delay path
critical delay path
Number of Processing Elements x SIMD width For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz
Fine grain processors
Inter-processorNetwork
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PEalu
PE
Uniprocessor
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
alu
Inter-processorNetwork
Coarse grain processorsPE
alu alu alu alu
PEalu alu alu alu
PEalu alu alu alu
PEalu alu alu alu
Advanced Computer Architecture LaboratoryUniversity of Michigan
12
System Architecture Design
Interconnect
GlobalMemory
SIMDMemory
SIMD ALU
SIMD RF
ScalarMemory
ScalarALU
ScalarRF
SoCInterface
ScalarPipeline
SIMDPipeline
SoCInterface
GlobalMemory
SoCInterface
MEM MEM
Memory
ALU
SoCInterface
Controller
PE
SIMDMemory
SIMD ALU
SIMD RF
ScalarMemory
ScalarALU
ScalarRF
SoCInterface
ScalarPipeline
SIMDPipeline
PE
SIMDMemory
SIMD ALU
SIMD RF
ScalarMemory
ScalarALU
ScalarRF
SoCInterface
ScalarPipeline
SIMDPipeline
PE
Advanced Computer Architecture LaboratoryUniversity of Michigan
13
PE Design (Area < 1mm2 Power<50mW)
SIMDMemory
4 KB
ScalarMemory
4KB
16x8bitRegFile
16x8bitRegFile
16x8bitRegFile
16x8bitRegFile
DataShuffle
Network
8bit ALU
8bit ALU
8bit ALU
8bit ALU
16x16bitRegFile
16bit ALU
DMA
SIMD Unit
Scalar Unit
PE
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
32x8bit
16bit 16bit 16bit
16bit
Advanced Computer Architecture LaboratoryUniversity of Michigan
14
Mapping DSP Algorithms: Filters
Z-1 Z-1 Z-1
b0b1b2b3
In
Out
Out
Inb
z-1
z-1spread Vin, Sin
shift z, z, upmac z, Vin, Sin
Advanced Computer Architecture LaboratoryUniversity of Michigan
15
SIMDMemory
4 KB
ScalarMemory
4KB
16x8bitRegFile
16x8bitRegFile
16x8bitRegFile
16x8bitRegFile
DataShuffle
Network
16x16bitRegFile
DMA
SIMD Unit
Scalar Unit
PE
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
8bit
32x8bit
16bit 16bit 16bit
16bit
8bit ALU
8bit ALU
8bit ALU
8bit ALU
16bit ALU
Mapping DSP Algorithm: Filter
spread Vin, Sin
shift z, z, upmac z, Vin, Sin
Advanced Computer Architecture LaboratoryUniversity of Michigan
16
Efficient Design
Wide SIMD width Small register file with minimum ports Small memories Narrow system BUS Data-path optimized for 8bits Vector shuffle reduce memory ports
Advanced Computer Architecture LaboratoryUniversity of Michigan
18
2 LPF-Rx182 MopsMisc. Control
1 Mops
Searcher200 Mops
Deinterleaver16 Mops
PowerControl15 Kops
PN CodeTX/RX
23 Mops
Turbo Decoder324 Mops
Buffer(1360 Bytes)
Buffer(1280 Bytes) Buffer
(2560 Bytes)FIFO Queue(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1024 Bytes)
ARM PE PE PE PEGlobal
Memory
Buffer(1024 Bytes)
WCDMA Receiver WCDMA Transmitter
4 LPF-Rx307 Mops
Scrambler8.6 Mops
Spreader4.7 Mops
Turbo Encoder2 Mops
Interleaver2 Mops
Descrambler22.5 Mops
Despreader11.3 Mops
Combiner3 Mops
Frontend
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
modulator
demodulator
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
Channeldecoder
(turbo/viterbi)
4 LPF-Rx307 Mops
Scrambler8.6 Mops
Spreader4.7 Mops
Turbo Encoder2 Mops
Interleaver2 Mops
Turbo Decoder324 Mops
Searcher200 Mops
2 LPF-Rx182 Mops
Descrambler22.5 Mops
Despreader11.3 Mops
Combiner3 Mops
PowerControl15 Kops
Deinterleaver16 Mops
PN CodeTX/RX
23 Mops
Misc. Control1 Mops
descrambler despreader combiner
descrambler despreader
...LPF-Rx deinteleaver
searcher
Channelencoder
InterleaverspreaderscramblerLPF-Tx
Buffer(1280 Bytes)
Buffer(1360 Bytes)
Buffer(2560 Bytes)
Buffer(1024 Bytes) FIFO Queue
(12.5 KBytes)
Buffer(20 KBytes)
Buffer(20 KBytes)Buffer
(10 Bytes)
Buffer(1024 Bytes)
Advanced Computer Architecture LaboratoryUniversity of Michigan
19
2 LPF-Rx182 MopsMisc. Control
1 Mops
Searcher200 Mops
Deinterleaver16 Mops
PowerControl15 Kops
PN CodeTX/RX
23 Mops
Turbo Decoder324 Mops
Buffer(1360 Bytes)
Buffer(1280 Bytes) Buffer
(2560 Bytes)FIFO Queue(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1024 Bytes)
ARM PE PE PE PEGlobal
Memory
Buffer(1024 Bytes)
WCDMA Receiver WCDMA Transmitter
4 LPF-Rx307 Mops
Scrambler8.6 Mops
Spreader4.7 Mops
Turbo Encoder2 Mops
Interleaver2 Mops
Descrambler22.5 Mops
Despreader11.3 Mops
Combiner3 Mops
Frontend
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
modulator
demodulator
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
Channelencoder
InterleaverspreaderscramblerLPF-Tx
Channeldecoder
(turbo/viterbi)deinteleaver
searcher
descrambler despreader combiner
descrambler despreader
...LPF-Rx
4 LPF-Rx307 Mops
Scrambler8.6 Mops
Spreader4.7 Mops
Turbo Encoder2 Mops
Interleaver2 Mops
Turbo Decoder324 Mops
Searcher200 Mops
2 LPF-Rx182 Mops
Descrambler22.5 Mops
Despreader11.3 Mops
Combiner3 Mops
PowerControl15 Kops
Deinterleaver16 Mops
PN CodeTX/RX
23 Mops
Misc. Control1 Mops
MACFrontend A/D
MAC
Frontend D/A
Buffer(1280 Bytes)
Buffer(1360 Bytes)
Buffer(2560 Bytes)
Buffer(1024 Bytes) FIFO Queue
(12.5 KBytes)
Buffer(20 KBytes)
Buffer(20 KBytes)Buffer
(10 Bytes)
Buffer(1024 Bytes)
Advanced Computer Architecture LaboratoryUniversity of Michigan
20
802.11a PE Mapping
FIR (Rx)320 Mops
Misc. Control80 Mops
FFT120 Mops
Deinterleaver60 Mops
Buffer(2048 Bytes)
Buffer(2048 Bytes) Buffer
(30 KBytes)
Buffer(30 KBytes)
ARM PE PE PE PEGlobal
Memory
Interleaver60 Mops
Freq. Eq.120 Mops
QAM Demod2 Mops
FIR (Tx)320 Mops
Buffer(2048 Bytes)
Channel Enc.20 Mops
Buffer(22048 Bytes)
IFFT120 Mops
Buffer(1024 Bytes)
QAM Mod2 Mops
Viterbi Dec.398 Mops
Buffer(2048 Bytes)
Preamble Ins.50 Mops
Buffer(30 KBytes)
Buffer(30 KBytes)
Interplator250 Mops
Buffer(2048 Bytes)
Sync.20 Mops
Buffer(1024 Bytes)
802.11a Receiver
802.11a Transmitter
Advanced Computer Architecture LaboratoryUniversity of Michigan
21
Power Results
0
50
100
150
200
250
300
350
400
PE Mem
ory
PE Reg
ister
File
PE ALU
PE Oth
ersARM
Glob
al M
emor
y
Inte
r-pro
cess
or Bus
Oth
ersTota
l
Po
wer
(m
W)
WCDMA
802.11a
Configuration 4 PEs, 1 ARM (Cortex M3) controller Global scratchpad memory (64Kb) 90nm (1V @ 400 MHZ), Synthesized conservatively
Advanced Computer Architecture LaboratoryUniversity of Michigan
22
Area Results
0
0.5
1
1.5
2
2.5
3
3.5
mm
^2 (9
0nm
)
SDR Programming Language Support
Advanced Computer Architecture LaboratoryUniversity of Michigan
24
Software Development Flow
Matlab-Simulink/CFloating Point
Algorithm Prototyping
TimingRequirements
C/SPEXFixed Point
Algorithm KernelImplementations
IO/ControlCode
C/SPEXSystem
Description
Timing/MachineIndependent
asm code
TimingRequirements
MemoryAccessPatterns
SystemDescriptionasm code
Controllerasmcode
PEasmcode
c
ProgrammerDefined
CompilerGenerated
Algorithm-levelDesign Flow
System-LevelDesign Flow
Advanced Computer Architecture LaboratoryUniversity of Michigan
25
SPEX (Signal Processing EXtension)
Implemented as a library extension to C System-level development
Support concurrent DSP kernel function definitions Channel variables for inter-kernel communications
Algorithm-level development Native vector & matrix variables Explicit DSP variable attribute definition Native vector & matrix operations
Advanced Computer Architecture LaboratoryUniversity of Michigan
26
SPEX Overview
SPEX
DSP variableextension
DSP kernelextension
Scalarobject
SIMDobject
DSP variableobjects
DSPvariable
attributes
DSP variable
operations
ModeBit-
widthType
Arith. &logicop.
Predi-cation
Permu-tation
Concurrentvariableobject
Concurrentvariable
operations
Kernelobject
Channelobject
Concurrentkernel
mangement
Channelcommunication
operations
Advanced Computer Architecture LaboratoryUniversity of Michigan
27
SPEX Example Code: Viterbi ACSConcurrent DSP kernel definitions
void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2;
while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive();
/* add */ metrics1 += states; metrics2 += states;
/* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2;
/* sending data to TB */ acs_to_tb.send(states); }}
Advanced Computer Architecture LaboratoryUniversity of Michigan
28
SPEX Example Code: Viterbi ACS
Native SIMD variable definition with explicit attributes
SPEX variable supports1. saturated/overflow2. various variable bit-width3. vector & matrices
void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2;
while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive();
/* add */ metrics1 += states; metrics2 += states;
/* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2;
/* sending data to TB */ acs_to_tb.send(states); }}
Advanced Computer Architecture LaboratoryUniversity of Michigan
29
SPEX Example Code: Viterbi ACS
Inter-kernel communication through channel operations
Channel types:1. FIFO queue2. Broadcast queue3. Sync/control channel4. Random-read FIFO queue
void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2;
while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive();
/* add */ metrics1 += states; metrics2 += states;
/* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2;
/* sending data to TB */ acs_to_tb.send(states); }}
Advanced Computer Architecture LaboratoryUniversity of Michigan
30
SPEX Example Code: Viterbi ACS
SPEX vector operations Supports(Matlab-like C code)
1. SIMD arithmetic operations
2. SIMD permutation 3. SIMD predication
void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2;
while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive();
/* add */ metrics1 += states; metrics2 += states;
/* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2;
/* sending data to TB */ acs_to_tb.send(states); }}
Advanced Computer Architecture LaboratoryUniversity of Michigan
31
Summary
- Hardware & software solutions for SDR- Hardware
- 4 dual-issue asymmetric SIMD processing elements- Consumes 200~300mW for 90nm- Meets the performance requirements for WCDMA &
802.11a
- Software- SPEX provides efficient DSP algorithm and system
implementation
Advanced Computer Architecture LaboratoryUniversity of Michigan
32
Questions?
top related