overview of implementation issues for multitier networks on dsps joseph r. cavallaro electrical...
TRANSCRIPT
Overview of Implementation Issues for Multitier Networks on DSPs
Joseph R. Cavallaro
Electrical & Computer Engineering Dept.Rice UniversityAugust 17, 1999
Outline
Overview of Multitier Networks
DSP Rapid Prototyping Tools
Channel Estimation and Multistage Detection
DSP implementation and Real-time Issues
ASIC Implementation of Algorithm Modules
Conclusions and Future Directions
Multitier Overlay Networks
Home Area Wireless LAN
High Speed Office Wireless LAN
Outdoor CDMA Cellular Network
Time Scales in Multitier Networks
Medium Access
Horizontal
Handoff Handoff
Vertical Session Lifetime
msec secs 10 secs mins
Multiple Radio Interfaces Reconfigurability and Commonality of Modules Multitier Network Interface Card
mNIC
ServerMobile Platform
Network Protocols
Proxy File System
Transcoders
Application
Proxy Awareness
mNIC
NIC
BS
BS
BS
INTERNET
FileSystem
Network Protocols
Proxy File System
Transcoders
Current Group
Suman Das - Universal Baseline Software System
Vishwas Sundaramurthy - System Design Issues
Sridhar Rajagopal - Channel Estimation Algorithms
Oscar Pan – Real Time Workshop Implementation
Recent Graduates:
– Chaitali Sengupta - ML Synchronization
– Gang Xu - Differencing Multistage Detector
W-CDMA Simulation Testbed Overview
Development of an integrated software testbed
Unified framework to evaluate new algorithms for coding,
synchronization, detection, etc.
Construction of a faster, efficient, and possibly hardware
accelerated simulation testbed
TI TMS320C6201- TMS320C6701 based system – Base Station
TI TMS320C54 and FPGA / ASIC - Mobile
Software Rapid Prototyping Methodology
DSP
hardware
DSP CODE
HOSTDSP CODE GENERATION TOOLS
C - CODE
WRAPPER (C - Code or
Simulink)
C mex - CODE
MATLAB
COMPILER
MATLAB
CODE Communication and
Signal Processing
Algorithms in MATLAB
and “C”
Faster Execution of “C”
Code
Acceleration on DSP
Boards
Multiple DSP Boards
C - CODE
Simulink
Simulink– Good system for algorithm evaluation in
communication systems and signal processing– Ties in well with MATLAB environment and
functions– More intuitive than (C/Matlab) code based
evaluation
Used in software version of wireless testbed
RTW
Real-Time Workshop
– Generates ANSI C-code for Simulink block
diagrams
– Tool for DSP rapid prototyping
– Quick but inefficient/non-optimized C-code
RTW support for C67x generation boards
– Hardware (DSP)-in-the-loop simulations
Wireless ChannelUser_Data
Show StatsUpdate Parameters
Decorrelating Detector
Multiuser Detector
Error Counter
Chip MF
Max. Likelihood Channel Est.
Channel Estimation
CDMA Wireless System Testbed Simulink Version
Parameters
Multiuser Detection
Channel Estimation
AWGN Channel
User Data
Error Rate Calculation
Statistics
Chip matched filter
Hardware Platform Issues
Current System
– TI TMS320C6201 and TMS32C6701 EVM boards
Multiple DSP Processor Configuration Issues and
Task Decomposition.
Planned Upgrade to BlueWave, Spectrum
DSPs in Simulink based Wireless testbed
Use of C67 based boards for simulations– Useful for study of individual algorithms on C67
generation processors Multiprocessing issues
– Need block diagram partitioning and code generation support from Simulink/RTW
– Need cleaner external communication mechanisms in the C67x DSP
– Need support for controlling multiple DSPs
Architectural Issues
Memory– More internal memory for large temporary
matrices Prefetch Buffers
– Matrices stored as arrays in memory. ASIC /FPGA glue support
– To explore HW acceleration of critical parts of the code
Specialized instructions : Square roots, reciprocals, rotations ?
Compiler Support
Compilers for VLIW
– Scheduling & Tracking units difficult in manual assembly
– Challenge to generate code to keep all units busy.
– Small Operating System Support
Architectural improvements require coordinated advances in compiler support.
W-CDMA Software Testbed Experiments
Third generation wireless communication systems
Multimedia capabilities
Multirate services
Quality of service
Higher Data Rates: 2 Mbps, 384 Kbps, 144 Kbps.
The Wireless Channel : Multiuser, Multipath
Direct Path
Reflected Paths
Faces Attenuation, Delays and Doppler Effects : Unknown Channel Parameters
Antenna
Noise + MAI
Desired User
W-CDMA Base-Station Receiver
Channel
Estimator
Multiuser
Detector
Demux Decoder
Data
Pilot
Estimated Amplitudes &
Delays
Demodulator
Antenna
CDMA Uplink System
Channel
Encoder
Channel
Encoder
Channel
Encoder
Spreading
Spreading
Spreading
AWGN
Matched
Filter
Matched
Filter
Channel
Estimator
Matched
Filter
Multi-
User
Detector
Channel
Decoder
+
User 1d1
User 2d2
User KdK
R(t)
User 1d1
'
User 2d2
'
User KdK
'
y1
y2
yK
Demux
Maximum Likelihood - Channel Estimation
Send a time-multiplexed Preamble (Pilot).
Channel properties extracted from received signal.
Compare received signal with known pilot and
estimate channel parameters.
Keep estimate for remaining data bits (static).
Repeat preamble every frame, if no tracking.
The Maximum Likelihood Algorithm
Compute the correlation matrices
Compute the channel estimate
Calculate the noise covariance matrix K.
Calculate the channel impulse response vector z.
Extract the ampitudes and delays from the channel impulse
response vector using least squares fit.
bb.brrr R & R ,R
.bb-1
br R R Y
The ML Algorithm Complexity
Complex-Real Dot Product.
Complex-Real Matrix Product.
Complex -Real Product.
Real Square roots.– Solving quadratic equation for least squares fit.
Critical code : Matrix-vector multiplications / Dot Product
r.bL
1Rbr
1
bbbr RRY
1''
212))((
UUUUUyUyz
L
k
L
k
R
k
R
k
L
k
H
k
R
k
H
k
H
k
Assuming Unity Noise CovarianceAssuming Unity Noise Covariance
Offline
Differencing Multistage - Multiuser Detection
Based on the principle of Parallel Interference
Cancellation (PIC)
Cross-correlation information used to remove
interference of other users from desired user
Repeated iterations for convergence
Differencing techniques applied for improving the
performance of the algorithm
The Differencing Multistage Detector
Split the crosscorrelation matrix into lower, upper and the
diagonal matrix.
Calculate the channel impulse
response iteratively using
x is called the differencing vector.
TSSDR
R
D
S
TS
})2,2,0{ˆ(
ˆˆˆ
ˆ)()2()1()1(
)1()1()(
k
lll
lTll
x
ddx
xASSAzAz
where
Multistage Detector Complexity
Matrix Multiplication:
– Computed only once for one frame
Dot Product:
– Computed iteratively
Critical code: Dot Product
ASSB T )(
ljij
lk
lk xBzz ˆ1
TI Tools Used
Evaluation Modules (EVM) for C6201 and C6701
fixed and floating point DSPs
– 64 KB each internal program & data memory
– 256 KB SBSRAM, 8 MB SDRAM (external)
C Compiler ver 3.0 from Code Generation Tools
Code Composer ver 4.02 for profiling the code
DSP Implementation: Channel Estimation
Floating point implementation found more feasible due to matrix inversions and square-roots.
Code optimized for the DSP
Use of Specialized approximate instructions– Approximate reciprocal square roots– Approximate reciprocals
Use of Assembly Code for critical part.– TI's C67 floating point benchmarks for Matrix-
Vector Multiplication & Dot Product
Data Memory requirements for Channel Estimation
Use of Approximate Instructions
L = 150, P =3, N= 31,
SNR = 5dB, SINR = -10 dB
TMS320C67x DSP Cycles
Approx. FPReciprocalinstruction
1
FP reciprocalfunction 28
Approx. FPReciprocal Sq. root
Instruction1
FP Reciprocal Sq.root Instruction 34
0 5 10 150
20
40
60
80
100
120
140
Number of users -->
Ex
ec
uti
on
tim
e(i
n m
illi
se
co
nd
s)
-->
Use of specialized instructions and assembly code on C6701 DSP
C6701: Original C6701: with IntrinsicsC6701: with Assembly
10% improvement
100% improvement
Optimization Effects for Channel Estimation
1 2 30
10
20
30
40
50
60
70
80
90
100 Effect of optimizations for Channel Estimation on C6701-->
Ex
ec
uti
on
tim
e(n
orm
ali
zed
) --
>
Base(-o3 -pm)
Approx.(-o3 -pm with intrinsics)
Assembly opt.(-o3 -pm with asm)
2.34X improvement
1.08X improvement
Data Memory Requirements
Data to be placed in External memory
1306
DSP Implementation: Multistage Detection
16-bit Fixed Point C Code
Code optimized for the DSP
Use of Assembly Code for critical part
– TI's C62 fixed point assembly benchmarks for Dot
Product
Data memory requirements for Multistage Detection
Optimization Effects for Multistage Detector
1 2 30
10
20
30
40
50
60
70
80
90
100 Effect of optimizations for Multistage Detection on C6201 -->
Ex
ec
uti
on
tim
e(n
orm
ali
zed
) --
>
Global opt.(-o3 -pm -mu)
Software Pipelining (-o3 -pm) Assembly opt.
(-o3 -pm with asm)
5.22X improvement
7.47X improvement
Data Memory Requirements
Data can be placed
completely in Internal memory
Flops Count
1 2 3 4 5 6 7 80
2
4
6
8
10
12
14x 10
4
Total Number of Iterations
Nu
mb
er
of
Flo
ps
Users:K=15 SNR=6dB
Conventional MethodDifferencing Method
conventional
differencing
2X speedup
for a
three-stage
detector
Real-Time Requirements
Real-Time capability by C6201 DSP
NUMBER OF USERS8 9 10 11 12 13 14
50
100
150
200
250
300
350
MA
X B
IT R
AT
E P
ER
US
ER
(k
b/s
)
SNR=10dB WindowSize=12
Conventional MethodDifferencing Method
12users
150kb/s
Trends in Recent DSPs
More internal memory and higher clock speeds
– C6203 : 512 KB data, 384 KB program, 250 MHz
– useful for uplink channel estimation algorithms.
Specialized Blocks in the DSP Core.
– Viterbi decoding in C54.
Lower Voltage operation
– 1.2 V in C5402 , useful for saving power consumption in the
mobile.
ASIC Implementation
Differencing Multistage Detector Block
MOSIS Tiny-Chip (40-pin DIP)
– 8 synchronous users
– 12-bit fixed point implementation
– 6000 transistors
– 1.2 m CMOS technology
– 190kb/s for each user (@12.5MHz)
– 3-stage cascade delay < 15 s
Chip (Single Stage) Architecture
)1( ld
)( lz)( lz
)( ld
)1( lz)( lz
)( ld
SHIFT
)1( ld
A
L
U
RECODER
REG
(L+L’)A ControlLogic
)1()()(
)()()1(
ˆˆˆ whereˆ)(
lll
lTll
ddxxALLzz Internal signals
External signals
ASIC Architecture Features
Chip Layout
12-bit ALU
Soft Decisions
Cross-Correlation
Recodinglogic
2.0 mm
3-stage Cascade Mode
Sin
Hin
Fin
Load
CLK
Sout
Hout
Fout
1/2
Sin
Hin
Fin
Load
CLK
Sout
Hout
Fout
1/2
Sin
Hin
Fin
Load
CLK
Sout
Hout
Fout
1/2
Matched
FilterOutput
DetectorOutput
HandShaking
Load RClock
Output Valid
Current Work – GPP vs. DSP
• Joint work with Prof. Sarita Adve, Praful Kaul, and Parthasarathy Ranganathan
• Performance of general-purpose systems• Comparing GPP and DSP performance• Complete 3G benchmark suite with all components• Identification of key performance bottlenecks
Preliminary Results (1 of 4)
(4 algorithms: channel estimation, multi-stage detection, FIR filter, dot product)
Performance of general-purpose processors– Instruction-level parallelism features help (3.4X to 4.4X)– Media ISA extensions help (1.2X to 5.4X)
New extensions for packing/multiplication useful Comparing GPP and DSP performance
– GPPs outperform DSPs UltraSPARC-II+VIS 2-4X better than TI TMS320C6701 Caveat: compiler issues with DSP
Preliminary Results (2 of 4)
Important to study complete system including all components– Need for complete benchmark suite
SOURCE CODING
CHANNEL CODING SPREADING
DECODER DETECTOR DEMODULATION
CHANNEL ESTIMATION
user’s bits
TRANSMITTER
RECEIVER(BASE STATION)
(MOBILE USER)
detected bits of all K users
K USERS
MODULATION
Preliminary Results (3 of 4)
Complete 3G benchmark suite with all components• Source coding• Channel coding• Spreading• Modulation/De-modulation• Multi-stage detection• Channel estimation• Channel decoding• Source decoding
Used either public-domain or in-house “C” code Optimized with ISA extensions
Preliminary Results (4 of 4)
Choice of source coding standard makes big difference– G728 system: source coding/decoding dominant– GSM system: channel estimation/detection dominant
G728
Speech Coder29%
Speech Decoder
24%
Channel Encoder
3%
Channel Decoder
17%
Channel Estimation
9%
Multi-stage Detection
18%
GSMSpeech Coder
11%
Speech Decoder
6%
Channel Encoder
3%
Channel Decoder
20%
Channel Estimation
20%
Multi-stage Detection
40%
Conclusions
Implementation issues : Estimation & Detection Algorithms
Channel Estimation - Floating Point / External Memory
Multistage Detection - Fixed Point / Internal Memory
Specialized instructions : square root/reciprocals.
Additional support for complex arithmetic useful.
Recent trends in GPP / DSPs highly encouraging for next generation
wireless communication applications.
Future Work
FPGA / ASIC Implementation via VHDL models and SPW Program & DSP implementations for W-CDMA uplink and
downlink
– Blind Algorithms – Adaptive Algorithms
Architectural bottlenecks and compiler issues in DSPs to
enhance suitability for next generation W-CDMA systems
Multiple DSPs – mixed DSP / FPGA for mNIC