memory intensive architectures for dsp and data communication · memory intensive architectures for...
TRANSCRIPT
Memory Intensive Architectures for DSP Memory Intensive Architectures for DSP and Data Communicationand Data Communication
Pronita Mehrotra, Paul Franzon
Department of Electrical and Computer EngineeringNorth Carolina State University
1
OutlineOutlineObjectivesApproachSignal Integrity and RoutabilityAlgorithms and DRAM Architecture
Memory Mapping SchemeTwiddle Factor Generation Scheme
Analysis of FFT Architecture and PerformanceForwarding Schemes for Router
Routing Scheme based on CompactionBinary Search Based Routing Scheme
••SHOCC High Density Packaging technology increases potential SHOCC High Density Packaging technology increases potential performance of a 1GB DSP system by a factor of about 20performance of a 1GB DSP system by a factor of about 20•• Lower Memory Requirements for the RouterLower Memory Requirements for the Router
2
Motivation & ApproachMotivation & Approach
Radar processor for future UAVsLarge problem size (1 GB, 1 M point FFTs)High-performance/Low-volume
Leverage High Density PackagingUtilize SHOCC (Seamless High Off Chip Connectivity)iAllows 128 parallel 16-bit wide memory channels i Number of channels limited by signal integrity and routability
Designed 2,048-bit, 250 MHz, memory bus Determine architecture that maximizes the potential of this memory bandwidth
3
Physical DesignPhysical Design
High Density Substrate (8cm x 8cm)
Edge-mounted commercialDDR DRAMs- approx. 2 mm pitch- 100 µm solder bump pitch (today)( => ~ 120 pins => up to x36 memory)
- 2 sets of 64x64 Mbit- organized as multiple independentbanks
- 266 Mbps per pin- Better availability than RAMBUS
bare die+ more certain SI issues
SHOCC-mountedIdentical Accelerator ICs(64 Multiplier-Accumulators)(approx. 1 sq.cm.)-interconnected by 2GHz, 128-bit bus
4
Substrate StackSubstrate Stack--upup
Signal layer S1 (2µ)
Gnd/Pow planes (2µ)
Signal layer S3 (2µ)
Signal layer S2 (2µ)(local ground)
Si Substrate
BCB (5µ)
BCB (10µ)
BCB (5µ)
BCB (5µ)
BCB (5µ)
5µ
S2 acting as the local ground reduces the coupling between S1 and S3
Maxwell Q-3D (Ansoft) parameter extractor used to determine R,L,C
5
Routing ApproachRouting Approach
2-Stage breakout routing approach:
Pitch decided by crosstalk limitations
13 µm Breakout Pitch(2 layers)
26 µm Intermediate pitch(1 layer)
36 µm final routing
S1
Gnd
S2
S1
S2
X-Y routing
Parallel Routing
6
SI Issues for High Density WiringSI Issues for High Density Wiring
0.25 µm CMOS Technology: DC NM = 1.04VOur design uses an upper limit of 0.7V
Noise Sources:CrosstalkiEspecially in the breakout region
SSNReflection NoiseiPotential Issue for long, wide memory wiring
7
Equivalent Circuit (SHOCC Line)Equivalent Circuit (SHOCC Line)
Dr.Dr.Roc Loc
Coc
Rbump Lbump
Cbump
SHOCCline
model
SHOCCline
model
Rbump Lbump
Cbump
Roc Loc
Coc
Rec.Rec.
Input Signal: 2ns pulse with a rise time of 80psDriver: 5 stage driver with a stage ratio of 3
8
SHOCC line model (Crosstalk)SHOCC line model (Crosstalk)
R/n L/n
C/2nC/2n
R/n L/n
C/2nC/2n
R/n L/n
C/2nC/2n
R/n L/n
C/2nC/2n
R/n L/n
C/2nC/2n
Cmtb
Cmtb
Cmtt
Cmtt
Lmtt
Lmtt
Lmtb
Lmtb
Signal (top)
Signal (top)
Signal (top)
Signal (bottom)
Signal (bottom)
9
Crosstalk Noise in Different RegionsCrosstalk Noise in Different Regions
050
100150
200
250300
350
0 1 2 3
(a) Length (cm)
Cro
ssta
lk N
oise
(mV
)
bottom (S3)top (S1)
0
100
200300400
500
600700
0 1 2 3
(b) Length (cm)
Cro
ssta
lk N
oise
(mV
)
bottom (S3)top (S1)
020406080
100120140160180200
0 1 2 3(c) Length (cm)
Cros
stal
k No
ise
(mV)
bottom (S3)top (S1)
Crosstalk Noise for: (a) 13µ initial breakout pitch(b) 26µ XY routing(c) 36µ routing (under DRAMs)
Trace Width in all cases = 10µ
10
Reflection NoiseReflection Noise
0
10
20
30
40
50
0 1 2 3
Length (cm)
Ref
lect
ion
Noi
se (
mV
)
bottom (S3)top (S1)
Reflection Noise constitutes a fairly smallpercentage of the total noise
Reflection Noise for 36µrouting
11
Delays in Different SHOCC regionsDelays in Different SHOCC regions
00.10.20.30.40.50.60.70.80.9
0 1 2 3
(a) Length (cm)
Delay
(ns)
bottom (S3)top (S1)
00.10.20.30.40.50.60.70.8
0 1 2 3(b) Length (cm)
Delay
(ns)
bottom (S3)top (S1)
00.10.20.30.40.50.60.70.80.9
0 1 2 3(c) Length (cm)
Delay
(ns)
bottom (S3)top (S1)
Worst Case delay for: (a) 13µ initial breakout pitch(b) 26µ XY routing(c) 36µ routing (under DRAMs)
Trace Width in all cases = 10µ
12
Noise and Timing AnalysisNoise and Timing Analysis
For a 1cm x 1cm chip (with ≈2500 I/O pins), the escape lengths in the two regions are 1cm and 0.8cmFor the various routing regions, crosstalk noise is ≈0.19V + 0.2V + 0.17V ≈ 0.56VThe reflection noise is approximately 0.03VThe total RSS noise is ≈0.6V (with an SSN of 0.2V). This is within the Noise Margin of 0.7V for a 0.25µ technologyThe worst case off-chip skew on an 8cm x 8cm substrate is around 0.2ns. After adding factors for on-chip skew and jitter, we can have a cycle time of at least 2nsThis gives an I/O bandwidth > 100GByte/sec
13
DRAM Timing IssuesDRAM Timing Issues
DRAM organized in banks and rows:
iRandom access takes 60 nsiA new bank can be accessed every 15 nsiA different entry within the row most recently accessed can be read or
written in 4 ns
Rowaddress
Sense AmpsColumn Address
Bank AddressData Word
However, in the FFT described next we can sustain 98% of peak However, in the FFT described next we can sustain 98% of peak bandwidthbandwidth
SRAM performance at DRAM pricesSRAM performance at DRAM prices
14
FFT Architectural IssuesFFT Architectural Issues
Conventional FFT implementation would spend most time in only one memory channel
Developed staggered channel algorithmNeed to maximize page mode access in DRAMs
Developed novel memory map scheme for dataConventional FFT stores twiddle factors in main memory
Instead we regenerate them on-the-fly in the datapath during otherwise dead cycles
15
MicroMicro--accelerator ICaccelerator IC
For a 0.25µtechnology, a32 bit multiplier and adder would take up an area < 1mm2. A 1cm2 chiparea can hold enoughhardware to makea fully parallel 16point FFT
+X MEM
1 cm
1 cm
32-bit FP arithmetic units
X++ +MEM
+X MEMX++ +MEM
+X MEMX++ +MEM
SRAM
•Micro-Accelerator: control reconfigures IC and manages MEM interface•64 multipliers and adders per chip. •64 16-bit mem interface units•≈0.5KB SRAM to store twiddle factors•Four chips work together to give a radix 64 FFT engine
16
Performance of the FFT EnginePerformance of the FFT Engine
A 32-bit multiply-accumulate unit, in 0.25µ technology, takes < 2ns to executeA 64-point FFT (including the twiddling) can be done in < 32ns. By pipelining the FFT into two stages, a result can be obtained every 20ns
ReadRead
4 8-pointunits
4 8-pointunits
4 8-pointunits
4 8-pointunits
4 8-pointunits
4 8-pointunits
4 8-pointunits
4 8-pointunits
WriteWrite
20 ns + penalty fornew page access
20 ns 20 ns 20 ns + penalty fornew page access
17
AddressAddress--Mapping AlgorithmMapping Algorithm
Maximize page-mode accessesAt each stage
The result set of each 64-point FFT is written to different DRAMs according to the following relationDRAM# = (FFT# + Index) % 64 where, FFT# is (index/64)
Resulting performance:Most of the new-page penalty is hidden by bank operations1.31ms for 4 stage million-point FFTWithin 1.6% of perfect SRAM performance
Key to success when using DRAM
18
...Addressing Scheme ...Addressing Scheme
Example: The indices and memory layout of the data after the end of the first stage is shown for one row in each of the DRAM’s as an illustrative example. The other stages are the same.
0, 64..4032 0, 64..4032
DRAM 0
DRAM 1
DRAM 63
The inputs for the next stageare now arranged in different DRAM’s, allowing fullexploitation of the memory bandwidth.
This shuffling of data after reading, for the next stage, is easily implemented using shiftregisters.
1,65..40331,65..4033
0, 127..4033 0, 127..4033
1, 64..40341, 64..4034
63, 127..409563, 127..4095 63, 126..4032 63, 126..4032
19
...Addressing Scheme...Addressing Scheme
12288,
8192,
4096,
0, 127…4033 B0
B1
B2B3 12289,
8193,
4097,
1, 64… 4034B0
B1
B2B3 12351,
8255,
4159,
63, 126...4032B0
B1
B2B3
DRAM 63DRAM 1DRAM 0
Reads:Row # = FFT#/4Bank # = FFT# % 4
Writes:Row # = FFT#/256Bank # =( FFT#/64) % 4
Where FFT# = index/64
20
Twiddle Factor GenerationTwiddle Factor Generation
A one-dimensional input array (N) can be manipulated as a two-dimensional array (LxM)
For a Radix-64 FFT, L=64The results of 64-point FFT need to be multiplied with the twiddle factors, Wms, where
∑∑−
=
−
=
=1
0
1
0
),(),(L
l
MslmsM
m
Lmr WmlxWWrsX
msNjms eW )/2( π−=
21
…Twiddle Factor Generation…Twiddle Factor Generation
For all stages, s varies from 0 to 63. By storing an initial set of twiddle factors (m=1), subsequent twiddle factors in the same stage can be generated by multiplying current factors by the initial factors
Whenever, m reaches 64, an initial twiddle factor set can be generated for the next stage
sN
sN WW .1
64/64 =
sN
smN
msN WWW .1)1( −=
22
Scheduling FFT OperationsScheduling FFT Operations
The first 2 stages of an 8-point FFT do not involve any multiplications. The free multipliers can be used for generating twiddle factors needed later.
1st two stagesof 8-point FFT
Generation oftwiddle factors
3rd stage
twiddling the final results
2ns 4ns 6ns 8ns 10ns 12ns
time
23
FFT Performance DiscussionFFT Performance Discussion
1,048,576 point FFT in 1.31 ms892 FFT/s1.44 x 1011 FLOPS127 GBps sustained memory performance
Commercial comparisonsBOPS Inc. System:i≈80 sq.cm. of PCB, 4-32-bit memory channels, 4 PEs with each PE
having 5 FP units21.5 ms to perform one million point FFTMotorola’s Altivecs:
128 bit vector execution unit with 4 parallel executions, simultaneous load of 4 IEEE floats
511 ms for a million point FFT
24
Optical Burst Switching (OBS)Optical Burst Switching (OBS)
Need to decouple the transmission/switching from forwarding/routing
One control channel that goes through O/E/O conversionData cuts through nodes without any conversion
Just-in-Time signaling protocol for burst transmission
Transmit packet after some delay without waiting for confirmation
CALLING HOST CALLED SWITCH CALLED HOST
OPTICALBURST
CONNECT
CONNECT
CONNECT
SETUP
SETUP
SETUP
PR
OC
ES
SIN
GD
ELA
Y
CALLPROC
CALLING SWITCH
CROSSCONNECTCONFIGURED
25
OBS Node ArchitectureOBS Node Architecture
SwitchingFabric
SwitchingFabric
Mux
Mux
Mux
Mux
Dem
uxD
emux
Dem
uxD
emux
Optical blocks Electrical blocks
RouterRouterBuffer
andScheduler
Bufferand
Scheduler OutputModuleOutputModule
OutputModuleOutputModule
InputModuleInput
Module
InputModuleInput
Module
Input Fiber #1
Input Fiber #N
Output Fiber #1
Output Fiber #N
ICC #1
ICC #N
IDC #N
FDL
FDL
IDC #1
ODC #N
ODC #1
26
Message EngineMessage Engine
Hard Path
Soft Path
Message Parsing and
Header Verification Route
Lookup
SRAM/DRAM
ExceptionHandler
To Software
Scheduler
SRAM/DRAM
DataBus
TTL andCRC
update
Messagegenerator
SwitchControl
27
Forwarding EngineForwarding Engine
SpeedReduce the number of lookups esp. in main memoryiNumber of memory accesses 2-9 (IPV4)iPartition data to ease hardware pipeliningiExisting schemes take 400-500ns (average time) for address lookup
ScalabilityReduce the amount of memory required to store dataiDirect/Indirect lookup schemes use memory inefficientlyiTree Based Schemes better
The bottleneck of the forwarding engine is the route lookup
28
Trie Trie Vs. TreeVs. Tree
0
0
0 0 0 0
0
1
1
111
1
1
<
<
< < < <
<
>
>
>>>
>
>
*Nick McKeown, Balaji Prabhakar, “High Performance Switches and Routers: Theory and Practice”,Hot Interconnects Tutorial Slides (http://tiny-tera.stanford.edu/nickm/talks/index.html), 1999
Binary Trie Binary Tree
◗Binary Trie: Number of address bits (32 for IPv4)◗Binary Tree: log2(N) (∼ 16 for 64K entries)
Memory Accesses:
29
Trie Trie Based Schemes: Direct LookupBased Schemes: Direct Lookup
An entry for each addressInefficient use of memoryVery poor scalabilityTrie of depth=1 and degree=2B
Address
B bits
2B bits
Lookup Time = 1 cycle (60ns)
1.00E+001.00E+021.00E+041.00E+061.00E+081.00E+101.00E+121.00E+14
0 10 20 30 40 50 60
Address Bits
Requ
ired
Mem
ory
Size
1,000 DRAM chips
30
TrieTrie Based Schemes: Indirect LookupBased Schemes: Indirect Lookup
❧ Address split in 2 or more parts*❧ Somewhat better use of memory❧ Poor scalability
Lookup Time = N cycles (N=no. of segments in the address)
B1 B2
B1
B2
Memory Requirement = Depends on the routing table.
Can reduce memory usage by using variable offset length
*P. Gupta, S. Lin, N. McKeown, “Routing Lookups in hardware at memory access speeds”, in Proc.IEEE Infocom ‘98, Session 10B-1, San Francisco, CA, pp 1382-1391
31
TrieTrie Based Schemes: Based Schemes: Trie Trie OptimizationsOptimizations
Memory usage optimal Lookup Time = H cycles (H=depth of tree)
1
2
3
4 5 6 7
Binary tree
No Prefix1 002 01103 104 11005 11016 11107 1111
1 23
4 5 6 7
Path-compressed(Patricia) tree
Skip=2
1 23
4 5 6 7
Level-compressed(LC) tree
Skip=2
*S. Nilsson, G. Karlsson, “IP-Address Lookup Using LC- Tries”, IEEE Journal on Selected Areasin Communications, Vol. 17, No.6, June 1999, pp 1083-1092
32
Trie Trie or Tree?or Tree?
Issues with Trie Based Schemes:Extra Nodes with no data add to the depth of the treeiMore Memory Accesses Needed
Search time proportional to the size of the addressiBinary Trie for Ipv4 can take up to 32 cyclesiFor IPv6 the worst case could be 128 cycles.
Issues with Tree Based SchemesBinary Search works for exact matchingiBacktracking or wrong pathsiUnbalanced Approaches
Pre-processing overhead higher
33
Tree Based Schemes: Binary SearchTree Based Schemes: Binary Search
Encoding prefixes as rangesMultiway search to reduce search time from log2N to logk+1NPre-computed table of best matching prefixes for the first Y bits
100000101000101010101011101111111111
Worst Case Lookup time =490ns (>32,000 entries)
PatriciaWorst casesearch (ns) 2585 1310 730 490
Worst caserelative toPatricia 1 2 3.5 5
Binary16 bit + binary
16 bit + 6 way
*B. Lampson, V. Srinivasan, G. Varghese, “IP Lookups using Multiway and Multicolumn Search”, Infocom ‘98, Vol. 3, 1998, pp. 1248-56
34
Lookups using Hash TablesLookups using Hash Tables
Hash Tables organized by prefix lengths
hash collisions?
Length Hash
5 7 9
01010
01010110110110
0111011011
Lookup Time = log2 (address bits)
Improve performance by binary search of hash tables by using markers in tables corresponding to shorter lengths to point to prefixes of greater lengths
*M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable High Speed IP Routing Lookups”, ACMComput. Commun. Rev., Vol. 27, Oct. 1997, pp. 25-36
35
Proposed Scheme Using CompactionProposed Scheme Using Compaction
Store path information in a smaller (≈250x than forwarding table), faster, wide (≈1000 bits) on-chip SRAMFew SRAM and one DRAM lookups
1
1 1 1010 0101 1001
Store a table containing number of 1’s in each level
Additionally, for each row of SRAM, the first few bits store number of 1’s till the previous row in that level
First 16 bits or so can be direct mapped
A lookup can be done every 60-65ns (14-15 million lookups per second)
36
…Proposed Scheme Using Compaction…Proposed Scheme Using Compaction
On-chip SRAM and Off-chip DRAM>1000 bit wide on-chip SRAMFor 40,000 prefixes in the routing table, the required SRAM sizeis less than 5kB2 sets of these memories can be used to hide the update operations
Pipelined SRAM and DRAM operationOnly 1 DRAM lookup in all casesOne lookup can be done every 60-65ns14-15 million lookups per second
37
Binary Search Based Proposed SchemeBinary Search Based Proposed Scheme
If n = m,Compare by numerical value
If n ≠ m,Chop longer prefix and compare. If chopped prefixes are equal then, the shorter prefix is considered larger
Two prefixes:A=a1a2…an B=b1b2…bm
Prefix Next Hop10* 7
01* 5
110* 3
1011* 5
0001* 0
01011* 7
00010* 1
001100* 2
1011001* 3
1011010* 5
0100110* 6
01001100* 4
10110011* 8
10110001* 10
01011001* 9
After Sorting:00010*, 0001*, 001100*, 01001100*,0100110*, 01011001*, 01011*, 01*,10110001*, 10110011*, 1011001*,1011010*, 1011*, 10*, 110*
Sorting Prefixes:
*N. Yazdani, P.S. Min, “Fast and Scalable Schemes for IP Address Lookup Problem”, Proc.IEEE Conference on High Performance Switching and Routing, pp. 83-92, 2000
Sample Prefix Set*
38
Binary Search Based Proposed SchemeBinary Search Based Proposed Scheme
Sorting gives depth-first-search of corresponding binary trieBinary Trie constructed as:
If A is a prefix of B, then B is the child of AIf A < B, then A lies on the left of B
0001
Root(*)
01 10 110
00010 0100110 01011
01001100 01011001
1011
10110001 1011001 1011010
10110011
001100
001100*
01001100*
001101*
010001*
010011000*
39
Modified Prefix TableModified Prefix Table
Prefix Next Hop Parent Info.00010* 1 00011000
0001* 0 00001000
001100* 2 00100000
01001100* 4 11000010
0100110* 6 01000010
01011001* 9 10010010
01011* 7 00010010
01* 5 00000010
10110001* 10 10001010
10110011* 8 11001010
1011001* 3 01001010
1011010* 5 01001010
1011* 5 00001010
10* 7 00000010
110* 3 00000100
Store Information about all parents in another fieldPre-processing requires another step. Update process is O(N) (same as Lampson’s scheme)Memory Requirement is ∼ 2x lesser
001100*
01001100*010001*
Match between 01001100*and 010001* is until 4 bits
Best Matching Prefix is 01*
40
ConclusionsConclusions
2,048-bit memory system buildable in high density packaging technologies
Limit determined by Signal Integrity IssuesModeled & Simulated with Ansoft and Hspice
FFT Architecture optimized to maximize available memory bandwidth
Memory map perfectly matched to DDR DRAM architectureOn-the-fly twiddle factor calculationVerified in Verilog model
Result 20x faster capable with conventional packaging
41
…Conclusions…Conclusions
Trie-based routing scheme using compaction suggested for smaller address sizes
SRAM Size is almost 250x lesser than DRAM sizeOne DRAM access only
Binary Search scheme for larger address sizeNumber of memory accesses = log2(N)Memory requirement ∼ 2x lesser than existing schemesUpdate Process at O(N). Same as existing schemes
42
Future WorkFuture Work
FFTComplete Verification (Verilog)Submit journal papers (T.VLSI, CPMT) (Conference paper published - EPEP)
Forwarding EngineVerify Routing schemes (high level Verilog)Evaluate pre-processing overheadsEvaluate performance against standard routing tablesSubmit journal paperConducting scaling studies to support OBS