memory intensive architectures for dsp and data communication · memory intensive architectures for...

Memory Intensive Architectures for DSP Memory Intensive Architectures for DSP and Data Communicationand Data Communication

Pronita Mehrotra, Paul Franzon

Department of Electrical and Computer EngineeringNorth Carolina State University

1

OutlineOutlineObjectivesApproachSignal Integrity and RoutabilityAlgorithms and DRAM Architecture

Memory Mapping SchemeTwiddle Factor Generation Scheme

Analysis of FFT Architecture and PerformanceForwarding Schemes for Router

Routing Scheme based on CompactionBinary Search Based Routing Scheme

••SHOCC High Density Packaging technology increases potential SHOCC High Density Packaging technology increases potential performance of a 1GB DSP system by a factor of about 20performance of a 1GB DSP system by a factor of about 20•• Lower Memory Requirements for the RouterLower Memory Requirements for the Router

2

Motivation & ApproachMotivation & Approach

Radar processor for future UAVsLarge problem size (1 GB, 1 M point FFTs)High-performance/Low-volume

Leverage High Density PackagingUtilize SHOCC (Seamless High Off Chip Connectivity)iAllows 128 parallel 16-bit wide memory channels i Number of channels limited by signal integrity and routability

Designed 2,048-bit, 250 MHz, memory bus Determine architecture that maximizes the potential of this memory bandwidth

3

Physical DesignPhysical Design

High Density Substrate (8cm x 8cm)

Edge-mounted commercialDDR DRAMs- approx. 2 mm pitch- 100 µm solder bump pitch (today)( => ~ 120 pins => up to x36 memory)

- 2 sets of 64x64 Mbit- organized as multiple independentbanks

- 266 Mbps per pin- Better availability than RAMBUS

bare die+ more certain SI issues

SHOCC-mountedIdentical Accelerator ICs(64 Multiplier-Accumulators)(approx. 1 sq.cm.)-interconnected by 2GHz, 128-bit bus

4

Substrate StackSubstrate Stack--upup

Signal layer S1 (2µ)

Gnd/Pow planes (2µ)

Signal layer S3 (2µ)

Signal layer S2 (2µ)(local ground)

Si Substrate

BCB (5µ)

BCB (10µ)

BCB (5µ)

BCB (5µ)

BCB (5µ)

5µ

S2 acting as the local ground reduces the coupling between S1 and S3

Maxwell Q-3D (Ansoft) parameter extractor used to determine R,L,C

5

Routing ApproachRouting Approach

2-Stage breakout routing approach:

Pitch decided by crosstalk limitations

13 µm Breakout Pitch(2 layers)

26 µm Intermediate pitch(1 layer)

36 µm final routing

S1

Gnd

S2

S1

S2

X-Y routing

Parallel Routing

6

SI Issues for High Density WiringSI Issues for High Density Wiring

0.25 µm CMOS Technology: DC NM = 1.04VOur design uses an upper limit of 0.7V

Noise Sources:CrosstalkiEspecially in the breakout region

SSNReflection NoiseiPotential Issue for long, wide memory wiring

7

Equivalent Circuit (SHOCC Line)Equivalent Circuit (SHOCC Line)

Dr.Dr.Roc Loc

Coc

Rbump Lbump

Cbump

SHOCCline

model

SHOCCline

model

Rbump Lbump

Cbump

Roc Loc

Coc

Rec.Rec.

Input Signal: 2ns pulse with a rise time of 80psDriver: 5 stage driver with a stage ratio of 3

8

SHOCC line model (Crosstalk)SHOCC line model (Crosstalk)

R/n L/n

C/2nC/2n

R/n L/n

C/2nC/2n

R/n L/n

C/2nC/2n

R/n L/n

C/2nC/2n

R/n L/n

C/2nC/2n

Cmtb

Cmtb

Cmtt

Cmtt

Lmtt

Lmtt

Lmtb

Lmtb

Signal (top)

Signal (top)

Signal (top)

Signal (bottom)

Signal (bottom)

9

Crosstalk Noise in Different RegionsCrosstalk Noise in Different Regions

050

100150

200

250300

350

0 1 2 3

(a) Length (cm)

Cro

ssta

lk N

oise

(mV

)

bottom (S3)top (S1)

0

100

200300400

500

600700

0 1 2 3

(b) Length (cm)

Cro

ssta

lk N

oise

(mV

)

bottom (S3)top (S1)

020406080

100120140160180200

0 1 2 3(c) Length (cm)

Cros

stal

k No

ise

(mV)

bottom (S3)top (S1)

Crosstalk Noise for: (a) 13µ initial breakout pitch(b) 26µ XY routing(c) 36µ routing (under DRAMs)

Trace Width in all cases = 10µ

10

Reflection NoiseReflection Noise

0

10

20

30

40

50

0 1 2 3

Length (cm)

Ref

lect

ion

Noi

se (

mV

)

bottom (S3)top (S1)

Reflection Noise constitutes a fairly smallpercentage of the total noise

Reflection Noise for 36µrouting

11

Delays in Different SHOCC regionsDelays in Different SHOCC regions

00.10.20.30.40.50.60.70.80.9

0 1 2 3

(a) Length (cm)

Delay

(ns)

bottom (S3)top (S1)

00.10.20.30.40.50.60.70.8

0 1 2 3(b) Length (cm)

Delay

(ns)

bottom (S3)top (S1)

00.10.20.30.40.50.60.70.80.9

0 1 2 3(c) Length (cm)

Delay

(ns)

bottom (S3)top (S1)

Worst Case delay for: (a) 13µ initial breakout pitch(b) 26µ XY routing(c) 36µ routing (under DRAMs)

Trace Width in all cases = 10µ

12

Noise and Timing AnalysisNoise and Timing Analysis

For a 1cm x 1cm chip (with ≈2500 I/O pins), the escape lengths in the two regions are 1cm and 0.8cmFor the various routing regions, crosstalk noise is ≈0.19V + 0.2V + 0.17V ≈ 0.56VThe reflection noise is approximately 0.03VThe total RSS noise is ≈0.6V (with an SSN of 0.2V). This is within the Noise Margin of 0.7V for a 0.25µ technologyThe worst case off-chip skew on an 8cm x 8cm substrate is around 0.2ns. After adding factors for on-chip skew and jitter, we can have a cycle time of at least 2nsThis gives an I/O bandwidth > 100GByte/sec

13

DRAM Timing IssuesDRAM Timing Issues

DRAM organized in banks and rows:

iRandom access takes 60 nsiA new bank can be accessed every 15 nsiA different entry within the row most recently accessed can be read or

written in 4 ns

Rowaddress

Sense AmpsColumn Address

Bank AddressData Word

However, in the FFT described next we can sustain 98% of peak However, in the FFT described next we can sustain 98% of peak bandwidthbandwidth

SRAM performance at DRAM pricesSRAM performance at DRAM prices

14

FFT Architectural IssuesFFT Architectural Issues

Conventional FFT implementation would spend most time in only one memory channel

Developed staggered channel algorithmNeed to maximize page mode access in DRAMs

Developed novel memory map scheme for dataConventional FFT stores twiddle factors in main memory

Instead we regenerate them on-the-fly in the datapath during otherwise dead cycles

15

MicroMicro--accelerator ICaccelerator IC

For a 0.25µtechnology, a32 bit multiplier and adder would take up an area < 1mm2. A 1cm2 chiparea can hold enoughhardware to makea fully parallel 16point FFT

+X MEM

1 cm

1 cm

32-bit FP arithmetic units

X++ +MEM

+X MEMX++ +MEM

+X MEMX++ +MEM

SRAM

•Micro-Accelerator: control reconfigures IC and manages MEM interface•64 multipliers and adders per chip. •64 16-bit mem interface units•≈0.5KB SRAM to store twiddle factors•Four chips work together to give a radix 64 FFT engine

16

Performance of the FFT EnginePerformance of the FFT Engine

A 32-bit multiply-accumulate unit, in 0.25µ technology, takes < 2ns to executeA 64-point FFT (including the twiddling) can be done in < 32ns. By pipelining the FFT into two stages, a result can be obtained every 20ns

ReadRead

4 8-pointunits

4 8-pointunits

4 8-pointunits

4 8-pointunits

4 8-pointunits

4 8-pointunits

4 8-pointunits

4 8-pointunits

WriteWrite

20 ns + penalty fornew page access

20 ns 20 ns 20 ns + penalty fornew page access

17

AddressAddress--Mapping AlgorithmMapping Algorithm

Maximize page-mode accessesAt each stage

The result set of each 64-point FFT is written to different DRAMs according to the following relationDRAM# = (FFT# + Index) % 64 where, FFT# is (index/64)

Resulting performance:Most of the new-page penalty is hidden by bank operations1.31ms for 4 stage million-point FFTWithin 1.6% of perfect SRAM performance

Key to success when using DRAM

18

...Addressing Scheme ...Addressing Scheme

Example: The indices and memory layout of the data after the end of the first stage is shown for one row in each of the DRAM’s as an illustrative example. The other stages are the same.

0, 64..4032 0, 64..4032

DRAM 0

DRAM 1

DRAM 63

The inputs for the next stageare now arranged in different DRAM’s, allowing fullexploitation of the memory bandwidth.

This shuffling of data after reading, for the next stage, is easily implemented using shiftregisters.

1,65..40331,65..4033

0, 127..4033 0, 127..4033

1, 64..40341, 64..4034

63, 127..409563, 127..4095 63, 126..4032 63, 126..4032

19

...Addressing Scheme...Addressing Scheme

12288,

8192,

4096,

0, 127…4033 B0

B1

B2B3 12289,

8193,

4097,

1, 64… 4034B0

B1

B2B3 12351,

8255,

4159,

63, 126...4032B0

B1

B2B3

DRAM 63DRAM 1DRAM 0

Reads:Row # = FFT#/4Bank # = FFT# % 4

Writes:Row # = FFT#/256Bank # =( FFT#/64) % 4

Where FFT# = index/64

20

Twiddle Factor GenerationTwiddle Factor Generation

A one-dimensional input array (N) can be manipulated as a two-dimensional array (LxM)

For a Radix-64 FFT, L=64The results of 64-point FFT need to be multiplied with the twiddle factors, Wms, where

∑∑−

=

−

=

=1

0

1

0

),(),(L

l

MslmsM

m

Lmr WmlxWWrsX

msNjms eW )/2( π−=

21

…Twiddle Factor Generation…Twiddle Factor Generation

For all stages, s varies from 0 to 63. By storing an initial set of twiddle factors (m=1), subsequent twiddle factors in the same stage can be generated by multiplying current factors by the initial factors

Whenever, m reaches 64, an initial twiddle factor set can be generated for the next stage

sN

sN WW .1

64/64 =

sN

smN

msN WWW .1)1( −=

22

Scheduling FFT OperationsScheduling FFT Operations

The first 2 stages of an 8-point FFT do not involve any multiplications. The free multipliers can be used for generating twiddle factors needed later.

1st two stagesof 8-point FFT

Generation oftwiddle factors

3rd stage

twiddling the final results

2ns 4ns 6ns 8ns 10ns 12ns

time

23

FFT Performance DiscussionFFT Performance Discussion

1,048,576 point FFT in 1.31 ms892 FFT/s1.44 x 1011 FLOPS127 GBps sustained memory performance

Commercial comparisonsBOPS Inc. System:i≈80 sq.cm. of PCB, 4-32-bit memory channels, 4 PEs with each PE

having 5 FP units21.5 ms to perform one million point FFTMotorola’s Altivecs:

128 bit vector execution unit with 4 parallel executions, simultaneous load of 4 IEEE floats

511 ms for a million point FFT

24

Optical Burst Switching (OBS)Optical Burst Switching (OBS)

Need to decouple the transmission/switching from forwarding/routing

One control channel that goes through O/E/O conversionData cuts through nodes without any conversion

Just-in-Time signaling protocol for burst transmission

Transmit packet after some delay without waiting for confirmation

CALLING HOST CALLED SWITCH CALLED HOST

OPTICALBURST

CONNECT

CONNECT

CONNECT

SETUP

SETUP

SETUP

PR

OC

ES

SIN

GD

ELA

Y

CALLPROC

CALLING SWITCH

CROSSCONNECTCONFIGURED

25

OBS Node ArchitectureOBS Node Architecture

SwitchingFabric

SwitchingFabric

Mux

Mux

Mux

Mux

Dem

uxD

emux

Dem

uxD

emux

Optical blocks Electrical blocks

RouterRouterBuffer

andScheduler

Bufferand

Scheduler OutputModuleOutputModule

OutputModuleOutputModule

InputModuleInput

Module

InputModuleInput

Module

Input Fiber #1

Input Fiber #N

Output Fiber #1

Output Fiber #N

ICC #1

ICC #N

IDC #N

FDL

FDL

IDC #1

ODC #N

ODC #1

26

Message EngineMessage Engine

Hard Path

Soft Path

Message Parsing and

Header Verification Route

Lookup

SRAM/DRAM

ExceptionHandler

To Software

Scheduler

SRAM/DRAM

DataBus

TTL andCRC

update

Messagegenerator

SwitchControl

27

Forwarding EngineForwarding Engine

SpeedReduce the number of lookups esp. in main memoryiNumber of memory accesses 2-9 (IPV4)iPartition data to ease hardware pipeliningiExisting schemes take 400-500ns (average time) for address lookup

ScalabilityReduce the amount of memory required to store dataiDirect/Indirect lookup schemes use memory inefficientlyiTree Based Schemes better

The bottleneck of the forwarding engine is the route lookup

28

Trie Trie Vs. TreeVs. Tree

0

0

0 0 0 0

0

1

1

111

1

1

<

<

< < < <

<

>

>

>>>

>

>

*Nick McKeown, Balaji Prabhakar, “High Performance Switches and Routers: Theory and Practice”,Hot Interconnects Tutorial Slides (http://tiny-tera.stanford.edu/nickm/talks/index.html), 1999

Binary Trie Binary Tree

◗Binary Trie: Number of address bits (32 for IPv4)◗Binary Tree: log2(N) (∼ 16 for 64K entries)

Memory Accesses:

29

Trie Trie Based Schemes: Direct LookupBased Schemes: Direct Lookup

An entry for each addressInefficient use of memoryVery poor scalabilityTrie of depth=1 and degree=2B

Address

B bits

2B bits

Lookup Time = 1 cycle (60ns)

1.00E+001.00E+021.00E+041.00E+061.00E+081.00E+101.00E+121.00E+14

0 10 20 30 40 50 60

Address Bits

Requ

ired

Mem

ory

Size

1,000 DRAM chips

30

TrieTrie Based Schemes: Indirect LookupBased Schemes: Indirect Lookup

❧ Address split in 2 or more parts*❧ Somewhat better use of memory❧ Poor scalability

Lookup Time = N cycles (N=no. of segments in the address)

B1 B2

B1

B2

Memory Requirement = Depends on the routing table.

Can reduce memory usage by using variable offset length

*P. Gupta, S. Lin, N. McKeown, “Routing Lookups in hardware at memory access speeds”, in Proc.IEEE Infocom ‘98, Session 10B-1, San Francisco, CA, pp 1382-1391

31

TrieTrie Based Schemes: Based Schemes: Trie Trie OptimizationsOptimizations

Memory usage optimal Lookup Time = H cycles (H=depth of tree)

1

2

3

4 5 6 7

Binary tree

No Prefix1 002 01103 104 11005 11016 11107 1111

1 23

4 5 6 7

Path-compressed(Patricia) tree

Skip=2

1 23

4 5 6 7

Level-compressed(LC) tree

Skip=2

*S. Nilsson, G. Karlsson, “IP-Address Lookup Using LC- Tries”, IEEE Journal on Selected Areasin Communications, Vol. 17, No.6, June 1999, pp 1083-1092

32

Trie Trie or Tree?or Tree?

Issues with Trie Based Schemes:Extra Nodes with no data add to the depth of the treeiMore Memory Accesses Needed

Search time proportional to the size of the addressiBinary Trie for Ipv4 can take up to 32 cyclesiFor IPv6 the worst case could be 128 cycles.

Issues with Tree Based SchemesBinary Search works for exact matchingiBacktracking or wrong pathsiUnbalanced Approaches

Pre-processing overhead higher

33

Tree Based Schemes: Binary SearchTree Based Schemes: Binary Search

Encoding prefixes as rangesMultiway search to reduce search time from log2N to logk+1NPre-computed table of best matching prefixes for the first Y bits

100000101000101010101011101111111111

Worst Case Lookup time =490ns (>32,000 entries)

PatriciaWorst casesearch (ns) 2585 1310 730 490

Worst caserelative toPatricia 1 2 3.5 5

Binary16 bit + binary

16 bit + 6 way

*B. Lampson, V. Srinivasan, G. Varghese, “IP Lookups using Multiway and Multicolumn Search”, Infocom ‘98, Vol. 3, 1998, pp. 1248-56

34

Lookups using Hash TablesLookups using Hash Tables

Hash Tables organized by prefix lengths

hash collisions?

Length Hash

5 7 9

01010

01010110110110

0111011011

Lookup Time = log2 (address bits)

Improve performance by binary search of hash tables by using markers in tables corresponding to shorter lengths to point to prefixes of greater lengths

*M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable High Speed IP Routing Lookups”, ACMComput. Commun. Rev., Vol. 27, Oct. 1997, pp. 25-36

35

Proposed Scheme Using CompactionProposed Scheme Using Compaction

Store path information in a smaller (≈250x than forwarding table), faster, wide (≈1000 bits) on-chip SRAMFew SRAM and one DRAM lookups

1

1 1 1010 0101 1001

Store a table containing number of 1’s in each level

Additionally, for each row of SRAM, the first few bits store number of 1’s till the previous row in that level

First 16 bits or so can be direct mapped

A lookup can be done every 60-65ns (14-15 million lookups per second)

36

…Proposed Scheme Using Compaction…Proposed Scheme Using Compaction

On-chip SRAM and Off-chip DRAM>1000 bit wide on-chip SRAMFor 40,000 prefixes in the routing table, the required SRAM sizeis less than 5kB2 sets of these memories can be used to hide the update operations

Pipelined SRAM and DRAM operationOnly 1 DRAM lookup in all casesOne lookup can be done every 60-65ns14-15 million lookups per second

37

Binary Search Based Proposed SchemeBinary Search Based Proposed Scheme

If n = m,Compare by numerical value

If n ≠ m,Chop longer prefix and compare. If chopped prefixes are equal then, the shorter prefix is considered larger

Two prefixes:A=a1a2…an B=b1b2…bm

Prefix Next Hop10* 7

01* 5

110* 3

1011* 5

0001* 0

01011* 7

00010* 1

001100* 2

1011001* 3

1011010* 5

0100110* 6

01001100* 4

10110011* 8

10110001* 10

01011001* 9

After Sorting:00010*, 0001*, 001100*, 01001100*,0100110*, 01011001*, 01011*, 01*,10110001*, 10110011*, 1011001*,1011010*, 1011*, 10*, 110*

Sorting Prefixes:

*N. Yazdani, P.S. Min, “Fast and Scalable Schemes for IP Address Lookup Problem”, Proc.IEEE Conference on High Performance Switching and Routing, pp. 83-92, 2000

Sample Prefix Set*

38

Binary Search Based Proposed SchemeBinary Search Based Proposed Scheme

Sorting gives depth-first-search of corresponding binary trieBinary Trie constructed as:

If A is a prefix of B, then B is the child of AIf A < B, then A lies on the left of B

0001

Root(*)

01 10 110

00010 0100110 01011

01001100 01011001

1011

10110001 1011001 1011010

10110011

001100

001100*

01001100*

001101*

010001*

010011000*

39

Modified Prefix TableModified Prefix Table

Prefix Next Hop Parent Info.00010* 1 00011000

0001* 0 00001000

001100* 2 00100000

01001100* 4 11000010

0100110* 6 01000010

01011001* 9 10010010

01011* 7 00010010

01* 5 00000010

10110001* 10 10001010

10110011* 8 11001010

1011001* 3 01001010

1011010* 5 01001010

1011* 5 00001010

10* 7 00000010

110* 3 00000100

Store Information about all parents in another fieldPre-processing requires another step. Update process is O(N) (same as Lampson’s scheme)Memory Requirement is ∼ 2x lesser

001100*

01001100*010001*

Match between 01001100*and 010001* is until 4 bits

Best Matching Prefix is 01*

40

ConclusionsConclusions

2,048-bit memory system buildable in high density packaging technologies

Limit determined by Signal Integrity IssuesModeled & Simulated with Ansoft and Hspice

FFT Architecture optimized to maximize available memory bandwidth

Memory map perfectly matched to DDR DRAM architectureOn-the-fly twiddle factor calculationVerified in Verilog model

Result 20x faster capable with conventional packaging

41

…Conclusions…Conclusions

Trie-based routing scheme using compaction suggested for smaller address sizes

SRAM Size is almost 250x lesser than DRAM sizeOne DRAM access only

Binary Search scheme for larger address sizeNumber of memory accesses = log2(N)Memory requirement ∼ 2x lesser than existing schemesUpdate Process at O(N). Same as existing schemes

42

Future WorkFuture Work

FFTComplete Verification (Verilog)Submit journal papers (T.VLSI, CPMT) (Conference paper published - EPEP)

Forwarding EngineVerify Routing schemes (high level Verilog)Evaluate pre-processing overheadsEvaluate performance against standard routing tablesSubmit journal paperConducting scaling studies to support OBS

memory intensive architectures for dsp and data communication · memory intensive architectures for...

Documents