array allocation taking into account sdram characteristics hong-kai chang youn-long lin department...

Array Allocation Taking into Account SDRAM Characteristics

Hong-Kai ChangYoun-Long LinDepartment of Computer ScienceNational Tsing Hua UniversityHsinChu, Taiwan, R.O.C.

Outline

IntroductionRelated WorkMotivationSolving ProblemProposed AlgorithmsExperimental ResultsConclusions and Future Work

Introduction

Performance gap between memory and processor

Systems without cache Application specific Embedded DRAM

Optimize DRAM performance by utilize its special characteristics

SDRAM’s multi-bank architecture enables new optimizations in scheduling

We assign arrays to different SDRAM banks to increase data access rate

Related Work

Previous research eliminate memory bottleneck by Using local memory (cache) Prefetch data as fast as possible

Panda, Dutt, and Nicolau utilizing page mode access to improve scheduling using EDO DRAM

Research about array mapping to physical memories for low power, lower cost, better performance

Motivation

DRAM operations Row decode Column decode Precharge

SDRAM characteristics Multiple banks Burst transfer Synchronous

Traditional DRAM 2-bank SDRAM

Column

B a n k 1B a n k 0

Address Mapping Table

Host Address: [a16:a0] Memory Address: [BA, A7-A0]

Page Size for host: Page Size for DRAM:

128 words (a6:a0) 256 words (A7:A0)

-If we exchange the mapping of a0 and a7...

BA A7 A6 A5 A4 A3 A2 A1 A0Row a7 a16 a15 a14 a13 a12 a11 a10 a99x8Col a8 a6 a5 a4 a3 a2 a1 a0

A 9x8 SDRAM address mapping table (Bank interleaving size: 128 words)

BA A7 A6 A5 A4 A3 A2 A1 A0Row a0 a16 a15 a14 a13 a12 a11 a10 a99x8Col a8 a6 a5 a4 a3 a2 a1 a7

A 9x8 SDRAM address mapping table (Bank interleaving size: 1 word)

Motivational Example

BA=BankActive

=RowDecode

R/W=Read/Write =ColumnDecode

BP=Precharge

Command Bus (Address Bus)

DataBus

27 Cycles

1 2 3 4 5 6 7 8 9 10 11 12 13 14

16 17 18 19 20 21 22 23 24 25 26 27 2815

Motivational Example

BA=BankActive

=RowDecode

R/W=Read/Write =ColumnDecode

BP=Precharge

DataBus

10 Cycles1 2 3 4 5 6 7 8 9 10

DataBus

16 Cycles1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Assumptions

Harvard architecture : Separated program/data memory Paging policy of the DRAM controller

Does not perform precharge after read/write If next access reference to different page, perform precharge, foll

owed by bank active, before read/write As many pages can be opened at once as the number of banks

Resource constraints

Function Unit ALU Multiplier Divider SDRAM SDRAMSupported Op +,-,>,S * / BA,BP R,WClocks 1 2 4 2 3Quantity 1 1 1 2 or 4 2 or 4

Problem Definition

Input a data flow graph, the resource constraints, and the memory configuration

Perform our bank allocation algorithm Schedule the operations with a static list scheduling

algorithm considering SDRAM timing constraints Output a schedule of operations, a bank allocation table,

and the total cycle counts

Bank Allocation Algorithm

Calculate Node distances Calculate Array distances Give arrays with the shorter distances higher priority Allocate arrays to different banks if possible

Example: SOR

main(){ float a[N][N], b[N][N], C[N][N], d[N][N], e[N][N], f[N][N]; float omega, resid, u[N][N]; int j,l;

for (j=2; j<N; j++) for (l=1;l<N;l+=2) { resid = a[j][l]*u[j+1][l]+ b[j][l]*u[j-1][l]+ c[j][l]*u[j][l+1]+ d[j][l]*u[j][l-1]+ e[j][l]*u[j][l] – f[j][l]; u[j][l] -= omega*resid/e[j][l]; }}

u[j+1][l]

u[j-1][l]

u[j][l+1]

u[j][l-1]

{1,-,-,-,-,-,-,1,-}

{-,1,-,-,-,-,-,-,1}

{2,2,-,-,-,-,-,2,2}

{-,-,1,-,-,-,1,-,-}

{-,-,-,1,-,-,1,-,-}

{-,-,2,2,-,-,2,-,-}

{3,3,3,3,-,-,3,3,3}

D F G o f S O R (p a rtia l)

Node Distance

Distances between current node and the nearest node that access array a, b, c,…. Shown in { }

Ex. {1,-,-,-,-,-,-,1,-} means the distances to the node that access array a[j] and u[j-1] are both 1.

‘-’ means the distance is still unknown When propagate downstream, the distance increases.

u[j+1][l]

u[j-1][l]

{1,-,-,-,-,-,-,1,-}

{-,1,-,-,-,-,-,-,1}

{2,2,-,-,-,-,-,2,2}

distance to a[j]

distance to u[j-1]

distance to b[j]

Array Distance

The distance between nodes that access arrays Calculate from node distance of corresponding arrays Get the minimum value

Ex. AD(a[j], u[j-1])=min(2,4)=2

u[j+1][l]

u[j-1][l]

{1,-,-,-,-,-,-,1,-}

{-,1,-,-,-,-,-,-,1}

{2,2,-,-,-,-,-,2,2}

AD(a[j], u[j-1])=1+1=2

AD(a[j], b[j]) =2+2=4

AD(a[j], u[j-1])=2+2=4

Example: SOR

a[j] b[j] c[j] d[j] e[j] f[j] u[j] u[j+1] u[j-1]a[j] 0 4 6 6 7 6 6 2 4b[j] 4 0 6 6 7 6 6 4 2c[j] 6 6 0 4 7 6 2 6 6d[j] 6 6 4 0 7 6 2 6 6e[j] 7 7 7 7 0 3 2 7 7f[j] 6 6 6 6 3 0 3 6 6u[j] 6 6 2 2 2 3 0 6 6

u[j+1] 2 4 6 6 7 6 6 0 4u[j-1] 4 2 6 6 7 6 6 4 0

Array distance table of SOR

Bank allocation:

Bank 0: c,d,e,f Bank 1: a,b,u

Experimental Characteristics

We divided our benchmarks into two groups First group benchmarks access multiple 1-D arrays

Apply our algorithm to arrays Second group benchmarks access single 2-D arrays

Apply our algorithm to array rows Memory configurations

Multi-bank configuration: 2 banks/ 4banks Multi-chip configuration: 2 chips/ 4chips Multi-chip vs mulit-bank: relieves bus contention Utilizing page mode access or not

Results of the first group (multiple array)

dhrc dequant wiener dct mmult leafcomp fir sor

Benchmark

Coarse 1Bank 2Bank 4Bank 2Chips 4Chips 2Bank+P 4Bank+P 2Chips+P 4Chips+P

Results of the second group (single array)

compress laplace sobel lowpass compress2 laplace2 sobel2 lowpass2

Benchmark

Coarse 1Bank 2Bank 4Bank 2Chips 4Chips 2Bank+P 4Bank+P 2Chips+P 4Chips+P

Results compare to Panda's

dhrc dequant mmult leafcomp sor lowpass

Benchmark

Coarse 1Bank 2Bank 4Bank 2Chips 4Chips 2Bank+P 4Bank+P 2Chips+P 4Chips+P Panda

Experimental Results

From the average results, we can see that Scheduling using SDRAM with our bank allocation algorithm do

improve the performance Utilizing page mode access relieves the traffic of address bus,

thus the use of multiple chips does not make obvious improvement

Configuration 1 Chip/2 Banks 1 Chip/4 Banks 2 Chips/1 Bank 4 Chips /1 BankW/O PageMode 70.20% 62.28% 64.93% 54.51%W/ PageMode 53.38% 43.36% 52.52% 42.02%

Average schedule length of different configurations

Conclusions

We presented a bank allocation algorithm incorporated in our scheduler to take advantages of SDRAM

The scheduling results have a great improvement from the coarse one and beat Panda’s work in some cases

Our work is based on a common paging policy Several different memory configurations are exploited Scheduling results are verified and meet Intel’s PC

SDRAM’s spec

Future Works

Extending our research to Rambus DRAM Grouping arrays to incorporating burst transfer Integration with other scheduling /allocation techniq

array allocation taking into account sdram characteristics hong-kai chang youn-long lin department...

precharge slide

possible slide

future work slide

better performance slide

data access rate slide

a0 memory address

dram performance

different sdram banks

Documents

innovation system hsinchu, taiwan

curriculum vitae faculty of science national tsing hua...

chin-yu huang department of computer science national tsing...

enhancing efficiency for additive–free blade–coated...

vol. 11 - national tsing hua university · vol. 11 national...

reform of hong kong’s public research funding system...

1 a mathematical model of flu with asymptomatic infection...

nthu rain removal project · nthu rain removal project...

contemporary mathematics · the first workshop on geometric...

aearu 16th annual general meeting october 28, …in july...

1 compilers for dsp processors and low-power jenq-kuen lee...

shi-yu huang and ya-chun lai design technology center (dtc)...

search for serendipitous tno occultation in x-rays...

chong-yung chi (cy...1 chong-yung chi professor, institute...

journal of physics and chemistry of...

hdl-based layout synthesis methodologies allen c.-h. wu...

laboratory for reliable computing department of electrical...

news from the lattice - national tsing hua...

automatic fingerprint matching system hsing-hua yu and...

laboratory of reliable computing department of electrical...