computer architecture peripherals

Computer Architecture 2011 – peripherals1

Computer Architecture

Peripherals

By Dan Tsafrir, 6/6/2011Presentation based on slides by Lihu Rappoport


MEMORY: REMINDER


Not so long ago…

1

10

100

1000

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

Perf

orm

ance

Time

DRAM9% per yr2X in 10 yrs

CPU60% per yr2X in 1.5 yrs

Gap grew 50% per year


Not so long ago… In 1994, in their paper

“Hitting the Memory Wall: Implications of the Obvious”,

William Wulf & Sally McKee said:

“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs.

The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”


More recently (2008)…lo

wer

= s

low

erFast

Slow

The memory wall in the multicore era

Perf

orm

ance

(se

cond

s)

Processor cores

Conventionalarchitecture


Memory Trade-Offs Large (dense) memories are slow Fast memories are small, expensive and consume high

power Goal: give the processor a feeling that it has a memory

which is large (dense), fast, consumes low power, and cheap

Solution: a Hierarchy of memories

Speed: Fastest SlowestSize: Smallest BiggestCost: Highest LowestPower: Highest Lowest

L1CacheCPU L2

CacheL3

CacheMemory(DRAM)


Typical levels in mem hierarchy

Response time Size Memory level≈ 0.5 ns ≈ 100 bytes CPU registers≈ 1 ns ≈ 64 KB L1 cache ≈ 15 ns ≈ 1 – 4 MB L2 cache≈ 150 ns ≈ 1 – 4 GB Main memory

(DRAM)≈ 15 ms ≈ 1 – 2 TB Hard disk (SATA)


DRAM & SRAM


DRAM basics DRAM

Dynamic random-access memory Random access = access cost the same (well, not really)

CPU thinks of DRAM as 1-dimensional Simpler

But DRAM is actually arranged as a 2-D grid Need row & col addresses to access Given “1-D address”, DRAM interface splits it to row &

col Some time duration must elapse between row & col

access(10s of ns)


DRAM basics Why 2D? Why delayed row & col accesses?

Every address-bit requires a physical pin DRAMs are large (GBs nowadays)

=> need many pins => more expensive

A DRAM array has Row decoder

• Extracts row number from memory address Column decoder

• Extracts column number from memory address Sense amplifiers

• Hold row when (1) written to, (2) read from, (3) is refreshed (see next slide)


DRAM basics Use one transistor-capacitor pair

Per bit

Capacitors leaks => Need to be refreshed every few ms

DRAM spends ~1% of time in refreshing “Opening” a row = fetching it to sense amplifiers = refreshing it

Is it worth it to make DRAM a rectangle (rather than square?)


x1 DRAM

Data in/out

buffersSense

amplifiers

Memoryarray

Column decoder

Row

deco

der

…columns……rows…

one bit


DRAM banks Each DRAM memory array outputs one bit

DRAMs use multiple arrays to output multiple bits at a time x N indicates DRAM with N memory arrays Typical today: x16, x32

Each collection of x N arrays forms a DRAM bank Can read/write from/to each bank independently


x4 DRAM

one bit

…row…

…columns…

Data in/out

buffersSense

amplifiers

Memoryarray

Column decoder

Row

deco

der

…row…

…columns…

Data in/out

buffersSense

amplifiers

Memoryarray

Column decoder

Row

deco

der

…row…

…columns…

Data in/out

buffersSense

amplifiers

Memoryarray

Column decoder

Row

deco

der

…rows…

…columns…

Data in/out

buffersSense

amplifiers

Memoryarray

Column decoder

Row

deco

der


Ranks & DIMMs DIMM

(Dual in-line) memory module (the unit we connect to the MB)

Increase bandwidth by delivering data from multiple banks Bandwidth by one bank is limited => Put multiple banks on DIMM Bus has higher clock frequency than any one DRAM Bus controls switches between banks to achieve high

data rate

Increase capacity by utilizing multiple ranks Each rank is an independent set of banks that can be

accessed for the full data bit‐width, • 64 bits for non-ECC; 72 for ECC (error correction code)

Ranks cannot be accessed simultaneously• As they share the same data path


Ranks & DIMMs

1GB 2Rx8 (= 2ranks x 8 banks)


Modern DRAM organization A system has multiple DIMMs

Each DIMM has multiple DRAM banks Arranged in one or more ranks

Each bank has multiple DRAM arrays

Concurrency in banks increases memory bandwidth


Memory controllerM

emor

yco

ntro

ller

address/command bus

data bus

chip select 1

address/command bus

data bus

chip select 2


Memory controller Functionality

Executes processor memory requests

In earlier systems Separate off-processor chip

In modern systems Integrated on-chip with the processor

Interconnect with processor Bus, but can be point-to-point, or through crossbar


Lifetime of a memory access1. Processor orders & queues memory requests2. Request(s) sent to memory controller3. Controller queues & orders requests4. For each request in queue, when the time is right

1. Controller waits until requested DRAM ready2. Controller breaks address bits into rank, bank, row,

column fields3. Controller sends chip-select signal to select rank4. Selected bank pre-charged to activate selected row5. Activate row within selected DRAM bank

• Use “RAS” (row-address strobe signal)6. Send (entire) row to sense amplifiers7. Select desired column

• Use “CAS” (column-address strobe signal)8. Send data back


Basic DRAM array

· Timing (2 phases)· Decode row address + RAS assert· Wait for “RAS to CAS delay”· Decode column address + CAS assert· Transfer DATA

Row latch

Row addressdecoder

Column addrdecoder

Column latchCAS#

RAS# Data

Memoryarray

Memory address bus

Addr


DRAM timing CAS Latency

Number of clock cycles to access a specific column of data

From moment the memory controller issues a column in the current row until data is read out from memory

RAS to CAS delay Number of cycles between row and column access

Row pre-charge time Number of cycles to close the opened-row & to open

next-row


Addressing sequence

· Access sequence· Put row address on data bus and assert RAS#

· Wait for RAS# to CAS# delay (tRCD)· Put column address on data bus and assert CAS# · DATA transfer· Pre-charge

access time

RAS/CAS delay

precharge delay

RAS#

Data

A[0:7]

CAS#

Data n

Row i Col n Row jX

CAS latencyX


· Paged Mode DRAM– Multiple accesses to different columns from same row (special locality)– Saves time it takes to bring a new row (but might be unfair)

· Extended Data Output RAM (EDO RAM)– A data output latch enables to parallelize next column address with

current column data

Improved DRAM Schemes

RAS#

DataA[0:7]CAS#

Data n D n+1

Row X Col n X Col n+1 X Col n+2 X

D n+2

X

RAS#

DataA[0:7]CAS#

Data n Data n+1

Row X Col n X Col n+1 X Col n+2 X

Data n+2

X


· Burst DRAM– Generates consecutive column address by itself

Improved DRAM Schemes (cont)

RAS#

DataA[0:7]CAS#

Data n Data n+1

Row X Col n X

Data n+2

X


Synchronous DRAM (SDRAM) Asynchrony in DRAM

Due to RAS & CAS arriving at any time

Synchronous DRAM Uses clock to deliver requests at regular intervals More predictable DRAM timing => Less skew => Faster turnaround

SDRAMs support burst-mode access Initial performance similar to BEDO (=burst +EDO) Clock scaling enabled higher transfer rates later

• => DDR SDRAM => DDR2 => DDR3


DRAM vs. SRAM(Random access = access time the same for all locations)

DRAM – Dynamic RAM SRAM – Static RAM

Refresh Yes (~1% time) NoAddress Address muxed:

row+colAddress not multiplexed

Random Access Not really… Yesdensity High (1 Transistor/bit) Low (6 Transistor/bit)Power low highSpeed slow fastPrice/bit low high

Typical usage Main memory cache

computer architecture peripherals

Documents

dram memory speed

reminder computer architecture

year computer architecture

slide computer architecture

gbmain memory dram

ms dram

dram basicsuse

dram interface