new dram contemporary dram architectures and beyondblj/talks/compaq.pdf · 1999. 9. 27. · and...
Post on 17-Oct-2020
1 Views
Preview:
TRANSCRIPT
CONTEMPORARY
ARCA
nd
k
UNIVE
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Contemporary DRAMArchitectures and Beyo
Bruce Jacob
Electrical & Computer EngineeringUniversity of Maryland, College Parhttp://www.ece.umd.edu/~blj/
OUTLINE:
• Motivation & Background
• Experiments
• Results
• More Recent ResultsRSITY OF MARYLAND
CONTEMPORARY
ARCA
rary ’99.. Mudge
pu,land
,gan
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Sources
“A Performance Study of ContempoDRAM Architectures,” Proc. ISCAV. Cuppu, B. Jacob, B. Davis, and T
Recent experiments by Vinodh CupPh.D. student at University of Mary
Recent experiments by Brian DavisPh.D. student at University of Michi
CONTEMPORARY
ARCA
R) ){
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Dilemma: THIS ...
STATUS QUO inMEMORY-SYSTEM RESEARCH:
...
if (memory_instruction(INSTR)) {if (L1_cache_miss( data_addr(INSTR) ){
if (L2_cache_miss( data_addr(INST
cycles += DRAM_LATENCY;
}}
}
...
CONTEMPORARY
ARCA
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
... or THIS
CONTEMPORARY
ARCA
...
MC
DR
AM
DR
AM
DR
AM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Motivation
HERE’S WHAT YOU MISS:
bus
DR
AM
DR
AM
DR
AM
CPU MC busCPU
DATA TRANSFER
OVERLAP
COLUMN ACCESS
ROW ACCESS
BUS TRANSMISSION
DRAM LATENCY:
...
CONTEMPORARY
ARCA
nents
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Goal
PRELIMINARY DRAM STUDY:
• Bus Transmission
• Row Access
• Column Access
• Data Transfer
• Bus Wait/Synch Time
• Stalls Due to Refresh
• The OVERLAP of These Compo(with each other)(with CPU execution)
MODEL EXISTING TECHNOLOGY
CONTEMPORARY
ARCA
.. Bit Lines...
MemoryArray
Sense Amps
lumn Decoder
DRAM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
BUS TRANSMISSION
BUSMEMORY
.
Row
Dec
oder
Co
. . .
.
Data In/OutBuffers
CONTROLLERCPU
CONTEMPORARY
ARCA
.. Bit Lines...
MemoryArray
Sense Amps
lumn Decoder
DRAM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
ROW ACCESS
BUSMEMORY
.
Row
Dec
oder
Co
. . .
.
Data In/OutBuffers
CONTROLLERCPU
CONTEMPORARY
ARCA
.. Bit Lines...
MemoryArray
Sense Amps
lumn Decoder
DRAM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
COLUMN ACCESS
BUSMEMORY
.
Row
Dec
oder
Co
. . .
.
Data In/OutBuffers
CONTROLLERCPU
CONTEMPORARY
ARCA
ith COL
.. Bit Lines...
MemoryArray
Sense Amps
lumn Decoder
DRAM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
DATA TRANSFER
note: page mode enables overlap w
BUSMEMORY
.
Row
Dec
oder
Co
. . .
.
Data In/OutBuffers
CONTROLLERCPU
CONTEMPORARY
ARCA
own
.. Bit Lines...
MemoryArray
Sense Amps
lumn Decoder
DRAM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
BUS TRANSMISSION
note: overlapped component not sh
BUSMEMORY
.
Row
Dec
oder
Co
. . .
.
Data In/OutBuffers
CONTROLLERCPU
CONTEMPORARY
ARCA
M
ut
Data Transfer
Column Access
Transfer Overlap
Row Access
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
Read Timing for Conventional DRA
RowAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
RowAddress
ColumnAddress
ValidDatao
CONTEMPORARY
ARCA
RAM
ValidDataout
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
Read Timing for Fast Page Mode D
RowAddress
ColumnAddress
ValidDataout
ColumnAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
Data Transfer
Column Access
Transfer Overlap
Row Access
CONTEMPORARY
ARCA
t DRAM
Data Transfer
Column Access
Transfer Overlap
Row Access
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
Read Timing for Extended Data Ou
RowAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
ColumnAddress
ColumnAddress
ValidDataout
ValidDataout
CONTEMPORARY
ARCA
M
alidtaout
Data Transfer
Column Access
Transfer Overlap
Row Access
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
Read Timing for Synchronous DRA
CAS
Address
DQ ValidDataout
ValidDataout
VDa
ColumnAddress
RowAddress
RAS
Clock
CONTEMPORARY
ARCA
d
Data Transfer
Column Access
Transfer Overla p
Row Access
utValid
Dataout
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Primer
Read Timing for Rambus DRAM
DQ
Command
Address
ReadStrobe
ReaTerm
ACTV/READ
Bank/Row
4 cycles
ColAddr
ColAddr
ColAddr
ValidDataout
ValidDatao
CONTEMPORARY
ARCA
free x32
ee x1
ulated)
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Simulator Overview
CPU: SimpleScalar v3.0a
• 8-way out-of-order
• L1 cache: split 64K/64K, lockup
• L2 cache: unified 1MB, lockup fr
• L2 blocksize: 128 bytes
Main Memory: 8 64Mb DRAMs
• 100MHz/128-bit memory bus
• Optimistic open-page policy(close-immediately can be calc
Represents a “typical” workstation
CONTEMPORARY
ARCA
hannel
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
DIMM
rrow Channel
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Configurations
Note: TRANSFER WIDTH of Direct Rambus C
• equals that of ganged FPM, EDO, etc.
• is 2x that of Rambus & SLDRAM
CPU Memory
Controllerand caches128-bit 100MHz bus
FPM, EDO, SDRAM, ESDRAM:
Fast, Na
CPU MemoryControllerand caches
128-bit 100MHz bus
DR
AM
DR
AM
DR
AM
Rambus, Direct Rambus, SLDRAM:
CONTEMPORARY
ARCA
DR
AM
DR
AM
DR
AM
arallel Channels
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
DRAM Configurations
Strawman: Rambus, etc.
DR
AM
DR
AM
...
Multiple P
CPU MemoryControllerand caches
128-bit 100MHz bus
CONTEMPORARY
ARCA CPU
hes
ortex
es caches) & Memorys Time
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Overhead: Memory vs.
Variable: speed of processor & cac
Compress Go Ijpeg Li Perl V0
0.5
1
1.5
2
2.5
3
Clo
cks
Per
Inst
ruct
ion
(CP
I)
Processor Execution (includOverlap between ExecutionStalls due to Memory Acces
Yesterday’s CPU
Tomorrow’s CPUToday’s CPU
BENCHMARK
CONTEMPORARY
ARCA t al)
memory
paths
ystem
tDRAM
tPROC
tBW
EAL
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Definitions (var. on Burger, e
• tPROC — processor with perfect
• tREAL — realistic configuration
• tBW — CPU with wide memory
• tDRAM — time seen by DRAM s
Stalls Due toBANDWIDTH
Stalls Due toLATENCY
CPU+L1+L2Execution
CPU-MemoryOVERLAP
tR
tREAL - tBW
tBW - tPROC
tREAL - tDRAM
tPROC - (tREAL - tDRAM)
CONTEMPORARY
ARCA L
DRAM
tion & Memorytencyndwidth
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Memory & CPU — PER
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
FPM EDO SDRAM ESDRAM DR
DRAM Configuration
Cyc
les
Per
Inst
ruct
ion
(CP
I)
Yesterday’s CPU
Tomorrow’s CPU
Today’s CPU
Processor ExecutionOverlap between ExecuStalls due to Memory LaStalls due to Memory Ba
CONTEMPORARY
ARCA AMs
transfers
DRDRAM
ission Time Timeess Timeer Time Overlaper Timee
me
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Average Latency of DR
note: SLDRAM & RDRAM 2x data
FPM EDO SDRAM ESDRAM SLDRAM RDRAM
DRAM Configurations
0
100
200
300
400
500
Tim
e pe
r A
cces
s (n
s)Bus TransmRow AccessColumn AccData TransfData TransfRefresh TimBus Wait Ti
CONTEMPORARY
ARCA AMs
transfers
DRDRAM
ission Time Timeess Timeer Time Overlaper Timee
me
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Average Latency of DR
note: SLDRAM & RDRAM 2x data
FPM EDO SDRAM ESDRAM SLDRAM RDRAM
DRAM Configurations
0
100
200
300
400
500
Tim
e pe
r A
cces
s (n
s)Bus TransmRow AccessColumn AccData TransfData TransfRefresh TimBus Wait Ti
CONTEMPORARY
ARCA
s
Latency
:
nels
apacity
AM
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Cost-Performance
FPM, EDO, SDRAM, ESDRAM:
• Lower Latency => Wide/Fast Bu
• Increase Capacity => Decrease
• Low System Cost
Rambus, Direct Rambus, SLDRAM
• Lower Latency => Multiple Chan
• Increase Capacity => Increase C
• High System Cost
1 DRDRAM = Multiple SDR
CONTEMPORARY
ARCA
ttleneck
CPUtion, ...)
h Problem
es
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Conclusions
100MHz/128-bit Bus is Current Bo
• Solution: Fast Bus/es & MC on (e.g. Compaq Alpha, Sony Emo
Current DRAMs Solving Bandwidt(but not Latency Problem )
There is Locality in DRAM Access(but how important is this?)
SPECint ’95 Fits in 1MB Cache
CONTEMPORARY
ARCA rk
round
ls
RAM
res
ses
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Recent (Unfinished) Wo
Investigation of Organization-LevelParameters:
• Channel widths & speeds, turna
• Independent vs. ganged channe
• Banks per channel, burst widths
Detailed Study of DRDRAM vs. SDin Highly Concurrent Environment
Embedded DRAM+DSP Architectu
Detailed Study of Multiprocessor Bu
CONTEMPORARY
ARCA
C
D DD D
D DD D
C
DD D
DD D
D D
D D
...
...
channelsf 1, 2, 4, ...
C
D DD D
D DD D
C
DD D
DD D
D D
D D
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Channel/Bank Model
C
D
C
D D
C
D D
D D
D
C
D
C
D DD D
D
D
D D
D D
C
DD
DD
D
D
D
D
C
DD DD
...
One independent channelBanking degrees of 1, 2, 4, ...
Four independent channelsBanking degrees of 1, 2, 4, ...
Two independentBanking degrees o
C
D D
C
D D
D D
C
D
C
D D
C
D DD D
D
D
D D
D D
C
DD
DD
D
D
D
D
C
DD DD
CONTEMPORARY
ARCA del
ns
20ns
40ns
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Read/Write Request Mot0
10
10ns90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
READ REQUESTS:
DATA BUS
ADDRESS BUS
DRAM BANK10ns
90ns70ns
DATA BUS
ADDRESS BUS
DRAM BANK
10ns100ns
70ns
t0
10ns
10ns90ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
WRITE REQUESTS:
DATA BUS
ADDRESS BUS
DRAM BANK
20ns
10ns90ns
40ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
10ns90ns
40ns
CONTEMPORARY
ARCA
0ns
nks:
nt banks:
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Concurrency Model
2
10ns90ns
70ns
20ns
10ns90ns
40ns
20ns
10ns90ns
70ns
Legal if no turnaround and R/W to different ba
R:
W:
10ns
10ns90ns
40ns
10ns
10ns90ns
70nsR:
W:
Legal if turnaround ≤ 10ns and R/W to differe
20ns
10ns90ns
70ns
Legal if R/R to different banks:
R:
R:20ns
10
10
CONTEMPORARY
ARCA dth
.4peed)
CPU
yte Burst Widthyte Burst Widthyte Burst Widthyte Burst Widthte Burst Width
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Bandwidth vs. Burst Wi
0.4 0.8 1.6 3.2 6System Bandwidth (GB/s = Channels * Width * S
0
0.25
0.5
0.75
1
1.25
Cyc
les
per
Inst
ruct
ion
PERL: 1 channel, 4 banks, 2GHz
128-B64-B32-B16-B8-By
CONTEMPORARY
ARCA
256idth)
CPU
1 channel
2 channels4 channels
64-Bit Data Bus32-Bit Data Bus16-Bit Data Bus8-Bit Data Bus
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Exploiting Concurrency
8 16 32 64 128Total Datapath Bitwidth (bits = Channels * BusW
0
0.25
0.5
0.75
1
1.25
Cyc
les
per
Inst
ruct
ion
PERL: 2 banks, 16-byte burst, 2GHz
400
MH
z x
1 ch
anne
l
400
MH
z x
2 ch
anne
ls
400
MH
z x
4 ch
anne
ls
CONTEMPORARY
ARCA
k
UNIVE
DRAMHITECTURESND BEYOND
Bruce Jacob
University ofMaryland
Conclusions
None yet ... preliminary data
CONTACT INFO:
Prof. Bruce Jacob
Electrical & Computer EngineeringUniversity of Maryland, College Parhttp://www.ece.umd.edu/~blj/
blj@eng.umd.edu
RSITY OF MARYLAND
top related