ddr4 designing for power and performance - memcon
Post on 11-Feb-2017
268 Views
Preview:
TRANSCRIPT
DDR4: Designing for Power and Performance
Agenda
Comparison between DDR3 and DDR4 Designing for power
− DDR4 power savings
Designing for performance − Creating a data valid window − Good layout practices for DDR4 − Board debug tools to minimize issues
Looking ahead and conclusion
2
Comparison Between DDR3 and DDR4
3
DRAM Technology Comparison DDR3 DDR4 GDDR5
Voltage 1.5 V / 1.35 V 1.2 V 1.5 V / 1.35 V
Strobe Bi-directional differential Bi-directional differential Free-running differential WRITE clock
Strobe Configuration Per byte Per byte Per word READ Data Capture Strobe based Strobe based Clock data recovery
Data Termination VDDQ/2 VDDQ VDDQ Address/Command
Termination VDDQ/2 VDDQ/2 VDDQ
Burst Length BC4, 8 BC4, 8 8 Bank Grouping No 4 4
On-Chip Error Detection No Command / address parity CRC for data bus CRC for data bus
Configuration x4, x8, x16 x4, x8, x16 x16, x32 Package 78-ball / 96-ball FBGA 78-ball / 96-ball FBGA 170-ball FBGA
Data Rate (Mbps/Pin) 800 – 2,133 1,600 – 3,200+ 4,000 – 7,000 Component Density 1 GB – 8 GB 2 GB – 16 GB 512 MB – 2 GB
Stacking Options DDP, QDP Up to 8H (128-GB stack); single load No
4
DDR4 Power Savings
5
DDR4 Power Savings Features
DDR4 voltage is 1.2 V (up to 40% savings) − Lower voltage than DDR3 (1.5 V) − On-die VREF − Pseudo-open drain I/Os
Manages refreshes (up to 20% savings) − Based on temperature
New DDR4 low-power auto self-refresh (LPASR) capability − Changes refresh rate based on temperature
− Only refreshes parts of array that is in use Controller must allow fine-granularity refresh based on memory utilization
Supports data bus inversion − Limits number of signals transitioning, reducing simultaneous switching
output (SSO) and saving power
6
Creating a Data Valid Window
7
Timing Margins Are Shrinking
8
Data Valid Window
DRAM Margin
Package/ Board Margin
Chip Margin
DDR1 2,500 900 800 800 DDR2 938 425 256 256 DDR3 469 188 140 140 DDR4 313 125 93 93
2,500
938
469 313
DDR1 DDR2 DDR3 DDR4
Shrinking Timing Margins in Picoseconds DRAM Margin Package/board Margin Chip Margin Data Valid Window
400 Mbps 3,200 Mbps
Package / Board Margin
Shrinking the Window Even More: DDR4 VREF Training (1/2)
DDR4 VREF training − Training: sweep VREF setting, find maximum passing window
Lump sum of DCD, RX offset, etc. Resolution error is the combination of (VREF, PI, or delay chain)
− Margin loss calculation VREF step size: from 0.5% VDDQ to 0.8% VDDQ VREF set tolerance: 1.625% or 0.15% Calibration error: 1 step size
− 0.8% * VDDQ = 0.8% * 1.2V = 9.6 mV Margin loss (due to VREF calibration error)
− 9.6 mv * 2 / slew_rate = 4.8 ps (assume slew rate = 4 V/ns) Calibration error = half step size
10
Vref Step Size Vref step 0.50% 0.65% 0.80% VDDQ 2
Vref Set Tolerance Vref_set_tol -1.625% 0.00% 1.625% VDDQ 3, 4, 6
-0.15% 0.00% 0.15% VDDQ 3, 5, 7
Shrinking the Window Even More: DDR4 VREF Training (2/2)
Discussion with JEDEC members − RDDR4 specification section 13.4: any DRAM component level variation
must be accounted for within the DRAM RX mask. This means that the VREF calibration error is included in VdlVW_total.
− VREF_DQ internal aligns to VCENT_DQs with training. VCENT_DQs has variation. VREF_DQ training error should increase with this variation and internal voltage noise etc.
11
Shrinking the Window Even More: Duty Cycle Error
DDR4 specification is +/-2% tCK = +/- 0.04 UI − IPD current budget +/-3% tCK
Margin loss is 4% tCK With proper link timing calibration
− 2% tCK margin loss
Assume same for read
12
+/-2%
+/-2%
DQS
DQ
Timing Parameters by Speed Bin for DDR4-2400 to DDR4-3200
Speed DDR4-2400 DDR4-2666 DDR4-3200 Units NOTE
Parameter Symbol MIN MAX MIN MAX MIN MAX
Clock Timing
Minimum Clock Cycle Time (DLL Off Mode) tCK (DLL_OFF) 8 - 8 - 8 - nδ 22
Average Clock Period tCK (avg) TBD pδ
Average High Pulse Width tCH (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tCK (avg)
Average Low Pulse Width tCL (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tCK (avg)
Shrinking the Window Even More: Calculating the PLL Jitter
13
Current Profile : I(f) PDN Impedance : Z(f)
f f
Jitter Sensitivity : S(f)
f
Jitter Spectrum J(f)
f
iFFT
TIE Jitter : j(t)
t
)()()()()()( tjfJfPfSfZfI TIEiFFT→=×××
p-p jitter
PSRR of PLL: P(f)
f
DDR4 Bank Group Timing
Different timing within a group and between groups (tCCD, tWTR, tRRD) − “Long” timing: bank-to-bank within a group − “Short” timing: access to different bank groups
Maintain array timing requirements within bank group Maintain speed between different bank groups
Bank Group 1
Bank 2
Bank 0
Bank 3
Bank 1
Bank 2
Bank 0
Bank 3
Bank 1
Short Timings
Long Timings
14
Bank Group 1
Bank Group 0
Bank 2
Bank 0
Bank 3
Bank 1
Bank Group 3
Bank 2
Bank 0
Bank 3
Bank 1
Bank Group 2
Bank 2
Bank 0
Bank 3
Bank 1
Calibration Is Critical to Shrinking Margins
15
-0.1
0
0.1
0.2
0.3
0.4
0.5
Mar
gin
(ns)
FPGA EffectsExternalEffects
CalibrationEffects
CalibrationUncertainty
No Margin Without Calibration
What is Calibration?
16
Benefit: Accurate strobe placement More resync margin
0 15 30 45 60 … … … … 315 330 345 360DQ0DQ1DQ2DQ3**DQ70DQ71
Valid data window
Resync Calibration
Voltage and temperature
tracking Data shifts due to VT variations
VT Compensation
Benefit: Dynamic phase adjustment to match shifting data valid window Robust over VT
Capture Calibration (De-skew)
Benefit: Reduce skew between data group More capture margin
Before de-skew – small valid capture window DQs
0 15 30 45 60 75 90 105 120 135 150 165 180DQ0DQ1DQ2DQ3DQ4DQ5DQ6DQ7
DQs0 15 30 45 60 75 90 105 120 135 150 165 180
DQ0DQ1DQ2DQ3DQ4DQ5
After de-skew – maximize valid capture window
High-Level Output Topology
Calibration knobs − DQ-out1 and DQ-out2 delay : Control the delay applied to outgoing DQ
pins − DQS-out1 and DQS-out2 delay : Control the delay applied to outgoing DQS
pins − Write leveling output : Changes the delay on both DQ and DQS relative to
the memory clock-in phase taps
17
DQS
CLK
DQS OUT2 DelayDQS OUT1 Delay
X phaseX+90 phase
DQDQ OUT2 DelayDQ OUT1 Delay
ptap control DQS out dtap1 control
DQS out dtap2 control
DQ out dtap1 control
DQ out dtap2 control
High-Level Input Topology
Calibration knobs − DQ-in delay: Control the delay applied to incoming DQ pins − DQS-in delay: Control the delay applied to incoming DQS pins − LFIFO : Controls number of cycles after read command that data is read out of
the LFIFO − DQS-En phase: Control the delay on DQS En in phase taps − DQS-En delay: Control the delay on DQS En in dtaps − VIFO : Adjusts the delay in cycles applied to controller-provided DQS burst signal
to generate DQS enable
18
DQS
DQ
DQS IN Delay DQS Delay Chain
DQ IN Delay
DQS in dtap control
DQ in dtap control
DDIOin
DQS Enable
X phase
dqs_en ptap control
DQS En Delay
DQS en dtap control
VFIFO
vfifo control
LFIFO
Lfifo control
Calibration Stages
DQS-enable calibration − Calibrate DQS enable (delayed read data valid) relative to DQS
Post-amble tracking − Track DQS-enable across temperature variation
Read data deskew − Calibrate DQS relative to read command (read leveling)
− Calibrate DQ versus DQS (per-bit deskew) for reads
LFIFO training − Calibrate LFIFO delay cycles (read latency)
Write leveling − Calibrate DQS and DM to write command (write leveling)
Write data deskew − Calibrate DQ versus DQS (per-bit deskew) for writes
Address/command training (leveling and deskew) − Calibrate CS, CAS, RAS, and ODT versus memory clock
VREF training (FPGA and memory) − Calibrates receiver voltage threshold
(for DDR4 with pseudo open drain DQs)
19
Initialize INST/AC ROM for all pins on this
Mem Interface
Initialize the memory(Mode Registers etc.)
Calibratethe Mem Interface
Start
Y
N
User command found in DPRIO?
User command found in RAM?
Process DPRIO user command
Process RAM user command
Y
YN
N
All Mem Interfaces calibrated?
Calibration loop
User mode loop
Wait for PLL/DLL locking
Calibration Is Critical to Shrinking Margins
20
-0.1
0
0.1
0.2
0.3
0.4
0.5
Mar
gin
(ns)
FPGA EffectsExternalEffects
CalibrationEffects
CalibrationUncertainty
No Margin Without Calibration
Good Layout Practices for DDR4
21
DDR4 Output Driver
DDR3 – Push-Pull DDR4 – Pseudo Open Drain
22
Content Courtesy of Micron
Unadjusted, Non-Terminated Data Eye
Jitter
Overshoot
Undershoot
VDD
VSS
23
Content Courtesy of Micron
Terminated Data Eye
VIHac
VILac
Vref
VIHdc
VILdc
Hi-Ringback
Lo-Ringback
Overshoot
Undershoot
24
Content Courtesy of Micron
OCT from the Controller Standpoint
DQ and CA pins are terminated differently in DDR4
25
Specification DDR3 DDR4
Density / Speed 512 Mb ~ 8 GB 1.6 ~ 2.1 Gbps
2 GB ~ 16 GB 1.6 ~ 3.2 Gbps
Interface
Voltage (VDD / VDDQ / VPP)
1.5 V / 1.5 V / NA (1.35 V / 1.35 V / NA) 1.2 V / 1.2 V / 2.5 V
VREF External VREF (VDD / 2) Internal VREF (need training)
Data I/Os CTT (34 ohm) POD (34 ohm)
CMD/ADDR I/Os CTT CTT
Strobe Bi-directional / differential Bi-directional / differential
Core Architect
Number of banks 8 16 (4 GB)
Page size (x4 / x8 / x16) 1 KB / 1 KB / 2 KB 512 B / 1 KB / 2 KB
Number of prefetch 8 bits 8 bits
Added function RESET / ZQ / Dynamic ODT + CRC / DBI / Multi preamble
Physical
Package type / balls (x4, x8 / x16) 78 / 96 BGA 78 / 96 BGA
DIMM type R, LR, U, SoDIMM + ECC SoDIMM
DIMM pins 240 (R, LR, U) / 204 (So) 284 (R, LR, U) / 256 (So)
OCT Calibration Scheme to Support DDR4
OCT can calibrate 2 times with 2 sets of pins (DQ/CA) DQ and CA pins will have 2 different sets of codes in DDR4
26
DDR3 DDR4
General Layout Concerns
Avoid crossing splits in the power plane SSO on controller collapsed strobes/clocks
− Separate supplies and/or flip-chip packaging helps
Low-pass VREF filtering on controller helps Minimize VREF noise Minimize intersymbol interference (ISI) Minimize crosstalk
27
Content Courtesy of Micron
Layout and Termination (1/12)
Signal integrity review − Importance of transmission line theory
Today’s clock rates are too fast to ignore − Matched impedance line is important for good signaling
Mismatched impedance lines result in reflections Termination schemes are used to reduce / eliminate reflections
− Good power bussing is paramount to reducing SSO SSO reduce voltage and timing margins
− Decoupling capacitors needs and requirements
28
Content Courtesy of Micron
Layout and Termination (2/12)
Signal integrity analysis is paramount to developing cost-effective high-speed memory systems − Develop timing budget for proof of concept − Use models to simulate − Board skews are important and should accounted for − ISI, crosstalk, VREF noise, path length matching, Cin and RTT mismatch –
employ industry practices and assumptions − Model vias too − Eliminate return path discontinuities (RPDs) − Minimize SSO affects
Difficult to model
29
Content Courtesy of Micron
Layout and Termination (3/12)
DRAM and controller package parasitics are fixed − SSO effects already contained in their specified timings
However, these are to test conditions with specific decoupling
Power delivery network (PDN) for the controller and DRAM need to be properly designed
Lowering power supply inductance minimizes signaling variations between devices − Use power and ground planes wherever possible − Make all power and ground traces as fat as possible − Couple power and ground as much as possible
Lowers inductance (mutual effects)
30
Content Courtesy of Micron
Layout and Termination (4/12)
SSO − Timing and noise issues generated due to rapid changes in voltage and
current caused by multiple circuits switching simultaneously in the same direction
Problems caused by SSO − False triggers due to power/ground bounce − Reduced timing margin due to SSO induced skew − Reduced voltage margin due to power/ground noise − Slew rate variation
31
Content Courtesy of Micron
Layout and Termination (5/12)
Good power bussing is paramount to reducing SSO
Reduce L (power delivery effective inductance) − Use planes for power and ground distribution − Proper routing of power and ground traces to devices − Proper use of decoupling capacitance
Locate as close as possible to the component pins
Reduce dI/dt (switching current slew rate) − Use the slowest drive edge that will work − Use reduced drive strength instead of full drive where possible
⋅=∆
dtdILV
32
Content Courtesy of Micron
Layout and Termination (6/12)
RPDs induce board noise and are difficult to model − Splits/holes in reference planes − Connector discontinuities − Layer changes
Avoid RPDs if at all possible − Avoid crossing holes/splits in reference plane − Route signals so they reference the proper domain − Add power/ground vias to board
Especially in dense layer-change areas − Place decoupling capacitors near connectors Solid Return Path
Split Return Path
33
Content Courtesy of Micron
Layout and Termination (7/12)
VREF noise − Induces strobe to data skews and reduces voltage margins − Power/ground plane noise − Crosstalk
Minimize VREF noise − Use widest trace practical to route
From chip to decoupling capacitor − Use large spacing between VREF and neighboring traces
34
Content Courtesy of Micron
Layout and Termination (8/12)
ISI − Occurs when data is random
Clocks do not have ISI − Multiple bits on the bus at the same time
Bus cannot settle from bit #1 before bit #2, etc. − Signal edges jitter due to previous bit’s energy still on the bus − Ringing due to impedance mismatches − Low pass structures can cause ISI
Minimize ISI − Optimize layout − Keep board/DIMM impedances matched
Drive impedance should be same as Zo of transmission line − Terminate nets
Termination values should be the same as Zo of transmission line − Select high-quality connector
Matched to board/DIMM impedance Low mutual coupling
35
Content Courtesy of Micron
Layout and Termination (9/12)
Crosstalk − Coupling on board, package, and connector from other signals, including
RPDs Inductive coupling is typically stronger than capacitive coupling
− When aggressors fire at the same time as victim (e.g. data-to-data coupling) Victim edge speeds up or slows down, causing jitter
− When aggressors do not fire at the same time as victim (e.g. data-to-command/address coupling) Noise couples onto victim at time of aggressor switching
36
Content Courtesy of Micron
Layout and Termination (10/12)
Minimize crosstalk − Keep bits that switch on same “clock” edge routed together
Route data bits next to other data bits; never next to CMD/ADDR bits − Isolate sensitive bits (strobes)
If need be, route next to signals that rarely switch − Separate traces by at least two to three {preferred} conductor widths
(more accurately, one would define by trace pitch and height above reference plane) Example: 5-mil trace located 5 mils from a reference plane should have a 15-mil gap
to its nearest neighbors to minimize crosstalk − Choose a high-quality connector − Run traces as stripline (as opposed to microstrip)
Not at the cost of additional vias − Maintain good references for signals and their return paths − Avoid RPDs − Keep driver, BD Zo, and ODT selections well matched
37
Content Courtesy of Micron
Layout and Termination (11/12)
Cin mismatch − Differing input capacitances on receiver pins − Adds skew to input timings
RTT mismatch − Termination resistors not at nominal value − Internal ODT on data pins have smaller variation than on DDR2
They are calibrated (so is DRAM’s Ron) − External termination resistor variation must be accounted for
Consider one-percent resistors
38
Content Courtesy of Micron
Layout and Termination (12/12)
High-speed signals must maintain a solid reference plane − Reference plane may be either VDD or ground
− For DDR3 UDIMM systems, the DQ busses are referenced to ground while the ADDR/CMD and clock are referenced to VDD
− All signals may be referenced to ground if the layout allows
Best signaling is obtained when a constant reference plane is maintained − If this is not possible try to make the transitions near decoupling capacitors
Signal Power Plane
Ground Plane
Cap
39
Content Courtesy of Micron
Board Debug Tools to Minimize Issues
40
TimeQuest DDR Timing: Read Capture
41
Errors in the calibration algorithm Effects of
temperature and voltage changes on
the calibration
Total margin after calibration
“Before calibration” is the standard timing analysis
Calibrating out some of the process variation in the
memory
Calibrating to the FPGA variations
(deskew + pessimism removal)
EMIF Debug Toolkit Features
Reports results of the last calibration to the user − Reports interface details, margins observed before calibration, settings
made during calibration, and post-calibration margins − In the case of a calibration failure, toolkit reports the stage at which
calibration failed and the group
Provides eye monitor support Provides loopback support Allows user interaction with memory interface
− Send commands to the memory interface to recalibrate, mask groups and ranks
− Eye monitor support of data valid window − Loopback support for bit error rate (BER) testing
42
43
TimeQuest-Like GUI interface
Commands run Shown in console
Tasks section
Reports section
“On-Chip” EMIF Debug Toolkit
Core access to calibration data − Access same calibration data as the EMIF toolkit, now via FPGA logic
Via Avalon® Memory-Mapped (Avalon-MM) interface
44
Looking Ahead and Conclusion
45
Will There Be a DDR5?
Very unlikely − SI for a parallel bus of 2 GHz and above would be very difficult − Timing budget would be consumed in the package
PDN noise Package skew
Transition to stack memory − Hybrid Memory Cube and serialized memory − 3D memories integrated into ASICs
46
Conclusion
DDR4 has many ways to reduce overall system power − ~50% lower power than DDR3 at 1.5 V
DDR4 is 33% faster than DDR3 2133 But there are challenges…..
− Shrinking data valid window − Increase signal integrity and power integrity concerns
These can be overcome by good controller design − Innovative calibration − Good ODT − Careful board design − Good board debug tools
47
Thank You Thank You
top related