dynamic near data processing framework for...
TRANSCRIPT
Dynamic Near Data Processing Framework for SSDs
Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*
*University of Southern California†North Carolina State University
‡University of California, San Diego
Conventional Storage = Cheap Passive Devices
2
Conventional storage devices• Slow, limited bandwidth (SATA 150 ~ 600 MB/s) • Passive devices (read, write, erase)
* Figures from Intel and Western Digital
Storage in Modern Server Systems
3
Storage devices for Big Data• Huge volumes of data slow, slower, much slower• Data movement is critical for performance
Intelligent Storage
4
NVM-based storage devices• No seek time, higher bandwidth over PCIe• Potential to be active systems
* Figures from Intel
Intelligent Storage
5
NVM-based storage devices• No seek time, higher bandwidth (PCIe)• Potential to be active systems
* Figures from Intel
SSDProcessor
DRAM
NAND flash packages
StorageProcessor
(SP)
Host
Near Data Processing (NDP)
6
CPU Storage interface
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
Host
CPU
Near Data Processing (NDP)
7
Storage interface
StorageProcessor
(SP)
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Host
Near Data Processing (NDP) on SSDs
8
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Garbage collection
Wear-leveling
Data computation @ storage
Host
Near Data Processing (NDP) on SSDs
9
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDP
Garbage collection
Wear-leveling
Data computation @ storage
Obstacles to in-SSD processing
• Less powerful embedded processor
• Dynamic computation resource availability
• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD
Summarizer –Detailed Firmware Architecture
12
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
SSD Embedded Processors
Summarizer – Initialization (Function Offloading)
13
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
INIT ( foo)
foo()
foo()f#1Function offloading
Function registration
New NVMe command
Summarizer –Computation (Dynamic mode)
14
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC( LBA,foo)
New NVMe command
New NVMe command decode
RD&PROC(PPA,foo)
goo()f#2
Summarizer –Computation (Dynamic mode)
15
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
RD&P(PPA1,foo)
RD&P(PPA2,foo)
Page data
RD&P(PPA1,foo)
goo()f#2
Summarizer –Computation (Dynamic mode)
16
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo1()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
buf1, foo
CC/Proc
Register in TQ
goo()f#2
Summarizer –Computation (Dynamic mode)
17
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
CC
TQ is full
goo()f#2
Summarizer – Finalization
18
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
FINAL ( foo)
New NVMe command
foo()f#1
Results
goo()f#2
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
19
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
20
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
ARM Processor
NAND flash DIMMs
AlteraStratix V
PCIe (to host)
DRAM
Evaluation - Performance
21
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Static workload offloading
Evaluation - Performance
22
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
CPU only processing (baseline) SSD only processing
Evaluation - Performance
23
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Summarizer Dynamic Offloading
Evaluation - Performance
24
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
SSD processing + transfer time(internal + external + In-SSD processing)
Host CPU processing time
Evaluation - Performance
25
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host timeExecution time normalized to baseline (CPU only)
Evaluation - Performance
26
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Evaluation - Performance
27
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.60
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host timeE
xe
cuti
on
tim
e (n
orm
ali
zed
to
ba
seli
ne
)
Evaluation - Performance
28
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.62
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host time
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Evaluation - Performance
29
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Performance degraded by static NDP
Evaluation - Performance
30
16% 10%
20% 7%
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Design Exploration –Better SSD Processor
31
Host
CPU Storage interface
Better embedded processor is cost effective
AP
Design Exploration –Higher Internal Bandwidth
32
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Embedded processor performance
Design Exploration –Higher Internal Bandwidth
33
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Summarizer is a cost effective NDP solution with powerful storage processors
Conclusion
34
▪Dynamic computation offloading framework• Opportunistic in-SSD computation
• Page-level task control
• Optimal performance improvement
▪ Summrizer programming model
✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning
✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs