dynamic near data processing framework for...
TRANSCRIPT
Dynamic Near Data Processing Framework for SSDs
Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*
*University of Southern California†North Carolina State University
‡University of California, San Diego
Conventional Storage = Cheap Passive Devices
2
Conventional storage devices• Slow, limited bandwidth (SATA 150 ~ 600 MB/s) • Passive devices (read, write, erase)
* Figures from Intel and Western Digital
Storage in Modern Server Systems
3
Storage devices for Big Data• Huge volumes of data slow, slower, much slower• Data movement is critical for performance
Intelligent Storage
4
NVM-based storage devices• No seek time, higher bandwidth over PCIe• Potential to be active systems
* Figures from Intel
Intelligent Storage
5
NVM-based storage devices• No seek time, higher bandwidth (PCIe)• Potential to be active systems
* Figures from Intel
SSDProcessor
DRAM
NAND flash packages
StorageProcessor
(SP)
Host
Near Data Processing (NDP)
6
CPU Storage interface
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
Host
CPU
Near Data Processing (NDP)
7
Storage interface
StorageProcessor
(SP)
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Host
Near Data Processing (NDP) on SSDs
8
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Garbage collection
Wear-leveling
Data computation @ storage
Host
Near Data Processing (NDP) on SSDs
9
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDP
Garbage collection
Wear-leveling
Data computation @ storage
Obstacles to in-SSD processing
• Less powerful embedded processor
• Dynamic computation resource availability
• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD
Host
CPU
Summarizer –Basic Concept
10
Storage interface AP
Monitoring resources
Host
CPU
Summarizer –Basic Concept
11
Storage interface AP
Monitoring resources
Summarizer –Detailed Firmware Architecture
12
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
SSD Embedded Processors
Summarizer – Initialization (Function Offloading)
13
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
INIT ( foo)
foo()
foo()f#1Function offloading
Function registration
New NVMe command
Summarizer –Computation (Dynamic mode)
14
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC( LBA,foo)
New NVMe command
New NVMe command decode
RD&PROC(PPA,foo)
goo()f#2
Summarizer –Computation (Dynamic mode)
15
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
RD&P(PPA1,foo)
RD&P(PPA2,foo)
Page data
RD&P(PPA1,foo)
goo()f#2
Summarizer –Computation (Dynamic mode)
16
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo1()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
buf1, foo
CC/Proc
Register in TQ
goo()f#2
Summarizer –Computation (Dynamic mode)
17
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
CC
TQ is full
goo()f#2
Summarizer – Finalization
18
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
FINAL ( foo)
New NVMe command
foo()f#1
Results
goo()f#2
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
19
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
20
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
ARM Processor
NAND flash DIMMs
AlteraStratix V
PCIe (to host)
DRAM
Evaluation - Performance
21
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Static workload offloading
Evaluation - Performance
22
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
CPU only processing (baseline) SSD only processing
Evaluation - Performance
23
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Summarizer Dynamic Offloading
Evaluation - Performance
24
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
SSD processing + transfer time(internal + external + In-SSD processing)
Host CPU processing time
Evaluation - Performance
25
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host timeExecution time normalized to baseline (CPU only)
Evaluation - Performance
26
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Evaluation - Performance
27
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.60
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host timeE
xe
cuti
on
tim
e (n
orm
ali
zed
to
ba
seli
ne
)
Evaluation - Performance
28
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.62
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host time
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Evaluation - Performance
29
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Performance degraded by static NDP
Evaluation - Performance
30
16% 10%
20% 7%
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Design Exploration –Better SSD Processor
31
Host
CPU Storage interface
Better embedded processor is cost effective
AP
Design Exploration –Higher Internal Bandwidth
32
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Embedded processor performance
Design Exploration –Higher Internal Bandwidth
33
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Summarizer is a cost effective NDP solution with powerful storage processors
Conclusion
34
▪Dynamic computation offloading framework• Opportunistic in-SSD computation
• Page-level task control
• Optimal performance improvement
▪ Summrizer programming model
✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning
✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs
Thank you
(We thank to Dell EMC for supporting the SSD development board)
Summarizer: Trading Communication with Computing Near Storage (MICRO ‘17)