trading communication with computing near storage...summarizer: trading communication with computing...
TRANSCRIPT
Summarizer: Trading Communication with Computing Near Storage
Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*
*University of Southern California†North Carolina State University
‡University of California, San Diego
Host
Motivation – High Data Movement Cost
2
CPU Storage interface
Data computation @ host Data transfer from storage
External (host -- storage) Internal
Limited data bandwidth High access latency
External (host -- storage) Internal
StorageProcessor
(SP)
Host
Near Data Processing (NDP)
3
CPU Storage interface
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
Host
CPU
Near Data Processing (NDP)
4
Storage interface
StorageProcessor
(SP)
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Host
Near Data Processing (NDP) on SSDs
5
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Garbage collection
Wear-leveling
Data computation @ storage
Host
Near Data Processing (NDP) on SSDs
6
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDP
Garbage collection
Wear-leveling
Data computation @ storage
Obstacles to in-SSD processing
• Less powerful embedded processor
• Dynamic computation resource availability
• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD
Host
CPU
Summarizer – Basic Concept
7
Storage interface AP
Monitoring resources
Host
CPU
Summarizer – Basic Concept
8
Storage interface AP
Monitoring resources
Summarizer – Detailed Firmware Architecture
9
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
SSD Embedded Processors
Normal Page Read Request
10
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
RD ( LBA)
(RD) PPA
Normal Page Read Request
11
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
RD(PPA 1)RD(PPA 2)
Page data
Normal Page Read Request
12
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
Page data
Summarizer – Initialization (Function Offloading)
13
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
INIT ( foo)
foo()
foo()f#1Function offloading
Function registration
New NVMe command
Summarizer – Computation (Dynamic mode)
14
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC( LBA,foo)
New NVMe command
New NVMe command decode
RD&PROC(PPA,foo)
goo()f#2
Summarizer – Computation (Dynamic mode)
15
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
RD&P(PPA1,foo)
RD&P(PPA2,foo)
Page data
RD&P(PPA1,foo)
goo()f#2
Summarizer – Computation (Dynamic mode)
16
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo1()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
buf1, foo
CC/Proc
Register in TQ
goo()f#2
Summarizer – Computation (Dynamic mode)
17
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
CC
TQ is full
goo()f#2
Summarizer – Finalization
18
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
FINAL ( foo)
New NVMe command
foo()f#1
Results
goo()f#2
Summarizer API and NVMe commands
19
Initialization
Finalization
Computation
• NVMe command: INIT_TSKn• Transfer a in-SSD procedure to SSD memory• Initialize data structure and temporal variables for in-SSD
computation
• NVMe command: READ_PROC_TSKn• Page read command is issued with the flag indicating the user
procedure embedded in SSD memory• Return the special code if the requested page is processed in SSD• Page data is transferred to the host if the requested page is NOT
computed in SSD
• NVMe command: FINAL_TSKn• Gather final in-SSD computation results and transfer to the host
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
20
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
Evaluation - Performance
21
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Static workload offloading
Evaluation - Performance
22
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
CPU only processing (baseline) SSD only processing
Evaluation - Performance
23
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Summarizer Dynamic Offloading
Evaluation - Performance
24
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
SSD processing + transfer time(internal + external + In-SSD processing)
Host CPU processing time
Evaluation - Performance
25
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host timeExecution time normalized to baseline (CPU only)
Evaluation - Performance
26
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Evaluation - Performance
27
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.60
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host timeE
xe
cuti
on
tim
e (n
orm
ali
zed
to
ba
seli
ne
)
Evaluation - Performance
28
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.62
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host time
Performance improved by 14%
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Evaluation - Performance
29
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Performance degraded by static NDP
Evaluation - Performance
30
16% 10%
20% 7%
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Design Exploration – Higher Internal Bandwidth
31
Host
CPU Storage interface
Data transfer bottleneck
Commercial SSD maintains internal bandwidth ≈ external bandwidth
Design Exploration – Higher Internal Bandwidth
32
Host
CPU Storage interface SP
Data transfer bottleneck
Higher internal bandwidth without increasing external bandwidth
0%
20%
40%
60%
80%
100%
1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4
TPC-H Query 6 TPC-H Query 1 TPC-H Query 14 String Similarity Join Average
Sp
ee
du
pDesign Exploration – Higher Internal Bandwidth
33
External : Internal bandwidth ratio
0%
20%
40%
60%
80%
100%
1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4
TPC-H Query 6 TPC-H Query 1 TPC-H Query 14 String Similarity Join Average
Sp
ee
du
pDesign Exploration – Higher Internal Bandwidth
34
Summarizer is effective if an SSD platform has higher internal bandwidth
Design Exploration – Better SSD Processor
35
Host
CPU Storage interface
Better embedded processor is cost effective
AP
Design Exploration – Higher Internal Bandwidth
36
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
p
Embedded processor performance
Design Exploration – Higher Internal Bandwidth
37
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
p
Summarizer is a cost effective NDP solution with powerful storage processors
Conclusion
38
▪Dynamic computation offloading framework• Opportunistic in-SSD computation
• Page-level task control
• Optimal performance improvement
▪ Summrizer programming model
✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning
✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs
Thank you
Summarizer:Trading Communication with Computing Near Storage
Gunjae Koo, Kiran Kumar Matam, Te I, H. V. Krishna Giri Nara, Jing Li,Hung-Wei Tseng, Steven Swanson, Murali Annavaram
(We thank to Dell EMC for supporting the SSD development board)