deltafs indexed massive dir software-defined …qingzhen/talk/deltafs_talk_pdsw17.pdfpdsw-discs 2017...
TRANSCRIPT
![Page 1: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/1.jpg)
PDSW-DISCS 2017
Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael KuchnikChuck Cranor, Garth Gibson
Brad Settlemyer, Gary Grider, Fan GuoCarnegie Mellon University
Los Alamos National Laboratory (LANL)
DeltaFS Indexed Massive Dir
Software-Defined StorageFor Fast Query
![Page 2: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/2.jpg)
Key features1. Require no dedicated resources
2. Almost no post-processing is needed
3. Low I/O overhead
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 2
DeltaFS Indexed Massive Dir
![Page 3: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/3.jpg)
Target workloads1. Data-intensive HPC simulations
2. Not designed for indexing checkpoints
3. I/O bandwidth is limited
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 3
DeltaFS Indexed Massive Dir
![Page 4: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/4.jpg)
AgendaPart 1 – Motivation
Part 2 – In-situ indexing design
Part 3 – API, LANL VPIC integration
Conclusion
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 4
![Page 5: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/5.jpg)
Existing HPC builds indexes during post-processing
Delay queries until post-processing done (5-20% simulation time)
App Lustre
Queries
IndexingWrite
Tmp
1 2
3
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 5
![Page 6: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/6.jpg)
Problem faced:
The increasing time-to-scienceDue to the growing gap between compute and I/O
Inefficient support on small data
simulation start query finish
![Page 7: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/7.jpg)
Processing data in-transit while data is written to storage
Need separate resources for sorting and indexing
App Lustre QueriesIndexing
Tmp
MapReduce
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 7
![Page 8: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/8.jpg)
In-situ indexing directly on app nodes using app resources
Lustre
Queries
data + index
Tmp
No need for a separate indexing cluster
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 8
App + Indexing
![Page 9: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/9.jpg)
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 9
Key idea:Reuse storage write-back buffering and
idle CPU cycles for in-situ indexing
![Page 10: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/10.jpg)
Example app: LANL VPIC
VPIC simulation Each VPIC process simulates millions of particles
Particle40 bytes
Particles move across processes during a simulation
Small random writesAfter simulation: high-selective queries
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 10
![Page 11: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/11.jpg)
TBs I/O per trajectory fetch
Query a single particle trajectoryA B C TBs search
Data object 1M...
Simulation procs
One output file per VPIC process
AB
EC
DF
P P P
...
1M...
1MACE
file-per-process
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 11
![Page 12: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/12.jpg)
5,000x faster than baseline with DeltaFS in-situ indexing
0.0625 0.25 1 4 16 64 256 1024 4096Query Time (sec)
DeltaFS (w/ 1 CPU core) Baseline (Full-system parallel scan w/ 3k CPU cores)
Time for reading a single particle trajectory(10TB, 48 billion particles)
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 12
![Page 13: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/13.jpg)
System design:
Light-weight in-situ indexing
1. Tiny mem footprint2. Zero write amplification
3. No read back
Part II
![Page 14: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/14.jpg)
Resource-efficient indexing by log-structured I/O
Tiny mem footprint, full storage b/w util.
data log
index
Lustre
bufferApp thread
Indexing thread
App proc
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 14
![Page 15: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/15.jpg)
LSM-Trees compacts all the time, but we can’t afford it
Must aim for low I/O overhead at 10%-20%
Compute I/O Compute I/O
Total simulation
Compaction easily causes 1000% I/O overhead by reading/writing previously written data
…
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 15
![Page 16: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/16.jpg)
In-situ indexing by aggressive data partitioning
A BC D EF
A B C D E FAll-to-all shuffle
App process #0 App process #1 App process #2
…Compute I/O Compute I/O
Bound the number of data needed per query per timestep
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 16
![Page 17: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/17.jpg)
...
data blockindex block
filter
...
data blockdata block
In-situ indexing as a file system lib component
No dedicated cluster needed
shuffle receiver
Index Log
WriteBuffer
Data Log
shuffle sender
App data
All-to-all shuffle
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 17
![Page 18: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/18.jpg)
Programming interface:
Indexed Massive Directory (IMD)
Part III
In-situ indexing keyed on filenames
mkdir(“./particles”, DELTAFS_IMD)
![Page 19: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/19.jpg)
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 19
How to use Indexed Massive Dir (IMD)
1. Data searched together go into a single IMD file e.g. one file for each particle
2. Create as many IMD files as you wante.g. 1 trillion files for 1 trillions particles
Query you data by “open-read-close”
![Page 20: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/20.jpg)
VPIC using DeltaFS IMD
Simulation procs
One IMD file per VPIC particle
P P P 1M...
1T Indexed Massive Directory
file-per-particle
AA
DD
BB
EE
CC
FF...
Data object1M...Index object
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 20
A B C TBs MBs search
![Page 21: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/21.jpg)
LANL Trinity Experiments
Compute Node32 cores/node
…
1-99 compute nodes, 496 million – 48 billion particles
bufferVPIC
VPIC-DeltaFS
buffer
VPIC-Baseline
VPIC
Queries
No post-processing
SSD
Burst-buffer Lustre
HDD
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 21
DeltaFS indexing
![Page 22: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/22.jpg)
245x 665x 532x 625x 992x 2221x 4049x 5112x
0.0156250.0625
0.2514
1664
25610244096
496 992 1,984 3,968 7,936 16,368 32,736 49,104
Quer
y Tim
e (se
c)
Simulation Size (million particles)
Baseline (Full-system parallel scan)DeltaFS (w/ 1 CPU core)
1 node 99 nodes2 nodes 4 node 8 node 66 nodes33 nodes16 nodes
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 22
![Page 23: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/23.jpg)
9.63x 4.78x 2.42x1.56x
1.29x
1.13x 1.15x 1.13x
0
40
80
120
160
200
496 992 1,984 3,968 7,936 16,368 32,736 49,104I/O Ti
me p
er D
ump (
sec)
Simulation Size (million particles)
Baseline DeltaFS
Tiny simulations Bigger simulations
1 node 99 nodes2 nodes 4 node 8 node 66 nodes33 nodes16 nodes
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 23
![Page 24: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,](https://reader033.vdocuments.us/reader033/viewer/2022060404/5f0ee24c7e708231d44168a5/html5/thumbnails/24.jpg)
Conclusion
• Indexed Massive Dir (~3% app mem, compaction-free, POSIX API)
• Powered by Mercury RPC
• DeltaFS is one of the Mochi micro-services
PDSW-DISCS 2017http://www.pdl.cmu.edu/ 24
In-situ indexing for transparent, almost-free query accelerationno dedicated nodes, no post-processing, ~15% I/O overhead
https://mercury-hpc.github.io/
https://press3.mcs.anl.gov/mochi/
https://github.com/pdlfs/deltafs