geomancy: automated performance enhancement through …lfu, lru, mru: least frequently used, least...
TRANSCRIPT
Oceane Bel, Kenneth Chang, Nathan R. Tallent, Dirk Duellmann, Ethan L. Miller, Faisal Nawab, Darrell D. E. Long{obel, kchang44,elm,fnawab,darrell}@ucsc.edu, [email protected], [email protected]
Storage Systems Research Center, University of California at Santa CruzPacific Northwest National Lab, CERN, Pure Storage
Geomancy: Automated Performance Enhancement through Data Layout Optimization
Background and Motivation❖ Moving the code to the data
➢ Achieved by spreading computational resources among many computation nodes■ Hadoop■ MapReduce■ Twister
➢ Problem: the workload is not storage bottlenecked■ Physics workloads require hundreds of terabytes of data
➢ Moving compute to the data requires specialized system architectures❖ Data layout strategies
➢ Based on the data distribution across storage devices➢ Heuristics cannot determine if a change in latency is long lasting
■ Limited by their inability to quickly search through enough data layout❖ Motivating workloads
➢ Traces generated from a Monte Carlo physics workload provided by Pacific Northwest National Lab (PNNL) ■ Utilizes data from the specialized particle detector
● From the SuperKEKB collider in Japan➢ Workload traces from CERN
■ A storage cluster with an analysis, archive and tape pool■ An analysis pool is a low latency disk base caching storage
Overview❖ Large distributed storage systems such as high performance computing
(HPC) systems➢ Must feature sufficient performance and scale for demanding scientific
workloads➢ Must handle shifting workloads without causing bottlenecks
❖ Data should be relocated preempting performance drops➢ Large storage systems inhibits rapid reconstruction of data layout
■ Size and complexity➢ Large number of features used to track access performance
■ Prevent accurate modeling of performance■ Increase modeling time overhead
❖ Geomancy➢ Automatically optimizes the placement of data within a distributed storage
system ➢ Leverages a neural network architecture
■ Accurately forecasts future performance of accesses❖ Benefits for storage systems such as
➢ Avoiding potential bottlenecks➢ Increasing overall I/O throughput by 11% to 30%
❖ Automatic feature selection➢ Determine the combination of features that produces the most
accurate predictions➢ Apply the updated predictions to the system
❖ Preemptive data movement➢ Preempting workload changes by changing locations of data
depending on the time of day➢ Handling concurrent workloads accessing similar data
Extended Goals:
❖ Concurrent access data movement➢ Multiple instances of the PNNL workload running concurrently➢ Have a neural network to predict next file movement possibility
■ Large enough time gap with no access to a file.❖ Block level data movements (Intelligent Striping)
The System
Preliminary results
❖ Geomancy uses deep learning➢Observe and learn from the executing
workload on a target system➢Predicts future access throughput for the
data located at a location❖ Geomancy uses file access patterns
➢To model how the system experiences workloads over time
➢To train the neural network ❖ File accesses contain
➢Previous and present locations of data➢Performance values when data is
accessed at a location❖ Action checker checks
➢Permissions/availability of new location ❖ Hardware for live testing: PNNL's
Bluesky node➢NFS mounted home directory
■ Connected to a shared storage server■ Via 10Gbit Ethernet■ Can have long access latencies
➢Two temporary RAID 1 storage points➢A RAID 5 storage point
■ Highest I/O throughput performance➢A Lustre file system➢An externally mounted hard disk drive
■ Lowest I/O throughput performance
Architecture of models with the highest accuracy:
❖ Model 1: 16Z (Dense) ReLU, 8Z (Dense) ReLU, 4Z (Dense) ReLU, 1 (Dense) Linear
❖ Model 2: Z (SimpleRNN) ReLU, 4Z (Dense) ReLU, Z (Dense) ReLU, 1 (Dense) Linear
❖ Model 3: 16Z (Dense) ReLU, 16Z (Dense) ReLU, 16Z (Dense) ReLU, 16Z (Dense) ReLU, 1 (Dense) ReLU
❖ Model 4: 16Z (Dense) ReLU, 16Z (Dense) ReLU, 16Z (Dense) ReLU, 16Z (Dense) ReLU, 16Z (Dense) ReLU, 1 (Dense) ReLU
❖ Model 5: Z (Dense) ReLU, Z (Dense) ReLU, Z (Dense) ReLU, Z (Dense) ReLU, Z (Dense) ReLU, 1 (Dense) ReLU
Accuracy of selected model (model 1) on individual storage point:
Graph Description:
❖ Dynamic: data is moved periodically❖ Static: place data once before test❖ Random: Random shuffling of the data across nodes❖ LFU, LRU, MRU: Least Frequently Used, Least Recently Used, Most
Recently Used➢ Organizes the data and spreads it across the mounts in the target
system evenly.❖ The simulation uses 24 files❖ Data access: the time the file is opened❖ “Geomancy data movement”: the timestamp when Geomancy moves
data in the target system.
Description
Best Neural networks accuracy
Future Work
Acknowledgements: This research was conducted in collaboration with CERN and PNNL, who provided the data used for the experiments. We would like to thank all the collaborators on this project for their guidance and help on this poster. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. We would like to acknowledge the Department of Energy and the National Science Foundation for sponsoring the research described here.
Model Number Mean of Absolute Error
Standard Deviation of Absolute Error
Model 1 18.88 16.92
Model 2 18.77 16.83
Model 3 17.63 15.95
Model 4 17.72 16.02
Model 5 18.50 16.42
Storage point Absolute Relative Error (%)
Externally mounted hard disk drive 14.73 +/- 12.70
Lustre file system 16.78 +/- 15.14
Temporary RAID 1 storage point #1 15.68 +/- 14.61
RAID 5 storage point 23.62 +/- 19.53
Temporary RAID 1 storage point #2 16.03 +/- 12.82
NFS mounted home directory 20.76 +/- 18.99