emc presentationapril 20051 research @ northeastern university i/o storage modeling and performance...
Post on 26-Dec-2015
216 Views
Preview:
TRANSCRIPT
EMC Presentation April 2005 1
Research @ Northeastern University
• I/O storage modeling and performance – David Kaeli
• Soft error modeling and mitigation – Mehdi B. Tahoori
I/O Storage Research at Northeastern University
David Kaeli
Yijian WangDepartment of Electrical and Computer Engineering
Northeastern University
Boston, MA
kaeli@ece.neu.edu
EMC Presentation April 2005 3
Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file I/O
• I/O Qualification Laboratory @ NU
• Areas for future work
EMC Presentation April 2005 4
Important File-base I/O Workloads
• Many subsurface sensing and imaging workloads involve file-based I/O– Cellular biology – in-vitro fertilization with NU biologists– Medical imaging – cancer therapy with MGH– Underwater mapping – multi-sensor fusion with Woods Hole
Oceanographic Institution– Ground-penetrating radar – toxic waste tracking with Idaho
National Labs
EMC Presentation April 2005 5
The Impact of Profile-guided Parallelization on SSI Applications
• Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster
• Hot-path parallelization• Data restructuring
• Reduced the runtime of a Monte Carloscattered light simulation by 98% on a 16-node Silicon Graphics Origin 2000
• Matlab-to-C compliation• Hot-path parallelization
• Obtained superlinear speedup of Ellipsoid Algorithm run on a 16-node IBM SP2
• Matlab-to-C compliation• Hot-path parallelization
Soil
Air
Mine
Scattered Light Simulation Speedup
1
10
100
1000
10000
100000
Ru
nti
me
in s
ec
on
ds
Original
Matlab-to-C
Hot pathparallelization
Ellipsoid Algorithm Speedup(versus serial C version)
05
101520
1 2 4 8 16
Number of Nodes
Sp
ee
du
p
64-vector 256-vector1024-vector linear speedup
EMC Presentation April 2005 6
Limits of Parallelization
• For compute-bound workloads, Beowulf clusters can be used effectively to overcome computational barriers
• Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems
• Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit)
• For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems
EMC Presentation April 2005 7
Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file I/O
• I/O Qualification Laboratory @ NU
• Areas for future work
EMC Presentation April 2005 8
Parallel I/O Acceleration• The I/O bottleneck
– The growing gap between the speed of processors, networks and underlying I/O devices
– Many imaging and scientific applications access disks very frequently
• I/O intensive applications– Out-of-core applications
– Work on large datasets that cannot fit in main memory
– File-intensive applications– Access file-based datasets frequently– Large number of file operations
EMC Presentation April 2005 9
Introduction
• Storage architectures– Direct Attached Storage (DAS)
– Storage device is directly attached to the computer
– Network Attached Storage (NAS)– Storage subsystem is attached to a network of servers
and file requests are passed through a parallel filesystem to the centralized storage device
– Storage Area Network (SAN)– A dedicated network to provide an any-to-any connection
between processors and disks
EMC Presentation April 2005 10
I/O PartitioningP
An I/O intensive application
Disk
P P P…
Disk
Disk Disk Disk
P P P…
…
Data Partitioning
Multiple disks(i.e. RAID)
Disk Disk Disk
P
…Data Striping
Multiple Processes
(i.e. MPI-IO)
EMC Presentation April 2005 11
I/O Partitioning• I/O is parallelized at both the application level
(using MPI and MPI-IO) and the disk level (using file partitioning)
• Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing)
• How to recognize the access patterns?• Profile-guided approach
EMC Presentation April 2005 12
Profile Generation
Run the application
Capture I/O execution profiles
Apply our partitioning algorithm
Rerun the tuned application
EMC Presentation April 2005 13
I/O traces and partitioning• For every process, for every contiguous file access,
we capture the following I/O profile information:– Process ID– File ID– Address– Chunk size– I/O operation (read/write)– Timestamp
• Generate a partition for every process• Optimal partitioning is NP-complete, so we develop a
greedy algorithm• We have found we can use partial profiles to guide
partitioning
EMC Presentation April 2005 14
for each IO process, create a partition;
for each contiguous data chunk {
total up the # of read/write accesses on a process-ID basis;
if the chunk is accessed by only one process
assign the chunk to the associated partition;
if the chunk is read (but never written) by multiple processes
duplicate the chunk in all partitions where read;
if the chunk is written by one process, but later read by multiple {
assign the chunk to all partitions where read and broadcast the updates on writes;
else assign the chunk to a shared partition;
} }
For each partition
sort chunks based on the earliest timestamp for each chunk;
Greedy File Partitioning Algorithm
EMC Presentation April 2005 15
Parallel I/O Workloads• NASA Parallel Benchmark (NPB2.4)/BT
– Computational fluid dynamics– Generates a file (~1.6 GB) dynamically and then reads it back– Writes/reads sequentially in chunk sizes of 2040 Bytes
• SPEChpc96/seismic– Seismic processing– Generates a file (~1.5 GB) dynamically and then reads it back– Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB
• Tile-IO– Parallel Benchmarking Consortium– Tile access to a two-dimensional matrix (~1 GB) with overlap– Writes/reads sequential chunks of 32 KB, with 2KB of overlap
• Perf– Parallel I/O test program within MPICH– Writes a 1 MB chunk at a location determined by rank, no overlap
• Mandelbrot– An image processing application that includes visualization– Chunk size is dependent on the number of processes
EMC Presentation April 2005 16
10/100Mb Ethernet Switch
RAIDNode
LocalPCI-IDE
Disk
LocalPCI-IDE
Disk
P2-350Mhz
P2-350Mhz P2-350Mhz
P2-350Mhz
P2-350Mhz
RAIDNode
P2-350Mhz
Beowulf Cluster
EMC Presentation April 2005 17
Hardware Specifics• DAS configuration
– Linux box, Western Digital WD800BB (IDE), 80GB, 7200RPM
• Beowulf cluster (base configuration)– Fast Ethernet 100Mbits/sec– Network Attached RAID - Morstor TF200 with 6-9GB drives
Seagate SCSI disks, 7200rpm, RAID-5 – Local attached IDE disks – IBM UltraATA-350840, 5400rpm
• Fibre channel disks– Seagate Cheetah X15 ST-336752FC, 15000rpm
EMC Presentation April 2005 18
0
50
100
150
200
UnixWrite
UnixRead
MPI-IOWrite
MPI-IORead
P-IOWrite
P-IORead
Ban
dwid
th (
MB/s
ec)
4 procs9 procs16 procs25 procs
Write/Read Bandwidth
0
50
100
150
200
UnixWrite
UnixRead
MPI-IOWrite
MPI-IORead
P-IOWrite
P-IORead
Ban
dwid
th (
MB/s
ec)
4 procs8 procs16 procs24 procs
NPB2.4/BT
SPECHPC/seis
EMC Presentation April 2005 19
0
25
50
75
100
125
MPI write MPI read PIO write PIO read
Band
wid
th (
MB/
sec)
Write/Read Bandwidth
0
50
100
150
200
250
MPI write MPI read PIO write PIO read
Band
wid
th (
MB/
sec)
0
50
100
150
200
250
MPI write MPI read PIO write PIO read
Bandw
idth
(M
B/se
c)
4 procs8 procs
16 procs24 procs
MPI-Tile Perf
Mandelbrot
EMC Presentation April 2005 20
Total Execution Time
0
1000
2000
3000
4000
Ex
ec
uti
on
Tim
e (
se
co
nd
s)
MPI-IO
PIO
EMC Presentation April 2005 21
Profile training sensitivity analysis
• We have found that IO access patterns are independent of file-based data values
• When we increase the problem size or reduce the number of processes, either:– the number of IOs increases, but access patterns
and chunk size remain the same (SPEChpc96, Mandelbrot), or
– the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf)
• Re-profiling can be avoided
EMC Presentation April 2005 22
Execution-driven Parallel I/O Modeling
• Growing need to process large, complex datasets in high performance parallel computing applications
• Efficient implementation of storage architectures can significantly improve system performance
• An accurate simulation environment for users to test and evaluate different storage architectures and applications
EMC Presentation April 2005 23
Execution-driven I/O Modeling • Target applications: parallel scientific programs
(MPI)• Target machine/Host machine: Beowulf clusters• Use DiskSim as the underlying disk drive
simulator• Direct execution to model CPU and network
communication• We execute the real parallel I/O accesses and
meanwhile, calculate the simulated I/O response time
EMC Presentation April 2005 24
Validation – Synthetic I/O Workload on DAS
Response Time of Sequential Writes
0
2
4
6
8
10
12
1 2 4 8 16access size in number of blocks
number of accesses = 1000
seco
nd
s
Response Time of Sequential Reads
0
2
4
6
8
10
1 2 4 8 16access size in number of blocks
number of accesses = 1000
seco
nd
s
modelreal
Response Time of Non-contiguous Reads
0
2
4
6
8
10
1 2 4 8 16 32seek distance in number of blocks
access size = 1 blocknumber of accesses = 1000
seco
nds
Response Time of Non-contiguous Writes
0
2
4
6
8
10
1 2 4 8 16 32seek distance in number of blocks
access size = 1 blocknumber of accesses = 1000
seco
nd
s
EMC Presentation April 2005 25
Simulation Framework - NAS
LAN/WAN
Network File System
I/O traces
Local I/O traces Local I/O traces Local I/O traces Local I/O traces
RAID controller
DiskSim
I/O requests
Filesystem metadata
Logical file access addresses
EMC Presentation April 2005 26
Execution Time of NPB2.4/BT on NAS - base configuration
0
500
1000
1500
2000
2500
3000
3500
4000
4 9 16 25number of processors
seco
nds
model
real
EMC Presentation April 2005 27
LAN/WAN
FileSystem FileSystem FileSystem FileSystem
I/O traces I/O traces I/O traces I/O traces
DiskSim
DiskSim
DiskSim
DiskSim
Simulation Framework – SAN direct• A variety of SAN where disks are distributed across the network and each server is directly connected to a single device• File partitioning• Utilize I/O profiling and data partitioning heuristics to distribute portions of files to disks close to the processing nodes
EMC Presentation April 2005 28
Execution Time of NPB2.4/BT on SAN-direct - base configuration
0
500
1000
1500
2000
2500
3000
4 9 16 25
number of processors
seco
nds
model
real
EMC Presentation April 2005 30
I/O Bandwidth of SPEChpc/seis
0
50
100
150
200
250
NA
S-jo
ulian
NA
S-A
TA
NA
S-S
CS
I
NA
S-F
C
SA
N-jo
ulian
SA
N-d
irect -AT
A
SA
N-d
irect-SC
SI
SA
N-d
irect-FC
storage architectures
MB
/s
4 processors
8 processors
16 processors
EMC Presentation April 2005 31
I/O Bandwidth of Mandelbrot
050
100150200250300350400
NA
S-jo
ulian
NA
S-A
TA
NA
S-S
CS
I
NA
S-F
C
SA
N-jo
ulian
SA
N-d
irect -AT
A
SA
N-d
irect-SC
SI
SA
N-d
irect-FC
storage architectures
MB
/s
4 processors
8 processors
16 processors
EMC Presentation April 2005 32
Publications• “Profile-guided File Partitioning on Beowulf Clusters,” Journal of Cluster
Computing, Special Issue on Parallel I/O, to appear 2005.• “Execution-Driven Simulation of Network Storage Systems,” Proceedings of the 12th
ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604-611.
• “Profile-Guided I/O Partitioning,” Proceedings of the 17th ACM International Symposium on Supercomputing, June 2003, pp. 252-260.
• “Source Level Transformations to Apply I/O Data Partitioning,” Proceedings of the IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 12-21.
• “Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applications,” International Journal of Systems, Science and Technology, September 2002, pp. 40-55.
EMC Presentation April 2005 33
Summary of Cluster-based Work • Many imaging applications are dominated by file-based
I/O• Parallel systems can only be effectively utilized if I/O is
also parallelized • Developed a profile-guided approach to I/O data
partitioning• Impacting clinical trials at MGH• Reduced overall execution time by 27-82% over MPI-IO• Execution-driven I/O model is highly accurate and
provides significant modeling flexibility
EMC Presentation April 2005 34
Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file I/O
• I/O Qualification Laboratory @ NU
• Areas for future work
EMC Presentation April 2005 35
I/O Qualification Laboratory
• Working with Enterprise Strategy Group• Develop a state-of-the-art facility to provide
independent performance qualification of Enterprise Storage systems
• Provide a quarterly report to ES customer base on the status of current ES offerings
• Work with leading ES vendors to provide them with custom early performance evaluation of their beta products
EMC Presentation April 2005 36
I/O Qualification Laboratory
• Contacted by IOIntegrity and SANGATE for product qualification
• Developed potential partners that are leaders in the ES field
• Initial proposals already reviewed by IBM, Hitachi and other ES vendors
• Looking for initial endorsement from industry
EMC Presentation April 2005 37
I/O Qualification Laboratory
• Why @ NU– Track record with industry (EMC, IBM,
Sun)– Experience with benchmarking and IO
characterization– Interesting set of applications (medical,
environmental, etc.)– Great opportunity to work within the
cooperative education model
EMC Presentation April 2005 38
Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file I/O
• I/O Qualification Laboratory @ NU
• Areas for future work
EMC Presentation April 2005 39
Areas for Future Work
• Designing a Peer-to-Peer storage system on a Grid system by partitioning datasets across geographically distributed storage devices
joulian.hpcl.neu.edu keys.ece.neu.edu
InternetInternet
1Gbit/s100Mbit/sRAID
31 sub-nodes 8 sub-nodes
Head node Head node
EMC Presentation April 2005 40
NPB2.4/BT read performance
0
20
40
60
80
100
120
140
160
180
single server dual server P2P
MB/s
4 procs
9 procs
16 procs
25 procs
EMC Presentation April 2005 41
Areas for Future Work
• Reduce simulation time by identifying characteristic “phases” in I/O workloads
• Apply machine learning algorithms to identify clusters of representative I/O behavior
• Utilize K-Means and Multinomial clustering to obtain high fidelity in simulation runs utilizing sampled I/O behavior
“A Multinomial Clustering Model for Fast Simulation of Architecture Designs”, submitted to the 2005 ACM KDD Conference.
top related