server-side parallel data reduction and analysisdust.ess.uci.edu/smn/smn_wzj07_gpc_200705.pdf ·...
TRANSCRIPT
Server-side Parallel Data Reduction and Analysis
Daniel L. Wang1*, Charles S. Zender2, and Stephen F. Jenks1
University of California, Irvine
1Department of Electrical Engineering and Computer Science2Department of Earth System Science
GPC 2007 Paris, France
A Sc ient i s t ' s Perspect ive
Courtesy of International Business Machines Corporation.Unauthorized use not permitted
UCI Eco Mote
NASA QuikSCAT satellite
I can give you 50 Gflops.Finegrained
measurements from everywhere.
500GB at the price of a few steak dinners.
Curious about temperature at
164W 17N?
280Tflops? No problem!
Thanks! But there's another
problem...
The Data Ana lys i s Prob lem
It's too hard to work with bulky data!
~1TB of data to generate this simple picture
Can I use these pro jects?
● Domain-specific visualization and/or analysis– GrADS/GDS (Kinter 1993)
– Ferret (Hankin 1996)
– CDAT (Doutriaux 2003)
– IDV (Murray 2006)
– NCO (Zender 2004)
...all well-suited for finer granularities
Gr id to the rescue?
● Compute grids help me compute, but...– My jobs are simple-- subsetting, averaging, rms,
t-statistic
– Data movement costs >> computation costs
● Data grids help me store, but...– Discovery, replication, cataloging do not address
download size
...What about a combination of the two?
SWAMP to the rescueS c r i p t W o r k f l o w A n a l y s i s f o r M u l t i P r o c e s s i n g
NCO script
ncwa in.nc .. ... ... ...
NCO script
ncwa in.nc .. ... ... ...
Ponder question
?
?
Ponder question
Write script*
Write script
Submit scriptReceive results
Request input dataReceive input data Execute script
Traditional
SWAMP
● SWAMP sends the analysis to the data
Geosc ience App l i cat ion Domain
● NCO: netCDF operators– Provides powerful primitives for data analysis
● subsetting, averaging, rms, add/subtract/multiply/divide
– File-level granularity
● OPeNDAP – Provides normalized access to scientific data
– Popular data protocol/server in geoscience
– Used in Earth System Grid
SWAMP ConceptsGoal: Daily terascale analysis of remote data
from geoscientist's desktop
● Server-side (integrated with data service)
– “Thou shalt not move the data”
● Data-reductive analysis
– Characterizes and summarizes raw data
– Very common in geoscience (e.g. 8GB->200MB)
● Script-based workflows
– Easy to learn, modify, maintain (~POSIX shell)
– Supports domain-specific primitives
HOWTO: Use SWAMP
● Send the script to the data:
(OPeNDAP protocol request)
swampsendscript.py script.swamp \http://server:port/nph-dods
● Receive tokens for async output download● Download output when ready
script of NCO commands
OPeNDAP URL
Nuts and Bo l t s
Parsing● Shell-script syntax● Support domain-
specific applications– NCO: ncrcat, ncwa,
ncbo, etc.● Automatic
dependency resolution
● (filename flagging)
Execution● Dynamically
scheduled for parallelism
● Peer-workers concurrent
● Db-backed job control
● Ramdisk optimization– Keep temporaries,
intermediate files in RAM
Block Diagram
Exper imenta l Setup
● Benchmark: Sample 06h00/18h00 (local) from 10 years of global T42 (t=20m)
~ 14,000 script lines
~ 8GB input data (120 files)
~ 26GB intermediate results
~ 230MB result data (10 files)● System:
● Dual Opteron 270 (4 cores total)● 16GB memory
● Compare 9 cases:● Traditional vs. SWAMP (1,2,4,8-wide, I/O opt)
8GB
26GB230MB
Parse and bu i ld work f low
dependencyaware workflow
~14,000 line script1 line ~ 1 NCO command
Overa l l Per formance
Non-SWAMP
Serial 2 workers 4 workers 8 workers
0
20
40
60
80
100SWAMP Performance
Compute
Transfer
Tim
e (
min
ute
s)
● Bandwidth savings enormous
● Scientist receives results ~6x faster
(Transfer time calculated using generous 30Mbit/s bandwidth (3x Ethernet))
Per formance Sca l ing
0 4 80
1
2
3
4SWAMP Parallelization
no opt
tmp in RAMideal
# workers
Com
puta
tion
al S
pee
du
p
● tmp in RAM still better w/o parallelism
● tmp in RAM tracks ideal case
● no opt suffers from I/O contention
Recent Progress
● SOAP web service interface● Better shell-script syntax support
– file remapping
– hazard detection
– script filename globbing
– shell/environment variables● Cluster workflow distributed parallelism● Near future:
– Testing at The National Center for Atmospheric Research (NCAR) Community Data Portal
Conc lus ions
● Scripts represent a viable domain-specific workflow definition language.
● High performance potential available in geoscience workflows – Order-of-magnitude improvements possible
● Large benefit through I/O optimization
(disk and network)
– I/O costs may dominate otherwise
(c.f. “memory wall”)