data analysis in i2u2 i2u2 all-hands meeting michael wilde argonne mcs university of chicago...
TRANSCRIPT
Data analysis in I2U2
I2U2 all-hands meetingMichael Wilde
Argonne MCS
University of Chicago Computation Institute
12 Dec 2005
www.griphyn.org/vds
Scaling up Social Science:Parallel Citation Network Analysis
2002
1975
1990
1985
1980
2000
1995
Work of James Evans, University of Chicago,
Department of Sociology
www.griphyn.org/vds
Scaling up the analysis Database queries of 25+ million citations Work started on small workstations Queries grew to month-long duration With database distributed across
U of Chicago TeraPort cluster: 50 (faster) CPUs gave 100 X speedup Many more methods and hypotheses can be
tested! Grid enables deeper analysis and wider
access
www.griphyn.org/vds
Grids Provide Global ResourcesTo Enable e-Science
www.griphyn.org/vds
Why Grids?eScience is the Initial Motivator …
New approaches to inquiry based on Deep analysis of huge quantities of data Interdisciplinary collaboration Large-scale simulation and analysis Smart instrumentation Dynamically assemble the resources to
tackle a new scale of problem Enabled by access to resources & services
without regard for location & other barriers… but eBusiness is catching up rapidly,
and this will benefit both domains
www.griphyn.org/vds
Technology that enables the Grid
Directory to locate grid sites and services Uniform interface to computing sites Fast and secure data set mover Directory to track where datasets live Security to control access Toolkits to create application services
Globus, Condor, VDT, many more
www.griphyn.org/vds
Virtual Data and Workflows
Next challenge is managing and organizing the vast computing and storage capabilities provided by Grids
Workflow expresses computations in a form that can be readily mapped to Grids
Virtual data keeps accurate track of data derivation methods and provenance
Grid tools virtualize location and caching of data, and recovery from failures
www.griphyn.org/vds
Virtual Data Process
Describe data derivation or analysis steps in a high-level workflow language (VDL)
VDL is cataloged in a database for sharing by the community
Workflows for Grid generated automatically from VDL
Provenance of derived results goes back into catalog for assessment or verification
www.griphyn.org/vds
Virtual Data Lifecycle
Describe Record the processing and analysis steps applied to the data Document the devices and methods used to measure the data
Discover I have some subject images - what analyses are available?
Which can be applied to this format? I’m a new team member – what are the methods and protocols of
my colleagues? Reuse
I want to apply an image registration program to thousands of objects. If the results already exist, I’ll save weeks of computation.
Validate I’ve come across some interesting data, but I need to understand
the nature of the preprocessing applied when it was constructed before I can trust it for my purposes.
www.griphyn.org/vds
Virtual Data WorkflowAbstracts Grid Details
www.griphyn.org/vds
Workflow - the nextprogramming model?
www.griphyn.org/vds
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,
University of Chicago
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
DAG
Virtual Data Example:Galaxy Cluster Search
Sloan Data
www.griphyn.org/vds
A virtual data glossary virtual data
defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation
VDS – Virtual Data System The tools to define, store, manipulate and execute
virtual data workflows VDT – Virtual Data Toolkit
A larger set of tools, based on NMI, VDT provides the Grid environment in which VDL workflows run
VDL – Virtual Data Language A language (text and XML) that defines the functions
and function calls of a virtual data workflow VDC – Virtual Data Catalog
The database and schema that store VDL definitions
www.griphyn.org/vds
What must we “virtualize”to compute on the Grid?
Location-independent computing: represent all workflow in abstract terms
Declarations not tied to specific entities: sites file systems schedulers
Failures – automated retry for data server and execution site un-availability
www.griphyn.org/vds
Expressing Workflow in VDL
TR grep (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
TR sort (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
DV grep (a1=@{in:file1}, a2=@{out:file2});
DV sort (a1=@{in:file2}, a2=@{out:file3});
file1
file2
file3
grep
sort
www.griphyn.org/vds
Expressing Workflow in VDL
TR grep (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
TR sort (in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
DV grep (a1=@{in:file1}, a2=@{out:file2});
DV sort (a1=@{in:file2}, a2=@{out:file3});
file1
file2
file3
grep
sort
Define a “function” wrapper for an
application
Provide “actual” argument values for the invocation
Define “formal arguments” for the application
Define a “call” to invoke application
Connect applications via output-to-input dependencies
www.griphyn.org/vds
Essence of VDL
Elevates specification of computation to a logical, location-independent level
Acts as an “interface definition language” at the shell/application level
Can express composition of functions Codable in textual and XML form Often machine-generated to provide ease of
use and higher-level features Preprocessor provides iteration and variables
www.griphyn.org/vds
Using VDL
Generated directly for low-volume usage Generated by scripts for production use Generated by application tool builders as
wrappers around scripts provided for community use
Generated transparently in an application-specific portal (e.g. quarknet.fnal.gov/grid)
Generated by drag-and-drop workflow design tools such as Triana
www.griphyn.org/vds
Basic VDL Toolkit
Convert between text and XML representation
Insert, update, remove definitions from a virtual data catalog
Attach metadata annotations to defintions Search for definitions Generate an abstract workflow for a data
derivation request Multiple interface levels provided:
Java API, command line, web service
www.griphyn.org/vds
Representing Workflow
Specifies a set of activities and control flow Sequences information transfer between
activities VDS uses XML-based notation called
“DAG in XML” (DAX) format VDC Represents a wide range of workflow
possibilities DAX document represents steps to create
a specific data product
www.griphyn.org/vds
Executing VDL Workflows
Abstractworkflow
Local planner
DAGmanDAG
StaticallyPartitioned
DAG
DAGman &Condor-GDynamically
PlannedDAG
VDLProgram
Virtual Datacatalog
Virtual DataWorkflowGenerator
JobPlanner
JobCleanup
Workflow spec Create Execution Plan Grid Workflow Execution
www.griphyn.org/vds
OSG:The “target chip” for VDS Workflows
Supported by the National Science Foundation and the Department of Energy.
www.griphyn.org/vds
VDS ApplicationsApplication Jobs / workflow Levels Status
ATLAS
HEP Event Simulation
500K 1 In Use
LIGO
Inspiral/Pulsar
~700 2-5 Inspiral In Use
NVO/NASA
Montage/Morphology
1000s 7 Both In Use
GADU
Genomics: BLAST,…
40K 1 In Use
fMRI DBIC
AIRSN Image Proc
100s 12 In Devel
QuarkNet
CosmicRay science
<10 3-6 In Use
SDSS
Coadd; Cluster Search
40K500K
28
In Devel/ CS Research
FOAM
Ocean/Atmos Model
2000 (core app runs
250 8-CPU jobs)
3 In use
GTOMO
Image proc
1000s 1 In Devel
SCEC
Earthquake sim
1000s In use
www.griphyn.org/vds
A Case Study – Functional MRI Problem: “spatial normalization” of a images to
prepare data from fMRI studies for analysis Target community is approximately 60 users at
Dartmouth Brain Imaging Center Wish to share data and methods across country
with researchers at Berkeley Process data from arbitrary user and archival
directories in the center’s AFS space; bring data back to same directories
Grid needs to be transparent to the users: Literally, “Grid as a Workstation”
www.griphyn.org/vds
A Case Study – Functional MRI (2)
Based workflow on shell script that performs 12-stage process on a local workstation
Adopted replica naming convention for moving user’s data to Grid sites
Creates VDL pre-processor to iterate transformations over datasets
Utilizing resources across two distinct grids – Grid3 and Dartmouth Green Grid
www.griphyn.org/vds
Functional MRI Analysis3a.h
align_warp/1
3a.i
3a.s.h
softmean/9
3a.s.i
3a.w
reslice/2
4a.h
align_warp/3
4a.i
4a.s.h 4a.s.i
4a.w
reslice/4
5a.h
align_warp/5
5a.i
5a.s.h 5a.s.i
5a.w
reslice/6
6a.h
align_warp/7
6a.i
6a.s.h 6a.s.i
6a.w
reslice/8
ref.h ref.i
atlas.h atlas.i
slicer/10 slicer/12 slicer/14
atlas_x.jpg
atlas_x.ppm
convert/11
atlas_y.jpg
atlas_y.ppm
convert/13
atlas_z.jpg
atlas_z.ppm
convert/15
Workflow courtesy James Dobson, Dartmouth Brain Imaging Center
www.griphyn.org/vds
Spatial normalization of functional runreorientRun
reorientRun
reslice_warpRun
random_select
alignlinearRun
resliceRun
softmean
alignlinear
combinewarp
strictmean
gsmoothRun
binarize
reorient/01
reorient/02
reslice_warp/22
alignlinear/03 alignlinear/07alignlinear/11
reorient/05
reorient/06
reslice_warp/23
reorient/09
reorient/10
reslice_warp/24
reorient/25
reorient/51
reslice_warp/26
reorient/27
reorient/52
reslice_warp/28
reorient/29
reorient/53
reslice_warp/30
reorient/31
reorient/54
reslice_warp/32
reorient/33
reorient/55
reslice_warp/34
reorient/35
reorient/56
reslice_warp/36
reorient/37
reorient/57
reslice_warp/38
reslice/04 reslice/08reslice/12
gsmooth/41
strictmean/39
gsmooth/42gsmooth/43gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50
softmean/13
alignlinear/17
combinewarp/21
binarize/40
reorient
reorient
alignlinear
reslice
softmean
alignlinear
combine_warp
reslice_warp
strictmean
binarize
gsmooth
Dataset-level workflow Expanded (10 volume) workflow
www.griphyn.org/vds
Conclusion: Motivation for the Grid
Provide flexible, cost-effective supercomputing Federate computing resources Organize storage resources and make them
universally available Link them on networks fast enough to achieve
federation Create usable Supercomputing
Shield users from heterogeneity Organize and locate widely distributed resources Automate policy mechanisms for resource sharing
Provide ubiquitous access while protecting valuable data and resources
www.griphyn.org/vds
Grid Opportunities
Vastly expanded computing and storage Reduced effort as needs scale up Improved resource utilization, lower costs Facilities and models for collaboration Sharing of tools, data, and procedures and
protocols Recording, discovery, review and reuse of
complex tasks Make high-end computing more readily
available
www.griphyn.org/vds
fMRI Dataset processing
FOREACH BOLDSEQDV reorient (# Process Blood O2 Level Dependent Sequence input = [ @{in: "$BOLDSEQ.img"},
@{in: "$BOLDSEQ.hdr"} ], output = [@{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.img"} @{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", );END
DV softmean ( input = [ FOREACH BOLDSEQ @{in:"$CWD/FUNCTIONAL/har$BOLDSEQ.img"} END ], mean = [ @{out:"$CWD/FUNCTIONAL/mean"} ]);
www.griphyn.org/vds
fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType
subject_image input A: fMRIDC.AIR::align_warp
List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img
Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img
3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
www.griphyn.org/vds
Blasting for Protein KnowledgeBlasting complete nr file for sequence similarity and function Characterization Knowledge Base
PUMA is an interface for the researchers to be able to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (nr file ~ approximately 2 million sequences)
Analysis on the Grid
The analysis of the protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding out protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.
www.griphyn.org/vds
FOAM:Fast Ocean/Atmosphere Model
250-Member EnsembleRun on TeraGrid under VDS
FOAM run for Ensemble Member 1
FOAM run for Ensemble Member 2
FOAM run for Ensemble Member N
Atmos Postprocessing Ocean
Postprocessing for Ensemble Member 2
Coupl Postprocessing for Ensemble Member 2
Atmos Postprocessing for Ensemble Member 2
Coupl Postprocessing for Ensemble Member 2
Results transferred to archival storage
Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution)
Remote Directory Creation for Ensemble Member 1
Remote Directory Creation for Ensemble Member 2
Remote Directory Creation for Ensemble Member N
www.griphyn.org/vds
FOAM: TeraGrid/VDSBenefits
Climate Supercomputer
TeraGrid with NMI and VDS
Visualization courtesy Pat
Behling and Yun Liu, UW Madison
www.griphyn.org/vds
Small Montage Workflow
~1200 node workflow, 7 levelsMosaic of M42 created onthe Teragrid using Pegasus
www.griphyn.org/vds
LIGO Inspiral Search Application
Describe…
Inspiral workflow application is the work of Duncan Brown, Caltech,
Scott Koranda, UW Milwaukee, and the LSC Inspiral group
www.griphyn.org/vds
US-ATLASData Challenge 2
Sep 10
Mid July
CP
U-d
ay
Event generation using Virtual Data
www.griphyn.org/vds
Provenance for DC2
How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+
Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556
... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386
Which Linux kernel releases were used ?
How many jobs were run on a Linux 2.4.28 Kernel?