dial distributed interactive analysis of large datasets
DESCRIPTION
DIAL Distributed Interactive Analysis of Large datasets. SC2003 Phoenix. David Adams BNL November 17, 2003. Goals of DIAL What is DIAL? Design JDL Datasets Grid2003 More information. Contents. Goals of DIAL. 1. Demonstrate the feasibility of interactive analysis of large datasets - PowerPoint PPT PresentationTRANSCRIPT
David Adams
ATLAS
DIALDistributed Interactive Analysis of
Large datasets
David AdamsBNL
November 17, 2003
SC2003Phoenix
DIAL SC2003 November 15-18, 2003 2
David Adams
ATLAS
Contents
Goals of DIAL
What is DIAL?
Design
JDL
Datasets
Grid2003
More information
DIAL SC2003 November 15-18, 2003 3
David Adams
ATLAS
Goals of DIAL1. Demonstrate the feasibility of interactive analysis of large datasets
• How much data can we analyze interactively?
2. Set requirements for GRID services• In particular those specific to interactive analysis
– Job definition: application, task, dataset– Gathering and relaying results– Real time monitoring (partial results)– Resource management: discovery, allocation, sharing
3. Provide ATLAS with a useful analysis tool• For current and upcoming data challenges• Like to add another experiment to show generality
DIAL SC2003 November 15-18, 2003 4
David Adams
ATLAS
What is DIAL?DIAL provides a connection between
• Interactive analysis framework– Fitting, presentation graphics, …
– E.g. ROOT, JAS, …
• and Data processing application– Natural to the data of interest
– E.g. athena for ATLAS
DIAL distributes processing• Among sites, farms, nodes
• To provide user with desired response time
Look to other projects to provide most infrastructure
DIAL SC2003 November 15-18, 2003 5
David Adams
ATLAS
In tera ctive a na lysise .g . R O O T , JA S , ...
D IA L
D istribu ted p rocessing ru nn ing da ta -specific a pp lica tion
D a ta s e t S c he d u le r A A AJ o b
DIAL SC2003 November 15-18, 2003 6
David Adams
ATLAS
DesignDIAL has the following major components
• Dataset describing the data of interest
• Application defined by experiment/site
• Task is user extension to the application
• Job uses application and task to process a dataset
• Result is the output of a job
• Scheduler creates and manages jobs
Together these define a high-level JDL • (job definition language)
Figure shows how these components interact →
DIAL SC2003 November 15-18, 2003 7
David Adams
ATLAS
UserAnalysis
Job 1
Job 2
Application Task
Dataset 1
Scheduler
1. Create or locate
2. select 3. Create or select
4. select
8. run(app,tsk,ds1)
5. submit(app,tsk,ds)8. run(app,tsk,ds2)
6. splitDataset
Dataset 2
7. create
e.g. ROOT
e.g. athena
Result9. fill
10. gather
Result 9. fill
Result Code
DIAL SC2003 November 15-18, 2003 8
David Adams
ATLAS
Analysis job ratesAt what rate is a site processing sub-jobs?
• Assume 1000 CPU’s at a “site”• Continuum of requests with the following extremes
Large scale data production with 3–30 hours/job:• 30-300 jobs/hour (1 job/minute)• Fine for batch and grid schedulers
For interactive analysis with 1-10 seconds/job• 100-1000 jobs/sec (10000 jobs/minute)• Challenging for grid and batch schedulers• Handle with hierarchy of schedulers
– Each scheduler hnadles a fraction of the rate– But each level adds latency
DIAL SC2003 November 15-18, 2003 9
David Adams
ATLAS
DIAL scheduler hierarchy
GridSche du le r
D IA L sched ule r hie ra rchy
O g s aC lie n tSc he du le r
S iteSc he du le r
So apC lie n tSc he du le r
O GSASche du le r
M as te rSc he du le r
L o c alSc he du le r
SO A PSche du le r
ls run
L SFO GSA
Sche du le r
G rid p o rta l G rid s e rv ic e F a rm ga te w a y
Us e r
O g s aC lie n tSc he du le r
C o ndo r
fo rk
Us e r
So apC lie n tSc he du le r
S iteb o u nd a ry
DIAL SC2003 November 15-18, 2003 10
David Adams
ATLAS
JDLHigh level job definition language
• Enable users to specify task without reference to executables, data files or sites
• Scheduler decides where and how to process data
• Analysis implies user is easily able to customize task
Common language• Enable different experiments and non-HEP activities to
share schedulers
• PPDG activity to define such a language– Led by Gabriele Carcassi (STAR)
– Similar to DIAL (application, task. dataset, …)
– XML based
DIAL SC2003 November 15-18, 2003 11
David Adams
ATLAS
JDL (DIAL perspective)
P I/S E A L
G A N G A
R O O T
J A S
E xp e rim e ntP ro d u c tio n
H ighle v e lJ D L
P R O O F
G A N G A
C o nd o r-G
C him e ra
S T A R
J D A P
D IA L
C la re ns O G S A
U se r in te r fa ce s S ch e d u le r s
W e b se rv ice in f ra s tru ctu re
W S D L
DIAL SC2003 November 15-18, 2003 12
David Adams
ATLAS
DatasetsWant to provide a high-level data view
Unit of processing is called “dataset”• Many properties beyond data location
• Location is not just a list of files (physical or logical)– Multiple logical file set representations
– Representation might be tables in an RDB
– Or object list in an ODB
– Or …
Properties and categories follow
DIAL SC2003 November 15-18, 2003 13
David Adams
ATLAS
Dataset properties0. Identity
• Dataset must have an unique index and/or name
1. Content• Description of the type of data in the dataset
– Event or non-event data– Simulation, reconstruction,– ESD, AOD, …– Jets, tracks, electrons,…
2. Location• Where to find the data
– Logical files, physical files, site,…
3. Mapping• Which content is at which location?
DIAL SC2003 November 15-18, 2003 14
David Adams
ATLAS
Dataset properties (cont)4. Provenance
• Prescription for creating the data
• E.g. input dataset and transformation
5. History• Details of production beyond provenance
– How production was split into jobs,
– Processing node and time for each job, …
6. Labels• Assigned metadata outside other categories, e.g.
– Integrated luminosity
– Result of quality checks
– Flag indicating ok for use in published analyses
DIAL SC2003 November 15-18, 2003 15
David Adams
ATLAS
Dataset properties (cont)7. Mutability
• May dataset be modified?
• Possible states: locked, unlocked, extensible, …
8. Compositeness• Dataset made up of other datasets.
• Two cases:– Construction: provenance is the list of sub-datasets
> E.g. the summer dataset is defined to be the union of the June, July and August datasets.
– Assignment: factorization into sub-datasets
> Typically to reflect data placement
> E.g. a representation of a global dataset might include sub-datasets in New York, Paris and Moscow.
DIAL SC2003 November 15-18, 2003 16
David Adams
ATLAS
Dataset categoriesCategorize datasets according to the extent of their location information
Virtual dataset (VDS)• no location
Nonvirtual dataset (NVDS)• Logical dataset (LDS)
– Collection of logical files
• Physical dataset (PDS)– Collection of physical files
• Staged dataset– NVDS with mapping of sub-datasets to CPU or process
DIAL SC2003 November 15-18, 2003 17
David Adams
ATLAS
Dataset category associations (example)
VDS 1
LDS 1-1{LF1 LF2}
LDS 1-2{LF3}
PDS 1-1-1{PF1A PF2A}
PDS 1-1-2{PF1B PF2B}
PDS 1-1-3{PF1A PF2B}
PDS 1-2-1{PF3A}
PDS 1-2-2{PF3B}
Virtual
Logical
Physical
DIAL SC2003 November 15-18, 2003 18
David Adams
ATLAS
Present dataset implementation includes• Virtual dataset (VDS)
– Portable representation of dataset without location
• Logical dataset (LDS) and physical dataset (PDS)– Add location expressed in terms of logical files
• Dataset database (DDB)– Repository of (immutable) datasets indexed by ID
• Dataset selection catalog (DSC)– Enables users to select a VDS
• Dataset replica catalog (DRC)– Enables “system” to locate an NVDS representation of a VDS
• Dataset file catalog (DFC)– Maps single-file datasets to LFN
Dataset implementation
C++ classes w/ XML rep
MySQL tables
Files indexed by name
DIAL SC2003 November 15-18, 2003 19
David Adams
ATLAS
Dataset implementationVDS LDS DSC DRC
virtual X X Xlogical X X# events X X Xevent ID's X Xevent content X X Xlogical files Xsite (with data) X
mappingparent Xxform X
historylabels X
mutability X X Xvirtual X X Xlogical X X
provenance
compositeness
Property
identity
content
location
DIAL SC2003 November 15-18, 2003 20
David Adams
ATLAS
Grid2003 datasetsDefine dataset
• Provenance, # events and list of LFN’s
• Assign dataset name
• Create entry in DSC
Produce data with GCE/Pegasus/Chimera• Transfer files to BNL disk directory
Poll destination directory for new files• Register files in Magda
• Create single-file dataset (LDS) for each registered file
• Store each dataset in DDB (dataset database)
• Register LFN-to-dataset association in DFC
DIAL SC2003 November 15-18, 2003 21
David Adams
ATLAS
Grid2003 datasets (cont)At regular intervals
• Merge the current set of single-file datasets to create the latest merged LDS
• Use this LDS to create a VDS
• Register the VDS-LDS association in the DRC
• Update DSC entry with the new VDS
DIAL SC2003 November 15-18, 2003 22
David Adams
ATLAS
Grid2003 analysisUser selects dataset from DSC
• Dataset by name will change as data comes in• Dataset by ID is snapshot at time of selection
User submits job• If by name, use DSC to get current dataset ID• Use ID to extract VDS from DDB• Submit dataset and task to DIAL scheduler
Scheduler• Uses DRC to find LDS corresponding to VDS
– In principle, the “best” choice for the given task
• Splits LDS into single-file sub-datasets• Processes each, gathers and merges results
DIAL SC2003 November 15-18, 2003 23
David Adams
ATLAS
Discovery!The first Grid2003 dataset was analyzed and a mass was calculated from the four leading electrons
48k events
Simulated mass was130 GeV
Electron ET > 5GeV
DIAL SC2003 November 15-18, 2003 24
David Adams
ATLAS
More informationDIAL
• http://www.usatlas.bnl.gov/~dladams/dial
Datasets• http://www.usatlas.bnl.gov/~dladams/dataset
Grid2003 Datasets• http://www.usatlas.bnl.gov/~dladams/dataset/grid3