dial distributed interactive analysis of large datasets

24
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL November 17, 2003 SC2003 Phoenix

Upload: alaire

Post on 12-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

DIAL Distributed Interactive Analysis of Large datasets. SC2003 Phoenix. David Adams BNL November 17, 2003. Goals of DIAL What is DIAL? Design JDL Datasets Grid2003 More information. Contents. Goals of DIAL. 1. Demonstrate the feasibility of interactive analysis of large datasets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DIAL Distributed Interactive Analysis of Large datasets

David Adams

ATLAS

DIALDistributed Interactive Analysis of

Large datasets

David AdamsBNL

November 17, 2003

SC2003Phoenix

Page 2: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 2

David Adams

ATLAS

Contents

Goals of DIAL

What is DIAL?

Design

JDL

Datasets

Grid2003

More information

Page 3: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 3

David Adams

ATLAS

Goals of DIAL1. Demonstrate the feasibility of interactive analysis of large datasets

• How much data can we analyze interactively?

2. Set requirements for GRID services• In particular those specific to interactive analysis

– Job definition: application, task, dataset– Gathering and relaying results– Real time monitoring (partial results)– Resource management: discovery, allocation, sharing

3. Provide ATLAS with a useful analysis tool• For current and upcoming data challenges• Like to add another experiment to show generality

Page 4: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 4

David Adams

ATLAS

What is DIAL?DIAL provides a connection between

• Interactive analysis framework– Fitting, presentation graphics, …

– E.g. ROOT, JAS, …

• and Data processing application– Natural to the data of interest

– E.g. athena for ATLAS

DIAL distributes processing• Among sites, farms, nodes

• To provide user with desired response time

Look to other projects to provide most infrastructure

Page 5: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 5

David Adams

ATLAS

In tera ctive a na lysise .g . R O O T , JA S , ...

D IA L

D istribu ted p rocessing ru nn ing da ta -specific a pp lica tion

D a ta s e t S c he d u le r A A AJ o b

Page 6: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 6

David Adams

ATLAS

DesignDIAL has the following major components

• Dataset describing the data of interest

• Application defined by experiment/site

• Task is user extension to the application

• Job uses application and task to process a dataset

• Result is the output of a job

• Scheduler creates and manages jobs

Together these define a high-level JDL • (job definition language)

Figure shows how these components interact →

Page 7: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 7

David Adams

ATLAS

UserAnalysis

Job 1

Job 2

Application Task

Dataset 1

Scheduler

1. Create or locate

2. select 3. Create or select

4. select

8. run(app,tsk,ds1)

5. submit(app,tsk,ds)8. run(app,tsk,ds2)

6. splitDataset

Dataset 2

7. create

e.g. ROOT

e.g. athena

Result9. fill

10. gather

Result 9. fill

Result Code

Page 8: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 8

David Adams

ATLAS

Analysis job ratesAt what rate is a site processing sub-jobs?

• Assume 1000 CPU’s at a “site”• Continuum of requests with the following extremes

Large scale data production with 3–30 hours/job:• 30-300 jobs/hour (1 job/minute)• Fine for batch and grid schedulers

For interactive analysis with 1-10 seconds/job• 100-1000 jobs/sec (10000 jobs/minute)• Challenging for grid and batch schedulers• Handle with hierarchy of schedulers

– Each scheduler hnadles a fraction of the rate– But each level adds latency

Page 9: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 9

David Adams

ATLAS

DIAL scheduler hierarchy

GridSche du le r

D IA L sched ule r hie ra rchy

O g s aC lie n tSc he du le r

S iteSc he du le r

So apC lie n tSc he du le r

O GSASche du le r

M as te rSc he du le r

L o c alSc he du le r

SO A PSche du le r

ls run

L SFO GSA

Sche du le r

G rid p o rta l G rid s e rv ic e F a rm ga te w a y

Us e r

O g s aC lie n tSc he du le r

C o ndo r

fo rk

Us e r

So apC lie n tSc he du le r

S iteb o u nd a ry

Page 10: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 10

David Adams

ATLAS

JDLHigh level job definition language

• Enable users to specify task without reference to executables, data files or sites

• Scheduler decides where and how to process data

• Analysis implies user is easily able to customize task

Common language• Enable different experiments and non-HEP activities to

share schedulers

• PPDG activity to define such a language– Led by Gabriele Carcassi (STAR)

– Similar to DIAL (application, task. dataset, …)

– XML based

Page 11: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 11

David Adams

ATLAS

JDL (DIAL perspective)

P I/S E A L

G A N G A

R O O T

J A S

E xp e rim e ntP ro d u c tio n

H ighle v e lJ D L

P R O O F

G A N G A

C o nd o r-G

C him e ra

S T A R

J D A P

D IA L

C la re ns O G S A

U se r in te r fa ce s S ch e d u le r s

W e b se rv ice in f ra s tru ctu re

W S D L

Page 12: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 12

David Adams

ATLAS

DatasetsWant to provide a high-level data view

Unit of processing is called “dataset”• Many properties beyond data location

• Location is not just a list of files (physical or logical)– Multiple logical file set representations

– Representation might be tables in an RDB

– Or object list in an ODB

– Or …

Properties and categories follow

Page 13: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 13

David Adams

ATLAS

Dataset properties0. Identity

• Dataset must have an unique index and/or name

1. Content• Description of the type of data in the dataset

– Event or non-event data– Simulation, reconstruction,– ESD, AOD, …– Jets, tracks, electrons,…

2. Location• Where to find the data

– Logical files, physical files, site,…

3. Mapping• Which content is at which location?

Page 14: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 14

David Adams

ATLAS

Dataset properties (cont)4. Provenance

• Prescription for creating the data

• E.g. input dataset and transformation

5. History• Details of production beyond provenance

– How production was split into jobs,

– Processing node and time for each job, …

6. Labels• Assigned metadata outside other categories, e.g.

– Integrated luminosity

– Result of quality checks

– Flag indicating ok for use in published analyses

Page 15: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 15

David Adams

ATLAS

Dataset properties (cont)7. Mutability

• May dataset be modified?

• Possible states: locked, unlocked, extensible, …

8. Compositeness• Dataset made up of other datasets.

• Two cases:– Construction: provenance is the list of sub-datasets

> E.g. the summer dataset is defined to be the union of the June, July and August datasets.

– Assignment: factorization into sub-datasets

> Typically to reflect data placement

> E.g. a representation of a global dataset might include sub-datasets in New York, Paris and Moscow.

Page 16: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 16

David Adams

ATLAS

Dataset categoriesCategorize datasets according to the extent of their location information

Virtual dataset (VDS)• no location

Nonvirtual dataset (NVDS)• Logical dataset (LDS)

– Collection of logical files

• Physical dataset (PDS)– Collection of physical files

• Staged dataset– NVDS with mapping of sub-datasets to CPU or process

Page 17: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 17

David Adams

ATLAS

Dataset category associations (example)

VDS 1

LDS 1-1{LF1 LF2}

LDS 1-2{LF3}

PDS 1-1-1{PF1A PF2A}

PDS 1-1-2{PF1B PF2B}

PDS 1-1-3{PF1A PF2B}

PDS 1-2-1{PF3A}

PDS 1-2-2{PF3B}

Virtual

Logical

Physical

Page 18: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 18

David Adams

ATLAS

Present dataset implementation includes• Virtual dataset (VDS)

– Portable representation of dataset without location

• Logical dataset (LDS) and physical dataset (PDS)– Add location expressed in terms of logical files

• Dataset database (DDB)– Repository of (immutable) datasets indexed by ID

• Dataset selection catalog (DSC)– Enables users to select a VDS

• Dataset replica catalog (DRC)– Enables “system” to locate an NVDS representation of a VDS

• Dataset file catalog (DFC)– Maps single-file datasets to LFN

Dataset implementation

C++ classes w/ XML rep

MySQL tables

Files indexed by name

Page 19: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 19

David Adams

ATLAS

Dataset implementationVDS LDS DSC DRC

virtual X X Xlogical X X# events X X Xevent ID's X Xevent content X X Xlogical files Xsite (with data) X

mappingparent Xxform X

historylabels X

mutability X X Xvirtual X X Xlogical X X

provenance

compositeness

Property

identity

content

location

Page 20: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 20

David Adams

ATLAS

Grid2003 datasetsDefine dataset

• Provenance, # events and list of LFN’s

• Assign dataset name

• Create entry in DSC

Produce data with GCE/Pegasus/Chimera• Transfer files to BNL disk directory

Poll destination directory for new files• Register files in Magda

• Create single-file dataset (LDS) for each registered file

• Store each dataset in DDB (dataset database)

• Register LFN-to-dataset association in DFC

Page 21: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 21

David Adams

ATLAS

Grid2003 datasets (cont)At regular intervals

• Merge the current set of single-file datasets to create the latest merged LDS

• Use this LDS to create a VDS

• Register the VDS-LDS association in the DRC

• Update DSC entry with the new VDS

Page 22: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 22

David Adams

ATLAS

Grid2003 analysisUser selects dataset from DSC

• Dataset by name will change as data comes in• Dataset by ID is snapshot at time of selection

User submits job• If by name, use DSC to get current dataset ID• Use ID to extract VDS from DDB• Submit dataset and task to DIAL scheduler

Scheduler• Uses DRC to find LDS corresponding to VDS

– In principle, the “best” choice for the given task

• Splits LDS into single-file sub-datasets• Processes each, gathers and merges results

Page 23: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 23

David Adams

ATLAS

Discovery!The first Grid2003 dataset was analyzed and a mass was calculated from the four leading electrons

48k events

Simulated mass was130 GeV

Electron ET > 5GeV

Page 24: DIAL Distributed Interactive Analysis of Large datasets

DIAL SC2003 November 15-18, 2003 24

David Adams

ATLAS

More informationDIAL

• http://www.usatlas.bnl.gov/~dladams/dial

Datasets• http://www.usatlas.bnl.gov/~dladams/dataset

Grid2003 Datasets• http://www.usatlas.bnl.gov/~dladams/dataset/grid3