david adams atlas dial: distributed interactive analysis of large datasets david adams bnl august 5,...

Post on 18-Jan-2016

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

David Adams

ATLAS

DIAL: Distributed Interactive Analysis of Large datasets

David Adams

BNL

August 5, 2002

BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 2

David Adams

ATLAS

Contents• Definitions

• Use cases

• Requirements

• Design

• Datasets

• Dataset interface

• Dataset implementation

• Status and conclusions

August 5, 2002DIAL BNL OMEGA talk 3

David Adams

ATLAS

DefinitionsDataset

• Collection of event data– Known event (beam crossing) ID’s

– Same content (raw, reconstructed, summary,…) for each event

– Known luminosity and selection criteria (including triggers)

• Suitable for extracting physical quantities (cross section, limit, etc.)

– Or special data for calibration, alignment or monitoring detector performance

August 5, 2002DIAL BNL OMEGA talk 4

David Adams

ATLAS

Definitions (cont)Large

• Too big to analyze from a single process– Today: 100 GB or more

Analysis• Loop over events and perform the same action

on each– Select events

– Visualize events

– Fill histograms and tuples

– Generate new event data?

August 5, 2002DIAL BNL OMEGA talk 5

David Adams

ATLAS

Definitions (cont)Interactive

• Rapid response– Request processed in seconds, not hours

• Updates if the request is not finished quickly:– Partial results

– Progress meter> % completed

> Time to completion

– Status visualization: what is being processed where

– Able to terminate incomplete requests

August 5, 2002DIAL BNL OMEGA talk 6

David Adams

ATLAS

Definitions (cont)Distributed

• Central process presents results to the user• Processing carried out by multiple jobs• Jobs on different machines and different sites• Motivation:

– Access remote data

– Parallel processing for faster response

August 5, 2002DIAL BNL OMEGA talk 7

David Adams

ATLAS

Use casesEvent data specification

• User defines dataset– which events and which data in each event

• Includes version of data for each event– e.g. jets from reco version 14.2 instead of 13.1

• Restrict visible content of each event– E.g. jets, not tracks

– Reduces cost of data access

• Dataset use as input for processing• Dataset can be recorded and recalled later

August 5, 2002DIAL BNL OMEGA talk 8

David Adams

ATLAS

Use cases (cont)Event loop processing

• Event selection– User provides algorithm to be run on each event

– Result determines if event is included in output dataset

• Fill histogram– User defines histogram and provides algorithm to

fill from data for one event

• Fill tuple– Collection of named variables

– User provides algorithm to fill 0-N times/event

August 5, 2002DIAL BNL OMEGA talk 9

David Adams

ATLAS

Use cases (cont)Single event processing

• Fetch event– Data for selected event returned to user

– User may request a subset of the event data

• Visualization– User defines a “view”

– User specifies an event and the associated data is used to fill the view

August 5, 2002DIAL BNL OMEGA talk 10

David Adams

ATLAS

Use cases (cont)Distributed processing

• Remote processing– Analysis program run on the local node

– Data is located on a remote node

– Job processing data runs on the remote node

– User generates requests on the local node which are run on the remote node with results returned to the local node

• Parallel processing– Dataset divided by event and each dataset is

processed in a separate process or thread

August 5, 2002DIAL BNL OMEGA talk 11

David Adams

ATLAS

Use cases (cont)Distributed processing (cont)

• Multi-node processing– Previous processes are run on different compute

nodes

• Multi-site processing– Previous processes are distributed over different

sites

• GRID processing– Previous uses GRID for job specification,

submission, authentication and monitoring

August 5, 2002DIAL BNL OMEGA talk 12

David Adams

ATLAS

RequirementsUse cases

• Satisfy the preceding use cases

Interactivity• Show status while a request is being processed• Update status once/minute (adjustable)• Return partial results on the same time scale• Provide facility to abort a request

August 5, 2002DIAL BNL OMEGA talk 13

David Adams

ATLAS

Requirements (cont)History

• Event selection– Identify and record the attributes (including code)

for each event selection algorithm

• Dataset– Identify and record each dataset

– Provide mechanism to recover the selection algorithm(s) used to construct a dataset

August 5, 2002DIAL BNL OMEGA talk 14

David Adams

ATLAS

DesignDataset

• This description of a set of event data is the basis for all analysis

– More on this later

Analyzer• User works in an analysis framework which

provides the tools required to view and process histograms and tuples

• ROOT is one example

August 5, 2002DIAL BNL OMEGA talk 15

David Adams

ATLAS

Design (cont)Task

• Specifies the operation to perform on each event including

– Number of event selections to be performed

– Histograms to be filled

– Tuple to be filled

– Code which makes selections and fills histograms and tuples

August 5, 2002DIAL BNL OMEGA talk 16

David Adams

ATLAS

Design (cont)Application

• Description of the executable run by jobs• Loops over events in a dataset• Executes task on each to generate event result• Merges successful event results to form a

dataset result• Specification includes

– Application name> E.g. Athena or ROOT

– Version or acceptable versions

August 5, 2002DIAL BNL OMEGA talk 17

David Adams

ATLAS

Design (cont)Event result

• Flag indicating whether event was accepted for each event selection entry

• Histogram entries for each fill• Tuple values for each fill• Return status from task

– Success or failure

August 5, 2002DIAL BNL OMEGA talk 18

David Adams

ATLAS

Design (cont)Dataset result

• New dataset for each event selection– Old dataset plus list of ID’s for each event selection

• Filled histograms• Filled tuples• List of events for for which task processing was

unsuccessful

August 5, 2002DIAL BNL OMEGA talk 19

David Adams

ATLAS

Design (cont)Job scheduler

• Receives request (application, task and dataset) from analyzer

• May divide dataset into sub-datasets• Creates or locates jobs with a matching

application (and possibly task)• Adds task to jobs if needed• Passes a dataset to each job, invokes task and

receives result• Merges results and returns to analyzer

August 5, 2002DIAL BNL OMEGA talk 20

David Adams

ATLAS

Design (cont)

Analyzer

Job 1

Job 2

Application Task

Dataset 1

Scheduler

1. create

2. create 3. create

4. create

7. create(app,tsk)

5. submit(app,tsk,ds)

7. create(app,tsk)

6. splitDataset

Dataset 2

6. create

8. submit(tsk,ds1)

8. submit(tsk,ds2)

August 5, 2002DIAL BNL OMEGA talk 21

David Adams

ATLAS

DatasetsDatasets provide interface and means for accessing event data

• Different types– Raw

– Reconstructed

– Summary

– Tag

• Organized into EDO’s (event data objects)– Dataset does not see inside EDO

• Following plots give some examples

August 5, 2002DIAL BNL OMEGA talk 22

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

T w o co m p le te even t view s w ith th e s am e co n ten t .

re c o 1

re c o 2

August 5, 2002DIAL BNL OMEGA talk 23

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

T w o in co m p le te an d co n s is ten t even t view s w ith th e s am e co n ten t .

abse nt

August 5, 2002DIAL BNL OMEGA talk 24

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R e f itT ra c k s

Elec tro ns

A m b igu o u s even t view .

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R e f itT ra c k s

Elec tro ns

In co n s is ten t even t view .

Not allowed

Not allowed?

August 5, 2002DIAL BNL OMEGA talk 25

David Adams

ATLAS

Dataset interfaceEvent range

• Collection of event ID’s

Content• Collection of content ID’s

Event data (event views)• For each event ID-content ID pair:

– A means to access the corresponding EDO or

– A flag indicating the EDO is not included

• No other event data is included

August 5, 2002DIAL BNL OMEGA talk 26

David Adams

ATLAS

Dataset interface (cont)

Eve

nt I

D

Versio n (c

od e, p ara

ms) C o ntent (typ e-key, P C , s tream)

Eve nt l is t

5 File s

1 D atase t

E x am p le o f a d a ta s e t an dits m ap p in g to d a ta fi le s

August 5, 2002DIAL BNL OMEGA talk 27

David Adams

ATLAS

Dataset implementationDatasets are used in many ways

• Inspection by humans• I/O for processing in C++

– And other languages

• Cataloging in DB’s

Implementation• Prefer something object oriented• At present, C++ classes with XML persistence

August 5, 2002DIAL BNL OMEGA talk 28

David Adams

ATLAS

Status and conclusionsHigh-level design for DIAL is in place

• Described in this talk• See http://www.usatlas.bnl.gov/~dladams/dial

Detailed design and first implementation of datasets is finished

• See http://www.usatlas.bnl.gov/~dladams/dataset

top related