ami s.a. datasets… solveig albrand. ami s.a. a set is… a number of things grouped together...

17
AMI Datasets… Solveig Albrand

Upload: eustace-phillips

Post on 23-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Datasets…

Solveig Albrand

Page 2: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

A set is…

• A number of things grouped together according to a system of classification, or conceived as forming a whole.

• A number of things connected in temporal of spatial succession, or by natural production or formation.

• A collection of instruments, tools, or machines used together in a particular operation.

Just a few of the definitions of sets in the Shorter Oxford Dictionary

Page 3: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Applied to ATLAS

•Our production is TDAQ or Monte-Carlo

•Our operation is moving from one ATLAS site to another.

An ATLAS dataset is a number of files which have been produced together, or which are usefully grouped together for transport.

Page 4: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

The things we put in the sets

• Our things are in general files. (usually of binary data, but not always)

• What we really want out of the datasets is not the files themselves but the events in the files. It’s just that we have to transport files.

• The connection between files and events is quite “natural” in Monte Carlo production.

Page 5: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Dataset Definition Document

• “ A set of data produced under the same logical conditions and is a minimal portion of data movable across GRID by ATLAS Distributed Data Management system, and is expected to consist of uniform files suitable for processing with the same application in the transformation chain “

Atlas Dataset Definition Document

Page 6: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Monte-Carlo Production

TASK(EVGEN)

TASK(SIMUL)

Task = « a set of jobs »

EVNTS

HITS

LOG

Page 7: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Notion of “Task”

• A “Task” is a transformation of the events in one or more dataset of a given type, into one or more datasets of other types which is usually (but not necessarily) different from the input type.

• Note that if more than one type is produced by a task, then we define an output dataset for each type, because the input of a succeeding step will be defined as a unique type.

Page 8: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

AMI Provenance Diagram

Page 9: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

What about real data?

• Discussions are on-going about how datasets will be formed for real data , and even for commissioning.

• For ctb in 2004: 1 run = 1 dataset of “RAW” type, then from each “RAW” dataset several “recon” tasks produced ESD.

• This was in pre-DDM days.

Page 10: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

DDM requires

• “A set of data produced under the same logical conditions and is a minimal portion of data movable across GRID by ATLAS Distributed Data Management system.”

• It seems that one CSC run is too small to be moved across the grid, so several runs are grouped together, according to the metadata.

• New VERSIONS of the dataset must be defined as runs become validated, or not.

Page 11: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Tiles & Larg

Greenruns

Blueruns

Redruns

larg.000038.BarrelP3C.Pedestal.high.v000001larg.000050.EC_Installation.Trigger.high.v000001

Page 12: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

How will the datasets be formed?

• TDAQ will write a certain amount of metadata into the header of each file.

• Probably this should be written into a database also – Surely we should not have to open each file to decide which dataset it belongs to?

Page 13: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Event Collections

• C.f. Caitrina’s talk yesterday. • We are interested in the events, but we can

only transport events in files. The files should be transparent to the user.

• Note that the SAME set of files can be required for several DIFFERENT event collections. (How will we tell DDM this? (Perhaps we don’t need to)

Page 14: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

2 Collections, same set of files

Page 15: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Other datasets

• For Monte-Carlo production to get the “cross-section” calculated by EVGEN need to parse the log files. (Done by AMI). Need only look at one log file per task.

• Either get ALL evgen logs for all evgen tasks• OR – make a “secondary” dataset – first two evgen log

files of each task primary evgen log dataset, and open a subscription to this dataset on some site accessible to AMI

• Actually, even doing this we end up transporting rather more than we need to, because in fact the “log” datasets contain the whole sandbox, and we only need just the “log” file output by the job.

Page 16: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

Conditions DB

• Some trials have been made of transport of snapshots of the conditions DB to ATLAS sites, using DDM.

Page 17: AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming

AMI

How many Datasets are we expecting?

• Used the computing model in 2 ways:

• Raw data + analysis model 128 million

• Storage Estimate N Files (22350000000) then nFiles/dataset 22 million

• But 42 million is just as good an answer as any…..

22350000000