mice data flow
Post on 16-Jan-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Flow
Henry Nebrensky
Brunel University
1
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data and the Grid
2
Storage, archiving and dissemination of experimental data:
Not been a high priority so far Overall strategy not documented anywhere
obvious Individual work on parts of this – but do the
pieces fit together?
Grid: Certain Grid services are separately funded to
provide a production service to MICE Provides a ready-made set of building blocks – but
“we” have to put them together MICE need to know what they want, to make sure
that the finished edifice meets all their needs (and that Grid includes all the necessary bricks)
Henry Nebrensky - MICE CM24 - 2 June 2009
Decision Time
We need to start putting the pieces together very soon.
Once data starts going on tape it will not be possible to change how and where it is stored
need an agreed plan in the near future (i.e. by end of CM24)
There are a number of unresolved issues – see
Note 252 and the data flow diagram. Data volumes, lifetime and access control mostly
unclear (LFC) File naming scheme – see MICE Note 247 File metadata requirements – raised at CM23
3
Henry Nebrensky - MICE CM24 - 2 June 2009
The Awesome Power of Grid Computing
The Grid provides seamless interconnection between tens of thousands of computers.
It therefore generates new acronyms and jargon at superhuman speed.
4
Henry Nebrensky - MICE CM24 - 2 June 2009
Grid Middleware
We are currently using EGEE/WLCG middleware and resources, as they are receiving significant development effort and are a reasonable match for our needs (shared with various minor experiments such as LHC)
Outside Europe other software may be expected – e.g. the OSG stack in the US. Interoperability is, from our perspective, yet another “known unknown”...
5
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE and Grid Data Storage
The Grid provides MICE not only with computing (number-crunching) power, but also with a secure global framework allowing users access to data
Good news: storing development data on the Grid keeps it available to the collaboration – not stuck on an old PC in the corner of the lab
Bad news: loss of ownership – who picks up the data curation responsibilities?
Data can be downloaded from the Grid to user’s “own” PC – doesn’t need to be analysed remotely
6
Henry Nebrensky - MICE CM24 - 2 June 2009
Grid File Management (1)
Each file is given a unique, machine-generated, GUID when stored on the Grid
The file is physically uploaded to one (or more) SEs (Storage Elements) where it is given a machine-generated SURL (Storage URL)
Machine-generated names are not (meant to be) human-usable
A “replica catalogue” tracks the multiple SURLs of a GUID For sanity's sake we would like to associate sensible
filenames with each file (LFN, Logical File Name) A “file catalogue” is a database that translates
between something that looks like a Unix filesystem and the GUIDs and SURLs needed to actually access the data on the Grid
7
Henry Nebrensky - MICE CM24 - 2 June 2009
Grid File Management (2)
8
MICE has an instance of LFC (LCG File Catalogue) run by the Tier 1 at RAL
The LFC service can do both the replica and LFN cataloguing
LFC presents the user with what looks like a normal Unix filespace - the Grid client SW keeps track of the data behind the scenes.
SE
Head Node
Tape
Pool Pool Pool
SE
SE
Replica Catalogue
GUID ad9e349c-7a56-4961-8741-8242949433b0
File Transfer Service
TURL gsiftp://dgc-grid-52.brunel.ac.uk/data2/dpmfs/mice/2009-03-09/ file34c34ee1-f10b-463a-80b3-2d257231261f.3660836.0
SURL srm://dgc-grid-34.brunel.ac.uk/dpm/ brunel.ac.uk/home/mice/generated/2009-03-09/ file34c34ee1-f10b-463a-80b3-2d257231261f
File Catalogue
LFN /grid/mice/users/Nebrensky/sw/g4beamline-1.15.3-Linux-g++.tgz
LFN
LFN
M etadata Catalogue
UI
Local Disk
LFC
From MICE Note 247
Henry Nebrensky - MICE CM24 - 2 June 2009
Data Integrity
(For recent SE releases) a checksum is calculated automatically when a file is uploaded.
This can be checked when the file is transferred between SEs, or the value retrieved to check local copies.
9
Henry Nebrensky - MICE CM24 - 2 June 2009
The VOMS server
File permissions will needed e.g. to ensure that users can’t accidentally delete RAW data. These rules will need to last for at least the life of the experiment.
VOMS is a Grid service that allows us to define specific roles (e.g. DAQ data archiver) which will then be allowed certain privileges (such as writing to tape at RAL Tier 1).
The VOMS service then maps humans to those roles, via their certificates.
MICE VOMS server is provided via GridPP at Manchester, UK.
New Mice are added or assigned to roles by the VO Manager (and Mouse) Paul Hodgson.
Thus the VOMS service provides us with a single portal where we can add/remove/reassign Mice, without needing to negotiate with the operators of every Grid resource worldwide – we actually keep control “in-house.”
10
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Flow
The basic data flow in MICE is thus something like:
The raw data file from the experiment are sent to tape using Grid protocols, including registering the files in LFC.
The offline reconstruction can then use Grid/LFC to pull down the raw data, and upload reconstructed (“RECO” or DST) files.
Users can use Grid/LFC to access RECO files they want to play with.
If I combine the above description with some background knowledge of the Grid, some snippets of what people are working on and a whole lot of guesswork I get:
11
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Flow
Diagram
12
M IC E D A Q
O n lin e B u ffe r
O n lin e F a rm
R A W
R A W E p h e m e ra l R O O T h is to g ra m s
C re a te f ile m e ta d a ta ?
R A W
M ic e N e t D A Q n e tw o rk
O p tic a l F ib re 1 G b p s to T ie r 1
R A W
“ R o b o t” ? c e r t if ic a te V O M S “ a rc h iv e r” ? ro le M IC E _ R A W _ T A P E ? to k e n
C A S T O R ta p e
B a c k u p lin k to T ie r 1 ? F a i lo v e r l in k to T ie r 2 ?
V O M S , D N S , N T P ?
IS IS o r P P D n e tw o rk
R A W
R E C O (R O O T tre e s )
L F C
T ie r 1 n e tw o rk
O fflin e R e c o n s tr u c tio n
O n lin e R e c o n s tr u c tio n
“ A n o in te d u s e r” c e r tif ic a te V O M S “ p ro d u c tio n ” ro le M IC E _ R E C O _ D IS K ? to k e n
U K G rid P P T ie r 2 F a rm s
A n y T ie r 2 F a rm
S e m i-a u to m a te d p ro c e s s
T ie r 2 S E d isk
U se r lo c a l d isk
F ig u r e 1 : D a ta f lo w fr o m th e M IC E e x p e r im e n t. S h o r t-d a sh e d e n t it ie s r e q u ir e c o n f ir m a tio n . L o n g -d a sh e d lin e s r e p r e s e n t b o r d e r s b e tw e e n s u b n e ts .
A n a ly s is re su lts
re su lts a rc h iv e ?
M y P ro x y
B D II
T ra n s fe r B o x
(M IC E A C Q 0 5 )
C o n tro ls & M o n ito r in g n e tw o rk
C o n d it io n s D a ta b a se (E P IC S )
2 T B / d a y = 2 0 0 M b p s
C o n fig u ra tio n D a ta b a se ?
C o n fig D B “ A P I” ?
?
?
“ C h a o tic ” (o n -d e m a n d ) a n a ly s is G e n e ric u s e r c e r tif ic a te
M C s im u la tio n
?
C A S T O R d isk
A M G A
R E C O
Short-dashed lines indicate entities that still need confirmation
Question marks indicate even higher levels of uncertainty
More details in MICE Note 252
The diagram would look pretty much the same if non-Grid tools were used
Henry Nebrensky - MICE CM24 - 2 June 2009
Data Flow Implementation
13
Most of this is NOT in place yet (at production level)!
M IC E D A Q
O n lin e B u ffe r
O n lin e F a rm
R A W
R A W
M ic e N e t D A Q n e tw o rk
C A S T O R ta p e
L F C
T ie r 1 n e tw o rk
O fflin e R e c o n s tr u c tio n
O n lin e R e c o n s tr u c tio n
U K G rid P P T ie r 2 F a rm s
A n y T ie r 2 F a rm
T ie r 2 S E d isk
U se r lo c a l d isk
F ig u r e 1 : D a ta f lo w fr o m th e M IC E e x p e r im e n t. S h o r t-d a sh e d e n t it ie s r e q u ir e c o n f ir m a tio n . L o n g -d a sh e d lin e s r e p r e s e n t b o r d e r s b e tw e e n s u b n e ts .
A n a ly s is re su lts
T ra n s fe r B o x
(M IC E A C Q 0 5 )
C o n tro ls & M o n ito r in g n e tw o rk
C o n d it io n s D a ta b a se (E P IC S )
2 T B / d a y = 2 0 0 M b p s
“ C h a o tic ” (o n -d e m a n d ) a n a ly s is G e n e ric u s e r c e r tif ic a te
M C s im u la tio n
C A S T O R d isk
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Unknowns
MICE Note 252 identifies four main flavours of data: RAW, RECO, analysis results, and MonteCarlo simulation.
For all four, we need to understand the: volume (the total amount of data, the rate at which
it will be produced, and the size of the individual files in which it will be stored)
lifetime (ephemeral or longer lasting? will it need archiving to tape?)
access control (who will create the data? who is allowed to see it? can it be modified or deleted, and if so who has those privileges?)
Also need to identify use cases I’ve missed, especially ones that will need more VOMS roles or CASTOR space tokens.
14
Henry Nebrensky - MICE CM24 - 2 June 2009
File Catalogue Namespace (1)
Also, we need to agree on a consistent namespace for the file catalogue
Proposal (MICE Note 247, Grid talk at CM23): We get given /grid/mice/ by the server Five upper-level directories: Construction/
historical data from detector development and QA
Calibration/needed during analysis (large datasets, c.f.
DB) TestBeam/
test beam data MICE/
DAQ output and corresponding MC simulation15
Henry Nebrensky - MICE CM24 - 2 June 2009
File Catalogue Namespace (2)
/grid/mice/users/nameFor people to use as scratch space for their
own purposes, e.g. analysis
Encourage people to do this through LFC – helps avoid “dark data”
LFC allows Unix-style access permissions
Again, the LFC namespace is something that needs to be finalised before production data can start to be registered.
16
Henry Nebrensky - MICE CM24 - 2 June 2009
Metadata Catalogue
For many applications – such as analysis – you will want to identify the list of files containing the data that matches some parameters
This is done by a “metadata catalogue”.For MICE this doesn't yet exist
A metadata catalogue can in principle return either the GUID or an LFN – it shouldn’t matter which as long as it’s properly integrated with the other Grid services.
(Grid talk at CM23)
17
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata Catalogue
We need to select a technology to use for this use the configuration database? gLite AMGA (who else uses it – will it remain
supported?)
Need to implement – i.e. register metadata to files
What metadata will be needed for analysis?
Should the catalogue include the file format and compression scheme (gzip ≠ PKzip)? 18
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata Cataloguefor Humans
or, in non-Gridspeak: we have several databases (configuration
DB, EPICS, e-Logbook) where we should be able to find all sorts of information about a run/timestamp.
but how do we know which runs to be interested in, for our analysis?
we need an “index” to the MICE data, and for this we need to define the set of “index terms” that will be used to search for relevant datasets.
19
Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata
Run, date/time Step Nominal 4-d / tranverse normalised Emittance Diffuser setting Nominal Momentum Configuration:
Magnet currents Physical geometry
RF?
???
20
Henry Nebrensky - MICE CM24 - 2 June 2009
Conclusions
The data flow is more complex than people realise…
… and probably won’t work by accident
Some specific issues that need to be understood are the attributes of the data flows (Note 252), the LFC Namespace (Note 247) and the index terms for the metadata catalogue.
This needs discussion and (where necessary) decision pretty soon – by end CM24 – to be ready for data taking. 21
top related