scientific data management: an incomplete experimental henp perspective

25
1 D. Olson, SDM-ISIC Mtg, 26 Mar 2002 Scientific Data Management: An Incomplete Experimental HENP Perspective D. Olson, LBNL 26 March 2002 SDM-ISIC Meeting Gatlinburg

Upload: hoang

Post on 11-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Scientific Data Management: An Incomplete Experimental HENP Perspective. D. Olson, LBNL 26 March 2002 SDM-ISIC Meeting Gatlinburg. Particle Physics Data Grid. Coordinators: Pordes, Olson. PI’s: Mount, Livny, Newman. www.ppdg.net. Contents. Quick overview of HENP data Generic data flow - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scientific Data Management: An Incomplete  Experimental HENP Perspective

1D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Scientific Data Management:An Incomplete

Experimental HENP Perspective

D. Olson, LBNL

26 March 2002SDM-ISIC Meeting

Gatlinburg

Page 2: Scientific Data Management: An Incomplete  Experimental HENP Perspective

2D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Particle Physics Data Grid

www.ppdg.net

PI’s:Mount,Livny,Newman

Coordinators:Pordes,

Olson

Page 3: Scientific Data Management: An Incomplete  Experimental HENP Perspective

3D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Contents

• Quick overview of HENP data—Generic data flow—Sizes, timescales—Average physicist view

• What’s hard—Making technology work in production—A clear view for average physicist—Analysis of large datasets—Other things as well

• Today, many issues wrapped in hopes for “Data Grid”

Page 4: Scientific Data Management: An Incomplete  Experimental HENP Perspective

4D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Experimental HENP event data

• Basic character of data is “event”—May be few particles

Page 5: Scientific Data Management: An Incomplete  Experimental HENP Perspective

5D. Olson, SDM-ISIC Mtg, 26 Mar 2002

BaBar event

http://www.slac.stanford.edu/BFROOT/

Page 6: Scientific Data Management: An Incomplete  Experimental HENP Perspective

6D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Experimental HENP event data

• Basic character of data is “event”—May be few particles—May be MANY particles

Page 7: Scientific Data Management: An Incomplete  Experimental HENP Perspective

7D. Olson, SDM-ISIC Mtg, 26 Mar 2002

STAR event, Au + Au

www.star.bnl.gov

Page 8: Scientific Data Management: An Incomplete  Experimental HENP Perspective

8D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Experimental HENP event data

• Basic character of data is “event”—May be few tracks—May be MANY tracks

• Detector characteristics, beam types, triggers effect the type of events recorded

• Physics analysis is a statistical analysis of many (1000’s, M’s, B’s, T’s) independent events

Page 9: Scientific Data Management: An Incomplete  Experimental HENP Perspective

9D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Generic data flow in HENP

“Skims”, “microDST production”, …

Filtering chosen to make this a convenient size

Page 10: Scientific Data Management: An Incomplete  Experimental HENP Perspective

10D. Olson, SDM-ISIC Mtg, 26 Mar 2002

A collaboration of people

$100M, 10 yr, 100 people

Free?, 10 yr, 20 people

Free?, 1 yr, 10 people, 5x/yr

Free?, 1 mo, 1 person, 50x/yr

(“Typical” example today, LHC is larger)

Page 11: Scientific Data Management: An Incomplete  Experimental HENP Perspective

11D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Example: CMS Tiers

Page 12: Scientific Data Management: An Incomplete  Experimental HENP Perspective

12D. Olson, SDM-ISIC Mtg, 26 Mar 2002

List of major accelerator-based HENP experiments

Experiment Location # physicists Time scale

BaBar SLAC 800 1999 - 2010

STAR BNL / RHIC 450 2000 - 2010

PHENIX BNL / RHIC 450 2000 - 2010

Jlab/CLAS JLAB 200 2000 - 2010

CDF FNAL 800 1995 - 2010

D0 FNAL 800 1995 - 2010

ATLAS CERN 2000 2006 - 2016

CMS CERN 2000 2006 - 2016

ALICE CERN 1200 2007 - 2017

Jlab Hall D JLAB 200 2008 - 2018

Page 13: Scientific Data Management: An Incomplete  Experimental HENP Perspective

13D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Size / frequency of basic activities

Item Size (TB) / Frequency (/yr)

Typical today LHC era (>5 yr)

Raw data 100 TB / yr 1,000 TB / yr

Event Reconstruction

3 / yr 2 / yr

DST data 1 > DST/ raw > 0.1 0.1 > DST/ raw > 0.02

microDST production

0.1 > microDST/DST > .001

?

Physics analysis 10 - 100 * #physicists / year

?

Page 14: Scientific Data Management: An Incomplete  Experimental HENP Perspective

14D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Average physicist view

• Mythology, culture, terminology varies a lot from one experiment to another.

• BaBar— Object view or primary event store (Objectivity)— Event collection objects give primary access points to data

• Event collection has list of references to all event components of interest

• With 100,000 collections, how to organize them?— Ntuples & PAW for final data format, analysis tool

• STAR (first year data, getting started)

— A “production, trigger” is all reconstructed events for a trigger type with a certain version of code, (P00hg, central)

— Access point is list of directory path’s below which all data are stored on disk

— WZ will be setting up STACS— ROOT for data format and analysis tool

• …

Page 15: Scientific Data Management: An Incomplete  Experimental HENP Perspective

15D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard I, living with technology

• Typical computer center today— A couple STK Powderhorn tape silos, HPSS or home-grown

MSS— 1000 linux processors— Assortment of 100/1000 Gbps network— 50 TB disk (1000 spindles)— Network s/w for I/O (NFS, Objy AMS, RFIO, …)— AFS for distributed collaboration

• Can make large RAID filesystems w/ network access— Faults can affect many nodes

• stale NFS file handles• AFS faults affects nodes across country, work

— Large RAID is $$$• Desire to reduce effect of faults

— Fewer faults— More tolerance

• …

Page 16: Scientific Data Management: An Incomplete  Experimental HENP Perspective

16D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard II,A clear view for average

physicist

What’s going on in this box?

Page 17: Scientific Data Management: An Incomplete  Experimental HENP Perspective

17D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard II,A clear view for average

physicist

• What data is available?—“data” means

• List of files? (like STAR)• Collection object w/ pointers to all events? (like BaBar)

—“available” means• On disk? Where?• Exists?

• Does it really have the filters and calibrations I need?

• Is it the “official” version of the data?• …

Page 18: Scientific Data Management: An Incomplete  Experimental HENP Perspective

18D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard III,Analysis of large datasets

Dataset does not fit on disk, or requires parallel processing, or

is large enough operation that chance of fault is high

Page 19: Scientific Data Management: An Incomplete  Experimental HENP Perspective

19D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard III,Analysis of large datasets

• Dataset does not fit on disk—Needs access s/w to couple w/ processing

• SAM, STACS

—Does performance meet demand?

Page 20: Scientific Data Management: An Incomplete  Experimental HENP Perspective

20D. Olson, SDM-ISIC Mtg, 26 Mar 2002

SAM (Sequential data Access via Meta-data)

http://d0db.fnal.gov/sam/

Page 21: Scientific Data Management: An Incomplete  Experimental HENP Perspective

21D. Olson, SDM-ISIC Mtg, 26 Mar 2002

STACS

http://sdm.lbl.gov/projectindividual.php?ProjectID=STACS

Page 22: Scientific Data Management: An Incomplete  Experimental HENP Perspective

22D. Olson, SDM-ISIC Mtg, 26 Mar 2002

What’s hard III,Analysis of large datasets

• Dataset does not fit on disk—Needs access s/w to couple w/ processing

• SAM, STACS

—Does performance meet demand?

• Needs parallel processing (not very hard)—Can not do analysis on private/personal machine—Schedule access to shared resource (CPU and disk)

• Operation for a single analysis is large enough that faults occur—Need exception handling—Need workflow management to complete failed

tasks or, at least, accurately report status

Page 23: Scientific Data Management: An Incomplete  Experimental HENP Perspective

23D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Example shared nothing cluster

http://www.ihep.ac.cn/~chep01/paper/4-026.pdf

Page 24: Scientific Data Management: An Incomplete  Experimental HENP Perspective

24D. Olson, SDM-ISIC Mtg, 26 Mar 2002

PPDG

Page 25: Scientific Data Management: An Incomplete  Experimental HENP Perspective

25D. Olson, SDM-ISIC Mtg, 26 Mar 2002

Summary

• Faulty technology sets boundary conditions—Fault tolerant will expand boundaries of capabilities

• Data management is coupled with processing—Visualization (access w/o processing) is minor in

HENP—Need access to data when & where it is needed for

processing

• Working on data grid as context for data management

• PPDG has SDM ISIC as one of the technology base projects