anaphe oo libraries for data analysis

43
CHEP01 CERN IT/API, Jakub.Mosci [email protected] 1 Anaphe Anaphe OO libraries for data OO libraries for data analysis analysis Jakub T. Mościcki CERN IT/API [email protected]

Upload: ciaran-downs

Post on 31-Dec-2015

26 views

Category:

Documents


3 download

DESCRIPTION

Anaphe OO libraries for data analysis. Jakub T. Mościcki CERN IT/API [email protected]. Outline. Overview of Anaphe and LHC Computing Anaphe components Lizard - Interactive Data Analysis Tool Summary. LHC Computing challenge. LHC & The Alps. Interaction Points. ~100m deep. - PowerPoint PPT Presentation

TRANSCRIPT

CHEP01 CERN IT/API, [email protected]

1

Anaphe Anaphe OO libraries for data analysisOO libraries for data analysis

Jakub T. MościckiCERN IT/[email protected]

CHEP01 CERN IT/API, [email protected] 2

OutlineOutline

Overview of Anaphe and LHC Computing

Anaphe components

Lizard - Interactive Data Analysis Tool

Summary

CHEP01 CERN IT/API, [email protected]

3

LHC Computing challenge

CHEP01 CERN IT/API, [email protected] 4

LHC & The Alps

27km circumference

~100m deepInteractionPoints

CHEP01 CERN IT/API, [email protected] 5

LHC Computing Challenge

4 experiments will create huge amount of data >1 PetaByte/year for each experiment !

1015 Bytes 1,000 TeraBytes 20,000 Redwood tapes 100,000 dual-sided DVD-RAM disks 1,500,000 sets of the Encyclopaedia Britannica (w/o photos)

Need lots of CPU power to reconstruct/analyse about 1000 PC boxes per experiment (2004 ones !)

complex data models

Data mining and analysis by thousands of geographically dispersed scientists around the globe

CHEP01 CERN IT/API, [email protected] 6

Lifetime of LHC software = 25 yrs

WWW

CHEP01 CERN IT/API, [email protected] 7

Technology (R)Evolution

10 yrs major cycle length (HW,SW,OS) ~12 evolutionary changes in the market 1 revolutionary change towards greater diversity don’t forget changes of requirements

Consequences SW written today most probably will be

rewritten tomorrow We must anticipate changes

CHEP01 CERN IT/API, [email protected] 8

Anaphe: what it is

Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments (previously LHC++) memory management and I/O foundation classes histogramming, minimizing/fitting visualization interactive data analysis

Trying to use standards wherever possible Trying to re-use existing class libraries This talk will not cover detector simulation (GEANT-4)

CHEP01 CERN IT/API, [email protected] 9

Anaphe Components

Data Analysis Lizard - AIDA

Custom graphics (2-D) Qt - Qplotter

Basic graphics (3-D) OpenInventor – OpenGL

Basic math NAG C

HEP foundation CLHEP

HEP math FML - Gemini - CLHEP

Histograms HTL

Database HepODBMS

Persistency ODMG/Objectivity DB

C++ Standard Libraries

CHEP01 CERN IT/API, [email protected] 10

‘Layered’ Approach

Components are individual C++ class libraries. Easy to replace one part without throwing away everything Alternative implementations interchangeable

HepODBMS versus HBOOK Ntuples Nag C minimizers versus MINUIT

Easy customization to match experiment specific needs Runtime flexibility Components may be used individually (limited

interdependencies)

Insulate components through Abstract Interfaces “wrapper” layer to implement Interfaces in terms of existing

libs

Identify and use patterns - avoid anti-patterns learn from other people’s experiences/failures

CHEP01 CERN IT/API, [email protected] 11

Anaphe Components: Overview

CHEP01 CERN IT/API, [email protected] 12

Users and Collaborations

AIDA spoken here!

IGUANA (CMS visualization) GAUDI (LHCb) framework ATHENA (Atlas) framework Analyzer modules in Geant 4 JAS Open Scientist …you?

CHEP01 CERN IT/API, [email protected]

13

Anaphe components

CHEP01 CERN IT/API, [email protected] 14

CLHEP

HEP foundation class library Random number generators Physics vectors

3- and 4- vectors Geometry Linear algebra System of units more packages recently added

will continue to evolve

wwwinfo.cern.ch/asd/lhc++/clhep/

CHEP01 CERN IT/API, [email protected] 15

2D Graphics libraries

Qt multi-platform C++ GUI toolkit

C++ class library, not wrapper around C libs superset of Motif and MFC available on Unix and MS Windows no change for developer

commercial but with public domain version www.troll.no

Qplotter “add-on” functionality for HEP

“HIGZ/HPLOT”

CHEP01 CERN IT/API, [email protected] 16

Basic 3D Graphic Libraries

OpenGL (basic graphics) De-facto industry standard for basic 3D graphics Used in CAD/CAE, games, VR, medical imaging

OpenInventor (scene mgmt.) OO 3D toolkit for graphics Cubes, polygons, text, materials Cameras, lights, picking 3D viewers/editors,animation

Based on OpenGL/MesaGL

CHEP01 CERN IT/API, [email protected] 17

Mathematical Libraries

NAG (Numerical Algorithms Group) C Library Covers a broad range of functionality

Linear algebra differential equations quadrature, etc.

Special functions of CERNLIB added to Mark-6 release mostly for theory and accelerator Quality assurance extensive testing done by NAG

www.nag.com

CHEP01 CERN IT/API, [email protected] 18

Histograms: the HTL package

Histograms are the basic tool for physics analysis Statistical information of density distributions

Histogram Template Library (HTL) design based on C++ templates Modular : separation between sampling and display Extensible : open for user defined binning systems Flexible: support transient/persistent at the same time Open: large use of abstract interfaces

recent addition: 3D histograms

CHEP01 CERN IT/API, [email protected] 19

Fitting and Minimization

Fitting and Minimization Library (FML) common OO interface

NAG-C, MINUIT based on Abstract Interfaces

IVector, IModelFunction, … fitting as a special case of minimization

minimize “distance” between data and model replacement for HepFitting (and Gemini)

Gemini common minimization interface very thin layer

CHEP01 CERN IT/API, [email protected] 20

Tags, Ntuples and Events

NtupleTag Library Ntuple navigation and analysis common OO interface for different storage

ODBMS HBook (CERNLIB)

Exploiting Tag concept enhanced Ntuples associated with an underlying persistent store optional association to the Event may be used to navigate

to any other part of the Event even from an interactive visualization program

main use: speedup data selection for analysis… Tag data is typically better clustered than the original data

Object Association

CHEP01 CERN IT/API, [email protected]

21

Interactive Data Analysis

CHEP01 CERN IT/API, [email protected] 22

Interactive Data Analysis

Aim: “OO replacement for PAW” analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) visualisation of data (Histograms, scatter-plot, “Vectors”) fitting of histograms (and other data) access to experiment specific data/code

Maximize flexibility and re-use plug-in structure careful design with limited source and binary

dependencies

Foresee customization/integration allow use from within experiment’s s/w framework!

CHEP01 CERN IT/API, [email protected] 23

Lizard Internals: Interfaces

CHEP01 CERN IT/API, [email protected] 24

Anaphe components

CHEP01 CERN IT/API, [email protected] 25

Architectural issue: Scripting

Typical use of scripting is quite different from programming (reconstruction, analysis, ...) history “go back to where I was before” repetition/looping - with “modifiable parameters”

SWIG to (semi-) automatically create connection to chosen scripting language allows flexibility to choose amongst several scripting languages Python, Perl, Tcl, Guile, Ruby, (Java) …

Python - OO scripting, no “strange $!%-variables” other scripting languages possible (through SWIG)

Can be enhanced and/or replaced by a GUI scripting window within GUI application

CHEP01 CERN IT/API, [email protected] 26

Example script (ntuple)

# get list of names of all tuples from tuplemanagerntm.listTuples()nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name# create 1D histos to project into h1=hm.create1D(10, “mass” ,100, 0., 5000.) h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.)

# project the attribute ”MASS" into histo h1 without cut ("")nt1.project1D( h1, “” , “MASS”)

# project the attribute ”MASS" into histo h2 with cut (”PT1>10")nt1.project1D( h2, “PT1>10” , “MASS”)

CHEP01 CERN IT/API, [email protected] 27

CHEP01 CERN IT/API, [email protected] 28

Lizard: History and Present Status

Started after CHEP-2000Full version out since June 2001

“PAW like” analysis functionality plus on-demand loading of compiled code using

shared librariesgives full access to experiment’s analysis code and

data

based on Abstract Interfacesflexible and extensible

CHEP01 CERN IT/API, [email protected] 29

Possible Future Enhancements

Access to other implementations of components HBOOK histograms and ntuples (RWN) /coming soon/ OpenScientist, ROOT histograms?

Adding other “scripting” languages Perl , Tcl, cint ?

Communication with Java tools/packages via AIDA

JAS WIRED

CHEP01 CERN IT/API, [email protected] 30

Architectural issue: Distributed Computing

Motivation move code to data parallel analysis

Techniques services via AI late binding plug-in architecture

End-user (Lizard) look-and-feel of local analysis

R&D started and first prototype available soon CORBA

CHEP01 CERN IT/API, [email protected] 31

Summary

The architecture of Anaphe shows some important items for flexible and modular data analysis: weak coupling between components through use of Abstract

Interface basic functionality is covered by C++ class libraries

Major criteria are flexibility, extensibility and interoperability recent example: GEANT-4 space examples using G4Analysis

component (based on AIDA)

Lizard is based on Anaphe components and the Python scripting language (through SWIG) Lizard is young but has very solid base in mature Anaphe

libraries real plug-in structure

CHEP01 CERN IT/API, [email protected] 32

More information

cern.ch/Anaphecern.ch/Anaphe/Lizardaida.freehep.org/cern.ch/DBwwwinfo.cern.ch/asd/lhc++/clhep/

CHEP01 CERN IT/API, [email protected] 33

CHEP01 CERN IT/API, [email protected]

34

Opening bracket: Persistency

CHEP01 CERN IT/API, [email protected] 35

Ntuple versus TagDB Model

Event Data Files Ntuple File

Ad hocextraction prg.

Object Association

Federated DB of Event & Tag

CHEP01 CERN IT/API, [email protected] 36

Object persistencyTwo concepts: serial and page I/O

“Sequential access to objects” (streaming) good in networking context or serial writes to files much like “good old Fortran” often perceived to be “simpler” to implement (“<<“, “>>”)

“Navigational access to objects” (buffered) I/O on demand for complex data models optimized for (random) disk access (disks deliver pages) sequential write to file still ok

Both concepts need to take care about changes of the internal structure of the objects (schema evolution)

CHEP01 CERN IT/API, [email protected] 37

Architectural Issue:Persistency (“Object-I/O”)

Brings a completely new quality into the design

Objects have now lifetime don’t “delete” until you really are sure you want to persistency is kind of “intended memory leak”

Objects may change during their (extended) life “schema evolution” additions/deletions of attributes changes of inheritance relations

CHEP01 CERN IT/API, [email protected] 38

Architectural Issue:Persistency (“Object-I/O”) (II)

Objects can be placed (“clustering”) de-coupling of logical and physical view of data

Special care needed to ensure consistency in data set avoid reading group of objects (tracks, events,...) for which

writing/updating is not (yet) complete clean up if only part of the objects are written typically taken care of by using transactions

Complications possible in distributed computing need to protect disk access now like memory access in past

(“Segmentation violation”)

CHEP01 CERN IT/API, [email protected] 39

Physical Model and Logical Model

• Physical model may be changed to optimise performancePhysical model may be changed to optimise performance• Existing applications continue to workExisting applications continue to work

CHEP01 CERN IT/API, [email protected] 40

Concurrent Access

Data changes are part of a Transaction ACID: Atomicity, Consistency, Isolation, Durability Guarantees consistency of data

Support for multiple concurrent writers e.g. Multiple parallel data streams e.g. Filter or reconstruction farms e.g. Distributed simulation

Access is co-ordinated by a lock server MROW: Multiple Reader, One Writer per container

(Objectivity/DB)

CHEP01 CERN IT/API, [email protected] 41

Plain files vs. Databases

HEP is using DBs since LEP e.g., FATMEN, HepDB

Mainly for “Meta data” event -> file -> tape mappings calibration data (conditions)

Why not use a single system for “Meta data” and “data” ? Overhead for DB administration is there anyway accessing “Meta data” from “data” is significantly easier

simple navigation from event -> calibrationData transaction safety also for event/reconstructed data

CHEP01 CERN IT/API, [email protected] 42

Persistency: Objectivity/DB

ODMG compliant database Object Data Management Group defined a standard

Language binding for C++, Java, Smalltalk ODBMS allow to use persistent objects directly as

variables of the OO language

Storage entity is a complete object State of all data members & Object class Guarantees consistent view of data (DB feature)

C++ Language Support Abstraction, Inheritance, Polymorphism Parameterised Types (Templates)

Location transparent access to objects

CHEP01 CERN IT/API, [email protected]

43

Closing bracket: Persistency