anaphe oo libraries for data analysis
DESCRIPTION
Anaphe OO libraries for data analysis. Jakub T. Mościcki CERN IT/API [email protected]. Outline. Overview of Anaphe and LHC Computing Anaphe components Lizard - Interactive Data Analysis Tool Summary. LHC Computing challenge. LHC & The Alps. Interaction Points. ~100m deep. - PowerPoint PPT PresentationTRANSCRIPT
CHEP01 CERN IT/API, [email protected]
1
Anaphe Anaphe OO libraries for data analysisOO libraries for data analysis
Jakub T. MościckiCERN IT/[email protected]
CHEP01 CERN IT/API, [email protected] 2
OutlineOutline
Overview of Anaphe and LHC Computing
Anaphe components
Lizard - Interactive Data Analysis Tool
Summary
CHEP01 CERN IT/API, [email protected] 4
LHC & The Alps
27km circumference
~100m deepInteractionPoints
CHEP01 CERN IT/API, [email protected] 5
LHC Computing Challenge
4 experiments will create huge amount of data >1 PetaByte/year for each experiment !
1015 Bytes 1,000 TeraBytes 20,000 Redwood tapes 100,000 dual-sided DVD-RAM disks 1,500,000 sets of the Encyclopaedia Britannica (w/o photos)
Need lots of CPU power to reconstruct/analyse about 1000 PC boxes per experiment (2004 ones !)
complex data models
Data mining and analysis by thousands of geographically dispersed scientists around the globe
CHEP01 CERN IT/API, [email protected] 7
Technology (R)Evolution
10 yrs major cycle length (HW,SW,OS) ~12 evolutionary changes in the market 1 revolutionary change towards greater diversity don’t forget changes of requirements
Consequences SW written today most probably will be
rewritten tomorrow We must anticipate changes
CHEP01 CERN IT/API, [email protected] 8
Anaphe: what it is
Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments (previously LHC++) memory management and I/O foundation classes histogramming, minimizing/fitting visualization interactive data analysis
Trying to use standards wherever possible Trying to re-use existing class libraries This talk will not cover detector simulation (GEANT-4)
CHEP01 CERN IT/API, [email protected] 9
Anaphe Components
Data Analysis Lizard - AIDA
Custom graphics (2-D) Qt - Qplotter
Basic graphics (3-D) OpenInventor – OpenGL
Basic math NAG C
HEP foundation CLHEP
HEP math FML - Gemini - CLHEP
Histograms HTL
Database HepODBMS
Persistency ODMG/Objectivity DB
C++ Standard Libraries
CHEP01 CERN IT/API, [email protected] 10
‘Layered’ Approach
Components are individual C++ class libraries. Easy to replace one part without throwing away everything Alternative implementations interchangeable
HepODBMS versus HBOOK Ntuples Nag C minimizers versus MINUIT
Easy customization to match experiment specific needs Runtime flexibility Components may be used individually (limited
interdependencies)
Insulate components through Abstract Interfaces “wrapper” layer to implement Interfaces in terms of existing
libs
Identify and use patterns - avoid anti-patterns learn from other people’s experiences/failures
CHEP01 CERN IT/API, [email protected] 12
Users and Collaborations
AIDA spoken here!
IGUANA (CMS visualization) GAUDI (LHCb) framework ATHENA (Atlas) framework Analyzer modules in Geant 4 JAS Open Scientist …you?
CHEP01 CERN IT/API, [email protected] 14
CLHEP
HEP foundation class library Random number generators Physics vectors
3- and 4- vectors Geometry Linear algebra System of units more packages recently added
will continue to evolve
wwwinfo.cern.ch/asd/lhc++/clhep/
CHEP01 CERN IT/API, [email protected] 15
2D Graphics libraries
Qt multi-platform C++ GUI toolkit
C++ class library, not wrapper around C libs superset of Motif and MFC available on Unix and MS Windows no change for developer
commercial but with public domain version www.troll.no
Qplotter “add-on” functionality for HEP
“HIGZ/HPLOT”
CHEP01 CERN IT/API, [email protected] 16
Basic 3D Graphic Libraries
OpenGL (basic graphics) De-facto industry standard for basic 3D graphics Used in CAD/CAE, games, VR, medical imaging
OpenInventor (scene mgmt.) OO 3D toolkit for graphics Cubes, polygons, text, materials Cameras, lights, picking 3D viewers/editors,animation
Based on OpenGL/MesaGL
CHEP01 CERN IT/API, [email protected] 17
Mathematical Libraries
NAG (Numerical Algorithms Group) C Library Covers a broad range of functionality
Linear algebra differential equations quadrature, etc.
Special functions of CERNLIB added to Mark-6 release mostly for theory and accelerator Quality assurance extensive testing done by NAG
www.nag.com
CHEP01 CERN IT/API, [email protected] 18
Histograms: the HTL package
Histograms are the basic tool for physics analysis Statistical information of density distributions
Histogram Template Library (HTL) design based on C++ templates Modular : separation between sampling and display Extensible : open for user defined binning systems Flexible: support transient/persistent at the same time Open: large use of abstract interfaces
recent addition: 3D histograms
CHEP01 CERN IT/API, [email protected] 19
Fitting and Minimization
Fitting and Minimization Library (FML) common OO interface
NAG-C, MINUIT based on Abstract Interfaces
IVector, IModelFunction, … fitting as a special case of minimization
minimize “distance” between data and model replacement for HepFitting (and Gemini)
Gemini common minimization interface very thin layer
CHEP01 CERN IT/API, [email protected] 20
Tags, Ntuples and Events
NtupleTag Library Ntuple navigation and analysis common OO interface for different storage
ODBMS HBook (CERNLIB)
Exploiting Tag concept enhanced Ntuples associated with an underlying persistent store optional association to the Event may be used to navigate
to any other part of the Event even from an interactive visualization program
main use: speedup data selection for analysis… Tag data is typically better clustered than the original data
Object Association
CHEP01 CERN IT/API, [email protected] 22
Interactive Data Analysis
Aim: “OO replacement for PAW” analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) visualisation of data (Histograms, scatter-plot, “Vectors”) fitting of histograms (and other data) access to experiment specific data/code
Maximize flexibility and re-use plug-in structure careful design with limited source and binary
dependencies
Foresee customization/integration allow use from within experiment’s s/w framework!
CHEP01 CERN IT/API, [email protected] 25
Architectural issue: Scripting
Typical use of scripting is quite different from programming (reconstruction, analysis, ...) history “go back to where I was before” repetition/looping - with “modifiable parameters”
SWIG to (semi-) automatically create connection to chosen scripting language allows flexibility to choose amongst several scripting languages Python, Perl, Tcl, Guile, Ruby, (Java) …
Python - OO scripting, no “strange $!%-variables” other scripting languages possible (through SWIG)
Can be enhanced and/or replaced by a GUI scripting window within GUI application
CHEP01 CERN IT/API, [email protected] 26
Example script (ntuple)
# get list of names of all tuples from tuplemanagerntm.listTuples()nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name# create 1D histos to project into h1=hm.create1D(10, “mass” ,100, 0., 5000.) h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.)
# project the attribute ”MASS" into histo h1 without cut ("")nt1.project1D( h1, “” , “MASS”)
# project the attribute ”MASS" into histo h2 with cut (”PT1>10")nt1.project1D( h2, “PT1>10” , “MASS”)
CHEP01 CERN IT/API, [email protected] 27
CHEP01 CERN IT/API, [email protected] 28
Lizard: History and Present Status
Started after CHEP-2000Full version out since June 2001
“PAW like” analysis functionality plus on-demand loading of compiled code using
shared librariesgives full access to experiment’s analysis code and
data
based on Abstract Interfacesflexible and extensible
CHEP01 CERN IT/API, [email protected] 29
Possible Future Enhancements
Access to other implementations of components HBOOK histograms and ntuples (RWN) /coming soon/ OpenScientist, ROOT histograms?
Adding other “scripting” languages Perl , Tcl, cint ?
Communication with Java tools/packages via AIDA
JAS WIRED
CHEP01 CERN IT/API, [email protected] 30
Architectural issue: Distributed Computing
Motivation move code to data parallel analysis
Techniques services via AI late binding plug-in architecture
End-user (Lizard) look-and-feel of local analysis
R&D started and first prototype available soon CORBA
CHEP01 CERN IT/API, [email protected] 31
Summary
The architecture of Anaphe shows some important items for flexible and modular data analysis: weak coupling between components through use of Abstract
Interface basic functionality is covered by C++ class libraries
Major criteria are flexibility, extensibility and interoperability recent example: GEANT-4 space examples using G4Analysis
component (based on AIDA)
Lizard is based on Anaphe components and the Python scripting language (through SWIG) Lizard is young but has very solid base in mature Anaphe
libraries real plug-in structure
CHEP01 CERN IT/API, [email protected] 32
More information
cern.ch/Anaphecern.ch/Anaphe/Lizardaida.freehep.org/cern.ch/DBwwwinfo.cern.ch/asd/lhc++/clhep/
CHEP01 CERN IT/API, [email protected] 33
CHEP01 CERN IT/API, [email protected] 35
Ntuple versus TagDB Model
Event Data Files Ntuple File
Ad hocextraction prg.
Object Association
Federated DB of Event & Tag
CHEP01 CERN IT/API, [email protected] 36
Object persistencyTwo concepts: serial and page I/O
“Sequential access to objects” (streaming) good in networking context or serial writes to files much like “good old Fortran” often perceived to be “simpler” to implement (“<<“, “>>”)
“Navigational access to objects” (buffered) I/O on demand for complex data models optimized for (random) disk access (disks deliver pages) sequential write to file still ok
Both concepts need to take care about changes of the internal structure of the objects (schema evolution)
CHEP01 CERN IT/API, [email protected] 37
Architectural Issue:Persistency (“Object-I/O”)
Brings a completely new quality into the design
Objects have now lifetime don’t “delete” until you really are sure you want to persistency is kind of “intended memory leak”
Objects may change during their (extended) life “schema evolution” additions/deletions of attributes changes of inheritance relations
CHEP01 CERN IT/API, [email protected] 38
Architectural Issue:Persistency (“Object-I/O”) (II)
Objects can be placed (“clustering”) de-coupling of logical and physical view of data
Special care needed to ensure consistency in data set avoid reading group of objects (tracks, events,...) for which
writing/updating is not (yet) complete clean up if only part of the objects are written typically taken care of by using transactions
Complications possible in distributed computing need to protect disk access now like memory access in past
(“Segmentation violation”)
CHEP01 CERN IT/API, [email protected] 39
Physical Model and Logical Model
• Physical model may be changed to optimise performancePhysical model may be changed to optimise performance• Existing applications continue to workExisting applications continue to work
CHEP01 CERN IT/API, [email protected] 40
Concurrent Access
Data changes are part of a Transaction ACID: Atomicity, Consistency, Isolation, Durability Guarantees consistency of data
Support for multiple concurrent writers e.g. Multiple parallel data streams e.g. Filter or reconstruction farms e.g. Distributed simulation
Access is co-ordinated by a lock server MROW: Multiple Reader, One Writer per container
(Objectivity/DB)
CHEP01 CERN IT/API, [email protected] 41
Plain files vs. Databases
HEP is using DBs since LEP e.g., FATMEN, HepDB
Mainly for “Meta data” event -> file -> tape mappings calibration data (conditions)
Why not use a single system for “Meta data” and “data” ? Overhead for DB administration is there anyway accessing “Meta data” from “data” is significantly easier
simple navigation from event -> calibrationData transaction safety also for event/reconstructed data
CHEP01 CERN IT/API, [email protected] 42
Persistency: Objectivity/DB
ODMG compliant database Object Data Management Group defined a standard
Language binding for C++, Java, Smalltalk ODBMS allow to use persistent objects directly as
variables of the OO language
Storage entity is a complete object State of all data members & Object class Guarantees consistent view of data (DB feature)
C++ Language Support Abstraction, Inheritance, Polymorphism Parameterised Types (Templates)
Location transparent access to objects