cms data analysis current status and future strategy

24
ACAT 2002 ACAT 2002 June, 2002 Lassi A. Tuura, Northeastern University http://iguana.cern.ch CMS Data Analysis CMS Data Analysis Current Status and Future Strategy Current Status and Future Strategy On behalf of CMS On behalf of CMS Collaboration Collaboration Lassi A. Tuura Lassi A. Tuura Northeastern University, Boston

Upload: alva

Post on 12-Jan-2016

61 views

Category:

Documents


0 download

DESCRIPTION

CMS Data Analysis Current Status and Future Strategy. On behalf of CMS Collaboration Lassi A. Tuura Northeastern University, Boston. Overview. The Context — CMS Analysis Today Data Analysis Environment Architecture Overview COBRA IGUANA GRID/Production Tomorrow and Beyond - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CMS Data Analysis Current Status and Future Strategy

ACAT 2002ACAT 2002

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

CMS Data AnalysisCMS Data AnalysisCurrent Status and Future StrategyCurrent Status and Future Strategy

On behalf of CMS CollaborationOn behalf of CMS Collaboration

Lassi A. TuuraLassi A. Tuura

Northeastern University, Boston

Page 2: CMS Data Analysis Current Status and Future Strategy

2June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

OverviewOverview The Context — CMS Analysis Today Data Analysis Environment Architecture

Overview COBRA IGUANA GRID/Production

Tomorrow and Beyond Leveraging current frameworks in the Grid-enriched analysis environment Clarens client-server prototype Other prototype activities

Page 3: CMS Data Analysis Current Status and Future Strategy

3June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Challenges:Challenges:ComplexityComplexityGeographic DispersionGeographic DispersionDirect Access To DataDirect Access To DataMigration from Reconstruction to TriggerMigration from Reconstruction to Trigger

Environments:Environments:Real-Time Event Filter, Online MonitoringReal-Time Event Filter, Online MonitoringPre-emptive Simulation, Reconstruction, AnalysisPre-emptive Simulation, Reconstruction, AnalysisInteractive Statistical AnalysisInteractive Statistical Analysis

ContextContext

Page 4: CMS Data Analysis Current Status and Future Strategy

4June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Current CMS ProductionCurrent CMS Production

PythiaZebra fileswith HITS

HEPEVTNtuples

CMSIM(GEANT3)

ORCA/COBRADigitization

(merge signaland pile-up)

ObjectivityDatabase

ORCA/COBRAooHit

FormatterObjectivityDatabase

OSCAR/COBRA(GEANT4)

ORCAUser

AnalysisNtuples orRoot files

ObjectivityDatabaseIGUANA

InteractiveAnalysis

Page 5: CMS Data Analysis Current Status and Future Strategy

5June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Complexity of Production 2002Complexity of Production 2002

7TB toward T1

4TB toward T2File Transfer by GDMP and by perl Scripts over scp/bbcp

17TBData Size (Not including fz files from Simulation)

~11,000Number of Files

6-8Number of Production Passes for each Dataset(including analysis group processing done by production)

176 CPUsLargest Local Center

~1000Number of CPU’s

21Number of Computing Centers

11Number of Regional Centers

Page 6: CMS Data Analysis Current Status and Future Strategy

6June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Interactive AnalysisInteractive Analysis

Lizard Qt Lizard Qt plotterplotter

ANAPHE histogramANAPHE histogramextended with pointers extended with pointers to CMS eventsto CMS events

Emacs used to edit a CMS Emacs used to edit a CMS C++ plugin to create and fill C++ plugin to create and fill histogramshistograms

OpenInventor-based OpenInventor-based display of selected display of selected event event

Python shell with Lizard Python shell with Lizard & CMS modules& CMS modules

Most of analysis is done Most of analysis is done using NTUPLEs in PAW, using NTUPLEs in PAW, some in ROOTsome in ROOT

Page 7: CMS Data Analysis Current Status and Future Strategy

7June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Behind the Scenes: FrameworksBehind the Scenes: Frameworks

FederationFederationwizardswizards

Detector/EventDetector/EventDisplayDisplay

Data BrowserData Browser

Analysis jobAnalysis jobwizardswizards

Generic analysis Generic analysis ToolsTools

ORCAORCA

FAMOSFAMOS

ObjyObjytoolstools

GRIDGRID

OSCAROSCAR COBRACOBRADistributedDistributedData StoreData Store

& Computing& ComputingInfrastructureInfrastructure

CMSCMStoolstools

Consistent User Consistent User InterfaceInterface

Coherent basic tools Coherent basic tools and mechanismsand mechanisms

Page 8: CMS Data Analysis Current Status and Future Strategy

8June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

ODBMSODBMSGEANT

3 / 4GEANT

3 / 4CLHEPCLHEP

PAW Replacement

PAW Replacement

C++ Standard Library+ Extension ToolkitsC++ Standard Library+ Extension Toolkits

Frameworks DisectedFrameworks Disected

CalibrationObjects

CalibrationObjects Generic Generic

Application Application FrameworkFramework

Physics modulesPhysics modulesGrid-UploadableGrid-Uploadable

BasicBasicServicesServices

Adapters and ExtensionsAdapters and Extensions

ConfigurationObjects

ConfigurationObjects Event

Objects Event

Objects

(Grid-aware) Data-Products

(Grid-aware) Data-Products

SpecificSpecificFrameworksFrameworks

EventEventFilterFilterEventEventFilterFilter

ReconstructionReconstructionAlgorithmsAlgorithms

ReconstructionReconstructionAlgorithmsAlgorithms

PhysicsPhysicsAnalysisAnalysisPhysicsPhysicsAnalysisAnalysis

DataDataMonitoringMonitoring

DataDataMonitoringMonitoring

Page 9: CMS Data Analysis Current Status and Future Strategy

9June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Several frameworks provide the environment together Open: No central framework with all functionality

– Frameworks are designed to be extensible

– … and to collaborate with other software Coherent: User sees “final” smooth interface

– Achieved by integrating the frameworks together

– … but the user does not do this work him/herself ! Design applied at both framework and object design level

Successfully applied in many parts of CMS software Applications, persistency; sub-frameworks; visualisation; … No loss of usability, functionality or performance Has made it easy to integrate directly with many existing tools

This is nothing novel — it is part of the standard risk-mitigation strategy of any modern industrial solution

Framework Design BasisFramework Design Basis

Page 10: CMS Data Analysis Current Status and Future Strategy

10

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Frameworks: COBRAFrameworks: COBRA

FederationFederationwizardswizards

Detector/EventDetector/EventDisplayDisplay

Data BrowserData Browser

Analysis jobAnalysis jobwizardswizards

Generic analysis Generic analysis ToolsTools

ORCAORCA

FAMOSFAMOS

ObjyObjytoolstools

GRIDGRID

OSCAROSCAR COBRACOBRADistributedDistributedData StoreData Store

& Computing& ComputingInfrastructureInfrastructure

CMSCMStoolstools

Consistent User Consistent User InterfaceInterface

Coherent basic tools Coherent basic tools and mechanismsand mechanisms

Page 11: CMS Data Analysis Current Status and Future Strategy

11

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

COBRA: Main ComponentsCOBRA: Main Components Push- and pull-mode execution—and any mixture

Reconstruction-on-demand is a key concept in COBRA Detector-centric reconstruction—push data from event Reconstruction-unit-centric reconstruction—pull/create data as needed

Event data and related structures Basic support for commonly needed objects (hits, digis, containers, …)

Application environments Basic application frameworks, various semi-specialised applications Lots of error-handling and recovery code (automatic recovery after crash,

…)

Meta data: a key component Data chunking, system and user collections, data streams, file management,

job concepts, configuration and setup records, redirected navigation after reprocessing, …

Page 12: CMS Data Analysis Current Status and Future Strategy

12

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

COBRA: Main StrengthsCOBRA: Main Strengths Algorithms in plug-ins

“Publish-yourself-plug-ins”—self-describing data producers

Strong meta-data facilities Reconstruction-on-demand matches data product concept very well

– Grid virtual data products concept really just an extension Convenient mapping of data products to chunks: files, containers, … Scatter / gather: decompose jobs, gather data

– One logical job can be chopped into many physical processes, we still know it is logically the same job no matter which process it is running in

Adapts automatically to many environments without special configuration: interactive, batch, farm, stand-alone, trigger, … Through appropriate use of enabling techniques (transactions, locking, refs) No data post-processing required Well-matched to production tools (IMPALA)

Page 13: CMS Data Analysis Current Status and Future Strategy

13

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

StorageManagerStorageManager

SchemaManagerSchemaManager

TransactionManager

TransactionManager

C++BindingFile I/OFile I/O

LockServerLock

Server

PageServerPage

Server

Catalog ManagerCatalog Manager

DDL SourceProcessingDDL SourceProcessing

MetaDataMetaData

ObjectAccessObjectAccess

MSS, Grid& Farm

Interface

MSS, Grid& Farm

Interface

Objectivity

Page 14: CMS Data Analysis Current Status and Future Strategy

14

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Refs &NavigationRefs &Navigation

QueriesQueries

CacheManagementCacheManagement

StorageManagerStorageManager

SchemaManagerSchemaManager

TransactionManager

TransactionManager

C++BindingFile I/OFile I/O

LockServerLock

Server

PageServerPage

Server

Catalog ManagerCatalog Manager

DDL SourceProcessingDDL SourceProcessing

MetaDataMetaData

ObjectAccessObjectAccess

MSS, Grid& Farm

Interface

MSS, Grid& Farm

Interface

Objectivity

Page 15: CMS Data Analysis Current Status and Future Strategy

15

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

ObjectNamingObjectNaming

Configurations(Data Sets)Configurations(Data Sets)

CollectionsCollections

Run Resume &Crash RecoveryRun Resume &Crash Recovery

StorageManagerStorageManager

SchemaManagerSchemaManager

TransactionManager

TransactionManager

C++BindingFile I/OFile I/O

LockServerLock

Server

PageServerPage

Server

Catalog ManagerCatalog Manager

DDL SourceProcessingDDL SourceProcessing

MetaDataMetaData

ObjectAccessObjectAccess

MSS, Grid& Farm

Interface

MSS, Grid& Farm

Interface

Objectivity

Page 16: CMS Data Analysis Current Status and Future Strategy

16

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

File SizeControlFile SizeControl

FarmManagementFarmManagement

SystemManagementSystemManagement

StorageManagerStorageManager

SchemaManagerSchemaManager

TransactionManager

TransactionManager

C++BindingFile I/OFile I/O

LockServerLock

Server

PageServerPage

Server

Catalog ManagerCatalog Manager

DDL SourceProcessingDDL SourceProcessing

MetaDataMetaData

ObjectAccessObjectAccess

MSS, Grid& Farm

Interface

MSS, Grid& Farm

Interface

Objectivity

Page 17: CMS Data Analysis Current Status and Future Strategy

17

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Frameworks: IGUANAFrameworks: IGUANA

FederationFederationwizardswizards

Detector/EventDetector/EventDisplayDisplay

Data BrowserData Browser

Analysis jobAnalysis jobwizardswizards

Generic analysis Generic analysis ToolsTools

ORCAORCA

FAMOSFAMOS

ObjyObjytoolstools

GRIDGRID

OSCAROSCAR COBRACOBRADistributedDistributedData StoreData Store

& Computing& ComputingInfrastructureInfrastructure

CMSCMStoolstools

Consistent User Consistent User InterfaceInterface

Coherent basic tools Coherent basic tools and mechanismsand mechanisms

Page 18: CMS Data Analysis Current Status and Future Strategy

18

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

User Interface and VisualisationUser Interface and Visualisation IGUANA: a generic toolkit for user interfaces and visualisation

Builds on existing high-quality libraries (Qt, OpenInventor, Anaphe, …) Used to implement specific visualisation applications in other projects

Main technical focus: provide a platform that makes it easy to integrate GUIs as a coherent whole, to provide application services and to visualise any application object Many categories / layers: GUI gadgets & support, application environment,

data visualisers, data representation methods, control panels, … Designed to integrate with and into other applications Virtually everything is in plug-ins (can still be statically linked)

Plug-InCachePlug-In

Cache

ObjectFactoryObject

FactoryComponentDatabase Plug-In

Cache

Plug-InPlug-In

Plug-InPlug-In

Plug-In ObjectFactory

AttachedUnattache

d

Page 19: CMS Data Analysis Current Status and Future Strategy

19

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Illustration: 3D VisualisationIllustration: 3D Visualisation

QMainWindowBrowser Site

QMDIShellBrowser Site

QMDIShellBrowser Site

3DBrowser

TwigBrowser

Page 20: CMS Data Analysis Current Status and Future Strategy

20

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

IGUANA GUI IntegrationIGUANA GUI Integration

IntegrationIntegration

ActionAction

Visualise Results,Visualise Results,Modify Objects,Modify Objects,

Further InteractionFurther Interaction

Page 21: CMS Data Analysis Current Status and Future Strategy

21

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Tomorrow and BeyondTomorrow and Beyond Leverage the current frameworks on the grid

Many native COBRA concepts match well with grid– (Virtual) data products ~ reconstruction-on-demand– Recording and matching configuration and setup information– Production interfaces: catalogs, redirection, MSS hooks– Scatter/gather job decomposition, production environment

COBRA-based applications can be encapsulated for distributed analysis IGUANA already separates application objects, model and viewer

– Many possibilities for introducing distributed links IGUANA+COBRA provides a platform for a coherent, well-integrated

interface no matter where the code runs and data comes and goes– Both have loads of knobs and hooks for integration

Aiming at adapting the existing software where possible Adapt and work within CMS software (COBRA, ORCA, …) and

existing analysis tools (ROOT, Lizard, …)—don’t replace them

Page 22: CMS Data Analysis Current Status and Future Strategy

22

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Client

RPC

Web Server

Clarens

Service

http

/htt

ps

Prototypes: Clarens Web PortalsPrototypes: Clarens Web Portals Grid-enabling the working environment for

physicists' data analysis Communication with clients via the

commodity XML-RPC protocol Implementation independence

Server implemented in C++: access to the CMS OO analysis toolkit

Server provides a remote API to Grid tools The Virtual Data Toolkit: Object collection access Data movement between tier centres using GSI-FTP CMS analysis software (ORCA/COBRA) Security services provided by the Grid (GSI) No Globus needed on client side, only certificate

Page 23: CMS Data Analysis Current Status and Future Strategy

23

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Tool plugin

module

Production system and data repositories

ORCA analysis farm(s) (or distributed `farm’ using grid queues)

RDBMS based data

warehouse(s)

PIAF/Proof/..type analysis

farm(s)

Local disk

User

TAGs/AODsdata flow

Physics Query flow

Tier 1/2

Tier 0/1/2

Tier 3/4/5

Productiondata flow

TAG and AOD extraction/conversion/transport services

Data extractionWeb service(s)

Local analysis tool: Lizard/ROOT/… Web browser

Query Web service(s)

Prototypes: Clarens Web Portals…Prototypes: Clarens Web Portals…

Page 24: CMS Data Analysis Current Status and Future Strategy

24

June, 2002 Lassi A. Tuura, Northeastern Universityhttp://iguana.cern.ch

Other PrototypesOther Prototypes Tag database optimisation

Fast sample selection is crucial Various models already tried Experimenting with RDBMS

MOP: distributed job submission system Allows submission of CMS

production jobs from a central location, run on remote locations, and return results

– Job Specification: IMPALA

– Replication: GDMP

– Globus GRAM

– Job Scheduling: Condor-G and local systems