wlcg status report

Download WLCG Status Report

If you can't read please download the document

Upload: shepry

Post on 26-Feb-2016

51 views

Category:

Documents


1 download

DESCRIPTION

WLCG Overview Board CERN, 9 th March 2012. WLCG Status Report. Ian Bird. WLCG MoU Status. Since the last summer: US LLNL granted full member access as ALICE Tier2; initially based on Letter of Intent. Progress on MoU signature to be reported on at next meeting - PowerPoint PPT Presentation

TRANSCRIPT

Legal Arrangements for operating a distributed research infrastructure

WLCG Status ReportIan BirdWLCG Overview BoardCERN, 9th March 2012Since the last summer:US LLNL granted full member access as ALICE Tier2; initially based on Letter of Intent. Progress on MoU signature to be reported on at next meetingFormer French Tier3 LPSC Grenoble became an ALICE and ATLAS Tier2Informal discussions and exchange of information with 3 different new countries expressing interest in becoming WLCG Tier2s (Thailand, Cyprus, Slovakia)

WLCG MoU with EGI has now been signedConfirms collaboration

[email protected] MoU Status18th October 2011

Castor service at Tier 0 well adapted to the load:Heavy Ions: more than 6 GB/s to tape (tests show that Castor can easily support >12 GB/s); Actual limit now is network from experiment to CCMajor improvements in tape efficiencies tape writing at ~native drive speeds. Fewer drives needed

WLCG: Data in 2011

HI: ALICE data into Castor > 4 GB/s (red)HI: Overall rates to tape > 6 GB/s (r+b)22 PB data written in 2011

January new Castor tape software:fully buffered tape marks in production 300% tape speed increaseMany other improvements in tape handling this yearE.g. Public instance (least efficient):14 MB/s 39 MB/s (2/3 buffered tape marks) 49 MB/s 146 MB/s (only buffered tape marks). CMS instance which was better at 61 MB/s now at 157 MB/sThe current write speed (150 160 MB/s) is still below the native drive speed (240 MB/s)The current bottleneck is the speed of the disks on the diskservers. Data is pre-loaded into RAM: small files fit into RAM and hence are more efficient! We plan to reduce number of drives by 30Other Castor site will benefit from this; other tape systems may consider this technique

Note on Castor tape speed improvements

WLCG in 2011109 HEPSPEC-hours/month(~150 k CPU continuous use)

1.5M jobs/dayUsage continues to grow# jobs/dayCPU [email protected] of Tier 0+1

CPU and disk occupancy compared to total available for Tier 0 and Tier1s(NB available includes efficiency factor)Green line: avail*effic; Pink line: pledgesEfficiencies now good;ALICE problem now [email protected] usage vs pledges

Fewer incidents in generalBut longer lasting (or most difficult to resolve)Q4 2011 all except 1 took >24 hr to [email protected] incidents

Time to resolutionExperiment computing progress [email protected] no shutdown for computing

Activity on 3rd Jan11ATLAS computing in 2011 incl. Christmas break, and nowSmooth running for p-p and HI in 2011Average 340Hz p-p, 4.6e6 life seconds, 1.6e9 eventsSimulated events 2.8e9 (full) + 0.7e9 (fast)Compression of RAW events since July (1.2 0.64 MB/ev), kept the compressed on diskDynamically extended Tier0 capacity into public share where needed, esp. for HI running used up to 5000 cores (3000 is nominal)One reprocessing per yearSteady improvement of reconstruction software speed to cope with high pileupUsage over Christmas breakHI backlog processing at Tier0 until Mid DecemberProduction of improved 7TeV MC during break for conferences using all available capacity including spillover of Grid jobs into Tier0 (left with ~500 nodes for residual HI processing)Spillover to be refined for the long 2013-14 shutdown to include HLTNeed massive 8TeV MC production in 2012

9.3.2012ATLAS computing 2011 - 2012 - hvds12All-Tiers and Tier-0 concurrent jobs: timeline

MC production5000 jobs3000 jobsHeavy IonsDecJanT0 usage for MCTier-0All Tiers100k jobs9.3.2012ATLAS computing 2011 - 2012 - hvds132012 expectation400Hz rate in physics streamsExpect LHC to run with * = 0.6m, average 24 interactions per event (34 at beginning at fills)~1.6e9 events like 2011, assuming shorter running (21 weeks), with ratio of stable beams / total physics time of 0.315Hz of Zero Bias events for pileup overlay studies and MCPlan 75Hz rate in delayed streams (to be processed in 2013)~250e6 eventsStrong physics case for B-physicsRAW written to tape ~200TB * 2 copies, processed to DAODs in 2013 on disk ~100TB * 2 copiesHLT, Tier0 and Grid processing making full use of improvements in latest software release for CPU and AOD sizeHigh bunch charge runs of 2011 had the pileup expected for 2012 and were used in tuning

9.3.2012ATLAS computing 2011 - 2012 - hvds14Activities Over the WinterCMS has done a complete processing of the 2011 Data and MC with the latest CMSSW version 8TeV MC Production began immediately after the HI runTier-0 resources were successfully used for MadGraph LHE production. 4B events were produced over the shutdownAnalysis activity continued at a high level in preparing for conferences.Feb 27, 1215CMS Tier-0 Data2012 will have higher luminosity and complex eventsExcellent performance of CMSSW_5_2 for High PU eventsFaster and less memoryCMS should be able to promptly reconstruct 300Hz

Even with the improvements CMS will eat more into the time between fillsWe will also use more the CAF and lxbatch resources for Tier-0 reconstructionFeb 27, 1216Data ParkingGiven the challenging triggering environment, potentially interesting physics, and impending long shutdown CMS would like to take more data than we have resources to reconstruct at the Tier-0We will repack into RAW additional data and Park it at Tier-1s for reconstruction at a later timeHow long data stays parked will depend on available Tier-1 resourcesSome data can be reconstructed during the year, and some may safely wait

Feb 27, 1217Impact of Data Parking ProcessingData Parking scenarios roughly double the dataset taken in 2012About Half Promptly Reconstructed and Half reconstructed later at Tier-1We believe in 2013 we need 20% more T1 CPU resources than 2012 and 15% more T2 CPU than we estimated last yearFurther increases over the small changes presented in 2011 in tape and disk storage were not expected from the planningPrimarily this is caused by changes in what CMS stores and analyzesWrite out fewer MC Raw events, as they are not needed and new MC can be recreated out of smaller formatsAnalysis has moved more completely to AOD formats, which saves spaceAggressive clean-up campaigns of old MC and old processing passes also frees resources for new thingsFeb 27, 1218Impact of Data Parking on AnalysisAnalysis Resources are well usedAdditional data will have some impact in increasing analysis needsFurther constraining resources Need to ensure that high priority activities can complete even in the presence of additional loadStretch out lower priority activities Expect to use the glide-in WMS global queue functionality to enforce priority

Feb 27, 12Full 1 fb-1 of 2011 data had been re-processed by end NovemberMC production for 2011 configuration

[email protected]

T2s:MC 63%Data reprocessing 28%Continuing disk shortage due to larger event size and increased trigger rateReduced copies of data for older processing passesSituation should improve with installation of 2012 pledgesAlso commissioning physics group centralised production to reduce need for large job runs for each groupAnalysis activity steady at Tier 0 and 11000 concurrent user jobs & 16000 jobs/dayStarting to prepare online farm for offline [email protected] 2011, there have been a number of changes introduced in the LHCb computing model to bridge the gap between the extended physics reach of LHCb and the available pledges which were defined before this. Already in 2010, and more clearly in 2011, LHCb has decided to expand its physics reach beyond the original vision to include significantly more Charm physics. In particular, the recent observation of a possibleevidence of CP in charmed meson decays, has pushed the Collaborationat the end of 2011 in a campaign of optimization of the HLT filter which will havethe effect of increasing in the yield of collected charm events.We also expect a general increase in signal events due to the operation of LHC at higherenergy and from exploiting the full bandwidth of the LHCb L0trigger (1 MHz).

As a result, the trigger rate, already increased to 3 kHz in 2011 (from anoriginal 2 kHz), will reach 4.5 kHz in 2012.The foreseen 2012 pledges will not allow to fully exploit the physics potential of thenew trigger bandwidth, due to disk space limitations and therefore,unless extra resources will become available during the year,parts of recorded data will have to be "locked" during 2012, the Stripping will be tuned to produce the same data bandwidth as in 2011. However, this data will be unlocked in 2013 re-stripping passes by introducing additional channels and looser requirements, increasing by 50% the final output. This will allow both enhanced analyses as well as true data mining [email protected] event rateALICE: Computing model parametersParameters have been updated based on the 2011 exercise (average values over the entire pp and PbPb runs)

Running scenario 2012:pp: 145 days (3.3106 s effective), Lint= 4pb-1 delivered, 1.4109 events (MB + rare) pPb: 24 days (5.2105 s effective), Lint= 15-30nb-1 delivered, 3108 events (MB + rare) 2012 requirements

2013 requirements

Offline status

Full reconstruction of 2011 pp data,PbPb 2011 reconstructed pass 2(ready for physics conferences!)Presently doing analysis, MC, Cosmic & CalibrationWaiting for beamVery good use of parasitic resourcesResources still tight butNew prototype T1s for ALICE in South Korea (KISTI) and Mexico (UNAM)InGrid2012 in Mumbai will also discuss Indian T1

Storage is criticalcleanup campaign will help in the short term

Good efficiency of production jobs (>80%)PbPb data taking and servicesVery successful data taking period:- 140 million events, enriched with rare triggers HLT compression operational, x3 reductionof RAW data volume Reconstruction and fast QA of ~50% of the data during the data taking period

High-ratethroughput testData accumulation, peak rate up to 4GB/sec

Excellent performance of data services: CASTOR2@CERN supported unprecedenteddata transfer rates (up to 4GB/sec)- Steady performance of tape storages T1sData replication completed 4 days after endof period Average rate 300MB/sec

[email protected] vs requirementsAll experiments will be pushing the limits of the resources that they will have [email protected] installation

Fears of late availability due to Thailand floods and consequent disk shortages have not materialized; little impact except at CERNTendering process has finishedPaper for Finance Committee next weekhopefully approval next week

Intent:Start prototyping later this year or in 2013Production in [email protected] of Remote Tier 0The WLCG strategy in this area is a topic included in the TEGs and there will be ongoing work on the use of virtualisation and how to use cloudsIndependently CERN is involved in the Helix Nebula projectEIROforum labs (CERN, EMBL, ESA, others observing) as user communities together with Industrial partners as resource providers:Goals: Some relevant to WLCGUnderstand costs, policy issues, practicality of moving/storing/accessing data in a cloud providerOther goals:Specifically to address some of the data privacy issues that prevent European labs moving services or workloads to [email protected] Helix Nebula: RationaleThe EIROforum labs collaborated on a cloud strategy paper:The potential benefits of adopting a cloudbased approach to the provision of computing resources have long been recognised in the business world and are readily transferable to the scientific domain. In particular, the ability to share scarce resources and absorb bursts of intensive demand can have a farreaching effect on the economics of research infrastructures.However, Europe has failed to respond to the opportunity presented by the cloud and has not yet adopted a leadership role.

Role of Helix Nebula: The Science CloudVision of a unified cloudbased infrastructure for the ERA based on Public/Private Partnership, 4 goals building on the collective experience of all involved. Goal One : Establish HELIX NEBULA the Science Cloud as a cloud computing infrastructure addressing the needs of the ERA and capable of serving as a platform for innovation and evolution of the overall einfrastructure. Goal Two: Identify and adopt suitable policies for trust, security and privacy on a Europeanlevel Goal Three: Create a lightweight governance structure that involves all the stakeholders and which can evolve over time as the infrastructure, services and userbase grows. Goal Four: Define a funding scheme involving all the stakeholder groups (service suppliers, users, EC and national funding agencies) for PPP to implement a Cloud Computing Infrastructure that delivers a sustainable and profitable business environment adhering to European level policies.

Specific outcomesDevelop strategies for extremely large or highly distributed and heterogeneous scientific data (including service architectures, applications and standardisation) in order to manage the upcoming data deluge Analyse and promote trust building towards open scientific data eInfrastructures covering organisational, operational, legal and technological aspects, including authentication, authorisation and accounting (AAA) Develop strategies and establish structures aiming at coordination between einfrastructure operatorsCreate frameworks, including business models for supporting Open Science and cloud infrastructures based on PPP, useful for procurement of computing services suitable for e ScienceScientific FlagshipsCERN LHC (ATLAS): High Throughput Computing and large scale data movementEMBL:Novel de novo genomic assembly techniquesESA:Integrated access to data held in existing Earth Observation Super SitesEach flagship brings out very different features and requirements and exercises different aspects of a cloud offeringATLAS use caseSimulations (~no input) with stage out to:Traditional grid storage vsLong term cloud storageData processing (== Tier 1)This implies large scale data import and export to/from the cloud resourceDistributed analysis (== Tier 2)Data accessed remotely (located at grid sites), orData located at the cloud resource (or another?)Bursting for urgent tasksCentrally managed: urgent processingRegionally managed: urgent local analysis needsAll experiences immediately transferable to other LHC (& HEP) experimentsImmediate and longer term goalsDetermine costs of commercial cloud resources from various sourcesCompute resourcesNetwork transfers into and out of cloudShort and long term data storage in the cloudDevelop understanding of appropriate SLAs How can they be broadly applicable to LHC or HEPUnderstand policy and legal constraints; e.g. in moving scientific data to commercial resourcesPerformance and reliability compared to WLCG baselineUse of standards (interfaces, etc.) & interoperability between providersCan CERN transparently offload work to a cloud resourceWhich type of work makes sense?Long term: Can we use commercial services as a significant fraction of overall resources available to CERN experiments?At which point is it economic/practical to rely on 3rd party providers?OSG will have ongoing fundingBut at ~20% less than previouslyEMI finishes in just over 1 yearEGI support for Heavy User Communities finishes in just over 1 year But EGI-Inspire continues for a further yearWe need to consider the sustainability of what WLCG requiresWLCG TEGs (see later talk) produce technical strategyCritical middleware negotiate support direct with institutes and projectsImprove WLCG technical collaboration from discussion in GDB and consequence of TEGs [email protected] happens after EMI, EGI Grid operations have continued smoothly over 2011 & holiday period, no major issuesExperiments make good progress in data processing and analysisTier 1 and Tier 2 resources well utilized already, likely to be stretched in 20123 experiments propose to take additional data in 2012 for processing laterWLCG needs to ensure tools and support needed after EMI, and as EGI enters new phaseSeveral initiatives [email protected] in MB/eventRawESD+AOD

pp1.60.1

AA3.53.9

CPU in KHS06 s/event

pp0.11

AA2.00

Disk (PB)T0CAFT1sT2s

Required13.20.2410.919.4

Pledged

Difference

Tape (PB)T0T1

Required23.519.1

Pledged

Difference

Tape (PB)T0T1

Required17.111.3

Pledged20.011.5

Difference14%2%

Disk (PB)T0CAFT1sT2s

Required7.60.247.012.4

Pledged8.17.229.11 (12.9)

Difference6%3%-36%

CPU (KHEP06)T0CAFT1sT2s

Required90.035.095.0207

Pledged9095115 (194)

Difference0%0%-80%

CPU (KHEP06)T0CAFT1sT2s

Required90.035.095.0194.8

Pledged

Difference