the texas high energy grid (thegrid) a proposal to build a cooperative data and computing grid for...
TRANSCRIPT
The Texas High Energy Grid (THEGrid)
A Proposal to Build a Cooperative Data and
Computing Grid for High Energy Physics and Astrophysics in
Texas
Alan Sill, Texas Tech UniversityJae Yu, Univ. of Texas, Arlington
Representing HiPCAT and the members of this workshop
Outline• High Energy Physics and Astrophysics in Texas• Work up to this workshop• CDF, DØ, ATLAS, CMS experiments• Problems• A solution• Implementation of the solution
– DØ and CDF Grid status– ATLAS, CMS– Etc.
• Status• Summary and Plans
High Energy Physics in Texas• Several Universities
– UT, UH, Rice, TTU, TAMU, UTA, UTB, UTEP, SMU, UTD, etc• Many different research facilities used
– Fermi National Accelerator Laboratory– CERN, Switzerland, DESY, Germany, and KEK, Japan– Jefferson Lab– Brookhaven National Lab– SLAC, CA and Cornell– Natural sources and underground labs
• Sizable community, variety of experiments and needs• Very large data sets now! Even larger ones coming!!
• High Energy Physics and Astrophysics data sets are huge– Total expected data size is over 5 PB for CDF and DØ, 10x
larger for CERN experiments– Detectors are complicated, need many people to construct and
make them work– Software is equally complicated– Collaboration is large and scattered all over the world
• Solution: Use the opportunity of having large data set in furthering grid computing technology– Allow software development and use at remote institutions– Optimize resource management, job scheduling, monitoring
tools, use of resources– Efficient and transparent data delivery and sharing– Improve computational capability for education – Improve quality of life for researchers and students
The Problem
Work up to this point• HiPCAT:
– What is HiPCAT? High Performance Computing Across Texas - a network and organization of computing centers and their directors at many Texas universities
– Other projects (TIGRE, cooperative education, etc.)– Natural forum for this proposal– First presentation April 2003– Many discussions since then– Led to this workshop
DØ and CDF at Fermilab Tevatron• World’s Highest Energy proton-anti-proton collider
– Ecm=1.96 TeV (=6.3x10-7J/p 13M Joules on 10-6m2)Equivalent to the kinetic energy of a 20t truck at a speed 80 mi/hr
Chicago
Tevatron p
p CDF
Dzero
Currently generating data at over a petabyte per year
Large-scale cluster computing duplicated all over the world
CDF Data Analysis Flow:
CDF
Level 3 Trigger(~250 duals)
Production Farm(~150 duals)
Central Analysis Farm(CAF)
(~500 duals)
UserDesktops
RoboticTape Storage
DataAnalysis7MHz beam Xing
0.75 Million channels
300 Hz
L1
↓
L2
↓
75 Hz
20 MB/s Read/write
Data
ReconstructionSimulation and
Distributed clusters in Italy, Germany, Japan, Taiwan, Spain, Korea, several places in the US, the UK, and Canada (more coming).
CDF-GRID: Example of a working practical grid• CDF-GRID based on DCAF clusters is a de-facto working high
energy physics distributed computing environment• Built / developed to be clonable, • Deployment led by TTU• Large effort on tools usable both on- and off-site
– Data access (SAM, dCache)– Remote / multi-level DB servers– Store from remote sites to tape/disk at FNAL
• User MC jobs at remote sites = reality now• Analysis on remote data samples being developed using SAM
– Up and working, already used for physics !– Many pieces borrowed from / developed with / shared with Dzero
• This effort is making HEP remote analysis possible -> practical -> working -> easy for physicists to adopt
Basic tools• Sequential Access via Metadata (SAM)
– Data replication and cataloging system• Batch Systems
– FBSNG: Fermilab’s own batch system– Condor
• Three of the DØSAR farms consists of desktop machines under Condor• CDF: Most central resources already based on Condor
– PBS• More general than FBSNG; most dedicated DØSAR farms use this manager• Part of popular Rocks cluster configuration environment
• Grid framework: JIM = Job Inventory Management– Provide framework for grid operation Job submission, match making and
scheduling– Built upon Condor-G and Globus– MonALISA, Ganglia, user monitoring tools– Everyone has an account (with suitable controls), so everyone can submit!
Data Handling: Operation of a SAM Station/ConsumersProducers/
Station &Cache
Manager
File Storage Server
File Stager(s)
Project Managers
eworkers
FileStorageClients
MSS orOtherStation
MSS orOtherStation
Data flow
Control
Cache DiskTemp Disk
The tools cont’d• Local Task management
– CDF Grid (http://cdfkits.fnal.gov/grid/) • Decentralized CDF Analysis Farm = DCAF• Develop code anywhere (laptop is supported)• Submit to FNAL or TTU or CNAF or Taiwan or SanDiego or…• Get output ~everywhere (most desktops OK)• User monitoring system including Ganglia, info by queue/user per cluster
– DØSAR (Dzero Southern Analysis Region)• Monte Carlo Farm (McFarm) management (cloned to other institutions)• DØSAR Grid: Submit requests onto a local machine and the requests gets
transferred to a submission site and executed at an execution site• Various Monitoring Software
– Ganglia resource– McFarmGraph: MC Job status monitoring– McPerM: Farm performance monitor
Background Statistics on CDF Grid• Data acquisition and data logging rate increased
– More data = more physicists– Approved by FNAL’s Physics Advisor Committee and Director
• Computing needs grow, but DOE/Fnal-CD budget flat
• CDF proposal: do 50% of analysis work offsite– CDF-GRID: Planned at Fermilab, deployment effort led by TTU
• Have a plan on how to do it• Have most tools in place and in use• Already in deployment status at several locations throughout the world
Hardware resources in CDF-GRIDsite GHz now TB now GHz Summer TB Summer Notes
INFN 250 5 950 30 Priority to INFN users; Pinned data sets exist
Taiwan 100 2.5 150 2.5 Pinned data sets exist
Korea 120 - 120 - Running MC only now
UCSD 280 5 280 5 Pools resources from several US groups. Min guaranteed from x2 larger farm (CDF+CMS)
Rutgers 100 4 400 4 In-kind, will do MC production
TTU 6 2 60 4 2 DCAFs, test site + CDF+CMS cluster
Germany GridKa
~200 16 ~240 18 Min. guaranteed CPU from x8 larger pool. Open to all by ~Dec (JIM)
Canada 240+ - 240+ - In-kind, doing MC production, + common pool
Japan - - 150 6 Just being deployed (07/2004)
Cantabria 30 1 60 2 ~1 months away
MIT - - 200 - ~1 month away
UK - - 400 - Open to all by ~Dec (JIM), + common pool
D0 Grid/Remote Computing April 2004 Joel Snow Langston University
DØSAR MC Delivery Stat. (as of May 10, 2004)Institution Inception NMC (TMB) x106
LTU 6/2003 0.4
LU 7/2003 2.3
OU 4/2003 1.6
Tata, India 6/2003 2.2
Sao Paulo, Brazil 4/2004 0.6
UTA-HEP 1/2003 3.6
UTA–RAC 12/2003 8.2
D0SAR Total As of 5/10/04 18.9
DØSAR Computing & Human ResourcesInstitutions CPU(GHz) [future] Storage (TB) People
Cinvestav 13 1.1 1F+?
Langston 22 1.3 1F+1GA
LTU 25+[12] 1.0 1F+1PD+2GA
KU 12 ?? 1F+1PD
KSU 40 1.2 1F+2GA
OU 19+270 (OSCER) 1.8 + 120(tape) 4F+3PD+2GA
Sao Paulo 60+[120] 4.5 2F+Many
Tata Institute 52 1.6 1F+1Sys
UTA 430 74 2.5F+1sys+1.5PD+3GA
Total 943 [1075] 85.5 +
120(tape)
14.5F+2sys+6.5PD+10GA
Current Texas Grid Status • DØSAR-Grid
– At the recent workshop in Louisiana Tech Univ.• 6 clusters form a regional computational grid for MC production• Simulated data production on grid in progress
– Institutions are paired to bring up new sites quicker– Collaboration between DØSAR consortium and the JIM team at Fermilab
begun for further software development• CDF Grid
– Less functionality than more ambitious HEP efforts, such as the LHC-Grid, butWorks now! Already in use!!Deployment led by TTUTuned on user’s needs
• Object Goal Oriented software!• Based on working models and spare use of standards
Costs little to get started• Large amount of documents and expertise in grid computing
accumulated between TTU and UTA already• Comparable experience probably available at other Texas institutions
Also have Sloan Digital Sky Surveyand other astrophysics work
– TTU SDSS DR1 mirror copy (first in the world)
– Locally hosted MySQL DB.– Image files stored on university
NAS storage– Submitted proposal w/ Astronomy
and CS colleagues for nationally-oriented database storage model based on local new observatory.
– Virtual Observatory (VO) storage methods -- international standards under development.
– Astrophysics is increasingly moving towards Grid methods.
Summary and Plans• Significant progress has been made within Texas in implementing grid computing
technologies for current and future HEP experiments • UTA and TTU are playing leading roles in Tevatron grid effort for the currently
running DØ and CDF as well as in LHC – ATLAS and CMS experiments• All HEP experiments building operating grids for MC data production
• Large amount of documents and expertise exist within Texas!• Already doing MC; moving toward data re-processing and analysis
– Different level of complexities can be handled by emerging framework
• Improvements to infrastructure necessary, especially with respect to network bandwidths– THEGrid will boost the stature of Texas in HEP grid computing world– Regional plans: Started working with AMPATH, Oklahoma, Louisiana, Brazilian
Consortia (tentatively named the BOLT Network)– Need Texas-based consortium to make progress in HEP and astrophysics
computing
Summary and Plans cont’d• Many shared pieces with between DØ and CDF experiment
for global grid development: Provides a template for THEGrid work
• Near-term goals:– Involve other institutions, including those in Texas– Implement and use an analysis grid 4 years before LHC– Work in close relation but not as part of LHC-Grid (so far)– Other experiments will benefit from feedback and use cases– Lead the development of these technologies for HEP– Involve other experiments and disciplines; expand grid– Complete the THEGrid document
• THEGrid will provide ample opportunity to increase inter-disciplinary research and education activities