your university or experiment logo here caitriana nicholson university of glasgow dynamic data...
TRANSCRIPT
![Page 1: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/1.jpg)
Your university or experiment logo here
Caitriana NicholsonUniversity of Glasgow
Dynamic Data Replication in LCG
2008
![Page 2: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/2.jpg)
Outline
• Introduction• Grid Replica Optimisation• The OptorSim grid simulator• OptorSim architecture• Experimental setup• Results• Conclusions
![Page 3: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/3.jpg)
Introduction
• Large Hadron Collider (LHC) at CERN will have raw data rate of ~15 PB/year
• LHC Computing Grid (LCG) for data storage and computing infrastructure
• 2008 will be first full year of LHC running
• Actual analysis behaviour still unknown use simulation to investigate behaviour investigate dynamic data replication
![Page 4: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/4.jpg)
Grid Replica Optimisation
• Many variables determine overall grid performance– Impossible to reach one optimal solution!
• Possible to optimise variables which are part of grid middleware– Job scheduling, data management etc
• This talk considers data management only… …and dynamic replica optimisation in
particular
![Page 5: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/5.jpg)
Dynamic Replica Optimisation
= optimisation of the placement of file replicas on grid sites…
…in a dynamic, automated fashion
![Page 6: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/6.jpg)
Design of a Replica Optimisation Service
• Centralised, hierarchical or distributed?• Pull or push?• Choosing a replication trigger
– On file request?– On file popularity?
• Aim to achieve global optimisation as a result of local optimisation
![Page 7: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/7.jpg)
OptorSim
• OptorSim is a grid simulator with a focus on data management
• Developed as part of European DataGrid Work Package 2
• Based on EDG architecture• Used to examine automated decisions
about replica placement and deletion
http://edg-wp2.web.cern.ch/edg-wp2/optimization/optorsim.html
![Page 8: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/8.jpg)
Architecture
• Sites with Computing Element (CE) and/or Storage Element (SE)
• Replica Optimiser decides replications for its site
• Resource Broker schedules jobs
• Replica Catalogue maps logical to physical filenames
• Replica Manager controls and registers replications
![Page 9: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/9.jpg)
Algorithms• Job scheduling
– Details not covered in this talk– “QueueAccessCost” scheduler used in these results
• Data replication– No replication– Simple replication:“always replicate, delete
existing files if necessary”• Least Recently Used (LRU)• Least Frequently Used (LFU)
– Economic model: “replicate only if profitable”• Sites “buy” and “sell” files using auction mechanism• Files deleted if less valuable than new file
![Page 10: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/10.jpg)
Experimental Setup - Jobs & Files
• Job types based on computing models
• “Dataset” for each experiment ~1 year’s AOD
(analysis data)• 2GB files• Placed at CERN and
Tier-1s at start
Job Event Size (kB)
Total no. of files
Files per job
alice-pp 50 25000 25
alice-hi 250 12500 125
atlas 100 100000 50
cms 50 37500 25
lhcb-small
75 37500 38
lhcb-big 75 37500 375
![Page 11: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/11.jpg)
Experimental Setup - Storage Resources
• CERN & Tier 1 site capacities from LCG Technical Design Report
• “Canonical” Tier 2 capacity of 197 TB each (18.8 PB / 95 sites)
• Define storage metric D = (average SE size)
(total dataset size)• Memory limitations -> scale down Tier 2 SE
sizes to 500 GB– Allows file deletion to start quickly– Disadvantage of small D
![Page 12: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/12.jpg)
Experimental Setup - Computing & Network
• Most (chaotic) analysis jobs run at Tier 2s– Tier 1s not given CE, except those running
LHCb jobs– CERN Analysis Facility with CE of 7840 kSI2k– Tier 2s with averaged CE of 645 kSI2k each
(61.3 MSI2k / 95 sites)• Network based on NREN topologies
– Sites connected to closest router– Default of 155 Mbps if published value not
available
![Page 13: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/13.jpg)
Network Topology
![Page 14: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/14.jpg)
Parameters
• Job scheduler “QueueAccessCost”– Combines data location and queue
information
• Sequential access pattern• 1000 jobs per simulation• Site policies set according to LCG
Memorandum of Understanding
![Page 15: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/15.jpg)
Evaluation Metrics
• Different grid users will have different criteria of evaluation
• Used in these summary results are:– Mean job time
• Average time taken for job to run, from scheduling to completion
– Effective Network Usage (ENU)• (File requests which use network resources) (Total number of file requests)
![Page 16: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/16.jpg)
Results: Data Replication
• Performance of algorithms measured with varying D
• D varied by reducing dataset size
• 20-25% gain in mean job time as D approaches realistic value
![Page 17: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/17.jpg)
Results: Data Replication
• ENU shows similar gain
• Allows clearer distinction between strategies
![Page 18: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/18.jpg)
Results: Data Replication
• Number of jobs increased to 4000
• Mean job time increases linearly
• Relative improvement as D increases will hold for higher numbers of jobs
• Realistic number of jobs is >O(10000)
![Page 19: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/19.jpg)
Results: Site Policies
• Vary site policies:– All Job Types
• Sites accept jobs from any VO
– One Job Type• Sites accept jobs
from one VO
– Mixed• default
• All Job Types is ~60% faster than One Job Type
![Page 20: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/20.jpg)
Results: Site Policies
• All Job Types also give ~25% lower ENU than other policies
• Egalitarian approach benefits all grid users
![Page 21: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/21.jpg)
Results: Access Patterns
• Sequential access likely for many physics applications
• Zipf-like access will also occur – Some files accessed
frequently, many infrequently
• Replication gives performance gain of ~75% when Zipf access pattern used
![Page 22: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/22.jpg)
Results: Access Patterns
• ENU also ~75% lower with Zipf access
• Any Zipf-like element makes replication highly desirable
• Size of efficiency gain depends on streaming model, etc
![Page 23: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/23.jpg)
Conclusions
• OptorSim used to simulate LCG in 2008• Dynamic data replication reduces running time
of simulated grid jobs:– 20% reduction with sequential access– 75% reduction with Zipf-like access– Similar reductions in network usage
• Little difference between replication strategies– Simpler LRU, LFU 20-30% faster than economic model
• Site policy which allows all experiments to share resources gives most effective grid use
![Page 24: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/24.jpg)
The End
![Page 25: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/25.jpg)
Backup Slides
![Page 26: Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008](https://reader036.vdocuments.us/reader036/viewer/2022070410/56649ece5503460f94bda62a/html5/thumbnails/26.jpg)
Replica optimiser architecture
• Access Mediator (AM) - contacts replica optimisers to locate the cheapest copies of files and makes them available locally
• Storage Broker (SB) - manages files stored in SE, trying to maximise profit for the finite amount of storage space available
• P2P Mediator (P2PM) - establishes and maintains P2P communication between grid sites