dave kant [email protected] monitoring and accounting dave kant cclrc e-science centre, uk gridpp 12...
TRANSCRIPT
![Page 1: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/1.jpg)
Dave Kant
Monitoring and Accounting
Dave KantCCLRC e-Science Centre, UK
GridPP 12 Jan 31st - Feb 1st 2005
![Page 2: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/2.jpg)
2
Overview
1. GOC Database
2. Monitoring Tools
3. Accounting
4. Issues
5. Future Plans
![Page 3: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/3.jpg)
4
GOC Database
– What features? • Configuration of monitoring tools• Security• Organisations• Administrative Roles• Replication
– What role will it play in the future?• New site registration procedure• BDII generation
![Page 4: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/4.jpg)
8
GRID Configuration Database
GOCDB
GridSite MySQL
Resource CentreResources & Site Information
EDG, LCG-1, LCG-2, …
ce
se
bdii
rb
Monitoring Services
• Operations Maps
• Configure other Tools
• Resource Provider
• Organisation Structures
• Secure services
- Site News
- Self Certification
- Accounting
Secure Database Management via HTTPS / X.509
Store a Subset of the Grid Information system
People, Contact Information, Resources
Maintenance Bit
RC
SQLhttps
SERVER
GOC DB can also contain information that is not present in the IS such as:Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps.
![Page 5: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/5.jpg)
9
EGEE ROC Structure
• EGEE is made up of regions.• Each region contains many computing centres.• Regional Operational Centres are a focus for
operational activities.
USA
![Page 6: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/6.jpg)
10
Developed a tool to manage organisational structures. Modelled on GridPP Tier1/2 Structure
Materialised Path Encoding Provide ROCs with a package to monitor the resources in the region
• Tailored Monitoring• Administrative roles to the coordinators in GOCDB
Organisational Structures
EGEE (1)
France (1.1) UK/I (1.2) S.E.E (1.3)
GridPP (1.2.1)
LondonT2
ScotGrid
IMPERIAL
QMUL
Edinburgh
![Page 7: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/7.jpg)
11
• Total List of all sites is derived from GOCDB (via RGMA)• GOC bit: sites which have opted out e.g. scheduled maintenance• White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site • Core tests are a subset of the site functional tests run by CERN every day• Black List: Sites that are not trusted
100’s of Sites
Monitoring Services
Total List of all sites
Sites pass core tests
Trusted Sites
Black List
White List BDII
RGMA
GOC Bit
• GOC DB Site info• Gstat Data• Site Functional Tests• GOC Hourly Tests
Generation of BDII configuration file via feedback into IS
Adaptive Job Brokering Based on the Monitoring System
Environments Production, VO, GridPP, …
![Page 8: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/8.jpg)
12
How Are New Sites Added?
Site
ROC
GOCDB
Site and ROC liaise
[1]
EGEE
1. JSPG have written a “Site Registration Policy & Procedure” Document2. https://edms.cern.ch/document/503198/3. New GOCDB portal to streamline the site registration process.
[3] Site installs middleware
[2] “candidate” site
[4] “uncertified” Site
[6] “certified” Site
[5] Certification Testing
![Page 9: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/9.jpg)
13
ReplicationTwo replicas, each one has a different security
considerations• “Services” replica managed by Taipei
– Direct connections to the database by the monitoring tools from known hosts
• “Users” replica to be setup at IN2P3– Web portal based on X.509 certificates
– CIC on duty
![Page 10: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/10.jpg)
14
Monitoring Tools
• What are the main tools that are used in the day-to-day operations of the LCG Grid? – GPPMON– GSTAT– Site Functional Tests
• Other monitoring tools exist, but I won’t discuss them here– GridIce
![Page 11: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/11.jpg)
15
Operations Map – Job Submission Tests
GPPMON
Displays the results of tests against sites.
Test: Job Submission
Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements.
This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
![Page 12: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/12.jpg)
16
Operations Map – Certificate Lifetime
GPPMON
Displays the results of tests against sites.
Test:Certificate Lifetime
Many grid services require a valid certificate for security.
By probing the host certificates on CEs and SEs at sites with a simple SSL client service, we can identify certificates which are due to expire and send an early warning to them. A predictive tool!
![Page 13: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/13.jpg)
23
GIIS Monitor• Developed by MinTsai (GOC Taipei)• Tool to display and check information published by the site GIIS (sanity
checks, fault detection)
• http://goc.grid.sinica.edu.tw/gstat/
Regional Plot:
http://map.gridpp.ac.uk
![Page 14: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/14.jpg)
24
Site Certification Service
• In terms of middleware, the installation and configuration of a site is quite a complicated procedure. – When there is a new release, sites don’t upgrade at the same time– Some upgrades don’t always go smoothly– Unexpected things happen (who turned of the power?)– Day-to-day problems; robustness of service under load?
• Its necessary to actively hunt for problems • • Site certification testing is by CERN deployment team on a daily
basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE.
• Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
![Page 15: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/15.jpg)
25
Certification Test Results
http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi
![Page 16: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/16.jpg)
26
Aggregator RSSReader (Windows Client)
GOC generates RSS feeds which clients can pull using an RSS aggregator.
How can we integrate feeds and ticketing systems?
Syndication of Monitoring Information
![Page 17: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/17.jpg)
27
Real Time Grid Monitorhttp://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html
A Visualisation tool to track jobs currently running on the grid.
Applet queries the logging and bookkeeping service to get information about grid jobs.
Why are jobs failing?
Why are jobs queued at sites while others are empty?
![Page 18: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/18.jpg)
28
Problems with Existing Tools
• Lots of monitoring tools around which have things in common:-- all the information which they generate is hidden away or difficult to access- limited interfaces: the data can only be accessed in specific ways
• Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data.
• The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all.
• Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community.
• How much CPU in UKI ROC– How much in GridPP?
• How much in each Tier2?
=> Integrate data from different sources to provide this information
![Page 19: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/19.jpg)
29
Monitoring Paradigm
A Better way to unify monitoring information.
GOC Services collect information and publish into an archiver.
ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community.
Information Repository (RGMA)
Accounting
Monitoring
GSTATTesting
ROC Services
Self Certification
CIC Services
Communities
VOs
ROCs
EGEE
Sites
Organisations
GOC Services
![Page 20: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/20.jpg)
30
Use Cases
• Monitoring services which use RGMA as the backbone for data transport and data location via the registry service.– Grid Event Monitoring System– “Site Functional Test” Reporting Tool– Accounting
![Page 21: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/21.jpg)
31
UseCases - GEMS• Grid Event Monitoring System• List of resources to monitor is provided by GOCDB
Alert system that uses RGMA
Looks for changes of state in the monitoring data tables
Generates an alert and displays on the GEMS console.
Notification features
Event filtering
![Page 22: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/22.jpg)
32
Reporting Tool PrototypeOrganisational Identities taken from GOCDB
![Page 23: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/23.jpg)
36
Accounting• Information collected at each site from batch logs,
gatekeeper logs etc• Information joined at site level to select grid jobs and
stored in database on R-GMA MON box at site.• Information published through R-GMA and collected
centrally in an R-GMA archive at GOC• Web site presents various views of this data for
presentation
• Information schema based on GGF Usage Group • Structure of Grid taken from GOC DB – the grid
configuration database.• Only normalised cpu time collected (at the moment)
![Page 24: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/24.jpg)
37
![Page 25: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/25.jpg)
39
GOC Accounting Serviceshttp://goc.grid-support.ac.uk/gridsite/accounting/index.html
BaseCpuSeconds Aggregated across EGEE
Each Site, per VO, per Month
Simple interface to customise views of data: VO, time frame and Region (default = EGEE)
Each Region, per VO, per Month
On Demand Services to EGEE Community
Other Distributions
Normalised CPU
# Jobs
![Page 26: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/26.jpg)
40
Web form to apply selection criteria on the data
Aggregate data across an organisation structure
(Default= All ROCs)
Select VOs (Default = All)
Select date range
![Page 27: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/27.jpg)
41
VO Index
Summed CPU (Seconds) consumed by resources in selected Region
Selected Date Range
![Page 28: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/28.jpg)
42
List of Sites Belonging to the Selected ROC
A breakdown of the resource usage per Site, per VO, per Month
![Page 29: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/29.jpg)
43
Deployment
• Package was released to LCG in August 2004 and certified soon afterwards.
• There was no LCG release after that until LCG2_3_0 on 18th December 2004
• Today there are still very few 2_3_0 sites. There are 28 sites producing accounting records today.
• The 2_3_0 release has some bugs which are fixed in a new release that is available on the accounting home page
• Recommend that sites upgrade accounting to version APEL 3.4.40 available on the accounting homepage
http://goc.grid-support.ac.uk/gridsite/accounting/index.html
![Page 30: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/30.jpg)
46
Future Plans
• Support for the LSF batch system. • Understand Normalisation issues; do we
have faith in the numbers we present?• Extend accounting schema to include
information about the worker node, Job efficiency and globalJobID.
• Integrate the LCG schema with de-facto grid accounting standards, namely GGF– Share data with other Grid Communities
• NorduGrid, Grid03
![Page 31: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005](https://reader030.vdocuments.us/reader030/viewer/2022032805/56649eef5503460f94bfeb90/html5/thumbnails/31.jpg)
47
Summary
• GOCDB to take a more important role in operation environment
• A shift in the monitoring paradigm which relies on sharing data through RGMA
• Accounting Information gathering infrastructure and reporting web site
• Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels.
• Development of Visualisation tools to enhance our understanding of the grid.
• Adaptive Job brokering based on the monitoring system