monitoring: grid, fabric, network jennifer m. schopf, argonne national lab ppdg review 28 april...
TRANSCRIPT
Monitoring:Grid, Fabric, Network
Jennifer M. Schopf, Argonne National Lab
PPDG Review 28 April 2003, Fermilab
28 Apr 2003 J. Schopf, PPDG Review 2
Monitoring and PPDG
• Many monitoring tools currently available– Different use cases– Different strengths– Legacy systems
• Much of PPDG monitoring work is by non-funded collaborators– Les Cotrell, SLAC, IEPM-BW– Iosif Legrand, Cal Tech, MonALISA– Brian Tierney, LBNL, NetLogger, PyGMA, NTAF– Wisconsin Group, Hawkeye
28 Apr 2003 J. Schopf, PPDG Review 3
Tools in a nutshell
Grid level Fabric Level Network Schema
MDS (Globus) X X X
Hawkeye (Condor)
X X X
Mona Lisa X X X
Ganglia X
IEPM-BW X X
NetLogger X X X
pyGMA X
NTAF (LBNL) X
GLUE schema X
28 Apr 2003 J. Schopf, PPDG Review 4
PPDG Role in Monitoring
• Deployment and evaluation– Use on production testbeds
• Requirements back to developers– Additional information sources– Realistic use cases
• Furthering of interoperability goals– GLUE schema– Common interfaces
28 Apr 2003 J. Schopf, PPDG Review 5
Deployment
CMS ATLAS STAR BaBar D0 TJNAF
MDS (Globus) X X X X
Hawkeye (Condor)
X
Mona Lisa X
Ganglia X X X X
IEPM-BW X X X
NetLogger X X
pyGMA X
GLUE schema X X
Local Solution X X X
28 Apr 2003 J. Schopf, PPDG Review 6
Interoperability between Efforts
MDS Hawk eye
Mona lisa
Ganglia IEPM-BW
Net Logger
pyGMA NTAF Glue Schema
MDS X X X X U X
Hawkeye X X
Mona Lisa X X X X U
Ganglia X X X X
IEPM-BW X U X X
NetLogger U X X X X
pyGMA U X X X X X U
NTAF U X X X
GLUE schema
X X X X
X – Currently available, U – Under consideration
28 Apr 2003 J. Schopf, PPDG Review 7
Overview
• Examples of interfacing between tools– STAR use of Ganglia/MDS– Ganglia extension in ATLAS– Mona Lisa interfaces to Hawkeye and MDS in CMS
• Scalability analysis• Some future steps
28 Apr 2003 J. Schopf, PPDG Review 8
Ganglia –MDS InterfaceSTAR efforts and use
Stratos Efstathiadis, BNL
• Developed a modified version of the Ganglia IP– Perl basis– Match the current CE-GLUE Schema– Can connect to the Ganglia Meta Daemon or the Ganglia
Monitoring daemon– Simpler and more flexible
• Currently being tested at PDSF and BNL
28 Apr 2003 J. Schopf, PPDG Review 9
Ganglia Extensions in ATLAS
Monitor Cluster health
The added information through Ganglia creates an additional level combining different clusters intoa “metacluster”
28 Apr 2003 J. Schopf, PPDG Review 10
MonALISA in CMS
– MonALISA (Caltech)
– Dynamic information/resource discoveryusing intelligent agents
• Java / Jini with interfaces to SNMP, MDS, Ganglia, and Hawkeye
• WDSL / SOAP with UDDI
– Aim to incorporate into a “Grid Control Room” Service
– Integration with MDS and Hawkeye
28 Apr 2003 J. Schopf, PPDG Review 11
Scalability Comparison of MDS, R-GMA, Hawkeye
• Zhang, Freschl and Schopf, “A Performance Study of Monitoring and Information Services for Distributed Systems”, to appear in HPDC 2003
• How many users can query an information server at a time?
• How many users can query a directory server?• How does an information server scale with the
amount of data in it?• How does an aggregator scale with the number of
information servers registered to it?
28 Apr 2003 J. Schopf, PPDG Review 12
Overall Results
• Performance can be a matter of deployment – Effect of background load– Effect of network bandwidth
• Performance can be affected by underlying infrastructure– LDAP/Java strengths and weaknesses
• Performance can be improved using standard techniques– Caching; multi-threading; etc.
28 Apr 2003 J. Schopf, PPDG Review 13
MonaLisa Performance
IO Threads
CPU Usage Dell I8100
~ 1GHz
Test : A large snmp query (~200 metrics values) on a 500 nodes farm every 60 s.Test : A large snmp query (~200 metrics values) on a 500 nodes farm every 60 s. ~ 1600 metrics values collected per second from 1 MonaLisa service~ 1600 metrics values collected per second from 1 MonaLisa service
““lxshare” cluster at CERN ~ 600 ndoeslxshare” cluster at CERN ~ 600 ndoes
28 Apr 2003 J. Schopf, PPDG Review 14
Future: OGSA and Monitoring
• Open Grid Services Architecture (OGSA) defines standard interfaces and behaviors for distributed system integration, especially:– Standard XML-based service information model– Standard interfaces for push and pull mode access to
service data• Notification and subscription
• Every service has it’s own service data– OGSA has common mechanism to expose a service
instance’s state data to service requestors for query, update and change notification
– Monitoring data is “baked right in”
28 Apr 2003 J. Schopf, PPDG Review 15
OGSA-Compatible Monitoring
• MDS3– Part of OGSA reference implementation GT3
– Release will include full data in the GLUE schema for CE; Service data from RFT, RLS, GRAM; GridFTP Server data, SW version and path data
– Simplest higher-level service is the caching index service• Much like the GIIS in MDS 2.x
• MonALISA – will be compatible with OGSI-spec registration/subscription services
– plans to have adapters that can interface to the OGSI service data
• LBNL tools also adapting OGSI spec
28 Apr 2003 J. Schopf, PPDG Review 16
Future Work - Interoperability
• Efforts will continue to make tools interoperate more– Many tools have the hooks to do this, it’s just a matter of
filling in the slots
• We need a better understanding of the requirements from the applications
28 Apr 2003 J. Schopf, PPDG Review 17
Summary
• Many monitoring solutions are in use by different experiments
• Additional experience is leading towards common uses and deployments
• Ongoing work towards the use of common tools, common schema and naming conventions
• Still need better identification of requirements and the involvement of application groups to work together on a common/consistent infrastructure
28 Apr 2003 J. Schopf, PPDG Review 19
GLUE-Schema Effort
• Part of HICB/JTB GLUE framework• To address need to common schemas between
projects– framework independent– something to translate into, not a requirement within fabric
layer
• Mail list: [email protected]• www.hicb.org/glue/glue-schema/schema.html
28 Apr 2003 J. Schopf, PPDG Review 20
Glue Schema Status
• Compute Element schema: – Currently being used in EDG (MDS) and MDS2
– Found a couple minor things missing, which will be added to the next version
– Will be in MDS-3
• SE schema: – Lots of good discussion to finalize this at CHEP
– Will start to use this in EDG (R-GMA) testbed 2 later this month
• NE schema: – Merged Ideas from EDG (UK group) with Datatag (Italian group)
– GGF NM-WG is now working on this too.
28 Apr 2003 J. Schopf, PPDG Review 21
Globus MDS2Monitoring and Discovery Service
• MDS has been accepted as core software for monitoring and presentation of information at the Grid level
• GIIS set up as part of collaboration with iVDGL– Presents overall picture of the state of the Grid sites
• Work continuing to interface it to local monitoring systems– Each site/experiment has preferred local solutions– Needed GLUE schema to make this happen
28 Apr 2003 J. Schopf, PPDG Review 22
MDS-3 in June Release
• All the data currently in core MDS-2• Full data in the GLUE schema for CE• Service data from RFT, RLS, GRAM• GridFTP Server data, SW version and path data
• Simplest higher-level service is the caching index service– Much like the GIIS in MDS 2.x– Will have configurablity like an GIIS hierarchy– Will also have PHP-style scripts, much as available today
28 Apr 2003 J. Schopf, PPDG Review 23
MonaLisa Current Status
MonaLisa is running for several months at all the US-CMS production sites and at CERN. It proved to be stable and scalable ( at CERN is monitoring ~600 nodes)
It is used to monitor several major internet connections (CERN-US , CERN-Geant, Taiwan – Chicago, DataTag link … )
• MonaLisa is a prototype service under development. It is based on the code mobility paradigm which provides the mechanism for a consistent, dynamic invocation of components in large, distributed systems.
•
• http://monalisa.cern.ch/MONALISA
28 Apr 2003 J. Schopf, PPDG Review 24
Hawkeye
• Developed by Condor Group• Focus – automatic problem detection• Underlying infrastructure builds on the Condor ClassAd Tech.
– Condor ClassAd language to identify resources in a pool
– ClassAd Matchmaking to execute jobs based on attribute values of resources to identify problems in a pool
• Schema-free representation allows users to easily add new types of information to Hawkeye
• Information probes run on individual cluster nodes and report to central collector
• Easy to add new information probes
28 Apr 2003 J. Schopf, PPDG Review 25
Hawkeye Recent Accomplishments
• Release candidate for version 1.0 has been released.• Used to monitor USCMS testbed• Used to monitor University of Wisconsin-Madison
Condor pool.
28 Apr 2003 J. Schopf, PPDG Review
PingERPIs: Les Cottrell SLAC
Impact and Connections IMPACT:
increase network and Grid application bulk throughput over high delay, bandwidth networks (like DOE’s ESnet)
provide trouble shooting information for networkers and users by identifying the onset and magnitude of performance changes, and whether they appear in the application or the network
provide network performance data base, analysis and navigateable reports from active monitoring
CONNECTIONS: SciDAC: High Energy Nuclear Physics, Bandwidth
Estimation, Data Grid, INCITE Base:Network Monitoring, Data Grid, Transport Protocols
Milestones/Dates/Status Infrastructure development Mon/Yr DONE - develop simple window tuning tool 08/01 08/01 - initial infrastructure developed 12/01 12/01 - infrastructure installed at one site 01/02 01/02 - improve and extend infrastructure 06/02 - deploy at 2nd site 08/02 - evaluate GIMI/DMF alternatives 10/02 - extend deployment to PPDG sites 03/03• Develop analysis/reporting tools - first version for standard apps 02/02 Integrate new apps &net tools - GridFTP and demo 05/05 - INCITE tools 08/02 - BW measure tools (e.g. pathload) 01/03 • Compare & validate tools - GridFTP 09/02 - BW tools 04/03
PingER novel ideas Low impact network performance measurements to most of the Internet connected world providing delays, loss and connectivity information over long time periods Network AND application high throughput performance measurements allowing comparisons, identification of bottlenecks Continuous, robust, measurement, analysis and web based reporting of results available world wide Simple infrastructure enabling rapid deployment, locating within an application host, and local site management to avoid security issues
PingER: Active End-to-end performance monitoring for the Research and Education
communities Tasks: -develop/deploy simple, robust ssh based active end-to-end measurement and management infrastructure -develop analysis/reporting tools -integrate new application and network measurement tools into the infrastructure -compare & validate various tools, and determine regions of applicability www-iepm.slac.stanford.edu
Date Prepared: 1/7/02
High-Performance Network Research- SciDAC/Base
28 Apr 2003 J. Schopf, PPDG Review 27
IEPM-BW Status
• N measuring to about 55 sites (mainly Grid, HENP and major networking sites)
• 10 measuring sites in 5 countries, 5 are in production• Data and analyzed results are available at http://www.slac.stanford.
edu/comp/net/bandwidth-tests/antonia/html/slac_wan_bw_tests.html
• PingER results have been plugged into MDS• IEPM-BW and PingER data available via web services, we are
aligning the naming with GGF NMWG and emerging GGF schemas• We will incorporate and evaluate different tests (e.g. tsunami,
GridFTP, UDPmon, new bandwidth estimators, new quick iperf)• We are also focusing on making the data useful, working with the
Internet2 PiPES project, on long and short term predictions and trouble-shooting.