ogsa-based grid workload monitoring r. zhang 1, s. heisig 2, s. moyle 1 and s. mckeever 1 1 oxford...
TRANSCRIPT
OGSA-based Grid Workload Monitoring
R. Zhang1 , S. Heisig2 , S. Moyle1 and S. McKeever1
1 Oxford University Computing Laboratory2 IBM T.J. Watson Research Centre
Complicated Systems
• Open Grid Service Architecture (OGSA), is in a nutshell: The Grid + Web Services
• While OGSA brings computational power and interoperability, it also inevitably yields Dynamics and Complexity
Complicated Problems
• For instance, the system has been slow (i.e. SLA violation) in the past hour– What is causing the problem?– How can it be fixed and prevented?
• We must find out:– Grid services (and underlying platforms) touched– Time spent on services (and underlying
platforms)
– End-to-end response time composition
Monitoring: The First Step
• We need to trace works across Grid services from end to end, monitoring workload and reporting data.
• “If you don’t measure it, you can’t control it.”– TQM
• Workload monitoring – the first step towards achieving self-managing and self-optimising system.
Instrumentation
Globus Client
Globus + Axis
Requests
Monitoring Points
Ogsa-Dai Client
DB2 CM
eDiamond Client
Tomcat
eDiaMoND Grid Service Back-end
Globus Client
Globus + Axis
Tomcat
Ogsa-Dai Grid Service Back-end
• Monitoring points inserted into common (OGSA-based Grid) middleware.
• Requests given a unique ID and traced through the system.
Start 0 (Client)
Start 1 (Tomcat@eD)
Start 2 (Axis@ eD)
Start 3 (Tomcat@Ogsa-Dai)
Start 4 (Axis@Ogsa-Dai)
Stop 0
Stop 1
Stop 2
Stop 3
Stop 4
Measurement • Timer at every monitoring point measures local response time.
• Subtraction gives elapsed time (no clock sync).
Reporting
• Data batched and aggregated at agents to reduce reporting overhead.
• Data reported with Java Messaging Service (JMS) to provide reliability and scalability.
Publisher-Subscriber Framework
...
Agent AgentAgent
...... ... ...
DB2
Concurrency Issue
• Parallel invocation is common in practice. For example, Grid service A calls B,D in parallel, and then C after B and D return.
• Concurrency is modelled by response time service Petri-Net (RTSPN), which is constructed automatically from data collected.
A A
B
D
B
D
C C
Legend:
Service rear
Service front
Experiment in eDiamond Setting
Monitoring Data in DB
Visualisation Screen Shot
Conclusions
• We have developed a monitoring infrastructure for OGSA-based Grids that:– discovers services touched; – monitors workload in an end-to-end manner;– captures concurrency in workload;– provides automated visualisation, – is portable (thanks to OGSA), scalable and
lightweight (5 ms/req,service).
Future Work
• The current infrastructure has enabled research on: – Performance problem determination;– End-to-end performance tuning/service
differentiation– Real eDiamond workload data collection;– Instrumentation with finer granularity
We are grateful to
• DTI for project grant
• IBM for software/research support
• eDiaMoND for experiment environment
• all of you for coming along
• Questions?
RTSPN Construction
• Automatic construction from data
• Each service receives ID of the service invoking it.
• Each service receives IDs from services it depends on:– workflow description– temporal relation