ogsa-based grid workload monitoring r. zhang 1, s. heisig 2, s. moyle 1 and s. mckeever 1 1 oxford...

OGSA-based Grid Workload Monitoring

R. Zhang1 , S. Heisig2 , S. Moyle1 and S. McKeever1

1 Oxford University Computing Laboratory2 IBM T.J. Watson Research Centre

Complicated Systems

• Open Grid Service Architecture (OGSA), is in a nutshell: The Grid + Web Services

• While OGSA brings computational power and interoperability, it also inevitably yields Dynamics and Complexity

Complicated Problems

• For instance, the system has been slow (i.e. SLA violation) in the past hour– What is causing the problem?– How can it be fixed and prevented?

• We must find out:– Grid services (and underlying platforms) touched– Time spent on services (and underlying

platforms)

– End-to-end response time composition

Monitoring: The First Step

• We need to trace works across Grid services from end to end, monitoring workload and reporting data.

• “If you don’t measure it, you can’t control it.”– TQM

• Workload monitoring – the first step towards achieving self-managing and self-optimising system.

Instrumentation

Globus Client

Globus + Axis

Requests

Monitoring Points

Ogsa-Dai Client

DB2 CM

eDiamond Client

Tomcat

eDiaMoND Grid Service Back-end

Globus Client

Globus + Axis

Tomcat

Ogsa-Dai Grid Service Back-end

• Monitoring points inserted into common (OGSA-based Grid) middleware.

• Requests given a unique ID and traced through the system.

Start 0 (Client)

Start 1 (Tomcat@eD)

Start 2 (Axis@ eD)

Start 3 (Tomcat@Ogsa-Dai)

Start 4 (Axis@Ogsa-Dai)

Stop 0

Stop 1

Stop 2

Stop 3

Stop 4

Measurement • Timer at every monitoring point measures local response time.

• Subtraction gives elapsed time (no clock sync).

Reporting

• Data batched and aggregated at agents to reduce reporting overhead.

• Data reported with Java Messaging Service (JMS) to provide reliability and scalability.

Publisher-Subscriber Framework

...

Agent AgentAgent

...... ... ...

DB2

Concurrency Issue

• Parallel invocation is common in practice. For example, Grid service A calls B,D in parallel, and then C after B and D return.

• Concurrency is modelled by response time service Petri-Net (RTSPN), which is constructed automatically from data collected.

A A

B

D

B

D

C C

Legend:

Service rear

Service front

Experiment in eDiamond Setting

Monitoring Data in DB

Visualisation Screen Shot

Conclusions

• We have developed a monitoring infrastructure for OGSA-based Grids that:– discovers services touched; – monitors workload in an end-to-end manner;– captures concurrency in workload;– provides automated visualisation, – is portable (thanks to OGSA), scalable and

lightweight (5 ms/req,service).

Future Work

• The current infrastructure has enabled research on: – Performance problem determination;– End-to-end performance tuning/service

differentiation– Real eDiamond workload data collection;– Instrumentation with finer granularity

We are grateful to

• DTI for project grant

• IBM for software/research support

• eDiaMoND for experiment environment

• all of you for coming along

• Questions?

RTSPN Construction

• Automatic construction from data

• Each service receives ID of the service invoking it.

• Each service receives IDs from services it depends on:– workflow description– temporal relation

ogsa-based grid workload monitoring r. zhang 1, s. heisig 2, s. moyle 1 and s. mckeever 1 1 oxford...

Documents

grid services

dataeach service

grid web serviceswhile

ogsabased grids

reporting data

monitoring infrastructure

java messaging service

services software components