ogsa-based grid workload monitoring r. zhang 1, s. heisig 2, s. moyle 1 and s. mckeever 1 1 oxford...

15
OGSA-based Grid Workload Monitoring R. Zhang 1 , S. Heisig 2 , S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research Centre

Upload: oliver-norman

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

OGSA-based Grid Workload Monitoring

R. Zhang1 , S. Heisig2 , S. Moyle1 and S. McKeever1

1 Oxford University Computing Laboratory2 IBM T.J. Watson Research Centre

Page 2: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Complicated Systems

• Open Grid Service Architecture (OGSA), is in a nutshell: The Grid + Web Services

• While OGSA brings computational power and interoperability, it also inevitably yields Dynamics and Complexity

Page 3: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Complicated Problems

• For instance, the system has been slow (i.e. SLA violation) in the past hour– What is causing the problem?– How can it be fixed and prevented?

• We must find out:– Grid services (and underlying platforms) touched– Time spent on services (and underlying

platforms)

– End-to-end response time composition

Page 4: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Monitoring: The First Step

• We need to trace works across Grid services from end to end, monitoring workload and reporting data.

• “If you don’t measure it, you can’t control it.”– TQM

• Workload monitoring – the first step towards achieving self-managing and self-optimising system.

Page 5: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Instrumentation

Globus Client

Globus + Axis

Requests

Monitoring Points

Ogsa-Dai Client

DB2 CM

eDiamond Client

Tomcat

eDiaMoND Grid Service Back-end

Globus Client

Globus + Axis

Tomcat

Ogsa-Dai Grid Service Back-end

• Monitoring points inserted into common (OGSA-based Grid) middleware.

• Requests given a unique ID and traced through the system.

Page 6: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Start 0 (Client)

Start 1 (Tomcat@eD)

Start 2 (Axis@ eD)

Start 3 (Tomcat@Ogsa-Dai)

Start 4 (Axis@Ogsa-Dai)

Stop 0

Stop 1

Stop 2

Stop 3

Stop 4

Measurement • Timer at every monitoring point measures local response time.

• Subtraction gives elapsed time (no clock sync).

Page 7: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Reporting

• Data batched and aggregated at agents to reduce reporting overhead.

• Data reported with Java Messaging Service (JMS) to provide reliability and scalability.

Publisher-Subscriber Framework

...

Agent AgentAgent

...... ... ...

DB2

Page 8: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Concurrency Issue

• Parallel invocation is common in practice. For example, Grid service A calls B,D in parallel, and then C after B and D return.

• Concurrency is modelled by response time service Petri-Net (RTSPN), which is constructed automatically from data collected.

A A

B

D

B

D

C C

Legend:

Service rear

Service front

Page 9: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Experiment in eDiamond Setting

Page 10: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Monitoring Data in DB

Page 11: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Visualisation Screen Shot

Page 12: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Conclusions

• We have developed a monitoring infrastructure for OGSA-based Grids that:– discovers services touched; – monitors workload in an end-to-end manner;– captures concurrency in workload;– provides automated visualisation, – is portable (thanks to OGSA), scalable and

lightweight (5 ms/req,service).

Page 13: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

Future Work

• The current infrastructure has enabled research on: – Performance problem determination;– End-to-end performance tuning/service

differentiation– Real eDiamond workload data collection;– Instrumentation with finer granularity

Page 14: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

We are grateful to

• DTI for project grant

• IBM for software/research support

• eDiaMoND for experiment environment

• all of you for coming along

• Questions?

Page 15: OGSA-based Grid Workload Monitoring R. Zhang 1, S. Heisig 2, S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research

RTSPN Construction

• Automatic construction from data

• Each service receives ID of the service invoking it.

• Each service receives IDs from services it depends on:– workflow description– temporal relation