contributions to the modelisation and optimisation of large scale distributed computing cécile...
Post on 18-Dec-2015
214 views
TRANSCRIPT
Contributions to the modelisation and optimisation of large scale distributed computing
Cécile Germain-RenaudLRI and LAL
http://www.lri.fr/~cecile/RAPH
Habilitation à diriger des recherches
HDR 09/07/2005 2
Summary
• Introduction
• A protocol and a model for Global Computing
• Fault tolerant message-passing
• Grid result-checking
• Grid-enabling medical image analysis
• Perspectives
HDR 09/07/2005 3
Old ideas
« Programmers at computer A have a blurred photo which they want to put into focus. Their program transmits the photo to computer B, which specializes in computer graphics (…). If B requires specialized computer assistance, it may call on computer C for help »
HDR 09/07/2005 4
High performance computing systemsMassively
parallelMassively Distributed
•Homogeneous hard/software•Internal network•Static
•Heterogenous hard/software•Internet•Autonomic management
Blue Gen
e L
Tera Grid
-DEISA
Desktop
grids
SETI@ho
me
OSGEGEE
Tera co
mputer
Cluster
s
HDR 09/07/2005 5
High performance computing systems: new issuesMassively
parallelMassively distributed
•Fault-free•CPU-centric
•Monolithic applications•Performance is speedup
•Exclusive•Application-level scheduling
•«Simple » – N1/2, LogP
•Faults are normal events•Data-centric
•EP or moldable applications•Performance is throughput
•Time-shared•Middleware scheduling
•Very complex
Blue Gen
e L
Tera Grid
-DEISA
Desktop
grids
SETI@ho
me
OSGEGEE
Tera co
mputer
Cluster
s
HDR 09/07/2005 6
Summary
• Introduction
• A protocol and a model for volatile computing
• Grid result-checking
• Fault tolerant message-passing
• Grid-enabling medical image analysis
• Perspectives
HDR 09/07/2005 7
XtremWeb architecture
• Explores the Global Computing modality of grids• A large scale distributed system
– Computing resources (collaborators) are• volatile: come and go unexpectedly• « at the edge of the internet »: low-level and anonymous
– Dedicated to high throughput (like Condor)– vs applets systems• Consequences
a RISgC: Reduced Infrastructure Software for Grid Computing– Not master-slave, but a pull model: the collaborators decide when and
what– A soft-state dispatcher/collaborator protocol: the collaborator state
expires if not refreshed– Deployed at Paris-Sud University for Auger
Dispatcher
WorkerWorkerWorkerCollaborator
Request
ReplyKeepalive
HDR 09/07/2005 8
A performance model
• Applications are not so much moldable – unit duration
• Soft-state introduces overhead – keepalive period
• The fault process on one site is largely unpredictable [Dinda 99]
• But for each task, the fault process is memoryless if the successive execution sites are uncorrelated [Libel et al 02]: Poisson process where is the fault rate
ee
T1
11
1
1
e
T
The system cannot be tuned even for infinite resource
HDR 09/07/2005 9
The Auger Observatory
• Detection of ultra-high energy cosmic rays
HDR 09/07/2005 10
The Auger Observatory
• Detection of ultra-high energy cosmic rays
• UHCR create particle showers
HDR 09/07/2005 11
The Auger Observatory
• Detection of ultra-high energy cosmic rays
• UHCR create particle showers
• Indirect observation– Ground detectors– Fluorescence telescope
• In silico experiments– Shower simulations – CCIN2P3 - XtremWeb
200 physicists, 55 institutions, 15 countries, 3000km2, 1600 tanks, 30 years life time…
HDR 09/07/2005 12
Publications
–F. Cappello, A. Djilali, G. Fedak, C. Germain, O. Lodygensky et V. Neri. Calcul réparti a grande échelle, chapitre XtremWeb : une plateforme de recherche sur le calcul global et pair a pair, pages 153-186 , Lavoisier, 2002. –G. Fedak, C. Germain, V. Neri and F. Cappello. XtremWeb: A generic global computing platform. In IEEE/ACM CCGRID'2001, pages 582-587, IEEE Press, 2001. –C. Germain, G. Fedak, V. Neri and F. Cappello. Global Computing Systems. In 3rd Int. Conf. on Large Scale Scientific Computations, LNCS 2179, pages 218-227, Sozopol, 2001. Springer-Verlag. –C. Germain, V. Néri, G. Fedak and F. Cappello. Xtremweb : Building an experimental platform for global computing. In Proc. 1st IEEE/ACM Intl. Workshop Grid 2000. Springer, 2000.
HDR 09/07/2005 13
Summary
• Introduction
• A protocol and a model for Global Computing
• Grid result-checking
• Fault tolerant message-passing
• Grid-enabling medical imaging
• Perspectives - Towards a grid observatory
HDR 09/07/2005 14
Why do we need to check grid computations more carefully ?
• Attacks are likely on Global Computing systems– Low control over collaborators – Real-world issue: some happened in all deployed GC systems
even with a unique binary application • SETI - wrong FFT, Decrypthon I had 5% errors• All double- or triple- checked their results
• Might happen also in grid systems– Submission tools and grid workflow management are in early
stage– Code version management
HDR 09/07/2005 15
Contexts & Related Work
Hardware support
TCPA/Palladium
Code
encryption
Result-checking: Blum
Property testing: Goldreich
Prevention
Detection
Check that a property holds for an object
Program output: Sorted array
Graph properties: random graph sampleEn masse
checking [Sarmenta FGCS 2002]
HDR 09/07/2005 16
Result checking and Grid Applications
• Typical grid use cases: no independent property to check
• Monte-Carlo simulationsRange of parameter values x internal randomizationLocal interactions: the specification is the program Unknown shape -> non parametric Fault tolerance through robust statistics
• Search for rare events: SETI Not fault tolerance
HDR 09/07/2005 17
Example: shower simulations
Shower Simulation
Fe, 1E20, Fe, 1E20, Pr 1.5E20, ’’• Only the input parameters can
be falsified• But extracting the input
parameters from the data is just the problem!
Detector Simulation
Reconstruction
HDR 09/07/2005 18
En masse result-checking
• Goals – Minimize the checking overhead through adaptive tests
The most likely situations are• Normal: the majority of collaborators are OK• Massive attack: the majority of collaborators cheat (or err)
– Robust to denial of service attacks: the system is unable to assess the quality of its production
– Efficient for anonymous execution: private network, IP spoofing
• Results– Generic 2-phase test based on Wald’s sequential test– Improvement for the Auger Showers: pre-qualification of showers through
empirical detection of outliers
HDR 09/07/2005 19
Overview
Sample Qualification
Batch segmentation Sample selection
OracleRe-execution
HDR 09/07/2005 20
Publications
• C. Germain and D. Monnier-Ragaigne. Grid Result Chechking. In Procs. 2nd Computing Frontiers, Ischia, Mai 2005. ACM Press.
• C. Germain and N. Playez. Result-Checking in Global Computing Systems. In Procs.17h ACM Int.Conf. on Supercomputing, pages 226-233, San Francisco, June 2003. ACM Press.
HDR 09/07/2005 21
Summary
• Introduction
• A protocol and a model for Global Computing
• Grid result-checking
• Fault tolerant message-passing
• Grid-enabling medical imaging
• Perspectives - Towards a grid observatory
HDR 09/07/2005 22
Grids create new contexts for message passing
• Message passing environments and especially MPI are the standard for parallel computing, but have been designed with MPP in mind: fault free + internal network
• New use cases for message passing– Global Computing
• Loosely coupled computations, very frequent faults
– Institutional Grids• Coupled computations, moderately frequent faults
– Very large clusters• Tightly coupled computations, unfrequent faults• Or frequent: time-sharing cf the Connection Machine
Fault tolerance
Tunneling
Fault-free performance
FT-MPI
MPI-FT
FT/MPI
…
HDR 09/07/2005 23
Contributions to MPICH-V
User Application
CommunicationVirtualization
DispatchProcess
Virtualization
Communication library Checkpoint/Restart library
Condor libckpt.aTCP Sockets
HDR 09/07/2005 24
Pessimistic message logging on Channel Memories • Decoupled communication
– Dedicated reliable nodes support Channel Memory servers
– CMs log all messages in FIFO order– Conceptually transactional put/get– A restarted process transparently replays all
communications
• Consistent execution based on partial restarts
• Adaptive to heterogeneous fault behaviour: independent scheduling of process checkpoints
• Tunneling as a byproduct
put
get
CM
MPI process
MPI process
HDR 09/07/2005 25
The MPICH-CM library
ADI
ChannelInterface
CM deviceInterface
MPI_Send
MPID_SendControlMPID_SendChannel
_cmbsend
_cmbsend
_cmbrecv
_cmprobe
_cmfrom
_cmInit
_cmFinalize
- get the src of the last message
- check for any message avail.
- blocking send
- blocking receive
- initialize the client
- finalize the client
ChameleonInterfacePIbsend
Blocking TCP
Read WriteControl + data messages
HDR 09/07/2005 26
Communication overhead
put
get
CM
MPI process
MPI process
x2
Bounded by
node bandwidth
HDR 09/07/2005 27
Towards a hybrid approach
• Coupled applications have – A hierachical structure– Differentiated requirements
• Asynchronous iterations [Bertsekas] also apply to faults
• A reliable tunneling infrastructure is required anyway
• Requires re-coding even of the innermost loop for non-trivial applications eg multi-grid
P0
P1
P2
P3
WANWAN
Self-stabilizing
Self-stabilizing
Fault-tolerant
HDR 09/07/2005 28
Publications
• A.Selikhov and C.Germain. A channel memory based environment for MPI applications. Future Generation Computing Systems, 21(5):709-715, 2004.
• A. Selikhov, G. Bosilca, C. Germain, G. Fedak and F. Cappello. MPICH-CM: a Communication Library Design for a P2P MPI Implementation. In 9th Euro PVM/MPI Conf., LNCS 2474, pages 323-330, Vienna, Oct. 2002. Springer-Verlag.
• G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Hérault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri and A. Selikhov. MPICH-V : Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In IEEE/ACM Int. Conf. for High Performance Computing and Communications 2002 (SC'02 - SuperComputing'02), Baltimore, 2002
HDR 09/07/2005 29
Summary
• Introduction
• A protocol and a model for Global Computing
• Grid result-checking
• Fault tolerant message-passing
• Grid-enabling medical image analysis
• Perspectives
HDR 09/07/2005 30
Medical image analysis exemplifies the need for…
Id Owner Submitted ST PRI Class Running Onf01n01.10873.0 qzha 5/19 07:34 R 50 fewcpu f11n07 f01n03.6292.0 agma 5/22 14:50 R 50 standard f12n02 f01n03.6293.0 publ 5/22 16:16 R 50 standard f03n09 f01n03.6304.0 agma 5/22 22:46 R 50 standard f11n05 f01n03.6309.0 agma 5/23 12:41 R 50 standard f01n11 f01n01.10914.0 ying 5/23 14:17 R 50 fewcpu f06n03f01n02.4596.0 dpan 5/23 15:33 I 50 standardf01n03.6310.0 divi 5/23 16:03 I 50 standard • Seamless integration of grid resources with local tools: analysis, graphics,…
• Unplanned access to high-end computing power and data• Interactivity • But convergence with many other areas
HDR 09/07/2005 31
gPTM3D
• grid-enabling the PTM3D software• PTM3D: poste de travail médical 3D
– A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004– Complex interface: optimized graphics and medically-oriented interactions– Expert interaction is required at and inside all steps– But 3D medical data may be very large – 1GB and computations too
Interaction
RenderExplore Analyse InterpretAcquire
HDR 09/07/2005 32
gPTM3D
• grid-enabling the PTM3D software on a production grid• PTM3D: poste de travail médical 3D
– A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004– Complex interface: optimized graphics and medically-oriented interactions– Expert interaction is required at and inside all steps– But 3D medical data may be very large – 1GB and computations too
Interaction
RenderExplore Analyse InterpretAcquire
HDR 09/07/2005 33
EGEE Computing Resources – April 2004
From The project status slides 1st EGEE review
Country providing resourcesCountry anticipating joining EGEE/LCG
In EGEE-0 (LCG-2): > 130 sites > 14,000 CPUs > 5 PB storage
70 leading institutions 27 countries, federated in regional Grids
~32 M Euros EU funding for first 2 years
HDR 09/07/2005 34
Interactive volume reconstruction
• gPTM3D first results– Optimal response time for volume reconstruction on EGEE– With unmodified interaction scheme
• Demonstrated at the first EGEE review
HDR 09/07/2005 35
Figures for volume reconstruction
Small body
Medium body
Large body
Lungs
Dataset
87MB
210MB
346MB
87MB
Input data
3MB18KB/slice
9.6 MB25KB/slice
15MB22KB/sclice
410KB4KB/slice
Output data
6MB106KB/slice
57MB151KB/slice
86MB131KB/slice
2.3MB24KB/slice
Tasks
169
378
676
95
StandaloneExecution
5mn15s1mn54s
33mn11mn5s
18mn
36s
EGEE
.
37s18s
2mn30s1mn15s
2mn03
24s
HDR 09/07/2005 36
Interactive jobs on a grid: a scheduling problem• Short Deadline Job
– A moldable application – individual tasks are very fine-grained– Soft deadline– No reservation: should be executed immediately or rejected
• Sharing contract– Bounded slowdown for regular jobs – Do not degrade resource utilization – No stong preemption– Fair share across SDJ
• Contexts– (multi) Processor soft real-time scheduling– Network routing Differentiated Services
HDR 09/07/2005 37
Scheduling SDJ
User Interface
Broker
User Interface
Broker
Clus
ter
Sche
dule
r
JSSCE
Node Permanent reservationon virtual processorsTransparent when unused
Interaction Bridge
User Interface
Job submissionProxy Tunneling
Interaction Bridge
Matchmaking
HDR 09/07/2005 38
Scheduling SDJ
User Interface
Broker
User Interface
Broker
Clus
ter
Sche
dule
r
JSSCE
Node Permanent reservationon virtual processorsTransparent when unused
Interaction Bridge
Interaction Bridge
User Interface
Task prioritizationTP
Matchmaking
HDR 09/07/2005 39
Scheduling tasks
Node
Interaction Bridge
TP
•Coping with the submission penalty
– N tasks each with small latency T•Potential completion bandwidth T-1
•Impaired by the submission protocol
–A case for application-level scheduling
SchedulingAgent
Workeragent
HDR 09/07/2005 40
Publications
• C. Germain, R. Texier and A. Osorio. Interactive Volume Reconstruction and Measurement on the Grid. Methods of Information in Medecine, 44(2) 227-232, 2005.
• C. Germain, R. Texier and A. Osorio. Interactive Exploration of Medical Images on the Grid. In Procs. 2nd european HealthGrid Conference, Clermont-Ferrand, Jan. 2004.
• C. Germain, A. Osorio and R. Texier. A Case Study in Medical Imaging and the Grid. In S. Norager, editor, Procs. 1st European HealthGrid Conference, pages 110-118, Lyon, Jan. 2003. EC-IST.
• D. Berry, C. Germain-Renaud, D. Hill, S. Pieper and J. Saltz. Report on the Workshop IMAGE'03: Images, medical analysis and grid environments. TR UKeS-2004-02, UK National e-Science Centre, , Feb. 2004.
HDR 09/07/2005 41
AGIR: Analyse Globalisée des Données d’Imagerie Radiologique
• A multidisciplinary research network funded by ACI Masses de Données
• Advances in medical imaging algorithms and their use – Image processing: raw computing/data power– Sharing data and algorithms: evaluation is a
major issue– From algorithmic research to clinical practice
• Identify and explore new services and mechanisms required by medical imaging
HDR 09/07/2005 42
PartnersAlGorilleCRANCompression
LPC EGEE VO BiomedCHRU Clermont Collaborative medecine
CREATISSegmentation 4D
Rainbow Software componentsEpidaureMedical ImagingCentre Antoine Lacassagne
LRI – coll LALLIMSI - St Anne Tenon FMPInteraction & Grids
14 CS6 physicians6 Phd4 engineers
Collaborations
EGEE
Grid5000
CNRS-STIC
CNRS-IN2P3
INRIA
INSERM
Hospitals
HDR 09/07/2005 43
A cross-section of AGIR
QVAZM3D Compression Partially reliable
transport protocol
Gold standard
Consensus
Bronze standard
Automatic
Nodules CAD
Registration
Algorithms
Evaluation
SPIHT Compression
Network emulation
Evaluation
ADOC
gPTM3D Volume
Reconstruction
PTM3D
calibration
Evaluation
Grid-enabled
Workflow
HDR 09/07/2005 44
More in
• C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legré, C. Loomis, J. Montagnat, J-M Moureaux, A. Osorio, X. Pennec and R. Texier. Grid-enabling medical image analysis. In Procs. 3rd BioGrid'05,Cardiff, Mai 2005. IEEE Press.
http://www.aci-agir.org
HDR 09/07/2005 45
Summary
• Introduction
• A protocol and a model for Global Computing
• Grid result-checking
• Fault tolerant message-passing
• Grid-enabling medical image analysis
• Conclusions & Perspectives
HDR 09/07/2005 46
Conclusion
Anatomy Physiology
Ecology
•Message-passing: fault-tolerance, performance and technology constraints point toward the same direction•Revisited result-checking: Monte-Carlo computations
•Message-passing: fault-tolerance, performance and technology constraints point toward the same direction•Revisited result-checking: Monte-Carlo computations
•Compatiblity of soft-state protocols with high troughput
•Compatiblity of soft-state protocols with high troughput
•An architecture for differentiated services•A grid testbed for algorithmic and clinical research in 3D medical imaging
•An architecture for differentiated services•A grid testbed for algorithmic and clinical research in 3D medical imaging
HDR 09/07/2005 47
Perspectives
• Data of interestInteractive grid access requires intelligent prefetch mechanisms to capture and anticipate the way data are explored and analyzed– Automatic selection– A model for describing the resulting requirements and
propagate them to the data source– Optimised access schemes in relation with the structure of
the raw data– Progressivity
HDR 09/07/2005 48
Perspectives
• Towards a grid observatoryOptimizing grid middleware and applications requires trace and models for– Intrinsic characterization of « grid traffic »: eg the data
locality parameters at a computing element– The reaction of the middleware components to these
requirements: eg hits and misses– Spatio-temporal correlation of users - VO– To which extent the latter explains the formers– MAGIE and DEMAIN projects
HDR 09/07/2005 49
Questions