10/9/13cs655/lectures/cs655_hessahalsaaran_proce… · dynamic*load*distribution.*! ... 10/9/13 4...
TRANSCRIPT
10/9/13
1
Process Migration – A Review of Dynamic Load-‐Balancing Algorithms (Part 1)
Hessah Alsaaran
Presentation #2 CS 655 – Advanced Topics in Distributed Systems
Computer Science Department Colorado State University.
Part 1 Outline
Ê Introduction
Ê Dynamic Load-‐Balancing Systems: 1. MOSIX
2. Sprite
3. Mach
4. LSF
5. RoC-‐LB
Ê Discussion and Comparison
2
Part 2 Outline: • Migration in cloud settings.
• Live migration. • Any suggested topic.
Introduction
3
Definition
Ê “The act of transferring a process between two machines during its execution.” [1]
4
Figure 1 from [1]
Goals (Why use Migration?)
Ê Dynamic load distribution.
Ê Data locality.
Ê Failure recovery (with checkpointing).
Ê Administrative shutdowns.
5
(8) Forward References
(7) Transfer X’s state
General Algorithm for Migrating Process X
6
Source Node Destination Node (1) Migration request
(2) Negotiation (3) Acceptance
(4) Suspend Process X (5) Queue up incoming
communication
(6) Create new instance of X
(9) Transfer queued communication
(10) Process X resumes (11) Process X is deleted
10/9/13
2
Migration à Load Balancing
Ê Load information Ê Information about the node and its load.
Ê Trade-‐off between size and network overhead.
Ê Load information dissemination: Ê Periodically or event driven. Ê Trade-‐off between frequency and stability.
Ê Load Management and Migration Decisions Ê Centralized or distributed.
7
Distributed Scheduling
Ê The migration decision is distributed.
Ê Policies: Ê Sender-‐initiated policy: the overloaded node requests the
migration. Ê Suitable for lightly loaded systems
Ê Receiver-‐initiated policy: the under loaded node requests the migration. Ê Suitable for lightly loaded systems
Ê Symmetric policy: combination of both of the above policies. Ê Random policy: randomly choose a node.
Ê Simple but improves performance. (It can eliminate the management overhead, but it can miss the critical nodes that are overloaded or under loaded à performance will depend on the load distribution.)
8
Transfer Strategies
Ê Eager (all) strategy: Copy all address space at migration time.
Ê Eager (dirty) strategy: Transfer the dirty pages at migration time, then fault in the remaining when needed.
Ê Copy-‐on-‐reference strategy: Transfer pages on demand. If it is dirty then its from the source node. Otherwise, it can be from the source node or the backing store.
Ê Flushing strategy: same as above but the dirty pages are flushed to the backing store.
Ê Precopy strategy: The process suspension at the source node is delayed until transfer is complete. Dirty pages may get recopied multiple times.
Ê Trade-‐off: initial migration cost, run-‐time cost , home-‐dependency, and reliability.
9
Alternatives
Ê Remote Execution Ê Remote invocation of existing code, or transfer code only.
Ê Faster than process migration (no state transfer).
Ê But it doesn’t move. Ê Once the the process is created at a node, it stays there until it
terminates or get killed.
Ê A process cannot resume execution; it is killed then restarted on the other node. (So is it really an alternative?)
Ê Can be suitable for very short processes that are not worth transferring their state.
10
Alternatives
Ê Process Cloning Ê Remote forking: children inherit state from their parent.
Ê Process migration is resembled by terminating the parent.
Ê It involves the process knowledge and action.
Ê No resumption here too.
11
Alternatives
Ê Application-‐level Checkpointing Ê The application checkpoints itself. Ê The application can run using these checkpoints. Ê This is one way of doing process migration, not an alternative.
12
10/9/13
3
Alternatives
Ê Mobile agents Ê Processes that save their own state, move to a new node, and
resume execution (coded). (self-‐migrating processes)
Ê They perform the migration decision.
Ê No transparency.
13
(1) MOSIX “Multicomputer OS for Unix”
14
MOSIX -‐ Distributed OS
Ê Main Feature: Single System Image
15 Figure from [3]
Other MOSIX Main Features
Ê Fully-‐decentralized control; no global state à scalable.
Ê Guest (migrated) processes run in sandboxes, and have lower priorities.
16
2 Types of Processes
Ê Linux Processes (not affected by MOSIX) Ê Administrative tasks and non-‐migratable processes.
Ê MOSIX processes Ê Migratable processes that are created using the “mosrun”
command.
Ê Each has a home-‐node: the node where the process was created.
17
Automatic Resource Discovery
Ê Randomized gossip algorithm. Ê Each node periodically sends its information along with
received information to randomly chosen nodes.
Ê Information: Ê Includes: current load, CPU speed, and memory.
Ê Load estimated by the average length of the ready queue over a fixed time period. (Is that enough?)
Ê Information about new nodes are gradually distributed, while information about disconnected nodes age and get quickly removed.
Ê Scalable.
18
10/9/13
4
Process Profiling
Ê Each process has an execution profile: Ê Includes: age, size, rates of system-‐calls , I/O, and IPC.
Ê Used along with load info. for automatic migration decisions.
19
Dynamic Load-‐Balancing
Ê Using process migration. Ê Triggered either manually or automatically.
Ê Migration: copying process memory image (compressed when sent over the network) and setting its run-‐time environment (sandbox).
Ê MOSIX uses the eager (dirty) transfer strategy: Ê At migration time: transfer only the dirty pages of the process.
Ê When process resumes execution: text and clean pages are faulted in as needed.
20
Migration Algorithm
Ê The migration decision is done: Ê When a node finds another node with significantly reduced
load.
Ê To migrate processes that request more memory than available.
Ê To migrate processes from slower to faster nodes.
Ê The selection for a process to migrate preferentially selects: Ê Older processes.
Ê Processes with communication history with that node.
Ê Processes with forking history.
21
MOSIX’s Virtualization Layer
22
• MOSIX is a software layer. • It intercepts all system calls
• If it is a migrated process à forward it to its home-‐node , perform the operation on behalf of the process, then send results back.
Figure from [3]
Results of the Virtualization Layer
Ê Migrated processes seem to be running in their home-‐nodes unaware about the migration.
Ê Users do not need to know where their programs are running.
Ê Users do not not need modify their applications.
Ê Users do not need to login or copy files to remote nodes.
Ê Users have the impression they are running on a single machine.
Ê A migrated process runs in a sandbox for host protection.
23
Drawbacks of the Virtualization Layer
Ê Migrated processes management and network overhead.
24
Table 1 from [4] Experiment settings: • Identical Xeon 3.06GHz servers, with 1Gb/s Ethernet. • 4 applications:
• RC (satisfiability): CPU intensive. • SW (protein sequences): small amount of I/O. • JELlium (molecular dynamics): large amount of I/O. • BLAT (bioinformatics): moderate amount of I/O.
Table 1 from [1]
10/9/13
5
Migratable Sockets (Direct Communication
Without Migratable Sockets
Migratable Sockets
Ê Each process has a mailbox.
Ê A process is identified by its home-‐node and PID in their home-‐node. (mapping ?)
Ê Process location is transparent.
Ê Network overhead is reduced.
25
X Y process
A B C D
Home-‐node
msg
msg msg
msg msg
Limitations
Ê Suitable for applications that are compute intensive or have low to moderate I/O. Ê Otherwise à network and management overhead.
Ê For Linux: MOSIX supports non-‐threaded Linux applications.
Ê For private clusters: only trusted and authorized nodes.
Ê For clusters with fast networks.
Ê Not suitable for the node shutdown case.
26
Issues Not Discussed
Ê Node failure: either home-‐node or remote-‐node. Ê Checkpointing is supported, but where are the checkpoints
(saved process image) stored?
Ê Fully-‐decentralized? Job queuing system for flood control. Ê MOSIX has a job queue that gradually distributes the jobs
according to their priorities, requirements, and currently available resources.
Ê Where is the job queue?
27
(2) Sprite
28
Sprite
Ê Like MOSIX: Ê It achieves transparency through the use of home-‐nodes (single
image system).
Ê It uses the dirty transfer strategy.
Ê Unlike MOSIX: Ê Its goal is to utilize idle nodes. Ê Uses a network file system à reduces home-‐node service.
Ê It uses a centralized load information management system. Ê A single process is maintaining the system’s state.
Ê Scalability issue à bottleneck.
29
(3) Mach
30
10/9/13
6
Mach
Ê Task migration is a feature added to the Mach microkernel.
Ê The process stays on the source node. Ê Only the task’s state is
transferred.
Ê Most system calls are forwarded to the source.
31
Figure 7 from [1]
Migration Decision
Ê Distributed file system.
Ê Distributed scheduling decisions of applications (profiled): Ê Migration to a node for data locality:
Ê Significant amount of remote paging.
Ê Significant amount of communication.
Ê Migration to a node for load balancing: Ê CPU intensive applications.
32
(4) LSF
33
LSF (Load Sharing Facility)
Ê LSF achieves load balancing through: Ê Initial placements of processes (primary goal). Ê Process migration with checkpointing (using Condor’s
checkpointing). Ê Checkpoints, kills, then restart the process on destination node.
Ê Centralized load information management: Ê A single master node.
Ê If it fails, another node promotes itself (how?).
Ê Other nodes send their information to the master periodically. Ê The master is responsible of scheduling.
Ê Based on load information and processes requirements.
34
(5) Rate of Change Load Balancing (RoC-‐LB)
Campos and Scherson [2]
35
Rate of Change Load Balancing (RoC-‐LB)
Ê The idea: the migration decision should depend on how the load changes in time: Ê To predict future overload or starvation. Ê To minimize node idle time.
36
A B A B A B A B
Equalizing load Utilize idle nodes
10/9/13
7
Migration Algorithm (When?)
Ê Each node periodically computes the change in its load since the last time interval àdifference in load (DL).
Ê Assuming that the DL will remain the same, each node computes how many time intervals it needs to become idle.
Ê If this time less than the network delay (ND) à initiate migration request.
Ê The network delay (ND) is the time it takes to receive a load after a request (adaptive).
37
Migration Algorithm (When?) First Exception
Ê If the node’s load < CT à initiate load request.
Ê Reason: small change in load can cause immediate node starvation.
38
Load Critical t (CT) Low t (LT) High t (HT)
Source Neutral Sink
Migration Algorithm (When?) Second Exception
Ê If the node’s load > HT à never initiate load request. Ê Even if it predicted that it will become idle.
Ê Reason: Ê DL must be high à the value is probably rare and short lived.
Ê To resist high spikes in load à delay load requests until the load falls below HT.
39
Migration Algorithm (Where?)
Ê Each node maintains 2 tables: Sink Table and Source Table. Ê For every load request it receives, the requesting node is
considered a sink. Ê For every load request reply it receives, the replying node is
considered a source.
Ê When a node initiates a load request: Ê It selects the first source node in the Source Table.
Ê If table is empty à select a random node (not in Sink Table).
Ê Send the load request to it. Ê It may forward the request to another selected source node.
40
Migration Algorithm (How many?)
Ê Using DL, the node predicts what its load will be after ND time length à predicted load (PL).
Ê If PL ≥ zero à the node does not predict idleness in the next ND time period.
Ê If PL < zero à the node requests abs(PL) load.
Ê The request receiver will send load: Ê If its load > HT. Ê Amount = min (requested amount, a).
41
HT
a
Node’s Load
Migration Algorithm (Which?)
Ê Old tasks (CPU time)
Ê Reason: they will probably live longer, making the migration process worthwhile.
42
10/9/13
8
Message Passing System and Table Update Algorithm
Ê Load Request Sending: (Requesting Node) 1. The requesting node will create a message.
2. Pop a source node from its Source Table.
3. Send the request message to that source node.
Ê Receiving a Reply: (Requesting Node) Ê Add the replier to the Source Table.
43
Sink Process ID
Number of Load Units Requested
Load State of Preceding Nodes
Number of Forwards
Load Request Message
44
Ê Receiving Load Request: (Receiving Node) 1. Add the requester node to the Sink Table. 2. If it is a forwarded message à add the preceding node to the table. 3. Check its load:
Ê If it can completely fulfill the request: Ê Send a reply to the requesting node with the requested load.
Ê If it can partially fulfill the request: a. Send a reply to the requesting node with part of the requested load. b. Update the message by:
Ê Decreasing the amount of requested load. Ê Incrementing the number of forwards. Ê Change the preceding node field to itself.
c. Check the number of forwards Ê If it reached a max value:
Ê Discard message and inform the requesting node. Ê Otherwise, forward the message to a source node from the table.
Ê If it cannot fulfill the request: Ê Repeat steps b and c (without changing the amount of requested load).
Comments/Critique
Ê If an algorithm is balancing the load through “equalization”, it covers the case of not wanting idle nodes.
Ê The requested load only covers the next time interval, then it will need to request another load Ê Repeating process migration process continually (including
message passing). Ê Trade-‐off between time interval size and the incurred overhead.
Ê The node should not only have “enough” load, it should have a mixture of tasks (e.g. CPU-‐bound and I/O bound) to effectively utilize the resources.
45
Comparisons & Discussion
46
Comparisons & Discussion (1)
Ê Balancing Goal: Ê MOSIX, Mach, LSF à full load balancing
Ê RoC-‐LB, and Sprite à utilizing idle nodes.
47
Comparisons & Discussion (2)
Ê Full vs. Partial Process Migration Ê MOSIX, Sprite, Mach à partial migration Ê RoC-‐LB, LSF à full migration
Ê In partial migration: (Dependency between the source and destination nodes.) Ê Increased transparency. Ê Management and network overhead. (Inefficiency) Ê Compromises fault-‐tolerance of the system. (Inefficiency)
Ê In full migration: Ê Node compatibility becomes an issue. Ê More information needs to be disseminated. (Inefficiency) Ê Longer migration time, but less run-‐time overhead.
48
10/9/13
9
Comparisons & Discussion (3)
Ê Distributed vs. Centralized load balancing algorithm: Ê MOSIX, Mach, RoC-‐LB à Distributed. Ê Sprite, LSF à Centralized.
Ê Distributed algorithms: Ê More scalable. Ê Load information storage issue. (Inefficiency)
Ê Centralized algorithms: Ê Bottleneck. (Inefficiency) Ê Single point of failure. (Inefficiency) Ê Fast, up-‐to-‐date information about all nodes.
49
Inefficiency à Load Representation
Ê A node’s load is estimated based on the number of load units. Ê Examples:
Ê MOSIX estimates load by the average length of the ready queue over a fixed time period.
Ê RoC-‐LB uses load change over time.
Ê Problem: Ê Load units (processes) are not equal:
Ê Different sizes (running times) àMany short tasks can be equivalent to a long one.
Ê CPU-‐bound vs. I/O-‐bound processesà must have a mixture to effectively utilize the resources.
50
If I would build a system, how would I do it?
Ê (1) Load balancing through full process migration (no home-‐node dependency) à more failure tolerant. Ê Checkpoints at application level à not transparent but produces
significantly smaller checkpoint files that are also machine-‐independent.
Ê Replicate these checkpoints at a configurable level on different nodes.
Ê Periodically save checkpoints and update replicas. If a node holding a replica does not respond, choose a replacement.
Ê On the other hand, if a timeout occurred at these nodes, one will take over the process.
Ê When a process has terminated, the node informs the nodes holding replicas to delete these replicas.
51
If I would build a system, how would I do it?
Ê (2) Migration decision is distributed as in MOSIX à more scalable. Ê But change the load representation to instead of basing on
only the number of units as if they are equal: Ê Load units will be weighted by their expected running time.
Ê However, this estimation can be an issue. Ê May use history if it has been executed recently on the node.
Ê Keeping a configurable ratio of CPU-‐bound and I/O-‐bound processes. Ê This information can be extracted from processes profiles.
Ê Instead of sending load info. to 2 random nodes, let the number of random nodes increase with the scale of the system.
52
If I would build a system, how would I do it?
Ê (3) Implement the system in the application-‐level as in LSF à For portability and simplicity, although it may degrade its performance.
53
Thank You
54
10/9/13
10
References
1. Milojicic, D. S., Douglis, F., Paindaveine, Y., Wheeler, R., and Zhou, S. “Process Migration”, ACM Computing Surveys, Vol. 32, No. 3, September 2000, pp. 241–299.
2. Campos, L. M., and Scherson, I. “Rate of Change Load Balancing in Distributed and Parallel Systems”, 10th Symposium on Parallel and Distributed Processing, 1999.
3. Barak, A. “Overview of MOSIX”, presentation, 2013, Computer Science, Hebrew University. (http://www.mosix.org)
4. Barak, A., and Shiloh, A. “The MOSIX Cluster Operating System for High-‐Performance Computing on Linux Clusters, Multi-‐Clusters and Clouds”, white paper, 2013. (http://www.mosix.org)
55
10/15/13
1
Process Migration (Part 2)
Hessah Alsaaran
Presentation #2 CS 655 – Advanced Topics in Distributed Systems
Computer Science Department Colorado State University.
Outline
Ê “On minimizing the resource consumption of cloud applications using process migrations” Ê by N. Tziritas, S. U. Khana, C-‐Z. Xua, T. Loukopoulos, and S. Lalis
Ê Journal of Parallel and Distributed Computing 73, 2013.
Ê “Rebalancing in a Multi-‐Cloud Environment” Ê By D. Duplyakin, P. Marshall, K. Keahey, H. Tufo, and A.
Alzabarah
Ê Science Cloud, 2013.
1
Distributed Reassignment Algorithm (DRA) (Tziritas et al. , 2013)
DRA’s Motivation (Problem Statement)
Ê Cloud services are widely used.
Ê Pay-‐as-‐you-‐go.
Ê Minimizing resource consumption is important for users.
Ê Problem: “Assuming that an application is already hosted in a cloud, reassign the application components to nodes of the system to minimize the total communication and computational resources consumed by the components of the respective application.” (Tziritas et al. , 2013)
Ê Using process migration.
3
Assumptions
Ê (1) The nodes are in a tree network structure. Ê General network graphs makes the assignment problem NP-‐
complete à many researchers limit their topology to solve the problem in polynomial time.
4 Figure 1b from (Tziritas et al. , 2013)
It can be organized in a
2-‐level hierarchy.
Assumptions
Ê (2) The computing requirement of a process does not change dynamically and its execution cost is known a priori. Ê Does not always hold àlimitation.
5
10/15/13
2
Problem Formulation
6 Figure 1a from (Tziritas et al. , 2013)
Problem: given an assignment between processes and nodes à reassign to minimize application resource consumption, taking capacity into consideration.
Total Execution Cost
Total Communication Cost
Fixed (regardless of assignment)
Depends on path length.
Single-‐Process Migration
7 Figure 2 from (Tziritas et al. , 2013)
The migration is beneficial if the positive load is larger than the negative load.
When does the process stop migrating?
Ê When it becomes center of gravity: Ê No beneficial migration to any of its 1-‐hop neighbors.
8
Possible Problem
Ê Consider migrating p1 or p2 to n2.
Ê Consider migrating both processes.
Ê This grouping of co-‐located processes produces: super-‐process.
9
Figure 3 from (Tziritas et al. , 2013)
Unbalanced Super-‐Process Identification
Ê A super-‐process that is not at the center of gravity.
10 11
Figures 5 and 6 from (Tziritas et al. , 2013)
Source Destination
Negative load – load with co-‐located processes
Positive Load
P2 and P5 form the super-‐process that
maximizes migration benefit
10/15/13
3
Information for Migration Decision
Ê Each node should know: Ê For each process it hosts, the volume of data used in
communication both locally and externally.
Ê 1-‐hop communicating neighbors.
Ê For each process, the location of its adjacent processes. Ê Concurrent migration of adjacent processes on neighboring nodes
(i.e. swapping) is not permitted.
12
Capacity Constraint
Ê Problem: the destination node cannot host a super-‐process.
Ê Solutions in the extended DRA: Ê Prune some processes from the super-‐process.
Ê The least beneficial processes.
Ê Enable multi-‐hop migration. Ê Each node must know others that are at most k-‐hops away.
13
Inefficiencies and Limitations
Ê (1) Local information vs. global information Ê Using only 1-‐hop migrations can cause a series of migration for a
process until it reaches an x-‐hop node. Ê à Management and network overhead. Ê à Can increase the process execution time. (more in point 4)
Ê (2) Load balancing is not considered. Ê Only capacity limit is considered. Ê Communicating processes can overload a node.
Ê à Can increase the process execution time. (more in point 4)
Ê (3) Process length is not considered. Ê Short processes may not be worth migrating.
14
Inefficiencies and Limitations
Ê (4) Application cost = execution cost + communication cost, where the execution cost is fixed. Ê Not true; cost is also affected by how long they use resources
(e.g. cost by hour).
Ê Migration can cause either speed-‐up (e.g. if it was placed on an under-‐loaded node) or slowdown (e.g. continuous migration).
Ê (5) Limited by assumptions: tree networks and unchanging process requirements. Ê The authors plan to consider these issues in a future extension.
15
Rebalancing in a Multi-‐Cloud Environment
(Duplyakin et al. 2013)
16
Motivation (Problem Statement)
Ê Multi-‐cloud environment with user preferences.
Ê Example: user uses a community cloud (the preferred cloud), but depending on the availability of its limited resources, the additional work is launched on a public cloud.
Ê Automatic rebalancing: downscaling and up-‐scaling.
Ê Automatic up-‐scaling is provided by Nimbus Phantom and Amazon’s Auto Scaling Service, but automatic downscaling was not addressed.
Ê Need to define preferences and a set of polices .
17
10/15/13
4
Assumptions
Ê Multi-‐clouds are deployed by a central auto-‐scaling service (authors use an open source Phantom) Ê It deploys instances across the clouds and monitors them.
Ê Master-‐worker environments Ê Distributed workers request jobs from a master. Ê Central job queue and and scheduler Ê The scheduler:
Ê Reschedules jobs upon worker failure or rebalancing termination. Ê Provide workers status info. (state, running jobs and their runtimes) Ê Add and remove worker instances.
Ê Parallel workflow Ê Independent jobs that can be executed without order.
18
Architecture
19 Figure 1 from (Duplyakin et al. 2013)
Need for Rebalancing Policies
Ê Because rebalancing in multi-‐cloud environment can be disruptive to the workload. Ê Terminating an instance can cause the termination of a running
job Ê Job is rescheduled.
Ê Increase in total execution time.
Ê Without rebalancing Ê Unexpected expenses.
20
Downscaling Policies
1. Opportunistic-‐Idle (OI): Ê Wait until there are idle instances in the less desired clouds,
then terminate them.
Ê This policy continuously terminates excess idle nodes until it reaches the desired state.
2. Force-‐Offline (FO): Ê Similar to the previous policy, but with graceful termination:
Ê Mark as offline, and complete job.
21
Downscaling Policies
3. Aggressive Policy (AP): Ê Discard partially completed jobs to satisfy request ASAP.
Ê To minimize re-‐execution overhead, choose instances with jobs that have the least running time.
Ê Use work threshold for progress. Ê Example: if t=25% and the instance is running a 2-‐hour job, then the
policy can terminate this job only in the first 30 minutes.
22
Implementation
Ê Usage of existing technologies: Ê Use infrastructure clouds such as Amazon EC2.
Ê The implementation is an integration with the open source Phantom service for auto-‐scaling.
Ê Use HTCondor for a master-‐worker workload management system.
Ê Implemented components: Ê The sensor and the rebalancing policy (Python). Ê Communication with the HTCondor masters and Phantom.
23
10/15/13
5
Inefficiency
Ê Instead of terminating jobs, they can be check-‐pointed then resumed. Ê Check-‐pointing is an available service in HTCondor.
24
Policies Trade-‐offs
25 Figure 6 from (Duplyakin et al. 2013)
Thank you
References
Ê Tziritas, N., Khana, S. U., Xua, C-‐Z., Loukopoulos, T., and Lalis, S. 2013. On minimizing the resource consumption of cloud applications using process migrations. Journal of Parallel and Distributed Computing 73, 1690-‐1704.
Ê Duplyakin, D., Marshall, P., Keahey, K., Tufo, H., and Alzabarah, A. 2013. Rebalancing in a Multi-‐Cloud Environment. Science Cloud’ 13, June 17, 2013, New York, NY, USA.
27