proposal for hooks for clusterwide process management in 2

Proposal for hooks for ClusterwideProcess Management in 2.6.10Bruce Walker and Laura Ramirez

Draft 10 – February 7, 2005

This paper is a lot more than just the proposed hooks. It has mostof design for the clusterwide process model, which of course justifiesthe hooks. It is organized into several sections. After outlining thegoals of the project and some infrastructure assumptions (section A andB), we discuss process id assignment (section C). Then we outline thevery limited changes to the task structure (section D), followed byinformation on the various process hash structures (section E).Following that are sections on each of the process relationships(sections F thru J). Section K describes how semaphore locking, boththe base tasklist_lock and some additional sleep locks (only used whenclusterproc is enabled) is utilized. Section L captures the use ofspecial task structures (called surrogates) on various nodes in thecluster. Section M outlines the clusterproc structure that would behung off each task structure (when clusterproc is enabled). Section Noutlines how the installable clusterproc module sets up the kernel tobe part of a distributed process cluster. The installable moduleinstalls a set of function pointers if CONFIG_CLUSTERPROC is defined(the clusterproc_ops function array). Section O outlines how variousforms of process movement are hooked in, including checkpoint/restart.Section P describes the clusterwide /proc model and section Q is anoverview of where the hooks are and why they are needed.

A: Goals

The goal is to enable a clusterwide process model, with minimal hooksand impact on the base Linux organization, locking, data structures andperformance. A clusterwide process model means:

• clusterwide unique process id’s;• visibility and access to any process from any node (except kernel

threads);• ability to have distributed process relationships, including

parent/child, process group, session, ptrace parent, etc.;• ability to move running processes from one node to another,

either at exec/fork time or at somewhat arbitrary points in theirexecution.

• Ability to transparently checkpoint and restart processes,process groups and thread groups;

• No performance impact if CONFIG_CLUSTERPROC is not defined andvery minimal impact if it is;

• ability to have processes continue to execute even if the nodethey were created on leaves the cluster;

• ability to retain relationships of remaining processes, no matterwhich nodes may have crashed;

• full, but optional /proc/<pid> capability for all processes fromall nodes;

• capability to support either an “init” process per node or asingle init for the entire cluster;

• capability to support a shared root filesystem or a rootfilesystem per node;

• capability to be an installable module that can be installedeither from the ramdisk/initramfs or shortly thereafter;

• support for clusters up to 64000 nodes, with optional code tosupport larger.

To enable the optional inclusion of clusterwide process management(referred also as “clusterproc” henceforth) capability, a set of entrypoints are proposed. The infrastructure is patterned after thesecurity hooks. If CONFIG_CLUSTERPROC is not set, the hooks are turnedinto inline functions that are either empty or return the defaultvalue. With CONFIG_CLUSTERPROC defined, the hook functions callclusterproc ops. The default set of ops are again empty functions ortrivial return statements. The ops can be replaced, and theclusterproc installable module will replace the ops with routines toprovide the goals listed above. The clusterproc module would be loadedearly in boot, and when loaded, would provide the routines to be calledby these function pointers. All the code to support the clusterwideprocess model will be under GPL. More detail on what the clusterprocmodule initialization will do is given in section N.

B: Assumed Infrastructure The clusterwide process management technology is modular and canwork in variety of environments. It does depend on a clustermembership subsystem which will enforce a consistent membership acrossnodes and will call registered nodeup and nodedown routines when nodesenter and leave the cluster. Clusterproc is expecting an interface todetermine the cluster_maxnodes value (largest node number in thecluster) so it can adjust its process id assignment (see section C).It is also assumed that node numbers are persistent, although it may bepossible in some environments to relax that requirement. There is acluster_this_node variable that the clusterproc subsystem expects thecluster membership subsystem to fill in before clusterproc isinstalled. To accommodate the flexibility of working with either asingle init for the cluster or an init per node, two external variables– cluster_init_node and cluster_single_init are expected. Cluster-init-node will have the node number of the node to host the single initif cluster_single_init is true. If cluster_single_init is not true, aninit is started on this node. The clusterproc module does not itself have a communicationcomponent included and thus needs access to one. Messages are all RPCin nature. The initial clusterproc implementation will leverage the

ICS kernel communication subsystem current in OpenSSI. Dependency onthat specific subsystem is modular. Clusterproc does depend on a remote way to access user addressspace, in order to support remote ptrace and the remote ioctl andremote process data via /proc. One implementation of this capabilityexists in OpenSSI but less invasive implementations are possible. While clusterproc can work in relatively loosely coupled clusterenvironments that don’t support remote tty capability and don’t have acluster filesystem component, the design and hooks assume the remotetty capability can be present and thus distributed controlling tty mustbe supported. The cluster filesystem is not assumed for clusterprocbut process movement is more complete if a cluster filesystem of someform is provided.

C: Clusterwide PIDs – Assignment

It is very important to have clusterwide unique process ids andthese must fit in the standard pid_t pid field so the clusterwideuniqueness is visible thru standard kernel interfaces (kill, setpgid,etc.). There are several ways we can attain this uniqueness. One strategy would be to have a designated clusternode hand out theids. While this could work, it would make fork() inefficient and wouldmake tracking where processes are currently running centralized. Another strategy is to give pid ranges to nodes and then each nodecan manage its range. With this strategy fork can be strictly localand process tracking is simplified because you can just query the nodethat is managing the range containing the pid of the process you aretrying to find. We are proposing a variant of this approach. A simple way to give pid ranges to each node is to encode the nodenumber in the high order bits and have the standard linux code manage auniquifier set of lower order bits. For example, if you let theuniquifier be 16 bits (65,000 unique pids per node), you have have 15bits of node number (32,000 nodes). We would propose to make thedefault base system stay the same as it is and have theclusterproc_pid_alloc() hook do the node encoding, using aconfiguration variable (cluster_maxnodes) to determine how many bitsare needed for node number and thus how many can be assigned foruniquifier. More complicated hooks could split up the pid namespace in non-uniform ways to allow some nodes to have a very large number ofprocesses while at the same time allowing a very large number of nodes. The node a process is created on will assign a process id that hasa uniquifier component in the low order bits and the node number of thecreating node in the higher order bits The assigned pid (with nodeencoding) is the only pid the process is known by and it is of courseclusterwide unique. The node assigning the pid is referred to as theorigin node. That node will track the existence of that process (sothe pid is not reused) and must track where it is running (so in thecluster case, we can find it). When clusterproc is not installed, pidswould have no bits for the node number and thus default to the usagemodel for pids in the base kernel.

D: Task Structure Modifications

The task structure is modified in a few ways – adding a new bitvalue for the per process flags, adding a pointer to a clusterprocstructure and substituting a sleep lock for a spinlock for proc_lock ifclusterproc is defined. If the clusterproc module is not installed, theclusterproc pointer will be null. The flag bit would be PF_REMOTE and would be in the task->flags dataelement. That flag would indicate that this structure is not for aprocess executing on this node. Consequently, the base code wouldnever set the flag. The proposal is to test the flag in a few placesin the base code where required as a way to determine if specialprocessing is needed for this process. Under the CONFIG_CLUSTERPROC, the spinlock proc_lock is replacedwith a sleep lock. If clusterproc is installed, additional information about processesis needed and would be stored in the clusterproc structure. In orderto deal with distributed process relationships (remote children, remoteparent, etc.), it is proposed that “dummy” task structures may becreated for processes which are not currently executing locally(section L below has more detail on when this is necessary). Theproposal is that these “dummy” task structures would not have a stack,would not be hashed into the pid_hash[PIDTYPE_PID] hash, would not beon the init_task tasks list, would have the PF_REMOTE flag as true andwould never be executable. These dummy task structures (referred here as surrogate taskstructures) are just struct task_struct and would never exist on thesame node as the node where the process is currently executing. Thesestructures would be linked off a new hash header (surrogate_hash) usingthe pids[PIDTYPE_PID].pid_chain. There are several reasons a surrogatecould exist on a given node. The reasons are outlined in section Lbelow. One reason is to support the “origin node” capability describedin section C above. If a process is executing at its origin node, nosurrogate is needed. However, if the process is not currentlyexecuting at origin node, a surrogate exists on the origin node toensure the pid is not reused and to record where the process isexecuting (via a field in the clusterproc structure).Figure 1 below shows how tasks running at their origin node would behashed in. Figure 2 below shows the surrogate that would exist on theorigin node for a process that is not running at origin node.

Figure 1: Process xyz running at it’s origin node

…

task

xyz

pid_chain

Figure 2: Origin node structures for process xyz, which is not running at it’s origin node

…

Surrogatetask

xyz

pids[0].pid_chain

pid_hash[PIDTYPE_PID]

surrogate_hash

pid_chain

pids[0].pid_chain

Note that process tracking (knowing whether a process exists and whereit is executing, which is what the origin node does) is not lost if theorigin node fails. Failure of the origin node is handled by atomic re-creation of the information at a predefined node, referred to as thesurrogate_origin node. Failure of the surrogate_origin is handled by arebuild on the next surrogate_origin, etc. Re-join of the origincauses that node to take back the responsibility from thesurrogate_origin. In the clusterproc structure there are execution node indicationsand load balancing information (section M).

E: Hashs and Tasks Lists

In the base Linux, there are 4 hashes that processes can be on(pid_hash[]). In addition, all thread group leaders are linked to(init_task) thru the “tasks” link.The following describes how clusterproc interacts with these basehashes:

PIDTYPE_PID: only processes running locally would be hashed into the pid hash.

PIDTYPE_TGID: Thread group leaders are hashed into this hash. They would only be on the hash if they are executing locally. Note that thread groups are not distributed so all the thread group relationship information is with respect to local processes; thread groups can migrate as a whole, in which case all the relationship information is carried to the new execution node and the thread group leader is hashed into the local TGID hash.

PIDTYPE_PGID: only processes running locally would be hashed into the PGID hash.

PIDTYPE_SID: only processes running locally would be hashed into the SID hash.

NEW FOR clusterproc, and only done if clusterproc installed:

surrogate_hash: surrogate task structures (task structures without stack, for processes that are not executing locally but for which we need a task structure (local kids, etc.) are hashed off this new hash header; to avoid adding a new “pids” element in the task structure, surrogates will be chained thru the pids[PIDTYPE_PID].pid_chain pointers.

init_surrogate: a surrogate that heads a list of surrogate thread group leaders on the origin node of those thread group leaders.

F: Parent/Child Relationships

In clusterproc, the complete children/sibling list will only bemaintained on the node where the parent process is executing. Thereare basically two ways we can implement this. The first is to haveonly local kids linked to the parent and have a supplemental structurefor remote kids. In that model, sys_wait4() operations would have tocontact each node where children were executing to see if there was anyprocesses to wait for. The other model is to have surrogate taskstructure for each remote child linked into the parent process and tomaintain some of the fields in those surrogates so the parent can dolocal wait operations and only go remote when there was a child toreap. Due to the performance difference of the 2 approaches, we haveassumed the latter approach in the rest of the design. Section L belowdetails the fields in the surrogate task structure that must beaccurately maintained so the parent can do wait processing. On childexecution nodes where the parent is not executing, there will be asurrogate task structure for the parent; children will point to theirparent surrogate task and the parent surrogate task will have achildren/sibling list for locally executing children. Data structuresinclude: a. on parent process execution node: i. linked list of all kids (some may be surrogate task to remote kids); (children/sibling list) ii. kids point back to parent (parent/real_parent pointers); b. on child process execution node (parent is locally executing) see a above. c. on child process execution node (parent is not locally executing): i. surrogate task struct for parent; child points at that with parent/real_parent. ii. children/sibling pointers from the surrogate task parent thru all locally executing children;Figure 3 below shows the parent/child structure for process xyz, on thenode where xyz is executing. Note that his children “B” and “D” arerunning on another node but they have surrogate task structures on this

node. Figure 4 shows the structures that would exist on the node wherechildren “B” and “D” are executing.

Process Management – Parent/Child Relationship

XYZ

Child B

Figure 4: xyz children’s execution node

LocalChild CLocal

Child A

RemoteChild B

XYZsiblingchildren sibling

parent

parent

parent

pid_chain

pid_chain

Figure 3: Parent xyz’s execution node

parent

Child D

RemoteChild D

children parent

sibling

sibling

parentsurrogate_hash



surrogate_hash

pids[0].pid_chain

pids[0].pid_chainpid_chain

pid_chain

pid_chain

Note that the alternative to having surrogate tasks in thechildren/sibling list is to have only local processes on this list andhave a new construct for remote kids. Such a layout would affect thewait code. One important consideration is that we want enoughinformation about each child on the parent execution node so the parentcan do a complete wait, only going to child execution nodes to do reaps.

F.1: Ptraced Processes To accommodate debuggers, a process can be ptraced. In this casethe “parent” of the process is set to the process doing the ptracingand the “real_parent” is left as the original process that did the forkto create the process. The process is on the children/sibling list ofthe ptracing process and is on the ptrace_children, ptrace_list listsof the real_parent. Figure 5 shows the relationships set up.

Process Management – Base Ptrace Relationship

Child A

XYZ

children

parent

real_parent

Figure 5: Parent xyz of Process A; Process abc is ptracing A

abc sibling

ptrace_children

ptrace_list


pid_chainb

pid_chain

pid_chain

In clusterproc, the parent or real_parent (or both) could be remote tothe process. The distributed structures described above for the non-ptrace case are repeated to handle thereal_parent/ptrace_children/ptrace_list. In addition to theparent/children/sibling lists described above, the following are set up: a. on real_parent execution node: i. linked list of all kids (some may be surrogates to remote kids) using ptrace_children, ptrace_list; ii. kids point back to parent (real_parent); b. on child node (real_parent is locally executing) see a above. c. on child node (real_parent is not locally executing): i. surrogate task struct for real parent; child points at that. ii. ptrace_children/ptrace_list pointers from the surrogate task

real_parent thru all locally executing ptrace children;

G: Thread Group Relationship

Thread groups almost always share address space. Consequently inclusterproc we don’t spread the members of a thread group ontodifferent nodes. Consequently the thread relationship structures frombase linux are used as is. Within pids[PIDTYPE_TGID] in the threadgroup leader, the pid_list pointer heads the list of thread groupmembers, which are themselves linked via the pids[PIDTYPE_TGID].pid_list pointer. Each member points back to the leader with thegroup_leader field in the task structure. All the thread group leaders are linked together, headed by theinit_task and linked by the tasks pointer in the task structure. In

the clusterproc model, all the locally executing thread group leaderswill be linked like this. This link is used for readdir of /proc.With clusterproc hooks enabled, thread group leader surrogate tasks ontheir origin node will be linked thru an init_surrogate head. This,together with the standard init_task list, will enable an accurateclusterwide readdir for /proc.

H: Process Group (PGRP) Relationships

In the base Linux, process group members are linked off thepids[PIDTYPE_PGID].pid_list of the current pgrp list leader, andare linked together via the pids[PIDTYPE_PGID].pid_list links.This structure is maintained in clusterproc at all nodes where any pgrpmember is executing. Additional information about nodes which havemembers is kept in clusterproc data structures so we can efficientlyfind all members of a process group. Figure 6 below shows thestructures at the pgrp leader origin node if the pgrp leader isexecuting locally. Figure 7 show the pgrp leader origin nodestructures if the pgrp leader is not executing locally.

Figure 6: Pgrp Leader XYZ Origin Node (leader executing locally)

Local member C

Localmember A

XYZpids[2].pid_list

pids[2].pid_list

For clusterprocs, a supplemental structure witha nodelist where other pgrp members are executing and the pgrp_list_sleep_lock

pids[0]pid_chain

Figure 7: Pgrp Leader XYZ Origin Node (leader not executing locally)

Local member C

Localmember A pids[2].pid_list

b



pid_hash[PIDTYPE_PGID]


pid_chain

For clusterprocs, a supplemental structure witha nodelist where other pgrp members are executing and the pgrp_list_sleep_lock

pids[0]pid_chain

pid_chain

pid_chain

pid_chain

pid_chain

pid_chain

pid_chain

Note that in both cases there is a supplemental data structure thatholds information about which other clusternodes have processes in thisprocess group. This other data structure also houses thepgrp_list_sleep_lock and information about which nodes have non-orphanqualifying processes on them. Also note that if the pgrp leader is notexecuting locally, the pgrp list is headed by any locally executingmember (this is new for 2.6.9).

Figure 8 below details the structures and links at the pgrp leaderexecution node (if he is not at his origin node). This is the same asit was when he is at the origin node except there is no nodelist ofnodes where other members are located (only kept at leader origin node)and the pgrp_list_sleep_lock is not used here (only used at the theorigin/list node).

Figure 8: Pgrp Leader XYZ Execution Node (leader not at origin node)

Local member C

Localmember A

XYZpids[2].pid_list

Pids[2].pid_list

Figure 9: Pgrp Members Execution Node (leader not executing locally and not at origin)

Local member C

Localmember A pids[2].pid_list

b





pid_chain

pids[0]pid_chain

pids[0]pid_chain

pid_chain

pid_chain

pid_chain

pid_chain

pid_chain

Figure 9 is again quite similar. It outlines the structures at amember execution node that is not the origin node and on which the pgrpleader is not executing.

H.1: Orphan Pgrp Functionality A pgrp is orphan if no member has a parent in a different pgrp butwith the same sid. Linux needs to know if a process group is orphan inorder to determine if processes can stop (SIGTSTP, SIGTTIN, SIGTTOU).If the process group is orphan, they cannot. Linux also needs to knowwhen a process group becomes orphan, because at that point any membersthat are stopped get a SIGHUP signal and SIGCONT signal. In the base linux, complete orphan pgrp tests are done whenever anevent that might change the orphaness of a pgrp occurs, or when aprocess/pgrp is trying to TSTP, TTIN or TTOU. Doing the exhaustivetest involves looping thru the relevant process group list(s) andchecking the pgrp and sid of the parent of each process group memberuntil it finds a qualifying member or doesn't find one. Examples ofevents that trigger the test are: death of process or parent; setpgidor setsid (note that Linux does not determine if a pgrp has gone orphanfor setpgid and setsid, only on exit of process or parent; not sure whynot?). Providing this capabilty in a completely distributed way withoutvery large potential overhead, but still with complete failurerecovery, is challenging. A proposed implementation is given below.

Note that a pgrp can become "un-orphan", although it is unlikely andthe change in status does not itself cause anything to happen - justthat after the change, members can now stop.

H.2: Providing Orphan Pgrp Capability One can avoid looping thru the pgrp and testing characteristics ofeach pgrp member’s parent by "knowing" whether the pgrp is orphan. Ifwe just "knew" if a pgrp was orphan, the functionality of preventingprocesses from stopping would be simplified. Our proposal is for eachprocess to know if he is contributing to his process group not being anorphan and to know if the pgrp he is in is orphaned or not. To dothis, the pgrp list will know which processes/nodes are contributing,and thus whether the group is orphan or not. As described earlier, theprocess group management strategy is that on the process group leaderorigin node, there is a list of locally running pgrp members and a listof nodes where other members are executing. Along with the remote nodelist is an indication of whether qualifying processes are executingthere (qualifying processes are ones who contribute to the pgrp notbeing orphan). To keep this data accurate, the relevant events (e.g.process or parent exit) will inform the pgrp list if the event changesthe characteristic of the remote node (becomes qualifying or ceases tobe qualifying). Process migration may require changes to information at the pgrplist (which nodes are contributing to the non-orphan-ness). Nodedown of a node requires the pgrp ldr list node to evaluate ifthe process group just went orphan. If the pgrp ldr list node leavesthe cluster, the surrogate_origin is populated by pushing theinformation from each member node (including whether the pgrp wasalready orphan or not, so the surrogate_origin can determine if thepgrp just went orphan). Note that on controlling terminal reads and writes, the pgrp of theprocess making the request is required to determine if it isin the foreground pgrp and if not, TTIN or TTOU processing willhave to determine if the process is part of an orphan pgrp.

I: Session Relationship analogous rules to pgrp;

J: Controlling Terminal Relationship

In the clusterproc cluster, the controlling terminal may be managedon a node other than of the session leader or any of the processesusing it. There is a relationship in that processes need to know whotheir controlling terminal is (and where it is) and the controllingterminal needs to know which session it is associated with and whichprocess group is the foreground process group. In the base linux, processes have a "tty" pointer to theircontrolling terminal. The tty_struct has a pgrp and a session field. In clusterproc, the base structures are maintained as is, with the pgrp and session fields in the tty struct and the tty pointer in thetask structure. The tty pointer will be maintained if the tty is localto the process. If the tty is not local, the clusterproc structurewill have cttynode and cttydev fields to allow code to determine wherethe controlling terminal is. To avoid hooks in some of the routinesbeing executed at the controlling terminal node, svrprocs doing opens,

ioctls, reads and writes of devices will masquerade as the processdoing the request (pid, pgrp, session, and tty). To avoid possibleproblems their masquerading might cause, svrprocs will not be hashed onthe pid_hash[PIDTYPE_PID] and will be marked PF_REMOTE.

K: Semaphores / Locks

The base uses the tasklist_lock spin lock (reader/writer spinlock)to manipulate all the relationship lists and to the extentprocesses/structures are local, clusterproc will use the same lock.The base also uses the proc_lock to protect proc_dentry. The proposalis for the proc_lock to be a sleep lock when clusterproc is enabled,since it must be held across remote operations. In addition, whenclusterproc is enabled, 2 other sleep locks will be used by variouscalls to ensure clusterwide coherency of process groups and sessionlist. The base has a task_capability_lock which makes capabilitymodification operations serialized. It is proposed that this lock be asleep lock when clusterproc is defined.

A: Task proc_lock – The existing base proc_lock spinlock becomes a sleep lock whenclusterproc is defined. In addition, its use is expanded. This lockprotects the process while it is migrating or exiting. It is also usedto protect the ptrace state of the process. It is used by setpgid (inthe case where the setpgid is happening to another process) and setsid,to interlock with the process exiting or migrating. It is used tointerlock a parent exit (trying to disinherit its kids) with its childtrying to migrate or exit itself. This lock is used in ptrace_unlinkto ensure only one agent is doing a ptrace_unlink on a given process ata single time (ptrace_unlink is atomic in the base due to thetasklist_lock but in the cluster that lock may be released to do remoteparent notifications). This lock is only requested on the process’s execution node and isheld remotely only in a special case of de_thread(). Descriptions ofmigrate, setpgid, setsid and exit explain how the lock is used. B: Pgrp_list_sleep_lock – There is at most one instance of this lock for each active processgroup. It is created in the supplemental nodelist structure which iscreated at the pgrp leader origin/list node if there are any remotemembers of the pgrp. This lock provides clusterwide protection of thepgrp list, in the face of process movement. Under some circumstancesthe lock is obtained and held remotely (e.g. migrate, which executes onthe node where the process is). This lock, which is also obtained tosend process group signals and to make local changes to the nodelist ofnodes who have pgrp members, ensures that moving processes don’t missthe signal or get it twice. As a consequence of this remote holding ofthe sleep lock, the clusterproc nodedown cleanup code must be able torelease locks held by processes on nodes that have left the cluster.The lock is used in migrate a member and any code which traverses thepgrp list (e.g. kill_pg_info). Fork does not need the global locksince it will not be changing the pgrp node list. In the case of a process migrating, we potentially need to changethe pgrp node list twice (once to add on the new node and once to

remove the old node); to avoid the race between pgrp signalling andprocess migration, a migrating process must hold this lock acrossthat last phase of the migrate (when the last task data is moved tothe new node and things are set up on the new node and destroyed onthe old node);

C: Session_list_sleep_lock There is at most one instance of this lock for each activesession. It is created in the supplemental nodelist structure which iscreated if there are any remote members of the session. This lockprovides clusterwide protection of the session list, in the face ofprocess movement. Under some circumstances the lock is obtained andheld remotely (e.g. migrate, which executes on the node where theprocess is). This lock, which is also obtained to send clearcontrolling terminal values for all session members and to controlchanges to the nodelist of nodes who have session members, ensures thatmoving processes don’t miss controlling terminal updates. As aconsequence of this remote holding of the sleep lock, the clusterprocnodedown cleanup code must be able to release locks held by processeson nodes that have left the cluster. Fork does not need the globallock since it will not be changing the session node list. In the caseof a process migrating, we potentially need to change the session nodelist twice (once to add on the new node and once to remove the oldnode); to avoid the race between controlling terminal updates andprocess migration, a migrating process must hold this lock across thatlast phase of the migrate (when the last task data is moved to the newnode and things are set up on the new node and destroyed on the oldnode);

D: Migrate_sleep_lock Each node has a single migrate_sleep_lock. For processes tomigrate/rexec, they first get this lock in shared read mode. Globaloperations on processes (kill_something_info(), sys_capset() for “all”case, sys_set/getpriority() for the PRIO_USER case) must get themigrate_sleep_lock in exclusive mode on all nodes to ensure noprocesses migrate during their operation (which might result in aprocess missing the operation).

E: task_capability_lock The sleeping version of task_capability_lock replaces the basespinlock task_capability_lock if clusterproc is defined. This sleeplock is defined in each kernel but actually will be acquired on onlyone node (a single clusterwide instance of the lock). It is not ashared lock.

You must of course get sleep locks before spin locks. The hierarchywithin the sleep locks would be:

- first the proc_lock, if needed;- next the session_list_sleep_lock, if needed;- next the pgrp_list_sleep_lock, if needed;- next the task_capability_lock, if needed;- finally the migrate_sleep_lock;

Process movement requires the first 4 locks; some capability operationsrequire the task_capability_lock and migrate_sleep_lock.

L: Rules for Surrogate Task Existence

Below are the rules for the existence of an surrogate task structureon a given node. Note that for each reason the usage count may bebumped (see below). Having a surrogate task structure on a given nodemeans the usage count is set to at least 1 (it is initialized to 0 andget_task_struct() should be called for one or more reasons below).put_task_struct() has a clusterproc hook so that if count goes to 0,for a surrogae task, it unhashes it, frees the memory and callspidhash_free(). Note that there is never an surrogate task and task forthe same process on the same node. There are never more than onesurrogate task for a given process on a given node.

Surrogate task for process A on node X (process A is execution on someother node) because:

1. X is origin node for A;- just need the execution node field set; usage+1; set up in

migrate out/ rexec(); cleared via clusterproc_exit_dealloc()called from release_task()

2. A's parent is executing on node X and thus A is on the children/sibling list of his parent; usage +1 for parent and +1 for real_parent;

- this list is used for exit (to disinherit) and wait/reap; - the surrogate tasks for the children need to have the

information to do a wait. To do wait locally (without going tothe child execution node), we need these fields in the surrogatetask struct filled in and maintained:

->real_parent- if real_parent is executing locally;->parent – parent is executing locally; ->signal->pgrp - used in eligible_child()

->exit_signal - "" ->ptrace - ""->tgid - ""->pids[PIDTYPE_TGID].pid_list (just set up to point to self)->state - used to determine STOPPED state

->exit_state - used to determine ZOMBIE state->stop_state - used in wait_task_continue()

- sent over via do_notify_parent() delay_group_leader(), called in eligible_child()

pgrp - changes with setpgid(), setsid(); update via clusterproc_pgrp_update() andclusterproc_setsid();

exit_signal - set in fork(), reparent_to_init(), reparent_thread(), zap_other_threads()

update via ???. Ptrace - changes when straced so need to update;

- update via clusterproc_update_parent() tgid - only gets set in fork(); state and exit_state - updated via do_notify_parent(); delay_group_leader(), this hooked to locally determine if the task is a thread leader and whether the thread list is

empty.

3. A's real_parent (if different from parent) is executing on node X and thus A is on the ptrace_children/ptrace_list lists for that real parent;

- usage +1 for the real_parent being local; - this list is used for exit (to disinherit), in a limited way

for wait and not for reap; waits don't really happen toptrace_children but the wait code does go thru the list to seeif an eligible process is on it (so it won't return ECHILD).Thus all the fields needed for a regular wait except "state"will need to be accurate (state must be updated for ZOMBIE butnot for STOPPED).

4. A has children executing on node X who have pointers to the surrogate task for A.

- (one ref cnt if you have any locally executing children or ptrace_children). - Parent surrogate task structure on nodes where children are executing locally need the following fields set:->pid->signal->session->signal->pgrp

->children - children/sibling list is maintained for locally executing children;

->tgid->sighand, signals - these fields can be checked on parent node

- don't set;

M: Clusterproc data structure components

TBD

N: Cluster Process Module Initialization

The clusterproc loadable module is expected to be loaded during theramdisk/initramfs phase or shortly thereafter. It is expected thatmembership and internode communication are already loaded and this nodeknows the cluster_maxnode value (maximum node number for a member inthe cluster) and the cluster_this_node value (node number of the localnode). It also expects cluster_initnode and cluster_single_init to beset. Cluster_single_init is flag telling the clusterproc code whetherto have a single init for the whole cluster or an init process per node(if you want single_init, clusterproc must be loaded in theramdisk/initramfs, before init is exec’d). If cluster_single_init isset, cluster_initnode indicates which cluster node is running or goingto run init. The initialization routine will do the following: a. adjust pid_max to reflect how many bits are left for uniquifier, which is the bits not needed to deal with cluster_maxnode; b. allocate data structures internal to the clusterproc module; c. for each existing process, allocate and initialize a clusterproc structure and link it into their task structure ; d. install the clusterproc ops vector; e. register with the cluster membership service;

f. register with the internode communication service; g. if cluster_single_init is requested:

i. change startup process (was pid 1) to next available pid (this will be child_reaper but will not be init);

ii. set hooks so startup process forks a pid 1 and execs init if this node is designated to run init; iii. set hooks so startup process (which is now child_reaper) does not exit but loops in wait/reap.

O: Process Movement Capabilities and Hooks

The proposed system, and hooks, accommodates several forms of process movement,including a couple of forms of remote exec, an rfork and somewhat arbitrary processmigration. In addition, these interfaces allow for transparent and programmaticcheckpoint/restart. The external interfaces to invoke process movement could be new system calls (rexecve(), rfork() and migrate(), etc.) or we can have library routines use /proc interfaces as aninterface. To avoid the new system calls, we can have a /proc/<pid>/goto file which canbe written from the library. Writes to this file would take a buffer and length. To allowconsiderable flexibility in specifying the form of the movement andcharacteristics/functions to be performed as part of the movement, the buffer wouldconsist of a set of stanzas, each made up of a command and argments to that command.The initial set of commands would be: rexec, rfork, migrate, checkpoint, restart, contextand mount, but additional commands can be added. The arguments to rexec, rfork andmigrate() are a node number. The argument to checkpoint and restart would be pathnamefor the checkpoint file. The context command indicates whether the process is to have hecontext of the node it is moving to or remain the way it was. The arguments to mountwould support the private mounts for v9fs that bproc uses.

Do_execve(0 and do_fork() have hooks which, if clusterproc is configured, will checkthe goto, and if appropriate, turn an exec into an rexec or a fork into an rfork.

To enable a migrate, besides setting fields in the clusterproc structure hung off the task,a bit would be set in the thread_info structure, flags element (TIF_MIGPENDING).Each time the process leaves the kernel to return to user space (did a system call orserviced an interrupt), the do_notify_resume() function is called if any of the flags inthread_info.flags are set (normally there are none set). do_notify_resume() now has ahook which will check for the flag and if it is set, a migrate is initiated. This hook willonly add pathlength when any of the flags are set (TIF_SIGPENDING, etc.), which isvery rarely.

Two forms of kernel-based checkpoint/restart are proposed. The first is transparent tothe process, where the action is initiated by another process. The other is when theprocess is checkpoint/restart aware and is doing the checkpoint on itself. In that case, theprocess may wish to “know” when it is being restarted. To do that, we propose that theprocess open the /proc/self/goto file and attach a signal and signal handler to it. Then,

when the process is restarted, the signal handler will be called. Checkpoint/restart arevariants of migrate. The argument field is a pathname. In the case of checkpoint, theTIF_SIGPENDING will be set and at the end of the next system call, the process willsave its state in the filename specified. Another argument can determine whether theprocess is to continue or destroy at that point. Restart is done by first creating a newprocess and then doing the “restart” goto command to populate the new process with thesaved image in the file which is specified as an argument.

A further enhancement to ensure the goto is used appropriately would be to only leaveit in effect for one system call. For rexecve(), then, the sequence would be to set thegoto and then immediately do an exec. Similarly for rfork().

P: Clusterwide /proc

The proposal for a clusterwide /proc is to stack a new pseudo filesystem (say cprocfs)over an unmodified /proc. Hooks will be needed to do the stacking but they will bemodest. The proposed semantics would be that: a. readdir would present all processes from all nodes and other proc files would either be an aggregation (sysvipc, uptime, net/unix, …) or would passthru to the local /proc b. cprocfs would function ship all ops on processes to the nodes where they were executing and then call the procfs on those node; c. cprocfs inodes would not point at task structures but at small structures which would have hints as to where the process was executing. d. /proc/node# directories would redirect to the /proc on that node so one could access all the hardware information for that node. e. readdir of /proc/node# would only show the processes executing on that node.

Q: Descriptions of the Base Hooks

The hooks are in the following base kernel files:

./arch/i386/kernel/ptrace.c ./arch/i386/kernel/signal.c

./drivers/char/n_tty.c ./drivers/char/tty_io.c

./fs/exec.c ./fs/fcntl.c ./fs/open.c ./fs/proc/array.c ./fs/proc/base.c

./include/asm-i386/thread_info.h ./include/linux/capability.h

./include/linux/clusterproc.h ./include/linux/init_task.h ./include/linux/ptrace.h ./include/linux/sched.h

./init/main.c ./kernel/capability.c ./kernel/exit.c ./kernel/fork.c ./kernel/pid.c ./kernel/printk.c ./kernel/ptrace.c ./kernel/signal.c ./kernel/sys.c ./kernel/timer.c The 4 changes in include/linux/sched.h are to add a PF_REMOTE flagin the task structure to identify surrogate task structures orprocesses surrogating for remotely executing process, add a void *clusterproc in the task structure to point to ansupplemental data structure allocated if clusterproc is installed, makethe proc_lock either a spinlock or sleeplock, and declaredo_notify_parent() as having a return value. The change in ptrace.h is in ptrace_unlink(). Instead of justcalling __ptrace_unlink(), it is changed to call do_ptrace_unlink()(new routine) which can get the process sleep lock before doing the__ptrace_unlink(). The change to capability.h is to make the task_capability_lockeither the base spinlock or a sleeplock (based on CONFIG_CLUSTERPROC. The change to init_task.h is to set up the initialization of theproc_lock, which now can be either a spinlock or a sleep lock, based onwhether CONFIG_CLUSTERPROC is defined. The change to asm-i386/thread_info.h is to add a new flag value toflags field of the thread_info structure (TIF_MIGPENDING). This flagis set on a process if it has been requested to migrate (through the /proc interface). In do_notify_resume(), there is hook which willcheck for the flag and if it is set, a migrate is initiated.

One can view the .c hooks functionally, and below is a set ofdescriptions in the following categories: pid assignment and freeing fork exec exit wait ptrace signalling controlling terminal setpgid, setsid priority and capability /proc miscellaneous

Pid assignment and freeeing:

The hooks are in pid.c, alloc_pidmap and feee_pidmap.

The clusterproc_alloc_pid() hook encodes the node number into the pid (see section C).

The clusterproc_local_pid() hooks returns whether the process id was allocated locally (and thus can be freed) or not, and whether it is still a pgrp or session ldr id (and thus cannot yet be freed). The clusterproc_strip_pid() routine take the node encoded pid and returns the uniquifier part, which the local base code has to manage.

Fork: Hooks are in fork.c, in copy_process() and in do_fork(). In copy_process(), clusterproc_fork_alloc() is called to allocate and initialize the clusterproc data structure hung off the task structure (with cleanup in error cases). Also in copy_process, __ptrace_link() (which has a hook in it) is called if the child is to be ptraced (if either parents are remote, clusterproc_update_parent() is called to set up the parent/child relationship on those parent nodes (see also ptrace below)). Finally, the clusterproc_update_parent() hook is needed in the CLONE_PARENT case where the grandparent (who is to be made parent) is executing remotely. In do_fork(), there is a hook to see if an rfork() was requested and if so, to basically execute the do_fork on the remote node.

Exec: There are 2 exec-related hooks. The first is in do_execve() where a test is inserted to see if an rexec() had been requested or to see if the exec() should be load balanced into an rexec(). The other hook is in de_thread(). If a multi-threaded process does an exec, it must call de_thread(), which, if the thread group is being ptraced, may have to inform parents on other nodes, which in turn will release the tasklist_lock. While the tasklist_lock is released, the parents could exit or ptrace_detach, which, if unprotected, could leave things linked inappropriately. Consequently de_thread() is modified to acquire the proc_lock sleep lock for the parent who is ptracing the. This ensures that ptrace parent cannot exit. It also gets the proc_lock on the leader, to interlock with any competing ptrace_detach().

Exit sys_exit() calls do_exit() which calls exit_notify(); exit_notify() does 3 things - disinherit children with forget_original_parent(), check whether our pgrp is now going to become orphan (will_become_orphaned_pgrp()) and notify our parent. All three have hooks.

forget_original_parent(): loop thru children (we only do local ones here) and call reparent_thread() on each one unless they are ptraced, in which

case we un-ptrace them (__ptrace_unlink); reparent_thread()

also has to check the orphan-ness of the pgrp of each child; it may have to do remote activity, in this case; if any remote acitvity was done, we restart the loop since the tasklist_lock has been released.

loop thru ptrace_children (someone else is tracing them) and if they are local, reparent_thread them;

finally, if there are any processes left on the children or ptrace_children lists, they must be executing remotely and we call clusterproc_rmt_reparent_children() to reparent them;

after that we do one more look to ensure the lists are empty.

will_become_orphaned_pgrp() is only called if the process was contributing to its pgrp not being orphan; if it was contributing, we first see if there are others locally contributing, and if so, then the pgrp did not go orphan; if not, we tell the pgrp list node we as a node are no longer contributing and he determines if that means the pgrp has gone orphan; if it has, he will send SIGHUP/SIGCONT if there are any stopped jobs in the pgrp;

do_notify_parent() is called if exit_signal != -1; this might go remote if the parent is remote; if exit_signal is -1, the exit will call release_task() (see wait below).

Wait sys_wait4() calls do_wait() which loops thru all your children looking for eligble ones (if it doesn't find any you get an -ECHILD); if a child is eligible (we have to ensure they test criteria can be done on the parent node without remote operations), it checks if it is zombie or stopped or traced (see below). If no eligible regular kids were found, it searches the ptrace_children list. It doesn't process them but they can count towards being eligible and avoiding the –ECHILD. The eligible_child() routine, which is run on the parent node, requires information about the child, including whether he is a delayed_leader_group. To avoid a hook in delay_group_leader()(which is in sched.h), an added call to clusterproc_thread_group_empty() is done in eligible_child() if the child is remote; the check is done by interrogating a local flag which is accurate if the child is zombie or stopped or traced. For zombies, sys_wait4() calls wait_task_zombie() which will go remote to the node where the child is and run wait_task_zombie() there. wait_task_zombie() will un-ptrace you if necessary so your real_parent can wait for you; if you aren't ptraced, release_task() is done next. release_task() will un-ptrace you if needed; it unhashes the process, copies some info to the parent and reduces the process usage count. unhashing the process may have to update a remote parent and calls detach_pid() which takes a process off one of the hash chains and as a side effect, may have to adjust dummy/surrogate copies of the process on other nodes (done by hook clusterproc_rmt_adjust_ldr_lists()). If the usage count goes to 0, the task structure is freed. For clusterproc, we just have to free the clusterproc structure hung off the task structure, which is done in

clusterproc_exit_dealloc(), which also clears the process’s surrogate at it’s origin node. If you were stopped instead of zombie, wait_task_stopped was called, which, like wait_task_zombie(), goes to the child node; there it gathers a little information about the child and returns the pid.

If you were continued instead of zombie, wait_task_continued was called, which, like wait_task_zombie(), goes to the child node; there it gathers a little information about the child and returns the pid. In fork.c, put_task_struct(), we propose a hook clusterproc_unhash_stask() that will allow cleaning up surrogate task structures which are set up for processes not running locally but for which local reference was needed (eg. parent process was executing locally)

Ptrace Hooks for ptrace capability are in arch/*/kernel/ptrace.c and kernel/ptrace.c. Sys_ptrace (in arch/*/ptrace.c) is hooked with a call to clusterproc_rmt_ptrace() if the requested child is not found locally, and to call clusterproc_update_parent() in the PTRACE_TRACEME case.

Ptrace_attach() and ptrace_check_attach() are modified to avoid the use of "current", since the operations may be being executing by a svrproc.

In __ptrace_link(), which is called in copy_process(), ptrace_attach() and de_thread(), clusterproc_update_parent() is called to deal with either parent being remote (it ensures the child surrogate task is set up appropriately on the parent execution node. Ptrace_unlink(), called by forget_orginal_parent(), ptrace_detach(), and reparent_to_init() and is enhanced to get the proc_lock so callers are synchronized with de_thread; it returns –EREMOTE to indicate that the tasklist_lock has be released.

__ptrace_unlink(), which is called by ptrace_unlink(), de_thread(), release_task() and wait_task_zombie() is modified to inform remote parents, which is done with the clusterproc_update_parent() function);

Signalling Signals come from 2 sources, system calls and kernel internal generated signals. System calls include sys_kill, sys_tkill and sys_tgkill. The sys_tkill and sys_tgkill calls are hooked to function ship either the process in question or the pgrp list node. Once on the correct node, the syscall is re-executed. The sys_kill call calls either kill_something_info and it calls kill_proc_info(), kill_pg_info() or does the kill -1 code. The kill -1 is hooked to send the request to all nodes (first freezing

all process movement so no process is missed). kill_proc_info() attempts to deliver the signal locally and if that fails, calls the hook clusterproc_rmt_sigproc() to find the process and deliver the signal. kill_pg_info() checks if the pgrp list is local. If so, it gets the pgrp_list_sleep_lock, does local signaling and then calls clusterproc_sigpgrp_rmt_members() to signal process group members on other nodes. If the call is not done on the pgrp list node, clusterproc_rmt_sigpgrp() is called to go to the pgrp list node and execute kill_pg_info() over there. Note that if Calls from inside the kernel (not from one of the above kill system calls) come thru a variety of means. Many will call kill_proc(), which will call kill_proc_info(). Some call kill_pg(), which calls kill_pg_info(). In fs/fcntl.c there are two specialized signal calls - send_sigio() and send_sigurg(). These calls don't go thru the standard signal delivery until a very low level so they are individually hooked in manners similar to kill_proc_info() and kill_pg_info() (including the use of async delivery as needed). Signal processing can sometimes require remote activity. For example, get_signal_to_deliver() can do a notify_parent(), which will notify remote parents if necessary. Handling stop signals can also require remote activity thru do_notify_parent_cldstop().

Controlling terminal In base linux, the controlling terminal of process is designated by the "tty" pointer in the task structure, which points to a tty structure. The proposed strategy is for processes local to their controlling tty to have the tty pointer filled in and processes remote to their controlling tty to have the pointer null and to have cttydev and cttynode fields filled in in the clusterproc structure. It is also proposed that the svrproc threads executing tty related operations on behalf of remote processes set up their tty pointer to reflect that of the caller. It is also proposed that each process know if it is currently in the foreground process group (needed to handle failure of the controlling terminal node). Based on this set of strategies, a set of hooks are needed in the kernel to provide controlling terminal capability.

One hook is needed for tty_open() (tty_io.c), which is when the code is checking to see if the process already has a controlling terminal; clusterproc_has_ctty() is called to check if the process has a remote ctty. Opens of /dev/tty must be initially handled on the the calling process's node and then sent to the node where the controlling terminal is. Tty_open() also has a hook in the open of /dev/tty in the case where the controlling terminal is remote.

For tty reads (read_chan()) and writes (write_chan()) (both in n_tty.c), the checks done to possibly send TTIN or TTOU signals must call is_ignored(sig), which is hooked to check back with the calling process, and call is_orphaned_pgrp(), which is hooked to check the IS_ORPHANED flag in the clusterproc structure. The call to kill_pg() is handled clusterwide (see signalling section).

For tty ioctls, tiocspgrp() and tiocsctty() need hooks. In tiocspgrp(), a call to clusterproc_update_ctty_pgrp() is inserted to handle informing all members of the old pgrp that they are no longer foreground and to inform all members of the new pgrp that they are foreground. In tiocsctty(), which sets controlling terminal, calls to clusterproc_has_ctty() and clusterproc_clear_tty() are inserted. TIOCNOTTY doesn't directly have hooks, but calls disassociate_ctty(), described below.

Several hooks are needed to handle session leader exit, terminal close and hangup operations. do_exit() calls disassociate_ctty(), which calls clusteproc_release_remote_tty() if not on the ctty node (it calls disassociate_ctty() on the ctty node), and calls clusterproc_clear_tty() to clear all tty pointers in all members of the session. Thru the tty_release op, terminal close calls realese_dev(), which is hooked to call clusterproc_clear_tty() to handle remote session members. Hangups (either from drivers or thru sys_vhangup()) go do thru do_tty_hangup(), which calls clusterproc_proc_clear_tty().

In kernel/printk.c, in tty_write_message(), the use of the tty pointer is augmented by a call to clusterproc_rmt_tty_write_message() if the tty is remote.

In fs/proc/array.c, proc_pid_stat(), clusterproc_get_tty() is called to get the tty dev and foreground pgrp if the tty is not local.

In exit.c, daemonize() must clean the tty and has a hook to clear the cttynode and cttydev fields as well. daemonize() also calls reparent_to_init(), which may have to call clusterproc_update_parent to adjust the structure at the old parent node, if it is remote.

Setpgid, setsid In sys.c sys_setpgid() we have to move the call to the node where the process being acted on is executing (usually the caller but doesn't have to be); this is done by testing if it is local (clusterproc_is_process_local), and if it is, grabbing the process lock. If it is isn't, clusterproc_rmt_setpgid() is called which will go the correct node and call sys_setpgid on that node. As part of the setpgid call, there is a check to ensure that someone in the new process group is in the same session as the caller of the setpgid. This check may be satisfied locally or may require scanning the new pgrp (which is done via clusterproc_verify_pgid_session(). Once that is done, we have to leave the old pgrp and join the new one. detach_pid() has the hooks to handle the distributed process group lists for the old group; clusterproc_pgrp_update() then does several things: a) determine if the pgrp we just joined is now changed state from orphan to non-orphan; b) update your surrogate task on your parent's node (if he is remote, so he can do accurate wait4's); c) update your pgrp value at your children's execution node and recalculates whether any of their pgrps changes state (from orphan to non or from non to orphan). d) if not at origin node, create a pgrp leader list at the origin.

sys_getpgid() tries to find the process in question locally but if it is not local, it calls clusterproc_rmt_proc_getattr() to get the pgid.

sys_getsid() tries to find the process in question locally but if it is not local, it calls clusterproc_rmt_proc_getattr() to get the sid.

sys_setsid() has to first get the proc_lock to ensure that someone doesn’t do a setpgid() during the setsid (in the base the setsid is all done under the tasklist_lock but in the clusterproc case that lock may have to be released and reacquired. Then setsid() calls __set_special_pids() to reset its pgid and sid. The detach_pid() call that __set_special_pids() does will clean up the old pgrp. Clusterproc_setsid() is called to: a) update our surrogate task at our parent execution node, if that is remote; b) update our surrogate task at each or our children's nodes; c) if we are not executing at our origin node, send a message to that node to create a pgrp leader list and a sid leader list and d) clear our cttydev and cttynode fields.

Priority and capability: Sys_getpriority() and sys_setpriority() are pretty straightforward. They can be done for an individual process, a process group or by Uid. For individual processes we ship the operation to the node where the process is and redo the system call. For process groups, we move the operation on the node where the pgrp list is, do the local members and then do the call on each node that has members (under the pgrp_list_sleep_lock). For the uid case, we first stop all process migration (so processes aren’t missed) and then execute the system call on all nodes.

For all capability operations we get a clusterwide sleep lock, to ensure the atomicity of the base. Sys_capget() only works on individual processes so we do the call on the node where the process is running. This does require the svrproc to masquerade as the calling process w.r.t. capabilities of the caller. Sys_capset() can be called for a process, process group or all processes. For the single process we go to the node where it is running. For the process group we go to the process group list node, do local members and then do the syscall on each node where members are executing (under the pgrp_list_sleep_lock). For the “all” case, we suspend process migration and run the system call on all nodes. /proc As described in the section P above, there are very limited hooks in the /proc code because a new filesystem (cprocfs) will be optionally stacked on the standard /proc. A few hooks are needed to deal with controlling terminal and the proc_lock.

In fs/proc/array.c, proc_pid_stat(), clusterproc_get_tty() is called to get the tty dev and foreground pgrp if the tty is not local.

In fs/proc/base_proc.c, the locking and unlocking of the proc_lock

are changed to be thru macros so that when clusterproc is configured, proc_lock can be a sleep lock.

Miscellaneous: In kernel/timer.c, there is slight change in get_ppid() to have it always return 1 if the parent is child_reaper. This hook allows the flexibility to have a clusterwide single init process and to allow a kernel thread to be child_reaper, while still presenting ppid==1 to mean you lost your parent.

In init/main.c, in the init() routine, we call clusterproc_start_reaper() which will create a thread to be the child_reaper and either return to start up init or exit (if we aren’t going to run init on this node.

In the architecture routine do_signal(), which is called as processes are returning back to user mode, there are tests to see if work needs to be done before returning (like processing signals). A bit is added to the work bit map to indicate that a migrate has been requested and if that bit is set, the clusterproc_do_migrate() hook is invoked at this point.

Appendix A: Clustproc Hooks

Header Files:

sched.h - in task struct, add:

a. void *clusterproc;b. PF_REMOTE flag in task->flags element;

c. proc_lock is either a spinlock (base) or sleep lock; - change the declaration of do_notify_parent() to return an int;ptrace.h

- change the declaration of ptrace_unlink() call do_ptrace_unlink(), which is in ptrace.c, so that hooks to obtain sleep lockscan be called;

clusterproc.h - clusterproc functions with and without ifdef CONFIG_CLUSTERPROC - flags for some of the ops:capability.h

- task_capability_lock is either a spinlock (base) or a sleep lock.asm-i386/thread_info.h

- a new flag is added to xxx to indicate that there is a pendingmigrate for this process so when the process is returning touser space it can be moved.

Changes needed beyond calls to clusterproc_ops:

./kernel/exit.c - in release_task(), an unlock of the proc_lock is moved to be right after the call to proc_pid_unhash();

- do_notify_parent() has a return code to indicate if the

tasklist_lock was released; call in reparent_thread cares - will_become_orphaned_pgrp() can now return <0 to indicate that

the tasklist_lock was released; relevant in reparent_thread();

- will_become_orphaned_pgrp() either locally determines thepgrp is not orphan (and returns 0 as in the base) orit must work clusterwide and thus does any orphanchecking and processing and returns -EREMOTE.

- is_orphaned_pgrp() calls clusterproc_is_orphan_pgrp() which, if clusterproc is installed will always return –EREMOTE so will_become_orphaned_pgrp() is never called;

- reparent_thread() can now return <0 to indicate tasklist_lockwas released; relevant in forget_original_parent()

- __ptrace_unlink() can now return < 0 to indicate tasklist_lockwas released; relevant in forget_original_parent()

- exit_notify() - moved setting of task state to ZOMBIE to beforethe call to do_notify_parent() to ensure parent does notmiss the state change in the distrubuted case (do_notify_parent() may release the tasklist_lock.

- tests for p->flags & PF_REMOTE in wait_task_zombie(), wait_task_stopped(), forget_original_parent(),

- added wait_task_continued() from 2.6.10 to allow a cleaner hook for supporting the new “continued” capability added in 2.6.9.

./kernel/ptrace.c - in sys_ptrace(), change the use of "current" to be the new

parent (so svrprocs can execute this code on behalf ofremotely exeucting processes; also change __ptrace_unlink()to have a return code to indicate the tasklist_lock hasbeen released and reacquired.

- tests for p->flags & PF_REMOTE in ptrace_attach()

./kernel/exec.c- In de_thread(), if the thread group leader is being ptraced, the

code is modified to handle the fact that in some cases thetasklist_lock will have to be released (and that would allow theparent of the leader to exit or ptrace_detach. To deal withthose races, the proc_lock is obtained on both the leader andits parent (which might be remote), thus preventing the leader’sparent from exiting and preventing anyone from trying to detachthe leader while it is in a transient state.

./kernel/signal.c- tests for p->flags & PF_REMOTE in do_notify_parent*()- do_notify_parent() and do_notify_parent_cldstop() can return

-EREMOTE if the tasklist_lock is released and reacquired;

kernel/timer.c- in sys_ppid(), if parent is child_reaper, just return 1 for

ppid; this is done because in the clusterwide single initconfiguration, there is a local kernel child_reaper threadon each node and it is not process 1.

HOOK Functions:

These are now described in include/linux/clusterproc.h

Appendix B: Implementation Notes

The general execution strategy for the clusterproc clusterwideprocess model is that operations are executed on the node where theprocess is. To deal with the race of the process moving (migration orrexec()), operations which don’t find the process on the node theythought it was running will failback and retry again. The execution ona remote node is done by one of a set of svrprocs (identical kernelthreads waiting to execute operations from other nodes). Thesesvrprocs in turn call, whenever possible, the existing base routinesneeded to complete the operation (say kill_proc_info()).Since code like kill_proc_info() has the hooks to enable it to goremote, we need to make sure that if the process moved while we weretrying to access it, we don’t go remote from the remote node but infact fail back to the original requesting node and retry. To ensurethis, there will be a bits in a flag field in the clusterproc structure– a recursion bit to avoid executing remote code from svrprocs.

If a process is adding or deleting itself from a pgrg (setpgid, setsid,reap) and the result changes whether the node is on or off the nodelistfor the pgrp, a message must be sent to the pgrp leader origin node toupdate the nodelist. At that node the pgrp_list_sleep_lock is obtainedin order to update the nodelist. Since this lock is not being heldback at the process node, it is possible someone else on that nodecould do an operation to change the nodelist status back and could besending a message to that effect at the same time. These messagescould be processed out of order. To get the correct final result, anymessage which is asking to do something (turn on or turn off the bitfor this node) which wouldn’t change anything, is sent back to berecalculated and resent. On the second time in, the other messagewould have been processed. The same logic is needed when turning on oroff a node w.r.t. whether it has non-orphan qualifying processes on it.

proposal for hooks for clusterwide process management in 2

Documents