scalable clusters jed liu 11 april 2002. overview microsoft cluster service built on windows nt...
Post on 20-Dec-2015
216 views
TRANSCRIPT
![Page 1: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/1.jpg)
Scalable Clusters
Jed Liu11 April 2002
![Page 2: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/2.jpg)
Overview Microsoft Cluster Service
Built on Windows NT Provides high availability services Presents itself to clients as a single system
Frangipani A scalable distributed file system
![Page 3: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/3.jpg)
Microsoft Cluster Service Design goals:
Cluster composed of COTS components Scalability – able to add components
without interrupting services Transparency – clients see cluster as a
single machine Reliability – when a node fails, can restart
services on a different node
![Page 4: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/4.jpg)
Cluster Abstractions Nodes Resources
e.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL service
Quorum resource Implements persistent storage for cluster
configuration database and change log Resource dependencies
Tracks dependencies btw resources
![Page 5: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/5.jpg)
Cluster Abstractions (cont’d) Resource groups
The unit of migration: resources in the same group are hosted on the same node
Cluster database Configuration data for starting the cluster is
kept in a database, accessed through the Windows registry.
Database is replicated at each node in the cluster.
![Page 6: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/6.jpg)
![Page 7: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/7.jpg)
Node Failure Active members broadcast periodic
heartbeat messages Failure suspicion occurs when a node
misses two successive heartbeat messages from some other node Regroup algorithm gets initiated to
determine new membership information Resources that were online at a failed
member are brought online at active nodes
![Page 8: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/8.jpg)
Member Regroup Algorithm Lockstep algorithm Activate. Each node waits for a clock
tick, then starts sending and collecting status messages
Closing. Determine whether partitions exist and determines whether current node is in a partition that should survive
Pruning. Prune the surviving group so that all nodes are fully-connected
![Page 9: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/9.jpg)
Regroup Algorithm (cont’d) Cleanup. Surviving nodes local
membership information as appropriate Stabilized. Done
![Page 10: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/10.jpg)
Joining a Cluster Sponsor authenticates the joining node
Denies access if applicant isn’t authorized to join Sponsor sends version info of config
database Also sends updates as needed, if changes were
made while applicant was offline Sponsor atomically broadcasts information
about applicant to all other members Active members update local membership
information
![Page 11: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/11.jpg)
Forming a Cluster Use local registry to find address of
quorum resource Acquire ownership of quorum resource
Arbitration protocol ensures that at most one node owns quorum resource
Synchronize local cluster database with master copy
![Page 12: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/12.jpg)
Leaving a Cluster Member sends an exit message to all
other cluster members and shuts down immediately
Active members gossip about exiting member and update their cluster databases
![Page 13: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/13.jpg)
Node States Inactive nodes are offline Active members are either online or paused All active nodes participate in cluster
database updates, vote in the quorum algorithm, maintain heartbeats
Only online nodes can take ownership of resource groups
![Page 14: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/14.jpg)
Resource Management Achieved by invoking a calls through a
resource control library (implemented as a DLL)
Through this library, MSCS can monitor the state of the resource
![Page 15: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/15.jpg)
Resource Migration Reasons for migration:
Node failure Resource failure Resource group prefers to execute at a
different node Operator-requested migration
In the first case, resource group is pulled to new node
In all other cases, resource group is pushed
![Page 16: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/16.jpg)
Pushing a Resource Group All resources in the old node are
brought offline Old host node chooses a new host Local copy of MSCS at new host brings
up the resource group
![Page 17: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/17.jpg)
Pulling a Resource Group Active nodes capable of hosting the
group determine amongst themselves the new host for the group New host chosen based on attributes that are
stored in the cluster database Since database is replicated at all nodes,
decision can be made without any communication!
New host brings online the resource group
![Page 18: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/18.jpg)
Client Access to Resources Normally, clients access SMB resources
using names of the form \\node\service This presents a problem – as resources migrate
between nodes, the resource name will change With MSCS, whenever a resource migrates,
resource’s network name also migrates as part of resource group Clients only sees services and their network
names – cluster becomes a single virtual node
![Page 19: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/19.jpg)
Membership Manager Maintains consensus among active
nodes about who is active and who is defined A join mechanism admits new members
into the cluster A regroup mechanism determines current
membership on start up or suspected failure
![Page 20: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/20.jpg)
Global Update Manager Used to implement atomic broadcast A single node in the cluster is always
designated as the locker Locker node takes over atomic
broadcast in case original sender fails in mid-broadcast
![Page 21: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/21.jpg)
Frangipani Design goals:
Provide users with coherent, shared access to files
Arbitrarily scalable to provide more storage, higher performance
Highly available in spite of component failures Minimal human administration
Full and consistent backups can be made of the entire file system without bringing it down
Complexity of administration stays constant despite the addition of components
![Page 22: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/22.jpg)
Server Layering
Userprogram
Userprogram
Userprogram
Frangipanifile server
Frangipanifile server
Petaldistributed virtualdisk service
Distributedlock service
Physical disks
![Page 23: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/23.jpg)
Assumptions Frangipani servers trust:
One another Petal servers Lock service
Meant to run in a cluster of machines that are under a common administration and can communicate securely
![Page 24: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/24.jpg)
System Structure Frangipani implemented as a file
system option in the OS kernel All file servers read and write the same
file system data structures on the shared Petal disk
Each file server keeps a redo log in Petal so that when it fails, another server can access log and recover
![Page 25: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/25.jpg)
Petalserver
Lockserver Petal
server
Lockserver
User programs
File system switch
Frangipanifile server module
Petaldevice driver
Network
User programs
File system switch
Frangipanifile server module
Petaldevice driver
Petalserver
Lockserver
Petal virtual disk
![Page 26: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/26.jpg)
Security Considerations Any Frangipani machine can access and
modify any block of the Petal virtual disk Must run only on machines with trusted OSes
Petal servers and lock servers should also run on trusted OSes
All three types of components should authenticate one another
Network security also important: eavesdropping should be prevented
![Page 27: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/27.jpg)
Disk Layout 264 bytes of addressable disk space,
partitioned into regions: Shared configuration parameters Logs – each server owns a part of this
region to hold its private log Allocation bitmaps – each server owns parts
of this region for its exclusive use Inodes, small data blocks, large data blocks
![Page 28: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/28.jpg)
Logging and Recovery Only log changes to metadata – user
data is not logged Use write-ahead redo logging
Log implemented as a circular buffer When log fills, reclaim oldest ¼ of buffer
Need to be able to find end of log Add monotonically increasing sequence
numbers to each block of the log
![Page 29: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/29.jpg)
Concurrency Considerations Need to ensure logging and recovery
work in the presence of multiple logs Updates requested to same data by
different servers are serialized Recovery applies a change only if it was
logged under an active lock at the time of failure
To ensure this, never replay an update that has already been completed
keep a version number on each metadata block
![Page 30: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/30.jpg)
Concurrency Considerations (cont’d)
Ensure that only one recovery daemon is replaying the log of a given server
Do this through an exclusive lock on the log
![Page 31: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/31.jpg)
Cache Coherence When lock service detects conflicting
lock requests, current lock holder is asked to release or downgrade lock
Lock service uses read locks and write locks When a read lock is released, corresponding
cache entry must be invalidated When a write lock is downgraded, dirty data
must be written to disk Releasing a write lock = downgrade to read
lock, then release
![Page 32: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/32.jpg)
Synchronization Division of on-disk data structures into
lockable segments is designed to avoid lock contention Each log is lockable Bitmap space divided into lockable units Unallocated inode or data block is protected
by lock on corresponding piece of the bitmap space
A single lock protects the inode and any file data that it points to
![Page 33: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/33.jpg)
Locking Service Locks are sticky – they’re retained until
someone else needs them Client failure dealt with by using leases Network failures can prevent a Frangipani
server from renewing its lease Server discards all locks and all cached data If there was dirty data in the cache,
Frangipani throws errors until file system is unmounted
![Page 34: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/34.jpg)
Locking Service Hole If a Frangipani server’s lease expires
due to temporary network outage, it might still try to access Petal Problem basically caused by lack of clock
synchronization Can be fixed without synchronized clocks by
including a lease identifier with every Petal request
![Page 35: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/35.jpg)
Adding and Removing Servers Adding a server is easy!
Just point it to a Petal virtual disk and a lock service, and it automagically gets integrated
Removing a server is even easier! Just take a sledgehammer to it Alternatively, if you want to be nicer, you
can flush dirty data before using the sledgehammer
![Page 36: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/36.jpg)
Backups Just use the snapshot features that are
built into Petal to do backups Resulting snapshot is crash-consistent:
reflects state reachable if all Frangipani servers were to crash
This is good enough – if you restore the backup, recovery mechanism can handle the rest
![Page 37: Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d4b5503460f94a2940d/html5/thumbnails/37.jpg)
Summary Microsoft Cluster Service
Aims to provide reliable services running on a cluster
Presents itself as a virtual node to its clients Frangipani
Aims to provide a reliable distributed file system Uses metadata logging to recover from crashes Clients see it as a regular shared disk Adding and removing nodes is really easy