introduction: · web viewthis out-of-band script makes the os update process semi-automated and...

15
OS update Script Package Introduction: This package provides a manually initiated script that the cluster administrator can run. This script has the functionality to download the applicable OS patches and deploy them safely to the VM instances. This out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster VMs. Contents Introduction:........................................................ 1 Where can I get the Script?.......................................... 1 EULA, Limitations and other disclaimers:.............................2 What does the Script do?............................................. 2 Prerequisites to using this script:..................................3 User account that has admin access to all nodes...................3 Ports that needs to be opened on all nodes........................3 Automatic Windows Update must be disabled on all nodes............4 Check for other Administrative active operations on the cluster...4 Check for health of the cluster...................................4 Instructions to run the Script.......................................5 Output of the script:.............................................7 Content of the VMname_WU.txt......................................7 Output on the Powershell console:.................................9 Frequently Asked questions:.......................................... 9 Troubleshooting help................................................ 10 Disabling or Enabling of node failed.............................10 Restart of Node Failed...........................................11 Health check failed and resulted in aborting of script...........11 Page | 1

Upload: dotuong

Post on 27-Apr-2018

218 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

OS update Script PackageIntroduction:This package provides a manually initiated script that the cluster administrator can run. This script has the functionality to download the applicable OS patches and deploy them safely to the VM instances. This out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster VMs.

ContentsIntroduction:...............................................................................................................................................1

Where can I get the Script?........................................................................................................................1

EULA, Limitations and other disclaimers:....................................................................................................2

What does the Script do?...........................................................................................................................2

Prerequisites to using this script:..............................................................................................................3

User account that has admin access to all nodes...............................................................................3

Ports that needs to be opened on all nodes.......................................................................................3

Automatic Windows Update must be disabled on all nodes.............................................................4

Check for other Administrative active operations on the cluster......................................................4

Check for health of the cluster...........................................................................................................4

Instructions to run the Script......................................................................................................................5

Output of the script:............................................................................................................................7

Content of the VMname_WU.txt.........................................................................................................7

Output on the Powershell console:.....................................................................................................9

Frequently Asked questions:......................................................................................................................9

Troubleshooting help................................................................................................................................10

Disabling or Enabling of node failed..................................................................................................10

Restart of Node Failed.......................................................................................................................11

Health check failed and resulted in aborting of script.......................................................................11

Updates were skipped on some nodes..............................................................................................11

Where can I get the Script?Download the Script from here

Page | 1

Page 2: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

EULA, Limitations and other disclaimers:1. PsExec utility is used to execute the script on remote machine. Currently script accepts the EULA

needed for this utility on behalf of administrator. If this is unacceptable, do not use the script.2. This solution is not tested with any other parallel workflow (Azure updates, Cluster Upgrade etc)

which can perform disable/enable nodes and thus can affect the availability of the UD 3. The script would fail if an ongoing cluster upgrade is detected, however it doesn't handle scenario

when cluster upgrade was started after script was started. 4. This script does not support rollback. If the patch was installed successfully, it remains on that node,

even if the script may exit prior to proceeding to the next upgrade domain for cluster health reasons.

5. This script is provided to be used at users risk. Do the necessary tests to makes sure that you are comfortable with it prior to using it in any production environment. We encourage you to review the .PS1 and all the content of the package prior to using it.

6. WU Updates : Some of the WU updates needs user interaction for installation. Currently all such updates are skipped and are logged appropriately. Admin can go through individual node logs and install these updates manually if needed.

What does the Script do?The script must be executed from one of the cluster VM. The script goes through all cluster nodes, Upgrade Domain by Upgrade Domain (UD), and performs download and installation of Windows updates.

Before installing updates on the node, the script disables the node to ensure that it is not running any active SF workload. The script also takes care of any pending reboots. Post installation/reboot, the script enables the node and checks that cluster aggregate health is not compromised due to update.

The package contains the following:

WindowsUpdateUtility.ps1 : This is the main script which orchestrates OS update on cluster nodes. WindowsUpdateAgent.ps1 : Script which calls Windows Update (WU) APIs to search, download

and install updates on local node. This script is copied to each of the cluster node by WindowsUpdateUtility.ps1

PSExec Utility : WindowsUpdateUtility.ps1 makes use of this utility to execute WindowsUpdateAgent.ps1 remotely on all SF nodes.

Optional reading: Here is a brief summary of steps that the script performs in order once it is kicked off:

1. Collect the candidate nodes - Connects to the cluster and populates the list of nodes on which OS updates would be initiated. By default, the script updates only active nodes, however users can specify an option to update disabled nodes or specify a list of specific node(s). Refer to options ForceUpgradeDisabledNodes & NodeNames in next section for more details.

2. Prepare steps: Copies WindowsUpdateAgent.ps1 script to all applicable nodes and searches and downloads windows updates on all nodes simultaneously. By default, the script searches WU updates with "IsInstalled=0" query, however users can override this by passing an appropriate option. Refer to option WUQuery in next section for more details.

Page | 2

Page 3: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

3. Pre-Reboot Steps: Once updates are downloaded on all nodes, the script proceeds to do a UD walk to install the OS patch(es). Before starting installation on a node, the script gracefully disables & stops the fabric node to ensure that the node being updated is not running any service fabric (system or the customer ) workload. Once the node is gracefully shut down, the script installs previously downloaded OS updates on this node. Note - By default the script installs updates on all nodes within UD in parallel, however if Admin feels this operation could be more disruptive for the cluster, SerializedUpdateInUD option can be used to serialize the updates within UD. Refer to the next section for more details on this option.

4. Reboot Steps: VM/Node reboot might be required to complete the installation process. The script restarts all such nodes in parallel and waits for them to come up. This step is executed based on the result of Pre-reboot steps.

5. Post-Reboot Steps : Once installation is completed successfully, the script enables nodes which were disabled in Pre-reboot steps. However, if updated node was already in disabled (>= Restart intent) state prior to update, the script doesn't change its state. After enabling the nodes, the script performs a cluster health check and continues to next UD if the aggregate health state of cluster is found healthy. Refer to HealthCheck timeout and retry options in the next section for more details.

6. Last Step - The script updates the Node from which it is being run only after updating all other nodes in the cluster are updated. If this node requires a reboot to complete the installation, the script sets up "RunOnce" job that activates this node back post reboot and Admin logging back into that node. If the admin does not TS/log back in, this “RunOnce” job does not get kicked off and the Node remains down.

 Prerequisites to using this script: User account that has admin access to all nodes.The script needs to be run from one of the service fabric nodes, under a user account with administrative privileges to perform the following operations:

Copy files to other service fabric node machines. Without this privilege, the script would fail when trying to copy WindowsUpdateAgent script to remote machines.

Start processes on other Service Fabric nodes using PSExec. Restart other Service Fabric nodes using Restart-Computer.

Apart from privileges on the node on which the script is planned to be run, user also needs to ensure the firewall rules do not prevent the above operations from succeeding. Ports that needs to be opened on all nodes.WindowsUpdateUtility.ps1 script requires open inbound traffic to the following ports:

135 - This port is used for RPC. 445 - This port is used for Windows File Sharing (SMB)

An Administrator user needs to open these ports on all the ServiceFabric nodes for script to work. For Azure clusters – you can use the same process like you would for opening application ports – refer to this document for guidance (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-upgrade#cluster-configurations-that-you-control)

Page | 3

Page 4: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

For on-premises clusters – The following cmdlet can be used as reference to work with Windows Firewall, however users are free to use any other method to open firewall on their SF nodes.New-NetFirewallRule -Name "SF-WU-Agent-Script-In" -DisplayName "Service Fabric Windows Update Agent (RPC, SMB - IN)" -Protocol TCP -LocalPort 135,445 Automatic Windows Update must be disabled on all nodes.Ensure that Automatic updates are disabled, having it enabled will interfere with running of the script.

Check for other Administrative active operations on the clusterAn Administrator has to ensure that there are no other parallel workflow or operations are happening on the cluster (e.g. fabric upgrade) which can change the state of the SF node. Doing any administrative actions while the cluster OS updates are going on can result in unhealthy cluster. The SF API that the script uses to gracefully shut down the VM/Node is the same one that the fabric upgrade uses, so safety of the operation is decided by the service fabric failover manager.

Check for health of the cluster.This script does not start/proceed to next upgrade domain if the cluster is unhealthy. If you want to roll out the script on an unhealthy cluster, then use the ForceUpdateUnhealthyCluster as a parameter.

Page | 4

Page 5: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

Instructions to run the Script 1. Prerequisites

a. Ensure that all prerequisites mentioned here are met on all the nodes of your cluster.2. Remote Desktop (RDP) into one of the active VMs/nodes that make up the cluster. You do not need

to worry about which one. (Instructions on how to RDP into an Azure VMSS instance - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-nodetypes )

3. Copy SFWindowsUpdate.zip file to one of the VMs and unzip it. You can download it directly from here.

4. Run WindowsUpdateUtility.ps1. This entire update process can take 2-3 hours per UD if nodes were not updated for a long time.

a. The following optional parameters can be passed to the script. It is strongly recommended that you review all the parameters: Parameter Name Description

NodeNames

Custom list of nodes where OS Update needs to be applied. If this list is not passed, all active nodes are considered for OS Update.

ForceUpdateUnhealthyCluster Script checks for the cluster aggregate health before and after installing updates and stops the rollout out if cluster is found unhealthy. When this option is passed, all health related checks are skipped for all operations. This option can be used to force update an unhealthy cluster.

ForceUpdateDisabledNodes By default, script updates only active nodes. This option forces OS update on disabled nodes, however state of node would be preserved post update (all states except disabled with pause intent, which would be enabled post OS Update.)

AcceptWUEula Some of the WU updates needs administrator to accept EULA. If this option is passed, script will accept EULA for all such updates on behalf of administrator. In the absence of this option, all such updates would be skipped for installation.

SerializedUpdateInUD By default, the script performs installation operation in parallel on all nodes within UD. However, if Administrator thinks this operation is more disruptive for the cluster, he can pass this option to serialize installation operation within UD.

RebootTimeoutSec Specifies the duration, in seconds, that the script waits for a reboot operation to finish on a remote node. Default value of this parameter is 600.

Page | 5

Page 6: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

HealthCheckWaitDurationSec Specifies the duration, in seconds, that the script waits before it performs the initial health check after it finishes the Update on the upgrade domain. Default value of this parameter is 300.

HealthCheckStableDurationSec Specifies the duration, in seconds, that the script waits to verify that the cluster is stable before moving to the next upgrade domain or completing the Update. Default value of this parameter is 0.

HealthCheckRetryTimeoutSec Specifies the duration, in seconds, after which the script retries the health check if the previous health check fails. Default value of this parameter is 30.

HealthCheckRetryCount Specifies number of times the script should check cluster's health before declaring cluster unhealthy. Default value for this parameter is 10.

WUCommandTimeoutInMinutes

Specifies the duration, in minutes for WU commands to complete, If WU commands do not complete within this timeout period, the script will publish out details of the node and administrator intervention would be needed to install updates on all failed nodes. Refer to the troubleshooting section for more details. Default value of this parameter is 180.

SFCommandTimeoutInSec Specifies the duration, in seconds, that the script waits for a ServiceFabric command (Enable, Disable, Stop ServiceFabricNode) to complete. If this command does not complete within this timeout period, script will publish out details of the node and administrator intervention would be needed to mitigate the problem. Refer to troubleshooting section for more details. Default value of this parameter is 300.

WUQuery This is an advanced option to specify the query to be used when installing windows updates. Default value for this is "IsInstalled=0", Refer to https://msdn.microsoft.com/en-us/library/windows/desktop/aa386526(v=vs.85).aspx for valid WU queries.

LogDirectory Specifies root directory for storing log files. By default, logs would be a stored in current directory. On every run, script creates a new log directory under this root directory with "log_MM_DD_YYYY_HH_MM_SS" name. All local and remote node logs are available under this folder at the end of script run.

Exampleso To update all enabled nodes.

Page | 6

Page 7: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

.\WindowsUpdateUtility.ps1 o To update all enabled & disabled nodes.

.\WindowsUpdateUtility.ps1 -ForceUpdateDisabledNodeso To update specific nodes with non-optional updates.

.\WindowsUpdateUtility.ps1 -NodeNames _temp1_0,_temp1_1,_temp1_2 –WUQuery "IsInstalled=0 and BrowseOnly=0"

Output of the script:The script in addition to posting detailed output to console, also writes them out to a folder called log_<date>_<time>. This log includes all patch information and the ones that were installed.

Content of the logfolderHere is a picture of what you are expected to see in that log folder.

Content of the VMname_WU.txt Here is a snapshot

Page | 7

Page 8: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

Content of output.txtThis is the progress report on what the script did/is doing.

Page | 8

Page 9: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

Output on the PowerShell console:This specific screen shot is showing a pop-up once the script had finished installing the patches to all the machines, and now wanted to reboot the VM where the script running. Selecting “yes” and then RDPing back into the VM, allowedthe installation of the patches to complete.

Frequently Asked questions:

Q. Where can I see the logs for my last run?A. Logs would be stored under same folder where the script is present.

Q. Can I use this script for standalone setup?A. Yes

Page | 9

Page 10: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

Q. My script prompts me to restart computer around the end of execution.A. Restarting of machine is required to complete the installation process for OS updates. Script processes all the nodes of the cluster first, and updates the self-node in the end. In order to complete the installation process sometimes restart is needed. Post installation steps are done by the same script when user logs-in to the machine (using windows run-once feature) thereby completing the OS updates for the entire cluster. Q. Script fails with some errorA. Refer to the mitigation steps in console logs or file logs. For most of the known cases we print the mitigation steps to restore the cluster to its original state.For re-run either you can run the script on the entire cluster or just a subset of nodes using -NodeNames parameter. Q. My Cluster went to unhealthy state during OS updateA. Refer to troubleshooting section below for possible solution. Q. I've disabled nodes in my cluster, can I apply OS update on them?A. By default script only process nodes which are in "Up" state, user can however include disabled nodes for OS Update by using switch ForceUpdateDisabledNodes. The script will install the updates and leave them in same state it found them. Q. My cluster is unhealthy, however I need to do OS update urgentlyA. You can use switch ForceUpdateUnhealthyCluster for updating nodes of an Unhealthy cluster.WARNING: Using ForceUpdateUnhealthyCluster option would skip all health checks within the script (i.e. health checks which happen after updating nodes of a UD and before proceeding to update nodes of next UD). Q. Why does my script take so long to run?A. The time taken by script is mostly dependent on following factors:

1. Number of updates available for download and install - We process nodes in parallel for quick processing. Avg. time take for downloading and installing an update should not exceed couple of hours.

2. Speed of the VM on which the script is run from - We use PowerShell jobs to achieve parallelism, in case the VM is not fast enough it'll take more time to execute the script.

Troubleshooting help

Disabling or Enabling of node failed Nodes for PreRebootSteps step are _temp_1Disabling node _temp_1Waiting for _temp_1 to Disable.Disable-ServiceFabricNode couldn't disable the node _temp_1 in SFCommandTimeoutInSec=0 secondsFollowing nodes are in disabled/disabling and might be in stopped state : _temp_1Run following commands for each of the above nodes to recover cluster in original state.Start-ServicFabricNode -NodeName <NodeName>Enable-ServiceFabricNode -NodeName <NodeName>

Page | 10

Page 11: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

Aborting script.Node _temp_1 couldn't be disabled within SFCommandTimeoutInSec=0 secondsAt C:\SFWU\WindowsUpdateUtility.ps1:494 char:21+ throw "Node $($node.NodeName) couldn't be disabled within SF ...+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : OperationStopped: (Node _temp_1 co...InSec=0 seconds:String) [], RuntimeException + FullyQualifiedErrorId : Node _temp_1 couldn't be disabled within SFCommandTimeoutInSec=0 seconds The script disables a node before it can install updates on them, in case the node cannot be disabled in specified timeout, the script would stop.To restore the cluster to original state.

Start-ServiceFabricNode for nodes which were stopped as part of script. Enable-ServiceFabricNode for nodes which were disabled as part of script.

You may try re-running the script with higher value for SFCommandTimeoutInSecSame troubleshooting steps apply for enabling of node as well. Restart of Node FailedRestart-Computer : Failed to restart the computer 10.0.0.7 with the following error message: The computer did not finish restarting within the specified time-out period..At line:1 char:1+ Restart-Computer -ComputerName 10.0.0.7 -Wait -For Wmi -Timeout 1 -Force -ErrorA ...+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : OperationTimeout: (10.0.0.7:String) [Restart-Computer], RestartComputerTimeoutException + FullyQualifiedErrorId : RestartComputerTimeout,Microsoft.PowerShell.Commands.RestartComputerCommandThis can happen if the current computer (on which script is started) is not able to restart a remote computer. Possible reasons can be

Not enough privileges - refer to prerequisites section Firewall might be blocking the operation to complete - Refer to prerequisites section Specified timeout might not be sufficient, try increasing RebootTimeoutSec when launching script.

Health check failed and resulted in aborting of scriptThis can happen if you are trying to run the script on an unhealthy cluster. We recommend bringing cluster to healthy state before you proceed with running the script. During running of the script the health check is done after completing updates on a UD and before script proceeds updating next UD. Users must check the health report to figure out why the cluster went to unhealthy state.Following are the steps which can be taken to recover the cluster to original state:

If OS updates were the cause of nodes going to bad state - uninstall the OS updates on the individual nodes manually. Users can refer to log files from the previous run to figure out which updates were installed.

If the cluster required more wait time or stabilisation time - Try changing values for following parameter as appropriate. Refer to user guide or "help .\WindowsUpdateUtility.ps1" for more details around these parameters.o HealthCheckWaitDurationSeco HealthCheckStableDurationSeco HealthCheckRetryTimeoutSeco HealthCheckRetryCount

Updates were skipped on some nodesCouldn't complete WU Download apis in 180 minutes for _temp_1 node. Aborting Job. Skipping this node for installation.

Page | 11

Page 12: Introduction: · Web viewThis out-of-band script makes the OS update process semi-automated and enables the cluster administrator to update all SF cluster nodes from one of the cluster

In case you see a node skipped for download or install operation, this could be due to: WU operation getting Timeout - In which case try increasing the WUCommandTimeoutInMinutes

parameter to script. WU operation failing - In which case more investigation might be required to ascertain why the

WU operation (download or install) failed. Logs from the node where the failure happened would be helpful in providing further details why the operation failed.

In either case, we recommend running the script again on failed nodes, this can be done by running the script with - NodeNames to install updates on specific nodes rather than whole cluster.

Page | 12