how to handle backup failures

14
Title: Troubleshooting Backup Failures Purpose: Document Netbackup Support function Date: Your name: Description of Change: Effective date: 10/18/200 6 Anthony Nguyen Document Creation 10/18/2006 10/30/200 6 Anthony Nguyen Changed escalation process per Ron Caplinger and Bryce Pier. 10/30/2006 11/29/20 06 Jackie Schlitz Added the following comments: Whenever a new Netbackup alert appears in OVOU, please verify that it is not related to a Sev 2 ticket or any other ticket. If a backup job is re-started, please monitor it and keep the job id handy. 11/29/2006 INDEX Responsibilities/Escalation HPOV Alert Netbackup Administration Console Error Numbers and Resolution Netbackup Troubleshooter Restarting Failed Backups Manually Stopping a Backup Job Troubleshooting Windows NetBackup Clients Troubleshooting Unix/Linux NetBackup Clients Shutting off VSP (Veritas SnapShot Provider) Netbackup Drives Alerts Contacting IBM on hardware issues Responsibilities/Escalation: Infrastructure Support attempts to remediate all backup failures and creates a trouble ticket to track the failures. Any failure that can’t be resolved by Infrastructure Support is escalated to L3-ENT-BACK. Normally, if the failure is a single failure, a Sev 3 Med ticket is sufficient. In the case of error code 96 (Out of Media), send a Sev 3 Critical ticket to L3-ENT-BACK. If there are multiple errors of the same kind, such as: more than two tape drives down and cannot be UP’ed; both tape drives on the same Media Server are down and cannot be UP’d;

Upload: arunaigiri-nm

Post on 11-Jul-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Veritas backup failure steps to be follow

TRANSCRIPT

Page 1: How to Handle Backup Failures

 Title:  Troubleshooting Backup FailuresPurpose:  Document Netbackup Support function 

Date: Your name: Description of Change: Effective date:10/18/2006 Anthony Nguyen Document Creation 10/18/200610/30/2006 Anthony Nguyen Changed escalation process per Ron

Caplinger and Bryce Pier.10/30/2006

 11/29/2006 Jackie Schlitz  Added the following comments:Whenever a new Netbackup alert appears in OVOU, please verify that it is not related to a Sev 2 ticket or any other ticket. If a backup job is re-started, please monitor it and keep the job id handy.

 11/29/2006

         INDEX

         Responsibilities/Escalation         HPOV Alert         Netbackup Administration Console         Error Numbers and Resolution         Netbackup Troubleshooter          Restarting Failed Backups          Manually Stopping a Backup Job         Troubleshooting Windows NetBackup Clients         Troubleshooting Unix/Linux NetBackup Clients         Shutting off VSP (Veritas SnapShot Provider)         Netbackup Drives Alerts         Contacting IBM on hardware issues   

Responsibilities/Escalation:Infrastructure Support attempts to remediate all backup failures and creates a trouble ticket to track the failures.  Any failure that can’t be resolved by Infrastructure Support is escalated to L3-ENT-BACK.  Normally, if the failure is a single failure, a Sev 3 Med ticket is sufficient.  In the case of error code 96 (Out of Media), send a Sev 3 Critical ticket to L3-ENT-BACK. If there are multiple errors of the same kind, such as: more than two tape drives down and cannot be UP’ed; both tape drives on the same Media Server are down and cannot be UP’d; multiple 219 (Storage Unit Unavailable) or multiple 84 (Media Write Error) errors, first follow normal trouble-shooting procedures for those error types.  If still unable to resolve, send a Sev 3 Critical ticket to L3-ENT-BACK.  Teradata Backups:At this time, Teradata backups are the responsibility of the backup team.  Teradata backups can be distinguished by the policy they belong to.  Teradata backups will start with a "TD_xxxxxx" in the policy name:    Example:  TD_DS32BKP, TD_DS35BKP, TD_1a_Inv_Item_Dict, etc... For Teradata backups, create a sev3 ticket to L3-ENT-BACK.  

Page 2: How to Handle Backup Failures

HPOV Alert Infrastructure Support will get an alert when a backup fails:  The alerts will appear in the OVOW console with the message group NBup.  The alerts will also appear in the OVOU console with the message group DCTech.  The alert will look like the following:

HPOV dxp11uxa.bestbuy.com NBU_JobFailure.log Entry: ds27bkup dvp03fc2z BDC_Wintel_Prod_File_Z_Drive Monthly_Cumulative 3896202 3896202 10/09/2006 06:30:06 54 :timed out connecting to client on 10/09/2006 at 07:07:49

 The report will provide the name of the media server (ds27bkup), the name of the client (dvp03fc2z), the backup policy (BDC_Wintel_Prod_File_Z_Drive), the schedule being executed (Monthly_Cumulative), the date & time the job started (10/09/2006 06:30:06), the error number (54), the reason why the backup failed (timed out connecting to client), and the date & time the job abended (10/09/2006 at 07:07:49. Since Netbackup retries failed backup jobs twice, you should only get one job failure alert after the 2nd retry.   In addition to the backup job failure alerts, Infrastructure Support will also get alerts on Netbackup Drives.   Whenever a new Netbackup alert appears in OVOU, please verify that it is not related to a Sev 2 ticket or any other ticket.  

Netbackup Administration Console

Backup failures can be monitor via the activity monitor on the Netbackup Administration Console.  There are two ways to access the Netbackup Administration Console.

1.       Java GUI - For the DC Tech Team, the Netbackup Administration Console via the Java GUI has limited access.  Please use the Java GUI to view the Activity Monitor for troubleshooting backup failures.

2.       Media Servers - The Netbackup Administration Console on the media servers will give you admin access when you log on with your -a account.  Use the Netbackup Administration Console from the Media Servers to rerun failed backup jobs.  

Error Numbers and Resolution

Some of the common issues that arise can be resolved.  Once resolved, the job or jobs can generally be restarted.  A list of the error numbers are listed below:

Hung Jobs - Jobs that are running but show no progress in several hours (or days).1. Cancel the job.  2. Restart the job that has failed.   

Error 1 – “Operation was partially successful”.  Most jobs that end with a status 1 will not alert but there are some that are significant and will cause an alert. These are SQL Cluster backups and Exchange Store backups.

1. If it’s for an SQL Cluster backup, re-run the job.   For Exchange backups, see “Self Correcting Jobs.”

 Error 6 – “Errors caused the user backup to fail.”

Page 3: How to Handle Backup Failures

1. This is an issue that happens when a database agent backup fails. 2. Database backups are run via RMAN and will reschedule on their own. If there are many

in a short time period, contact the Oracle DBA On-call (L2-ORACLE-DBA).  Error 12 – “File open failed.”

1. A possible cause is a permission problem with the file.   Check permissions on the file. 2. Restart the job.

 Error 13 – “Errors caused the user backup to fail”

1. Restart the job that has failed.  Error 25– “Cannot connect on socket.”

1. Verify that the Master Servers (dv10bkup,dxp11uxa, dxp10uxa) are in the server list on the client.

a. Log into the client and launch the “Backup, Archive, and Restore” interface. b. Click on “File” and select “Specify Netbackup Machines and Policy Type…” c. Under “Serverlist”, verify dv10bkup, dxp11uxa, and dxp10uxa are listed.  Also,

ensure dv10bkup is set to “current”. 2. Verify that the %SystemRoot%\system32\drivers\etc\services file have the following

entries: a. bpcd                       13782/tcp b. bprd                        13720/tcp c. bpdbm                     13721/tcp

3. Restart the “NetBackup Client Service” service on the client. 4. Check to see if there is free disk space. 5. Restart the job.

 Error 52 – “Timed out waiting for the media manager to mount volume.”

1. Verify the tape is available and the drives are up on the Media Server performing the backup. 

a. To determine which media server performing the backup, go to the Netbackup Administration Console and click on “Activity Monitor”.  Look for job ID of the failed job.  The media server performing the backup can be found under the “Media Server” column.

b. To determine if the drives are up on the media server, go to the Netbackup Administration Console and click on the toggle icon next to “Media and Device Management”.  Next, click on “Devices”.  In the “All Drives” pane, find the media server under the “Device Host” column.  Click to highlight the media server and verify the drives attached to the media server is showing a status of “UP” under the “Drive Status” column.  Note:  If you do not see the “Drive Status” column, this means the column is hidden.  To unhide the columns:

                                                               i.      From the menu bar, click on “View” and select “Columns” and select “Layout”.

                                                              ii.      Under Heading, click on “Drive Status” to highlight it.                                                            iii.      Click the “Show Column” icon or use “CTRL-S” to show the Drive

Status Column.                                                            iv.      Click OK.

c. If the drives are showing a status of “Down”, take note of the “device host” and the “drive name”.  Under “Media and Device Management”, click on “Device Monitor”.  Under the right pane, find the drive name.  After finding the drive name, click on it and select “Up Drive”.

2. Restart the job that has failed.  Error 54 – “Timed out connecting to client, General Network issue.”

1. See “Troubleshooting Netbackup Clients ”

Page 4: How to Handle Backup Failures

2. Restart the job that has failed.  Error numbers 57 and 59 – “Client refused connection.”  This error occurs when a media server is not listed in the clients approved server list.

1. In the Netbackup Administration Console: a. Log into the client that had the error. b. Launch the “Backup, Archive, and Restore” interface. c. Click on “File” and select “Specify Netbackup Machines and Policy Type…” d. Verify the servers listed in the “List of Netbackup Media Servers” table are

included in the “Server List” and verify dxp11uxa, dxp10uxa, and dv10bkup are also included.

e. If there are servers missing, manually add them one at a time. f. Restart the job(s).

 Error 71 – “None of the files in the file list exist.”  Files or directories are listed for backup in the client’s policy, but the files or directories displayed do not exist on this client.  Either the client’s configuration was changed and the directories are no longer there, or the policy was changed and no longer references valid paths on this client.

1. Send an email to “*IS – NBUadmin” for further analysis.NOTE: There are a couple of servers that will get 71’s all the time due to the nature of the server and the backup policy. DXP12UXA is the primary one this will happen to and it can be ignored.

 Error numbers 83 through 86 – These errors describe one of two common issues:  Faulty Tape or Hardware fault.  Hardware faults are more common than bad tapes.  Hardware faults do not mean hardware failure.

1. The work around is simply to suspend the tape then restart the job. a. To suspend a tape, go to the Netbackup Administration Console and click on the

Activity Monitor.  Take note of the media server the client was using.  The tape will be listed in the media server’s media catalog.

b. Open a command prompt - Run the following command: d:\veritas\netbackup\bin\admincmd\bpmedia -suspend -m <MEDIA ID> -h <MEDIA SERVER HOSTNAME>For example: bpmedia  -suspend -m B0100 -h ds25app

2. Send an email to “*IS – NBUadmin” as to which tapes have been frozen. 3. If you get more than two tapes having an 84 error at one time, create a sev3 Critical ticket

to the L3-ENT-BACK as it usually indicates a larger failure that they need to know about.  Error 96 – “No more scratch tapes available in the library”

1. Create a Sev 3 Critical trouble ticket for the Enterprise Backup & Recovery group (L3-ENT-BACK).

 Error 134 – “Unable to process request because the server resources are busy”

1. The available tape drives are all in use or the host is very busy. NB will requeue the job automatically to try and backup again.

 Error 150 – “Process terminated by authorized user or process.”  For some reason, someone with NetBackup Admin access has terminated the job or a process has failed.

1. Contact Enterprise Backup & Recovery group.  Error 155 – “Disk full” - A disk is full on the server.

1. Connect to server and free up space on the drive. 2. Restart the job.

 Error 196 – “Client backup was not attempted because backup window closed.” This will occur if the backups started AFTER window closed. Cause: Possible queuing delay.

Page 5: How to Handle Backup Failures

1. Restart the job that has failed. See “Restarting Backup jobs via Backup Policies”.  Error 219 – “Storage Unit is currently unavailable”

1. Check for drives down. See “Error 52” 2. Create a Sev 3 Critical trouble ticket for the Enterprise Backup & Recovery group (L3-

ENT-BACK).  Self Correcting JobsSome jobs will automatically re-run when they hit certain errors. The Exchange Store jobs (ie Exchange_BDC_STG1) are monitored for a status of 1 and are automatically re-ran and an email with a subject of “Restart Notice” is sent. Sometimes this process doesn’t work well due to an issue on the Exchange server and you’ll see many restarts for the same servers in a short period of time.  If this occurs contact the Enterprise Backup & Recovery group. Netbackup Troubleshooter

Additional errors codes and resolution steps can be found using the Troubleshooter within Netbackup.  To access the Troubleshooter, click on the hand/wrench icon.  See Figure 1. 

                Figure 1 Enter the error code into the status code field and click “Lookup” (Figure 2).  The Troubleshooter will detail the problem and provide troubleshooting steps based on the error code you entered.  See Figure 3.

 

Page 6: How to Handle Backup Failures

Figure 2                                                                                Figure 3  Restarting Failed Backups There are two ways to start a failed backup job:  Restart backup jobs via the activity monitor and restarting backup jobs via backup policies.  An error code of 196 will require you to restart the backup job via the backup policies. Restart Backup Jobs via the Activity MonitorRestarting failed backups can be done via the Netbackup Admin Console on the media servers.  To restart a backup,

1.       Log into a media server and launch the Netbackup Admin Console.2.       Click on the Activity Monitor.  All jobs will be displayed on the right pane.3.       Right-click on the right pane and select “Filter”.  The Filters window will display.4.       Click on the empty cell under client and enter the client name of the failed backup

job.  Click OK.  All jobs for that client will display.  5.       By default, Netbackup will automatically rerun a failed backup job three times. 

Before rerunning the backup, verify the status of the backup.  Netbackup may have already kicked off the rerun and the rerun may have already finished successful or the rerun may still be executing.  If the rerun is still executing, wait for it to finish.  If all three reruns have failed, continue with step 6.

6.       To rerun the backup, right-click on the backup and select “Restart Job”.  7.       If the job starts and fails with error code 196, this means the backup you just started

was not attempted because the client’s backup window is closed.   If this is the case, you will need to restart the backup via backup policies as detailed below.

 Note: 

1. If you restart a backup within the client’s backup window, the rerun will pick up where the last backup left off.  If you restart a backup and the backup is outside the client’s backup window, the rerun will start from the beginning.

2. We don’t want to have two identical backup run at the same time as they will be in contention with one another.  If two identical backups are running, stop the backup that was recently started via these instructions detailed in the section “Manually Stopping a Backup Job”.

3. If a backup job is re-started, please monitor it and keep the job id handy.  Restarting Backup jobs via Backup PoliciesIf you get an error code of 196 and you want to restart a backup, do the following:

Page 7: How to Handle Backup Failures

1.  The first thing you need to know is the backup policy and the schedule the failed backup ran under.  You can find this information from the HPOV alert or by looking at the activity monitor and locating or filtering for the client.

2.  Log into one of the media servers listed in the “List of Netbackup Media Server” chart.  Choose the media server for the domain the client belongs to.  Launch the Netbackup Administration Console.

3.  On the left pane, click on the + next to “Policies”.4.  Locate the backup policy from step 1.5.  Right click on the backup policy and select “Manual Backup” (Figure 4). 

                                Figure 4 

6.  Select the schedule (from step 1) on the left pane and select the client from the right pane.  Click OK (Figure 5).

 

Page 8: How to Handle Backup Failures

Figure 5 

7.  Verify the backup started via the activity monitor. Note:      It is ok to restart incremental and full backups.  If you restart a backup and users

complain about the performance of the server, please kill the backup and restart it at a later time.

  Manually Stopping a Backup Job To manually stop a backup job:

1.       Go to the Activity Monitor and find the job that needs to be cancelled.2.       Right-click on the job and select “Cancel Job”.

 Note:      Be careful not to select “Cancel All Jobs”.  Troubleshooting Windows NetBackup Clients

Important notes:  If you notice someone logged into a media server, DO NOT log them out.  If you need to log into a media server, try a different one (see the list of media servers below). 

1.       Verify that the server pings. 2.       Verify DNS entry (forward and backward) from the media server.

Forward: D:\veritas\netbackup\bin\bpclntcmd –hn <client>Backward: D:\veritas\netbackup\bin\bpclntcmd –ip <client IP>(It is an issue if the backward check resolves to another hostname.  If you see this issue or any other problems, escalate to L2-Network to resolve DNS entries.)

3.       Verify connection from Media Server to client.Enter command from any media server in the same domain as the client you are attempting to verify: D:\veritas\netbackup\bin\admincmd\bpgetconfig -M <client>

4.       Verify connection from media server to client via bpcd.Enter command from any media server in the same domain as the client you are attempting to verify:telnet <client> bpcdThis should result in a blank screen.  Hit enter and the connection will close.  If this happens, there are no issues with the bpcd connection from the media server to the client.  However, if the connection closes right away without hitting enter or if the connection does not close, there is an issue with the client.  Use CTRL+] to break the session.

5.       If step 4 fails, try recycling the NetBackup service on the client and try step 4 again.  If the NetBackup services fails to start, schedule a server reboot. 

6.       If the Netbackup services fails to start because the executable was not found, try reinstall Netbackup.  Please send communication to *IS - NBUadmin to inform them Netbackup was reinstalled on the server.  The backup team will need to modify the media server list, buffer size, time out settings, etc... on the client.

7.       If needed, try kicking off a daily incremental backup on the server.  Ensure you have the correct policy of the client.

If there are any problems with steps 3-7 or if these steps seem to work fine but there is still an issue, escalate to L3-ENT-BACK.

  

Page 9: How to Handle Backup Failures

List of NetBackup Media ServersNA Domain DMZ Domain Teradata

DS15BKUPDS16BKUPDS17BKUPDS20BKUPDS21BKUPDS22BKUPDS23BKUPDS24BKUPDS25BKUPDS26BKUPDS27BKUPDS28BKUPRS20BKUPRS21BKUP

Prod DMZ:DS19BKUPRS22BKUP  Legacy DMZ:DS29BKUP

DS30BKUPDS31BKUPDS32BKUPDS33BKUPDS34BKUPDS35BKUPDS37BKUPDS38BKUP

  Troubleshooting Unix/Linux NetBackup Clients 

1.       Verify that the server pings. 2.       Verify DNS entry (forward and backward) from the master server:

a.       From the master server (dv10bkup):Forward:  /usr/openv/netbackup/bin/bpclntcmd -hn <client hostname>Backwards:  /usr/openv/netbackup/bin/bpclntcmd -ip <client IP address>

b.       From the client server:/usr/openv/netbackup/bin/bpclntcmd -pn

3.       Verify connection from master server to client via bpcd.a.       From the master server:

telnet <client> bpcdb.       From the client server:

telnet <maseter server> bpcdThis should result in a blank screen.  Hit enter and the connection will close.  If this happens, there are no issues with the bpcd connection from the media server to the client.  However, if the connection closes right away without hitting enter or if the connection does not close, there is an issue with the client.  Use CTRL+] to break the session.

4.       If step 3 fails, escalate to L3-ENT-BACK.   Shutting off VSP (Veritas SnapShot Provider) If you notice that a VSP temp file is taking up space on a server, proceed to delete the file.  This was probably left over from previous installs as the new install does not delete this file. If you can’t delete the temp file because a system process has a lock on the file, this indicates that VSP is running on the server.  Use process explorer to identify this process and close the handle the process has on the VSP temp file.  Now you will be able to delete the VSP temp file without rebooting the server.  To shut off VSP on a server:

1.       Log in to one of the media servers.2.       Launch the Netbackup Administrator Console.

Page 10: How to Handle Backup Failures

3.       Click on the + next to “Host Properties”.4.       Click on “Clients”.5.       On the right pane, find the client that has VSP running.6.       Right-click on the client and select “Properties”.  The client properties will appear

(Figure 6): 

Figure 6 

7.       On the left pane, click the + next to “Windows Client”.8.       Click on “VSP”.  The VSP snapshot Provider window will open (Figure 7):

                        Figure 7 

9.       In the field below “VSP volume exclude list (drive letters separated by commas):” enter all drives on the server.  (example:  c,d,e,f,g)

10.    Check “Customize the cache sizes”

Page 11: How to Handle Backup Failures

11.    Select “Cache size in MB”12.    Click OK.

  Netbackup Drives AlertsThe Drive alerts are going through HPOV.  The drive alerts and their remediation steps can be found here.  Contacting IBM on hardware issues If requested by the Backup Team to engage IBM on a hardware issue, IBM's contact information is provided below: IBM dispatch #:                                  800-426-7378IBM Tape Library Model#:                    3584IBM Tape Drive Model #:                      3592 

Library Serial #Phone # Assoc. to Lib.

RDC prod (rtp01) 78A0347 (612) 670-2706BDC prod (dtp01) 78A0308 (952) 324-1872BDC test (dtt09) 78A1096 (952) 324-1872

 FYI, you will need to give IBM the Phone # associated to the library, as well as the library model # (3584, same for all libraries) and serial #.  If the problem appears to be a tape drive or if IBM asks, the tape drive model we use # is 3592.