troubleshooting backups & restores

43
© 2017 Brent Ozar Unlimited®. All rights reserved. 1 Troubleshooting Backups & Restores Because sometimes the GUI doesn’t cut it. www.BrentOzar.com sp_Blitz – sp_BlitzFirst email newsletter – videos SQL Critical Care®

Upload: others

Post on 16-Jan-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

© 2017 Brent Ozar Unlimited®. All rights reserved. 1

Troubleshooting Backups & RestoresBecause sometimes the GUI doesn’t cut it.

www.BrentOzar.comsp_Blitz – sp_BlitzFirst

email newsletter – videosSQL Critical Care®

© 2017 Brent Ozar Unlimited®. All rights reserved. 2

You already know…Recovery models: full, simple, bulk_logged

Backup types: full, differential, log

Using SSMS and T-SQL to leverage those

And you’ve already filled this out.

© 2017 Brent Ozar Unlimited®. All rights reserved. 3

We’ll focus on these.How we react by default

How we can react faster

We’ll use the Stack Overflow db:https://BrentOzar.com/go/querystack

~100GB, SQL Server 2008 & newer

“Oops”

© 2017 Brent Ozar Unlimited®. All rights reserved. 4

The simple situationAt around 7AM morning,we took an hour-long outage to do a deployment

Everyone signed off that the deployment worked

The web site went live again at 9AM

At around 9:30AM, customers started calling

Some (but not all) data in a major table is gone, but more data has been added since we went live

© 2017 Brent Ozar Unlimited®. All rights reserved. 5

The way it looks nowUsers are complaining that their data is gone.

(We’ve double-checked and the row is really gone, not just a bad update of DisplayName.)

We need to read the log.We do log file backups every 5 minutes.

To keep this simple, I’ll use a narrow time range.

© 2017 Brent Ozar Unlimited®. All rights reserved. 6

My strategyOn a development server:

• Restore the full backup, but use norecovery and a standby file

• Query the data to see if it’s there

• Restore the first log backup,using norecovery and a standby file again

• Query the data to see if it’s there

Restore the full with a standby.Using a standby file means we can query it, too.

The standby extension/location doesn’t really matter: it’s just where SQL Server stores temporary log info.

RESTORE DATABASE [StackOverflow]FROM DISK = N'M:\MSSQL\Backup\StackOverflow_FULL_20170210_071324.bak'WITH FILE = 1, NOUNLOAD, REPLACE,STANDBY = N'M:\MSSQL\DATA\StackOverflow_Standby.tuf';

© 2017 Brent Ozar Unlimited®. All rights reserved. 7

Query the data…

Yay! Brent is there! We’re done, right?

Not so fast.When we design systems, the business dictates how much data they’re willing to lose.

If they’d have said 1 hour, or 1 day, we’re done.

But businesses almost never say that.

In this case, our goal is 1 minute of data loss.

© 2017 Brent Ozar Unlimited®. All rights reserved. 8

Restore the first log file

Yay!

Check to see if our data is there

It still is, which means we need to keep moving forward if we’re going to get as close as possible to the bad transaction.

© 2017 Brent Ozar Unlimited®. All rights reserved. 9

Restore the next log and check

And the next log

© 2017 Brent Ozar Unlimited®. All rights reserved. 10

And the next log

Trust me, it’s worse in real life

© 2017 Brent Ozar Unlimited®. All rights reserved. 11

AWWW YEAH

What we’ve learned so farThe delete happened somewhere between 7:40-7:45.

If the business was comfortable losing up to 5 minutes of data, we could restore the 7:40 log backup, then take that table’s contents and push them into prod.

The killer is in here

© 2017 Brent Ozar Unlimited®. All rights reserved. 12

I’ll just jump back one log file, ok?

Nope: we’ve already gone past this log file.

We’re going to have to start all over with the full restore.

Tick tockNow would be a great time to talk about Instant File Initialization, which doesn’t help with the log file.

© 2017 Brent Ozar Unlimited®. All rights reserved. 13

If we’re OK losing 5 min of dataRestore the full backup

Restore all the logs up to 7:40AM

Pipe the data over to production using a linked server query, scripting, SSIS, etc.

(Now, what if you’re doing log backups every hour?Think the business is okay losing an hour of data?)

Next level: using STOPATRestore the full, restore the logs up to 7:40, and then:

But how much do you step forward?And even when you find the data, gotta go back…

© 2017 Brent Ozar Unlimited®. All rights reserved. 14

The iterative approachWHILE 1 = 1

• Restore a log file, bumping STOPAT by one second

• Dump the contents of your valuable table into another table, in a different database, like Users_HHMMSS (074500, 074530, 074600…)

• If no records were found, break out

Good luck with that.

Whew. And that was an easy one.No referential integrity

No cascading deletes

No other big deletes going on at the time

No TRUNCATE TABLE

Only an oops DELETE, not an oops UPDATE

© 2017 Brent Ozar Unlimited®. All rights reserved. 15

RPO/RTO dictates:• How often you take log backups

• How fast your full restores need to be(which determines speed requirements for your backup share, network pipe, disk, etc)

• The requirements for a spare box with enough free space to perform a full restore

• Whether you can restore to a point in time and just lose all user-entered data since

Let’s get ready for “Oops”

© 2017 Brent Ozar Unlimited®. All rights reserved. 16

Method 1: changing the databaseIf you can control and change the database:

• Log schema changes (DDL) with triggers, event notifications

• Log data modifications (DML) with auditing, change tracking, network sniffers

• Build better code releases with unit testing, undo scripts, end user signoffs

Method 1: changing the databaseIf you can control and change the database:

• Log schema changes (DDL) with triggers, event notifications

• Log data modifications (DML) with auditing, change tracking, network sniffers

• Build better code releases with unit testing, undo scripts, end user signoffs

© 2017 Brent Ozar Unlimited®. All rights reserved. 17

Method 2: the tools approach

Transaction log readersNative: fn_dump_dblog

ApexSQL Log Explorer – apexsql.com

Quest LiteSpeed and Toad – quest.com(but doesn’t work with native compressed backups)

Disclaimer: I used to work for Quest, but I have no affiliation with either company today. I don’t care which tool you use.

© 2017 Brent Ozar Unlimited®. All rights reserved. 18

fn_dump_dblogUndocumented. Use at your own risk.

Known bug up to 2012 SP2: consumes worker threads each time it’s executed, only fixed on restart. http://www.sqlskills.com/blogs/paul/using-fn_dblog-fn_dump_dblog-and-restoring-with-stopbeforemark-to-an-lsn/

(Strangely, the fix isn’t documented in SP2, either. They’re serious about that undocumented thing.)

Good for finding the transaction

© 2017 Brent Ozar Unlimited®. All rights reserved. 19

Needle in a haystack

Is it the right needle?Yes, we found a transaction doing deletes

But is it the right one? On the right table?

And can we pass this LSN into RESTORE…STOPAT?

© 2017 Brent Ozar Unlimited®. All rights reserved. 20

We’re going to have to check…Fn_dump_dblog doesn’tinclude the table name.

That would be too easy.

You’re going to have tojoin to sys.allocation_units, sys.partitions, sys.objectsto find which table is involved.

That exercise is left for the reader.

© 2017 Brent Ozar Unlimited®. All rights reserved. 21

Even when you find it…fn_dump_dblog just gives you a log sequence number.

It doesn’t give you an undo script.

All you’re doing is running RESTORE and stopping at the moment the delete happened.

You’re losing all transactions that happened after that moment in time.

That’s not cool. That’s not why you’re here.

3rd party tools are cool.I’m going to show LiteSpeed here, but the same concepts apply for ApexSQL Log.

© 2017 Brent Ozar Unlimited®. All rights reserved. 22

Point it at your backup files

Uh oh

© 2017 Brent Ozar Unlimited®. All rights reserved. 23

How about ApexSQL Log Reader?Let’s see what you can…hello? Hello, Apex?

The single-threaded blues

© 2017 Brent Ozar Unlimited®. All rights reserved. 24

Then stuck here for minutes

Happens all over the product

© 2017 Brent Ozar Unlimited®. All rights reserved. 25

In fact, I get excited when it’s not

Hours later

© 2017 Brent Ozar Unlimited®. All rights reserved. 26

Filter the grid, find the delete, and:

Time to make a choice.1. fn_dump_dblog – gets you to a point in time, but

you lose all subsequent transactions

2. Quest LiteSpeed/Toad – can generate undo scripts, but only if you use uncompressed backups, or license their compression tools on your production servers

3. ApexSQL Log – in real-world database sizes, hasn’t worked for me (but your mileage may vary)

© 2017 Brent Ozar Unlimited®. All rights reserved. 27

It all comes down to this.If you can only lose 1 minute of data,and be down for only 1 hour:

• Log backups every minute

• Super-fast restore server that can rip through restores in minutes, has space

• fn_dump_dblog script ready to go

• Rehearse all this ahead of time

More ambitious goals?If you can’t lose any data,and can be down for only 1 hour:

• Third party backup software that can generate undo scripts

• Super-fast restore server that can rip through restores in minutes, has space

• Tool installed, ready to go, and rehearsed

© 2017 Brent Ozar Unlimited®. All rights reserved. 28

When the disk says “oops”

Data pages inthe data file

System pages in the data file

Log file

Typically discovered when you: Read a pagefrom disk

Read a page from disk

Write a transaction

Which means you find it: Later, after you lost data

Later, after you lost data

Instantly, and the txn fails (usually)

Should you fail over? Probably Probably Probably

To recover, you: Need to restore pages

Need to restore pages

Shrink/regrow the log file

Ways corruption strikes

© 2017 Brent Ozar Unlimited®. All rights reserved. 29

DisclaimerCorruption is caused by one-time or recurring:

• Storage failures • Storage network failures• Crappy OS applications (filter drivers)• SQL Server bugs

Anytime you see corruption, sound an alarm, because you don’t know if it was one-time or recurring.

If you don’t fix the root cause, corruption will return.

But in this session, I’m going to assume you fixed it already, like you failed over to known good storage.

Corruption recovery optionsRecovery Point Objective

(RPO, data loss)Recovery Time Objective

(RTO, downtime)

Automatic page repair (database mirroring & AGs)

Zero Zero (no downtime, but some queries may fail)

Drop & recreate index(only works for NCs)

Zero Zero (no downtime, but some queries may fail)

Online page restore (Enterprise Edition)

Zero* Index is offline for some time based on restore speed

Entire database restore Zero* Everything offline based on restore speed

* - requires full recovery mode, a complete set of log backups including a tail-of-the-log backup for the entire restore.

© 2017 Brent Ozar Unlimited®. All rights reserved. 30

This is how RPO/RTO can dictate edition & architecture.

Recovery Point Objective (RPO,

data loss)

Recovery Time Objective (RTO, downtime)

Automatic page repair (database mirroring & AGs)

Zero Zero (no downtime, but some queries may fail)

Drop & recreate index(only works for NCs)

Zero Zero (no downtime, but some queries may fail)

Online page restore (Enterprise Edition)

Zero* Index is offline for some time based on restore

speed

Entire database restore Zero* Everything offline based on restore speed

Corruption recovery optionsRecovery Point Objective

(RPO, data loss)Recovery Time Objective

(RTO, downtime)

Automatic page repair (database mirroring & AGs)

Zero Zero (no downtime, but some queries may fail)

Drop & recreate index(only works for NCs)

Zero Zero (no downtime, but some queries may fail)

Online page restore (Enterprise Edition)

Zero* Index is offline for some time based on restore speed

Entire database restore Zero* Everything offline based on restore speed

* - requires full recovery mode, a complete set of log backups including a tail-of-the-log backup for the entire restore.

This part can be tricky.

© 2017 Brent Ozar Unlimited®. All rights reserved. 31

© 2017 Brent Ozar Unlimited®. All rights reserved. 32

Uh oh, something’s missing

The full backup last night took 2

hours. Ouch.

© 2017 Brent Ozar Unlimited®. All rights reserved. 33

Restore the last full with NORECOVERY to a different database name

The first log restores just fine

© 2017 Brent Ozar Unlimited®. All rights reserved. 34

But we can’t go farther

Checking our LSNS

© 2017 Brent Ozar Unlimited®. All rights reserved. 35

Take the differential backup

We can restore the differential

© 2017 Brent Ozar Unlimited®. All rights reserved. 36

Run another log backup

And we can restore the log backup that was taken after the differential

© 2017 Brent Ozar Unlimited®. All rights reserved. 37

Differential vs. log backupsDifferential backup

Transaction log backup

12:10 am: Full backup sets

differential base

2:09 pm You take a differential

backupExtents in data file changed since full

backup set differential base

Log records generated since last log backup*

Log records generated since last log backup

Log records generated since last log backup

* Or a full backup if there is no prior log backup

Possible restore sequences

Full backup restored

with NORECOVERY

Differential restore with NORECOVERY

Missing log backup

Transaction log backup

© 2017 Brent Ozar Unlimited®. All rights reserved. 38

What about this time period?

Full backup Differential backup

Missing log

backup file

Successful log backups

Successful log backups

Can I restore to here?

Differential dangersWhat if you don’t have the most recent full backup?

What if you THINK you have the most recent full backup & you’re wrong?

• Example: Danielle the Developer needed a fresh copy of production as of Tuesday, so someone ran a special full backup just for her (and didn’t keep a copy of it).• To mitigate this, you need to use the COPY_ONLY

parameter if you run “extra” full backups

What if the full backup doesn’t restore correctly?

© 2017 Brent Ozar Unlimited®. All rights reserved. 39

Takeaways: missing log backupsIf you have missing or corrupt log files, you must take action

You have a gap in your restore sequence

You can bridge this gap with either:• A new full backup• A differential backup

The differential backup approach has more risks, as it is completely dependent on the prior full backup

(And of course you’ve got to follow up on why those log backups went bad or went missing)

Reacting faster

© 2017 Brent Ozar Unlimited®. All rights reserved. 40

Which one matches you:1. “I’ve got just a few SQL Servers, and I’m busy.”

• Answer: buy a 3rd party tool like Idera SQLsafe, Quest Litespeed, or Red Gate SQL Backup

2. “We’ve got tens of terabytes of data on the SAN.”• Answer: check out SAN snapshot backups

3. “I need faster backups & restores, and I can’t spend money, but I can spend time.”• Answer: do manual tuning and testing

Manual tuning and testingMicrosoft white paper on multi-terabyte tuning:

https://BrentOzar.com/go/FasterBackup (DOCX)

Covers network configuration, parameters, file server recommendations, and more

This kind of work is necessary in cloud VM sizing, too

© 2017 Brent Ozar Unlimited®. All rights reserved. 41

BrentOzar.com/go/azuretests

BrentOzar.com/go/azuretests

© 2017 Brent Ozar Unlimited®. All rights reserved. 42

Recap

These two are hard.Document your current state:

• Take your largest/toughest DB, and go into dev or staging.

• Act as if one of these things just happened. Start the clock.

• Time how long it takes, and how much data you’d lose.

• And this is the BEST case scenario, not the worst.

© 2017 Brent Ozar Unlimited®. All rights reserved. 43

Then tell the business.Give them this document, and let them decide their new goals.

New goals = new work:it’s going to take you time, tools, and servers to get better.