a brief timeline

7
3D DBA Workshop 16-17 November 2010 1 A brief timeline Due to a firmware bug both controllers of the Sun StorageTek 6540 array reboot within 90 minutes after each other on 18/08/2010. All attempts to restore and recover the data to the original hardware fail. The database is corrupted during the recovery process. On 01/09/2010 the database is successfully restored to alternate hardware. On 02/09/2010 preparations are started to synchronize the database from RAL. RAL starts to copy LHCb and ATLAS data to SARA.

Upload: yon

Post on 06-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

A brief timeline Due to a firmware bug both controllers of the Sun StorageTek 6540 array reboot within 90 minutes after each other on 18/08/2010. All attempts to restore and recover the data to the original hardware fail. The database is corrupted during the recovery process. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A brief timeline

3D DBA Workshop 16-17 November 2010 1

A brief timeline• Due to a firmware bug both controllers of the Sun StorageTek

6540 array reboot within 90 minutes after each other on 18/08/2010.

• All attempts to restore and recover the data to the original hardware fail. The database is corrupted during the recovery process.

• On 01/09/2010 the database is successfully restored to alternate hardware.

• On 02/09/2010 preparations are started to synchronize the database from RAL.

• RAL starts to copy LHCb and ATLAS data to SARA.

Page 2: A brief timeline

3D DBA Workshop 16-17 November 2010 2

• The data is imported into the database on 08/09/2010.

• On the same day CERN brings the streams up.

Page 3: A brief timeline

3D DBA Workshop 16-17 November 2010 3

Some minor issues

• At SARA the COMPATIBLE parameter had to be changed from 10.2.0.3 to 10.2.0.4 to match the one at RAL.

• There is an error in the Oracle Database Administrator's Guide on page 8-37 regarding the syntax of the parameter file of the impdp command.

Page 4: A brief timeline

3D DBA Workshop 16-17 November 2010 4

Conclusions on the resynchronization• The resynchronization process went rather

smoothly (at least from SARA’s point of view).• For SARA this was a learning opportunity. The

procedure has been documented for possible future use.

• The assistance we received from both RAL and CERN was amazing.

Page 5: A brief timeline

3D DBA Workshop 16-17 November 2010 5

Conclusions on data corruption• We’ve been unable to determine the exact cause

of the corruption.• An upgrade of the storage firmware and a rebuild

of the LUNs solved the problem (but for how long?).

• Always use "db_block_checking='TRUE'" in combination with db_block_checksum to detect logical corruption at a very early stage.

Page 6: A brief timeline

3D DBA Workshop 16-17 November 2010 6

Page 7: A brief timeline

3D DBA Workshop 16-17 November 2010 7