Database Corruption Challenge #2 – Alternate Solution

Download PDF

With week two of the Database Corruption Challenge being just as interesting as week one, and after seeing yesterdays blog posting showing the solution by Rob Farley, you might be thinking you are ready to take on corruption. If you are interested in giving the corruption challenge a try, please sign up for my newsletter so that you will be notified of the next corruption challenge.

Lets take a look at one of the more interesting solutions submitted by Bogdan Sahlean, who was the first to solve using DBCC Page to find the exact corrupt bytes, and DBCC WritePage to fix those corrupt bytes.

Now if you are reading this, and don’t know about DBCC WritePage, I caution you…  DBCC WritePage is not something you should just play with, and it is certainly not something that you should ever run on a production database. Consider yourself warned.  One of the dangers of using Write page that it’s use will make the DB unsupported if Microsoft support is required.

However after all the warnings about DBCC WritePage are out of the way, Bogdan Sahlean has come up with a really slick solution that just fixed the 8 bytes of corruption that exist in the database.  The same solution could be achieved by editing the database with a hex editor, and that solution was provided by another participant.  Lets take a look at this solution.  First assume that the latest backup provided has been restored to a database called CorruptionChallenge2_Latest_SGBD.

He starts out by finding out what is corrupt:


-- GET [MORE] INFO ABOUT DATABASE CORRUPTION
DBCC CHECKDB('CorruptionChallenge2_Latest_SGBD') WITH NO_INFOMSGS, ALL_ERRORMSGS;

Which results with this output:

Msg 8939, Level 16, State 98, Line 40
Table error: Object ID 2137058649, index ID 1, partition ID 72057594038845440, alloc unit ID 72057594039762944 (type In-row data), page (1:244). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -10.
Msg 8928, Level 16, State 1, Line 40
Object ID 2137058649, index ID 1, partition ID 72057594038845440, alloc unit ID 72057594039762944 (type In-row data): Page (1:244) could not be processed. See other errors for details.
Msg 8976, Level 16, State 1, Line 40
Table error: Object ID 2137058649, index ID 1, partition ID 72057594038845440, alloc unit ID 72057594039762944 (type In-row data). Page (1:244) was not seen in the scan although its parent (1:157) and previous (1:243) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 40
Table error: Object ID 2137058649, index ID 1, partition ID 72057594038845440, alloc unit ID 72057594039762944 (type In-row data). Page (1:245) is missing a reference from previous page (1:244). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 4 consistency errors in table ‘Revenue’ (object ID 2137058649).
CHECKDB found 0 allocation errors and 4 consistency errors in database ‘CorruptionChallenge2_Latest_SGBD’.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (CorruptionChallenge2_Latest_SGBD).

 

Bogdan noted

Important info (from my point of view):
Page (1:244) was not seen in the scan although its parent (1:157) and previous (1:243) refer to it.

 

So he takes a look at what is in the revenue table:


SELECT * FROM CorruptionChallenge2_Latest_SGBD.dbo.Revenue

 

 

SQL Server detected a logical consistency-based I/O error: invalid protection option.
It occurred during a read of page (1:244) in database ID 7 at offset 0x000000001e8000
in file ‘D:\BD\Corruption_Challenge\week2\CorruptionChallenge2_LatestBackup.mdf’.
Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Bogdan noted:

Important info (from my point of view):
It occurred during a read of page (1:244) in database ID 7 at offset 0x000000001e8000

 

Now where the code gets interesting…


-- LET'S SEE WHAT DATA IS AT OFFSET 0x000000001e8000
 -- Page for offset 0x000000001e8000 ?
 SELECT CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8) --> Page 244
 -- Offset within page 244 ?
 SELECT CONVERT(INT, 0x000000001e8000, 1) -
        (CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8))
        * (1024 * 8) --> 0
 -- Conclusion: first byte[s] from the beggining (header)
 --      of page 1:244 "have problems"

 

So then he starts looking around at the pages to find the problem.


GO
 DBCC TRACEON(3604, -1)
 GO
 DBCC PAGE('CorruptionChallenge2_Latest_SGBD', 1, 244, 2)
 /*
 First bytes:
 020003ff ffff0001 f3000000 01001400 f5000000
 */
 DBCC PAGE('CorruptionChallenge2_Latest_SGBD', 1, 243, 2)
 /*
 First bytes:
 01010400 00020001 f2000000 01001400 f4000000
 */
 DBCC PAGE('CorruptionChallenge2_Latest_SGBD', 1, 245, 2)
 /*
 First bytes:
 01010400 00020001 f4000000 01001400 f6000000
 */
 DBCC PAGE('CorruptionChallenge2_Latest_SGBD', 1, 246, 2)
 /*
 First bytes:
 01010400 00820001 f5000000 01001400 9f000000
 */

Where he makes this discovery:

 1:243 = 01010400 00020001 f2000000 01001400 f4000000
 1:244 = 020003ff ffff0001 f3000000 01001400 f5000000
 1:245 = 01010400 00020001 f4000000 01001400 f6000000
 1:246 = 01010400 00820001 f5000000 01001400 9f000000
         |---------------|
          ^
          |-- It seems that first 8 bytes from 1:244 
              are "bad" and it  should be "01010400 00020001" 
              instead of "01010400 00020001"

Then the fix is made:


BEGIN
ALTER DATABASE CorruptionChallenge2_Latest_SGBD
SET SINGLE_USER

DBCC WRITEPAGE('CorruptionChallenge2_Latest_SGBD',
               1, 244, 0, 8, 0x0101040000020001, 1)

ALTER DATABASE CorruptionChallenge2_Latest_SGBD
SET MULTI_USER

EXEC CorruptionChallenge2_Latest_SGBD.dbo.checkCorruptionChallenge2Result
END

The DBCC WRITEPAGE command is writting 8 bytes into page 244 at position 8 to fix the corruption.

Next DBCC CheckDB() comes back clean and the sproc that tests the databse called checkCorruptionChallenge2Result reports success.

 

Nice job Bogdan Sahlean for finding the 8 specific bytes in the entire database that were corrupt and fixing just those 8 bytes. No data loss, and the database now reports that there is no corruption.

 

I would again like to reiterate my warning about using DBCC WritePage in production systems. I would certainly NEVER Use this without being absolutely sure that there was a really solid backup. DBCC WritePage is like the voodoo of SQL Server, but sometimes it does the job.

Tagged with:
9 comments on “Database Corruption Challenge #2 – Alternate Solution
  1. Jerrod says:

    Great job, Bodan! I highly encourage anyone interested in what those 8 bytes represent taking a look at Paul Randal’s post on the Anatomy of a Page: http://www.sqlskills.com/blogs/paul/inside-the-storage-engine-anatomy-of-a-page/. The portions of the page header that were actually corrupt were the (in order): m_headerVersion, m_type, m_typeFlagBits, m_level (page level in the b-tree), and m_flagBits. I couldn’t find anyone who had a bit/byte map of the page header when I was working on this, so I figured I’d be helpful, as I was keeping track of what byte changes impacted what header info in the output of dbcc page. Cheers!

  2. Patrick Flynn says:

    Hi Jerod

    The translation / mapping of the Page Header has been done by Mark Rasmussen in his ORCAmdf project.

    The post http://improve.dk/reverse-engineering-sql-server-page-headers/ gives the PageHeader.

    Regards

  3. JeremyH says:

    This is a wonderful series of challenges, Steve! I didn’t fully understand Bogdan’s solution at first, but after reading the referenced posts by Paul Randall and Mark Rasmussen, I’ve gotten most of the way there.

    I still have a couple questions about the above solution. For the queries:

    SELECT CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8) –> Page 244
    — Offset within page 244 ?
    SELECT CONVERT(INT, 0x000000001e8000, 1) –
    (CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8))
    * (1024 * 8) –> 0

    Do we not know what the offset from the error message is referencing (file offset or page offset)? What is the second query (above) supposed to produce? Plugging in any offset into both CONVERT statements is going to return 0.

    Does the directORbufferpool option of 1 for WRITEPAGE need to be used since we are changing the m_flagbits to the checksum value of 0x200? It doesn’t work using the default value.

    • SteveStedman SteveStedman says:

      JeremyH,
      You are correct, we did know the page and the offset without doing the math, however it was an interesting formula that was submitted, so I left it in.
      SELECT CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8) –> Page 244
      — Offset within page 244 ?
      SELECT CONVERT(INT, 0x000000001e8000, 1) –
      (CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8))
      * (1024 * 8) –> 0

      As far as the directOrBufferPool option, the default is 0. When zero is specified, it fixes up the checksum appropriately, if it is 1, the checksum is skipped. As far as using DBCC WritePage, it is dangerous, but if you are looking to fix corruption, the default value of 0 for directOrBufferPool should be used instead of the more dangerous paramater of 1.

      -Steve Stedman

      • Zhen Zhang says:

        Interesting point about the directOrBufferPool option, I used 1 as well. However, just to prove the option 0 is not working, I tried again, it turned out that if we use option 0, which changed the value of buffer pool and rely on engine to re calculate the checksum, it would not work.

  4. Kev Riley says:

    I still don’t understand the relevance of this query:

    SELECT CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8) –> Page 244

    SELECT CONVERT(INT, 0x000000001e8000, 1) –
    (CONVERT(INT, 0x000000001e8000, 1) / (1024 * 8))
    * (1024 * 8) –> 0

    What is it trying to show, and how did Bogdan arrive at it? As JeremyH says, you can substitute any values into that 2nd query, and you can simply reduce the formula down to 0 in every case.

    • SteveStedman SteveStedman says:

      Kev,
      The SELECT CONVERT… query is taking address 0x1e8000 converting it into an integer then doing the math to divede by 8k or 1024 * 8 which is the size of a page. it is a way to calculate the page number. Some do this using a calculator, but the SELECT can just we quicker than switching to a calculator to do the math.

      -Steve Stedman

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.