Fixing SQL Server Database Corruption (when you get lucky)

First things first–if you are reading this, and not regularly running consistency checks on your SQL Server databases, you should drop everything you are doing and go do that. What do I mean by regularly? In my opinion, based on years of experience, you should run DBCC CHECKDB at least as frequently as you take a full backup. Unless you have a clean consistency check, you don’t know if that last backup you took is valid. SQL Server will happily back up a corrupted database.

Screenshot of a failed checkdb, followed by a successful backup command

I cheated a little bit here and used an undocumented command called DBCC WRITEPAGE to corrupt a single 8kb page within an non-clustered index on a table I created. You should basically never use this command, unless you are trying to corrupt something for a demo like this, but as you can see, after we’ve corrupted the page, SQL Server fails CHECKDB, and then happily takes a backup of our now corrupted database.

What Causes Database Corruption?

Other than doing something terrible, like editing a page, database corruption is mostly caused by storage failures. An example of this could be your local SAN, where the SAN’s operating system acknowledges that a write operation is complete back to the host operating system, but the write doesn’t actually complete. SQL Server recieved the write acknowledgment and thinks that the data was correctly written to the page, however for whatever reason it didn’t happen. I had this happen a couple of times in a past job when the SAN in our “data center” (it wasn’t) crashed hard when the building lost power (yeah we didn’t have a generator, hence the quotes. Be careful who you buy your medical devices from). What was actually happening is that the SAN was acknowledging a write when the data hit memory on the SAN, which is a performance enhancement, that assumes you have a proper power infrastructure that will prevent the SAN from “crashing hard”. You know what happens what you assume, right?

Anyway, this is far less common than it used to be for a number of reasons, one of which is the use of cloud based storage, which is very robust in terms of data protection. Also, modern enterprise class SANs are more efficient, and less likely to have failures like this. However, it’s still very possible–I had a minor corruption event in an Azure VM a couple of years ago, and we had a customer who filled up their very non-enterprise class SAN, with terrible results (it was all the corruption). So the moral of the story, is wherever you are running SQL Server you need to run checkdb. (Except Azure SQL DB, and possibly Managed Instance).

Fixing Corruption

There is a lot of stuff and tools that people will try to sell you on the internet to fix your database corruption. Almost all of them are crap–if you have corruption in a table or clustered index, or worse in one of the system pages that determines allocation, you are screwed and need to restore your last good backup. (see why backup retention matters here?)

However, in some cases you can get lucky. If your corruption is limited to a nonclustered index, you don’t need to restore the database, and you can just rebuild your index.

However, in my case that just threw the dreaded SQL Server 824 error. I suspect this had something to do with how I corrupted the page, but that investigation is not complete. I was able to disable the index, and then rebuild and we had a sucessful CHECKDB.

Share

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?