This weekend I was migrating some databases from one server to another. As these were rather large databases (about 2 TB in size) and I didn’t have the clusters setup on a SAN (using DAS instead) I needed to setup log shipping from one cluster to another so that the downtime when moving was kept to a minimum.
But a little problem came up, when I restored one database I got a torn page error when trying to roll the logs forward. My first thought was ok, maybe there was a network hiccup or a problem with the storage on the destination server. So I restored the database again, and got a torn page error again. However the second time the error was on a different page. This is very important to know, because the torn page had moved from one page (1:1396373) to another (1:24815312) and I used the same database backup both times, this told me something very important. That the torn page wasn’t caused by the source database or the database backup. If it was then the page which was torn would have been the same both times.
Now you’ll notice that the torn page was detected when the transaction logs were being rolled forward, not while the actual restore was happening. The reason for this is that torn pages aren’t checked for during a full backup restore. Torn pages are only detected during two specific operations, when the pages are read into the buffer pool (my database is in NORECOVERY so that isn’t going to happen), and when the pages are changed (in this case they are changed when the transaction log rolls forward). So the only other way I would have been able to verify that there wasn’t a torn page would have been to bring the database up in STANDYBY mode and to a scan of every index (both clustered and non-clustered) which would have caused all the pages to be read into the buffer pool and caused the torn page message to be thrown. But this can’t happen in this case because I’m log shipping form SQL Server 2005 to SQL Server 2008 and the database can’t be brought up in STANDYBY mode because a database upgrade needs to be done. So checking for additional torn pages requires taking a fully rolled forward database out of NORECOVERY and putting it into RECOVERY, querying every table on every index so that every page is touched (by using index hints) to ensure that the torn page isn’t there.
This not so quick bit of testing (the database backup in question is about 300 Gigs) told me that there was a problem with either the HBAs or the DAS that the cluster was using. To further test we failed the SQL Cluster over to the other node and ran the restore again which didn’t cause any torn page error messages when rolling the logs forward. This would appear to put the DAS in the clear, and put the problem either on the other node of the cluster, the HBAs in that node, cables, etc. Something which only that server uses to connect to the storage.
If you ever setup log shipping and you get torn pages let this be a lesson, don’t assume that there is a problem on the source database. It might be a problem on the destination database instead.
Denny