Recently I did something which I haven’t had to do for a VERY long time, restore a database off of an EMC Data Domain. Thankfully I wasn’t restoring a failed production system, I was restoring to a replacement production system, so I was getting log shipping setup.
I’ve worked in plenty of shops with Data Domains before, but apparently I’ve blocked out the memories of doing a restore on them. Because if your backups are done the way EMC wants them to be done to get the most of the Data Domain (uncompressed SQL Backups in this case) the restore process is basically unusable. The reason that we are backing up the databases uncompressed was the allow the Data Domain to dedupe the backups as much as possible so that the final backup stored on the Data Domain would be as small as possible so that it could be replicated to another Data Domain in another data center.
The database in this case is ~6TB in size, so it’s a big database. Running the restore off of the EMC Data Domain, was painfully slow. I canceled it after about 24 hours. It was at ~2% complete. Doing a little bit of math that database restore was going to take 25 days. While the restore was running we tried calling EMC support to see if there was a way to get the EMC Data Domain to allow the restores to run faster, and they answer was no, that’s as fast as it’ll run.
After stopping the restore, I backed up the same database to a local disk, and restored it to the new server from there. This time the restore took ~8 hours to complete. A much more acceptable number.
If you are using EMC’s Data Domain (or any backup appliance) do not use that appliance as your only location of your SQL Server backups. These appliances are very efficient at writing backups to them, and replicating those backups off to another site (which is what is being done in this case). But they are horrible at rehydrating those backups so that you can actually restore them. The proof of this is in the throughput of the restore commands. Here’s the output of some of the restore commands that were running. These are for full backups, so there’s nothing for SQL Server to process here, it’s just moving blocks from point A to point B.
RESTORE DATABASE successfully processed 931 pages in 6.044 seconds (1.203 MB/sec).
RESTORE DATABASE successfully processed 510596 pages in 1841.175 seconds (2.166 MB/sec).
RESTORE DATABASE successfully processed 157903 pages in 440.849 seconds (2.798 MB/sec).
RESTORE DATABASE successfully processed 2107959 pages in 4696.428 seconds (3.506 MB/sec).
RESTORE DATABASE successfully processed 77307682 pages in 118807.557 seconds (5.083 MB/sec).
RESTORE DATABASE successfully processed 352411 pages in 816.810 seconds (3.370 MB/sec).
RESTORE DATABASE successfully processed 8400718 pages in 23940.799 seconds (2.741 MB/sec).
RESTORE DATABASE successfully processed 51554 pages in 111.890 seconds (3.599 MB/sec).
RESTORE DATABASE successfully processed 1222431 pages in 3167.605 seconds (3.014 MB/sec).
The biggest database there was restoring at 5 Megs a second. That was 33 hours to restore a database which is just ~606,816 Megs (~592 Gigs) in size. Now before you blame the SQL Server’s or the network, all these servers are physical servers running on Cisco UCS hardware. The network is all 10 Gig networking, and the storage on these new servers is a Pure storage array. The proof that the network and storage was fine was the full restore of the database which was done from the backup to disk, as that was restored off of a UNC path which was still attached to the production server.
When testing these appliances, make sure that doing restores within an acceptable time window is part of your testing practice. If we had found this problem during a system down situation, the company would probably have just gone out of business. There’s no way the business could have afforded to be down for ~25 days waiting for the database to restore.
Needless to say, as soon as this problem came up, we provisioned a huge LUN to the servers to start writing backups to. We’ll figure out how to get the backups offsite (the primary reason that the Data Domain exists in this environment) another day (and in another blog post).
“There’s no way the business could have afforded to be down for ~25 days waiting for the database to restore.”
That’s the type of cost, benefit, risk analysis that every architect needs state when they participate in these sorts of decisions. Assuming that architects are allowed to be part of the decisions.
The purpose of backups is restore.
If you want fast backups, back up to /dev/null.
Fast restores? Well now, that’s something we can discuss.
Hell…we don’t even need backups. We Only Need Restores™
We ran into a huge backup discrepancy too (2TB taking more than 13 hours). We were given access to their 2.0 Beta DDBA tool and saw this drop back to our original backup times (averaging 6 hours). Might want to look into this.
I get much better throughput from our Data Domain appliance. The example below is from a much smaller database, however. Also important to note: I backup to Data Domain WITH COMPRESSION. This would seem to have some fairly obvious trade-offs.
RESTORE DATABASE successfully processed 534687 pages in 31.198 seconds (133.894 MB/sec).
Yep, that’s working well because there’s little to no dedupe happening at the data domain layer.
At a previous job, the director of storage engineering was pushing for us to use the Data Domain appliance for our SQL Server boxes. The requirement to send uncompressed backups to the appliance killed that idea. We were running Litespeed at the time because it was way faster than a native backup. We didn’t have time in our maintenance window to run uncompressed backups given the number of databases and servers. This post is further confirmation we made the right call. Yes. this was before SQL Server had added the compressed backup feature.