I was working on performance tuning the data deletion process that I use to remove data from our database. We delete massive amounts of data on a daily basis, so our data deletion process is pretty important to us. As we’ve grown our customer base the amount of data to be deleted every day has also grown, and the amount of time to delete that data has gone up as well.
We it has started to take stupid amounts of time to delete the data, so I started digging into the data deletion procedures.
The first thing I looked at was the actual delete statements. These seams ok, the actual deletes were happening very quickly (we process deletes in batches of 1000 records per batch to minimize locking of data). So next I looked at the part of the code where we select the records to be deleted. Looking at the execution plan, everything looked ok.
But this little chunk of code took about 50 minutes to run. Pretty bad when only returning 1000 numbers back from the database.
[sql]SELECT TOP (@BatchSize) a.PolicyIncidentId
FROM PolicyIncident a WITH (NOLOCK)
JOIN #ComputersToProcess ComputersToProcess ON a.ComputerId = ComputersToProcess.ComputerId
WHERE CaptureTimestamp < ComputersToProcess.StartDeleteAt
The first thing that I did was put a primary key on the @ComputersToProcess table variable. That turned the table scan into a Clustered Index Scan, but didn’t do anything for performance.
The next thing I did was switch the table variable to a temp table (without a primary key). This really didn’t do anything to speed up the process as there is still no statistics on the data. However this time the execution plan actually shows you that there’s no statistic on the temp table.
Now, I didn’t want to put at non-clustered index on the table keeping the table as a heap, and a clustered index that wasn’t a primary key wasn’t going be any more effective than a primary key, so I put a primary key on the table. While the query cost percentage went up from 2% to 7% the actual run time went down from 50 minutes to just 1 second.
Now I didn’t make any other code changes to the procedures, just changing from the table variable to the temp table, and adding a primary key and this one little three line query went from an hour to a second. Its amazing how much such a small change can make things run smoother.
Now obviously this isn’t going to fix every problem. But in my case I’m putting a little over 190k rows into the table variable (now temp table) and this is just to much for the table variable to take. Keep in mind that with table variables the SQL Server has statistics, but it assumes only a single row per temp table, no matter how much data is actually in the table variable.
Is that really a NOLOCK hint on that query? ;c)
Did you capture an ‘actual’ execution plan from the run that took 1 second? I would have expected to see a hash or merge join rather than nested loops – and thicker connecting lines representing the higher row counts.
Your last sentence says “…with table variables the SQL Server has statistics…” – perhaps that is a typo? Statistics are not created for table variables.
Although statistical information is not available for table variables, if you use an OPTION (RECOMPILE) query hint, the run-time [I]cardinality[/I] of the table variable is available to the optimizer.
Using this hint would have allowed the optimizer to ‘see’ the 190,000+ rows in the table variable, resulting in a much better plan. One other benefit to using OPTION (RECOMPILE) here is that the optimizer could see the run-time value of @BatchSize.