A very strange Service Broker and TempDB problem

So, a little while ago I was having a major problem at work that I’d like to share in the hopes that someone else finds this information useful. (When I was having this problem, Google wasn’t able to find anything for me.)

The symptoms that we were having in a nut shell was that the tempdb was taking all the drive space which was allocated to it.  On this system, tempdb is sitting on it’s own 50 Gig drive, so SQL can take the entire thing if it needs to.  The problem was that sp_spaceused was showing that the entire database was empty, but we were getting out of space errors when running queries which needed to use tempdb.  First thing we did was create another tempdb file off on another drive which is a 200 Gig drive.  This gave us space to work with while we tried to figure the entire mess out.

The first thought that we had (well OK, not the first) was that we were having the problem described in MSKB 947204.  The solution to this KB Article was to apply SQL 2005 SP 2 Update 6.  So we did this, but to no avail.

Now, I need to take a little break and give you some additional background before we continue.

About 2 months ago we made a hardware swap on the SQL Server.  The old server wouldn’t take a memory upgrade, so we got a new server with the correct amount of RAM (old server had 8 Gigs, new server had 16) and did a fork lift migration to the new server.  The migration went fine, totally as expected.

Now another piece of information that you’ll need is that we are using the service broker heavily within our database.

What we didn’t notice, and what was eventually found to be the root cause of the problem, was that the service broker wasn’t working correctly.  The reason we didn’t notice it was that there were never any error messages.  Messages were being sent, but they never actually got delivered.

There were no errors being logged in the errorlog, on in SQL Profiler (when monitored using either the Errors and Alerts or Broker monitors).  The error could only be seen by querying the sys.transmission_queue.  In the transmission_status column was some sort of error along the lines of “An error occurred during decryption”.  It also gave some sort of error about the database master key.

 Working off of the database master key error I did an ALTER MASTER KEY and reset the master key, but that didn’t help.  I then decided to reset the connection between the master key and the service key by using the ALTER MASTER KEY ADD ENCRYPTION BY SERVICE MASTER KEY command.  This took forever to execute, which I assumed was a good thing since I didn’t see any blocking.  Once it was finished the CPUs on the SQL Server started running very hot, and the disk IO went up very high.  All the CPU was being taken by SPIDs 16 and 20-24 (or so) with a command of BRKR TASK (Broker Task).  So the service broker was now chewing on all the messages in the system (we had no idea how many there were, we estimated it at millions).  After a while of watching the system process messages which were no longer relevant we decided that we needed to kill all the messages in the service broker as quickly as possible. Fortunitly the few messages which we would need to be able to keep we would be able to easily recreate.

I wrote a nice little script which queried the top 100 conversation IDs from the sys.transmission_queue DMV and did an END CONVERSATION WITH CLEANUP on them to get rid of them.  This was then wrapped in a loop so that it would kill messages in-mass.  Because of the massive load on the database after 2 days of running only about 350,000 conversations had been closed.  During this two day window the SQL service had to be restarted twice because tempdb had grown to over 200 Gigs with 198 Gigs of that showing free space, but the system still reporting out of space messages.

 We decided that a more drastic approach was needed.  We thought about deleting the queues, but that wouldn’t help.  Service broker would simply hold the messages until the queues were back then start processing the messages again.  I thought about deleting the Service Broker Service, and tried it but the process hung waiting for the disk, as I assume it was removing all the messages which went with that service.  I didn’t want that sort of blocking and load on the system, so I killed that.

This is when I remembered that all messages being sent or received are done to a specific service broker GUID, not to a specific database name.  This gave me an idea.  If I change the GUID, the messages which the service broker is chewing on will become invalid and SQL will remove them as bogus messages.  So I stopped the SQL Service, and restarted it in single user mode with a minimal config (not actually needed, but I needed to get all the users out of the database, and with 30,000 people hitting the system 24x7x365 this was just the easiest way to do it).  I then did an ALTER DATABASE MyDatabase WITH NEW_BROKER.  This changed the GUID of my database to a new GUID which stopped SQL from processing the messages.  I restarted the service normally and brought the system back online.  The tempdb is sitting at 30 Gigs.

Some of the queries which will come in handy to see if you are having this same kind of issue are:

The first one is pretty self explanatory.
exec sp_spaceused;

If there are records in the sys.transmission_queue with an error about the master key, or decryption.
FROM sys.transmission_queue

The second value is the important one here. The internal_obj_kb is the amount of space taken in the tempdb by internal objects like table variables, stored procedure input parameter definitions, etc.
SUM (user_object_reserved_page_count)*8 as usr_obj_kb, SUM (internal_object_reserved_page_count)*8 as internal_obj_kb, SUM (version_store_reserved_page_count)*8 as version_store_kb, SUM (unallocated_extent_page_count)*8 as freespace_kb, SUM (mixed_extent_page_count)*8 as mixedextent_kb FROM sys.dm_db_file_space_usage

This query shows the number of pages and the number of Megs that the internal objects where using.
SELECT SUM(internal_object_reserved_page_count) AS [internal object pages used],
(SUM(internal_object_reserved_page_count)*1.0/128) AS [internal object space in MB]
FROM sys.dm_db_file_space_usage;

By using these queries we were able to find out that we had Gigs of space assigned as reserved for internal objects, even those sp_spaceused was showing us that all the space in the database was available because there wasn’t actually any data written to it from a physical object. We normally have between 7 and 15 Megs of internal object space, but until the root cause of the issue was resolved we were running about 100 Gigs of internal object space.

So if you have moved from one server to another, and you use SQL Service Broker, and you aren’t getting messages processed because of a master key or decryption error, and your tempdb is filling up like crazy, this might be your problem.

I hope that no one ever has this problem, but if you do and you have run across this posting, I hope that you find it useful in fixing your problem.


12 Responses

  1. It happened the same to me. Service Broker was consuming the entire tempdb and the mdf was growing really fast because a fire and forget issue was leaving alive the conversations created.

  2. Hello

    we actually have this issue also and your last query just show something huge and we’re getting mad…
    internal object pages used internal object space in MB
    40305040 314883.125000

    we don’t have any idea how to get over it. We have a cluster (2 nodes) and a mirror server…
    any help would be welcome  
  3. hlinds,
    Have you checked to see if there are a bunch open conversations which haven’t closed correctly?  If there are you’ll need to close them.

  4. Thanks Denny for your reply
    Not really because i’m not familiar with this mecanism so i’m gonna check this out and keep you posted.

  5. sp_spaceused being wrong in tempdb isn’t suprising.  When you queried the conversation_endpoints were you in tempdb or your application database?  You should have been in your application database.

  6. yes indeed. I have a huge number of conversation_endpoint so far basically.

    I’ve just launch a select count statement from my application database that’s still running (2mns left).
  7. OK, so stop the query.  There’s no point in leaving that running.  You need to one of two things.  Loop through all the records in conversation_endpoints and do an END CONVERSATION on the ones which should be closed.  Or do an ALTER DATABASE SET NEW_BROKER.  If you using routes to route messages from one database to another the ALTER DATABASE option gets a lot harder as you have to go fix all the routes.

    You also need to go through your code and fix the problem that is causing the conversations to stay open.

  8. Query stopped! find a way to properly kill or close end conversation. I’ll also get information from our dev team and as usual keep you posted.

    Many thanks 

  9. Hello Denny

    Is there a link regarding SQL Server 2008 R2 Miroring database “using” service broker so messages are routed properly?
    To avoid TempDB expansion

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?