If you see your replication log reader slow down for no reason, here’s some stuff to look at.

So a while back we were seeing some very strange behavior with our SQL Replication.  Every once and a while for no apparent reason the log reader would just slow down the pulling of records from a single publisher database.  Our replication was setup with a single publisher, and a single distributor with over a dozen publications all being sent to a couple of different subscribers.

At random times we would see the latency for all the publications for a single database start to climb, eventually being a few hours behind for no apparent reason.  Looking in the normal places didn’t lead me to much.  I looked at some execution plans, and saw a couple of performance issues there (with the Microsoft code) so I threw a couple of new indexes onto the MSlogreader_agents and MSsubscriptions tables (see below) and I also made a couple of tweaks to the sp_MSset_syncstate procedure to fix some of the pathetic code which I found within the procedure (you’ll also find this below).

This helped a little, but it didn’t solve the problem.  What did was when I queried the sys.dm_os_waiting_tasks dynamic management view.  This showed a large number of processes with a wait_type of TRACEWRITE, and these were waiting long enough that blocking was actually starting to pop up (very sporadically making it very hard to see).  A query look at sys.traces told me that there were three traces running against the server. I knew that I didn’t have one running, so I took the session_id values which were shown in sys.traces and looked in sys.dm_exec_sessions for those session IDs to find out who needed to be kicked in the junk.  Turns out that the traces were being run by Quest Software’s Spotlight for SQL Server Enterprise’s Diagnostic Server (the program_name column read “Quest Diagnostic Server (Trace)”).

So I logged into the diagnostic server’s via RDP, and opened Spotlight .  Then edited the properties for the server which is our distributor.  Then I opened the SQL Analysis window, and disabled the SQL Analysis for this server.  Pretty much as soon as I clicked OK through the windows the TRACEWRITE locks went away, and the latency went from 2 hours down to 0.

This just goes to show, just how careful that you have to be when using SQL Profiler (or any sort of tracing) against your database server.

Denny

P.S. If you decide to make these changes to your distributor keep in mind that they may cause anything or everything to break, including patches, etc. that you try and install against the SQL Server engine.  These changes were made for a distributor running SQL Server 2008 R1 build 10.0.1600, use against another build at your own risk. That said, here’s the code.

[sql]USE distribution
GO
CREATE INDEX IX_sp_MSget_new_errorid ON dbo.MSrepl_errors
(id)
WITH (FILLFACTOR=100)
GO
CREATE INDEX IX_sp_MSadd_logreader_history
ON dbo.MSlogreader_agents
(id)
include (name, publication)
GO
CREATE NONCLUSTERED INDEX IX_sp_MSset_syncstate
ON MSsubscriptions
(publisher_id, publisher_db, article_id, subscription_seqno)
include (publication_id)
with (fillfactor=80)
GO
CREATE NONCLUSTERED INDEX IX_sp_MSset_syncstate2
ON MSsubscriptions
(publisher_id, publication_id, sync_type, status, ss_cplt_seqno, publisher_db)
include (article_id, agent_id)
WITH (FILLFACTOR=90, DROP_EXISTING=ON)
GO

ALTER procedure sp_MSset_syncstate
@publisher_id smallint,
@publisher_db sysname,
@article_id int,
@sync_state int,
@xact_seqno varbinary(16)
as
set nocount on
declare @publication_id int

select top 1 @publication_id = s.publication_id
from MSsubscriptions s
where
s.publisher_id = @publisher_id and
s.publisher_db = @publisher_db and
s.article_id = @article_id and
s.subscription_seqno < @xact_seqno

if @publication_id is not null
begin
if( @sync_state = 1 )
begin
if not exists( select * from MSsync_states
where publisher_id = @publisher_id and
publisher_db = @publisher_db and
publication_id = @publication_id )
begin
insert into MSsync_states( publisher_id, publisher_db, publication_id )
values( @publisher_id, @publisher_db, @publication_id )
end
end
else if @sync_state = 0
begin

delete MSsync_states
where
publisher_id = @publisher_id and
publisher_db = @publisher_db and
publication_id = @publication_id

— activate the subscription(s) so the distribution agent can start processing
declare @automatic int
declare @active int
declare @initiated int

select @automatic = 1
select @active = 2
select @initiated = 3

— set status to active, ss_cplt_seqno = commit LSN of xact containing
— syncdone token.

— VERY IMPORTANT: We can only do this because we know that the publisher
— tables are locked in the same transaction that writes the SYNCDONE token.
— If the tables were NOT locked, we could get into a situation where data
— in the table was changed and committed between the time the SYNCDONE token was
— written and the time the SYNCDONE xact was committed. This would cause the
— logreader to replicate the xact with no compensation records, but the advance
— of the ss_cplt_seqno would cause the dist to skip that command since only commands
— with the snapshot bit set will be processed if they are <= ss_cplt_seqno.

update MSsubscriptions
set status = @active,
subscription_time = getdate(),
ss_cplt_seqno = @xact_seqno
where
publisher_id = @publisher_id and
publisher_db = @publisher_db and
publication_id = @publication_id and
sync_type = @automatic and
status = @initiated and
ss_cplt_seqno <= @xact_seqno
OPTION (OPTIMIZE FOR (@automatic=1, @initiated=3, @publisher_id UNKNOWN, @publisher_db UNKNOWN, @xact_seqno UNKNOWN))
end
end
GO

[/sql]

Share

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?