Random Unexplained Replication Job Failure in a large SQL Server Replication Topoligy

One of the environments that I work with includes a very large replication topology. While there are only a few servers involved there are dozens of publications between those servers. One of the problems that replication has is that it can only run so many agents at a time.

Now remember that SQL Server replication agents are just SQL Server Agent jobs. And while there are command line applications on the distributor (and subscriber if you are doing pull replication) that’s not quite how SQL Server replication works. You see, when you setup SQL Server replication there are custom job types for the various replication tasks. These custom tasks tell the SQL Server Agent to call custom DLLs which then do some magic and do the same thing as the command line applications which sit on the servers hard drive. At least that’s the theory.

Now on smaller topologies this all works just fine. Everything is happy, the jobs all run as expected. But on larger topologies you start getting some “strange” behavior from these jobs. They’ll randomly throw error messages. Or the agents will say that they are running, but they actually aren’t. The problem is something to do with the DLL that you are running as part of the SQL Agent job.

Thankfully, the trick to fixing this is actually pretty simple. Simply open up the properties for the SQL Agent job which is running the replication agent, and edit step 2 of the job. Change the type from whatever the replication operation is, to “Operating system (CmdExec)”. Then edit the command. Before the batch of parameters in there you want to add the full command line path to the replication agent that you are trying to run. By default this will be “c:program filesMicrosoft SQL Server120com” for SQL Server 2014. For older versions of SQL Server just change the 120 to the version number 110 for SQL 2012, 100 for SQL 2008, 90 for SQL 2005 and 80 for SQL 2000. Then put the name of the executable that you need to use, you’ll find the list of applications listed below.

Application Name Replication Agent
snapshot.exe Snapshot Agent
logread.exe Load Reader Agent
distrib.exe Distribution Agent

When editing the SQL Agent job, if there’s spaces in the path to the executable, you’ll need to put double quotes around the full path and executable, but not the parameters as shown below.

“c:program filesMicrosoft SQL Server120comsnapshot.exe” -Publisher [Server1] -PublisherDB [MyDB] -Distributor [Server2] -Publication [MyPublication] -DistributorSecurityMode 1

You shouldn’t need to make any changes to the parameters. Just save and close the SQL Agent job, then restart the replication agent either from the replication monitor or by manually starting the SQL Agent job and everything will just keep on cranking along as expected.



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?