SQL Server replication blocking on cleanup job

For some time my primary workplace has been having a problem with SQL Server replication delaying for several minutes at a time which is surprising given the 12GB of RAM and quad processor hardware behind it.

Activity Monitor showed a number of processes on the distribution database used for SQL Server’s replication were blocked by another process which was in turn blocked by a SQL Agent – TSQL Job executing on the distribution database.

This process’s job GUID that did not match any of our jobs and appeared to be an internal job that regularly executes:

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_disretention = 32760

This would sit on the PAGEIOLATCH_SH status for over 10 minutes at a time most most likely because it was operating on an MSrepl_commands table with 3.4 million records and a MSrepl_transactions table of 600k records.

The default max_disretention is 72 (hours) and specifies how long to hold the replication history for. This means if a subscriber is down for longer than this period it will loose transactions.

Quite how this system obtained the value 32760 (3.7 years) is a bit of a mystery and I’m assuming that it actually uses a lower value as there have been more than 3.4 million updates in that period.

Setting this value to 84 hours (3.5 days) in Management Studio’s Replication > Distributor Properties > Transaction Retention soon dropped those large MS tables down to 406 records and 300 respectively and resulted in sp\_MSdistribution\_cleanup executing in a couple of seconds.

It’s a very specific scenario but perhaps this blog post will be of use to somebody else!

[)amien

0 responses