SQL Server replication blocking on cleanup job
For some time my primary workplace has been having a problem with SQL Server replication delaying for several minutes at a time which is surprising given the 12GB of RAM and quad processor hardware behind it.
Activity Monitor showed a number of processes on the distribution database used for SQL Server’s replication were blocked by another process which was in turn blocked by a SQL Agent – TSQL Job executing on the distribution database.
This process’s job GUID that did not match any of our jobs and appeared to be an internal job that regularly executes:
EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_disretention = 32760
This would sit on the PAGEIOLATCH_SH status for over 10 minutes at a time most most likely because it was operating on an MSrepl_commands table with 3.4 million records and a MSrepl_transactions table of 600k records.
The default max_disretention is 72 (hours) and specifies how long to hold the replication history for. This means if a subscriber is down for longer than this period it will loose transactions.
Quite how this system obtained the value 32760 (3.7 years) is a bit of a mystery and I’m assuming that it actually uses a lower value as there have been more than 3.4 million updates in that period.
Setting this value to 84 hours (3.5 days) in Management Studio’s Replication > Distributor Properties > Transaction Retention soon dropped those large MS tables down to 406 records and 300 respectively and resulted in sp_MSdistribution_cleanup executing in a couple of seconds.
It’s a very specific scenario but perhaps this blog post will be of use to somebody else!