
For some time my primary workplace has been having a problem with SQL Server replication delaying for several minutes at a time which is surprising given the 12GB of RAM and quad processor hardware behind it.
Activity Monitor showed a number of processes on the distribution database used for SQL Server’s replication were blocked by another process which was in turn blocked by a SQL Agent TSQL Job executing on the distribution database.
This process’s job GUID that did not match any of our jobs and appeared to be an internal job that regularly executes:
EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_disretention = 32760
This would sit on the PAGEIOLATCH_SH status for over 10 minutes at a time most most likely because it was operating on an MSrepl_commands table with 3.4 million records and a MSrepl_transactions table of 600k records.
The default max_disretention is 72 (hours) and specifies how long to hold the replication history for. This means if a subscriber is down for longer than this period it will lose transactions.
How exactly this system obtained the value 32760 (3.7 years) is a bit of a mystery and I’m assuming that it actually uses a lower value as there have been more than 3.4 million updates in that period.
Setting this value to 84 hours (3.5 days) in Management Studio’s Replication > Distributor Properties > Transaction Retention soon dropped those large MS tables down to 406 records and 300 respectively and resulted in sp\_MSdistribution\_cleanup executing in a couple of seconds.
It’s a very specific scenario but perhaps this blog post will be of use to somebody else!
[)amien
0 responses