When SQL Server replication eats disk space
Part of my job involves revising our SQL Server architecture. My plan includes the addition of a read-only reporting SQL pair for non-critical enquiries and reports. This allows the heavy and unpredictable load from reporting away from from the primary SQL pair responsible for critical operations (shipping orders).
We utilised SQL Server’s publisher-subscriber replication on the required databases which, given their legacy nature, had some cross-database dependencies that were added without due consideration.
The reporting SQL pair needed to be rebuilt so the subscriptions were removed whilst they were being rebuilt. Within three days our primary servers were out of disk space and we had a problem that needed to be solved quickly.
Our first step was to remove the new publications. This released a small amount of space in the distribution database (now 100GB) and allowed the replicated databases logs to be truncated and shrunk. We now had a little breathing room.
One publication would not remove in a reasonable time (10+ hours) and even trying to remove individual articles caused a slow-running operation sat on PAGEIOLATCH_SH (which Microsoft describe as waiting on disk IO). A quick investigation revealed a complex query on internal system tables with no indexes – presumably for fast insert performance.
In the mean-time this slow-running operation would block the replication agents which are also used by IBM’s DataMirror product to replicate data back to the iSeries (AS/400) in real time. This blocking would prevent us shipping orders and was therefore unacceptable.
The database in question is a home-grown data warehousing application rebuilt each night and on reflection was a terrible idea to replicate using this method. Each night it generated over twice it’s own size in replication activity. Given we have a 72 hour replication retention and the size of the database that alone was enough to wipe out the disk space on our server within 3 days.
Reinitialize All Subscriptions was the magic bullet which ran in just a minute allowing the publication to be subsequently instantly deleted. The replicated databases transaction logs were then truncated and shrunk giving us ample breathing space on the storage front.
The distribution database was still huge but heading into Replication Monitor’s Common Jobs and starting the Distribution clean up: distribution job was our final step.
The job might take all day but did not block any other replication activity :)