The problem

Imagine a data warehouse where you have many etl process from either a staging area or other external systems direclty.There are also many reading tasks that fill data marts for reporting purposes.

Imagine now you have to update many records because you discovered a bug and you cannot create a maintenance window since the DW has to be online 24/7.

So let us just do the update straight away, right? For that purpose we create a large table i.e. an eample fact table

CREATE TABLE dbo.demotable (
	id INT IDENTITY NOT NULL,
	val1 BIGINT NOT NULL,
	val2 BIGINT NOT NULL,
	val3 BIGINT NOT NULL,
	val4 BIGINT NOT NULL,
	val5 BIGINT NOT NULL,
	val6 BIGINT NOT NULL,
	val7 BIGINT NOT NULL,
	val8 BIGINT NOT NULL,
	val9 BIGINT NOT NULL,
	val10 BIGINT NOT NULL,
	val11 BIGINT NOT NULL,
	val12 BIGINT NOT NULL
);

CREATE CLUSTERED INDEX IX_tempidx_demotable ON dbo.demotable (id);

Now let us fill the table with 4 million data rows. That may take a while. There are better ways to do this but that is not scope of this article.

SET NOCOUNT ON

DECLARE @iterator INT  = 1;

WHILE @iterator < 4000000
BEGIN;
	INSERT INTO dbo.demotable (val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12)
	VALUES (0,0,0,0,0,0,0,0,0,0,0,0)
	SET @iterator = @iterator + 1;
END;

SET NOCOUNT OFF

In a second tab/session let us now run the following script. It selects every column of a random row in an infinite loop and logs the current time in a temporary table afterwards. We use that table to measure if we had access to our demotable during the update.

IF OBJECT_ID('tempdb..#times') IS NOT NULL
BEGIN;
	DROP TABLE #times;
END;

CREATE TABLE #times (
	[datetime] DATETIME2(3)
)

DECLARE @maxid INT = (SELECT MAX(id) FROM dbo.demotable)

WHILE 1=1
BEGIN;
	SELECT *
	FROM dbo.demotable
	WHERE id = ABS(CHECKSUM(NEWID())) % @maxid;

	INSERT INTO #times VALUES (CURRENT_TIMESTAMP);
END;

In the first tab/session execute the update statement.

UPDATE dbo.demotable 
SET 
	val1 = ABS(CHECKSUM(NEWID())),
	val2 = ABS(CHECKSUM(NEWID())),
	val3 = ABS(CHECKSUM(NEWID())),
	val4 = ABS(CHECKSUM(NEWID())),
	val5 = ABS(CHECKSUM(NEWID())),
	val6 = ABS(CHECKSUM(NEWID())),
	val7 = ABS(CHECKSUM(NEWID())),
	val8 = ABS(CHECKSUM(NEWID())),
	val9 = ABS(CHECKSUM(NEWID())),
	val10 = ABS(CHECKSUM(NEWID())),
	val11 = ABS(CHECKSUM(NEWID())),
	val12 = ABS(CHECKSUM(NEWID()))

After the update statement is done we stop the infinite loop in the second tab. Interesting for us now is the maximum time between two consecutive times

WITH timeWithDiff AS (
SELECT 
		DATEDIFF(	MILLISECOND,
					[datetime], 
					LEAD([datetime]) OVER(ORDER BY [datetime])
				) as diff
FROM #times
)
SELECT MAX(diff)
FROM timeWithDiff

The result shows 22754 milliseconds that is all the execution time from the update statement. That means that the whole table was not accessible during the update statement. Now imagine you have a more complex update with a join and/or subqueries. It will take much longer to update then because the exclusive lock will be held until the transaction is over.

In a data warehouse enviromnent this is simply not possible.

The solution

One possibility is to split the update in a get and set phase.

In the get phase we execute the update statement but instead of updating the table we select the new values that we will use to update the table later and we also save the id so in the end we know what row gets what values. Using WITH(NOLOCK) queryhint speeds the process up and prevents the use of shared locks on the DW table.

SELECT	 id 
		,ABS(CHECKSUM(NEWID())) as newval1
		,ABS(CHECKSUM(NEWID())) as newval2
		,ABS(CHECKSUM(NEWID())) as newval3
		,ABS(CHECKSUM(NEWID())) as newval4
		,ABS(CHECKSUM(NEWID())) as newval5
		,ABS(CHECKSUM(NEWID())) as newval6
		,ABS(CHECKSUM(NEWID())) as newval7
		,ABS(CHECKSUM(NEWID())) as newval8
		,ABS(CHECKSUM(NEWID())) as newval9
		,ABS(CHECKSUM(NEWID())) as newval10
		,ABS(CHECKSUM(NEWID())) as newval11
		,ABS(CHECKSUM(NEWID())) as newval12
INTO #updateTable
FROM	dbo.demotable WITH(NOLOCK)

Now let us start the time measure script what i posted above int the second tab again

Done? Now let us start the batch update process

DECLARE @rowsaffected INT = 1;
DECLARE @batchsize INT = 10000;

WHILE @rowsaffected > 0 
BEGIN;
	
	-- delete #updateTableBatch when exists
	IF OBJECT_ID('tempdb..#updateTableBatch') IS NOT NULL
	BEGIN;
		DROP TABLE #updateTableBatch;
	END;

	-- load current batch in #updateTableBatch
	SELECT TOP (@batchsize) *
	INTO #updateTableBatch
	FROM #updateTable;

	-- update table for current batch
	UPDATE a 
	SET 
		 a.val1 = b.newval1
		,a.val2 = b.newval2
		,a.val3 = b.newval3
		,a.val4 = b.newval4
		,a.val5 = b.newval5
		,a.val6 = b.newval6
		,a.val7 = b.newval7
		,a.val8 = b.newval8
		,a.val9 = b.newval9
		,a.val10 = b.newval10
		,a.val11 = b.newval11
		,a.val12 = b.newval12
	FROM dbo.demotable as a
	JOIN #updateTableBatch as b
		ON a.id = b.id;

	SET @rowsaffected = @@ROWCOUNT;

	-- clear batch from todo table
	DELETE a 
	FROM #updateTable as a
	JOIN #updateTableBatch as b 
		ON a.id = b.id ;

END;

Last step: We check the access times by executing the query posted above again:

WITH timeWithDiff AS (
SELECT 
		DATEDIFF(	MILLISECOND,
					[datetime], 
					LEAD([datetime]) OVER(ORDER BY [datetime])
				) as diff
FROM #times
)
SELECT MAX(diff)
FROM timeWithDiff

We now waited only a maximum of 1273 milliseconds so we basically did not block the table at all.

Remark

This method is only good if you have to do hotfixes on a live data warehouse or any application where there is no maintenance window or the fix / update would simply take too long and therefore block several other crucial processes.

The price for the minimal blocking update surely is time. The batch size has to be as small as necessary because you want the transactions to be as short as possible, but also you want the batch size to be as big as possible because otherwise the update will take MUCH longer.

Alternatives to this approach exists e.g. when your fact table is partitioned try partition switching.