Wednesday, March 25, 2026

The SQL Server Deadlock Fix I Never Would Have Thought Of

Intro

In my previous post, I shared how a deadlock incident changed how I think about using AI in my day-to-day work as a DBA. Here I'll walk through the incident itself. The data and queries aren't the real ones; I recreated the scenario with my own made-up tables, but the deadlock behavior is the same.

Set up a deadlock demo

Below is the T-SQL to create the table and load it with test data. A couple of things worth knowing about the data:

  • For every ProcessId, there is exactly one active record, the one where EndDate is NULL.
  • The StartDate and EndDate pattern reflects a soft-delete approach: instead of updating a record directly, we close the current one by setting its EndDate, then insert a new record with the updated values.
USE [MyBlogDB];

DROP TABLE IF EXISTS [dbo].[Configuration];
CREATE TABLE [dbo].[Configuration] (
	[Id] INT IDENTITY(1,1),
	[ProcessId] INT NOT NULL,
	[Value1] VARCHAR(50) NOT NULL,
	[Value2] VARCHAR(50) NULL,
	[StartDate] DATETIME NOT NULL,
	[EndDate] DATETIME NULL,
	[IsDefaultConfigOption] BIT NOT NULL DEFAULT 0,
	CONSTRAINT [PK_Configuration_Id] PRIMARY KEY ([Id])
);


-- CLAUDE wrote this code to generate 500,000 rows of test data for the Configuration table 
;WITH Numbers AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1
    FROM Numbers
    WHERE n &lt 500000
)
INSERT INTO [dbo].[Configuration] ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
SELECT
    -- ProcessId cycles through 1–5000
    ((n - 1) % 5000) + 1,

    -- Value1: e.g. 'Value_1', 'Value_2', ...
    'Value_' + CAST(n AS VARCHAR(10)),

    -- Value2: nullable, every 3rd row is NULL
    CASE WHEN n % 3 = 0 THEN NULL ELSE 'Option_' + CAST((n % 10) + 1 AS VARCHAR(10)) END,

    -- StartDate: spread over the past ~3 years
    DATEADD(DAY, -(n % 1095), GETDATE()),

    -- EndDate: NULL only on the last occurrence of each ProcessId (n > 495000),
-- all prior rows get an end date 1 year after their start CASE WHEN n > 495000 THEN NULL ELSE DATEADD(DAY, -(n % 1095) + 365, GETDATE()) END, -- IsDefaultConfigOption: alternates in groups of 4 CASE WHEN n % 4 IN (1, 2) THEN 1 ELSE 0 END FROM Numbers OPTION (MAXRECURSION 0); -- Required to allow recursion beyond the default 100 limit SELECT * FROM [dbo].[Configuration] where processid=999;

I also added an index similar to the one that existed in the real-world case. It doesn't perfectly cover the WHERE clause, but it's good enough to be useful. I kept it that way intentionally, because in the real world, I wouldn't rush to throw more indexes at the problem.

DROP INDEX IF EXISTS I1_ProcessId_StartDate_EndDate ON [dbo].[Configuration];
CREATE INDEX I1_ProcessId_StartDate_EndDate ON [dbo].[Configuration] ([ProcessId], [StartDate], [EndDate]);

Here's the stored procedure at the center of it all. You might wonder why there is only one? in the real-world scenario, two colleagues were accidentally running the same operation on the same ProcessId at nearly the same time, which is what caused the deadlock.

CREATE OR ALTER PROCEDURE [dbo].[Configuration_Update_IsDefaultConfigOption]
	@ProcessId INT,
	@IsDefaultConfigOptionInput BIT
AS
BEGIN
	SET NOCOUNT ON;
	DECLARE @CurrentDate DATETIME = GETDATE();
	DECLARE @ID INT;
	DECLARE @Value1 VARCHAR(50);
	DECLARE @Value2 VARCHAR(50);

	BEGIN TRY
		BEGIN TRANSACTION;
		SELECT 
			@ID = [Id],
			@Value1 = [Value1],
			@Value2 = [Value2]
		FROM [dbo].[Configuration]
		WHERE [ProcessId] = @ProcessId
		      AND [EndDate] IS NULL;

		UPDATE [dbo].[Configuration]
		SET [EndDate] = @CurrentDate
		WHERE [ProcessId] = @ProcessId
		      AND [EndDate] IS NULL;

		INSERT INTO [dbo].[Configuration] ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
		VALUES (@ProcessId, @Value1, @Value2, @CurrentDate, NULL, @IsDefaultConfigOptionInput);

		COMMIT TRANSACTION;
	END TRY
	BEGIN CATCH
	IF @@TRANCOUNT > 0
		ROLLBACK TRANSACTION;
		THROW;
	END CATCH
END
GO

The proc is straightforward. Given a ProcessId, it fetches the current active record (where EndDate is NULL), closes it with a soft delete by setting EndDate, then inserts a new record with the updated IsDefaultConfigOption value and everything else carried over.

To reproduce a deadlock, I used SQLQueryStress with 4 threads all hammering this same query and ProcessID simultaneously (for 10 iterations each to increase the chance of getting the deadlock). In the real world there were only two competing sessions, but I bumped it up to 4 here just to make the deadlock easier to trigger reliably.

Understand the deadlock

Here is the deadlock graph I obtained from running the sp_BlitzLock proc (from First Responder Kit scripts).

<deadlock>
  <victim-list>
    <victimProcess id="process2b5c1d01048" />
  </victim-list>
  <process-list>
    <process id="process2b5c1d01048" spid="87"
▶ more... lockMode="S" waitresource="KEY: 5:72057594058899456 (294556843a07)" isolationlevel="read committed (2)" currentdbname="MyBlogDB" transactionname="user_transaction" lasttranstarted="2026-03-12T16:06:10.557" clientapp="SQLQueryStress" hostname="1NDWJD3"
> <executionStack> <frame procname="MyBlogDB.dbo.Configuration_Update_IsDefaultConfigOption" line="17"> SELECT @ID = [Id], @Value1 = [Value1], @Value2 = [Value2] FROM [dbo].[Configuration] WHERE [ProcessId] = @ProcessId AND [EndDate] IS NULL </frame>
▶ <frame procname="adhoc">...<frame procname="adhoc" line="1"> exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1 </frame>
</executionStack>
▶ <stackFrames> (39 frames, collapsed)... <stackFrames> <frame id="00" address="0x7FFA979F4694" module="ntdll" /> <frame id="01" address="0x7FFA94FBEB29" module="kernelbase" /> <frame id="02" address="0x7FFA77AA65CA" module="SqlDK" /> <frame id="03" address="0x7FFA77AA64ED" module="SqlDK" /> <frame id="04" address="0x7FFA77AA216C" module="SqlDK" /> <frame id="05" address="0x7FFA77AA1C33" module="SqlDK" /> <frame id="06" address="0x7FFA77AA3B66" module="SqlDK" /> <frame id="07" address="0x7FFA4659855A" module="sqlmin" /> <frame id="08" address="0x7FFA465987D0" module="sqlmin" /> <frame id="09" address="0x7FFA4650D84D" module="sqlmin" /> <frame id="10" address="0x7FFA46515599" module="sqlmin" /> <frame id="11" address="0x7FFA4652016E" module="sqlmin" /> <frame id="12" address="0x7FFA465160DD" module="sqlmin" /> <frame id="13" address="0x7FFA46515DFE" module="sqlmin" /> <frame id="14" address="0x7FFA465343AE" module="sqlmin" /> <frame id="15" address="0x7FFA780E1955" module="SqlTsEs" /> <frame id="16" address="0x7FFA46534419" module="sqlmin" /> <frame id="17" address="0x7FFA46533EBE" module="sqlmin" /> <frame id="18" address="0x7FFA465342B4" module="sqlmin" /> <frame id="19" address="0x7FFA46530020" module="sqlmin" /> <frame id="20" address="0x7FFA46543ECE" module="sqlmin" /> <frame id="21" address="0x7FFA4369D4D0" module="sqllang" /> <frame id="22" address="0x7FFA4369D702" module="sqllang" /> <frame id="23" address="0x7FFA43B022D9" module="sqllang" /> <frame id="24" address="0x7FFA44619DFF" module="sqllang" /> <frame id="25" address="0x7FFA4462D5EA" module="sqllang" /> <frame id="26" address="0x7FFA43697700" module="sqllang" /> <frame id="27" address="0x7FFA436CB62F" module="sqllang" /> <frame id="28" address="0x7FFA436CB451" module="sqllang" /> <frame id="29" address="0x7FFA436CB278" module="sqllang" /> <frame id="30" address="0x7FFA43B022D9" module="sqllang" /> <frame id="31" address="0x7FFA44619DFF" module="sqllang" /> <frame id="32" address="0x7FFA4462D5EA" module="sqllang" /> <frame id="33" address="0x7FFA43697700" module="sqllang" /> <frame id="34" address="0x7FFA436A2E8B" module="sqllang" /> <frame id="35" address="0x7FFA436956F1" module="sqllang" /> <frame id="36" address="0x7FFA43695402" module="sqllang" /> <frame id="37" address="0x7FFA77AA9165" module="SqlDK" /> <frame id="38" address="0x7FFA77AA9880" module="SqlDK" /> <frame id="39" address="0x7FFA77AA94E3" module="SqlDK" /> </stackFrames>
▶ <inputbuf>...<inputbuf> exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1 </inputbuf>
</process> <process id="process2b5c1acc478" spid="88"
▶ more... lockMode="X" waitresource="KEY: 5:72057594058964992 (3963a1a319be)" isolationlevel="read committed (2)" currentdbname="MyBlogDB" transactionname="user_transaction" lasttranstarted="2026-03-12T16:06:10.557" clientapp="SQLQueryStress" hostname="1NDWJD3"
> <executionStack> <frame procname="MyBlogDB.dbo.Configuration_Update_IsDefaultConfigOption" line="25"> UPDATE [dbo].[Configuration] SET [EndDate] = @CurrentDate WHERE [ProcessId] = @ProcessId AND [EndDate] IS NULL </frame>
▶ <frame procname="adhoc">...<frame procname="adhoc" line="1"> exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1 </frame>
</executionStack>
▶ <stackFrames> (39 frames, collapsed)... <stackFrames> <frame id="00" address="0x7FFA979F4694" module="ntdll" /> <frame id="01" address="0x7FFA94FBEB29" module="kernelbase" /> <frame id="02" address="0x7FFA77AA65CA" module="SqlDK" /> <frame id="03" address="0x7FFA77AA64ED" module="SqlDK" /> <frame id="04" address="0x7FFA77AA216C" module="SqlDK" /> <frame id="05" address="0x7FFA77AA1C33" module="SqlDK" /> <frame id="06" address="0x7FFA77AA3B66" module="SqlDK" /> <frame id="07" address="0x7FFA4659855A" module="sqlmin" /> <frame id="08" address="0x7FFA465987D0" module="sqlmin" /> <frame id="09" address="0x7FFA4650D84D" module="sqlmin" /> <frame id="10" address="0x7FFA46515599" module="sqlmin" /> <frame id="11" address="0x7FFA4652016E" module="sqlmin" /> <frame id="12" address="0x7FFA4671B463" module="sqlmin" /> <frame id="13" address="0x7FFA4671B147" module="sqlmin" /> <frame id="14" address="0x7FFA4671AE9E" module="sqlmin" /> <frame id="15" address="0x7FFA4671A770" module="sqlmin" /> <frame id="16" address="0x7FFA780E1B97" module="SqlTsEs" /> <frame id="17" address="0x7FFA4657E38E" module="sqlmin" /> <frame id="18" address="0x7FFA46530020" module="sqlmin" /> <frame id="19" address="0x7FFA46543ECE" module="sqlmin" /> <frame id="20" address="0x7FFA4369D4D0" module="sqllang" /> <frame id="21" address="0x7FFA436B4AB8" module="sqllang" /> <frame id="22" address="0x7FFA436B44D9" module="sqllang" /> <frame id="23" address="0x7FFA43B022D9" module="sqllang" /> <frame id="24" address="0x7FFA44619DFF" module="sqllang" /> <frame id="25" address="0x7FFA4462D5EA" module="sqllang" /> <frame id="26" address="0x7FFA43697700" module="sqllang" /> <frame id="27" address="0x7FFA436CB62F" module="sqllang" /> <frame id="28" address="0x7FFA436CB451" module="sqllang" /> <frame id="29" address="0x7FFA436CB278" module="sqllang" /> <frame id="30" address="0x7FFA43B022D9" module="sqllang" /> <frame id="31" address="0x7FFA44619DFF" module="sqllang" /> <frame id="32" address="0x7FFA4462D5EA" module="sqllang" /> <frame id="33" address="0x7FFA43697700" module="sqllang" /> <frame id="34" address="0x7FFA436A2E8B" module="sqllang" /> <frame id="35" address="0x7FFA436956F1" module="sqllang" /> <frame id="36" address="0x7FFA43695402" module="sqllang" /> <frame id="37" address="0x7FFA77AA9165" module="SqlDK" /> <frame id="38" address="0x7FFA77AA9880" module="SqlDK" /> <frame id="39" address="0x7FFA77AA94E3" module="SqlDK" /> </stackFrames>
▶ <inputbuf>...<inputbuf> exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1 </inputbuf>
</process> </process-list> <resource-list> <keylock objectname="MyBlogDB.dbo.Configuration" indexname="PK_Configuration_Id" mode="X"> <owner-list> <owner id="process2b5c1acc478" mode="X" /> </owner-list> <waiter-list> <waiter id="process2b5c1d01048" mode="S" requestType="wait" /> </waiter-list> </keylock> <keylock objectname="MyBlogDB.dbo.Configuration" indexname="I1_ProcessId_StartDate_EndDate" mode="U"> <owner-list> <owner id="process2b5c1d01048" mode="S" /> </owner-list> <waiter-list> <waiter id="process2b5c1acc478" mode="X" requestType="convert" /> </waiter-list> </keylock> </resource-list> </deadlock>

The deadlock occurred while session 1 (process2b5c1d01048) was processing the SELECT and session 2 (process2b5c1acc478) was processing the UPDATE. Session 1 holds an S lock on the I1 index and wants an S lock on the PK index. Session 2, conversely, holds an X lock on the PK index and wants an X lock on the I1 index. Both sessions want the index that the other already holds a lock on, and neither side is willing to release. Thus, a deadlock occurs.

If the two sessions were updating different ProcessIDs, the key locks would target different rows, and no deadlock would occur. No fix needed if the two sessions never contend for the same rows, but since users expect a clean experience instead of an abrupt failure, let's explore our options.

First, though, an important question: why do the two queries acquire locks on the two indexes in opposite orders? Session 1 performs a seek on I1 first, then does a key lookup on PK to retrieve columns not covered by I1. Session 2 also seeks on I1, but when updating, it applies changes to PK first, then maintains I1. That's what creates the inversion. See their actual execution plans below:


Before I asked AI 

I came up with a few ideas to fix:

  • Tune the queries to finish faster: Both queries are simple and already have a reasonable index.  I could have reordered I1 to (ProcessID, EndDate, StartDate) or added a new index I2 with (ProcessID, EndDate) as key to better match the WHERE clause. However, reordering I1 might affect other queries, and adding a new index means additional maintenance overhead.
  • Create a covering index: adding columns to I1 to make it a covering index would eliminate the key lookup on PK for the SELECT query. But that adds even more maintenance overhead, in my real-world scenario, that would have meant adding 10 more columns.
  •  Add retry logic: Retry the operation automatically when it fails due to a deadlock.

Every option I could think of was adding overhead somewhere, so I landed on retry logic. It's perfectly reasonable; users won't see a strange deadlock error they don't know how to handle, and since two users competing to update the same row is expected behavior, it's fine for one to fail and have the app gracefully retry. I messaged my coworker about what I thought we should do.

Then I asked Claude.

AI as my advisor on the problem

I was curious how the AI would approach this problem, so I gave it the deadlock report, the query text, and the execution plans. The sp_BlitzLock proc collects all of these in one place, which means I didn't have to do that work myself. That's one of the lessons I took from this: when the right tooling makes it easy to gather data, you're much more likely to actually bring that data to AI in the first place, enhancing the quality of the answers.

The AI correctly identified the deadlock and its root cause:

  • "Classic circular wait: A holds NCI S-lock and wants CI; B holds CI X-lock and wants NCI. SQL Server kills A as the victim."  ✓
  • "The lock ordering inversion is driven by two things: (1) The key lookup in the SELECT plan forces an NCI → CI lock acquisition sequence (2) The UPDATE acquires locks in CI → NCI order (it updates the clustered index first, then maintains the non-clustered index)" ✓
For the fixes, it did a better job this time around: it didn't flag the covering index approach as the "Best Fix," which it had in the real-world scenario I mentioned in my last post. My guess is that I used a better prompt this time (borrowed from First Responder Kit - Using AI documentation). 


Here's the fix Claude suggested:
" Fix 1 — Rewrite the SP (Primary Fix):
Eliminate the separate SELECT entirely by using UPDATE ... OUTPUT. This collapses the SELECT + UPDATE into a single statement, removing the interleaved lock acquisition that causes the circular wait. Two concurrent sessions doing UPDATE ... OUTPUT will conflict on the NCI U-lock upfront and serialize cleanly — no deadlock possible."
CREATE OR ALTER PROCEDURE [dbo].[Configuration_Update_IsDefaultConfigOption]
    @ProcessId INT,
    @IsDefaultConfigOptionInput BIT
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @CurrentDate DATETIME = GETDATE();
    DECLARE @ID INT;
    DECLARE @Value1 VARCHAR(50);
    DECLARE @Value2 VARCHAR(50);

    -- Table variable to capture the pre-update row values via OUTPUT
    DECLARE @Captured TABLE
    (
        Id     INT,
        Value1 VARCHAR(50),
        Value2 VARCHAR(50)
    );

    BEGIN TRY
        BEGIN TRANSACTION;

        -- Single statement replaces the separate SELECT + UPDATE.
        -- OUTPUT captures the values we need before the row is modified.
        UPDATE [dbo].[Configuration]
        SET [EndDate] = @CurrentDate
        OUTPUT
            DELETED.[Id],
            DELETED.[Value1],
            DELETED.[Value2]
        INTO @Captured (Id, Value1, Value2)
        WHERE [ProcessId] = @ProcessId
          AND [EndDate] IS NULL;

        SELECT
            @ID     = Id,
            @Value1 = Value1,
            @Value2 = Value2
        FROM @Captured;

        INSERT INTO [dbo].[Configuration]
            ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
        VALUES
            (@ProcessId, @Value1, @Value2, @CurrentDate, NULL, @IsDefaultConfigOptionInput);

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION;
        THROW;
    END CATCH
END
GO

The fix is simple and clean in hindsight, but it never crossed my mind, simply because UPDATE...OUTPUT isn't something I reach for often. That's exactly why I think AI is something all of us should learn to use well. I now treat Claude as an advisor I consult for second opinions (one who draws from a much wider surface area than I can hold in my head at any one time XD). And just like any working relationship, I'm still learning how to get the most out of it.

*Of course, I tested the UPDATE...OUTPUT fix, it works 🎉.

 

No comments:

Post a Comment