NGUYENSQL.COM

Tuesday, June 23, 2026

When Msg 3991 Lies: A Swallowed Permissions Error in SQLCLR

How a misleading transaction-nesting error hid the real culprit — a denied EXECUTE permission — and how Extended Events finally surfaced it.

Intro

Msg 3991, Level 16, State 1, Procedure dbo.Demo_BuildParts, Line 6 [Batch Start Line 100]

The context transaction which was active before entering user defined routine, trigger or aggregate "GetParts" has been ended inside of it, which is not allowed. Change application logic to enforce strict transaction nesting.

Last week one of our dev teams hit this while testing an application, and the message sent us chasing the wrong problem. It blames transaction nesting. The real cause is something the message never mentions: a SQLCLR function (written in VB.NET) was calling stored procedures the login didn’t have EXECUTE permission on.

The short version of what actually happened: the function’s Try/Catch caught those permission errors, swallowed them, and returned normally — but the failed calls had already ended the transaction the function was running in. When control came back to SQL Server, @@TRANCOUNT had dropped from 1 to 0, so the engine raised 3991 and complained about transaction nesting. The permission errors that caused all of it never surfaced.

The fix was almost anticlimactic: grant the login EXECUTE on the two procedures. But I wanted to know why the original errors vanished, and that pulled me into the genuinely confusing world of SQL Server error and transaction handling. Erland Sommarskog’s series is what made something click: Error and Transaction Handling in SQL Server

What follows is a reduced repro of the problem, plus the two questions it left me with.

The setup

In the real system this call chain is buried under a lot of code, which is part of why it was so hard to see. Stripped down, the pieces are simple: an INSERT fires an AFTER INSERT trigger, the trigger selects from a CLR table-valued function, and that function calls two stored procedures the user can’t execute.

First, the VB code behind the CLR function. Two things matter. GetParts() calls dbo.DeniedProc1 and dbo.DeniedProc2 through a small helper, RunProc; and RunProc wraps each call in a Try/Catch that, on any error, simply returns its (empty) list — no rethrow.

Imports System
Imports System.Collections
Imports System.Collections.Generic
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Runtime.InteropServices
Imports Microsoft.SqlServer.Server

Public Class Demo3991

    ' Runs one proc on the context connection and returns the result list.
    ' On ANY error it returns the (empty) list from the catch
    Private Shared Function RunProc(ByVal procName As String) As List(Of String)
        Dim items As New List(Of String)
        Try
            Using conn As SqlConnection = New SqlConnection("context connection=true")
                conn.Open()
                Using cmd As SqlCommand = conn.CreateCommand()
                    cmd.CommandType = CommandType.StoredProcedure
                    cmd.CommandText = procName
                    cmd.ExecuteNonQuery()           ' EXECUTE denied -> raises 229
                End Using
            End Using
            items.Add(procName & ":ok")
            Return items
        Catch ex As Exception
            Return items                            ' <-- swallow + return the list
        End Try
    End Function

    ' CLR table-valued function
    <SqlFunction(FillRowMethodName:="FillRow", _
                 TableDefinition:="PartName nvarchar(100)", _
                 DataAccess:=DataAccessKind.Read)> _
    Public Shared Function GetParts() As IEnumerable
        Dim parts As New List(Of String)
        parts.AddRange(RunProc("dbo.DeniedProc1"))   ' 229 #1 (swallowed via Return, like the moldings proc)
        parts.AddRange(RunProc("dbo.DeniedProc2"))   ' continues -> 229 #2 (swallowed via Return, like the panel proc)
        Return parts                                  ' returns the (now empty) list normally
    End Function

    Public Shared Sub FillRow(ByVal obj As Object, <Out()> ByRef PartName As SqlString)
        PartName = New SqlString(CStr(obj))
    End Sub

End Class

Compile that into Demo3991.dll, then register the assembly and create the function:

USE MyBlogDB;    
GO

EXEC sp_configure 'clr enabled', 1; RECONFIGURE;
GO

/* Trust the DLL by hash (SQL Server 2017+, avoids signing / TRUSTWORTHY) */
DECLARE @hash varbinary(64) =
    (SELECT HASHBYTES('SHA2_512', BulkColumn)
     FROM OPENROWSET(BULK 'C:\temp\CLRDemo\Demo3991.dll', SINGLE_BLOB) AS f);
IF NOT EXISTS (SELECT 1 FROM sys.trusted_assemblies WHERE [hash] = @hash)
    EXEC sys.sp_add_trusted_assembly @hash, N'Demo3991';
GO


/* ---- CLR TVF that swallows on error ---- */
CREATE ASSEMBLY Demo3991 FROM 'C:\temp\CLRDemo\Demo3991.dll' WITH PERMISSION_SET = SAFE;
GO
CREATE FUNCTION dbo.GetParts()
RETURNS TABLE (PartName nvarchar(100))
AS EXTERNAL NAME Demo3991.[Demo3991].GetParts;
GO

Create the two procedures the function calls:

/* ---- the two procs the user is NOT granted permissions to execute ---- */
CREATE PROCEDURE dbo.DeniedProc1 AS BEGIN SET NOCOUNT ON; SELECT 1 AS x; END
GO
CREATE PROCEDURE dbo.DeniedProc2 AS BEGIN SET NOCOUNT ON; SELECT 2 AS x; END
GO

Add the table whose INSERT starts everything, the destination table, and the AFTER INSERT trigger that calls GetParts():

/* ---- table whose INSERT kicks everything off ---- */
CREATE TABLE dbo.Orders_Demo (OrderID int IDENTITY PRIMARY KEY, Note nvarchar(50) NULL);
GO

/* ---- destination table ---- */
CREATE TABLE dbo.PartsInProcess (PartName nvarchar(100) NULL);
GO

/* ---- AFTER INSERT trigger (runs inside the INSERT's transaction) ---- */
CREATE TRIGGER dbo.trg_Orders_Demo ON dbo.Orders_Demo AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON;
    INSERT INTO dbo.PartsInProcess (PartName)
    SELECT PartName FROM dbo.GetParts();
END
GO

Finally, create a test user with INSERT on the orders table but no EXECUTE on the two procedures — exactly the misconfiguration that bit us during testing:

/* 
   INSERT on Orders_Demo is all that's needed; the
   rest of the chain (MSTVF, CLR TVF, INSERT into PartsInProcess) is reached via
   ownership chaining. EXECUTE of the two procs over the CLR context connection
   is re-checked and denied -> error 229 (the CLR function opens a separate connection
   with context connection=true and runs the procs through it, 
   which behaves like dynamic SQL — a fresh execution context that breaks the chain).   */
CREATE USER dbg_demo_user WITHOUT LOGIN;
GRANT INSERT   ON dbo.Orders_Demo     TO dbg_demo_user;
DENY  EXECUTE  ON dbo.DeniedProc1     TO dbg_demo_user;
DENY  EXECUTE  ON dbo.DeniedProc2     TO dbg_demo_user;
GO

Reproducing the failure

Now insert a row as that user:

EXECUTE AS USER = 'dbg_demo_user';
INSERT INTO dbo.Orders_Demo (Note) VALUES ('A');
GO
REVERT;
GO

And here is what comes back:

Msg 3991, Level 16, State 1, Procedure dbo.Demo_BuildParts, Line 6 [Batch Start Line 100]

The context transaction which was active before entering user defined routine, trigger or aggregate "GetParts" has been ended inside of it, which is not allowed. Change application logic to enforce strict transaction nesting.

Notice what’s missing: any mention of permissions. The message points squarely at transaction nesting, so that’s where we looked — hunting for places the transaction was opened and closed unevenly. We even asked Claude for help. After an hour or two it had us weighing rewrites of the CLR code and of the large outer transaction that wraps the insert. All of it was beside the point.

The reveal: Extended Events

What finally cracked it was capturing the errors the engine actually raised, instead of the one it chose to show us. An Extended Events session on error_reported does exactly that:

IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name='Debug_Demo3991')
    DROP EVENT SESSION Debug_Demo3991 ON SERVER;
GO
CREATE EVENT SESSION Debug_Demo3991 ON SERVER
ADD EVENT sqlserver.error_reported
    ( ACTION (sqlserver.tsql_stack, sqlserver.transaction_id, sqlserver.sql_text)
      WHERE ([severity] >= 11) )
ADD TARGET package0.ring_buffer;
GO
ALTER EVENT SESSION Debug_Demo3991 ON SERVER STATE = START;
GO

Run the insert again, then read the ring buffer:

/* ---- what the engine actually raised (expect two 229s + a 3991 per failing run) ---- */
;WITH xe AS (
    SELECT CONVERT(xml, t.target_data) AS x
    FROM sys.dm_xe_sessions s
    JOIN sys.dm_xe_session_targets t ON t.event_session_address = s.address
    WHERE s.name = 'Debug_Demo3991' AND t.target_name = 'ring_buffer'
)
SELECT  e.n.value('(@timestamp)[1]','datetime2')                        AS event_time,
        e.n.value('(data[@name="error_number"]/value)[1]','int')        AS error_number,
        e.n.value('(action[@name="transaction_id"]/value)[1]','bigint') AS transaction_id,
        e.n.value('(data[@name="message"]/value)[1]','nvarchar(max)')   AS message
FROM xe
CROSS APPLY xe.x.nodes('//event') AS e(n)
ORDER BY event_time, error_number;
GO

There they were: two “EXECUTE permission was denied” errors (229) sitting above the 3991. Granting the user EXECUTE on both procedures made the whole thing disappear.

Why did this happen?

That fixed it, but two questions stuck with me:

What rolled back the transaction inside the CLR function and produced the 3991?
Why did the 229 errors never come back to me?

There is no explicit rollback inside the CLR, so what rolled back the transaction?

Erland’s series has the answer. In Part Two he sorts errors into classes by how they behave.

		Without TRY-CATCH				With TRY-CATCH
	SET XACT_ABORT	OFF	ON	OFF	ON	ON or OFF	OFF	ON
Class	Name	Aborts		Rolls Back		Catchable	Dooms transaction
1	Fatal errors	Connection		Yes		No	n/a
2	Batch-aborting	Batch		Yes		Yes	Yes
3	Batch-only aborting	Batch		No	Yes	Yes	No	Yes
4	Statement-terminating	Statement	Batch	No	Yes	Yes	No	Yes
5	Terminates nothing at all	Nothing		No		Yes	Yes/No	Yes
6	Compilation: syntax errors	(Statement)		No		Yes	No	Yes
7	Compilation: binding errors	Scope	Batch	No	Yes	Outer scope only	No	Yes
8	Compilation: optimisation	Batch		Yes		Outer scope only	Yes
9	Attention signal	Batch		No	Yes	No	n/a
10	Informational and warning messages	Nothing		No		No	n/a
11	Uncatchable errors	Varying		Varying		No	n/a
12	Errors that only sets @@error	No		No		No	n/a

The error here aborts and rolls back the transaction; XACT_ABORT was off; and it isn’t a compilation error, which rules out his compilation class. Fatal errors terminate the connection, and ours clearly didn’t. That leaves batch abortion with rollback — errors that, with no CATCH handler on the stack, abort execution on the spot and roll back any open transaction.

The catch is that a permission error isn’t normally batch-aborting; on its own it’s only statement-terminating. So it behaved like a batch-aborting error here because of the context, not because that’s its nature. What promoted it, I believe, is trigger context. We can check that by calling the function outside a trigger:


GRANT SELECT ON GetParts TO dbg_demo_user;
EXECUTE AS USER = 'dbg_demo_user';
SET XACT_ABORT OFF;
BEGIN TRAN;
    SELECT * FROM dbo.GetParts();                 -- DeniedProc1/2 -> 229 (swallowed) -> empty result
    SELECT @@TRANCOUNT AS TranCount_StillOpen;    -- expect 1  (transaction survived)
COMMIT;                                           -- expect success
REVERT;
GO

No 3991, and no permission error surfaces either (more on that next). The transaction is still open returned from the call (no transaction rollback within the CLR) and the result set is simply empty.

Why didn’t the 229 errors come back?

Because the function ate them. Look at the catch again:


Catch ex As Exception
    Return items
End Try

There’s no Throw. A caught exception that’s never rethrown doesn’t propagate, so the 229 never travels from the function up to the trigger, let alone back to us.

But swallowing the exception only hides the error; it does nothing about the side effect, which is the rolled-back transaction. When control returns to the outer INSERT and SQL Server finds the transaction it was counting on gone, it raises the only complaint it has left: 3991. And that is what costs us a few hours of troubleshooting because of the misleading error message.

Takeaway

Don’t swallow errors, whatever your reason — users deserve to know what’s actually going on, and an error message that’s only reacting to the damage can confuse everyone.
error_reported in Extended Events shows you everything that was raised — not just the last one left standing.
There’s another case where 3991 is thrown and the original error never surfaces: when the error is both batch- and transaction-aborting and there’s explicit ROLLBACK inside the SQLCLR (Error and Transaction Handling in SQL Server Appendix 2: CLR Modules)

Conclusion

In my opinion, I don't like using CLR. The safest default is to reach for CLR only when T-SQL genuinely can't do the job. It isn't that CLR caused this bug — the swallowed exception and a missing GRANT did — but CLR brings extra moving parts: an assembly security model you have to get right (SAFE vs. EXTERNAL_ACCESS/UNSAFE, and signing or trusting assemblies under clr strict security), and error handling that behaves differently across the managed/T-SQL boundary. In terms of the performance, it can be faster than T-SQL for CPU-bound work, but for data-access-heavy logic, set-based T-SQL is usually the better and simpler call.

More to read:

Error and Transaction Handling in SQL Server

Cleanup code

ALTER EVENT SESSION Debug_Demo3991 ON SERVER STATE = STOP;
DROP EVENT SESSION Debug_Demo3991 ON SERVER;
DROP TRIGGER   dbo.trg_Orders_Demo;
DROP FUNCTION  dbo.GetParts;
DROP ASSEMBLY  Demo3991;
DROP TABLE     dbo.PartsInProcess;
DROP TABLE     dbo.Orders_Demo;
DROP PROCEDURE dbo.DeniedProc1;
DROP PROCEDURE dbo.DeniedProc2;
DROP USER      dbg_demo_user;

Wednesday, March 25, 2026

The SQL Server Deadlock Fix I Never Would Have Thought Of

Intro

In my previous post, I shared how a deadlock incident changed how I think about using AI in my day-to-day work as a DBA. Here I'll walk through the incident itself. The data and queries aren't the real ones; I recreated the scenario with my own made-up tables, but the deadlock behavior is the same.

Set up a deadlock demo

Below is the T-SQL to create the table and load it with test data. A couple of things worth knowing about the data:

For every ProcessId, there is exactly one active record, the one where EndDate is NULL.
The StartDate and EndDate pattern reflects a soft-delete approach: instead of updating a record directly, we close the current one by setting its EndDate, then insert a new record with the updated values.

USE [MyBlogDB];

DROP TABLE IF EXISTS [dbo].[Configuration];
CREATE TABLE [dbo].[Configuration] (
	[Id] INT IDENTITY(1,1),
	[ProcessId] INT NOT NULL,
	[Value1] VARCHAR(50) NOT NULL,
	[Value2] VARCHAR(50) NULL,
	[StartDate] DATETIME NOT NULL,
	[EndDate] DATETIME NULL,
	[IsDefaultConfigOption] BIT NOT NULL DEFAULT 0,
	CONSTRAINT [PK_Configuration_Id] PRIMARY KEY ([Id])
);


-- CLAUDE wrote this code to generate 500,000 rows of test data for the Configuration table 
;WITH Numbers AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1
    FROM Numbers
    WHERE n &lt 500000
)
INSERT INTO [dbo].[Configuration] ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
SELECT
    -- ProcessId cycles through 1–5000
    ((n - 1) % 5000) + 1,

    -- Value1: e.g. 'Value_1', 'Value_2', ...
    'Value_' + CAST(n AS VARCHAR(10)),

    -- Value2: nullable, every 3rd row is NULL
    CASE WHEN n % 3 = 0 THEN NULL ELSE 'Option_' + CAST((n % 10) + 1 AS VARCHAR(10)) END,

    -- StartDate: spread over the past ~3 years
    DATEADD(DAY, -(n % 1095), GETDATE()),

    -- EndDate: NULL only on the last occurrence of each ProcessId (n > 495000),
    --          all prior rows get an end date 1 year after their start
    CASE WHEN n > 495000 THEN NULL ELSE DATEADD(DAY, -(n % 1095) + 365, GETDATE()) END,

    -- IsDefaultConfigOption: alternates in groups of 4
    CASE WHEN n % 4 IN (1, 2) THEN 1 ELSE 0 END

FROM Numbers
OPTION (MAXRECURSION 0); -- Required to allow recursion beyond the default 100 limit

SELECT * FROM [dbo].[Configuration] where processid=999;

I also added an index similar to the one that existed in the real-world case. It doesn't perfectly cover the WHERE clause, but it's good enough to be useful. I kept it that way intentionally, because in the real world, I wouldn't rush to throw more indexes at the problem.

DROP INDEX IF EXISTS I1_ProcessId_StartDate_EndDate ON [dbo].[Configuration];
CREATE INDEX I1_ProcessId_StartDate_EndDate ON [dbo].[Configuration] ([ProcessId], [StartDate], [EndDate]);

Here's the stored procedure at the center of it all. You might wonder why there is only one? in the real-world scenario, two colleagues were accidentally running the same operation on the same ProcessId at nearly the same time, which is what caused the deadlock.

CREATE OR ALTER PROCEDURE [dbo].[Configuration_Update_IsDefaultConfigOption]
	@ProcessId INT,
	@IsDefaultConfigOptionInput BIT
AS
BEGIN
	SET NOCOUNT ON;
	DECLARE @CurrentDate DATETIME = GETDATE();
	DECLARE @ID INT;
	DECLARE @Value1 VARCHAR(50);
	DECLARE @Value2 VARCHAR(50);

	BEGIN TRY
		BEGIN TRANSACTION;
		SELECT 
			@ID = [Id],
			@Value1 = [Value1],
			@Value2 = [Value2]
		FROM [dbo].[Configuration]
		WHERE [ProcessId] = @ProcessId
		      AND [EndDate] IS NULL;

		UPDATE [dbo].[Configuration]
		SET [EndDate] = @CurrentDate
		WHERE [ProcessId] = @ProcessId
		      AND [EndDate] IS NULL;

		INSERT INTO [dbo].[Configuration] ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
		VALUES (@ProcessId, @Value1, @Value2, @CurrentDate, NULL, @IsDefaultConfigOptionInput);

		COMMIT TRANSACTION;
	END TRY
	BEGIN CATCH
	IF @@TRANCOUNT > 0
		ROLLBACK TRANSACTION;
		THROW;
	END CATCH
END
GO

The proc is straightforward. Given a ProcessId, it fetches the current active record (where EndDate is NULL), closes it with a soft delete by setting EndDate, then inserts a new record with the updated IsDefaultConfigOption value and everything else carried over.

To reproduce a deadlock, I used SQLQueryStress with 4 threads all hammering this same query and ProcessID simultaneously (for 10 iterations each to increase the chance of getting the deadlock). In the real world there were only two competing sessions, but I bumped it up to 4 here just to make the deadlock easier to trigger reliably.

Understand the deadlock

Here is the deadlock graph I obtained from running the sp_BlitzLock proc (from First Responder Kit scripts).

<deadlock>
  <victim-list>
    <victimProcess id="process2b5c1d01048" />
  </victim-list>
  <process-list>
    <process id="process2b5c1d01048" spid="87" ▶ more...
 lockMode="S" waitresource="KEY: 5:72057594058899456 (294556843a07)" isolationlevel="read committed (2)" currentdbname="MyBlogDB" transactionname="user_transaction" lasttranstarted="2026-03-12T16:06:10.557" clientapp="SQLQueryStress" hostname="1NDWJD3"
>
      <executionStack>
        <frame procname="MyBlogDB.dbo.Configuration_Update_IsDefaultConfigOption" line="17">
SELECT @ID = [Id], @Value1 = [Value1], @Value2 = [Value2]
FROM [dbo].[Configuration]
WHERE [ProcessId] = @ProcessId AND [EndDate] IS NULL
        </frame>
        ▶ <frame procname="adhoc">...
<frame procname="adhoc" line="1">
exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1
        </frame>
      </executionStack>
      ▶ <stackFrames> (39 frames, collapsed)...
      <stackFrames>
        <frame id="00" address="0x7FFA979F4694" module="ntdll" />
        <frame id="01" address="0x7FFA94FBEB29" module="kernelbase" />
        <frame id="02" address="0x7FFA77AA65CA" module="SqlDK" />
        <frame id="03" address="0x7FFA77AA64ED" module="SqlDK" />
        <frame id="04" address="0x7FFA77AA216C" module="SqlDK" />
        <frame id="05" address="0x7FFA77AA1C33" module="SqlDK" />
        <frame id="06" address="0x7FFA77AA3B66" module="SqlDK" />
        <frame id="07" address="0x7FFA4659855A" module="sqlmin" />
        <frame id="08" address="0x7FFA465987D0" module="sqlmin" />
        <frame id="09" address="0x7FFA4650D84D" module="sqlmin" />
        <frame id="10" address="0x7FFA46515599" module="sqlmin" />
        <frame id="11" address="0x7FFA4652016E" module="sqlmin" />
        <frame id="12" address="0x7FFA465160DD" module="sqlmin" />
        <frame id="13" address="0x7FFA46515DFE" module="sqlmin" />
        <frame id="14" address="0x7FFA465343AE" module="sqlmin" />
        <frame id="15" address="0x7FFA780E1955" module="SqlTsEs" />
        <frame id="16" address="0x7FFA46534419" module="sqlmin" />
        <frame id="17" address="0x7FFA46533EBE" module="sqlmin" />
        <frame id="18" address="0x7FFA465342B4" module="sqlmin" />
        <frame id="19" address="0x7FFA46530020" module="sqlmin" />
        <frame id="20" address="0x7FFA46543ECE" module="sqlmin" />
        <frame id="21" address="0x7FFA4369D4D0" module="sqllang" />
        <frame id="22" address="0x7FFA4369D702" module="sqllang" />
        <frame id="23" address="0x7FFA43B022D9" module="sqllang" />
        <frame id="24" address="0x7FFA44619DFF" module="sqllang" />
        <frame id="25" address="0x7FFA4462D5EA" module="sqllang" />
        <frame id="26" address="0x7FFA43697700" module="sqllang" />
        <frame id="27" address="0x7FFA436CB62F" module="sqllang" />
        <frame id="28" address="0x7FFA436CB451" module="sqllang" />
        <frame id="29" address="0x7FFA436CB278" module="sqllang" />
        <frame id="30" address="0x7FFA43B022D9" module="sqllang" />
        <frame id="31" address="0x7FFA44619DFF" module="sqllang" />
        <frame id="32" address="0x7FFA4462D5EA" module="sqllang" />
        <frame id="33" address="0x7FFA43697700" module="sqllang" />
        <frame id="34" address="0x7FFA436A2E8B" module="sqllang" />
        <frame id="35" address="0x7FFA436956F1" module="sqllang" />
        <frame id="36" address="0x7FFA43695402" module="sqllang" />
        <frame id="37" address="0x7FFA77AA9165" module="SqlDK" />
        <frame id="38" address="0x7FFA77AA9880" module="SqlDK" />
        <frame id="39" address="0x7FFA77AA94E3" module="SqlDK" />
      </stackFrames>
      
      ▶ <inputbuf>...
<inputbuf>
exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1
      </inputbuf>
    </process>
    <process id="process2b5c1acc478" spid="88" ▶ more...
 lockMode="X" waitresource="KEY: 5:72057594058964992 (3963a1a319be)" isolationlevel="read committed (2)" currentdbname="MyBlogDB" transactionname="user_transaction" lasttranstarted="2026-03-12T16:06:10.557" clientapp="SQLQueryStress" hostname="1NDWJD3"
>
      <executionStack>
        <frame procname="MyBlogDB.dbo.Configuration_Update_IsDefaultConfigOption" line="25">
UPDATE [dbo].[Configuration]
SET [EndDate] = @CurrentDate
WHERE [ProcessId] = @ProcessId AND [EndDate] IS NULL
        </frame>
        ▶ <frame procname="adhoc">...
<frame procname="adhoc" line="1">
exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1
        </frame>
      </executionStack>
      ▶ <stackFrames> (39 frames, collapsed)...
      <stackFrames>
        <frame id="00" address="0x7FFA979F4694" module="ntdll" />
        <frame id="01" address="0x7FFA94FBEB29" module="kernelbase" />
        <frame id="02" address="0x7FFA77AA65CA" module="SqlDK" />
        <frame id="03" address="0x7FFA77AA64ED" module="SqlDK" />
        <frame id="04" address="0x7FFA77AA216C" module="SqlDK" />
        <frame id="05" address="0x7FFA77AA1C33" module="SqlDK" />
        <frame id="06" address="0x7FFA77AA3B66" module="SqlDK" />
        <frame id="07" address="0x7FFA4659855A" module="sqlmin" />
        <frame id="08" address="0x7FFA465987D0" module="sqlmin" />
        <frame id="09" address="0x7FFA4650D84D" module="sqlmin" />
        <frame id="10" address="0x7FFA46515599" module="sqlmin" />
        <frame id="11" address="0x7FFA4652016E" module="sqlmin" />
        <frame id="12" address="0x7FFA4671B463" module="sqlmin" />
        <frame id="13" address="0x7FFA4671B147" module="sqlmin" />
        <frame id="14" address="0x7FFA4671AE9E" module="sqlmin" />
        <frame id="15" address="0x7FFA4671A770" module="sqlmin" />
        <frame id="16" address="0x7FFA780E1B97" module="SqlTsEs" />
        <frame id="17" address="0x7FFA4657E38E" module="sqlmin" />
        <frame id="18" address="0x7FFA46530020" module="sqlmin" />
        <frame id="19" address="0x7FFA46543ECE" module="sqlmin" />
        <frame id="20" address="0x7FFA4369D4D0" module="sqllang" />
        <frame id="21" address="0x7FFA436B4AB8" module="sqllang" />
        <frame id="22" address="0x7FFA436B44D9" module="sqllang" />
        <frame id="23" address="0x7FFA43B022D9" module="sqllang" />
        <frame id="24" address="0x7FFA44619DFF" module="sqllang" />
        <frame id="25" address="0x7FFA4462D5EA" module="sqllang" />
        <frame id="26" address="0x7FFA43697700" module="sqllang" />
        <frame id="27" address="0x7FFA436CB62F" module="sqllang" />
        <frame id="28" address="0x7FFA436CB451" module="sqllang" />
        <frame id="29" address="0x7FFA436CB278" module="sqllang" />
        <frame id="30" address="0x7FFA43B022D9" module="sqllang" />
        <frame id="31" address="0x7FFA44619DFF" module="sqllang" />
        <frame id="32" address="0x7FFA4462D5EA" module="sqllang" />
        <frame id="33" address="0x7FFA43697700" module="sqllang" />
        <frame id="34" address="0x7FFA436A2E8B" module="sqllang" />
        <frame id="35" address="0x7FFA436956F1" module="sqllang" />
        <frame id="36" address="0x7FFA43695402" module="sqllang" />
        <frame id="37" address="0x7FFA77AA9165" module="SqlDK" />
        <frame id="38" address="0x7FFA77AA9880" module="SqlDK" />
        <frame id="39" address="0x7FFA77AA94E3" module="SqlDK" />
      </stackFrames>
      
      ▶ <inputbuf>...
<inputbuf>
exec [dbo].[Configuration_Update_IsDefaultConfigOption] 999, 1
      </inputbuf>
    </process>
  </process-list>
  <resource-list>
    <keylock objectname="MyBlogDB.dbo.Configuration" indexname="PK_Configuration_Id" mode="X">
      <owner-list>
        <owner id="process2b5c1acc478" mode="X" />
      </owner-list>
      <waiter-list>
        <waiter id="process2b5c1d01048" mode="S" requestType="wait" />
      </waiter-list>
    </keylock>
    <keylock objectname="MyBlogDB.dbo.Configuration" indexname="I1_ProcessId_StartDate_EndDate" mode="U">
      <owner-list>
        <owner id="process2b5c1d01048" mode="S" />
      </owner-list>
      <waiter-list>
        <waiter id="process2b5c1acc478" mode="X" requestType="convert" />
      </waiter-list>
    </keylock>
  </resource-list>
</deadlock>

The deadlock occurred while session 1 (process2b5c1d01048) was processing the SELECT and session 2 (process2b5c1acc478) was processing the UPDATE. Session 1 holds an S lock on the I1 index and wants an S lock on the PK index. Session 2, conversely, holds an X lock on the PK index and wants an X lock on the I1 index. Both sessions want the index that the other already holds a lock on, and neither side is willing to release. Thus, a deadlock occurs.

If the two sessions were updating different ProcessIDs, the key locks would target different rows, and no deadlock would occur. No fix needed if the two sessions never contend for the same rows, but since users expect a clean experience instead of an abrupt failure, let's explore our options.

First, though, an important question: why do the two queries acquire locks on the two indexes in opposite orders? Session 1 performs a seek on I1 first, then does a key lookup on PK to retrieve columns not covered by I1. Session 2 also seeks on I1, but when updating, it applies changes to PK first, then maintains I1. That's what creates the inversion. See their actual execution plans below:

Before I asked AI

I came up with a few ideas to fix:

Tune the queries to finish faster: Both queries are simple and already have a reasonable index. I could have reordered I1 to (ProcessID, EndDate, StartDate) or added a new index I2 with (ProcessID, EndDate) as key to better match the WHERE clause. However, reordering I1 might affect other queries, and adding a new index means additional maintenance overhead.
Create a covering index: adding columns to I1 to make it a covering index would eliminate the key lookup on PK for the SELECT query. But that adds even more maintenance overhead, in my real-world scenario, that would have meant adding 10 more columns.
Add retry logic: Retry the operation automatically when it fails due to a deadlock.

Every option I could think of was adding overhead somewhere, so I landed on retry logic. It's perfectly reasonable; users won't see a strange deadlock error they don't know how to handle, and since two users competing to update the same row is expected behavior, it's fine for one to fail and have the app gracefully retry. I messaged my coworker about what I thought we should do.

Then I asked Claude.

AI as my advisor on the problem

I was curious how the AI would approach this problem, so I gave it the deadlock report, the query text, and the execution plans. The sp_BlitzLock proc collects all of these in one place, which means I didn't have to do that work myself. That's one of the lessons I took from this: when the right tooling makes it easy to gather data, you're much more likely to actually bring that data to AI in the first place, enhancing the quality of the answers.

The AI correctly identified the deadlock and its root cause:

"Classic circular wait: A holds NCI S-lock and wants CI; B holds CI X-lock and wants NCI. SQL Server kills A as the victim." ✓
"The lock ordering inversion is driven by two things: (1) The key lookup in the SELECT plan forces an NCI → CI lock acquisition sequence (2) The UPDATE acquires locks in CI → NCI order (it updates the clustered index first, then maintains the non-clustered index)" ✓

For the fixes, it did a better job this time around: it didn't flag the covering index approach as the "Best Fix," which it had in the real-world scenario I mentioned in my last post. My guess is that I used a better prompt this time (borrowed from First Responder Kit - Using AI documentation).

Here's the fix Claude suggested:

" Fix 1 — Rewrite the SP (Primary Fix):

Eliminate the separate SELECT entirely by using UPDATE ... OUTPUT. This collapses the SELECT + UPDATE into a single statement, removing the interleaved lock acquisition that causes the circular wait. Two concurrent sessions doing UPDATE ... OUTPUT will conflict on the NCI U-lock upfront and serialize cleanly — no deadlock possible."

CREATE OR ALTER PROCEDURE [dbo].[Configuration_Update_IsDefaultConfigOption]
    @ProcessId INT,
    @IsDefaultConfigOptionInput BIT
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @CurrentDate DATETIME = GETDATE();
    DECLARE @ID INT;
    DECLARE @Value1 VARCHAR(50);
    DECLARE @Value2 VARCHAR(50);

    -- Table variable to capture the pre-update row values via OUTPUT
    DECLARE @Captured TABLE
    (
        Id     INT,
        Value1 VARCHAR(50),
        Value2 VARCHAR(50)
    );

    BEGIN TRY
        BEGIN TRANSACTION;

        -- Single statement replaces the separate SELECT + UPDATE.
        -- OUTPUT captures the values we need before the row is modified.
        UPDATE [dbo].[Configuration]
        SET [EndDate] = @CurrentDate
        OUTPUT
            DELETED.[Id],
            DELETED.[Value1],
            DELETED.[Value2]
        INTO @Captured (Id, Value1, Value2)
        WHERE [ProcessId] = @ProcessId
          AND [EndDate] IS NULL;

        SELECT
            @ID     = Id,
            @Value1 = Value1,
            @Value2 = Value2
        FROM @Captured;

        INSERT INTO [dbo].[Configuration]
            ([ProcessId], [Value1], [Value2], [StartDate], [EndDate], [IsDefaultConfigOption])
        VALUES
            (@ProcessId, @Value1, @Value2, @CurrentDate, NULL, @IsDefaultConfigOptionInput);

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION;
        THROW;
    END CATCH
END
GO

The fix is simple and clean in hindsight, but it never crossed my mind, simply because UPDATE...OUTPUT isn't something I reach for often. That's exactly why I think AI is something all of us should learn to use well. I now treat Claude as an advisor I consult for second opinions (one who draws from a much wider surface area than I can hold in my head at any one time XD). And just like any working relationship, I'm still learning how to get the most out of it.

*Of course, I tested the UPDATE...OUTPUT fix, it works 🎉.

Thursday, March 12, 2026

What a Deadlock Taught Me About Working with AI

Why I’m Writing This?

Last week, my coworker sent me a report about a deadlock and asked me to help troubleshoot it. The event gave me so many thoughts that it made me feel like writing a blog post sharing them and exploring how AI has been for me in my day-to-day job as a DBA.

I was afraid the blog would go on too long by mixing technical demos with personal reflections, so I'm splitting this into two posts. In this one, I will share my perspective on AI and my realizations after the deadlock event. In a separate future post, I will cover the deadlock event itself with a demo and how I used AI as a second opinion to solve the problem.

(Spoiler Alert: Claude gave me several options to solve the deadlock. Annoyingly, the first option, according to it, the least risky solution, was to create a covering index spanning every single column, and you know it's against best practice. But the second option that it suggested, using “UPDATE…OUTPUT”, was the one that worked, and that impressed me because I hadn't even considered it, having barely seen or used that syntax in my four years working with T-SQL.)

From Ignorance to Awareness

Amid the chaos and excitement of the AI boom,

I gradually found myself incorporating AI more into my daily work.

I had been using AI since the ChatGPT’s release, and later my company adopted Claude, but only to ask them to help me with things like writing a DMV query, proofreading my emails and messages, or comparing two pieces of code to make sure an improved version was doing the same thing as the original. These are tasks I believe AI can do better than me.

I didn’t trust it to help me rewrite or tune queries, troubleshoot deadlocks, tasks I thought I would handle better myself, since I can collect more data about an issue than the AI can, I know my system better than it does, and sometimes when I did try, the AI just gave me wrong answers that I felt would only waste my time. (I still now resist asking it to rewrite production queries for me, though I can accept its tuning ideas and rewrite them myself.)

Thankfully, I kept myself updated; I followed Brent’s newsletters and read every tech blog he shared. And of course, among those were posts about AI and how DBAs are using it in their work (everyone is crazy about AI now). I learned about writing good prompts, giving the AI as much useful information as I would need myself (like the query text, the query plan, the deadlock report, ect.), and establishing a workflow that makes it simple enough to collect that data in the first place. For example, the sp_BlitzCache proc (First Responder Kit) has incorporated AI. I haven't tried that functionality yet, but my understanding is that you provide it the stored procedure name you're trying to tune along with an AI configuration (e.g., the model, a tailored prompt for the problem at hand). And the proc will automatically collect the query text and execution plan, sends them to the AI with your prompt, and returns a response. It is very neat. There is some setup involved, I believe SQL Server 2025 is required, but it should be straightforward and one time setup. I'll definitely give it a try sometime.

I picked up all of this gradually over time. But it wasn't until last week that I actually used Claude to troubleshoot a deadlock issue. And honestly, I was reluctant at first. I had already analyzed the deadlock report myself and exhausted every solution I could think of, ruling each one out due to its limitations. I sent my findings to my coworker, a lengthy write-up explaining how the deadlock occurred, what the potential solutions were and why none of them would work, and concluding that adding retry logic might be our only fix.

Then, about five minutes later, I had an idea: why not give AI a try? I was confident there wouldn't be any workable solution beyond retry logic, but I was also curious whether the AI would even understand how the deadlock had occurred in the first place. (Looking back, I think the flood of posts about AI replacing white-collar workers, and all the advice to start using AI before it replaces you, had quietly gotten to me. That fear had probably crept in.) So I gave it everything that I used to understand the problem: the query text, the query plans, and the deadlock report.

Claude understood the problem very well. It gave me an answer, not a perfect one (it recommended a covering index with almost every columns in the table as included columns, and referred to this solution as “Best fix, lowest risk”). Then I read the second suggestion.

I stopped for a moment and thought:

“Oh… why didn’t I think of that?”

Claude suggested rewriting the logic using UPDATE…OUTPUT, collapsing the SELECT and UPDATE into a single statement and eliminating the deadlock path entirely. I realized AI has been getting smarter. Or had it always been, and I simply hadn't been using it correctly?

The Advisor

I think people relate to AI differently. Some see it as nothing more than a fancy search engine, others treat it like a personal assistant, and some have even started calling it a friend. For me, the word that feels most right is advisor.

You know those advisors in Civilization? I've been playing that game recently, and they pop up to offer help and suggestions for production and research choices based on where your civilization stands. I almost always ignored them, stubbornly convinced that I should figure everything out myself. And honestly, that might be exactly why I never play well.

But here's the thing: AI is a genuinely smart advisor. It proposed an elegant solution to my deadlock problem, one I never would have found on my own, and probably wouldn't have stumbled across online either. I fully understood the problem. I knew what needed fixing. I just hadn't thought of UPDATE…OUTPUT because I barely use it, and it simply didn't cross my mind. That's exactly what a good advisor is for, not to replace your thinking, but to catch the blind spots you didn't know you had.

So from now on, that's how I plan to use it: as a second opinion from someone smarter than me in ways I haven't figured out yet.

Of course, a good advisor still needs the right information to work with. In my case, that means a well-crafted prompt with everything I would gather if I were solving the problem myself.

Maybe AI Won't Replace Me

Remember that covering index recommendation I mentioned? That one has more than 10 included columns and that goes against general best practices. The more columns you pack into an index, the heavier your modification costs (INSERT, UPDATE, DELETE) and the more painful your maintenance becomes. I still haven't figured out why Claude flagged that as the "Best fix, lowest risk" solution. But honestly? I'm not complaining. It's a reminder that someone needs to catch these things, and that someone is us.

I genuinely believe the best setup for any company right now is a knowledgeable human working alongside a capable AI. Not one replacing the other, but each covering the other's blind spots.

Here’s another story. Before my coworker brought the deadlock to me, he had already tried fixing it by adding a query hint. It didn’t work. My guess is that he had asked Claude for help and applied the suggestion it gave him. Since he’s a developer and not a DBA, he likely pasted only the query into Claude without including the deadlock report or execution plans. I tested that theory by doing the same thing myself and got a similar answer.

The lesson isn’t that AI failed him. It simply worked with the information it was given. Without the right context, even a smart advisor will point you in the wrong direction. That’s where domain knowledge matters. A developer might not know what a deadlock report is, why an execution plan matters, or what we can give so AI has a better understanding of the problem at hand. A DBA does. And that gap is why we're still in the room.

Control what you can control

There's a Stoic idea I keep coming back to: control what you can control.

I can't control whether AI will eventually change my role or reshape my industry. But I can control whether I stay curious, keep learning, and actually engage with the tools that are changing the world around me. Worrying about replacement is wasted energy. Using that same energy to grow; that's a choice I can make every day.

That mindset shift is really what this whole blog is about. I went from ignoring AI for anything serious, to letting it surprise me with a creative solution for my deadlock problem. That's not a small thing. It changed how I think about what AI is for and what my role alongside it looks like.

I'm also genuinely grateful. Grateful to still have a job when so many others haven't been so fortunate. Grateful for my employer. And grateful for the DBA community, for the newsletters, the blog posts, the shared knowledge that quietly kept me moving forward even when I didn't realize it.

Monday, February 2, 2026

Wait Stats First: How I Start my SQL Server Performance Tuning

The Theme Park Analogy

A friend once told me about his trip to Universal Studios: how packed it was, and how it took forever just to get on anything.

His experience basically came down to two kinds of time: the time that actually felt fun and worthwhile (walking around and being on the rides), and the time that felt like dead time (standing around waiting to get in).

When a user application submits a request to SQL Server for data, the same dynamic plays out. Query response time includes CPU processing time, when the query is actually being productive (reading data, processing it, and sending it back), and time spent waiting on a particular resource (disk I/O, memory, network, locking, and so on). In a concurrent database system, with hundreds or thousands of user requests competing for limited resources, waiting is inevitable.

Performance Tuning Starts with Wait Statistics

Sticking with the analogy: there are things visitors and the theme park can do to enhance the overall experience. Visitors may plan their trip on weekdays when things are less busy and plan ahead so that they can get the most out of the day. The theme park can study what contributes most to customers’ wait times and research solutions accordingly.

In the SQL Server world, to reduce response time, there are two parallel levers we can pull:

- Reduce processing time by tuning slow queries: queries can't tune themselves yet (maybe someday, we hope), but for now that's still our job. Maybe the execution plan is inefficient because the query is too complex, maybe there’s a missing index, or maybe it reads so much data that it’s time to consider pagination.

- Reduce wait time: This means finding the bottleneck, what resources the server is waiting on, and directing our tuning efforts accordingly.

When SQL Server as a whole is slow, analyzing wait statistics is an effective way to find the primary reasons queries are waiting, so we can focus our tuning efforts on the right problem. Once we’ve identified the bottlenecks, there are many things we can do to relieve them, depending on the problems we find: add resources, tune queries and indexes, or reconfigure settings.

The Resource Waits

For every wait that happens, SQL Server records the length of the wait and its cause, aka the wait type, which generally tells us what resource the request was waiting on. We can find this data in the sys.dm_os_wait_stats DMV. This DMV records cumulative totals for each wait type since the last SQL Server restart (or since wait stats were manually cleared). I use it to establish a wait baseline chart, which I’ll discuss in more detail below.

Not all waits are meaningful. There are problematic waits and benign waits, and we should filter the benign ones out during analysis. The "More to Read" section at the end links to a blog post by Brent Ozar and a white paper by SQLSkills that cover different wait types in more detail.

Identifying the top waits is only half the battle. Once you know what your server is waiting on, the next question is: what do you actually do about it? I'll be dedicating separate blog posts to the most common wait types I encounter and how I approach troubleshooting each one. Stay tuned for those as this series grows.

The Baseline

If you use First Responder Kit , you can get a quick snapshot of the top waits since the last SQL Server startup with a single call:

exec sp_blitzfirst @sincestartup=1

The downside is that you lose that history when SQL Server restarts. For long-term tracking, I needed something that would survive restarts and let me compare across days, weeks, and upgrades. So, I built my own wait stats chart.

The Chart

The chart tracks two performance metrics side by side:

Batch requests/second (the red line): how many queries the server is handling.
Wait Time (Second) per Core per Second (the bars): how much time (per core, per second) SQL Server spends waiting on specific wait types.

Having both metrics in one view gives me a few things I find valuable:

More context: It’s not just “what are we waiting on?"; the Batch Requests/sec line shows how busy the server actually is, which helps separate "high waits because the server is slammed" from "high waits because something is broken."
Baseline and comparison: I can use it to establish a server performance baseline for things like upgrading SQL Server to a newer version. I can spot anomaly periods where wait time is significantly higher and quickly see which waits are driving the change. For example, at the end of Jan 27 and the beginning of Jan 28, the server had an abnormal increase in CXPACKET waits. After tuning the query that contributed to it, we see a drop by mid-day on Jan 28.

Building the chart

The chart above is powered by a SQL Server Agent job that snapshots wait stats and batch request counters at regular intervals. Here's what the script does at a high level:

It creates a set of tables under a perftune schema to store snapshots and computed deltas. Every time the job fires, it captures the current cumulative counters, computes the difference from the last run, and normalizes the results. One year of history is retained by default.

If you want to build a similar chart:

Create a SQL Server Agent job that runs every 15 minutes (or whatever interval you want; just make sure to change the @n value in the script too).
Replace <YourDatabase> with the database where you want to store the monitoring tables
The script below keeps one year of history by default; you can change that retention window

For reference, this is how the script computes each metric:

Batch Requests/sec: The script reads the current cumulative “Batch Requests/sec” perf counter from sys.dm_os_performance_counters, subtracts the last saved counter value, then divides by the elapsed time window (@delta = @n * 60).

BatchRequestsPerSecond = (NewCounterValue - LastCounterValue) / @delta_seconds

Wait per core per second: The script snapshots sys.dm_os_wait_stats, computes resource wait time as wait_time_ms - signal_wait_time_ms, then takes the delta from the prior snapshot. That delta is converted to seconds, divided by CPU core count, and divided by the time window.

WaitSecondsPerCorePerSecond = ((NewResourceWaitMs - LastResourceWaitMs) / 1000) / @delta_seconds / cores

For the visualization itself, I use Power BI connected directly to the monitoring tables. Once the Agent job is running and data starts accumulating, it's straightforward to point Power BI at the perftune.WaitStatsTop10 and perftune.BatchRequestsPerSecond tables and build the stacked bar chart and line overlay from there.

Below is the full script for the agent job. I've commented it inline so you can follow along:


USE <YourDatabase>;
GO

/*------------------------------------------------------------------------------
PERFTUNE COLLECTION SCRIPT

Runs as a recurring job (every @n minutes; default 15).

Captures:
  1) Batch Requests/sec (per-interval delta from prior run)
  2) Top 10 *resource* waits (seconds/core/second) using deltas from the
     previous wait snapshot

How deltas are calculated:
  - Batch Requests/sec:
      (NewCounterValue - LastCounterValue) / @delta_seconds
  - Resource waits:
      ((NewResourceWaitMs - LastResourceWaitMs) / 1000) / @delta_seconds / cores

Notes:
  - First run seeds baseline counters; delta-based results require a prior run.
  - CollectionDate is rounded down to the minute (HH:mm:00) for consistent keys.
  - History is retained for 365 days (see cleanup section).
------------------------------------------------------------------------------*/


/*------------------------------------------------------------------------------
1) BOOTSTRAP PERSISTENT OBJECTS (schema + tables)
------------------------------------------------------------------------------*/
IF NOT EXISTS (
    SELECT 1
    FROM sys.schemas
    WHERE name = N'perftune'
)
BEGIN
    EXEC('CREATE SCHEMA perftune');
END
GO

IF OBJECT_ID(N'perftune.LastBatchRequestsCount', N'U') IS NULL
BEGIN
  /* Stores last observed cumulative Batch Requests/sec counter value
     to compute a per-interval delta on subsequent runs. */
  CREATE TABLE perftune.LastBatchRequestsCount
  (
      LastBatchRequestsCount bigint
  )
  INSERT INTO perftune.LastBatchRequestsCount VALUES (0)
END
GO

IF OBJECT_ID(N'perftune.BatchRequestsPerSecond', N'U') IS NULL
BEGIN
  /* Time series of calculated Batch Requests/sec per collection interval. */
  CREATE TABLE perftune.BatchRequestsPerSecond
  (
      CollectionDate datetime NOT NULL PRIMARY KEY,
      BatchRequestsPerSecond int
  )
END 
GO

IF OBJECT_ID(N'perftune.LastWaitStats', N'U') IS NULL
BEGIN
  /* Stores the most recent wait snapshot (one row per wait type).
     Truncated/reloaded each run; used to compute per-interval deltas. */
  CREATE TABLE perftune.LastWaitStats
  (
      CollectionDate datetime NOT NULL,
      WaitType NVARCHAR(100) NOT NULL, 
      WaitTimeMs numeric(19,6),
      PRIMARY KEY (WaitType)
  )
END 
GO

IF OBJECT_ID(N'perftune.WaitStatsTop10', N'U') IS NULL
BEGIN
  /* Time series of top 10 resource-wait deltas per interval,
     normalized to seconds/core/second. */
  CREATE TABLE perftune.WaitStatsTop10
  (
      CollectionDate datetime not null,
      WaitType NVARCHAR(100) not null, 
      WaitTimeSecondPerCorePerSecond numeric(19,6),
      PRIMARY KEY (CollectionDate, WaitType)
  )
END 
GO


/*------------------------------------------------------------------------------
2) COLLECT CURRENT INTERVAL SNAPSHOT
------------------------------------------------------------------------------*/
DECLARE @n as int = 15;
DECLARE @delta as decimal(10,2) = @n*60.;

-- Number of logical CPU cores (used for per-core normalization)
DECLARE @NumberOfCores INT =
(
    SELECT cpu_count
    FROM sys.dm_os_sys_info
);

-- Collection date/time (rounded down to the minute boundary)
DECLARE @CollectionDate DATETIME2(0) = SYSDATETIME();
SET @CollectionDate = DATEADD(MINUTE, DATEDIFF(MINUTE, 0, @CollectionDate), 0);

SELECT @CollectionDate AS CollectionDate;


----------------------------------------------------------------
-- Batch Requests/sec (delta from prior run)
----------------------------------------------------------------
DECLARE @OldBatchRequestsCount BIGINT =
(
    SELECT TOP (1) LastBatchRequestsCount
    FROM perftune.LastBatchRequestsCount
);

DECLARE @NewBatchRequestsCount BIGINT =
(
    SELECT cntr_value
    FROM sys.dm_os_performance_counters
    WHERE object_name LIKE '%:SQL Statistics%'
      AND counter_name = 'Batch Requests/sec'
);

IF (@OldBatchRequestsCount>0)
BEGIN
  INSERT INTO perftune.BatchRequestsPerSecond
  (
      CollectionDate,
      BatchRequestsPerSecond
  )
  SELECT
      CAST(@CollectionDate AS DATETIME) AS CollectionDate,
      (@NewBatchRequestsCount - @OldBatchRequestsCount) / @delta AS BatchRequestsPerSecond
  WHERE (@NewBatchRequestsCount - @OldBatchRequestsCount) / @delta > 0;
END

-- Persist the latest cumulative counter value for the next run’s delta
UPDATE perftune.LastBatchRequestsCount
SET LastBatchRequestsCount = @NewBatchRequestsCount;

----------------------------------------------------------------
-- Wait Stats (resource waits only); capture current snapshot
-- ResourceWaitTimeMs = wait_time_ms - signal_wait_time_ms
----------------------------------------------------------------
DROP TABLE IF EXISTS #NewWaitStats;

CREATE TABLE #NewWaitStats
(
    CollectionDate DATETIME2(0)   NOT NULL,
    WaitType       NVARCHAR(100)  NOT NULL,
    WaitTimeMs     NUMERIC(19,6) NOT NULL,
    CONSTRAINT PK_NewWaitStats PRIMARY KEY (WaitType)
);

INSERT INTO #NewWaitStats
(
    CollectionDate,
    WaitType,
    WaitTimeMs
)
SELECT
    @CollectionDate AS CollectionDate,
    wait_type       AS WaitType,
    (wait_time_ms - signal_wait_time_ms) AS ResourceWaitTimeMs
FROM sys.dm_os_wait_stats
WHERE waiting_tasks_count <> 0
  AND wait_type NOT IN
  (
        -- These wait types are almost 100% never a problem and so they are
        -- filtered out to avoid them skewing the results. Click on the URL
        -- for more information.
        N'BROKER_EVENTHANDLER',        -- https://www.sqlskills.com/help/waits/BROKER_EVENTHANDLER
        N'BROKER_RECEIVE_WAITFOR',     -- https://www.sqlskills.com/help/waits/BROKER_RECEIVE_WAITFOR
        N'BROKER_TASK_STOP',           -- https://www.sqlskills.com/help/waits/BROKER_TASK_STOP
        N'BROKER_TO_FLUSH',            -- https://www.sqlskills.com/help/waits/BROKER_TO_FLUSH
        N'BROKER_TRANSMITTER',         -- https://www.sqlskills.com/help/waits/BROKER_TRANSMITTER
        N'CHECKPOINT_QUEUE',           -- https://www.sqlskills.com/help/waits/CHECKPOINT_QUEUE
        N'CHKPT',                      -- https://www.sqlskills.com/help/waits/CHKPT
        N'CLR_AUTO_EVENT',             -- https://www.sqlskills.com/help/waits/CLR_AUTO_EVENT
        N'CLR_MANUAL_EVENT',           -- https://www.sqlskills.com/help/waits/CLR_MANUAL_EVENT
        N'CLR_SEMAPHORE',              -- https://www.sqlskills.com/help/waits/CLR_SEMAPHORE

        -- Maybe comment this out if you have parallelism issues
        N'CXCONSUMER',                 -- https://www.sqlskills.com/help/waits/CXCONSUMER

        -- Maybe comment these four out if you have mirroring issues
        N'DBMIRROR_DBM_EVENT',         -- https://www.sqlskills.com/help/waits/DBMIRROR_DBM_EVENT
        N'DBMIRROR_EVENTS_QUEUE',      -- https://www.sqlskills.com/help/waits/DBMIRROR_EVENTS_QUEUE
        N'DBMIRROR_WORKER_QUEUE',      -- https://www.sqlskills.com/help/waits/DBMIRROR_WORKER_QUEUE
        N'DBMIRRORING_CMD',            -- https://www.sqlskills.com/help/waits/DBMIRRORING_CMD

        N'DIRTY_PAGE_POLL',            -- https://www.sqlskills.com/help/waits/DIRTY_PAGE_POLL
        N'DISPATCHER_QUEUE_SEMAPHORE', -- https://www.sqlskills.com/help/waits/DISPATCHER_QUEUE_SEMAPHORE
        N'EXECSYNC',                   -- https://www.sqlskills.com/help/waits/EXECSYNC
        N'FSAGENT',                    -- https://www.sqlskills.com/help/waits/FSAGENT
        N'FT_IFTS_SCHEDULER_IDLE_WAIT',-- https://www.sqlskills.com/help/waits/FT_IFTS_SCHEDULER_IDLE_WAIT
        N'FT_IFTSHC_MUTEX',            -- https://www.sqlskills.com/help/waits/FT_IFTSHC_MUTEX

        -- Maybe comment these six out if you have AG issues
        N'HADR_CLUSAPI_CALL',                   -- https://www.sqlskills.com/help/waits/HADR_CLUSAPI_CALL
        N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',  -- https://www.sqlskills.com/help/waits/HADR_FILESTREAM_IOMGR_IOCOMPLETION
        N'HADR_LOGCAPTURE_WAIT',                -- https://www.sqlskills.com/help/waits/HADR_LOGCAPTURE_WAIT
        N'HADR_NOTIFICATION_DEQUEUE',           -- https://www.sqlskills.com/help/waits/HADR_NOTIFICATION_DEQUEUE
        N'HADR_TIMER_TASK',                     -- https://www.sqlskills.com/help/waits/HADR_TIMER_TASK
        N'HADR_WORK_QUEUE',                     -- https://www.sqlskills.com/help/waits/HADR_WORK_QUEUE

        N'KSOURCE_WAKEUP',              -- https://www.sqlskills.com/help/waits/KSOURCE_WAKEUP
        N'LAZYWRITER_SLEEP',            -- https://www.sqlskills.com/help/waits/LAZYWRITER_SLEEP
        N'LOGMGR_QUEUE',                -- https://www.sqlskills.com/help/waits/LOGMGR_QUEUE
        N'MEMORY_ALLOCATION_EXT',       -- https://www.sqlskills.com/help/waits/MEMORY_ALLOCATION_EXT
        N'ONDEMAND_TASK_QUEUE',         -- https://www.sqlskills.com/help/waits/ONDEMAND_TASK_QUEUE
        N'PARALLEL_REDO_DRAIN_WORKER',  -- https://www.sqlskills.com/help/waits/PARALLEL_REDO_DRAIN_WORKER
        N'PARALLEL_REDO_LOG_CACHE',     -- https://www.sqlskills.com/help/waits/PARALLEL_REDO_LOG_CACHE
        N'PARALLEL_REDO_TRAN_LIST',     -- https://www.sqlskills.com/help/waits/PARALLEL_REDO_TRAN_LIST
        N'PARALLEL_REDO_WORKER_SYNC',   -- https://www.sqlskills.com/help/waits/PARALLEL_REDO_WORKER_SYNC
        N'PARALLEL_REDO_WORKER_WAIT_WORK', -- https://www.sqlskills.com/help/waits/PARALLEL_REDO_WORKER_WAIT_WORK

        N'PREEMPTIVE_OS_FLUSHFILEBUFFERS', -- https://www.sqlskills.com/help/waits/PREEMPTIVE_OS_FLUSHFILEBUFFERS
        N'PREEMPTIVE_XE_GETTARGETSTATE',   -- https://www.sqlskills.com/help/waits/PREEMPTIVE_XE_GETTARGETSTATE
        N'PVS_PREALLOCATE',                -- https://www.sqlskills.com/help/waits/PVS_PREALLOCATE

        N'PWAIT_ALL_COMPONENTS_INITIALIZED',     -- https://www.sqlskills.com/help/waits/PWAIT_ALL_COMPONENTS_INITIALIZED
        N'PWAIT_DIRECTLOGCONSUMER_GETNEXT',      -- https://www.sqlskills.com/help/waits/PWAIT_DIRECTLOGCONSUMER_GETNEXT
        N'PWAIT_EXTENSIBILITY_CLEANUP_TASK',     -- https://www.sqlskills.com/help/waits/PWAIT_EXTENSIBILITY_CLEANUP_TASK

        N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP',     -- https://www.sqlskills.com/help/waits/QDS_PERSIST_TASK_MAIN_LOOP_SLEEP
        N'QDS_ASYNC_QUEUE',                      -- https://www.sqlskills.com/help/waits/QDS_ASYNC_QUEUE
        N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', -- https://www.sqlskills.com/help/waits/QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP
        N'QDS_SHUTDOWN_QUEUE',                   -- https://www.sqlskills.com/help/waits/QDS_SHUTDOWN_QUEUE

        N'REDO_THREAD_PENDING_WORK',     -- https://www.sqlskills.com/help/waits/REDO_THREAD_PENDING_WORK
        N'REQUEST_FOR_DEADLOCK_SEARCH',  -- https://www.sqlskills.com/help/waits/REQUEST_FOR_DEADLOCK_SEARCH
        N'RESOURCE_QUEUE',               -- https://www.sqlskills.com/help/waits/RESOURCE_QUEUE
        N'SERVER_IDLE_CHECK',            -- https://www.sqlskills.com/help/waits/SERVER_IDLE_CHECK

        N'SLEEP_BPOOL_FLUSH',        -- https://www.sqlskills.com/help/waits/SLEEP_BPOOL_FLUSH
        N'SLEEP_DBSTARTUP',          -- https://www.sqlskills.com/help/waits/SLEEP_DBSTARTUP
        N'SLEEP_DCOMSTARTUP',        -- https://www.sqlskills.com/help/waits/SLEEP_DCOMSTARTUP
        N'SLEEP_MASTERDBREADY',      -- https://www.sqlskills.com/help/waits/SLEEP_MASTERDBREADY
        N'SLEEP_MASTERMDREADY',      -- https://www.sqlskills.com/help/waits/SLEEP_MASTERMDREADY
        N'SLEEP_MASTERUPGRADED',     -- https://www.sqlskills.com/help/waits/SLEEP_MASTERUPGRADED
        N'SLEEP_MSDBSTARTUP',        -- https://www.sqlskills.com/help/waits/SLEEP_MSDBSTARTUP
        N'SLEEP_SYSTEMTASK',         -- https://www.sqlskills.com/help/waits/SLEEP_SYSTEMTASK
        N'SLEEP_TASK',               -- https://www.sqlskills.com/help/waits/SLEEP_TASK
        N'SLEEP_TEMPDBSTARTUP',      -- https://www.sqlskills.com/help/waits/SLEEP_TEMPDBSTARTUP

        N'SNI_HTTP_ACCEPT',              -- https://www.sqlskills.com/help/waits/SNI_HTTP_ACCEPT
        N'SOS_WORK_DISPATCHER',          -- https://www.sqlskills.com/help/waits/SOS_WORK_DISPATCHER
        N'SP_SERVER_DIAGNOSTICS_SLEEP',  -- https://www.sqlskills.com/help/waits/SP_SERVER_DIAGNOSTICS_SLEEP

        N'SQLTRACE_BUFFER_FLUSH',            -- https://www.sqlskills.com/help/waits/SQLTRACE_BUFFER_FLUSH
        N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', -- https://www.sqlskills.com/help/waits/SQLTRACE_INCREMENTAL_FLUSH_SLEEP
        N'SQLTRACE_WAIT_ENTRIES',            -- https://www.sqlskills.com/help/waits/SQLTRACE_WAIT_ENTRIES

        N'VDI_CLIENT_OTHER',         -- https://www.sqlskills.com/help/waits/VDI_CLIENT_OTHER
        N'WAIT_FOR_RESULTS',         -- https://www.sqlskills.com/help/waits/WAIT_FOR_RESULTS
        N'WAITFOR',                  -- https://www.sqlskills.com/help/waits/WAITFOR
        N'WAITFOR_TASKSHUTDOWN',     -- https://www.sqlskills.com/help/waits/WAITFOR_TASKSHUTDOWN

        N'WAIT_XTP_RECOVERY',        -- https://www.sqlskills.com/help/waits/WAIT_XTP_RECOVERY
        N'WAIT_XTP_HOST_WAIT',       -- https://www.sqlskills.com/help/waits/WAIT_XTP_HOST_WAIT
        N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', -- https://www.sqlskills.com/help/waits/WAIT_XTP_OFFLINE_CKPT_NEW_LOG
        N'WAIT_XTP_CKPT_CLOSE',      -- https://www.sqlskills.com/help/waits/WAIT_XTP_CKPT_CLOSE

        N'XE_DISPATCHER_JOIN',       -- https://www.sqlskills.com/help/waits/XE_DISPATCHER_JOIN
        N'XE_DISPATCHER_WAIT',       -- https://www.sqlskills.com/help/waits/XE_DISPATCHER_WAIT
        N'XE_TIMER_EVENT'            -- https://www.sqlskills.com/help/waits/XE_TIMER_EVENT
  );

----------------------------------------------------------------
-- Top 10 waits by delta since last snapshot
-- Convert ms delta to seconds, then normalize by interval duration and cores:
--   (delta_WaitTimeMs / 1000) / @delta / cores
----------------------------------------------------------------
IF ((SELECT COUNT(*) FROM perftune.LastWaitStats) <> 0)
BEGIN
  INSERT INTO perftune.WaitStatsTop10
  (
      CollectionDate,
      WaitType,
      WaitTimeSecondPerCorePerSecond
  )
  SELECT TOP (10)
      n.CollectionDate,
      n.WaitType,
      (n.WaitTimeMs - l.WaitTimeMs) / 1000.0 / @delta / NULLIF(@NumberOfCores, 0)
          AS WaitTimeSecondPerCorePerSecond
  FROM #NewWaitStats AS n
  INNER JOIN perftune.LastWaitStats AS l
	ON n.WaitType = l.WaitType
  WHERE (n.WaitTimeMs - l.WaitTimeMs) > 0
  ORDER BY (n.WaitTimeMs - l.WaitTimeMs) DESC
END
GO

----------------------------------------------------------------
-- Persist latest wait snapshot for next run
----------------------------------------------------------------
TRUNCATE TABLE perftune.LastWaitStats;

INSERT INTO perftune.LastWaitStats
(
    CollectionDate,
    WaitType,
    WaitTimeMs
)
SELECT
    CollectionDate,
    WaitType,
    WaitTimeMs
FROM #NewWaitStats;


/*------------------------------------------------------------------------------
3) HISTORY RETENTION (delete rows older than 365 days)
------------------------------------------------------------------------------*/
DECLARE @CleanupNDatesOlder datetime;
SET @CleanupNDatesOlder = DATEADD(dd,-365,GETDATE());

-- perftune.WaitStatsAllTime
IF OBJECT_ID('perftune.WaitStatsTop10', 'U') IS NOT NULL
    DELETE FROM perftune.WaitStatsTop10 WHERE CollectionDate < @CleanupNDatesOlder;

-- perftune.BatchRequestsPerSecond
IF OBJECT_ID('perftune.BatchRequestsPerSecond', 'U') IS NOT NULL
    DELETE FROM perftune.BatchRequestsPerSecond WHERE CollectionDate < @CleanupNDatesOlder;

The Signal Wait

Every wait in SQL Server has two components: resource wait time and signal wait time. Resource wait time (wait_time_ms - signal_wait_time_ms) is how long a task waited for an external resource, a disk read, a lock, a memory grant, and so on. Signal wait time (signal_wait_time_ms) is how long the task waited to get back on a CPU after the resource became available. In other words, it was ready to run but had to wait in line for a CPU to pick it up.

Both matter, but they point to different problems. High resource waits tell you what the server is waiting on: slow storage, lock contention, memory pressure, etc. High signal waits tell you the CPUs are too busy to keep up; tasks are ready to work but there's no core available to run them.

In this post, we focus on resource waits. I'll cover signal waits and CPU scheduling pressure in a future post.

Conclusion

SQL Server performance tuning has a lot in common with surviving a packed theme park: some time is real work (riding the rides), and some is just waiting in line. Wait stats let us see where SQL Server is spending that "line time," so we can stop guessing and focus on the actual bottleneck.

The baseline chart is my way of keeping that visibility over time, so when things change, new workload, code deploys, upgrades, or just a bad day, I can quickly spot what shifted and why.

The takeaway: if your SQL Server is slow and you don't know where to start, start with wait stats. They won't tell you everything, but they'll almost always point you in the right direction.

More to read

Wait Stats - Brent Ozar Unlimited®

sp_BlitzFirst @SinceStartup = 1 Shows You Wait Stats Since, Uh, Startup - Brent Ozar Unlimited®

How to Measure SQL Server Workloads: Wait Time per Core per Second - Brent Ozar Unlimited®

sql-server-performance-tuning-using-wait-statistics-whitepaper.pdf

If you spot anything off in the explanation (or have ideas to improve the script), I'd appreciate your feedback in the comments or via email. Thank you for reading and have a good day!

Pages