SQL Server: Approaching Database Server Performance Issues (GOOD)
Step 1:
When you work as DBA, many people will approach you with a complaint like "Application is taking ages to load the data on a page,could you please check something going wrong with database server?" There might be hundred of other reason for slowness of the page.It might be a Problem with application server,network issues,really a bad implementation or problem with database server due to generation of huge report /job running at that moment. What ever be the issue, database gets the blame first. Then it is our responsibility to cross check the server state.
Let us discuss how we can approach this issue. I use following script to diagnose the issues. The first script which I will run against server is given below:
SELECT
parent_node_id AS Node_Id,
COUNT(*) AS [No.of CPU In the NUMA],
SUM(COUNT(*)) OVER()
AS [Total No. of CPU],SUM(runnable_tasks_count ) AS [Runnable Task Count],
SUM(pending_disk_io_count) AS [Pending disk I/O count],
SUM(work_queue_count) AS [Work queue count]
FROM sys.dm_os_schedulers WHERE status='VISIBLE ONLINE' GROUP BY parent_node_id
This will give following information.
- Number of records in the output will be equal to number of NUMA nodes (if it is fetching only one record , it is not a NUMA supported server)
- Node_id : NUMA node id . Can be mapped into the later scripts.
- No.of CPU in the NUMA : Total number of CPU assigned to the specific NUMA node or the number of schedulers.
- Total No. of CPU : Total No. of CPU available in the server.If you have set the affinity mask, total number of CPU assigned to this instance.
- Runnable Task Count: Number of workers, with tasks assigned to them, that are waiting to be scheduled on the runnable queue. Is not nullable. In short number of request in runnable queue.To understand more about Runnable queue , read my earlier post.
- Pending disk I/O count : Number of pending I/Os that are waiting to be completed. Each scheduler has a list of pending I/Os that are checked to determine whether they have been completed every time there is a context switch. The count is incremented when the request is inserted. This count is decremented when the request is completed.
- Work queue count: Number of tasks in the pending queue. These tasks are waiting for a worker to pick them up.
I have scheduled this scrip to store the output this query in a table for two days in the interval of 10 minutes. That will give baseline data about what is normal in your environment. In my environment people will start complaining once the Runnabable Task Count of most of the nodes goes beyond 10 consistently. In normal scenario, the value of Runnabable Task Count will be always below 10 on each node and never seen a value greater than 0 for work queue count field.This will give a picture of current state of the system.If the output of this step is normal, we are safe to an extent, the slow response might be issue which might be beyond our control or a blocking and slow response is only for a couple of screens(sessions) not for entire system.
This is the first step in my way of diagnosing.
Step 2:
The next step (Step2) in my way of diagnosing is to check the session that are waiting of any resources. Below script will help us. This query required a function as prerequisite, which will help us to display the SQL server agent job name if the session started by SQL server agent.
/*****************************************************************************************
PREREQUISITE FUNCTION
******************************************************************************************/
USE MASTER
GO
CREATE FUNCTION ConvertStringToBinary ( @hexstring VARCHAR(100)
) RETURNS BINARY(34) AS
BEGIN
RETURN(SELECT CAST('' AS XML).value('xs:hexBinary( substring(sql:variable("@hexstring"), sql:column("t.pos")) )', 'varbinary(max)')
FROM (SELECT CASE SUBSTRING(@hexstring, 1, 2) WHEN '0x' THEN 3 ELSE 0 END) AS t(pos))
END
/***************************************************************************************
STEP 2: List the session which are currently waiting for resource
****************************************************************************************/
SELECT node.parent_node_id AS Node_id,
es.HOST_NAME,
es.Login_name,
CASE WHEN es.program_name LIKE '%SQLAgent - TSQL JobStep%' THEN
(
SELECT 'SQL AGENT JOB: '+name FROM msdb..sysjobs WHERE job_id=
MASTER.DBO.ConvertStringToBinary(LTRIM(RTRIM((SUBSTRING(es.program_name,CHARINDEX('(job',es.program_name,0)+4,35)))))
)
ELSE es.program_name END AS [Program Name] ,
DB_NAME(er.database_id) AS DatabaseName,
er.session_id,
wt.blocking_session_id,
wt.wait_duration_ms,
wt.wait_type,
wt.NoThread ,
er.command,
er.status,
er.wait_resource,
er.open_transaction_count,
er.cpu_time,
er.total_elapsed_time AS ElapsedTime_ms,
er.percent_complete ,
er.reads,
er.writes,
er.logical_reads,
wlgrp.name AS ResoursePool ,
SUBSTRING (sqltxt.TEXT,(er.statement_start_offset/2) + 1,
((CASE WHEN er.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), sqltxt.TEXT)) * 2
ELSE er.statement_end_offset
END - er.statement_start_offset)/2) + 1) AS [Individual Query],
sqltxt.TEXT AS [Batch Query]
FROM (SELECT session_id, SUM(wait_duration_ms) AS
wait_duration_ms,wait_type,blocking_session_id,COUNT(*) AS NoThread
FROM SYS.DM_OS_WAITING_TASKS GROUP BY session_id, wait_type,blocking_session_id) wt
INNER JOIN SYS.DM_EXEC_REQUESTS er ON wt.session_id=er.session_id INNER JOIN SYS.DM_EXEC_SESSIONS es ONes.session_id= er.session_id
INNER JOIN SYS.DM_RESOURCE_GOVERNOR_WORKLOAD_GROUPS wlgrp ON wlgrp.group_id=er.group_id
INNER JOIN (SELECT os.parent_node_id ,task_address FROM SYS.DM_OS_SCHEDULERS OS INNER JOIN
SYS.DM_OS_WORKERS OSW ON OS.scheduler_address=OSW.scheduler_address
WHERE os.status='VISIBLE ONLINE' GROUP BY os.parent_node_id ,task_address ) node
ON node.task_address=er.task_address
CROSS APPLY SYS.DM_EXEC_SQL_TEXT(er.sql_handle) AS sqltxt
WHERE sql_handle IS NOT NULL AND wt.wait_type NOT IN ('WAITFOR','BROKER_RECEIVE_WAITFOR')
GO
The Description of the columns in the result are given below.
Column Name | Description |
Node Id | NUMA node id . Can be mapped to the node id of the scheduler query. |
Host_Name | Name of the computer from the connection is originated. |
Login Name | Login used in the session to connect the database server |
Program Name | Name of the program/application using this session. You can set the application name in the connection string. If this session is part of SQL server agent job, it will show the job name |
Database Name | Current database of the session |
Session Id | The session id |
Blocking Session id | Session id blocking statement |
wait_duration_ms | Total wait time for this wait type, in milliseconds. This time is inclusive of signal wait time |
wait_type | Name of the wait type like SLEEP_TASK,CXPACKET etc |
No of Thread | No of threads running on this session. If the session is in parallel execution |
Command | Identifies the current type of command that is being processed like Select,insert,update,delete etc |
Status | Status of the request. This can be of the following: Background,Running,Runnable,Sleeping and Suspended |
Wait Resource | Resource for which the request is currently waiting |
Open Transaction count | Number of transaction opened in this session |
Cpu Time | CPU time in milliseconds that is used by the request. |
Total Elapsed Time | Total time elapsed in milliseconds since the request arrived |
Percent_Complete | Percent of work completed for certain operations like backup,restore rollback etc. |
Reads | Number of reads performed by this request. |
Writes | Number of writes performed by this request. |
logical_reads | Number of logical reads performed by this request. |
ResoursePool | Name of of Resource Governor Pool |
Individual Query | current statement of the batch running on this session. |
Batch Query | Current batch (procedure/set of sql statement) running on this session. |
If there is a session with very long wait_duration_ms and not blocked by any other session and not going away from the list in the subsequent execution of the same query, I will look into the program name,host name,login name and the statement that is running which will give me an idea about the session.Based on all these information, I might decide to kill that session and look into the implementation of that SQL batch. If the session is blocked, I will look into the blocking session using a different script which I will share later.(Refer this post)
Step 3:
The next step (Step 3) is to list all session which are currently running on the server. I use below query to do that.
/***************************************************************************************
/***************************************************************************************
STEP 3: List the session which are currently waiting/running
****************************************************************************************/SELECT node.parent_node_idAS Node_id,
es.HOST_NAME,
es.login_name,
CASE WHEN es.program_name LIKE '%SQLAgent - TSQL JobStep%' THEN
(SELECT 'SQL AGENT JOB: '+name FROM msdb..sysjobs WHERE job_id=ADMIN.DBO.ConvertStringToBinary(LTRIM(RTRIM((SUBSTRING(es.program_name,CHARINDEX('(job',es.program_name,0)+4,35)))))
)ELSE es.program_name END AS program_name ,
DB_NAME(er.database_id) AS DatabaseName,
er.session_id,
wt.blocking_session_id,
wt.wait_duration_ms,
wt.wait_type,
wt.NoThread ,
er.command,
er.status,
er.wait_resource,
er.open_transaction_count,
er.cpu_time,
er.total_elapsed_time AS ElapsedTime_ms,
er.percent_complete ,
er.reads,er.writes,er.logical_reads,
wlgrp.name AS ResoursePool ,
SUBSTRING (sqltxt.TEXT,(er.statement_start_offset/2) + 1,
((CASE WHEN er.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), sqltxt.TEXT)) * 2
ELSE er.statement_end_offset
END - er.statement_start_offset)/2) + 1) AS [Individual Query],
sqltxt.TEXT AS [Batch Query]
FROM
SYS.DM_EXEC_REQUESTS er INNER JOIN SYS.DM_EXEC_SESSIONS es ON es.session_id= er.session_id
INNER JOIN SYS.DM_RESOURCE_GOVERNOR_WORKLOAD_GROUPS wlgrp ON wlgrp.group_id=er.group_id
INNER JOIN (SELECT os.parent_node_id ,task_address FROM SYS.DM_OS_SCHEDULERS OS
INNER JOIN SYS.DM_OS_WORKERS OSW ON OS.scheduler_address=OSW.scheduler_address
WHERE os.status='VISIBLE ONLINE' GROUP BY os.parent_node_id ,task_address ) node ONnode.task_address=er.task_address
LEFT JOIN
(SELECT session_id, SUM(wait_duration_ms) AS
wait_duration_ms,wait_type,blocking_session_id,COUNT(*) AS NoThread
FROM SYS.DM_OS_WAITING_TASKS GROUP BY session_id, wait_type,blocking_session_id) wt
ON wt.session_id=er.session_id
CROSS apply SYS.DM_EXEC_SQL_TEXT(er.sql_handle) AS sqltxt
WHERE sql_handle IS NOT NULL AND ISNULL(wt.wait_type ,'') NOT IN
('WAITFOR','BROKER_RECEIVE_WAITFOR')
ORDER BY er.total_elapsed_time DESC
GO
The columns are same as we discussed in step 2 . I used to analyse the sessions with more total_elapsed_time and take appropriate actions like killing the session and look into the implementation. In most of the scenario (where server was running perfectly but all off sudden it become standstill) , I will be able fix the issue by following these steps. In the next part let us discuss about blocking session and session with open transaction which is not active.
Step 4:
In this post let us see how to list the blocking sessions with required information.
******************************************************************************************/
CREATE FUNCTION [dbo].dba_GetStatementForSpid
(
@spid SMALLINT
)
RETURNS NVARCHAR(4000)
BEGIN
DECLARE @SqlHandle BINARY(20)
DECLARE @SqlText NVARCHAR(4000)
SELECT @SqlHandle = sql_handle
FROM sys.sysprocesses WITH (nolock) WHERE spid = @spid
SELECT @SqlText = [text] FROM
sys.dm_exec_sql_text(@SqlHandle)
RETURN @SqlText
END
GO
/*****************************************************************************************
STEP 4: List the current blocking session information
****************************************************************************************/
SELECT
es.session_id,
es.HOST_NAME,
DB_NAME(database_id) AS DatabaseName,
CASE WHEN es.program_name LIKE '%SQLAgent - TSQL JobStep%' THEN
(SELECT 'SQL AGENT JOB: '+name FROM msdb..sysjobsWHEREjob_id=MASTER.DBO.ConvertStringToBinary (LTRIM(RTRIM((SUBSTRING(es.program_name,CHARINDEX('(job',es.program_name,0)+4,35)))))
) ELSE es.program_name END AS program_name ,
es.login_name ,
bes.session_id AS Blocking_session_id,
MASTER.DBO.dba_GetStatementForSpid(es.session_id) AS [Statement],
bes.HOST_NAME AS Blocking_hostname,
CASE WHEN Bes.program_name LIKE '%SQLAgent - TSQL JobStep%' THEN
(SELECT 'SQL AGENT JOB: '+name FROM msdb..sysjobs WHERE job_id=
MASTER.DBO.ConvertStringToBinary
(LTRIM(RTRIM((SUBSTRING(Bes.program_name,CHARINDEX('(job',es.program_name,0)+4,35))))))
ELSE Bes.program_name END AS Blocking_program_name,
bes.login_name AS Blocking_login_name,
MASTER.DBO.dba_GetStatementForSpid(bes.session_id ) AS [Blocking Statement]
FROM sys.dm_exec_requests S
INNER JOIN sys.dm_exec_sessions es ON es.session_id=s.session_id
INNER JOIN sys.dm_exec_sessions bes ON bes.session_id=s.blocking_session_id
This script will list the blocked and blocking statement information and will be helpful to troubleshoot. Below script will
help us to identify the sessions which have an open transaction but not active . That is, a sessions with an open transaction but not running any statement in the last 30 seconds./*****************************************************************************************
STEP 4: List the Open session with transaction which is not active
****************************************************************************************/
SELECT es.session_id,
es.login_name,
es.HOST_NAME,
DB_NAME(SP.dbid) AS DatabaseName,
sp.lastwaittype,
est.TEXT,cn.last_read,
cn.last_write, CASE WHEN es.program_name LIKE '%SQLAgent - TSQL JobStep%' THEN(SELECT 'SQL AGENT JOB: '+name FROM msdb..sysjobsWHERE job_id=MASTER.DBO.ConvertStringToBinary(LTRIM(RTRIM((SUBSTRING(es.program_name,CHARINDEX('(job',es.program_name,0)+4,35)))))
)ELSE es.program_name END AS program_name
FROM sys.dm_exec_sessions es
INNER JOIN sys.dm_tran_session_transactions st ON es.session_id = st.session_id INNER JOINsys.dm_exec_connections cn ON es.session_id = cn.session_id
INNER JOIN sys.sysprocesses SP ON SP.spid=es.session_id
LEFT OUTER JOIN sys.dm_exec_requests er ON st.session_id = er.session_id
AND er.session_id IS NULL
CROSS APPLY sys.dm_exec_sql_text(cn.most_recent_sql_handle) est
WHERE (DATEDIFF(SS,cn.last_read,GETDATE())+DATEDIFF(SS,cn.last_write,GETDATE()))>30
AND lastwaittype NOT IN ('BROKER_RECEIVE_WAITFOR' ,'WAITFOR') GO
Step 5:
In this post let us discuss about listing execution stats from the plan cache.
The below query will help us to list the CPU intensive queries from your plan cache.
/*****************************************************************************************
List heavy query based on CPU/IO. Change the order by clause appropriately
******************************************************************************************/
SELECT TOP 20
DB_NAME(qt.dbid) AS DatabaseName
,DATEDIFF(MI,creation_time,GETDATE()) AS [Age of the Plan(Minutes)]
,last_execution_time AS [Last Execution Time]
,qs.execution_count AS [Total Execution Count]
,CAST((qs.total_elapsed_time) / 1000000.0 AS DECIMAL(28,2)) [Total Elapsed Time(s)]
,CAST((qs.total_elapsed_time ) / 1000000.0/ qs.execution_count AS DECIMAL(28, 2)) AS [Average E
xecution time(s)],CAST((qs.total_worker_time) / 1000000.0 AS DECIMAL(28,2)) AS [Total CPU time (s)]
,CAST(qs.total_worker_time * 100.0 / qs.total_elapsed_time AS DECIMAL(28,2)) AS [% CPU]
,CAST((qs.total_elapsed_time - qs.total_worker_time)* 100.0 /qs.total_elapsed_time AS
DECIMAL(28, 2)) AS [% Waiting],CAST((qs.total_worker_time) / 1000000.0/ qs.execution_count AS DECIMAL(28, 2)) AS [CPU time
average (s)],CAST((qs.total_physical_reads) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg Physical
Read],CAST((qs.total_logical_reads) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg Logical
Reads],CAST((qs.total_logical_writes) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg Logical
Writes],max_physical_reads
,max_logical_reads
,max_logical_writes
, SUBSTRING (qt.TEXT,(qs.statement_start_offset/2) + 1,((CASE WHEN qs.statement_end_offset =
-1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.TEXT)) * 2
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1) AS [Individual Query]
, qt.TEXT AS [Batch Statement]
, qp.query_plan
FROM SYS.DM_EXEC_QUERY_STATS qs
CROSS APPLY SYS.DM_EXEC_SQL_TEXT(qs.sql_handle) AS qt
CROSS APPLY SYS.DM_EXEC_QUERY_PLAN(qs.plan_handle) qp
WHERE qs.total_elapsed_time > 0
ORDER BY
[Total CPU time (s)]
--[Avg Physical Read]
--[Avg Logical Reads]
--[Avg Logical Writes]
--[Total Elapsed Time(s)]
--[Total Execution Count]
DESC
Let us discuss the output columns.
Column Name | Description |
DatabaseName | Database Context of the execution plan. |
Age of the Plan(Minutes) | Age of the plan in the plan cache in minutes. |
Last Execution Time | Date and Time of the last execution of this plan. |
Total Execution Count | Total number of execution after the last compilation Or Total number of execution in the last [Age of the Plan(Minutes)] (since it was last compiled) |
Total Elapsed Time(s) | Total time (second) took to execute this plan [Total Execution Count] times |
Average Execution time(s) | Average Time(seconds) took for single execution of this plan |
Total CPU time (s) | Total CPU time(seconds) took to execute this plan [Total Execution Count] times |
% CPU | Percentage of CPU time compared to [Total Elapsed Time(s)] |
% Waiting | Percentage of waiting time(wait for resource) compared to [Total Elapsed Time(s)] |
CPU time average (s) | Average CPU time (Seconds) used for single exection |
Avg Physical Read | Average number of Physical read for single execution |
Avg Logical Reads | Average number of Logical read for single execution |
Avg Logical Writes | Average number of Logical writes for single execution |
max_physical_reads | Maximum number of physical reads that this plan has ever performed during a single execution. |
max_logical_reads | Maximum number of logical reads that this plan has ever performed during a single execution. |
max_logical_writes | Maximum number of logical writes that this plan has ever performed during a single execution. |
Individual Query | Part of Batch Statement |
Batch Statement | Batch Query |
query_plan | XML execution . On clicking this we can see the graphical execution plan |
I used to analyse the first five to ten records ( in each category by changing the order by clause) to see procedures implementation. Most of the time I used to find some thing wrong with the procedure like unnecessary temp table usage, distinct clause , cursors, table joining with out proper joining condition,no proper indexing etc. The other issue usually happen is, enormous calls to database for single procedure (CPU cost and execution time might less). This might be a wrong implementation, which should be fixed by discussing with development team. Usually this can be fixed by adding some types of caching in application.Some time calls to the database is only to check if there is any change in the result data. Some thing like if there is new records in a table, it should be processed immediately. To achieve this, application might be querying the table to find the unprocessed record multiple time in a second, This can be fixed by implementing some kind of asynchronous call to the application by the application which inserts the data into this table or by implementing the notification event using SqlDependency available in the .Net frame work.