Building Better Software: 2016

Wednesday, November 9, 2016

Why I Avoid Using Hints in my Database Queries

For those who have watched my Pluralsight courses on either Oracle or SQL Server performance tuning, you will notice that I don't talk about database hints in either course. This is intentional on my part. I tend not to use hints in any of my production SQL statements and overall I only use hints in very limited cases.

I made the decision not to cover hints in my courses because too often times, I have seen hints used incorrectly, often by someone who didn't understand what the hint was doing. I've seen cases where someone read about hints "on the Internet" and thought by dropping a hint into their SQL statement, it would act as some sort of magical performance booster for their statement. Unfortunately, there is no magic going on, and like any technology or technique, the result can actually be a worse outcome if you don't understand what is going on.

So lets explain what a little bit about hints.

The Query Optimization Process
When you submit a SQL statement to any database, a piece of software called the Query Optimizer parses the SQL statement and determines the fastest way to process that statement. It is the query optimizer that determines if an index can be used or a table scan should be performed. It also determines how to perform joins between tables and in what order the tables should be joined if you have multiple join conditions.

To determine the most efficient way to process your statement, the optimizer looks at the statistics of the tables involved in the statement, including the total number of rows and the distribution of those rows in the table. It also looks at the indexes on the table and the number of unique keys in the index and matches all of this data up with the where clauses and join criteria you have specified in your statement. Using these statistics, the optimizer can estimate the cost of all of the different ways to perform the SQL statement, and it will pick the lowest cost combination of these operations.

All of this usually happens in 100 milliseconds or less. And optimizers today are really, really good at picking the right execution plan that will result in the fastest way to execute your statement.

What Does a Hint Do
When you supply a database hint, you are taking control of how the statement will execute and taking this control away from the query optimizer. The problem is that the best way to execute a statement may change over time. Maybe a table gets more data or the distribution of the data in a table changes. Under normal circumstances, the query optimizer can adjust to these changes and come up with a new plan which is most efficient for the current data set.

When you provide a hint, you are taking this flexibility away from the optimizer. So now as things change in your database, the optimizer cannot adjust. So now, you have a sub-optimal plan, and many times that plan is much less efficient than the plan the query optimizer would have come up with on its own. The problem is that while you can figure out the right hint for the way the data looks today, you have no way of knowing what the data will look like tomorrow. Yet by providing a hint, you are really committing to a specific execution plan, even if that plan is wildly inefficient for tomorrows problem.

Is There Ever a Time to Use Hints
Sometimes I am surprised when I get a certain execution plan. For example, I may have expected the optimizer to use a different index than what it did. So in my SQL editor, I'll use a hint to look at that different version of the plan. In almost every case, the plan produced by the optimizer is less expensive than the plan with the hint. Its not that I don't trust the optimizer, but sometimes being able to contrast the two execution plans helps me understand why the optimizer is making the decisions it is.

The other time to use a hint would be if you were instructed to do so by Oracle or Microsoft technical support. I've never had to do this, and I'm guessing these instances are very few and very far between.

Summary
I really feel like it is best to avoid using hints in any production SQL statements your application might run. At first, they may seem magical. But the optimizers included in today's database products are really, really good. When you start including hints in your statement, you are taking control away from the optimizer and its ability to find the best plan for the current state of the table. This almost always results in a less efficient plan and therefore slower running SQL statement. So I encourage you to avoid using hints in any production SQL you might run.

Monday, November 7, 2016

MKE DOT NET 2016

On Saturday, October 29th, I attended MKE DOT NET in Pewaukee, WI. This is a one day conference focused around .NET organized by Centare. I was fortunate enough to be accepted to speak at the event, which gave me a chance to share my knowledge as well as attend and learn from others.

One of the big draws of this event is the speaker lineup. In the two years the event has existed, Centare has used their connections in the .NET community to get some high profile speakers for the event. In 2015 (which I did not attend) they had uncle Bob Martin give the keynote. This year, the keynote was delivered by Scott Hanselman of Microsoft. In addition, Stephen Cleary and Ben Day were also speaking at this years event. However, there was also a strong contingent of speakers from the Milwaukee and Madison area, so there was a nice balance of well known names with local practitioners. This was one of the first things that I really liked about the event was this balance of different perspectives, and I hope this continues in future years. Below, I'll give a brief review of the talks I attended, my impressions and my takeaways.

Keynote - Scott Hanselman
The morning started with a an hour long talk by Scott Hanselman discussing how and why Microsoft is evolving .NET to the .NET Core platform. We read a lot about .NET Core these days, but there is very little written about the reasons why the change is being made. Mainly, Microsoft wants a much smaller package that someone can get started with coding .NET without having to download a 2 GB package containing Visual Studio and all the other tooling that is necessary to code in .NET today. He also talked about how while the .NET Core runtime has been released, a lot of the tooling is still in beta. This is useful to know where the state of everything is as you decide what version to use for new projects.

The one piece of advice I really liked is when he said "know what level of abstraction you are working at and make sure to know how things work one level deeper". This is great advice, and something I have always believed in. I think the is a great way to put this and sums things up nicely. We always want to know what are tools and frameworks are doing behind the scenes, and this statement encapsulates that advice nicely.

Session #2 - Ben Day - Claims Based Security
I was speaking in the first slot, so the first session I attended was Ben Day's session on Claim based security. I think he did a really good job of summarizing why role based authentication is really too simplistic of a model in today's world. He described claims based security, and how most systems will fist check if a user is in a role and then check individual claims. I've started to see more system s move to more of a claims based model, and this talk provided a good overview of how this is implemented in .NET using the identity provider.

Session #3 - Josh Schultz - Working With Humans
This was a talk about how to work with people, and I found a lot of the advice very practical and useful. I loved the suggestion of having lunch with a coworker you may not get along with very well and talking about anything outside of work. A lot of his advice was very practical, like considering other perspectives and not making assumptions about what others are thinking. All things we know, but that we should constantly be reminded of. I also liked that he stressed that there is a really important aspect to development which is understanding the problem you are trying to solve, and asking lots of questions to make sure you do. This was a really good talk. While Josh claimed he didn't have all the answers, he had a lot of good advice in his presentation.

Session #4 - Andrew Bloechl - Building SaaS Products in the Cloud
This talk was really more like a retrospective of an effort to move one of their applications to the cloud. He went through the decisions that they made, why they made those decisions and some of the decisions that they later reversed. I really liked this talk because it wasn't the typical overview of a cloud service. It was real world experience of the ups and downs.

For example, they encountered problems where the system was very slow for the first user because Azure services would all shut down after 20 minutes of non-use, meaning the whole stack had to be fired up again when the next user hit the site. There is now a feature where you can keep everything in an 'always up' state, but that isn't something you typically think about from the start. They also had issues where people created virtual machines, forgot about them and left them running and got a big bill at the end of the month. Overall, they did really like the cloud solution and the flexibility it brought, but it was good to hear about some of the challenges they had to overcome. So often these solutions are presented in such a blue sky manner, its great to see talks like this that remind us there is a learning curve and can help the rest of us get up to speed on that curve.

Session #5 - Samrat Saha - Azure Functions
This was an interesting talk on Azure Functions, which allow you to write a function to be implemented in Azure and not worry at all about how the function is hosted or set up from an infrastructure perspective. The example the speaker used was a series of functions that could respond to text messages. He actually talked about how you could chain these functions together by using message queues so that you could keep each function small and concise, which is the point. He also talked about some of the challenges they had in sharing code between different functions and how they solved that by coming up with a set of shared libraries. Azure functions is something I want to look at more, so this provided a good introduction to the service.

Other Details
The event was extremely well organized. When checking in, everyone received a nicely printed program listing all of the talks and what room they were in. In addition, all of this information was posted on the outside of each room. Each room had plenty of seating and there was a water station set up in the back

When I went to my room to setup my presentation, they had multiple video adapters up at the podium so you could connect your laptop to the projector. I carry my own adapters, but this is really thoughtful just in case someone forgot to bring their adapter. There was a staff member in each room to help with any setup issues as well. Finally, each room was equipped with a wireless mic. From a speakers perspective, the room setup was one of the best that I have seen.

Both breakfast and lunch was provided at the event. In addition, they had a social event afterwards where everyone received two drink tickets plus a nice commemorative glass from the event. Way better than a t-shirt, which all of us already have zillions of. Now I have a nice glass that I can think back to the good memories any time I use it.

The cost to attend was $99. While this might be a detriment to some, I think the event was well worth it. Some first rate speakers, a nice facility, meals and first rate organization. I don't have any connection to Centare other than speaking at this conference, but I think they are really to be commended for how well the event was put together. If you are in the upper Midwest next fall, be sure to watch for when the event is announced, because I really think this event is worth attending.

Saturday, October 29, 2016

Slides from my MKE Dot Net talk on SQL Server Performance

Thanks to everyone who attended my talk at MKE Dot Net today. The slides are available by clicking on the image or the link below.

MKE Dot Net - What Every Developer Should Know About SQL Server Performance - Slides

Thanks again for attending.

Sunday, July 24, 2016

What About Indexing SQL Server Tables That do Not Have a Primary Key or Cluster Key

I had a question in the comments section of my recent Pluralsight course, and the basic premise of the question was, "If your table does not have a primary key or cluster key, is it still worth it to create an index on the table?". The answer to this question is absolutely, yes. I want to take this opportunity to discuss this topic further though.

The Most Common Scenario - A Review
In my course, I describe a the most typical setup in SQL Server, such that when you create a table with a primary key, the primary key column(s) are used as the cluster key of the table, and the cluster key is what the data in the table is physically sorted by. What this means is that the actual table data in SQL Server is stored in a B-tree structure something like what you see below.

In computer science, a B-tree structure can be searched very rapidly, as only a handful of comparisons are required to get you to the lowest level of the table (the leaf level) where the data is stored. This is why looking up data in SQL Server by the primary key value is so fast, because by default, the primary key is also the cluster key, and then SQL Server can take advantage of this tree structure to rapidly find the associated row of data.

Typically though, you need to search for data in a table on some other field or fields than the primary key. For example, you might not know a student's id number, but you know their first and last name. Without an index, SQL Server would have to scan through all of the rows of the table, and this would not just take a lot of time, but also be very resource intensive in terms of system resources (CPU and disk IO). So what we do is create an index on these columns since we commonly use them to search for data in the table. This index uses the same B-tree structure, but now this tree structure is organized (sorted) by the columns in the index key -- last name and first name in our case. And in the leaf nodes of the index (the bottom level), the index doesn't contain the data for the row, but instead the value of the primary key of the table.

So what SQL Server will do is first traverse the index, which again is very fast because we are traversing a tree structure, and find all of the index keys that match the input criteria (WHERE clause) specified. Then, it will get the primary key values out of the index and go over to the table and look those values up by their primary key. And as we said before, these lookups in the table are very fast because the table is stored in a tree structure organized by the primary key.

How We Got Here
What I did not mention in the course are some of the other scenarios that can occur. There is always a dilemma when putting together a course about what to put in and what to leave out, and I chose not to cover the less common scenarios because I felt like it was more important to make sure the viewer had a good understanding of the most common scenario. What I will do it go over those other scenarios here.

A Table With No Cluster Key
It is possible in SQL Server to create a table with no cluster key. In this case, the rows of the table are stored in what is called a heap, which is just a way of saying that some space is allocated on disk and the rows of the table are stored in no particular order within that allocated area. So now we don't have our rows organized in a nice tree structure, they are just stored in effectively random order in whatever pages on disk the table is using.

In this case, you can still (and should) create indexes on your table over the columns you search the table by. In this case though, the record in the index doesn't contain the primary key value, because that wouldn't really help us since our data isn't organized by primary key any more. Instead, it will contain a row pointer value that points tells SQL Server where it can find the corresponding row in the table. This row pointer value will include the the page the where the data is stored as well as a row identifier so it can find the row in the page. With this information, SQL Server can very quickly locate where the actual data for the row is, read it off of disk and return it to you.

A Table Key with a Cluster Key Different Than The Primary Key
It is also possible to create a table that has a cluster key that is not the primary key. That is, in a fictional students table, my primary key is a column named StudentId, but I am going to make my cluster key on the columns LastName and FirstName. This would mean that the data I store in my table would be physically organized (sorted) by the combination of LastName and FirstName.

You might think that would be a good idea, because after all, if I commonly search for students by their first and last name, then having the data organized like this would be making searching super fast, and I'd avoid the lookup step where you have to go look the actual row up by the student id after you have found the entry in the index. But before you do that, you should read this superbly written article by Michelle Ufford.

Effective Clustered Indexes - https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/

The problem is that as you insert data into the table, you are going to need to be inserting data in the middle of the table. That will lead to page splits, which will over time degrade your performance. Since our data is stored in a tree structure, it is much better if we are inserting new data at the end of the tree, not somewhere in the middle.

Further, we want our cluster key be unique, since SQL Server has to have some way to uniquely identify rows. If you choose a cluster key that is not unique, SQL Server will have to add a unique identifier value to the cluster key such that it can uniquely identify the rows.

Back to our question though, and lets say you have a table that has a cluster key different than the primary key and further, the cluster key is not unique. Should you still create indexes on the table? Absolutely, because otherwise you will have to scan each and every row of the table to locate your data. You will pay a small price because the cluster key is not unique, but have appropriate indexes will still be very beneficial to the overall performance of your application.

In Summary
The rule still holds that for a table of any size, you want to make sure that your SQL Statement is using an index whenever accessing the table. The execution plan and resulting data paths might look a little different, but a well thought out index will still provide a major performance improvement for your SQL statements.

Sunday, April 24, 2016

DMV Queries From my Pluralsight Course "What Every Developer Should Know About SQL Server Performance"

I've actually broken these queries up into separate blog posts, so here are the links to the individual posts where you will find the queries.

Exposing SQL Server DMV Data to Developers and other non-DBA Users

Using DMV's to Find SQL Server Connection and Session Info

Which Statements Are Currently Running in my SQL Server Database

Finding the Most Expensive, Longest Running Queries in SQL Server

Analyzing Index Usage in SQL Server
(Contains queries for both missing indexes and index usage stats)

Using DMV's to Find SQL Server Connection and Session Info

It is often times useful to understand how your application is connecting to SQL Server and any other connections that may be present to your application's database. This can be understood by lookng at the sys.dm_exec_sessions view. Here is a sample query to use:

SELECT
    database_id,    -- SQL Server 2012 and after only
    session_id,
    status,
    login_time,
    cpu_time,
    memory_usage,
    reads,
    writes,
    logical_reads,
    host_name,
    program_name,
    host_process_id,
    client_interface_name,
    login_name as database_login_name,
    last_request_start_time
FROM sys.dm_exec_sessions
WHERE is_user_process = 1
ORDER BY cpu_time DESC;

This simply lists all of the current connections to SQL Server, but we also get some useful information with each connection. We see we have fields like cpu_time (in milliseconds), memory_usage (in 8 KB pages), reads, writes and logical_reads. So if we have a session that is consuming a lot of resources in SQL Server, we can immediately see that by looking at these fields.

We also have columns like host_name, program_name and host_process_id which can help us identify where a connection is coming from. You would be surprised how many times you look at a production database and see users connecting directly to the database from applications like Management Studio or even Excel.

We can write a little bit different query to get a different look at this data:

SELECT
    login_name,
    host_name,
    host_process_id,
    COUNT(1) As LoginCount
FROM sys.dm_exec_sessions
WHERE is_user_process = 1
GROUP BY
    login_name,
    host_name,
    host_process_id;

This counts up all of the logins by login name and client process. So if you have an ASP.NET web application, you would see each worker process with the number of connections the worker process has to the database, plus any other processes that might be connected up. Sometimes this is a useful view, because you can validate that indeed all of the machines in your web cluster are indeed talking to the database and you can get a feel for how many connections they have in their connection pools.

It might seem simple to know who is logged in, but what this can help you do is understand any connectivity issues you may be having and see if there is a session using a lot of resources, so for those reasons, these simple queries against the dm_exec_sessions view can tell you quite a bit.

Saturday, April 23, 2016

Exposing SQL Server DMV Data to developer and other non-DBA Users

As a developer, it is very useful to have access to a number of SQL Server's Dynamic Management Views so you can view performance related statistics about how your application is interacting with the database. This includes finding which of your SQL Statements are taking the longest, if SQL Server thinks there are any missing indexes and usage data for your existing indexes.

Unfortunately, access to DMV's in SQL Server requires the VIEW SERVER STATE permission. Many DBA's are reluctant to grant this permission to members of the development team because of the amount of data that one can see when they have this level of access. Unfortunately, this removes an important set of performance tuning tools from the developer's toolbox or forces them to ask a DBA for help whenever they need this information. There has to be a better way.

What we need is a way that we can grant to a developer or any other non-DBA type user access to a select subset of DMV's information. For example, it is useful for a developer to see query execution statistics, missing index information and index usage data. Further, we would like to limit the information to just the database a developer is working in. So how do we do this?

You would think this would be easy to do--create a view as a dba user that exposes only the information you want a developer to access and then grant the developer SELECT permission on that view. This fails though with the following message.

Msg 300, Level 14, State 1, Line 1 VIEW SERVER STATE permission was denied on object 'server', database 'master'. Msg 297, Level 16, State 1, Line 1 The user does not have permission to perform this action.

The next thought you might have is to create a table valued function with EXECUTE AS set to a user that has permissions to view the DMV's. However, this also fails with an error message complaining about the module not being digitally signed.

There is a solution, but it turned out to be much more complex than I would have initially thought. My approach is based on this blog post by Aaron Betrand, though I had to do things a little bit differently than he did.

The Solution

What really helps is to visualize the solution before we start going through the individual steps, so here is what it looks like.

What we have to do is create a database (called PerfDB in this example) that is going to be a holding area for a set of table valued functions and views that we create to expose our DMV information. This database will have the option TRUSTWORTHY ON option set, which will allow our table valued function to execute as a user with VIEW SERVER STATE permissions and access the DMV's (this gets around the error about function not being digitally signed that was mentioned above). Then, we can create the rest of our objects to expose the DMv information we want to. This approach gives us a way to expose just the information we want, which is what we are after.

Let's walk through the individual steps we need to take for all of this to work.

Step 1 - Create your PerfDB container database
So to get started, we'll create the PerfDB database. I did this in SQL Server Management Studio using the UI, mainly because that is the easiest way for me to do it.

Once the database is created, I open a query window and run the following command to set the TRUSTWORTHY option to ON.

1	ALTER DATABASE PerfDB SET TRUSTWORTHY ON;

Some DBA's might have some concerns about setting TRUSTWORTHY to ON, but that is the reason for using a separate database that is just a container for the objects we create. By separating these objects out into a separate database, we've limited any security risks we might have.

Step 2 - Create the Custom DMV Performance Functions in the PerfView Database
Now we are ready to create a table valued functions which will query DMV's of interest for us. In this example, I am creating a function that will return back the query execution statistics in SQL Server. You would create one of these functions for each piece of DMV info you wanted to expose.

The function basically just executes a query against a DMV or DMVs and returns that data in a table object. In this function, you can see I am taking in a parameter of the database id so that in the next step, I'll be able to able to limit access for a user to only seeing data for one database.

You will need to create this function as a user that has the VIEW SERVER STATE privilege, so this could be an existing user or you could create a separate, dedicated user that owns this function and others like it. What is important is that this function has the WITH EXECUTE AS SELF option set, so that it will run as the user who created the function. This will allow this function to query the DMV's even if the calling user does not have permission to see the DMV's.

CREATE FUNCTION dbo.GetSqlExecutionStatistics(@database_id INT)
    RETURNS @x TABLE 
    (
        DatabaseId            INT,
        DatabaseName          VARCHAR(100), 
        ExecutionCount        BIGINT, 
        CpuPerExecution       BIGINT,
        TotalCpu              BIGINT,
        IOPerExecution        BIGINT,
        TotalIO               BIGINT,
        AverageElapsedTime    BIGINT,
        AverageTimeBlocked    BIGINT,
        AverageRowsReturned   BIGINT,
        TotalRowsReturned     BIGINT,
        QueryText             NVARCHAR(max),
        ParentQuery           NVARCHAR(max),
        ExecutionPlan         XML,
        CreationTime          DATETIME,
        LastExecutionTime     DATETIME
    )
    WITH EXECUTE AS OWNER
    AS
BEGIN

    INSERT @x (DatabaseId, DatabaseName, ExecutionCount, CpuPerExecution, TotalCpu, IOPerExecution, TotalIO, 
     AverageElapsedTime, AverageTimeBlocked, AverageRowsReturned, TotalRowsReturned, 
  QueryText, ParentQuery, ExecutionPlan, CreationTime, LastExecutionTime)
 SELECT
        [DatabaseId] = CONVERT(int, epa.value),
        [DatabaseName] = DB_NAME(CONVERT(int, epa.value)), 
        [ExecutionCount] = qs.execution_count,
        [CpuPerExecution] = total_worker_time / qs.execution_count ,
        [TotalCpu] = total_worker_time,
        [IOPerExecution] = (total_logical_reads + total_logical_writes) / qs.execution_count ,
        [TotalIO] = (total_logical_reads + total_logical_writes) ,
        [AverageElapsedTime] = total_elapsed_time / qs.execution_count,
        [AverageTimeBlocked] = (total_elapsed_time - total_worker_time) / qs.execution_count,
        [AverageRowsReturned] = total_rows / qs.execution_count,    
  [TotalRowsReturned] = total_rows,
        [QueryText] = SUBSTRING(qt.text,qs.statement_start_offset/2 +1, 
            (CASE WHEN qs.statement_end_offset = -1 
                THEN LEN(CONVERT(nvarchar(max), qt.text)) * 2 
                ELSE qs.statement_end_offset end - qs.statement_start_offset)
            /2),
        [ParentQuery] = qt.text,
        [ExecutionPlan] = p.query_plan,
        [CreationTime] = qs.creation_time,
        [LastExecutionTime] = qs.last_execution_time   
    FROM sys.dm_exec_query_stats qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as qt
    OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) p
    OUTER APPLY sys.dm_exec_plan_attributes(plan_handle) AS epa
    WHERE epa.attribute = 'dbid'
     AND CONVERT(int, epa.value) = @database_id;

 RETURN;
END

Step 3 - Create a Database Specific View to Wrap Your Function
I'll create a view specific to the database I want to expose data for and pass in the appropriate database id to the function created in step 2. The reason I am doing this is because ultimately, I'll give my users SELECT access on this view and not directly on the table valued function. This way I have the logic (my query) in just one place (the function) that I can use over and over again, but what my users will see is their specific view for their specific database.

1 2	CREATE VIEW AppDatabaseName_SqlExecutionStatistics AS SELECT * FROM GetSqlExecutionStatistics( <<application database id>> );

You would replace AppDatabaseName in the name of this view with the name of your application database. You would also create a view for each different app database you wanted developers to have access to.

If you need to know the database id of your database, this information can be obtained from the sys.databases view in the master database.

Step 4 - Grant Permission for Users to Query From the View
For your user's to be able to query this information, we need to grant them the SELECT privilege to the view we just created in step 4. But before we do that, we need to map their login to a user in the PerfDB database. Basically, they have to have access to the database before they can even see the view, so we can do this in the Management Studio GUI or by running the following command.

1
2
3

USE PerfDB;

CREATE <<username>> FROM LOGIN <<user login name>>;

For every user you want to have access to this information, you need to do this. So that means if you have five developers who need access to this information, you'll need to give all five of those developer logins access to the PerfDB database in this way.

Then, we can grant permission so those user's can query the view.

1 2	GRANT SELECT ON AppDatabaseName_SqlExecutionStatistics TO <<username>>;

You will probably want to create a role that has select permissions for all of the DMV views you create for a database, and then put user's inside of that view. The important point though is that you have to get the user's permission to be able to select from the view.

Step 5 - Querying the Data
Querying the data is as simple as Querying the view we just created, and this can be done by our non-DBA user who does not have VIEW SERVER STATE permission.

1 2	SELECT * FROM PerfDb.dbo.AppDatabaseName_SqlExecutionStatistics;

If you want to make things a little easier, you can create a synonym in the application database so user's don't have to fully qualify the object name in their queries.

1
2
3

CREATE SYNONYM [dbo].[QueryExecutionStatistics] 
    FOR [PerfDb].[dbo].[AppDatabaseName_SqlExecutionStatistics]
GO

And now users can simply run this query

1	SELECT * FROM QueryExecutionStatistics;

Now, normal user's who have access to this view can access the information contained within. This also allows you to selectively choose what information is exposed from your DMV's since you control the query inside of the table valued function and you can limit the results to data from just a single database.

Its unfortunate there isn't a more straightforward way to do this, but at least it can be done. And of course all of this work can be incorporated into scripts, making it much easier to run.

Saturday, April 2, 2016

Analyzing Index Usage in SQL Server

For our applications, we need to have effective indexes to insure acceptable database performance. This really boils down into two questions.

Am I missing any indexes that I should have in my database?
If I am missing an index, it is likely that SQL Server will have to resort to a scan operation over the table in order to find the data needed for a SQL Statements. This means SQL Server will have to read and process each row of the table, which is not only slow but also resource intensive.

Do I have any indexes that are not being used that I should drop?
Indexes are not free. They are a separate physical structure that SQL Server has to maintain whenever a DML (insert, update or delete) statement is executed against the table. As long as an index is being used, it is a reasonable trade off to pay the price of maintaining the index because usually, a queries that use the index are run far more frequently than DML statements that have to update the index. But how do we know this is the case?

These questions can be easily answered using SQL Server dynamic management views. DBA's have known about these DMV's for years, but many developers do not. While in some organizations a DBA is fully involved with the development of an application, in my experience this is not the case in most organizations. So it is the development team that is creating tables and writing SQL for their data access layer. To do this effectively and have a well performing application, they need visibility to what is happening inside of SQL Server, especially views like the ones I am about to show you that helps someone determine if they have the right indexes created.

So let's see how DMV's can help answer these questions

Finding Missing Indexes

Here is the query we are going to use

SELECT     
    TableName = d.statement,
    d.equality_columns, 
    d.inequality_columns,
    d.included_columns, 
    s.user_scans,
    s.user_seeks,
    s.avg_total_user_cost,
    s.avg_user_impact,
    AverageCostSavings = ROUND(s.avg_total_user_cost * (s.avg_user_impact/100.0), 3),
    TotalCostSavings = ROUND(s.avg_total_user_cost * (s.avg_user_impact/100.0) * (s.user_seeks + s.user_scans),3)
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
    ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
    ON d.index_handle = g.index_handle
WHERE d.database_id = db_id()
ORDER BY TableName, TotalCostSavings DESC;

This query works by joining together the sys.dm_db_missing_index_groups, sys.dm_db_missing_index_group_stats, and sys.dm_db_missing_index_details views. It also limits is results to just the current database by virtue of line 18. If you wanted to get results database wide, you would simply remove this WHERE criteria.

These views give us information on the index SQL Server believes should be created. This includes the table name (like 2 - no idea why this is called statement in the DMV), the columns that should be part of the index key (lines 3 and 4) and any possible include columns for the index (line 5). Between these columns, we could construct our create index statement.

We also get some statistics around why SQL Server believes the index should be created. The user_scans and user_seeks values (lines 6 and 7) represent how many times this index could have been used in one of the operations had the index existed. So if you see a high number, especially in the user_seeks columns, this means there are large numbers of statements executing that could benefit from this index.

The avg_total_user_cost column (line 8) gives us the average cost of statements that have been run against this table that would benefit from the index. The avg_user_impact column (line 9) tells us the percent that SQL Server believes this cost would be reduced by creating this index. From these two values, you can calculate some the average cost savings per statement (line 10) and the total cost savings for all statement executions (line 11) to give you an idea of how significant the cost savings might be.

I would, however, encourage you to not simply go and create an index for every row in this table. What you want to do is look through the recommendations in this table and look for patterns. You will find a number of recommendations for each table and many of these will be similar. So what you want to do is analyze these together and figure out which indexes make sense. This may also mean modifying an existing index rather than creating a new index. With this view, you can see all the recommendations that exist for a table and come up with the right set of indexes for the table.

Index Usage Statistics

Let's now tackle the problems of making sure our indexes are being used and identifying indexes that may need to be dropped.

SELECT
    [DatabaseName] = DB_Name(db_id()),
    [TableName] = OBJECT_NAME(i.object_id),
    [IndexName] = i.name, 
    [IndexType] = i.type_desc,
    [TotalUsage] = IsNull(user_seeks, 0) + IsNull(user_scans, 0) + IsNull(user_lookups, 0),
    [UserSeeks] = IsNull(user_seeks, 0),
    [UserScans] = IsNull(user_scans, 0), 
    [UserLookups] = IsNull(user_lookups, 0),
    [UserUpdates] = IsNull(user_updates, 0)
FROM sys.indexes i 
INNER JOIN sys.objects o
    ON i.object_id = o.object_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats s
    ON s.object_id = i.object_id
    AND s.index_id = i.index_id
WHERE 
    (OBJECTPROPERTY(i.object_id, 'IsMsShipped') = 0)
ORDER BY [TableName], [IndexName];

This query uses the sys.dm_db_index_usage_stats table and joins it with the sys.indexes system view. Again, we are only getting data for the current database, this time because the sys.objects and sys.indexes views only contain data for the current database

We have some information about the index, including the table it is on, the index name and the index type. In the index type column, two of the most common values you will see are CLUSTERED and NONCLUSTERED. CLUSTERED refers to a clustered index, which is how SQL Server typically lays out the data for a table. We really want to focus on NONCLUSTERED indexes, because these are indexes we have created on the table for some specific purpose, and we have control over the columns in those indexes and if they should exist or not.

There are three types of ways an index can be used:

user_seeks - SQL Server is traversing the b-tree structure of the index to perform a lookup on a key. This is the operation we want to see. SQL Server is using the index as intended.
user_scans - SQL Server is reading the entire index to find the value(s) it wants. If this is a clustered index, that means SQL Server is reading the entire table. If this is a Nonclustered index, all of the index keys have to be read. This is much more expensive than a seek operation
user_lookups - These are lookup operations against the table, typically when SQL Server is looking up a row in the clustered index structure of a table.

The user_updates column (line 10) gives the maintenance cost of the index -- that is how many time a DML statement has caused this index to need to be updated.

What you want to look for are indexes with very few or zero user_seeks. These are indexes that you are paying to maintain, but for whatever reason SQL Server is not able to use. And then you want to investigate why that index is not being used. Maybe the index is a unique index and is there to only to enforce a constraint. Maybe something in the application has changed so an index is no longer used. Maybe the columns are in the wrong order or the index is not selective enough. You want to figure this out and then either A) modify the index so it can be used or B) drop the index. These stats give you the visibility into what indexes you need to look at.

What Statements are Running in my SQL Server Database

Sometimes you need to know what statements are running in your SQL Server database right now. I've often had this need when someone shows up at me desk and says something like "Something is going on! Your app is performing really slowly right now!".

In situations like this, there is a need to triage what is going on with the app. Is it application code? What about the web server, is the CPU spiked there? What about the database? Do we have some long running queries that are making things appear to be stuck? Once we figure out what tier is responsible for the problem, we can drill down further to find and fix specifically what is going on.

To that end, one of the SQL Server queries I keep around is one that will query the SQL Server DMV's and report back all of the statements that are running right now in the database. For any database that is in use, this query will always return some rows, because generally there is always some process somewhere that is running an application. But you can also use this query to look for any long running or blocked statements, so lets take a look at the query.

SELECT
        [DatabaseName] = db_name(rq.database_id),
        s.session_id, 
        rq.status,
        [SqlStatement] = SUBSTRING (qt.text,rq.statement_start_offset/2,
            (CASE WHEN rq.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX),
            qt.text)) * 2 ELSE rq.statement_end_offset END - rq.statement_start_offset)/2),        
        [ClientHost] = s.host_name,
        [ClientProgram] = s.program_name, 
        [ClientProcessId] = s.host_process_id, 
        [SqlLoginUser] = s.login_name,
        [DurationInSeconds] = datediff(s,rq.start_time,getdate()),
        rq.start_time,
        rq.cpu_time,
        rq.logical_reads,
        rq.writes,
        [ParentStatement] = qt.text,
        p.query_plan,
        rq.wait_type,
        [BlockingSessionId] = bs.session_id,
        [BlockingHostname] = bs.host_name,
        [BlockingProgram] = bs.program_name,
        [BlockingClientProcessId] = bs.host_process_id,
        [BlockingSql] = SUBSTRING (bt.text, brq.statement_start_offset/2,
            (CASE WHEN brq.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX),
            bt.text)) * 2 ELSE brq.statement_end_offset END - brq.statement_start_offset)/2)
    FROM sys.dm_exec_sessions s
    INNER JOIN sys.dm_exec_requests rq
        ON s.session_id = rq.session_id
    CROSS APPLY sys.dm_exec_sql_text(rq.sql_handle) as qt
    OUTER APPLY sys.dm_exec_query_plan(rq.plan_handle) p
    LEFT OUTER JOIN sys.dm_exec_sessions bs
        ON rq.blocking_session_id = bs.session_id
    LEFT OUTER JOIN sys.dm_exec_requests brq
        ON rq.blocking_session_id = brq.session_id
    OUTER APPLY sys.dm_exec_sql_text(brq.sql_handle) as bt
    WHERE s.is_user_process =1
        AND s.session_id <> @@spid
 AND rq.database_id = DB_ID()  -- Comment out to look at all databases
    ORDER BY rq.start_time ASC;

This query is using the sys.dm_exec_sessions view in conjunction with the sys.dm_exec_requests view to get all of the current running statements in SQL Server. I am restricting this to the current database (line 39), but you can easily comment out this line to look database wide.

With just these two views, we do have some useful information like what time this statement started executing (line 13), how long it has been running so far (line 12), how much CPU it has consumed so far (line 14) and how many logical reads it has performed so far (line 15). If we see a statement that has been running for a long time and/or has already consumed a lot of CPU and IO resources, then this is casuse for further investigation.

I also like to get some information about who is running a SQL statement that might be taking a long time, and that is what lines 8 through 11 do. You would be amazed at how often someone logs into a production database from their desk and runs an ad-hoc query in the middle of the day from management studio or even Excel and has no idea that they are causing a performance impact on the database. Columns like this help to pinpoint what process is running a statement that might be consuming a lot of resources.

Then, I'm pulling in some other useful information that we can look at in our result set. Bringing in the sys.dm_exec_sql_text (line 30) allows us to include the statement that is running in the result set, which of course if a statement is taking a long time you want to know what the statement is. We also use sys.dm_exec_query_plan (line 31) to pull in the execution plan for the statement. If something is taking a long time, you probably want to just be able to click through and see the execution plan, so this lets us do that.

Finally, sys.dm_exec_requests contains a column called blocking_session_id. If this statement is currently blocked by another, this value will be non-zero. By joining this value back to the sys.dm_exec_seesions and sys.dm_exec_requests views a second time (lines 32-36), you can get information about the blocker (lines 21-26), like the blocking SQL and information about the program running that SQL. Sometimes, you may catch a statement that is blocked when you run this query, but if you run this query twice in a row and you see the same blocker, you know you have a problem.

All in all, this query gives you the ability to get a good pulse on what is happening in your database right now. When you are in the middle of troubleshooting a performance problem, this is very useful, because at a glance, you can determine if the problem you are facing is in the database tier and if so, probably get a pretty good idea of what might be the cause. So keep this query handy, and next time you need to know what is happening right now inside of SQL Server, you will be prepared.

Finding the Most Expensive, Longest Running Queries in SQL Server

One of the things we often want to know is what are our worst performing statements in our application so we can fix those statements and improve our overall performance. We can easily do this by querying the dynamic management views in SQL Server.

One of the great advantages of using the DMV's in this way is that we immediately get a global view of what our application is doing inside of SQL Server and where we might need to concentrate our performance tuning efforts. The statement below gives us a good idea of what statements are being run from our application, how often, how long they take and how many resources they use on average. That is a lot of information for the low cost of running a single query.

To run this query, you will need to VIEW SERVER STATE permission in SQL Server. If you don't have this permission, ask your DBA to run this query for you and give you the results.

OK, lets look at the query

SELECT TOP 20    
        DatabaseName = DB_NAME(CONVERT(int, epa.value)), 
        [Execution count] = qs.execution_count,
        [CpuPerExecution] = total_worker_time / qs.execution_count ,
        [TotalCPU] = total_worker_time,
        [IOPerExecution] = (total_logical_reads + total_logical_writes) / qs.execution_count ,
        [TotalIO] = (total_logical_reads + total_logical_writes) ,
        [AverageElapsedTime] = total_elapsed_time / qs.execution_count,
        [AverageTimeBlocked] = (total_elapsed_time - total_worker_time) / qs.execution_count,
     [AverageRowsReturned] = total_rows / qs.execution_count,    
     [Query Text] = SUBSTRING(qt.text,qs.statement_start_offset/2 +1, 
            (CASE WHEN qs.statement_end_offset = -1 
                THEN LEN(CONVERT(nvarchar(max), qt.text)) * 2 
                ELSE qs.statement_end_offset end - qs.statement_start_offset)
            /2),
        [Parent Query] = qt.text,
        [Execution Plan] = p.query_plan,
     [Creation Time] = qs.creation_time,
     [Last Execution Time] = qs.last_execution_time   
    FROM sys.dm_exec_query_stats qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as qt
    OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) p
    OUTER APPLY sys.dm_exec_plan_attributes(plan_handle) AS epa
    WHERE epa.attribute = 'dbid'
        AND epa.value = db_id()
    ORDER BY [AverageElapsedTime] DESC; --Other column aliases can be used

This query selects the 20 statements in the current database that have the longest average elapsed time, that is, the 20 statements that on average took the longest to run.

If you want more statements, then you can modify the "TOP 20" statement in line 1. You could also just remove the "TOP 20" criteria to get all of the statements running in this database.
You can sort the list by different criteria. CpuPerExecution and IOPerExecution are also good choices because these tell you what statements take the most resources to run.
If you have a batch process, you may get a few statements that were only run once or twice at the top of this list, because statements run in a batch process tend to be expensive by nature. Just add a criteria in the WHERE clause to only show statements with an execution count > 10, 20 or whatever if you want to exclude these.

What units is the data in?

The elapsed time and CPU time numbers are in microseconds, but the MSDN page claims these values are only accurate to the millisecond
The number of IO operations is in number of pages

What should I look for?

Statements that take a long time to run on average. For line of business applications, I want to understand anything taking over 500 ms. But if a statement takes 400 ms and every other statement takes only 50 ms, then I want to look at the 400 ms statement and see what is going on.
Any statement that has a high amount of average CPU or IO usage relative to its peers.
Statements with high numbers of executions. Is this legit? Is the app firing off this statement more than necessary? Is there a caching opportunity here?

The query above will return the execution plan for each statement in this list, so you can quickly click into the plan to see what is happening and if you want to take further action

What if I want to see a different database on my SQL Server?

This query is set up to look at the current database the user is in (line 25 - epa_value = db_id()). So just switch to the other database or comment out this line to see data for all statements across SQL Server

How current is the data returned by this query?

The data for this query comes from SQL Server's plan cache, which generally holds onto execution plans for statements that have been executed in the last few hours. Its a good idea to run this query a few different times throughout the day to capture any different loads that might be put on the database. Most of the data will probably agree, but a few different runs will give you a more complete overall picture.

In my opinion, this is probably the single most useful query of the DMV's because it immediately tells me if there are statements that aren't performing well and which ones they are. If you have a performance issue in your database, this query will take you right to the heart of what is going on.

Sunday, February 28, 2016

Sample Database for "What Every Developer Should Know About SQL Server Performance"

To download the sample database for 'What Every Developer Should Know About SQL Server Performance", use the link below that corresponds to your version of SQL Server.

Once you have the file downloaded, unzip the file. Then, in SQL Server Management Studio, right click on the databases folder and select "Restore Database". In the dialog box, choose "Device" and locate the "Students.bak" file you just unzipped. Follow the prompts and the database will be restored to your system.

Using bcp to Import Data Instead

If you have trouble with the backup file, or just prefer to import your data directly, I've put together a zip file of the raw data exported out using the SQL Server bcp utility. In this file, I have a readme.txt file that tells the exact steps to bring the data in, but basically, you will do the following:

Go into Management Studio and create a new database called Students.
Run the included Students-Schema.sql file in Management Studio to create all of the schema objects (tables)
At a command prompt, run a series of bcp commands to bring the data into SQL Server. In the file, there is a list of all the commands you need to run in the correct order.

The link to the bcp file is here: Students Data as bcp Export

Either way, the result is to same, the data will get into your database and give you an opportunity to play around.