Sunday, July 24, 2016

What About Indexing SQL Server Tables That do Not Have a Primary Key or Cluster Key

I had a question in the comments section of my recent Pluralsight course, and the basic premise of the question was, "If your table does not have a primary key or cluster key, is it still worth it to create an index on the table?".  The answer to this question is absolutely, yes.  I want to take this opportunity to discuss this topic further though.

The Most Common Scenario - A Review
In my course, I describe a the most typical setup in SQL Server, such that when you create a table with a primary key, the primary key column(s) are used as the cluster key of the table, and the cluster key is what the data in the table is physically sorted by.  What this means is that the actual table data in SQL Server is stored in a B-tree structure something like what you see below.



In computer science, a B-tree structure can be searched very rapidly, as only a handful of comparisons are required to get you to the lowest level of the table (the leaf level) where the data is stored.  This is why looking up data in SQL Server by the primary key value is so fast, because by default, the primary key is also the cluster key, and then SQL Server can take advantage of this tree structure to rapidly find the associated row of data.

Typically though, you need to search for data in a table on some other field or fields than the primary key.  For example, you might not know a student's id number, but you know their first and last name.  Without an index, SQL Server would have to scan through all of the rows of the table, and this would not just take a lot of time, but also be very resource intensive in terms of system resources (CPU and disk IO).  So what we do is create an index on these columns since we commonly use them to search for data in the table.  This index uses the same B-tree structure, but now this tree structure is organized (sorted) by the columns in the index key -- last name and first name in our case.  And in the leaf nodes of the index (the bottom level), the index doesn't contain the data for the row, but instead the value of the primary key of the table.



So what SQL Server will do is first traverse the index, which again is very fast because we are traversing a tree structure, and find all of the index keys that match the input criteria (WHERE clause) specified.  Then, it will get the primary key values out of the index and go over to the table and look those values up by their primary key.  And as we said before, these lookups in the table are very fast because the table is stored in a tree structure organized by the primary key.

How We Got Here
What I did not mention in the course are some of the other scenarios that can occur.  There is always a dilemma when putting together a course about what to put in and what to leave out, and I chose not to cover the less common scenarios because I felt like it was more important to make sure the viewer had a good understanding of the most common scenario.  What I will do it go over those other scenarios here.

A Table With No Cluster Key
It is possible in SQL Server to create a table with no cluster key.  In this case, the rows of the table are stored in what is called a heap, which is just a way of saying that some space is allocated on disk and the rows of the table are stored in no particular order within that allocated area.  So now we don't have our rows organized in a nice tree structure, they are just stored in effectively random order in whatever pages on disk the table is using.

In this case, you can still (and should) create indexes on your table over the columns you search the table by.  In this case though, the record in the index doesn't contain the primary key value, because that wouldn't really help us since our data isn't organized by primary key any more.  Instead, it will contain a row pointer value that points tells SQL Server where it can find the corresponding row in the table.  This row pointer value will include the the page the where the data is stored as well as a row identifier so it can find the row in the page.  With this information, SQL Server can very quickly locate where the actual data for the row is, read it off of disk and return it to you.

A Table Key with a Cluster Key Different Than The Primary Key
It is also possible to create a table that has a cluster key that is not the primary key.  That is, in a fictional students table, my primary key is a column named StudentId, but I am going to make my cluster key on the columns LastName and FirstName.  This would mean that the data I store in my table would be physically organized (sorted) by the combination of LastName and FirstName.

You might think that would be a good idea, because after all, if I commonly search for students by their first and last name, then having the data organized like this would be making searching super fast, and I'd avoid the lookup step where you have to go look the actual row up by the student id after you have found the entry in the index.  But before you do that, you should read this superbly written article by Michelle Ufford.

Effective Clustered Indexes - https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/

The problem is that as you insert data into the table, you are going to need to be inserting data in the middle of the table.  That will lead to page splits, which will over time degrade your performance.  Since our data is stored in a tree structure, it is much better if we are inserting new data at the end of the tree, not somewhere in the middle.

Further, we want our cluster key be unique, since SQL Server has to have some way to uniquely identify rows.  If you choose a cluster key that is not unique, SQL Server will have to add a unique identifier value to the cluster key such that it can uniquely identify the rows.

Back to our question though, and lets say you have a table that has a cluster key different than the primary key and further, the cluster key is not unique.  Should you still create indexes on the table?  Absolutely, because otherwise you will have to scan each and every row of the table to locate your data.  You will pay a small price because the cluster key is not unique, but have appropriate indexes will still be very beneficial to the overall performance of your application.

In Summary
The rule still holds that for a table of any size, you want to make sure that your SQL Statement is using an index whenever accessing the table.  The execution plan and resulting data paths might look a little different, but a well thought out index will still provide a major performance improvement for your SQL statements.