Saturday, April 15, 2017

JSON Functionality in SQL Server

SQL Server 2016 gives us the ability to work with JSON data directly in SQL Server, which is a very useful.  Traditionally, we've thought of relational databases and NoSQL databases as distinct entities, but with databases like SQL Server implementing JSON functionality directly in the database engine, we can start to think about using both relational and no-sql concepts side by side, which opens up a lot of exciting possibilities.

There is enough to cover that I will split this topic into multiple blog posts.  This is post number one, where I will cover how to store JSON in your tables and how to interact with scalar values.  The other posts are (will be updated as I go):

You can find the other parts of the series here (I'll update as I go)

  1. Accessing scalar properties on JSON objects stored in SQL Server (this post)
  2. Accessing array data in JSON objects stored in SQL Server


Also, if you want to recreate the database I have here and try the samples out for yourself, you can find everything you need on Github ar:



So lets take a look at what we can do.

Storing JSON Data in a Table
SQL Server does not contain a dedicated data type for JSON data, you simply just store your data in a VARCHAR (or NVARCHAR) data type of the appropriate size.

Below is a table that I have defined that is going to hold some weather data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CREATE TABLE WeatherDataJson
(
    ObservationId     INT IDENTITY(1,1)  NOT NULL,
    StationCode       VARCHAR(10)        NOT NULL,
    City              VARCHAR(30)        NOT NULL,
    State             VARCHAR(2)         NOT NULL,
    ObservationDate   DATETIME           NOT NULL,
    ObservationData   VARCHAR(4000)      NOT NULL,
    CONSTRAINT PK_WeatherDataJson
        PRIMARY KEY (ObservationId)
)

Our JSON will be stored in the column named ObservationData.  The data type for this field is simply a VARCHAR(4000) field, which is sufficient to hold the JSON objects we''l be storing in it.  If you have larger objects, you can use a VARCHAR(MAX) field, and indeed we'll se an example of this later.

This is a sample of the JSON we'll be storing in this column.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "stationCode": "04825",
  "location": {
    "city": "Appleton",
    "state": "WI",
    "stationName": "OUTAGAMIE CO RGNL AIRPORT"
  },
  "observationDate": "20160701",
  "observationTime": "1245",
  "observationDateTime": "2016-07-01T12:45:00",
  "skyCondition": "SCT045",
  "visibility": 10,
  "dryBulbFarenheit": 66,
  "dryBulbCelsius": 19,
  "wetBulbFarenheit": 55,
  "wetBulbCelsius": 12.6,
  "dewpointFarenheit": 45,
  "dewpointCelsius": 7,
  "relativeHumidity": 47,
  "windSpeed": 7,
  "windDirection": "360",
  "stationPressure": 29.13,
  "seaLevelPressure": null,
  "recordType": " ",
  "hourlyPrecip": null,
  "altimeter": 30.11
}

So we see we have a number of weather related properties off of the main object and a nested object for the location.  Our data in this column may be for different cities, but it all has this same format, which sets us up for the next step, querying the data.

Querying Scalar Values in JSON Data
First lets look at if we run a plain old SQL Query what we get back.   Here is our initial query:

1
2
3
4
5
6
SELECT * 
    FROM WeatherDataJson
    WHERE City = 'Appleton'
        AND State = 'WI'
        AND ObservationDate > '2016-07-01'
        AND ObservationDate < '2016-07-02'

And our results:

We can clearly see there is JSON stored in the ObservationData column.  But now we want to do something with it.  Lets pull out the temperature and humidity readings from each observation and make those columns in our result set.

To do this, we use a new function in SQL Server called JSON_VALUE.  JSON_VALUE takes two arguments:

  • A JSON expression, which is typically a the name of a column that contains JSON text, but could also be a T-SQL variable containing JSON.
  • A path expression, which describes how to navigate to the scalar value you want to extract out of the JSON.
Path expressions start with the $ sign and then use dot notation to navigate to the property that you want the value of.  So if I wanted to get the station name of where the measurement was made in our JSON above, I would use the expression '$.location.stationName'.  To get the temperature and humidity, we'll use the expressions '$.dryBulbFarenheit' and '$.relativeHumidity' respectively.

So now, we are going to rewrite our query to extract these fields out as columns like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT 
        ObservationId,
        StationCode,
        City,
        State,
        ObservationDate,
        JSON_VALUE(ObservationData, '$.dryBulbFarenheit') As Temperature,
        JSON_VALUE(ObservationData, '$.relativeHumidity') As Humidity,
        ObservationData
    FROM WeatherDataJson
    WHERE City = 'Appleton'
        AND State = 'WI'
        AND ObservationDate > '2016-07-01'
        AND ObservationDate < '2016-07-02';

And here are our results.

We see that SQL Server used the JSON_VALUE function to dynamically pull those values out of our JSON so we can see those as columns in our result set.  And now we can interact with them like we would any other column in a table.  So we can keep this data stored as JSON, but interact with it in a relational manner when that makes sense, which is pretty cool.  Traditionally, we would have needed an ETL process that would have parsed this data out and placed each element in a column, but now we can do that on the fly.

The Power of Virtual (Computed) Columns
With our weather data, it would be pretty cool if I didn't have to write that syntax each time I queried the table but could just get a couple of these key fields like temperature and humidity.  By combining the JSON_VALUE function with SQL Server's computed columns feature, we can create a set of virtual columns on our table which will do just that, pulling the data out of the JSON each time so it appears as just another column on the table.  Here is what the syntax looks like.

1
2
3
4
5
ALTER TABLE WeatherDataJson
    ADD Temperature AS JSON_VALUE(ObservationData, '$.dryBulbFarenheit');

ALTER TABLE WeatherDataJson
    ADD Humidity AS JSON_VALUE(ObservationData, '$.relativeHumidity');

Now, if we enter our original SELECT * query against the table, we will see that these two columns just show up in our result set like any other columns.  And of course, we could also refer to them by name, either in a SELECT clause, WHERE clause or JOIN condition.  The function just like any other column, even though what is happening is real time, SQL Server is parsing the JSON and retrieving this value for us.

That is an important point if we want to use these columns in a WHERE clause or a JOIN condition.  We know that for a table of any size, if we query the table by a column that is not indexed, our performance is going to be very slow because SQL Server has to read all of the rows of the table from disk and find the columns we are looking for.  Here, our results would probably be even worse if we tried to filter by one of these JSON virtual columns, because not only is SQL Server going to have to scan the whole table, it will also have to parse the JSON for each row.  Not good!

Indexing Computed Columns
However, SQL Server allows you to create an index over a computed (virtual) column.  When you do this, the values of the computed columns are computed at index creation time and then persisted in the index so they can be searched quickly like values of any other column.  In this way, but using the JSON_VALUE function in conjunction with a computed column and an index over that computed column, we can quickly search values in our JSON data like we would any other column in our table.  Lets look at an example.

Imagine we redefined our table from above to be about as simple as could be, a surrogate key ObservationId for the primary key and a column to hold our JSON data so that our CREATE TABLE statement now looks like this.

1
2
3
4
5
6
7
CREATE TABLE BasicWeatherDataJson
(
    ObservationId     INT IDENTITY(1,1)  NOT NULL,
    ObservationData   VARCHAR(4000)      NOT NULL,
    CONSTRAINT PK_BasicWeatherDataJson
        PRIMARY KEY (ObservationId)
)

This is about as simple as storing data can get.  All of our data is inside the ObservationData  column.  This includes fields that we will probably want later in order to look the data up by, things like the observation time, the city, the state and the station code.  We could write a query like this...


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT 
        ObservationId,
        JSON_VALUE(ObservationData, '$.stationCode') As StationCode,
        JSON_VALUE(ObservationData, '$.location.city') As City,  
        JSON_VALUE(ObservationData, '$.location.state') As State,
        CONVERT(datetime2(3), JSON_VALUE(ObservationData, '$.observationDateTime'), 126) As ObservationDate,  
        JSON_VALUE(ObservationData, '$.dryBulbFarenheit') As Temperature,
        JSON_VALUE(ObservationData, '$.relativeHumidity') As Humidity
    FROM BasicWeatherDataJson
    WHERE
        JSON_VALUE(ObservationData, '$.location.city') = 'Appleton'
        AND JSON_VALUE(ObservationData, '$.location.state') = 'WI'
        AND CONVERT(datetime2(3), JSON_VALUE(ObservationData, '$.observationDateTime'), 126) > '2016-07-01'
        AND CONVERT(datetime2(3), JSON_VALUE(ObservationData, '$.observationDateTime'), 126) < '2016-07-02'

Not exactly the paragon of simplicity.  Further, the execution plan confirms, this query is doing a full table scan to go and find the data, so it is both slow and resource intensive.


So first we are going to create some virtual columns to one, make the table easier to work with and two, so that we can create an index in the next step.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
ALTER TABLE BasicWeatherDataJson
    ADD StationCode AS JSON_VALUE(ObservationData, '$.stationCode');

ALTER TABLE BasicWeatherDataJson
    ADD City AS JSON_VALUE(ObservationData, '$.location.city');
     
ALTER TABLE BasicWeatherDataJson
    ADD State AS JSON_VALUE(ObservationData, '$.location.state');

ALTER TABLE BasicWeatherDataJson
    ADD ObservationDate AS CONVERT(datetime2(3), JSON_VALUE(ObservationData, '$.observationDateTime'), 126);

ALTER TABLE BasicWeatherDataJson
    ADD Temperature AS JSON_VALUE(ObservationData, '$.dryBulbFarenheit');

ALTER TABLE BasicWeatherDataJson
    ADD Humidity AS JSON_VALUE(ObservationData, '$.relativeHumidity');

And now for the indexes.

1
2
3
4
5
CREATE INDEX IX_BasicWeatherJson_ObservationDate_StationCode
    ON BasicWeatherDataJson (ObservationDate, StationCode);

CREATE INDEX IX_BasicWeatherJson_ObservationDate_State_City
    ON BasicWeatherDataJson (ObservationDate, State, City);

When you create these indexes, you will get a warning that you may be exceeding the maximum non-clustered index key size, because theoretically you could be pulling a very large string out of the JSON.  If you have larger strings, you want to make sure that these will fit under the 1700 byte limit for the key size of an index, and in our case, we are well below that, so we are fine to continue.

So now, we can rewrite our query so it looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT 
        ObservationId,
        StationCode,
        City,  
        State,
        ObservationDate,  
        Temperature,
        Humidity
    FROM BasicWeatherDataJson
    WHERE
        City = 'Appleton'
        AND State = 'WI'
        AND ObservationDate > '2016-07-01'
        AND ObservationDate < '2016-07-02';

This is much simpler and performs much better, as now the query is using one of our indexes.  So again, pretty cool that we can not just reach into the JSON to grab the values out, but we can also index any key properties we have so we can quickly search through the data that we have.

Summary
So there are the basics of storing JSON in SQL Server and interacting with scalar values.  in part 2, we'll take a look at interacting with arrays that may be stored in your JSON data.

No comments:

Post a Comment