https://www.flydata.com/blog/amazon-redshift-distkey-and-sortkey/
Amazon Redshift’s DISTKEY and SORTKEY are a powerful set of tools for optimizing query performance. Because Redshift is a columnar database with compressed storage, it doesn’t use indexes that way a transactional database such as MySQL or PostgreSQL would. Instead, it uses DISTKEYs and SORTKEYs.
Choosing the values to use as your DISTKEY and SORTKEY is not as straightforward as you might think. In fact, setting a DISTKEY/SORTKEY that is not well-thought-out can even worsen your query performance. In this series of articles, I’d like to show you a few interesting examples of how the Amazon Redshift DISTKEY and SORTKEY affects query performance.
# Know Thy Data
In this example, I use a series of tables called system_errors#
where #
is a series of numbers. Each record of the table consists of an error that happened on a system, with its (1) timestamp, and (2) error code. Each table has 282 million rows in it (lots of errors!). Here, I have a query which I want to optimize. The query gets the number of errors per error type, for each time slice.
SELECT err_code, created_at, count(*) FROM system_errors1 GROUP BY created_at, err_code;
Before setting the DISTKEY, let’s create a table without it, and see how the query performs.
CREATE TABLE system_errors1 (
err_code INTEGER,
created_at timestamp
);
On my Redshift cluster (2-node dc1.large), the query took 20.52 seconds to execute. This isn’t too bad, considering the number of rows in the table. But if you look at the CPU usage, both compute nodes were used up to 30% of CPU. Let’s see how we can improve this by investigating our query performance.
# Investigating The Query
Let’s check the query performance by checking the Amazon Redshift Console. Thankfully, it offers useful graphs and metrics to analyze query performance. Below is what the “Query Execution Details” for the query looked like.
Look at the warning sign! Something must have been wrong. Let’s see the details.
This warning occurred because rows to be aggregated (rows sharing the same err_code and created_at values) are spread across multiple compute nodes. Each node must aggregate its own rows first; then the leader node has to aggregate the results again. That’s why you see two “Aggregate” steps in the above screenshot. More importantly, a large amount data was sent to the leader node across the network, which became the performance bottleneck. We can avoid this by putting all rows sharing the same err_code and created_at values on a single node. This can be done by defining the DISTKEY. Worry-free replication from source to Redshift Try FlyData for free Quick setup. No credit card required.
# Adding DISTKEY and SORTKEY
To collocate all relevant rows in a single node, we can use either the column err_code
or created_at
as the DISTKEY. Since I also want to run a query grouped by err_code
without created_at
, I chose err_code
as DISTKEY and SORTKEY.
CREATE TABLE system_errors2 (
err_code INTEGER,
created_at timestamp
) DISTKEY(err_code) SORTKEY(err_code);
Let’s see how the query performs against the new table.
SELECT err_code, created_at, count(*) FROM system_errors2 GROUP BY created_at, err_code;
Shockingly, this query took 54.9 seconds! That’s 2.5 times slower than the first query against a table with no DISTKEY/SORTKEY. This new table puts all rows of an error code on the same node and stores them next to each other. Why would the query against this table be even slower than the table with no DISTKEY/SORTKEY??
# Solving The Puzzle
We created two tables with and without DISTKEY and found that the one with the DISTKEY was much slower than the other.
no DISTKEY -> 20.52 seconds
err_code as DISTKEY -> 54.9 seconds
Why did this happen? Let’s check the query’s execution details.
You’ll notice the long red lines. This means that the slowest node took significantly longer than the average processing time. In this case, it took 4 times more than the average. The slowest node must have had more rows than the other nodes. Let’s run the following query and see how many rows each err_code has.
SELECT err_code, count(*) FROM system_errors2 GROUP BY err_code;
err_code | count
----------+-----------
1210 | 22336
10010 | 252160
1205 | 517440
1207 | 1368576
1208 | 384
10009 | 1825856
1201 | 2162176
1206 | 1229824
1213 | 8960
10006 | 4960640
10007 | 67840
1204 | 269341760
1209 | 95488
10003 | 281856
(14 rows)
You can see that one of the error codes (1204) has an extremely large number of rows compared to the others. It actually has 95.5% of the rows in the table. Because we used err_code as the DISTKEY, at least 95.5% of the rows were put on one particular node. This is the so-called skew. When such skew occurs, the total query processing time takes much longer because the performance is capped by the slowest processing node; i.e., the query cannot be spread across multiple nodes. In this (extreme) case, almost all the rows were processed by a single node. That is why the query took longer than the query made against the table without a DISTKEY. Worry-free replication from source to Redshift Try FlyData for free Quick setup. No credit card required.
# Trying a Different DISTKEY and SORTKEY
Since the values of the column err_code
was too skewed to use as a DISTKEY, let’s use the other column created_at
instead.
CREATE TABLE system_errors3 (
err_code INTEGER,
created_at timestamp
) DISTKEY(created_at) SORTKEY(created_at);
The same query now takes only 8.32 seconds to return, more than 6 times faster than the previous query, and more than twice as fast as our very first query. CPU Utilization is also much better; 10% vs the previous 30%.
Query execution details look good as well. Skew is minimal, and also there is no warning sign for a large data transfer across the network.
If you look at the details of Hash Aggregation, you will notice that the steps are much more simplified compared to our very first query. The double “Aggregate” is no longer to be seen!
# Summary
- Pick a few important queries you want to optimize your databases for. You can’t optimize your table for all queries, unfortunately.
- To avoid a large data transfer over the network, define a DISTKEY.
- From the columns used in your queries, choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
- Even though it will almost never be the best performer, a table with no DISTKEY/SORTKEY is a decent all-around performer. It’s a good option not to define DISTKEY and SORTKEY until you really understand the nature of your data and queries.