https://blog.crunchydata.com/blog/postgres-indexes-for-newbies

If you’ve read Crunchy blogs recently you probably noticed by now that we’re all big fans of indexing. Indexing is key to optimizing your database workloads and reducing query times. Postgres now supports quite a few types of indexes and knowing the basics is a key part of working with Postgres. 

The role of database indexes is similar to the index section at the back of a book. A database index stores information on where a data row is located in a table so the database doesn’t have to scan the entire table for information. When the database has a query to retrieve, it goes to the index first and then uses that information to retrieve the requested data.

Indexes are their own data structures and they’re part of the Postgres data definition language (the DDL). They’re stored on disk along with data tables and other objects. 

  • B-tree indexes are the most common type of index and would be the default if you create an index and don’t specify the type. B-tree indexes are great for general purpose indexing on information you frequently query. 
  • BRIN indexes are block range indexes, specially targeted at very large datasets where the data you’re searching is in blocks, like timestamps and date ranges. They are known to be very performant and space efficient.
  • GIST indexes build a search tree inside your database and are most often used for spatial databases and full-text search use cases. 
  • GIN indexes are useful when you have multiple values in a single column which is very common when you’re storing array or json data. 

I did all my testing on Crunchy Bridge with a hobby instance, which is very nice for this kind of quick data load and testing work. I have some samples available alongside this post if you want to follow along with the data I used. You can also use Crunchy’s learning portal to do an indexing tutorial

Using Explain Analyze

You almost never talk about Postgres indexing without referring to the Explain feature. This is just one of those Postgres Swiss Army knife tools that you need to have in your pocket at all times. Explain analyze will give you information like query plan, execution time, and other useful info for any query. So as you’re working with indexes, you’ll be checking the indexes using explain analyze to review the query path and query time. 

You’ll see that the query plan indicates a “Seq Scan,” or a sequential scan. This means that it scans each data row in the table to see if it matches the query condition. You might be able to guess that for larger tables, a sequential scan could take up quite a bit of time so that’s where the index saves your database workload. 

Seq Scan on weather (cost=0.00..168.00 rows=496 width=102) (actual time=0.011..0.181 rows=100 loops=1)

If you’re using an index, you’ll see an index scan in your Explain results. 

Bitmap Index Scan on idx_weather_type  (cost=0.00..8.00 rows=496 width=0) (actual time=0.027..0.027 rows=496 loops=1

Analyze Database

The ANALYZE command collects information about the tables or database which are used by the internal query planner. There are some cases where you might want to run this before or after adding an index so that the database uses the most up to date query plan. In my testing, this seems to affect larger databases. 

B-Tree Sample

For the B-Tree samples I’ve used some open weather data with data events by type, damage, time and location1. For just a very basic index, I’m going to find everything for winter storms. Once I add this index, this means that to get this data, the database doesn’t have to scan all the weather events to get me additional data about severe weather events, it already knows where to look for those. 

Starting query 

SELECT * FROM weather where event_type='Winter Storm'

Before index explain analyze

Seq Scan on weather  (cost=0.00..9204.64 rows=3158 width=853) (actual time=0.008..27.619 rows=3182 loops=1)
Execution Time: 27.778 ms

Index

CREATE INDEX idx_weather_type ON weather(event_type);

After index explain analyze

Bitmap Index Scan on idx_weather_type  (cost=0.00..35.98 rows=3158 width=0) (actual time=0.247..0.247 rows=3182 loops=1)
Execution Time: 3.005 ms

Look at that drop in query time, cool! 

Multicolumn B-Tree Indexes

Indexes aren’t always created for single columns only – Postgres also supports multicolumn indexes. These can be useful if you know that you’ll be querying a lot on multiple columns at once.

Starting query

SELECT * FROM weather WHERE event_type='Winter Storm'   AND damage_crops > '0'

Before index explain analyze

Seq Scan on weather  (cost=0.00..9402.36 rows=2586 width=853) (actual time=0.007..67.365 rows=2896 loops=1)
Execution Time: 67.499 ms

Multi-column Index for severity and type

CREATE INDEX idx_storm_crop ON weather(event_type,damage_crops);

After Index Explain Analyze

Bitmap Index Scan on idx_storm_crop  (cost=0.00..38.15 rows=2586 width=0) (actual time=0.339..0.339 rows=2896 loops=1)
Execution Time: 2.204 ms

Reduction in query time again, yay!

Oh …. And if you get lost with all your creating indexes and need to see what you’ve got, this will show you all indexes on a particular table:

SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'weatherevents';

Drop index indexname will delete it if you want to test again. 

BRIN Index Sample

BRIN is often very helpful when using large datasets and specifically something with time series or time stamped data. I used a sample from an IoT dataset2 for this with thousands of data rows per day. 

Starting query

SELECT device, humidity FROM iot WHERE ts between '2020-07-13 0:00' AND '2020-07-14 0:00'

Explain Analyze before

Parallel Seq Scan on iot  (cost=0.00..10363.95 rows=896 width=28) (actual time=12.710..42.080 rows=16707 loops=3)
Execution Time: 67.851 ms

Index

CREATE INDEX iot_time ON iot USING brin(ts);

Explain Analyze after

Bitmap Index Scan on iot_time  (cost=0.00..12.26 rows=54025 width=0) (actual time=0.046..0.047 rows=10240 loops=1)
Execution Time: 10.513 ms

You will often hear that BRIN indexes are very space efficient. So as you’re working through indexes, you might want to query the size of the actual index. For example:

pg_size_pretty(pg_relation_size('iot_time'));

Basic Spatial Index with GIST

If you’re using spatial data in your database, you probably have a lot of data and indexing can be crucial. For spatial indexing I used data and examples from the PostGIS tutorial3. If you’re just starting to work with spatial data, I highly recommend this tutorial in general. 

SQL query

SELECT count(blocks.blkid)
FROM nyc_census_blocks blocks
JOIN nyc_subway_stations subways
ON ST_Contains(blocks.geom, subways.geom)
 WHERE subways.name LIKE 'B%';

Beginning explain analyze

Timing: Generation 4.364 ms, Inlining 360.628 ms, Optimization 615.663 ms, Emission 559.573 ms, Total 1540.227 ms
Execution Time: 1467.916 ms

Sample spatial GIST index

CREATE INDEX nyc_census_blocks_geom_idx
  ON nyc_census_blocks
  USING GIST (geom);

After indexing explain analyze

Execution Time: 7.575 ms

Wowza! You see what I mean? Spatial indexes can have a huge impact. 

GIN Index For JSON

JSON data has quite a wide adoption for Postgres users and the Postgres core project has embraced the adoption of the JSON data types with extensive features. There’s some indexing types that can really help if you’ve got data that’s listing several objects inside each field. The GIN index type is commonly used for this. For this example I used a json file from NASA that has meteor location information4

SQL query

SELECT data -> 'name' as name FROM meteors WHERE data @> '{"mass": "100"}';

Explain analyze before indexing

Parallel Seq Scan on meteors  (cost=0.00..23926.28 rows=4245 width=32) (actual time=0.065..114.114 rows=1024 loops=3)
Execution Time: 123.698 ms

Index 

CREATE INDEX gin_test ON meteors USING gin(data)

Explain analyze after indexing

Bitmap Index Scan on gin_test  (cost=0.00..116.40 rows=10187 width=0) (actual time=12.164..12.164 rows=3072 loops=1)
Execution Time: 22.017 ms

Finding the Right Fit With Indexes

You shouldn’t create an index on the fly as you’re about to run a one-off query. A good index plan requires planning and testing. Indexes are stored on disk and so they also take up space so that’s a consideration as well. For each new data row inserted or existing data row updated, index entries are updated automatically by the database. Indexes can absolutely also have an impact on the performance of database write operations, so be sure to research some of that. Just like Craig said recently, everyone needs to find their Goldilocks of indexes: not too big, not too small but just right. 

Footnotes:

1 B-Tree Index Data Sample. https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/, file StormEvents_details-ftp_v1.0_d2011_c20220107.csv.gz.

2 BRIN Index Data Sample. https://www.kaggle.com/garystafford/environmental-sensor-data-132k

3 PostGIS Data Sample. http://s3.cleverelephant.ca/postgis-workshop-2020.zip.

4 JSON Data Sample. https://data.nasa.gov/resource/y77d-th95.json. 

By admin

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.