Vladimir Mihajlovski Software Developer
Date: November 18, 2015 Tags: indexing, postgres, postgresql SUBSCRIBE
ENGINEERING

Indexing and Querying Postgres Hstore

Hstore is a really useful datatype in Postgres because you can easily extend your schema, and store different data. But, you might get to a point where you’ll have to query for specific keys. That’s what happened to us.

The Use Case

We have an entity in our database which contains an hstore column, to store some specifics. At a certain point during our development, we decided we want to query the hstore column, and, of course, we wanted it to be fast. We have a limited amount of keys, several of them, relatively static. The table itself contained a dozen million entries, where some keys were contained for many entries, and some for a smaller percent. The values were unique between themselves for a certain key. We are using PostgreSQL 9.2 and all the examples will be run on this version.

Preparing Test Data

So, let’s prepare the test by first installing the hstore extension into our database:

Now we can insert some data into it:

Let’s say we have three keys: key1key2 and key3 to be used in our hstore. We’ll insert 1 million entries with all three keys, 1 million with keys 1 and 2, 2 millions with key 1, 1.5 million with key 2, and 500 thousand entries with key3. Four million entries will have no values in hstore.

Testing

So, here are now a few timings for some quries:

As can be read, over 2 seconds isn’t something regarded as fast when talking about fetching a single row from a database, and sequential scan is performed, so we should add an index. We’ll test on key1 only. Create the index and check the query again.

Now the fetching of an entity is much faster, since it is using the index we created. Counting of entities containing key1 is still slow, and it uses sequential scan, but it’s not our use case. But when we are talking about indices we should also consider their spatial efficiency. Key1 is the most common one, and only 40% of entities contain it. With the current index, we have 60% of unnecessary entries in the index. So, we’ll add the condition used for checking the number of entities containing key1.

Well, we decreased the runtime for the first query. On the other hand, what’s really bad is that our main query for fetching the entity is once again not using the index. This is the point where we do some digging in our index knowledge, and find that to use the index, query must use the condition set in the index. So, let’s try a different query.

This is something that works. Now that we have choosen our index, and our query, we should add all indices to our database. This is one of the possible weakpoints to this solution, we have to statically create indices over all used keys, but this works in our use case, since we have only several of them, and they are static.

We didn’t take index maintanence into consideration, since our ration of reading towards writing allows it, but depending on your use case you might want to test that as well.

Conclusion

An approach to querying an hstore column over a limited and static set of keys can be solved using conditional indices. That’s what worked for us. There are other ways to index an hstore, of course, and you should try it with your use case. But, always measure, and always measure according to your specific use case. When creating an index, model it together with the query you are using.

SUBSCRIBE

CATEGORIES