Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Building Bubble. In this series of posts, our engineering team talks about the inner workings of Bubble, the cloud-based visual programming language thatâs making programming accessible to everyone.
One of the big challenges we have at Bubble is transforming user-written queries into high-performing SQL. Users can define their own data schemas, and write custom searches to filter and sort their results. The searches are often embedded in our userâs application logic, so they need to be lightning fast and respect our transaction semantics.
To enable this, we transform the queries into SQL and run them on our Postgres backend database. Making this both correct and performant is a big challenge. This post is about an interesting problem we recently came across with some slow queries that we were trying to speed up, and how we got around it.
(Iâm going to assume that youâre familiar with using indexes to speed up searches in a database, at least at a general level. If not, this is a great primer: Use The Index, Luke).
We donât know in advance how Bubble users plan to search the data they create. So our strategy for making searches fast is to automatically build indexes on the fly as we detect patterns. If we see a slow search happen a few times, we try to build an index to speed it up.
Indexes, however, are expensive. Each index on a table increases the amount of disk space the table consumes, and worse, adds overhead to every write to the table. Because every index needs to be updated, inserting a new row in a table with a hundred indexes is like doing a hundred writes instead of just one write!
So, we want to make sure every index we build pulls its weight. We want indexes that can be used for a bunch of different searches, not just one search. Our go-to index type is the single column b-tree index. B-tree indexes are Postgresâs default, and they can be used for equality matching, inequality matching, and sorting: theyâre the all purpose workhorses of the index world.
Lately we noticed something weird, however. Custom-sorted searches on large tables with nothing in the WHERE clause were performing really poorly, even with a b-tree index on the column we were sorting on. It was sometimes taking multiple seconds to run a search that only returned 5â10Â results.
In a sense, this isnât too surprising. Sorting is a really expensive operation in databases, because even if you only want to return ten results, you still need to examine every row to figure out which ones are the first ten. So if there are a million rows in your table, getting ten results in sorted order can take forever.
However, indexes provide a way of dealing with this. Letâs say we have a user thatâs building a game, and has a table of scores players have achieved in the game:
CREATE TABLE scores ( uid bigserial PRIMARY KEY, player_id bigint, timestamp bigint, score int);
And the user wants to show the top 10 scores, along with who got them and what day they happened, which we could do with this query:
SELECT * FROM scores ORDER BY score DESC LIMIT 10;
Without an index, Postgres would need to do a full table scan, reading every single score in order to figure out the highest 10Â scores.
However, with a b-tree index on the score column, Postgres can do an index scan: it can read through the score index from highest to lowest until it finds 10 results. Since thereâs no other restrictions on our search, the first ten entries in the index are the rows that Postgres will return, making this query very fast.
So far, so good. But in practice, our query is actually a little more complicated than that. For one thing, Bubble apps keep data up-to-date in real time in the client. This means we need our results to always be returned in a consistent order, because if weâre refreshing the search and the order changes, we donât want things to jump around randomly on the page. So we need a way of breaking ties in a consistent manner. We use our uid column for that, since itâs guaranteed to be unique.
Our user might want to break ties on their end, too. For instance, in a high scores chart, you probably want to break ties by ranking the first person to achieve a score higher. So our actual sort condition might be:
SELECT * FROM scores ORDER BY score DESC, timestamp, uid LIMIT 10;
We also need to worry about rows with null scores. In a table of game scores, we might not have any entries that are missing a score. But remember, we donât know what our user is actually trying to build. Maybe there are situations where nulls do make sense: either way, we need to be prepared for them. Bubble always sorts nulls at the end of searches, because we think itâs the most intuitive way of handling them. Postgres sorts nulls as higher than everything else: they are at the end in ascending searches, but at the beginning of descending searches. So, we need to override that to get the behavior we want:
SELECT * FROM scores ORDER BY score DESC NULLS LAST, timestamp, uid LIMIT 10;
So, the moment of truth. We build our trusty b-tree index:
CREATE INDEX ON scores (score);
⊠add a million scores to the table:
INSERT INTO scores (player_id, timestamp, score) VALUES (17, 1527875748779, 235);⊠and so on âŠ
And run our search!
Quiz: What happens?
Answer: Postgres scans the entire million row table đđđ
The reason is because Postgres is smart, but not that smart. It knows it can read a b-tree index to speed up a sort operation, and it knows how to read an index both forwards and backwards for ascending and descending searches. But for it to use an index to speed up the sort, it needs to know that the order of entries in the index will always exactly satisfy the order requested by the sort.
Our index may or may not exactly match the sort order. For one thing, weâre asking for nulls to sort lower than numbers, but the index has them stored as higher than numbers. The other possible reason things might not match are ties: if two entries have the same score, we want to break ties first by timestamp, and then by uid. The index, however, is single-column: if two entries are tied on score, Postgres doesnât guarantee what order they are stored in.
From a humanâs point of view, these are small differences. The index is basically in the right order already, and just needs some minor adjustments to give us exactly what we want. From Postgresâs point of view, though, these differences are total showstoppers. It doesnât have code to make those adjustments on the fly, so canât use the index at all. So close⊠and yet, so far.
One solution to this would be to build an index that does match the search exactly. For instance:
CREATE INDEX ON scores (score DESC NULLS LAST, timestamp, uid);
But this is an ugly, sad solution. Our users often sort the same table tons of different ways. Do we really want to build an index for every possible combination of sort orderings, and pay the performance penalty for each one?
Meanwhile, we have a perfectly lovely index on the score column, which we probably need to create anyways for efficient comparisons on scores, just sitting there, almost, but not quite, good enough.
Postgres might not be smart enough out of the box to use our score index, but thereâs no reason we canât be smarter than our database.
We know the following facts:
- Weâre going to sort nulls at the end, so we donât need to worry about nulls until we reach the end of search
- The index is going to be in the correct order except for tied scores. This means that the 10th highest score in the index has the same score number as the 10th highest score in our results, even if the user who got that score is different
So, what we do is this. Before performing our search for real, we do a preliminary search to find the 10th highest score:
SELECT score FROM scores WHERE score IS NOT NULL ORDER BY score DESC OFFSET 9 LIMIT 1
Postgres will do an index scan for this query. Weâve kicked nulls out of the search entirely and dropped the âNULLS LASTâ ordering. And we are now only sorting on a single column. So, the order of our index now exactly satisfies the query. This search is extremely fast: we just need to read the first ten entries in the index, and return the score from the last one.
If this preliminary query doesnât return any results, it means weâre basically trying to fetch the whole table (unless thereâs a million null entries, in which case an index on score isnât going to help us much anyway). So in that case, we just perform our original query without any modifications, since a full table scan is probably the best thing we can do in this situation.
On the other hand, if we do get a 10th score back, we add it into our original query as an inequality constraint. So, for example, if the 10th score was 534, we change our original query to be:
SELECT * FROM scores WHERE score >= 534 ORDER BY score DESC NULLS LAST, data, uid LIMIT 10
Hereâs the trick: Postgres isnât smart enough to use the index to help with the sort, but it is smart enough to do a scan on our score index to evaluate the inequality! If you run an EXPLAIN on this query to see how Postgres executes it, youâll see that it first does an index scan filtering on score >= 534, then sorts the output of that scan using our full sort ordering, then applies the 10 row limit.
This is much faster than a whole table scan. Our â>= 534â condition might return more than 10 rows if there are ties for 10th place, but unless you have a million ties, the total number of rows returned by that first index scan is probably pretty close to 10. Postgres still has to do the sort manually, but it only needs to sort those 10-ish rows, it doesnât need to sort the original million rows.
So, this modified query performs almost as well as it would if we had built a highly-specific index tailored for that particular search, but we do it with a generic, simple index that we can re-use for a bunch of other searches too. Victory!!
One caveat if you want to use this technique in practice: if the data changes (say, if you delete a high score) between our first query to fetch the 10th score and our second query to get the final results, the â>=â inequality might cause fewer than 10 results to be returned. To avoid this happening, the two searches need to be in the same transaction, and the transaction needs to run in REPEATABLE READ or higher isolation mode, not Postgresâs default READ COMMITTED mode.
(More on isolation modes here: https://www.postgresql.org/docs/9.6/static/transaction-iso.html. The tl;dr is that by default, two SELECTs in the same transaction can see different snapshots of the database, but in REPEATABLE READ mode, theyâll see the same snapshot of the database).
We actually run both queries within a stored procedure written in plv8 to minimize round-trip latency from the server to the database. Running plv8 in production is an interesting topic in its own right: perhaps the subject for a future post!
Found this interesting? Weâre always looking for great engineers to join us!
Originally published at blog.bubble.is on June 4, 2018.
Using Simple Indexes to Optimize Complicated Sorts in Postgres was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.