CTE vs Subquery: Who gives a 🦆uck?🔗
Inspiration🔗
- https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.html
- https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf
- Blog Posts about CTEs vs Subqueries
TL;DR🔗
Rough performance numbers on M1 Macbook Pro 16GB. 100,000 students joined with 1,000,000 exam scores.
Follow along from the github repo if you're comfortable with docker.
Engine | Query Structure | Performance (seconds) |
---|---|---|
DuckDB | Subquery | 0.02 |
DuckDB | CTE | 0.02 |
DuckDB + cached Postgres | Subquery | 0.03 |
DuckDB + cached Postgres | CTE | 0.07 |
Postgres | CTE | 0.49 |
DuckDB + scanned Postgres | Subquery | 0.50 |
DuckDB + scanned Postgres | CTE | 0.50 |
Postgres | Subquery | ~6300 |
Why it matters🔗
This is an example of a query with a subquery:
SQL | |
---|---|
It selects, for each student, the exam with the lowest grade they received out of all their own exams.
If you have experience with SQL joins then you might recognize the students.id = exams.sid
clause, which joins exams to students.
The subquery utilizes each student's id to find their own minimum grade of their exams.
The performance of this query in postgres is atrocious.
On the other hand, the performance of this completely equivalent query using a Common Table Expression (CTE) is very passable in postgres:
SQL | |
---|---|
This is the CTE vs Subquery debate in a nutshell.
You have to keep in mind that subqueries can produce poor query performance and you cannot write all subquery expressions as CTEs!
In comes the OLAP-centric duckdb and de-correlated queries so we can have our cake and eat it too.
Not giving a flock🔗
We'll run these queries over a duckdb database and compare their performance. (HINT: they're the same in duckdb)
You can follow along with this experience in your browser; visit https://shell.duckdb.org to open a duckdb shell.
Then we'll show that you can bring these benefits over to a live postgres instance with the duckdb postgres scanner extension!
Tables🔗
We'll generate some similar tables to the examples in the paper. Namely, a students and an exams table.
Copy and paste the following into a duckdb shell.
Follow along with the sql comments (-- these lines with two dashes are comments
)
You should see something like the following in your shell:
Text Only | |
---|---|
1 2 3 |
|
Subquery🔗
The subquery is the first version of the query I personally would write.
Let's evaluate its performance:
SQL | |
---|---|
This takes about 0.02
seconds on my machine on the first run (about 0.09
in the duckdb browser shell).
CTE🔗
Now for the CTE:
SQL | |
---|---|
This also takes about 0.02
seconds on my machine (about 0.09
in the duckdb browser shell).
So What🔗
Let's run the same experiment in postgres via a docker compose file:
With docker-compose up --build
and this docker-compose.yml
we should get a live postgres container.
Init Postgres Tables🔗
The following should get us a psql
shell in the postgres instance:
Bash | |
---|---|
(NOTE you will probably need to open another terminal)
And then we can create the same demo tables:
PG CTE🔗
For Postgres we'll start with the CTE version of the query:
SQL | |
---|---|
Execution time 493 ms
or 0.493
seconds.
Ok, not atrocious, but slower than duckdb running in WASM in the browser.
So if our subquery query can be re-written as a CTE query then we're mostly safe.
PG Subquery🔗
If you're brave, go ahead and try the following.
In the same psql
shell let's evaluate the subquery:
SQL | |
---|---|
WARNING this took about an hour and 45 minutes on my machine
Blending It Together🔗
But we can do even better!
By combining the power of duckdb and postgres we can run these queries efficiently against a live postgres instance without thinking about the structure of our query!
Duckdb Postgres Scan🔗
We can scan live data with the benefits of the de-correlated algorithm with the original queries.
We'll run duckdb from the CLI with duckdb
and connect to the dockerized postgres instance.
Bash | |
---|---|
And in the duckdb shell we can attach to the postgres instance:
SQL | |
---|---|
Live Scan🔗
Using .read
will execute the sql query in a given file.
This might be more convenient for you than copy and pasting the whole query.
SQL | |
---|---|
About 0.5
seconds for each of these queries.
(NOTE you could take this a step further by isolating the postgres instance on another server. In this experiment they are literally running on the same machine sharing resources)
Both had equivalent performance to the regular Postgres CTE version. Both had orders of magnitude better performance than the Postgres Subquery version.
So write whatever query makes sense to you and your team / organization!
Cached Scan🔗
If data duplication is not an issue, then we can load the postgres table data into duckdb itself and get even better performance!
SQL | |
---|---|
Around 0.07
seconds.
An order of magnitude better than the "efficient" CTE version on Postgres.
Interestingly, the cached subquery duckdb version has the best performance around 0.03
seconds.
(NOTE these queries only differ in the tables they reference: the cached tables. Start duckdb with a command like duckdb students.db
to save the database file and not have to copy tables again)
Duckdb Docker🔗
This mimics a real production scenario where your duckdb instance is querying a live postgres instance either on RDS or some other host.
Perhaps duckdb is running in a lambda or other type of workflow job.
The Dockerfile chooses a basic OS and downloads necessary packages:
.duckdbrc🔗
The .duckdbrc
file defines the initialization of duckdb shells for this image.
It will instruct each duckdb shell you open to connect to the live postgres docker container:
Bash | |
---|---|
Entrypoint🔗
But to set those postgres connection variables at runtime we'll need a small entrypoint bash script that will allow other commands to be entered after:
Docker Compose🔗
Let's add this other container / service to our docker-compose file:
YAML | |
---|---|
The following should get us a duckdb shell that can connect to the live postgres instance:
Bash | |
---|---|
Or we can run a query directly from the cli:
Bash | |
---|---|
This method might cause a performance hit on your machine, but can be tested more rigorously before a cloud pipeline deployment.
Story Time🔗
If you've made it this far then you probably have some interest in the topic.
Here's an example of how I learned about subquery performance in Postgres in the real world:
- Imagine a table full of account records for your customers
- Imagine this table has a column for the date the data was updated
- Imagine you want to select the most up to date information for a large set of accounts
Maybe you would write a query like this:
SQL | |
---|---|
And thus you fell into the trap. This query could take god-knows-how-long and that is not at all useful for developer experience.
Would you rather spend the mental cycles re-writing it to a CTE or move on with your life?
Resources🔗
Keep exploring!
- More on this example from @FranckPachot: dbfiddle.uk/wLZ4H496
- https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.html
- https://duckdb.org/2022/09/30/postgres-scanner.html
- https://hakibenita.com/be-careful-with-cte-in-postgre-sql
- https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf
Created: June 7, 2023