This repository was archived by the owner on Jan 28, 2021. It is now read-only.
This repository was archived by the owner on Jan 28, 2021. It is now read-only.
Use in-memory join when one of the tables is small enough #577
Closed
Description
This is just a future idea that I thought of leaving here for the future, as I was researching how to improve perf of the joins.
If we have two sides on a join we could try to use an in-memory cache for the smaller side instead of iterating it over and over. Depending on how expensive the underlying implementation is of iterating all over the rows, this might be a big performance boost.
What we need for this:
- Be able to estimate the size of each side (which will be more accurate the less filters and so on you have) if it's possible (inner join of an inner join might be tricky, for example).
- Optimisation rule that checks the size of each inner join side and if it can, replace it with an inner join that keeps the smaller side in memory instead of iterating it again and again.
Caveats:
- Estimating the size will be real hard and it's not portable, each impl must provide their own.
- Hard to decide a limit set in stone that's sensible for the in-memory join.
Sadly, this does not solve #454, because it can't and shouldn't be applied in every case. I've been digging on how to make a join not iterate one side N times, but it's the only way to do it without having a whole side in memory, which is not viable for large datasets.