Open
Description
Currently we are able to convert IN into a semi-join:
select * from t1 where t1.fld1 in (select fld2 from t2);
is executed like:
select * from t1 semi join t2 on t1.fld1 = t2.fld2;
The problem here is that if table t1
is large and the IN predicate is selective (often it contains PK values), then the whole table t1
is read. Instead, it makes more sense to use the subquery as a pre-filter with a plan like this:
select * from t2 lateral join (select * from t1 where t1.fld1 = t2.fld2);
or simply INNER JOIN
with the join order to be chosen by the optimizer.
If the subquery returns a non-unique field, a DISTINCT
should be applied implicitly, but it's still likely to be cheaper (for a small t2
) than a t1
fullscan.
The major problem here is to estimate the subquery cardinality before joining, currently we have no such ability. Without that it's impossible to decide whether DISTINCT + INNER JOIN
is cheaper than SEMI JOIN
.