Skip to content

Joining on a nullable column #32306

Open
Open
@dsaxton

Description

@dsaxton
import pandas as pd

left = pd.DataFrame({"a": [1, 2, pd.NA]}, dtype="Int64")
right = pd.DataFrame({"b": [1, 2, pd.NA]}, dtype="Int64")

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#       a     b
# 0     1     1
# 1     2     2
# 2  <NA>  <NA>

(Above is from 1.0.1 and master)

I think when joining on a nullable column we should not be matching NA with NA and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we have NaN which also seems incorrect, so could be an opportunity to fix this behavior?

pd.merge(left.astype(float), right.astype(float), how="inner", left_on="a", right_on="b")
#      a    b
# 0  1.0  1.0
# 1  2.0  2.0
# 2  NaN  NaN

Expected Output

      a     b
0     1     1
1     2     2

cc @jorisvandenbossche

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDeprecateFunctionality to remove in pandasNA - MaskedArraysRelated to pd.NA and nullable extension arraysReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions