Skip to content

BUG: isin() with missing values does not work in 1.3.0 with extension dtypes #42405

Closed
@zmeves

Description

@zmeves

This bug is not present in Pandas < 1.3.0.

In 1.3.0, calling Series.isin() will fail if

  • the Series dtype is an extension dtype (pd.Float64Dtype(), pd.Int64Dtype(), ...)
  • the Series contains any 'missing' values (numpy.nan, pd.na)

The following code snippet tests a few dtypes, determining if each of them supports isin with missing values:

import pandas as pd
import numpy as np

for dtype in (float, int, pd.Float64Dtype(), pd.Int64Dtype(), object):

    x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
    options = [1, 2, 3]

    print(f"\nTesting with dtype = {x.dtype}:")

    x.isin(options)  # This works everytime - no missing values

    x.iloc[1] = np.nan  # Set a value to NA

    try:
        x.isin(options)  # This no longer works
    except Exception as err:
        print(f"Error! {err}")
    else:  
        print("OK")

# Now, show the actual stack trace
print("\nStacktrace for dtype=Int64")
dtype = pd.Int64Dtype()
x = pd.Series([0, 1, 2, 3, 4], dtype=dtype)
options = [1, 2, 3]
x.iloc[1] = np.nan  # Set a value to NA
x.isin(options)

The output is:

Testing with dtype = float64:
OK

Testing with dtype = int64:
OK

Testing with dtype = Float64:
Error! boolean value of NA is ambiguous

Testing with dtype = Int64:
Error! boolean value of NA is ambiguous

Testing with dtype = object:
OK

Stacktrace for dtype=Int64
Traceback (most recent call last):
  File "...dev/pd_1_3_isin_bug.py", line 31, in <module>
    x.isin(options)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/series.py", line 5024, in isin
    result = algorithms.isin(self._values, values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/algorithms.py", line 475, in isin
    return comps.isin(values)
  File "..._dev_venv/lib/python3.7/site-packages/pandas/core/arrays/masked.py", line 408, in isin
    if libmissing.NA in values:
  File "pandas/_libs/missing.pyx", line 446, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1127.13.1.el7.x86_64
Version          : #1 SMP Fri Jun 12 14:34:17 EDT 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.20.3
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 21.0.1
setuptools       : 40.8.0
Cython           : 0.29.13
pytest           : 5.1.1
hypothesis       : None
sphinx           : 4.0.2
blosc            : None
feather          : None
xlsxwriter       : 1.2.1
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : 1.2.1
fsspec           : 0.5.2
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : 2.7.0
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 0.13.0
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.9
tables           : 3.5.2
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.45.1

Metadata

Metadata

Assignees

Labels

NA - MaskedArraysRelated to pd.NA and nullable extension arraysRegressionFunctionality that used to work in a prior pandas versionisinisin method

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions