Skip to content

pd.NA TypeError in drop_duplicates with object dtype #32992

Closed
@WillAyd

Description

@WillAyd

Code Sample, a copy-pastable example if possible

>>> pd.DataFrame([[1, pd.NA], [2, "a"]]).drop_duplicates()
Traceback (most recent call last):
   ...
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/frame.py", line 4859, in f
    labels, shape = algorithms.factorize(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 629, in factorize
    codes, uniques = _factorize_array(
  File "/Users/williamayd/miniconda3/envs/sitka/lib/python3.8/site-packages/pandas/core/algorithms.py", line 478, in _factorize_array
    uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1806, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 1728, in pandas._libs.hashtable.PyObjectHashTable._unique
  File "pandas/_libs/missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

This same failure isn't present when using an extension type:

>>> df = pd.DataFrame([[1, pd.NA], [2, "a"]], columns=list("ab"))
>>> df["b"] = df["b"].astype("string")
>>> df.drop_duplicates()
   a     b
0  1  <NA>
1  2     a

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions