Skip to content

pandas' recommendation on inplace deprecation and categorical column #57104

Open
@adrinjalali

Description

@adrinjalali

Working on making scikit-learn's code pandas=2.2.0 compatible, here's a minimal reproducer for where I started:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"].replace(to_replace="a", value="b", inplace=True)

which results in:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"].replace(to_replace="a", value="b", inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 7963, in replace
    warnings.warn(
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

The first pattern doesn't apply here, so from this message, I understand I should do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].replace(to_replace="a", value="b")

But this also fails with:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].replace(to_replace="a", value="b")
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 8135, in replace
    new_data = self._mgr.replace(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/base.py", line 249, in replace
    return self.apply_with_block(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 364, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 854, in replace
    values._replace(to_replace=to_replace, value=value, inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2665, in _replace
    warnings.warn(
FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

With a bit of reading docs, it seems I need to do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].cat.rename_categories({"a": "b"})

which fails with

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].cat.rename_categories({"a": "b"})
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/accessor.py", line 112, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2939, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 1205, in rename_categories
    cat._set_categories(new_categories)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 924, in _set_categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 221, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 378, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 579, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

So rename_categories is not the one I want apparently, but reading through the "see also":

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

None of them seem to do what I need to do.

So it seems the way to go would be:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df.loc[df["col"] == "a", "col"] = "b"
df["col"] = df["col"].astype("category").cat.remove_unused_categories()

Which is far from what the warning message suggests.

So at the end:

  • did I arrive at the right conclusion as what the code should look like now.
  • I think the warning message might be a bit more concrete as where users should go.
  • should there be a method on Series.cat to do this easier?

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeNeeds DiscussionRequires discussion from core team before further actioninplaceRelating to inplace parameter or equivalent

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions