pandas' recommendation on inplace deprecation and categorical column

Working on making scikit-learn's code pandas=2.2.0 compatible, here's a minimal reproducer for where I started:

```py
import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"].replace(to_replace="a", value="b", inplace=True)
```
which results in:

```
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"].replace(to_replace="a", value="b", inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 7963, in replace
    warnings.warn(
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
```

The first pattern doesn't apply here, so from this message, I understand I should do:

```py
import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].replace(to_replace="a", value="b")
```

But this also fails with:

```
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].replace(to_replace="a", value="b")
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 8135, in replace
    new_data = self._mgr.replace(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/base.py", line 249, in replace
    return self.apply_with_block(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 364, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 854, in replace
    values._replace(to_replace=to_replace, value=value, inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2665, in _replace
    warnings.warn(
FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
```

With a bit of reading docs, it seems I need to do:

```py
import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].cat.rename_categories({"a": "b"})
```

which fails with

```
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].cat.rename_categories({"a": "b"})
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/accessor.py", line 112, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2939, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 1205, in rename_categories
    cat._set_categories(new_categories)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 924, in _set_categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 221, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 378, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 579, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique
```

So `rename_categories` is not the one I want apparently, but reading through the "see also":

> [reorder_categories](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.reorder_categories.html#pandas.Series.cat.reorder_categories)
> 
>     Reorder categories.
> [add_categories](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.add_categories.html#pandas.Series.cat.add_categories)
> 
>     Add new categories.
> [remove_categories](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.remove_categories.html#pandas.Series.cat.remove_categories)
> 
>     Remove the specified categories.
> [remove_unused_categories](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.remove_unused_categories.html#pandas.Series.cat.remove_unused_categories)
> 
>     Remove categories which are not used.
> [set_categories](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.set_categories.html#pandas.Series.cat.set_categories)
> 
>     Set the categories to the specified ones.

None of them seem to do what I need to do.

So it seems the way to go would be:

```py
import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df.loc[df["col"] == "a", "col"] = "b"
df["col"] = df["col"].astype("category").cat.remove_unused_categories()
```

Which is far from what the warning message suggests. 

So at the end:

- did I arrive at the right conclusion as what the code should look like now.
- I think the warning message might be a bit more concrete as where users should go.
- should there be a method on `Series.cat` to do this easier?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas' recommendation on inplace deprecation and categorical column #57104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pandas' recommendation on inplace deprecation and categorical column #57104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions