Open
Description
Working on making scikit-learn's code pandas=2.2.0 compatible, here's a minimal reproducer for where I started:
import pandas as pd
df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"].replace(to_replace="a", value="b", inplace=True)
which results in:
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Traceback (most recent call last):
File "/tmp/4.py", line 4, in <module>
df["col"].replace(to_replace="a", value="b", inplace=True)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 7963, in replace
warnings.warn(
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
The first pattern doesn't apply here, so from this message, I understand I should do:
import pandas as pd
df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].replace(to_replace="a", value="b")
But this also fails with:
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Traceback (most recent call last):
File "/tmp/4.py", line 4, in <module>
df["col"] = df["col"].replace(to_replace="a", value="b")
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 8135, in replace
new_data = self._mgr.replace(
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/base.py", line 249, in replace
return self.apply_with_block(
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 364, in apply
applied = getattr(b, f)(**kwargs)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 854, in replace
values._replace(to_replace=to_replace, value=value, inplace=True)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2665, in _replace
warnings.warn(
FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
With a bit of reading docs, it seems I need to do:
import pandas as pd
df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].cat.rename_categories({"a": "b"})
which fails with
$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Traceback (most recent call last):
File "/tmp/4.py", line 4, in <module>
df["col"] = df["col"].cat.rename_categories({"a": "b"})
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/accessor.py", line 112, in f
return self._delegate_method(name, *args, **kwargs)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2939, in _delegate_method
res = method(*args, **kwargs)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 1205, in rename_categories
cat._set_categories(new_categories)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 924, in _set_categories
new_dtype = CategoricalDtype(categories, ordered=self.ordered)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 221, in __init__
self._finalize(categories, ordered, fastpath=False)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 378, in _finalize
categories = self.validate_categories(categories, fastpath=fastpath)
File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 579, in validate_categories
raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique
So rename_categories
is not the one I want apparently, but reading through the "see also":
Reorder categories.
Add new categories.
Remove the specified categories.
Remove categories which are not used.
Set the categories to the specified ones.
None of them seem to do what I need to do.
So it seems the way to go would be:
import pandas as pd
df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df.loc[df["col"] == "a", "col"] = "b"
df["col"] = df["col"].astype("category").cat.remove_unused_categories()
Which is far from what the warning message suggests.
So at the end:
- did I arrive at the right conclusion as what the code should look like now.
- I think the warning message might be a bit more concrete as where users should go.
- should there be a method on
Series.cat
to do this easier?