Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Currently, pd.get_dummies()
returns a boolean dtype by default, while str.get_dummies()
returns an integer dtype ( np.int64
). This inconsistency may cause confusion for users.
To align the behavior, I propose changing the default dtype of str.get_dummies()
to a boolean type (bool or boolean), matching the current behavior of pd.get_dummies()
.
Current Behavior of pd.get_dummies()
vs. str.get_dummies()
pd.get_dummies()
:
import pandas as pd
sr = pd.Series(["A", "B", "A"])
pd.get_dummies(sr)
A B
0 True False
1 False True
2 True False
str.get_dummies()
:
sr.str.get_dummies()
A B
0 1 0
1 0 1
2 1 0
Note: The behavior described here is consistent in both pandas 2.2.3 and the current main branch (3.0.0.dev).
Background
- In PR ENH: change get_dummies default dtype to bool #48022, the default dtype of
pd.get_dummies()
was changed to bool. - In PR ENH: Make get_dummies return ea booleans for ea inputs #56291,
pd.get_dummies()
was updated to return a corresponding boolean dtype (e.g.,boolean[pyarrow]
) when the input is anExtensionArray
.
However, str.get_dummies()
has not yet been updated to match this behavior. Currently, it still defaults to np.int64
.
Detailed Comparison by Dtype
The following compares the results of using pd.get_dummies()
and str.get_dummies()
on Series with various dtypes. Only when the dtype
is string[pyarrow]
do the results match, while in all other cases, the output types differ.
Code:
import numpy as np
import pandas as pd
import pyarrow as pa
sr_list = [pd.Series(["A", "B", "A"]),
pd.Series(["A", "B", "A"], dtype=pd.StringDtype()),
pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow")),
pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow_numpy")),
pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())),
pd.Series(["A", "B", "A"], dtype="category"),
pd.Series(["A", "B", "A"], dtype=pd.CategoricalDtype(pd.Index(["A", "B"], dtype=pd.ArrowDtype(pa.string()))))
]
for i, sr in enumerate(sr_list):
print(f"----- case {i}. {sr.dtype=} -----")
print(f"pd.get_dummies: {pd.get_dummies(sr)['A'].dtype}")
print(f"str.get_dummies: {sr.str.get_dummies()['A'].dtype}")
Output:
----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 3. sr.dtype=string[pyarrow_numpy] -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: int64
Feature Description
1. Modify the default dtype in str.get_dummies()
to return boolean values instead of np.int64
.
Currently, str.get_dummies()
sets the default dtype to np.int64
in the following locations. These can be updated to use np.bool_
for consistency:
- pandas/pandas/core/strings/object_array.py (
_str_get_dummies()
) - pandas/pandas/core/arrays/string_arrow.py (
_str_get_dummies()
) - pandas/pandas/core/arrays/arrow/array.py (
_str_get_dummies()
) – Already usingnp.bool_
2. If the input is an ExtensionArray
, the method should return the corresponding boolean dtype (e.g., boolean[pyarrow]
).
The pd.get_dummies()
method already determines the output dtype
based on the input Series dtype
using the following logic. The same approach can be adapted for str.get_dummies()
:
- pandas/pandas/core/reshape/encoding.py (
_get_dummies_1d()
)
Related PR:
I previously submitted a PR for this change, but the tests did not pass, and the PR has remained inactive since then. I am now opening this issue to clarify the problem and discuss potential solutions before proceeding with further modifications.
Alternative Solutions
No alternative solutions have been identified.
Additional Context
- Related PRs and Issues (Changes to
pd.get_dummies()
behavior) : - Current PR (Pending):