Skip to content

ENH: Change default dtype of str.get_dummies() to bool for consistency with pd.get_dummies() #60676

Open
@komo-fr

Description

@komo-fr

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pd.get_dummies() returns a boolean dtype by default, while str.get_dummies() returns an integer dtype ( np.int64 ). This inconsistency may cause confusion for users.
To align the behavior, I propose changing the default dtype of str.get_dummies() to a boolean type (bool or boolean), matching the current behavior of pd.get_dummies().

Current Behavior of pd.get_dummies() vs. str.get_dummies()

pd.get_dummies():

import pandas as pd
sr = pd.Series(["A", "B", "A"])
pd.get_dummies(sr)
	A	B
0	True	False
1	False	True
2	True	False

str.get_dummies():

sr.str.get_dummies()
	A	B
0	1	0
1	0	1
2	1	0

Note: The behavior described here is consistent in both pandas 2.2.3 and the current main branch (3.0.0.dev).

Background

However, str.get_dummies() has not yet been updated to match this behavior. Currently, it still defaults to np.int64.

Detailed Comparison by Dtype

The following compares the results of using pd.get_dummies() and str.get_dummies() on Series with various dtypes. Only when the dtype is string[pyarrow] do the results match, while in all other cases, the output types differ.

Code:

import numpy as np
import pandas as pd
import pyarrow as pa

sr_list = [pd.Series(["A", "B", "A"]),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype()),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow")),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow_numpy")),
           pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())),
           pd.Series(["A", "B", "A"], dtype="category"),
           pd.Series(["A", "B", "A"], dtype=pd.CategoricalDtype(pd.Index(["A", "B"], dtype=pd.ArrowDtype(pa.string()))))
]

for i, sr in enumerate(sr_list):
    print(f"----- case {i}. {sr.dtype=} -----")
    print(f"pd.get_dummies: {pd.get_dummies(sr)['A'].dtype}")
    print(f"str.get_dummies: {sr.str.get_dummies()['A'].dtype}")

Output:

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 3. sr.dtype=string[pyarrow_numpy] -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: int64

Feature Description

1. Modify the default dtype in str.get_dummies() to return boolean values instead of np.int64.

Currently, str.get_dummies() sets the default dtype to np.int64 in the following locations. These can be updated to use np.bool_ for consistency:

2. If the input is an ExtensionArray, the method should return the corresponding boolean dtype (e.g., boolean[pyarrow]).

The pd.get_dummies() method already determines the output dtype based on the input Series dtype using the following logic. The same approach can be adapted for str.get_dummies() :

Related PR:
I previously submitted a PR for this change, but the tests did not pass, and the PR has remained inactive since then. I am now opening this issue to clarify the problem and discuss potential solutions before proceeding with further modifications.

Alternative Solutions

No alternative solutions have been identified.

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions