Skip to content

ENH: Consistent API between pd.get_dummies() and Series.str.get_dummies() #59235

Open
@wany-oh

Description

@wany-oh

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Compared to pd.get_dummies(), Series.str.get_dummies() behaves so differently and has much more limited functionality. Such differences would not be user-friendly.

Feature Description

  1. The dtype of the return DataFrame of Series.str.get_dummies() should be bool, not int64.

    s = pd.Series(list('abca'))
    s.str.get_dummies()

    before:

       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    

    after (same as pd.get_dummies(s)):

           a      b      c
    0   True  False  False
    1  False   True  False
    2  False  False   True
    3   True  False  False
    
  2. prefix=, prefix_sep=, dummy_na=, sparse=, and dtype= arguments should be added to Series.str.get_dummies().

    s = pd.Series(['a', 'b', np.nan])
    s.str.get_dummies(prefix="dummy", prefix_sep="=", dummy_na=True, dtype=float)

    after (same as pd.get_dummies(s, prefix="dummy", prefix_sep="=", dummy_na=True, dtype=float)):

       dummy=a  dummy=b  dummy=nan
    0      1.0      0.0        0.0
    1      0.0      1.0        0.0
    2      0.0      0.0        1.0
    

    Note: Among the arguments of pd.get_dummies(), the columns= argument is obviously not needed for Series.str.get_dummies(). Whether Series.str.get_dummies() needs a drop_first= argument is debatable since Series.str.get_dummies() can yield True in multiple columns unlike pd.get_dummies().

Alternative Solutions

While there are countless alternatives to obtaining DataFrames that yield the same result, there is no alternative that would bring consistency to the two methods. The only alternative might be to simply deprecate Series.str.get_dummies().

Additional Context

No response

Metadata

Metadata

Labels

EnhancementNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions