Skip to content

ENH: Allow different dtype in pandas.Series.str.get_dummies #47872

Closed
@JeffersonQin

Description

@JeffersonQin

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

For pandas.Series.str.get_dummies now it will only return data type of numpy.int64. It would be nice if other data types can be specified.

Feature Description

Add a new parameter to str.get_dummies

Alternative Solutions

N/A

Additional Context

As pandas.Series.str.get_dummies is the easiest method in pandas to implement multi-encoding, it would be great if more data types are supported. The int64 used now can easily cause OOM problem in many cases. Indeed, it is this problem I came across that encouraged me to request this feature here.

Traceback (most recent call last):
  File "D:\CodeSpace\comp9727-assn2\preprocessing.py", line 13, in <module>
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 101, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 1919, in get_dummies
    result, name = self._data.array._str_get_dummies(sep)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\object_array.py", line 369, in _str_get_dummies
    dummies = np.empty((len(arr), len(tags2)), dtype=np.int64)
numpy.core._exceptions.MemoryError: Unable to allocate 25.8 GiB for an array with shape (231637, 14942) and data type int64

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementPerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions