Closed
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
For pandas.Series.str.get_dummies
now it will only return data type of numpy.int64
. It would be nice if other data types can be specified.
Feature Description
Add a new parameter to str.get_dummies
Alternative Solutions
N/A
Additional Context
As pandas.Series.str.get_dummies
is the easiest method in pandas to implement multi-encoding, it would be great if more data types are supported. The int64 used now can easily cause OOM problem in many cases. Indeed, it is this problem I came across that encouraged me to request this feature here.
Traceback (most recent call last):
File "D:\CodeSpace\comp9727-assn2\preprocessing.py", line 13, in <module>
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 101, in wrapper
return func(self, *args, **kwargs)
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 1919, in get_dummies
result, name = self._data.array._str_get_dummies(sep)
File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\object_array.py", line 369, in _str_get_dummies
dummies = np.empty((len(arr), len(tags2)), dtype=np.int64)
numpy.core._exceptions.MemoryError: Unable to allocate 25.8 GiB for an array with shape (231637, 14942) and data type int64