Closed
Description
Versions
pd.show_versions()
INSTALLED VERSIONS
------------------
commit : f2c8480af2f25efdbd803218b9d87980f416563e
python : 3.8.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English
pandas : 1.2.3
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.2.1
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
Description
When doing a groupby and then a fillna on a column of dtype StringDType, pandas throws a ValueError: StringArray requires a sequence of strings or pandas.NA
. This does not seem to happen with other (extension) dtypes afaik.
The error only occurs if there are still NAs left after the fillna
. It does not occur without the groupby
.
Example
>> pd.DataFrame({"a": pd.array([None, "a"], dtype="string"), "b": [0, 0]}).groupby("b").ffill()
Traceback (most recent call last):
File "C:\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-22-a665c967b563>", line 1, in <module>
pd.DataFrame({"a": pd.array([None, "a"], dtype="string"), "b": [0, 0]}).groupby("b").ffill()
File "C:\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 1953, in pad
return self._fill("ffill", limit=limit)
File "C:\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 1919, in _fill
return self._get_cythonized_result(
File "C:\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 2673, in _get_cythonized_result
result = algorithms.take_nd(values, result)
File "C:\venv\lib\site-packages\pandas\core\algorithms.py", line 1699, in take_nd
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
File "C:\venv\lib\site-packages\pandas\core\arrays\_mixins.py", line 78, in take
return self._from_backing_data(new_data)
File "C:\venv\lib\site-packages\pandas\core\arrays\numpy_.py", line 190, in _from_backing_data
return type(self)(arr)
File "C:\venv\lib\site-packages\pandas\core\arrays\string_.py", line 195, in __init__
self._validate()
File "C:\venv\lib\site-packages\pandas\core\arrays\string_.py", line 200, in _validate
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
ValueError: StringArray requires a sequence of strings or pandas.NA
>> pd.DataFrame({"a": pd.array(["a", None], dtype="string"), "b": [0, 0]}).groupby("b").ffill()
Out[31]:
a
0 a
1 a
Edit: Updated pandas, still an issue in 1.2.3