Skip to content

BUG: Error writing DataFrame with categorical type column and "Int" data to a CSV file ("int" works of course) #46812

Closed
@eason9

Description

@eason9

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

d = {'name':["bob", "todd", "sarah", "john"],
        'gp': [1, 2, np.NaN, 2],
        'score':[90, 40, 80, 98]}
df = pd.DataFrame(d)
df.name = df.name.astype("category")
df.gp = df.gp.astype("Int16")
df.gp = df.gp.astype("category")

print('-pandas version: ', pd.__version__)
print('-df dtypes:\n', df.dtypes)
print('-df.gp:\n', df.gp)
df.to_csv("test.csv")

Issue Description

This Bug is similar to: #46297. Executing the example above produces the following error in pandas 1.4.2 (this code example works fine on older pandas versions, e.g. 1.2.4). The problem occurs when saving a dataframe to a .csv when the categorical type is set over a nan supported "Int" dtype (note not a default "int" dtype).

-pandas version:  1.4.2
-df dtypes:
 name     category
gp       category
score       int64
dtype: object
-df.gp:
 0      1
1      2
2    NaN
3      2
Name: gp, dtype: category
Categories (2, Int16): [1, 2]
Traceback (most recent call last):

  File "<ipython-input-28-995d10a4922f>", line 15, in <module>
    df.to_csv("test.csv")

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\generic.py", line 3551, in to_csv
    return DataFrameRenderer(formatter).to_csv(

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\format.py", line 1180, in to_csv
    csv_formatter.save()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 261, in save
    self._save()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 266, in _save
    self._save_body()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 304, in _save_body
    self._save_chunk(start_i, end_i)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 311, in _save_chunk
    res = df._mgr.to_native_types(**self._number_format)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\managers.py", line 473, in to_native_types
    return self.apply("to_native_types", **kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\managers.py", line 304, in apply
    applied = getattr(b, f)(**kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\blocks.py", line 634, in to_native_types
    result = to_native_types(self.values, na_rep=na_rep, quoting=quoting, **kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\blocks.py", line 2163, in to_native_types
    values = take_nd(

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\array_algos\take.py", line 114, in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\arrays\masked.py", line 653, in take
    return type(self)(result, mask, copy=False)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\arrays\integer.py", line 315, in __init__
    raise TypeError(

TypeError: values should be integer numpy array. Use the 'pd.array' function instead

Expected Behavior

I'd expect the above example to write a dataframe to a .csv as in older pandas versions instead of ending in an error traceback.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United States.1252

pandas : 1.4.2
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 0.8.3
gcsfs : None
markupsafe : 1.1.1
matplotlib : 3.3.2
numba : 0.51.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
snappy : None
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeIO CSVread_csv, to_csvNA - MaskedArraysRelated to pd.NA and nullable extension arraysRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions