Skip to content

Setting with enlargement on categorical data #25383

Open
@0phoff

Description

@0phoff

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes)  # reg is int64, cat is categorical

df.loc[3] = (3, 'c')  # add row with categorical value that exist in categories
print(df.dtypes)  # reg is int64, cat is **object**

Problem description

There is no warning whatsoever, but still the dtype changes. In this dummy example this means we lose all information about the fact that 'd' is also a possible value. (So simply doing astype('category') wouldn't work here.)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

I couldn't seem to find an issue about this. However I did find a few related things like performing concat and append on categoricals also changes dtypes. I would love these functions to have a keyword to control that behaviour (eg. perform union of categories), but this is a different issue that has already been discussed... (just letting you know that there are people out there who would love this feature, instead of having to meddle with pandas.api.types.union_categoricals)

Expected Output

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.
If people don't care about the categorical, they can always call .astype('object') before adding the row?

I think this solution is also in the spirit of 'explicit is better than implicit`?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.1.0
pyarrow: 0.12.0
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions