Description
Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes) # reg is int64, cat is categorical
df.loc[3] = (3, 'c') # add row with categorical value that exist in categories
print(df.dtypes) # reg is int64, cat is **object**
Problem description
There is no warning whatsoever, but still the dtype changes. In this dummy example this means we lose all information about the fact that 'd'
is also a possible value. (So simply doing astype('category')
wouldn't work here.)
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
I couldn't seem to find an issue about this. However I did find a few related things like performing
concat
andappend
on categoricals also changes dtypes. I would love these functions to have a keyword to control that behaviour (eg. perform union of categories), but this is a different issue that has already been discussed... (just letting you know that there are people out there who would love this feature, instead of having to meddle withpandas.api.types.union_categoricals
)
Expected Output
Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.
If people don't care about the categorical, they can always call .astype('object')
before adding the row?
I think this solution is also in the spirit of 'explicit is better than implicit`?
Output of pd.show_versions()
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.1.0
pyarrow: 0.12.0
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None