Description
Follow-up of #14545.
We had a long discussion on what the behaviour of concat
should be when you have categorical data: #13767. In the end, for 0.19.0, we changed the behaviour of raising an error when categories didn't match to returning object dtyped data (only data with identical categories and ordered attributed gives a categorical as result). The table below is a summary of the changes between 0.18.1 and 0.19.0:
For categorical Series:
left | right | append/concat 0.18 | append/concat 0.19.0 |
---|---|---|---|
category | category (identical categories) | category | category |
category | category (different categories) | error | object |
category | not category | category | object |
category | not category (different categories) | category with NaNs | object |
However, we didn't change behaviour of append
for Indexes (the above append is for series):
For CategoricalIndex
:
left | right | append 0.18 | append 0.19.0 | append 0.19.1 |
---|---|---|---|---|
category | category (identical categories) | category | category | category |
category | category (different categories) | error | error | error |
category | not category | category | category | category |
category | not category (with other values) | error | error | error |
not category | category (with other values) | object | error | object |
The last line, i.e. the case where the calling Index is not a CategoricalIndex, changed by accident in 0.19.0, and it is this that I corrected for in PR #14545 for 0.19.1.
Questions:
- Do we want the same behaviour for
Index.append
as we now have forSeries.append
with categorical data? This means that the column in the table above becomes 'object' apart from the first row. - Do we want to make an exception for the case where the values in the 'right' correspond with the categories? (so that
pd.CategoricalIndex(['a', 'b', 'c']).append(pd.Index(['a']))
keeps working)
Changing this to always return object dtype unless for categoricals with indentical categories is easy, but gives a few failures in our test suite. Namely, in some indexing tests (indexing a DataFrame with a CategoricalIndex) there are changes in behaviour because indexing with a non-existing value in the index was performed using CategoricalIndex.append()
. But this we can workaround in the indexing code of course.