Description
I'm currently working on a refactoring the code DataFrame.append
(#22915). One question that came up was, what should be the behavior when appending DataFrames with different index types.
To simplify this, let's put the focus on the index itself, and assume that it is valid for both rows and columns (IMHO consistency between the two might be good).
Code Sample
>>> df1 = pd.DataFrame([[1, 2, 3]], columns=index1)
>>> df2 = pd.DataFrame([[4, 5, 6]], columns=index2)
>>> # what should happen in the next line when index1 and
>>> # index2 are of different types?
>>> result = df1.append(df2)
Current Behavior
Index 0 (rows)
All types of indexes work together, except for:
CategoricalIndex + {another}
When the types don't match, the result is cast to object
.
There's a also a bug that happend when appending MultiIndex + DatetimeIndex. I will verify it more in detail later and raise an issue.
Index 1 (columns)
All types of indexes work together, except for:
IntervalIndex + {another}
MultiIndex + {another}
{another} + PeriodIndex
{another} + IntervalIndex
{another} + MultiIndex
DatetimeIndex + {another}
(works fordf.append(series)
)TimedeltaIndex + {numeric}
(works fordf.append(series)
)PeriodIndex + {another}
(works fordf.append(series)
)
The kinds of exceptions raised here are the most varied and sometimes very cryptic.
For example:
>>> df1 = pd.DataFrame([[1]], columns=['A'])
>>> df2 = pd.DataFrame([[2]], columns=pd.PeriodIndex([2000], freq='A'))
>>> df1.append(df2)
Traceback (most recent call last):
File "<ipython-input-15-8ab0723181fb>", line 1, in <module>
df1.append(df2)
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/frame.py", line 6211, in append
sort=sort)
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 226, in concat
return op.get_result()
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 423, in get_result
copy=self.copy)
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5416, in concatenate_block_managers
elif is_uniform_join_units(join_units):
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in is_uniform_join_units
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
File "/home/andre/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/internals.py", line 5440, in <genexpr>
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
AttributeError: 'NoneType' object has no attribute 'is_extension'
Suggestion
Since index 0 already allows almost all types of index the be concatenated, we don't want to restrict this behavior. A suggestion I give is to allow any types of indexes to be merged together (and possibly raise a RuntimeWarning when a cast is necessary).
Foe example, joining two indexes of the same type should produce the same type (as the current behavior):
>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.Int64Index([2]))
>>> result = df1.append(df2)
>>> result
1 2
0 1.0 NaN
0 NaN 2.0
>>> result.columns
Int64Index([1, 2], dtype='int64')
Joining two indexes of different types usually upcasts to object
(only when necessary)
>>> df1 = pd.DataFrame([[1]], columns=pd.Int64Index([1]))
>>> df2 = pd.DataFrame([[2]], columns=pd.interval_range(0, 1))
>>> result = df1.append(df2)
>>> result
1 (0, 1]
0 1.0 NaN
1 NaN 2.0
>>> result.columns
Index([1, (0, 1]], dtype='object')
Empty indexes
I believe that empty indexes (e.g. those created in a DataFrame
with no rows or columns) should be ignored when calculating the final dtype of an index merge. This seems like it is already the current behavior:
>>> df1 = pd.DataFrame()
>>> df2 = pd.DataFrame(index=[1])
>>> df1.index
Index([], dtype='object')
>>> df2.index
Int64Index([1], dtype='int64')
>>> df1.append(df2).index
Int64Index([1], dtype='int64')
However, when an empty index has a dtype different from object
, we may want to preserve it (as it may have been created explicitly by the user).
>>> # current behavior
>>> df1 = pd.DataFrame(index=pd.Float64Index([]))
>>> df2 = pd.DataFrame(index=[1])
>>> df1.append(df2).index
Int64Index([1], dtype='int64')
>>>
>>> # suggested behavior
>>> df1.append(df2).index
Float64Index([1], dtype='int64')
Sorry if this was too long, this is being my first contribution to open source and I still didn't get the hang of how things work. Any suggestions, whether related to the issue or meta, are welcome!