pd.concat() crashes if dataframe contains duplicate indices but not df.join()

I just found out that when we concatenate two dataframes horizontally, if one dataframe has duplicate indices, pd.concat() will crash, but df.join() will not crash. Instead, df.join() will spread the values into all rows with the same index value. Is this behavior by design?  Thanks!

```
df1 = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a'])
df2 = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b'])
dfj = df1.join(df2, how='outer')
display(df1, df2, dfj)
dfc = pd.concat([df1, df2], axis=1)
display(dfc)
```
<img width="165" alt="image" src="https://user-images.githubusercontent.com/10172392/92682234-6e3c1f80-f362-11ea-8b07-21ffe291ad5c.png">

```
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-150ce3e802ea> in <module>
      3 dfj = df1.join(df2, how='outer')
      4 display(df1, df2, dfj)
----> 5 dfc = pd.concat([df1, df2], axis=1)
      6 display(dfc)

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    282     )
    283 
--> 284     return op.get_result()
    285 
    286 

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self)
    495 
    496             new_data = concatenate_block_managers(
--> 497                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy
    498             )
    499             if not self.copy:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2025         blocks.append(b)
   2026 
-> 2027     return BlockManager(blocks, axes)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    137 
    138         if do_integrity_check:
--> 139             self._verify_integrity()
    140 
    141         self._consolidate_check()

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    332         for block in self.blocks:
    333             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 334                 construction_error(tot_items, block.shape[1:], self.axes)
    335         if len(self.items) != tot_items:
    336             raise AssertionError(

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1692     if block_shape[0] == 0:
   1693         raise ValueError("Empty data passed with indices specified.")
-> 1694     raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
   1695 
   1696 

ValueError: Shape of passed values is (9, 2), indices imply (7, 2)
```

By right, if the dataframes have duplicate indices, it can behave like df.join() and at least it should NOT crash.
I suggest we introduce additional arguments to handle duplicate indices, e.g., if the same index has X(>0) rows in df1, Y(>0) rows in df2, then if `dup_index=`:

1. `combinatorial`: after merging, it will have `X*Y` rows for every combination possibility.
2. `outer-top-align`: after merging, it will have `max(X, Y)` rows, in which the rows align from top
3. `outer-bottom-align`: after merging, it will have `max(X, Y)` rows, in which the rows align from bottom
4. `inner-top-align`: after merging, it will have `min(X, Y)` rows, in which the rows align from top
5. `inner-bottom-align`: after merging, it will have `min(X, Y)` rows, in which the rows align from bottom
6. `raise`: raise an exception with the warning message

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions