Skip to content

BUG: Concat automatically sorts index when axis=1  #37937

Closed
@marlenezw

Description

@marlenezw

Problem description

Hi! I've noticed that when concat'ing DataFrames, along axis=1, integer indexes are automatically sorted even when sort=False. This is not the case when axis = 1 and we have a stringIndex. This also diverges from when axis=0. This behaviour is a bit confusing and so I thought it might be a bug.

In the _unique_indices API, if the indexes are integers they are brought together using a union, result = result.union(other) that sorts the numbers by default. So the resulting index is sorted whether or not sort=True.
I think this can be fixed by changing the code to
result = result.union(other, sort=sort) or something along those lines.
I've added examples below. Thank you :D

Example of difference between stringIndex and intIndex behaviour

#This is what happens with string index (union of indexes ordered)
>>> p1 = pd.DataFrame(
...                 {"a": [1, 2, 3], "b": [4, 5, 6]}, index=["p", "q", "r"]
...             )
>>> p2 = pd.DataFrame(
...                 {"c": [7, 8, 9], "d": [10, 11, 12]}, index=["r", "p", "z"]
...             )
>>> pd.concat([p1,p2], axis=1)
     a    b    c     d
p  1.0  4.0  8.0  11.0
q  2.0  5.0  NaN   NaN
r  3.0  6.0  7.0  10.0
z  NaN  NaN  9.0  12.0
>>> pd.concat([p2,p1], axis=1)
     c     d    a    b
r  7.0  10.0  3.0  6.0
p  8.0  11.0  1.0  4.0
z  9.0  12.0  NaN  NaN
q  NaN   NaN  2.0  5.0

#This is what happens with an int index (sorted, regardless of order of input)

>>> p_int1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, index=[1,2,3])
>>> p_int2 = pd.DataFrame({"c": [7, 8, 9], "d": [10, 11, 12]}, index=[3,1,6])
>>> pd.concat([p_int1,p_int2], axis=1)
     a    b    c     d
1  1.0  4.0  8.0  11.0
2  2.0  5.0  NaN   NaN
3  3.0  6.0  7.0  10.0
6  NaN  NaN  9.0  12.0
>>> pd.concat([p_int2,p_int1], axis=1)
     c     d    a    b
1  8.0  11.0  1.0  4.0
2  NaN   NaN  2.0  5.0
3  7.0  10.0  3.0  6.0
6  9.0  12.0  NaN  NaN

Example of expected output

>>> p_int1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, index=[1,2,3])
>>> p_int2 = pd.DataFrame({"c": [7, 8, 9], "d": [10, 11, 12]}, index=[3,1,6])
>>> pd.concat([p_int2,p_int1], axis=1, sort=False)

     c     d    a    b
3  7.0  10.0  3.0  6.0
1  8.0  11.0  1.0  4.0
6  9.0  12.0  NaN  NaN
2  NaN   NaN  2.0  5.0

#### Output of ``pd.show_versions(1.1.4)``

<details>

>>> pd.concat([p_int2,p_int1], axis=1)
     c     d    a    b
1  8.0  11.0  1.0  4.0
2  NaN   NaN  2.0  5.0
3  7.0  10.0  3.0  6.0
6  9.0  12.0  NaN  NaN

</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions