Skip to content

BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922

Closed
@AlexKirko

Description

@AlexKirko
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

IN:
import pandas as pd
ind = pd.Index([1, 1, 2, 3])

ind, indexer = ind.sort_values(ascending=False, return_indexer=True)
indexer
OUT:
array([3, 2, 1, 0], dtype=int64)

Problem description

Came across this one while fixing #35584 in #35604. sort_values reverses duplicate order when ascending=False. This is clearest when calling Index.sort_values, because it can return an indexer, but it's also true for Series.sort_values and it propagates to DataFrame.sort_values.

#35604 will make sorting in descending order stable for most Index types (leveraging nargsort from sorting.py), but the problem will remain for datetime-like index types and for Series and will require fixing.

Expected Output

array([3, 2, 0, 1], dtype=int64)

Duplicates should maintain order when descending=False. This will also let us leverage the same sorting algorithm both for Index and Series.

Additional use cases

Some additional use cases from the PR.

s = pd.Series(["A", "AA", "BB", "CAC"], dtype="object")
s.sort_values(ascending=False, key=lambda ser: ser.str.len())

OUT:
3    CAC
2     BB
1     AA
0      A
dtype: object

I don't think that swapping is expected here.

Then consider that you might be sorting a DataFrame with several columns, and a column with duplicates might be the first one. In this case you likely wouldn't expect a descending sort to change duplicate order. Or you could be using something like nlargest and get weirdness because there is a descending sort in there and it swaps elements.

Obviously, we could get by with a convention that we always revert duplicate order with a descending sort by being careful, but I believe keeping duplicate order is cleaner. In cases where it doesn't matter, it's the same, and when it does matter (as in nlargest and the like), you don't need to remember that you need extra reversals.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d0ca4b3
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+4054.gd0ca4b347
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.23.3
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : 0.6.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

Metadata

Metadata

Assignees

Labels

AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBugIndexRelated to the Index class or subclasses

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions