Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
IN:
import pandas as pd
ind = pd.Index([1, 1, 2, 3])
ind, indexer = ind.sort_values(ascending=False, return_indexer=True)
indexer
OUT:
array([3, 2, 1, 0], dtype=int64)
Problem description
Came across this one while fixing #35584 in #35604. sort_values
reverses duplicate order when ascending=False
. This is clearest when calling Index.sort_values
, because it can return an indexer, but it's also true for Series.sort_values
and it propagates to DataFrame.sort_values
.
#35604 will make sorting in descending order stable for most Index
types (leveraging nargsort from sorting.py
), but the problem will remain for datetime-like index types and for Series and will require fixing.
Expected Output
array([3, 2, 0, 1], dtype=int64)
Duplicates should maintain order when descending=False
. This will also let us leverage the same sorting algorithm both for Index
and Series
.
Additional use cases
Some additional use cases from the PR.
s = pd.Series(["A", "AA", "BB", "CAC"], dtype="object")
s.sort_values(ascending=False, key=lambda ser: ser.str.len())
OUT:
3 CAC
2 BB
1 AA
0 A
dtype: object
I don't think that swapping is expected here.
Then consider that you might be sorting a DataFrame with several columns, and a column with duplicates might be the first one. In this case you likely wouldn't expect a descending sort to change duplicate order. Or you could be using something like nlargest
and get weirdness because there is a descending sort in there and it swaps elements.
Obviously, we could get by with a convention that we always revert duplicate order with a descending sort by being careful, but I believe keeping duplicate order is cleaner. In cases where it doesn't matter, it's the same, and when it does matter (as in nlargest
and the like), you don't need to remember that you need extra reversals.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d0ca4b3
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
pandas : 0.26.0.dev0+4054.gd0ca4b347
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.23.3
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : 0.6.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0