Skip to content

BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 18, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,7 @@ Categorical
^^^^^^^^^^^

- Bug in :meth:`Categorical.from_codes` where ``NaN`` values in ``codes`` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of ``.from_codes([1.1, 2.0])``.
- Bug in :meth:`Categorical.sort_values` where ``NaN`` values were always positioned in front regardless of ``na_position`` value. (:issue:`22556`). `test_stata.py` was incorrectly passing using default ``na_position='last'``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the test_stata part


Datetimelike
^^^^^^^^^^^^
Expand Down
13 changes: 12 additions & 1 deletion pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,18 @@ def nargsort(items, kind='quicksort', ascending=True, na_position='last'):

# specially handle Categorical
if is_categorical_dtype(items):
return items.argsort(ascending=ascending, kind=kind)
if na_position not in ['first', 'last']:
raise ValueError('invalid na_position: {!r}'.format(na_position))
mask = isna(items)
cnt_null = mask.sum()
sorted_idx = items.argsort(ascending=ascending, kind=kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not have a .sort_values on Categorical that accepts na_position already?

if we don’t all of this code should live there anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way I see it, df.sort_values is not calling sort_values method from column objects, it's calling nargsort. Since pd.Series also doesn't have nargsort, I feel it may fit better in here rather than Categorical.nargsort

if ascending and na_position == 'last':
# NaN is coded as -1 and is listed in front after sorting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates some code in Categorical.sort_values is it not possible to refactor this to a function, and then call this from Categorical.sort_values (as this works with the raw indexer), while Categorical.sort_values reconstructs a Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion. updated.

sorted_idx = np.roll(sorted_idx, -cnt_null)
elif not ascending and na_position == 'first':
# NaN is coded as -1 and is listed in the end after sorting
sorted_idx = np.roll(sorted_idx, cnt_null)
return sorted_idx

items = np.asanyarray(items)
idx = np.arange(len(items))
Expand Down
64 changes: 63 additions & 1 deletion pandas/tests/frame/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from pandas.compat import lrange
from pandas.api.types import CategoricalDtype
from pandas import (DataFrame, Series, MultiIndex, Timestamp,
date_range, NaT, IntervalIndex)
date_range, NaT, IntervalIndex, Categorical)

from pandas.util.testing import assert_series_equal, assert_frame_equal

Expand Down Expand Up @@ -598,3 +598,65 @@ def test_sort_index_intervalindex(self):
closed='right')
result = result.columns.levels[1].categories
tm.assert_index_equal(result, expected)

def test_sort_index_na_position_with_categories(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize this test? Believe you can use values, na_position and index as parametrized arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd parametrized a bit. I am pretty new at this. Thanks for all the suggestions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries - appreciate the help and we hope you learn as you go through this process.

FWIW this unfortunately not the parametrization we are looking for. There should be a decorator for the function which performs that actual task. What you want is the pytest.mark.parametrize decorator; you'll see this on one other function in the module though if you poke around other modules you'll see more complex applications which can help you out as well

# GH 22556
# Positioning missing value properly when column is Categorical.
df = pd.DataFrame({
'c': pd.Categorical(['A', np.nan, 'B', np.nan, 'C'],
categories=['A', 'B', 'C'],
ordered=True)})
result_ascending_na_first = df.sort_values(by='c',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than naming result / expected like this, can you just name them result & expected, and put a comment before the test case describing; its easier to read.

If you can easily parameterize things would be great as well (though may not be so easy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for all the suggestions, I have made the changes except for the parameterization (couldn't figure out a good way to do that.. )

ascending=True,
na_position='first')
expected_ascending_na_first = DataFrame({
'c': Categorical([np.nan, np.nan, 'A', 'B', 'C'],
categories=['A', 'B', 'C'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you see if you can parametrize this a bit

ordered=True)}, index=[1, 3, 0, 2, 4])

assert_frame_equal(result_ascending_na_first,
expected_ascending_na_first)

result_ascending_na_last = df.sort_values(by='c',
ascending=True,
na_position='last')
expected_ascending_na_last = DataFrame({
'c': Categorical(['A', 'B', 'C', np.nan, np.nan],
categories=['A', 'B', 'C'],
ordered=True)}, index=[0, 2, 4, 1, 3])

assert_frame_equal(result_ascending_na_last,
expected_ascending_na_last)

result_descending_na_first = df.sort_values(by='c',
ascending=False,
na_position='first')
expected_descending_na_first = DataFrame({
'c': Categorical([np.nan, np.nan, 'C', 'B', 'A'],
categories=['A', 'B', 'C'],
ordered=True)}, index=[3, 1, 4, 2, 0])

assert_frame_equal(result_descending_na_first,
expected_descending_na_first)

result_descending_na_last = df.sort_values(by='c',
ascending=False,
na_position='last')
expected_descending_na_last = DataFrame({
'c': Categorical(['C', 'B', 'A', np.nan, np.nan],
categories=['A', 'B', 'C'],
ordered=True)}, index=[4, 2, 0, 3, 1])

assert_frame_equal(result_descending_na_last,
expected_descending_na_last)

def test_sort_index_na_position_with_categories_raises(self):
df = pd.DataFrame({
'c': pd.Categorical(['A', np.nan, 'B', np.nan, 'C'],
categories=['A', 'B', 'C'],
ordered=True)})

with pytest.raises(ValueError):
df.sort_values(by='c',
ascending=False,
na_position='bad_position')
2 changes: 1 addition & 1 deletion pandas/tests/io/test_stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -997,7 +997,7 @@ def test_categorical_sorting(self, file):
parsed = read_stata(getattr(self, file))

# Sort based on codes, not strings
parsed = parsed.sort_values("srh")
parsed = parsed.sort_values("srh", na_position='first')

# Don't sort index
parsed.index = np.arange(parsed.shape[0])
Expand Down