Skip to content

BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 18, 2018

Conversation

staftermath
Copy link
Contributor

@staftermath staftermath commented Sep 8, 2018

@pep8speaks
Copy link

pep8speaks commented Sep 8, 2018

Hello @staftermath! Thanks for updating the PR.

Comment last updated on October 10, 2018 at 01:02 Hours UTC

return items.argsort(ascending=ascending, kind=kind)
mask = isna(items)
null_idx = np.where(mask)[0]
sorted_idx = items.argsort(ascending=ascending, kind=kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not have a .sort_values on Categorical that accepts na_position already?

if we don’t all of this code should live there anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way I see it, df.sort_values is not calling sort_values method from column objects, it's calling nargsort. Since pd.Series also doesn't have nargsort, I feel it may fit better in here rather than Categorical.nargsort

@staftermath
Copy link
Contributor Author

ci/circleci: py35_ascii failing related to this?: #21763

@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #22640 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22640      +/-   ##
==========================================
- Coverage   92.19%   92.18%   -0.01%     
==========================================
  Files         169      169              
  Lines       50954    50947       -7     
==========================================
- Hits        46975    46968       -7     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.61% <100%> (-0.01%) ⬇️
#single 42.28% <7.14%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.53% <100%> (-0.1%) ⬇️
pandas/core/sorting.py 98.27% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9285820...8d4aedc. Read the comment docs.

@@ -608,6 +608,7 @@ Categorical
^^^^^^^^^^^

- Bug in :meth:`Categorical.from_codes` where ``NaN`` values in `codes` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of `.from_codes([1.1, 2.0])`.
- Bug in :meth:`Categorical.sort_values` where ``NaN`` values were always positioned in front regardless of ```na_position`` value. (:issue:`22556`). `test_stata.py` was incorrectly passing using default ``na_position='last'``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra backtick before na_position

mask = isna(items)
cnt_null = mask.sum()
sorted_idx = items.argsort(ascending=ascending, kind=kind)
if ascending:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the condition here to just negate cnt_null where required? Return can happen outside of condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. not sure how to do condition check and negate cnt_null only once. updated to keep the np.roll only when needed and took return outside of condition.

sorted_idx = items.argsort(ascending=ascending, kind=kind)
if ascending:
# NaN is coded as -1 and is listed in front after sorting
return sorted_idx if na_position == 'first' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use implicit line continuation where possible. If still needed after refactor of above condition would prefer that style

def test_sort_index_na_position_with_categories(self):
# GH 22556
# Positioning missing value properly when column is Categorical.
df_category_with_nan = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize these various cases?

@WillAyd WillAyd added the Categorical Categorical Data Type label Sep 11, 2018
@@ -613,6 +613,7 @@ Categorical
^^^^^^^^^^^

- Bug in :meth:`Categorical.from_codes` where ``NaN`` values in ``codes`` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of ``.from_codes([1.1, 2.0])``.
- Bug in :meth:`Categorical.sort_values` where ``NaN`` values were always positioned in front regardless of ``na_position`` value. (:issue:`22556`). `test_stata.py` was incorrectly passing using default ``na_position='last'``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the test_stata part

na_position='first')
expected_ascending_na_first = DataFrame({
'c': Categorical([np.nan, np.nan, 'A', 'B', 'C'],
categories=['A', 'B', 'C'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you see if you can parametrize this a bit

cnt_null = mask.sum()
sorted_idx = items.argsort(ascending=ascending, kind=kind)
if ascending and na_position == 'last':
# NaN is coded as -1 and is listed in front after sorting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates some code in Categorical.sort_values is it not possible to refactor this to a function, and then call this from Categorical.sort_values (as this works with the raw indexer), while Categorical.sort_values reconstructs a Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion. updated.

@jreback jreback added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 7, 2018
@jreback jreback added this to the 0.24.0 milestone Oct 7, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. pls rebase on master and ping on green.

'c': pd.Categorical(['A', np.nan, 'B', np.nan, 'C'],
categories=categories,
ordered=True)})
result_ascending_na_first = df.sort_values(by='c',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than naming result / expected like this, can you just name them result & expected, and put a comment before the test case describing; its easier to read.

If you can easily parameterize things would be great as well (though may not be so easy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for all the suggestions, I have made the changes except for the parameterization (couldn't figure out a good way to do that.. )

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

@WillAyd over to you

@@ -598,3 +598,71 @@ def test_sort_index_intervalindex(self):
closed='right')
result = result.columns.levels[1].categories
tm.assert_index_equal(result, expected)

def test_sort_index_na_position_with_categories(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize this test? Believe you can use values, na_position and index as parametrized arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd parametrized a bit. I am pretty new at this. Thanks for all the suggestions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries - appreciate the help and we hope you learn as you go through this process.

FWIW this unfortunately not the parametrization we are looking for. There should be a decorator for the function which performs that actual task. What you want is the pytest.mark.parametrize decorator; you'll see this on one other function in the module though if you poke around other modules you'll see more complex applications which can help you out as well

categories=categories,
ordered=True)
},
index=na_indices + category_indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit but you can move this line up to the one above it. Once you parametrize should only need one of these DataFrame constructors

@pytest.mark.parametrize("na_indices", [[1, 3]])
@pytest.mark.parametrize("na_position_first", ['first'])
@pytest.mark.parametrize("na_position_last", ['last'])
@pytest.mark.parametrize("column_name", ['c'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not quite it as this will generate the Cartesian product of all of the items here.

Giving it a second look, this might be more complicated than I originally thought with parametrization to the point that it might overcomplicated things. I'd be OK with reverting to what you had before my request, and this could maybe be a separate PR if it even makes sense.

Sorry for the back and forth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. I have reverted the parametrization. Thanks for the careful checks.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

this looked ok, @WillAyd ?

@staftermath can you merge master

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; ok with merge after updating from master

@jreback jreback merged commit 32ef84b into pandas-dev:master Oct 18, 2018
@jreback
Copy link
Contributor

jreback commented Oct 18, 2018

thanks @staftermath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.sort_values() not respecting na_position with categoricals
4 participants