BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

staftermath · 2018-09-08T20:51:57Z

closes df.sort_values() not respecting na_position with categoricals #22556
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-09-08T20:52:00Z

Hello @staftermath! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/core/sorting.py !
There are no PEP8 issues in the file pandas/tests/frame/test_sorting.py !
There are no PEP8 issues in the file pandas/tests/io/test_stata.py !

Comment last updated on October 10, 2018 at 01:02 Hours UTC

jreback · 2018-09-08T21:55:18Z

pandas/core/sorting.py

-        return items.argsort(ascending=ascending, kind=kind)
+        mask = isna(items)
+        null_idx = np.where(mask)[0]
+        sorted_idx = items.argsort(ascending=ascending, kind=kind)


do we not have a .sort_values on Categorical that accepts na_position already?

if we don’t all of this code should live there anyhow

the way I see it, df.sort_values is not calling sort_values method from column objects, it's calling nargsort. Since pd.Series also doesn't have nargsort, I feel it may fit better in here rather than Categorical.nargsort

staftermath · 2018-09-08T23:16:52Z

ci/circleci: py35_ascii failing related to this?: #21763

codecov · 2018-09-09T15:05:10Z

Codecov Report

Merging #22640 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22640      +/-   ##
==========================================
- Coverage   92.19%   92.18%   -0.01%     
==========================================
  Files         169      169              
  Lines       50954    50947       -7     
==========================================
- Hits        46975    46968       -7     
  Misses       3979     3979

Flag	Coverage Δ
#multiple	`90.61% <100%> (-0.01%)`	⬇️
#single	`42.28% <7.14%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.53% <100%> (-0.1%)`	⬇️
pandas/core/sorting.py	`98.27% <100%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9285820...8d4aedc. Read the comment docs.

WillAyd · 2018-09-11T01:21:08Z

doc/source/whatsnew/v0.24.0.txt

@@ -608,6 +608,7 @@ Categorical
 ^^^^^^^^^^^

 - Bug in :meth:`Categorical.from_codes` where ``NaN`` values in `codes` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of `.from_codes([1.1, 2.0])`.
+- Bug in :meth:`Categorical.sort_values` where ``NaN`` values were always positioned in front regardless of ```na_position`` value. (:issue:`22556`). `test_stata.py` was incorrectly passing using default ``na_position='last'``.


Extra backtick before na_position

pandas/tests/frame/test_sorting.py

WillAyd · 2018-09-11T01:24:24Z

pandas/core/sorting.py

+        mask = isna(items)
+        cnt_null = mask.sum()
+        sorted_idx = items.argsort(ascending=ascending, kind=kind)
+        if ascending:


Can you simplify the condition here to just negate cnt_null where required? Return can happen outside of condition

good point. not sure how to do condition check and negate cnt_null only once. updated to keep the np.roll only when needed and took return outside of condition.

WillAyd · 2018-09-11T01:25:18Z

pandas/core/sorting.py

+        sorted_idx = items.argsort(ascending=ascending, kind=kind)
+        if ascending:
+            # NaN is coded as -1 and is listed in front after sorting
+            return sorted_idx if na_position == 'first' \


We typically use implicit line continuation where possible. If still needed after refactor of above condition would prefer that style

WillAyd · 2018-09-11T01:25:41Z

pandas/tests/frame/test_sorting.py

+    def test_sort_index_na_position_with_categories(self):
+        # GH 22556
+        # Positioning missing value properly when column is Categorical.
+        df_category_with_nan = pd.DataFrame({


Can you parametrize these various cases?

jreback · 2018-09-18T12:19:57Z

doc/source/whatsnew/v0.24.0.txt

@@ -613,6 +613,7 @@ Categorical
 ^^^^^^^^^^^

 - Bug in :meth:`Categorical.from_codes` where ``NaN`` values in ``codes`` were silently converted to ``0`` (:issue:`21767`). In the future this will raise a ``ValueError``. Also changes the behavior of ``.from_codes([1.1, 2.0])``.
+- Bug in :meth:`Categorical.sort_values` where ``NaN`` values were always positioned in front regardless of ``na_position`` value. (:issue:`22556`). `test_stata.py` was incorrectly passing using default ``na_position='last'``.


you don't need the test_stata part

pandas/core/sorting.py

jreback · 2018-09-18T12:20:56Z

pandas/tests/frame/test_sorting.py

+                                                   na_position='first')
+        expected_ascending_na_first = DataFrame({
+            'c': Categorical([np.nan, np.nan, 'A', 'B', 'C'],
+                             categories=['A', 'B', 'C'],


can you see if you can parametrize this a bit

jreback · 2018-09-18T12:24:03Z

pandas/core/sorting.py

+        cnt_null = mask.sum()
+        sorted_idx = items.argsort(ascending=ascending, kind=kind)
+        if ascending and na_position == 'last':
+            # NaN is coded as -1 and is listed in front after sorting


this duplicates some code in Categorical.sort_values is it not possible to refactor this to a function, and then call this from Categorical.sort_values (as this works with the raw indexer), while Categorical.sort_values reconstructs a Categorical?

great suggestion. updated.

jreback

looks good. pls rebase on master and ping on green.

jreback · 2018-10-07T23:29:46Z

pandas/tests/frame/test_sorting.py

+            'c': pd.Categorical(['A', np.nan, 'B', np.nan, 'C'],
+                                categories=categories,
+                                ordered=True)})
+        result_ascending_na_first = df.sort_values(by='c',


rather than naming result / expected like this, can you just name them result & expected, and put a comment before the test case describing; its easier to read.

If you can easily parameterize things would be great as well (though may not be so easy).

@jreback thanks for all the suggestions, I have made the changes except for the parameterization (couldn't figure out a good way to do that.. )

jreback · 2018-10-09T12:41:20Z

@WillAyd over to you

pandas/tests/frame/test_sorting.py

WillAyd · 2018-10-09T17:04:03Z

pandas/tests/frame/test_sorting.py

@@ -598,3 +598,71 @@ def test_sort_index_intervalindex(self):
            closed='right')
        result = result.columns.levels[1].categories
        tm.assert_index_equal(result, expected)
+
+    def test_sort_index_na_position_with_categories(self):


Can you parametrize this test? Believe you can use values, na_position and index as parametrized arguments

@WillAyd parametrized a bit. I am pretty new at this. Thanks for all the suggestions!

No worries - appreciate the help and we hope you learn as you go through this process.

FWIW this unfortunately not the parametrization we are looking for. There should be a decorator for the function which performs that actual task. What you want is the pytest.mark.parametrize decorator; you'll see this on one other function in the module though if you poke around other modules you'll see more complex applications which can help you out as well

pandas/core/sorting.py

WillAyd · 2018-10-10T23:23:15Z

pandas/tests/frame/test_sorting.py

+                                     categories=categories,
+                                     ordered=True)
+        },
+            index=na_indices + category_indices)


Stylistic nit but you can move this line up to the one above it. Once you parametrize should only need one of these DataFrame constructors

WillAyd · 2018-10-11T04:18:56Z

pandas/tests/frame/test_sorting.py

+    @pytest.mark.parametrize("na_indices", [[1, 3]])
+    @pytest.mark.parametrize("na_position_first", ['first'])
+    @pytest.mark.parametrize("na_position_last", ['last'])
+    @pytest.mark.parametrize("column_name", ['c'])


Still not quite it as this will generate the Cartesian product of all of the items here.

Giving it a second look, this might be more complicated than I originally thought with parametrization to the point that it might overcomplicated things. I'd be OK with reverting to what you had before my request, and this could maybe be a separate PR if it even makes sense.

Sorry for the back and forth

no problem. I have reverted the parametrization. Thanks for the careful checks.

jreback · 2018-10-17T14:01:59Z

this looked ok, @WillAyd ?

@staftermath can you merge master

WillAyd

lgtm; ok with merge after updating from master

jreback · 2018-10-18T16:07:31Z

thanks @staftermath

…ndas-dev#22556 (pandas-dev#22640)

staftermath added 5 commits September 8, 2018 12:28

fix na_position when sort_values by categorical values

d119a00

move test to test_sort.py and fix index

f36346b

using pandas.isna to allow CategoricalIndex calling nargsort

1e26989

linter fix in test_sorting.py

dc4b4c4

add to Whatsnew

a4ee75f

jreback requested changes Sep 8, 2018

View reviewed changes

test ValueError exception and use np.roll to do the rotate

2b6f0b0

flake8 linter fix

229fff5

WillAyd requested changes Sep 11, 2018

View reviewed changes

WillAyd added the Categorical Categorical Data Type label Sep 11, 2018

staftermath added 2 commits September 10, 2018 22:04

address comments

fbc8a4e

solve conflict

41b29fb

jreback requested changes Sep 18, 2018

View reviewed changes

staftermath added 2 commits September 26, 2018 18:36

call narg in Categorical.sort_values and address linter

1999d58

resolve conflict

76da923

jreback added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 7, 2018

jreback added this to the 0.24.0 milestone Oct 7, 2018

jreback requested changes Oct 7, 2018

View reviewed changes

staftermath added 2 commits October 7, 2018 20:19

Merge remote-tracking branch 'upstream/master'

8f81649

comment rest

8943280

jreback approved these changes Oct 9, 2018

View reviewed changes

WillAyd requested changes Oct 9, 2018

View reviewed changes

staftermath added 3 commits October 9, 2018 18:52

parametrize test and use set

1709c91

Merge remote-tracking branch 'upstream/master'

8983dd9

PEP8 indent

7f6c68a

WillAyd reviewed Oct 10, 2018

View reviewed changes

pytest parametrization

9f0343b

WillAyd requested changes Oct 11, 2018

View reviewed changes

reverse back from pytest parametrization

9e7afb1

WillAyd approved these changes Oct 17, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master'

8d4aedc

jreback merged commit 32ef84b into pandas-dev:master Oct 18, 2018

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

BUG: df.sort_values() not respecting na_position with categoricals pa…

1df2321

…ndas-dev#22556 (pandas-dev#22640)

Uh oh!

BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

Uh oh!

Conversation

staftermath commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on October 10, 2018 at 01:02 Hours UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

staftermath commented Sep 8, 2018

Uh oh!

codecov bot commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 9, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 17, 2018

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 18, 2018

Uh oh!

Uh oh!

staftermath commented Sep 8, 2018 •

edited

Loading

pep8speaks commented Sep 8, 2018 •

edited

Loading

codecov bot commented Sep 9, 2018 •

edited

Loading