Skip to content

Value counts normalize #33652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
bd9011a
made nan count when dropna=False
DataInformer Apr 18, 2020
d9d5ec1
updated changelog
DataInformer Apr 18, 2020
86fe7f9
trivial
DataInformer Apr 18, 2020
c34a863
added specific test for groupby valuecount interval fix
DataInformer Apr 18, 2020
9c1c269
merge master
DataInformer Apr 18, 2020
5f8eb1d
updated value_count docstrings
DataInformer Apr 19, 2020
1276166
fixed pep8 style
DataInformer Apr 19, 2020
a1b7197
fixed more minor style
DataInformer Apr 19, 2020
9c3ede3
added test for na in bins
DataInformer Apr 20, 2020
0cff92b
added release notes to 1.1
DataInformer Apr 20, 2020
27aa460
trying to avoid docstring warning
DataInformer Apr 25, 2020
27c9856
trying to avoid docstring warning
DataInformer Apr 25, 2020
f5e9aeb
include nan count when dropna=False
DataInformer Jun 27, 2020
99b7112
listed bugfix
DataInformer Jun 27, 2020
75374b2
avoided tests that highlight groupby.value_count bug
DataInformer Jun 27, 2020
25b6c14
Revert "avoided tests that highlight groupby.value_count bug"
DataInformer Jul 4, 2020
277ce52
use series value_counts for groupby
DataInformer Jul 4, 2020
73ef54b
Merge branch 'master' of https://github.com/pandas-dev/pandas
DataInformer Jul 4, 2020
6b97e0b
Merge branch 'master' into value_counts_part1
DataInformer Jul 4, 2020
797f668
added groupby bin test
DataInformer Jul 4, 2020
fce6998
passing groupy valcount tests
DataInformer Jul 16, 2020
637a609
nan doesnt work for times
DataInformer Jul 26, 2020
c9a4383
passing all value count tests
DataInformer Jul 27, 2020
7ae1280
Merge remote-tracking branch 'upstream/master' into value_counts_part1
DataInformer Jul 27, 2020
d2399ea
speedups 1
DataInformer Jul 27, 2020
3299a36
Merge branch 'value_counts_part1' into value_counts_normalize
DataInformer Jul 27, 2020
ec92f15
speedup?
DataInformer Aug 2, 2020
d6179b0
Revert "speedup?"
DataInformer Aug 2, 2020
5abfb16
fixed comments
DataInformer Aug 10, 2020
83ccfd2
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Aug 30, 2020
5f33834
removed unneeded import
DataInformer Sep 12, 2020
8562f1b
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Sep 12, 2020
f685cb2
updated to use na_sentinal param
DataInformer Sep 12, 2020
c21bdbb
fixed bad test
DataInformer Sep 12, 2020
e4c2552
moved doc, reverted permissions
DataInformer Sep 13, 2020
74b13d8
more doc and permission fix
DataInformer Sep 13, 2020
f0e630a
fixed docstrings
DataInformer Sep 14, 2020
9763e83
file perm
DataInformer Sep 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,7 @@ Numeric
- Bug in :meth:`DataFrame.diff` with ``axis=1`` returning incorrect results with mixed dtypes (:issue:`32995`)
- Bug in :meth:`DataFrame.corr` and :meth:`DataFrame.cov` raising when handling nullable integer columns with ``pandas.NA`` (:issue:`33803`)
- Bug in :class:`DataFrame` and :class:`Series` addition and subtraction between object-dtype objects and ``datetime64`` dtype objects (:issue:`33824`)
- Bug in :meth:`Series.value_counts` with ``normalize=True`` for NA values (:issue:`25970`)

Conversion
^^^^^^^^^^
Expand Down
21 changes: 12 additions & 9 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -693,12 +693,16 @@ def value_counts(
ascending : bool, default False
Sort in ascending order
normalize: bool, default False
If True then compute a relative histogram
bins : integer, optional
Rather than count values, group them into half-open bins,
convenience for pd.cut, only works with numeric data
If True, then compute a relative histogram that outputs the
proportion of each value.
bins : integer or iterable of numeric, optional
Rather than count values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use pd.cut.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this will use pd.cut either way, so pls amend the doc to say that.

needs a versionchanged tag 1.2

If interable of numeric, will use provided numbers as bin endpoints.
dropna : bool, default True
Don't include counts of NaN
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.

Returns
-------
Expand All @@ -717,18 +721,17 @@ def value_counts(
except TypeError as err:
raise TypeError("bins argument only works with numeric data.") from err

# count, remove nulls (from the index), and but the bins
# count, remove nulls (from the index), and use the bins
result = ii.value_counts(dropna=dropna)
result = result[result.index.notna()]
result.index = result.index.astype("interval")
result = result.sort_index()

# if we are dropna and we have NO values
if dropna and (result._values == 0).all():
result = result.iloc[0:0]

# normalizing is by len of all (regardless of dropna)
counts = np.array([len(ii)])
# normalizing is by len of what gets included in the bins
counts = result._values

else:

Expand Down
18 changes: 15 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1180,11 +1180,14 @@ def value_counts(
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
bins : int, optional
Rather than count values, group them into half-open bins,
a convenience for ``pd.cut``, only works with numeric data.
bins : int or iterable of numeric, optional
Rather than count individual values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins and will use `pd.cut`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this doc-string the same way (in theory these could be shared, but that's another day)

If interable of numeric, will use provided numbers as bin endpoints.
dropna : bool, default True
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.

Returns
-------
Expand Down Expand Up @@ -1230,6 +1233,15 @@ def value_counts(
(3.0, 4.0] 1
dtype: int64

Bins can also be an iterable of numbers. These numbers are treated
as endpoints for the intervals.

>>> s.value_counts(bins=[0,2,4,9])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use spaces between the bins

(2.0, 4.0] 3
(-0.001, 2.0] 2
(4.0, 9.0] 0
dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/base/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,34 @@ def test_value_counts_bins(index_or_series):
assert s.nunique() == 0


def test_value_counts_bins_nas():
# GH25970, handle normalizing bins with NA's properly
# First test that NA's are included appropriately
rand_data = np.append(
np.random.randint(1, 5, 50), [np.nan] * np.random.randint(1, 20)
)
s = Series(rand_data)
assert s.value_counts(dropna=False).index.hasnans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize here on the bins arg.

Then you can split out to another test starting on L208.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, and will merge master soon. I'm still working on a nice way to handle the groupby value_counts that will give consistent output for datetime Groupers that include datetimes not present in the data.

assert not s.value_counts(dropna=True).index.hasnans
assert s.value_counts(dropna=False, bins=3).index.hasnans
assert not s.value_counts(dropna=True, bins=3).index.hasnans
assert s.value_counts(dropna=False, bins=[0, 1, 3, 6]).index.hasnans
assert not s.value_counts(dropna=True, bins=[0, 1, 3, 6]).index.hasnans

# then verify specific example
s2 = Series([1, 2, 2, 3, 3, 3, np.nan, np.nan, 4, 5])
intervals = IntervalIndex.from_breaks([0.995, 2.333, 3.667, 5.0])
expected_dropna = Series([0.375, 0.375, 0.25], intervals.take([1, 0, 2]))
expected_keepna_vals = np.array([0.3, 0.3, 0.2, 0.2])
tm.assert_series_equal(
s2.value_counts(dropna=True, normalize=True, bins=3), expected_dropna
)
tm.assert_numpy_array_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do this already done in assert_series_equal

s2.value_counts(dropna=False, normalize=True, bins=3).values,
expected_keepna_vals,
)


def test_value_counts_datetime64(index_or_series):
klass = index_or_series

Expand Down
5 changes: 3 additions & 2 deletions pandas/tests/groupby/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,10 @@ def seed_df(seed_nans, n, m):
@pytest.mark.slow
@pytest.mark.parametrize("df, keys, bins, n, m", binned, ids=ids)
@pytest.mark.parametrize("isort", [True, False])
@pytest.mark.parametrize("normalize", [True, False])
@pytest.mark.parametrize("normalize", [False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep these? Removing these reduces test coverage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically addressed this in my last comment:
the groupby tests now cause conflicting results, since the Series.value_count is doing the right thing but the DataFrame.groupby.value_count is not. It seems wrong to change the test, but I have temporarily reduced the normalize and dropna parameters to skip the problematic cases. Would you rather I combine the fixes again as a single pull request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we can't just remove test coverage like this to get things to pass. I'm not sure what the old commits looked like (FYI if you merge master instead of rebasing you don't need to force push, which helps retain history) but probably need to re-integrate that or some aspect of the fix so we don't have to do this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; will do

@pytest.mark.parametrize("sort", [True, False])
@pytest.mark.parametrize("ascending", [True, False])
@pytest.mark.parametrize("dropna", [True, False])
@pytest.mark.parametrize("dropna", [True])
def test_series_groupby_value_counts(
df, keys, bins, n, m, isort, normalize, sort, ascending, dropna
):
Expand All @@ -71,6 +71,7 @@ def rebuild_index(df):

gr = df.groupby(keys, sort=isort)
left = gr["3rd"].value_counts(**kwargs)
# left.index.names = left.index.names[:-1] + ["3rd"]

gr = df.groupby(keys, sort=isort)
right = gr["3rd"].apply(Series.value_counts, **kwargs)
Expand Down