-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Value counts normalize #33652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value counts normalize #33652
Changes from 3 commits
bd9011a
d9d5ec1
86fe7f9
c34a863
9c1c269
5f8eb1d
1276166
a1b7197
9c3ede3
0cff92b
27aa460
27c9856
f5e9aeb
99b7112
75374b2
25b6c14
277ce52
73ef54b
6b97e0b
797f668
fce6998
637a609
c9a4383
7ae1280
d2399ea
3299a36
ec92f15
d6179b0
5abfb16
83ccfd2
5f33834
8562f1b
f685cb2
c21bdbb
e4c2552
74b13d8
f0e630a
9763e83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1180,11 +1180,14 @@ def value_counts( | |
Sort by frequencies. | ||
ascending : bool, default False | ||
Sort in ascending order. | ||
bins : int, optional | ||
Rather than count values, group them into half-open bins, | ||
a convenience for ``pd.cut``, only works with numeric data. | ||
bins : int or iterable of numeric, optional | ||
Rather than count individual values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use `pd.cut`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. update this doc-string the same way (in theory these could be shared, but that's another day) |
||
If interable of numeric, will use provided numbers as bin endpoints. | ||
dropna : bool, default True | ||
Don't include counts of NaN. | ||
If False and NaNs are present, NaN will be a key in the output. | ||
|
||
Returns | ||
------- | ||
|
@@ -1230,6 +1233,15 @@ def value_counts( | |
(3.0, 4.0] 1 | ||
dtype: int64 | ||
|
||
Bins can also be an iterable of numbers. These numbers are treated | ||
as endpoints for the intervals. | ||
|
||
>>> s.value_counts(bins=[0,2,4,9]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use spaces between the bins |
||
(2.0, 4.0] 3 | ||
(-0.001, 2.0] 2 | ||
(4.0, 9.0] 0 | ||
dtype: int64 | ||
|
||
**dropna** | ||
|
||
With `dropna` set to `False` we can also see NaN index values. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -191,6 +191,34 @@ def test_value_counts_bins(index_or_series): | |
assert s.nunique() == 0 | ||
|
||
|
||
def test_value_counts_bins_nas(): | ||
# GH25970, handle normalizing bins with NA's properly | ||
# First test that NA's are included appropriately | ||
rand_data = np.append( | ||
np.random.randint(1, 5, 50), [np.nan] * np.random.randint(1, 20) | ||
) | ||
s = Series(rand_data) | ||
assert s.value_counts(dropna=False).index.hasnans | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you parameterize here on the bins arg. Then you can split out to another test starting on L208. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do, and will merge master soon. I'm still working on a nice way to handle the groupby value_counts that will give consistent output for datetime Groupers that include datetimes not present in the data. |
||
assert not s.value_counts(dropna=True).index.hasnans | ||
assert s.value_counts(dropna=False, bins=3).index.hasnans | ||
assert not s.value_counts(dropna=True, bins=3).index.hasnans | ||
assert s.value_counts(dropna=False, bins=[0, 1, 3, 6]).index.hasnans | ||
assert not s.value_counts(dropna=True, bins=[0, 1, 3, 6]).index.hasnans | ||
|
||
# then verify specific example | ||
s2 = Series([1, 2, 2, 3, 3, 3, np.nan, np.nan, 4, 5]) | ||
intervals = IntervalIndex.from_breaks([0.995, 2.333, 3.667, 5.0]) | ||
expected_dropna = Series([0.375, 0.375, 0.25], intervals.take([1, 0, 2])) | ||
expected_keepna_vals = np.array([0.3, 0.3, 0.2, 0.2]) | ||
tm.assert_series_equal( | ||
s2.value_counts(dropna=True, normalize=True, bins=3), expected_dropna | ||
) | ||
tm.assert_numpy_array_equal( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to do this already done in assert_series_equal |
||
s2.value_counts(dropna=False, normalize=True, bins=3).values, | ||
expected_keepna_vals, | ||
) | ||
|
||
|
||
def test_value_counts_datetime64(index_or_series): | ||
klass = index_or_series | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,10 +53,10 @@ def seed_df(seed_nans, n, m): | |
@pytest.mark.slow | ||
@pytest.mark.parametrize("df, keys, bins, n, m", binned, ids=ids) | ||
@pytest.mark.parametrize("isort", [True, False]) | ||
@pytest.mark.parametrize("normalize", [True, False]) | ||
@pytest.mark.parametrize("normalize", [False]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you keep these? Removing these reduces test coverage There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I specifically addressed this in my last comment: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea we can't just remove test coverage like this to get things to pass. I'm not sure what the old commits looked like (FYI if you merge master instead of rebasing you don't need to force push, which helps retain history) but probably need to re-integrate that or some aspect of the fix so we don't have to do this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks; will do |
||
@pytest.mark.parametrize("sort", [True, False]) | ||
@pytest.mark.parametrize("ascending", [True, False]) | ||
@pytest.mark.parametrize("dropna", [True, False]) | ||
@pytest.mark.parametrize("dropna", [True]) | ||
def test_series_groupby_value_counts( | ||
df, keys, bins, n, m, isort, normalize, sort, ascending, dropna | ||
): | ||
|
@@ -71,6 +71,7 @@ def rebuild_index(df): | |
|
||
gr = df.groupby(keys, sort=isort) | ||
left = gr["3rd"].value_counts(**kwargs) | ||
# left.index.names = left.index.names[:-1] + ["3rd"] | ||
|
||
gr = df.groupby(keys, sort=isort) | ||
right = gr["3rd"].apply(Series.value_counts, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this will use pd.cut either way, so pls amend the doc to say that.
needs a versionchanged tag 1.2