-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Value counts normalize #33652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value counts normalize #33652
Changes from 12 commits
bd9011a
d9d5ec1
86fe7f9
c34a863
9c1c269
5f8eb1d
1276166
a1b7197
9c3ede3
0cff92b
27aa460
27c9856
f5e9aeb
99b7112
75374b2
25b6c14
277ce52
73ef54b
6b97e0b
797f668
fce6998
637a609
c9a4383
7ae1280
d2399ea
3299a36
ec92f15
d6179b0
5abfb16
83ccfd2
5f33834
8562f1b
f685cb2
c21bdbb
e4c2552
74b13d8
f0e630a
9763e83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ Fixed regressions | |
Bug fixes | ||
~~~~~~~~~ | ||
|
||
|
||
Contributors | ||
~~~~~~~~~~~~ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -434,7 +434,8 @@ Performance improvements | |
|
||
Bug fixes | ||
~~~~~~~~~ | ||
|
||
Fixed Series.value_counts so that normalize excludes NA values when dropna=False. (:issue:`25970`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you move this down into the Numeric section. starts at L482 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
Fixed Dataframe Groupby value_counts with bins (:issue:`32471`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you move this down into the Groupby/resample/rolling section. starts on L596. |
||
|
||
Categorical | ||
^^^^^^^^^^^ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -415,7 +415,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray: | |
if is_categorical_dtype(comps): | ||
# TODO(extension) | ||
# handle categoricals | ||
return comps.isin(values) # type: ignore | ||
return comps.isin(values) | ||
|
||
comps, dtype = _ensure_data(comps) | ||
values, _ = _ensure_data(values, dtype=dtype) | ||
|
@@ -663,17 +663,22 @@ def value_counts( | |
ascending : bool, default False | ||
Sort in ascending order | ||
normalize: bool, default False | ||
If True then compute a relative histogram | ||
bins : integer, optional | ||
Rather than count values, group them into half-open bins, | ||
convenience for pd.cut, only works with numeric data | ||
If True, then compute a relative histogram that outputs the | ||
proportion of each value. | ||
bins : integer or iterable of numeric, optional | ||
Rather than count values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use pd.cut. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm this will use pd.cut either way, so pls amend the doc to say that. needs a versionchanged tag 1.2 |
||
If interable of numeric, will use provided numbers as bin endpoints. | ||
dropna : bool, default True | ||
Don't include counts of NaN | ||
Don't include counts of NaN. | ||
If False and NaNs are present, NaN will be a key in the output. | ||
|
||
Returns | ||
------- | ||
Series | ||
""" | ||
|
||
from pandas.core.series import Series | ||
|
||
name = getattr(values, "name", None) | ||
|
@@ -689,16 +694,15 @@ def value_counts( | |
|
||
# count, remove nulls (from the index), and but the bins | ||
result = ii.value_counts(dropna=dropna) | ||
result = result[result.index.notna()] | ||
result.index = result.index.astype("interval") | ||
result = result.sort_index() | ||
|
||
# if we are dropna and we have NO values | ||
if dropna and (result._values == 0).all(): | ||
result = result.iloc[0:0] | ||
|
||
# normalizing is by len of all (regardless of dropna) | ||
counts = np.array([len(ii)]) | ||
# normalizing is by len of what gets included in the bins | ||
counts = result._values | ||
|
||
else: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1176,17 +1176,19 @@ def value_counts( | |
Parameters | ||
---------- | ||
normalize : bool, default False | ||
If True then the object returned will contain the relative | ||
frequencies of the unique values. | ||
If True, outputs the relative frequencies of the unique values. | ||
sort : bool, default True | ||
Sort by frequencies. | ||
ascending : bool, default False | ||
Sort in ascending order. | ||
bins : int, optional | ||
Rather than count values, group them into half-open bins, | ||
a convenience for ``pd.cut``, only works with numeric data. | ||
bins : integer or iterable of numeric, optional | ||
Rather than count individual values, group them into half-open bins. | ||
Only works with numeric data. | ||
If int, interpreted as number of bins and will use `pd.cut`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. update this doc-string the same way (in theory these could be shared, but that's another day) |
||
If interable of numeric, will use provided numbers as bin endpoints. | ||
dropna : bool, default True | ||
Don't include counts of NaN. | ||
If False and NaNs are present, NaN will be a key in the output. | ||
|
||
Returns | ||
------- | ||
|
@@ -1223,15 +1225,26 @@ def value_counts( | |
|
||
Bins can be useful for going from a continuous variable to a | ||
categorical variable; instead of counting unique | ||
apparitions of values, divide the index in the specified | ||
number of half-open bins. | ||
instances of values, count the number of values that fall | ||
into half-open intervals. | ||
|
||
Bins can be an int. | ||
|
||
>>> s.value_counts(bins=3) | ||
(2.0, 3.0] 2 | ||
(0.996, 2.0] 2 | ||
(3.0, 4.0] 1 | ||
dtype: int64 | ||
|
||
Bins can also be an iterable of numbers. These numbers are treated | ||
as endpoints for the intervals. | ||
|
||
>>> s.value_counts(bins=[0,2,4,9]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use spaces between the bins |
||
(2.0, 4.0] 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a space missing here? |
||
(-0.001, 2.0] 2 | ||
(4.0, 9.0] 0 | ||
dtype: int64 | ||
|
||
**dropna** | ||
|
||
With `dropna` set to `False` we can also see NaN index values. | ||
|
@@ -1244,6 +1257,7 @@ def value_counts( | |
1.0 1 | ||
dtype: int64 | ||
""" | ||
|
||
result = value_counts( | ||
self, | ||
sort=sort, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,6 @@ | |
""" | ||
from collections import abc, namedtuple | ||
import copy | ||
from functools import partial | ||
from textwrap import dedent | ||
import typing | ||
from typing import ( | ||
|
@@ -41,11 +40,8 @@ | |
maybe_downcast_to_dtype, | ||
) | ||
from pandas.core.dtypes.common import ( | ||
ensure_int64, | ||
ensure_platform_int, | ||
is_bool, | ||
is_integer_dtype, | ||
is_interval_dtype, | ||
is_numeric_dtype, | ||
is_object_dtype, | ||
is_scalar, | ||
|
@@ -671,128 +667,14 @@ def describe(self, **kwargs): | |
def value_counts( | ||
self, normalize=False, sort=True, ascending=False, bins=None, dropna=True | ||
): | ||
|
||
from pandas.core.reshape.tile import cut | ||
from pandas.core.reshape.merge import _get_join_indexers | ||
|
||
if bins is not None and not np.iterable(bins): | ||
# scalar bins cannot be done at top level | ||
# in a backward compatible way | ||
return self.apply( | ||
Series.value_counts, | ||
normalize=normalize, | ||
sort=sort, | ||
ascending=ascending, | ||
bins=bins, | ||
) | ||
|
||
ids, _, _ = self.grouper.group_info | ||
val = self.obj._values | ||
|
||
# groupby removes null keys from groupings | ||
mask = ids != -1 | ||
ids, val = ids[mask], val[mask] | ||
|
||
if bins is None: | ||
lab, lev = algorithms.factorize(val, sort=True) | ||
llab = lambda lab, inc: lab[inc] | ||
else: | ||
|
||
# lab is a Categorical with categories an IntervalIndex | ||
lab = cut(Series(val), bins, include_lowest=True) | ||
lev = lab.cat.categories | ||
lab = lev.take(lab.cat.codes) | ||
llab = lambda lab, inc: lab[inc]._multiindex.codes[-1] | ||
|
||
if is_interval_dtype(lab): | ||
# TODO: should we do this inside II? | ||
sorter = np.lexsort((lab.left, lab.right, ids)) | ||
else: | ||
sorter = np.lexsort((lab, ids)) | ||
|
||
ids, lab = ids[sorter], lab[sorter] | ||
|
||
# group boundaries are where group ids change | ||
idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]] | ||
|
||
# new values are where sorted labels change | ||
lchanges = llab(lab, slice(1, None)) != llab(lab, slice(None, -1)) | ||
inc = np.r_[True, lchanges] | ||
inc[idx] = True # group boundaries are also new values | ||
out = np.diff(np.nonzero(np.r_[inc, True])[0]) # value counts | ||
|
||
# num. of times each group should be repeated | ||
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx)) | ||
|
||
# multi-index components | ||
codes = self.grouper.reconstructed_codes | ||
codes = [rep(level_codes) for level_codes in codes] + [llab(lab, inc)] | ||
levels = [ping.group_index for ping in self.grouper.groupings] + [lev] | ||
names = self.grouper.names + [self._selection_name] | ||
|
||
if dropna: | ||
mask = codes[-1] != -1 | ||
if mask.all(): | ||
dropna = False | ||
else: | ||
out, codes = out[mask], [level_codes[mask] for level_codes in codes] | ||
|
||
if normalize: | ||
out = out.astype("float") | ||
d = np.diff(np.r_[idx, len(ids)]) | ||
if dropna: | ||
m = ids[lab == -1] | ||
np.add.at(d, m, -1) | ||
acc = rep(d)[mask] | ||
else: | ||
acc = rep(d) | ||
out /= acc | ||
|
||
if sort and bins is None: | ||
cat = ids[inc][mask] if dropna else ids[inc] | ||
sorter = np.lexsort((out if ascending else -out, cat)) | ||
out, codes[-1] = out[sorter], codes[-1][sorter] | ||
|
||
if bins is None: | ||
mi = MultiIndex( | ||
levels=levels, codes=codes, names=names, verify_integrity=False | ||
) | ||
|
||
if is_integer_dtype(out): | ||
out = ensure_int64(out) | ||
return Series(out, index=mi, name=self._selection_name) | ||
|
||
# for compat. with libgroupby.value_counts need to ensure every | ||
# bin is present at every index level, null filled with zeros | ||
diff = np.zeros(len(out), dtype="bool") | ||
for level_codes in codes[:-1]: | ||
diff |= np.r_[True, level_codes[1:] != level_codes[:-1]] | ||
|
||
ncat, nbin = diff.sum(), len(levels[-1]) | ||
|
||
left = [np.repeat(np.arange(ncat), nbin), np.tile(np.arange(nbin), ncat)] | ||
|
||
right = [diff.cumsum() - 1, codes[-1]] | ||
|
||
_, idx = _get_join_indexers(left, right, sort=False, how="left") | ||
out = np.where(idx != -1, out[idx], 0) | ||
|
||
if sort: | ||
sorter = np.lexsort((out if ascending else -out, left[0])) | ||
out, left[-1] = out[sorter], left[-1][sorter] | ||
|
||
# build the multi-index w/ full levels | ||
def build_codes(lev_codes: np.ndarray) -> np.ndarray: | ||
return np.repeat(lev_codes[diff], nbin) | ||
|
||
codes = [build_codes(lev_codes) for lev_codes in codes[:-1]] | ||
codes.append(left[-1]) | ||
|
||
mi = MultiIndex(levels=levels, codes=codes, names=names, verify_integrity=False) | ||
|
||
if is_integer_dtype(out): | ||
out = ensure_int64(out) | ||
return Series(out, index=mi, name=self._selection_name) | ||
return self.apply( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is most likely significantly slower than the existing implementation - can you run the appropriate groupby benchmarks to check? |
||
Series.value_counts, | ||
normalize=normalize, | ||
sort=sort, | ||
ascending=ascending, | ||
bins=bins, | ||
dropna=dropna, | ||
) | ||
|
||
def count(self) -> Series: | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -191,6 +191,34 @@ def test_value_counts_bins(index_or_series): | |
assert s.nunique() == 0 | ||
|
||
|
||
def test_value_counts_bins_nas(): | ||
# GH25970, handle normalizing bins with NA's properly | ||
# First test that NA's are included appropriately | ||
rand_data = np.append( | ||
np.random.randint(1, 5, 50), [np.nan] * np.random.randint(1, 20) | ||
) | ||
s = Series(rand_data) | ||
assert s.value_counts(dropna=False).index.hasnans | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you parameterize here on the bins arg. Then you can split out to another test starting on L208. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do, and will merge master soon. I'm still working on a nice way to handle the groupby value_counts that will give consistent output for datetime Groupers that include datetimes not present in the data. |
||
assert not s.value_counts(dropna=True).index.hasnans | ||
assert s.value_counts(dropna=False, bins=3).index.hasnans | ||
assert not s.value_counts(dropna=True, bins=3).index.hasnans | ||
assert s.value_counts(dropna=False, bins=[0, 1, 3, 6]).index.hasnans | ||
assert not s.value_counts(dropna=True, bins=[0, 1, 3, 6]).index.hasnans | ||
|
||
# then verify specific example | ||
s2 = Series([1, 2, 2, 3, 3, 3, np.nan, np.nan, 4, 5]) | ||
intervals = IntervalIndex.from_breaks([0.995, 2.333, 3.667, 5.0]) | ||
expected_dropna = Series([0.375, 0.375, 0.25], intervals.take([1, 0, 2])) | ||
expected_keepna_vals = np.array([0.3, 0.3, 0.2, 0.2]) | ||
tm.assert_series_equal( | ||
s2.value_counts(dropna=True, normalize=True, bins=3), expected_dropna | ||
) | ||
tm.assert_numpy_array_equal( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to do this already done in assert_series_equal |
||
s2.value_counts(dropna=False, normalize=True, bins=3).values, | ||
expected_keepna_vals, | ||
) | ||
|
||
|
||
def test_value_counts_datetime64(index_or_series): | ||
klass = index_or_series | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert unrelated changes? Looks like blank space and file permissions were changed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DataInformer can you address this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought I did this before, but apparently it reverted. Should be undone again now.