Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> df = DataFrame(
{
"Timestamp": [pd.Timestamp(i) for i in range(3)],
"Food": ["apple", "apple", "banana"],
}
)
>>> dfg = df.groupby(Grouper(freq="1D", key="Timestamp"))
>>> dfg.value_counts()
../../core/groupby/generic.py:1800: in value_counts
result_series = cast(Series, gb.size())
../../core/groupby/groupby.py:2323: in size
result = self.grouper.size()
../../core/groupby/ops.py:881: in size
ids, _, ngroups = self.group_info
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
???
../../core/groupby/ops.py:915: in group_info
comp_ids, obs_group_ids = self._get_compressed_codes()
../../core/groupby/ops.py:941: in _get_compressed_codes
group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
labels = [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape = (1, 3, 2)
sort = True, xnull = True
def get_group_index(
labels, shape: Shape, sort: bool, xnull: bool
) -> npt.NDArray[np.int64]:
"""
For the particular label_list, gets the offsets into the hypothetical list
representing the totally ordered cartesian product of all possible label
combinations, *as long as* this space fits within int64 bounds;
otherwise, though group indices identify unique combinations of
labels, they cannot be deconstructed.
- If `sort`, rank of returned ids preserve lexical ranks of labels.
i.e. returned id's can be used to do lexical sort on labels;
- If `xnull` nulls (-1 labels) are passed through.
Parameters
----------
labels : sequence of arrays
Integers identifying levels at each location
shape : tuple[int, ...]
Number of unique levels at each location
sort : bool
If the ranks of returned ids should match lexical ranks of labels
xnull : bool
If true nulls are excluded. i.e. -1 values in the labels are
passed through.
Returns
-------
An array of type int64 where two elements are equal if their corresponding
labels are equal at all location.
Notes
-----
The length of `labels` and `shape` must be identical.
"""
def _int64_cut_off(shape) -> int:
acc = 1
for i, mul in enumerate(shape):
acc *= int(mul)
if not acc < lib.i8max:
return i
return len(shape)
def maybe_lift(lab, size) -> tuple[np.ndarray, int]:
# promote nan values (assigned -1 label in lab array)
# so that all output values are non-negative
return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)
labels = [ensure_int64(x) for x in labels]
lshape = list(shape)
if not xnull:
for i, (lab, size) in enumerate(zip(labels, shape)):
lab, size = maybe_lift(lab, size)
labels[i] = lab
lshape[i] = size
labels = list(labels)
# Iteratively process all the labels in chunks sized so less
# than lib.i8max unique int ids will be required for each chunk
while True:
# how many levels can be done without overflow:
nlev = _int64_cut_off(lshape)
# compute flat ids for the first `nlev` levels
stride = np.prod(lshape[1:nlev], dtype="i8")
out = stride * labels[0].astype("i8", subok=False, copy=False)
for i in range(1, nlev):
if lshape[i] == 0:
stride = np.int64(0)
else:
stride //= lshape[i]
> out += labels[i] * stride
E ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (3,)
../../core/sorting.py:182: ValueError
Issue Description
DataFrameGroupBy.value_counts
fails with a Grouper
with a freq
, while it works for a SeriesGroupBy
. There is already a test for the SeriesGroupBy
implementation named test_series_groupby_value_counts_with_grouper
.
Expected Behavior
In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy
implementation:
>>> dfg["Food"].value_counts()
Timestamp Food
1970-01-01 apple 2
banana 1
Name: Food, dtype: int64
This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.
Installed Versions
INSTALLED VERSIONS
commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 20.0.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.46.2
sphinx : 4.5.0
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli : None
fastparquet : 0.8.1
fsspec : 2022.3.0
gcsfs : 2022.3.0
matplotlib : 3.5.2
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : 2022.3.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None