Skip to content

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

Closed
@LucasG0

Description

@LucasG0

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = DataFrame(
      {
          "Timestamp": [pd.Timestamp(i) for i in range(3)],
          "Food": ["apple", "apple", "banana"],
      }
  )

>>> dfg = df.groupby(Grouper(freq="1D", key="Timestamp"))
>>> dfg.value_counts()

../../core/groupby/generic.py:1800: in value_counts
    result_series = cast(Series, gb.size())
../../core/groupby/groupby.py:2323: in size
    result = self.grouper.size()
../../core/groupby/ops.py:881: in size
    ids, _, ngroups = self.group_info
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
../../core/groupby/ops.py:915: in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
../../core/groupby/ops.py:941: in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

labels = [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape = (1, 3, 2)
sort = True, xnull = True

    def get_group_index(
        labels, shape: Shape, sort: bool, xnull: bool
    ) -> npt.NDArray[np.int64]:
        """
        For the particular label_list, gets the offsets into the hypothetical list
        representing the totally ordered cartesian product of all possible label
        combinations, *as long as* this space fits within int64 bounds;
        otherwise, though group indices identify unique combinations of
        labels, they cannot be deconstructed.
        - If `sort`, rank of returned ids preserve lexical ranks of labels.
          i.e. returned id's can be used to do lexical sort on labels;
        - If `xnull` nulls (-1 labels) are passed through.
    
        Parameters
        ----------
        labels : sequence of arrays
            Integers identifying levels at each location
        shape : tuple[int, ...]
            Number of unique levels at each location
        sort : bool
            If the ranks of returned ids should match lexical ranks of labels
        xnull : bool
            If true nulls are excluded. i.e. -1 values in the labels are
            passed through.
    
        Returns
        -------
        An array of type int64 where two elements are equal if their corresponding
        labels are equal at all location.
    
        Notes
        -----
        The length of `labels` and `shape` must be identical.
        """
    
        def _int64_cut_off(shape) -> int:
            acc = 1
            for i, mul in enumerate(shape):
                acc *= int(mul)
                if not acc < lib.i8max:
                    return i
            return len(shape)
    
        def maybe_lift(lab, size) -> tuple[np.ndarray, int]:
            # promote nan values (assigned -1 label in lab array)
            # so that all output values are non-negative
            return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)
    
        labels = [ensure_int64(x) for x in labels]
        lshape = list(shape)
        if not xnull:
            for i, (lab, size) in enumerate(zip(labels, shape)):
                lab, size = maybe_lift(lab, size)
                labels[i] = lab
                lshape[i] = size
    
        labels = list(labels)
    
        # Iteratively process all the labels in chunks sized so less
        # than lib.i8max unique int ids will be required for each chunk
        while True:
            # how many levels can be done without overflow:
            nlev = _int64_cut_off(lshape)
    
            # compute flat ids for the first `nlev` levels
            stride = np.prod(lshape[1:nlev], dtype="i8")
            out = stride * labels[0].astype("i8", subok=False, copy=False)
    
            for i in range(1, nlev):
                if lshape[i] == 0:
                    stride = np.int64(0)
                else:
                    stride //= lshape[i]
>               out += labels[i] * stride
E               ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (3,)

../../core/sorting.py:182: ValueError

Issue Description

DataFrameGroupBy.value_counts fails with a Grouper with a freq, while it works for a SeriesGroupBy. There is already a test for the SeriesGroupBy implementation named test_series_groupby_value_counts_with_grouper.

Expected Behavior

In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy implementation:

>>> dfg["Food"].value_counts()
Timestamp   Food  
1970-01-01  apple     2
            banana    1
Name: Food, dtype: int64

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

Installed Versions

INSTALLED VERSIONS

commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 20.0.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.46.2
sphinx : 4.5.0
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli : None
fastparquet : 0.8.1
fsspec : 2022.3.0
gcsfs : 2022.3.0
matplotlib : 3.5.2
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : 2022.3.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBugGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions