Skip to content

BUG: applying value_counts to a column of a grouped dataframe results in inconsistent output types #42608

Open
@HaydenSansum

Description

@HaydenSansum
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import pandas as pd

# Test 1 - Group by with value counts returns a series
id_column = ["idA","idA","idA","idA","idA","idA","idB","idB","idB","idB","idB","idB","idB"]
test_bool_value = [True,False,False,False,False,False,True,True,False,False,False,True,True]
test_data = {"id":id_column, "bool_value":test_bool_value}

df = pd.DataFrame(data=test_data)

df_output = df.groupby("id").apply(lambda df: df["bool_value"].value_counts())
print(type(df_output))

# Test 2 - Group by with value counts returns a dataframe (same input df shape with different values)
id_column = ["idA","idA","idA","idA","idA","idA","idB","idB","idB","idB","idB","idB","idB"]
test_bool_value = [True,True,True,True,False,False,True,True,False,False,False,True,True]
test_data = {"id":id_column, "bool_value":test_bool_value}

df = pd.DataFrame(data=test_data)

df_output = df.groupby("id").apply(lambda df: df["bool_value"].value_counts())
print(type(df_output))

Problem description

When using value_counts() on a column within a groupby - in this instance to count how many True/False values exist within each id grouping - the output type varies between a Series and a DataFrame depending on the specific arrangement of True/False values with no changes to the shape of the incoming DataFrame.

Expected Output

Either both should return <class 'pandas.core.frame.DataFrame'> or <class 'pandas.core.series.Series'> - seeing as the value_counts() function returns a Series I would expect the default output from the groupby operation to be a DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.89+
Version : #1 SMP Sat Feb 13 19:45:14 PST 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffApplyApply, Aggregate, Transform, MapBugGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions