Closed
Description
>>> df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
>>> df
A B
0 1 2
1 1 4
2 5 6
[3 rows x 2 columns]
Selecting a column of the GroupBy object, still returns all columns:
>>> g = df.groupby('A', as_index=False)['B']
>>> g.get_group(1)
A B
0 1 2
1 1 4
[2 rows x 2 columns]
>>> g = df.groupby('A', as_index=False)
>>> g.get_group(1)
A B
0 1 2
1 1 4
[2 rows x 2 columns]
>>> g.get_group(1)['B']
0 2
1 4
Name: B, dtype: int64
So an applied function with apply
is applied on all columns:
>>> df.groupby('A', as_index=False)['B'].apply(lambda x: x.cumsum())
A B
0 1 2
1 2 6
2 5 6
[3 rows x 2 columns]
With as_index=True
it works as expected:
>>> g = df.groupby('A')
>>> g.get_group(1)
A B
0 1 2
1 1 4
[2 rows x 2 columns]
>>> g = df.groupby('A')['B']
>>> g.get_group(1)
0 2
1 4
Name: B, dtype: int64
>>> df.groupby('A')['B'].apply(lambda x: x.cumsum())
0 2
1 6
2 6
dtype: int64
A more elaborate example where this turned out:
>>> s="""L1 L2 L3
... X 1 200
... X 2 100
... Z 1 15
... X 3 200
... Z 2 10
... Y 1 1
... Z 3 20
... Y 2 10
... Y 3 100"""
>>>
>>> df = pd.read_csv(StringIO(s), sep="\s+")
>>> df.groupby("L1")["L3"].apply(lambda x: x.order().cumsum()/x.sum())
L1
X 1 0.200000
0 0.600000
3 1.000000
Y 5 0.009009
7 0.099099
8 1.000000
Z 4 0.222222
2 0.555556
6 1.000000
dtype: float64
But if I don't want the X, Y, Z in the index:
>>> df.groupby("L1", as_index=False)["L3"].apply(lambda x: x.order().cumsum()/x.sum())
return an error as x
is a dataframe.