Skip to content

Commit ca2a46b

Browse files
author
Patrick Park
committed
Removed changes to gotchas.rst and simplified example in groupby.rst
1 parent fa94960 commit ca2a46b

File tree

2 files changed

+11
-114
lines changed

2 files changed

+11
-114
lines changed

doc/source/gotchas.rst

-93
Original file line numberDiff line numberDiff line change
@@ -337,96 +337,3 @@ See `the NumPy documentation on byte order
337337
<https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html>`__ for more
338338
details.
339339

340-
341-
Alternative to storing lists in Pandas DataFrame Cells
342-
------------------------------------------------------
343-
Storing nested lists/arrays inside a pandas object should be avoided for performance and memory use reasons. Instead they should be "exploded" into a flat DataFrame structure.
344-
345-
Example of exploding nested lists into a DataFrame:
346-
347-
.. ipython:: python
348-
349-
from collections import OrderedDict
350-
df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3),
351-
('opponent', ['76ers', 'blazers', 'bobcats']),
352-
('attribute x', ['A','B','C'])
353-
])
354-
))
355-
df
356-
357-
nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3
358-
nn
359-
360-
# Step 1: Create an index with the "parent" columns to be included in the final Dataframe
361-
df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1)
362-
df2
363-
364-
# Step 2: Transform the column with lists into series, which become columns in a new Dataframe.
365-
# Note that only the index from the original df is retained -
366-
# any other columns in the original df are not part of the new df
367-
df3 = df2.set_index(['name', 'opponent'])
368-
df3
369-
370-
# Step 3: Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
371-
# Note that at this point we have a Series, not a Dataframe
372-
ser = df3.stack()
373-
ser
374-
375-
# Step 4: Drop the extraneous index level created by the stack
376-
ser.reset_index(level=2, drop=True, inplace=True)
377-
ser
378-
379-
# Step 5: Create a Dataframe from the Series
380-
df4 = ser.to_frame('nearest_neighbors')
381-
df4
382-
383-
# All steps in one stack
384-
df4 = (df2.set_index(['name', 'opponent'])
385-
.stack()
386-
.reset_index(level=2, drop=True)
387-
.to_frame('nearest_neighbors'))
388-
df4
389-
390-
Example of exploding a list embedded in a dataframe:
391-
392-
.. ipython:: python
393-
394-
df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3),
395-
('opponent', ['76ers', 'blazers', 'bobcats']),
396-
('attribute x', ['A','B','C']),
397-
('nearest_neighbors', [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3)
398-
])
399-
))
400-
401-
df
402-
403-
# Step 1: Create an index with the "parent" columns to be included in the final Dataframe
404-
df2 = df.set_index(['name', 'opponent'])
405-
df2
406-
407-
# Step 2: Transform the column with lists into series, which become columns in a new Dataframe.
408-
# Note that only the index from the original df is retained -
409-
# any other columns in the original df are not part of the new df
410-
df3 = df2.nearest_neighbors.apply(pd.Series)
411-
df3
412-
413-
# Step 3: Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
414-
# Note that at this point we have a Series, not a Dataframe
415-
ser = df3.stack()
416-
ser
417-
418-
# Step 4: Drop the extraneous index level created by the stack
419-
ser.reset_index(level=2, drop=True, inplace=True)
420-
ser
421-
422-
# Step 5: Create a Dataframe from the Series
423-
df4 = ser.to_frame('nearest_neighbors')
424-
df4
425-
426-
# All steps in one stack
427-
df4 = (df.set_index(['name', 'opponent'])
428-
.nearest_neighbors.apply(pd.Series)
429-
.stack()
430-
.reset_index(level=2, drop=True)
431-
.to_frame('nearest_neighbors'))
432-
df4

doc/source/groupby.rst

+11-21
Original file line numberDiff line numberDiff line change
@@ -1017,39 +1017,29 @@ The returned dtype of the grouped will *always* include *all* of the categories
10171017
s.index.dtype
10181018
10191019
.. note::
1020-
Decimal columns are also "nuisance" columns. They are excluded from aggregate functions automatically in groupby.
1020+
Decimal and object columns are also "nuisance" columns. They are excluded from aggregate functions automatically in groupby.
10211021

1022-
If you do wish to include decimal columns in the aggregation, you must do so explicitly:
1022+
If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must do so explicitly.
10231023

10241024
.. ipython:: python
10251025
10261026
from decimal import Decimal
10271027
dec = pd.DataFrame(
1028-
{'name': ['foo', 'bar', 'foo', 'bar'],
1029-
'title': ['boo', 'far', 'boo', 'far'],
1030-
'id': [123, 456, 123, 456],
1031-
'int_column': [1, 2, 3, 4],
1032-
'dec_column1': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')],
1033-
'dec_column2': [Decimal('0.20'), Decimal('0.30'), Decimal('0.55'), Decimal('0.60')]
1034-
},
1035-
columns=['name','title','id','int_column','dec_column1','dec_column2']
1036-
)
1037-
1038-
dec.head()
1039-
1040-
dec.dtypes
1041-
1042-
# Decimal columns excluded from sum by default
1043-
dec.groupby(['name', 'title', 'id'], as_index=False).sum()
1028+
{'id': [123, 456, 123, 456],
1029+
'int_column': [1, 2, 3, 4],
1030+
'dec_column1': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')]
1031+
},
1032+
columns=['id','int_column','dec_column']
1033+
)
10441034
10451035
# Decimal columns can be sum'd explicitly by themselves...
1046-
dec.groupby(['name', 'title', 'id'], as_index=False)['dec_column1','dec_column2'].sum()
1036+
dec.groupby(['id'], as_index=False)['dec_column'].sum()
10471037
10481038
# ...but cannot be combined with standard data types or they will be excluded
1049-
dec.groupby(['name', 'title', 'id'], as_index=False)['int_column','dec_column1','dec_column2'].sum()
1039+
dec.groupby(['id'], as_index=False)['int_column','dec_column'].sum()
10501040
10511041
# Use .agg function to aggregate over standard and "nuisance" data types at the same time
1052-
dec.groupby(['name', 'title', 'id'], as_index=False).agg({'int_column': 'sum', 'dec_column1': 'sum', 'dec_column2': 'sum'})
1042+
dec.groupby(['id'], as_index=False).agg({'int_column': 'sum', 'dec_column': 'sum'})
10531043
10541044
10551045
.. _groupby.missing:

0 commit comments

Comments
 (0)