Description
While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:
- Add a new column whose values at each cell is a python list of the values the new column takes on.
- Unlistify the column thereby creating a new row for each element in the above lists.
I.e. I propose two new DataFrame methods, listify() and unlistify().
listify(df, column)
: Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column
values.
unlistify(df, column)
: Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column
for each row and generate a new row for each value.
The functions may be expanded to support multiple columns. listify()
may e.g. support a post processing function, that will be applied on the list.
The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.
#!/usr/bin/python
import pandas as pd
def listify(df, column):
matches = [i for i,n in enumerate(df.columns)
if n==column]
if len(matches)==0:
raise Exception('Failed to find column named ' + column +'!')
if len(matches)>1:
raise Exception('More than one column named ' + column +'!')
old_index = df.index
col_idx = matches[0] + len(old_index.shape) # Since we will reset the index
column_names = list(df.index.names) + list(df.columns)
gb_cols = [c for c in column_names
if c!= column]
# Helper function to generate the squashed dataframe
def fnc(d):
row = list(d.values[0])
return pd.DataFrame([row[:col_idx]
+ [[v[col_idx] for v in list(d.values)]]
+ row[col_idx+1:]])
return (df
.reset_index()
.groupby(gb_cols)
.apply(fnc)
.rename(columns = lambda i : column_names[i])
.set_index(old_index.names)
)
def unlistify(df, column):
matches = [i for i,n in enumerate(df.columns)
if n==column]
if len(matches)==0:
raise Exception('Failed to find column named ' + column +'!')
if len(matches)>1:
raise Exception('More than one column named ' + column +'!')
col_idx = matches[0]
# Helper function to expand and repeat the column col_idx
def fnc(d):
row = list(d.values[0])
bef = row[:col_idx]
aft = row[col_idx+1:]
col = row[col_idx]
z = [bef + [c] + aft for c in col]
return pd.DataFrame(z)
col_idx += len(df.index.shape) # Since we will push reset the index
index_names = list(df.index.names)
column_names = list(index_names) + list(df.columns)
return (df
.reset_index()
.groupby(level=0,as_index=0)
.apply(fnc)
.rename(columns = lambda i :column_names[i])
.set_index(index_names)
)
# Examples of how to listify and unlistify a column.
df = pd.DataFrame([[11,range(5),10],
[22,range(3),20]],
columns = ['A','B','C']).set_index('C')
print 'org'
print df
print '--'
df = unlistify(df,'B')
print 'unlistify(df,B)'
print df
print '--'
df = listify(df,'B')
print 'listify(df,B)'
print df
The corresponding output:
org
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]
--
unlistify(df,B)
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2
--
listify(df,B)
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]