Skip to content

New DataFrame feature: listify() and unlistify() #10511

Closed
@dov

Description

@dov

While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:

  1. Add a new column whose values at each cell is a python list of the values the new column takes on.
  2. Unlistify the column thereby creating a new row for each element in the above lists.

I.e. I propose two new DataFrame methods, listify() and unlistify().

listify(df, column): Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column values.

unlistify(df, column): Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column for each row and generate a new row for each value.

The functions may be expanded to support multiple columns. listify() may e.g. support a post processing function, that will be applied on the list.

The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.

#!/usr/bin/python

import pandas as pd

def listify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  old_index = df.index
  col_idx = matches[0] + len(old_index.shape) # Since we will reset the index

  column_names = list(df.index.names) + list(df.columns)
  gb_cols = [c for c in column_names
             if c!= column]

  # Helper function to generate the squashed dataframe
  def fnc(d):
    row = list(d.values[0])
    return pd.DataFrame([row[:col_idx]
                         + [[v[col_idx] for v in list(d.values)]]
                         + row[col_idx+1:]])

  return (df
          .reset_index()
          .groupby(gb_cols)
          .apply(fnc)
          .rename(columns = lambda i : column_names[i])
          .set_index(old_index.names)
          )

def unlistify(df, column):
  matches = [i for i,n in enumerate(df.columns)
             if n==column]

  if len(matches)==0:
    raise Exception('Failed to find column named ' + column +'!')
  if len(matches)>1:
    raise Exception('More than one column named ' + column +'!')

  col_idx = matches[0]

  # Helper function to expand and repeat the column col_idx
  def fnc(d):
    row = list(d.values[0])
    bef = row[:col_idx]
    aft = row[col_idx+1:]
    col = row[col_idx]
    z = [bef + [c] + aft for c in col]
    return pd.DataFrame(z)

  col_idx += len(df.index.shape) # Since we will push reset the index
  index_names = list(df.index.names)
  column_names = list(index_names) + list(df.columns)
  return (df
          .reset_index()
          .groupby(level=0,as_index=0)
          .apply(fnc)
          .rename(columns = lambda i :column_names[i])
          .set_index(index_names)
          )


# Examples of how to listify and unlistify a column.

df = pd.DataFrame([[11,range(5),10],
                   [22,range(3),20]],
                   columns = ['A','B','C']).set_index('C')
print 'org'
print df
print '--'
df = unlistify(df,'B')
print 'unlistify(df,B)'
print df
print '--'
df = listify(df,'B')
print 'listify(df,B)'
print df

The corresponding output:

org
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]
--
unlistify(df,B)
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2
--
listify(df,B)
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions