Open
Description
Here is sample code that finds duplicate columns in a DataFrame based on their values (useful for cleaning data):
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
I've seen others suggest something like df.T.drop_duplicates().T
. However, transposing is a bad idea when working with large DataFrames.
I would add a pull request, but I'm not sure I even know what that means.