Skip to content

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

Open
@h-vetinari

Description

@h-vetinari

When deduplicating, I need to get the mapping from the deduplicated records back to the originals.

Unfortunately:

Let's say I have a large DF (df) with several object columns (say, sig) that are the deduplication signature (subset-parameter of duplicated).

Then I have (here only for a subset of 100k records):

sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')

resulting in:

duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.

The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure duplicated call only takes around 10 sec.

This caused me to hunt down the inner workings of duplicated, to see where the inverse gets lost. Turns out until almost the very end of duplicated, the indices are there:

ids = get_group_index(labels, shape, sort=False, xnull=False)

I extracted and applied the guts of duplicated and once having ids for the whole set (~10sec), I have the following:

[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')

yielding

pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.

So, by exposing a return_inverse argument -- either in .duplicated, .drop_duplicates, or in a separate .unique function for DF -- this would accelerate my problem from 20min to around 11sec.

I would imagine the following signature:

df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)

If return_inverse were added to drop_duplicates, it would have to raise an error for the combination keep=False, return_inverse=True, because no inverse can be returned in this case.

I wouldn't know how to adapt the cython code for duplicate_int64, but since it's 5x slower than np.unique anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...

Edit: just saw that np.unique sorts, which makes the 'first'|'last' thing a fair bit harder...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions