Description
When deduplicating, I need to get the mapping from the deduplicated records back to the originals.
Unfortunately:
drop_duplicates
andduplicated
provide no information about this mappingnp.unique
has areturn_inverse
-parameter, but does not work withobject
data due to python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) numpy/numpy#641
Let's say I have a large DF (df
) with several object columns (say, sig
) that are the deduplication signature (subset
-parameter of duplicated
).
Then I have (here only for a subset of 100k records):
sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')
resulting in:
duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.
The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure duplicated
call only takes around 10 sec.
This caused me to hunt down the inner workings of duplicated
, to see where the inverse gets lost. Turns out until almost the very end of duplicated
, the indices are there:
Line 4384 in 3147a86
I extracted and applied the guts of duplicated
and once having ids
for the whole set (~10sec), I have the following:
[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')
yielding
pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.
So, by exposing a return_inverse
argument -- either in .duplicated
, .drop_duplicates
, or in a separate .unique
function for DF -- this would accelerate my problem from 20min to around 11sec.
I would imagine the following signature:
df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)
If return_inverse
were added to drop_duplicates
, it would have to raise an error for the combination keep=False, return_inverse=True
, because no inverse can be returned in this case.
I wouldn't know how to adapt the cython code for duplicate_int64
, but since it's 5x slower than np.unique
anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...
Edit: just saw that np.unique
sorts, which makes the 'first'|'last'
thing a fair bit harder...