Description
Is your feature request related to a problem?
Currently, in isin
, unique
and duplicated
temporary hash-maps are created, albeit sets would be enough (for isin
and for some inputs to unique
and duplicated
). Thus has at least two drawbacks:
- Quite a memory overhead: For every key there are additional 8 bytes overhead for values. This ratio will become even worse for 32bit/16bit keys, once they are be used.
- For big datasets, the running time of the above algorithms is dominated by cache-misses. Thus having twice as many cache-misses, because also values are touched could mean a factor 2 slowdown.
Describe the solution you'd like
Pandas uses khash, which also provides set-implementations, which would solve the above two problems.
Given the current infrastructure it would be no big deal to add sets as well.
Another advantage would be: prior investigations have shown, that the usages of sets (as in isin
) and of maps lead to different optimal strategies for hash-functions and collision-resolution-strategies (see e.g. #36729 (comment))
Describe alternatives you've considered
Given that the most OSes nowdays don't really reserve memory for a proccess if it is not used/written to, one could claim, that the above two problems aren't problems at all (at least for bigger datasets) - because in the above algorithms no values are written to values
.
The above claim is true, as long as there is no rehashing involved. Once a map is rehashed, the garbage values are copied to the new location, which leads OS to reserve this memory to the process. Thus, preallocating big enough maps/hashes would not trigger the usage of additional memory. However, right now it doesn't not happend for different reasons, one of them being the limitation of the size of the preallocated table to 2²⁰ elements (which is a sane approach imo).
However, using sets and not maps would be a more robust and explicit way to avoid unnecessary memory usage.