ENH: consider using sets and not maps for isin, unique and duplicated

#### Is your feature request related to a problem?

Currently, in `isin`, `unique` and `duplicated` temporary hash-maps are created, albeit sets would be enough (for `isin` and for some inputs to `unique` and `duplicated`). Thus has at least two drawbacks:

  1) Quite a memory overhead: For every key there are additional 8 bytes overhead for values. This ratio will become even worse for 32bit/16bit keys, once they are be used.
  2)  For big datasets, the running time of the above algorithms is dominated by cache-misses. Thus having twice as many cache-misses, because also values are touched could mean a factor 2 slowdown.

#### Describe the solution you'd like

Pandas uses khash, which also provides set-implementations, which would solve the above two problems. 

Given the current infrastructure it would be no big deal to add sets as well.

Another advantage would be: prior investigations have shown, that the usages of sets (as in `isin`) and of maps lead to different optimal strategies for hash-functions and collision-resolution-strategies (see e.g. https://github.com/pandas-dev/pandas/pull/36729#issuecomment-716843946) 

#### Describe alternatives you've considered

Given that the most OSes nowdays don't really reserve memory for a proccess if it is not used/written to, one could claim, that the above two problems aren't problems at all (at least for bigger datasets) - because in the above algorithms no values are written to `values`.

The above claim is true, as long as there is no rehashing involved. Once a map is rehashed, the garbage values are copied to the new location, which leads OS to reserve this memory to the process. Thus, preallocating big enough maps/hashes would not trigger the usage of additional memory. However, right now it doesn't not happend for different reasons, one of them being the limitation of the size of the preallocated table to 2²⁰ elements (which is a sane approach imo).

However, using sets and not maps would be a more robust and explicit way to avoid unnecessary memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: consider using sets and not maps for isin, unique and duplicated #39799

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: consider using sets and not maps for isin, unique and duplicated #39799

Description

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions