Large Monotonic Index Objects Always Allocate Hash Tables on get_loc

Historically, large monotonically-increasing `Index` objects would attempt to avoid creating a large hash table on `get_loc` calls. In service of this goal, many `IndexEngine` methods have guards like the one in `DatetimeEngine.get_loc`:

``` python
        if self.over_size_threshold and self.is_monotonic_increasing:
            if not self.is_unique:
                val = _to_i8(val)
                return self._get_loc_duplicates(val)
            # Do lookup using `searchsorted`
        # Do lookup using hash table.
```

Since at least 5eecdb2dd94719e1fd097ce3fb046697445a3d7f, `self.is_unique` has been implemented as a property that would force a hash table to be created **unless the index had already been marked as unique**.  Until https://github.com/pydata/pandas/pull/10199, the `is_monotonic_increasing` property would perform a check that would [sometimes set `self.unique` to True](https://github.com/pydata/pandas/pull/10199/files#diff-4f3357a9f53e943087e2778134494905L236), which would prevent the large hash table allocation.  After the commit linked above, however, the only code path that ever sets `IndexEngine.unique` is in [`IndexEngine.initialize`](https://github.com/pydata/pandas/blob/a7469cf98275a183ad2e4bfafa9706a1ef8d035e/pandas/index.pyx#L269-L270), which [unconditionally creates a hash table before setting the unique flag.](https://github.com/pydata/pandas/blob/a7469cf98275a183ad2e4bfafa9706a1ef8d035e/pandas/index.pyx#L266-L267).
#### Code Sample, a copy-pastable example if possible

``` python
import os
import humanize
import psutil
import pandas as pd


def get_mem_usage():
    pid = os.getpid()
    proc = psutil.Process(pid)
    return humanize.naturalsize(proc.memory_full_info().uss)

print("Pandas Version: " + pd.__version__)
print("Before Index Creation: " + get_mem_usage())

# The cutoff for allocating a hash table inside the index is supposed to be
#1,000,000 entries in the index.  This index is about 10x larger.
data = pd.date_range('1990', '2016', freq='min')

print("After Index Creation: " + get_mem_usage())

# Trigger a hash of the index's contents.
data.get_loc(data[5])

print("After get_loc() call: " + get_mem_usage())
```
#### Output (Old Pandas):

```
$ python repro.py
Pandas Version: 0.16.1
Before Index Creation: 36.8 MB
After Index Creation: 146.5 MB
After get_loc() call: 146.6 MB
```
#### Output (Pandas 0.18.1)

```
$ python repro.py
Pandas Version: 0.18.1
Before Index Creation: 47.6 MB
After Index Creation: 157.4 MB
After get_loc() call: 698.7 MB
```

For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely `DatetimeIndex`es to represent trading calendars) started failing with memory errors after merging https://github.com/quantopian/zipline/pull/1339.

Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change `IndexEngine._do_unique_check` to actually do a uniqueness check instead of just forcing a hash table creation.  Reading through the code, however, there are a bunch of ways that large `Index`es could still hit code paths that trigger hash table allocations.  For example, `DatetimeEngine.__contains__` guards against `self.over_size_threshold`, but none of the other `IndexEngine` subclasses do.  A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.
#### output of `pd.show_versions()`

<details>
In [4]: pd.show_versions()
## INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Code Sample, a copy-pastable example if possible

Output (Old Pandas):

Output (Pandas 0.18.1)

output of `pd.show_versions()`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Description

Code Sample, a copy-pastable example if possible

Output (Old Pandas):

Output (Pandas 0.18.1)

output of pd.show_versions()

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

output of `pd.show_versions()`