Skip to content

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Closed
@ssanderson

Description

@ssanderson

Historically, large monotonically-increasing Index objects would attempt to avoid creating a large hash table on get_loc calls. In service of this goal, many IndexEngine methods have guards like the one in DatetimeEngine.get_loc:

        if self.over_size_threshold and self.is_monotonic_increasing:
            if not self.is_unique:
                val = _to_i8(val)
                return self._get_loc_duplicates(val)
            # Do lookup using `searchsorted`
        # Do lookup using hash table.

Since at least 5eecdb2, self.is_unique has been implemented as a property that would force a hash table to be created unless the index had already been marked as unique. Until #10199, the is_monotonic_increasing property would perform a check that would sometimes set self.unique to True, which would prevent the large hash table allocation. After the commit linked above, however, the only code path that ever sets IndexEngine.unique is in IndexEngine.initialize, which unconditionally creates a hash table before setting the unique flag..

Code Sample, a copy-pastable example if possible

import os
import humanize
import psutil
import pandas as pd


def get_mem_usage():
    pid = os.getpid()
    proc = psutil.Process(pid)
    return humanize.naturalsize(proc.memory_full_info().uss)

print("Pandas Version: " + pd.__version__)
print("Before Index Creation: " + get_mem_usage())

# The cutoff for allocating a hash table inside the index is supposed to be
#1,000,000 entries in the index.  This index is about 10x larger.
data = pd.date_range('1990', '2016', freq='min')

print("After Index Creation: " + get_mem_usage())

# Trigger a hash of the index's contents.
data.get_loc(data[5])

print("After get_loc() call: " + get_mem_usage())

Output (Old Pandas):

$ python repro.py
Pandas Version: 0.16.1
Before Index Creation: 36.8 MB
After Index Creation: 146.5 MB
After get_loc() call: 146.6 MB

Output (Pandas 0.18.1)

$ python repro.py
Pandas Version: 0.18.1
Before Index Creation: 47.6 MB
After Index Creation: 157.4 MB
After get_loc() call: 698.7 MB

For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely DatetimeIndexes to represent trading calendars) started failing with memory errors after merging quantopian/zipline#1339.

Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change IndexEngine._do_unique_check to actually do a uniqueness check instead of just forcing a hash table creation. Reading through the code, however, there are a bunch of ways that large Indexes could still hit code paths that trigger hash table allocations. For example, DatetimeEngine.__contains__ guards against self.over_size_threshold, but none of the other IndexEngine subclasses do. A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.

output of pd.show_versions()

In [4]: pd.show_versions() ## INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions