Description
Historically, large monotonically-increasing Index
objects would attempt to avoid creating a large hash table on get_loc
calls. In service of this goal, many IndexEngine
methods have guards like the one in DatetimeEngine.get_loc
:
if self.over_size_threshold and self.is_monotonic_increasing:
if not self.is_unique:
val = _to_i8(val)
return self._get_loc_duplicates(val)
# Do lookup using `searchsorted`
# Do lookup using hash table.
Since at least 5eecdb2, self.is_unique
has been implemented as a property that would force a hash table to be created unless the index had already been marked as unique. Until #10199, the is_monotonic_increasing
property would perform a check that would sometimes set self.unique
to True, which would prevent the large hash table allocation. After the commit linked above, however, the only code path that ever sets IndexEngine.unique
is in IndexEngine.initialize
, which unconditionally creates a hash table before setting the unique flag..
Code Sample, a copy-pastable example if possible
import os
import humanize
import psutil
import pandas as pd
def get_mem_usage():
pid = os.getpid()
proc = psutil.Process(pid)
return humanize.naturalsize(proc.memory_full_info().uss)
print("Pandas Version: " + pd.__version__)
print("Before Index Creation: " + get_mem_usage())
# The cutoff for allocating a hash table inside the index is supposed to be
#1,000,000 entries in the index. This index is about 10x larger.
data = pd.date_range('1990', '2016', freq='min')
print("After Index Creation: " + get_mem_usage())
# Trigger a hash of the index's contents.
data.get_loc(data[5])
print("After get_loc() call: " + get_mem_usage())
Output (Old Pandas):
$ python repro.py
Pandas Version: 0.16.1
Before Index Creation: 36.8 MB
After Index Creation: 146.5 MB
After get_loc() call: 146.6 MB
Output (Pandas 0.18.1)
$ python repro.py
Pandas Version: 0.18.1
Before Index Creation: 47.6 MB
After Index Creation: 157.4 MB
After get_loc() call: 698.7 MB
For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely DatetimeIndex
es to represent trading calendars) started failing with memory errors after merging quantopian/zipline#1339.
Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change IndexEngine._do_unique_check
to actually do a uniqueness check instead of just forcing a hash table creation. Reading through the code, however, there are a bunch of ways that large Index
es could still hit code paths that trigger hash table allocations. For example, DatetimeEngine.__contains__
guards against self.over_size_threshold
, but none of the other IndexEngine
subclasses do. A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.
output of pd.show_versions()
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)