Skip to content

PERF: regression in MultiIndex get_loc performance #16319

Closed
@davidswaven

Description

@davidswaven
# Your code here
import pandas as pd
import numpy as np
import time
print pd.__version__

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
multind=pd.MultiIndex.from_product(iterables, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(4, 8), columns=multind)
df2 = pd.DataFrame(np.random.randn(4, 8), columns=multind)

t2=time.time()
df.combine_first(df2)
print "%f" % (time.time()-t2)

Problem description

Running this same code takes 116 ms in version 0.20.1
however it takes 3.6 ms in version 0.19.2.
This makes version 0.20.1 more than 30 times slower than 0.19.2 for this method.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: C LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.20.1
pytest: 2.8.5
pip: 9.0.1
setuptools: 19.6.2
Cython: 0.24.1
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
s3fs: 0.0.9
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    MultiIndexPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions