Skip to content

Observing significant (up to 30x) __get_item__ performance drop between 0.19.2 and 0.20.x #16644

Closed
@jupiterjosh

Description

@jupiterjosh

Code Sample, a copy-pastable example if possible

# Your code here
#%%
import pandas as pd
import numpy as np


#%%
_M = 10
_N = 1000
tuples = [(i,j) for i in range(_M) for j in range(_N)]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.rand(_N*_M,4),columns=['A','B','C','D'],index=index)

#%%
import itertools
for u,v in itertools.combinations(range(_M),2):
    for w in range(_N):
        someval1 = df.A.loc[u,w] - df.A.loc[v,w]
        someval2 = df.B.loc[u,w] - df.B.loc[v,w]

Problem description

This is my first ever git post, and so please let me know if I can do anything else to make the issue more clear.

This is the gist of a script that I execute in Spyder. It is ugly and could be written to be made faster, but since the "for" loop (which is calling get_item) only took around 10 seconds to run, I didn't worry about cleaning it up. The 3x performance degradation between 0.19.2 to 0.20.2 (and a 30x degradation in 0.20.1) is very disconcerting since get_item is such a low level operation.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Expected Output

The code sample above took around 10 seconds to execute on pandas version 0.19.2. It took over 5 minutes to run in version 0.20.1 and around 30 seconds to run in version 0.20.2. The runtime difference between 0.19.2 and 0.20.2 is still a 3x degradation.

Note the profiler output for get_item using pandas versions 0.19.2, 0.20.1 and 0.20.2 respectively:

image
image
image

1 of 4 -- Output of pd.show_versions() -- VERSION 0.19.2

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
boto: 2.46.1
pandas_datareader: None

2 of 3 -- Output of pd.show_versions() -- VERSION 0.20.1

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

3 of 3-- Output of pd.show_versions() -- VERSION 0.20.2

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions