Description
My Code Sample
When I tried to launch a join operation between multiIndexed dataframes as the following code describes
Fully code is posted on:
https://github.com/ericksc/pandas_studies
# My code here
dates = pd.date_range(start='1980-1-1', end='2050-1-1')
strings = np.array(
[
'apple', 'grape', 'orange' ,'pear', 'melon', 'banana',
'watermelon', 'lemon', 'strawberry', 'berry', 'blackberry',
'cherry','pineapple' , 'mango', 'papaya', 'peach', 'coffee',
'planes', 'cars', 'houses', 'dogs', 'cats', 'computers',
'servers', 'sun', 'moon', 'chairs', 'tables', 'screens',
'keyboards', 'shoes', 't-shirts', 'tv', 'radio', 'door', 'windows',
'bed', 'spoon', 'key', 'paper', 'foot', 'bee', 'ants' ,'worm',
'pack', 'phone', 'boat', 'hair', 'yellow', 'disc'
])
start = time.time()
idx = pd.MultiIndex.from_product([dates,strings], names=['date', 'words'])
end = time.time()
print("MultiIndex execution time:", end-start)
lenght = idx.shape[0]
start = time.time()
df_1 = pd.DataFrame({'col_A':np.random.randn(lenght),
'col_2':np.arange(lenght),
'col_b' : np.random.choice(a=[False, True], size=(lenght)),
'col_i' : np.random.random_integers(low=0,high=100, size=lenght),
}, index=idx)
end = time.time()
print("data frame creation execution time:", end-start)
df_2 = pd.DataFrame({'col_x':np.arange(lenght)},
index=idx)
assert df_1.shape == (1278450, 4)
start = time.time()
df_1 = df_1.join(df_2, how='left')
end = time.time()
print("Join execution time:", end - start)
assert df_1.shape == (1278450, 5)
Problem description
The join operation between multiIndexed dataframes (at least 2 levels) with a large combination of labels shows a change of time execution performance to worst from pandas 0.19.1 to 0.23.4, including 0.24.2.
- pandas 0.19.1-> join execution time -> 0.1129 seconds
- pandas 0.23.4-> join execution time -> 1.2409 seconds
- pandas 0.24.2-> join execution time -> 1.2309 seconds
This is around 10.9x times.
Expected Output
The expected output is a similar performance.
Output of pd.show_versions()
p36-pandas 0.19.1
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.1
nose: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
p36-pandas 0.23.4
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 4.6.3
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None