Skip to content

Join operation takes more time.  #27099

Closed
@ericksc

Description

@ericksc

My Code Sample

When I tried to launch a join operation between multiIndexed dataframes as the following code describes
Fully code is posted on:
https://github.com/ericksc/pandas_studies

# My code here
    dates = pd.date_range(start='1980-1-1', end='2050-1-1')
    strings = np.array(
        [
            'apple', 'grape', 'orange' ,'pear', 'melon', 'banana',
            'watermelon', 'lemon', 'strawberry', 'berry', 'blackberry',
            'cherry','pineapple' , 'mango', 'papaya', 'peach', 'coffee',
            'planes', 'cars', 'houses', 'dogs', 'cats', 'computers',
            'servers', 'sun', 'moon', 'chairs', 'tables', 'screens',
            'keyboards', 'shoes', 't-shirts', 'tv', 'radio', 'door', 'windows',
            'bed', 'spoon', 'key', 'paper', 'foot', 'bee', 'ants' ,'worm',
            'pack', 'phone', 'boat', 'hair', 'yellow', 'disc'
    ])
    start = time.time()
    idx = pd.MultiIndex.from_product([dates,strings], names=['date', 'words'])
    end = time.time()
    print("MultiIndex execution time:", end-start)
    lenght = idx.shape[0]
    start = time.time()
    df_1 = pd.DataFrame({'col_A':np.random.randn(lenght),
                  'col_2':np.arange(lenght),
                  'col_b' : np.random.choice(a=[False, True], size=(lenght)),
                  'col_i' : np.random.random_integers(low=0,high=100, size=lenght),
            }, index=idx)
    end = time.time()
    print("data frame creation execution time:", end-start)
    df_2 = pd.DataFrame({'col_x':np.arange(lenght)},
                        index=idx)

    assert df_1.shape == (1278450, 4)

    start = time.time()
    df_1 = df_1.join(df_2, how='left')
    end = time.time()
    print("Join execution time:", end - start)
    assert df_1.shape == (1278450, 5)

Problem description

The join operation between multiIndexed dataframes (at least 2 levels) with a large combination of labels shows a change of time execution performance to worst from pandas 0.19.1 to 0.23.4, including 0.24.2.

  • pandas 0.19.1-> join execution time -> 0.1129 seconds
  • pandas 0.23.4-> join execution time -> 1.2409 seconds
  • pandas 0.24.2-> join execution time -> 1.2309 seconds

This is around 10.9x times.

Expected Output

The expected output is a similar performance.

Output of pd.show_versions()

p36-pandas 0.19.1

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

p36-pandas 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 4.6.3
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions