Skip to content

Pandas groupby extremely slow in python3 for certain sets of single precision floating point data #13335

Closed
@RogerThomas

Description

@RogerThomas

In python3 with certain sets of single precision floating point data pandas groupby is up to ~150 slower than the same data in python2

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
from numpy.random import random
from time import time


def do_groupby(df):
    df.groupby(['a'])['b'].sum()


def main():
    tmp1 = (random(10000) * 0.1).astype(np.float32)
    tmp2 = (random(10000) * 10.0).astype(np.float32)
    tmp = np.concatenate((tmp1, tmp2))
    arr = np.repeat(tmp, 100)
    df = pd.DataFrame(dict(a=arr, b=arr))
    t1 = time()
    do_groupby(df)
    print("Took: %s" % (time() - t1,))

main()

On my machine
python2 this_file.py
The groupby takes around 0.1s

Where as python3 this_file.py
The groupby takes around 11s

With some investigation the discrepancy in run time between python versions varies hugely between the actual data but seems to have the biggest difference when half the data is roughly 100 times smaller than the other half.

Having profiled this, it seems this function is taking almost all the 11s in the python3 version in this method
https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx#L538

However I have no idea what is causing the run time discrepancy.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.5.2
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.7.4.None
psycopg2: None
jinja2: 2.8
boto: 2.38.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsNumeric OperationsArithmetic, Comparison, and Logical operationsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions