Skip to content

PERF: Significant speed difference between arr.mean() and arr.values.mean() for common dtype columns #34773

Closed
@ianozsvald

Description

@ianozsvald
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


I'm seeing a significant variance in timings for common math operations (e.g. mean, std, max) on a large Pandas Series vs the underlying NumPy array. An code example is shown below with 1 million elements and a 10x speed difference. The screenshot below uses 10 million elements.

I've generated a testing module (https://github.com/ianozsvald/dtype_pandas_numpy_speed_test) which several people have tried on Intel & AMD hardware: ianozsvald/dtype_pandas_numpy_speed_test#1

This module confirms the general trend that all of these operations are faster on the underlying NumPy array (not unsurprising as it avoids the despatch machinery) but for float operations the speed hit using Pandas seems to be extreme:

timings

Code Sample, a copy-pastable example

A Python module exists in this repo along with reports from several other users with screenshots of their graphs, the same general behaviour is seen across different machines: https://github.com/ianozsvald/dtype_pandas_numpy_speed_test

# note this is copied from my README linked above.
# paste into IPython or a Notebook
import pandas as pd
import numpy as np
arr = pd.Series(np.ones(shape=1_000_000))
arr.values.dtype                                                                                                                                                         
Out[]: dtype('float64')

arr.values.mean() == arr.mean()                                                                                                                                           
Out[]: True

# call arr.mean() vs arr.values.mean(), note circa 10* speed difference
# with 4ms vs 0.4ms
%timeit arr.mean()
4.59 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr.values.mean()
485 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# note that arr.values dereference is very cheap (nano seconds)
%timeit arr.values 
456 ns ± 0.828 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Problem description

Is this slow-down expected? The slowdown feels extreme but perhaps my testing methodology is flawed? I expect the float & integer math to operate at approximately the same speed but instead we see a significant slow-down for Pandas float operations vs their NumPy counterparts.

I've added some extra graphs:

Expected Output

Output of pd.show_versions()

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.6.7-050607-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1.post20200529
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions