Skip to content

PERF: pd.util.hash_pandas_object slower on string[pyarrow] than object dtypes #48964

Open
@jrbourbeau

Description

@jrbourbeau

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When investigating Dask's shuffle performance on string[pyarrow] data, I observed pd.util.hash_pandas_object was slightly less performant on string[pyarrow] data than on regular object Python objects. This surprised me as I would have expected hashing pyarrow-backed data to be faster than Python objects

In [1]: import pandas as pd

In [2]: s = pd.Series(range(2_000))

In [3]: s_object = s.astype(object)

In [4]: s_pyarrow = s.astype("string[pyarrow]")

In [5]: %timeit pd.util.hash_pandas_object(s_object)
859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow)
1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763
python           : 3.10.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.0
numpy            : 1.23.3
pytz             : 2022.4
dateutil         : 2.8.2
setuptools       : 65.4.1
pip              : 22.2.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 8.5.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 9.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityPerformanceMemory or execution speed performanceStringsString extension data type and string datahashinghash_pandas_object

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions