Skip to content

PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756

Closed
@datapythonista

Description

@datapythonista

NumPy data representation, by default, contains rows together in memory. In this example:

numpy.array([[1, 2, 3, 4],
             [5, 6, 7, 8]])

Operating (e.g. adding up) over 1, 2, 3, 4 will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level 1, 5, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.

The problem is that if I create a dataframe from a 2D numpy array (e.g. pandas.DataFrame(2d_numpy_array_with_default_strides)), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.

Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:

>>> import numpy
>>> import pandas

>>> df_default = pandas.DataFrame(numpy.random.rand(10_000_000, 2))
>>> df_efficient = pandas.DataFrame(numpy.random.rand(2, 10_000_000).T)

>>> %timeit df_default.sum()
340 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df_efficient.sum()
23.4 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions