PERF: Inefficient data representation when building dataframe from 2D NumPy array

NumPy data representation, by default, contains rows together in memory. In this example:

```python
numpy.array([[1, 2, 3, 4],
             [5, 6, 7, 8]])
```

Operating (e.g. adding up) over `1, 2, 3, 4` will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level `1, 5`, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.

The problem is that if I create a dataframe from a 2D numpy array (e.g. `pandas.DataFrame(2d_numpy_array_with_default_strides)`), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.

Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:

```python
>>> import numpy
>>> import pandas

>>> df_default = pandas.DataFrame(numpy.random.rand(10_000_000, 2))
>>> df_efficient = pandas.DataFrame(numpy.random.rand(2, 10_000_000).T)

>>> %timeit df_default.sum()
340 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df_efficient.sum()
23.4 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions