Description
NumPy data representation, by default, contains rows together in memory. In this example:
numpy.array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Operating (e.g. adding up) over 1, 2, 3, 4
will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level 1, 5
, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.
The problem is that if I create a dataframe from a 2D numpy array (e.g. pandas.DataFrame(2d_numpy_array_with_default_strides)
), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.
Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:
>>> import numpy
>>> import pandas
>>> df_default = pandas.DataFrame(numpy.random.rand(10_000_000, 2))
>>> df_efficient = pandas.DataFrame(numpy.random.rand(2, 10_000_000).T)
>>> %timeit df_default.sum()
340 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df_efficient.sum()
23.4 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.