Skip to content

Extremely slow construction of large DataFrames out of ndarrays? #8161

Closed
@aldanor

Description

@aldanor

I was trying to figure out what's the most efficient way to create a dataframe out of a large numpy record array (I initially naively thought this could be done in a zero-copy way) and ran some simple tests:


n_columns = 2
n_records = int(15e6)

data = np.zeros((n_columns, n_records), np.int32)
arr = np.core.records.fromarrays(data)

%timeit -n1 -r1 df = pd.DataFrame(arr)
%timeit -n1 -r1 df = pd.DataFrame(arr, copy=False)
%timeit -n1 -r1 f0, f1 = arr['f0'].copy(), arr['f1'].copy()
f0, f1 = arr['f0'].copy(), arr['f1'].copy()
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1})
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1}, copy=False)

1 loops, best of 1: 2.47 s per loop
1 loops, best of 1: 2.42 s per loop
1 loops, best of 1: 48.2 ms per loop
1 loops, best of 1: 4.25 s per loop
1 loops, best of 1: 4.57 s per loop

I wonder what's the DataFrame constructor doing for several seconds straight even if it's provided with a materialized ndarray (note that copying both arrays takes < 50ms)? This is on v0.14.1.

P.S. Is there a more efficientway to make a dataframe out of a recarray?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions