Closed
Description
I was trying to figure out what's the most efficient way to create a dataframe out of a large numpy record array (I initially naively thought this could be done in a zero-copy way) and ran some simple tests:
n_columns = 2
n_records = int(15e6)
data = np.zeros((n_columns, n_records), np.int32)
arr = np.core.records.fromarrays(data)
%timeit -n1 -r1 df = pd.DataFrame(arr)
%timeit -n1 -r1 df = pd.DataFrame(arr, copy=False)
%timeit -n1 -r1 f0, f1 = arr['f0'].copy(), arr['f1'].copy()
f0, f1 = arr['f0'].copy(), arr['f1'].copy()
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1})
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1}, copy=False)
1 loops, best of 1: 2.47 s per loop
1 loops, best of 1: 2.42 s per loop
1 loops, best of 1: 48.2 ms per loop
1 loops, best of 1: 4.25 s per loop
1 loops, best of 1: 4.57 s per loop
I wonder what's the DataFrame constructor doing for several seconds straight even if it's provided with a materialized ndarray
(note that copying both arrays takes < 50ms)? This is on v0.14.1.
P.S. Is there a more efficientway to make a dataframe out of a recarray?