Description
In numpy/numpy#14995 I have tried to make numpy consistent with respect to coercing dataframes (and other array-likes which also implement the sequence protocol) to numpy arrays.
With the new PR/behaviour, the __array__
interface would be fully preferred, and no mixed/inconsistent behaviour with respect to also being a sequence-like (with different behaviour) would occur.
Unfortunately, pandas DataFrames have this behaviour, since they are squence-like. This behaviour kicks in during DataFrame coercion, in the following case:
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])
Where df2
is currently coerced as a dataframe with dataframes inside. Currently this happens due to the following logic:
try:
if is_list_like(values[0]) or hasattr(values[0], 'len'): # <-- is hit
# following convert does nothing; `np.array()` than raises Error...
values = np.array([convert(v) for v in values])
elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
# GH#21861
values = np.array([convert(v) for v in values])
else:
values = convert(values)
except (ValueError, TypeError):
values = convert(values) # <-- Ends up getting called and forces object array.
EDIT: addtional code details: convert
is a thin wrapper around:
def maybe_convert_platform(values):
""" try to do platform conversion, allow ndarray or list here """
if isinstance(values, (list, tuple, range)):
values = construct_1d_object_array_from_listlike(values)
# more logic
This takes the first branch (values
is a list), which in turn forces a 1-D object array:
def construct_1d_object_array_from_listlike(values):
# numpy will try to interpret nested lists as further dimensions, hence
# making a 1D array that contains list-likes is a bit tricky:
result = np.empty(len(values), dtype='object')
result[:] = values
return result
because np.array([df1, df1])
will raise an error due to the inconsistencies within NumPy, it ends up calling convert([df1, df1])
which in turn creates a NumPy dtype=object
array with two dataframes inside.
However, the new/correct behaviour for NumPy would be to that np.array([df1, df1])
will return a 3 dimensional array. This ends up raising an error because pandas refuses to coerce a 3D array to a DataFrame.
It seems safest to not try to squeeze this into the upcoming NumPy release (it is planned in a few days). However, I would like to change it in master soon after branching. I am not sure if you see the current behaviour as important or not, but it would be nice if you can look into what the final intend will be here. If we (can) change this in NumPy I am not sure there is a way for pandas to retain the old behaviour.