Skip to content

Pandas Tests rely on inconsistent array coercion #29978

Open
@seberg

Description

@seberg

In numpy/numpy#14995 I have tried to make numpy consistent with respect to coercing dataframes (and other array-likes which also implement the sequence protocol) to numpy arrays.

With the new PR/behaviour, the __array__ interface would be fully preferred, and no mixed/inconsistent behaviour with respect to also being a sequence-like (with different behaviour) would occur.

Unfortunately, pandas DataFrames have this behaviour, since they are squence-like. This behaviour kicks in during DataFrame coercion, in the following case:

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])

Where df2 is currently coerced as a dataframe with dataframes inside. Currently this happens due to the following logic:

        try:
            if is_list_like(values[0]) or hasattr(values[0], 'len'):  # <-- is hit
                # following convert does nothing; `np.array()` than raises Error...
                values = np.array([convert(v) for v in values])
            elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
                # GH#21861
                values = np.array([convert(v) for v in values])
            else:
                values = convert(values)
        except (ValueError, TypeError):
            values = convert(values)  # <-- Ends up getting called and forces object array.

EDIT: addtional code details: convert is a thin wrapper around:

def maybe_convert_platform(values):
    """ try to do platform conversion, allow ndarray or list here """

    if isinstance(values, (list, tuple, range)):
        values = construct_1d_object_array_from_listlike(values)
    # more logic

This takes the first branch (values is a list), which in turn forces a 1-D object array:

def construct_1d_object_array_from_listlike(values):
    # numpy will try to interpret nested lists as further dimensions, hence
    # making a 1D array that contains list-likes is a bit tricky:
    result = np.empty(len(values), dtype='object')
    result[:] = values
    return result

because np.array([df1, df1]) will raise an error due to the inconsistencies within NumPy, it ends up calling convert([df1, df1]) which in turn creates a NumPy dtype=object array with two dataframes inside.
However, the new/correct behaviour for NumPy would be to that np.array([df1, df1]) will return a 3 dimensional array. This ends up raising an error because pandas refuses to coerce a 3D array to a DataFrame.

It seems safest to not try to squeeze this into the upcoming NumPy release (it is planned in a few days). However, I would like to change it in master soon after branching. I am not sure if you see the current behaviour as important or not, but it would be nice if you can look into what the final intend will be here. If we (can) change this in NumPy I am not sure there is a way for pandas to retain the old behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DataFrameDataFrame data structureDeprecateFunctionality to remove in pandas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions