Interchange between two dataframe types which use the same native storage representation

This was brought up by @jorisvandenbossche: if two libraries both use the same library for in-memory data storage (e.g. buffers/columns are backed by NumPy or Arrow arrays), can we avoid iterating through each buffer on each column by directly handing over that native representation?

This is a similar question to https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol_summary.md#what-is-wrong-with-to_numpy-and-to_arrow - but it's not the same, there is one important difference. The key point of that FAQ entry is that it's consumers who should rely on NumPy/Arrow, and not producers. Having a `to_numpy()` method somewhere is at odds with that. Here is an alternative:

1. A `Column` instance may define `__array__` or `__arrow_array__` _if and only if_ the column itself is backed by a single NumPy or an Arrow array.
2. `DataFrame` and `Buffer` instance must not define `__array__` or `__arrow_array__`.

(1) is motivated by wanting a simple shortcut like this:
```python
    # inside `from_dataframe` constructor
    for name in df.column_names():
        col = df.get_column_by_name(name)
        # say my library natively uses Arrow:
        if hasattr(col, '__arrow_array__'):
            # apparently we're both using Arrow, take the shortcut
            columns[name] = col.__arrow_array__()
        elif ...: # continue parsing dtypes, null values, etc.
```

However, there are other constraints then. For `__array__` this then also implies:
- the column has either no missing values or uses `NaN` or a sentinel value for nulls (and this needs checking first in the code above - otherwise the consumer may still misinterpret the data)
- this does not work for categorical or string dtypes - those are not representable by a single array

For `__arrow_array__` I cannot think of issues right away. Of course the producer should also be careful to ensure that there are no differences in behavior due to adding one of these methods. For example, if there's a dataframe with a nested dtype that is supported by Arrow but not by the protocol, calling `__dataframe__()` should raise because of the unsupported dtype.

The main pro of doing this is:
- A potential performance gain in the dataframe conversion (TBD how significant)

The main con is:
- Extra code complexity to get that performance gain, because now there are two code paths on the consumer side and both must be equivalent.

My impression is: this may be useful to do for `__arrow_array__`, I don't think it's a good idea for `__array__` because the gain is fairly limited and there's too many constraints or ways to get it wrong (e.g. `describe_null` must always be checked before using `__array__`). If `__array__` is to be added, then maybe at the `Buffer` level where it plays the same role as `__dlpack__`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interchange between two dataframe types which use the same native storage representation #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Interchange between two dataframe types which use the same native storage representation #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions