Skip to content

How to consume a single buffer & connection to array interchange #39

Open
@rgommers

Description

@rgommers

For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol, __array_interface__, __cuda_array_interface__, __array__ and __arrow_array__ all exist, and are still complicated.

For what a buffer is, currently it's only a data pointer (ptr) and a size (bufsize) which together describe a contiguous block of memory, plus a device attribute (__dlpack_device__) and optionally DLPack support (__dlpack__). One open question is:

The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:

def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
    """
    """
    if col.offset != 0:
        raise NotImplementedError("column.offset > 0 not handled yet")

    if col.describe_null not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")

    # Handle the dtype
    _dtype = col.dtype
    kind = _dtype[0]
    bitwidth = _dtype[1]
    if _dtype[0] not in (0, 1, 2, 20):
        raise RuntimeError("Not a boolean, integer or floating-point dtype")

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    # No DLPack yet, so need to construct a new ndarray from the data pointer
    # and size in the buffer plus the dtype on the column
    _buffer = col.get_data_buffer()
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))

    # NOTE: `x` does not own its memory, so the caller of this function must
    #       either make a copy or hold on to a reference of the column or
    #       buffer! (not done yet, this is pretty awful ...)
    x = np.ctypeslib.as_array(data_pointer,
                              shape=(_buffer.bufsize // (bitwidth//8),))

    return x

From #38 (review) (@kkraus14 & @rgommers):

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

__cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

From #38 (comment) (@jorisvandenbossche):

I personally think it would be useful to keep those existing interface methods (or array, or arrow_array). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers.

Alternative/extension to the current design

We could change the plain memory description + __dlpack__ to:

  1. Implementations MUST support a memory description with ptr, bufsize, and device
  2. Implementations MAY support buffers in their native format (e.g. add a native enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - __arrow_array__ or __array__)
  3. Implementations MAY support any exchange protocol (DLPack, __cuda_array_interface__, buffer protocol, __array_interface__).

(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is

The advantage of (2) and (3) are that they have the most hairy issue already solved, and will likely be faster.

And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).

What is missing for dealing with memory buffers

A summary of why this is hard is:

  1. Underlying implementations are not compatible. E.g., NumPy doesn't support variable length strings or bit masks, Arrow does not support strided arrays or byte masks.
  2. DLPack is the only protocol with device support, but it does not support all dtypes that are needed.

So what we are aiming for (ambitiously) is:

  • Something flexible enough to be a superset of NumPy and Arrow, with full device support.
  • In pure Python

The "holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well" seems necessary for supporting the raw memory description. This probably means that (a) the Buffer object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions