Data exchange formats

Based on what it's defined in https://github.com/wesm/dataframe-protocol/pull/1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).

Using a code example here, to see what this approach implies.

**1. Dataframe implementations should implement the `__dataframe__`, returning the exchange format we are defining**

For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:
```python
import pyarrow


class VaexExchangeDataFrame:
    """
    The format defined by our spec.
    
    Besides `to_arrow`, `to_numpy` it should implement the rest of
    the spec `num_rows`, `num_columns`, `column_names`...
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data

    def to_arrow(self):
        return self.arrow_data

    def to_numpy(self):
        raise NotImplementedError('numpy format not implemented')
    
class VaexDataFrame:
    """
    The public Vaex dataframe class.

    For simplicity of the example, this just wraps an arrow object received in the constructor,
    but this would be the whole `vaex.DataFrame`.
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data
        
    def __dataframe__(self):
        return VaexExchangeDataFrame(self.arrow_data)

# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
                                                                       type='string'),
                                                         pyarrow.array([26_300, 4_900, 5_200],
                                                                       type='uint32')],
                                                        ['name', 'github_stars']))
```

Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:
```python
import numpy


class ModinExchangeDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def to_arrow(self):
        raise NotImplementedError('arrow format not implemented')

    def to_numpy(self):
        return self.numpy_data


class ModinDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def __dataframe__(self):
        return ModinExchangeDataFrame(self.numpy_data)


modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
                           'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})
```

**2. Direct consumers should be able to understand all formats**

For example, pandas could implement a `from_dataframe` function to create a pandas dataframe from different formats:
```python
import pandas

def from_dataframe(dataframe):
    known_formats = {'numpy': lambda df: pandas.DataFrame(df),
                     'arrow': lambda df: df.to_pandas()}

    exchange_dataframe = dataframe.__dataframe__()
    for format_ in known_formats:
        try:
            data = getattr(exchange_dataframe, f'to_{format_}')()
        except NotImplementedError:
            pass
        else:
            return known_formats[format_](data)

    raise RuntimeError('Dataframe does not support any known format')

pandas.from_dataframe = from_dataframe
```

This would allow pandas user to load data from other formats:
```python
pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)
```

Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.

**3. Indirect consumers can pick an implementation, and use it to standardize its input**

For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using `from_dataframe` from the previous section:
```python
def seaborn_bar_plot(any_dataframe, x, y):
    pandas_df = pandas.from_dataframe(any_dataframe)
    return pandas_df.plot(kind='bar', x=x, y=y)

seaborn_bar_plot(vaex_df, x='name', y='github_stars')
```

Are people happy with this approach?

CC: @rgommers 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data exchange formats #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data exchange formats #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions