Skip to content

Data exchange formats #29

Closed
Closed
@datapythonista

Description

@datapythonista

Based on what it's defined in wesm/dataframe-protocol#1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).

Using a code example here, to see what this approach implies.

1. Dataframe implementations should implement the __dataframe__, returning the exchange format we are defining

For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:

import pyarrow


class VaexExchangeDataFrame:
    """
    The format defined by our spec.
    
    Besides `to_arrow`, `to_numpy` it should implement the rest of
    the spec `num_rows`, `num_columns`, `column_names`...
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data

    def to_arrow(self):
        return self.arrow_data

    def to_numpy(self):
        raise NotImplementedError('numpy format not implemented')
    
class VaexDataFrame:
    """
    The public Vaex dataframe class.

    For simplicity of the example, this just wraps an arrow object received in the constructor,
    but this would be the whole `vaex.DataFrame`.
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data
        
    def __dataframe__(self):
        return VaexExchangeDataFrame(self.arrow_data)

# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
                                                                       type='string'),
                                                         pyarrow.array([26_300, 4_900, 5_200],
                                                                       type='uint32')],
                                                        ['name', 'github_stars']))

Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:

import numpy


class ModinExchangeDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def to_arrow(self):
        raise NotImplementedError('arrow format not implemented')

    def to_numpy(self):
        return self.numpy_data


class ModinDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def __dataframe__(self):
        return ModinExchangeDataFrame(self.numpy_data)


modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
                           'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})

2. Direct consumers should be able to understand all formats

For example, pandas could implement a from_dataframe function to create a pandas dataframe from different formats:

import pandas

def from_dataframe(dataframe):
    known_formats = {'numpy': lambda df: pandas.DataFrame(df),
                     'arrow': lambda df: df.to_pandas()}

    exchange_dataframe = dataframe.__dataframe__()
    for format_ in known_formats:
        try:
            data = getattr(exchange_dataframe, f'to_{format_}')()
        except NotImplementedError:
            pass
        else:
            return known_formats[format_](data)

    raise RuntimeError('Dataframe does not support any known format')

pandas.from_dataframe = from_dataframe

This would allow pandas user to load data from other formats:

pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)

Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.

3. Indirect consumers can pick an implementation, and use it to standardize its input

For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using from_dataframe from the previous section:

def seaborn_bar_plot(any_dataframe, x, y):
    pandas_df = pandas.from_dataframe(any_dataframe)
    return pandas_df.plot(kind='bar', x=x, y=y)

seaborn_bar_plot(vaex_df, x='name', y='github_stars')

Are people happy with this approach?

CC: @rgommers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions