Description
Based on what it's defined in wesm/dataframe-protocol#1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).
Using a code example here, to see what this approach implies.
1. Dataframe implementations should implement the __dataframe__
, returning the exchange format we are defining
For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:
import pyarrow
class VaexExchangeDataFrame:
"""
The format defined by our spec.
Besides `to_arrow`, `to_numpy` it should implement the rest of
the spec `num_rows`, `num_columns`, `column_names`...
"""
def __init__(self, arrow_data):
self.arrow_data = arrow_data
def to_arrow(self):
return self.arrow_data
def to_numpy(self):
raise NotImplementedError('numpy format not implemented')
class VaexDataFrame:
"""
The public Vaex dataframe class.
For simplicity of the example, this just wraps an arrow object received in the constructor,
but this would be the whole `vaex.DataFrame`.
"""
def __init__(self, arrow_data):
self.arrow_data = arrow_data
def __dataframe__(self):
return VaexExchangeDataFrame(self.arrow_data)
# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
type='string'),
pyarrow.array([26_300, 4_900, 5_200],
type='uint32')],
['name', 'github_stars']))
Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:
import numpy
class ModinExchangeDataFrame:
def __init__(self, numpy_data):
self.numpy_data = numpy_data
def to_arrow(self):
raise NotImplementedError('arrow format not implemented')
def to_numpy(self):
return self.numpy_data
class ModinDataFrame:
def __init__(self, numpy_data):
self.numpy_data = numpy_data
def __dataframe__(self):
return ModinExchangeDataFrame(self.numpy_data)
modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})
2. Direct consumers should be able to understand all formats
For example, pandas could implement a from_dataframe
function to create a pandas dataframe from different formats:
import pandas
def from_dataframe(dataframe):
known_formats = {'numpy': lambda df: pandas.DataFrame(df),
'arrow': lambda df: df.to_pandas()}
exchange_dataframe = dataframe.__dataframe__()
for format_ in known_formats:
try:
data = getattr(exchange_dataframe, f'to_{format_}')()
except NotImplementedError:
pass
else:
return known_formats[format_](data)
raise RuntimeError('Dataframe does not support any known format')
pandas.from_dataframe = from_dataframe
This would allow pandas user to load data from other formats:
pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)
Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.
3. Indirect consumers can pick an implementation, and use it to standardize its input
For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using from_dataframe
from the previous section:
def seaborn_bar_plot(any_dataframe, x, y):
pandas_df = pandas.from_dataframe(any_dataframe)
return pandas_df.plot(kind='bar', x=x, y=y)
seaborn_bar_plot(vaex_df, x='name', y='github_stars')
Are people happy with this approach?
CC: @rgommers