Description
One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:
Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.
In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:
def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
"""
Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
"""
if isinstance(df, pd.DataFrame):
return df
if not hasattr(df, '__dataframe__'):
raise ValueError("`df` does not support __dataframe__")
return _from_dataframe(df.__dataframe__())
The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:
def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:
Other things we may or may not want to support:
- columns by index
- get a subset of chunks
My personal feeling is:
- columns by index: maybe, and if we do then with a separate keyword like
col_indices=None
- a subset of chunks: probably not. This is more advanced usage, and if one needs it it's likely one wants to get the object returned by
__dataframe__
first, then inspect some metadata, and only then decide what chunks to get.
Thoughts?