Skip to content

Signature for a standard from_dataframe constructor function #42

Open
@rgommers

Description

@rgommers

One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:

Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.

In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:

def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
    """
    Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
    """
    if isinstance(df, pd.DataFrame):
        return df

    if not hasattr(df, '__dataframe__'):
        raise ValueError("`df` does not support __dataframe__")

    return _from_dataframe(df.__dataframe__())

The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:

def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:

Other things we may or may not want to support:

  • columns by index
  • get a subset of chunks

My personal feeling is:

  • columns by index: maybe, and if we do then with a separate keyword like col_indices=None
  • a subset of chunks: probably not. This is more advanced usage, and if one needs it it's likely one wants to get the object returned by __dataframe__ first, then inspect some metadata, and only then decide what chunks to get.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions