Skip to content

Dataframe namespaces #23

Open
Open
@datapythonista

Description

@datapythonista

In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)'))

This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.

Approaches

Top-level methods

df.sum()
df.astype()

Many of the methods are simply implemented directly as methods of dataframe.

Prefixed methods

df.to_csv()
df.to_parquet()

Some of the methods are grouped with a common prefix.

Accessors

df.str.lower()
df.dt.hour()

Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.

Functions

pandas.wide_to_long(df)
pandas.melt(df)

In some cases, functions are used instead of methods.

Functional API

df.apply(func)
df.applymap(func)

pandas also provides a more functional API, where functions can be passed as parameters

Standard API

I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()
df.apply(my_custom_function)
df.apply(numpy.sum)

Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).

I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):

Top-level methods

df.sum()

Prefixed methods

df.reduce_sum()

Accessors

df.reduce.sum()

Functions

mod.reductions.sum(df)

mod represents the implementation module (e.g. pandas)

Functional API

df.reduce(mod.reductions.sum)

Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')).

Thoughts? Other ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions