Description
In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:
import pandas
(pandas.read_csv('countries.csv')
.rename(columns={'name': 'country'})
.assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
.query('(continent.str.lower() != "antarctica") | (population < area_km2)'))
This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.
Approaches
Top-level methods
df.sum()
df.astype()
Many of the methods are simply implemented directly as methods of dataframe.
Prefixed methods
df.to_csv()
df.to_parquet()
Some of the methods are grouped with a common prefix.
Accessors
df.str.lower()
df.dt.hour()
Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.
Functions
pandas.wide_to_long(df)
pandas.melt(df)
In some cases, functions are used instead of methods.
Functional API
df.apply(func)
df.applymap(func)
pandas also provides a more functional API, where functions can be passed as parameters
Standard API
I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.
Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:
@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
def my_custom_method(self):
return True
df.my_accessor.my_custom_method()
df.apply(my_custom_function)
df.apply(numpy.sum)
Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).
I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):
Top-level methods
df.sum()
Prefixed methods
df.reduce_sum()
Accessors
df.reduce.sum()
Functions
mod.reductions.sum(df)
mod represents the implementation module (e.g. pandas
)
Functional API
df.reduce(mod.reductions.sum)
Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')
).
Thoughts? Other ideas?