Skip to content

How to make a future dataframe API available? #79

Closed
@rgommers

Description

@rgommers

This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.

A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?

Options include:

  1. In a separate namespace, ala .array_api in NumPy/CuPy,
  2. In a separate retrievable-only namespace, ala __array_namespace__,
  3. Behind an environment variable (NumPy has done this a couple of times, for example with __array_function__ and more recently with dtype casting rules changes),
  4. With a context manager,
  5. With a from __future__ import new_behavior type import (i.e., new features on a per-module basis),
  6. As an external package, which may for example monkeypatch internals (added for completeness, not preferred),

One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.

For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.

Experiences with a separate namespace for the array API standard

The short summary of this is:

  • there's a problem where we now have two array objects, and supporting both in a code base is cumbersome and requires bi-directional conversions.
  • a summary of this problem and approaches taken in scikit-learn and SciPy to work around it are described in Array API standard and Numpy compatibility array-api#400
  • in NumPy the preferred solution direction longer term is to make the main numpy namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.

Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.

Using a context manager

Pandas already has a context manager, namely pandas.option_context. This is used for existing options, see pd.describe_option(). While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:

  • mode.chained_assignment (raise, warn, or ignore)
  • mode.data_manager ("block" or "array")
  • mode.use_inf_as_null (bool)

It could be used similarly to currently available options, one option per feature:

 with pd.option_context('mode.casting_rules', 'api-standard'):
     do_stuff()

Or there could be a single option to switch to "API-compliant mode":

 with pd.option_context('mode.api_standard', True):
     do_stuff()

Or both of those together.

Question: do other dataframe libraries have a similar context manager?

Using a from __future__ import

It looks like it's possible to implement features with a from __future__ itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a from dflib.__future__ import X is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:

from pandas.__future__ import api_standard_unique

# should use the `unique` behavior described in the API standard
df.unique()

from other_lib import do_stuff

# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)

Now of course this scope propagation is also what a context manager does. However, the point of a from __future__ import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.

Comparing a context manager and a from __future__ import

For new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)

For changes to existing functions or methods, both will work too. The module-local behavior of a from __future__ import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.

For behavior changes there's an issue with the from __future__ import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.

My current impression

  • A separate namespace is not desired, and a separate dataframe object is really not desired,
  • An environment variable is easy to implement, but pretty coarse - and given the fairly extensive backwards-compatibility issues that are likely, probably not good enough,
  • A context manager is nicest for behavior, and fine for new methods/functions
  • The from __future__ import xxx is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.

References

  1. somewhat related discussion on dataframe namespaces: Dataframe namespaces #23
  2. How to expose API to downstream libraries? array-api#16
  3. https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python (by @shoyer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions