How to make a future dataframe API available?

_This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics._

A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?

Options include:
1. In a separate namespace, ala `.array_api` in NumPy/CuPy,
2. In a separate _retrievable-only_ namespace, ala `__array_namespace__`,
3. Behind an environment variable (NumPy has done this a couple of times, for example with `__array_function__` and more recently with dtype casting rules changes),
4. With a context manager,
5. With a `from __future__ import new_behavior` type import (i.e., new features on a per-module basis),
6. As an external package, which may for example monkeypatch internals (_added for completeness, not preferred_),

One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.

For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.


### Experiences with a separate namespace for the array API standard

The short summary of this is:
- there's a problem where we now have two array objects, and supporting both in a code base is cumbersome and requires bi-directional conversions.
- a summary of this problem and approaches taken in scikit-learn and SciPy to work around it are described in https://github.com/data-apis/array-api/issues/400
- in NumPy the preferred solution direction longer term is to make the main `numpy` namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.

Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.

### Using a context manager

Pandas already has a context manager, namely [`pandas.option_context`](https://pandas.pydata.org/docs/reference/api/pandas.option_context.html). This is used for existing options, see `pd.describe_option()`. While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:
- `mode.chained_assignment` (raise, warn, or ignore)
- `mode.data_manager` (`"block"` or `"array"`)
- `mode.use_inf_as_null` (bool)

It could be used similarly to currently available options, one option per feature:
```python
 with pd.option_context('mode.casting_rules', 'api-standard'):
     do_stuff()
```

Or there could be a single option to switch to "API-compliant mode":
```python
 with pd.option_context('mode.api_standard', True):
     do_stuff()
```
Or both of those together.

_Question: do other dataframe libraries have a similar context manager?_


### Using a `from __future__` import

It looks like it's possible to implement features with a `from __future__` itself, via import hooks (see [Reference 3](https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python) below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a `from dflib.__future__ import X` is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:
```python
from pandas.__future__ import api_standard_unique

# should use the `unique` behavior described in the API standard
df.unique()

from other_lib import do_stuff

# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)
```
Now of course this scope propagation is also what a context manager does. However, the point of a `from __future__` import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.


### Comparing a context manager and a `from __future__` import

For _new functions, methods and objects_ both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)

For _changes to existing functions or methods_, both will work too. The module-local behavior of a `from __future__` import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.

For _behavior changes_ there's an issue with the `from __future__` import. The import hooks will rely on AST transforms, so there must be some _syntax_ to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.

### My current impression

- A separate namespace is not desired, and a separate dataframe object is really not desired,
- An environment variable is easy to implement, but pretty coarse - and given the fairly extensive backwards-compatibility issues that are likely, probably not good enough,
- A context manager is nicest for behavior, and fine for new methods/functions
- The `from __future__ import xxx` is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.


### References

1. somewhat related discussion on dataframe namespaces: https://github.com/data-apis/dataframe-api/issues/23
2. https://github.com/data-apis/array-api/issues/16
3. https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python (by @shoyer)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make a future dataframe API available? #79

Experiences with a separate namespace for the array API standard

Using a context manager

Using a `from future` import

Comparing a context manager and a `from future` import

My current impression

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to make a future dataframe API available? #79

Description

Experiences with a separate namespace for the array API standard

Using a context manager

Using a from __future__ import

Comparing a context manager and a from __future__ import

My current impression

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Using a `from future` import

Comparing a context manager and a `from future` import