Support pydata/sparse arrays in DataFrame

This is a discussion topic for adding the ability to store [pydata/sparse](https://github.com/pydata/sparse) ndarrays inside a DataFrame. It's not proposing that we actually do this at this point.

### Background

`sparse` implements a sparse ndarray that implements (much of) the NumPy API. This differs from `scipy.sparse` matricies which are strictly 2D and have their own API. It also differs from pandas' SparseArray, which is strictly 1D and implements the ExtensionArray interface.

### Motivation

In some workflows (especially machine learning) it's common to convert a dense 1D array to a sparse 2D array. The sparse 2D array is often very wide, and so isn't well-suited to storage in a DataFrame with SparseArray values. Each column of the 2D table needs to be stored independently, which at a minimum requires one Python object per column, and makes it difficult (but not impossible) to have these 1D arrays be views on some 2D object.

Since `sparse` implements the ndarray interface, we can in theory just store the `sparse.COO` array where we normally store a 2D ndarray, inside or Block. Indeed, with a minor change, this works

```diff
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 8deeb415c1..105c1c1a64 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,6 +427,7 @@ class DataFrame(NDFrame):
         dtype: Optional[Dtype] = None,
         copy: bool = False,
     ):
+        import sparse
         if data is None:
             data = {}
         if dtype is not None:
@@ -459,7 +460,7 @@ class DataFrame(NDFrame):
                     data = data.copy()
                 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
 
-        elif isinstance(data, (np.ndarray, Series, Index)):
+        elif isinstance(data, (np.ndarray, Series, Index, sparse.COO)):
             if data.dtype.names:
                 data_columns = list(data.dtype.names)
                 data = {k: data[k] for k in data_columns}

```

Which lets us *store* the 2D array


```python
In [1]: import sparse

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(sparse.zeros((10, 4)))

In [4]: df._data.blocks[0]
Out[4]: FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [5]: df._data.blocks[0].values
Out[5]: <COO: shape=(4, 10), dtype=float64, nnz=0, fill_value=0.0>
```

However, many things don't work. Notably

1. Anything calling `asarray(arr)` will raise. Sparse doesn't allow implicit conversion from sparse to dense. This includes things like the DataFrame repr
2. Anything touching Cython (groupby, join, factorize, etc.) will likely raise.

So this would primarily be useful for storing data, at least initially.

### Arguments Against

The biggest argument against allowing this is that pandas is potentially moving to a column store in the future. In this future, we don't have 2D blocks, so the value of a 2D sparse array diminishes. We may be able to do similar tricks as we'll do with numpy.ndarray, where 1D columns are views on a 2D object.

The second argument against is that we could potentially make the EA interface 2D, and implement an EA compatible wrapper around a pydata/sparse array (similar to `PandasArray`).

Finally, we can't really hope to offer "full" support for sparse-backed columns. Things like joining not working on sparse columns will cause user confusion, that may be hard to document and explain.

cc @adrinjalali and @hameerabbasi, just for awareness. Most of the discussion will likely be on the pandas side of things though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pydata/sparse arrays in DataFrame #33182

Background

Motivation

Arguments Against

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support pydata/sparse arrays in DataFrame #33182

Description

Background

Motivation

Arguments Against

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions