Skip to content

Support pydata/sparse arrays in DataFrame #33182

Open
@TomAugspurger

Description

@TomAugspurger

This is a discussion topic for adding the ability to store pydata/sparse ndarrays inside a DataFrame. It's not proposing that we actually do this at this point.

Background

sparse implements a sparse ndarray that implements (much of) the NumPy API. This differs from scipy.sparse matricies which are strictly 2D and have their own API. It also differs from pandas' SparseArray, which is strictly 1D and implements the ExtensionArray interface.

Motivation

In some workflows (especially machine learning) it's common to convert a dense 1D array to a sparse 2D array. The sparse 2D array is often very wide, and so isn't well-suited to storage in a DataFrame with SparseArray values. Each column of the 2D table needs to be stored independently, which at a minimum requires one Python object per column, and makes it difficult (but not impossible) to have these 1D arrays be views on some 2D object.

Since sparse implements the ndarray interface, we can in theory just store the sparse.COO array where we normally store a 2D ndarray, inside or Block. Indeed, with a minor change, this works

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 8deeb415c1..105c1c1a64 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,6 +427,7 @@ class DataFrame(NDFrame):
         dtype: Optional[Dtype] = None,
         copy: bool = False,
     ):
+        import sparse
         if data is None:
             data = {}
         if dtype is not None:
@@ -459,7 +460,7 @@ class DataFrame(NDFrame):
                     data = data.copy()
                 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
 
-        elif isinstance(data, (np.ndarray, Series, Index)):
+        elif isinstance(data, (np.ndarray, Series, Index, sparse.COO)):
             if data.dtype.names:
                 data_columns = list(data.dtype.names)
                 data = {k: data[k] for k in data_columns}

Which lets us store the 2D array

In [1]: import sparse

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(sparse.zeros((10, 4)))

In [4]: df._data.blocks[0]
Out[4]: FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [5]: df._data.blocks[0].values
Out[5]: <COO: shape=(4, 10), dtype=float64, nnz=0, fill_value=0.0>

However, many things don't work. Notably

  1. Anything calling asarray(arr) will raise. Sparse doesn't allow implicit conversion from sparse to dense. This includes things like the DataFrame repr
  2. Anything touching Cython (groupby, join, factorize, etc.) will likely raise.

So this would primarily be useful for storing data, at least initially.

Arguments Against

The biggest argument against allowing this is that pandas is potentially moving to a column store in the future. In this future, we don't have 2D blocks, so the value of a 2D sparse array diminishes. We may be able to do similar tricks as we'll do with numpy.ndarray, where 1D columns are views on a 2D object.

The second argument against is that we could potentially make the EA interface 2D, and implement an EA compatible wrapper around a pydata/sparse array (similar to PandasArray).

Finally, we can't really hope to offer "full" support for sparse-backed columns. Things like joining not working on sparse columns will cause user confusion, that may be hard to document and explain.

cc @adrinjalali and @hameerabbasi, just for awareness. Most of the discussion will likely be on the pandas side of things though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.InternalsRelated to non-user accessible pandas implementationNeeds DiscussionRequires discussion from core team before further actionSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions