Description
This is a discussion topic for adding the ability to store pydata/sparse ndarrays inside a DataFrame. It's not proposing that we actually do this at this point.
Background
sparse
implements a sparse ndarray that implements (much of) the NumPy API. This differs from scipy.sparse
matricies which are strictly 2D and have their own API. It also differs from pandas' SparseArray, which is strictly 1D and implements the ExtensionArray interface.
Motivation
In some workflows (especially machine learning) it's common to convert a dense 1D array to a sparse 2D array. The sparse 2D array is often very wide, and so isn't well-suited to storage in a DataFrame with SparseArray values. Each column of the 2D table needs to be stored independently, which at a minimum requires one Python object per column, and makes it difficult (but not impossible) to have these 1D arrays be views on some 2D object.
Since sparse
implements the ndarray interface, we can in theory just store the sparse.COO
array where we normally store a 2D ndarray, inside or Block. Indeed, with a minor change, this works
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 8deeb415c1..105c1c1a64 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,6 +427,7 @@ class DataFrame(NDFrame):
dtype: Optional[Dtype] = None,
copy: bool = False,
):
+ import sparse
if data is None:
data = {}
if dtype is not None:
@@ -459,7 +460,7 @@ class DataFrame(NDFrame):
data = data.copy()
mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
- elif isinstance(data, (np.ndarray, Series, Index)):
+ elif isinstance(data, (np.ndarray, Series, Index, sparse.COO)):
if data.dtype.names:
data_columns = list(data.dtype.names)
data = {k: data[k] for k in data_columns}
Which lets us store the 2D array
In [1]: import sparse
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(sparse.zeros((10, 4)))
In [4]: df._data.blocks[0]
Out[4]: FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64
In [5]: df._data.blocks[0].values
Out[5]: <COO: shape=(4, 10), dtype=float64, nnz=0, fill_value=0.0>
However, many things don't work. Notably
- Anything calling
asarray(arr)
will raise. Sparse doesn't allow implicit conversion from sparse to dense. This includes things like the DataFrame repr - Anything touching Cython (groupby, join, factorize, etc.) will likely raise.
So this would primarily be useful for storing data, at least initially.
Arguments Against
The biggest argument against allowing this is that pandas is potentially moving to a column store in the future. In this future, we don't have 2D blocks, so the value of a 2D sparse array diminishes. We may be able to do similar tricks as we'll do with numpy.ndarray, where 1D columns are views on a 2D object.
The second argument against is that we could potentially make the EA interface 2D, and implement an EA compatible wrapper around a pydata/sparse array (similar to PandasArray
).
Finally, we can't really hope to offer "full" support for sparse-backed columns. Things like joining not working on sparse columns will cause user confusion, that may be hard to document and explain.
cc @adrinjalali and @hameerabbasi, just for awareness. Most of the discussion will likely be on the pandas side of things though.