Open
Description
Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?
Context
It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.
Use cases
- efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
- efficient computation: e.g. computing the
nanmean
andnanstd
of a sparse column with more then 99% are missing - some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.
Limitations
- treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
- other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)
Survey of existing support
(incomplete, feel free to edit or comment)
- pandas' columns can efficiently handle sparse data
- arrow has sparse tensors but apparently those cannot be used to represent table columns
- vaex does have some interop capabilities with scipy.sparse matrices but I am not sure what is the sparse memory layout is preserved (maybe @maartenbreddels or @JovanVeljanoski can comment)
Questions:
- Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the
fill_value
param) - Should this be some kind of optional module / extension of the main dataframe API spec?