Sparse columns

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

## Context

It can be the case than a given column has more more than 99% of its values that are null or missing  (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

## Use cases

- efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
- efficient computation: e.g. computing the `nanmean` and  `nanstd` of a sparse column with more then 99% are missing
- some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.

## Limitations

- treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
- other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)

## Survey of existing support

(incomplete, feel free to edit or comment)

- pandas' columns can [efficiently handle sparse data](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html)
- arrow has [sparse tensors](https://arrow.apache.org/docs/cpp/api/tensor.html) but apparently those [cannot be used to represent table columns](https://issues.apache.org/jira/browse/ARROW-6327)
- vaex does have some [interop capabilities with scipy.sparse matrices](https://github.com/vaexio/vaex/issues/556) but I am not sure what is the sparse memory layout is preserved (maybe @maartenbreddels or @JovanVeljanoski can comment)

## Questions:

- Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the `fill_value` param)
- Should this be some kind of optional module / extension of the main dataframe API spec?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse columns #55

Context

Use cases

Limitations

Survey of existing support

Questions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse columns #55

Description

Context

Use cases

Limitations

Survey of existing support

Questions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions