Skip to content

Sparse columns #55

Open
Open
@ogrisel

Description

@ogrisel

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

Context

It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

Use cases

  • efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
  • efficient computation: e.g. computing the nanmean and nanstd of a sparse column with more then 99% are missing
  • some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.

Limitations

  • treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
  • other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)

Survey of existing support

(incomplete, feel free to edit or comment)

Questions:

  • Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)
  • Should this be some kind of optional module / extension of the main dataframe API spec?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions