`gix corpus` - an extendable way to run algorithms and record their results for comparison

Generally, it maintains information about a corpus of git repositories and writes it into a sqlite database for later data analysis.

The git repositories should be as many of the top-by-stars-and-smaller-than-5GB GitHub repos as can be held by a disk, which was 80K for a 4TB budget, leaving enough space for worktree checkouts as well. **Be sure to also get one of these 100GB repos for good measure, by hand**.

##### Initialization

* record information about the corpus as seen at one time, with some meta-data like pack-size and object size and other data by which to select which repos to run on.
* assume append-only set of repositories, where removals are the exception that we don't care about

#### Run commands

* dry-run mode which just shows what would be run.
* inform about changes in the corpus to let the user know it changed. They then can re-run the initialization (none-destructively) again to update statistics
* offer a filter as SQL statement ideally to be able to chose a subset of commands to run against
* allow to choose the set of commands to run, or run all of them
* each command can specify if it can run in parallel with other commands of its kind or not
* if a command-type can be run in parallel with others, the runner will perform the parallelization. The amount of threads can be configured.
* keep information about each run along with its own version to be able to see what happened.
* keep information about the result of each command along with timings (and maybe memory and CPU usage)
* Each command can return a JSON Value with its own free-form information.
* Its specifically useful for `benchmarks` that validate critical performance, like opening repositories, or resolving packs.
* definitely store progress message [this method on the `tree::Root`](https://docs.rs/prodash/25.0.0/prodash/tree/struct.Root.html#method.copy_new_messages)
* commands can return timings for sub-tasks that they can keep track of themselves, but that are in a format that's usable for storage in the database. This way it's possible to for instance keep track of how long it takes to create an index file from a tree, and then how long it takes to perform an operation on the index.
* Try to use `tracing` to record performance data about certain operations, akin to what git does, and store these spans in the database. These spans could be taken verbatim for analysis, ignoring their tree-structure at least at the beginning.

#### Analysis

A few very simple commands to answer questions like

- did the performance of a command get better or worse (also for a subset of all available data?)
- correlations between certain statistical datapoints, like size of pack, size of objects, and maybe how these affect the performance values (e.g. it got slower only for smaller objects)
- make it easy to get access to the underlying data, maybe by emitting SQL statements to do so
- make it easy to print all information about particular runs of commands

#### Ingestion Implementation

* [x] #897 
* [x] #902

#### Analysis Implementation

Maybe at first we can limit the corpus run to specific repos that we check by hand in the `corpus.db`

* [ ] TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`gix corpus` - an extendable way to run algorithms and record their results for comparison #858

Initialization

Run commands

Analysis

Ingestion Implementation

Analysis Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

gix corpus - an extendable way to run algorithms and record their results for comparison #858

Description

Initialization

Run commands

Analysis

Ingestion Implementation

Analysis Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`gix corpus` - an extendable way to run algorithms and record their results for comparison #858