Skip to content

gix corpus - an extendable way to run algorithms and record their results for comparison #858

Open
@Byron

Description

@Byron

Generally, it maintains information about a corpus of git repositories and writes it into a sqlite database for later data analysis.

The git repositories should be as many of the top-by-stars-and-smaller-than-5GB GitHub repos as can be held by a disk, which was 80K for a 4TB budget, leaving enough space for worktree checkouts as well. Be sure to also get one of these 100GB repos for good measure, by hand.

Initialization
  • record information about the corpus as seen at one time, with some meta-data like pack-size and object size and other data by which to select which repos to run on.
  • assume append-only set of repositories, where removals are the exception that we don't care about

Run commands

  • dry-run mode which just shows what would be run.
  • inform about changes in the corpus to let the user know it changed. They then can re-run the initialization (none-destructively) again to update statistics
  • offer a filter as SQL statement ideally to be able to chose a subset of commands to run against
  • allow to choose the set of commands to run, or run all of them
  • each command can specify if it can run in parallel with other commands of its kind or not
  • if a command-type can be run in parallel with others, the runner will perform the parallelization. The amount of threads can be configured.
  • keep information about each run along with its own version to be able to see what happened.
  • keep information about the result of each command along with timings (and maybe memory and CPU usage)
  • Each command can return a JSON Value with its own free-form information.
  • Its specifically useful for benchmarks that validate critical performance, like opening repositories, or resolving packs.
  • definitely store progress message this method on the tree::Root
  • commands can return timings for sub-tasks that they can keep track of themselves, but that are in a format that's usable for storage in the database. This way it's possible to for instance keep track of how long it takes to create an index file from a tree, and then how long it takes to perform an operation on the index.
  • Try to use tracing to record performance data about certain operations, akin to what git does, and store these spans in the database. These spans could be taken verbatim for analysis, ignoring their tree-structure at least at the beginning.

Analysis

A few very simple commands to answer questions like

  • did the performance of a command get better or worse (also for a subset of all available data?)
  • correlations between certain statistical datapoints, like size of pack, size of objects, and maybe how these affect the performance values (e.g. it got slower only for smaller objects)
  • make it easy to get access to the underlying data, maybe by emitting SQL statements to do so
  • make it easy to print all information about particular runs of commands

Ingestion Implementation

Analysis Implementation

Maybe at first we can limit the corpus run to specific repos that we check by hand in the corpus.db

  • TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-tracking-issueAn issue to track to track the progress of multiple PRs or issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions