Implement APIs for Git serialized commit graphs

The Git serialized commit graph should offer speed-up for certain types of history traversal. It was added as experimental feature in Git 2.18 and is greatly detailed in the [blog series by Derrick Stolee](https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/). It would be nice if `go-git` could offer access to the feature, both at low-level (verifying, building and querying the commit graph file) and high-level (speeding up log traversals with it, if applicable).

I've started a proof-of-concept code at https://github.com/filipnavara/go-git/compare/perf-read...filipnavara:commitgraph?expand=1. However, the API is far from nice, the code is not tested (besides running it on my own repository) and it is generally messy.

Since `go-git` currently defines `Commit` objects as `struct` I was left with no other choice but to introduce a `CommitNode` interface, which is watered down version of `Commit` as present in the serialized commit graph. In ideal world there would be only one `Commit` interface and the commit graph implementation of it would lazy-load the real `Commit` objects if necessary.

Here's a rough approch of my implementation:

At the lowest level there is `commitgraph` package, which provides the `Node` and `Index` interfaces representing the data at the file level (https://github.com/git/git/blob/2d3b1c576c85b7f5db1f418907af00ab88e0c303/Documentation/technical/commit-graph-format.txt). There is implementation of the interface using random-access files / memory-mapped files (`FileIndex`), in-memory implementation (`MemoryIndex`) and an `Encoder`, which can write down the memory index into new file.

Building a memory index from all commits reachable from an existing commit could be accomplished by a code like this:
```go
func buildCommitGraph(c *object.Commit) (*commitgraph.MemoryIndex, error) {
	idx := commitgraph.NewMemoryIndex()
	seen := make(map[plumbing.Hash]bool)
	return idx, addCommitToIndex(idx, c, seen)
}

func addCommitToIndex(idx *commitgraph.MemoryIndex, c *object.Commit, seen map[plumbing.Hash]bool) error {
	if seen[c.Hash] {
		return nil
	}
	seen[c.Hash] = true

	// Recursively add parents first
	err := c.Parents().ForEach(func(parent *object.Commit) error {
		return addCommitToIndex(idx, parent, seen)
	})
	if err != nil {
		return err
	}

	// Add this commit if it hasn't been done already
	node := &commitgraph.Node{
		TreeHash:     c.TreeHash,
		ParentHashes: c.ParentHashes,
		When:         c.Committer.When,
	}
	return idx.Add(c.Hash, node)
}
```

The access to the index is exposed through `CommitGraphStorer` interface in the `storer` package that is implemented by the memory storage and filesystem storage. The file system storage implements it using memory-mapped files for reading and using the encoder for writing. [Note to myself: The memory mapped files are currently not closed with the repository!]

In the `object` package `CommitNode` and `CommitNodeIndex` interfaces are added. The `CommitNode` interface is implemented by existing `Commit` structure and also a new lightweight `graphCommitNode`.  The `CommitNodeIndex` interface provides methods for looking up commit parents and getting a full `Commit` object from `CommitNode`. Two implementations of `CommitNodeIndex` exist. The first one is `objectCommitNodeIndex`, which only uses `Commit` objects and implements the interfaces to behave exactly as if no serialized commit graph existed. Second one, `graphCommitNodeIndex`, takes the additional `commitgraph.Index` object and implements the lookup methods by trying the commit graph first and falling back to loading full `Commit` objects if the commit is not present in the commit graph file.

I added `NewCommitNodeIterCTime` iterator as a counterpart to `NewCommitIterCTime`, which operates on top of the `CommitNode` and `CommitNodeIndex` interfaces. Similar thing could be done for other `NewCommit*Iter*` methods. In fact, it is easily possible to reimplement `NewCommit*Iter*` on top of `NewCommitNode*Iter*` and switch between the lookup implementations (`graphCommitNodeIndex` / `objectCommitNodeIndex`) based on the paricular workload at hand. When full commit information, such as Message or Author is needed, then it's more useful to load the objects directly. When only summary information is needed (eg. counting distance between two commits) then the commit graph implementation can be used.

Lastly, I added the method `CommitNodeIndex` to `git.Repository`, which returns either `graphCommitNodeIndex` or `objectCommitNodeIndex` depending on the presence of the serialized commit graph. [Note to myself: It should probably consult the `gc.commitGraph` config option.]



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement APIs for Git serialized commit graphs #965

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement APIs for Git serialized commit graphs #965

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions