Skip to content

Commit a807dd1

Browse files
committed
store all monthly reports in the repository as well
That way they are part of `gitoxide` much more, and don't risk getting lost in time when forges come and go.
1 parent 15f1cf7 commit a807dd1

40 files changed

+2891
-0
lines changed

etc/monthlies/2021/august.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
### git-ref - welcome to the family
2+
3+
Last months rabbit hole can finally be waved good bye, now that packed-ref writing is implemented. It's interesting to realize that `packed-refs` are really only good for traversing large amount of references fast. When deleting refs, `packed-refs` have to be handled and potentially be rewritten if a ref-to-be-deleted is indeed present in that file.
4+
5+
Updating `packed-refs` is notably different from the canonical git implementation as it's implemented as part of the standard transaction system. This brings the benefit of
6+
_pseudo_ transactions and rollbacks on error for free. The way this works is a special flag that can be set on the transaction to either enable write-through to `packed-refs`, and optionally trigger the deletion of the original ref. The latter essentially is a `git pack-refs --prune`, but with added safety in case of concurrent processes.
7+
8+
Another new feature is the addition of namespaces which can be activated in transactions to move all changes into their own, non-overlapping namespace. This is useful for servers that keep namespaced references, therefore allowing to use one repository for all forks of it. References can have their named adjusted by a namespace, or the namespace can be stripped of their name. Namespaces can naturally be used as prefix for iterating references, and generally need awareness and specific support by the application using them, making them a very explicit part of the crate's API surface.
9+
10+
11+
### cargo smart-release - taking back control
12+
13+
With `gitoxide` now spanning over 20 published and interconnected crates, the trusty `cargo release` started to be far less helpful due to its one-crate-at-a-time mentality. This meant it couldn't help with releases, especially their order, at all, and I had to manually track the correct release order and pray for mercy each time I ran `cargo release`. Of course, mistakes would always happen and a release of all relevant crates could easily take 90 painful minutes. Way too much to be kept, so I set my sights on 'writing this quick hack in a day' that would handle publishes for me.
14+
15+
It would take three days and a few hours here and there to finally have a tool that seems to do the job exactly how I want it. Here are some highlights:
16+
17+
* publish one or more crates, and deal with dependencies automatically
18+
* deal with dev dependencies which caused me a lot of trouble. Hint: Don't specify a version number to solve the problem easily.
19+
* dry-run by default, providing an overview of what would be done
20+
* automatic version bumping, but only if actually needed after consulting the crates.io index, along with automatic manifest updates of dependent crates
21+
* it's very fast and doesn't impose wait times
22+
* it's using the `git-repository` crate
23+
24+
Now releases seem to be a solved problem, and as a positive side-effect `cargo release` also learned a few new tricks.
25+
26+
### git-repository - the final push
27+
28+
This crate is destined to provide the standard abstractions suitable for any kind of application or library. No matter whether it's single-threaded, multi-threaded, one-off CLI programs or long-running server applications. The performance is as good or not much worse than hand-rolling code using the plumbing crates while providing a convenient, high-level API similar to what's offered in the `git2` crate. Of course you pay a performance-for-convenience penalty only when needed.
29+
30+
None of the above was fleshed out, until now.
31+
32+
`git_repository::Repository` is now relegated to the plumbing API, a slightly more convenient way of accessing references and objects, but actually working with these needs a lot of knowledge and will feel cumbersome, especially in comparison to `git2`, but might give that extra percent of performance or control that some applications need. The majority of them, even high-performance servers, will use the `Easy` API. It comes in various flavours to cater to one-shot CLI applications on the one hand and similarly to long running multi-threaded servers on the other.
33+
34+
It works by segmenting all `git` data into those who are written rarely and consume system resources, and those who are read often, and those who are written often. The latter will always exist once per thread and are best described as 'caches and acceleration' structures, as well as memory to back objects. The former may be shared across threads, and depending on the `Easy*` in use, one may or may not acquire a mutable reference for altering them.
35+
36+
This system is needed to allow multiple objects to be _infused_ with access to the repository and occasionally change memory, while passing the borrow checker. You guessed it, we are using interior mutability and RefCells or RwLocks in various configurations to achieve exactly the performance trade off that certain applications need to perform best.
37+
38+
One major shortcoming is that there currently wouldn't be support for long-running single-threaded applications that are OK using Rc<RefCell> for shared mutable access due to the lack of generic associated types. These will, for now, have to use the multi-threaded handle called `EasyArcExclusive`.
39+
40+
`cargo smart-release` is already using it to great benefit, and we will keep dog-fooding the API in all of our current and future tooling.
41+
42+
### Gitoxide opens up
43+
44+
Now that `gitoxide`'s `git-protocol` crate is consumed by `radicle-link` it became clear that the way of working thus far is lacking in many ways. Namely, it lacks transparency for downstream consumer and which also makes it hard to impossible to influence. In other words, you won't know about a breaking change until it hits you, maybe even with a semver that suggests no breaking change at all.
45+
46+
`gitoxide` was already using a [project board](https://github.com/Byron/gitoxide/projects/1) to show what's going on and what's planned as well as issues to track overall work done, but it would be hard to see what code actually implemented said features or steps along the way.
47+
48+
To remedy this, we will now use PRs for planned breaking changes or for greater features that everyone is invited to influence. Of course, we will do our best to keep branches short-lived.
49+
50+
A new _collaborating_ guide was added to outline this workflow, and a stability guide is planned to define how to use versions and which stability guarantees can or should be made.
51+
52+
### Pack Generation - counting objects fast isn't as easy as it seems :/
53+
54+
Last month I left the topic thinking `gitoxide` can be faster than git when counting, but after remeasuring the linux pack performance it feels more like it's always slower, with multiple threads not even beating a single thread of canonical git even with this one optimization I had in mind already implemented. Thus, now the counting phase isn't implemented as iterator anymore but uses simple 'scoped-thread' based parallelism while allowing for an optimization for the single-threaded case with just a little more boilerplate.
55+
56+
Unfortunately it's absolutely ineffective, with the absolute majority of the time spent on decoding packed objects during tree-traversal. The algorithm already uses a skip-list to never process objects twice, but that's not enough. It still has to decode a lot of highly deltified packed objects, and the current caches for delta objects aren't very effective in this case. Not only do they take up some runtime for themselves, but also are they not hit often enough. Probably there should be some statistics on that, but even if one would know the hit-rates, there isn't so much you can do if the cache is too small and trashed too much.
57+
58+
The only two caches I have played with is a fixed size LRU cache based on `uluru` and a dynamically sized hashmap LRU with a memory cap. Now, one the cache is full all it's going to do is to trash memory like there is no tomorrow - a lot o memory is copied and allocated, and dropped and deallocated, probably 60 thousand times per second per core. When profiling the allocations I was quite surprised to see that they quickly went into the tenth of millions, with zlib allocations taking the crown, but cache trashing following right after. A test of the statically sized `uluru` with a free-list moved the cache trashing effectively but wasn't any faster either.
59+
60+
The best performance improvement could be witnessed with a 400MB dynamic cache which could bring down single-threaded performance to about 140s for 8 million objects, but git does it in 40s! Maybe a free-list for this cache can help and more careful tuning, but I feel I am running out of options :/.
61+
62+
Besides performance the `gixp pack-create` code was cleaned up and improved thanks to the 'counting objects' refactoring, and will soon benefit from `git-repository::Easy` as well.
63+
64+
### The bigger picture
65+
66+
From reading the above it's hard to see that we are still in the 'make git-fetch work' block. Now that `git-ref` is fully implemented to the extend needed, one can properly implement fetches that alter references after having received objects. `radicle-link` already does implement fetches, and this month's side-tracking was due to the desire to have it's tests be supported by `gitoxide` instead of `git2`, so a lot of work on `git-repository` was finally due. On the bright side this means that `git-repository` is now reasonably well defined to allow more and more features to be implemented on a higher level which in turn should better support existing applications as well as test setups.
67+
68+
For the coming month we will strive into the server side and build something akin to `upload-pack`, a first step towards having a git server.
69+
70+
Cheers,
71+
Sebastian
72+
73+
PS: The latest timesheets can be found [here](https://github.com/Byron/byron/blob/main/timesheets/2021.csv).

etc/monthlies/2021/december.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
### The new object database
2+
3+
After the design-sketch for the new object database turned all lights to go, I managed to move forward and develop the sketch into a proof of concept. Starting out with minimal tests, it quickly became functional enough to let it run on its own in the `object-access` benchmark program. There it performed so admirably to allow it to become the only object database implementation from this point forward. Here are its features:
4+
5+
- entirely lazy, creating an instance does no disk IO at all if [`Slots::Given`][store::init::Slots::Given] is used.
6+
- multi-threaded lazy-loading of indices and packs
7+
- per-thread pack and object caching avoiding cache trashing.
8+
- most-recently-used packs are always first for speedups if objects are stored in the same pack, typical for packs organized by
9+
commit graph and object age.
10+
- lock-free reading for perfect scaling across all cores, and changes to it don't affect readers as long as these don't want to
11+
enter the same branch.
12+
- sync with the state on disk if objects aren't found to catch up with changes if an object seems to be missing.
13+
- turn off the behaviour above for all handles if objects are expected to be missing due to spare checkouts.
14+
15+
As of now the previous implementations are deprecated and will eventually be removed from the `git-odb` crate, and `git-repository` already uses the new one
16+
17+
### Multi-pack index support
18+
19+
The new ODB already expects multi-pack indices and is merely waiting for an implementation of the multi-pack index file format. This one is now available and provides all the usual access methods. Integration it into the new ODB will start as soon as possible, and I don't expect any problems with that.
20+
21+
Once this work is completed, `gitoxide` finally has a production-ready object database with all the features one would expect, which will also serve as good foundation for adding support for additional file formats, like reachability bitmaps and reverse indices to speed up pack building.
22+
23+
### Community Outreach part 2
24+
25+
Now that "Learning Rust with Gitoxide` is completed, we moved onto season two's programming with `Getting into Gitoxide`, a format to show how to use `git-repository` and how to extend it.
26+
27+
Here is the playlist link: https://youtube.com/playlist?list=PLMHbQxe1e9MkEmuj9csczEK1O06l0Npy5
28+
29+
Sidney and I use it not only to improve the existing API surface, but also to show off latest improvements or undertakings as part of the usual `gitoxide` development. It also serves as preparation to help Sidney to get started in `git-repository` himself.
30+
31+
### `git-repository` is getting better
32+
33+
After the last rounds of refactoring it feels like it's coming together. Previously there were quite a few shortcomings and inconveniences, but all of them have been removed by now. All this was possible due to learnings in `git-ref` and `git-odb` which allowed a lot of complexity to move down into the plumbing crates, where it belongs.
34+
35+
`git-ref` and `git-odb` now expect to provide a usable experience all by themselves, which in turn lead to great simplifications on the side of `git-repository`.
36+
37+
38+
### SHA256 support now has a tracking issue
39+
40+
In [#281](https://github.com/Byron/gitoxide/issues/281) one can now track the steps needed to get SHA256 support. In the past days I took the time to give it a push (and made the necessary breaking changes) to be parameterize most of the code-base.
41+
42+
At the end of this effort there could be a tool to convert repositories from one hash kind to another, with the intermediate step being the ability to read either one or the other, and write to it, too.
43+
44+
### So much is still to be done though…
45+
46+
While the major undertakings are shaping up, some work-in-progress seems to be less lucky and doesn't see much movement.
47+
48+
- git-fetch like functionality
49+
- despite being so close, no progress was made in that direction and it's still something I'd love to finally get done. The building blocks are all there.
50+
- simple object deltification
51+
- even though the current pack creation capabilities aren't anything to sneeze at, it would be great to experiment with the infrastructure needed to support creating own deltas
52+
53+
A lot of work, and I am looking forward to all of it :).
54+
55+
Merry Christmas and a happy new year,
56+
Sebastian
57+
58+
PS: The latest timesheets can be found [here](https://github.com/Byron/byron/blob/main/timesheets/2021.csv).

etc/monthlies/2021/july.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
### The ultimate rabbit hole: Loose file reference DB
2+
3+
Even though in June the foundation was set with `git-tempfile` and `git-lock`, correctly writing loose references requires more than 'just writing a file', the convenient story I told myself. The culprit here is that `gitoxide` can never be intentionally worse than `git`, and studying the canonical implementation showed how much care is taken to assure consistency and decent operation even in the light of broken references.
4+
5+
One of the ways to accomplish this is to use transactions, a bundle of edits, which we try hard to perform in a way that can be rolled back on error. It works by preparing each updated or soon-to-be-deleted ref with a lock file to block other writes which receives updates in place of the actual reference. Under most circumstances, a reflog is written as well which is supposedly guarded by the same lock. Once all lock files have been created, the transaction can be committed by moving all locks onto the reference they lock or by removing the reference accordingly. There is a ton of edge cases that are tested for and handled faithfully, even though ultimately these transactions aren't really transactional as many non-atomic operations are involved.
6+
7+
As a side-effect of this, reflogs can now be written, iterated in various ways, and packed-refs are incorporated into find results or when iterating references. As always, `gitoxide` choses not to handle caches internally but delegates this to the user. For all one-off operation this feels very natural, whereas long-running tools will most certainly get to resort to `git-repository` in some shape or form.
8+
9+
For fun I have run very unscientific benchmarks and saw ~6mio packed references traversed per second, along with 500.000 packed reference lookups per second, per core.
10+
11+
Even though I have worked on it for more than a month already it's still not done as packed references aren't handled yet when doing transactions. These will have to be removed from there on deletion at the very least, and it's certainly possible to think of ways to prefer updating packed-refs instead of loose refs and avoid spamming the file system with many small files. However, due to the nature of the loose reference DB, loose references are the source of truth and lock files are needed in any case, so performance improvements by handling packed-refs a little different than canonical git can't really happen. What can happen is to auto-pack refs and avoid creating loose file references which may be more suitable for servers that don't yet have access to the `ref-table`.
12+
13+
And of course, all `gix` and `gixp` tools were upgraded to make use of the new capabilities and became more convenient in the process.
14+
15+
### Get small packs fast*
16+
17+
The improvement for packs this months is very measurable. In the last letter I was talking about linux kernel packs weighing 45GB which are written at only 300MB/s. This was due to delta-objects being fully decompressed and then recompressed as base objects, with the latter operation being limited by `zlib-ng` which already is the fastest we know.
18+
19+
Now the tides have turned with the introduction of delta-object support as well as support for thin packs. The former feature makes it possible to directly copy existing packed delta objects to the new pack instead of recompressing them, which writes 3.6GB in 5.4 seconds. This is done by processing multiple chunks of pack entries in parallel and bringing them on order on the fly for writing.
20+
21+
The currently produced pack sizes are suitable for cloning and fetching, even though we still can't produce delta-objects ourselves.
22+
23+
Another weak spot is the counting stage as `gitoxide` currently can't use caches, nor is its single-threaded counting speed en par with canonical git. While counting the linux kernel pack is done in ~50s with multiple threads and thus about 2.4x faster than git, with a single thread we are ~20s behind. This gets worse on smaller repositories with canonical git clearly beating gitoxide which barely keeps up with a single thread of git even with all threads at its disposal.
24+
25+
One reason for this seems to be the use of a `DashSet` even for single-threaded operation which seems to cost about 40% of the runtime. In the next month this should be fixable by offering a single-threaded codepath for the counting stage which is required for deterministic packs, and try to make the existing multi-threaded version faster by using scoped threads instead of a threaded iterator. The iterator design for the counting stage was a mistake as at the time I wasn't aware that counts cannot be streamed, the amount of objects in the pack needs to be known and additional sorting has to happen on the list of input objects as well.
26+
27+
\*if the objects destined for the pack have already been in a pack.
28+
29+
### Bonus: Bad Objects
30+
31+
A significant amount of time allotted for my work on pack generation went into dealing with 'bad objects' adequately. These are to my mind _impossible_ objects that are referenced during commit graph traversal but don't actually exist in the repository. The Rust repository as nearly 1500 of them, the git repository has 4, and `git fsck` doesn't mind them nor does git care when producing its own packs.
32+
In the statistics produced by `gitoxide` these now show up and when eventually implementing `gixp fsck` I will be sure to provide more information about them. Are they blobs or trees, which commit do they belong to, and any other information that can help understand how these have been created in the first place. My guess is that these are from a time long gone when it was possible to 'loose' objects, and `git` ignores them knowing that nothing can be done to get these objects back.
33+
34+
Cheers,
35+
Sebastian
36+
37+
PS: The latest timesheets can be found [here](https://github.com/Byron/byron/blob/main/timesheets/2021.csv).

0 commit comments

Comments
 (0)