Future-proofing Git repository maintenance

Users get the most from Gitaly, the service responsible for the storage and maintenance of all Git repositories in GitLab, when traffic hitting it is efficiently handled. Therefore, we must ensure our Git repositories remain in a well-optimized state. When it comes to Git monorepositories, this maintenance can be a complex task that can cause a lot of overhead by itself because repository housekeeping becomes more expensive the larger the repositories get. This blog post explains in depth what we have done over the past few GitLab releases to rework our approach to repository housekeeping for better scaling and to maintain an optimized state to deliver the best peformance for our users.

The challenge with Git monorepository maintenance

To ensure that Git repositories remain performant, Git regularly runs a set of
maintenance tasks. On the client side, this usually happens by automatically
running git-gc(1) periodically, which:

Compresses revisions into a packed-refs file.
Compresses objects into packfiles.
Prunes objects that aren't reachable by any of the revisions and that have
not been used for a while.
Generates and updates data structures like commit-graphs that help to speed
up queries against the Git repository.

Git periodically runs git gc --auto automatically in the background, which
analyzes your repository and only performs maintenance tasks if required.

At GitLab, we can't use this infrastructure because it does not give us enough
control over which maintenance tasks are executed at what point in time.
Furthermore, it does not give us full control over exactly which data
structures we opt in to. Instead, we have implemented our own maintenance
strategies that are specific to how GitLab works and catered to our specific
needs. Unfortunately, the way GitLab implemented repository maintenance has
been limiting us for quite a while by now.

It is unsuitable for large monorepositories.
It does not give us the ability to easily iterate on the employed maintenance
strategy.

This post explains our previous maintenance strategy and its problems as well as
how we revamped the architecture to allow us to iterate faster and more
efficiently maintain repositories.

Our previous repository maintenance strategy

In the early days of GitLab, most of the application ran on a single server.
On this single server, GitLab directly accessed Git repositories. For various
reasons, this architecture limited us, so we created the standalone Gitaly
server that provides a gRPC API to access Git repositories.

To migrate to exclusively accessing Git repository data using Gitaly we:

Migrated all the logic that was previously contained in the Rails
application to Gitaly.
Created Gitaly RPCs and updated Rails to not execute the logic directly, but
instead invoke the newly-implemented RPC.

While this was the easiest way to tackle the huge task back then, the end
result was that there were still quite a few areas in the Rails codebase that
relied on knowing how the Git repositories were stored on disk.

One such area was repository maintenance. In an ideal world, the Rails server
would not need to know about the on-disk state of a Git repository. Instead,
the Rails server would only care about the data it wants to get out of the
repository or commit to it. Because of the Gitaly migration path we took,
the Rails application was still responsible for executing fine-grained
repository maintenance by calling certain RPCs:

Cleanup to delete stale, temporary files that have accumulated
RepackIncremental and RepackFull to either pack all loose objects into a
new packfile or alternatively to repack all packfiles into a single one
PackRefs to compress all references into a single packed-refs file
WriteCommitGraph to update the commit-graph
GarbageCollect to perform various different tasks

These low-level details of repository maintenance were being managed by the
client. But because clients didn't have any information on the on-disk state of
the repository, they could not even determine which of these maintenance tasks
had to be executed in the first place. Instead, we had a very simple heuristic:
Every few pushes, we ran one of the above RPCs to perform one of the maintenance
tasks. While this heuristic worked, it wasn't great for the following reasons:

Repositories can be modified without using pushes at all. So if users only
use the Web IDE to commit to repositories, they may not get repacked at all.
Because repository maintenance is controlled by the client, Gitaly can't
assume a specific repository state.
The threshold for executing housekeeping tasks is set globally across all
projects rather than on a per-project basis. Consequently, no matter
whether you have a tiny repository or a huge monorepository, we would use the
same intervals for executing maintenance tasks. As you may imagine though,
doing a full repack of a Git repository that is only a few dozen megabytes in
size is a few orders of magnitudes faster than repacking a monorepository
that is multiple gigabytes in size.
Specific types of Git repositories hosted by Gitaly need special care and we
required Gitaly clients to know about these.
Repository maintenance was inefficient overall. Clients do not know about the
on-disk state of repositories. Consequently, they had no choice except to
repeatedly ask Gitaly to optimize specific data structures without knowing
whether this was required in the first place.

Heuristical maintenance strategy

It was clear that we needed to change the strategy we used for repository
maintenance. Most importantly, we wanted to:

Make Gitaly the single source of truth for how we maintain repositories.
Clients should not need to worry about low-level specifics, and Gitaly should
be able to easily iterate on the strategy.
Make the default maintenance strategy work for repositories of all sizes.
Make the maintenance strategy work for repositories of all types. A client
should not need to worry about which maintenance tasks must be executed for
what repository type.
Avoid optimizing data structures that already are in an optimal state.
Improve visibility into the optimizations we perform.

As mentioned in the introduction, Git periodically runs git gc --auto. This
command inspects the repository's state and performs optimizations only when it
finds that the repository is in a sufficiently bad state to warrant the cost.
While using this command directly in the context of Gitaly does not give us
enough flexibility, it did serve as the inspiration for our new architecture.

Instead of providing fine-grained RPCs to maintain various parts of a Git
repository, we now only provide a single RPC OptimizeRepository that works as
a black-box to the caller. This RPC call:

Cleans up stale data in the repository if there is any.
Analyzes the on-disk state of the repository.
Depending on this on-disk state, performs only these maintenance tasks that
are deemed to be necessary.

Because we can analyze and use the on-disk state of the repository, we can be
far more intelligent about repository maintenance compared to the previous
strategy where we optimized some bits of the repository every few pushes.

Packing objects

In the old-style repository maintenance, the client would call either
RepackIncremental or RepackFull. This would either: Pack all loose objects into a new packfile or repack all objects into a single packfile.

By default, we would perform a full repack every five repacks. While this may be
a good default for small repositories, it gets prohibitively expensive for huge
monorepositories where a full repack may easily take several minutes.

The new heuristical maintenance strategy instead scales the allowed number of
packfiles by the total size of all combined packfiles. As a result, the
larger the repository becomes, the less frequently we perform a full repack.

Pruning objects

In the past, clients would periodically call GarbageCollect. In addition to
repacking objects, this RPC would also prune any objects that are unreachable
and that haven't been accessed for a specific grace period.

The new heuristical maintenance strategy scans through all loose objects that
exist in the repository. If the number of loose objects that have a modification
time older than two weeks exceeds a certain threshold, it spawns the
git prune command to prune these objects.

Packing references

In the past, clients would call PackRefs to repack references into the
packed-refs file.

Because the time to compress references scales with the size of the
packed-refs file, the new heuristical maintenance strategy takes into account
both the size of the packed-refs file and the number of loose references that
exist in the repository. If a ratio between these two figures is exceeded, we
compress the loose references.

Auxiliary data structures

There are auxiliary data structures like commit-graphs that are used by Git
to speed up various queries. With the new heuristical maintenance strategy,
Gitaly now automatically updates these as required, either when they are
deemed to be out-of-date, or when they are missing altogether.

Heuristical maintenance strategy rollout

We rolled out this new heuristical maintenance strategy to GitLab.com in March 2022. Initially, we only rolled it out for
gitlab-org/gitlab, which is a
repository where maintenance performed particularly poorly in the past. You can
see the impact of the rollout in the following graph:

Latency of OptimizeRepository for gitlab-org/gitlab

In this graph, you can see that:

Until March 19, we used the legacy fine-grained RPC calls. We spent most
of the time in RepackFull, followed by RepackIncremental and GarbageCollect.
Because March 19 and 20 occurred on a weekend, nothing much happens with
housekeeping.
Early on March 21 we switched gitlab-org/gitlab to use heuristical
housekeeping using OptimizeRepository. Initially, there didn't seem to be
much of an improvement. There wasn't much difference in how much time we
spent maintaining this repository compared to the past.

However, this was caused by an inefficient heuristic. Instead of only pruning
objects when there were stale ones, we always pruned objects when we saw that
there were too many loose objects.
We deployed a fix for this bug on March 22, which led to a significant drop in
time spent optimizing this repository compared to before.

This demonstrated two things:

We're easily able to iterate on the heuristics that we have in Gitaly.
Using the heuristics saves a lot of compute time as we don't unnecessarily
optimize anymore.

We have subsequently rolled this out to all of GitLab.com, starting on March
29, 2022, with similar improvements. With this change, we more than halved the CPU
load when performing repository optimizations.

Observability

While it is great that OptimizeRepository has managed to save us a lot of
compute power, one goal was to improve visibility into repository housekeeping.
More specifically, we wanted to:

Gain visibility on the global level to see what optimizations are performed
across all of our repositories.
Gain visibility on the repository level to know what state a specific
repository is in.

In order to improve global visibility, we expose a set of Prometheus metrics that
allow us to observe important details about our repository maintenance. The
following graphs show the optimizations performed in a 30-minute window of our
production systems on GitLab.com.

The optimizations, which are being performed in general.
The average latency it takes to perform each of these optimizations.
What kind of stale data we are cleaning up.

To improve visibility into the state each repository is in we have started to
log structured data that includes all the relevant bits. A subset of the
information it exposes is:

The number of loose objects and their sizes.
The number of packfiles and their combined size.
The number of loose references.
The size of the packed-refs file.
Information about commit-graphs, bitmaps and other auxiliary data
structures.

This information is also exposed through Prometheus metrics:

Repository state metrics for GitLab.com

These graphs expose important metrics of the on-disk state of our repositories:

The top panel shows which data structures exist.
The heatmaps on the left show how large specific data structures are.
The heatmaps on the right show how many of these data structures we have.

Combining both the global and per-repository information allows us to easily
observe how repository maintenance behaves during normal operations. But more
importantly, it gives us meaningful data when rolling out new features that
change the way repositories are maintained.

Manually enabling heuristical housekeeping

While the heuristical housekeeping is enabled by default starting with GitLab
15.9, it has already been introduced with GitLab 14.10. If you want to use the
new housekeeping strategy before upgrading to 15.9, you can opt in by
setting the optimized_housekeeping feature flag.
You can do so via the gitlab-rails console:

Feature.enable(:optimized_housekeeping)

Future improvements

While the new heuristical optimization strategy has been successfully
battle-tested for a while now for GitLab.com, at the time of writing this
blog post, it still wasn't enabled by default for self-deployed installations.
This has finally changed with GitLab 15.8, where we have default-enabled the new
heuristical maintenance strategy.

We are not done yet, though. Now that Gitaly is the only source of truth for how
repositories are optimized, we are tracking improvements to our maintenance
strategy in epic 7443:

Multi-pack indices and geometric
repacking will help us to further reduce the time spent repacking objects.
Cruft packs will help us to further
reduce the time spent pruning objects and reduce the overall size of
unreachable objects.
Gitaly will automatically run housekeeping tasks when receiving mutating RPC
calls so that clients don't have to call OptimizeRepository at all anymore.

So stay tuned!

Future-proofing Git repository maintenance

The challenge with Git monorepository maintenance

Our previous repository maintenance strategy