Improving monorepo performance
A monorepo is a repository that contains sub-projects. A single application often contains interdependent projects. For example, a backend, a web frontend, an iOS application, and an Android application. Monorepos are common, but they can present performance risks. Some common problems:
- Large binary files.
- Many files with long histories.
- Many simultaneous clones and pushes.
- Vertical scaling limits.
- Network bandwidth limits.
- Disk bandwidth limits.
GitLab is itself based in Git. Its Git storage service, Gitaly, experiences the performance constraints associated with monorepos. What we've learned can help you manage your own monorepo better.
- What repository characteristics can impact performance.
- Some tools and steps to optimize monorepos.
Optimize Gitaly for monorepos
Git compresses objects into packfiles to use less space. When you clone, fetch, or push, Git uses packfiles. They reduce disk space and network bandwidth, but packfile creation requires much CPU and memory.
Massive monorepos have more commits, files, branches, and tags than smaller repositories. When the objects
become larger, and take longer to transfer, packfile creation becomes more expensive
and slower. In Git, the git-pack-objects
process is
the most resource intensive operation, because it:
- Analyzes the commit history and files.
- Determines which files to send back to the client.
- Creates packfiles.
Traffic from git clone
and git fetch
starts a git-pack-objects
process on the server.
Automated continuous integration systems, like GitLab CI/CD, can cause much of this traffic.
High amounts of automated CI/CD traffic send many clone and fetch requests, and can strain your
Gitaly server.
Use these strategies to decrease load on your Gitaly server.
pack-objects
cache
Enable the Gitaly Enable the Gitaly pack-objects
cache,
which reduces server load for clones and fetches.
When a Git client sends a clone or fetch request, the data produced by git-pack-objects
can be
cached for reuse. If your monorepo is cloned frequently, enabling
Gitaly pack-objects
cache,
reduces server load. When enabled, Gitaly maintains an in-memory cache instead of regenerating
response data for each clone or fetch call.
For more information, see Pack-objects cache.
Configure Git bundle URIs
Create and store Git bundles on third-party storage with low latency, like a CDN. Git downloads packages from your bundle first, then fetches any remaining objects and references from your Git remote. This approach bootstraps your object database faster and reduces load on Gitaly.
- It speeds up clones and fetches for users with a poor network connection to your GitLab server.
- It reduces the load on servers that run CI/CD jobs by pre-loading bundles.
To learn more, see Bundle URIs.
Configure Gitaly negotiation timeouts
When attempting to fetch or archive repositories, fatal: the remote end hung up unexpectedly
errors
can happen if you have:
- Large repositories.
- Many repositories in parallel.
- The same large repository in parallel.
To mitigate this issue, increase the default negotiation timeout values.
Size your hardware correctly
Monorepos are usually for larger organizations with many users. To support your monorepo, your GitLab environment should match one of the reference architectures provided by the GitLab Test Platform and Support teams. These architectures are the recommended way to deploy GitLab at scale while maintaining performance.
Reduce the number of Git references
In Git, references
are branch and tag names that point to specific commits. Git stores references as loose files in the
.git/refs
folder of your repository. To see all references in your repository,
run git for-each-ref
.
When the number of references in your repository grows, the seek time needed to find a specific reference also grows. Each time Git parses a reference, the increased seek time leads to increased latency.
To fix this problem, Git uses pack-refs to create a single
.git/packed-refs
file containing all references for that repository. This method reduces the storage
space needed for refs. It also decreases seek time, because seeking in a single file is faster than seeking
through all files in a directory.
Git handles newly created or updated references with loose files. They are not cleaned up and added to the
.git/packed-refs
file until you run git pack-refs
. Gitaly runs git pack-refs
during
housekeeping. While this helps
many repositories, write-heavy repositories still have these performance problems:
- Creating or updating references creates new loose files.
- Deleting references requires editing the existing
packed-refs
file to remove the existing reference.
Git iterates through all references when you fetch or clone a repository. The server reviews ("walks") the internal graph structure of each reference, finds any missing objects, and sends them to the client. The iteration and walking processes are CPU-intensive, and increase latency. This latency can cause a domino effect in repositories with a lot of activity. Each operation is slower, and each operation stalls later operations.
To mitigate the effects of a large number of references in a monorepo:
-
Create an automated process for cleaning up old branches.
-
If certain references don't need to be visible to the client, hide them using the
transfer.hideRefs
configuration setting. Gitaly ignores any on-server Git configuration, so you must change the Gitaly configuration itself in/etc/gitlab/gitlab.rb
:gitaly['configuration'] = { # ... git: { # ... config: [ # ... { key: "transfer.hideRefs", value: "refs/namespace_to_hide" }, ], }, }
In Git 2.42.0 and later, different Git operations can skip over hidden references when doing an object graph walk.
Optimize CI/CD for monorepos
To keep GitLab scalable with your monorepo, optimize how your CI/CD jobs interact with your repository.
Reduce concurrent clones in CI/CD
Reduce CI/CD pipeline concurrency by staggering your scheduled pipelines to run at different times. Even a few minutes apart can help.
CI/CD loads are often concurrent, because pipelines are scheduled at specific times. Git requests to your repository can spike during these times, and affect performance for CI/CD processes and users.
Use shallow clones in CI/CD processes
For git clone
and git fetch
calls in your CI/CD systems, set the
--depth
option with a small number, like 10. A depth of 10 instructs Git to request only the last 10 changes
for a given branch. If your repository has a long backlog, or many large files, this change can
make Git fetches much faster. It reduces the amount of data transferred.
GitLab and GitLab Runner perform a shallow clone by default.
This GitLab CI/CD pipeline configuration example sets the GIT_DEPTH
:
variables:
GIT_DEPTH: 10
test:
script:
- ls -al
git fetch
in CI/CD operations
Use If it's possible to keep a working copy of the repository available, use git fetch
instead of
git clone
on CI/CD systems. git fetch
requires less work from the server:
-
git clone
requests the entire repository from scratch.git-pack-objects
must process and send all branches and tags. -
git fetch
requests only the Git references missing from the repository.git-pack-objects
processes only a subset of the total Git references. This strategy also reduces the total data transferred.
By default, GitLab uses the
fetch
Git strategy recommended for large repositories.
git clone
path
Set a If your monorepo is used with a fork-based workflow, consider setting
GIT_CLONE_PATH
to control
where you clone your repository.
Git stores forks as separate repositories with separate worktrees. GitLab Runner cannot optimize the use of worktrees. Configure and use the GitLab Runner executor only for the given project. To make the process more efficient, don't share it across different projects.
The GIT_CLONE_PATH
must be
in the directory set in $CI_BUILDS_DIR
. You can't pick any path from disk.
git clean
on CI/CD jobs
Disable The git clean
command removes untracked files from the working tree. In large repositories, it uses
a lot of disk I/O. If you reuse existing machines, and can reuse an existing worktree, consider
disabling it on CI/CD jobs. For example, GIT_CLEAN_FLAGS: -ffdx -e .build/
can avoid deleting directories
from the worktree between runs. This can speed up incremental builds.
To disable git clean
on CI/CD jobs, set
GIT_CLEAN_FLAGS
to none
for them.
By default, GitLab ensures that:
- You have your worktree on the given SHA.
- Your repository is clean.
For exact parameters accepted by GIT_CLEAN_FLAGS
, see the Git documentation for
git clean
. The available parameters depend on your Git version.
git fetch
behavior with flags
Change Change the behavior of git fetch
to exclude any data your CI/CD jobs do not need. If your project contains
many tags, and your CI/CD jobs do not need them, use GIT_FETCH_EXTRA_FLAGS
to set
--no-tags
. This setting
can make your fetches faster and more compact.
Even if your repository does not contain many tags, --no-tags
can improve performance in some cases.
For more information, see issue 746
and the GIT_FETCH_EXTRA_FLAGS
Git documentation.
Optimize Git for monorepos
To keep GitLab scalable with your monorepo, optimize the repository itself.
Avoid shallow clones for development
Avoid shallow clones for development. Shallow clones greatly increase the time needed to push changes. Shallow clones work well with CI/CD jobs, because repository contents aren't changed after checkout.
For local development, use partial clones instead, to:
- Filter out blobs, with
git clone --filter=blob:none
- Filter out trees, with
git clone --filter=tree:0
For more information, see Reduce clone size.
Profile your repository to find problems
Large repositories generally experience performance issues in Git. The
git-sizer
project profiles your repository, and helps you understand
potential problems. It can help you develop mitigation strategies to prevent performance problems.
Analyzing your repository requires a full Git mirror or bare clone, to ensure all Git references
are present.
To profile your repository with git-sizer
:
-
Run this command to clone your repository in the bare Git format compatible with
git-sizer
:git clone --mirror <git_repo_url>
-
In the directory of your Git repository, run
git-sizer
with all statistics:git-sizer -v
After processing, the output of git-sizer
should look like this example. Each row includes a
Level of concern for that aspect of the repository. Higher levels of concern are shown with more
asterisks. Items with extremely high levels of concern are shown with exclamation marks. In this example,
a few items have a high level of concern:
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 723 k | * |
| * Total size | 525 MiB | ** |
| * Trees | | |
| * Count | 3.40 M | ** |
| * Total size | 9.00 GiB | **** |
| * Total tree entries | 264 M | ***** |
| * Blobs | | |
| * Count | 1.65 M | * |
| * Total size | 55.8 GiB | ***** |
| * Annotated tags | | |
| * Count | 534 | |
| * References | | |
| * Count | 539 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 72.7 KiB | * |
| * Maximum parents [2] | 66 | ****** |
| * Trees | | |
| * Maximum entries [3] | 1.68 k | * |
| * Blobs | | |
| * Maximum size [4] | 13.5 MiB | * |
| | | |
| History structure | | |
| * Maximum history depth | 136 k | |
| * Maximum tag depth [5] | 1 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [6] | 4.38 k | ** |
| * Maximum path depth [7] | 13 | * |
| * Maximum path length [8] | 134 B | * |
| * Number of files [9] | 62.3 k | * |
| * Total size of files [9] | 747 MiB | |
| * Number of symlinks [10] | 40 | |
| * Number of submodules | 0 | |
Use Git LFS for large binary files
Store binary files (like packages, audio, video, or graphics) as Git Large File Storage (Git LFS) objects.
When users commit files into Git, Git uses the blob
object type to store and manage their content.
Git does not handle large binary data efficiently, so large blobs are problematic for Git. If git-sizer
reports blobs of over 10 MB, you usually have large binary files in your repository. Large binary files
cause problems for both server and client:
- For the server: unlike text-based source code, binary data is often already compressed. Git can't compress binary data further, which leads to large packfiles. Large packfiles require more CPU, memory, and bandwidth to create and send.
- For the client: Git stores blob content in both packfiles (usually in
.git/objects/pack/
) and regular files (in worktrees), binary files require far more space than text-based source code.
Git LFS stores objects externally, such as in object storage. Your Git repository contains a pointer to the object's location, rather than the binary file itself. This can improve repository performance. For more information, see the Git LFS documentation.