Organization
The Organization initiative focuses on reaching feature parity between GitLab.com and GitLab Self-Managed.
Consolidate groups and projects
One facet of the Organization initiative is to consolidate groups and projects, addressing the feature disparity between them. Some features, such as epics, are only available at the group level. Some features, such as issues, are only available at the project level. Other features, such as milestones, are available to both groups and projects.
We receive many requests to add features either to the group or project level. Moving features around to different levels is problematic on multiple levels:
- It requires engineering time to move the features.
- It requires UX overhead to maintain mental models of feature availability.
- It creates redundant code.
When features are copied from one level (project, group, or instance) to another, the copies often have small, nuanced differences between them. These nuances cause extra engineering time when fixes are needed, because the fix must be copied to several locations. These nuances also create different user experiences when the feature is used in different places.
A solution for this problem is to consolidate groups and projects into a single
entity, namespace
. The work on this solution is split into several phases and
is tracked in epic 6473.
How to plan features that interact with Group and ProjectNamespace
As of now, every Project in the system has a record in the namespaces
table. This makes it possible to
use common interface to create features that are shared between Groups and Projects. Shared behavior can be added using
a concerns mechanism. Because the Namespace
model is responsible for UserNamespace
methods as well, it is discouraged
to use the Namespace
model for shared behavior for Projects and Groups.
Resource-based features
To migrate resource-based features, existing functionality will need to be supported. This can be achieved in two Phases.
Phase 1 - Setup
- Link into the namespaces table
- Add a column to the table
- For example, in issues a
project id
points to the projects table. We need to establish a link to thenamespaces
table. - Modify code so that any new record already has the correct data in it
- Backfill
Phase 2 - Prerequisite work
- Investigate the permission model as well as any performance concerns related to that.
- Permissions need to be checked and kept in place.
- Investigate what other models need to support namespaces for functionality dependent on features you migrate in Phase 1.
- Adjust CRUD services and APIs (REST and GraphQL) to point to the new column you added in Phase 1.
- Consider performance when fetching resources.
Introducing new functionality is very much dependent on every single team and feature.
Settings-related features
Right now, cascading settings are available for NamespaceSettings
. By creating ProjectNamespace
,
we can use this framework to make sure that some settings are applicable on the project level as well.
When working on settings, we need to make sure that:
- They are not used in
join
queries or modify those queries. - Updating settings is taken into consideration.
- If we want to move from project to project namespace, we follow a similar database process to the one described in Phase 1.
Organizations & cells
For the Cells project, GitLab will rely on organizations. A cell will host one or more organizations. When a request is made, the HTTP Router Service will route it to the correct cell.
Defining a sharding key for all organizational tables
All tables with the following gitlab_schema
are considered organization level:
gitlab_main_cell
gitlab_ci
gitlab_sec
gitlab_main_user
All newly created organization-level tables are required to have a sharding_key
defined in the corresponding db/docs/
file for that table.
The purpose of the sharding key is documented in the Organization isolation blueprint, but in short this column is used to provide a standard way of determining which Organization owns a particular row in the database. The column will be used in the future to enforce constraints on data not cross Organization boundaries. It will also be used in the future to provide a uniform way to migrate data between Cells.
The actual name of the foreign key can be anything but it must reference a row
in projects
or groups
. The chosen sharding_key
column must be non-nullable.
Setting multiple sharding_key
, with nullable columns are also allowed, provided that
the table has a check constraint that correctly ensures at least one of the keys must be non-nullable for a row in the table.
See NOT NULL
constraints for multiple columns
for instructions on creating these constraints.
The following are examples of valid sharding keys:
-
The table entries belong to a project only:
sharding_key: project_id: projects
-
The table entries belong to a project and the foreign key is
target_project_id
:sharding_key: target_project_id: projects
-
The table entries belong to a namespace/group only:
sharding_key: namespace_id: namespaces
-
The table entries belong to a namespace/group only and the foreign key is
group_id
:sharding_key: group_id: namespaces
-
The table entries belong to a namespace or a project:
sharding_key: project_id: projects namespace_id: namespaces
-
(Only for
gitlab_main_user
) The table entries belong to a user only:sharding_key: user_id: user
The sharding key must be immutable
The choice of a sharding_key
should always be immutable. This is because the
sharding key column will be used as an index for the planned
Org Mover,
and also the
enforcement of isolation
of Organization data.
Any mutation of the sharding_key
could result in in-consistent data being read.
Therefore, if your feature requires a user experience which allows data to be
moved between projects or groups/namespaces, then you may need to redesign the
move feature to create new rows.
An example of this can be seen in the
move an issue feature.
This feature does not actually change the project_id
column for an existing
issues
row but instead creates a new issues
row and creates a link in the
database from the original issues
row.
If there is a particularly challenging
existing feature that needs to allow moving data you will need to reach out to
the Tenant Scale team early on to discuss options for how to manage the
sharding key.
namespace_id
as sharding key
Using The namespaces
table has rows that can refer to a Group
, a ProjectNamespace
,
or a UserNamespace
. The UserNamespace
type is also known as a personal namespace.
Using a namespace_id
as a sharding key is a good option, except when namespace_id
refers to a UserNamespace
. Because a user does not necessarily have a related
namespace
record, this sharding key can be NULL
. A sharding key should not
have NULL
values.
Using the same sharding key for projects and namespaces
Developers may also choose to use namespace_id
only for tables that can
belong to a project where the feature used by the table is being developed
following the
Consolidating Groups and Projects blueprint.
In that case the namespace_id
would need to be the ID of the
ProjectNamespace
and not the group that the namespace belongs to.
organization_id
as sharding key
Using Usually, project_id
or namespace_id
are the most common sharding keys.
However, there are cases where a table does not belong to a project or a namespace.
In such cases, organization_id
is an option for the sharding key, provided the below guidelines are followed:
- The
sharding_key
column still needs to be immutable. - Only add
organization_id
for root level models (for example,namespaces
), and not leaf-level models (for example,issues
). - Ensure such tables do not contain data related to groups, or projects (or records that belong to groups / projects).
Instead, use
project_id
, ornamespace_id
. - Tables with lots of rows are not good candidates because we would need to re-write every row if we move the entity to a different organization which can be expensive.
- When there are other tables referencing this table, the application should continue to work if the referencing table records are moved to a different organization.
If you believe that the organization_id
is the best option for the sharding key, seek approval from the Tenant Scale group.
This is crucial because it has implications for data migration and may require reconsideration of the choice of sharding key.
As an example, see this issue, which added organization_id
as a sharding key to an existing table.
For more information about development with organizations, see Organization
desired_sharding_key
to automatically backfill a sharding_key
Define a We need to backfill a sharding_key
to hundreds of tables that do not have one.
This process will involve creating a merge request like
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/136800 to add the new
column, backfill the data from a related table in the database, and then create
subsequent merge requests to add indexes, foreign keys and not-null
constraints.
In order to minimize the amount of repetitive effort for developers we've
introduced a concise declarative way to describe how to backfill the
sharding_key
for this specific table. This content will later be used in
automation to create all the necessary merge requests.
An example of the desired_sharding_key
was added in
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/139336 and it looks like:
--- # db/docs/security_findings.yml
table_name: security_findings
classes:
- Security::Finding
# ...
desired_sharding_key:
project_id:
references: projects
backfill_via:
parent:
foreign_key: scanner_id
table: vulnerability_scanners
table_primary_key: id # Optional. Defaults to 'id'
sharding_key: project_id
belongs_to: scanner
To understand best how this YAML data will be used you can map it onto
the merge request we created manually in GraphQL
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/136800. The idea
will be to automatically create this. The content of the YAML specifies
the parent table and its sharding_key
to backfill from in the batched
background migration. It also specifies a belongs_to
relation which
will be added to the model to automatically populate the sharding_key
in
the before_save
.
desired_sharding_key
when the parent table also has one
Define a By default, a desired_sharding_key
configuration will validate that the chosen sharding_key
exists on the parent table. However, if the parent table also has a desired_sharding_key
configuration
and is itself waiting to be backfilled, you need to include the awaiting_backfill_on_parent
field.
For example:
desired_sharding_key:
project_id:
references: projects
backfill_via:
parent:
foreign_key: package_file_id
table: packages_package_files
table_primary_key: id # Optional. Defaults to 'id'
sharding_key: project_id
belongs_to: package_file
awaiting_backfill_on_parent: true
There are likely edge cases where this desired_sharding_key
structure is not
suitable for backfilling a sharding_key
. In such cases the team owning the
table will need to create the necessary merge requests to add the
sharding_key
manually.
Exempting certain tables from having sharding keys
Certain tables can be exempted from having sharding keys by adding
exempt_from_sharding: true
to the table's database dictionary file. This can be used for:
- JiHu specific tables, since they do not have any data on the .com database. !145905
- tables that are marked to be dropped soon, like
operations_feature_flag_scopes
. !147541
When tables are exempted from sharding key requirements, they also do not show up in our progress dashboard.
Exempted tables must not have foreign key, or loose foreign key references, as this may cause the target cell's database to have foreign key violations when data is moved. See #471182 for examples and possible solutions.
Ensure sharding key presence on application level
When you define your sharding key you must make sure it's filled on application level.
Every ApplicationRecord
model includes a helper populate_sharding_key
, which
provides a convenient way of defining sharding key logic,
and also a corresponding matcher to test your sharding key logic. For example:
# in model.rb
populate_sharding_key :project_id, source: :merge_request, field: :target_project_id
# in model_spec.rb
it { is_expected.to populate_sharding_key(:project_id).from(:merge_request, :target_project_id) }
See more helper examples and RSpec matcher examples.
Current.organization
Map a request to an organization with The application needs to know how to map incoming requests to an organization. The mapping logic is encapsulated in Gitlab::Current::Organization
. The outcome of this mapping is stored in a ActiveSupport::CurrentAttributes
instance called Current
. You can then access the current organization using the Current.organization
method.
Since this mapping depends on HTTP requests, Current.organization
is only available in the request layer (Rails controllers,
Grape API, and GraphQL). It cannot be used in Rake tasks, cron jobs or Sidekiq workers. This is enforced by a RuboCop rule. In
those cases, the organization ID should be derived from something else (related data) or passed as an argument.
Current.organization
Availability of Since this mapping depends on HTTP requests, Current.organization
is available only in the request layer. You can use it in:
- Rails controllers that inherit from
ApplicationController
- GraphQL queries and mutations
- Grape API endpoints (requires usage of a helper
You cannot use Current.organization
in:
- Rake tasks
- Cron jobs
- Sidekiq workers
This restriction is enforced by a RuboCop rule. For these cases, derive the organization ID from related data or pass it as an argument.
Usage in Grape API
Current.organization
is not available in all Grape API endpoints. Use the set_current_organization
helper to set Current.organization
:
module API
class SomeAPIEndpoint < ::API::Base
before do
set_current_organization # This will set Current.organization
end
# ... api logic ...
end
end
The default organization
Do not rely on a default organization. Only one cell can access the default organization, and other cells cannot access it.
Default organizations were initially used to assign existing data when introducing the Organization data structure. However, the application no longer depends on default organizations. Do not create or assign default organization objects.
The default organization remains available on GitLab.com only until all data is assigned to new organizations. Hard-coded dependencies on the default organization do not work in cells. All cells should be treated the same.
Organization data sources
An organization serves two purposes:
- A logical grouping of data (for example: an User belongs to one or more Organizations)
- Sharding key for Cells
For data modeling purposes, there is no need to have redundant organization_id
attributes. For example, the projects table has an organization_id
column. From a normalization point of view, this is not needed because a project belongs to a namespace and a namespace belongs to an organization.
However, for sharding purposes, we violate this normalization rule. Tables that have a parent-child relationship still define organization_id
on both the parent table and the child.
To populate the organization_id
column, use these methods in order of preference:
- Derive from related data. For example, a subgroup can use the organization that is assigned to the parent group.
-
Current.organization
. This is available in the request layer and can be passed into Sidekiq workers. - Ask the user. In some cases, the UI needs to be updated and should include a way of selecting an organization.
Related topics
- Consolidating groups and projects architecture documentation
- Organization user documentation