On this page

Other Pages

Infrastructure team(s)

Per our current set of OKRs, the infrastructure team works on making " ready for mission-critical tasks". Specifically, this means the infrastructure team works on

  1.'s availability and scalability.
    • Current goal: Availability at 99.9%.
    • Availability here is defined as the uptime of the site linked above (which is in fact the first issue in the gitlab-ce issue tracker; for more on pingdom see the monitoring page), measured per calendar month, and as recorded on pingdom.
  2.'s performance.
  3.'s security
  4. Keeping it easy to maintain GitLab instances, for administrators all over the world.

To do this, we've defined four teams within Infrastructure to tackle:

Collaboration across the company for Site Reliability

Individuals from the Infrastructure team frequently collaborate very closely with different product teams (e.g. Platform, Discussion, CI, Packaging, etc) and Reliability Experts from product teams collaborate closely with the Infrastructure team. Together, they work on the topics listed above, using the principles and methods of Site Reliability Engineering a little bit more each day.

Embedded Production Engineers

Additionally, specific production engineers can be "embedded" with one or multiple different teams for anything from a few weeks to months.

If you are an "embedded" Production Engineers, then you are expected to:

Since at GitLab most "feature sets or services" are mostly already in production, then it means that you work on making sure that the feature set or service post factum meets the requirements for Production Readiness [TODO: add link to production readiness review questionnaire]. This will typically involve improving the runbooks and documentation, alerting, monitoring, coding for resiliency, etc. By the time you are done, any other member of the Production Team should be able to tend to the feature set or service in production as well as you can, and the "embedment" stops. At that point you should be listed as an expert in the respective service.

Production and Staging Access

Production access (specifically: ssh access to production) is granted to production engineers, security engineers, and (production) on-call heroes.

Staging access is treated at the same level as production access because it contains production data.

Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker.

There is one temporary exception: release managers require production ssh access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.



Runbooks are public. They are automatically mirrored to our private development environment, this is so because if is down, we can still access the runbooks there.

These runbooks aim to provide simple solutions for common problems. The solutions are linked to from our alerting system. The runbooks should be kept up to date with whatever we learn as we scale so that our customers can also adopt them.

Runbooks are divided into 2 main sections:

When writing a new runbook, be mindful what the goal of it is:

Chef cookbooks

Some basic rules:

Generally our chef cookbooks live in the open, and they get mirrored back to our internal cookbooks group for availability reasons.

There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab private instance. This should be assessed in a case by case and documented properly.

Internal documentation

Available in the Chef Repo. There is some documentation that is specific to Things that are specific to our infrastructure providers or that would create a security threat for our installation.

Still, this documentation is in the Chef Repo, and we aim to start pulling things out of there into the runbooks, until this documentation is thin and only.

Outages and Blameless Post Mortems

Every time there is a production incident we will create an issue in the infrastructure issue tracker with the outage label.

In this issue we will gather the following information:

These issues should also be tagged with any other label that makes sense, for example, if the issue is related to storage, label it as such.

The responsibility of creating this post mortem is initially on the person that handled the incident, unless it gets assigned explicitly to someone else.

Public by default policy

These blameless post mortems have to be public by default with just a few exceptions:

That's it, there are no other reasons.

If what's blocking us from revealing this information is shame because we made a mistake, that is not a good enough reason.

The post mortem is blameless because our mistakes are not a person mistake but a company mistake, if we made a bad decision because our monitoring failed we have to fix our monitoring, not blame someone for making a decision based on insufficient data.

On top of this, blameless post-mortems help in the following ways:

Once this Post Mortem is created, we will tweet from the GitLabStatus account with a link to the issue and a brief explanation of what is it about.

Making Changes to

The production environment of should be treated with care since we strive to make so reliable that it can be mission-critical for our customers. Allowable downtime is a scarce resource.

Therefore, to be able to deploy useful and cool new things into production, we need to

Changes that need a checklist and a schedule

For example:

Changes that may need a checklist, but not explicit scheduling

For example:

Changes in staging

Testing things in staging typically only needs scheduling to avoid conflicting with others, but is otherwise straightforward since it is mostly self-service.


Deployments of release(s) (candidates) are a special case. We don't want to block deployments if we can avoid it. And since they are currently performed by release managers there is generally no need for someone from production engineering to be heavily involved. Still, follow the next steps to schedule a deploy:

Production Change Checklist

Any team or individual can initiate a change to by following this checklist. Create an issue in the infrastructure issue tracker and select the change_checklist template

Make settings the default

As said in the production engineer job description one of the goals is "Making GitLab easier to maintain to administrators all over the world". One of the ways we do it is making settings the default for all our customers. It is very important that is running GitLab Enterprise Edition with all its default settings. We don't want users running GitLab at scale to run into any problems.

If it is not possible to use the default settings the difference should be documented in settings before applying them to

Involving Azure