Sustainable Operations

I have recently been reading the code in the archlinux infrastructure repository. This is the operations repository of the Arch Linux operations team, where they have achieved sustainable operations through Infrastructure as Code.

What sustainable operations means:

As time passes and operations personnel change, operational procedures are not lost and servers do not become black boxes;
Whether managing 1 server or 10,000, long-term operability, maintainability, and strong consistency are guaranteed. Examples include:
- Continuously upgrading k8s clusters to secure versions – whether one cluster or many, the middleware upgrade process remains essentially the same (continuous service updates);
- New team members can understand existing services and architecture by reading the operations code. Once they have the foundational skills in Infrastructure as Code (after passing the interview), they are capable of sustained operations (continuous human resource renewal);

Here, “as code” primarily means using Ansible to save server state (what services are running, what configurations are needed, etc.) in YAML format within a Git version control system, and collaborating and continuously iterating through GitLab.

Advantages of Infrastructure as Code:

First, versioning. Since all operational procedures are recorded in the GitLab repository via Ansible, any internal employee can see every change made to any server. For example:

You can use Git to find out how a particular service on a specific server was deployed, when it was deployed, and who performed the deployment or change
You can verify directly in the repository whether a server’s configuration needs updating without logging into any production server (manually logging into servers is dangerous)
You can find all architectural information in a single repository. Avoiding architectural black boxes is tremendously helpful for reviewing and optimizing the operations architecture

Second, server state versioning. Taking an Nginx cluster configuration as an example, the Nginx cluster state is saved as corresponding Git commits in GitLab. If an Nginx change causes a failure, you can quickly roll back the operations code via git checkout <commit_id> and re-run Ansible to restore the service. Some might ask: when modifying an Nginx configuration, why not just:

cp nginx.conf nginx.conf.bak

And when a failure occurs:

cp -f nginx.conf.bak nginx.conf && nginx -s reload

Why bother with Infrastructure as Code? On one hand, imagine if your cluster has 30 nodes and you need to perform the above operation on each server – the volume of manual operations increases, and so does the likelihood of errors. Furthermore, over the course of sustained operations, multiple changes will be involved. If 30 servers each undergo 30 changes, there will be 30 different nginx.conf.bak files on each server, which is unfavorable for troubleshooting (with versioned operations, you can quickly identify configuration differences via git diff <commit1> <commit2> nginx.conf).

Third, quality improvement. Quality degradation is a nightmare in large-scale cluster operations. If one error is applied to all servers (say, 100 servers), it produces 100 errors. Therefore, quality improvement is crucial for sustainable operations. How does Infrastructure as Code improve quality? Git and GitLab provide multi-person collaboration capabilities. All operations code undergoes code review on GitLab. Anyone can submit operations code, but only reviewed and reliable code is merged into the main branch and ultimately applied to production servers by personnel with the appropriate permissions and experience. Operations quality gradually improves through successive rounds of review and iteration.

Fourth, high efficiency. Once we have high-quality operations code through collaboration, the operational cost is nearly the same whether managing 1 server or 100 servers. By executing the corresponding operations code, Ansible applies changes to all servers (regardless of how many). We only need to ensure that operations on one server are reliable, and the remaining 99 will be reliable as well.

Examples include:

Once a k8s cluster can be stably upgraded through operations code, k8s clusters in other data centers can also be stably upgraded
Once a production service can be stably upgraded through operations code, production services in other data centers can also be stably upgraded
Once a Linux distribution can be stably upgraded through operations code, the same Linux distribution in other data centers can also be stably upgraded

Traditional manual operations involve repetitive execution of simple tasks. Large volumes of repetitive operations easily lead to human error and lower the morale of operations personnel. The “nobody dares to manage it, nobody wants to manage it” problem is caused by declining motivation and rigid responsibility structures, ultimately making operations unsustainable at the human level. People get tired, but code does not. Again using Nginx configuration updates as an example: if 100 Nginx servers need configuration updates, the traditional approach is to modify one server, test it, then sync to the other 99 servers via scp or rsync, and run nginx -t && nginx -s reload on each one – involving a great deal of manual operations. The benefit of Infrastructure as Code is that we only need to commit the Nginx configuration change, push it to GitLab for review, and after approval, an authorized person executes the operations code. Ansible handles the configuration updates and reliability checks across all 100 Nginx servers, drastically reducing manual operations.

Fifth, improved security. With reliable review processes and efficient operations, operations gradually become stable and reliable, and human errors are greatly reduced. Furthermore, high efficiency restores and improves operations personnel’s motivation (because simple tasks only need to be done once, not 100 times), and security patches are applied more promptly and reliably.

Sixth, automation. Through GitLab and GitLab Runner, some simple but operationally expensive tasks can be solved, such as:

Automatically updating account passwords on all servers
Automatically checking security patch installation status on all servers
Automatically checking recent login activity on all servers

The sustainability of operations (across time and in the context of staff turnover) is essentially about solving a series of problems: avoiding operational black boxes, avoiding inefficiency in large-scale operations, and preventing declining motivation among operations personnel.

There is always a gap between ideals and reality. Never forget the ideal, but live in the real world.