29 Apr 2021

Sysadmin Technical Debt or why admins never have time

pexels-photo-3861958

Technical debt is a commonly used metaphor in IT, it refers to shortcuts that make work more difficult in the long term. While we hear a lot about it in software development, it's rare to find someone talking about sysadmin technical debt. Few might realize that it's sources can be drastically different than more "popular" software technical debt.

The difference pops up because system administration work is different from programming. In programming, you're making something, adding a feature, or fixing a bug. In administration, you mix already available tools and hardware in order to create systems that satisfy requirements. Also, sysadmins are expected to keep those systems running indefinitely. Not really forever, but usually at least for a long undefined period of time.

Preferably your systems should never, ever, and not even then go offline. Which makes production problems quite stressful and hard to fix. And unlike in programming, the quick and dirty fix to keep it online is harder to clean up later. Playing around production tends to break production. And quite many changes require an app/server restart which for production usually requires scheduling and notifying people. This sort of social activity is generally frowned upon by sysadmins. Meaning that it never happens, or that sysadmins wait for the next big outage to use as an opportunity to play around production.

Major pain points of sysadmin technical debt

All technical debt is bad, and sysadmin technical debt is no different. Bellow is a few cultural and organizational practices that massively increase technical debt. You are probably having a lot of sysadmin technical debt if you have to constantly poke sysadmins to get anything done.

Snowflake systems

A snowflake system is a system that exists as a "one-off" without easy reproducibility. It's often hard to change and easy to break. The name comes from Special snowflake, generally meaning an easily upset and delusional person. When no one really knows how to change a system, and when that system breaks on every change we call it "snowflake".

Because of their fragility, they tend to easily break, and for the same reason, every hotfix is hard to make more proper later on. They simply end up as a card pyramid with years of hotfixes that somehow keep everything together. As long as you don't touch it, look at it, or walk too loudly next to its server. There are two major ways of addressing such systems are making them reproducible and remaking them in a more stable environment.

Reproducibility will allow you to redeploy the previous "working" version. This means that you will be free to change and experiment because you will have the rollback in case things go bad. The current hip way to deal with this is by using Infrastructure as a code to automate deployments. Actual implementation can be through GitOps, Ansible, puppet, or any other IaC provider.
Stabilizing an environment may end up being hard or even impossible. Let's face it some software is just too ancient or messy to be maintainable without years of expertise. Mail servers and LDAPs are notorious for their complexity. At the very least ensure that the system gets documented with design decisions, tradeoff justifications, and hotfixes. Then just slowly move it towards a cleaner state whenever possible.

Extreme territorialism

Commonly this is expressed by sysadmins jealously restricting access to servers and services that they maintain. This is quite common in sysadmin cultures that have a long tradition of one-man sysadmin teams. The same goes for organizations who never really considered upgrading the IT-crowd like few admins in a basement setup

Major pain points of sysadmin technical debt

Snowflake systems

Extreme territorialism

Subscribe to buzzwrd.me