In any software company, whether product or SaaS, most people recognize that the quality of service security and uptime are just as important as the features you are delivering. The phrase “keep the lights on” (KTLO) became popular as enterprise IT matured-particularly in the 1990s and early 2000s-when companies began to formally recognize and budget for the ongoing maintenance overhead essential to sustain business operations. Observability and traceability became increasingly important, and terms like “chaos engineering” started gaining more attention, such as with Netflix’s infamous Chaos Monkey. KTLO encompasses the essential maintenance tasks that keep your software operational: bug fixes, security patches, performance tuning, and more.

As a decision maker or engineering manager, you must recognize KTLO’s dual role. On one side, there’s the managerial challenge-balancing budgets, allocating resources, and deciding how much capacity to reserve for sustaining operations versus driving innovation. On the other side is the technical reality your developers and operations teams grapple with every day. With the right metrics in hand, you can pinpoint where your KTLO efforts need to be augmented and be better equipped than ever to quantify and optimize these maintenance efforts.

Best practices in KTLO planning & management

If you care about your enterprise’s longevity and long-term success, you’d be better off prioritizing and budgeting KTLO just as carefully as you do new feature development. You need to acknowledge that maintenance work is non-negotiable-it’s the bedrock that keeps the system reliable. Allocate a fixed percentage of your engineering resources specifically for KTLO activities, and use robust metrics to continuously track performance indicators, like mean time to repair (MTTR), bug resolution rates, and downtime. These insights empower you to fine-tune your allocations and ensure every dollar spent on maintenance delivers maximum impact. Extracting those metrics is not always easy, but we’re building tools to achieve exactly that in Milestone.

If your budget and team size allow it, you can establish dedicated KTLO teams. But if it doesn’t, the next best idea is to rotate responsibilities among your existing developer and infrastructure teams. Rotating responsibilities keeps all engineers engaged with the product’s overall health, fostering a more holistic view of your infrastructure. Use your data to determine which approach minimizes downtime and maximizes efficiency in your specific context.

It’s also a good idea to monitor and manage technical debt. A proactive maintenance schedule can avoid ugly and stressful situations caused by unexpected downtime. This could be the discovery of zero-day vulnerabilities in your codebase or in your dependencies’ codebase, which could directly or indirectly expose you. You can use the 20/80 rule as a rule of thumb, dedicating a sustainable percentage of engineering effort to KTLO, and you can go even further by monitoring your infrastructure and calculating what percentage of time should be dedicated to achieving the best tradeoff between effort and cost.

Best practices for developers & infrastructure teams

Predictable project structures and builds

As a developer, infrastructure engineer, or team leader, you can streamline KTLO for your teams by embracing a culture of automation and robust monitoring. You can minimize overhead by providing default “cookie-cutter” repo templates pre-populated with CI/CD pipelines for testing and code coverage, vulnerability checking, and building and publishing to your artifact registry. Make sure to parametrize those recipes to quickly bump language and project dependencies in one place and streamline the naming of artifacts so that you never have to waste time again doing so manually. Nothing is more frustrating than waiting for your build to fail because you had to change the same value in two places.

Always-on static code analysis

When it comes to bugs, patches, and security fixes, efficiency is critical. Implement rolling updates that minimize disruptions instead of deploying big batch updates that can introduce risk. Keep your dependencies current by proactively tracking library vulnerabilities and integrating DevSecOps practices into your workflows. Use a secrets management tool like Vault to avoid nasty secret commits in the repos and the overhead of erasing them. Many modern languages come with native static code analysis and vulnerability checkers, such as Go’s govulncheck, which scan a project’s dependency tree and, if configured as part of a CICD pipeline, will never allow a build with a known vulnerability to succeed. This proactive method seamlessly integrates security into your KTLO processes, requiring minimal effort and long-lasting results. Code linters as part of the CI/CD pipeline are also an elegant method to reduce technical debt, and many of those can be configured based on your codebase’s coding style and can ensure uniformity and clarity, lowering the need for refactoring later.

Observability and monitoring

Developers can build observability and monitoring into the code, and it’s an integral part of their KTLO mentality. This allows you to track requests across services and build statistics about the code these services are running, such as which functions were executed, how long they took, what arguments they used, and what their response was​​. This information can be transmitted to a central collection point like Elastic APM, which dashboards and graphs can monitor by looking for trends.

This is a prime example of KTLO since it’s live and easily understandable by nearly everyone. It provides intuition about bottlenecks and control flow and allows for automated alarms when things are about to go bad. While tricky to implement, the effects and visibility this provides are invaluable. Milestone can help identify patterns in this data and provide valuable insights into understanding.

Self-healing infrastructure

To reduce firefighting or midnight calls, look into your orchestrator’s capabilities for probing your components’ readiness and health and configure it to auto-restart services that it deems unhealthy. Typically, the control plane can provide info on how many and which services restarted, what image was running on them when this happened, and possibly logs that were emitted close to the restart time. This way, you can avoid stress and enable your team to focus on innovation. Combined with AI-driven monitoring tools that provide predictive insights and automated adjustments, this approach creates a robust, resilient infrastructure. This synergy ensures that your operations run smoothly and your KTLO efforts are as efficient as possible.

Conclusion

For managers, effective KTLO involves balancing infrastructure maintenance with innovation. To this end, allocate a fixed engineering percentage for KTLO and use robust metrics-like MTTR and downtime-to proactively manage technical debt and optimize resource allocation. Modern AI tools further enhance this process by monitoring performance, predicting issues, and ensuring maintenance efforts align with business objectives.

Reducing the KTLO burden for developers and infrastructure teams means embracing automation and standardization. To handle updates and vulnerabilities efficiently, implement templated repositories with CI/CD pipelines, automated testing, and static code analysis. Enhance observability with integrated monitoring tools and self-healing orchestrators, augmented by AI-driven insights that detect emerging hotspots before they escalate.

Written by

Sign up to our newsletter

By subscribing, you accept our Privacy Policy.

Related posts

10 Top Tips For Improving Developer Engineering Productivity
KTLO in Software Development: Best Practices to Minimize Maintenance Overload

Ready to Transform
Your GenAI
Investments?

Don’t leave your GenAI adoption to chance. With Milestone, you can achieve measurable ROI and maintain a competitive edge.
Website Design & Development InCreativeWeb.com