Thumbnail

After Hours Calls: How Software Teams Decide When to Wake People

After Hours Calls: How Software Teams Decide When to Wake People

Software teams face a critical challenge: determining which problems justify waking an engineer at 3 AM and which can wait until morning. This article examines how leading organizations establish effective on-call escalation policies, drawing on insights from engineering leaders and incident response experts. The strategies covered range from tracking time-sensitive issues to implementing automated decision frameworks that reduce alert fatigue while maintaining system reliability.

Prioritize Harm That Escalates Over Time

The threshold we've settled on is surprisingly simple: we don't wake people up because a system is broken, we wake people up because customers are being harmed in a way that compounds with time.
There was an incident where part of our service was degraded late at night. Error rates were elevated, but the system was still technically running. The dashboards looked ugly enough that an inexperienced team might have declared an emergency immediately. What mattered, though, was understanding the shape of the failure. Were users hitting a temporary inconvenience that would still be there in the morning, or were they losing data, missing critical workflows, or building up a backlog that would become much harder to unwind later?
That distinction changed how we handled incidents. A lot of founders assume uptime is the thing to optimize for. In practice, customer impact is the thing to optimize for. Sometimes a 5% outage deserves a phone call at 2 a.m. because it affects a critical action. Other times a much larger technical issue can safely wait because users have workarounds and nothing irreversible is happening.
One signal I've come to trust is whether every passing hour makes the recovery materially more painful for customers. If the answer is yes, people get woken up. If the answer is no, and the problem is stable rather than cascading, normal business hours are often the better choice.
The unexpected lesson is that many overnight pages aren't really about customer impact at all. They're about the discomfort of seeing a graph move in the wrong direction. Those are very different things. Teams that can't tell the difference tend to burn out their best engineers long before they improve their reliability.

Use KPI Spikes With Automated Decision Briefs

We wake the team when a business KPI we own, cTAT90 (scan to signed report), shows an out-of-hours spike that meaningfully degrades clinician-facing service. Our clear signal is an automated one-page decision brief that pulls ops data, compares current cTAT90 to normal baselines, and triggers on-call push alerts when the deviation exceeds expected variance. In a real incident that brief flagged a night-shift cTAT90 spike, and we pushed alerts to on-call while running a time-boxed edge-caching pilot instead of immediately buying more capacity. Six weeks later that pilot reduced cTAT90 from ~70 to ~55 minutes and gave us the evidence to prioritize the full fix.

Andrei Blaj
Andrei BlajCo-founder, Medicai

Alert Owners If Core Actions Fail

Wake the team when the degradation threatens a customer-facing promise, not when a dashboard simply looks noisy.
At Ronas IT, the signal I trust most is not a single alert. It's a match between technical evidence and user impact. If monitoring shows errors or latency in a production path and that path blocks a core action, such as login, checkout, booking, payment, or data submission, we wake the responsible engineer. If the problem affects an internal admin screen, a delayed report, or a non-critical background task, we document it, add context for the morning team, and keep watching.
The threshold I like is: can a normal user complete the action they came for? If the answer is no, it's an incident. If the answer is yes but the experience is slower, we look at scope and trend. A short spike that recovers is different from a slow climb that keeps getting worse every few minutes. The second one deserves attention because waiting can turn a contained fix into a recovery problem.
One good call came from treating customer impact as the deciding factor. We saw degraded performance outside business hours, but the first alerts were not enough to wake the whole team. The turning point was when synthetic checks and logs showed the same pattern on a core user flow. At that point, the question changed from "is the server unhappy?" to "are users failing to complete the product's main action?" We woke the engineer who owned that area, kept the rest of the team asleep, and handled it as a focused incident instead of a panic broadcast.
My advice is to define wake-up rules before anyone is tired. Use business flows as the threshold, assign clear ownership, and separate "wake one accountable person" from "wake everyone." A good after-hours process protects both uptime and the team's ability to think clearly the next day.

Enforce a Predefined Severity Matrix

The decision to wake an engineering team at 3:00 AM should never be made in the heat of the moment; it must be governed by a pre-codified severity matrix established long before an incident occurs. When a system degrades, the assessment is not about team capability, but whether the business impact justifies the operational cost and the inevitable cognitive fatigue that follows.
I rely on three non-negotiable criteria: direct financial impact, safety or compliance risks, and the complete paralysis of a critical business unit. If a manufacturing line stalls, a payroll run fails, or a regulatory deadline faces immediate risk, we wake the team. These scenarios are binary and non-debatable. Conversely, minor degradations in non-critical modules or systems with viable manual workarounds wait for business hours.
The danger of waking a team for low-impact issues is the rapid erosion of trust in the on-call process. When every alert is treated as an emergency, effectively, no alert is. I instruct my teams to rigorously evaluate the impact-versus-effort threshold. If the effort to fix the issue immediately is high but the cost of waiting is low, we document it, triage it during morning stand-ups, and address it then.
Preserving a team's mental energy is a strategic asset. By establishing a rigid, transparent threshold for what constitutes a wake-up event, I protect my engineers from burnout and ensure that when a genuine catastrophe occurs, they are rested, focused, and ready to deploy the right solution. Effective incident management is not about working harder at night; it is about having the discipline to distinguish between a fire that requires immediate intervention and a problem that can wait for the light of day.

Girish Songirkar
Girish SongirkarDelivery Manager, Enterprise Software Engineering, Arionerp

Related Articles

Copyright © 2026 Featured. All rights reserved.