Alert fatigue is the single most common reason an industrial IoT product stops working — not because the sensors failed, but because the operators silenced the app a month in and stopped looking at it. Every team I've worked with has tripped over this at some stage. It's not a "tune the thresholds better" problem. It's a design problem.
This is a short framework for thinking about alerts in industrial, agricultural and cold-chain IoT — the lens I wish I'd had on day one of my first deployment.
The three-layer model
Useful alerting isn't one thing. It's three distinct layers that most products conflate into a single "notification" surface:
- Observation — something in the system crossed a threshold. This is what the sensor saw.
- Interpretation — is the observation actually meaningful in context? Is it a transient spike, a known maintenance window, a seasonal pattern, or a real anomaly?
- Action — who needs to know, when do they need to know, and what should they do about it?
Most alert pipelines I've audited skip step 2 entirely. The sensor fires, the notification goes out, the operator gets paged. The observation-to-notification path is so tight that the product cannot distinguish a faulty sensor from an unfolding disaster. Both look the same on the operator's phone.
The first disciplined thing to add to any alerting system is a gap between observation and action. That gap is where interpretation lives.
Severity is a product decision, not a threshold
A common anti-pattern: "if temperature > 8°C, send alert". That rule generates thousands of alerts per month in a real facility, because temperatures fluctuate above 8°C constantly — during door openings, defrost cycles, brief load events. Each of those is fine; the sustained version is not.
Severity belongs on the alert, not on the threshold. Useful severity tiers generally look like:
- Info — logged for auditability, surfaced in a timeline. Not pushed anywhere.
- Warning — pushed to on-shift operator, no escalation. Can be acknowledged or ignored; silences on its own if the condition clears.
- Critical — pushed to on-shift operator, escalates if not acknowledged within a defined window. Continues to escalate until acknowledged.
- Safety — always escalates immediately, cannot be silenced at the device level, requires a documented resolution entered into the system before it clears.
The three productive rules for moving an observation into one of those tiers:
- Sustained, not transient. A Warning only becomes Critical if the breach persists past a duration threshold, not the first time it's seen.
- Scoped, not global. A Warning that fires across 20% of your fleet simultaneously is not 20 Warnings; it's a correlated event that deserves a single aggregated notification with its own handling path.
- Context-aware, not raw. A temperature reading outside the target band during a known compressor maintenance window is not a Warning. The system should know about the maintenance window.
Escalation is a conversation, not a broadcast
The shape of a good escalation policy looks like this: the alert starts as a quiet, dismissable notification aimed at whoever is closest to the problem — a line operator, a field technician, the farmer themselves. If they acknowledge it, the escalation stops. If they don't, after a defined wait, it tries the next tier — the shift supervisor, the regional manager — through an increasingly disruptive channel. Only if everyone in the chain is unresponsive does it reach the "wake someone up" level.
Two design rules that go a long way:
- Acknowledgement gates escalation. An acknowledgement is a promise to handle the issue, not a promise it's fixed. Escalation halts on acknowledgement; a separate resolution marks the alert closed. Conflating the two ("I acknowledged, the alert went away, then it came back an hour later and everyone was confused") is a classic operations failure.
- The operator gets to say 'snooze for 30 minutes because I'm actively fixing it'. Alerts that keep firing while the operator is physically working on the problem breed resentment. A first-class snooze-with-reason is respectful of the operator's time and produces a useful audit trail.
Pushing the right alert to the right person
The other ingredient most products get wrong: there is no single "operator". Who should see a particular alert depends on the kind of alert, the time of day, the scope of the problem, and who's actually rostered.
A useful mental model is a three-column matrix:
| Scope | On-shift | After-hours |
|---|---|---|
| Single device | Nearest operator | Nearest operator |
| Single site | Shift lead | On-call rotation |
| Fleet-wide correlated event | Regional manager | On-call + engineering |
"Nearest operator" is a real concept — determined either by geography or by who recently interacted with the device. An operator who just walked past sensor 7 does not need to know sensor 7 is fine; but if sensor 7 starts behaving oddly five minutes after they touched it, they are absolutely the right person to hear about it first.
Building the routing as a declarative policy, rather than a hard-coded if/else, means you can change it without shipping code. That flexibility has outsized value in the first year of deployment, when the team is still learning how its own operations actually run.
Channels have personalities
A WhatsApp message and an SMS and a phone call are not the same alert, even if the content is identical. Operators read them differently. WhatsApp is relational — it feels like a colleague asking for help. SMS is bureaucratic — it feels like a system message that can be dealt with later. A phone call is physical — it demands immediate attention.
Match the channel to the severity deliberately:
- Warning-tier content belongs in the app, and optionally a WhatsApp message for shift operators who have opted in.
- Critical-tier content belongs in a push notification plus a structured message on a chat channel where the team expects to see operational events.
- Safety-tier content belongs on every available channel simultaneously, including a phone call if the initial acknowledgement doesn't land.
Picking a single channel for everything ("we use WhatsApp for alerts") is a recipe for either alert fatigue (if Warnings show up on WhatsApp) or missed Critical events (if they don't).
Instrument the alert system itself
The last — and most underused — discipline is to treat your alerting pipeline as a product that deserves its own dashboard. At minimum, you want to see:
- How many alerts fired per day, per severity tier.
- How many of each tier were acknowledged, how quickly.
- How many escalated, and how far.
- Which devices / sites produce disproportionate alert volumes (usually a clue that a threshold is miscalibrated or a sensor is ageing out).
- Which time-of-day windows are hottest.
If you can't answer those questions at a glance, you cannot improve your alerting system. And you will not improve it by guessing — every tuning change that isn't grounded in alert telemetry is a coin flip.
The short version
If a team asked me for a five-bullet summary:
- Separate observation, interpretation, and action. Don't let a sensor reading become a page without passing through interpretation.
- Sustained, scoped, context-aware — three rules before any notification fires.
- Severity is a product decision. Four tiers cover almost everything.
- Acknowledgement halts escalation, it does not close the alert.
- Instrument the alerting pipeline and review its metrics the same way you review feature metrics.
Alert fatigue is a design symptom, not an operator failing. The operators are doing their jobs; the product is asking them to live with noise. Fixing the noise is the product's responsibility.