Designing IoT systems for 2G-only villages: reliability patterns that actually work

Most IoT reliability advice is written for data centers. The sensor you deploy in a greenhouse two hours outside Rajshahi does not live in a data center. It lives on a rooftop that loses grid power for four hours every afternoon, talks to a 2G tower that drops packets whenever it rains, and is maintained by a farmer who has never opened a terminal.

This is a short, opinionated list of the patterns that have kept our deployments alive across cold storage, polynet greenhouses, and aquaculture farms. None of them are revolutionary. All of them got removed and re-added at least once after we shipped them.

1. Treat the network as unreliable by default — and measure it

The single biggest lie we tell ourselves as engineers is that "the connection is basically fine". It isn't. When we instrumented edge gateways to log every publish attempt, retry, and timeout, we found that even "good" rural sites were dropping 8–12% of their outbound MQTT packets across a week, with localized peaks above 40% during monsoon afternoons.

Two practical consequences:

Every command from cloud → device must be idempotent and carry a correlation ID. If a compressor gets the "turn on" message twice because the ack got lost on the way back, nothing bad should happen. If it arrives three times and the second one was actually "turn off", you need the version number to discard it.
Every telemetry publish must have a local fallback. We buffer the last 48 hours of sensor readings to on-disk SQLite on the gateway. When the tower comes back, we replay with exponential-backoff pacing so we don't DDoS our own broker.

The measurement itself is the point. You cannot fix a reliability problem you haven't quantified.

2. Separate the control plane from the data plane

A temperature reading arriving six seconds late is annoying. A "turn off the compressor" command arriving six seconds late can cost a cold-storage operator several lakh taka of spoiled fish.

We run two MQTT topic hierarchies with different QoS and retention policies:

telemetry/# — QoS 0, retained = false, batched publishes every 15 seconds.
control/# — QoS 2, retained = true, never batched, always ack-confirmed end-to-end.

Control topics also have a stricter client-side circuit breaker: if three consecutive commands time out, the gateway refuses to accept new commands from the cloud for 30 seconds and surfaces an operator alert. Better to fail loudly than to queue up eight conflicting instructions for a physical relay.

3. Design the offline mode as the primary mode

The first version of our cold-storage UI loaded real-time sensor data from our cloud API. When the tower went down, the operator's phone app showed a spinning loader and the operator concluded the app was broken.

We flipped the mental model. Now the gateway has a local HTTP endpoint on the facility's Wi-Fi that serves cached state. The mobile app prefers the local endpoint when it can reach it, falls back to cloud when it can't, and reconciles on reconnect. Operators stop noticing when the internet drops, which is exactly what we want.

The rule we settled on: if the local network is healthy, the product must work — full stop — even if the cloud is on fire.

4. Push configuration, pull telemetry

Early on we tried to be clever and push configuration changes in real time. It felt modern. It was a disaster. A misconfigured alert threshold could paper itself across 240 sensors in under a minute, and rolling it back required manual re-pushes that sometimes failed halfway.

We switched to a pull model with explicit versions. The gateway fetches a signed configuration manifest every 60 seconds, with a local fallback to the last-known-good config. Cloud becomes an advisor, not a dictator. When something goes wrong, we pause the manifest rollout and the fleet self-heals to the previous good state.

The aesthetic lesson: config is not commands. Treat it like software releases, not like MQTT messages.

5. Power is a first-class dependency

Grid power in rural Bangladesh is not a continuous variable; it's a step function with steep edges. Every deployment has:

A small UPS (ideally 2+ hours for the gateway).
A sensor schedule that degrades gracefully during brownouts — critical parameters continue at full rate, nice-to-haves drop to 1× per minute.
A "safe shutdown" sequence on the gateway that flushes SQLite, closes MQTT cleanly, and syncs the clock from a cached NTP snapshot before the battery dies.

We learned the last one the hard way after spending two weeks debugging "corrupted" sensor data that turned out to be timestamps from a gateway whose clock had drifted 47 minutes while running off a dead battery.

6. Operational UX is half the system

Technology people under-weight this. The best architecture in the world dies quietly when the operator can't tell whether the problem is with their device, the internet, the cloud, or the sensor. Every screen we ship now answers three questions at a glance:

Is the sensor alive? (last heartbeat within expected window)
Is the link alive? (last successful publish to cloud)
Is the data fresh? (what the operator is looking at — when was it captured, not when was it rendered)

A tiny green/amber/red dot in the corner of the dashboard has prevented more support calls than any five backend improvements combined.

The unglamorous summary

Reliability in rural IoT is not a single heroic decision. It's dozens of small, boring decisions — idempotent commands, local caching, config manifests, power budgets, operator-legible status indicators — that compound into a system that keeps working when you aren't watching. If you take exactly one thing from this post: instrument before you optimize. Every one of the six patterns above started as a spike in a log we almost didn't bother to ship.

Designing IoT systems for 2G-only villages: reliability patterns that actually work

1. Treat the network as unreliable by default — and measure it

2. Separate the control plane from the data plane

3. Design the offline mode as the primary mode

4. Push configuration, pull telemetry

5. Power is a first-class dependency

6. Operational UX is half the system

The unglamorous summary

Related posts from the journal

Designing thresholds that farmers actually trust

The energy math of pond aeration: why aerators run far longer than they need to

Offline-first mobile UX for field operators: the three status indicators that matter