MQTT topic design for industrial IoT: the questions that matter more than the answer

The topic hierarchy is one of the few architectural decisions in an MQTT-based IoT product that you will almost never get to redo cleanly. It sits underneath every feature, every auth rule, every reconnect behavior, every migration. Teams tend to either rush past it in the first sprint or overthink it for months. Neither extreme ends well.

This is not a blueprint post. I'm not going to give you the exact topic tree we use in production — the tree itself is part of the product. Instead, this is the decision framework I wish I'd had before my first serious attempt. If you're about to design an MQTT hierarchy for a cold-chain, agritech or general industrial-IoT workload, these are the questions to put on the wall before anyone proposes a solution.

Every topic segment has to answer four questions at once

Any proposed hierarchy is, implicitly, a bet on how four concerns interact:

Routing — who needs to receive the message?
Authorization — who is allowed to publish or subscribe?
Retention — should a late-arriving subscriber see the last value?
Cardinality — how many distinct topics will this produce at scale?

Teams that design for routing alone tend to ship a tree that produces millions of mostly-empty topics, authorization rules that need regex gymnastics, and a retention story that silently eats broker memory.

Before you write a single segment, I'd force yourself to sketch out how each of the four would behave for your biggest anticipated tenant, not your smallest demo.

The five traps I see repeatedly

None of these are hypothetical. Every one of them cost us weeks the first time we met them.

1. Encoding metadata into topics

The most common mistake. A segment like <device>/<firmware-version>/heartbeat sounds clever and is a subtle time bomb. The first firmware update invalidates every retained message under the old branch, breaks every subscription that hard-coded the version, and forces a cross-topic reconciliation dance that nobody budgeted for.

The rule I'd tattoo on a whiteboard: topics express identity, payloads express state. Anything that changes over the life of the thing belongs in the payload. The topic is the thing's address, not its description.

2. Uniform QoS and retention across all streams

An MQTT tree with a single QoS / retention policy is a tree that's about to have a capacity incident. High-frequency telemetry and rare safety-critical commands have radically different delivery requirements — pretending they don't is how you end up with retained sensor readings the broker is too busy to serve and un-retained command ack topics that silently drop critical confirmations.

The productive exercise is to catalogue every kind of message your product will produce (telemetry, commands, command acks, state snapshots, heartbeats, config, audit, diagnostics), and decide the delivery contract for each category in isolation. Encode those contracts in a shared library that every client and server uses, so application code cannot override them on a whim.

3. Flat trees that sounded simpler at the time

A hierarchy like tenant/<tenant>/device/<id>/... is briefly attractive because it looks minimal. It collapses the moment you need multi-site visibility, role-based access, or any kind of logical grouping that doesn't map to the physical "tenant → device" flattening.

Shape the hierarchy around how your operators think about the system, not how your database joins it. If an operator's mental model has layers ("site", "line", "zone", "room"), the topic tree should have corresponding prefixes. Prefix wildcards then collapse your authorization and aggregation problems into trivial one-liners.

4. Mixing environments in the same namespace

Dev, staging, and prod must be separated at a level that cannot be overridden by any client library or configuration tweak. Production MQTT brokers and dev MQTT brokers should ideally not even share infrastructure — but if they do, the topmost segment has to make it structurally impossible for a dev client to publish into production by accident. This is a lesson every IoT team learns at least once the hard way.

5. Treating naming like cosmetics

Casing, separators, and length limits sound like bikeshedding until you spend an evening debugging a fleet of devices where half publish to Site_01 and half publish to site-01. Pick one convention, enforce it in CI via a simple linter, and never relax it. Future-you will not remember which segment was supposed to be CamelCase and which was supposed to be kebab-case.

Commands are a different product from telemetry

If you take nothing else from this post, take this. Telemetry and commands travel over the same MQTT broker but they are not the same product and they should not be designed as if they were.

Telemetry is high-volume, loss-tolerant, largely one-way, and useful in aggregate. It can tolerate being dropped, reordered, or re-sent. A missed temperature reading is annoying; you'll publish another one in a few seconds.

Commands are low-volume, loss-intolerant, bidirectional, and consequential. A command that arrives twice, or arrives hours late, or fires during a connection flap, can produce an unsafe physical action in the real world. The delivery semantics, the audit trail, the acknowledgement model, and the replay rules for commands deserve their own design exercise — and, often, their own topic subtree with its own contract.

The single biggest source of "weird IoT bugs" I've encountered is teams treating a command as a slightly-more-important telemetry message. It is not.

Questions to stress-test your design before you ship it

Before you commit to a hierarchy, I'd walk a draft through these four scenarios and see if it survives:

The disconnect-reconnect storm. What happens when 30% of your fleet reconnects inside a 60-second window after a regional outage? Does the broker queue explode? Do retained messages fire commands that are no longer valid? Is there anything stale in the tree that will be delivered incorrectly?
The mis-shipped device. A device configured for site-A ends up physically installed at site-B. Can it publish into site-B without deliberate re-provisioning? If yes, that's an auth bug waiting to happen.
The rogue subscriber. An operator's credentials are compromised. What's the blast radius — can that operator read or write across your entire fleet, or only their scoped sub-tree? The answer should be "only their scoped sub-tree", and your topic hierarchy is what makes that achievable.
The noisy neighbour. A single misbehaving device publishes at 1000x the expected rate. Can you identify, throttle, and quarantine it without taking down its neighbours? Your topic design plus your broker's policy engine determine this answer together.

If any of these scenarios produces a shrug rather than a clear answer, your hierarchy is not done yet.

The unsexy bits that matter most

Two operational practices are worth more than any amount of theoretical hierarchy elegance:

Observability of the topic tree itself. You should be able to answer, at any moment, "how many distinct topics are currently active? which branches are growing? which are stale? which retained messages have been sitting there longest?". If you can't, you will discover problems weeks later, in the form of bills or outages.
Convention linting in CI. A twenty-line script that fails a pull request when anyone introduces a topic that doesn't match the house convention pays for itself in avoided pages. Topics are content; treat them like content.

TL;DR

MQTT topic design is a decision framework, not a template. The framework, in its shortest form:

Every segment has to answer routing, authorization, retention, and cardinality simultaneously.
Identity goes in topics; state goes in payloads.
Telemetry and commands are separate products with separate delivery contracts.
Shape the tree around operator mental models, not database joins.
Enforce naming conventions in CI from day one.
Stress-test the design against disconnect storms, mis-configured devices, compromised credentials, and noisy-neighbour failures before you ship.

If you're thinking about this for a cold-chain, agritech or industrial-IoT workload, I'm happy to talk through specifics — reach out through the contact section. The right answer is always context-dependent, and it's almost never the first design on the whiteboard.