Skip to content
← Back to Writing

300 million devices and a 47-second problem

· 5 min read
samsung-smartthings distributed-systems reliability

We built jitter into the scheduling system on purpose. A random offset between zero and fifty-nine seconds, baked into every timer automation at creation. At 300 million devices, it was the only way to survive 7:00 PM.

The problem is, we forgot to tell anyone what “random” means in practice.

The math behind the minute

Timer-based automations are one of the most common patterns in a smart home system. “Turn on the porch light at sunset.” “Start the coffee at 7:15.” “Lock the door at 10 PM.” At scale, these are cron jobs: the automation service queues events, a scheduler fires them at the right time.

The naive version of this falls apart immediately. If you have 10 million devices with a “7:00 PM” automation and your cron job checks the queue at the top of every minute, you get 10 million events trying to fire at the same instant. The event pipeline chokes. Latency spikes. Downstream services get hammered. Users notice their porch lights all turned on three minutes late, in a cascade.

The fix is jitter: when an automation is created, assign it a random offset between 0 and 59 seconds. The “7:00 PM” trigger becomes “7:00:23 PM” or “7:00:47 PM” at creation time. The user sees “7:00 PM” in the app. The system fires it sometime that minute, spread across the full 60 seconds. Load distributed. Pipeline happy.

This worked. It worked well. It’s the kind of thing you build once, forget about, and later list as “reliability engineering” on your resume.

The meeting

Then someone from the product org scheduled a demo. An executive wanted to see a timer automation fire. They set it for a specific time, counted down, and watched. The event fired 47 seconds into the minute.

Not broken. Correct. Jitter doing exactly what it was designed to do. The automation fired within the minute it was scheduled, which is what the system promised.

The exec did not experience it that way.

“Why is it 47 seconds late?” is not the question you want to answer by explaining distributed systems load balancing in a room full of product managers. “By design” is technically accurate. It is not, in that context, a satisfying answer.

Two kinds of event

The meeting forced a harder look at what the event queue actually contained. Not one type of event. Two.

The first: user-visible timer automations. Porch lights, coffee makers, alarms, door locks. Events a person set deliberately and might be watching for. When one fires 47 seconds into the minute, that’s not invisible latency. That’s the thing they just saw.

The second: background measurement events. Device health polls. Metric collection. State sync jobs that run on a schedule but produce nothing a user ever directly sees. These need to happen sometime in the correct minute. Nobody is counting the seconds.

The original system assigned jitter randomly to both. That made sense when it was built: the problem being solved was load, not latency perception. A user-visible automation and a background health poll got the same random 0-to-59-second offset at creation. Statistically, load smoothed out fine. But statistically also meant a porch light automation could draw 47, and an exec could spend a meeting asking why the product team built a slow timer.

The fix

Two tiers. Same jitter principle, different windows.

User-visible automations, flagged as priority at creation, were given the first five seconds of their target minute: offset 0 to 4. The exec’s 7:00 PM timer would now fire between 7:00:00 and 7:00:04. Close enough that nobody calls it late.

Background measurement events kept the wider band: offset 5 to 59. Fifty-five seconds of spread for jobs the user never sees. Load still distributed. Pipeline still protected.

The top-of-hour problem wasn’t made worse. Ten million user-visible “7:00 PM” automations spread over five seconds is still far more manageable than ten million at the same instant. The math held.

The implementation was a flag on the automation object, a change to the jitter calculation at creation time, and a migration to re-bucket existing automations. Not a rewrite. Not a new service. About a week of work.

What it taught me

The system was correct. The system was also incomplete. Those are different things.

“By design” is the beginning of an answer, not the end of one. The design handled load. It made one assumption: that all timer events were equivalent. That was true when it was written and stopped being true when the product grew. The assumption was never written down. It just quietly became wrong.

At scale, this pattern shows up constantly. A constraint gets baked in early, for good reasons, and then the product grows past it. The fix is rarely to throw out the constraint. It’s to make it conditional. Add the tier. Add the flag. Scope the rule to the cases where it still applies.

The exec’s timer now fires in the first five seconds of the minute. Nobody had to rebuild anything.

You've read this far. Let's talk.