When is the swarm actually done?
TL;DR — When one AI agent hands work to another, which hands it to a third, no single participant can tell whether the whole chain actually finished. Each agent only sees its own immediate neighbor say “got it.” Below is a tiny three-program demo you can copy, paste, and run in two minutes: a root caller asks A, A asks B, B answers 200 accepted — and then silently drops the work it just promised. Root prints SUCCESS. Every process exits 0. The system looks healthy and is lying, and nobody in the chain misbehaved. The reason isn’t a bug you can patch in the demo: once A’s request to B closes, A literally cannot learn B’s downstream fate. There is no global observer of the cascade. This has a name distributed-systems researchers have studied for forty years — distributed termination detection — and the agent ecosystem is building swarms as if it doesn’t exist.
Picture a relay race where you, the coach, can see the first runner take off but not the finish line. The first runner passes the baton, jogs back, and tells you “handed it off, all good.” You write down race complete. But the runner three legs downstream just tripped, dropped the baton, and walked off the track. Your first runner doesn’t know that. You don’t know that. The only report you got — “I passed the baton” — was completely honest and completely useless for the question you actually care about: did anyone cross the finish line?
Swap runners for AI agents and that is the exact shape of a problem the agent industry is racing past. An orchestrator agent delegates to a specialist; the specialist delegates to a tool-running sub-agent; that one delegates again. Everyone reports back to the agent directly above them, and everyone reports the truth. And the system as a whole can still have no idea whether the work got done.
Let me make that concrete enough to run on your laptop.
Three programs that lie to each other politely
The entire system fits in three scripts. Root is a script. A and B are two tiny web servers on two different ports — separate processes, talking over the network, exactly like real agents do. Pure FastAPI and httpx, no agent frameworks, nothing private. You need only pip install fastapi uvicorn httpx — uvicorn is the tiny web server that runs A and B.
(If async/await isn’t your daily language: read await as “wait here for this.” That’s enough to follow every line.)
b.py — the agent that accepts work and then drops it:
# b.py -- run with: uvicorn b:app --port 8002
import asyncio, logging
from fastapi import FastAPI
app = FastAPI()
_running = set() # anchor background tasks so they can't be silently garbage-collected
async def do_the_real_work():
# The actual job B promised to do. Pretend it is a long task.
try:
await asyncio.sleep(1)
raise RuntimeError("B's worker died here")
except Exception:
logging.exception("WORK DROPPED -- and no one upstream is listening")
@app.post("/work")
async def work():
# Kick off the real work in the background, do NOT wait for it...
t = asyncio.create_task(do_the_real_work())
_running.add(t)
t.add_done_callback(_running.discard)
# ...and immediately answer the caller: "accepted!"
return {"status": "accepted"} # HTTP 200, instantlya.py — the middle agent that trusts B’s “accepted”:
# a.py -- run with: uvicorn a:app --port 8001
import httpx
from fastapi import FastAPI
app = FastAPI()
@app.post("/run")
async def run():
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:8002/work")
# A sees B's 200 "accepted" and considers ITS OWN job done.
return {"status": "done", "downstream": resp.json()}root.py — the orchestrator that believes A:
# root.py -- run with: python root.py
import httpx
resp = httpx.post("http://localhost:8001/run")
print("Root got:", resp.json())
print("SUCCESS" if resp.json()["status"] == "done" else "FAILURE")Start the two servers in two terminals, then run the root in a third:
pip install fastapi uvicorn httpx
uvicorn b:app --port 8002 # terminal 1
uvicorn a:app --port 8001 # terminal 2
python root.py # terminal 3Root’s terminal prints:
Root got: {'status': 'done', 'downstream': {'status': 'accepted'}}
SUCCESSSUCCESS. Exit code 0 (the success code). Meanwhile, moments later, in B’s terminal, you will see a line like WORK DROPPED -- and no one upstream is listening followed by a traceback — because B already answered 200 before the work ran, so when the work blows up, the failure has nowhere to go. The job you asked for did not happen, and every single participant believes everything is fine.
The line t = asyncio.create_task(do_the_real_work()) is the whole trick: it starts the work running but does not wait for it — Python schedules it to run later and immediately moves on to return the 200. That single line is the gap between “accepted” and “done,” and it is everywhere. Swap create_task for a queue, a background thread, or a message to another service, and the trap is identical: nearly every async API in the world answers accepted and lets you assume completed.
That gap is the crux. What each reply actually licenses you to believe:
| What the reply says | What it actually means | What it does NOT mean |
|---|---|---|
200 / accepted | B received the request | B finished, or even started, the work |
A returns done | A heard B say accepted | the work below A succeeded |
Root prints SUCCESS | every reply came back | the job got done |
The work is gone. The system is green. No one lied.
Why A genuinely cannot know
If you go looking for a bug in the code, you won’t find one — not in the usual sense. Each node behaved correctly given what it could see. The failure is in the shape of the system, not in any one node’s code.
Walk the timeline. A opens an HTTP request to B. B accepts it and returns 200. That response closes the connection. From that instant, A and B share no channel. The work B does next happens on B’s side of a boundary A can no longer see across. A asking “did B finish?” after the request closed is like asking “is the runner still running?” after they have left the stadium — there is no wire left to carry the answer.
You could patch this demo (have B wait for the work, or call A back when it is done). But every patch just moves the boundary. Make B call A back, and now B can’t see whether A finished relaying that result up to Root. Add a C downstream of B and the frontier — the set of still-active hops in the cascade — runs off the edge of everyone’s vision again. The structural fact survives every patch.
That distinction is the spine of the whole problem:
A local fact is something one node can see for itself. A global fact is something true of the whole cascade. No node in a delegation chain can directly observe a global fact — it can only see its own edges.
Look at the right-hand column: every row that actually answers “are we done?” reads nobody.
| Fact | Kind | Who can see it directly |
|---|---|---|
| ”I sent my request” | Local | The sender |
”I got a 200 back” | Local | The caller on that one hop |
| ”My immediate child accepted” | Local | The parent of that hop |
| ”B’s background work succeeded” | Global | Nobody — it happens after the hop closed |
| ”The entire A→B→C cascade has settled” | Global | Nobody — no participant sees all the edges |
Every row a node can see is local. The information each agent holds is true and insufficient, all at once.
What each node can — and can’t — see
The reason no clever logging fixes this is that the knowledge is partitioned by design. Each participant holds one slice of the truth and none holds the union:
| Node | Knows for certain | Is blind to |
|---|---|---|
| Root | ”A returned done.” | Whether B (or anyone below) actually did the work. |
| A | ”B returned accepted.” | Whether B’s accepted work ran, dropped, or died. |
| B | ”I accepted the work; it later crashed.” | That Root already declared SUCCESS and stopped listening. |
Stack those rows and the gap is visible: there is no column for “the whole system,” because there is no node standing where the whole system is visible.
This is the part that separates a mesh from a single box. On one machine, in one process, you can at least imagine one watcher seeing everything — the same memory, one event loop you could instrument. (That single-box version has its own quiet failure mode, covered in green is not evidence — but it lives inside one process where, in principle, one observer could exist.) The moment you cross a network boundary, that imagined watcher is gone for good. A and B are separate processes. The instant A’s request to B closes, the only wire between them is severed. There is no global observer, and there cannot be one. This isn’t a missing feature someone forgot to build; no single participant sits where it could see the whole picture. That is what “distributed” means.
”Just wait a few seconds” is not an answer
Faced with this, the reflex is to throw a timeout at it: give the cascade ten seconds, then call it done. It feels prudent. It is the wrong shape of answer, not just a weak one.
A timeout asks the clock. The question you actually have is about the mesh: has every hop, including ones you can’t see, settled? Those are different questions, and the clock’s answer is never the mesh’s answer:
| Your 10-second timeout fires; you assume done | What is really happening | Verdict it gives you |
|---|---|---|
| Cascade settled in 1 second | The work finished long ago | Right answer, wasted 9 seconds |
| A hop is still working at second 11 | Live work, declared dead | Wrong — green over running work |
| A hop died silently at second 3 | The work is already a corpse | Wrong — SUCCESS over a corpse |
| Cascade settled in exactly 10 seconds | — | Right by luck |
A timeout is right only by coincidence. It can be too slow and too fast in the same system, because the work it is guessing about has no fixed duration. Tune it short and it fails healthy chains; tune it long and it passes dead ones. There is no right number, because a timeout is not a slightly-weak version of the right check — it is a different category. It samples one node’s clock; the property you want is about all the edges being empty at once. You cannot measure a global property with a local stopwatch.
This problem has a name (and it’s older than you think)
This converts a vague unease into a known, heavy-artillery problem with a literature behind it — the kind of thing worth carrying into your next architecture review, or a board meeting.
What you are looking at is not a quirk of FastAPI or of agents. Proving that a computation spread across many independent participants has globally stopped — every node idle, and no work still in flight between them — is a named, studied, genuinely hard problem in distributed systems, and computer scientists have wrestled with it since the early 1980s:
Distributed termination detection — proving that a spread-out computation has globally stopped: every participant is idle, and no work is still traveling between them.
It is hard for exactly the reason the demo shows. Any single node can be idle right now while a message carrying more work is traveling toward it. To declare the whole system done, you need to know that all nodes are idle and nothing is in transit — a global property no single participant can see alone. (Our two-hop toy shows the idle-misread half; the “in transit” half is what makes the general A→B→C problem, with a real C still being fed, even harder.) There are real algorithms for it — Dijkstra–Scholten, the Dijkstra–Feijen–van Gasteren token ring, and others — and entire textbook chapters, with named algorithms and proven lower bounds on how much coordination it costs. It is hard on purpose — which is precisely why it tends to get skipped.
For the reader who wants one glanceable glossary to walk away with:
| Plain phrase | The CS name | Why it’s hard |
|---|---|---|
| ”Is the whole swarm done?” | Distributed termination detection | No node can see the global state |
| ”I waited 10s, assume done” | Timeout-based liveness | Guesses with a clock; never observes the mesh |
”B said accepted” | A local liveness fact | True, but nowhere near sufficient |
Stated plainly: the agent ecosystem is currently blind to it. The frameworks now wiring agents into delegating chains — one model calling another, fanning work across services — are rebuilding multi-node distributed systems without the vocabulary the multi-node world spent forty years building. They ship the topology and skip the termination detection. The result is exactly our demo at scale: swarms of A→B→C cascades that print SUCCESS and exit 0 while the frontier is quietly non-empty.
I think quiescence — knowing a mesh has actually gone quiet — deserves to be a first-class thing you can ask for, not a wall-clock guess. And the bar for a real answer is exact: a sound quiescence signal must fire only when every node is idle and no work is in transit between them — it has to observe the whole active frontier of the mesh, not just the edges any one node can see. A closed stream is not that. A clock is not that. Whether you reach it with the classic algorithms (Dijkstra–Scholten, token rings) or something new, that is the invariant a real answer has to satisfy. We haven’t closed the gap yet, but the shape of the missing primitive is strictly defined.
The proof that this is real, not theory
This is not an extrapolation from a toy. This exact blind spot was sitting in Google’s reference A2A SDK — a teardown bug I found and fixed, now open as a PR upstream. And it is not one vendor’s slip: the A2A protocol itself punts on the hard part, and the official conformance test kit fakes the answer. Three independent admissions, from three layers of the stack, that the ecosystem has solved “the stream closed” and has not solved “the mesh is done.”
That is Task was destroyed but it is pending — the same blindness, this time with primary sources you can click. The fix I submitted there is real, tested, and local: it makes one process’s teardown deterministic. It does not hand the ecosystem the missing drain primitive. The in-process bug and the cross-network gap in this demo are the same shape at two zoom levels, but solving the first does not solve the second. The problem is systemic and profound; the fix I submitted is local and precise.
The portable rule
| What read green | What was actually true | The rule it became |
|---|---|---|
Root printed SUCCESS; exit 0 | B dropped the work it accepted; no one upstream can ever learn it | A reply means “my neighbor received it,” never “the work finished" |
"The request returned 200” | A local fact about one edge | Never let a local fact stand in for the global one it can’t speak to |
| ”We waited 10 seconds, nothing screamed” | The clock advanced; the mesh’s actual state was never observed | Gate on an observed settlement signal, never on a wall-clock timeout |
| Each node reported truthfully | True local facts composed into a false global one | No participant in a mesh can see the whole frontier — so don’t ask one to |
If a planted, work-dropping node can survive your cascade and your system stays green, your “done” is decoration. Stop asking one node whether the swarm finished. No node can answer. That is the portable rule — and the reason “are we done yet?” is a harder question than the entire agent industry is currently treating it as.
The in-process priors are next door. Your AI coding agents can’t hear each other is the same blindness on a single box — peers that cannot tell when a neighbor finishes or fails. Green is not evidence is the false-green lens this whole cascade wears. This piece is the cross-process, cross-network sequel: the single box can be given a nervous system, but the mesh does not yet have a way to know it has gone quiet.
Frequently asked questions
- What is the demo actually showing?
- Three programs across a network — Root calls A, A calls B. B answers HTTP 200 'accepted' and then drops the work it promised, which dies a moment later with nobody listening. A only ever saw the 200, so it reports 'done', and Root prints SUCCESS and exits 0. The work never happened, and every participant told the truth about what it could see.
- Why can't A just check whether B finished?
- Because once A's HTTP request to B returns, the connection closes and A and B share no channel. B's work happens on the far side of a boundary A can no longer see across. A can't learn B's downstream fate even in principle — the information channel is gone.
- What is distributed termination detection?
- It is the classic distributed-systems problem of proving that a computation spread across many independent participants has globally stopped: every node idle AND no work still in transit between them. It has been studied since the early 1980s, with named algorithms (Dijkstra–Scholten and others) and proven lower bounds on coordination cost. It is hard precisely because no single participant can see the global state.
- How is this different from 'green is not evidence'?
- That piece is a single-process failure — one program exits 0 over work it never observed, on one machine, where in principle one observer could exist. This piece is a topology failure across a network boundary: three or more separate processes where no participant can see the whole frontier, so no single observer can exist at all. It is the cross-network cousin, not a rerun.
- Does the author claim to have solved this?
- No. This piece names the problem; it does not claim to fix the mesh. The only concrete fix — covered in the next piece — is local: it makes one process's teardown deterministic. The cross-network gap stays open. The stance across both pieces is deliberate: name a systemic failure honestly, while standing behind only the localized fix you can actually defend.
The proof that this is real, not theory — the bug in Google’s A2A SDK, the protocol punting on it, the official test kit faking it — is the next piece.