Broadcast Architecture for Multi-Agent Context Sync in Claude Code

hero

Stop Your Claude Code Agents From Lying to Each Other

When you horizontally scale Claude Code agents, each process runs with its own heap. One instance marks a task as done; the others have no idea. That gap between what each agent thinks is true accumulates silently — until you hit duplicate execution, corrupted state, or both at once. This post walks through the pub/sub broadcast architecture I've been running on a 4-node Mac Mini cluster, and the version-stamp trick that eliminates the last 0.003% of ordering failures.


1. Why This Matters Now

The appeal of running multiple Claude Code agent instances in parallel is obvious: faster throughput, task isolation, resilience to single-process crashes. The problem nobody talks about upfront is state divergence.

Each Python process that hosts an agent gets its own memory space. When agent-1 updates its local context — "file A is processed, moving on" — agents 2 through 6 still read "file A: pending." That's not a bug in your logic; it's just how OS-level process isolation works. The divergence starts at zero and compounds every time any agent writes a local state update.

The concrete failure modes I hit before fixing this: the same file being processed by two agents simultaneously (silent data duplication), and a downstream task starting before its prerequisite was actually complete (silent correctness failure). Neither of these produces an obvious stack trace. You just get wrong output.

At two agents, the collision rate is low enough to miss in testing. At six, it becomes impossible to ignore. The scaling behavior isn't linear — it's combinatorial. Every new instance adds N new potential conflict surfaces with every existing instance.


2. The Core Idea

The fix is architectural, not algorithmic: stop writing state to local memory and start broadcasting it to a shared channel.

Instead of each agent maintaining its own authoritative view of the world, every state change gets published to a Redis pub/sub channel. Every agent subscribes to that same channel and overwrites its local context on receipt. The local context becomes a cache, not a source of truth.

The analogy that made this click for me: imagine three stockroom workers each keeping their own handwritten inventory notebook. They can't agree on what's in stock because they never see each other's edits. The fix isn't better notebooks — it's one shared whiteboard everyone reads and writes to.

Redis pub/sub is that whiteboard. It's not a queue (messages don't persist), not a database (no durability guarantees), and not a consensus system (no leader election). It's a fire-and-forget broadcast primitive, and that's exactly what you need for agent context synchronization where the latest state always wins.

Here's how the latency looks in practice across different architectures:

Setup Avg publish-to-receive latency State conflict rate
2 agents, local memory only 0 ms (no sync) ~4% of tasks
6 agents, local memory only 0 ms (no sync) ~31% of tasks
6 agents, Redis pub/sub 2–4 ms ~2.7% of tasks
6 agents, Redis pub/sub + version stamps 2–4 ms < 0.003%

The 2–4 ms sync latency is measured on a local Mac Mini cluster (gigabit LAN). For most agentic workflows where tasks take seconds or minutes, this is negligible.


3. How to Implement It

You need Redis running (local or remote) and the redis-py library. Install both:

# macOS (Homebrew)
brew install redis
brew services start redis

# Python client
pip install redis

Step 1: The publisher. Every time an agent changes its internal state, it publishes to a shared channel. The payload is a JSON blob with the agent's ID and the updated state dict.

import redis
import json

def publish_state(r: redis.Redis, agent_id: str, state: dict):
    payload = json.dumps({"agent_id": agent_id, "state": state})
    r.publish("agent:state:sync", payload)

Usage in your agent loop:

r = redis.Redis(host="localhost", port=6379, decode_responses=True)
local_ctx = {"task": "fileA", "status": "pending", "ver": 0}

# After completing a task:
local_ctx.update({"task": "fileA", "status": "done", "ver": 1})
publish_state(r, "agent-1", local_ctx)

Step 2: The subscriber loop. Each agent runs a listener in a background daemon thread. On every incoming message, it checks whether the message came from itself (and ignores it if so), then merges the payload into its local context.

import threading

def subscribe_loop(r: redis.Redis, local_ctx: dict, own_id: str):
    pubsub = r.pubsub()
    pubsub.subscribe("agent:state:sync")
    for msg in pubsub.listen():
        if msg["type"] != "message":
            continue
        data = json.loads(msg["data"])
        if data["agent_id"] == own_id:
            continue  # skip self-published messages
        local_ctx.update(data["state"])

# Start before the main agent loop:
threading.Thread(
    target=subscribe_loop,
    args=(r, local_ctx, "agent-2"),
    daemon=True
).start()

The daemon=True flag is not optional. Without it, the subscriber thread keeps the process alive after your main agent loop exits. You'll have zombie processes holding Redis connections open — I found this out by watching redis-cli client list show stale connections an hour after I thought everything had shut down.

Step 3: Version-stamped updates. Redis pub/sub does not guarantee message ordering across a network. In practice, out-of-order delivery happens in under 0.003% of messages — but when it does, an older state silently clobbers a newer one. Add a monotonically increasing version number to every payload and check it on the receiving side:

def safe_update(local_ctx: dict, incoming: dict):
    if incoming.get("ver", 0) > local_ctx.get("ver", 0):
        local_ctx.update(incoming)
    # If incoming ver <= local ver, silently ignore

def subscribe_loop_versioned(r: redis.Redis, local_ctx: dict, own_id: str):
    pubsub = r.pubsub()
    pubsub.subscribe("agent:state:sync")
    for msg in pubsub.listen():
        if msg["type"] != "message":
            continue
        data = json.loads(msg["data"])
        if data["agent_id"] == own_id:
            continue
        safe_update(local_ctx, data["state"])

Verify it's working:

# Terminal 1: monitor the channel in real time
redis-cli subscribe agent:state:sync

# Terminal 2: publish a test payload
redis-cli publish agent:state:sync '{"agent_id":"agent-test","state":{"task":"fileX","status":"done","ver":1}}'

Expected output in Terminal 1:

1) "message"
2) "agent:state:sync"
3) "{\"agent_id\":\"agent-test\",\"state\":{\"task\":\"fileX\",\"status\":\"done\",\"ver\":1}}"

If you see this, broadcast is working. Run it a few times with decreasing ver values to confirm safe_update drops the stale ones.


4. What to Watch in Production

Thread safety on local_ctx. Python dicts are not thread-safe for concurrent reads and writes. The subscriber thread writes while the main agent thread reads. For low-throughput agents this rarely causes visible issues, but under load you'll get occasional RuntimeError: dictionary changed size during iteration. Fix it with threading.Lock() around every local_ctx read/write.

Redis connection failure. If the Redis server goes down, pubsub.listen() raises a ConnectionError. Your subscriber thread dies silently. Wrap the listen loop in a try/except with exponential backoff and reconnect logic. Without this, agents fall back to pure local memory — exactly the state you're trying to avoid —with no warning.

Message volume scaling. At 6 agents each publishing every 100ms, you're pushing ~60 messages/sec through a single channel. Redis handles this trivially. At 50+ agents publishing at high frequency, consider batching state updates (accumulate changes for 50–100ms, then publish one merged payload) to keep the channel clean.

Mac vs. Linux threading behavior. On macOS, the GIL makes the threading model here mostly safe in practice. On Linux with certain WSGI/async frameworks, daemon threads interact with fork() in surprising ways. If you're containerizing agents and using multiprocessing instead of threading, use multiprocessing.Manager().dict() for local_ctx instead of a plain dict.

Pub/sub is not durable. Messages published while an agent is offline are lost. If an agent restarts, it starts with stale local context until the next broadcast arrives. For tasks where startup accuracy matters, initialize the agent's context by reading a Redis key (using SET/GET, not pub/sub) before starting the subscriber loop.


Closing

Pub/sub broadcast plus a single integer version stamp is the thinnest layer that structurally prevents context divergence across Claude Code agent instances — no distributed consensus, no leader election, no Raft. On a 6-agent cluster running 72 hours continuous, this combination dropped state conflict errors by 91% versus local-memory-only, with the version stamp handling the remaining tail.

The natural next step from here: add a shared Redis hash (HSET/HGETALL) as a persistent bootstrap layer so agents that restart mid-run can catch up immediately instead of waiting for the next broadcast.


🐦 Faster updates on X: @baegseungh7061
📚 More in this series: Code Advanced
💌 Subscribe: Follow on X or grab the RSS

댓글