Most agent frameworks treat communication as an afterthought — agents call each other via HTTP, pass JSON payloads, and hope for the best. This works for demos. It breaks in production the moment you need real-time coordination, streaming results, or reliable delivery under load.
Agent-to-agent communication is a distributed systems problem, and distributed systems problems require serious protocol design. Here's what we've learned building communication infrastructure for agents that need to work together at machine speed.
The Communication Spectrum
Not all agent interactions need the same communication model. The right protocol depends on three factors: latency requirements, reliability needs, and the interaction pattern.
Protocol 1: Request-Response (HTTP/REST)
The simplest model. Agent A sends a request to Agent B and waits for a response. Familiar, well-understood, widely supported.
Best for: Simple, synchronous tasks. "Translate this text." "Classify this document." "Extract entities from this paragraph."
Limitations: Blocking. Agent A can't do anything while waiting for Agent B. If Agent B is slow, Agent A is stuck. If Agent B fails, Agent A needs retry logic. At scale, connection pools become a bottleneck.
Protocol 2: Asynchronous Messaging (Queue-Based)
Agent A publishes a task to a message queue. Agent B picks it up when available and publishes the result to a response queue. Neither agent blocks.
Best for: Tasks where latency isn't critical but reliability is. Batch processing, background jobs, and any task where "done in the next 5 minutes" is acceptable.
Implementation: Use a message broker (Redis Streams, RabbitMQ, or cloud-native options like SQS/Pub-Sub). Include a correlation ID in every message so you can match requests to responses.
Protocol 3: Server-Sent Events (SSE) / Streaming
Agent B streams its response back to Agent A as it generates it. Instead of waiting for the complete result, Agent A receives partial results in real-time.
Best for: Long-running tasks where progress visibility matters. A research agent that streams findings as it discovers them. A code generation agent that streams functions as it writes them.
Implementation: SSE is simpler than WebSockets and sufficient for most streaming use cases. The server pushes events; the client receives them. One-directional, but that's usually all you need for result streaming.
Protocol 4: WebSocket (Bidirectional Real-Time)
Full-duplex communication where both agents can send messages at any time. Essential for collaborative patterns where agents need to negotiate, debate, or coordinate in real-time.
Best for: Agent negotiation, real-time auctions, collaborative problem-solving where agents need to exchange multiple messages quickly.
Implementation: More complex than HTTP or SSE. Requires connection management, heartbeats, and reconnection logic. Use it only when you genuinely need bidirectional real-time communication.
Message Sequence: A Typical Agent Collaboration
Notice how the collaboration involves multiple protocol types: synchronous discovery (steps 1-2), parallel assignment (steps 3-4), streaming progress (step 5), peer-to-peer communication between workers (steps 6-7), and asynchronous completion (steps 8-9). A production system needs to support all of these patterns.
Message Design Principles
Regardless of which protocol you use, agent messages should follow these principles:
- Self-describing — Every message includes its type, version, and schema. The receiving agent can validate and process the message without external documentation.
- Idempotent — Processing the same message twice produces the same result. Networks are unreliable; messages will be duplicated.
- Traceable — Every message carries a correlation ID that links it to the original request. When debugging a distributed agent system, you need to reconstruct the full message chain.
- Versioned — Message schemas evolve. Versioning lets old agents and new agents coexist during transitions.
Error Handling in Agent Communication
Distributed systems fail in distributed ways. Your communication layer needs to handle:
- Agent unavailability — The target agent is down or overloaded. Implement circuit breakers that stop sending requests after N failures, with exponential backoff on retries.
- Partial failures — In fan-out patterns, some agents succeed and others fail. Decide in advance: do you return partial results or wait for all agents? The answer depends on the use case.
- Timeout cascades — Agent A waits for Agent B, which waits for Agent C. Set timeouts at each level, and make them shorter as you go deeper. If C times out at 10 seconds, B should time out at 15, and A at 20.
- Poison messages — A message that crashes the receiving agent every time it's processed. Without dead-letter handling, the message gets retried forever. Move failed messages to a dead-letter queue after N attempts.
Choosing Your Protocol Stack
For most agent systems, start with this stack:
- Synchronous work: HTTP/REST with structured JSON payloads and 30-second timeouts
- Background work: Message queue with at-least-once delivery and dead-letter handling
- Streaming: SSE for progress updates and partial results
- Real-time coordination: WebSockets only when you genuinely need bidirectional communication
Don't over-engineer. Start with HTTP. Add queues when you need reliability. Add streaming when you need progress visibility. Add WebSockets when you need real-time negotiation.
Built-in agent communication infrastructure.
AgentNation handles the protocol complexity so your agents can communicate reliably at scale. Start connecting agents today.