@sym-bot/sym 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/PRD.md +1 -1
- package/README.md +165 -64
- package/TECHNICAL-SPEC.md +250 -197
- package/bin/setup-claude.sh +31 -1
- package/docs/mesh-memory-protocol.md +563 -0
- package/docs/mmp-architecture-image-prompt.txt +12 -0
- package/docs/p2p-protocol-research.md +907 -0
- package/docs/protocol-wake.md +242 -0
- package/integrations/claude-code/mcp-server.js +240 -18
- package/integrations/telegram/bot.js +418 -0
- package/lib/node.js +488 -39
- package/lib/transport.js +88 -0
- package/package.json +3 -2
- package/sym-relay/Dockerfile +7 -0
- package/sym-relay/lib/logger.js +28 -0
- package/sym-relay/lib/relay.js +388 -0
- package/sym-relay/package-lock.json +40 -0
- package/sym-relay/package.json +18 -0
- package/sym-relay/render.yaml +14 -0
- package/sym-relay/server.js +67 -0
- package/.mcp.json +0 -12
|
@@ -0,0 +1,907 @@
|
|
|
1
|
+
# P2P Protocol Research for AI Agent Mesh Networks
|
|
2
|
+
|
|
3
|
+
Research conducted: 2026-03-23
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. libp2p (IPFS, Filecoin, Ethereum)
|
|
8
|
+
|
|
9
|
+
**What it is:** A modular networking stack extracted from IPFS. Provides building blocks for P2P applications: transports, peer discovery, content routing, NAT traversal, pubsub.
|
|
10
|
+
|
|
11
|
+
### Core Design Principles
|
|
12
|
+
- **Unix philosophy:** Small, composable, swappable modules
|
|
13
|
+
- **Transport agnostic:** TCP, QUIC, WebSocket, WebRTC, WebTransport all supported
|
|
14
|
+
- **Identity-centric:** Every peer has a cryptographic identity (PeerId derived from public key)
|
|
15
|
+
- **Protocol negotiation:** Peers negotiate which protocols they support at connection time
|
|
16
|
+
|
|
17
|
+
### Discovery Mechanism
|
|
18
|
+
- **Kademlia DHT** for wide-area peer discovery
|
|
19
|
+
- **mDNS** for local network discovery (LAN)
|
|
20
|
+
- **Bootstrap nodes** as initial entry points
|
|
21
|
+
- **Rendezvous protocol** for topic-based discovery
|
|
22
|
+
|
|
23
|
+
### Transport Model
|
|
24
|
+
- Multi-transport: TCP, QUIC, WebSocket, WebRTC, WebTransport
|
|
25
|
+
- Connections upgraded to secure channels via TLS 1.3 or Noise protocol
|
|
26
|
+
- Stream multiplexing over single connections (yamux, mplex)
|
|
27
|
+
- NAT traversal via relay nodes (Circuit Relay v2) and hole punching
|
|
28
|
+
|
|
29
|
+
### Message Format
|
|
30
|
+
- Protocol Buffers for wire format
|
|
31
|
+
- Multiaddr for addressing (self-describing network addresses)
|
|
32
|
+
- GossipSub for pubsub (mesh + gossip hybrid)
|
|
33
|
+
|
|
34
|
+
### State Synchronization
|
|
35
|
+
- No built-in state sync — delegates to application layer
|
|
36
|
+
- Kademlia DHT for distributed key-value lookups
|
|
37
|
+
- GossipSub for real-time message propagation
|
|
38
|
+
- Bitswap for content exchange (IPFS-specific)
|
|
39
|
+
|
|
40
|
+
### Offline/Reconnect Behavior
|
|
41
|
+
- DHT routing tables gradually evict unresponsive peers via periodic PING
|
|
42
|
+
- Peers re-announce themselves on reconnection
|
|
43
|
+
- No built-in message queuing for offline peers
|
|
44
|
+
- Relay nodes can bridge connectivity gaps
|
|
45
|
+
|
|
46
|
+
### Central vs P2P
|
|
47
|
+
- **Fully P2P** in design, but practically relies on bootstrap nodes and relay infrastructure
|
|
48
|
+
- No central authority; any node can serve any role
|
|
49
|
+
|
|
50
|
+
### Tradeoffs
|
|
51
|
+
- (+) Extremely modular and extensible
|
|
52
|
+
- (+) Battle-tested at scale (IPFS, Filecoin, Ethereum)
|
|
53
|
+
- (+) Rich NAT traversal capabilities
|
|
54
|
+
- (-) Complexity — many moving parts to configure
|
|
55
|
+
- (-) No built-in offline message delivery
|
|
56
|
+
- (-) DHT can be slow for real-time discovery
|
|
57
|
+
|
|
58
|
+
### Key Lessons for Agent Mesh
|
|
59
|
+
- **Modularity is essential** — agents may run on different platforms with different transport capabilities
|
|
60
|
+
- **PeerId as cryptographic identity** is the right model for agent identity
|
|
61
|
+
- **GossipSub** is an excellent pubsub model for topic-based agent communication
|
|
62
|
+
- **Circuit Relay** pattern solves NAT traversal without requiring all agents to be publicly reachable
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## 2. Matrix Protocol
|
|
67
|
+
|
|
68
|
+
**What it is:** An open standard for decentralized, federated real-time communication. Each user connects to a homeserver; homeservers federate with each other.
|
|
69
|
+
|
|
70
|
+
### Core Design Principles
|
|
71
|
+
- **Decentralized conversation history** — no single server owns the truth
|
|
72
|
+
- **Eventual consistency** via event DAGs
|
|
73
|
+
- **Federation** — servers replicate room state to each other
|
|
74
|
+
- **End-to-end encryption** (Megolm/Olm) as first-class feature
|
|
75
|
+
|
|
76
|
+
### Discovery Mechanism
|
|
77
|
+
- Users are addressed as `@user:homeserver.org`
|
|
78
|
+
- Room aliases (`#room:server`) resolve to room IDs via server lookups
|
|
79
|
+
- Server discovery via `.well-known` DNS records and SRV records
|
|
80
|
+
- Room directory for public room discovery
|
|
81
|
+
|
|
82
|
+
### Transport Model
|
|
83
|
+
- Client-Server API: RESTful HTTP + JSON (long-polling or SSE for sync)
|
|
84
|
+
- Server-Server (Federation) API: HTTPS with signed JSON
|
|
85
|
+
- All events are cryptographically signed by originating server
|
|
86
|
+
|
|
87
|
+
### Message Format
|
|
88
|
+
- JSON events with standardized schema
|
|
89
|
+
- Each event has: type, content, sender, room_id, origin_server_ts, event_id
|
|
90
|
+
- State events have additional `state_key` field
|
|
91
|
+
- Events form a DAG (Directed Acyclic Graph) — each references parent events
|
|
92
|
+
|
|
93
|
+
### State Synchronization
|
|
94
|
+
- **Event DAG:** Each event references its parent events, forming a partial order
|
|
95
|
+
- **State Resolution Algorithm (v2):** Deterministic algorithm to resolve conflicts when servers have divergent views of room state
|
|
96
|
+
- **Backfill:** Servers can request missing events from other servers in the room
|
|
97
|
+
- Events are signed and immutable — append-only log per room
|
|
98
|
+
|
|
99
|
+
### Offline/Reconnect Behavior
|
|
100
|
+
- Homeserver stores all events while user is offline
|
|
101
|
+
- Client syncs via `/sync` endpoint with a `since` token — gets all missed events
|
|
102
|
+
- Federation: servers backfill from peers when they come back online
|
|
103
|
+
- **Strong offline support** — the homeserver acts as a persistent store-and-forward relay
|
|
104
|
+
|
|
105
|
+
### Central vs P2P
|
|
106
|
+
- **Federated** — each homeserver is a central point for its users, but no global center
|
|
107
|
+
- Experimental P2P Matrix (Pinecone) removes the homeserver requirement
|
|
108
|
+
- In practice, matrix.org is a dominant homeserver
|
|
109
|
+
|
|
110
|
+
### Tradeoffs
|
|
111
|
+
- (+) Excellent offline handling — homeserver stores everything
|
|
112
|
+
- (+) Rich state model with conflict resolution
|
|
113
|
+
- (+) E2E encryption built in
|
|
114
|
+
- (+) Bridges to other protocols (Slack, IRC, etc.)
|
|
115
|
+
- (-) Homeservers are resource-intensive (Synapse is notoriously heavy)
|
|
116
|
+
- (-) State resolution can be complex and expensive
|
|
117
|
+
- (-) Federation has trust/moderation challenges
|
|
118
|
+
- (-) Event DAG grows indefinitely
|
|
119
|
+
|
|
120
|
+
### Key Lessons for Agent Mesh
|
|
121
|
+
- **Event DAG** is a powerful model for ordering agent interactions
|
|
122
|
+
- **State resolution** is critical when agents have divergent views
|
|
123
|
+
- **Homeserver-as-proxy** pattern solves mobile background restrictions — agent has a persistent relay
|
|
124
|
+
- **Room model** maps well to agent collaboration spaces
|
|
125
|
+
- **Backfill** mechanism is exactly what agents need for catching up after being offline
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## 3. Nostr (Notes and Other Stuff Transmitted by Relays)
|
|
130
|
+
|
|
131
|
+
**What it is:** A minimalist protocol for decentralized social networking. Clients publish signed events to relays; relays store and serve events.
|
|
132
|
+
|
|
133
|
+
### Core Design Principles
|
|
134
|
+
- **Radical simplicity** — the protocol is tiny (NIP-01 is a few pages)
|
|
135
|
+
- **User identity = keypair** (secp256k1). No registration, no usernames, no servers to trust
|
|
136
|
+
- **Relays are dumb storage** — they store events and serve them on request
|
|
137
|
+
- **Client intelligence** — all logic lives in the client
|
|
138
|
+
- **Censorship resistance** — users can publish to multiple relays
|
|
139
|
+
|
|
140
|
+
### Discovery Mechanism
|
|
141
|
+
- Users publish a "relay list" event (NIP-65) declaring which relays they use
|
|
142
|
+
- Clients query multiple relays to find a user's events
|
|
143
|
+
- No global directory — discovery is organic via relay overlap
|
|
144
|
+
- NIP-05: DNS-based identity verification (user@domain.com mapping)
|
|
145
|
+
|
|
146
|
+
### Transport Model
|
|
147
|
+
- **WebSocket** connections between client and relay
|
|
148
|
+
- Simple JSON messages over WebSocket:
|
|
149
|
+
- `["EVENT", <event>]` — publish
|
|
150
|
+
- `["REQ", <sub_id>, <filters>]` — subscribe/query
|
|
151
|
+
- `["CLOSE", <sub_id>]` — unsubscribe
|
|
152
|
+
|
|
153
|
+
### Message Format
|
|
154
|
+
- Single event type with fields: `id`, `pubkey`, `created_at`, `kind`, `tags`, `content`, `sig`
|
|
155
|
+
- `kind` integer determines event semantics (0=metadata, 1=text note, 4=DM, etc.)
|
|
156
|
+
- Tags provide structured metadata: `["e", <event_id>]`, `["p", <pubkey>]`
|
|
157
|
+
- Events are signed with Schnorr signatures
|
|
158
|
+
|
|
159
|
+
### State Synchronization
|
|
160
|
+
- **No sync protocol** — relays are independent and may have different events
|
|
161
|
+
- Clients query multiple relays and merge results client-side
|
|
162
|
+
- "Replaceable events" (NIP-16): newer event of same kind+pubkey replaces older
|
|
163
|
+
- "Parameterized replaceable events": newer event of same kind+pubkey+d-tag replaces older
|
|
164
|
+
- No conflict resolution beyond "latest timestamp wins"
|
|
165
|
+
|
|
166
|
+
### Offline/Reconnect Behavior
|
|
167
|
+
- Events persist on relays indefinitely (relay policy permitting)
|
|
168
|
+
- Client reconnects and re-subscribes; gets current state from relay
|
|
169
|
+
- No guaranteed delivery — if a relay was down when event was published, it may never get it
|
|
170
|
+
- Users mitigate by publishing to multiple relays
|
|
171
|
+
|
|
172
|
+
### Central vs P2P
|
|
173
|
+
- **Neither federated nor P2P** — it's a novel "relay" architecture
|
|
174
|
+
- Relays are independent servers; clients choose which relays to use
|
|
175
|
+
- No relay-to-relay communication (by default)
|
|
176
|
+
- Users are sovereign; relays are interchangeable infrastructure
|
|
177
|
+
|
|
178
|
+
### Tradeoffs
|
|
179
|
+
- (+) Extreme simplicity — easy to implement
|
|
180
|
+
- (+) User sovereignty — identity is just a keypair
|
|
181
|
+
- (+) Censorship resistant — multi-relay redundancy
|
|
182
|
+
- (+) No complex state sync needed
|
|
183
|
+
- (-) No guaranteed delivery
|
|
184
|
+
- (-) Relay discovery can be unreliable
|
|
185
|
+
- (-) No built-in E2E encryption (NIP-44 adds it optionally)
|
|
186
|
+
- (-) Spam management is relay-by-relay
|
|
187
|
+
- (-) "Latest timestamp wins" is a weak conflict resolution
|
|
188
|
+
|
|
189
|
+
### Key Lessons for Agent Mesh
|
|
190
|
+
- **Simplicity wins** — Nostr's adoption grew fast because the protocol is trivial to implement
|
|
191
|
+
- **Keypair identity** removes all registration/discovery overhead
|
|
192
|
+
- **Multi-relay redundancy** is a powerful pattern for resilience
|
|
193
|
+
- **"Dumb relay, smart client"** maps well to "dumb infrastructure, smart agent"
|
|
194
|
+
- **Event-based model** with kinds is highly extensible
|
|
195
|
+
- **Replaceable events** pattern works well for agent state announcements
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## 4. MQTT (Message Queuing Telemetry Transport)
|
|
200
|
+
|
|
201
|
+
**What it is:** A lightweight publish/subscribe messaging protocol designed for constrained devices and unreliable networks. OASIS standard.
|
|
202
|
+
|
|
203
|
+
### Core Design Principles
|
|
204
|
+
- **Minimal overhead** — 2-byte fixed header minimum
|
|
205
|
+
- **Pub/sub decoupling** — publishers and subscribers don't know about each other
|
|
206
|
+
- **Quality of Service levels** — flexible delivery guarantees
|
|
207
|
+
- **Designed for unreliable networks** — built-in keepalive, will messages, session persistence
|
|
208
|
+
|
|
209
|
+
### Discovery Mechanism
|
|
210
|
+
- **No peer discovery** — broker-centric architecture
|
|
211
|
+
- Clients connect to a known broker endpoint
|
|
212
|
+
- Topic-based routing — subscribers discover publishers by subscribing to topics
|
|
213
|
+
- Wildcard subscriptions (`+` single level, `#` multi-level)
|
|
214
|
+
|
|
215
|
+
### Transport Model
|
|
216
|
+
- TCP-based (MQTT 3.1.1)
|
|
217
|
+
- MQTT 5.0 adds enhanced features (reason codes, shared subscriptions, topic aliases)
|
|
218
|
+
- MQTT-SN variant for UDP/non-TCP transports (sensor networks)
|
|
219
|
+
- TLS for encryption
|
|
220
|
+
- WebSocket transport available
|
|
221
|
+
|
|
222
|
+
### Message Format
|
|
223
|
+
- Binary protocol with compact fixed header (2 bytes minimum)
|
|
224
|
+
- Variable header and payload depend on packet type
|
|
225
|
+
- 14 packet types: CONNECT, CONNACK, PUBLISH, PUBACK, SUBSCRIBE, etc.
|
|
226
|
+
- Topic strings (UTF-8) for routing
|
|
227
|
+
- Payload is opaque bytes — any format
|
|
228
|
+
|
|
229
|
+
### State Synchronization
|
|
230
|
+
- **No state sync** — MQTT is purely a message transport
|
|
231
|
+
- **Retained messages:** Last message on a topic is stored and delivered to new subscribers
|
|
232
|
+
- **Persistent sessions:** Broker stores subscriptions and queued QoS 1/2 messages for disconnected clients
|
|
233
|
+
- **Last Will and Testament (LWT):** Broker publishes a pre-configured message when client disconnects unexpectedly
|
|
234
|
+
|
|
235
|
+
### Offline/Reconnect Behavior
|
|
236
|
+
- **Persistent sessions** (Clean Session = false): Broker queues messages while client is offline
|
|
237
|
+
- On reconnect, client receives all queued QoS 1/2 messages
|
|
238
|
+
- QoS 0: At most once (fire and forget) — lost if offline
|
|
239
|
+
- QoS 1: At least once — queued and retried
|
|
240
|
+
- QoS 2: Exactly once — queued with full handshake
|
|
241
|
+
- **Session Expiry Interval** (MQTT 5.0): configurable session lifetime
|
|
242
|
+
|
|
243
|
+
### Central vs P2P
|
|
244
|
+
- **Centralized broker** — all messages flow through the broker
|
|
245
|
+
- Broker clustering for HA (vendor-specific)
|
|
246
|
+
- Not P2P at all; the broker is the single point of coordination
|
|
247
|
+
|
|
248
|
+
### Tradeoffs
|
|
249
|
+
- (+) Extremely lightweight — runs on microcontrollers
|
|
250
|
+
- (+) Excellent offline handling with persistent sessions
|
|
251
|
+
- (+) QoS levels provide flexible delivery guarantees
|
|
252
|
+
- (+) Retained messages provide "current state" semantics
|
|
253
|
+
- (+) Wildcard subscriptions enable flexible topic hierarchies
|
|
254
|
+
- (+) LWT provides presence/failure notification
|
|
255
|
+
- (-) Central broker is a single point of failure
|
|
256
|
+
- (-) No peer discovery
|
|
257
|
+
- (-) No built-in encryption at protocol level (relies on TLS)
|
|
258
|
+
- (-) Topic-based only — no content-based routing
|
|
259
|
+
|
|
260
|
+
### Key Lessons for Agent Mesh
|
|
261
|
+
- **QoS levels** are essential for agent communication — some messages must be guaranteed, others are fire-and-forget
|
|
262
|
+
- **Retained messages** pattern is perfect for agent state/capability announcements
|
|
263
|
+
- **Last Will and Testament** is an elegant offline detection mechanism
|
|
264
|
+
- **Persistent sessions** solve the mobile background problem — broker stores messages
|
|
265
|
+
- **Topic hierarchies with wildcards** are a natural fit for agent capability routing
|
|
266
|
+
- **2-byte overhead** shows how minimal a protocol can be
|
|
267
|
+
|
|
268
|
+
---
|
|
269
|
+
|
|
270
|
+
## 5. WebRTC (Web Real-Time Communication)
|
|
271
|
+
|
|
272
|
+
**What it is:** A set of protocols and APIs for real-time peer-to-peer communication of audio, video, and data directly between browsers/apps.
|
|
273
|
+
|
|
274
|
+
### Core Design Principles
|
|
275
|
+
- **Direct peer-to-peer** communication (no intermediary for media)
|
|
276
|
+
- **NAT traversal as core requirement** — ICE framework
|
|
277
|
+
- **Secure by default** — DTLS mandatory
|
|
278
|
+
- **Codec negotiation** via SDP (Session Description Protocol)
|
|
279
|
+
- **Signaling is out of scope** — protocol only handles media/data transport
|
|
280
|
+
|
|
281
|
+
### Discovery Mechanism
|
|
282
|
+
- **No built-in discovery** — requires external signaling server
|
|
283
|
+
- Signaling exchanges SDP offers/answers and ICE candidates
|
|
284
|
+
- Signaling can use any transport (WebSocket, HTTP, even QR codes)
|
|
285
|
+
- ICE candidates gathered from: local interfaces, STUN servers, TURN servers
|
|
286
|
+
|
|
287
|
+
### Transport Model
|
|
288
|
+
- **ICE (Interactive Connectivity Establishment):** Tests multiple connection paths, picks best
|
|
289
|
+
- **STUN:** Discovers public IP/port; enables direct connection through NAT
|
|
290
|
+
- **TURN:** Relay fallback when direct connection impossible (~10-15% of connections)
|
|
291
|
+
- **DTLS:** Encryption over UDP
|
|
292
|
+
- **SRTP:** Encrypted real-time media
|
|
293
|
+
- **DataChannels:** Reliable or unreliable data over SCTP/DTLS
|
|
294
|
+
|
|
295
|
+
### Message Format
|
|
296
|
+
- SDP (Session Description Protocol) for negotiation — text-based
|
|
297
|
+
- RTP/SRTP for media streams
|
|
298
|
+
- SCTP for data channels — supports binary and text
|
|
299
|
+
- DataChannels can be configured as ordered/unordered, reliable/unreliable
|
|
300
|
+
|
|
301
|
+
### State Synchronization
|
|
302
|
+
- **No state sync** — WebRTC is a transport, not a state system
|
|
303
|
+
- DataChannels provide raw bidirectional communication
|
|
304
|
+
- Application must implement its own state sync over DataChannels
|
|
305
|
+
|
|
306
|
+
### Offline/Reconnect Behavior
|
|
307
|
+
- **Connection is lost** when peer goes offline — ICE connection fails
|
|
308
|
+
- **ICE restart** can attempt to re-establish connection with new candidates
|
|
309
|
+
- No message queuing or store-and-forward
|
|
310
|
+
- Requires signaling server to coordinate reconnection
|
|
311
|
+
|
|
312
|
+
### Central vs P2P
|
|
313
|
+
- **P2P for data/media transport** — direct peer connections
|
|
314
|
+
- **Requires signaling server** for connection setup (not fully P2P)
|
|
315
|
+
- **TURN relay** as fallback makes it partially client-server
|
|
316
|
+
- In practice: hybrid architecture
|
|
317
|
+
|
|
318
|
+
### Tradeoffs
|
|
319
|
+
- (+) True P2P data transfer — low latency
|
|
320
|
+
- (+) Excellent NAT traversal (ICE/STUN/TURN)
|
|
321
|
+
- (+) Built into every modern browser
|
|
322
|
+
- (+) DataChannels support both reliable and unreliable modes
|
|
323
|
+
- (+) Mandatory encryption
|
|
324
|
+
- (-) Requires signaling infrastructure
|
|
325
|
+
- (-) Connection setup is complex and slow (ICE gathering)
|
|
326
|
+
- (-) No offline message delivery
|
|
327
|
+
- (-) Peer-to-peer doesn't scale beyond small groups (mesh topology)
|
|
328
|
+
- (-) Maintaining connections drains mobile battery
|
|
329
|
+
|
|
330
|
+
### Key Lessons for Agent Mesh
|
|
331
|
+
- **ICE framework** is the gold standard for NAT traversal
|
|
332
|
+
- **DataChannels** provide flexible transport (reliable/unreliable, ordered/unordered)
|
|
333
|
+
- **Signaling is separate from transport** — good architectural separation
|
|
334
|
+
- **Direct P2P is ideal for latency-sensitive agent communication** (real-time sensing)
|
|
335
|
+
- **TURN relay pattern** is essential fallback — not all agents can connect directly
|
|
336
|
+
- **Connection setup overhead** means WebRTC is better for persistent connections than one-off messages
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
## 6. Kademlia DHT
|
|
341
|
+
|
|
342
|
+
**What it is:** A distributed hash table protocol that uses XOR distance metric for efficient peer lookup and content routing. Used by BitTorrent, IPFS, Ethereum.
|
|
343
|
+
|
|
344
|
+
### Core Design Principles
|
|
345
|
+
- **XOR distance metric** — distance = XOR of node IDs, interpreted as integer
|
|
346
|
+
- **Logarithmic lookup** — O(log n) hops to find any node/value
|
|
347
|
+
- **Self-organizing** — routing tables maintained through normal traffic
|
|
348
|
+
- **Redundancy** — values stored on k closest nodes (typically k=20)
|
|
349
|
+
|
|
350
|
+
### Discovery Mechanism
|
|
351
|
+
- **Bootstrap:** New node needs address of just ONE existing node
|
|
352
|
+
- **Self-lookup:** New node performs a lookup on its own ID to populate routing table
|
|
353
|
+
- **k-buckets:** Routing table organized by XOR distance ranges; each bucket holds up to k entries
|
|
354
|
+
- **Bucket refresh:** Periodic lookups on random IDs within each bucket's range
|
|
355
|
+
|
|
356
|
+
### Transport Model
|
|
357
|
+
- UDP-based RPC (original paper)
|
|
358
|
+
- Four RPCs: PING, STORE, FIND_NODE, FIND_VALUE
|
|
359
|
+
- Implementations vary (libp2p uses TCP+multiplexing)
|
|
360
|
+
|
|
361
|
+
### Message Format
|
|
362
|
+
- Simple RPC messages containing: sender node ID, target ID, and payload
|
|
363
|
+
- Responses include closest known nodes to target
|
|
364
|
+
|
|
365
|
+
### State Synchronization
|
|
366
|
+
- **Key-value store:** Values stored on k closest nodes to key
|
|
367
|
+
- **Republishing:** Values periodically re-stored to maintain redundancy
|
|
368
|
+
- **No conflict resolution** — last write wins or application-defined
|
|
369
|
+
- **Expiration:** Values expire after configurable TTL
|
|
370
|
+
|
|
371
|
+
### Offline/Reconnect Behavior
|
|
372
|
+
- **Graceful degradation:** k-bucket entries are evicted only after confirmed failure (PING timeout)
|
|
373
|
+
- **Preference for long-lived nodes:** Existing entries preferred over new ones (stability heuristic)
|
|
374
|
+
- **Automatic recovery:** Rejoining node performs self-lookup, quickly rebuilds routing table
|
|
375
|
+
- **Redundancy covers gaps:** k replicas ensure availability even when some nodes are offline
|
|
376
|
+
|
|
377
|
+
### Central vs P2P
|
|
378
|
+
- **Fully P2P** — no central coordination
|
|
379
|
+
- Bootstrap nodes are the only "well-known" infrastructure
|
|
380
|
+
- All nodes are equal participants
|
|
381
|
+
|
|
382
|
+
### Tradeoffs
|
|
383
|
+
- (+) Proven at massive scale (millions of nodes in BitTorrent)
|
|
384
|
+
- (+) O(log n) lookup efficiency
|
|
385
|
+
- (+) Self-healing — network adapts to churn automatically
|
|
386
|
+
- (+) Preference for stable nodes reduces routing table churn
|
|
387
|
+
- (-) Lookup latency (multiple round trips)
|
|
388
|
+
- (-) Vulnerable to Sybil attacks (attacker creates many node IDs)
|
|
389
|
+
- (-) Eclipse attacks (surrounding a target with malicious nodes)
|
|
390
|
+
- (-) Not suitable for real-time communication
|
|
391
|
+
- (-) Bootstrap node dependency for initial join
|
|
392
|
+
|
|
393
|
+
### Key Lessons for Agent Mesh
|
|
394
|
+
- **XOR distance** is an elegant, symmetric distance metric — useful for agent capability routing
|
|
395
|
+
- **k-bucket structure** naturally maintains a balanced view of the network
|
|
396
|
+
- **Preference for long-lived nodes** is smart — stable agents should be preferred routing partners
|
|
397
|
+
- **Self-lookup on join** is an efficient bootstrapping pattern
|
|
398
|
+
- **k-redundancy** ensures resilience without coordination
|
|
399
|
+
- **Not suitable as sole discovery mechanism** for real-time agent communication — too slow
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
403
|
+
## 7. Gossip Protocols (SWIM, HyParView)
|
|
404
|
+
|
|
405
|
+
### SWIM (Scalable Weakly-consistent Infection-style Process Group Membership)
|
|
406
|
+
|
|
407
|
+
**What it is:** A protocol for membership management and failure detection in large distributed systems. Used by Consul, Serf (HashiCorp).
|
|
408
|
+
|
|
409
|
+
#### Core Design Principles
|
|
410
|
+
- **Separate failure detection from dissemination** — two distinct components
|
|
411
|
+
- **Randomized probing** — each node randomly selects one peer per round to probe
|
|
412
|
+
- **Piggyback dissemination** — membership updates ride on failure detection messages
|
|
413
|
+
- **Suspicion before declaration** — nodes are suspected before declared dead
|
|
414
|
+
|
|
415
|
+
#### Discovery Mechanism
|
|
416
|
+
- Join via any known member
|
|
417
|
+
- Membership list propagated via gossip (piggyback on ping/ack)
|
|
418
|
+
- New member announced infection-style through the cluster
|
|
419
|
+
|
|
420
|
+
#### How It Works
|
|
421
|
+
1. Each period: node picks random peer, sends PING
|
|
422
|
+
2. If no ACK: sends PING-REQ to k random peers (indirect probe)
|
|
423
|
+
3. If still no ACK: marks peer as SUSPECTED
|
|
424
|
+
4. Suspected peers get grace period before being declared DEAD
|
|
425
|
+
5. Membership changes (join/leave/fail) piggyback on ping/ack messages
|
|
426
|
+
|
|
427
|
+
#### Scalability
|
|
428
|
+
- O(1) message load per member per period (constant, not proportional to cluster size)
|
|
429
|
+
- Compared to heartbeat protocols: O(n^2) vs O(n) total messages
|
|
430
|
+
- Infection-style dissemination reaches all nodes in O(log n) rounds
|
|
431
|
+
|
|
432
|
+
### HyParView
|
|
433
|
+
|
|
434
|
+
**What it is:** A membership protocol designed specifically for reliable gossip-based broadcast.
|
|
435
|
+
|
|
436
|
+
#### Core Design Principles
|
|
437
|
+
- **Two partial views:** Small active view (log(n)+c peers) + larger passive view (k*(log(n)+c) peers)
|
|
438
|
+
- **Active view = overlay network** — broadcast is flooding over active view connections
|
|
439
|
+
- **Passive view = backup pool** — nodes promoted to active when active members fail
|
|
440
|
+
- **TCP-based** — connections double as failure detectors
|
|
441
|
+
|
|
442
|
+
#### How It Works
|
|
443
|
+
- Active view maintained via TCP connections — broken connection = immediate failure detection
|
|
444
|
+
- Passive view maintained via periodic shuffle (random exchange of partial views)
|
|
445
|
+
- When active member fails: promote random passive member
|
|
446
|
+
- Recovers from up to 80% node failure while maintaining broadcast reliability
|
|
447
|
+
|
|
448
|
+
### Tradeoffs (Both)
|
|
449
|
+
- (+) Scalable failure detection (SWIM: constant overhead per node)
|
|
450
|
+
- (+) Fast dissemination (logarithmic rounds)
|
|
451
|
+
- (+) Resilient to high failure rates (HyParView: survives 80% failure)
|
|
452
|
+
- (+) No central coordinator
|
|
453
|
+
- (-) Eventual consistency — membership view may be temporarily inconsistent
|
|
454
|
+
- (-) False positives possible (mitigated by suspicion mechanism)
|
|
455
|
+
- (-) Requires continuous background traffic
|
|
456
|
+
|
|
457
|
+
### Key Lessons for Agent Mesh
|
|
458
|
+
- **SWIM's piggyback pattern** is brilliant — reuse existing messages for metadata propagation
|
|
459
|
+
- **Suspicion mechanism** is essential — avoid declaring agents dead prematurely
|
|
460
|
+
- **HyParView's dual-view** approach balances active communication with fault tolerance
|
|
461
|
+
- **Infection-style dissemination** is ideal for propagating agent capability updates
|
|
462
|
+
- **O(1) per-member overhead** is critical for mobile agents with limited resources
|
|
463
|
+
- **TCP as failure detector** (HyParView) is practical and efficient
|
|
464
|
+
|
|
465
|
+
---
|
|
466
|
+
|
|
467
|
+
## 8. CRDTs (Conflict-Free Replicated Data Types)
|
|
468
|
+
|
|
469
|
+
**What it is:** Data structures that can be replicated across multiple nodes and updated independently, with a mathematically guaranteed merge that always converges to the same state.
|
|
470
|
+
|
|
471
|
+
### Core Design Principles
|
|
472
|
+
- **Strong eventual consistency** — all replicas that have received the same set of updates converge to the same state
|
|
473
|
+
- **No coordination needed** — updates applied locally without consensus
|
|
474
|
+
- **Merge must be commutative, associative, and idempotent** (forms a join-semilattice)
|
|
475
|
+
- **Two flavors:** State-based (CvRDT) and Operation-based (CmRDT)
|
|
476
|
+
|
|
477
|
+
### Types
|
|
478
|
+
|
|
479
|
+
**State-based CRDTs (CvRDTs):**
|
|
480
|
+
- Replicas send full state to peers
|
|
481
|
+
- Merge function computes join of two states
|
|
482
|
+
- Only requires gossip protocol (unreliable transport OK)
|
|
483
|
+
- Higher bandwidth (full state transfer)
|
|
484
|
+
|
|
485
|
+
**Operation-based CRDTs (CmRDTs):**
|
|
486
|
+
- Replicas send only operations
|
|
487
|
+
- Requires reliable, causal-order delivery
|
|
488
|
+
- Lower bandwidth (just operations)
|
|
489
|
+
- More complex middleware requirements
|
|
490
|
+
|
|
491
|
+
### Common CRDT Types
|
|
492
|
+
| Type | Description | Use Case |
|
|
493
|
+
|------|-------------|----------|
|
|
494
|
+
| G-Counter | Grow-only counter, per-node counts | View counts, upvotes |
|
|
495
|
+
| PN-Counter | Positive-negative counter | Likes minus dislikes |
|
|
496
|
+
| G-Set | Grow-only set | Tag collections |
|
|
497
|
+
| OR-Set | Observed-remove set (add wins on concurrent add/remove) | Member lists |
|
|
498
|
+
| LWW-Register | Last-writer-wins register | Profile fields |
|
|
499
|
+
| LWW-Map | Map of LWW-Registers | Agent state objects |
|
|
500
|
+
| Sequence/RGA | Replicated growable array | Collaborative editing |
|
|
501
|
+
|
|
502
|
+
### State Synchronization
|
|
503
|
+
- **State-based:** Periodically gossip full state; merge on receive
|
|
504
|
+
- **Operation-based:** Broadcast operations; apply in causal order
|
|
505
|
+
- **Delta-state CRDTs:** Send only the delta (diff) since last sync — best of both worlds
|
|
506
|
+
- Merge is always safe — applying the same update multiple times has no effect (idempotent)
|
|
507
|
+
|
|
508
|
+
### Offline/Reconnect Behavior
|
|
509
|
+
- **Perfect offline support** — updates applied locally, merged on reconnect
|
|
510
|
+
- No conflict resolution needed — merge is automatic and deterministic
|
|
511
|
+
- Reconnecting node sends its state/operations; receiving nodes merge
|
|
512
|
+
- Works perfectly with intermittent connectivity
|
|
513
|
+
|
|
514
|
+
### Tradeoffs
|
|
515
|
+
- (+) No coordination/consensus needed — fully decentralized
|
|
516
|
+
- (+) Perfect offline support
|
|
517
|
+
- (+) Mathematically guaranteed convergence
|
|
518
|
+
- (+) Low latency — local writes, async merge
|
|
519
|
+
- (-) Limited data structure types — not everything can be a CRDT
|
|
520
|
+
- (-) State-based CRDTs can be bandwidth-heavy
|
|
521
|
+
- (-) Metadata overhead (tombstones, vector clocks)
|
|
522
|
+
- (-) Monotonically growing state (hard to garbage collect)
|
|
523
|
+
- (-) "Last writer wins" may not always be the right semantics
|
|
524
|
+
|
|
525
|
+
### Key Lessons for Agent Mesh
|
|
526
|
+
- **CRDTs are the ideal state synchronization mechanism for agent mesh networks**
|
|
527
|
+
- **Agent state (mood, energy, capabilities) maps naturally to LWW-Registers and OR-Sets**
|
|
528
|
+
- **Delta-state CRDTs** balance bandwidth and consistency — critical for mobile agents
|
|
529
|
+
- **Offline-first by design** — agents can operate independently and merge later
|
|
530
|
+
- **G-Counter pattern** works for agent interaction counts, coupling strength
|
|
531
|
+
- **OR-Set** is perfect for agent peer lists, subscriptions, capability sets
|
|
532
|
+
- **Metadata growth** must be managed — agent state should be prunable
|
|
533
|
+
|
|
534
|
+
---
|
|
535
|
+
|
|
536
|
+
## 9. ActivityPub
|
|
537
|
+
|
|
538
|
+
**What it is:** W3C standard for federated social networking. Powers the Fediverse (Mastodon, Pixelfed, PeerTube, etc.).
|
|
539
|
+
|
|
540
|
+
### Core Design Principles
|
|
541
|
+
- **Actor model** — every entity (user, group, app) is an Actor with inbox/outbox
|
|
542
|
+
- **ActivityStreams 2.0** vocabulary — standardized JSON-LD for describing activities
|
|
543
|
+
- **Dual API:** Client-to-Server (C2S) for posting, Server-to-Server (S2S) for federation
|
|
544
|
+
- **HTTP-based** — standard web infrastructure
|
|
545
|
+
|
|
546
|
+
### Discovery Mechanism
|
|
547
|
+
- **WebFinger** — `/.well-known/webfinger?resource=acct:user@domain` resolves actor URL
|
|
548
|
+
- **Actor URLs** are HTTPS endpoints returning JSON-LD actor documents
|
|
549
|
+
- **Followers/following** collections provide social graph discovery
|
|
550
|
+
- No global directory — discovery through social graph traversal
|
|
551
|
+
|
|
552
|
+
### Transport Model
|
|
553
|
+
- **HTTPS POST** to actor inboxes for delivery
|
|
554
|
+
- **HTTP Signatures** for authentication (proving sender identity)
|
|
555
|
+
- JSON-LD with ActivityStreams 2.0 vocabulary
|
|
556
|
+
- Polling via `outbox` collection for pull-based access
|
|
557
|
+
|
|
558
|
+
### Message Format
|
|
559
|
+
- JSON-LD wrapping ActivityStreams 2.0
|
|
560
|
+
- Activities: Create, Update, Delete, Follow, Accept, Reject, Like, Announce, etc.
|
|
561
|
+
- Objects: Note, Article, Image, Video, Event, etc.
|
|
562
|
+
- Each actor has: `id` (URL), `inbox`, `outbox`, `followers`, `following`, `publicKey`
|
|
563
|
+
|
|
564
|
+
### State Synchronization
|
|
565
|
+
- **Push-based:** Activities POSTed to follower inboxes
|
|
566
|
+
- **No global state** — each server maintains its own copy of remote actors/content
|
|
567
|
+
- **No conflict resolution** — activities are append-only
|
|
568
|
+
- **No backfill** — if a server was down when an activity was sent, it may be lost
|
|
569
|
+
|
|
570
|
+
### Offline/Reconnect Behavior
|
|
571
|
+
- **Server inbox queues** — activities wait in sender's outbox retry queue
|
|
572
|
+
- Delivery retried with exponential backoff (implementation-specific)
|
|
573
|
+
- No guaranteed delivery — retries eventually give up
|
|
574
|
+
- No catch-up mechanism for missed activities during extended outages
|
|
575
|
+
|
|
576
|
+
### Central vs P2P
|
|
577
|
+
- **Federated** — each instance is a server for its users
|
|
578
|
+
- No central authority, but large instances have outsized influence
|
|
579
|
+
- Server admins control moderation, defederation
|
|
580
|
+
|
|
581
|
+
### Tradeoffs
|
|
582
|
+
- (+) Uses standard web infrastructure (HTTPS, JSON, DNS)
|
|
583
|
+
- (+) Actor model is intuitive and extensible
|
|
584
|
+
- (+) Large ecosystem (Mastodon, Pixelfed, etc.)
|
|
585
|
+
- (+) ActivityStreams vocabulary is rich and standardized
|
|
586
|
+
- (-) No guaranteed delivery
|
|
587
|
+
- (-) No backfill/catch-up mechanism
|
|
588
|
+
- (-) JSON-LD is complex and often implemented incompletely
|
|
589
|
+
- (-) Instance admins have significant power over users
|
|
590
|
+
- (-) Discovery is slow (requires multiple HTTP requests)
|
|
591
|
+
|
|
592
|
+
### Key Lessons for Agent Mesh
|
|
593
|
+
- **Actor model with inbox/outbox** is a natural fit for agents
|
|
594
|
+
- **ActivityStreams vocabulary** could be extended for agent activities (Sense, Decide, Act, Couple)
|
|
595
|
+
- **HTTP Signatures** pattern is practical for agent authentication
|
|
596
|
+
- **WebFinger** is an elegant discovery mechanism for named agents
|
|
597
|
+
- **Lack of backfill is a critical gap** — agents need catch-up mechanisms
|
|
598
|
+
- **JSON-LD complexity is a warning** — keep the message format simple
|
|
599
|
+
|
|
600
|
+
---
|
|
601
|
+
|
|
602
|
+
## 10. Bluetooth Mesh
|
|
603
|
+
|
|
604
|
+
**What it is:** Bluetooth SIG standard for many-to-many device communication using BLE advertising. Designed for building automation, lighting, sensor networks.
|
|
605
|
+
|
|
606
|
+
### Core Design Principles
|
|
607
|
+
- **Managed flooding** — messages broadcast to all nodes, relayed hop-by-hop
|
|
608
|
+
- **Publish/Subscribe** — nodes publish to and subscribe to addresses (topics)
|
|
609
|
+
- **Low power support** — Friend/Low Power Node pattern
|
|
610
|
+
- **Provisioning** — secure device onboarding with network key distribution
|
|
611
|
+
- **No routing tables** — flooding eliminates routing complexity
|
|
612
|
+
|
|
613
|
+
### Discovery Mechanism
|
|
614
|
+
- **Provisioning:** Unprovisioned devices send beacons; a Provisioner discovers and onboards them
|
|
615
|
+
- **Network keys** define mesh membership — only nodes with the key can participate
|
|
616
|
+
- **Group addresses** for multicast communication
|
|
617
|
+
|
|
618
|
+
### Transport Model
|
|
619
|
+
- **BLE advertising** as bearer — no connection needed for mesh messages
|
|
620
|
+
- **GATT connections** via Proxy Nodes for non-mesh devices (phones)
|
|
621
|
+
- **Managed flooding** with TTL to limit propagation
|
|
622
|
+
- **Network layer relay** — relay nodes rebroadcast messages
|
|
623
|
+
- **Segmentation** for messages larger than one advertisement
|
|
624
|
+
|
|
625
|
+
### Message Format
|
|
626
|
+
- Binary, compact format
|
|
627
|
+
- Layers: Bearer → Network (encryption, relay) → Transport (segmentation) → Access (application data) → Model (standardized behaviors)
|
|
628
|
+
- Network PDU: ~29 bytes maximum per advertisement
|
|
629
|
+
- Messages encrypted with network key (network layer) and application key (application layer)
|
|
630
|
+
|
|
631
|
+
### State Synchronization
|
|
632
|
+
- **Publish/Subscribe model** — state changes published to group addresses
|
|
633
|
+
- **No global state** — each node maintains its own state
|
|
634
|
+
- **Provisioning data** (keys, addresses) distributed during device onboarding
|
|
635
|
+
- **Configuration model** standardizes how nodes are configured
|
|
636
|
+
|
|
637
|
+
### Offline/Reconnect Behavior
|
|
638
|
+
- **Friend Node pattern:** A mains-powered Friend Node stores messages for associated Low Power Nodes (LPNs)
|
|
639
|
+
- LPN wakes periodically, polls Friend Node, receives queued messages
|
|
640
|
+
- LPN can sleep for hours/days — Friend stores messages until polled
|
|
641
|
+
- **No replay protection for sleeping nodes** — relies on sequence numbers
|
|
642
|
+
|
|
643
|
+
### Central vs P2P
|
|
644
|
+
- **Fully P2P** in message transport (flooding)
|
|
645
|
+
- **Provisioner** is a semi-central role for onboarding
|
|
646
|
+
- No routing infrastructure — any relay node forwards any message
|
|
647
|
+
|
|
648
|
+
### Tradeoffs
|
|
649
|
+
- (+) No routing tables — extreme simplicity
|
|
650
|
+
- (+) Friend/LPN pattern brilliantly solves low-power problem
|
|
651
|
+
- (+) Managed flooding is robust to topology changes
|
|
652
|
+
- (+) Double encryption (network + application layer)
|
|
653
|
+
- (+) Provisioning provides secure onboarding
|
|
654
|
+
- (-) Flooding doesn't scale — limited to ~hundreds of nodes
|
|
655
|
+
- (-) Bandwidth limited (BLE advertising)
|
|
656
|
+
- (-) No acknowledgment at mesh layer
|
|
657
|
+
- (-) Latency increases with network size
|
|
658
|
+
- (-) TTL must be tuned carefully
|
|
659
|
+
|
|
660
|
+
### Key Lessons for Agent Mesh
|
|
661
|
+
- **Friend/LPN pattern** is directly applicable — a cloud relay stores messages for sleeping mobile agents
|
|
662
|
+
- **Managed flooding with TTL** works for small agent clusters
|
|
663
|
+
- **Provisioning model** is relevant for agent onboarding (key exchange, capability declaration)
|
|
664
|
+
- **Publish/Subscribe with group addresses** maps to agent topic subscriptions
|
|
665
|
+
- **Flooding won't scale** — need structured routing for large agent networks
|
|
666
|
+
- **Double encryption model** (transport + application) provides defense in depth
|
|
667
|
+
|
|
668
|
+
---
|
|
669
|
+
|
|
670
|
+
## Mobile P2P: Background Restrictions and Patterns
|
|
671
|
+
|
|
672
|
+
### iOS Restrictions
|
|
673
|
+
- **Multipeer Connectivity:** Only works in foreground; networking stops within ~3 minutes of backgrounding
|
|
674
|
+
- **BLE scanning in background:** Reduced frequency; `AllowDuplicates` option ignored; only one advertisement received
|
|
675
|
+
- **BLE advertising in background:** Severely throttled; Apple limits frequency and eventually cuts off
|
|
676
|
+
- **No background TCP/UDP sockets** — suspended apps lose all connections
|
|
677
|
+
- **Background App Refresh:** Limited, unpredictable scheduling by OS
|
|
678
|
+
- **Push Notifications (APNs):** Only reliable way to wake an app — but limited to triggering brief background execution (~30 seconds)
|
|
679
|
+
- **Background Processing Tasks (BGTaskScheduler):** Can run for minutes, but OS controls scheduling
|
|
680
|
+
|
|
681
|
+
### Android Restrictions
|
|
682
|
+
- **Doze Mode:** Network access suspended; alarms deferred; wake locks ignored
|
|
683
|
+
- **`setAndAllowWhileIdle()`:** Can fire alarms in Doze, but max once per 9 minutes per app
|
|
684
|
+
- **Foreground Services:** Keep app alive but require persistent notification; battery drain
|
|
685
|
+
- **Background Services:** Killed by OS in recent Android versions
|
|
686
|
+
- **Firebase Cloud Messaging (FCM):** Reliable wake mechanism (equivalent to APNs)
|
|
687
|
+
|
|
688
|
+
### Patterns for Waking Sleeping Mobile Nodes
|
|
689
|
+
|
|
690
|
+
1. **Push Notification Relay:** Central server sends push notification (APNs/FCM) to wake agent, agent connects to relay, receives pending messages, processes, goes back to sleep
|
|
691
|
+
2. **Friend Node Pattern (from Bluetooth Mesh):** A cloud-based "friend" stores messages for sleeping agents; agent polls when it wakes
|
|
692
|
+
3. **Scheduled Background Tasks:** Agent requests periodic wake-ups via `BGTaskScheduler` (iOS) or `WorkManager` (Android); unpredictable timing but no server needed
|
|
693
|
+
4. **BLE Peripheral Mode:** Agent advertises as BLE peripheral; can receive connections even when backgrounded (iOS allows this with `bluetooth-peripheral` background mode)
|
|
694
|
+
5. **VoIP Push (iOS):** `PushKit` provides high-priority push that wakes app instantly — but Apple rejects apps that misuse this for non-VoIP purposes
|
|
695
|
+
6. **Foreground Service (Android):** Keeps agent fully alive at cost of persistent notification and battery drain
|
|
696
|
+
|
|
697
|
+
### Recommended Architecture for Mobile Agents
|
|
698
|
+
The most practical pattern combines:
|
|
699
|
+
- **Cloud relay** (like MQTT broker or Matrix homeserver) as persistent message store
|
|
700
|
+
- **Push notifications** (APNs/FCM) to wake agents when messages arrive
|
|
701
|
+
- **Brief wake processing** — agent connects to relay, syncs state, processes messages, returns to sleep
|
|
702
|
+
- **CRDTs** for state sync — agents merge state on each wake cycle, no coordination needed
|
|
703
|
+
|
|
704
|
+
---
|
|
705
|
+
|
|
706
|
+
## Mesh Networks and Intermittent Connectivity
|
|
707
|
+
|
|
708
|
+
### Delay-Tolerant Networking (DTN)
|
|
709
|
+
- **Core insight:** Treat disconnection as the norm, not the exception
|
|
710
|
+
- **Bundle Protocol (BP):** Messages ("bundles") are self-contained units with addressing, TTL, and custody transfer
|
|
711
|
+
- **Store-Carry-Forward:** Nodes store bundles, physically carry them (mobile nodes), and forward when connectivity available
|
|
712
|
+
- **Custody Transfer:** Responsibility for delivery can be handed off between nodes
|
|
713
|
+
- **Frame Sequence Numbers:** Each producer assigns monotonic sequence numbers; consumers request missing sequences on reconnect
|
|
714
|
+
- **NASA uses DTN** for deep-space communication (minutes/hours of delay)
|
|
715
|
+
|
|
716
|
+
### Patterns for Intermittent Connectivity
|
|
717
|
+
|
|
718
|
+
1. **Sequence Numbers + Gap Detection:** Each producer maintains monotonic sequence (FSEQ). Consumers track last received FSEQ. On reconnect, request missing range. Simple and effective.
|
|
719
|
+
|
|
720
|
+
2. **Vector Clocks / Version Vectors:** Each node maintains a vector of counters (one per known node). On sync, compare vectors to determine what's missing. More complex but handles multi-source updates.
|
|
721
|
+
|
|
722
|
+
3. **Merkle Trees / Hash-Based Sync:** Compare root hash to detect divergence. Walk tree to find specific differences. Efficient for large state with small changes (used by Git, Cassandra).
|
|
723
|
+
|
|
724
|
+
4. **CRDTs + Gossip:** Combine CRDTs for conflict-free merge with gossip for opportunistic sync. Each connection is an opportunity to merge state. Perfect for intermittent connectivity.
|
|
725
|
+
|
|
726
|
+
5. **Log-Based Sync (Event Sourcing):** Append-only log of events. Each event has globally unique ID. On reconnect, exchange events not seen by the other. Matrix's event DAG is a variant.
|
|
727
|
+
|
|
728
|
+
---
|
|
729
|
+
|
|
730
|
+
## Summary: Patterns and Principles for AI Agent Mesh Protocol
|
|
731
|
+
|
|
732
|
+
### Architecture Patterns
|
|
733
|
+
|
|
734
|
+
| Pattern | Source Protocols | Applicability to Agent Mesh |
|
|
735
|
+
|---------|-----------------|---------------------------|
|
|
736
|
+
| **Keypair Identity** | Nostr, libp2p | Agent identity = cryptographic keypair. No registration needed. |
|
|
737
|
+
| **Relay/Friend Node** | Nostr, MQTT, BT Mesh | Cloud relay stores messages for sleeping/offline agents. Essential for mobile. |
|
|
738
|
+
| **Pub/Sub with Topics** | MQTT, GossipSub, BT Mesh | Agents subscribe to capability/mood/context topics. Natural routing. |
|
|
739
|
+
| **Event DAG** | Matrix | Ordered, immutable record of agent interactions. Supports backfill. |
|
|
740
|
+
| **Actor Model (Inbox/Outbox)** | ActivityPub | Each agent has inbox (receives) and outbox (publishes). Clean abstraction. |
|
|
741
|
+
| **Managed Flooding + TTL** | BT Mesh, Gossip | Works for small clusters. TTL prevents infinite propagation. |
|
|
742
|
+
| **Structured Routing (DHT)** | Kademlia | O(log n) discovery for large networks. Not real-time. |
|
|
743
|
+
| **Dual View (Active + Passive)** | HyParView | Active peers for communication, passive for fault tolerance. |
|
|
744
|
+
| **Piggyback Dissemination** | SWIM | Reuse existing messages to propagate membership/state changes. Zero extra cost. |
|
|
745
|
+
|
|
746
|
+
### State Synchronization Strategy
|
|
747
|
+
|
|
748
|
+
**Recommended: Delta-state CRDTs + Gossip + Event Log**
|
|
749
|
+
|
|
750
|
+
1. **Agent state** (mood, energy, capabilities, coupling weights) → **LWW-Map CRDT**
|
|
751
|
+
- Each agent maintains its own state as a CRDT
|
|
752
|
+
- On any connection, agents merge state — order doesn't matter, always converges
|
|
753
|
+
- Delta-state variant minimizes bandwidth
|
|
754
|
+
|
|
755
|
+
2. **Agent peer sets** (who I'm coupled with) → **OR-Set CRDT**
|
|
756
|
+
- Add/remove peers without coordination
|
|
757
|
+
- Concurrent add+remove → add wins (safe default)
|
|
758
|
+
|
|
759
|
+
3. **Interaction history** → **Append-only event log with sequence numbers**
|
|
760
|
+
- Each agent maintains monotonic sequence number
|
|
761
|
+
- On reconnect, request missing events by sequence range
|
|
762
|
+
- Events are immutable and signed
|
|
763
|
+
|
|
764
|
+
4. **Coupling strength** → **Bounded Counter CRDT**
|
|
765
|
+
- Decays over time (agent applies local decay function)
|
|
766
|
+
- Incremented by interactions
|
|
767
|
+
- Merge takes maximum of decayed values
|
|
768
|
+
|
|
769
|
+
### Discovery Strategy
|
|
770
|
+
|
|
771
|
+
**Recommended: Layered Discovery**
|
|
772
|
+
|
|
773
|
+
1. **Bootstrap:** Known relay servers (like MQTT brokers or Nostr relays)
|
|
774
|
+
2. **Local:** mDNS / BLE for same-network agent discovery
|
|
775
|
+
3. **Global:** DHT for finding agents by capability/ID
|
|
776
|
+
4. **Social:** Agent publishes relay list (Nostr NIP-65 pattern); other agents find you through your relays
|
|
777
|
+
5. **Push:** APNs/FCM to wake sleeping mobile agents
|
|
778
|
+
|
|
779
|
+
### Message Routing Strategy
|
|
780
|
+
|
|
781
|
+
**Recommended: Hybrid Pub/Sub + Direct**
|
|
782
|
+
|
|
783
|
+
1. **Topic-based pub/sub** for broadcast (mood changes, capability announcements)
|
|
784
|
+
- Via relay servers (MQTT/Nostr pattern)
|
|
785
|
+
- Topics map to agent contexts: `mood/#`, `capability/#`, `mesh/coupling/#`
|
|
786
|
+
|
|
787
|
+
2. **Direct messaging** for agent-to-agent coupling
|
|
788
|
+
- Via relay (store-and-forward for offline agents)
|
|
789
|
+
- Via direct connection (WebRTC DataChannel for real-time, low-latency)
|
|
790
|
+
|
|
791
|
+
3. **Gossip** for metadata propagation
|
|
792
|
+
- SWIM-style piggyback on existing messages
|
|
793
|
+
- Membership changes, capability updates, health signals
|
|
794
|
+
|
|
795
|
+
### Offline/Reconnect Strategy
|
|
796
|
+
|
|
797
|
+
**Recommended: Store-Forward-Merge**
|
|
798
|
+
|
|
799
|
+
1. **Relay stores messages** while agent is offline (MQTT persistent session pattern)
|
|
800
|
+
2. **Push notification wakes agent** when priority message arrives (APNs/FCM)
|
|
801
|
+
3. **Agent connects to relay, pulls pending messages** (Nostr REQ pattern)
|
|
802
|
+
4. **Agent merges state with CRDT** — automatic, conflict-free convergence
|
|
803
|
+
5. **Agent publishes its updated state** — peers merge on their next sync
|
|
804
|
+
6. **Sequence numbers enable gap detection** — agent requests missing events (DTN pattern)
|
|
805
|
+
|
|
806
|
+
### Mobile-Specific Design Principles
|
|
807
|
+
|
|
808
|
+
1. **Assume agents are usually asleep** — design for intermittent connectivity as the norm
|
|
809
|
+
2. **Cloud relay is not optional** — iOS/Android kill background networking; relay is the Friend Node
|
|
810
|
+
3. **Push notifications are the only reliable wake mechanism** on mobile
|
|
811
|
+
4. **Minimize wake processing** — connect, sync CRDTs, process priority messages, sleep
|
|
812
|
+
5. **BLE for proximity** — when agents are physically nearby, BLE works even in background (iOS)
|
|
813
|
+
6. **Battery budget is the primary constraint** — every design choice must consider power
|
|
814
|
+
|
|
815
|
+
### Protocol Design Recommendations
|
|
816
|
+
|
|
817
|
+
1. **Keep it Nostr-simple:** Keypair identity, JSON events, WebSocket transport, relay architecture
|
|
818
|
+
2. **Add MQTT's QoS:** Support fire-and-forget (QoS 0) and guaranteed delivery (QoS 1)
|
|
819
|
+
3. **Use CRDTs for state:** Agent state is a delta-state CRDT — merge on every sync
|
|
820
|
+
4. **Adopt SWIM's piggyback:** Propagate mesh metadata on existing messages for free
|
|
821
|
+
5. **Support Matrix-style backfill:** Agents can request historical events by sequence range
|
|
822
|
+
6. **Design for the Friend Node pattern:** Relay stores messages for sleeping agents, delivers on poll
|
|
823
|
+
7. **Layer discovery:** mDNS (local) → relay (cloud) → DHT (global) → push (wake)
|
|
824
|
+
8. **Event-based, not request-response:** All communication is signed, immutable events with kinds
|
|
825
|
+
9. **Coupling as first-class concept:** Unlike any existing protocol, agent coupling strength should be a protocol-level concept encoded in CRDTs
|
|
826
|
+
10. **Autonomous, not automated:** Agents decide whether to respond to events based on their coupling engine — the protocol transports, it doesn't mandate behavior
|
|
827
|
+
|
|
828
|
+
### What No Existing Protocol Provides
|
|
829
|
+
|
|
830
|
+
None of the surveyed protocols address these agent-specific requirements:
|
|
831
|
+
|
|
832
|
+
- **Coupling dynamics** — weighted, decaying relationships between peers that influence routing
|
|
833
|
+
- **Mood/state-aware routing** — message priority and routing based on agent emotional/energy state
|
|
834
|
+
- **Autonomous response decisions** — protocol-level support for agents choosing whether to act
|
|
835
|
+
- **Context-aware presence** — richer than online/offline; includes mood, energy, attention, coupling strength
|
|
836
|
+
- **Collective state emergence** — mesh-level state that emerges from individual agent states
|
|
837
|
+
|
|
838
|
+
These gaps define the unique value proposition of a purpose-built agent mesh protocol.
|
|
839
|
+
|
|
840
|
+
---
|
|
841
|
+
|
|
842
|
+
## Sources
|
|
843
|
+
|
|
844
|
+
### libp2p
|
|
845
|
+
- [libp2p Official Site](https://libp2p.io/)
|
|
846
|
+
- [libp2p Architecture Spec](https://github.com/libp2p/specs/blob/master/_archive/4-architecture.md)
|
|
847
|
+
- [libp2p Peers Documentation](https://docs.libp2p.io/concepts/fundamentals/peers/)
|
|
848
|
+
- [GossipSub Spec](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.0.md)
|
|
849
|
+
|
|
850
|
+
### Matrix
|
|
851
|
+
- [Matrix Specification](https://spec.matrix.org/latest/)
|
|
852
|
+
- [Matrix Protocol Wikipedia](https://en.wikipedia.org/wiki/Matrix_(protocol))
|
|
853
|
+
- [Matrix.org](https://matrix.org/)
|
|
854
|
+
- [Matrix Protocol Comprehensive Study](https://link.springer.com/article/10.1186/s13677-025-00829-7)
|
|
855
|
+
|
|
856
|
+
### Nostr
|
|
857
|
+
- [Nostr Protocol](https://nostr.com/)
|
|
858
|
+
- [NIP-01 Basic Protocol](https://nips.nostr.com/1)
|
|
859
|
+
- [Nostr GitHub](https://github.com/nostr-protocol/nostr)
|
|
860
|
+
- [How Nostr Works](https://nostr.how/en/the-protocol)
|
|
861
|
+
|
|
862
|
+
### MQTT
|
|
863
|
+
- [MQTT.org](https://mqtt.org/)
|
|
864
|
+
- [HiveMQ MQTT Essentials](https://www.hivemq.com/mqtt/)
|
|
865
|
+
- [MQTT Pub/Sub Architecture](https://www.hivemq.com/blog/mqtt-essentials-part2-publish-subscribe/)
|
|
866
|
+
- [EMQ MQTT Guide](https://www.emqx.com/en/blog/the-easiest-guide-to-getting-started-with-mqtt)
|
|
867
|
+
|
|
868
|
+
### WebRTC
|
|
869
|
+
- [WebRTC Protocols - MDN](https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API/Protocols)
|
|
870
|
+
- [WebRTC Getting Started](https://webrtc.org/getting-started/peer-connections)
|
|
871
|
+
- [WebRTC for the Curious](https://webrtcforthecurious.com/docs/01-what-why-and-how/)
|
|
872
|
+
|
|
873
|
+
### Kademlia
|
|
874
|
+
- [Kademlia Guide - Stanford](https://codethechange.stanford.edu/guides/guide_kademlia.html)
|
|
875
|
+
- [IPFS DHT Docs](https://docs.ipfs.tech/concepts/dht/)
|
|
876
|
+
- [Kademlia Wikipedia](https://en.wikipedia.org/wiki/Kademlia)
|
|
877
|
+
|
|
878
|
+
### Gossip Protocols
|
|
879
|
+
- [SWIM Protocol Explained](https://www.brianstorti.com/swim/)
|
|
880
|
+
- [SWIM Paper (Cornell)](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)
|
|
881
|
+
- [HyParView Paper](https://asc.di.fct.unl.pt/~jleitao/pdf/dsn07-leitao.pdf)
|
|
882
|
+
|
|
883
|
+
### CRDTs
|
|
884
|
+
- [CRDT.tech](https://crdt.tech/)
|
|
885
|
+
- [CRDT Dictionary](https://www.iankduncan.com/engineering/2025-11-27-crdt-dictionary/)
|
|
886
|
+
- [CRDTs - Redis](https://redis.io/blog/diving-into-crdts/)
|
|
887
|
+
- [CRDTs - Ably](https://ably.com/blog/crdts-distributed-data-consistency-challenges)
|
|
888
|
+
|
|
889
|
+
### ActivityPub
|
|
890
|
+
- [W3C ActivityPub Spec](https://www.w3.org/TR/activitypub/)
|
|
891
|
+
- [ActivityPub Rocks](https://activitypub.rocks/)
|
|
892
|
+
- [ActivityPub Wikipedia](https://en.wikipedia.org/wiki/ActivityPub)
|
|
893
|
+
|
|
894
|
+
### Bluetooth Mesh
|
|
895
|
+
- [Bluetooth Mesh FAQ](https://www.bluetooth.com/learn-about-bluetooth/feature-enhancements/mesh/mesh-faq/)
|
|
896
|
+
- [Bluetooth Mesh Guide](https://novelbits.io/bluetooth-mesh-networking-the-ultimate-guide/)
|
|
897
|
+
- [Bluetooth Mesh Directed Forwarding](https://www.bluetooth.com/mesh-directed-forwarding/)
|
|
898
|
+
|
|
899
|
+
### Mobile P2P
|
|
900
|
+
- [iOS Core Bluetooth Background Processing](https://developer.apple.com/library/archive/documentation/NetworkingInternetWeb/Conceptual/CoreBluetooth_concepts/CoreBluetoothBackgroundProcessingForIOSApps/PerformingTasksWhileYourAppIsInTheBackground.html)
|
|
901
|
+
- [Apple Multipeer Connectivity](https://developer.apple.com/documentation/multipeerconnectivity)
|
|
902
|
+
- [iOS P2P Exploration - Thali Project](http://thaliproject.org/iOSp2p/)
|
|
903
|
+
|
|
904
|
+
### Delay-Tolerant Networking
|
|
905
|
+
- [Z-Mesh DTN Overview](https://z-mesh.org/overviews/delay-tolerant.html)
|
|
906
|
+
- [NASA DTN Tutorial](https://www.nasa.gov/wp-content/uploads/2023/09/dtn-tutorial-v3.2-0.pdf)
|
|
907
|
+
- [DTN Wikipedia](https://en.wikipedia.org/wiki/Delay-tolerant_networking)
|