[High] Race Condition in Route Update Propagation #22

Closed
opened 2026-02-11 19:31:45 +00:00 by thabeta · 1 comment
Owner

Issue

Concurrent updates to the routing table during peer synchronization can cause stale metric caching in the Babel protocol implementation.

Location

mycelium/src/babel/route_request.rs

Problem Description

When multiple peers send route updates simultaneously, the route_request handler does not use atomic operations or sufficient locking to ensure all metric updates are consistently applied. This can lead to:

  • Incorrect path metrics being used for route decisions
  • Packets being routed through suboptimal paths
  • Transient routing loops during network topology changes

Impact

  • Severity: HIGH (affects routing correctness)
  • Frequency: Occurs under high peer churn or large mesh networks
  • User Impact: Unstable routing, higher latency, potential packet loss

Remediation

  1. Use a version-based or epoch-based approach to atomic route updates
  2. Implement read-write locks or RwLock for route table access
  3. Add integration tests that stress-test concurrent route updates
  4. Document the thread-safety guarantees of the routing table

Testing

  • Create a chaos test with 100+ peers sending contradictory routes
  • Verify no stale metrics are observed in routing decisions
  • Measure update propagation latency under concurrent load
## Issue Concurrent updates to the routing table during peer synchronization can cause stale metric caching in the Babel protocol implementation. ## Location `mycelium/src/babel/route_request.rs` ## Problem Description When multiple peers send route updates simultaneously, the route_request handler does not use atomic operations or sufficient locking to ensure all metric updates are consistently applied. This can lead to: - Incorrect path metrics being used for route decisions - Packets being routed through suboptimal paths - Transient routing loops during network topology changes ## Impact - **Severity**: HIGH (affects routing correctness) - **Frequency**: Occurs under high peer churn or large mesh networks - **User Impact**: Unstable routing, higher latency, potential packet loss ## Remediation 1. Use a version-based or epoch-based approach to atomic route updates 2. Implement read-write locks or RwLock for route table access 3. Add integration tests that stress-test concurrent route updates 4. Document the thread-safety guarantees of the routing table ## Testing - Create a chaos test with 100+ peers sending contradictory routes - Verify no stale metrics are observed in routing decisions - Measure update propagation latency under concurrent load
Owner

Route requests are handled by reads from the routing table (#17 (comment)) which at that point have the most up to date calculated metrics. While there could be updates which are queued for processing, this is always the case since said update could also just still be in flight meaning the receiver node does not know about it yet. Note that the the babel spec accounts for this by preventing routing loops

Route requests are handled by reads from the routing table (https://forge.ourworld.tf/geomind_code/mycelium_network/issues/17#issuecomment-14018) which at that point have the most up to date calculated metrics. While there could be updates which are queued for processing, this is always the case since said update could also just still be in flight meaning the receiver node does not know about it yet. Note that the the babel spec accounts for this by preventing routing loops
lee closed this issue 2026-03-20 11:39:34 +00:00
thabeta added this to the ACTIVE project 2026-03-23 14:27:06 +00:00
Sign in to join this conversation.
No labels
Urgent
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/mycelium_network#22
No description provided.