Investigate adding latency telemetry to hero_aibroker_server #158
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_aibroker#158
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
hero_aibroker_servercurrently has limited visibility into where time is spent per request. The existingMetricsstruct counts requests/errors, and the request log records only total duration. When latency feels high, we can't easily tell whether the bottleneck is routing, key pool waits, serialization, upstream TTFB, full upstream response, tool execution, or response formatting.Questions to answer
/metricsendpoint suffice as a starting point?Candidate approaches
Option A: Prometheus histograms
Extend the existing
/metricsendpoint with latency histograms. Simplest to deploy, no extra collector, easy dashboards/alerts.Option B: OpenTelemetry + Jaeger
Instrument the chat path with spans and export to Jaeger. Best for per-request deep dives, but requires running Jaeger or OTel collector.
Option C: Both
Prometheus for aggregate monitoring first, OpenTelemetry for traces as a follow-up.
What we want to measure
21-06-2026
Current progress on OpenTelemetry/Jaeger tracing for
hero_aibroker_server:Done
opentelemetry,opentelemetry_sdk,opentelemetry-otlp, andtracing-opentelemetry.crates/hero_aibroker_server/src/telemetry.rswith opt-in OTLP/HTTP trace export, env fallbacks (OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAME), and graceful shutdown.--telemetryand--telemetry-endpointtomain.rs.HeroTracingLayer.chat.completions(router entry, total duration, status)chat.route(model routing)keypool.acquireupstream.chat/upstream.chat_stream(OpenAI + OpenRouter)response.serializebroker.tool_loop/broker.tool.execute/broker.streaming_tool_loop/broker.streaming_tool_turnreqwestHTTP export without panics.cargo check,cargo clippy -D warnings, andcargo test -p hero_aibroker_server(119 tests pass).Manual test (without Jaeger)
hero_aibroker_server --fake --telemetry --address 127.0.0.1 --port 8080.Open / next
docker-compose.telemetry.ymlto run Jaeger + broker together for local end-to-end trace viewing.