- Rust 66%
- HTML 33.4%
- CSS 0.4%
- Makefile 0.2%
Switch from the two-phase early-bind-stub → rebind-with-full-router pattern to a simpler OnceLock<Arc<AppState>> global: serve_domains (macro-generated) owns rpc.sock binding and --info handling from the start; the REST server is spawned via a oneshot channel once models finish loading. EmbedService and AdminService become unit structs that read from APP_STATE rather than carrying an Option<Arc<AppState>> field. Also bumps reqwest 0.11→0.12 and drops the hardcoded [[binaries.tcp]] from service.toml in favour of the HERO_EMBEDDER_PROVIDER_TCP_BIND env var. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .forgejo/workflows | ||
| crates | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| Cargo.toml.hero_builder_backup | ||
| LICENSE | ||
| OPENAI_COMPAT_VERIFICATION.md | ||
| README.md | ||
| rust-toolchain.toml | ||
hero_embedder_provider
Stateless OpenAI-compatible embeddings + reranking daemon backed by BGE ONNX models. No auth, no sessions, no database — models are loaded once at startup and kept in memory.
What it does
- Embeddings — four BGE quality levels (INT8 fast → FP32 best), 384d or 768d
- Reranking — BGE cross-encoder reranker scores
(query, doc)pairs - OpenAI-compatible REST on
rest.sock+ TCP 8092 for drop-in client use - Hero-native JSON-RPC 2.0 on
rpc.sockwith fullstatus/mem_infointrospection - Admin dashboard on
admin.sock— live status, embed/rerank playground, performance benchmarks
Crates
| Crate | Role |
|---|---|
hero_embedder_provider_server |
Axum server — REST, RPC, TCP 8092; ONNX engine in src/engine/ (embedder + reranker, download, install) |
hero_embedder_provider_admin |
Admin dashboard web UI |
hero_embedder_provider_sdk |
Generated OpenRPC client (embed + admin domains) for Rust consumers |
hero_embedder_provider_examples |
Runnable embed / rerank examples over the SDK |
hero_embedder_provider_tests |
Integration tests against a running instance (RPC + OpenAI REST) |
Quality levels
Four quality levels are exposed on the embed call and as REST model IDs.
Select by quality (1–4) in JSON-RPC, or by model in the REST API:
| Level | Model ID | BGE model | Precision | Dimensions | Max tokens | Use case |
|---|---|---|---|---|---|---|
| Q1 | bge-small-q1 |
bge-small-en-v1.5 | INT8 | 384 | 128 | Fast, low memory |
| Q2 | bge-small-q2 |
bge-small-en-v1.5 | FP16 | 384 | 256 | Balanced |
| Q3 | bge-base-q3 |
bge-base-en-v1.5 | INT8 | 768 | 256 | High quality |
| Q4 | bge-base-q4 |
bge-base-en-v1.5 | FP16 | 768 | 512 | Best accuracy |
| — | bge-reranker-base |
bge-reranker-base | FP32 | n/a | 512 | Cross-encoder rerank |
OpenAI alias IDs are also accepted on the REST endpoint:
text-embedding-3-small→bge-small-q2text-embedding-3-large→bge-base-q4
Bind address
| Interface | Address |
|---|---|
| TCP | 127.0.0.1:8092 (Linux: mycelium overlay if up) |
| REST socket | $PATH_SOCKETS/hero_embedder_provider/rest.sock |
| RPC socket | $PATH_SOCKETS/hero_embedder_provider/rpc.sock |
| Admin socket | $PATH_SOCKETS/hero_embedder_provider/admin.sock |
On Linux the TCP listener probes the local mycelium daemon at startup; if a mycelium overlay address is found it binds there so other nodes on the overlay can reach it directly.
Set HERO_EMBEDDER_PROVIDER_TCP_BIND=<ip>:<port> to override the resolved
bind address (useful when the mycelium daemon API is not locally reachable).
REST API (OpenAI-compatible)
| Method | Path | Description |
|---|---|---|
POST |
/v1/embeddings |
Compute dense vectors for one or more inputs |
POST |
/v1/rerank |
Score (query, doc) pairs and return sorted results |
GET |
/v1/models |
List available local model IDs |
GET |
/health |
{"status", "service", "version", "models_ready", "download", "model_load"} |
Embeddings request
POST /v1/embeddings
Content-Type: application/json
{ "model": "bge-small-q1", "input": "Hello world" }
Batch input and the text-embedding-3-* aliases work the same way.
JSON-RPC 2.0 API (Hero-native)
The service is split into two domains, both served on the single rpc.sock
(the OpenRPC spec for each is at GET /api/{domain}/openrpc.json, and
GET /api/domains.json lists them):
embed(inference) →POST /api/embed/rpcadmin(operational) →POST /api/admin/rpc
| Domain | Method | Params | Description |
|---|---|---|---|
admin |
health |
— | Service status + models_ready boolean |
admin |
system_ping |
— | Liveness smoke test (alias of health) |
admin |
status |
— | Download progress, model load progress, models root path |
admin |
models_list |
— | List locally available model IDs |
admin |
mem_info |
— | Process RSS, system RAM, per-model disk sizes |
admin |
ort_info |
— | ONNX Runtime version, path, min-version check |
embed |
embed |
request: { texts, quality? } |
Compute embeddings; preserves INT8 quantization scale |
embed |
rerank |
request: { query, docs, top_k? } |
BGE cross-encoder rerank |
Methods take a single by-name argument. The control plane on rpc.sock also
answers GET /api/ping and POST /api/rpc (ping, discover_domains).
embed example — POST /api/embed/rpc
{ "jsonrpc": "2.0", "id": 1, "method": "embed",
"params": { "request": { "texts": ["Hello world", "Machine learning"], "quality": 1 } } }
Response includes embeddings, precision (int8/fp16), dimensions,
quality, and model name. INT8 embeddings carry a scale field for
dequantisation.
rerank example — POST /api/embed/rpc
{ "jsonrpc": "2.0", "id": 2, "method": "rerank",
"params": { "request": {
"query": "what is ML?",
"docs": [{"id":"d1","text":"ML is..."},{"id":"d2","text":"Weather..."}],
"top_k": 5 } } }
Returns { "results": [{id, score}, …] } sorted by descending score.
Health during startup
Model loading takes seconds (CPU-only) to minutes (cold first-run download).
Both sockets and TCP are bound immediately at startup and return
{"status": "starting"} until models are ready — hero_proc probes never
flap. The status RPC method exposes per-file download progress and
per-model load progress during this window.
ONNX Runtime
ONNX Runtime 1.25.0+ is required. At startup the server:
- Checks for an existing installation (hero lib dir → Homebrew → system paths).
- If not found, downloads the official GitHub release archive automatically.
To override: set ORT_DYLIB_PATH to the full path of the dylib before
starting. The ort_info RPC method reports the detected path and version.
Build
cargo build --workspace --release
Use from Rust
use hero_embedder_provider_sdk::EmbedderProviderClient;
let client = EmbedderProviderClient::connect().await?;
// Embed at Q1 (fast INT8)
let resp = client.embed(vec!["hello world".into()], Some(1)).await?;
println!("{}d {}", resp.dimensions, resp.precision);
// Rerank
let hits = client.rerank(
"what is ML?".into(),
vec![("d1".into(), "ML is...".into()), ("d2".into(), "Weather".into())],
Some(5),
).await?;
Or point any OpenAI client at http://127.0.0.1:8092/v1 with no auth.