Stateless OpenAI-compatible embeddings and reranking daemon using BGE ONNX models.

Rust 66%
HTML 33.4%
CSS 0.4%
Makefile 0.2%

Find a file

Kristof ad1a0d242b All checks were successful lab release / release (push) Successful in 15m16s Details lab release (gnu) / publish-gnu (push) Successful in 29m52s Details refactor(server): replace early/late rebind with OnceLock state + serve_domains Switch from the two-phase early-bind-stub → rebind-with-full-router pattern to a simpler OnceLock<Arc<AppState>> global: serve_domains (macro-generated) owns rpc.sock binding and --info handling from the start; the REST server is spawned via a oneshot channel once models finish loading. EmbedService and AdminService become unit structs that read from APP_STATE rather than carrying an Option<Arc<AppState>> field. Also bumps reqwest 0.11→0.12 and drops the hardcoded [[binaries.tcp]] from service.toml in favour of the HERO_EMBEDDER_PROVIDER_TCP_BIND env var. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-06-24 19:26:22 +02:00
.forgejo/workflows	ci: canonical lab-release (cargo check + multi-arch + hero.releaser)	2026-06-10 20:10:52 +02:00
crates	refactor(server): replace early/late rebind with OnceLock state + serve_domains	2026-06-24 19:26:22 +02:00
.gitignore	feat: initial scaffold — stateless OpenAI-compatible embedder daemon	2026-05-26 14:59:23 +02:00
Cargo.lock	refactor(server): replace early/late rebind with OnceLock state + serve_domains	2026-06-24 19:26:22 +02:00
Cargo.toml	refactor(server): split into embed + admin domains; fold lib into server	2026-06-24 10:24:49 +02:00
Cargo.toml.hero_builder_backup	chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally	2026-06-01 09:30:58 +02:00
LICENSE	feat: initial scaffold — stateless OpenAI-compatible embedder daemon	2026-05-26 14:59:23 +02:00
OPENAI_COMPAT_VERIFICATION.md	test(openai): add OpenAI REST interface compatibility tests	2026-06-24 10:40:05 +02:00
README.md	refactor(server): replace early/late rebind with OnceLock state + serve_domains	2026-06-24 19:26:22 +02:00
rust-toolchain.toml	chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally	2026-06-01 09:30:58 +02:00

README.md

hero_embedder_provider

Stateless OpenAI-compatible embeddings + reranking daemon backed by BGE ONNX models. No auth, no sessions, no database — models are loaded once at startup and kept in memory.

What it does

Embeddings — four BGE quality levels (INT8 fast → FP32 best), 384d or 768d
Reranking — BGE cross-encoder reranker scores (query, doc) pairs
OpenAI-compatible REST on rest.sock + TCP 8092 for drop-in client use
Hero-native JSON-RPC 2.0 on rpc.sock with full status / mem_info introspection
Admin dashboard on admin.sock — live status, embed/rerank playground, performance benchmarks

Crates

Crate	Role
`hero_embedder_provider_server`	Axum server — REST, RPC, TCP 8092; ONNX engine in `src/engine/` (embedder + reranker, download, install)
`hero_embedder_provider_admin`	Admin dashboard web UI
`hero_embedder_provider_sdk`	Generated OpenRPC client (`embed` + `admin` domains) for Rust consumers
`hero_embedder_provider_examples`	Runnable `embed` / `rerank` examples over the SDK
`hero_embedder_provider_tests`	Integration tests against a running instance (RPC + OpenAI REST)

Quality levels

Four quality levels are exposed on the embed call and as REST model IDs. Select by quality (1–4) in JSON-RPC, or by model in the REST API:

Level	Model ID	BGE model	Precision	Dimensions	Max tokens	Use case
Q1	`bge-small-q1`	bge-small-en-v1.5	INT8	384	128	Fast, low memory
Q2	`bge-small-q2`	bge-small-en-v1.5	FP16	384	256	Balanced
Q3	`bge-base-q3`	bge-base-en-v1.5	INT8	768	256	High quality
Q4	`bge-base-q4`	bge-base-en-v1.5	FP16	768	512	Best accuracy
—	`bge-reranker-base`	bge-reranker-base	FP32	n/a	512	Cross-encoder rerank

OpenAI alias IDs are also accepted on the REST endpoint:

text-embedding-3-small → bge-small-q2
text-embedding-3-large → bge-base-q4

Bind address

Interface	Address
TCP	`127.0.0.1:8092` (Linux: mycelium overlay if up)
REST socket	`$PATH_SOCKETS/hero_embedder_provider/rest.sock`
RPC socket	`$PATH_SOCKETS/hero_embedder_provider/rpc.sock`
Admin socket	`$PATH_SOCKETS/hero_embedder_provider/admin.sock`

On Linux the TCP listener probes the local mycelium daemon at startup; if a mycelium overlay address is found it binds there so other nodes on the overlay can reach it directly.

Set HERO_EMBEDDER_PROVIDER_TCP_BIND=<ip>:<port> to override the resolved bind address (useful when the mycelium daemon API is not locally reachable).

REST API (OpenAI-compatible)

Method	Path	Description
`POST`	`/v1/embeddings`	Compute dense vectors for one or more inputs
`POST`	`/v1/rerank`	Score `(query, doc)` pairs and return sorted results
`GET`	`/v1/models`	List available local model IDs
`GET`	`/health`	`{"status", "service", "version", "models_ready", "download", "model_load"}`

Embeddings request

POST /v1/embeddings
Content-Type: application/json

{ "model": "bge-small-q1", "input": "Hello world" }

Batch input and the text-embedding-3-* aliases work the same way.

JSON-RPC 2.0 API (Hero-native)

The service is split into two domains, both served on the single rpc.sock (the OpenRPC spec for each is at GET /api/{domain}/openrpc.json, and GET /api/domains.json lists them):

embed (inference) → POST /api/embed/rpc
admin (operational) → POST /api/admin/rpc

Domain	Method	Params	Description
`admin`	`health`	—	Service status + `models_ready` boolean
`admin`	`system_ping`	—	Liveness smoke test (alias of `health`)
`admin`	`status`	—	Download progress, model load progress, models root path
`admin`	`models_list`	—	List locally available model IDs
`admin`	`mem_info`	—	Process RSS, system RAM, per-model disk sizes
`admin`	`ort_info`	—	ONNX Runtime version, path, min-version check
`embed`	`embed`	`request: { texts, quality? }`	Compute embeddings; preserves INT8 quantization scale
`embed`	`rerank`	`request: { query, docs, top_k? }`	BGE cross-encoder rerank

Methods take a single by-name argument. The control plane on rpc.sock also answers GET /api/ping and POST /api/rpc (ping, discover_domains).

`embed` example — `POST /api/embed/rpc`

{ "jsonrpc": "2.0", "id": 1, "method": "embed",
  "params": { "request": { "texts": ["Hello world", "Machine learning"], "quality": 1 } } }

Response includes embeddings, precision (int8/fp16), dimensions, quality, and model name. INT8 embeddings carry a scale field for dequantisation.

`rerank` example — `POST /api/embed/rpc`

{ "jsonrpc": "2.0", "id": 2, "method": "rerank",
  "params": { "request": {
    "query": "what is ML?",
    "docs": [{"id":"d1","text":"ML is..."},{"id":"d2","text":"Weather..."}],
    "top_k": 5 } } }

Returns { "results": [{id, score}, …] } sorted by descending score.

Health during startup

Model loading takes seconds (CPU-only) to minutes (cold first-run download). Both sockets and TCP are bound immediately at startup and return {"status": "starting"} until models are ready — hero_proc probes never flap. The status RPC method exposes per-file download progress and per-model load progress during this window.

ONNX Runtime

ONNX Runtime 1.25.0+ is required. At startup the server:

Checks for an existing installation (hero lib dir → Homebrew → system paths).
If not found, downloads the official GitHub release archive automatically.

To override: set ORT_DYLIB_PATH to the full path of the dylib before starting. The ort_info RPC method reports the detected path and version.

Build

cargo build --workspace --release

Use from Rust

use hero_embedder_provider_sdk::EmbedderProviderClient;

let client = EmbedderProviderClient::connect().await?;

// Embed at Q1 (fast INT8)
let resp = client.embed(vec!["hello world".into()], Some(1)).await?;
println!("{}d {}", resp.dimensions, resp.precision);

// Rerank
let hits = client.rerank(
    "what is ML?".into(),
    vec![("d1".into(), "ML is...".into()), ("d2".into(), "Weather".into())],
    Some(5),
).await?;

Or point any OpenAI client at http://127.0.0.1:8092/v1 with no auth.

README.md Unescape Escape