fix(osis): make OsisIndexer async, bridge sync handlers via dedicated runtime #130

Closed
timur wants to merge 2 commits from fix-osis-indexer-async into development
Owner

Summary

  • OsisIndexer is now async throughout — dropped the owned Runtime, public methods (index_document, delete_document, search, try_initialize) are async fn and .await hero_indexer_sdk directly. Schema registration runs once via tokio::sync::OnceCell::get_or_try_init.

  • Added sync shims (index_document_blocking, delete_document_blocking, search_blocking, try_initialize_blocking) that dispatch async work onto a shared OnceLock<Runtime> bridge and block the caller on a stdlib sync_channel. Safe from any caller context (inside or outside another tokio runtime). The async surface remains the source of truth; the shims exist only to bridge the existing sync OSIS handler boundary.

  • Generator (rust_osis.rs) emits the *_blocking shims from the generated <root>_new/_set/_delete write-through and <root>_find arm. Before/after at the write-through call site:

    // before — panicked when invoked from inside an outer tokio runtime
    let _ = self.indexer.index_document("recipe", &sid, fields);
    
    // after — safe from any context
    self.indexer.index_document_blocking("recipe", &sid, fields);
    
  • Smoke test → #[tokio::test] + .await. Added blocking_shims_safe_inside_runtime to exercise the sync shims from inside an outer tokio runtime — exactly the pre-fix panic site — to lock the regression in.

  • Bench wires the now-async try_initialize through the bench-owned Runtime via rt.block_on(...).

Why

hero_rpc#127 wired the per-domain Tantivy indexer into the generated OSIS CRUD methods via a sync facade — OsisIndexer owned a private tokio::runtime::Runtime and block_on-ed every hero_indexer_sdk call. That panicked with "Cannot start a runtime from within a runtime" the moment the generated handler was invoked from inside an outer tokio context (jsonrpsee server worker, or hero_service #[tokio::test]/#[tokio::main] driving CRUD).

The architectural choice was wrong: hero_indexer_sdk is generated via openrpc_client! and is async-only. Wrapping it in a blocking facade is the bug. Principle locked in via codegen:

Generated SDK clients are always async; consumers in async generated handlers .await them; never wrap in a blocking facade.

The future-state migration (when the generated <root>_new/_set/_delete themselves become async fn) drops the shims entirely and .awaits the async surface — a one-line change in the codegen at that time.

Test plan

  • cargo test -p hero_rpc_osis --test indexer_smoke --features rpc -- --nocapture2/2 pass including blocking_shims_safe_inside_runtime (the pre-fix panic site, now safe).
  • BENCH_LARGE=1000 cargo bench -p hero_rpc_osis_benches --bench index_perf -- --quick 'query_indexed_vs_full_scan/wire_find.title/1000' — preflight asserts 333 hits over 1000 rows as expected; bench completes in ~3.2 ms per iter with no nested-runtime panic.
  • cargo test -p hero_rpc_osis --lib --features rpc — 85 unit tests pass.
  • cargo test -p hero_rpc_generator --lib — 137 generator unit tests pass.
  • cargo build --workspace clean.

Related: hero_rpc#127 (original indexer wiring), hero_rpc#128 (codegen-emitted _find_* E2E tests that surfaced the panic when invoked from hero_service's outer tokio runtime).

## Summary - `OsisIndexer` is now async throughout — dropped the owned `Runtime`, public methods (`index_document`, `delete_document`, `search`, `try_initialize`) are `async fn` and `.await` `hero_indexer_sdk` directly. Schema registration runs once via `tokio::sync::OnceCell::get_or_try_init`. - Added sync shims (`index_document_blocking`, `delete_document_blocking`, `search_blocking`, `try_initialize_blocking`) that dispatch async work onto a shared `OnceLock<Runtime>` bridge and block the caller on a stdlib `sync_channel`. Safe from any caller context (inside or outside another tokio runtime). The async surface remains the source of truth; the shims exist only to bridge the existing sync OSIS handler boundary. - Generator (`rust_osis.rs`) emits the `*_blocking` shims from the generated `<root>_new`/`_set`/`_delete` write-through and `<root>_find` arm. Before/after at the write-through call site: ```rust // before — panicked when invoked from inside an outer tokio runtime let _ = self.indexer.index_document("recipe", &sid, fields); // after — safe from any context self.indexer.index_document_blocking("recipe", &sid, fields); ``` - Smoke test → `#[tokio::test]` + `.await`. Added `blocking_shims_safe_inside_runtime` to exercise the sync shims from inside an outer tokio runtime — exactly the pre-fix panic site — to lock the regression in. - Bench wires the now-async `try_initialize` through the bench-owned `Runtime` via `rt.block_on(...)`. ## Why hero_rpc#127 wired the per-domain Tantivy indexer into the generated OSIS CRUD methods via a sync facade — `OsisIndexer` owned a private `tokio::runtime::Runtime` and `block_on`-ed every `hero_indexer_sdk` call. That panicked with **"Cannot start a runtime from within a runtime"** the moment the generated handler was invoked from inside an outer tokio context (jsonrpsee server worker, or hero_service `#[tokio::test]`/`#[tokio::main]` driving CRUD). The architectural choice was wrong: `hero_indexer_sdk` is generated via `openrpc_client!` and is async-only. Wrapping it in a blocking facade is the bug. Principle locked in via codegen: > **Generated SDK clients are always async; consumers in async generated handlers `.await` them; never wrap in a blocking facade.** The future-state migration (when the generated `<root>_new`/`_set`/`_delete` themselves become `async fn`) drops the shims entirely and `.await`s the async surface — a one-line change in the codegen at that time. ## Test plan - `cargo test -p hero_rpc_osis --test indexer_smoke --features rpc -- --nocapture` — **2/2 pass** including `blocking_shims_safe_inside_runtime` (the pre-fix panic site, now safe). - `BENCH_LARGE=1000 cargo bench -p hero_rpc_osis_benches --bench index_perf -- --quick 'query_indexed_vs_full_scan/wire_find.title/1000'` — preflight asserts `333 hits over 1000 rows` as expected; bench completes in `~3.2 ms` per iter with no nested-runtime panic. - `cargo test -p hero_rpc_osis --lib --features rpc` — 85 unit tests pass. - `cargo test -p hero_rpc_generator --lib` — 137 generator unit tests pass. - `cargo build --workspace` clean. Related: hero_rpc#127 (original indexer wiring), hero_rpc#128 (codegen-emitted `_find_*` E2E tests that surfaced the panic when invoked from hero_service's outer tokio runtime).
hero_rpc#127 wired the per-domain Tantivy indexer into the generated
OSIS `_new`/`_set`/`_delete` write-through and the `_find` arm via a
sync facade — `OsisIndexer` owned a private `tokio::runtime::Runtime`
and `block_on`-ed every `hero_indexer_sdk` call. That panicked with
'Cannot start a runtime from within a runtime' the moment the
generated handler was invoked from inside an outer tokio context (the
jsonrpsee server worker, or a hero_service `#[tokio::test]` /
`#[tokio::main]` driving CRUD through the wire).

The architectural choice was wrong. `hero_indexer_sdk` is generated
via `openrpc_client!` and is async-only — consumers must `.await` it.
Wrapping it in a blocking facade is the bug.

This commit:

- Drops the owned `Runtime` from `OsisIndexer`. Public surface
  (`index_document`, `delete_document`, `search`, `try_initialize`)
  becomes `async fn` and `.await`s the SDK directly. Schema
  registration is hoisted into a `tokio::sync::OnceCell::get_or_try_init`
  fast-path so the lazy connect+`db.create`+`db.select` chain runs at
  most once and subsequent calls are a single atomic load.

- Adds `*_blocking` shims (`index_document_blocking`,
  `delete_document_blocking`, `search_blocking`,
  `try_initialize_blocking`) that bridge the existing sync OSIS
  handler boundary. These spawn the async work onto a shared
  multi-thread bridge runtime and block the caller's thread on a
  stdlib `sync_channel` — safe regardless of whether the caller is
  inside another tokio runtime or not. `hero_indexer_sdk` itself
  stays async throughout; the shims exist only because the generated
  `<root>_new`/`_set`/`_delete`/`_find` handler bodies are currently
  sync `pub fn`. Async callers (smoke test, benches, future async
  handlers) should keep calling the async surface directly.

- Updates the smoke test to `#[tokio::test]` + `.await`. Adds a second
  test (`blocking_shims_safe_inside_runtime`) that exercises the
  sync shims from inside an outer tokio runtime — the exact
  pre-fix panic scenario — to lock the regression in.

- Updates the bench to `rt.block_on(indexer.try_initialize())` since
  `try_initialize` is now async. The bench file owns its own
  `Runtime` (criterion can't be `async fn`), which is the right
  place to `block_on` async work — exactly the pattern the
  architecture prescribes for sync consumers outside the OSIS
  handler boundary.

Codegen change to emit the new sync shims lands in the next commit.
fix(generator): emit OsisIndexer sync shims from generated CRUD bodies
Some checks failed
Test / test (push) Failing after 3m8s
Test / test (pull_request) Failing after 2m14s
9ca9e8a6b8
The generated `<root>_new`/`_set`/`_delete`/`_find` methods on
`Osis<Domain>` are sync `pub fn` — they're invoked from sync
`*_rpc_new` dispatch arms and (over the wire) from the sync
jsonrpsee `register_method` closures. They can't `.await` the async
`OsisIndexer` surface directly.

Switch the emitted calls to the `*_blocking` shims hero_rpc#129
adds on `OsisIndexer`. These dispatch the async work onto a shared
bridge runtime and block on a stdlib channel — safe to call from
any caller context (inside or outside another tokio runtime), which
is exactly the panic the previous `block_on`-in-place facade hit.

Before (panicked when called from inside an outer tokio runtime):

    let _ = self.indexer.index_document("recipe", &sid, fields);

After (safe from any context):

    self.indexer.index_document_blocking("recipe", &sid, fields);

Same change for the delete path (`delete_document_blocking`) and the
`_find` arm (`search_blocking`). The async surface on `OsisIndexer`
remains the source of truth; future async OSIS handlers should
`.await` it directly and skip the shims entirely.
Author
Owner

Cross-repo validation note

hero_service development cannot currently build against this PR (or against hero_rpc development either, independently of this work): it imports hero_rpc_osis::rpc::bootstrap::MultiDomainBuilder, which only exists on the unmerged issue-124-lifecycle-alignment branch in this repo (commit 5e72817). Both before and after this PR, cargo build --workspace in hero_service fails with the same MultiDomainBuilder not-found error.

The full cross-repo cargo test --workspace validation the task spec called for is therefore blocked on hero_rpc#124 landing first — it is not regressed by this PR. The OSIS indexer panic this PR fixes is covered by the in-repo smoke test (blocking_shims_safe_inside_runtime, which reproduces the exact pre-fix scenario from inside an outer #[tokio::test] runtime) and by the bench wire-arm preflight (333 hits over 1000 rows over wire_find.title/1000).

### Cross-repo validation note hero_service `development` cannot currently build against this PR (or against hero_rpc `development` either, independently of this work): it imports `hero_rpc_osis::rpc::bootstrap::MultiDomainBuilder`, which only exists on the unmerged `issue-124-lifecycle-alignment` branch in this repo (commit `5e72817`). Both before and after this PR, `cargo build --workspace` in hero_service fails with the same `MultiDomainBuilder` not-found error. The full cross-repo `cargo test --workspace` validation the task spec called for is therefore blocked on hero_rpc#124 landing first — it is not regressed by this PR. The OSIS indexer panic this PR fixes is covered by the in-repo smoke test (`blocking_shims_safe_inside_runtime`, which reproduces the exact pre-fix scenario from inside an outer `#[tokio::test]` runtime) and by the bench wire-arm preflight (`333 hits over 1000 rows` over `wire_find.title/1000`).
Author
Owner

Closing in favor of #133, which implements the ideal architectural fix instead of bridging the panic at the sync->async boundary.

#133 makes every link in the chain async: the generated OSIS CRUD handlers (_new / _set / _delete / _find / _get / _list / _list_full / _exists) are emitted as async fn, the OsisAppRpcHandler trait is async (RPITIT), rpc2_adapter uses register_async_method, and OsisIndexer keeps the async surface this branch salvaged. The result is one continuous .await stack from the network closure to the indexer socket — no spawn_blocking on the CRUD hot path, no bridge runtime, no *_blocking shims.

The async OsisIndexer code in crates/osis/src/index/remote.rs from this branch was salvaged as the starting point of #133. The bench file + smoke test changes from this branch landed too (minus the now-obsolete blocking_shims_safe_inside_runtime test, whose premise goes away once the shims do).

Not deleting the branch yet — #133's history references it.

Closing in favor of #133, which implements the ideal architectural fix instead of bridging the panic at the sync->async boundary. #133 makes every link in the chain async: the generated OSIS CRUD handlers (_new / _set / _delete / _find / _get / _list / _list_full / _exists) are emitted as `async fn`, the `OsisAppRpcHandler` trait is async (RPITIT), `rpc2_adapter` uses `register_async_method`, and `OsisIndexer` keeps the async surface this branch salvaged. The result is one continuous `.await` stack from the network closure to the indexer socket — no `spawn_blocking` on the CRUD hot path, no bridge runtime, no `*_blocking` shims. The async OsisIndexer code in `crates/osis/src/index/remote.rs` from this branch was salvaged as the starting point of #133. The bench file + smoke test changes from this branch landed too (minus the now-obsolete `blocking_shims_safe_inside_runtime` test, whose premise goes away once the shims do). Not deleting the branch yet — #133's history references it.
timur closed this pull request 2026-05-22 13:57:21 +00:00
Some checks failed
Test / test (push) Failing after 3m8s
Test / test (pull_request) Failing after 2m14s

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_rpc!130
No description provided.