fix: process tree tests — zombie handling for Docker CI containers #14

Merged
mik-tf merged 5 commits from development-fix-ci into development 2026-01-30 14:13:23 +00:00
Member

Summary

Fixes the 03_process_tree integration tests (both bash and Rhai) that were failing in CI but passing locally. The root cause was zombie process detection in Docker containers that lack a proper init system.

Problem

In our CI Docker containers, PID 1 is tail -f /dev/null — not a proper init process. When zinit stops a service and kills its process group, orphaned child processes receive the signal and terminate, but PID 1 never calls wait() to reap them. They become zombies (state Z) re-parented to PID 1.

The tests used ps -p $PID to verify the child was killed, but ps -p reports zombies as existing processes — causing a false "still alive" assertion failure.

A secondary issue was using pgrep -f "sleep 200" for process discovery, which is fragile in CI environments where stale processes from previous runs or other containers may match.

Changes

tests/scripts/03_process_tree.sh

  • Use PID file written by the service script instead of pgrep -f for reliable child PID discovery
  • Replace ps -p with /proc/PID/status state check — zombies (Z) and dead processes (X) are correctly treated as terminated
  • Add diagnostic logging (PGID, process state) for CI debugging

tests/rhai/03_process_tree.rhai

  • Same PID file approach and /proc/PID/status zombie detection as the bash version
  • Removed zinit_delete() cleanup call (consistent with other Rhai tests; also avoids a name normalization inconsistency in the zinit_delete global function)

src/server/process.rs

  • Fixed stale doc comment on find_processes_on_ports that still referenced a removed netstat fallback

.forgejo/workflows/test.yaml

  • Removed temporary development-fix-ci branch from trigger list

Testing

  • All 186 Rust integration tests pass locally
  • All 5 Rhai integration tests pass locally
  • CI run on development-fix-ci branch passes: all tests green

Builds on the earlier CI fixes already merged to development:

  • e8b3c0e — cmdline filter fix + separate test/publish workflows
  • 0237513find_processes_on_ports socket inode rewrite
## Summary Fixes the `03_process_tree` integration tests (both bash and Rhai) that were failing in CI but passing locally. The root cause was zombie process detection in Docker containers that lack a proper init system. ## Problem In our CI Docker containers, PID 1 is `tail -f /dev/null` — not a proper init process. When zinit stops a service and kills its process group, orphaned child processes receive the signal and terminate, but PID 1 never calls `wait()` to reap them. They become **zombies** (state `Z`) re-parented to PID 1. The tests used `ps -p $PID` to verify the child was killed, but `ps -p` reports zombies as existing processes — causing a false "still alive" assertion failure. A secondary issue was using `pgrep -f "sleep 200"` for process discovery, which is fragile in CI environments where stale processes from previous runs or other containers may match. ## Changes ### `tests/scripts/03_process_tree.sh` - Use **PID file** written by the service script instead of `pgrep -f` for reliable child PID discovery - Replace `ps -p` with `/proc/PID/status` state check — zombies (`Z`) and dead processes (`X`) are correctly treated as terminated - Add diagnostic logging (PGID, process state) for CI debugging ### `tests/rhai/03_process_tree.rhai` - Same PID file approach and `/proc/PID/status` zombie detection as the bash version - Removed `zinit_delete()` cleanup call (consistent with other Rhai tests; also avoids a name normalization inconsistency in the `zinit_delete` global function) ### `src/server/process.rs` - Fixed stale doc comment on `find_processes_on_ports` that still referenced a removed netstat fallback ### `.forgejo/workflows/test.yaml` - Removed temporary `development-fix-ci` branch from trigger list ## Testing - All 186 Rust integration tests pass locally - All 5 Rhai integration tests pass locally - CI run on `development-fix-ci` branch passes: all tests green ## Related Builds on the earlier CI fixes already merged to `development`: - `e8b3c0e` — cmdline filter fix + separate test/publish workflows - `0237513` — `find_processes_on_ports` socket inode rewrite
- Read child PID from a file written by the service script instead of
  using pgrep -f 'sleep 200' (which is global and may match stale
  processes from previous tests or other contexts).
- Log service and child PGID for diagnostics.
- Add extended wait (5s extra) with diagnostics if child survives
  initial 2s wait, to handle slow CI containers before failing.
- Show process status info on failure for CI debugging.
ci: add development-fix-ci branch to test.yaml triggers
Some checks failed
Tests / test (push) Failing after 2m23s
e81e063af0
fix: treat zombie processes as dead in 03_process_tree.sh
All checks were successful
Tests / test (push) Successful in 2m50s
1c8f6b0a60
In Docker containers without a proper init (PID 1 = tail -f /dev/null),
orphaned child processes become zombies (Z state) re-parented to PID 1.
They received the signal and terminated, but PID 1 never calls wait()
to reap them. ps -p still reports zombies as existing, causing the test
to fail.

Fix: use /proc/PID/status to check actual process state. Zombies (Z)
and dead processes (X) are treated as dead, which they are.
fix: rhai process tree test zombie handling + doc comment
All checks were successful
Tests / test (push) Successful in 2m51s
7e8b1966fb
- Rewrite 03_process_tree.rhai to use PID file instead of pgrep
- Check /proc/PID/status for zombie (Z) and dead (X) states
  instead of relying on ps -p (which reports zombies as alive)
- Fix doc comment on find_processes_on_ports: removed stale
  reference to netstat fallback
ci: remove temporary test branch from workflow triggers
All checks were successful
Tests / test (pull_request) Successful in 4m5s
53c1db4758
mik-tf merged commit fbdf84603a into development 2026-01-30 14:13:23 +00:00
mik-tf deleted branch development-fix-ci 2026-01-30 14:13:23 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/zinit!14
No description provided.