article

Treat Codex Failures as Recoverable Runtime States: Self-Healing Hermes Agent Workers

A fact-grounded analysis of Hermes Agent’s Codex auth, timeout, and Kanban protocol-violation failure modes, using public GitHub cases and one observed worker incident to outline a safer recovery design.

PublisherWayDigital

Published2026-05-13 04:32 UTC

Languageen

Regionglobal

CategoryEssays

Treat Codex Failures as Recoverable Runtime States: Self-Healing Hermes Agent Workers

In a multi-agent coding system, the most costly failures are often not bad patches or broken tests. A worker may have already collected evidence, written logs, and produced most of the required artifacts, only to lose the final model call, auth context, or Kanban state transition. If the system only looks at the last process exit, it can turn a recoverable runtime fault into a blocked task.

That is the key lesson from recent Hermes Agent + Codex work: a Codex call failure should not automatically mean task failure. The watchdog should inspect durable artifacts first, then choose recovery, completion, or a precise block.

Failure mode 1: isolated profiles cannot see Codex login state

Hermes profiles isolate HOME and configuration for different workers. That isolation is useful, but it creates a common auth failure: the main user may be logged into Codex through ~/.codex/auth.json, while a worker running under PROFILE_HOME cannot see that file. The symptom can be a 401, a missing bearer error, or a not-logged-in Codex process.

A good worker preflight should be mechanical:

Before starting the worker, run HOME="$PROFILE_HOME" codex login status.
If it reports Logged in using ChatGPT, continue.
If it reports 401, missing bearer, or not logged in, inspect $PROFILE_HOME/.codex/auth.json.
If the main user’s ~/.codex/auth.json exists, create a symlink into the profile and check status again.
Only if that still fails should the task block, with a clear instruction to re-login to Codex rather than a vague request for an API token.

This matches the direction visible in public Hermes Agent work. PR #18555 adds an opt-in compatibility mode that reuses Codex CLI auth from CODEX_HOME/auth.json or ~/.codex/auth.json. PR #20457 describes a related case: Hermes’ own ~/.hermes/auth.json can be empty while the Codex CLI already has valid credentials in ~/.codex/auth.json. The proposed fix bootstraps Hermes credentials from the Codex CLI file.

The broader point is that Codex auth is not a single switch. It is a state-sync problem across profiles, HOME directories, Hermes auth storage, credential pools, and the Codex CLI shared auth file.

Failure mode 2: not every 401 means “not logged in”

401 errors also need classification. Issue #23670 reports preflight compression surfacing a Codex OAuth 401 with User not found while the main turn continues. That is not necessarily a whole-task failure; it is an auxiliary fallback gap.

Issue #23896 is more specific. In residency-enforced ChatGPT workspaces, Hermes openai-codex requests could return HTTP 401: Workspace is not authorized in this region even when the Codex CLI worked. PR #23935 fixes the class of problem by extracting chatgpt_data_residency, falling back to chatgpt_compute_residency, and sending x-openai-internal-codex-residency.

So a watchdog should not flatten all 401s into one message. It should distinguish missing profile auth, stale Hermes/Codex auth synchronization, auxiliary fallback failure, and missing residency headers.

Failure mode 3: timeouts should not erase completed work

Issue #13834 shows a practical transport problem: on the same macOS machine and same network, the official Codex CLI worked, while Hermes openai-codex repeatedly hit APIConnectionError and APITimeoutError, ending with API failed after 3 retries — Connection error.

In long-running worker tasks, this kind of failure often happens late. Raw evidence may already exist. JSON logs may already be saved. The missing piece may only be the final Markdown, HTML, or kanban_complete call. Blindly rerunning all collection work is the wrong default.

A safer recovery path is:

Retry once or twice with exponential backoff.
If the call still fails, inspect acceptance artifacts: raw evidence, logs, final MD/HTML, and project state files.
If raw evidence exists but final outputs are missing, resume from the evidence and generate only the missing artifacts.
If evidence is insufficient, block with an exact missing-file list and the next command to run.

Failure mode 4: protocol violation may mean “finished but did not close”

Hermes Kanban workers are expected to end by calling kanban_complete or kanban_block. The current hermes_cli/kanban_db.py code treats a worker that exits with rc=0 while its task remains running as a protocol violation: the worker may have answered conversationally instead of making the terminal Kanban tool call. The regression test test_detect_crashed_workers_protocol_violation_auto_blocks pins that behavior.

Public PR #24388 improves evidence preservation for this case. It keeps the final worker log summary when a Kanban worker exits rc=0 without a terminal Kanban tool call, and attaches log path, tail, and summary evidence to protocol_violation and gave_up events.

That is a good step, but the next step is recovery. A protocol violation should first trigger artifact inspection. If the expected deliverables are present and locally validated, the watchdog can complete the task. If they are not present, then it should block with a precise recovery note.

The t_52cac9ce incident

The recent t_52cac9ce task falls into this pattern. The worker had already collected raw evidence. A later Codex API timeout caused the process to exit without calling kanban_complete or kanban_block, leaving the task as a protocol violation. From the protocol layer, that is invalid. From the artifact layer, it is not a clean failure from zero.

The correct sequence is:

After a worker crash or protocol violation, inspect acceptance artifacts before deciding the task state.
If raw evidence exists but final Markdown or HTML is missing, generate the missing artifacts from the evidence instead of rerunning collection.
Run local acceptance: non-empty MD/HTML; required subjects and dimensions covered; raw evidence path exists; PROJECT_STATE.md and PROJECT_STATE.html updated.
Only after acceptance passes, call kanban complete.
Immediately dispatch downstream ready tasks.
If evidence is insufficient, call kanban block and state exactly what is missing and what command should be run next.

Environment health is part of task health

Not all worker failures are model failures. Issue #23725 reports that Hermes 0.13.0 referenced a built-in kanban-worker skill that existed in the source tree but was not auto-installed on fresh install or update. Workers then failed with Unknown skill(s): kanban-worker, and tasks auto-blocked after repeated failures. PR #23884 addresses this by auto-installing bundled skills during hermes kanban init.

That case belongs in the same recovery model: before declaring a task failed, check worker prerequisites such as skills, profile HOME, Codex auth, and required files.

Recommended watchdog rules

401 Unauthorized: run Codex profile auth repair first; inspect profile .codex/auth.json, main-user Codex CLI auth, Hermes auth storage, credential pool state, and residency headers.
Timeout or connection error: retry 1–2 times with exponential backoff; if intermediate artifacts exist, resume rather than restarting collection.
Protocol violation with rc=0: do not give up immediately; check whether artifacts are complete but the worker skipped kanban_complete.
Blocked task with completed artifacts: run local validation; if it passes, complete the task.
No ready tasks because a parent is blocked: inspect whether the parent is recoverable before reporting the queue as blocked.
All recovery actions: write them to state/PROJECT_STATE.md and state/PROJECT_STATE.html.

Conclusion

Hermes Agent’s strength is that it combines models, tools, skills, memory, profiles, Kanban, and messaging gateways into a long-running agent system. Such a system should not define task state by a single Codex API call. It should separate runtime failure from task failure: inspect durable artifacts, recover when possible, complete when evidence is sufficient, and block only when the missing work is explicit.

The truth of a long-running task is usually on disk, not only in the last API response.

References

More from WayDigital

Continue through other published articles from the same publisher.

上一篇特朗普这次访华，真正要谈的不是一件事2026-05-13 05:25 UTC 下一篇把 Codex 故障当作可恢复状态：Hermes Agent 多 worker 开发中的自愈设计2026-05-13 04:32 UTC

Treat Codex Failures as Recoverable Runtime States: Self-Healing Hermes Agent Workers

Treat Codex Failures as Recoverable Runtime States: Self-Healing Hermes Agent Workers

Failure mode 1: isolated profiles cannot see Codex login state

Failure mode 2: not every 401 means “not logged in”

Failure mode 3: timeouts should not erase completed work

Failure mode 4: protocol violation may mean “finished but did not close”

The t_52cac9ce incident

Environment health is part of task health

Recommended watchdog rules

Conclusion

References

More from WayDigital

Comments