Why Hermes Agent keeps stopping for preflight compression
A factual explanation of Hermes Agent preflight compression in long gateway sessions, why it can time out, how to fix it, and what state is affected.
Why Hermes Agent keeps stopping for preflight compression
If you use Hermes Agent through Feishu, Telegram, or another gateway for a long running project, you may eventually see a line like this:
📦 Preflight compression: ~165,575 tokens >= 136,000 threshold. This may take a moment.
That message usually does not mean the model is broken. It means the active chat history has grown large enough that Hermes does not want to send the whole transcript to the model as-is. Before it handles your new message, it tries to compact older turns so the request stays inside a safe context budget.
The 136,000 number is a useful clue. Hermes has a gateway-level safety pass that triggers at 85% of the model context window. A threshold of 136,000 lines up with a 160,000-token context window. In other words, the session has grown close to the point where another full turn could become unsafe.
The main cause: one chat has become the project database
Hermes has two separate compression layers. The first one runs inside the agent loop and is the normal context manager. Its default threshold is 50% of the context window. The second one runs in the gateway before the agent starts processing the next message. Hermes calls this session hygiene in the code. It is a higher-threshold safety valve, set at 85%, for gateway sessions that have grown too large between turns.
The source comments are blunt about the reason. Long-lived gateway sessions can accumulate enough history that every new message rehydrates an oversized transcript, leading to repeated truncation or context failures. The hygiene pass estimates the transcript size before the agent starts. It prefers the last API-reported prompt token count when available and falls back to a rough estimate when it has to.
So yes, the pattern you are seeing is mostly caused by too much conversation history. The longer the project runs in the same chat, and the more files, logs, tool calls, screenshots, searches, and retries it contains, the more often this will happen.
Why compression itself can take so long
Hermes is not just deleting old turns. The default ContextCompressor sends the middle part of the conversation to an auxiliary model and asks it to write a structured handoff summary. The summary is supposed to preserve the goal, constraints, progress, decisions, files, open work, and critical errors. Recent messages stay intact. A few early messages are also protected. The middle gets replaced by the summary.
That is a real model call. If the middle of the transcript is huge, the compression call can be slow. If it contains long tool outputs, code, logs, web pages, images, or failed retries, it gets worse. Hermes' own developer documentation also warns that the summary model needs a context window at least as large as the main model. If the auxiliary compression model is too small, the summary call can fail. When that happens, the compressor may fall back to a marker instead of a useful summary, and the active context loses the middle of the session.
The public issue tracker shows that this is an active reliability area, not just a local annoyance. There are reports and fixes around compression being interrupted by gateway messages, preflight compression pass budgets, compression timeout defaults, Codex OAuth failures during preflight compression, custom provider context length handling, and smart model routing using the wrong threshold. One recent PR raised the default auxiliary compression timeout from 120 seconds to 300 seconds because large compression passes could time out too easily. Another protects compression from gateway-session interrupts.
How to fix it in practice
The best fix is not to keep raising the threshold. The better fix is to stop using one chat as an endless working memory.
Start new sessions at natural project boundaries
When a phase is done, start a new session with /new or /reset. Research done? New session. Implementation done? New session. Release handoff done? New session. The old transcript is still stored and searchable, but the active model window no longer has to carry several days or weeks of history into every message.
Use a reliable auxiliary compression model
Check auxiliary.compression in config.yaml. The compression model should have a context window at least as large as the main model, should be stable, and should have a timeout long enough for large summaries. If your local config still has auxiliary.compression.timeout set to 120 seconds, that matches a known failure mode for large compression jobs. In many setups, 300 or 600 seconds is more realistic.
Set context_length correctly
Custom providers and OpenAI-compatible proxies need accurate context_length settings. If the value is too small, Hermes compresses too often. If it is too large, Hermes may wait until the provider starts rejecting requests. There are open and closed discussions in the Hermes repository about context_length handling for custom providers and auxiliary compression, so this is worth checking rather than guessing.
Move durable state out of chat
Files, plans, issues, PR descriptions, and project notes are better places for durable project state. The chat should carry the current step, not the whole project archive. For large work, split tasks into subagents, cron jobs, kanban tasks, or fresh sessions. Each worker should carry the context it needs, not the entire history of the parent chat.
Keep Hermes updated
Compression has had multiple reliability fixes. The public repository includes fixes or reports involving preflight compression budgets, compression interruptions, compression timeout defaults, Telegram topic bindings after session splits, OAuth failures, and context-window detection. If you run very long tasks, updating Hermes is part of the fix.
Do not simply disable compression
Turning compression off may make the next message feel faster, but it removes a safety mechanism. After that, the session can hit the model's context limit directly. The failure mode then becomes worse: provider errors, degraded answers, lost tool calls, or the agent forgetting important recent work. Compression is the guardrail. The problem is that your session is large enough, and your compression path slow enough, that the guardrail has become visible.
What changes after you fix it
Persistent memory is not erased by normal context compression. USER profile, MEMORY, skills, files on disk, code repositories, cron jobs, and kanban boards are separate systems. Compression changes the active message list that gets sent to the model, not your whole Hermes installation.
The tradeoff is in the active conversation. Older middle turns are replaced by a summary. Recent messages remain. The first few messages remain. Fine-grained details that were never written to a file or memory may not survive word for word. Hermes also rotates or creates a new session id when compression rewrites the transcript, while the old transcript remains stored and searchable.
Prompt caching is affected too. Hermes' developer documentation says compression invalidates the cache for the compressed region, although the system prompt cache can survive. The rolling cache is rebuilt over the next one or two turns. That means the first couple of turns after compression can be slower or more expensive than usual.
The practical operating rule
Treat this as context management, not just a bug. Long-term preferences belong in memory. Repeatable procedures belong in skills. Project state belongs in files, issues, plans, or PRs. The chat should carry the current working set.
For the situation described here, I would do three things first: update Hermes, verify auxiliary.compression model/provider/context_length/timeout, and start a fresh session for the next phase of the project. That should reduce the repeated preflight compression stalls without deleting memory or project files.
Sources
- Hermes Agent developer docs: Context Compression and Caching
- Hermes Agent source: agent/context_compressor.py
- Hermes Agent source: gateway/run.py session hygiene logic
- PR #23752: raise default compression timeout
- Issue #23975: compression interrupted by gateway messages
- PR #24001: protect context compression from session interrupts
- PR #22871: honor preflight compression pass budget
- Issue #23670: preflight compression and Codex OAuth 401
- Issue #7798: smart model routing and preflight compression threshold
- Issue #14690: context auto-compression threshold bug
- Issue #19539: custom provider context_length and auxiliary compression
More from WayDigital
Continue through other published articles from the same publisher.
Comments
0 public responses
All visitors can read comments. Sign in to join the discussion.
Log in to comment