Skip to content

fix(mcp): stability fix pack — reload timeout, shutdown cleanup, event loop handler, OAuth non-blocking#4757

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-10baf9e9
Apr 3, 2026
Merged

fix(mcp): stability fix pack — reload timeout, shutdown cleanup, event loop handler, OAuth non-blocking#4757
teknium1 merged 1 commit intomainfrom
hermes/hermes-10baf9e9

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented Apr 3, 2026

Summary

Four fixes for MCP server stability issues reported by community member the77helios (terminal lockup with Obsidian MCP server, zombie processes accumulating, escape sequence pollution, startup hang).

What this PR does

Fix 1: MCP reload timeout guard (cli.py)
_check_config_mcp_changes now runs _reload_mcp in a separate daemon thread with a 30s hard timeout. Previously, a hung MCP server could block the process_loop thread indefinitely, freezing the entire TUI — user can type but nothing happens, only Ctrl+D/Ctrl+\ work.

Fix 2: MCP stdio subprocess PID tracking (mcp_tool.py)
Tracks child PIDs spawned by stdio_client via before/after snapshots of /proc children. On shutdown, _stop_mcp_loop force-kills any tracked PIDs that survived the SDK's graceful SIGTERM→SIGKILL cleanup. Prevents zombie MCP server processes from accumulating across sessions.

Fix 3: MCP event loop exception handler (mcp_tool.py)
Installs _mcp_loop_exception_handler on the MCP background event loop — same pattern as the existing _suppress_closed_loop_errors on prompt_toolkit's loop. Suppresses benign 'Event loop is closed' RuntimeError from httpx transport __del__ during MCP shutdown.

Fix 4: MCP OAuth non-blocking (mcp_oauth.py + mcp_tool.py)
Replaces blocking input() call in _wait_for_callback with OAuthNonInteractiveError raise. Adds _is_interactive() TTY detection. In non-interactive environments, build_oauth_auth() still returns a provider (cached tokens + refresh work), but the callback handler raises immediately instead of blocking the MCP event loop for 120s. Re-raises OAuth setup failures in _run_http so failed servers are reported cleanly without blocking others.

Attribution

Test plan

  • 33 new/updated tests across test_mcp_stability.py and test_mcp_oauth.py — all passing
  • Tests cover: exception handler suppression + forwarding, PID tracking lifecycle, dead PID cleanup, interactive/non-interactive detection, callback timeout raises instead of input(), cached token warning behavior

Issues

Closes #2537, closes #4462
Related: #4128, #3436

…t loop handler, OAuth non-blocking

Four fixes for MCP server stability issues reported by community member
(terminal lockup, zombie processes, escape sequence pollution, startup hang):

1. MCP reload timeout guard (cli.py): _check_config_mcp_changes now runs
   _reload_mcp in a separate daemon thread with a 30s hard timeout. Previously,
   a hung MCP server could block the process_loop thread indefinitely, freezing
   the entire TUI (user can type but nothing happens, only Ctrl+D/Ctrl+\ work).

2. MCP stdio subprocess PID tracking (mcp_tool.py): Tracks child PIDs spawned
   by stdio_client via before/after snapshots of /proc children. On shutdown,
   _stop_mcp_loop force-kills any tracked PIDs that survived the SDK's graceful
   SIGTERM→SIGKILL cleanup. Prevents zombie MCP server processes from
   accumulating across sessions.

3. MCP event loop exception handler (mcp_tool.py): Installs
   _mcp_loop_exception_handler on the MCP background event loop — same pattern
   as the existing _suppress_closed_loop_errors on prompt_toolkit's loop.
   Suppresses benign 'Event loop is closed' RuntimeError from httpx transport
   __del__ during MCP shutdown. Salvaged from PR #2538 (acsezen).

4. MCP OAuth non-blocking (mcp_oauth.py): Replaces blocking input() call in
   _wait_for_callback with OAuthNonInteractiveError raise. Adds _is_interactive()
   TTY detection. In non-interactive environments, build_oauth_auth() still
   returns a provider (cached tokens + refresh work), but the callback handler
   raises immediately instead of blocking the MCP event loop for 120s. Re-raises
   OAuth setup failures in _run_http so failed servers are reported cleanly
   without blocking others. Salvaged from PRs #4521 (voidborne-d) and #4465
   (heathley).

Closes #2537, closes #4462
Related: #4128, #3436
@teknium1 teknium1 merged commit cc54818 into main Apr 3, 2026
3 of 4 checks passed
jooray added a commit to jooray/hermes-agent that referenced this pull request Apr 3, 2026
* upstream/main: (38 commits)
  fix(memory): Fix ByteRover plugin - run brv query synchronously before LLM call
  chore: release v0.7.0 (2026.4.3) (NousResearch#4812)
  fix: route memory provider tools in sequential execution path (NousResearch#4803)
  fix: persist API server sessions to shared SessionDB (state.db) (NousResearch#4802)
  fix(discord): register /approve and /deny slash commands, wire up button-based approval UI (NousResearch#4800)
  fix: respect per-platform disabled skills in Telegram menu and gateway dispatch (NousResearch#4799)
  fix(gateway): route /approve and /deny through running-agent guard (NousResearch#4798)
  docs: add community FAQ entries — multi-model workflows, WhatsApp binding, verbose control, skills config, thread sessions, migration, install troubleshooting (NousResearch#4797)
  fix: handle None mcp_servers in _get_platform_tools()
  fix(mcp): stability fix pack — reload timeout, shutdown cleanup, event loop handler, OAuth non-blocking (NousResearch#4757)
  fix: prevent compression death spiral from API disconnects (NousResearch#2153) (NousResearch#4750)
  fix: handle Anthropic Sonnet long-context tier 429 by reducing to 200k (NousResearch#4747)
  fix: correct qwen3.6-plus model slug
  fix: handle Anthropic long-context tier 429 by reducing to 200k
  docs(acp): fix zed config
  fix: use get_hermes_home(), consolidate git_cmd, update tests
  Add fork detection and upstream sync to hermes update
  fix(update): handle conflicted git index during hermes update (NousResearch#4735)
  fix: remove redundant restart message from update launchd path
  fix(update): avoid launchd restart race on macOS
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant