Skip to content

fix(gateway): retry transient send failures and notify user on exhaustion#3288

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-64c3ceb2
Mar 27, 2026
Merged

fix(gateway): retry transient send failures and notify user on exhaustion#3288
teknium1 merged 1 commit intomainfrom
hermes/hermes-64c3ceb2

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

Salvage of PR #3108 by @Mibayy (authorship preserved). Fixes #2910.

When send() fails due to a network error (ConnectError, ReadTimeout, etc.), the failure was silently logged and the user received no feedback — appearing as a hang. In one reported case, a user waited 1+ hour for a response that had already been generated but failed to deliver.

Changes

Adds _send_with_retry() to BasePlatformAdapter:

Error type Behavior
Success Returns immediately, no overhead
Transient (network) Retries up to 2x with exponential backoff + jitter. On exhaustion, sends user a delivery-failure notice.
Permanent (formatting, permission) Falls back to plain-text version once, no retry loop.

Also adds:

  • SendResult.retryable field for platform-specific transient error flagging
  • _RETRYABLE_ERROR_PATTERNS constant for string-based transient detection
  • _is_retryable_error() static method

All adapters benefit automatically via BasePlatformAdapter inheritance — no per-adapter changes needed.

Follow-up improvements over original PR

  • Removed unused event parameter from _send_with_retry signature
  • Hoisted import random to module-level instead of per-call import
  • Fixed for/else logic bug: original code sent a misleading delivery-failure notice when error transitioned from network to non-network mid-retry. Now correctly falls through to the plain-text fallback instead.
  • Cleaned up test imports (removed unused MagicMock, dataclass, field)
  • Added test for the network→non-network transition path

Tests

27 tests in tests/gateway/test_send_retry.py. 6294 pass full suite (only pre-existing anthropic 429 flake fails).

…tion

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR #3108 by Mibayy.
@github-actions
Copy link
Copy Markdown

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Install hook files modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

@teknium1 teknium1 merged commit bde45f5 into main Mar 27, 2026
1 of 2 checks passed
StreamOfRon pushed a commit to StreamOfRon/hermes-agent that referenced this pull request Mar 29, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
dlkakbs added a commit to dlkakbs/hermes-agent that referenced this pull request Mar 30, 2026
When sendMessage times out, the Bot API may have already delivered the
message even though the HTTP client got no response.  PR NousResearch#3288 added
_send_with_retry() which retried on any transient error (including
timeouts), stacking on top of TelegramAdapter.send()'s existing 3-
attempt internal loop — risking 2–3 duplicate messages per response.

- Add SendResult.delivery_uncertain flag; when True, _send_with_retry()
  returns immediately without retrying or falling back to plain text.
- Add TelegramAdapter._looks_like_send_timeout() to detect TimedOut /
  ReadTimeout / WriteTimeout exceptions (with and without the
  python-telegram-bot import).
- Set delivery_uncertain=True in send()'s final except clause when the
  exhausted error is a send timeout.

Fixes NousResearch#3906.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram message delivery failure not surfaced to user - appears as 'hang/crash'

2 participants