Skip to content

feat(skill): add self-improve autonomous code improvement skill#2074

Merged
Yeachan-Heo merged 4 commits intoYeachan-Heo:devfrom
lonj7798:feat/self-improve-skill
Apr 1, 2026
Merged

feat(skill): add self-improve autonomous code improvement skill#2074
Yeachan-Heo merged 4 commits intoYeachan-Heo:devfrom
lonj7798:feat/self-improve-skill

Conversation

@lonj7798
Copy link
Copy Markdown
Contributor

@lonj7798 lonj7798 commented Apr 1, 2026

Summary

  • Add a self-contained self-improve skill that autonomously improves any target codebase through tournament-based evolutionary optimization
  • Spawns parallel agent pairs (researcher → planner → executor), benchmarks experiments in isolated git worktrees, merges only the best-performing change per iteration
  • Leverages 4 existing OMC agents (planner, architect, critic, executor) with 4 custom reference docs for specialized roles (researcher, benchmark-builder, goal-clarifier, tournament logic)
  • Skill-only invocation via /oh-my-claudecode:self-improve (no keyword trigger — "self improve" is too common in English)

What's included

New skill directory (skills/self-improve/):

  • SKILL.md — Loop controller with 11-step iteration cycle, resumability, cancellation
  • si-researcher.md, si-benchmark-builder.md, si-goal-clarifier.md — Custom agent prompts
  • data_contracts.md — 12 JSON schemas for inter-agent communication
  • scripts/validate.sh — Sealed file + plan/result schema validation
  • scripts/plot_progress.py — Progress visualization (matplotlib with text fallback)
  • templates/ — Default configs (settings, agent-state, goal, harness, ideas)

Integration points (4 files):

  • src/tools/state-tools.ts'self-improve' in STATE_TOOL_MODES + EXTRA_STATE_ONLY_MODES
  • src/hooks/skill-state/index.ts'self-improve': 'heavy' (10 reinforcements, 30min TTL)
  • skills/cancel/SKILL.md — Position 11 in cancellation dependency order
  • CLAUDE.md — Added to workflow skill catalog

Test update: src/__tests__/skills.test.ts — Updated expected skill counts

Inspired by

lonj7798/self-improvement — evolutionary code improvement engine with tournament selection, sealed benchmarks, and institutional memory.

Test plan

  • npm test passes (skill count assertions updated)
  • /oh-my-claudecode:self-improve loads the skill
  • state_read(mode='self-improve') works
  • Stop hook reinforces with heavy protection (10x, 30min)
  • /oh-my-claudecode:cancel clears self-improve state
  • scripts/validate.sh runs without errors
  • Manual: full iteration cycle with a test repo + benchmark

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 80ef26fb59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

2. **Filter** to `status: "success"` only. If zero candidates, skip to Step 9 (Record & Visualize).
3. **Rank** by `benchmark_score` (respecting `benchmark_direction`)
4. **Ranked-candidate loop** — for each candidate in rank order (best first):
a. **No-regression check**: candidate score must be >= current `best_score`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect metric direction in regression gate

The tournament gate currently hard-codes candidate score >= best_score, which is only valid for higher_is_better. For goals configured as lower_is_better (e.g., latency/error), genuinely better candidates will be rejected before merge, so the loop can stall even when improvements exist. The comparison in this step needs to branch on benchmark_direction just like ranking does.

Useful? React with 👍 / 👎.

Comment on lines +86 to +87
base_commit=$(git -C "${GIT_DIR}" merge-base HEAD HEAD~1 2>/dev/null || echo "HEAD~1")
modified_files_str=$(git -C "${GIT_DIR}" diff --name-only "${base_commit}" 2>/dev/null || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compare sealed files against the correct baseline

In --worktree mode, sealed-file detection diffs against merge-base HEAD HEAD~1 (effectively HEAD~1 for normal commits), which is not the branch baseline. In a fresh experiment branch this can include unrelated changes from the parent commit and falsely report sealed-file violations, and in multi-commit experiments it can miss sealed-file edits made before the last commit. This makes sealed-file enforcement both noisy and unreliable during executor runs.

Useful? React with 👍 / 👎.

@Yeachan-Heo
Copy link
Copy Markdown
Owner

CI failed — 2 test failures in keyword-detector-script.test.ts and hook-templates.test.ts. These tests were updated in recent dev merges (#2068 changed keyword-detector to inject SKILL.md content directly instead of using Skill tool invocations).

Please rebase on dev and the tests should align:

git fetch origin dev
git rebase origin/dev


[repo owner's gaebal-gajae (clawdbot) 🦞]

Integrate an evolutionary self-improvement loop as a self-contained skill
that autonomously improves any target codebase through tournament selection.
The skill spawns parallel agent pairs (researcher → planner → executor),
benchmarks each experiment in isolated git worktrees, and merges only the
best-performing change per iteration.

Skill structure:
- SKILL.md: Loop controller with 11-step iteration cycle, resumability,
  and cancellation support
- si-researcher.md, si-benchmark-builder.md, si-goal-clarifier.md:
  Custom reference docs for roles without OMC agent equivalents
- data_contracts.md: 12 JSON schemas for inter-agent communication
- scripts/validate.sh: Sealed file + plan/result schema validation
- scripts/plot_progress.py: Progress visualization with matplotlib fallback
- templates/: Default config for settings, agent state, goal, harness, ideas

Integration points (4 files):
- state-tools.ts: Register in STATE_TOOL_MODES + EXTRA_STATE_ONLY_MODES
- skill-state/index.ts: SKILL_PROTECTION 'heavy' (10 reinforcements, 30min)
- cancel/SKILL.md: Position 11 in dependency order with cleanup semantics
- CLAUDE.md: Added to workflow skill catalog

Agent mapping (evidence-based):
- 4 OMC agents leveraged: planner, architect, critic, executor
- 4 custom roles: researcher, benchmark-builder, goal-clarifier, tournament
- Skill-only invocation (/oh-my-claudecode:self-improve) — no keyword trigger

Inspired by github.com/lonj7798/self-improvement

Constraint: EXTRA_STATE_ONLY_MODES (not full ExecutionMode) to minimize integration surface
Rejected: Global agents | context bloat in every session
Rejected: Full OMC agent wrapping | impedance mismatch (explore=haiku, git-master=atomic only)
Rejected: Keyword trigger | "self improve" too common in English, accidental trigger risk
Confidence: high
Scope-risk: narrow
Not-tested: Full multi-iteration loop with real benchmark (manual testing recommended)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lonj7798 lonj7798 force-pushed the feat/self-improve-skill branch from 80ef26f to 86c34a7 Compare April 1, 2026 04:31
lonj7798 and others added 2 commits March 31, 2026 21:33
…seline

In --worktree mode, detect the improve/* branch as the correct merge-base
for sealed file comparison instead of HEAD~1. This ensures multi-commit
experiments are fully checked and avoids false positives from unrelated
parent commit changes.

Constraint: Must handle repos where improve/* branch does not yet exist
Confidence: high
Scope-risk: narrow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on check

The no-regression gate hard-coded >= which only works for higher_is_better.
For lower_is_better goals (latency, error rate), better candidates would be
rejected. Now branches on benchmark_direction like the ranking step does.

Confidence: high
Scope-risk: narrow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c13231305c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +89 to +90
improve_branch=$(git -C "${GIT_DIR}" branch -a --list 'improve/*' 2>/dev/null | head -1 | tr -d ' *' || true)
if [[ -z "${improve_branch}" ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive sealed-file baseline from current experiment branch

In --worktree mode the sealed-file diff baseline is chosen from the first improve/* branch returned by Git, not from the improvement branch that the current experiment/round_* branch was created from. If a repo has multiple improvement branches, this can diff against an unrelated history and either falsely report sealed-file violations or miss real ones for this executor run. Fresh evidence: this revision now uses git branch -a --list 'improve/*' | head -1, which is still branch-agnostic.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current skill is designed to maintain a single improved branch at a time (only keep winner branch), and this will help to keep the branch as clean as possible while reducing the complexity. So, head -1 will return the correct branch.


# Settings path must be provided or discovered from .omc/self-improve/config/
SETTINGS=""
VALID_APPROACH_FAMILIES="architecture training_config data infrastructure optimization testing documentation other"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept custom approach families in plan validation

The validator hard-codes VALID_APPROACH_FAMILIES and rejects anything outside that list, but the skill contracts explicitly allow custom approach families from harness.md (data_contracts.md states these are valid). As soon as a user configures a custom family, check_plan_schema will fail and block executor benchmarking for otherwise valid plans.

Useful? React with 👍 / 👎.

@lonj7798
Copy link
Copy Markdown
Contributor Author

lonj7798 commented Apr 1, 2026

CI failure is a flaky performance test in bridge.test.ts:193expected 159 to be less than 100 (timing-dependent, CI runner was slow). Not related to this PR's changes. 409/410 test files pass, 7280/7281 tests pass.

Can you re-run CI or merge with the flaky test acknowledged?

@Yeachan-Heo
Copy link
Copy Markdown
Owner

CI is all green now ✅ (the flaky test passed on re-run). Will review the PR shortly.


[repo owner's gaebal-gajae (clawdbot) 🦞]

Copy link
Copy Markdown
Owner

@Yeachan-Heo Yeachan-Heo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: REQUEST_CHANGES

Thanks for the ambitious PR — the self-improve concept is directionally interesting, but in its current form there are blocking concerns that need to be addressed before merge.

Blocking Issues

1. Unsandboxed arbitrary-code execution / repo trust boundary

  • SKILL.md:96-100, 213-225, si-benchmark-builder.md:47-49, 63-67
  • The feature accepts an arbitrary repo_path, creates/uses benchmark code in that repo, then runs benchmark commands autonomously
  • Git worktrees isolate Git state, not process/network/env access
  • As written, this is effectively an autonomous code-execution loop over arbitrary repos with no sandbox/trust gate

2. Autonomous push/PR behavior is too risky

  • SKILL.md:130, 243, 302
  • The loop auto-pushes winners and may create a PR upstream
  • Without explicit opt-in and remote verification, this is too dangerous for a new skill

3. Cancel/resume integration is incomplete

  • SKILL.md:147, 276, 301-312, cancel/SKILL.md:118, state-tools.ts:46-49
  • The skill promises user_stopped, preserved iteration_state, orphaned worktree cleanup, and resume safety, but the patch only adds a state-tool enum entry, heavy skill protection, and one cancel bullet
  • There is no real self-improve-specific cancel flow implementing those promises

Non-blocking Issues

4. Validator contradicts the documented contractdata_contracts.md:144-157 docs say custom approach families from harness.md are valid, but validate.sh hardcodes a fixed whitelist

5. Sealed-file baseline is nondeterministicvalidate.sh:83-94 picks the first improve/* branch via head -1, which can be the wrong baseline if multiple improvement branches exist

6. Docs are under-integratedCLAUDE.md is updated, but broader user-facing docs/skill inventories are not

What Needs to Happen

  • Add a trust/sandbox model for the execution loop
  • Make push/PR creation explicitly opt-in (not default behavior)
  • Implement real cancel/resume integration (not just documentation)
  • Fix validator contract mismatches

Looking forward to a v2!


[repo owner's gaebal-gajae (clawdbot) 🦞]

@Yeachan-Heo
Copy link
Copy Markdown
Owner

Review Summary

Gate: ✅ Star gate passed (starred oh-my-claudecode)

CI: ✅ All checks passing (after dev fix in #2075 + flaky perf test rerun)

Code Review:

Core integration (3 files) — Clean

  • state-tools.ts: adds self-improve to STATE_TOOL_MODES + EXTRA_STATE_ONLY_MODES — follows existing pattern
  • skill-state/index.ts: self-improve: 'heavy' protection (10 reinforcements, 30min TTL) — same as deepinit, appropriate for long-running
  • skills.test.ts: 31→32 canonical, 32→33 total — correctly updated

Skill content (14 files) — Self-contained

  • SKILL.md (345 lines): well-structured loop controller with 11 steps, state tracking, agent mapping, git strategy, stop conditions
  • Supporting docs: data_contracts.md (12 JSON schemas), si-researcher.md, si-benchmark-builder.md, si-goal-clarifier.md
  • Scripts: validate.sh (sealed file + schema validation), plot_progress.py (matplotlib visualization)
  • Templates: default configs for settings, agent-state, goal, harness, ideas

Notes

  • No keyword trigger (skill-only invocation via /oh-my-claudecode:self-improve) — good call, "self improve" is too common
  • Cancel integration documented in position 11 of dependency order
  • Inspired by lonj7798/self-improvement
  • All state isolated under .omc/self-improve/

Decision needed

This is a feature addition — owner approval needed for inclusion. Code quality looks good, integration is minimal and follows patterns.


[repo owner's gaebal-gajae (clawdbot) 🦞]

…in, cancel/resume

Addresses all 3 blocking issues from maintainer review on Yeachan-Heo#2074:

1. Trust gate: Setup phase now requires explicit user confirmation of
   repo_path and benchmark_command before any autonomous execution.
   Consent persisted as trust_confirmed in agent-settings.json; skipped
   on resume. Gate enforced alongside si_setting_* flags.

2. Push/PR opt-in: auto_push and auto_pr default to false in settings.
   Tournament push (Step 8) and completion PR gated on these flags.
   When disabled, exact manual commands are logged for the user.

3. Cancel/resume: New Step 0 cleans stale worktrees idempotently on
   every iteration start and resume. Step 2 detects cancel with explicit
   cleanup flow (set user_stopped, update iteration_state, clean worktrees).
   Resumability handles user_stopped (ask), crash (auto-resume), and
   fresh start with Step 0 as a hard prerequisite.

Also fixes non-blocker: removed approach_family enum check from
validate.sh — critic owns taxonomy validation (supports custom families
from harness.md).

Constraint: All fixes are SKILL.md prompt changes, no new TypeScript
Rejected: Compiled trust infrastructure | disproportionate for skill-level feature
Confidence: high
Scope-risk: narrow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lonj7798 lonj7798 requested a review from Yeachan-Heo April 1, 2026 05:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e1ff7cd38

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

3. **Rank** by `benchmark_score` (respecting `benchmark_direction`)
4. **Ranked-candidate loop** — for each candidate in rank order (best first):
a. **No-regression check**: candidate score must improve or hold even vs `best_score`, respecting `benchmark_direction` (`higher_is_better`: score >= best_score; `lower_is_better`: score <= best_score)
b. **Merge** via `oh-my-claudecode:git-master`: `git merge experiment/round_{n}_executor_{id} --no-ff -m "Iteration {n}: {hypothesis} (score: {before} → {after})"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Checkout improvement branch before running merge command

This merge step uses git merge experiment/... without switching branches first, but Setup explicitly leaves the main repo on {target_branch} after creating improve/{goal_slug}. Because git merge always merges into the current branch, following these instructions literally can merge experiment commits into the protected baseline branch instead of improve/{goal_slug}. Add an explicit checkout/switch to the improvement branch immediately before this command (or use a command form that names both source and destination).

Useful? React with 👍 / 👎.

Comment on lines +205 to +208
elif [[ "${field}" == "benchmark_score" ]]; then
exists=$(jq --arg f "${field}" 'has($f)' "${result_file}" 2>/dev/null || echo "false")
if [[ "${exists}" != "true" ]]; then
missing="${missing} ${field}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate benchmark_score type for success results

Result validation only checks whether benchmark_score exists, so non-numeric values like strings are accepted even when status is success. The tournament step ranks and compares candidates by score, so allowing non-numeric values can produce incorrect winner selection or comparison failures. Enforce that benchmark_score is numeric for successful runs (and only relax this for error/timeout statuses if needed).

Useful? React with 👍 / 👎.

Comment on lines +177 to +179
steps_len=$(jq '.steps | length' "${plan_file}" 2>/dev/null || echo "0")
if [[ "${steps_len}" -eq 0 ]]; then
err "steps must be a non-empty array"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require steps to be an array during plan validation

The plan check uses .steps | length but never verifies that steps is an array, so malformed payloads like a string still pass schema validation when non-empty. Downstream executor logic expects an ordered list of step objects, so this can allow invalid plans through and cause execution ambiguity/failures later in the loop. Add an explicit type == "array" check before evaluating length.

Useful? React with 👍 / 👎.

@Yeachan-Heo Yeachan-Heo merged commit a59c097 into Yeachan-Heo:dev Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants