feat(skill): add self-improve autonomous code improvement skill#2074
feat(skill): add self-improve autonomous code improvement skill#2074Yeachan-Heo merged 4 commits intoYeachan-Heo:devfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80ef26fb59
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
skills/self-improve/SKILL.md
Outdated
| 2. **Filter** to `status: "success"` only. If zero candidates, skip to Step 9 (Record & Visualize). | ||
| 3. **Rank** by `benchmark_score` (respecting `benchmark_direction`) | ||
| 4. **Ranked-candidate loop** — for each candidate in rank order (best first): | ||
| a. **No-regression check**: candidate score must be >= current `best_score` |
There was a problem hiding this comment.
Respect metric direction in regression gate
The tournament gate currently hard-codes candidate score >= best_score, which is only valid for higher_is_better. For goals configured as lower_is_better (e.g., latency/error), genuinely better candidates will be rejected before merge, so the loop can stall even when improvements exist. The comparison in this step needs to branch on benchmark_direction just like ranking does.
Useful? React with 👍 / 👎.
| base_commit=$(git -C "${GIT_DIR}" merge-base HEAD HEAD~1 2>/dev/null || echo "HEAD~1") | ||
| modified_files_str=$(git -C "${GIT_DIR}" diff --name-only "${base_commit}" 2>/dev/null || true) |
There was a problem hiding this comment.
Compare sealed files against the correct baseline
In --worktree mode, sealed-file detection diffs against merge-base HEAD HEAD~1 (effectively HEAD~1 for normal commits), which is not the branch baseline. In a fresh experiment branch this can include unrelated changes from the parent commit and falsely report sealed-file violations, and in multi-commit experiments it can miss sealed-file edits made before the last commit. This makes sealed-file enforcement both noisy and unreliable during executor runs.
Useful? React with 👍 / 👎.
|
CI failed — 2 test failures in Please rebase on git fetch origin dev
git rebase origin/dev— |
Integrate an evolutionary self-improvement loop as a self-contained skill that autonomously improves any target codebase through tournament selection. The skill spawns parallel agent pairs (researcher → planner → executor), benchmarks each experiment in isolated git worktrees, and merges only the best-performing change per iteration. Skill structure: - SKILL.md: Loop controller with 11-step iteration cycle, resumability, and cancellation support - si-researcher.md, si-benchmark-builder.md, si-goal-clarifier.md: Custom reference docs for roles without OMC agent equivalents - data_contracts.md: 12 JSON schemas for inter-agent communication - scripts/validate.sh: Sealed file + plan/result schema validation - scripts/plot_progress.py: Progress visualization with matplotlib fallback - templates/: Default config for settings, agent state, goal, harness, ideas Integration points (4 files): - state-tools.ts: Register in STATE_TOOL_MODES + EXTRA_STATE_ONLY_MODES - skill-state/index.ts: SKILL_PROTECTION 'heavy' (10 reinforcements, 30min) - cancel/SKILL.md: Position 11 in dependency order with cleanup semantics - CLAUDE.md: Added to workflow skill catalog Agent mapping (evidence-based): - 4 OMC agents leveraged: planner, architect, critic, executor - 4 custom roles: researcher, benchmark-builder, goal-clarifier, tournament - Skill-only invocation (/oh-my-claudecode:self-improve) — no keyword trigger Inspired by github.com/lonj7798/self-improvement Constraint: EXTRA_STATE_ONLY_MODES (not full ExecutionMode) to minimize integration surface Rejected: Global agents | context bloat in every session Rejected: Full OMC agent wrapping | impedance mismatch (explore=haiku, git-master=atomic only) Rejected: Keyword trigger | "self improve" too common in English, accidental trigger risk Confidence: high Scope-risk: narrow Not-tested: Full multi-iteration loop with real benchmark (manual testing recommended) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
80ef26f to
86c34a7
Compare
…seline In --worktree mode, detect the improve/* branch as the correct merge-base for sealed file comparison instead of HEAD~1. This ensures multi-commit experiments are fully checked and avoids false positives from unrelated parent commit changes. Constraint: Must handle repos where improve/* branch does not yet exist Confidence: high Scope-risk: narrow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on check The no-regression gate hard-coded >= which only works for higher_is_better. For lower_is_better goals (latency, error rate), better candidates would be rejected. Now branches on benchmark_direction like the ranking step does. Confidence: high Scope-risk: narrow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c13231305c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| improve_branch=$(git -C "${GIT_DIR}" branch -a --list 'improve/*' 2>/dev/null | head -1 | tr -d ' *' || true) | ||
| if [[ -z "${improve_branch}" ]]; then |
There was a problem hiding this comment.
Derive sealed-file baseline from current experiment branch
In --worktree mode the sealed-file diff baseline is chosen from the first improve/* branch returned by Git, not from the improvement branch that the current experiment/round_* branch was created from. If a repo has multiple improvement branches, this can diff against an unrelated history and either falsely report sealed-file violations or miss real ones for this executor run. Fresh evidence: this revision now uses git branch -a --list 'improve/*' | head -1, which is still branch-agnostic.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
current skill is designed to maintain a single improved branch at a time (only keep winner branch), and this will help to keep the branch as clean as possible while reducing the complexity. So, head -1 will return the correct branch.
|
|
||
| # Settings path must be provided or discovered from .omc/self-improve/config/ | ||
| SETTINGS="" | ||
| VALID_APPROACH_FAMILIES="architecture training_config data infrastructure optimization testing documentation other" |
There was a problem hiding this comment.
Accept custom approach families in plan validation
The validator hard-codes VALID_APPROACH_FAMILIES and rejects anything outside that list, but the skill contracts explicitly allow custom approach families from harness.md (data_contracts.md states these are valid). As soon as a user configures a custom family, check_plan_schema will fail and block executor benchmarking for otherwise valid plans.
Useful? React with 👍 / 👎.
|
CI failure is a flaky performance test in Can you re-run CI or merge with the flaky test acknowledged? |
|
CI is all green now ✅ (the flaky test passed on re-run). Will review the PR shortly. — |
Yeachan-Heo
left a comment
There was a problem hiding this comment.
Review: REQUEST_CHANGES
Thanks for the ambitious PR — the self-improve concept is directionally interesting, but in its current form there are blocking concerns that need to be addressed before merge.
Blocking Issues
1. Unsandboxed arbitrary-code execution / repo trust boundary
SKILL.md:96-100, 213-225,si-benchmark-builder.md:47-49, 63-67- The feature accepts an arbitrary
repo_path, creates/uses benchmark code in that repo, then runs benchmark commands autonomously - Git worktrees isolate Git state, not process/network/env access
- As written, this is effectively an autonomous code-execution loop over arbitrary repos with no sandbox/trust gate
2. Autonomous push/PR behavior is too risky
SKILL.md:130, 243, 302- The loop auto-pushes winners and may create a PR upstream
- Without explicit opt-in and remote verification, this is too dangerous for a new skill
3. Cancel/resume integration is incomplete
SKILL.md:147, 276, 301-312,cancel/SKILL.md:118,state-tools.ts:46-49- The skill promises
user_stopped, preservediteration_state, orphaned worktree cleanup, and resume safety, but the patch only adds a state-tool enum entry, heavy skill protection, and one cancel bullet - There is no real self-improve-specific cancel flow implementing those promises
Non-blocking Issues
4. Validator contradicts the documented contract — data_contracts.md:144-157 docs say custom approach families from harness.md are valid, but validate.sh hardcodes a fixed whitelist
5. Sealed-file baseline is nondeterministic — validate.sh:83-94 picks the first improve/* branch via head -1, which can be the wrong baseline if multiple improvement branches exist
6. Docs are under-integrated — CLAUDE.md is updated, but broader user-facing docs/skill inventories are not
What Needs to Happen
- Add a trust/sandbox model for the execution loop
- Make push/PR creation explicitly opt-in (not default behavior)
- Implement real cancel/resume integration (not just documentation)
- Fix validator contract mismatches
Looking forward to a v2!
—
[repo owner's gaebal-gajae (clawdbot) 🦞]
Review SummaryGate: ✅ Star gate passed (starred oh-my-claudecode) CI: ✅ All checks passing (after dev fix in #2075 + flaky perf test rerun) Code Review: Core integration (3 files) — Clean
Skill content (14 files) — Self-contained
Notes
Decision neededThis is a feature addition — owner approval needed for inclusion. Code quality looks good, integration is minimal and follows patterns. — |
…in, cancel/resume Addresses all 3 blocking issues from maintainer review on Yeachan-Heo#2074: 1. Trust gate: Setup phase now requires explicit user confirmation of repo_path and benchmark_command before any autonomous execution. Consent persisted as trust_confirmed in agent-settings.json; skipped on resume. Gate enforced alongside si_setting_* flags. 2. Push/PR opt-in: auto_push and auto_pr default to false in settings. Tournament push (Step 8) and completion PR gated on these flags. When disabled, exact manual commands are logged for the user. 3. Cancel/resume: New Step 0 cleans stale worktrees idempotently on every iteration start and resume. Step 2 detects cancel with explicit cleanup flow (set user_stopped, update iteration_state, clean worktrees). Resumability handles user_stopped (ask), crash (auto-resume), and fresh start with Step 0 as a hard prerequisite. Also fixes non-blocker: removed approach_family enum check from validate.sh — critic owns taxonomy validation (supports custom families from harness.md). Constraint: All fixes are SKILL.md prompt changes, no new TypeScript Rejected: Compiled trust infrastructure | disproportionate for skill-level feature Confidence: high Scope-risk: narrow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e1ff7cd38
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| 3. **Rank** by `benchmark_score` (respecting `benchmark_direction`) | ||
| 4. **Ranked-candidate loop** — for each candidate in rank order (best first): | ||
| a. **No-regression check**: candidate score must improve or hold even vs `best_score`, respecting `benchmark_direction` (`higher_is_better`: score >= best_score; `lower_is_better`: score <= best_score) | ||
| b. **Merge** via `oh-my-claudecode:git-master`: `git merge experiment/round_{n}_executor_{id} --no-ff -m "Iteration {n}: {hypothesis} (score: {before} → {after})"` |
There was a problem hiding this comment.
Checkout improvement branch before running merge command
This merge step uses git merge experiment/... without switching branches first, but Setup explicitly leaves the main repo on {target_branch} after creating improve/{goal_slug}. Because git merge always merges into the current branch, following these instructions literally can merge experiment commits into the protected baseline branch instead of improve/{goal_slug}. Add an explicit checkout/switch to the improvement branch immediately before this command (or use a command form that names both source and destination).
Useful? React with 👍 / 👎.
| elif [[ "${field}" == "benchmark_score" ]]; then | ||
| exists=$(jq --arg f "${field}" 'has($f)' "${result_file}" 2>/dev/null || echo "false") | ||
| if [[ "${exists}" != "true" ]]; then | ||
| missing="${missing} ${field}" |
There was a problem hiding this comment.
Validate benchmark_score type for success results
Result validation only checks whether benchmark_score exists, so non-numeric values like strings are accepted even when status is success. The tournament step ranks and compares candidates by score, so allowing non-numeric values can produce incorrect winner selection or comparison failures. Enforce that benchmark_score is numeric for successful runs (and only relax this for error/timeout statuses if needed).
Useful? React with 👍 / 👎.
| steps_len=$(jq '.steps | length' "${plan_file}" 2>/dev/null || echo "0") | ||
| if [[ "${steps_len}" -eq 0 ]]; then | ||
| err "steps must be a non-empty array" |
There was a problem hiding this comment.
Require steps to be an array during plan validation
The plan check uses .steps | length but never verifies that steps is an array, so malformed payloads like a string still pass schema validation when non-empty. Downstream executor logic expects an ordered list of step objects, so this can allow invalid plans through and cause execution ambiguity/failures later in the loop. Add an explicit type == "array" check before evaluating length.
Useful? React with 👍 / 👎.
Summary
self-improveskill that autonomously improves any target codebase through tournament-based evolutionary optimization/oh-my-claudecode:self-improve(no keyword trigger — "self improve" is too common in English)What's included
New skill directory (
skills/self-improve/):SKILL.md— Loop controller with 11-step iteration cycle, resumability, cancellationsi-researcher.md,si-benchmark-builder.md,si-goal-clarifier.md— Custom agent promptsdata_contracts.md— 12 JSON schemas for inter-agent communicationscripts/validate.sh— Sealed file + plan/result schema validationscripts/plot_progress.py— Progress visualization (matplotlib with text fallback)templates/— Default configs (settings, agent-state, goal, harness, ideas)Integration points (4 files):
src/tools/state-tools.ts—'self-improve'inSTATE_TOOL_MODES+EXTRA_STATE_ONLY_MODESsrc/hooks/skill-state/index.ts—'self-improve': 'heavy'(10 reinforcements, 30min TTL)skills/cancel/SKILL.md— Position 11 in cancellation dependency orderCLAUDE.md— Added to workflow skill catalogTest update:
src/__tests__/skills.test.ts— Updated expected skill countsInspired by
lonj7798/self-improvement — evolutionary code improvement engine with tournament selection, sealed benchmarks, and institutional memory.
Test plan
npm testpasses (skill count assertions updated)/oh-my-claudecode:self-improveloads the skillstate_read(mode='self-improve')works/oh-my-claudecode:cancelclears self-improve statescripts/validate.shruns without errors🤖 Generated with Claude Code