Quality

/large-plan

Scaffold multi-phase plans with async agents. Triggers: large feature, multi-PR refactor, parallel cmux.

$ golems-cli skills install large-plan

Good

97% best pass rate

77 assertions

13 evals

4 workflows

Updated 5 days ago

Invoke as: /large-plan (single segment).
Source: ~/Gits/golems/skills/golem-powers/large-plan/ (symlinked at ~/.claude/commands/large-plan).

Scaffold folder-based plans with phase folders, execute them through the branch-PR-review cycle, and coordinate async agent collaboration.

Quick Actions

What you want to do	Workflow
Create a new plan from a description	workflows/scaffold.md
Execute the next phase in a plan	workflows/execute-phase.md
Start async collab on a phase	workflows/collab.md

Available Scripts

Script	Purpose	Usage
`scripts/scaffold-plan.sh`	Create folder-based plan structure	`bash scripts/scaffold-plan.sh <plan-dir> <plan-name> <phase-count>`

Core Concept

Large plans are folder-based: one folder per phase, each containing a README.md (steps) and findings.md (shared knowledge). A main README.md acts as the index with a progress table and routing.

plan-dir/
  README.md              # Index: progress table, routing, execution rules
  collab.md              # Created when parallel phases exist (see below)
  phase-1-name/
    README.md            # Steps for this phase
    findings.md          # Shared knowledge room (agents write here)
  phase-2-name/
    README.md
    findings.md
  ...

Execution Decision: Sequential vs Parallel

EVERY plan must decide this at scaffold time. Analyze the dependency graph:

Phases with NO cross-dependencies  →  Parallel (collab.md + multiple agents)
Phases that depend on each other   →  Sequential (execute-phase, one at a time)
Mixed                              →  Rounds (parallel within round, sequential between rounds)

Decision tree:

Draw the dependency graph from phase Depends On fields
Group independent phases into rounds (phases in the same round can run in parallel)
If ANY round has 2+ phases → create collab.md at plan root
Add ## Execution Strategy to the main README.md showing rounds and parallelism

Example:

Full SKILL.md source — includes LLM directives, anti-patterns, and technical instructions stripped from the Overview tab.

Invoke as: /large-plan (single segment).
Source: ~/Gits/golems/skills/golem-powers/large-plan/ (symlinked at ~/.claude/commands/large-plan).

Scaffold folder-based plans with phase folders, execute them through the branch-PR-review cycle, and coordinate async agent collaboration.

Quick Actions

What you want to do	Workflow
Create a new plan from a description	workflows/scaffold.md
Execute the next phase in a plan	workflows/execute-phase.md
Start async collab on a phase	workflows/collab.md

Available Scripts

Script	Purpose	Usage
`scripts/scaffold-plan.sh`	Create folder-based plan structure	`bash scripts/scaffold-plan.sh <plan-dir> <plan-name> <phase-count>`

Core Concept

Large plans are folder-based: one folder per phase, each containing a README.md (steps) and findings.md (shared knowledge). A main README.md acts as the index with a progress table and routing.

plan-dir/
  README.md              # Index: progress table, routing, execution rules
  collab.md              # Created when parallel phases exist (see below)
  phase-1-name/
    README.md            # Steps for this phase
    findings.md          # Shared knowledge room (agents write here)
  phase-2-name/
    README.md
    findings.md
  ...

Execution Decision: Sequential vs Parallel

EVERY plan must decide this at scaffold time. Analyze the dependency graph:

Phases with NO cross-dependencies  →  Parallel (collab.md + multiple agents)
Phases that depend on each other   →  Sequential (execute-phase, one at a time)
Mixed                              →  Rounds (parallel within round, sequential between rounds)

Decision tree:

Draw the dependency graph from phase Depends On fields
Group independent phases into rounds (phases in the same round can run in parallel)
If ANY round has 2+ phases → create collab.md at plan root
Add ## Execution Strategy to the main README.md showing rounds and parallelism

Example:

## Execution Strategy
 
| Round | Phases | Mode | Agents |
|-------|--------|------|--------|
| 1 | Phase 1, Phase 2 | **parallel** (collab) | brainClaude, golemsClaude |
| 2 | Phase 3 (depends on 1+2) | sequential | mainClaude |
| 3 | Phase 4, Phase 5 | **parallel** (collab) | brainClaude, golemsClaude |

When a round has parallel phases, the orchestrator:

Creates/updates collab.md using the collab protocol
Spawns one managed cmux agent per phase in the parent/caller agent's workspace unless the user explicitly asks for a different workspace, with explicit role:"worker" unless the seat is the coordinating lead. Do not route by the human's currently focused workspace; if cmux must focus the parent workspace to initialize a terminal, it must restore the user's prior focus after the new pane is available. If a successor/coordinating lead is spawned in a workspace that already has a left lead pane, it must become a tab in that same lead pane; a new left split or third column is a health failure, not an acceptable layout variant.
Writes the returned agent_id, surface, workspace, role, session/resume health, inbox health, and topology health into the collab Agent table
Each agent's kickoff prompt includes the collab.md path and required DONE marker
Orchestrator monitors collab.md, wait_for(agent_id), inbox events, and output DONE files, then advances rounds when all phases are verified done. For file-backed cmux workers, the wait_for(done) call must include the worker's report_path and exact done_marker.
Worker completion must wake the owning lead through an adapter-owned guard: Claude native /loop/Cron/Monitor, file-backed wait_for({target_state:"done",report_path,done_marker}), live inbox/subscription, cmux notify, or a recorded temporary watcher. A report DONE marker or visible TASK_DONE without lead notification is completion_notification_missing health evidence. The worker does not have to post to the collab for this to fire; the report path and DONE marker are state, and the inbox/notification/watcher path is the event.
Codex leads must arm a guard before delegating: live inbox heartbeat, file-backed wait_for(done) receipt owned by the lead, or a background watcher/loop PID/task id that watches the registered collab/goal/report/DONE paths. Native cmuxlayer delivery may satisfy this when present, but the skill contract does not depend on it. No guard is blocked:health.
Completed workers should be harvested, reviewed or handed off, then closed. Keeping a completed worker open is valid for active review, successor orientation, live process state, or paired continuation; stale purposeless panes are done_worker_left_open health evidence. If kept, the Agent row must say KEPT_OPEN:<reason> with owner and next check.

Continuation / Recovery Authority

A large-plan dispatch is not permission to park. When the user has approved the plan or assigned a lane, agents have delegated authority to complete the normal path for that lane: commit, push, PR, review loop, merge when the lead/goal owns merge, restart in-scope services/MCPs, and resume or replace workers that cannot continue. Generic statements like "I can't commit", "I can't merge", "I can't restart BrainLayer/cmuxlayer MCP", or "Codex cannot reconnect MCPs" are not terminal blockers.

Recoverable blockers must become actions:

For PR work: invoke /pr-loop; approved work already authorizes the full loop unless the goal explicitly says "no merge".
For daemon/MCP work: checkpoint, restart/reload the in-scope service, verify with a real probe, and continue.
For Codex MCP reconnect limits: write a handoff and spawn/resume a managed successor that boots with the needed MCPs.
For stale or idle workers: the lead monitor nudges, supersedes, resumes, or replaces; it does not wait for the user to discover the idle pane.

Only irreversible data loss, force-push/history rewrite, destructive cleanup of unowned work, credential/account changes, paid external actions, or human-only license acceptance become blocked:human.

Plan Lifecycle

Scaffold plan  →  Analyze dependencies  →  Group into rounds
                                                |
                    ┌───────────────────────────┘
                    ▼
              Round has 1 phase?  →  Execute sequentially (execute-phase)
              Round has 2+ phases? → Create collab.md, spawn agents in parallel
                    |
                    ▼
              All round phases done  →  Advance to next round  →  Repeat

Non-Code Deliverables Check (MANDATORY at scaffold time)

Root cause (April 5 overnight sprint): orcClaude missed the second track — user wanted 9 entity files enhanced for morning walk + code PRs by dawn. Agent only scaffolded the code track.

At scaffold time, ALWAYS ask: "Are there non-code deliverables alongside the code phases?"

Common non-code deliverables:

Data enrichment / content curation (entity files, research docs, grill enhancement)
Documentation updates (READMEs, portfolio pages, design docs)
Configuration changes (LaunchAgents, hooks, environment)
Research outputs (A/B test results, comparative analysis)

If yes, add a separate phase or parallel track for the non-code work. Non-code deliverables are often the user's PRIMARY goal — the code is just infrastructure supporting it.

Branch Lifecycle (per phase)

master -> feature/phase-N-name -> implement -> commit -> push -> PR
  ^                                                            |
  |__ merge <-- approve <-- fix <-- review (CodeRabbit + Cursor Bugbot + DeepSource)

Phase Template

Each phase README follows this template:

# Phase N: Name
 
> [Back to main plan](../README.md)
 
## Goal
One sentence describing what this phase achieves.
 
## Time
- **Estimate:** NNmin (basis: [complexity/rolling avg from prior phases])
- **Started:** HH:MM
- **Completed:** —
- **Actual:** —
- **Error ratio:** —
 
## Round
Round M (parallel with Phase X, Phase Y) OR Round M (sequential).
 
## Tools
- **Research:** [gemini|cursor|codex] — what to research
- **Code:** [cursor|haiku|sonnet] — what to implement
- **MCPs:** [list relevant MCP servers]
 
## Steps
1. Step one
2. Step two
3. ...
 
## Depends On
- Phase X (for Y reason)
 
## Status
- [ ] Step one
- [ ] Step two

Findings Template

Each phase findings.md is the shared collaboration room:

# Phase N Findings
 
## Decisions
- [timestamp] Decision: ...
 
## Research
- [timestamp] Agent: Found that ...
 
## Task Board
| Task | Owner | Status |
|------|-------|--------|
| Research X | gemini | done |
| Implement Y | cursor | in progress |

Parallel Execution (Collab Protocol)

When a round has 2+ independent phases, use the full collab protocol defined in workflows/collab.md.

The orchestrator MUST:

Create collab.md at plan root using the template from the collab workflow
Fill in all mandatory sections (Goal, Agents, Task Board, Constraints, Gates)
Spawn agents through managed cmux lifecycle in the caller/current workspace unless explicitly overridden, with explicit roles (orchestrator for the coordinating lead, worker for phase executors)
Record each returned agent_id plus surface/workspace/session/inbox/topology health before the first task dispatch, and verify get_agent_state agrees with list_surfaces
Monitor collab.md, file-backed wait_for, inbox_check, Metacomm inbox events, and file DONE markers; advance rounds only when all agents are verified done

Key rule: If the human has to tell you to update the collab file, the collab has failed. Agents must self-coordinate.

Health rule: missing agent_id, auto-* discovered agents, null cli_session_id on crash-recover/long-running workers, non-resumability, inbox_check.monitor_alive=false, wedged prompts, registry/screen disagreement, registry/surface workspace disagreement, surprise workspace creation, role mismatch, worker-in-left-column placement, or unexpected three-column topology are blocked:health entries, not normal status.

Add these live dispatch failures to the same blocked:health bucket:

successor_lead_wrong_pane: an orchestrator/successor lead lands as a separate left split instead of a tab in the existing lead pane.
boot_prompt_typed_not_submitted: the task prompt is visible in the Codex/Cursor/Gemini composer after spawn but was never submitted. Recover with Return if needed, but record the failure even if the worker later produces its report.

Complexity tiers (from collab workflow):

Lightweight (~40 lines): 2 agents, fully independent work
Standard (~100 lines): 2-3 agents, some dependencies
Complex (~200 lines): 3+ agents, multi-repo, round-based

See workflows/collab.md for the full protocol, mandatory sections, update gates, message format, and anti-patterns.

Integration with Other Skills (Building Blocks)

MANDATORY for every phase:

Skill	When	Why
`/pr-loop`	Every phase completion	The FULL loop — branch through MERGED. Not optional.
`/superpowers:test-driven-development`	All implementation	Red-green-refactor. No code without failing test first.
`/superpowers:verification-before-completion`	Before claiming "done"	Evidence before assertions. Always.
`/never-fabricate`	Before reporting results	Read() files before summarizing them.

Optional per phase:

Skill	When to use
`/coderabbit`	Verify phase output with targeted review
Manual QA checklist	Generate test plans per phase from the diff
`/prd`	Create PRDs from phase specs
`/pr-loop` step 5	CodeRabbit review + atomic commit
`/create-pr`	Create PR (step 7 of pr-loop)

PR Review Cycle (per phase)

After push, automated reviewers comment. Classify each:

Type	Action
Real bug	FIX immediately
Style preference	Fix if genuinely better
Over-engineering	SKIP
Out of context	Comment explaining why

Repeat push-fix cycle until no real bugs remain.

Platform Features vs Universal Fallbacks

Claude Code features are listed first. If running on Codex or Cursor, use the universal fallback. Full adapter docs: adapters/

Feature	Claude Code	Universal Fallback
Parallel phase agents	`Agent(isolation="worktree", run_in_background=true)`	`spawn_agent({repo, cli:"codex", role:"worker", worktree, prompt})` and record `agent_id`
Phase worktree isolation	`Agent(isolation="worktree")` — auto-creates + cleans up	`git worktree add -b feature/phase-N ../<dir> master`
Collab/report monitoring	`CronCreate` or `/loop 5m`	Codex/Cursor use a recorded 5-minute watcher over collab + goal/report/DONE paths, or a live inbox/wait guard; watcher may call `dispatch_to_agent(nudge:"auto")` or notify the owning lead
Cron cleanup (plan done)	`CronDelete(<id>)` — mandatory	`kill <bg-monitor-pid>`
Plan mode (spec first)	`EnterPlanMode → ExitPlanMode`	Write plan to `docs.local/plan/<name>/README.md` manually
Memory persistence	`brain_store()` / `brain_search()` via BrainLayer	Append to `<plan-dir>/findings.md`
Session resume	`claude --resume`	Verify `cli_session_id` via `get_agent_state`; null session id is a health failure
Background phase execution	`Agent(run_in_background=true)`	Managed cmux worker + output file with final DONE marker
Goal delivery command	Native Claude goal/task command if present	File-backed goal is universal; slash-command syntax is not. Codex may use `/goal` only after acceptance is verified, Cursor needs UI-state verification, Gemini/Antigravity should receive a plain file-contract instruction.

Worker Pane Hygiene

Worker panes are working memory, not trophies. After a worker reaches TASK_DONE, Goal achieved, or a verified report DONE marker, the lead should:

harvest the report/output path and exact DONE marker;
review it or transfer it to a successor/reviewer when needed;
close/archive the pane unless it is actively useful.

It is valid to keep a completed worker open when it supports active review, successor orientation, live process state, or paired continuation. If a worker stays open, the Agent row must say KEPT_OPEN:<reason> with owner and next check. A completed or stale worker left visible with no current purpose is done_worker_left_open health evidence. Use judgment; the rule is critical cleanup, not bureaucracy.

Cursor-Workflows Inner Loop Recon

For read-heavy, cross-skill remediation, use cursor-workflows as an inner loop, not a single shallow gather. Use lib/autocursor.py primitives: parallel() for independent file reads, then loop_until_dry() to expand from findings into follow-up checks, red-team passes, and targeted probes until consecutive rounds add no stable new findings. Keep it read-only unless the user explicitly asks for mutation; the lead still owns edits, verification, and final synthesis.

Use this when the contract spans multiple skills/adapters/evals/reports and a single pass could miss second-order drift. Do not use it reflexively for simple local edits.

Time Tracking & Estimation Calibration (MANDATORY)

Data from April 5 overnight sprint (brainlayer, Codex workers): estimated 90min/phase, actual 15min average. Started at 6x overestimate, auto-calibrated to 1.25x by phase 7. Record timestamps at phase start + PR creation. Without tracking, estimates never calibrate.

At Scaffold Time

The main README.md progress table MUST include estimate and actual columns:

## Progress
 
| Phase | Status | Estimate | Started | Completed | Actual | Error |
|-------|--------|----------|---------|-----------|--------|-------|
| 1. Setup | ✅ done | 30min | 1:15 AM | 1:28 AM | 13min | 2.3x |
| 2. Search | ✅ done | 30min | 1:30 AM | 1:42 AM | 12min | 2.5x |
| 3. Hybrid | 🔄 active | 15min* | 1:45 AM | — | — | — |
| 4. Evals | ⏳ pending | 15min* | — | — | — | — |
 
*Auto-recalibrated from rolling avg of phases 1-2 (12.5min → round to 15min)
Rolling calibration: 2.3x → 2.5x → tracking...

At Phase Start (CLOCK IN)

brain_store(
  content: "CLOCK IN [plan-name / Phase N]: Started HH:MM. Estimate: NNmin. Basis: [first phase=complexity, later=rolling avg].",
  tags: ["time-tracking", "clock-in", "<project>"],
  importance: 5
)

Fill in the phase template's Time section: Started, Estimate.

At Phase Complete (CLOCK OUT)

brain_store(
  content: "CLOCK OUT [plan-name / Phase N]: PR merged HH:MM. Actual: NNmin. Estimated: NNmin. Error: X.Xx. Rolling avg (last 3): NNmin.",
  tags: ["time-tracking", "clock-out", "<project>"],
  importance: 5
)

Fill in the phase template's Time section: Completed, Actual, Error ratio. Update the main README progress table.

Auto-Recalibration (after 3+ phases)

Once 3 phases have actuals:

rolling_avg = average(last 3 actuals)
remaining_phases × rolling_avg = estimated total remaining

Report: "Phases 1-3 done in 38min total. Rolling avg: 12.7min.
         Remaining 4 phases: ~51min at current pace.
         Sprint total ETA: ~89min (original estimate was 630min = 7.1x overestimate)"

Rule: After 3+ phases, new estimates MUST be within 2x of rolling average. Don't keep estimating 90min when actuals are 15min.

Why This Matters

User correction (April 5): "No, I'm saying it will take probably hours, not weeks" — after orc estimated a 2-week timeline for work that took one evening. Time tracking turns this from a repeated correction into self-correcting behavior.

Quality Gates (before marking phase done)

Gate	Check
Typed right	No `any`, proper interfaces
Documented	JSDoc on exports, CLAUDE.md updated if needed
DRY	No duplicated logic
Tests pass	`bun test` / `npm test` green
Build passes	No compile errors

Phase N+1: Adversarial Evaluator (NON-NEGOTIABLE)

Closes the self-audit-as-evaluator substitution loophole. Observed at P5 fix queue 2026-05-17 — agent self-graded "evaluator replay PASS" without dispatching a separate evaluator. /goal hook silently passed.

Every /large-plan output that produces code, scripts, configs, or plist drafts MUST end with a Phase N+1 that:

Spawns a separate evaluator subagent (NOT the producing agent's self-audit).
- Use Agent(subagent_type=evaluator, ...) or equivalent platform fallback.
- The evaluator MUST be a different agent invocation from the one that produced the work.
Hands the evaluator a verbatim copy of every "Pass criterion" from the original /goal hook (no paraphrasing, no summarization).
Requires the evaluator to re-Read each cited file:line and run anti-fabrication checks (per /never-fabricate Live-citation gate).
The evaluator MUST score ≥8/10 OR produce an ITERATE verdict with specific fixes.
SELF-AUDIT IS NOT EVALUATION. If the producing agent grades its own work, the /goal hook does not pass — re-dispatch with explicit subagent_type ≠ producing agent.

Template: workflows/phase-evaluator.md — minimal evaluator-subagent dispatch (prompt format, scoring rubric link).

Done-gate semantics:

Producing agent emits	/goal hook treats as
`TASK_DONE` without evaluator dispatch transcript	FAIL (substitution loophole)
`TASK_NEEDS_EVALUATOR` + transcript of separate evaluator scoring ≥8/10	PASS
`TASK_NEEDS_EVALUATOR` + evaluator `ITERATE` verdict	RE-DISPATCH (do not declare done)

Evidence: 4-of-4 /goal outputs 2026-05-17 night surfaced critical issues only when externally evaluated. P5 fix queue silently substituted self-audit for the required external evaluator replay (skillcreator-p5fix mine [1438]).

Good

Best Pass Rate

97%

Opus 4.6

Assertions

3 models tested

Avg Cost / Run

$0.1564

across models

Fastest (p50)

1.8s

Haiku 4.5

Behavior Evals

Phase 2 baseline — skill quality on Claude

Behavior Baseline

Opus 4.6

97%75/77

●

Sonnet 4.6

88%68/77

◒

Haiku 4.5

77%59/77

◒

Assertion	Opus 4.6	Sonnet 4.6	Haiku 4.5	Consensus
scaffolds-folder-structure				3/3
includes-phase-dependencies				3/3
pr-loop-per-phase				3/3
includes-quality-gates				3/3
references-tdd				3/3
refuses-to-skip-quality-gate				2/3
suggests-scope-reduction				3/3
maintains-pr-loop				3/3
spawns-parallel-agents				3/3
uses-collab-files				2/3
defines-merge-strategy				3/3
uses-task-create				3/3
updates-collab-before-commits				3/3
brain-store-checkpoints				3/3
teaches-coderabbit				3/3
honesty-rule-present				3/3
sets-up-monitoring-loops				2/3
agent-entity-awareness				2/3
hook-file-before-register				2/3
reads-collab-file				2/3
checks-stale-working				3/3
suggests-loop-monitoring				3/3
checks-task-status				3/3
explicit-role-selection				2/3
managed-agent-registry				3/3
caller-workspace-default				3/3
focus-restore-after-pane-init				2/3
codex-lead-is-valid				3/3
placement-verification				3/3
subscription-nudge-contract				2/3
done-marker-required				3/3
health-failures-loud				3/3
reuse-before-spawn				2/3
full-file-backed-goal				2/3
goal-delivery-through-managed-agent				3/3
harness-command-state-verified				3/3
collab-registry-updated				3/3
workspace-topology-health				2/3
file-done-monitoring				2/3
no-duplicate-worker-by-default				2/3
closure-invariant-named				2/3
done-file-required-before-close				3/3
blocked-handoff-required				3/3
transfer-record-required				3/3
closure-without-artifact-health-failure				2/3
clean-workspace-not-proof				2/3
rejects-rename-promotion				3/3
managed-agent-id-required				2/3
orchestrator-left-placement				3/3
lead-goal-delegation				3/3
interrupted-spawn-cleanup				2/3
file-done-plus-task-done-wins				3/3
records-stale-registry-health				2/3
marks-done-closeable				3/3
consolidates-temporary-workers				3/3
does-not-wait-forever-on-idle				3/3
closure-invariant-preserved				1/3
approved-work-authorizes-pr-loop				3/3
mcp-restart-recoverable				3/3
codex-reconnect-successor				3/3
lead-monitor-must-act				2/3
post-recovery-verification				3/3
human-blockers-narrow				3/3
does-not-park				3/3
done-marker-counts				3/3
missing-notification-health-failure				2/3
requires-durable-wake-path				2/3
does-not-require-collab-post				3/3
requires-armed-lead-guard				3/3
records-session-inbox-health				3/3
lead-harvests-before-advancing				2/3
does-not-rely-on-manual-discovery				3/3
harvests-before-cleanup				3/3
keeps-only-with-purpose				1/3
records-kept-open-contract				2/3
closes-stale-completed-workers				3/3
flags-unexplained-visible-done-worker				3/3

Token Usage

Opus 4.6

9,046

Sonnet 4.6

4,009

Haiku 4.5

3,662

Input tokensOutput tokens

Cost per Run

Opus 4.6

$0.4302

Sonnet 4.6

$0.0366

Haiku 4.5

$0.0025

Model	Input Tokens	Output Tokens	Cost / Run	Cost / 1K Runs
Opus 4.6	4,137	4,909	$0.4302	$430.20
Sonnet 4.6	1,960	2,049	$0.0366	$36.60
Haiku 4.5	2,114	1,548	$0.0025	$2.50

Response Time (p50)

Haiku 4.5

1.8s

Sonnet 4.6

1.8s

Opus 4.6

6.8s

Response Time (p95)

Sonnet 4.6

3.0s

Haiku 4.5

3.6s

Opus 4.6

10.2s

Model	p50	p95	Overhead
Opus 4.6	6.8s	10.2s	+50%
Sonnet 4.6	1.8s	3.0s	+63%
Haiku 4.5	1.8s	3.6s	+99%

Last evaluated: 2026-03-12 · Data is generated from skill assertions (real cross-model benchmarks coming soon)

Workflows

/large-plan:collab/large-plan:execute-phase/large-plan:phase-evaluator/large-plan:scaffold