Other

/ecosystem-health

Run ecosystem health checks — MCP connections, BrainLayer stats, skill evals, friction scans. Use this skill when asked about ecosystem health, maintenance checks, skill monitoring, 'is everything working', 'run a health check', 'what's broken', or when proactively auditing the system. Also triggers for 'maintenance Claude', 'ecosystem audit', 'skill eval', 'MCP status', or 'BrainLayer health'. Run this before and after major changes to catch regressions.

$ golems-cli skills install ecosystem-health

Experimental

Updated 2 months ago

Audit the golems ecosystem: MCP servers, BrainLayer, skills, friction patterns. Two modes: quick (2-3 min) and deep (10-15 min).

Why This Exists

The ecosystem has 22 repos, 45+ skills, 2 MCP servers (BrainLayer, VoiceLayer), a 7GB semantic memory store, and multiple long-lived Claude sessions. Things break silently — MCP disconnects, WAL files grow, skills degrade, friction accumulates. This skill catches those problems before the user notices.

Quick Check (weekly)

Run these in order. Stop and report if anything is red.

1. MCP Connectivity

# BrainLayer
brain_recall(mode="stats")
# Expected: chunk count, entity count, last enrichment time
# RED if: timeout, error, or chunk count dropped
 
# VoiceLayer
voice_speak(message="Health check ping", mode="think")
# Expected: silent log, no error
# RED if: timeout or MCP unavailable

2. BrainLayer Vitals

python3 -c "
import sqlite3, os, glob
 
# Auto-detect DB path (brainlayer moved from zikaron.db to brainlayer.db)
candidates = [
    '~/.local/share/brainlayer/brainlayer.db',
    '~/.local/share/zikaron/zikaron.db',
]
db = None
for c in candidates:
    p = os.path.expanduser(c)
    # Check for main file OR WAL-only mode (shm exists)
    if os.path.exists(p) or os.path.exists(p + '-shm'):
        db = p
        break
if not db:
    print('RED: No BrainLayer DB found at known paths')
    exit(1)
 
print(f'DB path: {db}')
db_exists = os.path.exists(db)
db_size = os.path.getsize(db) if db_exists else 0
wal_path = db + '-wal'
wal_size = os.path.getsize(wal_path) if os.path.exists(wal_path) else 0
shm_exists = os.path.exists(db + '-shm')
 
# If main file missing but shm exists, DB is in WAL-only mode — MCP holds it open
if not db_exists and shm_exists:
    print(f'DB file missing but -shm exists — WAL-only mode (MCP holds DB in memory)')
    print(f'WAL size: {wal_size / 1e6:.0f} MB')
    print('Use brain_recall(mode=\"stats\") to verify chunk count via MCP')
else:
    conn = sqlite3.connect(f'file:{db}?mode=ro', uri=True)
    chunks = conn.execute('SELECT COUNT(*) FROM chunks').fetchone()[0]
    print(f'Chunks: {chunks}')
    print(f'DB size: {db_size / 1e9:.1f} GB')
    print(f'WAL size: {wal_size / 1e6:.0f} MB')
    print(f'WAL ratio: {wal_size / db_size * 100:.1f}%' if db_size > 0 else 'WAL ratio: N/A')
    conn.close()
"

Thresholds:

WAL > 100MB = YELLOW (checkpoint needed)
WAL > 500MB = RED (queries will timeout)
DB growing > 500MB/week = investigate enrichment
Chunk count dropped = RED (data loss)

Fix WAL: python3 -c "import sqlite3,os; c=sqlite3.connect(os.path.expanduser('~/.local/share/brainlayer/brainlayer.db')); c.execute('PRAGMA wal_checkpoint(PASSIVE)'); c.close()" If PASSIVE doesn't shrink: kill stale brainlayer-mcp processes first (pgrep -fl brainlayer), then retry with RESTART.

3. MCP Process Health

# Count brainlayer-mcp processes (should be 1-3, not 7+)
pgrep -fl brainlayer-mcp | wc -l
# Count voicelayer-mcp processes
pgrep -fl voicelayer-mcp | wc -l

Thresholds:

brainlayer-mcp > 4 = YELLOW (stale sessions, kill oldest)
voicelayer-mcp > 3 = YELLOW
Any MCP at 100% CPU = RED (check with top -l 1 -pid PID)

4. Skill Inventory

# Top-level skill directories (sub-skills are discovered recursively)
ls ~/.claude/commands/ | wc -l
# Expected: ~10-12 top-level skill dirs (each contains nested sub-skills)
# Claude Code discovers SKILL.md files recursively within these
# Full skill count visible in session via /skills
ls -la ~/.claude/commands/ | grep -c "broken"

5. Active Sessions

# Check for zombie Claude sessions
pgrep -fl "claude" | grep -v "claude-code" | wc -l
# Check cmux workspace health
cmux list-workspaces
cmux surface-health

Deep Check (monthly)

Everything in Quick Check, plus:

6. Friction Scan

python3 ~/Gits/orchestrator/scripts/friction-scan.py --threshold 5

Compare results against previous scan (state file at ~/.local/share/brainlayer/friction-scan-state.json or ~/.local/share/zikaron/friction-scan-state.json). Look for:

New friction categories appearing
Same friction recurring (not getting fixed)
Friction count trending up or down

7. Skill Eval Sampling

Pick 3-5 skills from different domains. For each, check that the eval directory exists and has test cases:

for skill in coach pr-loop commit research cli-agents; do
  echo "=== $skill ==="
  ls ~/Gits/orchestrator/skill-evals/$skill/ 2>/dev/null || echo "NO EVALS"
done

For skills with evals, run one test case and verify it passes. Use the skill-creator eval framework if available.

8. Cross-Repo Staleness

Check which repos have recent activity:

for repo in golems brainlayer voicelayer orchestrator 6pm-mini domica etanheyman.com t3code; do
  last=$(git -C ~/Gits/$repo log -1 --format="%ar" 2>/dev/null || echo "not found")
  echo "$repo: $last"
done

Repos with no activity > 2 weeks during active development = investigate.

9. BrainLayer Search Quality

Run 3 known-good queries and verify they return expected results:

brain_search("component reasoning brainlayer-session-start")
# Expected: manual-f9c5a44d5f3a4fef chunk

brain_search("friction patterns coachClaude categories")
# Expected: friction research from March 7

brain_search("orchestrator golem concept architecture")
# Expected: architectural overview

If any return empty or irrelevant results, search quality has degraded — check FTS5 index, embedding generation.

10. Hook Health

# Check hooks exist and are executable
ls -la ~/.claude/hooks/brainlayer-*.py
# Check hook settings
cat ~/.claude/settings.json | python3 -m json.tool | grep -A5 "hooks"

Verify: SessionStart and UserPromptSubmit hooks are wired. No PostToolUse hooks (those cause hangs).

Full SKILL.md source — includes LLM directives, anti-patterns, and technical instructions stripped from the Overview tab.

Audit the golems ecosystem: MCP servers, BrainLayer, skills, friction patterns. Two modes: quick (2-3 min) and deep (10-15 min).

Why This Exists

Quick Check (weekly)

Run these in order. Stop and report if anything is red.

1. MCP Connectivity

# BrainLayer
brain_recall(mode="stats")
# Expected: chunk count, entity count, last enrichment time
# RED if: timeout, error, or chunk count dropped
 
# VoiceLayer
voice_speak(message="Health check ping", mode="think")
# Expected: silent log, no error
# RED if: timeout or MCP unavailable

2. BrainLayer Vitals

python3 -c "
import sqlite3, os, glob
 
# Auto-detect DB path (brainlayer moved from zikaron.db to brainlayer.db)
candidates = [
    '~/.local/share/brainlayer/brainlayer.db',
    '~/.local/share/zikaron/zikaron.db',
]
db = None
for c in candidates:
    p = os.path.expanduser(c)
    # Check for main file OR WAL-only mode (shm exists)
    if os.path.exists(p) or os.path.exists(p + '-shm'):
        db = p
        break
if not db:
    print('RED: No BrainLayer DB found at known paths')
    exit(1)
 
print(f'DB path: {db}')
db_exists = os.path.exists(db)
db_size = os.path.getsize(db) if db_exists else 0
wal_path = db + '-wal'
wal_size = os.path.getsize(wal_path) if os.path.exists(wal_path) else 0
shm_exists = os.path.exists(db + '-shm')
 
# If main file missing but shm exists, DB is in WAL-only mode — MCP holds it open
if not db_exists and shm_exists:
    print(f'DB file missing but -shm exists — WAL-only mode (MCP holds DB in memory)')
    print(f'WAL size: {wal_size / 1e6:.0f} MB')
    print('Use brain_recall(mode=\"stats\") to verify chunk count via MCP')
else:
    conn = sqlite3.connect(f'file:{db}?mode=ro', uri=True)
    chunks = conn.execute('SELECT COUNT(*) FROM chunks').fetchone()[0]
    print(f'Chunks: {chunks}')
    print(f'DB size: {db_size / 1e9:.1f} GB')
    print(f'WAL size: {wal_size / 1e6:.0f} MB')
    print(f'WAL ratio: {wal_size / db_size * 100:.1f}%' if db_size > 0 else 'WAL ratio: N/A')
    conn.close()
"

Thresholds:

WAL > 100MB = YELLOW (checkpoint needed)
WAL > 500MB = RED (queries will timeout)
DB growing > 500MB/week = investigate enrichment
Chunk count dropped = RED (data loss)

3. MCP Process Health

# Count brainlayer-mcp processes (should be 1-3, not 7+)
pgrep -fl brainlayer-mcp | wc -l
# Count voicelayer-mcp processes
pgrep -fl voicelayer-mcp | wc -l

Thresholds:

brainlayer-mcp > 4 = YELLOW (stale sessions, kill oldest)
voicelayer-mcp > 3 = YELLOW
Any MCP at 100% CPU = RED (check with top -l 1 -pid PID)

4. Skill Inventory

# Top-level skill directories (sub-skills are discovered recursively)
ls ~/.claude/commands/ | wc -l
# Expected: ~10-12 top-level skill dirs (each contains nested sub-skills)
# Claude Code discovers SKILL.md files recursively within these
# Full skill count visible in session via /skills
ls -la ~/.claude/commands/ | grep -c "broken"

5. Active Sessions

# Check for zombie Claude sessions
pgrep -fl "claude" | grep -v "claude-code" | wc -l
# Check cmux workspace health
cmux list-workspaces
cmux surface-health

Deep Check (monthly)

Everything in Quick Check, plus:

6. Friction Scan

python3 ~/Gits/orchestrator/scripts/friction-scan.py --threshold 5

Compare results against previous scan (state file at ~/.local/share/brainlayer/friction-scan-state.json or ~/.local/share/zikaron/friction-scan-state.json). Look for:

New friction categories appearing
Same friction recurring (not getting fixed)
Friction count trending up or down

7. Skill Eval Sampling

Pick 3-5 skills from different domains. For each, check that the eval directory exists and has test cases:

for skill in coach pr-loop commit research cli-agents; do
  echo "=== $skill ==="
  ls ~/Gits/orchestrator/skill-evals/$skill/ 2>/dev/null || echo "NO EVALS"
done

For skills with evals, run one test case and verify it passes. Use the skill-creator eval framework if available.

8. Cross-Repo Staleness

Check which repos have recent activity:

for repo in golems brainlayer voicelayer orchestrator 6pm-mini domica etanheyman.com t3code; do
  last=$(git -C ~/Gits/$repo log -1 --format="%ar" 2>/dev/null || echo "not found")
  echo "$repo: $last"
done

Repos with no activity > 2 weeks during active development = investigate.

9. BrainLayer Search Quality

Run 3 known-good queries and verify they return expected results:

brain_search("component reasoning brainlayer-session-start")
# Expected: manual-f9c5a44d5f3a4fef chunk

brain_search("friction patterns coachClaude categories")
# Expected: friction research from March 7

brain_search("orchestrator golem concept architecture")
# Expected: architectural overview

If any return empty or irrelevant results, search quality has degraded — check FTS5 index, embedding generation.

10. Hook Health

# Check hooks exist and are executable
ls -la ~/.claude/hooks/brainlayer-*.py
# Check hook settings
cat ~/.claude/settings.json | python3 -m json.tool | grep -A5 "hooks"

Verify: SessionStart and UserPromptSubmit hooks are wired. No PostToolUse hooks (those cause hangs).

Report Format

After running checks, produce a summary:

# Ecosystem Health Report — YYYY-MM-DD
 
## Status: GREEN / YELLOW / RED
 
### Quick Checks
| Check | Status | Value | Notes |
|-------|--------|-------|-------|
| BrainLayer MCP | GREEN | 297K+ chunks | responding in <1s |
| VoiceLayer MCP | GREEN | connected | 4 processes |
| WAL size | GREEN | 0MB | clean |
| MCP processes | GREEN | 4 brainlayer, 4 voicelayer | within limits |
| Skills count | GREEN | 10 top-level | all symlinks valid, sub-skills discovered recursively |
 
### Deep Checks (if run)
| Check | Status | Notes |
|-------|--------|-------|
| Friction scan | ... | N new candidates |
| Skill evals | ... | N/M passing |
| Search quality | ... | 3/3 queries returned expected |
 
### Actions Needed
1. [specific action]
2. [specific action]

Store the report in BrainLayer:

brain_store(content: "Ecosystem health report YYYY-MM-DD: [summary]", tags: ["health-check", "ecosystem", "maintenance"], importance: 7)

Automation (future)

This skill is designed to be run by a Scheduled Task or cron job:

Weekly quick check: Monday 9am
Monthly deep check: First Sunday of month
On-demand: whenever user asks

The friction-scan.py script already has state tracking. Future: add state tracking to all checks so we can trend over time.

Quick Install

golems-cli skills install ecosystem-health

Or paste into Claude Code:

Install and configure the ecosystem-health skill for Claude Code.
Download from github.com/EtanHey/golems/tree/master/skills/golem-powers/ecosystem-health
and symlink to ~/.claude/commands/ecosystem-health/. If ~/.golems/config.yaml
doesn't exist, run the setup wizard first. Then follow the First-Time
Setup section in the SKILL.md.

Grade: Experimental
Category: Other
Fixtures: None
Updated: 2 months ago

View source on GitHub

Related Skills

/wizard

First-time onboarding wizard for the golems ecosystem. Detects installed AI CLIs...

/nightly-docs-update

Collect stats from golems repo, update etanheyman.com content, detect dead refer...

/orchestrator-status

Ecosystem-wide status collection and orientation. Use when returning to work, st...