helixevo 0.2.7 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -12,7 +12,7 @@ HelixEvo builds on ideas from [EvoSkill](https://arxiv.org/abs/2603.02766) and [
12
12
 
13
13
  Every proposed change goes through:
14
14
  1. **3 independent LLM judges** (Task Completion, Correction Alignment, Side-Effect Check)
15
- 2. **Regression testing** against golden cases
15
+ 2. **Regression testing** against skill tests
16
16
  3. **3-day canary deployment** with auto-rollback
17
17
 
18
18
  ## Prerequisites
@@ -57,7 +57,7 @@ npm link
57
57
  ## Quick Start
58
58
 
59
59
  ```bash
60
- # 1. Initialize — imports existing skills + generates golden cases
60
+ # 1. Initialize — imports existing skills + generates skill tests
61
61
  helixevo init
62
62
 
63
63
  # 2. Capture failures from a session
@@ -80,7 +80,7 @@ helixevo dashboard
80
80
  | `helixevo watch` | Always-on learning: auto-capture + auto-evolve |
81
81
  | `helixevo metrics` | Correction rates, skill trends, evolution impact |
82
82
  | `helixevo health` | Network health: cohesion, coverage, balance, transfer |
83
- | `helixevo init` | Import existing skills + generate golden cases |
83
+ | `helixevo init` | Import existing skills + generate skill tests |
84
84
  | `helixevo capture <session>` | Extract failures from a session file |
85
85
  | `helixevo evolve` | Evolve skills from captured failures |
86
86
  | `helixevo generalize` | Promote cross-project patterns ↑ |
@@ -126,7 +126,7 @@ All data is stored in `~/.helix/`:
126
126
  ├── failures.jsonl # Captured failures
127
127
  ├── frontier.json # Pareto frontier (top-k configurations)
128
128
  ├── evolution-history.json # All evolution runs + proposals
129
- ├── golden-cases.jsonl # Regression test cases
129
+ ├── skill-tests.jsonl # Regression test cases
130
130
  ├── skill-graph.json # Cached network (nodes + edges)
131
131
  ├── canary-registry.json # Active canary deployments
132
132
  ├── knowledge-buffer.json # Research discoveries + drafts
@@ -122,7 +122,7 @@ function analyzeNetworkAdaptation(
122
122
  })
123
123
  result.suggestions.push({
124
124
  type: 'rewire',
125
- description: `Review golden cases for [${partners.join(', ')}] after this change`
125
+ description: `Review skill tests for [${partners.join(', ')}] after this change`
126
126
  })
127
127
  }
128
128
  }
@@ -148,7 +148,7 @@ function ArchitectureDiagram() {
148
148
  <div className="guide-diagram-box guide-diagram-check" style={{ direction: 'ltr' }}>
149
149
  <div className="guide-diagram-box-label">Validate</div>
150
150
  <div className="guide-diagram-box-title">Regression Tests</div>
151
- <div className="guide-diagram-box-desc">Golden cases + cross-skill</div>
151
+ <div className="guide-diagram-box-desc">Skill tests + cross-skill</div>
152
152
  </div>
153
153
  <div className="guide-diagram-arrow" style={{ direction: 'ltr' }}>←</div>
154
154
  <div className="guide-diagram-box guide-diagram-judge" style={{ direction: 'ltr' }}>
@@ -317,7 +317,7 @@ cd helixevo && npm install && npm run build && npm link`}</Code>
317
317
  <Code title="Terminal">{`helixevo init`}</Code>
318
318
  <p className="guide-text-sm">
319
319
  This scans your existing SKILL.md files (from <code>~/.agents/skills/</code>), imports them into HelixEvo,
320
- and generates golden test cases for each skill. It also creates the data directory at <code>~/.helix/</code>.
320
+ and generates skill tests for each skill. It also creates the data directory at <code>~/.helix/</code>.
321
321
  </p>
322
322
  </Step>
323
323
 
@@ -370,7 +370,7 @@ helixevo status`}</Code>
370
370
  },
371
371
  {
372
372
  cmd: 'helixevo init',
373
- desc: 'Import existing skills and generate golden test cases. Scans ~/.agents/skills/ and creates the HelixEvo data directory.',
373
+ desc: 'Import existing skills and generate skill tests. Scans ~/.agents/skills/ and creates the HelixEvo data directory.',
374
374
  flags: ['--verbose'],
375
375
  },
376
376
  {
@@ -545,7 +545,7 @@ Project B: "Use FlashList not FlatList" (React Native perf)
545
545
  → Abstract skill created: "react-native-performance" (domain layer)
546
546
  → Project A skill inherits from it
547
547
  → Project B skill inherits from it
548
- → Domain skill tested against all golden cases
548
+ → Domain skill tested against all skill tests
549
549
  → Deployed if regression passes`}</Code>
550
550
  <Callout type="tip">
551
551
  Auto-generalization is the key to the <strong>double helix</strong> metaphor: as projects evolve, skills
@@ -601,7 +601,7 @@ Project B: "Use FlashList not FlatList" (React Native perf)
601
601
  <div className="guide-pipeline-connector" />
602
602
  <PipelineStep
603
603
  icon="5" title="Regression Testing" color="var(--red)"
604
- desc="The modified skill is tested against all golden cases for that skill AND co-evolved partner skills. Must maintain ≥95% pass rate."
604
+ desc="The modified skill is tested against all skill tests for that skill AND co-evolved partner skills. Must maintain ≥95% pass rate."
605
605
  />
606
606
  <div className="guide-pipeline-connector" />
607
607
  <PipelineStep
@@ -693,29 +693,29 @@ Project B: "Use FlashList not FlatList" (React Native perf)
693
693
  </Section>
694
694
 
695
695
  {/* ─── Regression Testing ─── */}
696
- <Section id="regression" title="Regression Testing" subtitle="Golden cases and cross-skill validation ensure quality.">
697
- <h3 className="guide-h3">Golden Cases</h3>
696
+ <Section id="regression" title="Regression Testing" subtitle="Skill tests and cross-skill validation ensure quality.">
697
+ <h3 className="guide-h3">Skill Tests</h3>
698
698
  <p className="guide-text">
699
- Golden cases are regression test scenarios tied to specific skills. They&apos;re created when:
699
+ Skill tests are regression test scenarios tied to specific skills. They&apos;re created when:
700
700
  </p>
701
701
  <ul className="guide-list">
702
702
  <li><strong>Init:</strong> Automatically generated from existing SKILL.md files during <code>helixevo init</code></li>
703
- <li><strong>Evolution:</strong> When a failure is resolved, the scenario is promoted to a golden case</li>
703
+ <li><strong>Evolution:</strong> When a failure is resolved, the scenario is promoted to a skill test</li>
704
704
  </ul>
705
705
  <p className="guide-text">
706
- Each golden case stores the input, context, and expected behavior. During regression testing,
706
+ Each skill test stores the input, context, and expected behavior. During regression testing,
707
707
  an LLM judge evaluates whether the modified skill would still handle each scenario correctly.
708
708
  </p>
709
709
 
710
710
  <h3 className="guide-h3">Cross-Skill Regression</h3>
711
711
  <p className="guide-text">
712
- When skill A is modified, HelixEvo also tests golden cases from co-evolved, dependent, and enhancing
712
+ When skill A is modified, HelixEvo also tests skill tests from co-evolved, dependent, and enhancing
713
713
  partner skills. This catches silent incompatibilities where changing one skill breaks a related skill&apos;s behavior.
714
714
  </p>
715
715
  <Code title="How it works">{`Skill A evolves
716
716
  → Load skill graph edges
717
717
  → Find partners (co-evolves, depends, enhances)
718
- → Test partner golden cases against Skill A's changes
718
+ → Test partner skill tests against Skill A's changes
719
719
  → Block if partner pass rate < 95%`}</Code>
720
720
  </Section>
721
721
 
@@ -796,10 +796,10 @@ generation: 3
796
796
  <div className="guide-params">
797
797
  <Param name="quality.judgePassScore" type="number" desc="Minimum judge score to pass (1-10)." def="7" />
798
798
  <Param name="quality.judgeConsensusMin" type="number" desc="Minimum judges that must pass." def="2" />
799
- <Param name="quality.regressionPassRate" type="number" desc="Minimum golden case pass rate (0-1)." def="0.95" />
799
+ <Param name="quality.regressionPassRate" type="number" desc="Minimum skill test pass rate (0-1)." def="0.95" />
800
800
  <Param name="quality.canaryDurationDays" type="number" desc="Days to monitor canary deployments." def="3" />
801
801
  <Param name="quality.autoRollbackThreshold" type="number" desc="Failure rate multiplier triggering rollback." def="1.5" />
802
- <Param name="quality.maxGoldenCases" type="number" desc="Maximum golden cases per skill." def="50" />
802
+ <Param name="quality.maxSkillTests" type="number" desc="Maximum skill tests per skill." def="50" />
803
803
  </div>
804
804
 
805
805
  <Code title="~/.helix/config.json">{`{
@@ -819,7 +819,7 @@ generation: 3
819
819
  "regressionPassRate": 0.95,
820
820
  "canaryDurationDays": 3,
821
821
  "autoRollbackThreshold": 1.5,
822
- "maxGoldenCases": 50
822
+ "maxSkillTests": 50
823
823
  }
824
824
  }`}</Code>
825
825
  </Section>
@@ -831,7 +831,7 @@ generation: 3
831
831
  ├── failures.jsonl # Captured failure records (append-only)
832
832
  ├── frontier.json # Pareto frontier (top-K programs)
833
833
  ├── evolution-history.json # All evolution iterations + proposals
834
- ├── golden-cases.jsonl # Regression test cases (append-only)
834
+ ├── skill-tests.jsonl # Regression test cases (append-only)
835
835
  ├── skill-graph.json # Cached network (nodes + edges)
836
836
  ├── canary-registry.json # Active canary deployments
837
837
  ├── knowledge-buffer.json # Research discoveries + drafts
@@ -858,7 +858,7 @@ generation: 3
858
858
  }`}</Code>
859
859
  </div>
860
860
  <div className="guide-data-card">
861
- <div className="guide-data-title">Golden Case</div>
861
+ <div className="guide-data-title">Skill Test</div>
862
862
  <Code>{`{
863
863
  "id": "gc_react_42",
864
864
  "skill": "react-patterns",
@@ -949,7 +949,7 @@ generation: 3
949
949
  </FAQItem>
950
950
  <FAQItem q="How does cross-skill regression work?">
951
951
  When Skill A evolves, HelixEvo checks the skill graph for co-evolved, dependent, and enhancing
952
- partners. It tests their golden cases against Skill A&apos;s changes. If partner pass rate drops below 95%,
952
+ partners. It tests their skill tests against Skill A&apos;s changes. If partner pass rate drops below 95%,
953
953
  the proposal is rejected.
954
954
  </FAQItem>
955
955
  <FAQItem q="How does the knowledge buffer work?">
@@ -89,7 +89,9 @@ export function loadHistory(): { iterations: Iteration[] } {
89
89
  return readJson<{ iterations: Iteration[] }>('evolution-history.json', { iterations: [] })
90
90
  }
91
91
 
92
- export function loadGoldenCases(): { id: string; skill: string; input: string }[] {
92
+ export function loadSkillTests(): { id: string; skill: string; input: string }[] {
93
+ const newFile = readJsonl('skill-tests.jsonl')
94
+ if (newFile.length > 0) return newFile
93
95
  return readJsonl('golden-cases.jsonl')
94
96
  }
95
97
 
@@ -126,7 +128,7 @@ export function getDashboardSummary() {
126
128
  const history = loadHistory()
127
129
  const buffer = loadBuffer()
128
130
  const canaries = loadCanaries()
129
- const goldenCases = loadGoldenCases()
131
+ const skillTests = loadSkillTests()
130
132
 
131
133
  const evolved = graph.nodes.filter(n => n.generation > 0)
132
134
  const totalProposals = history.iterations.flatMap(i => i.proposals)
@@ -141,6 +143,6 @@ export function getDashboardSummary() {
141
143
  evolution: { runs: history.iterations.length, accepted: accepted.length, rejected: rejected.length },
142
144
  buffer: { discoveries: buffer.discoveries.length, drafts: buffer.drafts.length },
143
145
  canaries: canaries.entries.length,
144
- goldenCases: goldenCases.length,
146
+ skillTests: skillTests.length,
145
147
  }
146
148
  }
package/dist/cli.js CHANGED
@@ -2129,7 +2129,7 @@ var init_config = __esm(() => {
2129
2129
  regressionPassRate: 0.95,
2130
2130
  canaryDurationDays: 3,
2131
2131
  autoRollbackThreshold: 1.5,
2132
- maxGoldenCases: 50
2132
+ maxSkillTests: 50
2133
2133
  },
2134
2134
  reporting: {
2135
2135
  schedule: "0 8 * * *",
@@ -9226,11 +9226,14 @@ function loadHistory() {
9226
9226
  function saveHistory(history) {
9227
9227
  writeJson("evolution-history.json", history);
9228
9228
  }
9229
- function loadGoldenCases() {
9229
+ function loadSkillTests() {
9230
+ const newFile = readJsonl("skill-tests.jsonl");
9231
+ if (newFile.length > 0)
9232
+ return newFile;
9230
9233
  return readJsonl("golden-cases.jsonl");
9231
9234
  }
9232
- function appendGoldenCase(gc) {
9233
- appendJsonl("golden-cases.jsonl", gc);
9235
+ function appendSkillTest(gc) {
9236
+ appendJsonl("skill-tests.jsonl", gc);
9234
9237
  }
9235
9238
  function loadSkillGraph() {
9236
9239
  return readJson("skill-graph.json", {
@@ -9577,8 +9580,8 @@ import { join as join3 } from "node:path";
9577
9580
  import { homedir as homedir2 } from "node:os";
9578
9581
  import { existsSync as existsSync4, cpSync } from "node:fs";
9579
9582
 
9580
- // src/prompts/golden-gen.ts
9581
- function buildGoldenGenPrompt(skill) {
9583
+ // src/prompts/test-gen.ts
9584
+ function buildTestGenPrompt(skill) {
9582
9585
  return `Read this skill and generate 3 typical usage scenarios where the skill should guide correct behavior.
9583
9586
 
9584
9587
  ## Skill: ${skill.meta.name}
@@ -9650,13 +9653,13 @@ async function initCommand(options) {
9650
9653
  console.log(`
9651
9654
  Imported ${imported} new skills
9652
9655
  `);
9653
- if (!options.skipGolden) {
9656
+ if (!options.skipTests) {
9654
9657
  const generalSkills = loadAllGeneralSkills();
9655
- console.log(` Generating golden cases...
9658
+ console.log(` Generating skill tests...
9656
9659
  `);
9657
9660
  for (const skill of generalSkills) {
9658
9661
  try {
9659
- const prompt = buildGoldenGenPrompt(skill);
9662
+ const prompt = buildTestGenPrompt(skill);
9660
9663
  const output = await chatJson({ prompt });
9661
9664
  for (const c of output.cases) {
9662
9665
  const gc = {
@@ -9671,11 +9674,11 @@ async function initCommand(options) {
9671
9674
  lastResult: "pass",
9672
9675
  consecutivePasses: 1
9673
9676
  };
9674
- appendGoldenCase(gc);
9677
+ appendSkillTest(gc);
9675
9678
  }
9676
- console.log(` ✓ ${skill.slug}: ${output.cases.length} golden cases`);
9679
+ console.log(` ✓ ${skill.slug}: ${output.cases.length} skill tests`);
9677
9680
  } catch (err) {
9678
- console.log(` ✗ ${skill.slug}: failed to generate golden cases (${err})`);
9681
+ console.log(` ✗ ${skill.slug}: failed to generate skill tests (${err})`);
9679
9682
  }
9680
9683
  }
9681
9684
  }
@@ -9989,11 +9992,11 @@ init_config();
9989
9992
  init_llm();
9990
9993
  async function runRegression(skillSlug, newSkillContent, verbose = false) {
9991
9994
  const config = loadConfig();
9992
- const allCases = loadGoldenCases();
9995
+ const allCases = loadSkillTests();
9993
9996
  const cases = allCases.filter((gc) => gc.skill === skillSlug);
9994
9997
  if (cases.length === 0) {
9995
9998
  if (verbose)
9996
- console.log(` No golden cases for ${skillSlug}, skipping regression`);
9999
+ console.log(` No skill tests for ${skillSlug}, skipping regression`);
9997
10000
  return { total: 0, passed: 0, passRate: 1, failures: [] };
9998
10001
  }
9999
10002
  const failures = [];
@@ -10020,7 +10023,7 @@ async function runRegression(skillSlug, newSkillContent, verbose = false) {
10020
10023
  failures
10021
10024
  };
10022
10025
  }
10023
- function promoteToGoldenCase(failure, skillSlug, replayResult) {
10026
+ function promoteToSkillTest(failure, skillSlug, replayResult) {
10024
10027
  const gc = {
10025
10028
  id: `gc_${skillSlug}_${Date.now() % 1e5}`,
10026
10029
  addedAt: new Date().toISOString(),
@@ -10033,7 +10036,7 @@ function promoteToGoldenCase(failure, skillSlug, replayResult) {
10033
10036
  lastResult: "pass",
10034
10037
  consecutivePasses: 1
10035
10038
  };
10036
- appendGoldenCase(gc);
10039
+ appendSkillTest(gc);
10037
10040
  }
10038
10041
  function buildRegressionJudgePrompt(gc, skillContent) {
10039
10042
  return `You are a regression test judge. Determine if a modified skill can still handle this scenario correctly.
@@ -10072,7 +10075,7 @@ async function runCrossSkillRegression(skillSlug, newSkillContent, verbose = fal
10072
10075
  return { total: 0, passed: 0, passRate: 1, testedSkills: [] };
10073
10076
  }
10074
10077
  const config = loadConfig();
10075
- const allCases = loadGoldenCases();
10078
+ const allCases = loadSkillTests();
10076
10079
  const partnerCases = allCases.filter((gc) => partners.includes(gc.skill));
10077
10080
  if (partnerCases.length === 0) {
10078
10081
  return { total: 0, passed: 0, passRate: 1, testedSkills: partners };
@@ -10496,7 +10499,7 @@ async function evolveCommand(options) {
10496
10499
  const skillFailureCount = allFailures.filter((f) => f.skillsActive.includes(skillSlug2)).length;
10497
10500
  deployCanary(skillSlug2, `v${generation}`, backupPath, config.quality.canaryDurationDays, skillFailureCount);
10498
10501
  console.log(` \uD83D\uDC25 Canary deployed: ${config.quality.canaryDurationDays} day monitoring period`);
10499
- promoteToGoldenCase(testFailure, skillSlug2, replayResult);
10502
+ promoteToSkillTest(testFailure, skillSlug2, replayResult);
10500
10503
  const program2 = {
10501
10504
  id: `gen${generation}-${skillSlug2}`,
10502
10505
  generation,
@@ -10703,7 +10706,7 @@ async function statusCommand() {
10703
10706
  const unresolved = failures.filter((f) => !f.resolved);
10704
10707
  const frontier = loadFrontier();
10705
10708
  const skills = loadAllGeneralSkills();
10706
- const goldenCases = loadGoldenCases();
10709
+ const skillTests = loadSkillTests();
10707
10710
  const stagnation = getStagnationCount();
10708
10711
  const recentIter = getRecentIterations(7);
10709
10712
  console.log(`\uD83E\uDDEC HelixEvo Status
@@ -10722,7 +10725,7 @@ async function statusCommand() {
10722
10725
  }
10723
10726
  console.log(`
10724
10727
  Failures: ${unresolved.length} unresolved / ${failures.length} total`);
10725
- console.log(` Golden cases: ${goldenCases.length}`);
10728
+ console.log(` Skill tests: ${skillTests.length}`);
10726
10729
  const buffer = getBufferStats();
10727
10730
  console.log(`
10728
10731
  Knowledge Buffer:`);
@@ -12924,12 +12927,12 @@ async function metricsCommand(options) {
12924
12927
 
12925
12928
  // src/cli.ts
12926
12929
  var program2 = new Command;
12927
- program2.name("helixevo").description("Self-evolving skill ecosystem for AI agents").version("0.2.7").addHelpText("after", `
12930
+ program2.name("helixevo").description("Self-evolving skill ecosystem for AI agents").version("0.2.8").addHelpText("after", `
12928
12931
  Examples:
12929
12932
  $ helixevo watch Always-on learning (auto-capture + auto-evolve)
12930
12933
  $ helixevo watch --project myapp Watch with project context
12931
12934
  $ helixevo metrics Show correction rates and evolution impact
12932
- $ helixevo init Import skills + generate golden cases
12935
+ $ helixevo init Import skills + generate skill tests
12933
12936
  $ helixevo status Show system health
12934
12937
  $ helixevo evolve --verbose Evolve skills from failures
12935
12938
  $ helixevo evolve --dry-run Preview proposals without applying
@@ -12943,7 +12946,7 @@ Examples:
12943
12946
  $ helixevo graph --optimize Detect merge/split/conflicts
12944
12947
  $ helixevo report --days 7 Weekly evolution report
12945
12948
  $ helixevo capture <session.json> Extract failures from session`);
12946
- program2.command("init").description("Import existing skills + generate golden cases").option("--skills-paths <paths...>", "Paths to scan for existing skills").option("--skip-golden", "Skip golden case generation").action(initCommand);
12949
+ program2.command("init").description("Import existing skills + generate skill tests").option("--skills-paths <paths...>", "Paths to scan for existing skills").option("--skip-tests", "Skip skill test generation").action(initCommand);
12947
12950
  program2.command("capture").description("Capture failures from a Craft Agent session").argument("<sessionPath>", "Path to session conversation file").option("--project <name>", "Project name override").action(captureCommand);
12948
12951
  program2.command("evolve").description("Evolve skills from failures [--dry-run] [--verbose] [--max-proposals <n>]").option("--dry-run", "Show proposals without applying").option("--verbose", "Show detailed LLM interactions").option("--max-proposals <n>", "Max proposals per run", "5").action(evolveCommand);
12949
12952
  program2.command("generalize").description("Promote cross-skill patterns to higher layer ↑ [--dry-run] [--verbose]").option("--dry-run", "Show candidates without applying").option("--verbose", "Show detailed analysis").action(generalizeCommand);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "helixevo",
3
- "version": "0.2.7",
3
+ "version": "0.2.8",
4
4
  "description": "Self-evolving skill ecosystem for AI agents. Skills and projects co-evolve through multi-judge evaluation and a Pareto frontier.",
5
5
  "type": "module",
6
6
  "bin": {