agentv 2.2.0 → 2.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,77 +1,78 @@
1
- ---
2
- description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
3
- ---
4
-
5
- # AgentV Prompt Optimizer
6
-
7
- ## Input Variables
8
- - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
9
- - `optimization-log-path` (optional): Path where optimization progress should be logged
10
-
11
- ## Workflow
12
-
13
- 1. **Initialize**
14
- - Verify `<eval-path>` (file or glob) targets the correct system.
15
- - **Identify Prompt Files**:
16
- - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
17
- - Recursively check referenced prompt files for *other* prompt references (dependencies).
18
- - If multiple prompts are found, consider ALL of them as candidates for optimization.
19
- - **Identify Optimization Log**:
20
- - If `<optimization-log-path>` is provided, use it.
21
- - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
22
- - Read content of the identified prompt file.
23
-
24
- 2. **Optimization Loop** (Max 10 iterations)
25
- - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
26
- - *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
27
- - **Analyze (The Reflector)**:
28
- - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
29
- - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
30
- - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
31
- - **Output**: Return a structured analysis including:
32
- - **Score**: Current pass rate.
33
- - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
34
- - **Insight**: Key learning or pattern identified from the failures.
35
- - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
36
- - **Decide**:
37
- - If **100% pass**: STOP and report success.
38
- - If **Score decreased**: Revert last change, try different approach.
39
- - If **No improvement** (2x): STOP and report stagnation.
40
- - **Refine (The Curator)**:
41
- - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
42
- - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
43
- - **Output**: The **Log Entry** describing the specific operation performed.
44
- ```markdown
45
- ### Iteration [N]
46
- - **Operation**: [ADD / UPDATE / DELETE]
47
- - **Target**: [Section Name]
48
- - **Change**: [Specific text added/modified]
49
- - **Trigger**: [Specific failing test case or error pattern]
50
- - **Rationale**: [From Reflector: Root Cause]
51
- - **Score**: [From Reflector: Current Pass Rate]
52
- - **Insight**: [From Reflector: Key Learning]
53
- ```
54
- - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
55
- - **ADD**: Insert a new rule if a constraint was missed.
56
- - **UPDATE**: Refine an existing rule to be clearer or more general.
57
- - *Clarify*: Make ambiguous instructions specific.
58
- - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
59
- - **DELETE**: Remove obsolete, redundant, or harmful rules.
60
- - *Prune*: If a general rule covers specific cases, delete the specific ones.
61
- - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
62
- - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
63
- - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
64
- - **Log Result**:
65
- - Append the **Log Entry** returned by the Curator to the optimization log file.
66
-
67
- 3. **Completion**
68
- - Report final score.
69
- - Summarize key changes made to the prompt.
70
- - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
71
-
72
- ## Guidelines
73
- - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
74
- - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
75
- - **Structure**: Maintain existing Markdown headers/sections.
76
- - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
77
- - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
1
+ ---
2
+ name: agentv-prompt-optimizer
3
+ description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
4
+ ---
5
+
6
+ # AgentV Prompt Optimizer
7
+
8
+ ## Input Variables
9
+ - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
10
+ - `optimization-log-path` (optional): Path where optimization progress should be logged
11
+
12
+ ## Workflow
13
+
14
+ 1. **Initialize**
15
+ - Verify `<eval-path>` (file or glob) targets the correct system.
16
+ - **Identify Prompt Files**:
17
+ - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
18
+ - Recursively check referenced prompt files for *other* prompt references (dependencies).
19
+ - If multiple prompts are found, consider ALL of them as candidates for optimization.
20
+ - **Identify Optimization Log**:
21
+ - If `<optimization-log-path>` is provided, use it.
22
+ - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
23
+ - Read content of the identified prompt file.
24
+
25
+ 2. **Optimization Loop** (Max 10 iterations)
26
+ - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
27
+ - *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
28
+ - **Analyze (The Reflector)**:
29
+ - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
30
+ - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
31
+ - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
32
+ - **Output**: Return a structured analysis including:
33
+ - **Score**: Current pass rate.
34
+ - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
35
+ - **Insight**: Key learning or pattern identified from the failures.
36
+ - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
37
+ - **Decide**:
38
+ - If **100% pass**: STOP and report success.
39
+ - If **Score decreased**: Revert last change, try different approach.
40
+ - If **No improvement** (2x): STOP and report stagnation.
41
+ - **Refine (The Curator)**:
42
+ - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
43
+ - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
44
+ - **Output**: The **Log Entry** describing the specific operation performed.
45
+ ```markdown
46
+ ### Iteration [N]
47
+ - **Operation**: [ADD / UPDATE / DELETE]
48
+ - **Target**: [Section Name]
49
+ - **Change**: [Specific text added/modified]
50
+ - **Trigger**: [Specific failing test case or error pattern]
51
+ - **Rationale**: [From Reflector: Root Cause]
52
+ - **Score**: [From Reflector: Current Pass Rate]
53
+ - **Insight**: [From Reflector: Key Learning]
54
+ ```
55
+ - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
56
+ - **ADD**: Insert a new rule if a constraint was missed.
57
+ - **UPDATE**: Refine an existing rule to be clearer or more general.
58
+ - *Clarify*: Make ambiguous instructions specific.
59
+ - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
60
+ - **DELETE**: Remove obsolete, redundant, or harmful rules.
61
+ - *Prune*: If a general rule covers specific cases, delete the specific ones.
62
+ - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
63
+ - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
64
+ - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
65
+ - **Log Result**:
66
+ - Append the **Log Entry** returned by the Curator to the optimization log file.
67
+
68
+ 3. **Completion**
69
+ - Report final score.
70
+ - Summarize key changes made to the prompt.
71
+ - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
72
+
73
+ ## Guidelines
74
+ - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
75
+ - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
76
+ - **Structure**: Maintain existing Markdown headers/sections.
77
+ - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
78
+ - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
@@ -1,5 +1,5 @@
1
- ---
2
- description: 'Create and maintain AgentV YAML evaluation files'
3
- ---
4
-
1
+ ---
2
+ description: 'Create and maintain AgentV YAML evaluation files'
3
+ ---
4
+
5
5
  #file:../../.claude/skills/agentv-eval-builder/SKILL.md
@@ -1,4 +1,4 @@
1
- ---
2
- description: Iteratively optimize prompt files against an AgentV evaluation suite
3
- ---
1
+ ---
2
+ description: Iteratively optimize prompt files against an AgentV evaluation suite
3
+ ---
4
4
  #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "2.2.0",
3
+ "version": "2.5.2",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {
@@ -14,10 +14,7 @@
14
14
  "bin": {
15
15
  "agentv": "./dist/cli.js"
16
16
  },
17
- "files": [
18
- "dist",
19
- "README.md"
20
- ],
17
+ "files": ["dist", "README.md"],
21
18
  "scripts": {
22
19
  "dev": "bun --watch src/index.ts",
23
20
  "build": "tsup && bun run copy-readme",
@@ -31,7 +28,7 @@
31
28
  "test:watch": "bun test --watch"
32
29
  },
33
30
  "dependencies": {
34
- "@agentv/core": "2.0.2",
31
+ "@agentv/core": "2.5.1",
35
32
  "@mariozechner/pi-agent": "^0.9.0",
36
33
  "@mariozechner/pi-ai": "^0.37.2",
37
34
  "cmd-ts": "^0.14.3",