@mutagent/cli 0.1.156 → 0.1.158
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -7
- package/dist/bin/cli.js +132 -27
- package/dist/bin/cli.js.map +3 -3
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -251,8 +251,7 @@ mutagent auth login
|
|
|
251
251
|
```
|
|
252
252
|
|
|
253
253
|
> `mutagent login` is the canonical command. `mutagent auth login` is preserved
|
|
254
|
-
> as a back-compat alias; both
|
|
255
|
-
> See `docs/cli-design-principles.md` → "Login Unification".
|
|
254
|
+
> as a back-compat alias; both behave identically.
|
|
256
255
|
|
|
257
256
|
### 2. Post-Onboarding (Interactive)
|
|
258
257
|
|
|
@@ -656,12 +655,14 @@ bun run verify # Full verification (lint + typecheck + build + test)
|
|
|
656
655
|
|
|
657
656
|
---
|
|
658
657
|
|
|
659
|
-
##
|
|
658
|
+
## See Also
|
|
660
659
|
|
|
661
|
-
- [
|
|
662
|
-
- [
|
|
663
|
-
- [
|
|
664
|
-
- [
|
|
660
|
+
- **[@mutagent/sdk](https://www.npmjs.com/package/@mutagent/sdk)** — TypeScript SDK for programmatic access
|
|
661
|
+
- **[docs.mutagent.io](https://docs.mutagent.io)** — Full platform documentation
|
|
662
|
+
- **[CLI Commands Reference](https://docs.mutagent.io/cli/commands)** — All commands with flags
|
|
663
|
+
- **[Integration Guides](https://docs.mutagent.io/integrations/overview)** — Mastra, LangChain, LangGraph, Vercel AI
|
|
664
|
+
- **[Tracing Setup](https://docs.mutagent.io/tracing/setup)** — OTel integration walkthrough
|
|
665
|
+
- **[mutagent.io](https://mutagent.io)** — Homepage
|
|
665
666
|
|
|
666
667
|
---
|
|
667
668
|
|
package/dist/bin/cli.js
CHANGED
|
@@ -9623,7 +9623,7 @@ description: |
|
|
|
9623
9623
|
|
|
9624
9624
|
1. **\`--json\` on EVERY command.** No exceptions. Agents use JSON mode exclusively.
|
|
9625
9625
|
2. **\`<command> --help\` BEFORE first use of any command.** The CLI is the source of truth for flags — this SKILL never inlines them.
|
|
9626
|
-
3. **NEVER auto-generate eval criteria — collect from user.** Ask the user for each rubric field. See [concepts/eval-criteria.md](./concepts/eval-criteria.md) for the
|
|
9626
|
+
3. **NEVER auto-generate eval criteria — collect from user.** Ask the user for each rubric field. See [concepts/eval-criteria.md](./concepts/eval-criteria.md) for the rubric format.
|
|
9627
9627
|
4. **Explore-before-modify.** Run \`mutagent explore --json\` before any write operation. Present findings, get user confirmation. Never mutate without discovery first.
|
|
9628
9628
|
5. **Cost transparency before \`optimize start\`.** Run \`mutagent usage --json\` and show the result to the user. Get explicit confirmation before any optimization job.
|
|
9629
9629
|
6. **Before optimizing, run \`mutagent providers list --models\` to verify available models.** This calls \`/providers/catalog\` and shows which models are available per provider. Use the output to pick valid \`--exec-model\` and \`--eval-model\` values.
|
|
@@ -9656,7 +9656,7 @@ Match the user's first request. Load ONLY the matching subfile. Do NOT preload t
|
|
|
9656
9656
|
| [workflows/exploration.md](./workflows/exploration.md) | User wants to scan codebase, identify prompts vs agents | Read-only discovery; output taxonomy to user | Run only; no writes |
|
|
9657
9657
|
| [workflows/agents.md](./workflows/agents.md) | Multi-turn / tool-calling code detected | WIP — do NOT attempt optimizer, surface partnership link | Show WIP card to user verbatim |
|
|
9658
9658
|
| [concepts/prompt-variables.md](./concepts/prompt-variables.md) | Any question about \`{var}\` vs \`{{var}}\`, delimiter inference | Brace convention + conversion rules | Load before \`prompts create\` in optimization workflow |
|
|
9659
|
-
| [concepts/eval-criteria.md](./concepts/eval-criteria.md) | Any question about rubric design, MVC, Output Standards |
|
|
9659
|
+
| [concepts/eval-criteria.md](./concepts/eval-criteria.md) | Any question about rubric design, MVC, Output Standards | granular rubric format — INPUT-param vs OUTPUT-param scope | Load before \`evaluation create --guided\` in optimization workflow |
|
|
9660
9660
|
|
|
9661
9661
|
---
|
|
9662
9662
|
|
|
@@ -9768,7 +9768,8 @@ description: |
|
|
|
9768
9768
|
Canonical source for MutagenT evaluation-criteria framing:
|
|
9769
9769
|
INPUT-param criteria → Minimum Viable Context (MVC);
|
|
9770
9770
|
OUTPUT-param criteria → Output Standards.
|
|
9771
|
-
|
|
9771
|
+
Granular rubric discipline (match anchors to the dimension's observable quality levels; binary scoring (1.0/0.0) for yes/no checks):
|
|
9772
|
+
grounded, observable, never vague.
|
|
9772
9773
|
Includes current platform validation rules for criterion shape.
|
|
9773
9774
|
Mirrored in mutagent/src/modules/prompts/prompt-evaluations/README.md.
|
|
9774
9775
|
triggers:
|
|
@@ -9817,17 +9818,84 @@ Do not skip any field. Do not pre-fill answers. The user must provide each rubri
|
|
|
9817
9818
|
|
|
9818
9819
|
---
|
|
9819
9820
|
|
|
9821
|
+
## Your rubric is the instruction to the G-Eval LLM
|
|
9822
|
+
|
|
9823
|
+
The rubric text in each criterion's \`description\` field is read verbatim by
|
|
9824
|
+
the LLM-as-Judge (G-Eval). **The more precise your anchor descriptions, the
|
|
9825
|
+
more consistent and accurate the scores. Vague rubrics produce vague scores.**
|
|
9826
|
+
|
|
9827
|
+
A rubric like "0.8 if mostly good, 0.2 if mostly bad" gives the judge no
|
|
9828
|
+
grounding — it will invent its own interpretation. A rubric with concrete
|
|
9829
|
+
tier definitions, observable characteristics, and specific examples locks the
|
|
9830
|
+
judge's interpretation to yours.
|
|
9831
|
+
|
|
9832
|
+
The target quality bar is the G-Eval system's own internal scoring guidelines:
|
|
9833
|
+
each tier has a **named level**, a **score range**, a **definition**, **observable
|
|
9834
|
+
characteristics**, and a **concrete example**. Your rubrics should aspire to
|
|
9835
|
+
that same precision for the dimensions that matter most to your domain.
|
|
9836
|
+
|
|
9837
|
+
---
|
|
9838
|
+
|
|
9820
9839
|
## Input-param criteria → Minimum Viable Context (MVC)
|
|
9821
9840
|
|
|
9822
9841
|
**Scope**: the \`{variables}\` the prompt template consumes. Each \`{variable}\` is
|
|
9823
9842
|
an **input param**. The criterion asks: *is the information required for the
|
|
9824
9843
|
prompt to succeed actually present?*
|
|
9825
9844
|
|
|
9826
|
-
###
|
|
9845
|
+
### Rubric format (match anchors to observable quality levels)
|
|
9846
|
+
|
|
9847
|
+
Each anchor must be **observable** (a human can assign it by reading one input
|
|
9848
|
+
row) and **grounded** (describes a concrete property, not a feeling).
|
|
9849
|
+
|
|
9850
|
+
\`\`\`
|
|
9851
|
+
"Evaluate the completeness and usability of the {variable} input field.
|
|
9852
|
+
|
|
9853
|
+
Score 0.95-1.00 (Exceptional):
|
|
9854
|
+
All required context is present with rich, unambiguous detail. The prompt
|
|
9855
|
+
can produce a high-quality output without any hedging or assumption.
|
|
9856
|
+
Observable: full narrative prose, field-specific depth (e.g., >= 500 chars),
|
|
9857
|
+
no placeholder text, no ambiguous referents.
|
|
9858
|
+
Example: {document} = 800-word technical article with clear subject,
|
|
9859
|
+
context, and argument — summarizer has everything it needs.
|
|
9860
|
+
|
|
9861
|
+
Score 0.80-0.90 (Adequate):
|
|
9862
|
+
All required context is present but with minor gaps or imprecision that
|
|
9863
|
+
may cause hedging. The prompt can attempt a reasonable answer.
|
|
9864
|
+
Observable: complete field with minor quality shortfalls
|
|
9865
|
+
(e.g., 150-499 chars, one ambiguous term).
|
|
9866
|
+
Example: {document} = 200-word product description with one unclear
|
|
9867
|
+
abbreviation — summarizer can work but may flag the ambiguity.
|
|
9868
|
+
|
|
9869
|
+
Score 0.60-0.70 (Marginal):
|
|
9870
|
+
Most required context is present; the prompt can attempt an answer but
|
|
9871
|
+
the output will be noticeably incomplete or generic.
|
|
9872
|
+
Observable: partial field content, missing secondary context
|
|
9873
|
+
(e.g., 50-149 chars, missing key metadata).
|
|
9874
|
+
Example: {document} = three-sentence product blurb with no technical
|
|
9875
|
+
specifics — summarizer produces a generic response.
|
|
9876
|
+
|
|
9877
|
+
Score 0.40-0.50 (Insufficient):
|
|
9878
|
+
Partial context only. The prompt will produce a low-quality or generic
|
|
9879
|
+
response that cannot be acted on.
|
|
9880
|
+
Observable: very short field (< 50 chars), or content present but off-topic.
|
|
9881
|
+
Example: {document} = "See attached" — summarizer has nothing to work with.
|
|
9882
|
+
|
|
9883
|
+
Score 0.20-0.30 (Minimal):
|
|
9884
|
+
Only a stub or placeholder. The prompt will fail or produce a useless
|
|
9885
|
+
response.
|
|
9886
|
+
Observable: summary stubs, auto-generated filler, one-word answers.
|
|
9887
|
+
Example: {document} = "TBD" or "N/A" — useless for any summarization.
|
|
9888
|
+
|
|
9889
|
+
Score 0.00-0.10 (Absent):
|
|
9890
|
+
Critical context is missing. The prompt cannot succeed regardless of model.
|
|
9891
|
+
Observable: empty string, null, filename without content, TODO marker.
|
|
9892
|
+
Example: {document} = "" or null or 'document.pdf'."
|
|
9893
|
+
\`\`\`
|
|
9827
9894
|
|
|
9828
|
-
-
|
|
9829
|
-
-
|
|
9830
|
-
|
|
9895
|
+
> **Binary checks use two-anchor scoring (1.0 / 0.0) — the criterion is either satisfied or not** (e.g., "Is the
|
|
9896
|
+
> document non-empty?"). For any spectrum dimension — quality, completeness,
|
|
9897
|
+
> accuracy — use as many scoring anchors as the dimension has meaningfully distinguishable quality levels, so the optimizer has fine-grained signal to
|
|
9898
|
+
> act on. Binary checks (yes/no) need 2 anchors. Spectrum dimensions (quality, completeness, accuracy) typically need 4-8, depending on how many distinct quality levels a human could reliably tell apart.
|
|
9831
9899
|
|
|
9832
9900
|
### Discipline rules
|
|
9833
9901
|
|
|
@@ -9845,27 +9913,38 @@ Enumerate variables using the delimiter inferred by \`mutagent explore --json\`:
|
|
|
9845
9913
|
|
|
9846
9914
|
See [concepts/prompt-variables.md](./prompt-variables.md) for the full inference contract.
|
|
9847
9915
|
|
|
9848
|
-
### Example
|
|
9916
|
+
### Example (compact inline format)
|
|
9849
9917
|
|
|
9850
|
-
For \`"Summarize {document} for {audience}"
|
|
9918
|
+
For \`"Summarize {document} for {audience}"\`, the full-depth rubric above can
|
|
9919
|
+
be condensed to inline form for the JSON \`description\` field:
|
|
9851
9920
|
|
|
9852
9921
|
\`\`\`json
|
|
9853
9922
|
[
|
|
9854
9923
|
{
|
|
9855
9924
|
"name": "document-present",
|
|
9856
9925
|
"evaluationParameter": "document",
|
|
9857
|
-
"description": "1.0
|
|
9926
|
+
"description": "Evaluate the usability of the document input. Score 0.95-1.00 (Exceptional): rich prose >= 500 chars, full context, no ambiguity — summarizer has everything it needs. Score 0.80-0.90 (Adequate): complete prose >= 100 chars, minor gaps — summarizer can work but may hedge one point. Score 0.60-0.70 (Marginal): short but usable text 50-99 chars — output will be generic. Score 0.40-0.50 (Insufficient): very short snippet < 50 chars or off-topic content — output will be low-quality. Score 0.20-0.30 (Minimal): summary stub, placeholder, or filler text — prompt cannot produce a useful response. Score 0.00-0.10 (Absent): empty, null, filename, or TODO — prompt cannot succeed."
|
|
9858
9927
|
},
|
|
9859
9928
|
{
|
|
9860
9929
|
"name": "audience-concrete",
|
|
9861
9930
|
"evaluationParameter": "audience",
|
|
9862
|
-
"description": "1.
|
|
9931
|
+
"description": "Evaluate how concretely the audience is specified. Score 0.95-1.00 (Exceptional): concrete persona with role, seniority, and domain context (e.g., 'junior Python devs at an early-stage startup') — summarizer can tailor depth and vocabulary precisely. Score 0.80-0.90 (Adequate): concrete role with seniority but no domain ('junior Python devs') — good but summarizer must assume domain. Score 0.60-0.70 (Marginal): role with seniority but no discipline ('senior engineers') — summarizer must assume tech stack. Score 0.40-0.50 (Insufficient): broad category without seniority ('engineers') — output will be generic. Score 0.20-0.30 (Minimal): vague group ('technical people', 'our team') — barely actionable. Score 0.00-0.10 (Absent): empty, 'general', 'everyone', or null — no tailoring possible."
|
|
9863
9932
|
}
|
|
9864
9933
|
]
|
|
9865
9934
|
\`\`\`
|
|
9866
9935
|
|
|
9867
|
-
|
|
9868
|
-
|
|
9936
|
+
#### Binary exception
|
|
9937
|
+
|
|
9938
|
+
Some dimensions are genuinely binary — no spectrum exists. For these, 1.0/0.0
|
|
9939
|
+
is correct and adding extra anchors would be artificial:
|
|
9940
|
+
|
|
9941
|
+
\`\`\`json
|
|
9942
|
+
{
|
|
9943
|
+
"name": "language-valid",
|
|
9944
|
+
"evaluationParameter": "language",
|
|
9945
|
+
"description": "Score 1.0 if the value is a valid BCP-47 language tag (e.g. 'en', 'fr-CA'). Score 0.0 if empty, null, or not a valid BCP-47 tag. No intermediate states exist — a tag is either valid or it is not."
|
|
9946
|
+
}
|
|
9947
|
+
\`\`\`
|
|
9869
9948
|
|
|
9870
9949
|
---
|
|
9871
9950
|
|
|
@@ -9881,24 +9960,47 @@ to the task and write one criterion per dimension.
|
|
|
9881
9960
|
- **Groundedness** — facts in the output traceable to facts in the input
|
|
9882
9961
|
- **Format compliance** — JSON validity, markdown shape, regex match
|
|
9883
9962
|
|
|
9884
|
-
###
|
|
9963
|
+
### Full-depth example: summary_accuracy
|
|
9885
9964
|
|
|
9886
|
-
|
|
9887
|
-
|
|
9965
|
+
This rubric demonstrates the gold-standard format for an OUTPUT criterion that
|
|
9966
|
+
evaluates a complex, multi-dimensional field (the factual accuracy of a generated
|
|
9967
|
+
summary against its source document):
|
|
9888
9968
|
|
|
9889
|
-
|
|
9969
|
+
\`\`\`json
|
|
9970
|
+
{
|
|
9971
|
+
"name": "summary-accuracy",
|
|
9972
|
+
"evaluationParameter": "summary",
|
|
9973
|
+
"description": "Evaluate the factual accuracy of the generated summary against the source document.\\n\\nScore 0.95-1.00 (Flawless):\\n Every claim in the summary traces directly to the source. No additions, no omissions of key facts, no distortions. A fact-checker would approve without notes.\\n Observable: each stated figure, date, or claim appears verbatim or with lossless paraphrase in the source; nothing is added that the source does not support.\\n Example: Source describes Q3 revenue of €4.2M with 12% YoY growth. Summary states exactly these figures in proper context.\\n\\nScore 0.80-0.90 (Accurate):\\n All major facts correct. 1-2 minor simplifications that do not mislead (e.g., rounding €4.2M to 'over €4M').\\n Observable: core claims verified; minor imprecision in secondary detail does not change the reader's understanding.\\n Example: Summary captures the revenue figure but describes growth as 'double-digit' instead of the precise 12%.\\n\\nScore 0.60-0.70 (Mostly Accurate):\\n Core narrative correct but 2-3 details are imprecise or missing. Reader gets the right general picture but would fail a quiz on specifics.\\n Observable: main conclusion correct; at least one number or attribution is off or absent.\\n Example: Summary states revenue grew but omits the percentage and rounds the figure to the nearest million.\\n\\nScore 0.40-0.50 (Partially Accurate):\\n Mix of correct and incorrect claims. Key facts present but some figures wrong or attributed to wrong context.\\n Observable: overall topic correct; at least one material claim contradicts or misattributes source data.\\n Example: Revenue figure correct but growth rate stated as 20% (source says 12%); quarter attribution swapped.\\n\\nScore 0.20-0.30 (Largely Inaccurate):\\n Summary contradicts source on important points or invents claims not present in the original.\\n Observable: multiple fabricated or inverted facts; reader would form a wrong understanding of the source.\\n Example: Summary inverts the YoY direction ('revenue declined') when the source reports growth.\\n\\nScore 0.00-0.10 (Fabricated):\\n Summary bears no factual relationship to the source document, is empty, or is a boilerplate placeholder.\\n Observable: empty string; '[Summary goes here]'; figures invented wholesale with no source basis.\\n Example: summary field is empty, or contains figures from a completely different document."
|
|
9974
|
+
}
|
|
9975
|
+
\`\`\`
|
|
9976
|
+
|
|
9977
|
+
### Simpler example: input_completeness (fewer tiers, still full depth)
|
|
9978
|
+
|
|
9979
|
+
Not every criterion needs 6 tiers. For an input check where the spectrum is
|
|
9980
|
+
narrower, 5 tiers can be right — as long as each tier has definition,
|
|
9981
|
+
observables, and an example:
|
|
9982
|
+
|
|
9983
|
+
\`\`\`json
|
|
9984
|
+
{
|
|
9985
|
+
"name": "context-completeness",
|
|
9986
|
+
"evaluationParameter": "context",
|
|
9987
|
+
"description": "Evaluate whether the input context provides sufficient information for the task.\\n\\nScore 0.95-1.00 (Comprehensive):\\n All required fields populated with specific, actionable detail. A human could complete the task using only this context without asking clarifying questions.\\n Observable: every required field present and non-empty; values are specific rather than generic placeholders.\\n Example: data extraction task where source_text, target_fields, and output_format are all fully specified with concrete values.\\n\\nScore 0.80-0.90 (Sufficient):\\n All required fields present with adequate detail. 1-2 optional fields missing but the task can proceed without them.\\n Observable: required fields complete; one optional field absent or using a safe default.\\n Example: translation task where source_text and target_language are present, but tone_style is unspecified — translation can proceed with neutral tone.\\n\\nScore 0.60-0.70 (Workable):\\n Core information present but some fields are vague or use placeholder language. The model can attempt the task but output will lack specificity.\\n Observable: required fields present but one uses generic language ('some text', 'relevant context'); output will be shallow.\\n Example: code review task where the code snippet is present but the review_focus field says 'check for issues' instead of specifying which aspects to evaluate.\\n\\nScore 0.40-0.50 (Thin):\\n Only basic identifiers present (name, category). Critical context fields are empty or contain single-word entries. Output will be generic.\\n Observable: task topic identifiable but most content fields empty or trivially short; model must hallucinate detail to respond.\\n Example: summarization task where source_document is only a title with no body text.\\n\\nScore 0.00-0.20 (Unusable):\\n Missing critical fields. The model cannot produce a meaningful output from this input alone.\\n Observable: required fields absent or null; no basis for task execution.\\n Example: data extraction task where source_text is empty or null."
|
|
9988
|
+
}
|
|
9989
|
+
\`\`\`
|
|
9990
|
+
|
|
9991
|
+
### Inline compact format (for production use)
|
|
9992
|
+
|
|
9993
|
+
The full-depth format above is for documentation and teaching. In production
|
|
9994
|
+
\`description\` fields (which are single-line strings), compress as follows:
|
|
9890
9995
|
|
|
9891
9996
|
\`\`\`json
|
|
9892
9997
|
{
|
|
9893
9998
|
"name": "summary-correctness",
|
|
9894
9999
|
"evaluationParameter": "summary",
|
|
9895
|
-
"description": "1.
|
|
10000
|
+
"description": "Evaluate the correctness of the summary field against the source document and required format. Score 0.95-1.00 (Exceptional): valid JSON, all 3 required fields present, all key arguments covered accurately, no hallucinated facts, prose is precise and well-structured. Score 0.80-0.90 (Strong): valid JSON, all fields present, one argument understated but not wrong — does not change the conclusion. Score 0.60-0.70 (Adequate): valid JSON, all fields present, 1-2 arguments missing but no hallucinations — output is usable but incomplete. Score 0.40-0.50 (Weak): valid JSON, 1-2 required fields missing, or one argument hedged incorrectly — output is partially wrong. Score 0.20-0.30 (Poor): valid JSON but substantive content missing or severely incomplete — output provides little value. Score 0.00-0.10 (Failure): invalid JSON, any fabricated facts, or empty output."
|
|
9896
10001
|
}
|
|
9897
10002
|
\`\`\`
|
|
9898
10003
|
|
|
9899
|
-
The \`evaluationParameter\` here is the output-schema field name, not an input
|
|
9900
|
-
variable. Same 1:1 discipline — one criterion per output dimension.
|
|
9901
|
-
|
|
9902
10004
|
---
|
|
9903
10005
|
|
|
9904
10006
|
## Platform validation rules (current)
|
|
@@ -9926,12 +10028,13 @@ each criterion must pass these platform-enforced checks:
|
|
|
9926
10028
|
2. **Ask the user**: "Evaluate INPUTS (is context sufficient) or OUTPUTS
|
|
9927
10029
|
(is response correct) first?" — let the user pick the scope.
|
|
9928
10030
|
3. **Collect criteria**: use AskUserQuestion to collect from user, never auto-generate — one per variable (INPUT) or per dimension (OUTPUT),
|
|
9929
|
-
always with a
|
|
10031
|
+
always with a granular rubric (anchors matched to the dimension's observable quality levels) describing observable behavior. Use binary scoring (1.0/0.0) only
|
|
10032
|
+
for genuinely binary checks (membership tests, exact-match fields).
|
|
9930
10033
|
4. **Map to platform shape**:
|
|
9931
10034
|
\`\`\`typescript
|
|
9932
10035
|
{
|
|
9933
10036
|
name: string; // short, slug-like
|
|
9934
|
-
description: string; // the
|
|
10037
|
+
description: string; // the rubric verbatim
|
|
9935
10038
|
evaluationParameter: string; // the variable name OR output field
|
|
9936
10039
|
}
|
|
9937
10040
|
\`\`\`
|
|
@@ -9947,7 +10050,9 @@ the output to collect rubrics in the correct order.
|
|
|
9947
10050
|
|
|
9948
10051
|
- **Auto-generating criteria** — Rule 3: NEVER. Always collect from user.
|
|
9949
10052
|
- **Mixing input and output in one criterion** — breaks signal; split into two.
|
|
9950
|
-
- **Vague rubrics** — "0.8 if mostly good" → rewrite with
|
|
10053
|
+
- **Vague rubrics** — "0.8 if mostly good" → rewrite with named tier, definition, observables, example.
|
|
10054
|
+
- **Too few anchors for spectrum dimensions** — using only two or three scoring levels for quality/completeness dimensions starves the optimizer of signal; use as many anchors as the dimension has meaningfully distinguishable quality levels so the gradient is meaningful.
|
|
10055
|
+
- **One-liner anchors** — "1.0 = good, 0.6 = partial, 0.0 = bad" gives G-Eval no grounding to distinguish similar outputs. Each anchor needs a definition + observable + example.
|
|
9951
10056
|
- **One criterion for many variables** — reduces signal, slows optimization.
|
|
9952
10057
|
- **Scoring the model, not the data** — MVC scores the INPUT data quality.
|
|
9953
10058
|
|
|
@@ -10655,7 +10760,7 @@ Read the **5 rules** in [SKILL.md](../SKILL.md) before executing. All 5 rules ap
|
|
|
10655
10760
|
| Step | Pre-read | Why |
|
|
10656
10761
|
|---|---|---|
|
|
10657
10762
|
| Before \`prompts create\` | [concepts/prompt-variables.md](../concepts/prompt-variables.md) | Brace convention — single \`{var}\` vs double \`{{var}}\` affects how variables are parsed |
|
|
10658
|
-
| Before \`evaluation create --guided\` | [concepts/eval-criteria.md](../concepts/eval-criteria.md) | INPUT MVC + OUTPUT Standards —
|
|
10763
|
+
| Before \`evaluation create --guided\` | [concepts/eval-criteria.md](../concepts/eval-criteria.md) | INPUT MVC + OUTPUT Standards — granular rubric format |
|
|
10659
10764
|
|
|
10660
10765
|
---
|
|
10661
10766
|
|
|
@@ -10809,7 +10914,7 @@ mutagent prompts dataset add <prompt-id> -d '<constructed-json>' --name '<name>'
|
|
|
10809
10914
|
|
|
10810
10915
|
- [SKILL.md](../SKILL.md) → 5 rules + journey router
|
|
10811
10916
|
- [concepts/prompt-variables.md](../concepts/prompt-variables.md) → brace convention + conversion (critical for steps 3 and 15)
|
|
10812
|
-
- [concepts/eval-criteria.md](../concepts/eval-criteria.md) → INPUT MVC + OUTPUT Standards +
|
|
10917
|
+
- [concepts/eval-criteria.md](../concepts/eval-criteria.md) → INPUT MVC + OUTPUT Standards + granular rubric (critical for steps 7-8)
|
|
10813
10918
|
- [workflows/exploration.md](./exploration.md) → step 1 of this workflow
|
|
10814
10919
|
- [workflows/tracing.md](./tracing.md) → parallel or follow-up path
|
|
10815
10920
|
`,
|
|
@@ -12379,5 +12484,5 @@ if (isInteractive && !isSkillCommand) {
|
|
|
12379
12484
|
}
|
|
12380
12485
|
program.parse();
|
|
12381
12486
|
|
|
12382
|
-
//# debugId=
|
|
12487
|
+
//# debugId=D02CA9EC0C26385264756E2164756E21
|
|
12383
12488
|
//# sourceMappingURL=cli.js.map
|