npm - @mutagent/cli - Versions diffs - 0.1.156 → 0.1.158 - Mend

@mutagent/cli 0.1.156 → 0.1.158

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -251,8 +251,7 @@ mutagent auth login
 ```
 > `mutagent login` is the canonical command. `mutagent auth login` is preserved
-> as a back-compat alias; both delegate to the same shared action module.
-> See `docs/cli-design-principles.md` → "Login Unification".
+> as a back-compat alias; both behave identically.
 ### 2. Post-Onboarding (Interactive)
@@ -656,12 +655,14 @@ bun run verify       # Full verification (lint + typecheck + build + test)
 ---
-## Documentation
+## See Also
-- [Full Documentation](https://docs.mutagent.io)
-- [CLI Commands Reference](https://docs.mutagent.io/cli/commands)
-- [Integration Guides](https://docs.mutagent.io/integrations/overview)
-- [Tracing Setup](https://docs.mutagent.io/tracing/setup)
+- **[@mutagent/sdk](https://www.npmjs.com/package/@mutagent/sdk)** — TypeScript SDK for programmatic access
+- **[docs.mutagent.io](https://docs.mutagent.io)** — Full platform documentation
+- **[CLI Commands Reference](https://docs.mutagent.io/cli/commands)** — All commands with flags
+- **[Integration Guides](https://docs.mutagent.io/integrations/overview)** — Mastra, LangChain, LangGraph, Vercel AI
+- **[Tracing Setup](https://docs.mutagent.io/tracing/setup)** — OTel integration walkthrough
+- **[mutagent.io](https://mutagent.io)** — Homepage
 ---

package/dist/bin/cli.js CHANGED Viewed

@@ -9623,7 +9623,7 @@ description: |
 1. **\`--json\` on EVERY command.** No exceptions. Agents use JSON mode exclusively.
 2. **\`<command> --help\` BEFORE first use of any command.** The CLI is the source of truth for flags — this SKILL never inlines them.
-3. **NEVER auto-generate eval criteria — collect from user.** Ask the user for each rubric field. See [concepts/eval-criteria.md](./concepts/eval-criteria.md) for the 3-tier format.
+3. **NEVER auto-generate eval criteria — collect from user.** Ask the user for each rubric field. See [concepts/eval-criteria.md](./concepts/eval-criteria.md) for the rubric format.
 4. **Explore-before-modify.** Run \`mutagent explore --json\` before any write operation. Present findings, get user confirmation. Never mutate without discovery first.
 5. **Cost transparency before \`optimize start\`.** Run \`mutagent usage --json\` and show the result to the user. Get explicit confirmation before any optimization job.
 6. **Before optimizing, run \`mutagent providers list --models\` to verify available models.** This calls \`/providers/catalog\` and shows which models are available per provider. Use the output to pick valid \`--exec-model\` and \`--eval-model\` values.
@@ -9656,7 +9656,7 @@ Match the user's first request. Load ONLY the matching subfile. Do NOT preload t
 | [workflows/exploration.md](./workflows/exploration.md) | User wants to scan codebase, identify prompts vs agents | Read-only discovery; output taxonomy to user | Run only; no writes |
 | [workflows/agents.md](./workflows/agents.md) | Multi-turn / tool-calling code detected | WIP — do NOT attempt optimizer, surface partnership link | Show WIP card to user verbatim |
 | [concepts/prompt-variables.md](./concepts/prompt-variables.md) | Any question about \`{var}\` vs \`{{var}}\`, delimiter inference | Brace convention + conversion rules | Load before \`prompts create\` in optimization workflow |
-| [concepts/eval-criteria.md](./concepts/eval-criteria.md) | Any question about rubric design, MVC, Output Standards | 3-tier rubric format — INPUT-param vs OUTPUT-param scope | Load before \`evaluation create --guided\` in optimization workflow |
+| [concepts/eval-criteria.md](./concepts/eval-criteria.md) | Any question about rubric design, MVC, Output Standards | granular rubric format — INPUT-param vs OUTPUT-param scope | Load before \`evaluation create --guided\` in optimization workflow |
 ---
@@ -9768,7 +9768,8 @@ description: |
   Canonical source for MutagenT evaluation-criteria framing:
   INPUT-param criteria → Minimum Viable Context (MVC);
   OUTPUT-param criteria → Output Standards.
-  3-tier rubric discipline: grounded, observable, never vague.
+  Granular rubric discipline (match anchors to the dimension's observable quality levels; binary scoring (1.0/0.0) for yes/no checks):
+  grounded, observable, never vague.
   Includes current platform validation rules for criterion shape.
   Mirrored in mutagent/src/modules/prompts/prompt-evaluations/README.md.
 triggers:
@@ -9817,17 +9818,84 @@ Do not skip any field. Do not pre-fill answers. The user must provide each rubri
 ---
+## Your rubric is the instruction to the G-Eval LLM
+The rubric text in each criterion's \`description\` field is read verbatim by
+the LLM-as-Judge (G-Eval). **The more precise your anchor descriptions, the
+more consistent and accurate the scores. Vague rubrics produce vague scores.**
+A rubric like "0.8 if mostly good, 0.2 if mostly bad" gives the judge no
+grounding — it will invent its own interpretation. A rubric with concrete
+tier definitions, observable characteristics, and specific examples locks the
+judge's interpretation to yours.
+The target quality bar is the G-Eval system's own internal scoring guidelines:
+each tier has a **named level**, a **score range**, a **definition**, **observable
+characteristics**, and a **concrete example**. Your rubrics should aspire to
+that same precision for the dimensions that matter most to your domain.
+---
 ## Input-param criteria → Minimum Viable Context (MVC)
 **Scope**: the \`{variables}\` the prompt template consumes. Each \`{variable}\` is
 an **input param**. The criterion asks: *is the information required for the
 prompt to succeed actually present?*
-### 3-tier completeness scale (grounded, never vague)
+### Rubric format (match anchors to observable quality levels)
+Each anchor must be **observable** (a human can assign it by reading one input
+row) and **grounded** (describes a concrete property, not a feeling).
+\`\`\`
+"Evaluate the completeness and usability of the {variable} input field.
+Score 0.95-1.00 (Exceptional):
+  All required context is present with rich, unambiguous detail. The prompt
+  can produce a high-quality output without any hedging or assumption.
+  Observable: full narrative prose, field-specific depth (e.g., >= 500 chars),
+  no placeholder text, no ambiguous referents.
+  Example: {document} = 800-word technical article with clear subject,
+  context, and argument — summarizer has everything it needs.
+Score 0.80-0.90 (Adequate):
+  All required context is present but with minor gaps or imprecision that
+  may cause hedging. The prompt can attempt a reasonable answer.
+  Observable: complete field with minor quality shortfalls
+  (e.g., 150-499 chars, one ambiguous term).
+  Example: {document} = 200-word product description with one unclear
+  abbreviation — summarizer can work but may flag the ambiguity.
+Score 0.60-0.70 (Marginal):
+  Most required context is present; the prompt can attempt an answer but
+  the output will be noticeably incomplete or generic.
+  Observable: partial field content, missing secondary context
+  (e.g., 50-149 chars, missing key metadata).
+  Example: {document} = three-sentence product blurb with no technical
+  specifics — summarizer produces a generic response.
+Score 0.40-0.50 (Insufficient):
+  Partial context only. The prompt will produce a low-quality or generic
+  response that cannot be acted on.
+  Observable: very short field (< 50 chars), or content present but off-topic.
+  Example: {document} = "See attached" — summarizer has nothing to work with.
+Score 0.20-0.30 (Minimal):
+  Only a stub or placeholder. The prompt will fail or produce a useless
+  response.
+  Observable: summary stubs, auto-generated filler, one-word answers.
+  Example: {document} = "TBD" or "N/A" — useless for any summarization.
+Score 0.00-0.10 (Absent):
+  Critical context is missing. The prompt cannot succeed regardless of model.
+  Observable: empty string, null, filename without content, TODO marker.
+  Example: {document} = "" or null or 'document.pdf'."
+\`\`\`
-- \`1.0\` — all required context present, no ambiguity
-- \`0.5\` — some context present, enough to attempt but likely partial/hedged answer
-- \`0.0\` — critical context missing, prompt cannot succeed regardless of model
+> **Binary checks use two-anchor scoring (1.0 / 0.0) — the criterion is either satisfied or not** (e.g., "Is the
+> document non-empty?"). For any spectrum dimension — quality, completeness,
+> accuracy — use as many scoring anchors as the dimension has meaningfully distinguishable quality levels, so the optimizer has fine-grained signal to
+> act on. Binary checks (yes/no) need 2 anchors. Spectrum dimensions (quality, completeness, accuracy) typically need 4-8, depending on how many distinct quality levels a human could reliably tell apart.
 ### Discipline rules
@@ -9845,27 +9913,38 @@ Enumerate variables using the delimiter inferred by \`mutagent explore --json\`:
 See [concepts/prompt-variables.md](./prompt-variables.md) for the full inference contract.
-### Example
+### Example (compact inline format)
-For \`"Summarize {document} for {audience}"\`:
+For \`"Summarize {document} for {audience}"\`, the full-depth rubric above can
+be condensed to inline form for the JSON \`description\` field:
 \`\`\`json
 [
   {
     "name": "document-present",
     "evaluationParameter": "document",
-    "description": "1.0 = actual prose text >= 100 chars. 0.5 = short snippet or summary stub. 0.0 = filename, null, TODO, or empty."
+    "description": "Evaluate the usability of the document input. Score 0.95-1.00 (Exceptional): rich prose >= 500 chars, full context, no ambiguity — summarizer has everything it needs. Score 0.80-0.90 (Adequate): complete prose >= 100 chars, minor gaps — summarizer can work but may hedge one point. Score 0.60-0.70 (Marginal): short but usable text 50-99 chars — output will be generic. Score 0.40-0.50 (Insufficient): very short snippet < 50 chars or off-topic content — output will be low-quality. Score 0.20-0.30 (Minimal): summary stub, placeholder, or filler text — prompt cannot produce a useful response. Score 0.00-0.10 (Absent): empty, null, filename, or TODO — prompt cannot succeed."
   },
   {
     "name": "audience-concrete",
     "evaluationParameter": "audience",
-    "description": "1.0 = concrete persona (e.g. 'junior Python devs'). 0.5 = broad category ('engineers'). 0.0 = empty, 'general', 'everyone'."
+    "description": "Evaluate how concretely the audience is specified. Score 0.95-1.00 (Exceptional): concrete persona with role, seniority, and domain context (e.g., 'junior Python devs at an early-stage startup') — summarizer can tailor depth and vocabulary precisely. Score 0.80-0.90 (Adequate): concrete role with seniority but no domain ('junior Python devs') — good but summarizer must assume domain. Score 0.60-0.70 (Marginal): role with seniority but no discipline ('senior engineers') — summarizer must assume tech stack. Score 0.40-0.50 (Insufficient): broad category without seniority ('engineers') — output will be generic. Score 0.20-0.30 (Minimal): vague group ('technical people', 'our team') — barely actionable. Score 0.00-0.10 (Absent): empty, 'general', 'everyone', or null — no tailoring possible."
   }
 ]
 \`\`\`
-Note the 1:1 mapping: one criterion per variable, \`evaluationParameter\` equal
-to the variable name.
+#### Binary exception
+Some dimensions are genuinely binary — no spectrum exists. For these, 1.0/0.0
+is correct and adding extra anchors would be artificial:
+\`\`\`json
+{
+  "name": "language-valid",
+  "evaluationParameter": "language",
+  "description": "Score 1.0 if the value is a valid BCP-47 language tag (e.g. 'en', 'fr-CA'). Score 0.0 if empty, null, or not a valid BCP-47 tag. No intermediate states exist — a tag is either valid or it is not."
+}
+\`\`\`
 ---
@@ -9881,24 +9960,47 @@ to the task and write one criterion per dimension.
 - **Groundedness** — facts in the output traceable to facts in the input
 - **Format compliance** — JSON validity, markdown shape, regex match
-### 3-tier correctness scale
+### Full-depth example: summary_accuracy
-Concrete pass/fail per tier. **Never** write "0-1 scale" or "score based on
-accuracy" — those are vague and the optimizer can't act on them.
+This rubric demonstrates the gold-standard format for an OUTPUT criterion that
+evaluates a complex, multi-dimensional field (the factual accuracy of a generated
+summary against its source document):
-### Example (JSON-output summarizer)
+\`\`\`json
+{
+  "name": "summary-accuracy",
+  "evaluationParameter": "summary",
+  "description": "Evaluate the factual accuracy of the generated summary against the source document.\\n\\nScore 0.95-1.00 (Flawless):\\n  Every claim in the summary traces directly to the source. No additions, no omissions of key facts, no distortions. A fact-checker would approve without notes.\\n  Observable: each stated figure, date, or claim appears verbatim or with lossless paraphrase in the source; nothing is added that the source does not support.\\n  Example: Source describes Q3 revenue of €4.2M with 12% YoY growth. Summary states exactly these figures in proper context.\\n\\nScore 0.80-0.90 (Accurate):\\n  All major facts correct. 1-2 minor simplifications that do not mislead (e.g., rounding €4.2M to 'over €4M').\\n  Observable: core claims verified; minor imprecision in secondary detail does not change the reader's understanding.\\n  Example: Summary captures the revenue figure but describes growth as 'double-digit' instead of the precise 12%.\\n\\nScore 0.60-0.70 (Mostly Accurate):\\n  Core narrative correct but 2-3 details are imprecise or missing. Reader gets the right general picture but would fail a quiz on specifics.\\n  Observable: main conclusion correct; at least one number or attribution is off or absent.\\n  Example: Summary states revenue grew but omits the percentage and rounds the figure to the nearest million.\\n\\nScore 0.40-0.50 (Partially Accurate):\\n  Mix of correct and incorrect claims. Key facts present but some figures wrong or attributed to wrong context.\\n  Observable: overall topic correct; at least one material claim contradicts or misattributes source data.\\n  Example: Revenue figure correct but growth rate stated as 20% (source says 12%); quarter attribution swapped.\\n\\nScore 0.20-0.30 (Largely Inaccurate):\\n  Summary contradicts source on important points or invents claims not present in the original.\\n  Observable: multiple fabricated or inverted facts; reader would form a wrong understanding of the source.\\n  Example: Summary inverts the YoY direction ('revenue declined') when the source reports growth.\\n\\nScore 0.00-0.10 (Fabricated):\\n  Summary bears no factual relationship to the source document, is empty, or is a boilerplate placeholder.\\n  Observable: empty string; '[Summary goes here]'; figures invented wholesale with no source basis.\\n  Example: summary field is empty, or contains figures from a completely different document."
+}
+\`\`\`
+### Simpler example: input_completeness (fewer tiers, still full depth)
+Not every criterion needs 6 tiers. For an input check where the spectrum is
+narrower, 5 tiers can be right — as long as each tier has definition,
+observables, and an example:
+\`\`\`json
+{
+  "name": "context-completeness",
+  "evaluationParameter": "context",
+  "description": "Evaluate whether the input context provides sufficient information for the task.\\n\\nScore 0.95-1.00 (Comprehensive):\\n  All required fields populated with specific, actionable detail. A human could complete the task using only this context without asking clarifying questions.\\n  Observable: every required field present and non-empty; values are specific rather than generic placeholders.\\n  Example: data extraction task where source_text, target_fields, and output_format are all fully specified with concrete values.\\n\\nScore 0.80-0.90 (Sufficient):\\n  All required fields present with adequate detail. 1-2 optional fields missing but the task can proceed without them.\\n  Observable: required fields complete; one optional field absent or using a safe default.\\n  Example: translation task where source_text and target_language are present, but tone_style is unspecified — translation can proceed with neutral tone.\\n\\nScore 0.60-0.70 (Workable):\\n  Core information present but some fields are vague or use placeholder language. The model can attempt the task but output will lack specificity.\\n  Observable: required fields present but one uses generic language ('some text', 'relevant context'); output will be shallow.\\n  Example: code review task where the code snippet is present but the review_focus field says 'check for issues' instead of specifying which aspects to evaluate.\\n\\nScore 0.40-0.50 (Thin):\\n  Only basic identifiers present (name, category). Critical context fields are empty or contain single-word entries. Output will be generic.\\n  Observable: task topic identifiable but most content fields empty or trivially short; model must hallucinate detail to respond.\\n  Example: summarization task where source_document is only a title with no body text.\\n\\nScore 0.00-0.20 (Unusable):\\n  Missing critical fields. The model cannot produce a meaningful output from this input alone.\\n  Observable: required fields absent or null; no basis for task execution.\\n  Example: data extraction task where source_text is empty or null."
+}
+\`\`\`
+### Inline compact format (for production use)
+The full-depth format above is for documentation and teaching. In production
+\`description\` fields (which are single-line strings), compress as follows:
 \`\`\`json
 {
   "name": "summary-correctness",
   "evaluationParameter": "summary",
-  "description": "1.0 = valid JSON, all 3 required fields, covers all key arguments. 0.5 = valid JSON, 1-2 arguments missing/hedged. 0.0 = invalid JSON OR fabricated facts."
+  "description": "Evaluate the correctness of the summary field against the source document and required format. Score 0.95-1.00 (Exceptional): valid JSON, all 3 required fields present, all key arguments covered accurately, no hallucinated facts, prose is precise and well-structured. Score 0.80-0.90 (Strong): valid JSON, all fields present, one argument understated but not wrong — does not change the conclusion. Score 0.60-0.70 (Adequate): valid JSON, all fields present, 1-2 arguments missing but no hallucinations — output is usable but incomplete. Score 0.40-0.50 (Weak): valid JSON, 1-2 required fields missing, or one argument hedged incorrectly — output is partially wrong. Score 0.20-0.30 (Poor): valid JSON but substantive content missing or severely incomplete — output provides little value. Score 0.00-0.10 (Failure): invalid JSON, any fabricated facts, or empty output."
 }
 \`\`\`
-The \`evaluationParameter\` here is the output-schema field name, not an input
-variable. Same 1:1 discipline — one criterion per output dimension.
 ---
 ## Platform validation rules (current)
@@ -9926,12 +10028,13 @@ each criterion must pass these platform-enforced checks:
 2. **Ask the user**: "Evaluate INPUTS (is context sufficient) or OUTPUTS
    (is response correct) first?" — let the user pick the scope.
 3. **Collect criteria**: use AskUserQuestion to collect from user, never auto-generate — one per variable (INPUT) or per dimension (OUTPUT),
-   always with a 3-tier rubric describing observable behavior.
+   always with a granular rubric (anchors matched to the dimension's observable quality levels) describing observable behavior. Use binary scoring (1.0/0.0) only
+   for genuinely binary checks (membership tests, exact-match fields).
 4. **Map to platform shape**:
    \`\`\`typescript
    {
      name: string;                // short, slug-like
-     description: string;         // the 3-tier rubric verbatim
+     description: string;         // the rubric verbatim
      evaluationParameter: string; // the variable name OR output field
    }
    \`\`\`
@@ -9947,7 +10050,9 @@ the output to collect rubrics in the correct order.
 - **Auto-generating criteria** — Rule 3: NEVER. Always collect from user.
 - **Mixing input and output in one criterion** — breaks signal; split into two.
-- **Vague rubrics** — "0.8 if mostly good" → rewrite with observable tiers.
+- **Vague rubrics** — "0.8 if mostly good" → rewrite with named tier, definition, observables, example.
+- **Too few anchors for spectrum dimensions** — using only two or three scoring levels for quality/completeness dimensions starves the optimizer of signal; use as many anchors as the dimension has meaningfully distinguishable quality levels so the gradient is meaningful.
+- **One-liner anchors** — "1.0 = good, 0.6 = partial, 0.0 = bad" gives G-Eval no grounding to distinguish similar outputs. Each anchor needs a definition + observable + example.
 - **One criterion for many variables** — reduces signal, slows optimization.
 - **Scoring the model, not the data** — MVC scores the INPUT data quality.
@@ -10655,7 +10760,7 @@ Read the **5 rules** in [SKILL.md](../SKILL.md) before executing. All 5 rules ap
 | Step | Pre-read | Why |
 |---|---|---|
 | Before \`prompts create\` | [concepts/prompt-variables.md](../concepts/prompt-variables.md) | Brace convention — single \`{var}\` vs double \`{{var}}\` affects how variables are parsed |
-| Before \`evaluation create --guided\` | [concepts/eval-criteria.md](../concepts/eval-criteria.md) | INPUT MVC + OUTPUT Standards — 3-tier rubric format |
+| Before \`evaluation create --guided\` | [concepts/eval-criteria.md](../concepts/eval-criteria.md) | INPUT MVC + OUTPUT Standards — granular rubric format |
 ---
@@ -10809,7 +10914,7 @@ mutagent prompts dataset add <prompt-id> -d '<constructed-json>' --name '<name>'
 - [SKILL.md](../SKILL.md) → 5 rules + journey router
 - [concepts/prompt-variables.md](../concepts/prompt-variables.md) → brace convention + conversion (critical for steps 3 and 15)
-- [concepts/eval-criteria.md](../concepts/eval-criteria.md) → INPUT MVC + OUTPUT Standards + 3-tier rubric (critical for steps 7-8)
+- [concepts/eval-criteria.md](../concepts/eval-criteria.md) → INPUT MVC + OUTPUT Standards + granular rubric (critical for steps 7-8)
 - [workflows/exploration.md](./exploration.md) → step 1 of this workflow
 - [workflows/tracing.md](./tracing.md) → parallel or follow-up path
 `,
@@ -12379,5 +12484,5 @@ if (isInteractive && !isSkillCommand) {
 }
 program.parse();
-//# debugId=EE09049CBBE922DA64756E2164756E21
+//# debugId=D02CA9EC0C26385264756E2164756E21
 //# sourceMappingURL=cli.js.map