npm - tarsk - Versions diffs - 0.5.41 → 0.5.43 - Mend

tarsk 0.5.41 → 0.5.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (106) hide show

package/dist/bundled-skills/skill-creator/agents/grader.md ADDED Viewed

@@ -0,0 +1,227 @@
+# Grader Agent
+Evaluate expectations against an execution transcript and outputs.
+## Role
+The Grader reviews a transcript and output files, then determines whether each expectation passes or fails. Provide clear evidence for each judgment.
+You have two jobs: grade the outputs, and critique the evals themselves. A passing grade on a weak assertion is worse than useless — it creates false confidence. When you notice an assertion that's trivially satisfied, or an important outcome that no assertion checks, say so.
+## Inputs
+You receive these parameters in your prompt:
+- **expectations**: List of expectations to evaluate (strings)
+- **transcript_path**: Path to the execution transcript (markdown file)
+- **outputs_dir**: Directory containing output files from execution
+## Process
+### Step 1: Read the Transcript
+1. Read the transcript file completely
+2. Note the eval prompt, execution steps, and final result
+3. Identify any issues or errors documented
+### Step 2: Examine Output Files
+1. List files in outputs_dir
+2. Read/examine each file relevant to the expectations. If outputs aren't plain text, use the inspection tools provided in your prompt — don't rely solely on what the transcript says the executor produced.
+3. Note contents, structure, and quality
+### Step 3: Evaluate Each Assertion
+For each expectation:
+1. **Search for evidence** in the transcript and outputs
+2. **Determine verdict**:
+   - **PASS**: Clear evidence the expectation is true AND the evidence reflects genuine task completion, not just surface-level compliance
+   - **FAIL**: No evidence, or evidence contradicts the expectation, or the evidence is superficial (e.g., correct filename but empty/wrong content)
+3. **Cite the evidence**: Quote the specific text or describe what you found
+### Step 4: Extract and Verify Claims
+Beyond the predefined expectations, extract implicit claims from the outputs and verify them:
+1. **Extract claims** from the transcript and outputs:
+   - Factual statements ("The form has 12 fields")
+   - Process claims ("Used pypdf to fill the form")
+   - Quality claims ("All fields were filled correctly")
+2. **Verify each claim**:
+   - **Factual claims**: Can be checked against the outputs or external sources
+   - **Process claims**: Can be verified from the transcript
+   - **Quality claims**: Evaluate whether the claim is justified
+3. **Flag unverifiable claims**: Note claims that cannot be verified with available information
+This catches issues that predefined expectations might miss.
+### Step 5: Read User Notes
+If `{outputs_dir}/user_notes.md` exists:
+1. Read it and note any uncertainties or issues flagged by the executor
+2. Include relevant concerns in the grading output
+3. These may reveal problems even when expectations pass
+### Step 6: Critique the Evals
+After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap.
+Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion _discriminating_: it passes when the skill genuinely succeeds and fails when it doesn't.
+Suggestions worth raising:
+- An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
+- An important outcome you observed — good or bad — that no assertion covers at all
+- An assertion that can't actually be verified from the available outputs
+Keep the bar high. The goal is to flag things the eval author would say "good catch" about, not to nitpick every assertion.
+### Step 7: Write Grading Results
+Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
+## Grading Criteria
+**PASS when**:
+- The transcript or outputs clearly demonstrate the expectation is true
+- Specific evidence can be cited
+- The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)
+**FAIL when**:
+- No evidence found for the expectation
+- Evidence contradicts the expectation
+- The expectation cannot be verified from available information
+- The evidence is superficial — the assertion is technically satisfied but the underlying task outcome is wrong or incomplete
+- The output appears to meet the assertion by coincidence rather than by actually doing the work
+**When uncertain**: The burden of proof to pass is on the expectation.
+### Step 8: Read Executor Metrics and Timing
+1. If `{outputs_dir}/metrics.json` exists, read it and include in grading output
+2. If `{outputs_dir}/../timing.json` exists, read it and include timing data
+## Output Format
+Write a JSON file with this structure:
+```json
+{
+  "expectations": [
+    {
+      "text": "The output includes the name 'John Smith'",
+      "passed": true,
+      "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
+    },
+    {
+      "text": "The spreadsheet has a SUM formula in cell B10",
+      "passed": false,
+      "evidence": "No spreadsheet was created. The output was a text file."
+    },
+    {
+      "text": "The assistant used the skill's OCR script",
+      "passed": true,
+      "evidence": "Transcript Step 2 shows: 'Tool: Bash - python ocr_script.py image.png'"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 1,
+    "total": 3,
+    "pass_rate": 0.67
+  },
+  "execution_metrics": {
+    "tool_calls": {
+      "Read": 5,
+      "Write": 2,
+      "Bash": 8
+    },
+    "total_tool_calls": 15,
+    "total_steps": 6,
+    "errors_encountered": 0,
+    "output_chars": 12450,
+    "transcript_chars": 3200
+  },
+  "timing": {
+    "executor_duration_seconds": 165.0,
+    "grader_duration_seconds": 26.0,
+    "total_duration_seconds": 191.0
+  },
+  "claims": [
+    {
+      "claim": "The form has 12 fillable fields",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Counted 12 fields in field_info.json"
+    },
+    {
+      "claim": "All required fields were populated",
+      "type": "quality",
+      "verified": false,
+      "evidence": "Reference section was left blank despite data being available"
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": ["Used 2023 data, may be stale"],
+    "needs_review": [],
+    "workarounds": ["Fell back to text overlay for non-fillable fields"]
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "The output includes the name 'John Smith'",
+        "reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email from the input"
+      },
+      {
+        "reason": "No assertion checks whether the extracted phone numbers match the input — I observed incorrect numbers in the output that went uncaught"
+      }
+    ],
+    "overall": "Assertions check presence but not correctness. Consider adding content verification."
+  }
+}
+```
+## Field Descriptions
+- **expectations**: Array of graded expectations
+  - **text**: The original expectation text
+  - **passed**: Boolean - true if expectation passes
+  - **evidence**: Specific quote or description supporting the verdict
+- **summary**: Aggregate statistics
+  - **passed**: Count of passed expectations
+  - **failed**: Count of failed expectations
+  - **total**: Total expectations evaluated
+  - **pass_rate**: Fraction passed (0.0 to 1.0)
+- **execution_metrics**: Copied from executor's metrics.json (if available)
+  - **output_chars**: Total character count of output files (proxy for tokens)
+  - **transcript_chars**: Character count of transcript
+- **timing**: Wall clock timing from timing.json (if available)
+  - **executor_duration_seconds**: Time spent in executor subagent
+  - **total_duration_seconds**: Total elapsed time for the run
+- **claims**: Extracted and verified claims from the output
+  - **claim**: The statement being verified
+  - **type**: "factual", "process", or "quality"
+  - **verified**: Boolean - whether the claim holds
+  - **evidence**: Supporting or contradicting evidence
+- **user_notes_summary**: Issues flagged by the executor
+  - **uncertainties**: Things the executor wasn't sure about
+  - **needs_review**: Items requiring human attention
+  - **workarounds**: Places where the skill didn't work as expected
+- **eval_feedback**: Improvement suggestions for the evals (only when warranted)
+  - **suggestions**: List of concrete suggestions, each with a `reason` and optionally an `assertion` it relates to
+  - **overall**: Brief assessment — can be "No suggestions, evals look solid" if nothing to flag
+## Guidelines
+- **Be objective**: Base verdicts on evidence, not assumptions
+- **Be specific**: Quote the exact text that supports your verdict
+- **Be thorough**: Check both transcript and output files
+- **Be consistent**: Apply the same standard to each expectation
+- **Explain failures**: Make it clear why evidence was insufficient
+- **No partial credit**: Each expectation is pass or fail, not partial

package/dist/bundled-skills/skill-creator/assets/eval_review.html ADDED Viewed

@@ -0,0 +1,292 @@
+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>Eval Set Review - __SKILL_NAME_PLACEHOLDER__</title>
+    <link rel="preconnect" href="https://fonts.googleapis.com" />
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
+    <link
+      href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap"
+      rel="stylesheet"
+    />
+    <style>
+      * {
+        box-sizing: border-box;
+        margin: 0;
+        padding: 0;
+      }
+      body {
+        font-family: "Lora", Georgia, serif;
+        background: #faf9f5;
+        padding: 2rem;
+        color: #141413;
+      }
+      h1 {
+        font-family: "Poppins", sans-serif;
+        margin-bottom: 0.5rem;
+        font-size: 1.5rem;
+      }
+      .description {
+        color: #b0aea5;
+        margin-bottom: 1.5rem;
+        font-style: italic;
+        max-width: 900px;
+      }
+      .controls {
+        margin-bottom: 1rem;
+        display: flex;
+        gap: 0.5rem;
+      }
+      .btn {
+        font-family: "Poppins", sans-serif;
+        padding: 0.5rem 1rem;
+        border: none;
+        border-radius: 6px;
+        cursor: pointer;
+        font-size: 0.875rem;
+        font-weight: 500;
+      }
+      .btn-add {
+        background: #6a9bcc;
+        color: white;
+      }
+      .btn-add:hover {
+        background: #5889b8;
+      }
+      .btn-export {
+        background: #d97757;
+        color: white;
+      }
+      .btn-export:hover {
+        background: #c4613f;
+      }
+      table {
+        width: 100%;
+        max-width: 1100px;
+        border-collapse: collapse;
+        background: white;
+        border-radius: 6px;
+        overflow: hidden;
+        box-shadow: 0 1px 3px rgba(0, 0, 0, 0.08);
+      }
+      th {
+        font-family: "Poppins", sans-serif;
+        background: #141413;
+        color: #faf9f5;
+        padding: 0.75rem 1rem;
+        text-align: left;
+        font-size: 0.875rem;
+      }
+      td {
+        padding: 0.75rem 1rem;
+        border-bottom: 1px solid #e8e6dc;
+        vertical-align: top;
+      }
+      tr:nth-child(even) td {
+        background: #faf9f5;
+      }
+      tr:hover td {
+        background: #f3f1ea;
+      }
+      .section-header td {
+        background: #e8e6dc;
+        font-family: "Poppins", sans-serif;
+        font-weight: 500;
+        font-size: 0.8rem;
+        color: #141413;
+        text-transform: uppercase;
+        letter-spacing: 0.05em;
+      }
+      .query-input {
+        width: 100%;
+        padding: 0.4rem;
+        border: 1px solid #e8e6dc;
+        border-radius: 4px;
+        font-size: 0.875rem;
+        font-family: "Lora", Georgia, serif;
+        resize: vertical;
+        min-height: 60px;
+      }
+      .query-input:focus {
+        outline: none;
+        border-color: #d97757;
+        box-shadow: 0 0 0 2px rgba(217, 119, 87, 0.15);
+      }
+      .toggle {
+        position: relative;
+        display: inline-block;
+        width: 44px;
+        height: 24px;
+      }
+      .toggle input {
+        opacity: 0;
+        width: 0;
+        height: 0;
+      }
+      .toggle .slider {
+        position: absolute;
+        inset: 0;
+        background: #b0aea5;
+        border-radius: 24px;
+        cursor: pointer;
+        transition: 0.2s;
+      }
+      .toggle .slider::before {
+        content: "";
+        position: absolute;
+        width: 18px;
+        height: 18px;
+        left: 3px;
+        bottom: 3px;
+        background: white;
+        border-radius: 50%;
+        transition: 0.2s;
+      }
+      .toggle input:checked + .slider {
+        background: #d97757;
+      }
+      .toggle input:checked + .slider::before {
+        transform: translateX(20px);
+      }
+      .btn-delete {
+        background: #c44;
+        color: white;
+        padding: 0.3rem 0.6rem;
+        border: none;
+        border-radius: 4px;
+        cursor: pointer;
+        font-size: 0.75rem;
+        font-family: "Poppins", sans-serif;
+      }
+      .btn-delete:hover {
+        background: #a33;
+      }
+      .summary {
+        margin-top: 1rem;
+        color: #b0aea5;
+        font-size: 0.875rem;
+      }
+    </style>
+  </head>
+  <body>
+    <h1>Eval Set Review: <span id="skill-name">__SKILL_NAME_PLACEHOLDER__</span></h1>
+    <p class="description">
+      Current description: <span id="skill-desc">__SKILL_DESCRIPTION_PLACEHOLDER__</span>
+    </p>
+    <div class="controls">
+      <button class="btn btn-add" onclick="addRow()">+ Add Query</button>
+      <button class="btn btn-export" onclick="exportEvalSet()">Export Eval Set</button>
+    </div>
+    <table>
+      <thead>
+        <tr>
+          <th style="width: 65%">Query</th>
+          <th style="width: 18%">Should Trigger</th>
+          <th style="width: 10%">Actions</th>
+        </tr>
+      </thead>
+      <tbody id="eval-body"></tbody>
+    </table>
+    <p class="summary" id="summary"></p>
+    <script>
+      const EVAL_DATA = __EVAL_DATA_PLACEHOLDER__;
+      let evalItems = [...EVAL_DATA];
+      function render() {
+        const tbody = document.getElementById("eval-body");
+        tbody.innerHTML = "";
+        // Sort: should-trigger first, then should-not-trigger
+        const sorted = evalItems
+          .map((item, origIdx) => ({ ...item, origIdx }))
+          .sort((a, b) => (b.should_trigger ? 1 : 0) - (a.should_trigger ? 1 : 0));
+        let lastGroup = null;
+        sorted.forEach((item) => {
+          const group = item.should_trigger ? "trigger" : "no-trigger";
+          if (group !== lastGroup) {
+            const headerRow = document.createElement("tr");
+            headerRow.className = "section-header";
+            headerRow.innerHTML = `<td colspan="3">${item.should_trigger ? "Should Trigger" : "Should NOT Trigger"}</td>`;
+            tbody.appendChild(headerRow);
+            lastGroup = group;
+          }
+          const idx = item.origIdx;
+          const tr = document.createElement("tr");
+          tr.innerHTML = `
+          <td><textarea class="query-input" onchange="updateQuery(${idx}, this.value)">${escapeHtml(item.query)}</textarea></td>
+          <td>
+            <label class="toggle">
+              <input type="checkbox" ${item.should_trigger ? "checked" : ""} onchange="updateTrigger(${idx}, this.checked)">
+              <span class="slider"></span>
+            </label>
+            <span style="margin-left:8px;font-size:0.8rem;color:#b0aea5">${item.should_trigger ? "Yes" : "No"}</span>
+          </td>
+          <td><button class="btn-delete" onclick="deleteRow(${idx})">Delete</button></td>
+        `;
+          tbody.appendChild(tr);
+        });
+        updateSummary();
+      }
+      function escapeHtml(text) {
+        const div = document.createElement("div");
+        div.textContent = text;
+        return div.innerHTML;
+      }
+      function updateQuery(idx, value) {
+        evalItems[idx].query = value;
+        updateSummary();
+      }
+      function updateTrigger(idx, value) {
+        evalItems[idx].should_trigger = value;
+        render();
+      }
+      function deleteRow(idx) {
+        evalItems.splice(idx, 1);
+        render();
+      }
+      function addRow() {
+        evalItems.push({ query: "", should_trigger: true });
+        render();
+        const inputs = document.querySelectorAll(".query-input");
+        inputs[inputs.length - 1].focus();
+      }
+      function updateSummary() {
+        const trigger = evalItems.filter((i) => i.should_trigger).length;
+        const noTrigger = evalItems.filter((i) => !i.should_trigger).length;
+        document.getElementById("summary").textContent =
+          `${evalItems.length} queries total: ${trigger} should trigger, ${noTrigger} should not trigger`;
+      }
+      function exportEvalSet() {
+        const valid = evalItems.filter((i) => i.query.trim() !== "");
+        const data = valid.map((i) => ({
+          query: i.query.trim(),
+          should_trigger: i.should_trigger,
+        }));
+        const blob = new Blob([JSON.stringify(data, null, 2)], { type: "application/json" });
+        const url = URL.createObjectURL(blob);
+        const a = document.createElement("a");
+        a.href = url;
+        a.download = "eval_set.json";
+        document.body.appendChild(a);
+        a.click();
+        document.body.removeChild(a);
+        URL.revokeObjectURL(url);
+      }
+      render();
+    </script>
+  </body>
+</html>