npm - @dailephd/my-dev-kit-lab - Versions diffs - 0.2.0 - Mend

@dailephd/my-dev-kit-lab 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (250) hide show

package/docs/METRICS.md ADDED Viewed

@@ -0,0 +1,286 @@
+# Metrics
+This document is the canonical metric glossary for my-dev-kit-lab. It defines every metric that appears in benchmark profiles, prompt variants, controlled experiment artifacts, and rendered reports.
+Related documentation:
+- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) — how metrics flow through the pipeline
+- [docs/TUTORIAL.md](docs/TUTORIAL.md) — how to read token savings and correctness scores in the report
+- [docs/CURRENT_STATE.md](docs/CURRENT_STATE.md) — current baseline and limitations
+## Metric interpretation quick reference
+| Metric | Positive means | Negative means | N/A means |
+|---|---|---|---|
+| `tokenSavings` | my-dev-kit used fewer tokens | my-dev-kit used more tokens | Token totals unavailable for one or both runs |
+| `correctnessScore` | More answer-key facts matched | Fewer answer-key facts matched | Run did not complete |
+| `complexityScore` | Higher project complexity | Lower project complexity | — |
+**Token savings notes:**
+- A positive token savings value means the my-dev-kit-guided strategy used fewer tokens than raw-full-file for that run pair.
+- A negative token savings value means my-dev-kit-guided used more tokens. This can happen on small projects where raw-full-file is cheaper.
+- Token savings are only computed when both paired runs expose token totals. Claude does not expose token totals. Codex may expose token totals but can produce timeouts or invalid-output runs.
+- Token counts in fake-agent runs are estimated using `Math.ceil(characterCount / 4)`. These are context-size estimates, not provider billing totals.
+**Correctness scoring notes:**
+- Correctness is scored deterministically against benchmark answer keys. It is not semantic LLM judging.
+- A run passes if it meets or exceeds the `minimumCorrectFacts` threshold defined in the answer key.
+**Complexity score notes:**
+- The complexity score is a heuristic 0-100 weighted score. Higher scores indicate projects where raw full-file reading is less attractive.
+- Small projects may show negative token savings because raw-full-file is cheaper when the entire project fits easily in context.
+- Larger, more localized tasks are where my-dev-kit is expected to become more useful.
+## Project Complexity Metrics
+- `fileCount`
+  Meaning: total non-generated files captured in the benchmark file tree.
+  Appears in: `benchmarks/contracts/benchmark-project-profiles.json`, experiment report project sections.
+  Formula: count of file entries in `src/evaluation/projectFileTree.ts`.
+  Interpretation: higher means more total files to scan.
+  Caveat: includes config and docs files, not only source.
+- `sourceFileCount`
+  Meaning: number of source-role files in the benchmark project.
+  Appears in: project profiles and reports.
+  Formula: file-tree entries where role is `source`.
+  Interpretation: higher means broader implementation surface.
+  Caveat: role detection is path-based.
+- `testFileCount`
+  Meaning: number of test-role files in the benchmark project.
+  Appears in: project profiles and reports.
+  Formula: file-tree entries where role is `test`.
+  Interpretation: higher can increase raw full-file context size.
+  Caveat: test helpers outside test roots still depend on role detection.
+- `totalLinesOfCode`
+  Meaning: approximate code lines across source and test files.
+  Appears in: project profiles and reports.
+  Formula: nonblank, non-comment lines counted by `countApproximateCodeLines`.
+  Interpretation: higher means more code context overall.
+  Caveat: this is approximate and language-agnostic.
+- `sourceLinesOfCode`
+  Meaning: approximate code lines across source files only.
+  Appears in: project profiles and reports.
+  Formula: approximate code-line count over source-role files.
+  Interpretation: higher usually means more production logic.
+  Caveat: comment stripping is simple.
+- `testLinesOfCode`
+  Meaning: approximate code lines across test files only.
+  Appears in: project profiles and reports.
+  Formula: approximate code-line count over test-role files.
+  Interpretation: higher can increase context noise for raw reads.
+  Caveat: tests may still be relevant to answer-key tasks.
+- `languageCount`
+  Meaning: number of detected code languages in source and test files.
+  Appears in: project profiles and reports.
+  Formula: unique language count from file-tree metadata.
+  Interpretation: higher means more language switching cost.
+  Caveat: only known file extensions are counted.
+- `internalImportCount`
+  Meaning: approximate count of local/internal imports in source files.
+  Appears in: project profiles and reports.
+  Formula: import-pattern count from `countInternalImports`.
+  Interpretation: higher means more cross-file coupling.
+  Caveat: regex-based and approximate.
+- `exportedSymbolEstimate`
+  Meaning: approximate count of exported or top-level callable symbols.
+  Appears in: project profiles and reports.
+  Formula: regex-based count from `countExportedSymbols`.
+  Interpretation: higher means more symbol-selection work.
+  Caveat: Python counting treats top-level defs/classes as exported.
+- `taskCount`
+  Meaning: number of benchmark tasks associated with the project suite or case set used to profile it.
+  Appears in: project profiles and reports.
+  Formula: provided task stats input during profile generation.
+  Interpretation: higher suggests broader benchmark coverage.
+  Caveat: this is metadata, not code structure.
+- `expectedRelevantFilesAverage`
+  Meaning: average count of expected relevant files across answer-key tasks.
+  Appears in: project profiles and reports.
+  Formula: average expected-file count from profiled tasks.
+  Interpretation: higher means tasks span more files.
+  Caveat: depends on case selection quality.
+- `expectedRelevantSymbolsAverage`
+  Meaning: average count of expected relevant symbols across answer-key tasks.
+  Appears in: project profiles and reports.
+  Formula: average expected-symbol count from profiled tasks.
+  Interpretation: higher means symbol selection is less trivial.
+  Caveat: depends on answer-key breadth.
+- `maxFileLines`
+  Meaning: maximum raw line count of any code file in the project.
+  Appears in: project profiles and reports.
+  Formula: max `lines` value from code-role file-tree entries.
+  Interpretation: higher means a single-file read can be heavier.
+  Caveat: uses raw line counts, not approximate code lines.
+- `averageFileLines`
+  Meaning: average raw line count across code files.
+  Appears in: project profiles and reports.
+  Formula: average `lines` value across source and test code entries.
+  Interpretation: higher means broader files on average.
+  Caveat: small files and tests can pull the average down.
+- `complexityScore`
+  Meaning: 0-100 weighted project complexity score.
+  Appears in: project profiles and experiment reports.
+  Formula: `benchmark-project-complexity-v1` in `src/evaluation/projectComplexity.ts`.
+  Interpretation: higher means raw full-file reading should be less attractive.
+  Caveat: it is heuristic, not a runtime truth metric.
+- `complexityLevel`
+  Meaning: bucketed project size label such as `small`, `medium`, or `large`.
+  Appears in: project profiles and reports.
+  Formula: manually assigned profile label.
+  Interpretation: human-readable size category.
+  Caveat: coarse label; use the score and metrics for detail.
+## Prompt Complexity Metrics
+- `promptChars`
+  Meaning: prompt length in characters.
+  Appears in: prompt variants, experiment runs, and prompt report tables.
+  Formula: `promptText.length`.
+  Interpretation: higher means more instruction payload.
+  Caveat: character count is not provider billing.
+- `promptEstimatedTokens`
+  Meaning: estimated prompt tokens.
+  Appears in: prompt variants and prompt report tables.
+  Formula: `estimated_chars_div_4` via `src/core/countTokens.ts`.
+  Interpretation: useful for rough relative comparisons.
+  Caveat: not provider-reported usage.
+- `instructionCount`
+  Meaning: approximate count of instruction-like phrases in the prompt.
+  Appears in: prompt report tables.
+  Formula: regex count in `measurePromptComplexity`.
+  Interpretation: higher means denser instruction framing.
+  Caveat: approximate text heuristic.
+- `constraintCount`
+  Meaning: approximate count of constraint-like phrases in the prompt.
+  Appears in: prompt report tables.
+  Formula: regex count in `measurePromptComplexity`.
+  Interpretation: higher means tighter behavioral constraints.
+  Caveat: approximate text heuristic.
+- `requestedOutputFieldCount`
+  Meaning: count of output fields explicitly requested from the agent.
+  Appears in: prompt report tables.
+  Formula: number of known field names found in the prompt text.
+  Interpretation: higher means a more structured answer contract.
+  Caveat: limited to predefined field names.
+- `taskStepCount`
+  Meaning: count of numbered steps in the prompt body.
+  Appears in: prompt report tables.
+  Formula: regex count of `1.`, `2.`, and so on.
+  Interpretation: higher means more explicit workflow steps.
+  Caveat: only numbered steps count.
+- `expectedFactCount`
+  Meaning: number of answer-key facts in scope for the prompt.
+  Appears in: prompt report tables.
+  Formula: answer-key fact count.
+  Interpretation: higher means more correctness evidence required.
+  Caveat: depends on case design.
+- `expectedFileCount`
+  Meaning: number of expected relevant files in the answer key.
+  Appears in: prompt report tables.
+  Formula: answer-key expected-file count.
+  Interpretation: higher means broader context demand.
+  Caveat: answer-key driven.
+- `expectedSymbolCount`
+  Meaning: number of expected relevant symbols in the answer key.
+  Appears in: prompt report tables.
+  Formula: answer-key expected-symbol count.
+  Interpretation: higher means more symbol-level targeting.
+  Caveat: answer-key driven.
+- `requiresGraphGuidedRetrieval`
+  Meaning: whether the prompt explicitly requires my-dev-kit retrieval flow.
+  Appears in: prompt report tables.
+  Formula: `strategy === "my-dev-kit-guided"`.
+  Interpretation: `true` means command-guided retrieval is expected.
+  Caveat: not a guarantee that the agent followed it.
+- `requiresCommandExecution`
+  Meaning: whether the prompt expects command execution.
+  Appears in: prompt report tables.
+  Formula: `strategy === "my-dev-kit-guided"`.
+  Interpretation: `true` means retrieval commands are part of the task.
+  Caveat: prompt intent only.
+## Experiment And Run Metrics
+- `durationMs`
+  Meaning: measured wall-clock duration of a normalized run.
+  Appears in: experiment runs, comparisons, and reports.
+  Formula: runtime duration from `runMeasuredCommand` or orchestrator timing.
+  Interpretation: lower is faster.
+  Caveat: includes local CLI overhead.
+- `status`
+  Meaning: normalized run outcome such as `completed`, `failed`, `timeout`, `agent-unavailable`, `agent-limit-reached`, or `invalid-output`.
+  Appears in: experiment runs and reports.
+  Formula: outcome classification in `src/evaluation/classifyAgentRunOutcome.ts`.
+  Interpretation: explains whether a run is usable for comparison.
+  Caveat: external account/session failures are not code regressions.
+- `tokenUsageSource`
+  Meaning: where token counts came from.
+  Appears in: experiment runs and reports.
+  Formula: adapter normalization from `src/agents`.
+  Interpretation: provider-reported sources are stronger than missing values.
+  Caveat: depends on adapter output format.
+- `tokenUsageReliability`
+  Meaning: trust label for token usage fields.
+  Appears in: experiment runs and reports.
+  Formula: adapter normalization from `src/agents`.
+  Interpretation: stronger labels mean better comparison quality.
+  Caveat: missing or partial token fields reduce reliability.
+- `inputTokens`
+  Meaning: provider-reported input token count when available.
+  Appears in: experiment runs and reports.
+  Formula: parsed from agent output.
+  Interpretation: lower means less prompt/context input.
+  Caveat: may be unavailable.
+- `outputTokens`
+  Meaning: provider-reported output token count when available.
+  Appears in: experiment runs and reports.
+  Formula: parsed from agent output.
+  Interpretation: lower means a shorter generated response.
+  Caveat: may be unavailable.
+- `totalTokens`
+  Meaning: provider-reported total token count when available.
+  Appears in: experiment runs, comparisons, and reports.
+  Formula: parsed from agent output or combined provider fields.
+  Interpretation: used for token savings comparisons.
+  Caveat: prompt estimates do not replace missing totals.
+- `correctnessScore`
+  Meaning: deterministic answer-key-based correctness score.
+  Appears in: correctness artifacts and reports.
+  Formula: `0.25 * fileMatchScore + 0.25 * symbolMatchScore + 0.50 * factMatchScore`.
+  Interpretation: higher is better; pass threshold is `>= 0.70` with required fact checks.
+  Caveat: not semantic judging.
+- `fileMatchScore`
+  Meaning: fraction of expected files found by the parsed answer.
+  Appears in: correctness artifacts and reports.
+  Formula: expected files found divided by expected files total.
+  Interpretation: higher means better file targeting.
+  Caveat: exact-file matching is strict.
+- `symbolMatchScore`
+  Meaning: fraction of expected symbols found by the parsed answer.
+  Appears in: correctness artifacts and reports.
+  Formula: expected symbols found divided by expected symbols total.
+  Interpretation: higher means better symbol targeting.
+  Caveat: depends on parsed answer quality.
+- `factMatchScore`
+  Meaning: weighted fraction of expected facts found by the parsed answer.
+  Appears in: correctness artifacts and reports.
+  Formula: matched fact weights divided by total fact weights.
+  Interpretation: higher means better factual correctness coverage.
+  Caveat: answer-key fact wording still matters.
+- `tokenSavingsPercent`
+  Meaning: percent reduction in total tokens for my-dev-kit versus raw full-file.
+  Appears in: experiment comparisons, summaries, and reports.
+  Formula: `(rawTotalTokens - myDevKitTotalTokens) / rawTotalTokens * 100`.
+  Interpretation: positive means my-dev-kit used fewer tokens; negative means it used more.
+  Caveat: only valid when both paired runs expose total tokens.
+- `durationReductionPercent`
+  Meaning: percent reduction in wall-clock duration for my-dev-kit versus raw full-file.
+  Appears in: experiment comparisons, summaries, and reports.
+  Formula: `(rawDurationMs - myDevKitDurationMs) / rawDurationMs * 100`.
+  Interpretation: positive means my-dev-kit was faster; negative means it was slower.
+  Caveat: local machine noise affects timing.
+- `reliabilityLabel`
+  Meaning: comparison-level quality label such as `strong`, `correctness-only`, `partial`, `unavailable`, `limit-reached`, or `failed`.
+  Appears in: experiment comparisons and reports.
+  Formula: derived from paired run outcomes and metric availability.
+  Interpretation: stronger labels mean safer aggregate interpretation.
+  Caveat: comparison reliability is not the same as correctness.

package/examples/demo-report-input.json ADDED Viewed

@@ -0,0 +1,78 @@
+{
+  "reportId": "demo-report",
+  "title": "Benchmark Validation Demo Report",
+  "projectName": "my-dev-kit-lab",
+  "benchmarkProject": "todo-ts",
+  "workflowName": "benchmark validation demo",
+  "generatedAt": "2026-06-10T15:00:00.000Z",
+  "summary": "Deterministic sample report for Prompt 2 report rendering and screenshot capture. This report does not claim token-savings evaluation.",
+  "steps": [
+    {
+      "id": "load-contracts",
+      "label": "Load benchmark contracts",
+      "command": "npm run verify:benchmarks",
+      "status": "pass",
+      "durationMs": 42,
+      "notes": "Validated benchmark contract structure."
+    },
+    {
+      "id": "run-project-tests",
+      "label": "Run benchmark project tests",
+      "command": "npm run test:benchmarks",
+      "status": "pass",
+      "durationMs": 118,
+      "notes": "Todo benchmark behavior stayed aligned."
+    },
+    {
+      "id": "prepare-report",
+      "label": "Prepare report artifacts",
+      "command": "npm run capture-demo-report",
+      "status": "pass",
+      "durationMs": 35,
+      "notes": "No token-savings metrics are reported in this demo."
+    }
+  ],
+  "metrics": [
+    {
+      "id": "benchmark-project-count",
+      "label": "Benchmark projects",
+      "value": 4,
+      "interpretation": "Prompt 1 benchmark suite is available."
+    },
+    {
+      "id": "step-count",
+      "label": "Workflow steps",
+      "value": 3,
+      "interpretation": "Demo workflow uses a small deterministic sequence."
+    },
+    {
+      "id": "warning-count",
+      "label": "Warnings",
+      "value": 1,
+      "interpretation": "This demo explicitly notes that token comparison is not implemented yet."
+    },
+    {
+      "id": "artifact-count",
+      "label": "Artifacts declared",
+      "value": 2,
+      "interpretation": "JSON and HTML are always expected before optional PNG capture."
+    }
+  ],
+  "artifacts": [
+    {
+      "id": "report-json",
+      "label": "Demo report JSON",
+      "path": "lab-output/demo-report/demo-report.json",
+      "kind": "json"
+    },
+    {
+      "id": "report-html",
+      "label": "Demo report HTML",
+      "path": "lab-output/demo-report/demo-report.html",
+      "kind": "html"
+    }
+  ],
+  "warnings": [
+    "Token-savings evaluation is not implemented in Prompt 2."
+  ]
+}

package/examples/lab-demo-cases.json ADDED Viewed

@@ -0,0 +1,35 @@
+[
+  {
+    "id": "todo-ts-create-task-demo",
+    "title": "Todo TS create task demo retrieval",
+    "benchmarkProject": "todo-ts",
+    "projectProfileRef": "todo-ts",
+    "targetRoot": "benchmarks/projects/todo-ts",
+    "sourceRoots": ["src", "tests"],
+    "query": "create task deterministic id task-1",
+    "expectedFiles": ["src/taskService.ts", "src/taskStore.ts"],
+    "expectedSymbols": ["createTask", "TaskService"],
+    "answerKey": {
+      "expectedFiles": ["src/taskService.ts", "src/taskStore.ts"],
+      "expectedSymbols": ["createTask", "TaskService"],
+      "expectedFacts": [
+        {
+          "id": "create-deterministic-id",
+          "text": "createTask assigns deterministic IDs such as task-1 and task-2.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "create-stores-incomplete-task",
+          "text": "createTask stores new tasks with completed set to false.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 2,
+      "notes": "Demo metadata prepares later correctness scoring without changing the current lab-demo command."
+    },
+    "rawIncludeGlobs": ["src/**/*", "tests/**/*"],
+    "notes": "Small single-case lab demo fixture."
+  }
+]

package/examples/real-agent-campaign-cases.json ADDED Viewed

@@ -0,0 +1,118 @@
+[
+  {
+    "id": "task-workflow-import-dedup",
+    "title": "Task workflow import and dedupe retrieval",
+    "benchmarkProject": "task-workflow-medium-ts",
+    "projectProfileRef": "task-workflow-medium-ts",
+    "targetRoot": "benchmarks/projects/task-workflow-medium-ts",
+    "sourceRoots": ["src", "tests"],
+    "query": "import tasks duplicate normalized title deterministic id summarize workflow project",
+    "expectedFiles": [
+      "src/services/importTasks.ts",
+      "src/store/taskStore.ts",
+      "src/validation/taskValidation.ts"
+    ],
+    "expectedSymbols": ["importTasks", "validateImportInput", "findDuplicate", "createTask"],
+    "answerKey": {
+      "expectedFiles": [
+        "src/services/importTasks.ts",
+        "src/store/taskStore.ts",
+        "src/validation/taskValidation.ts"
+      ],
+      "expectedSymbols": ["importTasks", "validateImportInput", "findDuplicate", "createTask"],
+      "expectedFacts": [
+        {
+          "id": "workflow-import-validates-title-and-project",
+          "text": "importTasks validates normalized task titles and project ids before inserting records.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "workflow-import-skips-duplicates-by-project-and-normalized-title",
+          "text": "importTasks skips duplicates when a task with the same normalized title already exists in the same project.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "workflow-import-keeps-deterministic-task-sequences",
+          "text": "New imported tasks still receive deterministic ids such as task-1 and task-2 from the shared store sequence.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 3,
+      "notes": "This case is intended to require multiple files: service, validation, and store logic."
+    },
+    "rawIncludeGlobs": ["src/**/*", "tests/**/*"],
+    "notes": "Medium project case for real-agent comparison."
+  },
+  {
+    "id": "task-analytics-health-label",
+    "title": "Task analytics health label retrieval",
+    "benchmarkProject": "task-analytics-large-mixed",
+    "projectProfileRef": "task-analytics-large-mixed",
+    "targetRoot": "benchmarks/projects/task-analytics-large-mixed",
+    "sourceRoots": ["ts/src", "ts/tests", "py/task_analytics", "py/tests"],
+    "query": "completion rate stale tasks quality label health report build analytics snapshot mixed language",
+    "expectedFiles": [
+      "ts/src/services/buildAnalyticsSnapshot.ts",
+      "ts/src/reporting/formatTaskHealthReport.ts",
+      "py/task_analytics/metrics.py",
+      "py/task_analytics/quality.py",
+      "py/task_analytics/reporting.py"
+    ],
+    "expectedSymbols": [
+      "buildAnalyticsSnapshot",
+      "formatTaskHealthReport",
+      "calculate_project_metrics",
+      "determine_quality_label",
+      "build_health_report"
+    ],
+    "answerKey": {
+      "expectedFiles": [
+        "ts/src/services/buildAnalyticsSnapshot.ts",
+        "ts/src/reporting/formatTaskHealthReport.ts",
+        "py/task_analytics/metrics.py",
+        "py/task_analytics/quality.py",
+        "py/task_analytics/reporting.py"
+      ],
+      "expectedSymbols": [
+        "buildAnalyticsSnapshot",
+        "formatTaskHealthReport",
+        "calculate_project_metrics",
+        "determine_quality_label",
+        "build_health_report"
+      ],
+      "expectedFacts": [
+        {
+          "id": "analytics-snapshot-computes-completion-rate",
+          "text": "The analytics snapshot computes completionRate as completedTasks divided by totalTasks times 100, rounded to two decimals.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "analytics-snapshot-counts-stale-open-tasks",
+          "text": "Stale tasks count only open tasks whose updated day is at least ten days behind the current day.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "python-quality-labels-healthy-watch-risk",
+          "text": "The Python quality layer maps metrics to healthy, watch, or risk labels based on completion rate and stale task thresholds.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "python-reporting-renders-quality-lines",
+          "text": "The Python report writer emits one line per project with the quality label plus completion, open, and stale counts.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 3,
+      "notes": "This case intentionally spans TypeScript snapshot building and Python quality reporting."
+    },
+    "rawIncludeGlobs": ["ts/src/**/*", "ts/tests/**/*", "py/task_analytics/**/*", "py/tests/**/*"],
+    "notes": "Large mixed-language case for real-agent comparison."
+  }
+]

package/examples/token-savings-cases.json ADDED Viewed

@@ -0,0 +1,122 @@
+[
+  {
+    "id": "todo-ts-create-task",
+    "title": "Todo TS create task retrieval",
+    "benchmarkProject": "todo-ts",
+    "projectProfileRef": "todo-ts",
+    "targetRoot": "benchmarks/projects/todo-ts",
+    "sourceRoots": ["src", "tests"],
+    "query": "create task deterministic id task-1",
+    "expectedFiles": ["src/taskService.ts", "src/taskStore.ts"],
+    "expectedSymbols": ["createTask", "TaskService"],
+    "answerKey": {
+      "expectedFiles": ["src/taskService.ts", "src/taskStore.ts"],
+      "expectedSymbols": ["createTask", "TaskService"],
+      "expectedFacts": [
+        {
+          "id": "create-deterministic-id",
+          "text": "createTask assigns deterministic IDs such as task-1 and task-2.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "create-validates-title",
+          "text": "createTask validates that the title is not empty or whitespace-only.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 2,
+      "notes": "Example metadata is optional for the current token-savings evaluator."
+    },
+    "rawIncludeGlobs": ["src/**/*", "tests/**/*"],
+    "notes": "TypeScript create task path."
+  },
+  {
+    "id": "todo-python-complete-task",
+    "title": "Todo Python complete task retrieval",
+    "benchmarkProject": "todo-python",
+    "projectProfileRef": "todo-python",
+    "targetRoot": "benchmarks/projects/todo-python",
+    "sourceRoots": ["src", "tests"],
+    "query": "complete task by id",
+    "expectedFiles": ["src/task_service.py"],
+    "expectedSymbols": ["complete_task"],
+    "answerKey": {
+      "expectedFiles": ["src/task_service.py"],
+      "expectedSymbols": ["complete_task"],
+      "expectedFacts": [
+        {
+          "id": "complete-finds-by-id",
+          "text": "complete_task locates the matching task by id.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "complete-marks-completed",
+          "text": "complete_task marks the matching task as completed.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 2
+    },
+    "rawIncludeGlobs": ["src/**/*", "tests/**/*"]
+  },
+  {
+    "id": "todo-js-open-tasks",
+    "title": "Todo JS open tasks retrieval",
+    "benchmarkProject": "todo-js",
+    "projectProfileRef": "todo-js",
+    "targetRoot": "benchmarks/projects/todo-js",
+    "sourceRoots": ["src", "tests"],
+    "query": "list open tasks incomplete tasks",
+    "expectedFiles": ["src/taskService.js"],
+    "expectedSymbols": ["listOpenTasks"],
+    "answerKey": {
+      "expectedFiles": ["src/taskService.js"],
+      "expectedSymbols": ["listOpenTasks"],
+      "expectedFacts": [
+        {
+          "id": "open-filters-incomplete",
+          "text": "listOpenTasks returns tasks where completed is false.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 1
+    },
+    "rawIncludeGlobs": ["src/**/*", "tests/**/*"]
+  },
+  {
+    "id": "todo-mixed-summary",
+    "title": "Todo mixed summary retrieval",
+    "benchmarkProject": "todo-mixed-ts-py",
+    "projectProfileRef": "todo-mixed-ts-py",
+    "targetRoot": "benchmarks/projects/todo-mixed-ts-py",
+    "sourceRoots": ["src", "python", "tests"],
+    "query": "summarize tasks total open completed",
+    "expectedFiles": ["src/taskCli.ts", "python/task_service.py"],
+    "expectedSymbols": ["summarizeTasks", "summarize_tasks"],
+    "answerKey": {
+      "expectedFiles": ["src/taskCli.ts", "python/task_service.py"],
+      "expectedSymbols": ["summarizeTasks", "summarize_tasks"],
+      "expectedFacts": [
+        {
+          "id": "summary-total",
+          "text": "Summary reports total task count.",
+          "weight": 1,
+          "required": true
+        },
+        {
+          "id": "summary-open-completed",
+          "text": "Summary reports open and completed task counts.",
+          "weight": 1,
+          "required": true
+        }
+      ],
+      "minimumCorrectFacts": 2
+    },
+    "rawIncludeGlobs": ["src/**/*", "python/**/*", "tests/**/*"]
+  }
+]