npm - bluera-knowledge - Versions diffs - 0.34.1 → 0.34.2 - Mend

bluera-knowledge 0.34.1 → 0.34.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/.claude-plugin/plugin.json +1 -1
package/CHANGELOG.md +8 -0
package/hooks/posttooluse-bk-reminder.py +0 -0
package/hooks/posttooluse-web-research.py +0 -0
package/hooks/posttooluse-websearch-bk.py +0 -0
package/package.json +1 -1
package/scripts/auto-setup.sh +3 -0
package/skills/advanced-workflows/SKILL.md +26 -246
package/skills/advanced-workflows/references/examples.md +86 -0
package/skills/eval/SKILL.md +16 -203
package/skills/eval/references/output-format.md +73 -0
package/skills/eval/references/procedures.md +61 -0
package/skills/store-lifecycle/SKILL.md +16 -441
package/skills/store-lifecycle/references/operations.md +75 -0
package/skills/store-lifecycle/references/source-types.md +48 -0
package/skills/test-plugin/SKILL.md +8 -515
package/skills/test-plugin/references/output-format.md +43 -0
package/skills/test-plugin/references/test-procedures.md +107 -0
package/hooks/pretooluse-bk-suggest.py +0 -296
package/hooks/skill-activation.py +0 -221
package/hooks/skill-rules.json +0 -131

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "bluera-knowledge",
-  "version": "0.34.1",
+  "version": "0.34.2",
   "description": "Clone repos, crawl docs, search locally. Fast, authoritative answers for AI coding agents.",
   "author": {
     "name": "Bluera Inc",

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,14 @@
 All notable changes to this project will be documented in this file. See [commit-and-tag-version](https://github.com/absolute-version/commit-and-tag-version) for commit guidelines.
+## [0.34.2](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.2) (2026-03-19)
+### Bug Fixes
+* **hooks:** make Python hooks executable, add stdin drain, remove stale files ([219b645](https://github.com/blueraai/bluera-knowledge/commit/219b6459e955764645c8edd0c98ff0be2b9c96b8))
+* **mcp:** expand shell variables in PROJECT_ROOT to prevent literal ${PWD} directories ([2ab025f](https://github.com/blueraai/bluera-knowledge/commit/2ab025f42fb8d063cccc3dacfc47ed87d299d634))
 ## [0.34.1](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.1) (2026-03-12)

package/hooks/posttooluse-bk-reminder.py CHANGED Viewed

File without changes

package/hooks/posttooluse-web-research.py CHANGED Viewed

File without changes

package/hooks/posttooluse-websearch-bk.py CHANGED Viewed

File without changes

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "bluera-knowledge",
-  "version": "0.34.1",
+  "version": "0.34.2",
   "description": "CLI tool for managing knowledge stores with semantic search",
   "type": "module",
   "bin": {

package/scripts/auto-setup.sh CHANGED Viewed

@@ -6,6 +6,9 @@
 # It exits quickly (0) if already set up, or runs full setup if needed.
 # Non-interactive: cannot prompt for user input (no TTY).
+# Drain stdin so the pipe doesn't hang
+cat > /dev/null 2>&1 || true
 PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(dirname "$(dirname "$0")")}"
 # Colors for output

package/skills/advanced-workflows/SKILL.md CHANGED Viewed

@@ -3,271 +3,51 @@ name: advanced-workflows
 description: Multi-tool orchestration patterns for BK operations
 ---
-# Advanced Bluera Knowledge Workflows
+# Advanced BK Workflows
-Master complex multi-tool operations that combine multiple MCP tools for efficient knowledge retrieval and management.
+Multi-tool patterns for efficient knowledge retrieval and management.
-## Progressive Library Exploration
+## Core Patterns
-When exploring a new library or codebase, use this pattern for efficient discovery:
+### Progressive Library Exploration
-### Workflow: Find Relevant Code in Unknown Library
-```
-1. list_stores()
-   → See what's indexed, identify target store
-2. get_store_info(store)
-   → Get metadata: file paths, size, indexed files
-   → Understand scope before searching
-3. search(query, detail='minimal', stores=[target])
-   → Get high-level summaries of relevant code
-   → Review relevance scores (>0.7 = good match)
-4. get_full_context(result_ids[top_3])
-   → Deep dive on most relevant results only
-   → Get complete code with full context
-```
-**Example:**
-User: "How does Vue's computed properties work?"
-```
-list_stores()
-→ Found: vue, react, pydantic
-get_store_info('vue')
-→ Path: .bluera/bluera-knowledge/repos/vue/
-→ Files: 2,847 indexed
-search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
-→ Result 1: packages/reactivity/src/computed.ts (score: 0.92)
-→ Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
-→ Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
-get_full_context(['result_1_id', 'result_2_id'])
-→ Full code for ComputedRefImpl class
-→ Complete API implementation
-Now explain with authoritative source code.
-```
-## Adding New Library with Job Monitoring
-When adding large libraries, monitor indexing progress to know when search is ready:
-### Workflow: Add Library and Wait for Index
-```
-1. create_store(url_or_path, name)
-   → Returns: job_id
-   → Background indexing starts
-2. check_job_status(job_id)
-   → Poll every 10-30 seconds
-   → Status: 'pending' | 'running' | 'completed' | 'failed'
-   → Progress: percentage, current file
-3. When status='completed':
-   list_stores()
-   → Verify store appears in list
-4. search(query, stores=[new_store], limit=5)
-   → Test search works
-   → Verify indexing quality
-```
-**Example:**
-```
-create_store('https://github.com/fastapi/fastapi', 'fastapi')
-→ job_id: 'job_abc123'
-→ Status: Indexing started in background
-# Poll for completion (typically 30-120 seconds for medium repos)
-check_job_status('job_abc123')
-→ Status: running
-→ Progress: 45% (processing src/fastapi/routing.py)
-# ... wait 30 seconds ...
-check_job_status('job_abc123')
-→ Status: completed
-→ Indexed: 487 files, 125k lines
-# Verify and test
-list_stores()
-→ fastapi: 487 files, vector + FTS indexed
-search("dependency injection", stores=['fastapi'], limit=3)
-→ Returns relevant FastAPI DI patterns
-→ Store is ready for use!
-```
-## Handling Large Result Sets
-When initial search returns many results, use progressive detail to avoid context overload:
-### Workflow: Progressive Detail Strategy
-```
-1. search(query, detail='minimal', limit=20)
-   → Get summaries only (~100 tokens/result)
-   → Review all 20 summaries quickly
-2. Filter by relevance score:
-   - Score > 0.8: Excellent match
-   - Score 0.6-0.8: Good match
-   - Score < 0.6: Possibly irrelevant
-3. For top 3-5 results (score > 0.7):
-   get_full_context(selected_ids)
-   → Fetch complete code only for relevant items
-   → Saves ~80% context vs fetching all upfront
-4. If nothing relevant:
-   search(refined_query, detail='contextual', limit=10)
-   → Try different query with more context
-   → Or broaden/narrow the search
-```
-**Example:**
-```
-# Initial broad search
-search("authentication middleware", detail='minimal', limit=20)
-→ 20 results, scores ranging 0.45-0.92
-→ Total context: ~2k tokens (minimal)
-# Filter by score
-Top results (>0.7):
-  - Result 3: auth/jwt.ts (score: 0.92)
-  - Result 7: middleware/authenticate.ts (score: 0.85)
-  - Result 12: auth/session.ts (score: 0.74)
-# Get full code for top 3 only
-get_full_context(['result_3', 'result_7', 'result_12'])
-→ Complete implementations for relevant files only
-→ Context: ~3k tokens (vs ~15k if we fetched all 20)
-# Found what we needed! If not, would refine query and retry.
-```
-## Multi-Store Search with Ranking
-When searching across multiple stores, use ranking to prioritize results:
-### Workflow: Cross-Library Search
-```
-1. search(query, limit=10)
-   → Searches ALL stores
-   → Returns mixed results ranked by relevance
-2. Review store distribution:
-   - If dominated by one store: might narrow to specific stores
-   - If balanced: good cross-library perspective
-3. For specific library focus:
-   search(query, stores=['lib1', 'lib2'], limit=15)
-   → Search only relevant libraries
-   → Get more results from target libraries
 ```
-**Example:**
-User: "How do different frameworks handle routing?"
+1. list_stores() → identify target store
+2. search(query, detail='minimal', stores=[target]) → scan summaries
+3. get_full_context(top_result_ids) → deep dive on best matches
 ```
-# Search all indexed frameworks
-search("routing implementation", intent='find-implementation', limit=15)
-→ Result mix:
-  - express (score: 0.91)
-  - fastapi (score: 0.89)
-  - hono (score: 0.87)
-  - vue-router (score: 0.82)
-  - ...
-# All stores represented, good comparative view!
-# If user wants deeper FastAPI focus:
-search("routing implementation", stores=['fastapi', 'starlette'], limit=20)
-→ More FastAPI/Starlette-specific results
-→ Deeper exploration of Python framework routing
-```
-## Error Recovery
-When operations fail, use these recovery patterns:
-### Workflow: Handle Indexing Failures
+### Add Library + Wait for Index
 ```
-1. create_store() fails or job_status shows 'failed'
-   → Check error message
-   → Common issues:
-     - Git auth required (private repo)
-     - Invalid URL/path
-     - Disk space
-     - Network timeout
-2. Recovery actions:
-   - Auth issue: Provide credentials or use HTTPS
-   - Invalid path: Verify URL/path exists
-   - Disk space: delete_store() unused stores
-   - Network: Retry with smaller repo or use --shallow
-3. Verify recovery:
-   list_stores() → Check store appeared
-   search(test_query, stores=[new_store]) → Verify searchable
+1. create_store(url, name) → job_id
+2. check_job_status(job_id) → poll every 15-30s
+3. When completed: search(query, stores=[name]) → verify
 ```
-**Example:**
+### Progressive Detail Strategy
 ```
-create_store('https://github.com/private/repo', 'my-repo')
-→ job_id: 'job_xyz'
-check_job_status('job_xyz')
-→ Status: failed
-→ Error: "Authentication required for private repository"
-# Recovery: Use authenticated URL or SSH
-create_store('git@github.com:private/repo.git', 'my-repo')
-→ job_id: 'job_xyz2'
-check_job_status('job_xyz2')
-→ Status: completed
-→ Success!
+1. search(query, detail='minimal', limit=20) → ~100 tokens/result
+2. Filter by score (>0.7 = strong match)
+3. get_full_context(top_3_ids) → saves ~80% context
+4. If nothing relevant: refine query and retry
 ```
-## Combining Workflows
-Real-world usage often combines these patterns:
+### Cross-Library Comparison
 ```
-User: "I need to understand how Express and Hono handle middleware differently"
-1. list_stores() → check if both indexed
-2. If not: create_store() for missing framework(s)
-3. check_job_status() → wait for indexing
-4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
-5. Review summaries, identify key files
-6. get_full_context() for 2-3 most relevant from each framework
-7. Compare implementations with full context
+1. search(query, limit=10) → searches ALL stores
+2. Review store distribution
+3. Narrow: search(query, stores=['lib1', 'lib2'], limit=15)
 ```
-This multi-step workflow is efficient, targeted, and conserves context.
 ## Best Practices
-1. **Always start with detail='minimal'** - Get summaries first, full context selectively
-2. **Monitor background jobs** - Don't search newly added stores until indexing completes
-3. **Use intent parameter** - Helps ranking ('find-implementation' vs 'find-pattern' vs 'find-usage')
-4. **Filter by stores when known** - Faster, more focused results
-5. **Check relevance scores** - >0.7 is usually a strong match, <0.5 might be noise
-6. **Progressive refinement** - Broad search → filter → narrow → full context
+1. **Start with detail='minimal'** — summaries first, full context selectively
+2. **Monitor background jobs** — don't search until indexing completes
+3. **Use intent parameter** — 'find-implementation' vs 'find-usage' vs 'find-documentation'
+4. **Filter by stores** — faster, more focused results
+5. **Check scores** — >0.7 strong match, <0.5 likely noise
-These workflows reduce token usage, minimize tool calls, and get you to the right answer faster.
+Detailed examples and error recovery: [references/examples.md](references/examples.md)

package/skills/advanced-workflows/references/examples.md ADDED Viewed

@@ -0,0 +1,86 @@
+# Workflow Examples
+## Progressive Library Exploration
+User: "How does Vue's computed properties work?"
+```
+list_stores()
+→ Found: vue, react, pydantic
+get_store_info('vue')
+→ Path: .bluera/bluera-knowledge/repos/vue/
+→ Files: 2,847 indexed
+search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
+→ Result 1: packages/reactivity/src/computed.ts (score: 0.92)
+→ Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
+→ Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
+get_full_context(['result_1_id', 'result_2_id'])
+→ Full code for ComputedRefImpl class
+→ Complete API implementation
+```
+## Add Library + Monitor
+```
+create_store('https://github.com/fastapi/fastapi', 'fastapi')
+→ job_id: 'job_abc123'
+check_job_status('job_abc123')
+→ Status: running, Progress: 45%
+# ... wait 30 seconds ...
+check_job_status('job_abc123')
+→ Status: completed, Indexed: 487 files
+search("dependency injection", stores=['fastapi'], limit=3)
+→ Returns relevant FastAPI DI patterns
+```
+## Progressive Detail Strategy
+```
+# Initial broad search
+search("authentication middleware", detail='minimal', limit=20)
+→ 20 results, scores 0.45-0.92, ~2k tokens total
+# Filter by score (>0.7):
+  - auth/jwt.ts (score: 0.92)
+  - middleware/authenticate.ts (score: 0.85)
+  - auth/session.ts (score: 0.74)
+# Get full code for top 3 only
+get_full_context(['result_3', 'result_7', 'result_12'])
+→ ~3k tokens (vs ~15k if fetched all 20)
+```
+## Error Recovery
+```
+create_store('https://github.com/private/repo', 'my-repo')
+→ job_id: 'job_xyz'
+check_job_status('job_xyz')
+→ Status: failed, Error: "Authentication required"
+# Recovery: Use SSH
+create_store('git@github.com:private/repo.git', 'my-repo')
+→ Status: completed
+```
+## Combining Workflows
+User: "Compare Express and Hono middleware"
+```
+1. list_stores() → check if both indexed
+2. If not: create_store() for missing
+3. check_job_status() → wait for indexing
+4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
+5. Review summaries, identify key files
+6. get_full_context() for 2-3 most relevant from each
+7. Compare implementations with full context
+```

package/skills/eval/SKILL.md CHANGED Viewed

@@ -7,216 +7,29 @@ context: fork
 # Agent Quality Evaluation
-Compare how well Claude answers library questions across three access levels.
+Compare how well Claude answers library questions across three access levels:
-For each query, three agents run in parallel:
-- **Without BK** — uses only web search and training knowledge
-- **BK Grep** — can Grep/Read/Glob the cloned source repos but has no vector search
-- **BK Full** — uses BK vector search + get_full_context + Grep/Read (all BK tools)
-Then score all three answers on accuracy, specificity, completeness, and source grounding.
+- **Without BK** — web search + training knowledge only
+- **BK Grep** — Grep/Read/Glob on cloned repos, no vector search
+- **BK Full** — vector search + get_full_context + Grep/Read
 ## Arguments
 Parse `$ARGUMENTS`:
-- **No arguments or empty**: Show usage help
-- **Quoted string** (not starting with `--`): Arbitrary query mode — run eval for that single question
-- **`--predefined`**: Run all predefined queries (skip any whose stores are not indexed)
-- **`--predefined N`**: Run predefined query #N only (1-based index)
-If no arguments provided, show:
-```
-Usage:
-  /bluera-knowledge:eval "How does Express handle errors?"     # Arbitrary query
-  /bluera-knowledge:eval --predefined                          # Run all predefined queries
-  /bluera-knowledge:eval --predefined 3                        # Run predefined query #3
-```
-## Step 1: Prerequisites Check
-1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
-2. If no stores are indexed, show error and abort:
-   ```
-   No knowledge stores indexed. Add at least one library first:
-     /bluera-knowledge:add-repo https://github.com/expressjs/express --name express
-   ```
-3. Record the list of available store names — you'll pass these to the BK Full agent
-4. Build a `STORE_PATHS` mapping from the store response: for each store with a `path` field, record `- **<name>**: \`<path>\`` (one per line, as a markdown list). This gets passed to the BK Grep agent.
-## Step 2: Resolve Queries
-### Predefined mode (`--predefined`)
-1. Read the predefined queries file: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
-2. Parse the YAML content
-3. For each query, check if ANY of its `store_hint` values match an available store name
-4. Split into **runnable** (store available) and **skipped** (store not available) lists
-5. If `--predefined N` was specified, select only query at index N from the full list (skip if store not available)
-6. If no queries are runnable, show what stores to add and abort
-### Arbitrary mode (bare query string)
-1. Use the raw query string as the question
-2. Set `expected_topics` and `anti_patterns` to empty lists
-3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
-## Step 3: Load Templates
-Read these files from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
-1. `without-bk-agent.md` — instructions for the baseline agent
-2. `bk-grep-agent.md` — instructions for the BK Grep agent
-3. `with-bk-agent.md` — instructions for the BK Full agent
-4. `judge.md` — grading rubric
-## Step 4: Run Eval (for each query)
-### Spawn ALL THREE agents in parallel (same turn, three Task tool calls)
-**Without-BK agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `without-bk-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Send as the task prompt
-**BK Grep agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `bk-grep-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Replace `{{STORE_PATHS}}` with the store name-to-path mapping built in Step 1
-- Send as the task prompt
-**BK Full agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `with-bk-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Replace `{{STORES}}` with the list of available store names (one per line, as a markdown list)
-- Send as the task prompt
-Wait for all three agents to complete.
-### Capture Token Usage
-From each Task tool response, parse the `<usage>` block to extract:
-- `total_tokens` — the total tokens consumed by the agent
-- `duration_ms` — wall-clock time for the agent
-If usage data is not available in a Task response, show "N/A" for that agent.
-### Judge the results
-Using the rubric from `judge.md`, evaluate all three answers yourself:
-1. Read all three agent responses
-2. For each answer, score all 4 criteria (1-5):
-   - **Factual Accuracy**: Are the claims correct?
-   - **Specificity**: Does it cite specific files, functions, code?
-   - **Completeness**: Does it cover the full answer?
-   - **Source Grounding**: Are claims backed by evidence?
-3. If the query has `expected_topics`, check which answers mention each topic
-4. If the query has `anti_patterns`, flag if any answer makes those claims
-5. Calculate totals (max 20 each), determine winner and deltas
-## Step 5: Output Results
-### Single query output (arbitrary or `--predefined N`)
-Show the full comparison:
-```
-## Eval: "<question>"
-| Criterion         | Without BK | BK Grep | BK Full |
-|-------------------|:----------:|:-------:|:-------:|
-| Accuracy          |     X      |    X    |    X    |
-| Specificity       |     X      |    X    |    X    |
-| Completeness      |     X      |    X    |    X    |
-| Source Grounding   |     X      |    X    |    X    |
-| **Total**         |   **X**    |  **X**  |  **X**  |
-| Usage             | Without BK | BK Grep | BK Full |
-|-------------------|:----------:|:-------:|:-------:|
-| Tokens            |   X,XXX    |  X,XXX  |  X,XXX  |
-| Duration (s)      |    X.X     |   X.X   |   X.X   |
-**Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
-**Key Difference:** [One sentence explaining the most important quality gap]
-**Grep vs Full:** [One sentence on whether vector search outperformed manual grep, and if so how]
-```
-If expected topics were provided:
-```
-### Expected Topics
-- [x] topic covered by all three
-- [x] topic covered by BK Full + BK Grep only
-- [x] topic covered by BK Full only
-- [ ] topic missed by all
-```
-### Multi-query output (`--predefined`)
-Show a summary row per query, then aggregate:
-```
-## Agent Quality Eval Summary
-Ran X/8 queries (Y skipped — stores not indexed)
-| # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
-|---|-------|:----------:|:------:|:----:|:----:|--------|-------|
-| 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
-| 2 | query-id | easy | 14/20 | 17/20 | 18/20 | Full | marginal |
-| ... |
-### Token Usage
-| # | Query | w/o BK tokens | Grep tokens | Full tokens |
-|---|-------|:-------------:|:-----------:|:-----------:|
-| 1 | query-id | 2,340 | 8,120 | 5,670 |
-| 2 | query-id | 1,890 | 6,450 | 4,230 |
-| ... |
-### Aggregate
-- **Without BK mean:** X.X/20 (avg X,XXX tokens)
-- **BK Grep mean:** X.X/20 (avg X,XXX tokens)
-- **BK Full mean:** X.X/20 (avg X,XXX tokens)
-- **Full vs Without:** +X.X points (+XX%)
-- **Full vs Grep:** +X.X points (+XX%)
-- **Grep vs Without:** +X.X points (+XX%)
-- **Full win rate:** X/X (XX%)
-- **Significant wins (Full):** X
-### By Category
-| Category | w/o BK | Grep | Full | Full delta |
-|----------|:------:|:----:|:----:|------------|
-| implementation | X.X | X.X | X.X | +X.X |
-| api | X.X | X.X | X.X | +X.X |
-### By Difficulty
-| Difficulty | w/o BK | Grep | Full | Full delta |
-|------------|:------:|:----:|:----:|------------|
-| easy | X.X | X.X | X.X | +X.X |
-| medium | X.X | X.X | X.X | +X.X |
-| hard | X.X | X.X | X.X | +X.X |
+- **No arguments**: Show usage help
+- **Quoted string**: Run eval for that single question
+- **`--predefined`**: Run all predefined queries
+- **`--predefined N`**: Run predefined query #N only
-### Token Efficiency
-| Agent | Mean Score | Mean Tokens | Score/1K Tokens |
-|-------|:----------:|:-----------:|:---------------:|
-| Without BK | X.X | X,XXX | X.XX |
-| BK Grep | X.X | X,XXX | X.XX |
-| BK Full | X.X | X,XXX | X.XX |
-```
+## Workflow
-If any queries were skipped:
-```
-### Skipped (store not indexed)
-- vue-reactivity-tracking — add with: /bluera-knowledge:add-repo https://github.com/vuejs/core --name vue
-- fastapi-dependency-injection — add with: /bluera-knowledge:add-repo https://github.com/fastapi/fastapi --name fastapi
-```
+1. **Prerequisites**: Call `execute` with `{ command: "stores" }` to list stores. Abort if none.
+2. **Resolve queries**: Load from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml` or use arbitrary query.
+3. **Load templates**: Read agent prompts + judge rubric from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`
+4. **Spawn 3 agents in parallel** per query (replace `{{QUESTION}}`, `{{STORES}}`, `{{STORE_PATHS}}`)
+5. **Judge**: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding
-## Important Notes
+Detailed procedures: [references/procedures.md](references/procedures.md)
-- Each query spawns 3 subagents. For `--predefined` with 8 queries, that's up to 24 agent runs. Process one query at a time (but spawn all three agents for each query in parallel).
-- The without-BK agent may use WebSearch — this is intentional. We're comparing against "the best Claude can do without BK."
-- The BK Grep agent may NOT use WebSearch. It tests what an agent can discover by exploring raw source code, to isolate the value of vector search.
-- Scoring is somewhat subjective. The value is in the comparison (relative scores) rather than absolute numbers. Look at the delta and key differences.
-- The Token Efficiency table reveals cost-effectiveness: if BK Grep achieves similar scores to BK Full with fewer tokens, it suggests vector search isn't adding much for that query type.
-- For arbitrary queries without expected topics, grading relies entirely on the 4 general criteria. This is fine — it still reveals whether BK adds value.
+Output format: [references/output-format.md](references/output-format.md)