npm - bluera-knowledge - Versions diffs - 0.34.0 → 0.34.2 - Mend

bluera-knowledge 0.34.0 → 0.34.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/.claude-plugin/plugin.json +1 -1
package/CHANGELOG.md +15 -0
package/dist/{chunk-4S6LWHKI.js → chunk-TD3VX74F.js} +2 -2
package/dist/{chunk-K2EB4PGE.js → chunk-V5MWZM5X.js} +8 -4
package/dist/chunk-V5MWZM5X.js.map +1 -0
package/dist/{chunk-FYHKBCIH.js → chunk-VELBEZVB.js} +12 -2
package/dist/chunk-VELBEZVB.js.map +1 -0
package/dist/index.js +3 -3
package/dist/mcp/bootstrap.js +6 -2
package/dist/mcp/bootstrap.js.map +1 -1
package/dist/mcp/server.js +2 -2
package/dist/workers/background-worker-cli.js +2 -2
package/hooks/posttooluse-bk-reminder.py +0 -0
package/hooks/posttooluse-web-research.py +0 -0
package/hooks/posttooluse-websearch-bk.py +0 -0
package/package.json +1 -1
package/scripts/auto-setup.sh +3 -0
package/skills/advanced-workflows/SKILL.md +26 -246
package/skills/advanced-workflows/references/examples.md +86 -0
package/skills/eval/SKILL.md +16 -203
package/skills/eval/references/output-format.md +73 -0
package/skills/eval/references/procedures.md +61 -0
package/skills/store-lifecycle/SKILL.md +16 -441
package/skills/store-lifecycle/references/operations.md +75 -0
package/skills/store-lifecycle/references/source-types.md +48 -0
package/skills/test-plugin/SKILL.md +8 -515
package/skills/test-plugin/references/output-format.md +43 -0
package/skills/test-plugin/references/test-procedures.md +107 -0
package/dist/chunk-FYHKBCIH.js.map +0 -1
package/dist/chunk-K2EB4PGE.js.map +0 -1
package/hooks/pretooluse-bk-suggest.py +0 -296
package/hooks/skill-activation.py +0 -221
package/hooks/skill-rules.json +0 -131
/package/dist/{chunk-4S6LWHKI.js.map → chunk-TD3VX74F.js.map} +0 -0

package/skills/eval/SKILL.md CHANGED Viewed

@@ -7,216 +7,29 @@ context: fork
 # Agent Quality Evaluation
-Compare how well Claude answers library questions across three access levels.
+Compare how well Claude answers library questions across three access levels:
-For each query, three agents run in parallel:
-- **Without BK** — uses only web search and training knowledge
-- **BK Grep** — can Grep/Read/Glob the cloned source repos but has no vector search
-- **BK Full** — uses BK vector search + get_full_context + Grep/Read (all BK tools)
-Then score all three answers on accuracy, specificity, completeness, and source grounding.
+- **Without BK** — web search + training knowledge only
+- **BK Grep** — Grep/Read/Glob on cloned repos, no vector search
+- **BK Full** — vector search + get_full_context + Grep/Read
 ## Arguments
 Parse `$ARGUMENTS`:
-- **No arguments or empty**: Show usage help
-- **Quoted string** (not starting with `--`): Arbitrary query mode — run eval for that single question
-- **`--predefined`**: Run all predefined queries (skip any whose stores are not indexed)
-- **`--predefined N`**: Run predefined query #N only (1-based index)
-If no arguments provided, show:
-```
-Usage:
-  /bluera-knowledge:eval "How does Express handle errors?"     # Arbitrary query
-  /bluera-knowledge:eval --predefined                          # Run all predefined queries
-  /bluera-knowledge:eval --predefined 3                        # Run predefined query #3
-```
-## Step 1: Prerequisites Check
-1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
-2. If no stores are indexed, show error and abort:
-   ```
-   No knowledge stores indexed. Add at least one library first:
-     /bluera-knowledge:add-repo https://github.com/expressjs/express --name express
-   ```
-3. Record the list of available store names — you'll pass these to the BK Full agent
-4. Build a `STORE_PATHS` mapping from the store response: for each store with a `path` field, record `- **<name>**: \`<path>\`` (one per line, as a markdown list). This gets passed to the BK Grep agent.
-## Step 2: Resolve Queries
-### Predefined mode (`--predefined`)
-1. Read the predefined queries file: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
-2. Parse the YAML content
-3. For each query, check if ANY of its `store_hint` values match an available store name
-4. Split into **runnable** (store available) and **skipped** (store not available) lists
-5. If `--predefined N` was specified, select only query at index N from the full list (skip if store not available)
-6. If no queries are runnable, show what stores to add and abort
-### Arbitrary mode (bare query string)
-1. Use the raw query string as the question
-2. Set `expected_topics` and `anti_patterns` to empty lists
-3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
-## Step 3: Load Templates
-Read these files from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
-1. `without-bk-agent.md` — instructions for the baseline agent
-2. `bk-grep-agent.md` — instructions for the BK Grep agent
-3. `with-bk-agent.md` — instructions for the BK Full agent
-4. `judge.md` — grading rubric
-## Step 4: Run Eval (for each query)
-### Spawn ALL THREE agents in parallel (same turn, three Task tool calls)
-**Without-BK agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `without-bk-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Send as the task prompt
-**BK Grep agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `bk-grep-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Replace `{{STORE_PATHS}}` with the store name-to-path mapping built in Step 1
-- Send as the task prompt
-**BK Full agent** — Use the Task tool with `subagent_type: "general-purpose"`:
-- Take the content from `with-bk-agent.md`
-- Replace `{{QUESTION}}` with the actual question
-- Replace `{{STORES}}` with the list of available store names (one per line, as a markdown list)
-- Send as the task prompt
-Wait for all three agents to complete.
-### Capture Token Usage
-From each Task tool response, parse the `<usage>` block to extract:
-- `total_tokens` — the total tokens consumed by the agent
-- `duration_ms` — wall-clock time for the agent
-If usage data is not available in a Task response, show "N/A" for that agent.
-### Judge the results
-Using the rubric from `judge.md`, evaluate all three answers yourself:
-1. Read all three agent responses
-2. For each answer, score all 4 criteria (1-5):
-   - **Factual Accuracy**: Are the claims correct?
-   - **Specificity**: Does it cite specific files, functions, code?
-   - **Completeness**: Does it cover the full answer?
-   - **Source Grounding**: Are claims backed by evidence?
-3. If the query has `expected_topics`, check which answers mention each topic
-4. If the query has `anti_patterns`, flag if any answer makes those claims
-5. Calculate totals (max 20 each), determine winner and deltas
-## Step 5: Output Results
-### Single query output (arbitrary or `--predefined N`)
-Show the full comparison:
-```
-## Eval: "<question>"
-| Criterion         | Without BK | BK Grep | BK Full |
-|-------------------|:----------:|:-------:|:-------:|
-| Accuracy          |     X      |    X    |    X    |
-| Specificity       |     X      |    X    |    X    |
-| Completeness      |     X      |    X    |    X    |
-| Source Grounding   |     X      |    X    |    X    |
-| **Total**         |   **X**    |  **X**  |  **X**  |
-| Usage             | Without BK | BK Grep | BK Full |
-|-------------------|:----------:|:-------:|:-------:|
-| Tokens            |   X,XXX    |  X,XXX  |  X,XXX  |
-| Duration (s)      |    X.X     |   X.X   |   X.X   |
-**Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
-**Key Difference:** [One sentence explaining the most important quality gap]
-**Grep vs Full:** [One sentence on whether vector search outperformed manual grep, and if so how]
-```
-If expected topics were provided:
-```
-### Expected Topics
-- [x] topic covered by all three
-- [x] topic covered by BK Full + BK Grep only
-- [x] topic covered by BK Full only
-- [ ] topic missed by all
-```
-### Multi-query output (`--predefined`)
-Show a summary row per query, then aggregate:
-```
-## Agent Quality Eval Summary
-Ran X/8 queries (Y skipped — stores not indexed)
-| # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
-|---|-------|:----------:|:------:|:----:|:----:|--------|-------|
-| 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
-| 2 | query-id | easy | 14/20 | 17/20 | 18/20 | Full | marginal |
-| ... |
-### Token Usage
-| # | Query | w/o BK tokens | Grep tokens | Full tokens |
-|---|-------|:-------------:|:-----------:|:-----------:|
-| 1 | query-id | 2,340 | 8,120 | 5,670 |
-| 2 | query-id | 1,890 | 6,450 | 4,230 |
-| ... |
-### Aggregate
-- **Without BK mean:** X.X/20 (avg X,XXX tokens)
-- **BK Grep mean:** X.X/20 (avg X,XXX tokens)
-- **BK Full mean:** X.X/20 (avg X,XXX tokens)
-- **Full vs Without:** +X.X points (+XX%)
-- **Full vs Grep:** +X.X points (+XX%)
-- **Grep vs Without:** +X.X points (+XX%)
-- **Full win rate:** X/X (XX%)
-- **Significant wins (Full):** X
-### By Category
-| Category | w/o BK | Grep | Full | Full delta |
-|----------|:------:|:----:|:----:|------------|
-| implementation | X.X | X.X | X.X | +X.X |
-| api | X.X | X.X | X.X | +X.X |
-### By Difficulty
-| Difficulty | w/o BK | Grep | Full | Full delta |
-|------------|:------:|:----:|:----:|------------|
-| easy | X.X | X.X | X.X | +X.X |
-| medium | X.X | X.X | X.X | +X.X |
-| hard | X.X | X.X | X.X | +X.X |
+- **No arguments**: Show usage help
+- **Quoted string**: Run eval for that single question
+- **`--predefined`**: Run all predefined queries
+- **`--predefined N`**: Run predefined query #N only
-### Token Efficiency
-| Agent | Mean Score | Mean Tokens | Score/1K Tokens |
-|-------|:----------:|:-----------:|:---------------:|
-| Without BK | X.X | X,XXX | X.XX |
-| BK Grep | X.X | X,XXX | X.XX |
-| BK Full | X.X | X,XXX | X.XX |
-```
+## Workflow
-If any queries were skipped:
-```
-### Skipped (store not indexed)
-- vue-reactivity-tracking — add with: /bluera-knowledge:add-repo https://github.com/vuejs/core --name vue
-- fastapi-dependency-injection — add with: /bluera-knowledge:add-repo https://github.com/fastapi/fastapi --name fastapi
-```
+1. **Prerequisites**: Call `execute` with `{ command: "stores" }` to list stores. Abort if none.
+2. **Resolve queries**: Load from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml` or use arbitrary query.
+3. **Load templates**: Read agent prompts + judge rubric from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`
+4. **Spawn 3 agents in parallel** per query (replace `{{QUESTION}}`, `{{STORES}}`, `{{STORE_PATHS}}`)
+5. **Judge**: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding
-## Important Notes
+Detailed procedures: [references/procedures.md](references/procedures.md)
-- Each query spawns 3 subagents. For `--predefined` with 8 queries, that's up to 24 agent runs. Process one query at a time (but spawn all three agents for each query in parallel).
-- The without-BK agent may use WebSearch — this is intentional. We're comparing against "the best Claude can do without BK."
-- The BK Grep agent may NOT use WebSearch. It tests what an agent can discover by exploring raw source code, to isolate the value of vector search.
-- Scoring is somewhat subjective. The value is in the comparison (relative scores) rather than absolute numbers. Look at the delta and key differences.
-- The Token Efficiency table reveals cost-effectiveness: if BK Grep achieves similar scores to BK Full with fewer tokens, it suggests vector search isn't adding much for that query type.
-- For arbitrary queries without expected topics, grading relies entirely on the 4 general criteria. This is fine — it still reveals whether BK adds value.
+Output format: [references/output-format.md](references/output-format.md)

package/skills/eval/references/output-format.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Eval Output Format
+## Single Query Output
+```
+## Eval: "<question>"
+| Criterion         | Without BK | BK Grep | BK Full |
+|-------------------|:----------:|:-------:|:-------:|
+| Accuracy          |     X      |    X    |    X    |
+| Specificity       |     X      |    X    |    X    |
+| Completeness      |     X      |    X    |    X    |
+| Source Grounding   |     X      |    X    |    X    |
+| **Total**         |   **X**    |  **X**  |  **X**  |
+| Usage             | Without BK | BK Grep | BK Full |
+|-------------------|:----------:|:-------:|:-------:|
+| Tokens            |   X,XXX    |  X,XXX  |  X,XXX  |
+| Duration (s)      |    X.X     |   X.X   |   X.X   |
+**Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
+**Key Difference:** [One sentence explaining the most important quality gap]
+**Grep vs Full:** [One sentence on whether vector search outperformed manual grep]
+```
+If expected topics provided:
+```
+### Expected Topics
+- [x] topic covered by all three
+- [x] topic covered by BK Full only
+- [ ] topic missed by all
+```
+## Multi-Query Output (`--predefined`)
+```
+## Agent Quality Eval Summary
+Ran X/8 queries (Y skipped — stores not indexed)
+| # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
+|---|-------|:----------:|:------:|:----:|:----:|--------|-------|
+| 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
+### Token Usage
+| # | Query | w/o BK tokens | Grep tokens | Full tokens |
+|---|-------|:-------------:|:-----------:|:-----------:|
+| 1 | query-id | 2,340 | 8,120 | 5,670 |
+### Aggregate
+- **Without BK mean:** X.X/20 (avg X,XXX tokens)
+- **BK Grep mean:** X.X/20 (avg X,XXX tokens)
+- **BK Full mean:** X.X/20 (avg X,XXX tokens)
+- **Full vs Without:** +X.X points (+XX%)
+- **Full vs Grep:** +X.X points (+XX%)
+- **Full win rate:** X/X (XX%)
+### By Category / Difficulty
+Tables breaking down scores by category and difficulty level.
+### Token Efficiency
+| Agent | Mean Score | Mean Tokens | Score/1K Tokens |
+|-------|:----------:|:-----------:|:---------------:|
+| Without BK | X.X | X,XXX | X.XX |
+| BK Grep | X.X | X,XXX | X.XX |
+| BK Full | X.X | X,XXX | X.XX |
+```
+### Skipped queries
+```
+- query-id — add with: /bluera-knowledge:add-repo <url> --name <name>
+```

package/skills/eval/references/procedures.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Eval Procedures
+## Step 1: Prerequisites Check
+1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
+2. If no stores indexed, show error and abort
+3. Record available store names — pass to BK Full agent
+4. Build `STORE_PATHS` mapping from store response: `- **<name>**: \`<path>\`` per store
+## Step 2: Resolve Queries
+### Predefined mode (`--predefined`)
+1. Read: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
+2. For each query, check if ANY `store_hint` values match available stores
+3. Split into runnable and skipped lists
+4. If `--predefined N`, select only query at index N
+### Arbitrary mode
+1. Use raw query as question
+2. Set `expected_topics` and `anti_patterns` to empty
+3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
+## Step 3: Load Templates
+From `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
+1. `without-bk-agent.md` — baseline agent
+2. `bk-grep-agent.md` — grep-only agent
+3. `with-bk-agent.md` — full BK agent
+4. `judge.md` — grading rubric
+## Step 4: Run Eval
+Spawn ALL THREE agents in parallel per query using the Task tool:
+**Without-BK**: Replace `{{QUESTION}}` in `without-bk-agent.md`
+**BK Grep**: Replace `{{QUESTION}}` and `{{STORE_PATHS}}` in `bk-grep-agent.md`
+**BK Full**: Replace `{{QUESTION}}` and `{{STORES}}` in `with-bk-agent.md`
+### Capture Token Usage
+From each Task response, extract `total_tokens` and `duration_ms` from `<usage>` block.
+### Judge
+Score all 4 criteria (1-5) per answer:
+- **Factual Accuracy**: Are claims correct?
+- **Specificity**: Does it cite specific files, functions, code?
+- **Completeness**: Does it cover the full answer?
+- **Source Grounding**: Are claims backed by evidence?
+Check `expected_topics` coverage and `anti_patterns` violations.
+## Notes
+- Each query spawns 3 subagents. Process one query at a time.
+- Without-BK agent may use WebSearch — intentional baseline.
+- BK Grep agent may NOT use WebSearch — isolates vector search value.
+- Scoring is subjective; value is in relative comparison (deltas).
+- Token Efficiency reveals cost-effectiveness across modes.