bluera-knowledge 0.34.0 → 0.34.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/.claude-plugin/plugin.json +1 -1
  2. package/CHANGELOG.md +15 -0
  3. package/dist/{chunk-4S6LWHKI.js → chunk-TD3VX74F.js} +2 -2
  4. package/dist/{chunk-K2EB4PGE.js → chunk-V5MWZM5X.js} +8 -4
  5. package/dist/chunk-V5MWZM5X.js.map +1 -0
  6. package/dist/{chunk-FYHKBCIH.js → chunk-VELBEZVB.js} +12 -2
  7. package/dist/chunk-VELBEZVB.js.map +1 -0
  8. package/dist/index.js +3 -3
  9. package/dist/mcp/bootstrap.js +6 -2
  10. package/dist/mcp/bootstrap.js.map +1 -1
  11. package/dist/mcp/server.js +2 -2
  12. package/dist/workers/background-worker-cli.js +2 -2
  13. package/hooks/posttooluse-bk-reminder.py +0 -0
  14. package/hooks/posttooluse-web-research.py +0 -0
  15. package/hooks/posttooluse-websearch-bk.py +0 -0
  16. package/package.json +1 -1
  17. package/scripts/auto-setup.sh +3 -0
  18. package/skills/advanced-workflows/SKILL.md +26 -246
  19. package/skills/advanced-workflows/references/examples.md +86 -0
  20. package/skills/eval/SKILL.md +16 -203
  21. package/skills/eval/references/output-format.md +73 -0
  22. package/skills/eval/references/procedures.md +61 -0
  23. package/skills/store-lifecycle/SKILL.md +16 -441
  24. package/skills/store-lifecycle/references/operations.md +75 -0
  25. package/skills/store-lifecycle/references/source-types.md +48 -0
  26. package/skills/test-plugin/SKILL.md +8 -515
  27. package/skills/test-plugin/references/output-format.md +43 -0
  28. package/skills/test-plugin/references/test-procedures.md +107 -0
  29. package/dist/chunk-FYHKBCIH.js.map +0 -1
  30. package/dist/chunk-K2EB4PGE.js.map +0 -1
  31. package/hooks/pretooluse-bk-suggest.py +0 -296
  32. package/hooks/skill-activation.py +0 -221
  33. package/hooks/skill-rules.json +0 -131
  34. /package/dist/{chunk-4S6LWHKI.js.map → chunk-TD3VX74F.js.map} +0 -0
@@ -7,216 +7,29 @@ context: fork
7
7
 
8
8
  # Agent Quality Evaluation
9
9
 
10
- Compare how well Claude answers library questions across three access levels.
10
+ Compare how well Claude answers library questions across three access levels:
11
11
 
12
- For each query, three agents run in parallel:
13
- - **Without BK** — uses only web search and training knowledge
14
- - **BK Grep** — can Grep/Read/Glob the cloned source repos but has no vector search
15
- - **BK Full** — uses BK vector search + get_full_context + Grep/Read (all BK tools)
16
-
17
- Then score all three answers on accuracy, specificity, completeness, and source grounding.
12
+ - **Without BK** web search + training knowledge only
13
+ - **BK Grep** — Grep/Read/Glob on cloned repos, no vector search
14
+ - **BK Full** — vector search + get_full_context + Grep/Read
18
15
 
19
16
  ## Arguments
20
17
 
21
18
  Parse `$ARGUMENTS`:
22
19
 
23
- - **No arguments or empty**: Show usage help
24
- - **Quoted string** (not starting with `--`): Arbitrary query mode — run eval for that single question
25
- - **`--predefined`**: Run all predefined queries (skip any whose stores are not indexed)
26
- - **`--predefined N`**: Run predefined query #N only (1-based index)
27
-
28
- If no arguments provided, show:
29
- ```
30
- Usage:
31
- /bluera-knowledge:eval "How does Express handle errors?" # Arbitrary query
32
- /bluera-knowledge:eval --predefined # Run all predefined queries
33
- /bluera-knowledge:eval --predefined 3 # Run predefined query #3
34
- ```
35
-
36
- ## Step 1: Prerequisites Check
37
-
38
- 1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
39
- 2. If no stores are indexed, show error and abort:
40
- ```
41
- No knowledge stores indexed. Add at least one library first:
42
- /bluera-knowledge:add-repo https://github.com/expressjs/express --name express
43
- ```
44
- 3. Record the list of available store names — you'll pass these to the BK Full agent
45
- 4. Build a `STORE_PATHS` mapping from the store response: for each store with a `path` field, record `- **<name>**: \`<path>\`` (one per line, as a markdown list). This gets passed to the BK Grep agent.
46
-
47
- ## Step 2: Resolve Queries
48
-
49
- ### Predefined mode (`--predefined`)
50
-
51
- 1. Read the predefined queries file: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
52
- 2. Parse the YAML content
53
- 3. For each query, check if ANY of its `store_hint` values match an available store name
54
- 4. Split into **runnable** (store available) and **skipped** (store not available) lists
55
- 5. If `--predefined N` was specified, select only query at index N from the full list (skip if store not available)
56
- 6. If no queries are runnable, show what stores to add and abort
57
-
58
- ### Arbitrary mode (bare query string)
59
-
60
- 1. Use the raw query string as the question
61
- 2. Set `expected_topics` and `anti_patterns` to empty lists
62
- 3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
63
-
64
- ## Step 3: Load Templates
65
-
66
- Read these files from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
67
-
68
- 1. `without-bk-agent.md` — instructions for the baseline agent
69
- 2. `bk-grep-agent.md` — instructions for the BK Grep agent
70
- 3. `with-bk-agent.md` — instructions for the BK Full agent
71
- 4. `judge.md` — grading rubric
72
-
73
- ## Step 4: Run Eval (for each query)
74
-
75
- ### Spawn ALL THREE agents in parallel (same turn, three Task tool calls)
76
-
77
- **Without-BK agent** — Use the Task tool with `subagent_type: "general-purpose"`:
78
- - Take the content from `without-bk-agent.md`
79
- - Replace `{{QUESTION}}` with the actual question
80
- - Send as the task prompt
81
-
82
- **BK Grep agent** — Use the Task tool with `subagent_type: "general-purpose"`:
83
- - Take the content from `bk-grep-agent.md`
84
- - Replace `{{QUESTION}}` with the actual question
85
- - Replace `{{STORE_PATHS}}` with the store name-to-path mapping built in Step 1
86
- - Send as the task prompt
87
-
88
- **BK Full agent** — Use the Task tool with `subagent_type: "general-purpose"`:
89
- - Take the content from `with-bk-agent.md`
90
- - Replace `{{QUESTION}}` with the actual question
91
- - Replace `{{STORES}}` with the list of available store names (one per line, as a markdown list)
92
- - Send as the task prompt
93
-
94
- Wait for all three agents to complete.
95
-
96
- ### Capture Token Usage
97
-
98
- From each Task tool response, parse the `<usage>` block to extract:
99
- - `total_tokens` — the total tokens consumed by the agent
100
- - `duration_ms` — wall-clock time for the agent
101
-
102
- If usage data is not available in a Task response, show "N/A" for that agent.
103
-
104
- ### Judge the results
105
-
106
- Using the rubric from `judge.md`, evaluate all three answers yourself:
107
-
108
- 1. Read all three agent responses
109
- 2. For each answer, score all 4 criteria (1-5):
110
- - **Factual Accuracy**: Are the claims correct?
111
- - **Specificity**: Does it cite specific files, functions, code?
112
- - **Completeness**: Does it cover the full answer?
113
- - **Source Grounding**: Are claims backed by evidence?
114
- 3. If the query has `expected_topics`, check which answers mention each topic
115
- 4. If the query has `anti_patterns`, flag if any answer makes those claims
116
- 5. Calculate totals (max 20 each), determine winner and deltas
117
-
118
- ## Step 5: Output Results
119
-
120
- ### Single query output (arbitrary or `--predefined N`)
121
-
122
- Show the full comparison:
123
-
124
- ```
125
- ## Eval: "<question>"
126
-
127
- | Criterion | Without BK | BK Grep | BK Full |
128
- |-------------------|:----------:|:-------:|:-------:|
129
- | Accuracy | X | X | X |
130
- | Specificity | X | X | X |
131
- | Completeness | X | X | X |
132
- | Source Grounding | X | X | X |
133
- | **Total** | **X** | **X** | **X** |
134
-
135
- | Usage | Without BK | BK Grep | BK Full |
136
- |-------------------|:----------:|:-------:|:-------:|
137
- | Tokens | X,XXX | X,XXX | X,XXX |
138
- | Duration (s) | X.X | X.X | X.X |
139
-
140
- **Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
141
- **Key Difference:** [One sentence explaining the most important quality gap]
142
- **Grep vs Full:** [One sentence on whether vector search outperformed manual grep, and if so how]
143
- ```
144
-
145
- If expected topics were provided:
146
- ```
147
- ### Expected Topics
148
- - [x] topic covered by all three
149
- - [x] topic covered by BK Full + BK Grep only
150
- - [x] topic covered by BK Full only
151
- - [ ] topic missed by all
152
- ```
153
-
154
- ### Multi-query output (`--predefined`)
155
-
156
- Show a summary row per query, then aggregate:
157
-
158
- ```
159
- ## Agent Quality Eval Summary
160
-
161
- Ran X/8 queries (Y skipped — stores not indexed)
162
-
163
- | # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
164
- |---|-------|:----------:|:------:|:----:|:----:|--------|-------|
165
- | 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
166
- | 2 | query-id | easy | 14/20 | 17/20 | 18/20 | Full | marginal |
167
- | ... |
168
-
169
- ### Token Usage
170
-
171
- | # | Query | w/o BK tokens | Grep tokens | Full tokens |
172
- |---|-------|:-------------:|:-----------:|:-----------:|
173
- | 1 | query-id | 2,340 | 8,120 | 5,670 |
174
- | 2 | query-id | 1,890 | 6,450 | 4,230 |
175
- | ... |
176
-
177
- ### Aggregate
178
- - **Without BK mean:** X.X/20 (avg X,XXX tokens)
179
- - **BK Grep mean:** X.X/20 (avg X,XXX tokens)
180
- - **BK Full mean:** X.X/20 (avg X,XXX tokens)
181
- - **Full vs Without:** +X.X points (+XX%)
182
- - **Full vs Grep:** +X.X points (+XX%)
183
- - **Grep vs Without:** +X.X points (+XX%)
184
- - **Full win rate:** X/X (XX%)
185
- - **Significant wins (Full):** X
186
-
187
- ### By Category
188
- | Category | w/o BK | Grep | Full | Full delta |
189
- |----------|:------:|:----:|:----:|------------|
190
- | implementation | X.X | X.X | X.X | +X.X |
191
- | api | X.X | X.X | X.X | +X.X |
192
-
193
- ### By Difficulty
194
- | Difficulty | w/o BK | Grep | Full | Full delta |
195
- |------------|:------:|:----:|:----:|------------|
196
- | easy | X.X | X.X | X.X | +X.X |
197
- | medium | X.X | X.X | X.X | +X.X |
198
- | hard | X.X | X.X | X.X | +X.X |
20
+ - **No arguments**: Show usage help
21
+ - **Quoted string**: Run eval for that single question
22
+ - **`--predefined`**: Run all predefined queries
23
+ - **`--predefined N`**: Run predefined query #N only
199
24
 
200
- ### Token Efficiency
201
- | Agent | Mean Score | Mean Tokens | Score/1K Tokens |
202
- |-------|:----------:|:-----------:|:---------------:|
203
- | Without BK | X.X | X,XXX | X.XX |
204
- | BK Grep | X.X | X,XXX | X.XX |
205
- | BK Full | X.X | X,XXX | X.XX |
206
- ```
25
+ ## Workflow
207
26
 
208
- If any queries were skipped:
209
- ```
210
- ### Skipped (store not indexed)
211
- - vue-reactivity-tracking add with: /bluera-knowledge:add-repo https://github.com/vuejs/core --name vue
212
- - fastapi-dependency-injection add with: /bluera-knowledge:add-repo https://github.com/fastapi/fastapi --name fastapi
213
- ```
27
+ 1. **Prerequisites**: Call `execute` with `{ command: "stores" }` to list stores. Abort if none.
28
+ 2. **Resolve queries**: Load from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml` or use arbitrary query.
29
+ 3. **Load templates**: Read agent prompts + judge rubric from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`
30
+ 4. **Spawn 3 agents in parallel** per query (replace `{{QUESTION}}`, `{{STORES}}`, `{{STORE_PATHS}}`)
31
+ 5. **Judge**: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding
214
32
 
215
- ## Important Notes
33
+ Detailed procedures: [references/procedures.md](references/procedures.md)
216
34
 
217
- - Each query spawns 3 subagents. For `--predefined` with 8 queries, that's up to 24 agent runs. Process one query at a time (but spawn all three agents for each query in parallel).
218
- - The without-BK agent may use WebSearch — this is intentional. We're comparing against "the best Claude can do without BK."
219
- - The BK Grep agent may NOT use WebSearch. It tests what an agent can discover by exploring raw source code, to isolate the value of vector search.
220
- - Scoring is somewhat subjective. The value is in the comparison (relative scores) rather than absolute numbers. Look at the delta and key differences.
221
- - The Token Efficiency table reveals cost-effectiveness: if BK Grep achieves similar scores to BK Full with fewer tokens, it suggests vector search isn't adding much for that query type.
222
- - For arbitrary queries without expected topics, grading relies entirely on the 4 general criteria. This is fine — it still reveals whether BK adds value.
35
+ Output format: [references/output-format.md](references/output-format.md)
@@ -0,0 +1,73 @@
1
+ # Eval Output Format
2
+
3
+ ## Single Query Output
4
+
5
+ ```
6
+ ## Eval: "<question>"
7
+
8
+ | Criterion | Without BK | BK Grep | BK Full |
9
+ |-------------------|:----------:|:-------:|:-------:|
10
+ | Accuracy | X | X | X |
11
+ | Specificity | X | X | X |
12
+ | Completeness | X | X | X |
13
+ | Source Grounding | X | X | X |
14
+ | **Total** | **X** | **X** | **X** |
15
+
16
+ | Usage | Without BK | BK Grep | BK Full |
17
+ |-------------------|:----------:|:-------:|:-------:|
18
+ | Tokens | X,XXX | X,XXX | X,XXX |
19
+ | Duration (s) | X.X | X.X | X.X |
20
+
21
+ **Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
22
+ **Key Difference:** [One sentence explaining the most important quality gap]
23
+ **Grep vs Full:** [One sentence on whether vector search outperformed manual grep]
24
+ ```
25
+
26
+ If expected topics provided:
27
+ ```
28
+ ### Expected Topics
29
+ - [x] topic covered by all three
30
+ - [x] topic covered by BK Full only
31
+ - [ ] topic missed by all
32
+ ```
33
+
34
+ ## Multi-Query Output (`--predefined`)
35
+
36
+ ```
37
+ ## Agent Quality Eval Summary
38
+
39
+ Ran X/8 queries (Y skipped — stores not indexed)
40
+
41
+ | # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
42
+ |---|-------|:----------:|:------:|:----:|:----:|--------|-------|
43
+ | 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
44
+
45
+ ### Token Usage
46
+
47
+ | # | Query | w/o BK tokens | Grep tokens | Full tokens |
48
+ |---|-------|:-------------:|:-----------:|:-----------:|
49
+ | 1 | query-id | 2,340 | 8,120 | 5,670 |
50
+
51
+ ### Aggregate
52
+ - **Without BK mean:** X.X/20 (avg X,XXX tokens)
53
+ - **BK Grep mean:** X.X/20 (avg X,XXX tokens)
54
+ - **BK Full mean:** X.X/20 (avg X,XXX tokens)
55
+ - **Full vs Without:** +X.X points (+XX%)
56
+ - **Full vs Grep:** +X.X points (+XX%)
57
+ - **Full win rate:** X/X (XX%)
58
+
59
+ ### By Category / Difficulty
60
+ Tables breaking down scores by category and difficulty level.
61
+
62
+ ### Token Efficiency
63
+ | Agent | Mean Score | Mean Tokens | Score/1K Tokens |
64
+ |-------|:----------:|:-----------:|:---------------:|
65
+ | Without BK | X.X | X,XXX | X.XX |
66
+ | BK Grep | X.X | X,XXX | X.XX |
67
+ | BK Full | X.X | X,XXX | X.XX |
68
+ ```
69
+
70
+ ### Skipped queries
71
+ ```
72
+ - query-id — add with: /bluera-knowledge:add-repo <url> --name <name>
73
+ ```
@@ -0,0 +1,61 @@
1
+ # Eval Procedures
2
+
3
+ ## Step 1: Prerequisites Check
4
+
5
+ 1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
6
+ 2. If no stores indexed, show error and abort
7
+ 3. Record available store names — pass to BK Full agent
8
+ 4. Build `STORE_PATHS` mapping from store response: `- **<name>**: \`<path>\`` per store
9
+
10
+ ## Step 2: Resolve Queries
11
+
12
+ ### Predefined mode (`--predefined`)
13
+
14
+ 1. Read: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
15
+ 2. For each query, check if ANY `store_hint` values match available stores
16
+ 3. Split into runnable and skipped lists
17
+ 4. If `--predefined N`, select only query at index N
18
+
19
+ ### Arbitrary mode
20
+
21
+ 1. Use raw query as question
22
+ 2. Set `expected_topics` and `anti_patterns` to empty
23
+ 3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
24
+
25
+ ## Step 3: Load Templates
26
+
27
+ From `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
28
+ 1. `without-bk-agent.md` — baseline agent
29
+ 2. `bk-grep-agent.md` — grep-only agent
30
+ 3. `with-bk-agent.md` — full BK agent
31
+ 4. `judge.md` — grading rubric
32
+
33
+ ## Step 4: Run Eval
34
+
35
+ Spawn ALL THREE agents in parallel per query using the Task tool:
36
+
37
+ **Without-BK**: Replace `{{QUESTION}}` in `without-bk-agent.md`
38
+ **BK Grep**: Replace `{{QUESTION}}` and `{{STORE_PATHS}}` in `bk-grep-agent.md`
39
+ **BK Full**: Replace `{{QUESTION}}` and `{{STORES}}` in `with-bk-agent.md`
40
+
41
+ ### Capture Token Usage
42
+
43
+ From each Task response, extract `total_tokens` and `duration_ms` from `<usage>` block.
44
+
45
+ ### Judge
46
+
47
+ Score all 4 criteria (1-5) per answer:
48
+ - **Factual Accuracy**: Are claims correct?
49
+ - **Specificity**: Does it cite specific files, functions, code?
50
+ - **Completeness**: Does it cover the full answer?
51
+ - **Source Grounding**: Are claims backed by evidence?
52
+
53
+ Check `expected_topics` coverage and `anti_patterns` violations.
54
+
55
+ ## Notes
56
+
57
+ - Each query spawns 3 subagents. Process one query at a time.
58
+ - Without-BK agent may use WebSearch — intentional baseline.
59
+ - BK Grep agent may NOT use WebSearch — isolates vector search value.
60
+ - Scoring is subjective; value is in relative comparison (deltas).
61
+ - Token Efficiency reveals cost-effectiveness across modes.