bluera-knowledge 0.34.1 → 0.34.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "bluera-knowledge",
3
- "version": "0.34.1",
3
+ "version": "0.34.2",
4
4
  "description": "Clone repos, crawl docs, search locally. Fast, authoritative answers for AI coding agents.",
5
5
  "author": {
6
6
  "name": "Bluera Inc",
package/CHANGELOG.md CHANGED
@@ -2,6 +2,14 @@
2
2
 
3
3
  All notable changes to this project will be documented in this file. See [commit-and-tag-version](https://github.com/absolute-version/commit-and-tag-version) for commit guidelines.
4
4
 
5
+ ## [0.34.2](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.2) (2026-03-19)
6
+
7
+
8
+ ### Bug Fixes
9
+
10
+ * **hooks:** make Python hooks executable, add stdin drain, remove stale files ([219b645](https://github.com/blueraai/bluera-knowledge/commit/219b6459e955764645c8edd0c98ff0be2b9c96b8))
11
+ * **mcp:** expand shell variables in PROJECT_ROOT to prevent literal ${PWD} directories ([2ab025f](https://github.com/blueraai/bluera-knowledge/commit/2ab025f42fb8d063cccc3dacfc47ed87d299d634))
12
+
5
13
  ## [0.34.1](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.1) (2026-03-12)
6
14
 
7
15
 
File without changes
File without changes
File without changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "bluera-knowledge",
3
- "version": "0.34.1",
3
+ "version": "0.34.2",
4
4
  "description": "CLI tool for managing knowledge stores with semantic search",
5
5
  "type": "module",
6
6
  "bin": {
@@ -6,6 +6,9 @@
6
6
  # It exits quickly (0) if already set up, or runs full setup if needed.
7
7
  # Non-interactive: cannot prompt for user input (no TTY).
8
8
 
9
+ # Drain stdin so the pipe doesn't hang
10
+ cat > /dev/null 2>&1 || true
11
+
9
12
  PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(dirname "$(dirname "$0")")}"
10
13
 
11
14
  # Colors for output
@@ -3,271 +3,51 @@ name: advanced-workflows
3
3
  description: Multi-tool orchestration patterns for BK operations
4
4
  ---
5
5
 
6
- # Advanced Bluera Knowledge Workflows
6
+ # Advanced BK Workflows
7
7
 
8
- Master complex multi-tool operations that combine multiple MCP tools for efficient knowledge retrieval and management.
8
+ Multi-tool patterns for efficient knowledge retrieval and management.
9
9
 
10
- ## Progressive Library Exploration
10
+ ## Core Patterns
11
11
 
12
- When exploring a new library or codebase, use this pattern for efficient discovery:
12
+ ### Progressive Library Exploration
13
13
 
14
- ### Workflow: Find Relevant Code in Unknown Library
15
-
16
- ```
17
- 1. list_stores()
18
- → See what's indexed, identify target store
19
-
20
- 2. get_store_info(store)
21
- → Get metadata: file paths, size, indexed files
22
- → Understand scope before searching
23
-
24
- 3. search(query, detail='minimal', stores=[target])
25
- → Get high-level summaries of relevant code
26
- → Review relevance scores (>0.7 = good match)
27
-
28
- 4. get_full_context(result_ids[top_3])
29
- → Deep dive on most relevant results only
30
- → Get complete code with full context
31
- ```
32
-
33
- **Example:**
34
-
35
- User: "How does Vue's computed properties work?"
36
-
37
- ```
38
- list_stores()
39
- → Found: vue, react, pydantic
40
-
41
- get_store_info('vue')
42
- → Path: .bluera/bluera-knowledge/repos/vue/
43
- → Files: 2,847 indexed
44
-
45
- search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
46
- → Result 1: packages/reactivity/src/computed.ts (score: 0.92)
47
- → Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
48
- → Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
49
-
50
- get_full_context(['result_1_id', 'result_2_id'])
51
- → Full code for ComputedRefImpl class
52
- → Complete API implementation
53
-
54
- Now explain with authoritative source code.
55
- ```
56
-
57
- ## Adding New Library with Job Monitoring
58
-
59
- When adding large libraries, monitor indexing progress to know when search is ready:
60
-
61
- ### Workflow: Add Library and Wait for Index
62
-
63
- ```
64
- 1. create_store(url_or_path, name)
65
- → Returns: job_id
66
- → Background indexing starts
67
-
68
- 2. check_job_status(job_id)
69
- → Poll every 10-30 seconds
70
- → Status: 'pending' | 'running' | 'completed' | 'failed'
71
- → Progress: percentage, current file
72
-
73
- 3. When status='completed':
74
- list_stores()
75
- → Verify store appears in list
76
-
77
- 4. search(query, stores=[new_store], limit=5)
78
- → Test search works
79
- → Verify indexing quality
80
- ```
81
-
82
- **Example:**
83
-
84
- ```
85
- create_store('https://github.com/fastapi/fastapi', 'fastapi')
86
- → job_id: 'job_abc123'
87
- → Status: Indexing started in background
88
-
89
- # Poll for completion (typically 30-120 seconds for medium repos)
90
- check_job_status('job_abc123')
91
- → Status: running
92
- → Progress: 45% (processing src/fastapi/routing.py)
93
-
94
- # ... wait 30 seconds ...
95
-
96
- check_job_status('job_abc123')
97
- → Status: completed
98
- → Indexed: 487 files, 125k lines
99
-
100
- # Verify and test
101
- list_stores()
102
- → fastapi: 487 files, vector + FTS indexed
103
-
104
- search("dependency injection", stores=['fastapi'], limit=3)
105
- → Returns relevant FastAPI DI patterns
106
- → Store is ready for use!
107
- ```
108
-
109
- ## Handling Large Result Sets
110
-
111
- When initial search returns many results, use progressive detail to avoid context overload:
112
-
113
- ### Workflow: Progressive Detail Strategy
114
-
115
- ```
116
- 1. search(query, detail='minimal', limit=20)
117
- → Get summaries only (~100 tokens/result)
118
- → Review all 20 summaries quickly
119
-
120
- 2. Filter by relevance score:
121
- - Score > 0.8: Excellent match
122
- - Score 0.6-0.8: Good match
123
- - Score < 0.6: Possibly irrelevant
124
-
125
- 3. For top 3-5 results (score > 0.7):
126
- get_full_context(selected_ids)
127
- → Fetch complete code only for relevant items
128
- → Saves ~80% context vs fetching all upfront
129
-
130
- 4. If nothing relevant:
131
- search(refined_query, detail='contextual', limit=10)
132
- → Try different query with more context
133
- → Or broaden/narrow the search
134
- ```
135
-
136
- **Example:**
137
-
138
- ```
139
- # Initial broad search
140
- search("authentication middleware", detail='minimal', limit=20)
141
- → 20 results, scores ranging 0.45-0.92
142
- → Total context: ~2k tokens (minimal)
143
-
144
- # Filter by score
145
- Top results (>0.7):
146
- - Result 3: auth/jwt.ts (score: 0.92)
147
- - Result 7: middleware/authenticate.ts (score: 0.85)
148
- - Result 12: auth/session.ts (score: 0.74)
149
-
150
- # Get full code for top 3 only
151
- get_full_context(['result_3', 'result_7', 'result_12'])
152
- → Complete implementations for relevant files only
153
- → Context: ~3k tokens (vs ~15k if we fetched all 20)
154
-
155
- # Found what we needed! If not, would refine query and retry.
156
- ```
157
-
158
- ## Multi-Store Search with Ranking
159
-
160
- When searching across multiple stores, use ranking to prioritize results:
161
-
162
- ### Workflow: Cross-Library Search
163
-
164
- ```
165
- 1. search(query, limit=10)
166
- → Searches ALL stores
167
- → Returns mixed results ranked by relevance
168
-
169
- 2. Review store distribution:
170
- - If dominated by one store: might narrow to specific stores
171
- - If balanced: good cross-library perspective
172
-
173
- 3. For specific library focus:
174
- search(query, stores=['lib1', 'lib2'], limit=15)
175
- → Search only relevant libraries
176
- → Get more results from target libraries
177
14
  ```
178
-
179
- **Example:**
180
-
181
- User: "How do different frameworks handle routing?"
182
-
15
+ 1. list_stores() → identify target store
16
+ 2. search(query, detail='minimal', stores=[target]) → scan summaries
17
+ 3. get_full_context(top_result_ids) → deep dive on best matches
183
18
  ```
184
- # Search all indexed frameworks
185
- search("routing implementation", intent='find-implementation', limit=15)
186
- → Result mix:
187
- - express (score: 0.91)
188
- - fastapi (score: 0.89)
189
- - hono (score: 0.87)
190
- - vue-router (score: 0.82)
191
- - ...
192
-
193
- # All stores represented, good comparative view!
194
-
195
- # If user wants deeper FastAPI focus:
196
- search("routing implementation", stores=['fastapi', 'starlette'], limit=20)
197
- → More FastAPI/Starlette-specific results
198
- → Deeper exploration of Python framework routing
199
- ```
200
-
201
- ## Error Recovery
202
19
 
203
- When operations fail, use these recovery patterns:
204
-
205
- ### Workflow: Handle Indexing Failures
20
+ ### Add Library + Wait for Index
206
21
 
207
22
  ```
208
- 1. create_store() fails or job_status shows 'failed'
209
- Check error message
210
- Common issues:
211
- - Git auth required (private repo)
212
- - Invalid URL/path
213
- - Disk space
214
- - Network timeout
215
-
216
- 2. Recovery actions:
217
- - Auth issue: Provide credentials or use HTTPS
218
- - Invalid path: Verify URL/path exists
219
- - Disk space: delete_store() unused stores
220
- - Network: Retry with smaller repo or use --shallow
221
-
222
- 3. Verify recovery:
223
- list_stores() → Check store appeared
224
- search(test_query, stores=[new_store]) → Verify searchable
23
+ 1. create_store(url, name) job_id
24
+ 2. check_job_status(job_id) poll every 15-30s
25
+ 3. When completed: search(query, stores=[name]) → verify
225
26
  ```
226
27
 
227
- **Example:**
28
+ ### Progressive Detail Strategy
228
29
 
229
30
  ```
230
- create_store('https://github.com/private/repo', 'my-repo')
231
- job_id: 'job_xyz'
232
-
233
- check_job_status('job_xyz')
234
- → Status: failed
235
- → Error: "Authentication required for private repository"
236
-
237
- # Recovery: Use authenticated URL or SSH
238
- create_store('git@github.com:private/repo.git', 'my-repo')
239
- → job_id: 'job_xyz2'
240
-
241
- check_job_status('job_xyz2')
242
- → Status: completed
243
- → Success!
31
+ 1. search(query, detail='minimal', limit=20) → ~100 tokens/result
32
+ 2. Filter by score (>0.7 = strong match)
33
+ 3. get_full_context(top_3_ids) → saves ~80% context
34
+ 4. If nothing relevant: refine query and retry
244
35
  ```
245
36
 
246
- ## Combining Workflows
247
-
248
- Real-world usage often combines these patterns:
37
+ ### Cross-Library Comparison
249
38
 
250
39
  ```
251
- User: "I need to understand how Express and Hono handle middleware differently"
252
-
253
- 1. list_stores() check if both indexed
254
- 2. If not: create_store() for missing framework(s)
255
- 3. check_job_status() → wait for indexing
256
- 4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
257
- 5. Review summaries, identify key files
258
- 6. get_full_context() for 2-3 most relevant from each framework
259
- 7. Compare implementations with full context
40
+ 1. search(query, limit=10) searches ALL stores
41
+ 2. Review store distribution
42
+ 3. Narrow: search(query, stores=['lib1', 'lib2'], limit=15)
260
43
  ```
261
44
 
262
- This multi-step workflow is efficient, targeted, and conserves context.
263
-
264
45
  ## Best Practices
265
46
 
266
- 1. **Always start with detail='minimal'** - Get summaries first, full context selectively
267
- 2. **Monitor background jobs** - Don't search newly added stores until indexing completes
268
- 3. **Use intent parameter** - Helps ranking ('find-implementation' vs 'find-pattern' vs 'find-usage')
269
- 4. **Filter by stores when known** - Faster, more focused results
270
- 5. **Check relevance scores** - >0.7 is usually a strong match, <0.5 might be noise
271
- 6. **Progressive refinement** - Broad search → filter → narrow → full context
47
+ 1. **Start with detail='minimal'** summaries first, full context selectively
48
+ 2. **Monitor background jobs** don't search until indexing completes
49
+ 3. **Use intent parameter** 'find-implementation' vs 'find-usage' vs 'find-documentation'
50
+ 4. **Filter by stores** faster, more focused results
51
+ 5. **Check scores** >0.7 strong match, <0.5 likely noise
272
52
 
273
- These workflows reduce token usage, minimize tool calls, and get you to the right answer faster.
53
+ Detailed examples and error recovery: [references/examples.md](references/examples.md)
@@ -0,0 +1,86 @@
1
+ # Workflow Examples
2
+
3
+ ## Progressive Library Exploration
4
+
5
+ User: "How does Vue's computed properties work?"
6
+
7
+ ```
8
+ list_stores()
9
+ → Found: vue, react, pydantic
10
+
11
+ get_store_info('vue')
12
+ → Path: .bluera/bluera-knowledge/repos/vue/
13
+ → Files: 2,847 indexed
14
+
15
+ search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
16
+ → Result 1: packages/reactivity/src/computed.ts (score: 0.92)
17
+ → Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
18
+ → Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
19
+
20
+ get_full_context(['result_1_id', 'result_2_id'])
21
+ → Full code for ComputedRefImpl class
22
+ → Complete API implementation
23
+ ```
24
+
25
+ ## Add Library + Monitor
26
+
27
+ ```
28
+ create_store('https://github.com/fastapi/fastapi', 'fastapi')
29
+ → job_id: 'job_abc123'
30
+
31
+ check_job_status('job_abc123')
32
+ → Status: running, Progress: 45%
33
+
34
+ # ... wait 30 seconds ...
35
+
36
+ check_job_status('job_abc123')
37
+ → Status: completed, Indexed: 487 files
38
+
39
+ search("dependency injection", stores=['fastapi'], limit=3)
40
+ → Returns relevant FastAPI DI patterns
41
+ ```
42
+
43
+ ## Progressive Detail Strategy
44
+
45
+ ```
46
+ # Initial broad search
47
+ search("authentication middleware", detail='minimal', limit=20)
48
+ → 20 results, scores 0.45-0.92, ~2k tokens total
49
+
50
+ # Filter by score (>0.7):
51
+ - auth/jwt.ts (score: 0.92)
52
+ - middleware/authenticate.ts (score: 0.85)
53
+ - auth/session.ts (score: 0.74)
54
+
55
+ # Get full code for top 3 only
56
+ get_full_context(['result_3', 'result_7', 'result_12'])
57
+ → ~3k tokens (vs ~15k if fetched all 20)
58
+ ```
59
+
60
+ ## Error Recovery
61
+
62
+ ```
63
+ create_store('https://github.com/private/repo', 'my-repo')
64
+ → job_id: 'job_xyz'
65
+
66
+ check_job_status('job_xyz')
67
+ → Status: failed, Error: "Authentication required"
68
+
69
+ # Recovery: Use SSH
70
+ create_store('git@github.com:private/repo.git', 'my-repo')
71
+ → Status: completed
72
+ ```
73
+
74
+ ## Combining Workflows
75
+
76
+ User: "Compare Express and Hono middleware"
77
+
78
+ ```
79
+ 1. list_stores() → check if both indexed
80
+ 2. If not: create_store() for missing
81
+ 3. check_job_status() → wait for indexing
82
+ 4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
83
+ 5. Review summaries, identify key files
84
+ 6. get_full_context() for 2-3 most relevant from each
85
+ 7. Compare implementations with full context
86
+ ```
@@ -7,216 +7,29 @@ context: fork
7
7
 
8
8
  # Agent Quality Evaluation
9
9
 
10
- Compare how well Claude answers library questions across three access levels.
10
+ Compare how well Claude answers library questions across three access levels:
11
11
 
12
- For each query, three agents run in parallel:
13
- - **Without BK** — uses only web search and training knowledge
14
- - **BK Grep** — can Grep/Read/Glob the cloned source repos but has no vector search
15
- - **BK Full** — uses BK vector search + get_full_context + Grep/Read (all BK tools)
16
-
17
- Then score all three answers on accuracy, specificity, completeness, and source grounding.
12
+ - **Without BK** web search + training knowledge only
13
+ - **BK Grep** — Grep/Read/Glob on cloned repos, no vector search
14
+ - **BK Full** — vector search + get_full_context + Grep/Read
18
15
 
19
16
  ## Arguments
20
17
 
21
18
  Parse `$ARGUMENTS`:
22
19
 
23
- - **No arguments or empty**: Show usage help
24
- - **Quoted string** (not starting with `--`): Arbitrary query mode — run eval for that single question
25
- - **`--predefined`**: Run all predefined queries (skip any whose stores are not indexed)
26
- - **`--predefined N`**: Run predefined query #N only (1-based index)
27
-
28
- If no arguments provided, show:
29
- ```
30
- Usage:
31
- /bluera-knowledge:eval "How does Express handle errors?" # Arbitrary query
32
- /bluera-knowledge:eval --predefined # Run all predefined queries
33
- /bluera-knowledge:eval --predefined 3 # Run predefined query #3
34
- ```
35
-
36
- ## Step 1: Prerequisites Check
37
-
38
- 1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
39
- 2. If no stores are indexed, show error and abort:
40
- ```
41
- No knowledge stores indexed. Add at least one library first:
42
- /bluera-knowledge:add-repo https://github.com/expressjs/express --name express
43
- ```
44
- 3. Record the list of available store names — you'll pass these to the BK Full agent
45
- 4. Build a `STORE_PATHS` mapping from the store response: for each store with a `path` field, record `- **<name>**: \`<path>\`` (one per line, as a markdown list). This gets passed to the BK Grep agent.
46
-
47
- ## Step 2: Resolve Queries
48
-
49
- ### Predefined mode (`--predefined`)
50
-
51
- 1. Read the predefined queries file: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
52
- 2. Parse the YAML content
53
- 3. For each query, check if ANY of its `store_hint` values match an available store name
54
- 4. Split into **runnable** (store available) and **skipped** (store not available) lists
55
- 5. If `--predefined N` was specified, select only query at index N from the full list (skip if store not available)
56
- 6. If no queries are runnable, show what stores to add and abort
57
-
58
- ### Arbitrary mode (bare query string)
59
-
60
- 1. Use the raw query string as the question
61
- 2. Set `expected_topics` and `anti_patterns` to empty lists
62
- 3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
63
-
64
- ## Step 3: Load Templates
65
-
66
- Read these files from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
67
-
68
- 1. `without-bk-agent.md` — instructions for the baseline agent
69
- 2. `bk-grep-agent.md` — instructions for the BK Grep agent
70
- 3. `with-bk-agent.md` — instructions for the BK Full agent
71
- 4. `judge.md` — grading rubric
72
-
73
- ## Step 4: Run Eval (for each query)
74
-
75
- ### Spawn ALL THREE agents in parallel (same turn, three Task tool calls)
76
-
77
- **Without-BK agent** — Use the Task tool with `subagent_type: "general-purpose"`:
78
- - Take the content from `without-bk-agent.md`
79
- - Replace `{{QUESTION}}` with the actual question
80
- - Send as the task prompt
81
-
82
- **BK Grep agent** — Use the Task tool with `subagent_type: "general-purpose"`:
83
- - Take the content from `bk-grep-agent.md`
84
- - Replace `{{QUESTION}}` with the actual question
85
- - Replace `{{STORE_PATHS}}` with the store name-to-path mapping built in Step 1
86
- - Send as the task prompt
87
-
88
- **BK Full agent** — Use the Task tool with `subagent_type: "general-purpose"`:
89
- - Take the content from `with-bk-agent.md`
90
- - Replace `{{QUESTION}}` with the actual question
91
- - Replace `{{STORES}}` with the list of available store names (one per line, as a markdown list)
92
- - Send as the task prompt
93
-
94
- Wait for all three agents to complete.
95
-
96
- ### Capture Token Usage
97
-
98
- From each Task tool response, parse the `<usage>` block to extract:
99
- - `total_tokens` — the total tokens consumed by the agent
100
- - `duration_ms` — wall-clock time for the agent
101
-
102
- If usage data is not available in a Task response, show "N/A" for that agent.
103
-
104
- ### Judge the results
105
-
106
- Using the rubric from `judge.md`, evaluate all three answers yourself:
107
-
108
- 1. Read all three agent responses
109
- 2. For each answer, score all 4 criteria (1-5):
110
- - **Factual Accuracy**: Are the claims correct?
111
- - **Specificity**: Does it cite specific files, functions, code?
112
- - **Completeness**: Does it cover the full answer?
113
- - **Source Grounding**: Are claims backed by evidence?
114
- 3. If the query has `expected_topics`, check which answers mention each topic
115
- 4. If the query has `anti_patterns`, flag if any answer makes those claims
116
- 5. Calculate totals (max 20 each), determine winner and deltas
117
-
118
- ## Step 5: Output Results
119
-
120
- ### Single query output (arbitrary or `--predefined N`)
121
-
122
- Show the full comparison:
123
-
124
- ```
125
- ## Eval: "<question>"
126
-
127
- | Criterion | Without BK | BK Grep | BK Full |
128
- |-------------------|:----------:|:-------:|:-------:|
129
- | Accuracy | X | X | X |
130
- | Specificity | X | X | X |
131
- | Completeness | X | X | X |
132
- | Source Grounding | X | X | X |
133
- | **Total** | **X** | **X** | **X** |
134
-
135
- | Usage | Without BK | BK Grep | BK Full |
136
- |-------------------|:----------:|:-------:|:-------:|
137
- | Tokens | X,XXX | X,XXX | X,XXX |
138
- | Duration (s) | X.X | X.X | X.X |
139
-
140
- **Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
141
- **Key Difference:** [One sentence explaining the most important quality gap]
142
- **Grep vs Full:** [One sentence on whether vector search outperformed manual grep, and if so how]
143
- ```
144
-
145
- If expected topics were provided:
146
- ```
147
- ### Expected Topics
148
- - [x] topic covered by all three
149
- - [x] topic covered by BK Full + BK Grep only
150
- - [x] topic covered by BK Full only
151
- - [ ] topic missed by all
152
- ```
153
-
154
- ### Multi-query output (`--predefined`)
155
-
156
- Show a summary row per query, then aggregate:
157
-
158
- ```
159
- ## Agent Quality Eval Summary
160
-
161
- Ran X/8 queries (Y skipped — stores not indexed)
162
-
163
- | # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
164
- |---|-------|:----------:|:------:|:----:|:----:|--------|-------|
165
- | 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
166
- | 2 | query-id | easy | 14/20 | 17/20 | 18/20 | Full | marginal |
167
- | ... |
168
-
169
- ### Token Usage
170
-
171
- | # | Query | w/o BK tokens | Grep tokens | Full tokens |
172
- |---|-------|:-------------:|:-----------:|:-----------:|
173
- | 1 | query-id | 2,340 | 8,120 | 5,670 |
174
- | 2 | query-id | 1,890 | 6,450 | 4,230 |
175
- | ... |
176
-
177
- ### Aggregate
178
- - **Without BK mean:** X.X/20 (avg X,XXX tokens)
179
- - **BK Grep mean:** X.X/20 (avg X,XXX tokens)
180
- - **BK Full mean:** X.X/20 (avg X,XXX tokens)
181
- - **Full vs Without:** +X.X points (+XX%)
182
- - **Full vs Grep:** +X.X points (+XX%)
183
- - **Grep vs Without:** +X.X points (+XX%)
184
- - **Full win rate:** X/X (XX%)
185
- - **Significant wins (Full):** X
186
-
187
- ### By Category
188
- | Category | w/o BK | Grep | Full | Full delta |
189
- |----------|:------:|:----:|:----:|------------|
190
- | implementation | X.X | X.X | X.X | +X.X |
191
- | api | X.X | X.X | X.X | +X.X |
192
-
193
- ### By Difficulty
194
- | Difficulty | w/o BK | Grep | Full | Full delta |
195
- |------------|:------:|:----:|:----:|------------|
196
- | easy | X.X | X.X | X.X | +X.X |
197
- | medium | X.X | X.X | X.X | +X.X |
198
- | hard | X.X | X.X | X.X | +X.X |
20
+ - **No arguments**: Show usage help
21
+ - **Quoted string**: Run eval for that single question
22
+ - **`--predefined`**: Run all predefined queries
23
+ - **`--predefined N`**: Run predefined query #N only
199
24
 
200
- ### Token Efficiency
201
- | Agent | Mean Score | Mean Tokens | Score/1K Tokens |
202
- |-------|:----------:|:-----------:|:---------------:|
203
- | Without BK | X.X | X,XXX | X.XX |
204
- | BK Grep | X.X | X,XXX | X.XX |
205
- | BK Full | X.X | X,XXX | X.XX |
206
- ```
25
+ ## Workflow
207
26
 
208
- If any queries were skipped:
209
- ```
210
- ### Skipped (store not indexed)
211
- - vue-reactivity-tracking add with: /bluera-knowledge:add-repo https://github.com/vuejs/core --name vue
212
- - fastapi-dependency-injection add with: /bluera-knowledge:add-repo https://github.com/fastapi/fastapi --name fastapi
213
- ```
27
+ 1. **Prerequisites**: Call `execute` with `{ command: "stores" }` to list stores. Abort if none.
28
+ 2. **Resolve queries**: Load from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml` or use arbitrary query.
29
+ 3. **Load templates**: Read agent prompts + judge rubric from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`
30
+ 4. **Spawn 3 agents in parallel** per query (replace `{{QUESTION}}`, `{{STORES}}`, `{{STORE_PATHS}}`)
31
+ 5. **Judge**: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding
214
32
 
215
- ## Important Notes
33
+ Detailed procedures: [references/procedures.md](references/procedures.md)
216
34
 
217
- - Each query spawns 3 subagents. For `--predefined` with 8 queries, that's up to 24 agent runs. Process one query at a time (but spawn all three agents for each query in parallel).
218
- - The without-BK agent may use WebSearch — this is intentional. We're comparing against "the best Claude can do without BK."
219
- - The BK Grep agent may NOT use WebSearch. It tests what an agent can discover by exploring raw source code, to isolate the value of vector search.
220
- - Scoring is somewhat subjective. The value is in the comparison (relative scores) rather than absolute numbers. Look at the delta and key differences.
221
- - The Token Efficiency table reveals cost-effectiveness: if BK Grep achieves similar scores to BK Full with fewer tokens, it suggests vector search isn't adding much for that query type.
222
- - For arbitrary queries without expected topics, grading relies entirely on the 4 general criteria. This is fine — it still reveals whether BK adds value.
35
+ Output format: [references/output-format.md](references/output-format.md)