bluera-knowledge 0.34.1 → 0.34.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/CHANGELOG.md +8 -0
- package/hooks/posttooluse-bk-reminder.py +0 -0
- package/hooks/posttooluse-web-research.py +0 -0
- package/hooks/posttooluse-websearch-bk.py +0 -0
- package/package.json +1 -1
- package/scripts/auto-setup.sh +3 -0
- package/skills/advanced-workflows/SKILL.md +26 -246
- package/skills/advanced-workflows/references/examples.md +86 -0
- package/skills/eval/SKILL.md +16 -203
- package/skills/eval/references/output-format.md +73 -0
- package/skills/eval/references/procedures.md +61 -0
- package/skills/store-lifecycle/SKILL.md +16 -441
- package/skills/store-lifecycle/references/operations.md +75 -0
- package/skills/store-lifecycle/references/source-types.md +48 -0
- package/skills/test-plugin/SKILL.md +8 -515
- package/skills/test-plugin/references/output-format.md +43 -0
- package/skills/test-plugin/references/test-procedures.md +107 -0
- package/hooks/pretooluse-bk-suggest.py +0 -296
- package/hooks/skill-activation.py +0 -221
- package/hooks/skill-rules.json +0 -131
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,14 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to this project will be documented in this file. See [commit-and-tag-version](https://github.com/absolute-version/commit-and-tag-version) for commit guidelines.
|
|
4
4
|
|
|
5
|
+
## [0.34.2](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.2) (2026-03-19)
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
### Bug Fixes
|
|
9
|
+
|
|
10
|
+
* **hooks:** make Python hooks executable, add stdin drain, remove stale files ([219b645](https://github.com/blueraai/bluera-knowledge/commit/219b6459e955764645c8edd0c98ff0be2b9c96b8))
|
|
11
|
+
* **mcp:** expand shell variables in PROJECT_ROOT to prevent literal ${PWD} directories ([2ab025f](https://github.com/blueraai/bluera-knowledge/commit/2ab025f42fb8d063cccc3dacfc47ed87d299d634))
|
|
12
|
+
|
|
5
13
|
## [0.34.1](https://github.com/blueraai/bluera-knowledge/compare/v0.34.0...v0.34.1) (2026-03-12)
|
|
6
14
|
|
|
7
15
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
package/package.json
CHANGED
package/scripts/auto-setup.sh
CHANGED
|
@@ -6,6 +6,9 @@
|
|
|
6
6
|
# It exits quickly (0) if already set up, or runs full setup if needed.
|
|
7
7
|
# Non-interactive: cannot prompt for user input (no TTY).
|
|
8
8
|
|
|
9
|
+
# Drain stdin so the pipe doesn't hang
|
|
10
|
+
cat > /dev/null 2>&1 || true
|
|
11
|
+
|
|
9
12
|
PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(dirname "$(dirname "$0")")}"
|
|
10
13
|
|
|
11
14
|
# Colors for output
|
|
@@ -3,271 +3,51 @@ name: advanced-workflows
|
|
|
3
3
|
description: Multi-tool orchestration patterns for BK operations
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
# Advanced
|
|
6
|
+
# Advanced BK Workflows
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
Multi-tool patterns for efficient knowledge retrieval and management.
|
|
9
9
|
|
|
10
|
-
##
|
|
10
|
+
## Core Patterns
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
### Progressive Library Exploration
|
|
13
13
|
|
|
14
|
-
### Workflow: Find Relevant Code in Unknown Library
|
|
15
|
-
|
|
16
|
-
```
|
|
17
|
-
1. list_stores()
|
|
18
|
-
→ See what's indexed, identify target store
|
|
19
|
-
|
|
20
|
-
2. get_store_info(store)
|
|
21
|
-
→ Get metadata: file paths, size, indexed files
|
|
22
|
-
→ Understand scope before searching
|
|
23
|
-
|
|
24
|
-
3. search(query, detail='minimal', stores=[target])
|
|
25
|
-
→ Get high-level summaries of relevant code
|
|
26
|
-
→ Review relevance scores (>0.7 = good match)
|
|
27
|
-
|
|
28
|
-
4. get_full_context(result_ids[top_3])
|
|
29
|
-
→ Deep dive on most relevant results only
|
|
30
|
-
→ Get complete code with full context
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
**Example:**
|
|
34
|
-
|
|
35
|
-
User: "How does Vue's computed properties work?"
|
|
36
|
-
|
|
37
|
-
```
|
|
38
|
-
list_stores()
|
|
39
|
-
→ Found: vue, react, pydantic
|
|
40
|
-
|
|
41
|
-
get_store_info('vue')
|
|
42
|
-
→ Path: .bluera/bluera-knowledge/repos/vue/
|
|
43
|
-
→ Files: 2,847 indexed
|
|
44
|
-
|
|
45
|
-
search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
|
|
46
|
-
→ Result 1: packages/reactivity/src/computed.ts (score: 0.92)
|
|
47
|
-
→ Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
|
|
48
|
-
→ Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
|
|
49
|
-
|
|
50
|
-
get_full_context(['result_1_id', 'result_2_id'])
|
|
51
|
-
→ Full code for ComputedRefImpl class
|
|
52
|
-
→ Complete API implementation
|
|
53
|
-
|
|
54
|
-
Now explain with authoritative source code.
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
## Adding New Library with Job Monitoring
|
|
58
|
-
|
|
59
|
-
When adding large libraries, monitor indexing progress to know when search is ready:
|
|
60
|
-
|
|
61
|
-
### Workflow: Add Library and Wait for Index
|
|
62
|
-
|
|
63
|
-
```
|
|
64
|
-
1. create_store(url_or_path, name)
|
|
65
|
-
→ Returns: job_id
|
|
66
|
-
→ Background indexing starts
|
|
67
|
-
|
|
68
|
-
2. check_job_status(job_id)
|
|
69
|
-
→ Poll every 10-30 seconds
|
|
70
|
-
→ Status: 'pending' | 'running' | 'completed' | 'failed'
|
|
71
|
-
→ Progress: percentage, current file
|
|
72
|
-
|
|
73
|
-
3. When status='completed':
|
|
74
|
-
list_stores()
|
|
75
|
-
→ Verify store appears in list
|
|
76
|
-
|
|
77
|
-
4. search(query, stores=[new_store], limit=5)
|
|
78
|
-
→ Test search works
|
|
79
|
-
→ Verify indexing quality
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
**Example:**
|
|
83
|
-
|
|
84
|
-
```
|
|
85
|
-
create_store('https://github.com/fastapi/fastapi', 'fastapi')
|
|
86
|
-
→ job_id: 'job_abc123'
|
|
87
|
-
→ Status: Indexing started in background
|
|
88
|
-
|
|
89
|
-
# Poll for completion (typically 30-120 seconds for medium repos)
|
|
90
|
-
check_job_status('job_abc123')
|
|
91
|
-
→ Status: running
|
|
92
|
-
→ Progress: 45% (processing src/fastapi/routing.py)
|
|
93
|
-
|
|
94
|
-
# ... wait 30 seconds ...
|
|
95
|
-
|
|
96
|
-
check_job_status('job_abc123')
|
|
97
|
-
→ Status: completed
|
|
98
|
-
→ Indexed: 487 files, 125k lines
|
|
99
|
-
|
|
100
|
-
# Verify and test
|
|
101
|
-
list_stores()
|
|
102
|
-
→ fastapi: 487 files, vector + FTS indexed
|
|
103
|
-
|
|
104
|
-
search("dependency injection", stores=['fastapi'], limit=3)
|
|
105
|
-
→ Returns relevant FastAPI DI patterns
|
|
106
|
-
→ Store is ready for use!
|
|
107
|
-
```
|
|
108
|
-
|
|
109
|
-
## Handling Large Result Sets
|
|
110
|
-
|
|
111
|
-
When initial search returns many results, use progressive detail to avoid context overload:
|
|
112
|
-
|
|
113
|
-
### Workflow: Progressive Detail Strategy
|
|
114
|
-
|
|
115
|
-
```
|
|
116
|
-
1. search(query, detail='minimal', limit=20)
|
|
117
|
-
→ Get summaries only (~100 tokens/result)
|
|
118
|
-
→ Review all 20 summaries quickly
|
|
119
|
-
|
|
120
|
-
2. Filter by relevance score:
|
|
121
|
-
- Score > 0.8: Excellent match
|
|
122
|
-
- Score 0.6-0.8: Good match
|
|
123
|
-
- Score < 0.6: Possibly irrelevant
|
|
124
|
-
|
|
125
|
-
3. For top 3-5 results (score > 0.7):
|
|
126
|
-
get_full_context(selected_ids)
|
|
127
|
-
→ Fetch complete code only for relevant items
|
|
128
|
-
→ Saves ~80% context vs fetching all upfront
|
|
129
|
-
|
|
130
|
-
4. If nothing relevant:
|
|
131
|
-
search(refined_query, detail='contextual', limit=10)
|
|
132
|
-
→ Try different query with more context
|
|
133
|
-
→ Or broaden/narrow the search
|
|
134
|
-
```
|
|
135
|
-
|
|
136
|
-
**Example:**
|
|
137
|
-
|
|
138
|
-
```
|
|
139
|
-
# Initial broad search
|
|
140
|
-
search("authentication middleware", detail='minimal', limit=20)
|
|
141
|
-
→ 20 results, scores ranging 0.45-0.92
|
|
142
|
-
→ Total context: ~2k tokens (minimal)
|
|
143
|
-
|
|
144
|
-
# Filter by score
|
|
145
|
-
Top results (>0.7):
|
|
146
|
-
- Result 3: auth/jwt.ts (score: 0.92)
|
|
147
|
-
- Result 7: middleware/authenticate.ts (score: 0.85)
|
|
148
|
-
- Result 12: auth/session.ts (score: 0.74)
|
|
149
|
-
|
|
150
|
-
# Get full code for top 3 only
|
|
151
|
-
get_full_context(['result_3', 'result_7', 'result_12'])
|
|
152
|
-
→ Complete implementations for relevant files only
|
|
153
|
-
→ Context: ~3k tokens (vs ~15k if we fetched all 20)
|
|
154
|
-
|
|
155
|
-
# Found what we needed! If not, would refine query and retry.
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
## Multi-Store Search with Ranking
|
|
159
|
-
|
|
160
|
-
When searching across multiple stores, use ranking to prioritize results:
|
|
161
|
-
|
|
162
|
-
### Workflow: Cross-Library Search
|
|
163
|
-
|
|
164
|
-
```
|
|
165
|
-
1. search(query, limit=10)
|
|
166
|
-
→ Searches ALL stores
|
|
167
|
-
→ Returns mixed results ranked by relevance
|
|
168
|
-
|
|
169
|
-
2. Review store distribution:
|
|
170
|
-
- If dominated by one store: might narrow to specific stores
|
|
171
|
-
- If balanced: good cross-library perspective
|
|
172
|
-
|
|
173
|
-
3. For specific library focus:
|
|
174
|
-
search(query, stores=['lib1', 'lib2'], limit=15)
|
|
175
|
-
→ Search only relevant libraries
|
|
176
|
-
→ Get more results from target libraries
|
|
177
14
|
```
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
User: "How do different frameworks handle routing?"
|
|
182
|
-
|
|
15
|
+
1. list_stores() → identify target store
|
|
16
|
+
2. search(query, detail='minimal', stores=[target]) → scan summaries
|
|
17
|
+
3. get_full_context(top_result_ids) → deep dive on best matches
|
|
183
18
|
```
|
|
184
|
-
# Search all indexed frameworks
|
|
185
|
-
search("routing implementation", intent='find-implementation', limit=15)
|
|
186
|
-
→ Result mix:
|
|
187
|
-
- express (score: 0.91)
|
|
188
|
-
- fastapi (score: 0.89)
|
|
189
|
-
- hono (score: 0.87)
|
|
190
|
-
- vue-router (score: 0.82)
|
|
191
|
-
- ...
|
|
192
|
-
|
|
193
|
-
# All stores represented, good comparative view!
|
|
194
|
-
|
|
195
|
-
# If user wants deeper FastAPI focus:
|
|
196
|
-
search("routing implementation", stores=['fastapi', 'starlette'], limit=20)
|
|
197
|
-
→ More FastAPI/Starlette-specific results
|
|
198
|
-
→ Deeper exploration of Python framework routing
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
## Error Recovery
|
|
202
19
|
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
### Workflow: Handle Indexing Failures
|
|
20
|
+
### Add Library + Wait for Index
|
|
206
21
|
|
|
207
22
|
```
|
|
208
|
-
1. create_store()
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
- Git auth required (private repo)
|
|
212
|
-
- Invalid URL/path
|
|
213
|
-
- Disk space
|
|
214
|
-
- Network timeout
|
|
215
|
-
|
|
216
|
-
2. Recovery actions:
|
|
217
|
-
- Auth issue: Provide credentials or use HTTPS
|
|
218
|
-
- Invalid path: Verify URL/path exists
|
|
219
|
-
- Disk space: delete_store() unused stores
|
|
220
|
-
- Network: Retry with smaller repo or use --shallow
|
|
221
|
-
|
|
222
|
-
3. Verify recovery:
|
|
223
|
-
list_stores() → Check store appeared
|
|
224
|
-
search(test_query, stores=[new_store]) → Verify searchable
|
|
23
|
+
1. create_store(url, name) → job_id
|
|
24
|
+
2. check_job_status(job_id) → poll every 15-30s
|
|
25
|
+
3. When completed: search(query, stores=[name]) → verify
|
|
225
26
|
```
|
|
226
27
|
|
|
227
|
-
|
|
28
|
+
### Progressive Detail Strategy
|
|
228
29
|
|
|
229
30
|
```
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
→ Status: failed
|
|
235
|
-
→ Error: "Authentication required for private repository"
|
|
236
|
-
|
|
237
|
-
# Recovery: Use authenticated URL or SSH
|
|
238
|
-
create_store('git@github.com:private/repo.git', 'my-repo')
|
|
239
|
-
→ job_id: 'job_xyz2'
|
|
240
|
-
|
|
241
|
-
check_job_status('job_xyz2')
|
|
242
|
-
→ Status: completed
|
|
243
|
-
→ Success!
|
|
31
|
+
1. search(query, detail='minimal', limit=20) → ~100 tokens/result
|
|
32
|
+
2. Filter by score (>0.7 = strong match)
|
|
33
|
+
3. get_full_context(top_3_ids) → saves ~80% context
|
|
34
|
+
4. If nothing relevant: refine query and retry
|
|
244
35
|
```
|
|
245
36
|
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
Real-world usage often combines these patterns:
|
|
37
|
+
### Cross-Library Comparison
|
|
249
38
|
|
|
250
39
|
```
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
2. If not: create_store() for missing framework(s)
|
|
255
|
-
3. check_job_status() → wait for indexing
|
|
256
|
-
4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
|
|
257
|
-
5. Review summaries, identify key files
|
|
258
|
-
6. get_full_context() for 2-3 most relevant from each framework
|
|
259
|
-
7. Compare implementations with full context
|
|
40
|
+
1. search(query, limit=10) → searches ALL stores
|
|
41
|
+
2. Review store distribution
|
|
42
|
+
3. Narrow: search(query, stores=['lib1', 'lib2'], limit=15)
|
|
260
43
|
```
|
|
261
44
|
|
|
262
|
-
This multi-step workflow is efficient, targeted, and conserves context.
|
|
263
|
-
|
|
264
45
|
## Best Practices
|
|
265
46
|
|
|
266
|
-
1. **
|
|
267
|
-
2. **Monitor background jobs**
|
|
268
|
-
3. **Use intent parameter**
|
|
269
|
-
4. **Filter by stores
|
|
270
|
-
5. **Check
|
|
271
|
-
6. **Progressive refinement** - Broad search → filter → narrow → full context
|
|
47
|
+
1. **Start with detail='minimal'** — summaries first, full context selectively
|
|
48
|
+
2. **Monitor background jobs** — don't search until indexing completes
|
|
49
|
+
3. **Use intent parameter** — 'find-implementation' vs 'find-usage' vs 'find-documentation'
|
|
50
|
+
4. **Filter by stores** — faster, more focused results
|
|
51
|
+
5. **Check scores** — >0.7 strong match, <0.5 likely noise
|
|
272
52
|
|
|
273
|
-
|
|
53
|
+
Detailed examples and error recovery: [references/examples.md](references/examples.md)
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# Workflow Examples
|
|
2
|
+
|
|
3
|
+
## Progressive Library Exploration
|
|
4
|
+
|
|
5
|
+
User: "How does Vue's computed properties work?"
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
list_stores()
|
|
9
|
+
→ Found: vue, react, pydantic
|
|
10
|
+
|
|
11
|
+
get_store_info('vue')
|
|
12
|
+
→ Path: .bluera/bluera-knowledge/repos/vue/
|
|
13
|
+
→ Files: 2,847 indexed
|
|
14
|
+
|
|
15
|
+
search("computed properties", intent='find-implementation', detail='minimal', stores=['vue'])
|
|
16
|
+
→ Result 1: packages/reactivity/src/computed.ts (score: 0.92)
|
|
17
|
+
→ Result 2: packages/reactivity/__tests__/computed.spec.ts (score: 0.85)
|
|
18
|
+
→ Result 3: packages/runtime-core/src/apiComputed.ts (score: 0.78)
|
|
19
|
+
|
|
20
|
+
get_full_context(['result_1_id', 'result_2_id'])
|
|
21
|
+
→ Full code for ComputedRefImpl class
|
|
22
|
+
→ Complete API implementation
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## Add Library + Monitor
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
create_store('https://github.com/fastapi/fastapi', 'fastapi')
|
|
29
|
+
→ job_id: 'job_abc123'
|
|
30
|
+
|
|
31
|
+
check_job_status('job_abc123')
|
|
32
|
+
→ Status: running, Progress: 45%
|
|
33
|
+
|
|
34
|
+
# ... wait 30 seconds ...
|
|
35
|
+
|
|
36
|
+
check_job_status('job_abc123')
|
|
37
|
+
→ Status: completed, Indexed: 487 files
|
|
38
|
+
|
|
39
|
+
search("dependency injection", stores=['fastapi'], limit=3)
|
|
40
|
+
→ Returns relevant FastAPI DI patterns
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Progressive Detail Strategy
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
# Initial broad search
|
|
47
|
+
search("authentication middleware", detail='minimal', limit=20)
|
|
48
|
+
→ 20 results, scores 0.45-0.92, ~2k tokens total
|
|
49
|
+
|
|
50
|
+
# Filter by score (>0.7):
|
|
51
|
+
- auth/jwt.ts (score: 0.92)
|
|
52
|
+
- middleware/authenticate.ts (score: 0.85)
|
|
53
|
+
- auth/session.ts (score: 0.74)
|
|
54
|
+
|
|
55
|
+
# Get full code for top 3 only
|
|
56
|
+
get_full_context(['result_3', 'result_7', 'result_12'])
|
|
57
|
+
→ ~3k tokens (vs ~15k if fetched all 20)
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
## Error Recovery
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
create_store('https://github.com/private/repo', 'my-repo')
|
|
64
|
+
→ job_id: 'job_xyz'
|
|
65
|
+
|
|
66
|
+
check_job_status('job_xyz')
|
|
67
|
+
→ Status: failed, Error: "Authentication required"
|
|
68
|
+
|
|
69
|
+
# Recovery: Use SSH
|
|
70
|
+
create_store('git@github.com:private/repo.git', 'my-repo')
|
|
71
|
+
→ Status: completed
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## Combining Workflows
|
|
75
|
+
|
|
76
|
+
User: "Compare Express and Hono middleware"
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
1. list_stores() → check if both indexed
|
|
80
|
+
2. If not: create_store() for missing
|
|
81
|
+
3. check_job_status() → wait for indexing
|
|
82
|
+
4. search("middleware implementation", stores=['express', 'hono'], detail='minimal')
|
|
83
|
+
5. Review summaries, identify key files
|
|
84
|
+
6. get_full_context() for 2-3 most relevant from each
|
|
85
|
+
7. Compare implementations with full context
|
|
86
|
+
```
|
package/skills/eval/SKILL.md
CHANGED
|
@@ -7,216 +7,29 @@ context: fork
|
|
|
7
7
|
|
|
8
8
|
# Agent Quality Evaluation
|
|
9
9
|
|
|
10
|
-
Compare how well Claude answers library questions across three access levels
|
|
10
|
+
Compare how well Claude answers library questions across three access levels:
|
|
11
11
|
|
|
12
|
-
|
|
13
|
-
- **
|
|
14
|
-
- **BK
|
|
15
|
-
- **BK Full** — uses BK vector search + get_full_context + Grep/Read (all BK tools)
|
|
16
|
-
|
|
17
|
-
Then score all three answers on accuracy, specificity, completeness, and source grounding.
|
|
12
|
+
- **Without BK** — web search + training knowledge only
|
|
13
|
+
- **BK Grep** — Grep/Read/Glob on cloned repos, no vector search
|
|
14
|
+
- **BK Full** — vector search + get_full_context + Grep/Read
|
|
18
15
|
|
|
19
16
|
## Arguments
|
|
20
17
|
|
|
21
18
|
Parse `$ARGUMENTS`:
|
|
22
19
|
|
|
23
|
-
- **No arguments
|
|
24
|
-
- **Quoted string
|
|
25
|
-
- **`--predefined`**: Run all predefined queries
|
|
26
|
-
- **`--predefined N`**: Run predefined query #N only
|
|
27
|
-
|
|
28
|
-
If no arguments provided, show:
|
|
29
|
-
```
|
|
30
|
-
Usage:
|
|
31
|
-
/bluera-knowledge:eval "How does Express handle errors?" # Arbitrary query
|
|
32
|
-
/bluera-knowledge:eval --predefined # Run all predefined queries
|
|
33
|
-
/bluera-knowledge:eval --predefined 3 # Run predefined query #3
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
## Step 1: Prerequisites Check
|
|
37
|
-
|
|
38
|
-
1. Call MCP `execute` with `{ command: "stores" }` to list indexed stores
|
|
39
|
-
2. If no stores are indexed, show error and abort:
|
|
40
|
-
```
|
|
41
|
-
No knowledge stores indexed. Add at least one library first:
|
|
42
|
-
/bluera-knowledge:add-repo https://github.com/expressjs/express --name express
|
|
43
|
-
```
|
|
44
|
-
3. Record the list of available store names — you'll pass these to the BK Full agent
|
|
45
|
-
4. Build a `STORE_PATHS` mapping from the store response: for each store with a `path` field, record `- **<name>**: \`<path>\`` (one per line, as a markdown list). This gets passed to the BK Grep agent.
|
|
46
|
-
|
|
47
|
-
## Step 2: Resolve Queries
|
|
48
|
-
|
|
49
|
-
### Predefined mode (`--predefined`)
|
|
50
|
-
|
|
51
|
-
1. Read the predefined queries file: `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml`
|
|
52
|
-
2. Parse the YAML content
|
|
53
|
-
3. For each query, check if ANY of its `store_hint` values match an available store name
|
|
54
|
-
4. Split into **runnable** (store available) and **skipped** (store not available) lists
|
|
55
|
-
5. If `--predefined N` was specified, select only query at index N from the full list (skip if store not available)
|
|
56
|
-
6. If no queries are runnable, show what stores to add and abort
|
|
57
|
-
|
|
58
|
-
### Arbitrary mode (bare query string)
|
|
59
|
-
|
|
60
|
-
1. Use the raw query string as the question
|
|
61
|
-
2. Set `expected_topics` and `anti_patterns` to empty lists
|
|
62
|
-
3. Set `id` to "arbitrary", `category` to "general", `difficulty` to "unknown"
|
|
63
|
-
|
|
64
|
-
## Step 3: Load Templates
|
|
65
|
-
|
|
66
|
-
Read these files from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`:
|
|
67
|
-
|
|
68
|
-
1. `without-bk-agent.md` — instructions for the baseline agent
|
|
69
|
-
2. `bk-grep-agent.md` — instructions for the BK Grep agent
|
|
70
|
-
3. `with-bk-agent.md` — instructions for the BK Full agent
|
|
71
|
-
4. `judge.md` — grading rubric
|
|
72
|
-
|
|
73
|
-
## Step 4: Run Eval (for each query)
|
|
74
|
-
|
|
75
|
-
### Spawn ALL THREE agents in parallel (same turn, three Task tool calls)
|
|
76
|
-
|
|
77
|
-
**Without-BK agent** — Use the Task tool with `subagent_type: "general-purpose"`:
|
|
78
|
-
- Take the content from `without-bk-agent.md`
|
|
79
|
-
- Replace `{{QUESTION}}` with the actual question
|
|
80
|
-
- Send as the task prompt
|
|
81
|
-
|
|
82
|
-
**BK Grep agent** — Use the Task tool with `subagent_type: "general-purpose"`:
|
|
83
|
-
- Take the content from `bk-grep-agent.md`
|
|
84
|
-
- Replace `{{QUESTION}}` with the actual question
|
|
85
|
-
- Replace `{{STORE_PATHS}}` with the store name-to-path mapping built in Step 1
|
|
86
|
-
- Send as the task prompt
|
|
87
|
-
|
|
88
|
-
**BK Full agent** — Use the Task tool with `subagent_type: "general-purpose"`:
|
|
89
|
-
- Take the content from `with-bk-agent.md`
|
|
90
|
-
- Replace `{{QUESTION}}` with the actual question
|
|
91
|
-
- Replace `{{STORES}}` with the list of available store names (one per line, as a markdown list)
|
|
92
|
-
- Send as the task prompt
|
|
93
|
-
|
|
94
|
-
Wait for all three agents to complete.
|
|
95
|
-
|
|
96
|
-
### Capture Token Usage
|
|
97
|
-
|
|
98
|
-
From each Task tool response, parse the `<usage>` block to extract:
|
|
99
|
-
- `total_tokens` — the total tokens consumed by the agent
|
|
100
|
-
- `duration_ms` — wall-clock time for the agent
|
|
101
|
-
|
|
102
|
-
If usage data is not available in a Task response, show "N/A" for that agent.
|
|
103
|
-
|
|
104
|
-
### Judge the results
|
|
105
|
-
|
|
106
|
-
Using the rubric from `judge.md`, evaluate all three answers yourself:
|
|
107
|
-
|
|
108
|
-
1. Read all three agent responses
|
|
109
|
-
2. For each answer, score all 4 criteria (1-5):
|
|
110
|
-
- **Factual Accuracy**: Are the claims correct?
|
|
111
|
-
- **Specificity**: Does it cite specific files, functions, code?
|
|
112
|
-
- **Completeness**: Does it cover the full answer?
|
|
113
|
-
- **Source Grounding**: Are claims backed by evidence?
|
|
114
|
-
3. If the query has `expected_topics`, check which answers mention each topic
|
|
115
|
-
4. If the query has `anti_patterns`, flag if any answer makes those claims
|
|
116
|
-
5. Calculate totals (max 20 each), determine winner and deltas
|
|
117
|
-
|
|
118
|
-
## Step 5: Output Results
|
|
119
|
-
|
|
120
|
-
### Single query output (arbitrary or `--predefined N`)
|
|
121
|
-
|
|
122
|
-
Show the full comparison:
|
|
123
|
-
|
|
124
|
-
```
|
|
125
|
-
## Eval: "<question>"
|
|
126
|
-
|
|
127
|
-
| Criterion | Without BK | BK Grep | BK Full |
|
|
128
|
-
|-------------------|:----------:|:-------:|:-------:|
|
|
129
|
-
| Accuracy | X | X | X |
|
|
130
|
-
| Specificity | X | X | X |
|
|
131
|
-
| Completeness | X | X | X |
|
|
132
|
-
| Source Grounding | X | X | X |
|
|
133
|
-
| **Total** | **X** | **X** | **X** |
|
|
134
|
-
|
|
135
|
-
| Usage | Without BK | BK Grep | BK Full |
|
|
136
|
-
|-------------------|:----------:|:-------:|:-------:|
|
|
137
|
-
| Tokens | X,XXX | X,XXX | X,XXX |
|
|
138
|
-
| Duration (s) | X.X | X.X | X.X |
|
|
139
|
-
|
|
140
|
-
**Winner:** [BK Full | BK Grep | Without BK | Tie] ([significant | marginal | none])
|
|
141
|
-
**Key Difference:** [One sentence explaining the most important quality gap]
|
|
142
|
-
**Grep vs Full:** [One sentence on whether vector search outperformed manual grep, and if so how]
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
If expected topics were provided:
|
|
146
|
-
```
|
|
147
|
-
### Expected Topics
|
|
148
|
-
- [x] topic covered by all three
|
|
149
|
-
- [x] topic covered by BK Full + BK Grep only
|
|
150
|
-
- [x] topic covered by BK Full only
|
|
151
|
-
- [ ] topic missed by all
|
|
152
|
-
```
|
|
153
|
-
|
|
154
|
-
### Multi-query output (`--predefined`)
|
|
155
|
-
|
|
156
|
-
Show a summary row per query, then aggregate:
|
|
157
|
-
|
|
158
|
-
```
|
|
159
|
-
## Agent Quality Eval Summary
|
|
160
|
-
|
|
161
|
-
Ran X/8 queries (Y skipped — stores not indexed)
|
|
162
|
-
|
|
163
|
-
| # | Query | Difficulty | w/o BK | Grep | Full | Winner | Delta |
|
|
164
|
-
|---|-------|:----------:|:------:|:----:|:----:|--------|-------|
|
|
165
|
-
| 1 | query-id | medium | 9/20 | 15/20 | 19/20 | Full | significant |
|
|
166
|
-
| 2 | query-id | easy | 14/20 | 17/20 | 18/20 | Full | marginal |
|
|
167
|
-
| ... |
|
|
168
|
-
|
|
169
|
-
### Token Usage
|
|
170
|
-
|
|
171
|
-
| # | Query | w/o BK tokens | Grep tokens | Full tokens |
|
|
172
|
-
|---|-------|:-------------:|:-----------:|:-----------:|
|
|
173
|
-
| 1 | query-id | 2,340 | 8,120 | 5,670 |
|
|
174
|
-
| 2 | query-id | 1,890 | 6,450 | 4,230 |
|
|
175
|
-
| ... |
|
|
176
|
-
|
|
177
|
-
### Aggregate
|
|
178
|
-
- **Without BK mean:** X.X/20 (avg X,XXX tokens)
|
|
179
|
-
- **BK Grep mean:** X.X/20 (avg X,XXX tokens)
|
|
180
|
-
- **BK Full mean:** X.X/20 (avg X,XXX tokens)
|
|
181
|
-
- **Full vs Without:** +X.X points (+XX%)
|
|
182
|
-
- **Full vs Grep:** +X.X points (+XX%)
|
|
183
|
-
- **Grep vs Without:** +X.X points (+XX%)
|
|
184
|
-
- **Full win rate:** X/X (XX%)
|
|
185
|
-
- **Significant wins (Full):** X
|
|
186
|
-
|
|
187
|
-
### By Category
|
|
188
|
-
| Category | w/o BK | Grep | Full | Full delta |
|
|
189
|
-
|----------|:------:|:----:|:----:|------------|
|
|
190
|
-
| implementation | X.X | X.X | X.X | +X.X |
|
|
191
|
-
| api | X.X | X.X | X.X | +X.X |
|
|
192
|
-
|
|
193
|
-
### By Difficulty
|
|
194
|
-
| Difficulty | w/o BK | Grep | Full | Full delta |
|
|
195
|
-
|------------|:------:|:----:|:----:|------------|
|
|
196
|
-
| easy | X.X | X.X | X.X | +X.X |
|
|
197
|
-
| medium | X.X | X.X | X.X | +X.X |
|
|
198
|
-
| hard | X.X | X.X | X.X | +X.X |
|
|
20
|
+
- **No arguments**: Show usage help
|
|
21
|
+
- **Quoted string**: Run eval for that single question
|
|
22
|
+
- **`--predefined`**: Run all predefined queries
|
|
23
|
+
- **`--predefined N`**: Run predefined query #N only
|
|
199
24
|
|
|
200
|
-
|
|
201
|
-
| Agent | Mean Score | Mean Tokens | Score/1K Tokens |
|
|
202
|
-
|-------|:----------:|:-----------:|:---------------:|
|
|
203
|
-
| Without BK | X.X | X,XXX | X.XX |
|
|
204
|
-
| BK Grep | X.X | X,XXX | X.XX |
|
|
205
|
-
| BK Full | X.X | X,XXX | X.XX |
|
|
206
|
-
```
|
|
25
|
+
## Workflow
|
|
207
26
|
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
```
|
|
27
|
+
1. **Prerequisites**: Call `execute` with `{ command: "stores" }` to list stores. Abort if none.
|
|
28
|
+
2. **Resolve queries**: Load from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml` or use arbitrary query.
|
|
29
|
+
3. **Load templates**: Read agent prompts + judge rubric from `$CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/`
|
|
30
|
+
4. **Spawn 3 agents in parallel** per query (replace `{{QUESTION}}`, `{{STORES}}`, `{{STORE_PATHS}}`)
|
|
31
|
+
5. **Judge**: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding
|
|
214
32
|
|
|
215
|
-
|
|
33
|
+
Detailed procedures: [references/procedures.md](references/procedures.md)
|
|
216
34
|
|
|
217
|
-
|
|
218
|
-
- The without-BK agent may use WebSearch — this is intentional. We're comparing against "the best Claude can do without BK."
|
|
219
|
-
- The BK Grep agent may NOT use WebSearch. It tests what an agent can discover by exploring raw source code, to isolate the value of vector search.
|
|
220
|
-
- Scoring is somewhat subjective. The value is in the comparison (relative scores) rather than absolute numbers. Look at the delta and key differences.
|
|
221
|
-
- The Token Efficiency table reveals cost-effectiveness: if BK Grep achieves similar scores to BK Full with fewer tokens, it suggests vector search isn't adding much for that query type.
|
|
222
|
-
- For arbitrary queries without expected topics, grading relies entirely on the 4 general criteria. This is fine — it still reveals whether BK adds value.
|
|
35
|
+
Output format: [references/output-format.md](references/output-format.md)
|