bluera-knowledge 0.13.0 → 0.13.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/rules/code-quality.md +12 -0
- package/.claude/rules/git.md +5 -0
- package/.claude/rules/versioning.md +7 -0
- package/.claude-plugin/plugin.json +2 -15
- package/.mcp.json +11 -0
- package/CHANGELOG.md +7 -0
- package/CLAUDE.md +5 -13
- package/CONTRIBUTING.md +307 -0
- package/README.md +58 -1167
- package/commands/crawl.md +2 -1
- package/commands/test-plugin.md +197 -72
- package/docs/claude-code-best-practices.md +458 -0
- package/docs/cli.md +170 -0
- package/docs/commands.md +392 -0
- package/docs/crawler-architecture.md +89 -0
- package/docs/mcp-integration.md +130 -0
- package/docs/token-efficiency.md +91 -0
- package/eslint.config.js +1 -1
- package/hooks/check-dependencies.sh +18 -1
- package/hooks/hooks.json +2 -2
- package/hooks/posttooluse-bk-reminder.py +30 -2
- package/package.json +1 -1
- package/scripts/test-mcp-dev.js +260 -0
- package/src/mcp/plugin-mcp-config.test.ts +26 -19
- package/tests/integration/cli-consistency.test.ts +3 -2
- package/docs/plans/2024-12-17-ai-search-quality-implementation.md +0 -752
- package/docs/plans/2024-12-17-ai-search-quality-testing-design.md +0 -201
- package/docs/plans/2025-12-16-bluera-knowledge-cli.md +0 -2951
- package/docs/plans/2025-12-16-phase2-features.md +0 -1518
- package/docs/plans/2025-12-17-hil-implementation.md +0 -926
- package/docs/plans/2025-12-17-hil-quality-testing.md +0 -224
- package/docs/plans/2025-12-17-search-quality-phase1-implementation.md +0 -1416
- package/docs/plans/2025-12-17-search-quality-testing-v2-design.md +0 -212
- package/docs/plans/2025-12-28-ai-agent-optimization.md +0 -1630
|
@@ -1,224 +0,0 @@
|
|
|
1
|
-
# Human-in-the-Loop Quality Testing
|
|
2
|
-
|
|
3
|
-
## Overview
|
|
4
|
-
|
|
5
|
-
Add optional human-in-the-loop (HIL) capabilities to the search quality testing system. Enables human visibility, judgment, and feedback alongside AI evaluation.
|
|
6
|
-
|
|
7
|
-
## Commands
|
|
8
|
-
|
|
9
|
-
```
|
|
10
|
-
test:quality [--quiet|--silent] [--set <name|all>] Run tests (verbose by default)
|
|
11
|
-
test:quality:queries --list List query sets
|
|
12
|
-
test:quality:queries --review --set <name|all> Edit existing queries
|
|
13
|
-
test:quality:generate [--set <seed>] Generate new queries with HIL
|
|
14
|
-
test:quality:review --list List test runs
|
|
15
|
-
test:quality:review <run-id> Review run results with HIL
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
### Global `--set` Convention
|
|
19
|
-
|
|
20
|
-
Applies to all commands accepting `--set`:
|
|
21
|
-
- `--set core` - specific set
|
|
22
|
-
- `--set all` - combine all curated sets
|
|
23
|
-
- `--set generated/*` - all generated sets
|
|
24
|
-
- Omit flag → defaults to `core`
|
|
25
|
-
|
|
26
|
-
---
|
|
27
|
-
|
|
28
|
-
## 1. Query Management
|
|
29
|
-
|
|
30
|
-
### List Query Sets
|
|
31
|
-
|
|
32
|
-
`npm run test:quality:queries -- --list`
|
|
33
|
-
|
|
34
|
-
```
|
|
35
|
-
Available query sets:
|
|
36
|
-
core 15 queries (curated, v1.0.0)
|
|
37
|
-
hono-focus 8 queries (curated, v1.0.0)
|
|
38
|
-
generated/2025-12-17 12 queries (ai-generated)
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
### Generate New Queries
|
|
42
|
-
|
|
43
|
-
`npm run test:quality:generate`
|
|
44
|
-
|
|
45
|
-
1. AI proposes 10-15 queries as batch
|
|
46
|
-
2. Display structured list with categories and intents
|
|
47
|
-
3. Prompt: `Actions: [a]ccept all, [e]dit list, [r]egenerate, [q]uit`
|
|
48
|
-
4. Edit mode: `drop 3,7,12`, `edit 5`, `add`
|
|
49
|
-
5. AI fills gaps if needed
|
|
50
|
-
6. Saves to `tests/fixtures/queries/<name>.json`
|
|
51
|
-
|
|
52
|
-
Optional `--set <existing>` seeds from existing set.
|
|
53
|
-
|
|
54
|
-
### Review Existing Queries
|
|
55
|
-
|
|
56
|
-
`npm run test:quality:queries -- --review --set core`
|
|
57
|
-
|
|
58
|
-
Loads existing query set into same HIL edit loop. When reviewing `--set all`, shows combined list with source labels. Saves back to original files.
|
|
59
|
-
|
|
60
|
-
---
|
|
61
|
-
|
|
62
|
-
## 2. Test Run Output
|
|
63
|
-
|
|
64
|
-
### Default (Verbose)
|
|
65
|
-
|
|
66
|
-
`npm run test:quality`
|
|
67
|
-
|
|
68
|
-
```
|
|
69
|
-
[1/15] "zod schema validation"
|
|
70
|
-
→ 1. [0.82] /fixtures/zod/validation.md
|
|
71
|
-
"Zod provides a parse() method that validates and returns typed data..."
|
|
72
|
-
→ 2. [0.71] /fixtures/zod/schemas.md
|
|
73
|
-
"Define schemas using z.object(), z.string(), z.number()..."
|
|
74
|
-
...
|
|
75
|
-
✓ AI: relevance=0.80 ranking=0.85 coverage=0.70 snippet=0.68 overall=0.76
|
|
76
|
-
|
|
77
|
-
[2/15] "express middleware error handling"
|
|
78
|
-
...
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
### Quiet Mode
|
|
82
|
-
|
|
83
|
-
`npm run test:quality -- --quiet`
|
|
84
|
-
|
|
85
|
-
```
|
|
86
|
-
[1/15] "zod schema validation" - overall: 0.76
|
|
87
|
-
[2/15] "express middleware error handling" - overall: 0.55
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
### Silent Mode (CI)
|
|
91
|
-
|
|
92
|
-
`npm run test:quality -- --silent`
|
|
93
|
-
|
|
94
|
-
```
|
|
95
|
-
📈 Average overall score: 0.43
|
|
96
|
-
✅ No regressions detected
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
---
|
|
100
|
-
|
|
101
|
-
## 3. Post-Run Review
|
|
102
|
-
|
|
103
|
-
### List Runs
|
|
104
|
-
|
|
105
|
-
`npm run test:quality:review -- --list`
|
|
106
|
-
|
|
107
|
-
```
|
|
108
|
-
Recent test runs:
|
|
109
|
-
2025-12-17T22-32-30 core 15 queries overall=0.43 (no HIL review)
|
|
110
|
-
2025-12-17T18-15-22 core 15 queries overall=0.37 (reviewed)
|
|
111
|
-
2025-12-16T14-20-11 hono-focus 8 queries overall=0.31 (no HIL review)
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### Review Session
|
|
115
|
-
|
|
116
|
-
`npm run test:quality:review -- 2025-12-17T22-32-30`
|
|
117
|
-
|
|
118
|
-
```
|
|
119
|
-
Reviewing run: 2025-12-17T22-32-30 (15 queries, overall=0.43)
|
|
120
|
-
|
|
121
|
-
[1/15] "zod schema validation"
|
|
122
|
-
AI overall: 0.76
|
|
123
|
-
|
|
124
|
-
Results returned:
|
|
125
|
-
→ 1. [0.82] /fixtures/zod/validation.md
|
|
126
|
-
"Zod provides a parse() method that validates..."
|
|
127
|
-
→ 2. [0.71] /fixtures/zod/schemas.md
|
|
128
|
-
"Define schemas using z.object()..."
|
|
129
|
-
...
|
|
130
|
-
|
|
131
|
-
How did the search do?
|
|
132
|
-
[g]ood [o]kay [p]oor [t]errible [n]ote only [enter] skip
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
### Human Judgment Scale
|
|
136
|
-
|
|
137
|
-
| Judgment | Score | Meaning |
|
|
138
|
-
|----------|-------|---------|
|
|
139
|
-
| good | 1.0 | Results nail it - exactly what I'd want |
|
|
140
|
-
| okay | 0.7 | Decent results, room for improvement |
|
|
141
|
-
| poor | 0.4 | Missing key results or bad ranking |
|
|
142
|
-
| terrible | 0.1 | Completely wrong or useless |
|
|
143
|
-
|
|
144
|
-
---
|
|
145
|
-
|
|
146
|
-
## 4. HIL Data Storage
|
|
147
|
-
|
|
148
|
-
All HIL data stored inline in existing JSONL results file.
|
|
149
|
-
|
|
150
|
-
### Per-Query HIL Data
|
|
151
|
-
|
|
152
|
-
```json
|
|
153
|
-
{
|
|
154
|
-
"type": "query_evaluation",
|
|
155
|
-
"data": {
|
|
156
|
-
"query": { "query": "zod schema validation", "intent": "..." },
|
|
157
|
-
"evaluation": { "scores": { "overall": 0.76 } },
|
|
158
|
-
"hil": {
|
|
159
|
-
"reviewed": true,
|
|
160
|
-
"judgment": "okay",
|
|
161
|
-
"humanScore": 0.7,
|
|
162
|
-
"note": "Good top result, but missing error_handling.md",
|
|
163
|
-
"reviewedAt": "2025-12-17T23:45:00Z"
|
|
164
|
-
}
|
|
165
|
-
}
|
|
166
|
-
}
|
|
167
|
-
```
|
|
168
|
-
|
|
169
|
-
### Run Summary HIL Section
|
|
170
|
-
|
|
171
|
-
```json
|
|
172
|
-
{
|
|
173
|
-
"type": "run_summary",
|
|
174
|
-
"data": {
|
|
175
|
-
"averageScores": { "overall": 0.43 },
|
|
176
|
-
"hilReview": {
|
|
177
|
-
"reviewedAt": "2025-12-17T23:50:00Z",
|
|
178
|
-
"queriesReviewed": 12,
|
|
179
|
-
"queriesSkipped": 3,
|
|
180
|
-
"queriesFlagged": 1,
|
|
181
|
-
"humanAverageScore": 0.51,
|
|
182
|
-
"aiVsHumanDelta": -0.08,
|
|
183
|
-
"synthesis": "AI consistently underscores coverage. Snippet quality ratings align well.",
|
|
184
|
-
"actionItems": [
|
|
185
|
-
"Improve coverage for validation-related queries",
|
|
186
|
-
"Refine 'hono vs express' query or remove"
|
|
187
|
-
]
|
|
188
|
-
}
|
|
189
|
-
}
|
|
190
|
-
}
|
|
191
|
-
```
|
|
192
|
-
|
|
193
|
-
Synthesis and action items generated by AI based on human feedback.
|
|
194
|
-
|
|
195
|
-
---
|
|
196
|
-
|
|
197
|
-
## 5. Implementation
|
|
198
|
-
|
|
199
|
-
### New Files
|
|
200
|
-
|
|
201
|
-
| File | Purpose |
|
|
202
|
-
|------|---------|
|
|
203
|
-
| `tests/scripts/quality-queries.ts` | Query generation & editing HIL loop |
|
|
204
|
-
| `tests/scripts/quality-review.ts` | Post-run review HIL loop |
|
|
205
|
-
| `tests/scripts/quality-shared.ts` | Shared utilities (list sets, prompt helpers, scoring) |
|
|
206
|
-
|
|
207
|
-
### Modified Files
|
|
208
|
-
|
|
209
|
-
| File | Changes |
|
|
210
|
-
|------|---------|
|
|
211
|
-
| `tests/scripts/search-quality.ts` | Default verbose output, add `--quiet`/`--silent`, `--set all` |
|
|
212
|
-
| `tests/scripts/search-quality.types.ts` | Add `HilReview`, `HilQueryData` types |
|
|
213
|
-
| `package.json` | Add new npm scripts |
|
|
214
|
-
|
|
215
|
-
### New npm Scripts
|
|
216
|
-
|
|
217
|
-
```json
|
|
218
|
-
{
|
|
219
|
-
"test:quality": "npx tsx tests/scripts/search-quality.ts",
|
|
220
|
-
"test:quality:generate": "npx tsx tests/scripts/quality-queries.ts --generate",
|
|
221
|
-
"test:quality:queries": "npx tsx tests/scripts/quality-queries.ts",
|
|
222
|
-
"test:quality:review": "npx tsx tests/scripts/quality-review.ts"
|
|
223
|
-
}
|
|
224
|
-
```
|