bluera-knowledge 0.13.0 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,224 +0,0 @@
1
- # Human-in-the-Loop Quality Testing
2
-
3
- ## Overview
4
-
5
- Add optional human-in-the-loop (HIL) capabilities to the search quality testing system. Enables human visibility, judgment, and feedback alongside AI evaluation.
6
-
7
- ## Commands
8
-
9
- ```
10
- test:quality [--quiet|--silent] [--set <name|all>] Run tests (verbose by default)
11
- test:quality:queries --list List query sets
12
- test:quality:queries --review --set <name|all> Edit existing queries
13
- test:quality:generate [--set <seed>] Generate new queries with HIL
14
- test:quality:review --list List test runs
15
- test:quality:review <run-id> Review run results with HIL
16
- ```
17
-
18
- ### Global `--set` Convention
19
-
20
- Applies to all commands accepting `--set`:
21
- - `--set core` - specific set
22
- - `--set all` - combine all curated sets
23
- - `--set generated/*` - all generated sets
24
- - Omit flag → defaults to `core`
25
-
26
- ---
27
-
28
- ## 1. Query Management
29
-
30
- ### List Query Sets
31
-
32
- `npm run test:quality:queries -- --list`
33
-
34
- ```
35
- Available query sets:
36
- core 15 queries (curated, v1.0.0)
37
- hono-focus 8 queries (curated, v1.0.0)
38
- generated/2025-12-17 12 queries (ai-generated)
39
- ```
40
-
41
- ### Generate New Queries
42
-
43
- `npm run test:quality:generate`
44
-
45
- 1. AI proposes 10-15 queries as batch
46
- 2. Display structured list with categories and intents
47
- 3. Prompt: `Actions: [a]ccept all, [e]dit list, [r]egenerate, [q]uit`
48
- 4. Edit mode: `drop 3,7,12`, `edit 5`, `add`
49
- 5. AI fills gaps if needed
50
- 6. Saves to `tests/fixtures/queries/<name>.json`
51
-
52
- Optional `--set <existing>` seeds from existing set.
53
-
54
- ### Review Existing Queries
55
-
56
- `npm run test:quality:queries -- --review --set core`
57
-
58
- Loads existing query set into same HIL edit loop. When reviewing `--set all`, shows combined list with source labels. Saves back to original files.
59
-
60
- ---
61
-
62
- ## 2. Test Run Output
63
-
64
- ### Default (Verbose)
65
-
66
- `npm run test:quality`
67
-
68
- ```
69
- [1/15] "zod schema validation"
70
- → 1. [0.82] /fixtures/zod/validation.md
71
- "Zod provides a parse() method that validates and returns typed data..."
72
- → 2. [0.71] /fixtures/zod/schemas.md
73
- "Define schemas using z.object(), z.string(), z.number()..."
74
- ...
75
- ✓ AI: relevance=0.80 ranking=0.85 coverage=0.70 snippet=0.68 overall=0.76
76
-
77
- [2/15] "express middleware error handling"
78
- ...
79
- ```
80
-
81
- ### Quiet Mode
82
-
83
- `npm run test:quality -- --quiet`
84
-
85
- ```
86
- [1/15] "zod schema validation" - overall: 0.76
87
- [2/15] "express middleware error handling" - overall: 0.55
88
- ```
89
-
90
- ### Silent Mode (CI)
91
-
92
- `npm run test:quality -- --silent`
93
-
94
- ```
95
- 📈 Average overall score: 0.43
96
- ✅ No regressions detected
97
- ```
98
-
99
- ---
100
-
101
- ## 3. Post-Run Review
102
-
103
- ### List Runs
104
-
105
- `npm run test:quality:review -- --list`
106
-
107
- ```
108
- Recent test runs:
109
- 2025-12-17T22-32-30 core 15 queries overall=0.43 (no HIL review)
110
- 2025-12-17T18-15-22 core 15 queries overall=0.37 (reviewed)
111
- 2025-12-16T14-20-11 hono-focus 8 queries overall=0.31 (no HIL review)
112
- ```
113
-
114
- ### Review Session
115
-
116
- `npm run test:quality:review -- 2025-12-17T22-32-30`
117
-
118
- ```
119
- Reviewing run: 2025-12-17T22-32-30 (15 queries, overall=0.43)
120
-
121
- [1/15] "zod schema validation"
122
- AI overall: 0.76
123
-
124
- Results returned:
125
- → 1. [0.82] /fixtures/zod/validation.md
126
- "Zod provides a parse() method that validates..."
127
- → 2. [0.71] /fixtures/zod/schemas.md
128
- "Define schemas using z.object()..."
129
- ...
130
-
131
- How did the search do?
132
- [g]ood [o]kay [p]oor [t]errible [n]ote only [enter] skip
133
- ```
134
-
135
- ### Human Judgment Scale
136
-
137
- | Judgment | Score | Meaning |
138
- |----------|-------|---------|
139
- | good | 1.0 | Results nail it - exactly what I'd want |
140
- | okay | 0.7 | Decent results, room for improvement |
141
- | poor | 0.4 | Missing key results or bad ranking |
142
- | terrible | 0.1 | Completely wrong or useless |
143
-
144
- ---
145
-
146
- ## 4. HIL Data Storage
147
-
148
- All HIL data stored inline in existing JSONL results file.
149
-
150
- ### Per-Query HIL Data
151
-
152
- ```json
153
- {
154
- "type": "query_evaluation",
155
- "data": {
156
- "query": { "query": "zod schema validation", "intent": "..." },
157
- "evaluation": { "scores": { "overall": 0.76 } },
158
- "hil": {
159
- "reviewed": true,
160
- "judgment": "okay",
161
- "humanScore": 0.7,
162
- "note": "Good top result, but missing error_handling.md",
163
- "reviewedAt": "2025-12-17T23:45:00Z"
164
- }
165
- }
166
- }
167
- ```
168
-
169
- ### Run Summary HIL Section
170
-
171
- ```json
172
- {
173
- "type": "run_summary",
174
- "data": {
175
- "averageScores": { "overall": 0.43 },
176
- "hilReview": {
177
- "reviewedAt": "2025-12-17T23:50:00Z",
178
- "queriesReviewed": 12,
179
- "queriesSkipped": 3,
180
- "queriesFlagged": 1,
181
- "humanAverageScore": 0.51,
182
- "aiVsHumanDelta": -0.08,
183
- "synthesis": "AI consistently underscores coverage. Snippet quality ratings align well.",
184
- "actionItems": [
185
- "Improve coverage for validation-related queries",
186
- "Refine 'hono vs express' query or remove"
187
- ]
188
- }
189
- }
190
- }
191
- ```
192
-
193
- Synthesis and action items generated by AI based on human feedback.
194
-
195
- ---
196
-
197
- ## 5. Implementation
198
-
199
- ### New Files
200
-
201
- | File | Purpose |
202
- |------|---------|
203
- | `tests/scripts/quality-queries.ts` | Query generation & editing HIL loop |
204
- | `tests/scripts/quality-review.ts` | Post-run review HIL loop |
205
- | `tests/scripts/quality-shared.ts` | Shared utilities (list sets, prompt helpers, scoring) |
206
-
207
- ### Modified Files
208
-
209
- | File | Changes |
210
- |------|---------|
211
- | `tests/scripts/search-quality.ts` | Default verbose output, add `--quiet`/`--silent`, `--set all` |
212
- | `tests/scripts/search-quality.types.ts` | Add `HilReview`, `HilQueryData` types |
213
- | `package.json` | Add new npm scripts |
214
-
215
- ### New npm Scripts
216
-
217
- ```json
218
- {
219
- "test:quality": "npx tsx tests/scripts/search-quality.ts",
220
- "test:quality:generate": "npx tsx tests/scripts/quality-queries.ts --generate",
221
- "test:quality:queries": "npx tsx tests/scripts/quality-queries.ts",
222
- "test:quality:review": "npx tsx tests/scripts/quality-review.ts"
223
- }
224
- ```