bluera-knowledge 0.12.11 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/.claude/rules/code-quality.md +12 -0
  2. package/.claude/rules/git.md +5 -0
  3. package/.claude/rules/versioning.md +7 -0
  4. package/.claude-plugin/plugin.json +1 -1
  5. package/CHANGELOG.md +2 -0
  6. package/CLAUDE.md +5 -13
  7. package/README.md +11 -2
  8. package/commands/crawl.md +2 -1
  9. package/commands/test-plugin.md +197 -72
  10. package/dist/{chunk-7DZZHYDU.js → chunk-6ZVW2P2F.js} +66 -38
  11. package/dist/chunk-6ZVW2P2F.js.map +1 -0
  12. package/dist/{chunk-S5VW7NPH.js → chunk-GCUKVV33.js} +2 -2
  13. package/dist/{chunk-XVVMSRLO.js → chunk-H5AKKHY7.js} +2 -2
  14. package/dist/index.js +3 -3
  15. package/dist/mcp/server.js +2 -2
  16. package/dist/workers/background-worker-cli.js +2 -2
  17. package/docs/claude-code-best-practices.md +458 -0
  18. package/eslint.config.js +1 -1
  19. package/hooks/check-dependencies.sh +18 -1
  20. package/hooks/hooks.json +2 -2
  21. package/hooks/posttooluse-bk-reminder.py +30 -2
  22. package/package.json +1 -1
  23. package/scripts/test-mcp-dev.js +260 -0
  24. package/src/services/index.service.test.ts +347 -0
  25. package/src/services/index.service.ts +93 -44
  26. package/tests/integration/cli-consistency.test.ts +3 -2
  27. package/dist/chunk-7DZZHYDU.js.map +0 -1
  28. package/docs/plans/2024-12-17-ai-search-quality-implementation.md +0 -752
  29. package/docs/plans/2024-12-17-ai-search-quality-testing-design.md +0 -201
  30. package/docs/plans/2025-12-16-bluera-knowledge-cli.md +0 -2951
  31. package/docs/plans/2025-12-16-phase2-features.md +0 -1518
  32. package/docs/plans/2025-12-17-hil-implementation.md +0 -926
  33. package/docs/plans/2025-12-17-hil-quality-testing.md +0 -224
  34. package/docs/plans/2025-12-17-search-quality-phase1-implementation.md +0 -1416
  35. package/docs/plans/2025-12-17-search-quality-testing-v2-design.md +0 -212
  36. package/docs/plans/2025-12-28-ai-agent-optimization.md +0 -1630
  37. /package/dist/{chunk-S5VW7NPH.js.map → chunk-GCUKVV33.js.map} +0 -0
  38. /package/dist/{chunk-XVVMSRLO.js.map → chunk-H5AKKHY7.js.map} +0 -0
@@ -1,926 +0,0 @@
1
- # HIL Quality Testing Implementation Plan
2
-
3
- > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
4
-
5
- **Goal:** Add human-in-the-loop capabilities to search quality testing with query management, verbose output, and post-run review.
6
-
7
- **Architecture:** Three new scripts share utilities from `quality-shared.ts`. Main test runner gets verbose output by default. HIL data stored inline in existing JSONL format.
8
-
9
- **Tech Stack:** TypeScript, Node.js readline for interactive prompts, existing Claude CLI integration
10
-
11
- ---
12
-
13
- ## Task 1: Add HIL Types
14
-
15
- **Files:**
16
- - Modify: `tests/scripts/search-quality.types.ts`
17
-
18
- **Step 1: Add HIL type definitions**
19
-
20
- Add to end of file:
21
-
22
- ```typescript
23
- // HIL (Human-in-the-Loop) types
24
- export type HilJudgment = 'good' | 'okay' | 'poor' | 'terrible';
25
-
26
- export const HIL_JUDGMENT_SCORES: Record<HilJudgment, number> = {
27
- good: 1.0,
28
- okay: 0.7,
29
- poor: 0.4,
30
- terrible: 0.1,
31
- };
32
-
33
- export interface HilQueryData {
34
- reviewed: boolean;
35
- judgment?: HilJudgment;
36
- humanScore?: number;
37
- note?: string;
38
- flagged?: boolean;
39
- reviewedAt?: string;
40
- }
41
-
42
- export interface HilReviewSummary {
43
- reviewedAt: string;
44
- queriesReviewed: number;
45
- queriesSkipped: number;
46
- queriesFlagged: number;
47
- humanAverageScore: number;
48
- aiVsHumanDelta: number;
49
- synthesis: string;
50
- actionItems: string[];
51
- }
52
-
53
- // Extended types with HIL support
54
- export interface QueryEvaluationWithHil extends QueryEvaluation {
55
- hil?: HilQueryData;
56
- }
57
-
58
- export interface RunSummaryWithHil extends RunSummary {
59
- hilReview?: HilReviewSummary;
60
- }
61
- ```
62
-
63
- **Step 2: Commit**
64
-
65
- ```bash
66
- git add tests/scripts/search-quality.types.ts
67
- git commit -m "feat(tests): add HIL type definitions"
68
- ```
69
-
70
- ---
71
-
72
- ## Task 2: Create Shared Utilities
73
-
74
- **Files:**
75
- - Create: `tests/scripts/quality-shared.ts`
76
-
77
- **Step 1: Create the shared utilities file**
78
-
79
- ```typescript
80
- #!/usr/bin/env npx tsx
81
-
82
- import { readFileSync, readdirSync, existsSync } from 'node:fs';
83
- import { join, basename } from 'node:path';
84
- import { createInterface } from 'node:readline';
85
- import type { QuerySet, HilJudgment } from './search-quality.types.js';
86
-
87
- // Directory constants
88
- export const ROOT_DIR = join(import.meta.dirname, '..', '..');
89
- export const RESULTS_DIR = join(import.meta.dirname, '..', 'quality-results');
90
- export const QUERIES_DIR = join(import.meta.dirname, '..', 'fixtures', 'queries');
91
-
92
- // Query set discovery
93
- export interface QuerySetInfo {
94
- name: string;
95
- path: string;
96
- queryCount: number;
97
- source: 'curated' | 'ai-generated';
98
- version?: string;
99
- }
100
-
101
- export function listQuerySets(): QuerySetInfo[] {
102
- const sets: QuerySetInfo[] = [];
103
-
104
- // Curated sets (top-level .json files)
105
- if (existsSync(QUERIES_DIR)) {
106
- const files = readdirSync(QUERIES_DIR);
107
- for (const file of files) {
108
- if (file.endsWith('.json')) {
109
- const path = join(QUERIES_DIR, file);
110
- const data = JSON.parse(readFileSync(path, 'utf-8')) as QuerySet;
111
- sets.push({
112
- name: basename(file, '.json'),
113
- path,
114
- queryCount: data.queries.length,
115
- source: data.source ?? 'curated',
116
- version: data.version,
117
- });
118
- }
119
- }
120
- }
121
-
122
- // Generated sets (in generated/ subdirectory)
123
- const generatedDir = join(QUERIES_DIR, 'generated');
124
- if (existsSync(generatedDir)) {
125
- const files = readdirSync(generatedDir);
126
- for (const file of files) {
127
- if (file.endsWith('.json')) {
128
- const path = join(generatedDir, file);
129
- const data = JSON.parse(readFileSync(path, 'utf-8')) as QuerySet;
130
- sets.push({
131
- name: `generated/${basename(file, '.json')}`,
132
- path,
133
- queryCount: data.queries.length,
134
- source: 'ai-generated',
135
- version: data.version,
136
- });
137
- }
138
- }
139
- }
140
-
141
- return sets;
142
- }
143
-
144
- export function loadQuerySet(name: string): QuerySet {
145
- const sets = listQuerySets();
146
- const set = sets.find(s => s.name === name);
147
- if (!set) {
148
- throw new Error(`Query set not found: ${name}`);
149
- }
150
- return JSON.parse(readFileSync(set.path, 'utf-8')) as QuerySet;
151
- }
152
-
153
- export function loadAllQuerySets(): QuerySet {
154
- const sets = listQuerySets().filter(s => s.source === 'curated');
155
- const combined: QuerySet = {
156
- version: '1.0.0',
157
- description: 'Combined curated query sets',
158
- queries: [],
159
- source: 'curated',
160
- };
161
-
162
- for (const set of sets) {
163
- const data = JSON.parse(readFileSync(set.path, 'utf-8')) as QuerySet;
164
- combined.queries.push(...data.queries.map(q => ({
165
- ...q,
166
- id: `${set.name}:${q.id}`,
167
- })));
168
- }
169
-
170
- return combined;
171
- }
172
-
173
- // Run discovery
174
- export interface RunInfo {
175
- id: string;
176
- path: string;
177
- querySet: string;
178
- queryCount: number;
179
- overallScore: number;
180
- hasHilReview: boolean;
181
- timestamp: string;
182
- }
183
-
184
- export function listRuns(): RunInfo[] {
185
- const runs: RunInfo[] = [];
186
-
187
- if (!existsSync(RESULTS_DIR)) return runs;
188
-
189
- const files = readdirSync(RESULTS_DIR)
190
- .filter(f => f.endsWith('.jsonl'))
191
- .sort()
192
- .reverse();
193
-
194
- for (const file of files) {
195
- const path = join(RESULTS_DIR, file);
196
- const lines = readFileSync(path, 'utf-8').trim().split('\n');
197
-
198
- // Find summary line
199
- for (const line of lines) {
200
- const parsed = JSON.parse(line);
201
- if (parsed.type === 'run_summary') {
202
- runs.push({
203
- id: basename(file, '.jsonl'),
204
- path,
205
- querySet: parsed.data.config?.querySet ?? 'unknown',
206
- queryCount: parsed.data.totalQueries,
207
- overallScore: parsed.data.averageScores?.overall ?? 0,
208
- hasHilReview: !!parsed.data.hilReview,
209
- timestamp: parsed.data.timestamp,
210
- });
211
- break;
212
- }
213
- }
214
- }
215
-
216
- return runs;
217
- }
218
-
219
- // Interactive prompts
220
- export function createPrompt(): { question: (q: string) => Promise<string>; close: () => void } {
221
- const rl = createInterface({
222
- input: process.stdin,
223
- output: process.stdout,
224
- });
225
-
226
- return {
227
- question: (q: string) => new Promise(resolve => rl.question(q, resolve)),
228
- close: () => rl.close(),
229
- };
230
- }
231
-
232
- export function formatJudgmentPrompt(): string {
233
- return '[g]ood [o]kay [p]oor [t]errible [n]ote only [enter] skip';
234
- }
235
-
236
- export function parseJudgment(input: string): HilJudgment | 'note' | 'skip' {
237
- const lower = input.toLowerCase().trim();
238
- if (lower === '' || lower === 's') return 'skip';
239
- if (lower === 'g' || lower === 'good') return 'good';
240
- if (lower === 'o' || lower === 'okay') return 'okay';
241
- if (lower === 'p' || lower === 'poor') return 'poor';
242
- if (lower === 't' || lower === 'terrible') return 'terrible';
243
- if (lower === 'n' || lower === 'note') return 'note';
244
- return 'skip';
245
- }
246
-
247
- // Display helpers
248
- export function printQuerySets(sets: QuerySetInfo[]): void {
249
- console.log('Available query sets:');
250
- for (const set of sets) {
251
- const source = set.source === 'curated' ? `curated, v${set.version}` : 'ai-generated';
252
- console.log(` ${set.name.padEnd(20)} ${String(set.queryCount).padStart(3)} queries (${source})`);
253
- }
254
- }
255
-
256
- export function printRuns(runs: RunInfo[]): void {
257
- console.log('Recent test runs:');
258
- for (const run of runs.slice(0, 10)) {
259
- const reviewed = run.hasHilReview ? '(reviewed)' : '(no HIL review)';
260
- console.log(` ${run.id} ${run.querySet.padEnd(10)} ${String(run.queryCount).padStart(3)} queries overall=${run.overallScore.toFixed(2)} ${reviewed}`);
261
- }
262
- }
263
- ```
264
-
265
- **Step 2: Commit**
266
-
267
- ```bash
268
- git add tests/scripts/quality-shared.ts
269
- git commit -m "feat(tests): add shared HIL utilities"
270
- ```
271
-
272
- ---
273
-
274
- ## Task 3: Update Test Runner - Verbose Output
275
-
276
- **Files:**
277
- - Modify: `tests/scripts/search-quality.ts`
278
-
279
- **Step 1: Add output mode parsing after CLI args parsing**
280
-
281
- Find this section (around line 413-415):
282
- ```typescript
283
- const isExplore = args.includes('--explore');
284
- const updateBaseline = args.includes('--update-baseline');
285
- ```
286
-
287
- Add after it:
288
- ```typescript
289
- // Output verbosity
290
- const quietMode = args.includes('--quiet');
291
- const silentMode = args.includes('--silent');
292
- ```
293
-
294
- **Step 2: Update the query loop output**
295
-
296
- Find the console.log in the query loop (around line 502):
297
- ```typescript
298
- console.log(` ${progress} "${q.query.slice(0, 40)}${q.query.length > 40 ? '...' : ''}" - overall: ${evaluation.scores.overall.toFixed(2)}`);
299
- ```
300
-
301
- Replace the entire query loop output section with:
302
- ```typescript
303
- if (!silentMode) {
304
- if (quietMode) {
305
- // Quiet mode: summary only
306
- console.log(` ${progress} "${q.query.slice(0, 40)}${q.query.length > 40 ? '...' : ''}" - overall: ${evaluation.scores.overall.toFixed(2)}`);
307
- } else {
308
- // Default verbose mode: full results
309
- console.log(`\n${progress} "${q.query}"`);
310
- for (const r of results.slice(0, 5)) {
311
- console.log(` → ${r.rank}. [${r.score.toFixed(2)}] ${r.source}`);
312
- const snippet = r.content.slice(0, 100).replace(/\n/g, ' ');
313
- console.log(` "${snippet}${r.content.length > 100 ? '...' : ''}"`);
314
- }
315
- if (results.length > 5) {
316
- console.log(` ... and ${results.length - 5} more results`);
317
- }
318
- const s = evaluation.scores;
319
- console.log(` ✓ AI: relevance=${s.relevance.toFixed(2)} ranking=${s.ranking.toFixed(2)} coverage=${s.coverage.toFixed(2)} snippet=${s.snippetQuality.toFixed(2)} overall=${s.overall.toFixed(2)}`);
320
- }
321
- }
322
- ```
323
-
324
- **Step 3: Commit**
325
-
326
- ```bash
327
- git add tests/scripts/search-quality.ts
328
- git commit -m "feat(tests): add verbose output by default with --quiet/--silent"
329
- ```
330
-
331
- ---
332
-
333
- ## Task 4: Update Test Runner - Set All Support
334
-
335
- **Files:**
336
- - Modify: `tests/scripts/search-quality.ts`
337
-
338
- **Step 1: Add import for shared utilities**
339
-
340
- Add near top imports:
341
- ```typescript
342
- import { listQuerySets, loadAllQuerySets } from './quality-shared.js';
343
- ```
344
-
345
- **Step 2: Update query loading logic**
346
-
347
- Find where queries are loaded (around line 458-462):
348
- ```typescript
349
- } else {
350
- const querySet = loadQuerySet(querySetName);
351
- ```
352
-
353
- Replace with:
354
- ```typescript
355
- } else {
356
- let querySet: QuerySet;
357
- if (querySetName === 'all') {
358
- querySet = loadAllQuerySets();
359
- console.log(`📋 Loaded ${querySet.queries.length} queries from all curated sets\n`);
360
- } else {
361
- querySet = loadQuerySet(querySetName);
362
- console.log(`📋 Loaded ${querySet.queries.length} queries from ${querySetName}.json\n`);
363
- }
364
- ```
365
-
366
- **Step 3: Commit**
367
-
368
- ```bash
369
- git add tests/scripts/search-quality.ts
370
- git commit -m "feat(tests): add --set all support for combined query sets"
371
- ```
372
-
373
- ---
374
-
375
- ## Task 5: Create Query Management Script
376
-
377
- **Files:**
378
- - Create: `tests/scripts/quality-queries.ts`
379
-
380
- **Step 1: Create the query management script**
381
-
382
- ```typescript
383
- #!/usr/bin/env npx tsx
384
-
385
- import { readFileSync, writeFileSync, existsSync, mkdirSync } from 'node:fs';
386
- import { join, dirname } from 'node:path';
387
- import { execSync } from 'node:child_process';
388
- import {
389
- listQuerySets,
390
- loadQuerySet,
391
- loadAllQuerySets,
392
- printQuerySets,
393
- createPrompt,
394
- QUERIES_DIR,
395
- ROOT_DIR,
396
- } from './quality-shared.js';
397
- import type { QuerySet, CoreQuery } from './search-quality.types.js';
398
-
399
- const CLAUDE_CLI = process.env.CLAUDE_CLI || `${process.env.HOME}/.claude/local/claude`;
400
-
401
- function shellEscape(str: string): string {
402
- return `'${str.replace(/'/g, "'\"'\"'")}'`;
403
- }
404
-
405
- async function generateQueries(seedSet?: string): Promise<void> {
406
- const prompt = createPrompt();
407
-
408
- console.log('🔍 Query Generation Mode\n');
409
-
410
- // Load seed queries if specified
411
- let seedQueries: CoreQuery[] = [];
412
- if (seedSet) {
413
- const seed = seedSet === 'all' ? loadAllQuerySets() : loadQuerySet(seedSet);
414
- seedQueries = seed.queries;
415
- console.log(`Seeding from ${seedSet} (${seedQueries.length} queries)\n`);
416
- }
417
-
418
- // Generate queries via Claude
419
- console.log('Generating queries from corpus analysis...\n');
420
-
421
- const generatePrompt = `You have access to explore the tests/fixtures/ directory which contains content indexed in a knowledge store.
422
-
423
- ${seedQueries.length > 0 ? `Existing queries to build on:\n${seedQueries.map(q => `- "${q.query}" (${q.category})`).join('\n')}\n\nPropose 10-15 NEW queries that complement the existing ones.` : 'Propose 10-15 diverse search queries.'}
424
-
425
- For each query provide:
426
- - query: the search string
427
- - intent: what the user is trying to find
428
- - category: one of code-pattern, concept, api-reference, troubleshooting, comparison
429
-
430
- Return as JSON array.`;
431
-
432
- const args = [
433
- CLAUDE_CLI,
434
- '-p', shellEscape(generatePrompt),
435
- '--output-format', 'json',
436
- '--allowedTools', 'Glob,Read',
437
- ];
438
-
439
- const result = execSync(args.join(' '), {
440
- cwd: ROOT_DIR,
441
- encoding: 'utf-8',
442
- timeout: 120000,
443
- });
444
-
445
- const parsed = JSON.parse(result);
446
- let queries: CoreQuery[] = parsed.structured_output ?? JSON.parse(parsed.result);
447
-
448
- // Assign IDs
449
- queries = queries.map((q, i) => ({
450
- ...q,
451
- id: `gen-${String(i + 1).padStart(3, '0')}`,
452
- }));
453
-
454
- // Interactive review loop
455
- let done = false;
456
- while (!done) {
457
- console.log(`\nProposed queries (${queries.length}):\n`);
458
- queries.forEach((q, i) => {
459
- console.log(`${String(i + 1).padStart(2)}. [${q.category}] "${q.query}"`);
460
- console.log(` Intent: ${q.intent}\n`);
461
- });
462
-
463
- const action = await prompt.question('Actions: [a]ccept, [d]rop <nums>, [e]dit <num>, [r]egenerate, [q]uit: ');
464
-
465
- if (action === 'a' || action === 'accept') {
466
- done = true;
467
- } else if (action.startsWith('d ') || action.startsWith('drop ')) {
468
- const nums = action.replace(/^d(rop)?\s+/, '').split(',').map(n => parseInt(n.trim(), 10) - 1);
469
- queries = queries.filter((_, i) => !nums.includes(i));
470
- console.log(`Dropped ${nums.length} queries.`);
471
- } else if (action.startsWith('e ') || action.startsWith('edit ')) {
472
- const num = parseInt(action.replace(/^e(dit)?\s+/, ''), 10) - 1;
473
- if (queries[num]) {
474
- const newQuery = await prompt.question(`Query [${queries[num].query}]: `);
475
- const newIntent = await prompt.question(`Intent [${queries[num].intent}]: `);
476
- if (newQuery.trim()) queries[num].query = newQuery.trim();
477
- if (newIntent.trim()) queries[num].intent = newIntent.trim();
478
- }
479
- } else if (action === 'r' || action === 'regenerate') {
480
- console.log('Regenerating...');
481
- // Would call Claude again here - simplified for now
482
- } else if (action === 'q' || action === 'quit') {
483
- prompt.close();
484
- console.log('Cancelled.');
485
- return;
486
- }
487
- }
488
-
489
- // Save to file
490
- const name = await prompt.question('Save as (name): ');
491
- const filename = name.trim() || `generated-${new Date().toISOString().split('T')[0]}`;
492
-
493
- const outputDir = join(QUERIES_DIR, 'generated');
494
- if (!existsSync(outputDir)) {
495
- mkdirSync(outputDir, { recursive: true });
496
- }
497
-
498
- const outputPath = join(outputDir, `${filename}.json`);
499
- const querySet: QuerySet = {
500
- version: '1.0.0',
501
- description: `AI-generated queries from ${new Date().toISOString()}`,
502
- queries,
503
- source: 'ai-generated',
504
- generatedAt: new Date().toISOString(),
505
- };
506
-
507
- writeFileSync(outputPath, JSON.stringify(querySet, null, 2));
508
- console.log(`\n✓ Saved ${queries.length} queries to ${outputPath}`);
509
-
510
- prompt.close();
511
- }
512
-
513
- async function reviewQueries(setName: string): Promise<void> {
514
- const prompt = createPrompt();
515
-
516
- console.log(`📝 Reviewing query set: ${setName}\n`);
517
-
518
- const querySet = setName === 'all' ? loadAllQuerySets() : loadQuerySet(setName);
519
- let queries = [...querySet.queries];
520
-
521
- // Interactive review loop (same as generate)
522
- let done = false;
523
- while (!done) {
524
- console.log(`\nQueries (${queries.length}):\n`);
525
- queries.forEach((q, i) => {
526
- console.log(`${String(i + 1).padStart(2)}. [${q.category}] "${q.query}"`);
527
- console.log(` Intent: ${q.intent}\n`);
528
- });
529
-
530
- const action = await prompt.question('Actions: [d]rop <nums>, [e]dit <num>, [a]dd, [s]ave, [q]uit: ');
531
-
532
- if (action === 's' || action === 'save') {
533
- done = true;
534
- } else if (action.startsWith('d ') || action.startsWith('drop ')) {
535
- const nums = action.replace(/^d(rop)?\s+/, '').split(',').map(n => parseInt(n.trim(), 10) - 1);
536
- queries = queries.filter((_, i) => !nums.includes(i));
537
- console.log(`Dropped ${nums.length} queries.`);
538
- } else if (action.startsWith('e ') || action.startsWith('edit ')) {
539
- const num = parseInt(action.replace(/^e(dit)?\s+/, ''), 10) - 1;
540
- if (queries[num]) {
541
- const newQuery = await prompt.question(`Query [${queries[num].query}]: `);
542
- const newIntent = await prompt.question(`Intent [${queries[num].intent}]: `);
543
- if (newQuery.trim()) queries[num].query = newQuery.trim();
544
- if (newIntent.trim()) queries[num].intent = newIntent.trim();
545
- }
546
- } else if (action === 'a' || action === 'add') {
547
- const newQuery = await prompt.question('Query: ');
548
- const newIntent = await prompt.question('Intent: ');
549
- const newCategory = await prompt.question('Category (code-pattern/concept/api-reference/troubleshooting/comparison): ');
550
- queries.push({
551
- id: `manual-${queries.length + 1}`,
552
- query: newQuery.trim(),
553
- intent: newIntent.trim(),
554
- category: (newCategory.trim() || 'code-pattern') as CoreQuery['category'],
555
- });
556
- } else if (action === 'q' || action === 'quit') {
557
- prompt.close();
558
- console.log('Cancelled without saving.');
559
- return;
560
- }
561
- }
562
-
563
- // Save back
564
- const sets = listQuerySets();
565
- const setInfo = sets.find(s => s.name === setName);
566
- if (setInfo && setName !== 'all') {
567
- // Backup original
568
- const backup = readFileSync(setInfo.path, 'utf-8');
569
- writeFileSync(`${setInfo.path}.bak`, backup);
570
-
571
- querySet.queries = queries;
572
- writeFileSync(setInfo.path, JSON.stringify(querySet, null, 2));
573
- console.log(`\n✓ Saved ${queries.length} queries to ${setInfo.path}`);
574
- console.log(` Backup: ${setInfo.path}.bak`);
575
- } else {
576
- console.log('Cannot save "all" - edits would need to be saved to individual sets.');
577
- }
578
-
579
- prompt.close();
580
- }
581
-
582
- async function main(): Promise<void> {
583
- const args = process.argv.slice(2);
584
-
585
- const showList = args.includes('--list');
586
- const generateMode = args.includes('--generate');
587
- const reviewMode = args.includes('--review');
588
- const setArg = args.find(a => a.startsWith('--set='))?.split('=')[1]
589
- ?? args[args.indexOf('--set') + 1];
590
-
591
- if (showList || (!generateMode && !reviewMode && !setArg)) {
592
- const sets = listQuerySets();
593
- printQuerySets(sets);
594
- console.log('\nUsage:');
595
- console.log(' --generate [--set <seed>] Generate new queries');
596
- console.log(' --review --set <name|all> Review existing queries');
597
- return;
598
- }
599
-
600
- if (generateMode) {
601
- await generateQueries(setArg);
602
- } else if (reviewMode) {
603
- if (!setArg) {
604
- console.error('--review requires --set <name|all>');
605
- process.exit(1);
606
- }
607
- await reviewQueries(setArg);
608
- }
609
- }
610
-
611
- main().catch(console.error);
612
- ```
613
-
614
- **Step 2: Commit**
615
-
616
- ```bash
617
- git add tests/scripts/quality-queries.ts
618
- git commit -m "feat(tests): add interactive query management script"
619
- ```
620
-
621
- ---
622
-
623
- ## Task 6: Create Review Script
624
-
625
- **Files:**
626
- - Create: `tests/scripts/quality-review.ts`
627
-
628
- **Step 1: Create the review script**
629
-
630
- ```typescript
631
- #!/usr/bin/env npx tsx
632
-
633
- import { readFileSync, writeFileSync } from 'node:fs';
634
- import { execSync } from 'node:child_process';
635
- import {
636
- listRuns,
637
- printRuns,
638
- createPrompt,
639
- formatJudgmentPrompt,
640
- parseJudgment,
641
- ROOT_DIR,
642
- } from './quality-shared.js';
643
- import type {
644
- QueryEvaluation,
645
- RunSummary,
646
- HilJudgment,
647
- HilQueryData,
648
- HilReviewSummary,
649
- HIL_JUDGMENT_SCORES,
650
- } from './search-quality.types.js';
651
-
652
- const CLAUDE_CLI = process.env.CLAUDE_CLI || `${process.env.HOME}/.claude/local/claude`;
653
-
654
- const JUDGMENT_SCORES: Record<HilJudgment, number> = {
655
- good: 1.0,
656
- okay: 0.7,
657
- poor: 0.4,
658
- terrible: 0.1,
659
- };
660
-
661
- interface ParsedRun {
662
- lines: Array<{ type: string; data?: unknown; timestamp?: string; runId?: string; config?: unknown }>;
663
- evaluations: QueryEvaluation[];
664
- summary: RunSummary | null;
665
- }
666
-
667
- function parseRunFile(path: string): ParsedRun {
668
- const content = readFileSync(path, 'utf-8');
669
- const lines = content.trim().split('\n').map(l => JSON.parse(l));
670
-
671
- const evaluations: QueryEvaluation[] = [];
672
- let summary: RunSummary | null = null;
673
-
674
- for (const line of lines) {
675
- if (line.type === 'query_evaluation') {
676
- evaluations.push(line.data as QueryEvaluation);
677
- } else if (line.type === 'run_summary') {
678
- summary = line.data as RunSummary;
679
- }
680
- }
681
-
682
- return { lines, evaluations, summary };
683
- }
684
-
685
- function shellEscape(str: string): string {
686
- return `'${str.replace(/'/g, "'\"'\"'")}'`;
687
- }
688
-
689
- async function reviewRun(runId: string): Promise<void> {
690
- const runs = listRuns();
691
- const run = runs.find(r => r.id === runId);
692
-
693
- if (!run) {
694
- console.error(`Run not found: ${runId}`);
695
- console.log('\nAvailable runs:');
696
- printRuns(runs);
697
- process.exit(1);
698
- }
699
-
700
- const prompt = createPrompt();
701
- const parsed = parseRunFile(run.path);
702
-
703
- console.log(`\n📊 Reviewing run: ${runId}`);
704
- console.log(` ${parsed.evaluations.length} queries, overall=${run.overallScore.toFixed(2)}\n`);
705
-
706
- const hilData: Map<number, HilQueryData> = new Map();
707
-
708
- for (let i = 0; i < parsed.evaluations.length; i++) {
709
- const eval_ = parsed.evaluations[i];
710
- const progress = `[${i + 1}/${parsed.evaluations.length}]`;
711
-
712
- console.log(`\n${progress} "${eval_.query.query}"`);
713
- console.log(` AI overall: ${eval_.evaluation.scores.overall.toFixed(2)}`);
714
- console.log(`\n Results returned:`);
715
-
716
- for (const r of eval_.searchResults.slice(0, 5)) {
717
- console.log(` → ${r.source}`);
718
- const snippet = r.snippet.slice(0, 80).replace(/\n/g, ' ');
719
- console.log(` "${snippet}${r.snippet.length > 80 ? '...' : ''}"`);
720
- }
721
-
722
- console.log(`\n How did the search do?`);
723
- console.log(` ${formatJudgmentPrompt()}`);
724
-
725
- const input = await prompt.question('> ');
726
- const judgment = parseJudgment(input);
727
-
728
- if (judgment === 'skip') {
729
- hilData.set(i, { reviewed: false });
730
- continue;
731
- }
732
-
733
- const hil: HilQueryData = {
734
- reviewed: true,
735
- reviewedAt: new Date().toISOString(),
736
- };
737
-
738
- if (judgment !== 'note') {
739
- hil.judgment = judgment;
740
- hil.humanScore = JUDGMENT_SCORES[judgment];
741
- }
742
-
743
- const note = await prompt.question(' Note (optional): ');
744
- if (note.trim()) {
745
- hil.note = note.trim();
746
- }
747
-
748
- hilData.set(i, hil);
749
- }
750
-
751
- prompt.close();
752
-
753
- // Calculate summary stats
754
- const reviewed = [...hilData.values()].filter(h => h.reviewed);
755
- const scores = reviewed.filter(h => h.humanScore !== undefined).map(h => h.humanScore!);
756
- const humanAvg = scores.length > 0 ? scores.reduce((a, b) => a + b, 0) / scores.length : 0;
757
- const flagged = reviewed.filter(h => h.flagged).length;
758
-
759
- // Generate synthesis via Claude
760
- console.log('\n📝 Generating synthesis...');
761
-
762
- const synthesisPrompt = `Summarize this human review of search quality results:
763
-
764
- AI average score: ${run.overallScore.toFixed(2)}
765
- Human average score: ${humanAvg.toFixed(2)}
766
- Queries reviewed: ${reviewed.length}/${parsed.evaluations.length}
767
-
768
- Human judgments:
769
- ${[...hilData.entries()].filter(([_, h]) => h.reviewed).map(([i, h]) => {
770
- const q = parsed.evaluations[i].query.query;
771
- return `- "${q}": ${h.judgment ?? 'note only'}${h.note ? ` - "${h.note}"` : ''}`;
772
- }).join('\n')}
773
-
774
- Provide:
775
- 1. A 2-3 sentence synthesis of the human feedback
776
- 2. 2-4 specific action items for improving search quality
777
-
778
- Return as JSON: { "synthesis": "...", "actionItems": ["...", "..."] }`;
779
-
780
- const result = execSync([
781
- CLAUDE_CLI,
782
- '-p', shellEscape(synthesisPrompt),
783
- '--output-format', 'json',
784
- ].join(' '), {
785
- cwd: ROOT_DIR,
786
- encoding: 'utf-8',
787
- timeout: 60000,
788
- });
789
-
790
- const synthResult = JSON.parse(result);
791
- const synthesis = synthResult.structured_output ?? JSON.parse(synthResult.result);
792
-
793
- // Build HIL review summary
794
- const hilReview: HilReviewSummary = {
795
- reviewedAt: new Date().toISOString(),
796
- queriesReviewed: reviewed.length,
797
- queriesSkipped: parsed.evaluations.length - reviewed.length,
798
- queriesFlagged: flagged,
799
- humanAverageScore: Math.round(humanAvg * 100) / 100,
800
- aiVsHumanDelta: Math.round((humanAvg - run.overallScore) * 100) / 100,
801
- synthesis: synthesis.synthesis,
802
- actionItems: synthesis.actionItems,
803
- };
804
-
805
- // Update the JSONL file with HIL data
806
- const updatedLines = parsed.lines.map((line, idx) => {
807
- if (line.type === 'query_evaluation') {
808
- const evalIdx = parsed.lines.slice(0, idx).filter(l => l.type === 'query_evaluation').length;
809
- const hil = hilData.get(evalIdx);
810
- if (hil) {
811
- return { ...line, data: { ...(line.data as object), hil } };
812
- }
813
- }
814
- if (line.type === 'run_summary') {
815
- return { ...line, data: { ...(line.data as object), hilReview } };
816
- }
817
- return line;
818
- });
819
-
820
- writeFileSync(run.path, updatedLines.map(l => JSON.stringify(l)).join('\n') + '\n');
821
-
822
- console.log(`\n✓ Review saved to ${run.path}`);
823
- console.log(`\n📊 Summary:`);
824
- console.log(` Human avg: ${humanAvg.toFixed(2)} (AI: ${run.overallScore.toFixed(2)}, delta: ${hilReview.aiVsHumanDelta >= 0 ? '+' : ''}${hilReview.aiVsHumanDelta.toFixed(2)})`);
825
- console.log(` ${hilReview.synthesis}`);
826
- console.log(`\n🎯 Action items:`);
827
- hilReview.actionItems.forEach((item, i) => console.log(` ${i + 1}. ${item}`));
828
- }
829
-
830
- async function main(): Promise<void> {
831
- const args = process.argv.slice(2);
832
-
833
- const showList = args.includes('--list');
834
- const runId = args.find(a => !a.startsWith('--'));
835
-
836
- if (showList || !runId) {
837
- const runs = listRuns();
838
- printRuns(runs);
839
- console.log('\nUsage: npm run test:quality:review -- <run-id>');
840
- return;
841
- }
842
-
843
- await reviewRun(runId);
844
- }
845
-
846
- main().catch(console.error);
847
- ```
848
-
849
- **Step 2: Commit**
850
-
851
- ```bash
852
- git add tests/scripts/quality-review.ts
853
- git commit -m "feat(tests): add post-run HIL review script"
854
- ```
855
-
856
- ---
857
-
858
- ## Task 7: Update package.json Scripts
859
-
860
- **Files:**
861
- - Modify: `package.json`
862
-
863
- **Step 1: Add new npm scripts**
864
-
865
- Find the scripts section and update:
866
-
867
- ```json
868
- "scripts": {
869
- "build": "tsup",
870
- "dev": "tsup --watch",
871
- "start": "node dist/index.js",
872
- "test": "vitest",
873
- "test:run": "vitest run",
874
- "test:corpus:index": "npx tsx tests/scripts/corpus-index.ts",
875
- "test:quality": "npx tsx tests/scripts/search-quality.ts",
876
- "test:quality:explore": "npx tsx tests/scripts/search-quality.ts --explore",
877
- "test:quality:baseline": "npx tsx tests/scripts/search-quality.ts --update-baseline",
878
- "test:quality:queries": "npx tsx tests/scripts/quality-queries.ts",
879
- "test:quality:generate": "npx tsx tests/scripts/quality-queries.ts --generate",
880
- "test:quality:review": "npx tsx tests/scripts/quality-review.ts",
881
- "lint": "eslint src/",
882
- "typecheck": "tsc --noEmit"
883
- }
884
- ```
885
-
886
- **Step 2: Commit**
887
-
888
- ```bash
889
- git add package.json
890
- git commit -m "chore: add HIL quality testing npm scripts"
891
- ```
892
-
893
- ---
894
-
895
- ## Task 8: Integration Test
896
-
897
- **Step 1: Test query listing**
898
-
899
- ```bash
900
- npm run test:quality:queries
901
- ```
902
-
903
- Expected: Shows available query sets
904
-
905
- **Step 2: Test verbose output**
906
-
907
- ```bash
908
- npm run test:quality -- --quiet --set core
909
- ```
910
-
911
- Expected: Runs with summary output per query
912
-
913
- **Step 3: Test run listing**
914
-
915
- ```bash
916
- npm run test:quality:review -- --list
917
- ```
918
-
919
- Expected: Shows recent test runs
920
-
921
- **Step 4: Final commit**
922
-
923
- ```bash
924
- git add -A
925
- git commit -m "test: verify HIL quality testing integration"
926
- ```