bluera-knowledge 0.13.0 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,752 +0,0 @@
1
- # AI Search Quality Testing Implementation Plan
2
-
3
- > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
4
-
5
- **Goal:** Build a CLI command that uses Claude to generate test queries and evaluate search result quality, outputting structured JSONL for trend tracking.
6
-
7
- **Architecture:** TypeScript script invokes Claude CLI twice per run - once to generate queries from fixtures, once per query to evaluate search results. Results stream to timestamped JSONL files.
8
-
9
- **Tech Stack:** TypeScript, Node.js child_process, Claude CLI with JSON schema validation
10
-
11
- ---
12
-
13
- ### Task 1: Create Configuration File
14
-
15
- **Files:**
16
- - Create: `tests/quality-config.json`
17
-
18
- **Step 1: Create the config file**
19
-
20
- ```json
21
- {
22
- "queryCount": 15,
23
- "searchLimit": 10,
24
- "searchMode": "hybrid",
25
- "stores": null,
26
- "maxRetries": 3,
27
- "timeoutMs": 60000
28
- }
29
- ```
30
-
31
- **Step 2: Commit**
32
-
33
- ```bash
34
- git add tests/quality-config.json
35
- git commit -m "chore: add search quality test configuration"
36
- ```
37
-
38
- ---
39
-
40
- ### Task 2: Create Results Directory
41
-
42
- **Files:**
43
- - Create: `tests/quality-results/.gitkeep`
44
-
45
- **Step 1: Create directory with .gitkeep**
46
-
47
- ```bash
48
- mkdir -p tests/quality-results
49
- touch tests/quality-results/.gitkeep
50
- ```
51
-
52
- **Step 2: Add to .gitignore (keep .gitkeep, ignore results)**
53
-
54
- Append to `.gitignore`:
55
- ```
56
- tests/quality-results/*.jsonl
57
- ```
58
-
59
- **Step 3: Commit**
60
-
61
- ```bash
62
- git add tests/quality-results/.gitkeep .gitignore
63
- git commit -m "chore: add quality results directory structure"
64
- ```
65
-
66
- ---
67
-
68
- ### Task 3: Create JSON Schemas for Claude Output
69
-
70
- **Files:**
71
- - Create: `tests/scripts/schemas/query-generation.json`
72
- - Create: `tests/scripts/schemas/evaluation.json`
73
-
74
- **Step 1: Create query generation schema**
75
-
76
- `tests/scripts/schemas/query-generation.json`:
77
- ```json
78
- {
79
- "type": "object",
80
- "properties": {
81
- "queries": {
82
- "type": "array",
83
- "items": {
84
- "type": "object",
85
- "properties": {
86
- "query": { "type": "string" },
87
- "intent": { "type": "string" },
88
- "expectedContentTypes": {
89
- "type": "array",
90
- "items": { "type": "string" }
91
- }
92
- },
93
- "required": ["query", "intent"]
94
- }
95
- }
96
- },
97
- "required": ["queries"]
98
- }
99
- ```
100
-
101
- **Step 2: Create evaluation schema**
102
-
103
- `tests/scripts/schemas/evaluation.json`:
104
- ```json
105
- {
106
- "type": "object",
107
- "properties": {
108
- "scores": {
109
- "type": "object",
110
- "properties": {
111
- "relevance": { "type": "number", "minimum": 0, "maximum": 1 },
112
- "ranking": { "type": "number", "minimum": 0, "maximum": 1 },
113
- "coverage": { "type": "number", "minimum": 0, "maximum": 1 },
114
- "snippetQuality": { "type": "number", "minimum": 0, "maximum": 1 },
115
- "overall": { "type": "number", "minimum": 0, "maximum": 1 }
116
- },
117
- "required": ["relevance", "ranking", "coverage", "snippetQuality", "overall"]
118
- },
119
- "analysis": {
120
- "type": "object",
121
- "properties": {
122
- "relevance": { "type": "string" },
123
- "ranking": { "type": "string" },
124
- "coverage": { "type": "string" },
125
- "snippetQuality": { "type": "string" }
126
- },
127
- "required": ["relevance", "ranking", "coverage", "snippetQuality"]
128
- },
129
- "suggestions": {
130
- "type": "array",
131
- "items": { "type": "string" }
132
- },
133
- "resultAssessments": {
134
- "type": "array",
135
- "items": {
136
- "type": "object",
137
- "properties": {
138
- "rank": { "type": "integer" },
139
- "source": { "type": "string" },
140
- "relevant": { "type": "boolean" },
141
- "note": { "type": "string" }
142
- },
143
- "required": ["rank", "source", "relevant"]
144
- }
145
- }
146
- },
147
- "required": ["scores", "analysis", "suggestions", "resultAssessments"]
148
- }
149
- ```
150
-
151
- **Step 3: Commit**
152
-
153
- ```bash
154
- git add tests/scripts/schemas/
155
- git commit -m "feat: add JSON schemas for Claude quality evaluation"
156
- ```
157
-
158
- ---
159
-
160
- ### Task 4: Create Type Definitions
161
-
162
- **Files:**
163
- - Create: `tests/scripts/search-quality.types.ts`
164
-
165
- **Step 1: Create TypeScript types**
166
-
167
- ```typescript
168
- export interface QualityConfig {
169
- queryCount: number;
170
- searchLimit: number;
171
- searchMode: 'vector' | 'fts' | 'hybrid';
172
- stores: string[] | null;
173
- maxRetries: number;
174
- timeoutMs: number;
175
- }
176
-
177
- export interface GeneratedQuery {
178
- query: string;
179
- intent: string;
180
- expectedContentTypes?: string[];
181
- }
182
-
183
- export interface QueryGenerationResult {
184
- queries: GeneratedQuery[];
185
- }
186
-
187
- export interface Scores {
188
- relevance: number;
189
- ranking: number;
190
- coverage: number;
191
- snippetQuality: number;
192
- overall: number;
193
- }
194
-
195
- export interface Analysis {
196
- relevance: string;
197
- ranking: string;
198
- coverage: string;
199
- snippetQuality: string;
200
- }
201
-
202
- export interface ResultAssessment {
203
- rank: number;
204
- source: string;
205
- relevant: boolean;
206
- note?: string;
207
- }
208
-
209
- export interface EvaluationResult {
210
- scores: Scores;
211
- analysis: Analysis;
212
- suggestions: string[];
213
- resultAssessments: ResultAssessment[];
214
- }
215
-
216
- export interface QueryEvaluation {
217
- timestamp: string;
218
- query: string;
219
- queryIntent: string;
220
- searchMode: string;
221
- resultCount: number;
222
- scores: Scores;
223
- analysis: Analysis;
224
- suggestions: string[];
225
- results: ResultAssessment[];
226
- }
227
-
228
- export interface RunSummary {
229
- type: 'summary';
230
- timestamp: string;
231
- totalQueries: number;
232
- averageScores: Scores;
233
- topIssues: string[];
234
- recommendedFocus: string;
235
- }
236
- ```
237
-
238
- **Step 2: Commit**
239
-
240
- ```bash
241
- git add tests/scripts/search-quality.types.ts
242
- git commit -m "feat: add TypeScript types for quality testing"
243
- ```
244
-
245
- ---
246
-
247
- ### Task 5: Create Main Script - Imports and Config Loading
248
-
249
- **Files:**
250
- - Create: `tests/scripts/search-quality.ts`
251
-
252
- **Step 1: Create script with imports and config loading**
253
-
254
- ```typescript
255
- #!/usr/bin/env npx tsx
256
-
257
- import { execSync } from 'node:child_process';
258
- import { readFileSync, writeFileSync, appendFileSync, mkdirSync, existsSync } from 'node:fs';
259
- import { join, dirname } from 'node:path';
260
- import { fileURLToPath } from 'node:url';
261
- import type {
262
- QualityConfig,
263
- QueryGenerationResult,
264
- EvaluationResult,
265
- QueryEvaluation,
266
- RunSummary,
267
- Scores,
268
- } from './search-quality.types.js';
269
-
270
- const __filename = fileURLToPath(import.meta.url);
271
- const __dirname = dirname(__filename);
272
- const ROOT_DIR = join(__dirname, '..', '..');
273
- const FIXTURES_DIR = join(__dirname, '..', 'fixtures');
274
- const RESULTS_DIR = join(__dirname, '..', 'quality-results');
275
- const SCHEMAS_DIR = join(__dirname, 'schemas');
276
-
277
- function loadConfig(): QualityConfig {
278
- const configPath = join(__dirname, '..', 'quality-config.json');
279
- const defaultConfig: QualityConfig = {
280
- queryCount: 15,
281
- searchLimit: 10,
282
- searchMode: 'hybrid',
283
- stores: null,
284
- maxRetries: 3,
285
- timeoutMs: 60000,
286
- };
287
-
288
- if (existsSync(configPath)) {
289
- const userConfig = JSON.parse(readFileSync(configPath, 'utf-8'));
290
- return { ...defaultConfig, ...userConfig };
291
- }
292
- return defaultConfig;
293
- }
294
-
295
- function loadSchema(name: string): string {
296
- return readFileSync(join(SCHEMAS_DIR, `${name}.json`), 'utf-8');
297
- }
298
-
299
- // Continue in next task...
300
- ```
301
-
302
- **Step 2: Commit**
303
-
304
- ```bash
305
- git add tests/scripts/search-quality.ts
306
- git commit -m "feat: add quality script skeleton with config loading"
307
- ```
308
-
309
- ---
310
-
311
- ### Task 6: Add Query Generation Function
312
-
313
- **Files:**
314
- - Modify: `tests/scripts/search-quality.ts`
315
-
316
- **Step 1: Add generateQueries function after loadSchema**
317
-
318
- ```typescript
319
- function generateQueries(config: QualityConfig): QueryGenerationResult {
320
- console.log('šŸ” Generating test queries from tests/fixtures/...');
321
-
322
- const schema = loadSchema('query-generation');
323
- const prompt = `You have access to explore the tests/fixtures/ directory which contains content that has been indexed in a knowledge store search system.
324
-
325
- Your task:
326
- 1. Use the Glob and Read tools to explore tests/fixtures/ and understand what content is available
327
- 2. Generate exactly ${config.queryCount} diverse search queries that would thoroughly test the search system
328
-
329
- Generate queries that:
330
- - Cover different content types (code, documentation, READMEs)
331
- - Range from specific (function names) to conceptual (design patterns)
332
- - Include some ambiguous queries that could match multiple files
333
- - Test edge cases (very short queries, natural language questions)
334
-
335
- Return your queries in the specified JSON format.`;
336
-
337
- const args = [
338
- 'claude',
339
- '-p',
340
- '--output-format', 'json',
341
- '--json-schema', JSON.stringify(JSON.parse(schema)),
342
- '--allowedTools', 'Glob,Read',
343
- prompt,
344
- ];
345
-
346
- try {
347
- const result = execSync(args.join(' '), {
348
- encoding: 'utf-8',
349
- timeout: config.timeoutMs * 2, // Extra time for exploration
350
- cwd: ROOT_DIR,
351
- maxBuffer: 10 * 1024 * 1024,
352
- });
353
-
354
- const parsed = JSON.parse(result) as QueryGenerationResult;
355
- console.log(`āœ“ Generated ${parsed.queries.length} queries\n`);
356
- return parsed;
357
- } catch (error) {
358
- console.error('Failed to generate queries:', error);
359
- throw error;
360
- }
361
- }
362
- ```
363
-
364
- **Step 2: Commit**
365
-
366
- ```bash
367
- git add tests/scripts/search-quality.ts
368
- git commit -m "feat: add query generation using Claude CLI"
369
- ```
370
-
371
- ---
372
-
373
- ### Task 7: Add Search Execution Function
374
-
375
- **Files:**
376
- - Modify: `tests/scripts/search-quality.ts`
377
-
378
- **Step 1: Add runSearch function**
379
-
380
- ```typescript
381
- interface SearchResult {
382
- rank: number;
383
- score: number;
384
- source: string;
385
- content: string;
386
- }
387
-
388
- function runSearch(query: string, config: QualityConfig): { results: SearchResult[]; rawOutput: string } {
389
- const args = [
390
- 'node', 'dist/index.js', 'search',
391
- JSON.stringify(query),
392
- '--mode', config.searchMode,
393
- '--limit', String(config.searchLimit),
394
- '--include-content',
395
- ];
396
-
397
- if (config.stores && config.stores.length > 0) {
398
- args.push('--stores', config.stores.join(','));
399
- }
400
-
401
- try {
402
- const rawOutput = execSync(args.join(' '), {
403
- encoding: 'utf-8',
404
- timeout: config.timeoutMs,
405
- cwd: ROOT_DIR,
406
- });
407
-
408
- // Parse the output format: "1. [-0.23] /path/to/file.ts\n content..."
409
- const results: SearchResult[] = [];
410
- const lines = rawOutput.split('\n');
411
- let currentResult: Partial<SearchResult> | null = null;
412
-
413
- for (const line of lines) {
414
- const headerMatch = line.match(/^(\d+)\.\s+\[(-?[0-9.]+)\]\s+(.+)$/);
415
- if (headerMatch) {
416
- if (currentResult && currentResult.content !== undefined) {
417
- results.push(currentResult as SearchResult);
418
- }
419
- currentResult = {
420
- rank: parseInt(headerMatch[1], 10),
421
- score: parseFloat(headerMatch[2]),
422
- source: headerMatch[3].trim(),
423
- content: '',
424
- };
425
- } else if (currentResult && line.startsWith(' ')) {
426
- currentResult.content += (currentResult.content ? '\n' : '') + line.trim();
427
- }
428
- }
429
- if (currentResult && currentResult.content !== undefined) {
430
- results.push(currentResult as SearchResult);
431
- }
432
-
433
- return { results, rawOutput };
434
- } catch (error) {
435
- console.error(`Search failed for query "${query}":`, error);
436
- return { results: [], rawOutput: '' };
437
- }
438
- }
439
- ```
440
-
441
- **Step 2: Commit**
442
-
443
- ```bash
444
- git add tests/scripts/search-quality.ts
445
- git commit -m "feat: add search execution and result parsing"
446
- ```
447
-
448
- ---
449
-
450
- ### Task 8: Add Evaluation Function
451
-
452
- **Files:**
453
- - Modify: `tests/scripts/search-quality.ts`
454
-
455
- **Step 1: Add evaluateResults function**
456
-
457
- ```typescript
458
- function evaluateResults(
459
- query: string,
460
- intent: string,
461
- results: SearchResult[],
462
- config: QualityConfig
463
- ): EvaluationResult {
464
- const schema = loadSchema('evaluation');
465
-
466
- const resultsForPrompt = results.map(r => ({
467
- rank: r.rank,
468
- score: r.score,
469
- source: r.source,
470
- contentPreview: r.content.slice(0, 500) + (r.content.length > 500 ? '...' : ''),
471
- }));
472
-
473
- const prompt = `Evaluate these search results for quality.
474
-
475
- **Query:** "${query}"
476
- **Intent:** ${intent}
477
-
478
- **Search Results (${results.length} returned):**
479
- ${JSON.stringify(resultsForPrompt, null, 2)}
480
-
481
- Evaluate on these dimensions (0.0 to 1.0 scale):
482
-
483
- 1. **Relevance**: Do the results actually relate to the query intent?
484
- 2. **Ranking**: Are the most relevant results at the top?
485
- 3. **Coverage**: Did the search find the expected content? (Consider what SHOULD match)
486
- 4. **Snippet Quality**: Are the content previews useful and showing relevant sections?
487
- 5. **Overall**: Weighted assessment of search quality
488
-
489
- Provide:
490
- - Numeric scores for each dimension
491
- - Detailed analysis explaining each score
492
- - Specific, actionable suggestions for improving the search system
493
- - Assessment of each result (relevant or not, with notes)
494
-
495
- Be critical and specific. Your feedback will be used to improve the search system.`;
496
-
497
- const args = [
498
- 'claude',
499
- '-p',
500
- '--output-format', 'json',
501
- '--json-schema', JSON.stringify(JSON.parse(schema)),
502
- prompt,
503
- ];
504
-
505
- try {
506
- const result = execSync(args.join(' '), {
507
- encoding: 'utf-8',
508
- timeout: config.timeoutMs,
509
- cwd: ROOT_DIR,
510
- maxBuffer: 10 * 1024 * 1024,
511
- });
512
-
513
- return JSON.parse(result) as EvaluationResult;
514
- } catch (error) {
515
- console.error(`Evaluation failed for query "${query}":`, error);
516
- // Return a failure result
517
- return {
518
- scores: { relevance: 0, ranking: 0, coverage: 0, snippetQuality: 0, overall: 0 },
519
- analysis: {
520
- relevance: 'Evaluation failed',
521
- ranking: 'Evaluation failed',
522
- coverage: 'Evaluation failed',
523
- snippetQuality: 'Evaluation failed',
524
- },
525
- suggestions: ['Evaluation failed - check logs'],
526
- resultAssessments: [],
527
- };
528
- }
529
- }
530
- ```
531
-
532
- **Step 2: Commit**
533
-
534
- ```bash
535
- git add tests/scripts/search-quality.ts
536
- git commit -m "feat: add result evaluation using Claude CLI"
537
- ```
538
-
539
- ---
540
-
541
- ### Task 9: Add Summary Generation and Main Function
542
-
543
- **Files:**
544
- - Modify: `tests/scripts/search-quality.ts`
545
-
546
- **Step 1: Add summary generation and main function**
547
-
548
- ```typescript
549
- function generateSummary(evaluations: QueryEvaluation[]): RunSummary {
550
- const avgScores: Scores = {
551
- relevance: 0,
552
- ranking: 0,
553
- coverage: 0,
554
- snippetQuality: 0,
555
- overall: 0,
556
- };
557
-
558
- for (const eval_ of evaluations) {
559
- avgScores.relevance += eval_.scores.relevance;
560
- avgScores.ranking += eval_.scores.ranking;
561
- avgScores.coverage += eval_.scores.coverage;
562
- avgScores.snippetQuality += eval_.scores.snippetQuality;
563
- avgScores.overall += eval_.scores.overall;
564
- }
565
-
566
- const count = evaluations.length || 1;
567
- avgScores.relevance = Math.round((avgScores.relevance / count) * 100) / 100;
568
- avgScores.ranking = Math.round((avgScores.ranking / count) * 100) / 100;
569
- avgScores.coverage = Math.round((avgScores.coverage / count) * 100) / 100;
570
- avgScores.snippetQuality = Math.round((avgScores.snippetQuality / count) * 100) / 100;
571
- avgScores.overall = Math.round((avgScores.overall / count) * 100) / 100;
572
-
573
- // Collect all suggestions and count occurrences
574
- const suggestionCounts = new Map<string, number>();
575
- for (const eval_ of evaluations) {
576
- for (const suggestion of eval_.suggestions) {
577
- const key = suggestion.toLowerCase().slice(0, 50);
578
- suggestionCounts.set(key, (suggestionCounts.get(key) || 0) + 1);
579
- }
580
- }
581
-
582
- const topIssues = [...suggestionCounts.entries()]
583
- .sort((a, b) => b[1] - a[1])
584
- .slice(0, 5)
585
- .map(([issue]) => issue);
586
-
587
- // Find lowest scoring dimension
588
- const dimensions = ['relevance', 'ranking', 'coverage', 'snippetQuality'] as const;
589
- const lowestDim = dimensions.reduce((min, dim) =>
590
- avgScores[dim] < avgScores[min] ? dim : min
591
- );
592
-
593
- return {
594
- type: 'summary',
595
- timestamp: new Date().toISOString(),
596
- totalQueries: evaluations.length,
597
- averageScores: avgScores,
598
- topIssues,
599
- recommendedFocus: `${lowestDim} (avg: ${avgScores[lowestDim]}) - review evaluations for specific improvement suggestions`,
600
- };
601
- }
602
-
603
- async function main() {
604
- const config = loadConfig();
605
-
606
- // Ensure results directory exists
607
- if (!existsSync(RESULTS_DIR)) {
608
- mkdirSync(RESULTS_DIR, { recursive: true });
609
- }
610
-
611
- // Generate output filename
612
- const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
613
- const outputPath = join(RESULTS_DIR, `${timestamp}.jsonl`);
614
-
615
- // Phase 1: Generate queries
616
- const { queries } = generateQueries(config);
617
-
618
- // Phase 2: Evaluate each query
619
- console.log('šŸ“Š Evaluating search quality...');
620
- const evaluations: QueryEvaluation[] = [];
621
-
622
- for (let i = 0; i < queries.length; i++) {
623
- const q = queries[i];
624
- const progress = `[${i + 1}/${queries.length}]`;
625
-
626
- // Run search
627
- const { results } = runSearch(q.query, config);
628
-
629
- // Evaluate results
630
- const evaluation = evaluateResults(q.query, q.intent, results, config);
631
-
632
- // Build full evaluation record
633
- const record: QueryEvaluation = {
634
- timestamp: new Date().toISOString(),
635
- query: q.query,
636
- queryIntent: q.intent,
637
- searchMode: config.searchMode,
638
- resultCount: results.length,
639
- scores: evaluation.scores,
640
- analysis: evaluation.analysis,
641
- suggestions: evaluation.suggestions,
642
- results: evaluation.resultAssessments,
643
- };
644
-
645
- evaluations.push(record);
646
-
647
- // Write immediately (append)
648
- appendFileSync(outputPath, JSON.stringify(record) + '\n');
649
-
650
- console.log(` ${progress} "${q.query.slice(0, 40)}${q.query.length > 40 ? '...' : ''}" - overall: ${evaluation.scores.overall.toFixed(2)}`);
651
- }
652
-
653
- // Write summary
654
- const summary = generateSummary(evaluations);
655
- appendFileSync(outputPath, JSON.stringify(summary) + '\n');
656
-
657
- console.log(`\nāœ“ Results written to ${outputPath}`);
658
- console.log(`šŸ“ˆ Average overall score: ${summary.averageScores.overall}`);
659
-
660
- if (summary.topIssues.length > 0) {
661
- console.log('\nšŸŽÆ Top issues to address:');
662
- summary.topIssues.forEach((issue, i) => console.log(` ${i + 1}. ${issue}`));
663
- }
664
-
665
- console.log(`\nšŸ’” Recommended focus: ${summary.recommendedFocus}`);
666
- }
667
-
668
- main().catch(console.error);
669
- ```
670
-
671
- **Step 2: Commit**
672
-
673
- ```bash
674
- git add tests/scripts/search-quality.ts
675
- git commit -m "feat: add summary generation and main execution loop"
676
- ```
677
-
678
- ---
679
-
680
- ### Task 10: Add npm Script
681
-
682
- **Files:**
683
- - Modify: `package.json`
684
-
685
- **Step 1: Add npm script**
686
-
687
- Add to "scripts" section:
688
- ```json
689
- "test:search-quality": "npx tsx tests/scripts/search-quality.ts"
690
- ```
691
-
692
- **Step 2: Commit**
693
-
694
- ```bash
695
- git add package.json
696
- git commit -m "chore: add npm script for search quality testing"
697
- ```
698
-
699
- ---
700
-
701
- ### Task 11: Test the Implementation
702
-
703
- **Step 1: Build the project**
704
-
705
- ```bash
706
- npm run build
707
- ```
708
-
709
- **Step 2: Ensure test fixtures are indexed**
710
-
711
- ```bash
712
- node dist/index.js store create test-fixtures --type file --source tests/fixtures
713
- node dist/index.js index test-fixtures
714
- ```
715
-
716
- **Step 3: Run the quality test**
717
-
718
- ```bash
719
- npm run test:search-quality
720
- ```
721
-
722
- **Step 4: Verify output**
723
-
724
- Check that `tests/quality-results/` contains a new JSONL file with:
725
- - One JSON object per line for each query evaluation
726
- - Final line is the summary object
727
-
728
- **Step 5: Commit any fixes if needed**
729
-
730
- ---
731
-
732
- ### Task 12: Final Commit - Feature Complete
733
-
734
- **Step 1: Verify all files are committed**
735
-
736
- ```bash
737
- git status
738
- ```
739
-
740
- **Step 2: Create final commit if any uncommitted changes**
741
-
742
- ```bash
743
- git add -A
744
- git commit -m "feat: complete AI-powered search quality testing
745
-
746
- Adds npm run test:search-quality command that:
747
- - Uses Claude CLI to generate test queries from fixtures
748
- - Evaluates search results with detailed scoring and analysis
749
- - Outputs JSONL for trend tracking
750
-
751
- Closes #AI-quality-testing"
752
- ```