bluera-knowledge 0.13.0 ā 0.13.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/rules/code-quality.md +12 -0
- package/.claude/rules/git.md +5 -0
- package/.claude/rules/versioning.md +7 -0
- package/.claude-plugin/plugin.json +2 -15
- package/.mcp.json +11 -0
- package/CHANGELOG.md +7 -0
- package/CLAUDE.md +5 -13
- package/CONTRIBUTING.md +307 -0
- package/README.md +58 -1167
- package/commands/crawl.md +2 -1
- package/commands/test-plugin.md +197 -72
- package/docs/claude-code-best-practices.md +458 -0
- package/docs/cli.md +170 -0
- package/docs/commands.md +392 -0
- package/docs/crawler-architecture.md +89 -0
- package/docs/mcp-integration.md +130 -0
- package/docs/token-efficiency.md +91 -0
- package/eslint.config.js +1 -1
- package/hooks/check-dependencies.sh +18 -1
- package/hooks/hooks.json +2 -2
- package/hooks/posttooluse-bk-reminder.py +30 -2
- package/package.json +1 -1
- package/scripts/test-mcp-dev.js +260 -0
- package/src/mcp/plugin-mcp-config.test.ts +26 -19
- package/tests/integration/cli-consistency.test.ts +3 -2
- package/docs/plans/2024-12-17-ai-search-quality-implementation.md +0 -752
- package/docs/plans/2024-12-17-ai-search-quality-testing-design.md +0 -201
- package/docs/plans/2025-12-16-bluera-knowledge-cli.md +0 -2951
- package/docs/plans/2025-12-16-phase2-features.md +0 -1518
- package/docs/plans/2025-12-17-hil-implementation.md +0 -926
- package/docs/plans/2025-12-17-hil-quality-testing.md +0 -224
- package/docs/plans/2025-12-17-search-quality-phase1-implementation.md +0 -1416
- package/docs/plans/2025-12-17-search-quality-testing-v2-design.md +0 -212
- package/docs/plans/2025-12-28-ai-agent-optimization.md +0 -1630
|
@@ -1,752 +0,0 @@
|
|
|
1
|
-
# AI Search Quality Testing Implementation Plan
|
|
2
|
-
|
|
3
|
-
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
4
|
-
|
|
5
|
-
**Goal:** Build a CLI command that uses Claude to generate test queries and evaluate search result quality, outputting structured JSONL for trend tracking.
|
|
6
|
-
|
|
7
|
-
**Architecture:** TypeScript script invokes Claude CLI twice per run - once to generate queries from fixtures, once per query to evaluate search results. Results stream to timestamped JSONL files.
|
|
8
|
-
|
|
9
|
-
**Tech Stack:** TypeScript, Node.js child_process, Claude CLI with JSON schema validation
|
|
10
|
-
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
### Task 1: Create Configuration File
|
|
14
|
-
|
|
15
|
-
**Files:**
|
|
16
|
-
- Create: `tests/quality-config.json`
|
|
17
|
-
|
|
18
|
-
**Step 1: Create the config file**
|
|
19
|
-
|
|
20
|
-
```json
|
|
21
|
-
{
|
|
22
|
-
"queryCount": 15,
|
|
23
|
-
"searchLimit": 10,
|
|
24
|
-
"searchMode": "hybrid",
|
|
25
|
-
"stores": null,
|
|
26
|
-
"maxRetries": 3,
|
|
27
|
-
"timeoutMs": 60000
|
|
28
|
-
}
|
|
29
|
-
```
|
|
30
|
-
|
|
31
|
-
**Step 2: Commit**
|
|
32
|
-
|
|
33
|
-
```bash
|
|
34
|
-
git add tests/quality-config.json
|
|
35
|
-
git commit -m "chore: add search quality test configuration"
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
---
|
|
39
|
-
|
|
40
|
-
### Task 2: Create Results Directory
|
|
41
|
-
|
|
42
|
-
**Files:**
|
|
43
|
-
- Create: `tests/quality-results/.gitkeep`
|
|
44
|
-
|
|
45
|
-
**Step 1: Create directory with .gitkeep**
|
|
46
|
-
|
|
47
|
-
```bash
|
|
48
|
-
mkdir -p tests/quality-results
|
|
49
|
-
touch tests/quality-results/.gitkeep
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
**Step 2: Add to .gitignore (keep .gitkeep, ignore results)**
|
|
53
|
-
|
|
54
|
-
Append to `.gitignore`:
|
|
55
|
-
```
|
|
56
|
-
tests/quality-results/*.jsonl
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
**Step 3: Commit**
|
|
60
|
-
|
|
61
|
-
```bash
|
|
62
|
-
git add tests/quality-results/.gitkeep .gitignore
|
|
63
|
-
git commit -m "chore: add quality results directory structure"
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
---
|
|
67
|
-
|
|
68
|
-
### Task 3: Create JSON Schemas for Claude Output
|
|
69
|
-
|
|
70
|
-
**Files:**
|
|
71
|
-
- Create: `tests/scripts/schemas/query-generation.json`
|
|
72
|
-
- Create: `tests/scripts/schemas/evaluation.json`
|
|
73
|
-
|
|
74
|
-
**Step 1: Create query generation schema**
|
|
75
|
-
|
|
76
|
-
`tests/scripts/schemas/query-generation.json`:
|
|
77
|
-
```json
|
|
78
|
-
{
|
|
79
|
-
"type": "object",
|
|
80
|
-
"properties": {
|
|
81
|
-
"queries": {
|
|
82
|
-
"type": "array",
|
|
83
|
-
"items": {
|
|
84
|
-
"type": "object",
|
|
85
|
-
"properties": {
|
|
86
|
-
"query": { "type": "string" },
|
|
87
|
-
"intent": { "type": "string" },
|
|
88
|
-
"expectedContentTypes": {
|
|
89
|
-
"type": "array",
|
|
90
|
-
"items": { "type": "string" }
|
|
91
|
-
}
|
|
92
|
-
},
|
|
93
|
-
"required": ["query", "intent"]
|
|
94
|
-
}
|
|
95
|
-
}
|
|
96
|
-
},
|
|
97
|
-
"required": ["queries"]
|
|
98
|
-
}
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
**Step 2: Create evaluation schema**
|
|
102
|
-
|
|
103
|
-
`tests/scripts/schemas/evaluation.json`:
|
|
104
|
-
```json
|
|
105
|
-
{
|
|
106
|
-
"type": "object",
|
|
107
|
-
"properties": {
|
|
108
|
-
"scores": {
|
|
109
|
-
"type": "object",
|
|
110
|
-
"properties": {
|
|
111
|
-
"relevance": { "type": "number", "minimum": 0, "maximum": 1 },
|
|
112
|
-
"ranking": { "type": "number", "minimum": 0, "maximum": 1 },
|
|
113
|
-
"coverage": { "type": "number", "minimum": 0, "maximum": 1 },
|
|
114
|
-
"snippetQuality": { "type": "number", "minimum": 0, "maximum": 1 },
|
|
115
|
-
"overall": { "type": "number", "minimum": 0, "maximum": 1 }
|
|
116
|
-
},
|
|
117
|
-
"required": ["relevance", "ranking", "coverage", "snippetQuality", "overall"]
|
|
118
|
-
},
|
|
119
|
-
"analysis": {
|
|
120
|
-
"type": "object",
|
|
121
|
-
"properties": {
|
|
122
|
-
"relevance": { "type": "string" },
|
|
123
|
-
"ranking": { "type": "string" },
|
|
124
|
-
"coverage": { "type": "string" },
|
|
125
|
-
"snippetQuality": { "type": "string" }
|
|
126
|
-
},
|
|
127
|
-
"required": ["relevance", "ranking", "coverage", "snippetQuality"]
|
|
128
|
-
},
|
|
129
|
-
"suggestions": {
|
|
130
|
-
"type": "array",
|
|
131
|
-
"items": { "type": "string" }
|
|
132
|
-
},
|
|
133
|
-
"resultAssessments": {
|
|
134
|
-
"type": "array",
|
|
135
|
-
"items": {
|
|
136
|
-
"type": "object",
|
|
137
|
-
"properties": {
|
|
138
|
-
"rank": { "type": "integer" },
|
|
139
|
-
"source": { "type": "string" },
|
|
140
|
-
"relevant": { "type": "boolean" },
|
|
141
|
-
"note": { "type": "string" }
|
|
142
|
-
},
|
|
143
|
-
"required": ["rank", "source", "relevant"]
|
|
144
|
-
}
|
|
145
|
-
}
|
|
146
|
-
},
|
|
147
|
-
"required": ["scores", "analysis", "suggestions", "resultAssessments"]
|
|
148
|
-
}
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
**Step 3: Commit**
|
|
152
|
-
|
|
153
|
-
```bash
|
|
154
|
-
git add tests/scripts/schemas/
|
|
155
|
-
git commit -m "feat: add JSON schemas for Claude quality evaluation"
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
---
|
|
159
|
-
|
|
160
|
-
### Task 4: Create Type Definitions
|
|
161
|
-
|
|
162
|
-
**Files:**
|
|
163
|
-
- Create: `tests/scripts/search-quality.types.ts`
|
|
164
|
-
|
|
165
|
-
**Step 1: Create TypeScript types**
|
|
166
|
-
|
|
167
|
-
```typescript
|
|
168
|
-
export interface QualityConfig {
|
|
169
|
-
queryCount: number;
|
|
170
|
-
searchLimit: number;
|
|
171
|
-
searchMode: 'vector' | 'fts' | 'hybrid';
|
|
172
|
-
stores: string[] | null;
|
|
173
|
-
maxRetries: number;
|
|
174
|
-
timeoutMs: number;
|
|
175
|
-
}
|
|
176
|
-
|
|
177
|
-
export interface GeneratedQuery {
|
|
178
|
-
query: string;
|
|
179
|
-
intent: string;
|
|
180
|
-
expectedContentTypes?: string[];
|
|
181
|
-
}
|
|
182
|
-
|
|
183
|
-
export interface QueryGenerationResult {
|
|
184
|
-
queries: GeneratedQuery[];
|
|
185
|
-
}
|
|
186
|
-
|
|
187
|
-
export interface Scores {
|
|
188
|
-
relevance: number;
|
|
189
|
-
ranking: number;
|
|
190
|
-
coverage: number;
|
|
191
|
-
snippetQuality: number;
|
|
192
|
-
overall: number;
|
|
193
|
-
}
|
|
194
|
-
|
|
195
|
-
export interface Analysis {
|
|
196
|
-
relevance: string;
|
|
197
|
-
ranking: string;
|
|
198
|
-
coverage: string;
|
|
199
|
-
snippetQuality: string;
|
|
200
|
-
}
|
|
201
|
-
|
|
202
|
-
export interface ResultAssessment {
|
|
203
|
-
rank: number;
|
|
204
|
-
source: string;
|
|
205
|
-
relevant: boolean;
|
|
206
|
-
note?: string;
|
|
207
|
-
}
|
|
208
|
-
|
|
209
|
-
export interface EvaluationResult {
|
|
210
|
-
scores: Scores;
|
|
211
|
-
analysis: Analysis;
|
|
212
|
-
suggestions: string[];
|
|
213
|
-
resultAssessments: ResultAssessment[];
|
|
214
|
-
}
|
|
215
|
-
|
|
216
|
-
export interface QueryEvaluation {
|
|
217
|
-
timestamp: string;
|
|
218
|
-
query: string;
|
|
219
|
-
queryIntent: string;
|
|
220
|
-
searchMode: string;
|
|
221
|
-
resultCount: number;
|
|
222
|
-
scores: Scores;
|
|
223
|
-
analysis: Analysis;
|
|
224
|
-
suggestions: string[];
|
|
225
|
-
results: ResultAssessment[];
|
|
226
|
-
}
|
|
227
|
-
|
|
228
|
-
export interface RunSummary {
|
|
229
|
-
type: 'summary';
|
|
230
|
-
timestamp: string;
|
|
231
|
-
totalQueries: number;
|
|
232
|
-
averageScores: Scores;
|
|
233
|
-
topIssues: string[];
|
|
234
|
-
recommendedFocus: string;
|
|
235
|
-
}
|
|
236
|
-
```
|
|
237
|
-
|
|
238
|
-
**Step 2: Commit**
|
|
239
|
-
|
|
240
|
-
```bash
|
|
241
|
-
git add tests/scripts/search-quality.types.ts
|
|
242
|
-
git commit -m "feat: add TypeScript types for quality testing"
|
|
243
|
-
```
|
|
244
|
-
|
|
245
|
-
---
|
|
246
|
-
|
|
247
|
-
### Task 5: Create Main Script - Imports and Config Loading
|
|
248
|
-
|
|
249
|
-
**Files:**
|
|
250
|
-
- Create: `tests/scripts/search-quality.ts`
|
|
251
|
-
|
|
252
|
-
**Step 1: Create script with imports and config loading**
|
|
253
|
-
|
|
254
|
-
```typescript
|
|
255
|
-
#!/usr/bin/env npx tsx
|
|
256
|
-
|
|
257
|
-
import { execSync } from 'node:child_process';
|
|
258
|
-
import { readFileSync, writeFileSync, appendFileSync, mkdirSync, existsSync } from 'node:fs';
|
|
259
|
-
import { join, dirname } from 'node:path';
|
|
260
|
-
import { fileURLToPath } from 'node:url';
|
|
261
|
-
import type {
|
|
262
|
-
QualityConfig,
|
|
263
|
-
QueryGenerationResult,
|
|
264
|
-
EvaluationResult,
|
|
265
|
-
QueryEvaluation,
|
|
266
|
-
RunSummary,
|
|
267
|
-
Scores,
|
|
268
|
-
} from './search-quality.types.js';
|
|
269
|
-
|
|
270
|
-
const __filename = fileURLToPath(import.meta.url);
|
|
271
|
-
const __dirname = dirname(__filename);
|
|
272
|
-
const ROOT_DIR = join(__dirname, '..', '..');
|
|
273
|
-
const FIXTURES_DIR = join(__dirname, '..', 'fixtures');
|
|
274
|
-
const RESULTS_DIR = join(__dirname, '..', 'quality-results');
|
|
275
|
-
const SCHEMAS_DIR = join(__dirname, 'schemas');
|
|
276
|
-
|
|
277
|
-
function loadConfig(): QualityConfig {
|
|
278
|
-
const configPath = join(__dirname, '..', 'quality-config.json');
|
|
279
|
-
const defaultConfig: QualityConfig = {
|
|
280
|
-
queryCount: 15,
|
|
281
|
-
searchLimit: 10,
|
|
282
|
-
searchMode: 'hybrid',
|
|
283
|
-
stores: null,
|
|
284
|
-
maxRetries: 3,
|
|
285
|
-
timeoutMs: 60000,
|
|
286
|
-
};
|
|
287
|
-
|
|
288
|
-
if (existsSync(configPath)) {
|
|
289
|
-
const userConfig = JSON.parse(readFileSync(configPath, 'utf-8'));
|
|
290
|
-
return { ...defaultConfig, ...userConfig };
|
|
291
|
-
}
|
|
292
|
-
return defaultConfig;
|
|
293
|
-
}
|
|
294
|
-
|
|
295
|
-
function loadSchema(name: string): string {
|
|
296
|
-
return readFileSync(join(SCHEMAS_DIR, `${name}.json`), 'utf-8');
|
|
297
|
-
}
|
|
298
|
-
|
|
299
|
-
// Continue in next task...
|
|
300
|
-
```
|
|
301
|
-
|
|
302
|
-
**Step 2: Commit**
|
|
303
|
-
|
|
304
|
-
```bash
|
|
305
|
-
git add tests/scripts/search-quality.ts
|
|
306
|
-
git commit -m "feat: add quality script skeleton with config loading"
|
|
307
|
-
```
|
|
308
|
-
|
|
309
|
-
---
|
|
310
|
-
|
|
311
|
-
### Task 6: Add Query Generation Function
|
|
312
|
-
|
|
313
|
-
**Files:**
|
|
314
|
-
- Modify: `tests/scripts/search-quality.ts`
|
|
315
|
-
|
|
316
|
-
**Step 1: Add generateQueries function after loadSchema**
|
|
317
|
-
|
|
318
|
-
```typescript
|
|
319
|
-
function generateQueries(config: QualityConfig): QueryGenerationResult {
|
|
320
|
-
console.log('š Generating test queries from tests/fixtures/...');
|
|
321
|
-
|
|
322
|
-
const schema = loadSchema('query-generation');
|
|
323
|
-
const prompt = `You have access to explore the tests/fixtures/ directory which contains content that has been indexed in a knowledge store search system.
|
|
324
|
-
|
|
325
|
-
Your task:
|
|
326
|
-
1. Use the Glob and Read tools to explore tests/fixtures/ and understand what content is available
|
|
327
|
-
2. Generate exactly ${config.queryCount} diverse search queries that would thoroughly test the search system
|
|
328
|
-
|
|
329
|
-
Generate queries that:
|
|
330
|
-
- Cover different content types (code, documentation, READMEs)
|
|
331
|
-
- Range from specific (function names) to conceptual (design patterns)
|
|
332
|
-
- Include some ambiguous queries that could match multiple files
|
|
333
|
-
- Test edge cases (very short queries, natural language questions)
|
|
334
|
-
|
|
335
|
-
Return your queries in the specified JSON format.`;
|
|
336
|
-
|
|
337
|
-
const args = [
|
|
338
|
-
'claude',
|
|
339
|
-
'-p',
|
|
340
|
-
'--output-format', 'json',
|
|
341
|
-
'--json-schema', JSON.stringify(JSON.parse(schema)),
|
|
342
|
-
'--allowedTools', 'Glob,Read',
|
|
343
|
-
prompt,
|
|
344
|
-
];
|
|
345
|
-
|
|
346
|
-
try {
|
|
347
|
-
const result = execSync(args.join(' '), {
|
|
348
|
-
encoding: 'utf-8',
|
|
349
|
-
timeout: config.timeoutMs * 2, // Extra time for exploration
|
|
350
|
-
cwd: ROOT_DIR,
|
|
351
|
-
maxBuffer: 10 * 1024 * 1024,
|
|
352
|
-
});
|
|
353
|
-
|
|
354
|
-
const parsed = JSON.parse(result) as QueryGenerationResult;
|
|
355
|
-
console.log(`ā Generated ${parsed.queries.length} queries\n`);
|
|
356
|
-
return parsed;
|
|
357
|
-
} catch (error) {
|
|
358
|
-
console.error('Failed to generate queries:', error);
|
|
359
|
-
throw error;
|
|
360
|
-
}
|
|
361
|
-
}
|
|
362
|
-
```
|
|
363
|
-
|
|
364
|
-
**Step 2: Commit**
|
|
365
|
-
|
|
366
|
-
```bash
|
|
367
|
-
git add tests/scripts/search-quality.ts
|
|
368
|
-
git commit -m "feat: add query generation using Claude CLI"
|
|
369
|
-
```
|
|
370
|
-
|
|
371
|
-
---
|
|
372
|
-
|
|
373
|
-
### Task 7: Add Search Execution Function
|
|
374
|
-
|
|
375
|
-
**Files:**
|
|
376
|
-
- Modify: `tests/scripts/search-quality.ts`
|
|
377
|
-
|
|
378
|
-
**Step 1: Add runSearch function**
|
|
379
|
-
|
|
380
|
-
```typescript
|
|
381
|
-
interface SearchResult {
|
|
382
|
-
rank: number;
|
|
383
|
-
score: number;
|
|
384
|
-
source: string;
|
|
385
|
-
content: string;
|
|
386
|
-
}
|
|
387
|
-
|
|
388
|
-
function runSearch(query: string, config: QualityConfig): { results: SearchResult[]; rawOutput: string } {
|
|
389
|
-
const args = [
|
|
390
|
-
'node', 'dist/index.js', 'search',
|
|
391
|
-
JSON.stringify(query),
|
|
392
|
-
'--mode', config.searchMode,
|
|
393
|
-
'--limit', String(config.searchLimit),
|
|
394
|
-
'--include-content',
|
|
395
|
-
];
|
|
396
|
-
|
|
397
|
-
if (config.stores && config.stores.length > 0) {
|
|
398
|
-
args.push('--stores', config.stores.join(','));
|
|
399
|
-
}
|
|
400
|
-
|
|
401
|
-
try {
|
|
402
|
-
const rawOutput = execSync(args.join(' '), {
|
|
403
|
-
encoding: 'utf-8',
|
|
404
|
-
timeout: config.timeoutMs,
|
|
405
|
-
cwd: ROOT_DIR,
|
|
406
|
-
});
|
|
407
|
-
|
|
408
|
-
// Parse the output format: "1. [-0.23] /path/to/file.ts\n content..."
|
|
409
|
-
const results: SearchResult[] = [];
|
|
410
|
-
const lines = rawOutput.split('\n');
|
|
411
|
-
let currentResult: Partial<SearchResult> | null = null;
|
|
412
|
-
|
|
413
|
-
for (const line of lines) {
|
|
414
|
-
const headerMatch = line.match(/^(\d+)\.\s+\[(-?[0-9.]+)\]\s+(.+)$/);
|
|
415
|
-
if (headerMatch) {
|
|
416
|
-
if (currentResult && currentResult.content !== undefined) {
|
|
417
|
-
results.push(currentResult as SearchResult);
|
|
418
|
-
}
|
|
419
|
-
currentResult = {
|
|
420
|
-
rank: parseInt(headerMatch[1], 10),
|
|
421
|
-
score: parseFloat(headerMatch[2]),
|
|
422
|
-
source: headerMatch[3].trim(),
|
|
423
|
-
content: '',
|
|
424
|
-
};
|
|
425
|
-
} else if (currentResult && line.startsWith(' ')) {
|
|
426
|
-
currentResult.content += (currentResult.content ? '\n' : '') + line.trim();
|
|
427
|
-
}
|
|
428
|
-
}
|
|
429
|
-
if (currentResult && currentResult.content !== undefined) {
|
|
430
|
-
results.push(currentResult as SearchResult);
|
|
431
|
-
}
|
|
432
|
-
|
|
433
|
-
return { results, rawOutput };
|
|
434
|
-
} catch (error) {
|
|
435
|
-
console.error(`Search failed for query "${query}":`, error);
|
|
436
|
-
return { results: [], rawOutput: '' };
|
|
437
|
-
}
|
|
438
|
-
}
|
|
439
|
-
```
|
|
440
|
-
|
|
441
|
-
**Step 2: Commit**
|
|
442
|
-
|
|
443
|
-
```bash
|
|
444
|
-
git add tests/scripts/search-quality.ts
|
|
445
|
-
git commit -m "feat: add search execution and result parsing"
|
|
446
|
-
```
|
|
447
|
-
|
|
448
|
-
---
|
|
449
|
-
|
|
450
|
-
### Task 8: Add Evaluation Function
|
|
451
|
-
|
|
452
|
-
**Files:**
|
|
453
|
-
- Modify: `tests/scripts/search-quality.ts`
|
|
454
|
-
|
|
455
|
-
**Step 1: Add evaluateResults function**
|
|
456
|
-
|
|
457
|
-
```typescript
|
|
458
|
-
function evaluateResults(
|
|
459
|
-
query: string,
|
|
460
|
-
intent: string,
|
|
461
|
-
results: SearchResult[],
|
|
462
|
-
config: QualityConfig
|
|
463
|
-
): EvaluationResult {
|
|
464
|
-
const schema = loadSchema('evaluation');
|
|
465
|
-
|
|
466
|
-
const resultsForPrompt = results.map(r => ({
|
|
467
|
-
rank: r.rank,
|
|
468
|
-
score: r.score,
|
|
469
|
-
source: r.source,
|
|
470
|
-
contentPreview: r.content.slice(0, 500) + (r.content.length > 500 ? '...' : ''),
|
|
471
|
-
}));
|
|
472
|
-
|
|
473
|
-
const prompt = `Evaluate these search results for quality.
|
|
474
|
-
|
|
475
|
-
**Query:** "${query}"
|
|
476
|
-
**Intent:** ${intent}
|
|
477
|
-
|
|
478
|
-
**Search Results (${results.length} returned):**
|
|
479
|
-
${JSON.stringify(resultsForPrompt, null, 2)}
|
|
480
|
-
|
|
481
|
-
Evaluate on these dimensions (0.0 to 1.0 scale):
|
|
482
|
-
|
|
483
|
-
1. **Relevance**: Do the results actually relate to the query intent?
|
|
484
|
-
2. **Ranking**: Are the most relevant results at the top?
|
|
485
|
-
3. **Coverage**: Did the search find the expected content? (Consider what SHOULD match)
|
|
486
|
-
4. **Snippet Quality**: Are the content previews useful and showing relevant sections?
|
|
487
|
-
5. **Overall**: Weighted assessment of search quality
|
|
488
|
-
|
|
489
|
-
Provide:
|
|
490
|
-
- Numeric scores for each dimension
|
|
491
|
-
- Detailed analysis explaining each score
|
|
492
|
-
- Specific, actionable suggestions for improving the search system
|
|
493
|
-
- Assessment of each result (relevant or not, with notes)
|
|
494
|
-
|
|
495
|
-
Be critical and specific. Your feedback will be used to improve the search system.`;
|
|
496
|
-
|
|
497
|
-
const args = [
|
|
498
|
-
'claude',
|
|
499
|
-
'-p',
|
|
500
|
-
'--output-format', 'json',
|
|
501
|
-
'--json-schema', JSON.stringify(JSON.parse(schema)),
|
|
502
|
-
prompt,
|
|
503
|
-
];
|
|
504
|
-
|
|
505
|
-
try {
|
|
506
|
-
const result = execSync(args.join(' '), {
|
|
507
|
-
encoding: 'utf-8',
|
|
508
|
-
timeout: config.timeoutMs,
|
|
509
|
-
cwd: ROOT_DIR,
|
|
510
|
-
maxBuffer: 10 * 1024 * 1024,
|
|
511
|
-
});
|
|
512
|
-
|
|
513
|
-
return JSON.parse(result) as EvaluationResult;
|
|
514
|
-
} catch (error) {
|
|
515
|
-
console.error(`Evaluation failed for query "${query}":`, error);
|
|
516
|
-
// Return a failure result
|
|
517
|
-
return {
|
|
518
|
-
scores: { relevance: 0, ranking: 0, coverage: 0, snippetQuality: 0, overall: 0 },
|
|
519
|
-
analysis: {
|
|
520
|
-
relevance: 'Evaluation failed',
|
|
521
|
-
ranking: 'Evaluation failed',
|
|
522
|
-
coverage: 'Evaluation failed',
|
|
523
|
-
snippetQuality: 'Evaluation failed',
|
|
524
|
-
},
|
|
525
|
-
suggestions: ['Evaluation failed - check logs'],
|
|
526
|
-
resultAssessments: [],
|
|
527
|
-
};
|
|
528
|
-
}
|
|
529
|
-
}
|
|
530
|
-
```
|
|
531
|
-
|
|
532
|
-
**Step 2: Commit**
|
|
533
|
-
|
|
534
|
-
```bash
|
|
535
|
-
git add tests/scripts/search-quality.ts
|
|
536
|
-
git commit -m "feat: add result evaluation using Claude CLI"
|
|
537
|
-
```
|
|
538
|
-
|
|
539
|
-
---
|
|
540
|
-
|
|
541
|
-
### Task 9: Add Summary Generation and Main Function
|
|
542
|
-
|
|
543
|
-
**Files:**
|
|
544
|
-
- Modify: `tests/scripts/search-quality.ts`
|
|
545
|
-
|
|
546
|
-
**Step 1: Add summary generation and main function**
|
|
547
|
-
|
|
548
|
-
```typescript
|
|
549
|
-
function generateSummary(evaluations: QueryEvaluation[]): RunSummary {
|
|
550
|
-
const avgScores: Scores = {
|
|
551
|
-
relevance: 0,
|
|
552
|
-
ranking: 0,
|
|
553
|
-
coverage: 0,
|
|
554
|
-
snippetQuality: 0,
|
|
555
|
-
overall: 0,
|
|
556
|
-
};
|
|
557
|
-
|
|
558
|
-
for (const eval_ of evaluations) {
|
|
559
|
-
avgScores.relevance += eval_.scores.relevance;
|
|
560
|
-
avgScores.ranking += eval_.scores.ranking;
|
|
561
|
-
avgScores.coverage += eval_.scores.coverage;
|
|
562
|
-
avgScores.snippetQuality += eval_.scores.snippetQuality;
|
|
563
|
-
avgScores.overall += eval_.scores.overall;
|
|
564
|
-
}
|
|
565
|
-
|
|
566
|
-
const count = evaluations.length || 1;
|
|
567
|
-
avgScores.relevance = Math.round((avgScores.relevance / count) * 100) / 100;
|
|
568
|
-
avgScores.ranking = Math.round((avgScores.ranking / count) * 100) / 100;
|
|
569
|
-
avgScores.coverage = Math.round((avgScores.coverage / count) * 100) / 100;
|
|
570
|
-
avgScores.snippetQuality = Math.round((avgScores.snippetQuality / count) * 100) / 100;
|
|
571
|
-
avgScores.overall = Math.round((avgScores.overall / count) * 100) / 100;
|
|
572
|
-
|
|
573
|
-
// Collect all suggestions and count occurrences
|
|
574
|
-
const suggestionCounts = new Map<string, number>();
|
|
575
|
-
for (const eval_ of evaluations) {
|
|
576
|
-
for (const suggestion of eval_.suggestions) {
|
|
577
|
-
const key = suggestion.toLowerCase().slice(0, 50);
|
|
578
|
-
suggestionCounts.set(key, (suggestionCounts.get(key) || 0) + 1);
|
|
579
|
-
}
|
|
580
|
-
}
|
|
581
|
-
|
|
582
|
-
const topIssues = [...suggestionCounts.entries()]
|
|
583
|
-
.sort((a, b) => b[1] - a[1])
|
|
584
|
-
.slice(0, 5)
|
|
585
|
-
.map(([issue]) => issue);
|
|
586
|
-
|
|
587
|
-
// Find lowest scoring dimension
|
|
588
|
-
const dimensions = ['relevance', 'ranking', 'coverage', 'snippetQuality'] as const;
|
|
589
|
-
const lowestDim = dimensions.reduce((min, dim) =>
|
|
590
|
-
avgScores[dim] < avgScores[min] ? dim : min
|
|
591
|
-
);
|
|
592
|
-
|
|
593
|
-
return {
|
|
594
|
-
type: 'summary',
|
|
595
|
-
timestamp: new Date().toISOString(),
|
|
596
|
-
totalQueries: evaluations.length,
|
|
597
|
-
averageScores: avgScores,
|
|
598
|
-
topIssues,
|
|
599
|
-
recommendedFocus: `${lowestDim} (avg: ${avgScores[lowestDim]}) - review evaluations for specific improvement suggestions`,
|
|
600
|
-
};
|
|
601
|
-
}
|
|
602
|
-
|
|
603
|
-
async function main() {
|
|
604
|
-
const config = loadConfig();
|
|
605
|
-
|
|
606
|
-
// Ensure results directory exists
|
|
607
|
-
if (!existsSync(RESULTS_DIR)) {
|
|
608
|
-
mkdirSync(RESULTS_DIR, { recursive: true });
|
|
609
|
-
}
|
|
610
|
-
|
|
611
|
-
// Generate output filename
|
|
612
|
-
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
|
|
613
|
-
const outputPath = join(RESULTS_DIR, `${timestamp}.jsonl`);
|
|
614
|
-
|
|
615
|
-
// Phase 1: Generate queries
|
|
616
|
-
const { queries } = generateQueries(config);
|
|
617
|
-
|
|
618
|
-
// Phase 2: Evaluate each query
|
|
619
|
-
console.log('š Evaluating search quality...');
|
|
620
|
-
const evaluations: QueryEvaluation[] = [];
|
|
621
|
-
|
|
622
|
-
for (let i = 0; i < queries.length; i++) {
|
|
623
|
-
const q = queries[i];
|
|
624
|
-
const progress = `[${i + 1}/${queries.length}]`;
|
|
625
|
-
|
|
626
|
-
// Run search
|
|
627
|
-
const { results } = runSearch(q.query, config);
|
|
628
|
-
|
|
629
|
-
// Evaluate results
|
|
630
|
-
const evaluation = evaluateResults(q.query, q.intent, results, config);
|
|
631
|
-
|
|
632
|
-
// Build full evaluation record
|
|
633
|
-
const record: QueryEvaluation = {
|
|
634
|
-
timestamp: new Date().toISOString(),
|
|
635
|
-
query: q.query,
|
|
636
|
-
queryIntent: q.intent,
|
|
637
|
-
searchMode: config.searchMode,
|
|
638
|
-
resultCount: results.length,
|
|
639
|
-
scores: evaluation.scores,
|
|
640
|
-
analysis: evaluation.analysis,
|
|
641
|
-
suggestions: evaluation.suggestions,
|
|
642
|
-
results: evaluation.resultAssessments,
|
|
643
|
-
};
|
|
644
|
-
|
|
645
|
-
evaluations.push(record);
|
|
646
|
-
|
|
647
|
-
// Write immediately (append)
|
|
648
|
-
appendFileSync(outputPath, JSON.stringify(record) + '\n');
|
|
649
|
-
|
|
650
|
-
console.log(` ${progress} "${q.query.slice(0, 40)}${q.query.length > 40 ? '...' : ''}" - overall: ${evaluation.scores.overall.toFixed(2)}`);
|
|
651
|
-
}
|
|
652
|
-
|
|
653
|
-
// Write summary
|
|
654
|
-
const summary = generateSummary(evaluations);
|
|
655
|
-
appendFileSync(outputPath, JSON.stringify(summary) + '\n');
|
|
656
|
-
|
|
657
|
-
console.log(`\nā Results written to ${outputPath}`);
|
|
658
|
-
console.log(`š Average overall score: ${summary.averageScores.overall}`);
|
|
659
|
-
|
|
660
|
-
if (summary.topIssues.length > 0) {
|
|
661
|
-
console.log('\nšÆ Top issues to address:');
|
|
662
|
-
summary.topIssues.forEach((issue, i) => console.log(` ${i + 1}. ${issue}`));
|
|
663
|
-
}
|
|
664
|
-
|
|
665
|
-
console.log(`\nš” Recommended focus: ${summary.recommendedFocus}`);
|
|
666
|
-
}
|
|
667
|
-
|
|
668
|
-
main().catch(console.error);
|
|
669
|
-
```
|
|
670
|
-
|
|
671
|
-
**Step 2: Commit**
|
|
672
|
-
|
|
673
|
-
```bash
|
|
674
|
-
git add tests/scripts/search-quality.ts
|
|
675
|
-
git commit -m "feat: add summary generation and main execution loop"
|
|
676
|
-
```
|
|
677
|
-
|
|
678
|
-
---
|
|
679
|
-
|
|
680
|
-
### Task 10: Add npm Script
|
|
681
|
-
|
|
682
|
-
**Files:**
|
|
683
|
-
- Modify: `package.json`
|
|
684
|
-
|
|
685
|
-
**Step 1: Add npm script**
|
|
686
|
-
|
|
687
|
-
Add to "scripts" section:
|
|
688
|
-
```json
|
|
689
|
-
"test:search-quality": "npx tsx tests/scripts/search-quality.ts"
|
|
690
|
-
```
|
|
691
|
-
|
|
692
|
-
**Step 2: Commit**
|
|
693
|
-
|
|
694
|
-
```bash
|
|
695
|
-
git add package.json
|
|
696
|
-
git commit -m "chore: add npm script for search quality testing"
|
|
697
|
-
```
|
|
698
|
-
|
|
699
|
-
---
|
|
700
|
-
|
|
701
|
-
### Task 11: Test the Implementation
|
|
702
|
-
|
|
703
|
-
**Step 1: Build the project**
|
|
704
|
-
|
|
705
|
-
```bash
|
|
706
|
-
npm run build
|
|
707
|
-
```
|
|
708
|
-
|
|
709
|
-
**Step 2: Ensure test fixtures are indexed**
|
|
710
|
-
|
|
711
|
-
```bash
|
|
712
|
-
node dist/index.js store create test-fixtures --type file --source tests/fixtures
|
|
713
|
-
node dist/index.js index test-fixtures
|
|
714
|
-
```
|
|
715
|
-
|
|
716
|
-
**Step 3: Run the quality test**
|
|
717
|
-
|
|
718
|
-
```bash
|
|
719
|
-
npm run test:search-quality
|
|
720
|
-
```
|
|
721
|
-
|
|
722
|
-
**Step 4: Verify output**
|
|
723
|
-
|
|
724
|
-
Check that `tests/quality-results/` contains a new JSONL file with:
|
|
725
|
-
- One JSON object per line for each query evaluation
|
|
726
|
-
- Final line is the summary object
|
|
727
|
-
|
|
728
|
-
**Step 5: Commit any fixes if needed**
|
|
729
|
-
|
|
730
|
-
---
|
|
731
|
-
|
|
732
|
-
### Task 12: Final Commit - Feature Complete
|
|
733
|
-
|
|
734
|
-
**Step 1: Verify all files are committed**
|
|
735
|
-
|
|
736
|
-
```bash
|
|
737
|
-
git status
|
|
738
|
-
```
|
|
739
|
-
|
|
740
|
-
**Step 2: Create final commit if any uncommitted changes**
|
|
741
|
-
|
|
742
|
-
```bash
|
|
743
|
-
git add -A
|
|
744
|
-
git commit -m "feat: complete AI-powered search quality testing
|
|
745
|
-
|
|
746
|
-
Adds npm run test:search-quality command that:
|
|
747
|
-
- Uses Claude CLI to generate test queries from fixtures
|
|
748
|
-
- Evaluates search results with detailed scoring and analysis
|
|
749
|
-
- Outputs JSONL for trend tracking
|
|
750
|
-
|
|
751
|
-
Closes #AI-quality-testing"
|
|
752
|
-
```
|