bluera-knowledge 0.13.0 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,201 +0,0 @@
1
- # AI-Powered Search Quality Testing
2
-
3
- ## Overview
4
-
5
- A two-phase evaluation system where Claude CLI both generates test queries and judges search result quality, producing structured feedback to drive search improvements.
6
-
7
- ## Goals
8
-
9
- - **Comprehensive evaluation**: Score relevance, ranking, coverage, and snippet quality
10
- - **Actionable feedback**: Detailed analysis and specific improvement suggestions
11
- - **Trend tracking**: JSONL output for monitoring quality over time
12
- - **Self-evolving**: AI generates queries from actual indexed content
13
-
14
- ## Execution Flow
15
-
16
- ```
17
- npm run test:search-quality
18
-
19
- ├─► Phase 1: Query Generation
20
- │ └─► Claude explores tests/fixtures/ directory
21
- │ └─► Generates 10-15 diverse test queries
22
-
23
- ├─► Phase 2: Search & Evaluation
24
- │ └─► For each query:
25
- │ ├─► Run bkbsearch
26
- │ └─► Claude evaluates results (scores + analysis + suggestions)
27
-
28
- └─► Output: tests/quality-results/2024-12-17T10-30-00.jsonl
29
- ```
30
-
31
- ## File Structure
32
-
33
- ```
34
- tests/
35
- ├── quality-config.json # Configuration
36
- ├── quality-results/ # JSONL output files
37
- │ └── .gitkeep
38
- └── scripts/
39
- └── search-quality.ts # Main test runner
40
- ```
41
-
42
- ## Configuration
43
-
44
- `tests/quality-config.json`:
45
- ```json
46
- {
47
- "queryCount": 15,
48
- "searchLimit": 10,
49
- "searchMode": "hybrid",
50
- "stores": null
51
- }
52
- ```
53
-
54
- - `queryCount`: Number of queries Claude generates (default: 15)
55
- - `searchLimit`: Max results per search (default: 10)
56
- - `searchMode`: vector | fts | hybrid (default: hybrid)
57
- - `stores`: Specific stores to test, or null for all
58
-
59
- ## Output Format
60
-
61
- Each line in the JSONL file represents one query evaluation:
62
-
63
- ```json
64
- {
65
- "timestamp": "2024-12-17T10:30:00.000Z",
66
- "query": "JWT token refresh implementation",
67
- "queryIntent": "Find code handling JWT refresh token logic",
68
- "searchMode": "hybrid",
69
- "resultCount": 10,
70
- "scores": {
71
- "relevance": 0.85,
72
- "ranking": 0.70,
73
- "coverage": 0.90,
74
- "snippetQuality": 0.75,
75
- "overall": 0.80
76
- },
77
- "analysis": {
78
- "relevance": "8/10 results directly relate to JWT tokens. Results #6 and #9 are about general auth headers, not JWT specifically.",
79
- "ranking": "Top result is OAuth flow, not JWT refresh. jwt-auth.ts with refreshToken function should rank #1 but is #3.",
80
- "coverage": "Found the main JWT implementation. Minor gap: didn't surface the token expiry handling in error-handling.ts.",
81
- "snippetQuality": "Snippets are readable but some cut off mid-function. Result #4 snippet doesn't show the most relevant code section."
82
- },
83
- "suggestions": [
84
- "Improve semantic distinction between 'refresh token' (JWT concept) vs 'OAuth refresh' (flow type)",
85
- "Consider boosting exact function name matches - 'refreshToken' function exists but ranks below partial matches",
86
- "Chunk boundaries may be splitting functions - review chunking strategy for code files"
87
- ],
88
- "results": [
89
- {"rank": 1, "source": "oauth-flow.ts", "score": -0.23, "relevant": true},
90
- {"rank": 2, "source": "jwt-auth.ts", "score": -0.31, "relevant": true}
91
- ]
92
- }
93
- ```
94
-
95
- **End-of-file summary** (last line):
96
- ```json
97
- {
98
- "type": "summary",
99
- "timestamp": "2024-12-17T10:30:45.000Z",
100
- "totalQueries": 15,
101
- "averageScores": {
102
- "relevance": 0.82,
103
- "ranking": 0.68,
104
- "coverage": 0.85,
105
- "snippetQuality": 0.71,
106
- "overall": 0.77
107
- },
108
- "topIssues": ["Ranking of exact matches", "Code chunk boundaries", "Semantic disambiguation"],
109
- "recommendedFocus": "Ranking algorithm - exact and function name matches consistently rank lower than expected"
110
- }
111
- ```
112
-
113
- ## Implementation Approach
114
-
115
- ### Phase 1: Query Generation
116
-
117
- ```bash
118
- claude -p --output-format json --json-schema '...' \
119
- "Explore tests/fixtures/ to understand the indexed content,
120
- then generate ${queryCount} diverse search queries..."
121
- ```
122
-
123
- Claude uses Glob/Read tools to browse fixtures, understands content types (auth code, API docs, READMEs), and generates queries that meaningfully test search.
124
-
125
- ### Phase 2: Per-Query Evaluation
126
-
127
- ```bash
128
- bkbsearch "JWT refresh" --include-content | \
129
- claude -p --output-format json --json-schema '...' \
130
- "Evaluate these search results for the query 'JWT refresh'..."
131
- ```
132
-
133
- Pipes actual search results to Claude for evaluation against the JSON schema.
134
-
135
- ### npm Script
136
-
137
- ```json
138
- {
139
- "scripts": {
140
- "test:search-quality": "npx tsx tests/scripts/search-quality.ts"
141
- }
142
- }
143
- ```
144
-
145
- ## Error Handling
146
-
147
- ### Claude CLI Failures
148
- - **Rate limits**: Retry with exponential backoff (max 3 retries)
149
- - **Invalid JSON**: Log raw response to stderr, skip query, continue
150
- - **Timeout**: 60s per CLI call, `--max-budget-usd 0.50` safety cap
151
-
152
- ### Search Failures
153
- - **Empty results**: Evaluate as valid feedback (coverage = 0)
154
- - **Store not found**: Fail fast with clear error
155
- - **Malformed output**: Log, skip, include in summary as "evaluation failed"
156
-
157
- ### Output Handling
158
- - **Directory missing**: Create automatically
159
- - **Partial run (Ctrl+C)**: Write completed evaluations immediately
160
- - **Duplicate timestamps**: Append milliseconds suffix
161
-
162
- ### AI Evaluation Edge Cases
163
- - **Ambiguous queries**: Flag in analysis ("query could mean X or Y")
164
- - **All results relevant**: Valid outcome, high scores
165
- - **No fixtures indexed**: Detect early, abort with helpful message
166
-
167
- ## Console Output
168
-
169
- ```
170
- 🔍 Generating test queries from tests/fixtures/...
171
- ✓ Generated 15 queries
172
-
173
- 📊 Evaluating search quality...
174
- [1/15] "JWT token refresh" - overall: 0.80
175
- [2/15] "error handling middleware" - overall: 0.85
176
- ...
177
- [15/15] "database repository pattern" - overall: 0.72
178
-
179
- ✓ Results written to tests/quality-results/2024-12-17T10-30-00.jsonl
180
- 📈 Average overall score: 0.77
181
- ```
182
-
183
- ## Score Dimensions
184
-
185
- | Dimension | Description |
186
- |-----------|-------------|
187
- | **Relevance** | Do results actually relate to the query intent? |
188
- | **Ranking** | Are the most relevant results at the top? |
189
- | **Coverage** | Did the search find all expected matches? |
190
- | **Snippet Quality** | Are snippets readable and showing relevant sections? |
191
- | **Overall** | Weighted combination of all dimensions |
192
-
193
- ## Usage for Improvements
194
-
195
- The detailed feedback enables:
196
-
197
- 1. **Identify patterns**: Run multiple times, aggregate `topIssues` across runs
198
- 2. **Track regressions**: Compare `averageScores` between runs after changes
199
- 3. **Prioritize work**: `recommendedFocus` highlights highest-impact improvements
200
- 4. **Debug specific issues**: `analysis` explains exactly why scores are low
201
- 5. **Validate fixes**: Re-run after changes to confirm improvements