bluera-knowledge 0.13.0 → 0.13.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/rules/code-quality.md +12 -0
- package/.claude/rules/git.md +5 -0
- package/.claude/rules/versioning.md +7 -0
- package/.claude-plugin/plugin.json +2 -15
- package/.mcp.json +11 -0
- package/CHANGELOG.md +7 -0
- package/CLAUDE.md +5 -13
- package/CONTRIBUTING.md +307 -0
- package/README.md +58 -1167
- package/commands/crawl.md +2 -1
- package/commands/test-plugin.md +197 -72
- package/docs/claude-code-best-practices.md +458 -0
- package/docs/cli.md +170 -0
- package/docs/commands.md +392 -0
- package/docs/crawler-architecture.md +89 -0
- package/docs/mcp-integration.md +130 -0
- package/docs/token-efficiency.md +91 -0
- package/eslint.config.js +1 -1
- package/hooks/check-dependencies.sh +18 -1
- package/hooks/hooks.json +2 -2
- package/hooks/posttooluse-bk-reminder.py +30 -2
- package/package.json +1 -1
- package/scripts/test-mcp-dev.js +260 -0
- package/src/mcp/plugin-mcp-config.test.ts +26 -19
- package/tests/integration/cli-consistency.test.ts +3 -2
- package/docs/plans/2024-12-17-ai-search-quality-implementation.md +0 -752
- package/docs/plans/2024-12-17-ai-search-quality-testing-design.md +0 -201
- package/docs/plans/2025-12-16-bluera-knowledge-cli.md +0 -2951
- package/docs/plans/2025-12-16-phase2-features.md +0 -1518
- package/docs/plans/2025-12-17-hil-implementation.md +0 -926
- package/docs/plans/2025-12-17-hil-quality-testing.md +0 -224
- package/docs/plans/2025-12-17-search-quality-phase1-implementation.md +0 -1416
- package/docs/plans/2025-12-17-search-quality-testing-v2-design.md +0 -212
- package/docs/plans/2025-12-28-ai-agent-optimization.md +0 -1630
|
@@ -1,201 +0,0 @@
|
|
|
1
|
-
# AI-Powered Search Quality Testing
|
|
2
|
-
|
|
3
|
-
## Overview
|
|
4
|
-
|
|
5
|
-
A two-phase evaluation system where Claude CLI both generates test queries and judges search result quality, producing structured feedback to drive search improvements.
|
|
6
|
-
|
|
7
|
-
## Goals
|
|
8
|
-
|
|
9
|
-
- **Comprehensive evaluation**: Score relevance, ranking, coverage, and snippet quality
|
|
10
|
-
- **Actionable feedback**: Detailed analysis and specific improvement suggestions
|
|
11
|
-
- **Trend tracking**: JSONL output for monitoring quality over time
|
|
12
|
-
- **Self-evolving**: AI generates queries from actual indexed content
|
|
13
|
-
|
|
14
|
-
## Execution Flow
|
|
15
|
-
|
|
16
|
-
```
|
|
17
|
-
npm run test:search-quality
|
|
18
|
-
│
|
|
19
|
-
├─► Phase 1: Query Generation
|
|
20
|
-
│ └─► Claude explores tests/fixtures/ directory
|
|
21
|
-
│ └─► Generates 10-15 diverse test queries
|
|
22
|
-
│
|
|
23
|
-
├─► Phase 2: Search & Evaluation
|
|
24
|
-
│ └─► For each query:
|
|
25
|
-
│ ├─► Run bkbsearch
|
|
26
|
-
│ └─► Claude evaluates results (scores + analysis + suggestions)
|
|
27
|
-
│
|
|
28
|
-
└─► Output: tests/quality-results/2024-12-17T10-30-00.jsonl
|
|
29
|
-
```
|
|
30
|
-
|
|
31
|
-
## File Structure
|
|
32
|
-
|
|
33
|
-
```
|
|
34
|
-
tests/
|
|
35
|
-
├── quality-config.json # Configuration
|
|
36
|
-
├── quality-results/ # JSONL output files
|
|
37
|
-
│ └── .gitkeep
|
|
38
|
-
└── scripts/
|
|
39
|
-
└── search-quality.ts # Main test runner
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
## Configuration
|
|
43
|
-
|
|
44
|
-
`tests/quality-config.json`:
|
|
45
|
-
```json
|
|
46
|
-
{
|
|
47
|
-
"queryCount": 15,
|
|
48
|
-
"searchLimit": 10,
|
|
49
|
-
"searchMode": "hybrid",
|
|
50
|
-
"stores": null
|
|
51
|
-
}
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
- `queryCount`: Number of queries Claude generates (default: 15)
|
|
55
|
-
- `searchLimit`: Max results per search (default: 10)
|
|
56
|
-
- `searchMode`: vector | fts | hybrid (default: hybrid)
|
|
57
|
-
- `stores`: Specific stores to test, or null for all
|
|
58
|
-
|
|
59
|
-
## Output Format
|
|
60
|
-
|
|
61
|
-
Each line in the JSONL file represents one query evaluation:
|
|
62
|
-
|
|
63
|
-
```json
|
|
64
|
-
{
|
|
65
|
-
"timestamp": "2024-12-17T10:30:00.000Z",
|
|
66
|
-
"query": "JWT token refresh implementation",
|
|
67
|
-
"queryIntent": "Find code handling JWT refresh token logic",
|
|
68
|
-
"searchMode": "hybrid",
|
|
69
|
-
"resultCount": 10,
|
|
70
|
-
"scores": {
|
|
71
|
-
"relevance": 0.85,
|
|
72
|
-
"ranking": 0.70,
|
|
73
|
-
"coverage": 0.90,
|
|
74
|
-
"snippetQuality": 0.75,
|
|
75
|
-
"overall": 0.80
|
|
76
|
-
},
|
|
77
|
-
"analysis": {
|
|
78
|
-
"relevance": "8/10 results directly relate to JWT tokens. Results #6 and #9 are about general auth headers, not JWT specifically.",
|
|
79
|
-
"ranking": "Top result is OAuth flow, not JWT refresh. jwt-auth.ts with refreshToken function should rank #1 but is #3.",
|
|
80
|
-
"coverage": "Found the main JWT implementation. Minor gap: didn't surface the token expiry handling in error-handling.ts.",
|
|
81
|
-
"snippetQuality": "Snippets are readable but some cut off mid-function. Result #4 snippet doesn't show the most relevant code section."
|
|
82
|
-
},
|
|
83
|
-
"suggestions": [
|
|
84
|
-
"Improve semantic distinction between 'refresh token' (JWT concept) vs 'OAuth refresh' (flow type)",
|
|
85
|
-
"Consider boosting exact function name matches - 'refreshToken' function exists but ranks below partial matches",
|
|
86
|
-
"Chunk boundaries may be splitting functions - review chunking strategy for code files"
|
|
87
|
-
],
|
|
88
|
-
"results": [
|
|
89
|
-
{"rank": 1, "source": "oauth-flow.ts", "score": -0.23, "relevant": true},
|
|
90
|
-
{"rank": 2, "source": "jwt-auth.ts", "score": -0.31, "relevant": true}
|
|
91
|
-
]
|
|
92
|
-
}
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
**End-of-file summary** (last line):
|
|
96
|
-
```json
|
|
97
|
-
{
|
|
98
|
-
"type": "summary",
|
|
99
|
-
"timestamp": "2024-12-17T10:30:45.000Z",
|
|
100
|
-
"totalQueries": 15,
|
|
101
|
-
"averageScores": {
|
|
102
|
-
"relevance": 0.82,
|
|
103
|
-
"ranking": 0.68,
|
|
104
|
-
"coverage": 0.85,
|
|
105
|
-
"snippetQuality": 0.71,
|
|
106
|
-
"overall": 0.77
|
|
107
|
-
},
|
|
108
|
-
"topIssues": ["Ranking of exact matches", "Code chunk boundaries", "Semantic disambiguation"],
|
|
109
|
-
"recommendedFocus": "Ranking algorithm - exact and function name matches consistently rank lower than expected"
|
|
110
|
-
}
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
## Implementation Approach
|
|
114
|
-
|
|
115
|
-
### Phase 1: Query Generation
|
|
116
|
-
|
|
117
|
-
```bash
|
|
118
|
-
claude -p --output-format json --json-schema '...' \
|
|
119
|
-
"Explore tests/fixtures/ to understand the indexed content,
|
|
120
|
-
then generate ${queryCount} diverse search queries..."
|
|
121
|
-
```
|
|
122
|
-
|
|
123
|
-
Claude uses Glob/Read tools to browse fixtures, understands content types (auth code, API docs, READMEs), and generates queries that meaningfully test search.
|
|
124
|
-
|
|
125
|
-
### Phase 2: Per-Query Evaluation
|
|
126
|
-
|
|
127
|
-
```bash
|
|
128
|
-
bkbsearch "JWT refresh" --include-content | \
|
|
129
|
-
claude -p --output-format json --json-schema '...' \
|
|
130
|
-
"Evaluate these search results for the query 'JWT refresh'..."
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
Pipes actual search results to Claude for evaluation against the JSON schema.
|
|
134
|
-
|
|
135
|
-
### npm Script
|
|
136
|
-
|
|
137
|
-
```json
|
|
138
|
-
{
|
|
139
|
-
"scripts": {
|
|
140
|
-
"test:search-quality": "npx tsx tests/scripts/search-quality.ts"
|
|
141
|
-
}
|
|
142
|
-
}
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
## Error Handling
|
|
146
|
-
|
|
147
|
-
### Claude CLI Failures
|
|
148
|
-
- **Rate limits**: Retry with exponential backoff (max 3 retries)
|
|
149
|
-
- **Invalid JSON**: Log raw response to stderr, skip query, continue
|
|
150
|
-
- **Timeout**: 60s per CLI call, `--max-budget-usd 0.50` safety cap
|
|
151
|
-
|
|
152
|
-
### Search Failures
|
|
153
|
-
- **Empty results**: Evaluate as valid feedback (coverage = 0)
|
|
154
|
-
- **Store not found**: Fail fast with clear error
|
|
155
|
-
- **Malformed output**: Log, skip, include in summary as "evaluation failed"
|
|
156
|
-
|
|
157
|
-
### Output Handling
|
|
158
|
-
- **Directory missing**: Create automatically
|
|
159
|
-
- **Partial run (Ctrl+C)**: Write completed evaluations immediately
|
|
160
|
-
- **Duplicate timestamps**: Append milliseconds suffix
|
|
161
|
-
|
|
162
|
-
### AI Evaluation Edge Cases
|
|
163
|
-
- **Ambiguous queries**: Flag in analysis ("query could mean X or Y")
|
|
164
|
-
- **All results relevant**: Valid outcome, high scores
|
|
165
|
-
- **No fixtures indexed**: Detect early, abort with helpful message
|
|
166
|
-
|
|
167
|
-
## Console Output
|
|
168
|
-
|
|
169
|
-
```
|
|
170
|
-
🔍 Generating test queries from tests/fixtures/...
|
|
171
|
-
✓ Generated 15 queries
|
|
172
|
-
|
|
173
|
-
📊 Evaluating search quality...
|
|
174
|
-
[1/15] "JWT token refresh" - overall: 0.80
|
|
175
|
-
[2/15] "error handling middleware" - overall: 0.85
|
|
176
|
-
...
|
|
177
|
-
[15/15] "database repository pattern" - overall: 0.72
|
|
178
|
-
|
|
179
|
-
✓ Results written to tests/quality-results/2024-12-17T10-30-00.jsonl
|
|
180
|
-
📈 Average overall score: 0.77
|
|
181
|
-
```
|
|
182
|
-
|
|
183
|
-
## Score Dimensions
|
|
184
|
-
|
|
185
|
-
| Dimension | Description |
|
|
186
|
-
|-----------|-------------|
|
|
187
|
-
| **Relevance** | Do results actually relate to the query intent? |
|
|
188
|
-
| **Ranking** | Are the most relevant results at the top? |
|
|
189
|
-
| **Coverage** | Did the search find all expected matches? |
|
|
190
|
-
| **Snippet Quality** | Are snippets readable and showing relevant sections? |
|
|
191
|
-
| **Overall** | Weighted combination of all dimensions |
|
|
192
|
-
|
|
193
|
-
## Usage for Improvements
|
|
194
|
-
|
|
195
|
-
The detailed feedback enables:
|
|
196
|
-
|
|
197
|
-
1. **Identify patterns**: Run multiple times, aggregate `topIssues` across runs
|
|
198
|
-
2. **Track regressions**: Compare `averageScores` between runs after changes
|
|
199
|
-
3. **Prioritize work**: `recommendedFocus` highlights highest-impact improvements
|
|
200
|
-
4. **Debug specific issues**: `analysis` explains exactly why scores are low
|
|
201
|
-
5. **Validate fixes**: Re-run after changes to confirm improvements
|