bluera-knowledge 0.13.0 → 0.13.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/rules/code-quality.md +12 -0
- package/.claude/rules/git.md +5 -0
- package/.claude/rules/versioning.md +7 -0
- package/.claude-plugin/plugin.json +2 -15
- package/.mcp.json +11 -0
- package/CHANGELOG.md +7 -0
- package/CLAUDE.md +5 -13
- package/CONTRIBUTING.md +307 -0
- package/README.md +58 -1167
- package/commands/crawl.md +2 -1
- package/commands/test-plugin.md +197 -72
- package/docs/claude-code-best-practices.md +458 -0
- package/docs/cli.md +170 -0
- package/docs/commands.md +392 -0
- package/docs/crawler-architecture.md +89 -0
- package/docs/mcp-integration.md +130 -0
- package/docs/token-efficiency.md +91 -0
- package/eslint.config.js +1 -1
- package/hooks/check-dependencies.sh +18 -1
- package/hooks/hooks.json +2 -2
- package/hooks/posttooluse-bk-reminder.py +30 -2
- package/package.json +1 -1
- package/scripts/test-mcp-dev.js +260 -0
- package/src/mcp/plugin-mcp-config.test.ts +26 -19
- package/tests/integration/cli-consistency.test.ts +3 -2
- package/docs/plans/2024-12-17-ai-search-quality-implementation.md +0 -752
- package/docs/plans/2024-12-17-ai-search-quality-testing-design.md +0 -201
- package/docs/plans/2025-12-16-bluera-knowledge-cli.md +0 -2951
- package/docs/plans/2025-12-16-phase2-features.md +0 -1518
- package/docs/plans/2025-12-17-hil-implementation.md +0 -926
- package/docs/plans/2025-12-17-hil-quality-testing.md +0 -224
- package/docs/plans/2025-12-17-search-quality-phase1-implementation.md +0 -1416
- package/docs/plans/2025-12-17-search-quality-testing-v2-design.md +0 -212
- package/docs/plans/2025-12-28-ai-agent-optimization.md +0 -1630
|
@@ -1,212 +0,0 @@
|
|
|
1
|
-
# Search Quality Testing v2 - Design
|
|
2
|
-
|
|
3
|
-
## Goal
|
|
4
|
-
|
|
5
|
-
Build a valid, reproducible search quality testing system that enables real-world performance tracking and drives actionable improvements over time.
|
|
6
|
-
|
|
7
|
-
## Core Requirements
|
|
8
|
-
|
|
9
|
-
- **Valid tests**: Real representative data, not synthetic placeholders
|
|
10
|
-
- **Regression tracking**: Stable queries against stable corpus to measure changes
|
|
11
|
-
- **Exploratory testing**: Generate fresh queries to discover new issues
|
|
12
|
-
- **Actionable output**: AI-judged with spot-checks for calibration
|
|
13
|
-
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
## Test Corpus
|
|
17
|
-
|
|
18
|
-
### Structure
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
tests/fixtures/corpus/ # Committed directly (no .git folders)
|
|
22
|
-
├── oss-repos/
|
|
23
|
-
│ ├── zod/ # TypeScript schema validation
|
|
24
|
-
│ └── hono/ # Lightweight web framework
|
|
25
|
-
├── documentation/
|
|
26
|
-
│ └── express-docs/ # Express.js guide excerpts
|
|
27
|
-
├── articles/ # Technical blog posts, tutorials
|
|
28
|
-
├── papers/ # Research papers (markdown)
|
|
29
|
-
└── VERSION.md # Corpus version documentation
|
|
30
|
-
```
|
|
31
|
-
|
|
32
|
-
### Selection Criteria
|
|
33
|
-
|
|
34
|
-
- Small but representative (~50-100 documents)
|
|
35
|
-
- Mix of content types: code + docs, articles, reference
|
|
36
|
-
- Pinned versions (cleaned snapshots, no .git)
|
|
37
|
-
- Reflects real usage: dev docs, documented codebases, articles
|
|
38
|
-
|
|
39
|
-
---
|
|
40
|
-
|
|
41
|
-
## Query Management
|
|
42
|
-
|
|
43
|
-
### Core Queries (`queries/core.json`)
|
|
44
|
-
|
|
45
|
-
```json
|
|
46
|
-
{
|
|
47
|
-
"version": "1.0.0",
|
|
48
|
-
"description": "Stable regression benchmark queries",
|
|
49
|
-
"queries": [
|
|
50
|
-
{
|
|
51
|
-
"id": "auth-001",
|
|
52
|
-
"query": "JWT token validation middleware",
|
|
53
|
-
"intent": "Find authentication middleware implementations",
|
|
54
|
-
"category": "code-pattern",
|
|
55
|
-
"addedAt": "2025-12-17",
|
|
56
|
-
"expectedSources": []
|
|
57
|
-
}
|
|
58
|
-
]
|
|
59
|
-
}
|
|
60
|
-
```
|
|
61
|
-
|
|
62
|
-
### Query Categories
|
|
63
|
-
|
|
64
|
-
- `code-pattern` - Find implementation patterns
|
|
65
|
-
- `concept` - Explain a concept or approach
|
|
66
|
-
- `api-reference` - Look up specific API/function
|
|
67
|
-
- `troubleshooting` - Debug/error resolution
|
|
68
|
-
- `comparison` - Compare approaches or tools
|
|
69
|
-
|
|
70
|
-
### Generated Queries
|
|
71
|
-
|
|
72
|
-
- Saved to `queries/generated/YYYY-MM-DD-HH-MM.json`
|
|
73
|
-
- Same structure with `"source": "ai-generated"`
|
|
74
|
-
- Can promote good queries to core set manually
|
|
75
|
-
|
|
76
|
-
---
|
|
77
|
-
|
|
78
|
-
## Results & Tracking
|
|
79
|
-
|
|
80
|
-
### Structure
|
|
81
|
-
|
|
82
|
-
```
|
|
83
|
-
tests/quality-results/
|
|
84
|
-
├── runs/ # Individual run outputs
|
|
85
|
-
│ └── 2025-12-17T16-23-58.jsonl
|
|
86
|
-
├── baseline.json # Current performance baseline
|
|
87
|
-
└── history.json # Score trends over time
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
### Baseline (`baseline.json`)
|
|
91
|
-
|
|
92
|
-
```json
|
|
93
|
-
{
|
|
94
|
-
"updatedAt": "2025-12-17",
|
|
95
|
-
"corpus": "v1.0.0",
|
|
96
|
-
"querySet": "core@1.0.0",
|
|
97
|
-
"scores": {
|
|
98
|
-
"relevance": 0.72,
|
|
99
|
-
"ranking": 0.68,
|
|
100
|
-
"coverage": 0.65,
|
|
101
|
-
"snippetQuality": 0.70,
|
|
102
|
-
"overall": 0.69
|
|
103
|
-
},
|
|
104
|
-
"thresholds": {
|
|
105
|
-
"regression": 0.05,
|
|
106
|
-
"improvement": 0.03
|
|
107
|
-
}
|
|
108
|
-
}
|
|
109
|
-
```
|
|
110
|
-
|
|
111
|
-
### Comparison Output
|
|
112
|
-
|
|
113
|
-
```
|
|
114
|
-
📊 Search Quality Results (vs baseline)
|
|
115
|
-
|
|
116
|
-
Relevance: 0.75 (+0.03) ✅
|
|
117
|
-
Ranking: 0.66 (-0.02)
|
|
118
|
-
Coverage: 0.71 (+0.06) ✅
|
|
119
|
-
Snippet: 0.68 (-0.02)
|
|
120
|
-
Overall: 0.70 (+0.01)
|
|
121
|
-
|
|
122
|
-
✅ No regressions detected
|
|
123
|
-
```
|
|
124
|
-
|
|
125
|
-
---
|
|
126
|
-
|
|
127
|
-
## Test Execution
|
|
128
|
-
|
|
129
|
-
### Commands
|
|
130
|
-
|
|
131
|
-
| Command | Purpose |
|
|
132
|
-
|---------|---------|
|
|
133
|
-
| `npm run test:corpus:index` | Create store + index committed corpus |
|
|
134
|
-
| `npm run test:search-quality` | Regression check against baseline |
|
|
135
|
-
| `npm run test:search-quality -- --explore` | Generate fresh queries + run |
|
|
136
|
-
| `npm run test:search-quality -- --set <name>` | Re-run historical query set |
|
|
137
|
-
| `npm run test:search-quality -- --update-baseline` | Lock current scores as baseline |
|
|
138
|
-
|
|
139
|
-
### CI Integration
|
|
140
|
-
|
|
141
|
-
```yaml
|
|
142
|
-
- run: npm run test:corpus:index
|
|
143
|
-
- run: npm run test:search-quality
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
---
|
|
147
|
-
|
|
148
|
-
## AI Judgment Calibration
|
|
149
|
-
|
|
150
|
-
### Spot-Check Workflow
|
|
151
|
-
|
|
152
|
-
```bash
|
|
153
|
-
npm run test:search-quality -- --review
|
|
154
|
-
```
|
|
155
|
-
|
|
156
|
-
Interactive review of AI judgments to track agreement rate.
|
|
157
|
-
|
|
158
|
-
### Calibration Data (`queries/calibration.json`)
|
|
159
|
-
|
|
160
|
-
```json
|
|
161
|
-
{
|
|
162
|
-
"judgments": [...],
|
|
163
|
-
"stats": {
|
|
164
|
-
"totalReviewed": 47,
|
|
165
|
-
"agreementRate": 0.89,
|
|
166
|
-
"lastReview": "2025-12-17"
|
|
167
|
-
}
|
|
168
|
-
}
|
|
169
|
-
```
|
|
170
|
-
|
|
171
|
-
### When to Recalibrate
|
|
172
|
-
|
|
173
|
-
- Agreement rate drops below 85%
|
|
174
|
-
- After major prompt changes
|
|
175
|
-
- Quarterly as hygiene
|
|
176
|
-
|
|
177
|
-
---
|
|
178
|
-
|
|
179
|
-
## Implementation Priority
|
|
180
|
-
|
|
181
|
-
### Phase 1 - Foundation
|
|
182
|
-
|
|
183
|
-
1. Build corpus: clone Zod + Hono, clean .git dirs, commit
|
|
184
|
-
2. Add 5-10 articles/docs manually
|
|
185
|
-
3. Create `core.json` with 15-20 curated queries
|
|
186
|
-
4. Update script for committed corpus + named query sets
|
|
187
|
-
5. Add baseline comparison output
|
|
188
|
-
|
|
189
|
-
### Phase 2 - Tracking
|
|
190
|
-
|
|
191
|
-
1. Implement `baseline.json` and `history.json`
|
|
192
|
-
2. Add `--update-baseline` flag
|
|
193
|
-
3. Regression detection with threshold alerts
|
|
194
|
-
4. Before/after comparison in output
|
|
195
|
-
|
|
196
|
-
### Phase 3 - Calibration
|
|
197
|
-
|
|
198
|
-
1. Interactive `--review` command
|
|
199
|
-
2. `calibration.json` tracking
|
|
200
|
-
3. Agreement rate reporting
|
|
201
|
-
|
|
202
|
-
### Phase 4 - CI
|
|
203
|
-
|
|
204
|
-
1. GitHub Actions workflow
|
|
205
|
-
2. PR comments with score changes
|
|
206
|
-
3. Block merges on regression
|
|
207
|
-
|
|
208
|
-
### Out of Scope (YAGNI)
|
|
209
|
-
|
|
210
|
-
- PDF support (add later if needed)
|
|
211
|
-
- Visualization dashboards
|
|
212
|
-
- Automated query promotion
|