diogenes 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.release-please-config.json +16 -0
- data/.release-please-manifest.json +3 -0
- data/CHANGELOG.md +7 -0
- data/CLAUDE.md +138 -0
- data/README.md +192 -19
- data/docs/context.md +60 -0
- data/docs/contributing.md +228 -0
- data/docs/dashboard.md +365 -0
- data/docs/examples.md +162 -0
- data/docs/framework.md +146 -0
- data/lib/diogenes/version.rb +1 -1
- data/mise.lock +48 -0
- data/mise.toml +6 -0
- metadata +42 -17
|
@@ -0,0 +1,228 @@
|
|
|
1
|
+
# Contributing to Diogenes
|
|
2
|
+
|
|
3
|
+
Diogenes is opinionated by design. Before contributing, read the philosophy section of the README and the architecture notes in CLAUDE.md. A PR that conflicts with the core philosophy — even if the code is excellent — will not be merged.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## What We're Looking For
|
|
8
|
+
|
|
9
|
+
**Good contributions:**
|
|
10
|
+
- New gates that encode a real, defensible constraint
|
|
11
|
+
- Improvements to grounding verification accuracy or the verifier prompt
|
|
12
|
+
- New eval matchers that cover failure modes existing matchers don't
|
|
13
|
+
- Drift detection improvements — staleness scoring, webhook integrations
|
|
14
|
+
- Better failure messages — they should be actionable, not cryptic
|
|
15
|
+
- Dashboard improvements that surface existing data more clearly
|
|
16
|
+
- Bug fixes with accompanying regression tests
|
|
17
|
+
- Documentation that makes the framework clearer
|
|
18
|
+
|
|
19
|
+
**Not a good fit:**
|
|
20
|
+
- LLM client integrations (Diogenes is provider-agnostic by design)
|
|
21
|
+
- RAG pipeline or embedding implementations
|
|
22
|
+
- Gates that can be bypassed with configuration
|
|
23
|
+
- Anything that makes a gate failure a warning instead of an error
|
|
24
|
+
- Database-backed eval golden pairs
|
|
25
|
+
- Authentication for the dashboard
|
|
26
|
+
|
|
27
|
+
If you're unsure whether your idea fits, open a discussion issue before writing code.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Getting Started
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
git clone https://github.com/your-org/diogenes
|
|
35
|
+
cd diogenes
|
|
36
|
+
bundle install
|
|
37
|
+
bundle exec rspec
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
The test suite should pass with no configuration. If it doesn't, open an issue.
|
|
41
|
+
|
|
42
|
+
To run the full suite including Rails integration tests:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
bundle exec rspec
|
|
46
|
+
bundle exec rspec spec/rails
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Contributing to Gates
|
|
52
|
+
|
|
53
|
+
Gates are the core primitive. Adding one is a significant decision — each gate adds cognitive load to every team that adopts the gem. New gates should address a failure mode that existing gates don't cover.
|
|
54
|
+
|
|
55
|
+
### Steps
|
|
56
|
+
|
|
57
|
+
1. Create `lib/diogenes/gates/your_gate.rb` inheriting from `Diogenes::Gates::Base`
|
|
58
|
+
2. Implement `#valid?` and `#failure_message`
|
|
59
|
+
3. Add any cross-gate incompatibilities to `lib/diogenes/gates/compatibility.rb`
|
|
60
|
+
4. Register in `Diogenes::Feature`
|
|
61
|
+
5. Write unit specs in `spec/diogenes/gates/your_gate_spec.rb`
|
|
62
|
+
6. Document in `docs/framework.md` following the existing format
|
|
63
|
+
7. Add to `docs/examples.md` if the gate meaningfully changes a verdict
|
|
64
|
+
|
|
65
|
+
### Gate Design Rules
|
|
66
|
+
|
|
67
|
+
- Gates fail loudly or pass. No warnings, no degraded modes.
|
|
68
|
+
- `#failure_message` must include the feature class name, the gate name, and a plain-English instruction for what to change.
|
|
69
|
+
- Gates are validated at boot. If your gate requires runtime information to validate, reconsider the design.
|
|
70
|
+
- Default configuration should be the most conservative option.
|
|
71
|
+
|
|
72
|
+
### Versioning
|
|
73
|
+
|
|
74
|
+
Adding a new gate incompatibility to the compatibility matrix is a **minor** bump if it catches configurations that were silently wrong. It is a **major** bump if it invalidates configurations that were intentionally valid.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Contributing to Grounding Verification
|
|
79
|
+
|
|
80
|
+
The grounding verifier has two separate concerns: the prompt and the logic. Keep them separate.
|
|
81
|
+
|
|
82
|
+
### The Prompt (`lib/diogenes/grounding/prompt.rb`)
|
|
83
|
+
|
|
84
|
+
The prompt is versioned independently. When you change it:
|
|
85
|
+
|
|
86
|
+
- Increment `Diogenes::Grounding::Prompt::VERSION`
|
|
87
|
+
- Audit records store the prompt version — old records remain interpretable
|
|
88
|
+
- Add a spec that the new prompt correctly classifies a fixture set of (context, response) pairs as supported/unsupported/contradicted
|
|
89
|
+
|
|
90
|
+
Don't change the prompt to improve performance on one case without running the full fixture set. A prompt that is better on average but worse on contradictions is not an improvement.
|
|
91
|
+
|
|
92
|
+
### The Verifier Logic (`lib/diogenes/grounding/verifier.rb`)
|
|
93
|
+
|
|
94
|
+
The verifier accepts any LLM callable. Tests use a stub:
|
|
95
|
+
|
|
96
|
+
```ruby
|
|
97
|
+
Diogenes::Grounding::Verifier.stub(result: :pass)
|
|
98
|
+
Diogenes::Grounding::Verifier.stub(result: :flag, unsupported: ["claim text"])
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
Never make live LLM calls in specs.
|
|
102
|
+
|
|
103
|
+
### The Result (`lib/diogenes/grounding/result.rb`)
|
|
104
|
+
|
|
105
|
+
`Diogenes::Grounding::Result` is a value object. Adding fields is a minor bump. Removing or renaming fields is a major bump — audit records store serialized results.
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Contributing to Drift Detection
|
|
110
|
+
|
|
111
|
+
### Staleness Scoring (`lib/diogenes/drift/staleness_score.rb`)
|
|
112
|
+
|
|
113
|
+
The staleness score is a pure function of timestamps and diff size. Keep it that way — no database access, no I/O. Specs use fixed timestamps.
|
|
114
|
+
|
|
115
|
+
```ruby
|
|
116
|
+
score = Diogenes::Drift::StalenessScore.calculate(
|
|
117
|
+
source_updated_at: 47.days.ago,
|
|
118
|
+
indexed_at: 89.days.ago,
|
|
119
|
+
diff_size: :major
|
|
120
|
+
)
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
### Re-index Jobs
|
|
124
|
+
|
|
125
|
+
Diogenes ships a base job class. The host app subclasses it with embedding logic. Never add embedding implementation to the base class.
|
|
126
|
+
|
|
127
|
+
```ruby
|
|
128
|
+
# What we ship
|
|
129
|
+
class Diogenes::Drift::ReindexJob < ApplicationJob
|
|
130
|
+
def perform(document_id)
|
|
131
|
+
raise NotImplementedError, "Subclass with your embedding logic"
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
|
|
135
|
+
# What the host app writes
|
|
136
|
+
class ReindexDocumentJob < Diogenes::Drift::ReindexJob
|
|
137
|
+
def perform(document_id)
|
|
138
|
+
# fetch, embed, store
|
|
139
|
+
end
|
|
140
|
+
end
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
### Webhooks
|
|
144
|
+
|
|
145
|
+
Drift alert webhooks send a standard payload. If you're adding webhook support for a new event type, use the existing payload shape and add a `type` field. Don't introduce a new shape.
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Contributing to the Eval Runner
|
|
150
|
+
|
|
151
|
+
### Adding Matchers (`lib/diogenes/evals/matchers.rb`)
|
|
152
|
+
|
|
153
|
+
Matchers are composable predicates. A new matcher should:
|
|
154
|
+
|
|
155
|
+
- Accept a response struct and return true/false
|
|
156
|
+
- Have a clear, descriptive failure message when it returns false
|
|
157
|
+
- Be usable inside `all_of` and `one_of`
|
|
158
|
+
- Be tested against fixture responses, never live LLM calls
|
|
159
|
+
|
|
160
|
+
```ruby
|
|
161
|
+
module Diogenes
|
|
162
|
+
module Evals
|
|
163
|
+
module Matchers
|
|
164
|
+
def your_matcher(arg)
|
|
165
|
+
->(result) {
|
|
166
|
+
# evaluate result, return true/false
|
|
167
|
+
}
|
|
168
|
+
end
|
|
169
|
+
end
|
|
170
|
+
end
|
|
171
|
+
end
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### Golden Pairs Live in Code
|
|
175
|
+
|
|
176
|
+
Don't add database storage for golden pairs. They are version-controlled files. If a contributor wants to propose database-backed pairs, the answer is no — visibility in code review is a feature, not a limitation.
|
|
177
|
+
|
|
178
|
+
### Regression Detection
|
|
179
|
+
|
|
180
|
+
The regression detector (`lib/diogenes/evals/regression.rb`) stores the last passing response and diffs it against the first failing one. If you're improving diff quality, use the fixture set in `spec/fixtures/evals/regressions/`. Don't introduce a dependency on a diff library — the current implementation is intentionally minimal.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Contributing to the Dashboard
|
|
185
|
+
|
|
186
|
+
The dashboard engine has no authentication and no CSS framework dependency. Keep it that way.
|
|
187
|
+
|
|
188
|
+
### Adding a New View
|
|
189
|
+
|
|
190
|
+
Views use ERB. No ViewComponent, no Stimulus, no Turbo (unless it's already a dependency by the time you're reading this — check the gemspec). Interactivity should be achievable with standard Rails UJS or vanilla JS.
|
|
191
|
+
|
|
192
|
+
### Adding a New Route
|
|
193
|
+
|
|
194
|
+
Add to `lib/diogenes/engine/config/routes.rb`. Follow the existing namespace pattern. Every new route needs a corresponding controller action and view, even if the view is a stub.
|
|
195
|
+
|
|
196
|
+
### What the Dashboard Must Never Do
|
|
197
|
+
|
|
198
|
+
- Make LLM calls
|
|
199
|
+
- Trigger re-indexing without an explicit user action
|
|
200
|
+
- Display raw query or response content (hashes only — PII protection)
|
|
201
|
+
- Require specific authentication middleware
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
## Pull Request Guidelines
|
|
206
|
+
|
|
207
|
+
- One logical change per PR
|
|
208
|
+
- Tests for all new behaviour
|
|
209
|
+
- Updated documentation if you're changing or adding functionality
|
|
210
|
+
- A clear description of *why*, not just *what*
|
|
211
|
+
|
|
212
|
+
PRs that add code without updating relevant docs will be sent back.
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## Versioning
|
|
217
|
+
|
|
218
|
+
Diogenes follows semantic versioning strictly.
|
|
219
|
+
|
|
220
|
+
- **Patch:** bug fixes, documentation, internal refactors
|
|
221
|
+
- **Minor:** new gates, new matchers, new dashboard views, non-breaking additions
|
|
222
|
+
- **Major:** changes to gate semantics, changes to audit record shape, removal of features, breaking DSL changes
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## Code of Conduct
|
|
227
|
+
|
|
228
|
+
Be direct. Be specific. Assume good intent. If you're reviewing a PR, explain your reasoning — "this conflicts with the philosophy" is not sufficient without explaining which part and why.
|
data/docs/dashboard.md
ADDED
|
@@ -0,0 +1,365 @@
|
|
|
1
|
+
# The Diogenes Dashboard
|
|
2
|
+
|
|
3
|
+
The Diogenes dashboard is a Rails engine that mounts into your existing application and gives you a live view of AI feature health across three dimensions: grounding verification, document drift, and eval regressions.
|
|
4
|
+
|
|
5
|
+
Think of it as Sidekiq Web for AI accountability — one place to see whether your AI features are doing what you think they're doing.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Mounting
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
# config/routes.rb
|
|
13
|
+
mount Diogenes::Engine => '/diogenes'
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
Diogenes has no opinions about authentication. Protect the route the same way you protect any sensitive admin surface:
|
|
17
|
+
|
|
18
|
+
```ruby
|
|
19
|
+
# With Devise + admin constraint
|
|
20
|
+
authenticate :user, ->(u) { u.admin? } do
|
|
21
|
+
mount Diogenes::Engine => '/diogenes'
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
# With HTTP basic auth via a constraint
|
|
25
|
+
constraints Diogenes::AdminConstraint.new(ENV['DIOGENES_PASSWORD']) do
|
|
26
|
+
mount Diogenes::Engine => '/diogenes'
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
# With a custom Rack middleware
|
|
30
|
+
mount Diogenes::Engine => '/diogenes', constraints: YourAuthConstraint
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Engine Routes
|
|
36
|
+
|
|
37
|
+
```ruby
|
|
38
|
+
# lib/diogenes/engine/routes.rb
|
|
39
|
+
|
|
40
|
+
Diogenes::Engine.routes.draw do
|
|
41
|
+
root to: 'dashboard#overview'
|
|
42
|
+
|
|
43
|
+
namespace :dashboard do
|
|
44
|
+
# Grounding
|
|
45
|
+
get 'grounding', to: 'grounding#index'
|
|
46
|
+
get 'grounding/:feature', to: 'grounding#show'
|
|
47
|
+
get 'grounding/:feature/:id', to: 'grounding#verification'
|
|
48
|
+
|
|
49
|
+
# Drift
|
|
50
|
+
get 'drift', to: 'drift#index'
|
|
51
|
+
get 'drift/:index_name', to: 'drift#show'
|
|
52
|
+
post 'drift/:index_name/reindex', to: 'drift#reindex'
|
|
53
|
+
post 'drift/:document_id/reindex', to: 'drift#reindex_document'
|
|
54
|
+
|
|
55
|
+
# Evals
|
|
56
|
+
get 'evals', to: 'evals#index'
|
|
57
|
+
get 'evals/:feature', to: 'evals#show'
|
|
58
|
+
post 'evals/:feature/run', to: 'evals#run'
|
|
59
|
+
get 'evals/:feature/:run_id', to: 'evals#run_result'
|
|
60
|
+
end
|
|
61
|
+
end
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Controller Structure
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
lib/diogenes/engine/
|
|
70
|
+
app/
|
|
71
|
+
controllers/
|
|
72
|
+
diogenes/
|
|
73
|
+
dashboard_controller.rb # Overview tab
|
|
74
|
+
dashboard/
|
|
75
|
+
grounding_controller.rb
|
|
76
|
+
drift_controller.rb
|
|
77
|
+
evals_controller.rb
|
|
78
|
+
views/
|
|
79
|
+
diogenes/
|
|
80
|
+
dashboard/
|
|
81
|
+
overview.html.erb
|
|
82
|
+
grounding/
|
|
83
|
+
index.html.erb
|
|
84
|
+
show.html.erb
|
|
85
|
+
verification.html.erb
|
|
86
|
+
drift/
|
|
87
|
+
index.html.erb
|
|
88
|
+
show.html.erb
|
|
89
|
+
evals/
|
|
90
|
+
index.html.erb
|
|
91
|
+
show.html.erb
|
|
92
|
+
run_result.html.erb
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## The Overview Tab
|
|
98
|
+
|
|
99
|
+
`GET /diogenes`
|
|
100
|
+
|
|
101
|
+
The entry point. One row per gated feature, showing aggregate health across all three dimensions. Red features surface immediately.
|
|
102
|
+
|
|
103
|
+
```
|
|
104
|
+
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
105
|
+
│ DIOGENES Last updated: 2 min ago│
|
|
106
|
+
├──────────────────────┬────────┬─────────────┬──────────────┬────────────────┤
|
|
107
|
+
│ Feature │ Gates │ Grounding │ Drift │ Evals │
|
|
108
|
+
├──────────────────────┼────────┼─────────────┼──────────────┼────────────────┤
|
|
109
|
+
│ SupportAssistant │ 5/5 ✓ │ 94% clean │ 2 stale docs │ 47/50 passing │
|
|
110
|
+
│ CodebaseOracle │ 5/5 ✓ │ 98% clean │ All current │ 30/30 passing │
|
|
111
|
+
│ ComplianceSearcher │ 5/5 ✓ │ 87% clean ⚠ │ 11 stale docs│ 18/25 passing ✗│
|
|
112
|
+
└──────────────────────┴────────┴─────────────┴──────────────┴────────────────┘
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
`ComplianceSearcher` has a low grounding rate, stale documents, and a failing eval suite. That's the feature that needs attention today — and you can see it before a user does.
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## The Grounding Tab
|
|
120
|
+
|
|
121
|
+
`GET /diogenes/dashboard/grounding`
|
|
122
|
+
|
|
123
|
+
Per-feature grounding verification history. Shows flag rates over time and a live feed of recent verifications.
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
SupportAssistant — Grounding History (last 30 days)
|
|
127
|
+
|
|
128
|
+
Flag rate: 6.2% ▼ from 8.1% last month
|
|
129
|
+
|
|
130
|
+
Recent verifications:
|
|
131
|
+
✓ Supported "You can cancel your subscription from the billing page..."
|
|
132
|
+
✓ Supported "The enterprise tier includes unlimited seats..."
|
|
133
|
+
⚠ Unsupported "Refunds are processed within 24 hours..." → flagged for review
|
|
134
|
+
✗ Contradicted "The API rate limit is 100 req/min..." → blocked, not served
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
### Verification Detail
|
|
138
|
+
|
|
139
|
+
`GET /diogenes/dashboard/grounding/:feature/:id`
|
|
140
|
+
|
|
141
|
+
Drill into any verification to see the full picture:
|
|
142
|
+
|
|
143
|
+
```
|
|
144
|
+
Verification #8472 — SupportAssistant
|
|
145
|
+
Query: "How long does a refund take?"
|
|
146
|
+
Status: ⚠ FLAGGED — unsupported claim detected
|
|
147
|
+
|
|
148
|
+
Retrieved context:
|
|
149
|
+
[chunk 1] "Refund requests are reviewed within 2 business days..." ← source: refund-policy.md
|
|
150
|
+
[chunk 2] "Enterprise customers receive priority support..." ← source: enterprise-terms.md
|
|
151
|
+
|
|
152
|
+
AI response:
|
|
153
|
+
"Refunds are typically processed within 24 hours."
|
|
154
|
+
|
|
155
|
+
Verdict:
|
|
156
|
+
supported: []
|
|
157
|
+
unsupported: ["processed within 24 hours"] ← no chunk supports this timeframe
|
|
158
|
+
contradicted: []
|
|
159
|
+
|
|
160
|
+
Action taken: flagged_for_review
|
|
161
|
+
Reviewed by: sarah@company.com (approved with edit)
|
|
162
|
+
Final sent: "Refunds are reviewed within 2 business days..."
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
This is the audit trail that makes human-in-the-loop real. Every verification, every flag, every human decision — all linked.
|
|
166
|
+
|
|
167
|
+
---
|
|
168
|
+
|
|
169
|
+
## The Drift Tab
|
|
170
|
+
|
|
171
|
+
`GET /diogenes/dashboard/drift`
|
|
172
|
+
|
|
173
|
+
A staleness leaderboard — every indexed document, ranked by how long since its embedding was updated versus when the source last changed.
|
|
174
|
+
|
|
175
|
+
```
|
|
176
|
+
Document Drift — All Indexes
|
|
177
|
+
|
|
178
|
+
● 11 documents require attention
|
|
179
|
+
|
|
180
|
+
CRITICAL (source changed > 30 days ago, not re-indexed)
|
|
181
|
+
┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
|
|
182
|
+
│ Document │ Source updated│ Last indexed │ Staleness│
|
|
183
|
+
├─────────────────────────────────┼──────────────┼──────────────┼──────────┤
|
|
184
|
+
│ refund-policy.md │ 47 days ago │ 89 days ago │ ████████ │
|
|
185
|
+
│ enterprise-pricing.md │ 33 days ago │ 102 days ago │ ████████ │
|
|
186
|
+
└─────────────────────────────────┴──────────────┴──────────────┴──────────┘
|
|
187
|
+
|
|
188
|
+
WARNING (source changed 7–30 days ago)
|
|
189
|
+
┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
|
|
190
|
+
│ api-rate-limits.md │ 12 days ago │ 45 days ago │ █████░░░ │
|
|
191
|
+
│ ... │ │ │ │
|
|
192
|
+
└─────────────────────────────────┴──────────────┴──────────────┴──────────┘
|
|
193
|
+
|
|
194
|
+
[Re-index all critical] [Re-index all warnings]
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
### Re-indexing
|
|
198
|
+
|
|
199
|
+
Clicking re-index queues an `ActiveJob`. The job is defined in the host app — Diogenes provides the interface and the tracking, not the embedding logic:
|
|
200
|
+
|
|
201
|
+
```ruby
|
|
202
|
+
# config/initializers/diogenes.rb
|
|
203
|
+
Diogenes.configure do |config|
|
|
204
|
+
config.drift.reindex_job = ReindexDocumentJob
|
|
205
|
+
config.drift.staleness_thresholds = {
|
|
206
|
+
warning: 7.days,
|
|
207
|
+
critical: 30.days
|
|
208
|
+
}
|
|
209
|
+
config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
210
|
+
end
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### How Drift Is Detected
|
|
214
|
+
|
|
215
|
+
Diogenes tracks two timestamps per indexed document:
|
|
216
|
+
|
|
217
|
+
- `source_updated_at` — populated by the host app when indexing, updated by a webhook or scheduled check
|
|
218
|
+
- `indexed_at` — set by Diogenes when an embedding is stored
|
|
219
|
+
|
|
220
|
+
The staleness score is a function of the gap between them, weighted by how much the source changed (line diff if available, timestamp delta otherwise).
|
|
221
|
+
|
|
222
|
+
```ruby
|
|
223
|
+
# Inform Diogenes that a source document has changed
|
|
224
|
+
Diogenes::Drift.source_updated(
|
|
225
|
+
document_id: 'refund-policy-v2',
|
|
226
|
+
updated_at: Time.current,
|
|
227
|
+
diff_size: :major # :minor, :moderate, :major — affects staleness weight
|
|
228
|
+
)
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
## The Evals Tab
|
|
234
|
+
|
|
235
|
+
`GET /diogenes/dashboard/evals`
|
|
236
|
+
|
|
237
|
+
The hardest unsolved problem in production AI is knowing whether your feature is getting better or worse over time. The evals tab is Diogenes's answer: define golden question/answer pairs, run them on a schedule, track pass rates over time, and alert on regression.
|
|
238
|
+
|
|
239
|
+
```
|
|
240
|
+
Evals — All Features
|
|
241
|
+
|
|
242
|
+
SupportAssistant 47/50 passing 94% ▲ from 91% last week [Run now]
|
|
243
|
+
CodebaseOracle 30/30 passing 100% → unchanged [Run now]
|
|
244
|
+
ComplianceSearcher 18/25 passing 72% ▼ from 88% last week ✗ [Run now]
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### Defining Golden Pairs
|
|
248
|
+
|
|
249
|
+
Golden pairs live in your codebase as Ruby, not in a database. They're version-controlled, reviewed in PRs, and treated as first-class tests:
|
|
250
|
+
|
|
251
|
+
```ruby
|
|
252
|
+
# test/diogenes/evals/support_assistant_evals.rb
|
|
253
|
+
|
|
254
|
+
Diogenes::Evals.define(SupportAssistant) do
|
|
255
|
+
eval "basic refund question" do
|
|
256
|
+
query "How do I request a refund?"
|
|
257
|
+
expects all_of(
|
|
258
|
+
grounded_in("refund-policy"),
|
|
259
|
+
contains("billing page"),
|
|
260
|
+
does_not_contain("24 hours") # known wrong answer to guard against
|
|
261
|
+
)
|
|
262
|
+
end
|
|
263
|
+
|
|
264
|
+
eval "enterprise tier question" do
|
|
265
|
+
query "Does the enterprise plan include SSO?"
|
|
266
|
+
expects all_of(
|
|
267
|
+
grounded_in("enterprise-terms"),
|
|
268
|
+
semantically_similar_to("Yes, SSO is included in all enterprise plans")
|
|
269
|
+
)
|
|
270
|
+
end
|
|
271
|
+
|
|
272
|
+
eval "question with no good answer" do
|
|
273
|
+
query "What is the API rate limit for the legacy v1 endpoints?"
|
|
274
|
+
expects one_of(
|
|
275
|
+
low_confidence_response,
|
|
276
|
+
routes_to_human_review
|
|
277
|
+
)
|
|
278
|
+
end
|
|
279
|
+
end
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
### Matchers
|
|
283
|
+
|
|
284
|
+
| Matcher | What it checks |
|
|
285
|
+
|---|---|
|
|
286
|
+
| `grounded_in(source)` | Retrieved context includes chunks from the named source |
|
|
287
|
+
| `contains(text)` | Response includes the specified text or close semantic equivalent |
|
|
288
|
+
| `does_not_contain(text)` | Response does not include the specified text |
|
|
289
|
+
| `semantically_similar_to(text)` | Response embedding is within threshold cosine distance |
|
|
290
|
+
| `low_confidence_response` | Feature returned a flagged or low-confidence result |
|
|
291
|
+
| `routes_to_human_review` | Feature queued the output for human review |
|
|
292
|
+
| `all_of(*matchers)` | All matchers must pass |
|
|
293
|
+
| `one_of(*matchers)` | At least one matcher must pass |
|
|
294
|
+
|
|
295
|
+
### Running Evals
|
|
296
|
+
|
|
297
|
+
Evals can be run on demand from the dashboard, via a Rake task, or on a schedule:
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
# On demand
|
|
301
|
+
bundle exec rake diogenes:evals:run[SupportAssistant]
|
|
302
|
+
bundle exec rake diogenes:evals:run # all features
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
```ruby
|
|
306
|
+
# Scheduled via your existing job infrastructure
|
|
307
|
+
class RunDiogenesEvalsJob < ApplicationJob
|
|
308
|
+
def perform
|
|
309
|
+
Diogenes::Evals.run_all
|
|
310
|
+
end
|
|
311
|
+
end
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
### Regression Detection
|
|
315
|
+
|
|
316
|
+
When a passing eval starts failing, Diogenes records the regression point — exactly which run it broke on — and stores a diff between the last passing response and the first failing one. The evals detail view surfaces this:
|
|
317
|
+
|
|
318
|
+
```
|
|
319
|
+
ComplianceSearcher — Eval Regression Detected
|
|
320
|
+
|
|
321
|
+
"What is our data retention policy?"
|
|
322
|
+
|
|
323
|
+
Status: ✗ FAILING since run #142 (3 days ago)
|
|
324
|
+
|
|
325
|
+
Last passing response (run #141):
|
|
326
|
+
"Data is retained for 7 years per our compliance obligations..."
|
|
327
|
+
|
|
328
|
+
Current failing response (run #144):
|
|
329
|
+
"Data retention policies vary by region..."
|
|
330
|
+
|
|
331
|
+
Likely cause: 'data-retention-policy.md' updated 4 days ago, not re-indexed
|
|
332
|
+
→ See drift tab
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
That last line is the integration that makes the dashboard coherent: a failing eval points to a stale document in the drift tab, which explains the low grounding score on the grounding tab. Three separate signals, one root cause.
|
|
336
|
+
|
|
337
|
+
---
|
|
338
|
+
|
|
339
|
+
## Configuration Reference
|
|
340
|
+
|
|
341
|
+
```ruby
|
|
342
|
+
# config/initializers/diogenes.rb
|
|
343
|
+
|
|
344
|
+
Diogenes.configure do |config|
|
|
345
|
+
# Grounding
|
|
346
|
+
config.grounding.default_threshold = 0.8
|
|
347
|
+
config.grounding.verifier_llm = -> (prompt) { YourLLMClient.complete(prompt) }
|
|
348
|
+
|
|
349
|
+
# Drift
|
|
350
|
+
config.drift.reindex_job = ReindexDocumentJob
|
|
351
|
+
config.drift.staleness_thresholds = { warning: 7.days, critical: 30.days }
|
|
352
|
+
config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
353
|
+
config.drift.check_interval = 1.hour
|
|
354
|
+
|
|
355
|
+
# Evals
|
|
356
|
+
config.evals.schedule = '0 9 * * *' # daily at 9am, if using sidekiq-cron
|
|
357
|
+
config.evals.alert_on_regression = true
|
|
358
|
+
config.evals.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
359
|
+
config.evals.passing_threshold = 0.9 # alert if pass rate drops below 90%
|
|
360
|
+
|
|
361
|
+
# Audit log
|
|
362
|
+
config.audit.retention_days = 90
|
|
363
|
+
config.audit.pii_scrubber = Diogenes::PII::DefaultScrubber
|
|
364
|
+
end
|
|
365
|
+
```
|