diogenes 0.1.2 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.mise/config.toml +72 -0
- data/.mise/mise.lock +179 -0
- data/.mise/tasks/update-hk-import +79 -0
- data/.release-please-config.json +1 -1
- data/.release-please-manifest.json +2 -2
- data/CHANGELOG.md +7 -0
- data/CLAUDE.md +107 -99
- data/CONTRIBUTING.md +206 -0
- data/README.md +157 -134
- data/Rakefile +15 -1
- data/Steepfile +11 -0
- data/docs/gates.md +178 -0
- data/docs/targets.md +11 -0
- data/exe/diogenes +6 -0
- data/hk.pkl +46 -0
- data/lib/diogenes/cli/init.rb +88 -0
- data/lib/diogenes/cli.rb +95 -0
- data/lib/diogenes/templates/init/artifacts/decision_record.md.erb +53 -0
- data/lib/diogenes/templates/init/diogenes.rb +13 -0
- data/lib/diogenes/templates/init/hooks/README.md +15 -0
- data/lib/diogenes/templates/init/rules/five_gates.rb +33 -0
- data/lib/diogenes/templates/init/skills/example_skill.rb +33 -0
- data/lib/diogenes/version.rb +2 -1
- data/lib/diogenes.rb +27 -2
- data/sig/generated/diogenes/cli/init.rbs +34 -0
- data/sig/generated/diogenes/cli.rbs +34 -0
- data/sig/generated/diogenes/version.rbs +5 -0
- data/sig/generated/diogenes.rbs +26 -0
- metadata +23 -9
- data/docs/context.md +0 -60
- data/docs/contributing.md +0 -228
- data/docs/dashboard.md +0 -365
- data/docs/examples.md +0 -162
- data/docs/framework.md +0 -146
- data/mise.lock +0 -48
- data/mise.toml +0 -6
data/docs/dashboard.md
DELETED
|
@@ -1,365 +0,0 @@
|
|
|
1
|
-
# The Diogenes Dashboard
|
|
2
|
-
|
|
3
|
-
The Diogenes dashboard is a Rails engine that mounts into your existing application and gives you a live view of AI feature health across three dimensions: grounding verification, document drift, and eval regressions.
|
|
4
|
-
|
|
5
|
-
Think of it as Sidekiq Web for AI accountability — one place to see whether your AI features are doing what you think they're doing.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Mounting
|
|
10
|
-
|
|
11
|
-
```ruby
|
|
12
|
-
# config/routes.rb
|
|
13
|
-
mount Diogenes::Engine => '/diogenes'
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
Diogenes has no opinions about authentication. Protect the route the same way you protect any sensitive admin surface:
|
|
17
|
-
|
|
18
|
-
```ruby
|
|
19
|
-
# With Devise + admin constraint
|
|
20
|
-
authenticate :user, ->(u) { u.admin? } do
|
|
21
|
-
mount Diogenes::Engine => '/diogenes'
|
|
22
|
-
end
|
|
23
|
-
|
|
24
|
-
# With HTTP basic auth via a constraint
|
|
25
|
-
constraints Diogenes::AdminConstraint.new(ENV['DIOGENES_PASSWORD']) do
|
|
26
|
-
mount Diogenes::Engine => '/diogenes'
|
|
27
|
-
end
|
|
28
|
-
|
|
29
|
-
# With a custom Rack middleware
|
|
30
|
-
mount Diogenes::Engine => '/diogenes', constraints: YourAuthConstraint
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
---
|
|
34
|
-
|
|
35
|
-
## Engine Routes
|
|
36
|
-
|
|
37
|
-
```ruby
|
|
38
|
-
# lib/diogenes/engine/routes.rb
|
|
39
|
-
|
|
40
|
-
Diogenes::Engine.routes.draw do
|
|
41
|
-
root to: 'dashboard#overview'
|
|
42
|
-
|
|
43
|
-
namespace :dashboard do
|
|
44
|
-
# Grounding
|
|
45
|
-
get 'grounding', to: 'grounding#index'
|
|
46
|
-
get 'grounding/:feature', to: 'grounding#show'
|
|
47
|
-
get 'grounding/:feature/:id', to: 'grounding#verification'
|
|
48
|
-
|
|
49
|
-
# Drift
|
|
50
|
-
get 'drift', to: 'drift#index'
|
|
51
|
-
get 'drift/:index_name', to: 'drift#show'
|
|
52
|
-
post 'drift/:index_name/reindex', to: 'drift#reindex'
|
|
53
|
-
post 'drift/:document_id/reindex', to: 'drift#reindex_document'
|
|
54
|
-
|
|
55
|
-
# Evals
|
|
56
|
-
get 'evals', to: 'evals#index'
|
|
57
|
-
get 'evals/:feature', to: 'evals#show'
|
|
58
|
-
post 'evals/:feature/run', to: 'evals#run'
|
|
59
|
-
get 'evals/:feature/:run_id', to: 'evals#run_result'
|
|
60
|
-
end
|
|
61
|
-
end
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
---
|
|
65
|
-
|
|
66
|
-
## Controller Structure
|
|
67
|
-
|
|
68
|
-
```
|
|
69
|
-
lib/diogenes/engine/
|
|
70
|
-
app/
|
|
71
|
-
controllers/
|
|
72
|
-
diogenes/
|
|
73
|
-
dashboard_controller.rb # Overview tab
|
|
74
|
-
dashboard/
|
|
75
|
-
grounding_controller.rb
|
|
76
|
-
drift_controller.rb
|
|
77
|
-
evals_controller.rb
|
|
78
|
-
views/
|
|
79
|
-
diogenes/
|
|
80
|
-
dashboard/
|
|
81
|
-
overview.html.erb
|
|
82
|
-
grounding/
|
|
83
|
-
index.html.erb
|
|
84
|
-
show.html.erb
|
|
85
|
-
verification.html.erb
|
|
86
|
-
drift/
|
|
87
|
-
index.html.erb
|
|
88
|
-
show.html.erb
|
|
89
|
-
evals/
|
|
90
|
-
index.html.erb
|
|
91
|
-
show.html.erb
|
|
92
|
-
run_result.html.erb
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
---
|
|
96
|
-
|
|
97
|
-
## The Overview Tab
|
|
98
|
-
|
|
99
|
-
`GET /diogenes`
|
|
100
|
-
|
|
101
|
-
The entry point. One row per gated feature, showing aggregate health across all three dimensions. Red features surface immediately.
|
|
102
|
-
|
|
103
|
-
```
|
|
104
|
-
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
105
|
-
│ DIOGENES Last updated: 2 min ago│
|
|
106
|
-
├──────────────────────┬────────┬─────────────┬──────────────┬────────────────┤
|
|
107
|
-
│ Feature │ Gates │ Grounding │ Drift │ Evals │
|
|
108
|
-
├──────────────────────┼────────┼─────────────┼──────────────┼────────────────┤
|
|
109
|
-
│ SupportAssistant │ 5/5 ✓ │ 94% clean │ 2 stale docs │ 47/50 passing │
|
|
110
|
-
│ CodebaseOracle │ 5/5 ✓ │ 98% clean │ All current │ 30/30 passing │
|
|
111
|
-
│ ComplianceSearcher │ 5/5 ✓ │ 87% clean ⚠ │ 11 stale docs│ 18/25 passing ✗│
|
|
112
|
-
└──────────────────────┴────────┴─────────────┴──────────────┴────────────────┘
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
`ComplianceSearcher` has a low grounding rate, stale documents, and a failing eval suite. That's the feature that needs attention today — and you can see it before a user does.
|
|
116
|
-
|
|
117
|
-
---
|
|
118
|
-
|
|
119
|
-
## The Grounding Tab
|
|
120
|
-
|
|
121
|
-
`GET /diogenes/dashboard/grounding`
|
|
122
|
-
|
|
123
|
-
Per-feature grounding verification history. Shows flag rates over time and a live feed of recent verifications.
|
|
124
|
-
|
|
125
|
-
```
|
|
126
|
-
SupportAssistant — Grounding History (last 30 days)
|
|
127
|
-
|
|
128
|
-
Flag rate: 6.2% ▼ from 8.1% last month
|
|
129
|
-
|
|
130
|
-
Recent verifications:
|
|
131
|
-
✓ Supported "You can cancel your subscription from the billing page..."
|
|
132
|
-
✓ Supported "The enterprise tier includes unlimited seats..."
|
|
133
|
-
⚠ Unsupported "Refunds are processed within 24 hours..." → flagged for review
|
|
134
|
-
✗ Contradicted "The API rate limit is 100 req/min..." → blocked, not served
|
|
135
|
-
```
|
|
136
|
-
|
|
137
|
-
### Verification Detail
|
|
138
|
-
|
|
139
|
-
`GET /diogenes/dashboard/grounding/:feature/:id`
|
|
140
|
-
|
|
141
|
-
Drill into any verification to see the full picture:
|
|
142
|
-
|
|
143
|
-
```
|
|
144
|
-
Verification #8472 — SupportAssistant
|
|
145
|
-
Query: "How long does a refund take?"
|
|
146
|
-
Status: ⚠ FLAGGED — unsupported claim detected
|
|
147
|
-
|
|
148
|
-
Retrieved context:
|
|
149
|
-
[chunk 1] "Refund requests are reviewed within 2 business days..." ← source: refund-policy.md
|
|
150
|
-
[chunk 2] "Enterprise customers receive priority support..." ← source: enterprise-terms.md
|
|
151
|
-
|
|
152
|
-
AI response:
|
|
153
|
-
"Refunds are typically processed within 24 hours."
|
|
154
|
-
|
|
155
|
-
Verdict:
|
|
156
|
-
supported: []
|
|
157
|
-
unsupported: ["processed within 24 hours"] ← no chunk supports this timeframe
|
|
158
|
-
contradicted: []
|
|
159
|
-
|
|
160
|
-
Action taken: flagged_for_review
|
|
161
|
-
Reviewed by: sarah@company.com (approved with edit)
|
|
162
|
-
Final sent: "Refunds are reviewed within 2 business days..."
|
|
163
|
-
```
|
|
164
|
-
|
|
165
|
-
This is the audit trail that makes human-in-the-loop real. Every verification, every flag, every human decision — all linked.
|
|
166
|
-
|
|
167
|
-
---
|
|
168
|
-
|
|
169
|
-
## The Drift Tab
|
|
170
|
-
|
|
171
|
-
`GET /diogenes/dashboard/drift`
|
|
172
|
-
|
|
173
|
-
A staleness leaderboard — every indexed document, ranked by how long since its embedding was updated versus when the source last changed.
|
|
174
|
-
|
|
175
|
-
```
|
|
176
|
-
Document Drift — All Indexes
|
|
177
|
-
|
|
178
|
-
● 11 documents require attention
|
|
179
|
-
|
|
180
|
-
CRITICAL (source changed > 30 days ago, not re-indexed)
|
|
181
|
-
┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
|
|
182
|
-
│ Document │ Source updated│ Last indexed │ Staleness│
|
|
183
|
-
├─────────────────────────────────┼──────────────┼──────────────┼──────────┤
|
|
184
|
-
│ refund-policy.md │ 47 days ago │ 89 days ago │ ████████ │
|
|
185
|
-
│ enterprise-pricing.md │ 33 days ago │ 102 days ago │ ████████ │
|
|
186
|
-
└─────────────────────────────────┴──────────────┴──────────────┴──────────┘
|
|
187
|
-
|
|
188
|
-
WARNING (source changed 7–30 days ago)
|
|
189
|
-
┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
|
|
190
|
-
│ api-rate-limits.md │ 12 days ago │ 45 days ago │ █████░░░ │
|
|
191
|
-
│ ... │ │ │ │
|
|
192
|
-
└─────────────────────────────────┴──────────────┴──────────────┴──────────┘
|
|
193
|
-
|
|
194
|
-
[Re-index all critical] [Re-index all warnings]
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
### Re-indexing
|
|
198
|
-
|
|
199
|
-
Clicking re-index queues an `ActiveJob`. The job is defined in the host app — Diogenes provides the interface and the tracking, not the embedding logic:
|
|
200
|
-
|
|
201
|
-
```ruby
|
|
202
|
-
# config/initializers/diogenes.rb
|
|
203
|
-
Diogenes.configure do |config|
|
|
204
|
-
config.drift.reindex_job = ReindexDocumentJob
|
|
205
|
-
config.drift.staleness_thresholds = {
|
|
206
|
-
warning: 7.days,
|
|
207
|
-
critical: 30.days
|
|
208
|
-
}
|
|
209
|
-
config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
210
|
-
end
|
|
211
|
-
```
|
|
212
|
-
|
|
213
|
-
### How Drift Is Detected
|
|
214
|
-
|
|
215
|
-
Diogenes tracks two timestamps per indexed document:
|
|
216
|
-
|
|
217
|
-
- `source_updated_at` — populated by the host app when indexing, updated by a webhook or scheduled check
|
|
218
|
-
- `indexed_at` — set by Diogenes when an embedding is stored
|
|
219
|
-
|
|
220
|
-
The staleness score is a function of the gap between them, weighted by how much the source changed (line diff if available, timestamp delta otherwise).
|
|
221
|
-
|
|
222
|
-
```ruby
|
|
223
|
-
# Inform Diogenes that a source document has changed
|
|
224
|
-
Diogenes::Drift.source_updated(
|
|
225
|
-
document_id: 'refund-policy-v2',
|
|
226
|
-
updated_at: Time.current,
|
|
227
|
-
diff_size: :major # :minor, :moderate, :major — affects staleness weight
|
|
228
|
-
)
|
|
229
|
-
```
|
|
230
|
-
|
|
231
|
-
---
|
|
232
|
-
|
|
233
|
-
## The Evals Tab
|
|
234
|
-
|
|
235
|
-
`GET /diogenes/dashboard/evals`
|
|
236
|
-
|
|
237
|
-
The hardest unsolved problem in production AI is knowing whether your feature is getting better or worse over time. The evals tab is Diogenes's answer: define golden question/answer pairs, run them on a schedule, track pass rates over time, and alert on regression.
|
|
238
|
-
|
|
239
|
-
```
|
|
240
|
-
Evals — All Features
|
|
241
|
-
|
|
242
|
-
SupportAssistant 47/50 passing 94% ▲ from 91% last week [Run now]
|
|
243
|
-
CodebaseOracle 30/30 passing 100% → unchanged [Run now]
|
|
244
|
-
ComplianceSearcher 18/25 passing 72% ▼ from 88% last week ✗ [Run now]
|
|
245
|
-
```
|
|
246
|
-
|
|
247
|
-
### Defining Golden Pairs
|
|
248
|
-
|
|
249
|
-
Golden pairs live in your codebase as Ruby, not in a database. They're version-controlled, reviewed in PRs, and treated as first-class tests:
|
|
250
|
-
|
|
251
|
-
```ruby
|
|
252
|
-
# test/diogenes/evals/support_assistant_evals.rb
|
|
253
|
-
|
|
254
|
-
Diogenes::Evals.define(SupportAssistant) do
|
|
255
|
-
eval "basic refund question" do
|
|
256
|
-
query "How do I request a refund?"
|
|
257
|
-
expects all_of(
|
|
258
|
-
grounded_in("refund-policy"),
|
|
259
|
-
contains("billing page"),
|
|
260
|
-
does_not_contain("24 hours") # known wrong answer to guard against
|
|
261
|
-
)
|
|
262
|
-
end
|
|
263
|
-
|
|
264
|
-
eval "enterprise tier question" do
|
|
265
|
-
query "Does the enterprise plan include SSO?"
|
|
266
|
-
expects all_of(
|
|
267
|
-
grounded_in("enterprise-terms"),
|
|
268
|
-
semantically_similar_to("Yes, SSO is included in all enterprise plans")
|
|
269
|
-
)
|
|
270
|
-
end
|
|
271
|
-
|
|
272
|
-
eval "question with no good answer" do
|
|
273
|
-
query "What is the API rate limit for the legacy v1 endpoints?"
|
|
274
|
-
expects one_of(
|
|
275
|
-
low_confidence_response,
|
|
276
|
-
routes_to_human_review
|
|
277
|
-
)
|
|
278
|
-
end
|
|
279
|
-
end
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
### Matchers
|
|
283
|
-
|
|
284
|
-
| Matcher | What it checks |
|
|
285
|
-
|---|---|
|
|
286
|
-
| `grounded_in(source)` | Retrieved context includes chunks from the named source |
|
|
287
|
-
| `contains(text)` | Response includes the specified text or close semantic equivalent |
|
|
288
|
-
| `does_not_contain(text)` | Response does not include the specified text |
|
|
289
|
-
| `semantically_similar_to(text)` | Response embedding is within threshold cosine distance |
|
|
290
|
-
| `low_confidence_response` | Feature returned a flagged or low-confidence result |
|
|
291
|
-
| `routes_to_human_review` | Feature queued the output for human review |
|
|
292
|
-
| `all_of(*matchers)` | All matchers must pass |
|
|
293
|
-
| `one_of(*matchers)` | At least one matcher must pass |
|
|
294
|
-
|
|
295
|
-
### Running Evals
|
|
296
|
-
|
|
297
|
-
Evals can be run on demand from the dashboard, via a Rake task, or on a schedule:
|
|
298
|
-
|
|
299
|
-
```bash
|
|
300
|
-
# On demand
|
|
301
|
-
bundle exec rake diogenes:evals:run[SupportAssistant]
|
|
302
|
-
bundle exec rake diogenes:evals:run # all features
|
|
303
|
-
```
|
|
304
|
-
|
|
305
|
-
```ruby
|
|
306
|
-
# Scheduled via your existing job infrastructure
|
|
307
|
-
class RunDiogenesEvalsJob < ApplicationJob
|
|
308
|
-
def perform
|
|
309
|
-
Diogenes::Evals.run_all
|
|
310
|
-
end
|
|
311
|
-
end
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
### Regression Detection
|
|
315
|
-
|
|
316
|
-
When a passing eval starts failing, Diogenes records the regression point — exactly which run it broke on — and stores a diff between the last passing response and the first failing one. The evals detail view surfaces this:
|
|
317
|
-
|
|
318
|
-
```
|
|
319
|
-
ComplianceSearcher — Eval Regression Detected
|
|
320
|
-
|
|
321
|
-
"What is our data retention policy?"
|
|
322
|
-
|
|
323
|
-
Status: ✗ FAILING since run #142 (3 days ago)
|
|
324
|
-
|
|
325
|
-
Last passing response (run #141):
|
|
326
|
-
"Data is retained for 7 years per our compliance obligations..."
|
|
327
|
-
|
|
328
|
-
Current failing response (run #144):
|
|
329
|
-
"Data retention policies vary by region..."
|
|
330
|
-
|
|
331
|
-
Likely cause: 'data-retention-policy.md' updated 4 days ago, not re-indexed
|
|
332
|
-
→ See drift tab
|
|
333
|
-
```
|
|
334
|
-
|
|
335
|
-
That last line is the integration that makes the dashboard coherent: a failing eval points to a stale document in the drift tab, which explains the low grounding score on the grounding tab. Three separate signals, one root cause.
|
|
336
|
-
|
|
337
|
-
---
|
|
338
|
-
|
|
339
|
-
## Configuration Reference
|
|
340
|
-
|
|
341
|
-
```ruby
|
|
342
|
-
# config/initializers/diogenes.rb
|
|
343
|
-
|
|
344
|
-
Diogenes.configure do |config|
|
|
345
|
-
# Grounding
|
|
346
|
-
config.grounding.default_threshold = 0.8
|
|
347
|
-
config.grounding.verifier_llm = -> (prompt) { YourLLMClient.complete(prompt) }
|
|
348
|
-
|
|
349
|
-
# Drift
|
|
350
|
-
config.drift.reindex_job = ReindexDocumentJob
|
|
351
|
-
config.drift.staleness_thresholds = { warning: 7.days, critical: 30.days }
|
|
352
|
-
config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
353
|
-
config.drift.check_interval = 1.hour
|
|
354
|
-
|
|
355
|
-
# Evals
|
|
356
|
-
config.evals.schedule = '0 9 * * *' # daily at 9am, if using sidekiq-cron
|
|
357
|
-
config.evals.alert_on_regression = true
|
|
358
|
-
config.evals.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
|
|
359
|
-
config.evals.passing_threshold = 0.9 # alert if pass rate drops below 90%
|
|
360
|
-
|
|
361
|
-
# Audit log
|
|
362
|
-
config.audit.retention_days = 90
|
|
363
|
-
config.audit.pii_scrubber = Diogenes::PII::DefaultScrubber
|
|
364
|
-
end
|
|
365
|
-
```
|
data/docs/examples.md
DELETED
|
@@ -1,162 +0,0 @@
|
|
|
1
|
-
# Examples
|
|
2
|
-
|
|
3
|
-
These two examples use the same codebase, the same RAG pipeline, and the same underlying LLM. They reach different verdicts. That's the point.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Example A: Support Agent Assistant ✅ Passes
|
|
8
|
-
|
|
9
|
-
**The feature:** A text interface inside your Rails admin that lets support agents paste a customer's problem and receive relevant documentation, similar past tickets, and a suggested first response — all sourced and citable.
|
|
10
|
-
|
|
11
|
-
**The user:** A trained support agent. Not the customer.
|
|
12
|
-
|
|
13
|
-
### Gate Evaluation
|
|
14
|
-
|
|
15
|
-
```ruby
|
|
16
|
-
class SupportAssistant
|
|
17
|
-
include Diogenes::Feature
|
|
18
|
-
include Diogenes::Grounding
|
|
19
|
-
|
|
20
|
-
gate :failure_mode,
|
|
21
|
-
severity: :recoverable
|
|
22
|
-
# A wrong suggested response gets caught by the agent before
|
|
23
|
-
# it reaches the customer. Failure is visible and correctable.
|
|
24
|
-
|
|
25
|
-
gate :user_calibration,
|
|
26
|
-
audience: :trained_agent
|
|
27
|
-
# Agents know the product. They will notice a wrong answer.
|
|
28
|
-
# They have the context to evaluate AI output critically.
|
|
29
|
-
|
|
30
|
-
gate :human_in_loop,
|
|
31
|
-
verified: true,
|
|
32
|
-
max_daily_reviews: 80,
|
|
33
|
-
review_sla_minutes: 5
|
|
34
|
-
# The agent reads the suggestion before sending anything.
|
|
35
|
-
# The customer never sees unreviewed AI output.
|
|
36
|
-
|
|
37
|
-
gate :observability,
|
|
38
|
-
logging: :full,
|
|
39
|
-
alerting: :enabled,
|
|
40
|
-
drift_detection: true
|
|
41
|
-
# We track suggestion acceptance rate, edit rate, and
|
|
42
|
-
# escalation rate as proxy signals for output quality.
|
|
43
|
-
|
|
44
|
-
gate :necessity,
|
|
45
|
-
alternatives_considered: [
|
|
46
|
-
"Static knowledge base with keyword search",
|
|
47
|
-
"Pre-written response templates by category"
|
|
48
|
-
],
|
|
49
|
-
reasoning: "Agents handle 400+ tickets/day across 12 product areas and
|
|
50
|
-
6 pricing tiers. Template coverage is <40% of actual ticket
|
|
51
|
-
variance. Search requires agents to know what to search for."
|
|
52
|
-
|
|
53
|
-
verify_grounding threshold: 0.8, on_failure: :flag_for_review
|
|
54
|
-
|
|
55
|
-
def answer(query, agent:)
|
|
56
|
-
context = retriever.retrieve(query)
|
|
57
|
-
response = llm.complete(query, context: context)
|
|
58
|
-
|
|
59
|
-
verify_and_return(
|
|
60
|
-
response,
|
|
61
|
-
context: context,
|
|
62
|
-
reviewed_by: agent
|
|
63
|
-
)
|
|
64
|
-
end
|
|
65
|
-
end
|
|
66
|
-
```
|
|
67
|
-
|
|
68
|
-
### Why It Passes
|
|
69
|
-
|
|
70
|
-
The human is genuinely in the loop — the agent reads the suggestion and decides what to send. The audience is calibrated — agents know when an answer feels wrong. The failure mode is recoverable — a bad suggestion gets edited or ignored before it causes harm. Observability is real — acceptance and edit rates tell you when quality degrades before users notice.
|
|
71
|
-
|
|
72
|
-
The RAG pipeline underneath is the same one described in the implementation guide. The grounding verifier ensures the suggested response is actually supported by retrieved context, not confabulated. Chunks that surface PII from past tickets are scrubbed before storage — the anonymization pipeline is what makes it legal to index support history at all.
|
|
73
|
-
|
|
74
|
-
---
|
|
75
|
-
|
|
76
|
-
## Example B: Invoice Explainer ❌ Fails
|
|
77
|
-
|
|
78
|
-
**The feature:** A "What does this mean?" button on invoice line items that uses AI to explain charges to customers in plain language.
|
|
79
|
-
|
|
80
|
-
**The user:** Any customer. General consumer audience.
|
|
81
|
-
|
|
82
|
-
### Gate Evaluation
|
|
83
|
-
|
|
84
|
-
```ruby
|
|
85
|
-
class InvoiceExplainer
|
|
86
|
-
include Diogenes::Feature
|
|
87
|
-
|
|
88
|
-
gate :failure_mode,
|
|
89
|
-
severity: :financial_dispute
|
|
90
|
-
# A wrong explanation of a charge — "that fee doesn't apply to
|
|
91
|
-
# your account" when it does — creates a customer expectation that
|
|
92
|
-
# is difficult and expensive to walk back. It may constitute a
|
|
93
|
-
# misrepresentation of contract terms.
|
|
94
|
-
|
|
95
|
-
gate :user_calibration,
|
|
96
|
-
audience: :general_consumer
|
|
97
|
-
# Customers don't have access to the billing logic. They cannot
|
|
98
|
-
# evaluate whether the explanation is correct. A confident wrong
|
|
99
|
-
# answer is worse than no answer.
|
|
100
|
-
|
|
101
|
-
gate :human_in_loop,
|
|
102
|
-
verified: false
|
|
103
|
-
# There is no human between the AI output and the customer.
|
|
104
|
-
# Real-time invoice explanations cannot wait for review.
|
|
105
|
-
|
|
106
|
-
# Diogenes::UnsafeFeatureError raised at boot:
|
|
107
|
-
# InvoiceExplainer fails 3 gates:
|
|
108
|
-
# - :failure_mode severity :financial_dispute requires :human_in_loop verified: true
|
|
109
|
-
# - :user_calibration audience :general_consumer incompatible with :failure_mode severity :financial_dispute
|
|
110
|
-
# - :human_in_loop verified: false blocks all consumer-facing output
|
|
111
|
-
end
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### Why It Fails — And What To Do Instead
|
|
115
|
-
|
|
116
|
-
The feature fails because the failure mode is financial, the audience can't self-verify, and there's no human review. These aren't independent problems — they compound. An incorrect explanation of a financial charge, delivered confidently to someone who has no way to check it, with no human who saw it before it went out, is a product liability problem.
|
|
117
|
-
|
|
118
|
-
**The right solution isn't a better AI feature. It's better UI.**
|
|
119
|
-
|
|
120
|
-
```ruby
|
|
121
|
-
# What to build instead:
|
|
122
|
-
#
|
|
123
|
-
# 1. Plain-English line item labels in the invoice model itself
|
|
124
|
-
# "Platform fee (10% of monthly usage)" not "PLAT_FEE_USG_PCT"
|
|
125
|
-
#
|
|
126
|
-
# 2. A tooltip or expandable section with a static explanation
|
|
127
|
-
# per charge type — written once by a human, accurate forever
|
|
128
|
-
#
|
|
129
|
-
# 3. A "why was I charged this?" link that opens a pre-filtered
|
|
130
|
-
# support ticket — staffed by agents using the SupportAssistant above
|
|
131
|
-
#
|
|
132
|
-
# This solves 90% of user confusion, is faster to build,
|
|
133
|
-
# costs nothing to run, and has zero hallucination risk.
|
|
134
|
-
```
|
|
135
|
-
|
|
136
|
-
### The Honest Conversation
|
|
137
|
-
|
|
138
|
-
When a client comes to you with the invoice explainer idea, the Diogenes gate output is the conversation:
|
|
139
|
-
|
|
140
|
-
> "We ran this through the framework. It fails on failure mode severity, user calibration, and human review. Here's what that means in practice, and here's what we'd build instead."
|
|
141
|
-
|
|
142
|
-
That's not a rejection. That's a better answer to the same problem — delivered with evidence.
|
|
143
|
-
|
|
144
|
-
---
|
|
145
|
-
|
|
146
|
-
## What the Dashboard Shows
|
|
147
|
-
|
|
148
|
-
Three weeks after shipping `SupportAssistant`, the Diogenes overview tab looks like this:
|
|
149
|
-
|
|
150
|
-
```
|
|
151
|
-
┌──────────────────────┬────────┬─────────────┬──────────────┬───────────────┐
|
|
152
|
-
│ Feature │ Gates │ Grounding │ Drift │ Evals │
|
|
153
|
-
├──────────────────────┼────────┼─────────────┼──────────────┼───────────────┤
|
|
154
|
-
│ SupportAssistant │ 5/5 ✓ │ 94% clean │ 1 stale doc │ 47/50 passing │
|
|
155
|
-
└──────────────────────┴────────┴─────────────┴──────────────┴───────────────┘
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
The one stale document is `refund-policy.md`, updated by the product team two weeks ago and not yet re-indexed. The grounding tab shows three recent verifications flagged for containing the old refund timeframe. The drift tab surfaces the document with a re-index button. The eval for "how long does a refund take?" is among the six failing evals.
|
|
159
|
-
|
|
160
|
-
All three signals point to the same root cause. One click queues the re-index job. The problem is visible before a customer receives a wrong answer.
|
|
161
|
-
|
|
162
|
-
`InvoiceExplainer` doesn't appear in the dashboard at all. It never got past boot.
|
data/docs/framework.md
DELETED
|
@@ -1,146 +0,0 @@
|
|
|
1
|
-
# The Diogenes Framework
|
|
2
|
-
|
|
3
|
-
Diogenes is built around a single question: **can you defend this decision?**
|
|
4
|
-
|
|
5
|
-
Not defend it philosophically — defend it in a postmortem, in a client meeting, in an audit, or in a conversation with a user who got a wrong answer and wants to know why.
|
|
6
|
-
|
|
7
|
-
The framework is five gates. Each one represents a question your team must answer before shipping an AI-enhanced feature to users. Some gates are enforced by code. Some are enforced by documentation. All of them are enforced by the fact that they live in your codebase, get reviewed in PRs, and travel with the feature forever.
|
|
8
|
-
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
## Gate 1: Failure Mode
|
|
12
|
-
|
|
13
|
-
**The question:** What happens when the AI is wrong — and is that acceptable?
|
|
14
|
-
|
|
15
|
-
Every AI feature will produce wrong answers. The question is never *if*, it's *what happens when*. The failure mode gate forces you to classify the severity of a wrong answer before you ship.
|
|
16
|
-
|
|
17
|
-
```ruby
|
|
18
|
-
gate :failure_mode, severity: :recoverable
|
|
19
|
-
# Acceptable severities: :cosmetic, :recoverable, :degraded_experience
|
|
20
|
-
# Unacceptable without additional gates: :financial_dispute, :safety_risk, :legal_exposure
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
A wrong restaurant recommendation is cosmetic. A wrong answer about a return policy is recoverable. A wrong answer about medication interactions is a safety risk. The gem will warn at boot if you declare a high-severity failure mode without also declaring strong human-in-the-loop configuration.
|
|
24
|
-
|
|
25
|
-
**What this forces you to think about:**
|
|
26
|
-
- The worst plausible wrong answer, not the average wrong answer
|
|
27
|
-
- Whether wrong answers compound over time or reset per interaction
|
|
28
|
-
- Whether the user will know they received a wrong answer
|
|
29
|
-
|
|
30
|
-
---
|
|
31
|
-
|
|
32
|
-
## Gate 2: User Calibration
|
|
33
|
-
|
|
34
|
-
**The question:** Can your target user tell when the output is wrong?
|
|
35
|
-
|
|
36
|
-
This is the gate that kills the most features — and it should. The danger of a confident, fluent AI answer isn't that it's wrong. It's that it's wrong in a way that sounds exactly like being right.
|
|
37
|
-
|
|
38
|
-
A developer reading Copilot output has the domain knowledge to spot a hallucinated API method. A consumer reading an AI-generated insurance summary probably does not.
|
|
39
|
-
|
|
40
|
-
```ruby
|
|
41
|
-
gate :user_calibration, audience: :trained_agent
|
|
42
|
-
# Audiences: :domain_expert, :trained_agent, :power_user, :general_consumer
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
`:general_consumer` doesn't automatically fail — but it requires strong grounding verification and conservative output thresholds. The gem will not allow `:general_consumer` + `:safety_risk` without raising at boot.
|
|
46
|
-
|
|
47
|
-
**What this forces you to think about:**
|
|
48
|
-
- Who the actual user is, not the assumed user
|
|
49
|
-
- Whether the UI makes uncertainty visible
|
|
50
|
-
- Whether users have any recourse when they suspect an error
|
|
51
|
-
|
|
52
|
-
---
|
|
53
|
-
|
|
54
|
-
## Gate 3: Human in the Loop
|
|
55
|
-
|
|
56
|
-
**The question:** Is a human genuinely reviewing output, or rubber-stamping at volume?
|
|
57
|
-
|
|
58
|
-
"A human reviews it" is one of the most misused phrases in AI feature development. A human reviewing 400 AI outputs a day for 30 seconds each is not a meaningful review. This gate requires you to specify not just that a human is involved, but the conditions under which that review is real.
|
|
59
|
-
|
|
60
|
-
```ruby
|
|
61
|
-
gate :human_in_loop,
|
|
62
|
-
verified: true,
|
|
63
|
-
max_daily_reviews: 50,
|
|
64
|
-
review_sla_minutes: 30
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
Diogenes tracks review volume per reviewer. If a reviewer exceeds `max_daily_reviews`, the queue is flagged and additional reviewers are required before new outputs are released.
|
|
68
|
-
|
|
69
|
-
**What this forces you to think about:**
|
|
70
|
-
- Whether reviewers have enough context to actually evaluate the output
|
|
71
|
-
- Whether reviewers have the authority to reject or modify, not just approve
|
|
72
|
-
- Whether review load is sustainable as the feature scales
|
|
73
|
-
|
|
74
|
-
---
|
|
75
|
-
|
|
76
|
-
## Gate 4: Observability
|
|
77
|
-
|
|
78
|
-
**The question:** Will you know when things are going wrong in production?
|
|
79
|
-
|
|
80
|
-
Most teams ship AI features with less monitoring than they'd give a sorting algorithm. The outputs are nondeterministic, which means traditional error rate monitoring misses the most important failure modes — outputs that are wrong but don't raise exceptions.
|
|
81
|
-
|
|
82
|
-
```ruby
|
|
83
|
-
gate :observability,
|
|
84
|
-
logging: :full,
|
|
85
|
-
alerting: :enabled,
|
|
86
|
-
drift_detection: true
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
Declaring this gate without the corresponding observability infrastructure raises at boot. Diogenes checks for a configured audit log destination and alert webhook before allowing the feature to serve.
|
|
90
|
-
|
|
91
|
-
**What this forces you to think about:**
|
|
92
|
-
- How you detect when hallucination rates increase
|
|
93
|
-
- How you detect when retrieved context is becoming stale
|
|
94
|
-
- What a "degraded AI feature" looks like in your alerting dashboard
|
|
95
|
-
|
|
96
|
-
---
|
|
97
|
-
|
|
98
|
-
## Gate 5: Necessity
|
|
99
|
-
|
|
100
|
-
**The question:** Is AI actually the right solution, or just the exciting one?
|
|
101
|
-
|
|
102
|
-
This is the hardest gate to encode and the most important one to ask. A well-designed UI, a good search index, a rule-based system, or better documentation will outperform a brittle AI feature in most cases — and will be faster to build, cheaper to run, and easier to debug.
|
|
103
|
-
|
|
104
|
-
```ruby
|
|
105
|
-
gate :necessity,
|
|
106
|
-
alternatives_considered: [
|
|
107
|
-
"Better invoice UI with plain-English line item labels",
|
|
108
|
-
"Keyword search across the knowledge base",
|
|
109
|
-
"Contextual help tooltips on common confusion points"
|
|
110
|
-
],
|
|
111
|
-
reasoning: "Support agents handle 400+ tickets/day across 12 product areas.
|
|
112
|
-
The volume and variance exceeds what static documentation can address."
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
Diogenes doesn't enforce this gate programmatically. It enforces it by requiring you to write it down, put it in a PR, and have someone else read it.
|
|
116
|
-
|
|
117
|
-
**What this forces you to think about:**
|
|
118
|
-
- Whether the problem is actually an AI problem or a UX problem
|
|
119
|
-
- Whether you're solving for users or solving for a roadmap item
|
|
120
|
-
- Whether you'll still want this feature in 18 months when the novelty has worn off
|
|
121
|
-
|
|
122
|
-
---
|
|
123
|
-
|
|
124
|
-
## The Verdict
|
|
125
|
-
|
|
126
|
-
A feature that passes all five gates is ready to build — with the understanding that the gates are revisited when the feature scales, when the user base changes, or when the failure mode landscape shifts.
|
|
127
|
-
|
|
128
|
-
A feature that fails a gate isn't necessarily dead. It's a specification of what needs to change before it ships. Sometimes that's a better review workflow. Sometimes that's a UI change that improves user calibration. Sometimes it's the answer that saves you six months of work on the wrong solution.
|
|
129
|
-
|
|
130
|
-
That's the point.
|
|
131
|
-
|
|
132
|
-
---
|
|
133
|
-
|
|
134
|
-
## After the Gates
|
|
135
|
-
|
|
136
|
-
Passing the gates is the beginning, not the end. Once a feature ships, Diogenes continues watching it through three lenses:
|
|
137
|
-
|
|
138
|
-
**Grounding** — are AI responses actually supported by the context they're drawing from, or are they drifting into confabulation?
|
|
139
|
-
|
|
140
|
-
**Drift** — are the documents feeding your retrieval pipeline still current, or have they gone stale while the embeddings haven't caught up?
|
|
141
|
-
|
|
142
|
-
**Evals** — are the things the feature was supposed to do when you shipped it still working, or has something quietly regressed?
|
|
143
|
-
|
|
144
|
-
All three surface in the mounted dashboard. The gates tell you whether to build it. These tell you whether it's still working.
|
|
145
|
-
|
|
146
|
-
See [docs/dashboard.md](docs/dashboard.md) for full documentation.
|