diogenes 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,228 @@
1
+ # Contributing to Diogenes
2
+
3
+ Diogenes is opinionated by design. Before contributing, read the philosophy section of the README and the architecture notes in CLAUDE.md. A PR that conflicts with the core philosophy — even if the code is excellent — will not be merged.
4
+
5
+ ---
6
+
7
+ ## What We're Looking For
8
+
9
+ **Good contributions:**
10
+ - New gates that encode a real, defensible constraint
11
+ - Improvements to grounding verification accuracy or the verifier prompt
12
+ - New eval matchers that cover failure modes existing matchers don't
13
+ - Drift detection improvements — staleness scoring, webhook integrations
14
+ - Better failure messages — they should be actionable, not cryptic
15
+ - Dashboard improvements that surface existing data more clearly
16
+ - Bug fixes with accompanying regression tests
17
+ - Documentation that makes the framework clearer
18
+
19
+ **Not a good fit:**
20
+ - LLM client integrations (Diogenes is provider-agnostic by design)
21
+ - RAG pipeline or embedding implementations
22
+ - Gates that can be bypassed with configuration
23
+ - Anything that makes a gate failure a warning instead of an error
24
+ - Database-backed eval golden pairs
25
+ - Authentication for the dashboard
26
+
27
+ If you're unsure whether your idea fits, open a discussion issue before writing code.
28
+
29
+ ---
30
+
31
+ ## Getting Started
32
+
33
+ ```bash
34
+ git clone https://github.com/your-org/diogenes
35
+ cd diogenes
36
+ bundle install
37
+ bundle exec rspec
38
+ ```
39
+
40
+ The test suite should pass with no configuration. If it doesn't, open an issue.
41
+
42
+ To run the full suite including Rails integration tests:
43
+
44
+ ```bash
45
+ bundle exec rspec
46
+ bundle exec rspec spec/rails
47
+ ```
48
+
49
+ ---
50
+
51
+ ## Contributing to Gates
52
+
53
+ Gates are the core primitive. Adding one is a significant decision — each gate adds cognitive load to every team that adopts the gem. New gates should address a failure mode that existing gates don't cover.
54
+
55
+ ### Steps
56
+
57
+ 1. Create `lib/diogenes/gates/your_gate.rb` inheriting from `Diogenes::Gates::Base`
58
+ 2. Implement `#valid?` and `#failure_message`
59
+ 3. Add any cross-gate incompatibilities to `lib/diogenes/gates/compatibility.rb`
60
+ 4. Register in `Diogenes::Feature`
61
+ 5. Write unit specs in `spec/diogenes/gates/your_gate_spec.rb`
62
+ 6. Document in `docs/framework.md` following the existing format
63
+ 7. Add to `docs/examples.md` if the gate meaningfully changes a verdict
64
+
65
+ ### Gate Design Rules
66
+
67
+ - Gates fail loudly or pass. No warnings, no degraded modes.
68
+ - `#failure_message` must include the feature class name, the gate name, and a plain-English instruction for what to change.
69
+ - Gates are validated at boot. If your gate requires runtime information to validate, reconsider the design.
70
+ - Default configuration should be the most conservative option.
71
+
72
+ ### Versioning
73
+
74
+ Adding a new gate incompatibility to the compatibility matrix is a **minor** bump if it catches configurations that were silently wrong. It is a **major** bump if it invalidates configurations that were intentionally valid.
75
+
76
+ ---
77
+
78
+ ## Contributing to Grounding Verification
79
+
80
+ The grounding verifier has two separate concerns: the prompt and the logic. Keep them separate.
81
+
82
+ ### The Prompt (`lib/diogenes/grounding/prompt.rb`)
83
+
84
+ The prompt is versioned independently. When you change it:
85
+
86
+ - Increment `Diogenes::Grounding::Prompt::VERSION`
87
+ - Audit records store the prompt version — old records remain interpretable
88
+ - Add a spec that the new prompt correctly classifies a fixture set of (context, response) pairs as supported/unsupported/contradicted
89
+
90
+ Don't change the prompt to improve performance on one case without running the full fixture set. A prompt that is better on average but worse on contradictions is not an improvement.
91
+
92
+ ### The Verifier Logic (`lib/diogenes/grounding/verifier.rb`)
93
+
94
+ The verifier accepts any LLM callable. Tests use a stub:
95
+
96
+ ```ruby
97
+ Diogenes::Grounding::Verifier.stub(result: :pass)
98
+ Diogenes::Grounding::Verifier.stub(result: :flag, unsupported: ["claim text"])
99
+ ```
100
+
101
+ Never make live LLM calls in specs.
102
+
103
+ ### The Result (`lib/diogenes/grounding/result.rb`)
104
+
105
+ `Diogenes::Grounding::Result` is a value object. Adding fields is a minor bump. Removing or renaming fields is a major bump — audit records store serialized results.
106
+
107
+ ---
108
+
109
+ ## Contributing to Drift Detection
110
+
111
+ ### Staleness Scoring (`lib/diogenes/drift/staleness_score.rb`)
112
+
113
+ The staleness score is a pure function of timestamps and diff size. Keep it that way — no database access, no I/O. Specs use fixed timestamps.
114
+
115
+ ```ruby
116
+ score = Diogenes::Drift::StalenessScore.calculate(
117
+ source_updated_at: 47.days.ago,
118
+ indexed_at: 89.days.ago,
119
+ diff_size: :major
120
+ )
121
+ ```
122
+
123
+ ### Re-index Jobs
124
+
125
+ Diogenes ships a base job class. The host app subclasses it with embedding logic. Never add embedding implementation to the base class.
126
+
127
+ ```ruby
128
+ # What we ship
129
+ class Diogenes::Drift::ReindexJob < ApplicationJob
130
+ def perform(document_id)
131
+ raise NotImplementedError, "Subclass with your embedding logic"
132
+ end
133
+ end
134
+
135
+ # What the host app writes
136
+ class ReindexDocumentJob < Diogenes::Drift::ReindexJob
137
+ def perform(document_id)
138
+ # fetch, embed, store
139
+ end
140
+ end
141
+ ```
142
+
143
+ ### Webhooks
144
+
145
+ Drift alert webhooks send a standard payload. If you're adding webhook support for a new event type, use the existing payload shape and add a `type` field. Don't introduce a new shape.
146
+
147
+ ---
148
+
149
+ ## Contributing to the Eval Runner
150
+
151
+ ### Adding Matchers (`lib/diogenes/evals/matchers.rb`)
152
+
153
+ Matchers are composable predicates. A new matcher should:
154
+
155
+ - Accept a response struct and return true/false
156
+ - Have a clear, descriptive failure message when it returns false
157
+ - Be usable inside `all_of` and `one_of`
158
+ - Be tested against fixture responses, never live LLM calls
159
+
160
+ ```ruby
161
+ module Diogenes
162
+ module Evals
163
+ module Matchers
164
+ def your_matcher(arg)
165
+ ->(result) {
166
+ # evaluate result, return true/false
167
+ }
168
+ end
169
+ end
170
+ end
171
+ end
172
+ ```
173
+
174
+ ### Golden Pairs Live in Code
175
+
176
+ Don't add database storage for golden pairs. They are version-controlled files. If a contributor wants to propose database-backed pairs, the answer is no — visibility in code review is a feature, not a limitation.
177
+
178
+ ### Regression Detection
179
+
180
+ The regression detector (`lib/diogenes/evals/regression.rb`) stores the last passing response and diffs it against the first failing one. If you're improving diff quality, use the fixture set in `spec/fixtures/evals/regressions/`. Don't introduce a dependency on a diff library — the current implementation is intentionally minimal.
181
+
182
+ ---
183
+
184
+ ## Contributing to the Dashboard
185
+
186
+ The dashboard engine has no authentication and no CSS framework dependency. Keep it that way.
187
+
188
+ ### Adding a New View
189
+
190
+ Views use ERB. No ViewComponent, no Stimulus, no Turbo (unless it's already a dependency by the time you're reading this — check the gemspec). Interactivity should be achievable with standard Rails UJS or vanilla JS.
191
+
192
+ ### Adding a New Route
193
+
194
+ Add to `lib/diogenes/engine/config/routes.rb`. Follow the existing namespace pattern. Every new route needs a corresponding controller action and view, even if the view is a stub.
195
+
196
+ ### What the Dashboard Must Never Do
197
+
198
+ - Make LLM calls
199
+ - Trigger re-indexing without an explicit user action
200
+ - Display raw query or response content (hashes only — PII protection)
201
+ - Require specific authentication middleware
202
+
203
+ ---
204
+
205
+ ## Pull Request Guidelines
206
+
207
+ - One logical change per PR
208
+ - Tests for all new behaviour
209
+ - Updated documentation if you're changing or adding functionality
210
+ - A clear description of *why*, not just *what*
211
+
212
+ PRs that add code without updating relevant docs will be sent back.
213
+
214
+ ---
215
+
216
+ ## Versioning
217
+
218
+ Diogenes follows semantic versioning strictly.
219
+
220
+ - **Patch:** bug fixes, documentation, internal refactors
221
+ - **Minor:** new gates, new matchers, new dashboard views, non-breaking additions
222
+ - **Major:** changes to gate semantics, changes to audit record shape, removal of features, breaking DSL changes
223
+
224
+ ---
225
+
226
+ ## Code of Conduct
227
+
228
+ Be direct. Be specific. Assume good intent. If you're reviewing a PR, explain your reasoning — "this conflicts with the philosophy" is not sufficient without explaining which part and why.
data/docs/dashboard.md ADDED
@@ -0,0 +1,365 @@
1
+ # The Diogenes Dashboard
2
+
3
+ The Diogenes dashboard is a Rails engine that mounts into your existing application and gives you a live view of AI feature health across three dimensions: grounding verification, document drift, and eval regressions.
4
+
5
+ Think of it as Sidekiq Web for AI accountability — one place to see whether your AI features are doing what you think they're doing.
6
+
7
+ ---
8
+
9
+ ## Mounting
10
+
11
+ ```ruby
12
+ # config/routes.rb
13
+ mount Diogenes::Engine => '/diogenes'
14
+ ```
15
+
16
+ Diogenes has no opinions about authentication. Protect the route the same way you protect any sensitive admin surface:
17
+
18
+ ```ruby
19
+ # With Devise + admin constraint
20
+ authenticate :user, ->(u) { u.admin? } do
21
+ mount Diogenes::Engine => '/diogenes'
22
+ end
23
+
24
+ # With HTTP basic auth via a constraint
25
+ constraints Diogenes::AdminConstraint.new(ENV['DIOGENES_PASSWORD']) do
26
+ mount Diogenes::Engine => '/diogenes'
27
+ end
28
+
29
+ # With a custom Rack middleware
30
+ mount Diogenes::Engine => '/diogenes', constraints: YourAuthConstraint
31
+ ```
32
+
33
+ ---
34
+
35
+ ## Engine Routes
36
+
37
+ ```ruby
38
+ # lib/diogenes/engine/routes.rb
39
+
40
+ Diogenes::Engine.routes.draw do
41
+ root to: 'dashboard#overview'
42
+
43
+ namespace :dashboard do
44
+ # Grounding
45
+ get 'grounding', to: 'grounding#index'
46
+ get 'grounding/:feature', to: 'grounding#show'
47
+ get 'grounding/:feature/:id', to: 'grounding#verification'
48
+
49
+ # Drift
50
+ get 'drift', to: 'drift#index'
51
+ get 'drift/:index_name', to: 'drift#show'
52
+ post 'drift/:index_name/reindex', to: 'drift#reindex'
53
+ post 'drift/:document_id/reindex', to: 'drift#reindex_document'
54
+
55
+ # Evals
56
+ get 'evals', to: 'evals#index'
57
+ get 'evals/:feature', to: 'evals#show'
58
+ post 'evals/:feature/run', to: 'evals#run'
59
+ get 'evals/:feature/:run_id', to: 'evals#run_result'
60
+ end
61
+ end
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Controller Structure
67
+
68
+ ```
69
+ lib/diogenes/engine/
70
+ app/
71
+ controllers/
72
+ diogenes/
73
+ dashboard_controller.rb # Overview tab
74
+ dashboard/
75
+ grounding_controller.rb
76
+ drift_controller.rb
77
+ evals_controller.rb
78
+ views/
79
+ diogenes/
80
+ dashboard/
81
+ overview.html.erb
82
+ grounding/
83
+ index.html.erb
84
+ show.html.erb
85
+ verification.html.erb
86
+ drift/
87
+ index.html.erb
88
+ show.html.erb
89
+ evals/
90
+ index.html.erb
91
+ show.html.erb
92
+ run_result.html.erb
93
+ ```
94
+
95
+ ---
96
+
97
+ ## The Overview Tab
98
+
99
+ `GET /diogenes`
100
+
101
+ The entry point. One row per gated feature, showing aggregate health across all three dimensions. Red features surface immediately.
102
+
103
+ ```
104
+ ┌─────────────────────────────────────────────────────────────────────────────┐
105
+ │ DIOGENES Last updated: 2 min ago│
106
+ ├──────────────────────┬────────┬─────────────┬──────────────┬────────────────┤
107
+ │ Feature │ Gates │ Grounding │ Drift │ Evals │
108
+ ├──────────────────────┼────────┼─────────────┼──────────────┼────────────────┤
109
+ │ SupportAssistant │ 5/5 ✓ │ 94% clean │ 2 stale docs │ 47/50 passing │
110
+ │ CodebaseOracle │ 5/5 ✓ │ 98% clean │ All current │ 30/30 passing │
111
+ │ ComplianceSearcher │ 5/5 ✓ │ 87% clean ⚠ │ 11 stale docs│ 18/25 passing ✗│
112
+ └──────────────────────┴────────┴─────────────┴──────────────┴────────────────┘
113
+ ```
114
+
115
+ `ComplianceSearcher` has a low grounding rate, stale documents, and a failing eval suite. That's the feature that needs attention today — and you can see it before a user does.
116
+
117
+ ---
118
+
119
+ ## The Grounding Tab
120
+
121
+ `GET /diogenes/dashboard/grounding`
122
+
123
+ Per-feature grounding verification history. Shows flag rates over time and a live feed of recent verifications.
124
+
125
+ ```
126
+ SupportAssistant — Grounding History (last 30 days)
127
+
128
+ Flag rate: 6.2% ▼ from 8.1% last month
129
+
130
+ Recent verifications:
131
+ ✓ Supported "You can cancel your subscription from the billing page..."
132
+ ✓ Supported "The enterprise tier includes unlimited seats..."
133
+ ⚠ Unsupported "Refunds are processed within 24 hours..." → flagged for review
134
+ ✗ Contradicted "The API rate limit is 100 req/min..." → blocked, not served
135
+ ```
136
+
137
+ ### Verification Detail
138
+
139
+ `GET /diogenes/dashboard/grounding/:feature/:id`
140
+
141
+ Drill into any verification to see the full picture:
142
+
143
+ ```
144
+ Verification #8472 — SupportAssistant
145
+ Query: "How long does a refund take?"
146
+ Status: ⚠ FLAGGED — unsupported claim detected
147
+
148
+ Retrieved context:
149
+ [chunk 1] "Refund requests are reviewed within 2 business days..." ← source: refund-policy.md
150
+ [chunk 2] "Enterprise customers receive priority support..." ← source: enterprise-terms.md
151
+
152
+ AI response:
153
+ "Refunds are typically processed within 24 hours."
154
+
155
+ Verdict:
156
+ supported: []
157
+ unsupported: ["processed within 24 hours"] ← no chunk supports this timeframe
158
+ contradicted: []
159
+
160
+ Action taken: flagged_for_review
161
+ Reviewed by: sarah@company.com (approved with edit)
162
+ Final sent: "Refunds are reviewed within 2 business days..."
163
+ ```
164
+
165
+ This is the audit trail that makes human-in-the-loop real. Every verification, every flag, every human decision — all linked.
166
+
167
+ ---
168
+
169
+ ## The Drift Tab
170
+
171
+ `GET /diogenes/dashboard/drift`
172
+
173
+ A staleness leaderboard — every indexed document, ranked by how long since its embedding was updated versus when the source last changed.
174
+
175
+ ```
176
+ Document Drift — All Indexes
177
+
178
+ ● 11 documents require attention
179
+
180
+ CRITICAL (source changed > 30 days ago, not re-indexed)
181
+ ┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
182
+ │ Document │ Source updated│ Last indexed │ Staleness│
183
+ ├─────────────────────────────────┼──────────────┼──────────────┼──────────┤
184
+ │ refund-policy.md │ 47 days ago │ 89 days ago │ ████████ │
185
+ │ enterprise-pricing.md │ 33 days ago │ 102 days ago │ ████████ │
186
+ └─────────────────────────────────┴──────────────┴──────────────┴──────────┘
187
+
188
+ WARNING (source changed 7–30 days ago)
189
+ ┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
190
+ │ api-rate-limits.md │ 12 days ago │ 45 days ago │ █████░░░ │
191
+ │ ... │ │ │ │
192
+ └─────────────────────────────────┴──────────────┴──────────────┴──────────┘
193
+
194
+ [Re-index all critical] [Re-index all warnings]
195
+ ```
196
+
197
+ ### Re-indexing
198
+
199
+ Clicking re-index queues an `ActiveJob`. The job is defined in the host app — Diogenes provides the interface and the tracking, not the embedding logic:
200
+
201
+ ```ruby
202
+ # config/initializers/diogenes.rb
203
+ Diogenes.configure do |config|
204
+ config.drift.reindex_job = ReindexDocumentJob
205
+ config.drift.staleness_thresholds = {
206
+ warning: 7.days,
207
+ critical: 30.days
208
+ }
209
+ config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
210
+ end
211
+ ```
212
+
213
+ ### How Drift Is Detected
214
+
215
+ Diogenes tracks two timestamps per indexed document:
216
+
217
+ - `source_updated_at` — populated by the host app when indexing, updated by a webhook or scheduled check
218
+ - `indexed_at` — set by Diogenes when an embedding is stored
219
+
220
+ The staleness score is a function of the gap between them, weighted by how much the source changed (line diff if available, timestamp delta otherwise).
221
+
222
+ ```ruby
223
+ # Inform Diogenes that a source document has changed
224
+ Diogenes::Drift.source_updated(
225
+ document_id: 'refund-policy-v2',
226
+ updated_at: Time.current,
227
+ diff_size: :major # :minor, :moderate, :major — affects staleness weight
228
+ )
229
+ ```
230
+
231
+ ---
232
+
233
+ ## The Evals Tab
234
+
235
+ `GET /diogenes/dashboard/evals`
236
+
237
+ The hardest unsolved problem in production AI is knowing whether your feature is getting better or worse over time. The evals tab is Diogenes's answer: define golden question/answer pairs, run them on a schedule, track pass rates over time, and alert on regression.
238
+
239
+ ```
240
+ Evals — All Features
241
+
242
+ SupportAssistant 47/50 passing 94% ▲ from 91% last week [Run now]
243
+ CodebaseOracle 30/30 passing 100% → unchanged [Run now]
244
+ ComplianceSearcher 18/25 passing 72% ▼ from 88% last week ✗ [Run now]
245
+ ```
246
+
247
+ ### Defining Golden Pairs
248
+
249
+ Golden pairs live in your codebase as Ruby, not in a database. They're version-controlled, reviewed in PRs, and treated as first-class tests:
250
+
251
+ ```ruby
252
+ # test/diogenes/evals/support_assistant_evals.rb
253
+
254
+ Diogenes::Evals.define(SupportAssistant) do
255
+ eval "basic refund question" do
256
+ query "How do I request a refund?"
257
+ expects all_of(
258
+ grounded_in("refund-policy"),
259
+ contains("billing page"),
260
+ does_not_contain("24 hours") # known wrong answer to guard against
261
+ )
262
+ end
263
+
264
+ eval "enterprise tier question" do
265
+ query "Does the enterprise plan include SSO?"
266
+ expects all_of(
267
+ grounded_in("enterprise-terms"),
268
+ semantically_similar_to("Yes, SSO is included in all enterprise plans")
269
+ )
270
+ end
271
+
272
+ eval "question with no good answer" do
273
+ query "What is the API rate limit for the legacy v1 endpoints?"
274
+ expects one_of(
275
+ low_confidence_response,
276
+ routes_to_human_review
277
+ )
278
+ end
279
+ end
280
+ ```
281
+
282
+ ### Matchers
283
+
284
+ | Matcher | What it checks |
285
+ |---|---|
286
+ | `grounded_in(source)` | Retrieved context includes chunks from the named source |
287
+ | `contains(text)` | Response includes the specified text or close semantic equivalent |
288
+ | `does_not_contain(text)` | Response does not include the specified text |
289
+ | `semantically_similar_to(text)` | Response embedding is within threshold cosine distance |
290
+ | `low_confidence_response` | Feature returned a flagged or low-confidence result |
291
+ | `routes_to_human_review` | Feature queued the output for human review |
292
+ | `all_of(*matchers)` | All matchers must pass |
293
+ | `one_of(*matchers)` | At least one matcher must pass |
294
+
295
+ ### Running Evals
296
+
297
+ Evals can be run on demand from the dashboard, via a Rake task, or on a schedule:
298
+
299
+ ```bash
300
+ # On demand
301
+ bundle exec rake diogenes:evals:run[SupportAssistant]
302
+ bundle exec rake diogenes:evals:run # all features
303
+ ```
304
+
305
+ ```ruby
306
+ # Scheduled via your existing job infrastructure
307
+ class RunDiogenesEvalsJob < ApplicationJob
308
+ def perform
309
+ Diogenes::Evals.run_all
310
+ end
311
+ end
312
+ ```
313
+
314
+ ### Regression Detection
315
+
316
+ When a passing eval starts failing, Diogenes records the regression point — exactly which run it broke on — and stores a diff between the last passing response and the first failing one. The evals detail view surfaces this:
317
+
318
+ ```
319
+ ComplianceSearcher — Eval Regression Detected
320
+
321
+ "What is our data retention policy?"
322
+
323
+ Status: ✗ FAILING since run #142 (3 days ago)
324
+
325
+ Last passing response (run #141):
326
+ "Data is retained for 7 years per our compliance obligations..."
327
+
328
+ Current failing response (run #144):
329
+ "Data retention policies vary by region..."
330
+
331
+ Likely cause: 'data-retention-policy.md' updated 4 days ago, not re-indexed
332
+ → See drift tab
333
+ ```
334
+
335
+ That last line is the integration that makes the dashboard coherent: a failing eval points to a stale document in the drift tab, which explains the low grounding score on the grounding tab. Three separate signals, one root cause.
336
+
337
+ ---
338
+
339
+ ## Configuration Reference
340
+
341
+ ```ruby
342
+ # config/initializers/diogenes.rb
343
+
344
+ Diogenes.configure do |config|
345
+ # Grounding
346
+ config.grounding.default_threshold = 0.8
347
+ config.grounding.verifier_llm = -> (prompt) { YourLLMClient.complete(prompt) }
348
+
349
+ # Drift
350
+ config.drift.reindex_job = ReindexDocumentJob
351
+ config.drift.staleness_thresholds = { warning: 7.days, critical: 30.days }
352
+ config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
353
+ config.drift.check_interval = 1.hour
354
+
355
+ # Evals
356
+ config.evals.schedule = '0 9 * * *' # daily at 9am, if using sidekiq-cron
357
+ config.evals.alert_on_regression = true
358
+ config.evals.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
359
+ config.evals.passing_threshold = 0.9 # alert if pass rate drops below 90%
360
+
361
+ # Audit log
362
+ config.audit.retention_days = 90
363
+ config.audit.pii_scrubber = Diogenes::PII::DefaultScrubber
364
+ end
365
+ ```