diogenes 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/docs/dashboard.md DELETED
@@ -1,365 +0,0 @@
1
- # The Diogenes Dashboard
2
-
3
- The Diogenes dashboard is a Rails engine that mounts into your existing application and gives you a live view of AI feature health across three dimensions: grounding verification, document drift, and eval regressions.
4
-
5
- Think of it as Sidekiq Web for AI accountability — one place to see whether your AI features are doing what you think they're doing.
6
-
7
- ---
8
-
9
- ## Mounting
10
-
11
- ```ruby
12
- # config/routes.rb
13
- mount Diogenes::Engine => '/diogenes'
14
- ```
15
-
16
- Diogenes has no opinions about authentication. Protect the route the same way you protect any sensitive admin surface:
17
-
18
- ```ruby
19
- # With Devise + admin constraint
20
- authenticate :user, ->(u) { u.admin? } do
21
- mount Diogenes::Engine => '/diogenes'
22
- end
23
-
24
- # With HTTP basic auth via a constraint
25
- constraints Diogenes::AdminConstraint.new(ENV['DIOGENES_PASSWORD']) do
26
- mount Diogenes::Engine => '/diogenes'
27
- end
28
-
29
- # With a custom Rack middleware
30
- mount Diogenes::Engine => '/diogenes', constraints: YourAuthConstraint
31
- ```
32
-
33
- ---
34
-
35
- ## Engine Routes
36
-
37
- ```ruby
38
- # lib/diogenes/engine/routes.rb
39
-
40
- Diogenes::Engine.routes.draw do
41
- root to: 'dashboard#overview'
42
-
43
- namespace :dashboard do
44
- # Grounding
45
- get 'grounding', to: 'grounding#index'
46
- get 'grounding/:feature', to: 'grounding#show'
47
- get 'grounding/:feature/:id', to: 'grounding#verification'
48
-
49
- # Drift
50
- get 'drift', to: 'drift#index'
51
- get 'drift/:index_name', to: 'drift#show'
52
- post 'drift/:index_name/reindex', to: 'drift#reindex'
53
- post 'drift/:document_id/reindex', to: 'drift#reindex_document'
54
-
55
- # Evals
56
- get 'evals', to: 'evals#index'
57
- get 'evals/:feature', to: 'evals#show'
58
- post 'evals/:feature/run', to: 'evals#run'
59
- get 'evals/:feature/:run_id', to: 'evals#run_result'
60
- end
61
- end
62
- ```
63
-
64
- ---
65
-
66
- ## Controller Structure
67
-
68
- ```
69
- lib/diogenes/engine/
70
- app/
71
- controllers/
72
- diogenes/
73
- dashboard_controller.rb # Overview tab
74
- dashboard/
75
- grounding_controller.rb
76
- drift_controller.rb
77
- evals_controller.rb
78
- views/
79
- diogenes/
80
- dashboard/
81
- overview.html.erb
82
- grounding/
83
- index.html.erb
84
- show.html.erb
85
- verification.html.erb
86
- drift/
87
- index.html.erb
88
- show.html.erb
89
- evals/
90
- index.html.erb
91
- show.html.erb
92
- run_result.html.erb
93
- ```
94
-
95
- ---
96
-
97
- ## The Overview Tab
98
-
99
- `GET /diogenes`
100
-
101
- The entry point. One row per gated feature, showing aggregate health across all three dimensions. Red features surface immediately.
102
-
103
- ```
104
- ┌─────────────────────────────────────────────────────────────────────────────┐
105
- │ DIOGENES Last updated: 2 min ago│
106
- ├──────────────────────┬────────┬─────────────┬──────────────┬────────────────┤
107
- │ Feature │ Gates │ Grounding │ Drift │ Evals │
108
- ├──────────────────────┼────────┼─────────────┼──────────────┼────────────────┤
109
- │ SupportAssistant │ 5/5 ✓ │ 94% clean │ 2 stale docs │ 47/50 passing │
110
- │ CodebaseOracle │ 5/5 ✓ │ 98% clean │ All current │ 30/30 passing │
111
- │ ComplianceSearcher │ 5/5 ✓ │ 87% clean ⚠ │ 11 stale docs│ 18/25 passing ✗│
112
- └──────────────────────┴────────┴─────────────┴──────────────┴────────────────┘
113
- ```
114
-
115
- `ComplianceSearcher` has a low grounding rate, stale documents, and a failing eval suite. That's the feature that needs attention today — and you can see it before a user does.
116
-
117
- ---
118
-
119
- ## The Grounding Tab
120
-
121
- `GET /diogenes/dashboard/grounding`
122
-
123
- Per-feature grounding verification history. Shows flag rates over time and a live feed of recent verifications.
124
-
125
- ```
126
- SupportAssistant — Grounding History (last 30 days)
127
-
128
- Flag rate: 6.2% ▼ from 8.1% last month
129
-
130
- Recent verifications:
131
- ✓ Supported "You can cancel your subscription from the billing page..."
132
- ✓ Supported "The enterprise tier includes unlimited seats..."
133
- ⚠ Unsupported "Refunds are processed within 24 hours..." → flagged for review
134
- ✗ Contradicted "The API rate limit is 100 req/min..." → blocked, not served
135
- ```
136
-
137
- ### Verification Detail
138
-
139
- `GET /diogenes/dashboard/grounding/:feature/:id`
140
-
141
- Drill into any verification to see the full picture:
142
-
143
- ```
144
- Verification #8472 — SupportAssistant
145
- Query: "How long does a refund take?"
146
- Status: ⚠ FLAGGED — unsupported claim detected
147
-
148
- Retrieved context:
149
- [chunk 1] "Refund requests are reviewed within 2 business days..." ← source: refund-policy.md
150
- [chunk 2] "Enterprise customers receive priority support..." ← source: enterprise-terms.md
151
-
152
- AI response:
153
- "Refunds are typically processed within 24 hours."
154
-
155
- Verdict:
156
- supported: []
157
- unsupported: ["processed within 24 hours"] ← no chunk supports this timeframe
158
- contradicted: []
159
-
160
- Action taken: flagged_for_review
161
- Reviewed by: sarah@company.com (approved with edit)
162
- Final sent: "Refunds are reviewed within 2 business days..."
163
- ```
164
-
165
- This is the audit trail that makes human-in-the-loop real. Every verification, every flag, every human decision — all linked.
166
-
167
- ---
168
-
169
- ## The Drift Tab
170
-
171
- `GET /diogenes/dashboard/drift`
172
-
173
- A staleness leaderboard — every indexed document, ranked by how long since its embedding was updated versus when the source last changed.
174
-
175
- ```
176
- Document Drift — All Indexes
177
-
178
- ● 11 documents require attention
179
-
180
- CRITICAL (source changed > 30 days ago, not re-indexed)
181
- ┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
182
- │ Document │ Source updated│ Last indexed │ Staleness│
183
- ├─────────────────────────────────┼──────────────┼──────────────┼──────────┤
184
- │ refund-policy.md │ 47 days ago │ 89 days ago │ ████████ │
185
- │ enterprise-pricing.md │ 33 days ago │ 102 days ago │ ████████ │
186
- └─────────────────────────────────┴──────────────┴──────────────┴──────────┘
187
-
188
- WARNING (source changed 7–30 days ago)
189
- ┌─────────────────────────────────┬──────────────┬──────────────┬──────────┐
190
- │ api-rate-limits.md │ 12 days ago │ 45 days ago │ █████░░░ │
191
- │ ... │ │ │ │
192
- └─────────────────────────────────┴──────────────┴──────────────┴──────────┘
193
-
194
- [Re-index all critical] [Re-index all warnings]
195
- ```
196
-
197
- ### Re-indexing
198
-
199
- Clicking re-index queues an `ActiveJob`. The job is defined in the host app — Diogenes provides the interface and the tracking, not the embedding logic:
200
-
201
- ```ruby
202
- # config/initializers/diogenes.rb
203
- Diogenes.configure do |config|
204
- config.drift.reindex_job = ReindexDocumentJob
205
- config.drift.staleness_thresholds = {
206
- warning: 7.days,
207
- critical: 30.days
208
- }
209
- config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
210
- end
211
- ```
212
-
213
- ### How Drift Is Detected
214
-
215
- Diogenes tracks two timestamps per indexed document:
216
-
217
- - `source_updated_at` — populated by the host app when indexing, updated by a webhook or scheduled check
218
- - `indexed_at` — set by Diogenes when an embedding is stored
219
-
220
- The staleness score is a function of the gap between them, weighted by how much the source changed (line diff if available, timestamp delta otherwise).
221
-
222
- ```ruby
223
- # Inform Diogenes that a source document has changed
224
- Diogenes::Drift.source_updated(
225
- document_id: 'refund-policy-v2',
226
- updated_at: Time.current,
227
- diff_size: :major # :minor, :moderate, :major — affects staleness weight
228
- )
229
- ```
230
-
231
- ---
232
-
233
- ## The Evals Tab
234
-
235
- `GET /diogenes/dashboard/evals`
236
-
237
- The hardest unsolved problem in production AI is knowing whether your feature is getting better or worse over time. The evals tab is Diogenes's answer: define golden question/answer pairs, run them on a schedule, track pass rates over time, and alert on regression.
238
-
239
- ```
240
- Evals — All Features
241
-
242
- SupportAssistant 47/50 passing 94% ▲ from 91% last week [Run now]
243
- CodebaseOracle 30/30 passing 100% → unchanged [Run now]
244
- ComplianceSearcher 18/25 passing 72% ▼ from 88% last week ✗ [Run now]
245
- ```
246
-
247
- ### Defining Golden Pairs
248
-
249
- Golden pairs live in your codebase as Ruby, not in a database. They're version-controlled, reviewed in PRs, and treated as first-class tests:
250
-
251
- ```ruby
252
- # test/diogenes/evals/support_assistant_evals.rb
253
-
254
- Diogenes::Evals.define(SupportAssistant) do
255
- eval "basic refund question" do
256
- query "How do I request a refund?"
257
- expects all_of(
258
- grounded_in("refund-policy"),
259
- contains("billing page"),
260
- does_not_contain("24 hours") # known wrong answer to guard against
261
- )
262
- end
263
-
264
- eval "enterprise tier question" do
265
- query "Does the enterprise plan include SSO?"
266
- expects all_of(
267
- grounded_in("enterprise-terms"),
268
- semantically_similar_to("Yes, SSO is included in all enterprise plans")
269
- )
270
- end
271
-
272
- eval "question with no good answer" do
273
- query "What is the API rate limit for the legacy v1 endpoints?"
274
- expects one_of(
275
- low_confidence_response,
276
- routes_to_human_review
277
- )
278
- end
279
- end
280
- ```
281
-
282
- ### Matchers
283
-
284
- | Matcher | What it checks |
285
- |---|---|
286
- | `grounded_in(source)` | Retrieved context includes chunks from the named source |
287
- | `contains(text)` | Response includes the specified text or close semantic equivalent |
288
- | `does_not_contain(text)` | Response does not include the specified text |
289
- | `semantically_similar_to(text)` | Response embedding is within threshold cosine distance |
290
- | `low_confidence_response` | Feature returned a flagged or low-confidence result |
291
- | `routes_to_human_review` | Feature queued the output for human review |
292
- | `all_of(*matchers)` | All matchers must pass |
293
- | `one_of(*matchers)` | At least one matcher must pass |
294
-
295
- ### Running Evals
296
-
297
- Evals can be run on demand from the dashboard, via a Rake task, or on a schedule:
298
-
299
- ```bash
300
- # On demand
301
- bundle exec rake diogenes:evals:run[SupportAssistant]
302
- bundle exec rake diogenes:evals:run # all features
303
- ```
304
-
305
- ```ruby
306
- # Scheduled via your existing job infrastructure
307
- class RunDiogenesEvalsJob < ApplicationJob
308
- def perform
309
- Diogenes::Evals.run_all
310
- end
311
- end
312
- ```
313
-
314
- ### Regression Detection
315
-
316
- When a passing eval starts failing, Diogenes records the regression point — exactly which run it broke on — and stores a diff between the last passing response and the first failing one. The evals detail view surfaces this:
317
-
318
- ```
319
- ComplianceSearcher — Eval Regression Detected
320
-
321
- "What is our data retention policy?"
322
-
323
- Status: ✗ FAILING since run #142 (3 days ago)
324
-
325
- Last passing response (run #141):
326
- "Data is retained for 7 years per our compliance obligations..."
327
-
328
- Current failing response (run #144):
329
- "Data retention policies vary by region..."
330
-
331
- Likely cause: 'data-retention-policy.md' updated 4 days ago, not re-indexed
332
- → See drift tab
333
- ```
334
-
335
- That last line is the integration that makes the dashboard coherent: a failing eval points to a stale document in the drift tab, which explains the low grounding score on the grounding tab. Three separate signals, one root cause.
336
-
337
- ---
338
-
339
- ## Configuration Reference
340
-
341
- ```ruby
342
- # config/initializers/diogenes.rb
343
-
344
- Diogenes.configure do |config|
345
- # Grounding
346
- config.grounding.default_threshold = 0.8
347
- config.grounding.verifier_llm = -> (prompt) { YourLLMClient.complete(prompt) }
348
-
349
- # Drift
350
- config.drift.reindex_job = ReindexDocumentJob
351
- config.drift.staleness_thresholds = { warning: 7.days, critical: 30.days }
352
- config.drift.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
353
- config.drift.check_interval = 1.hour
354
-
355
- # Evals
356
- config.evals.schedule = '0 9 * * *' # daily at 9am, if using sidekiq-cron
357
- config.evals.alert_on_regression = true
358
- config.evals.alert_webhook = ENV['DIOGENES_ALERT_WEBHOOK']
359
- config.evals.passing_threshold = 0.9 # alert if pass rate drops below 90%
360
-
361
- # Audit log
362
- config.audit.retention_days = 90
363
- config.audit.pii_scrubber = Diogenes::PII::DefaultScrubber
364
- end
365
- ```
data/docs/examples.md DELETED
@@ -1,162 +0,0 @@
1
- # Examples
2
-
3
- These two examples use the same codebase, the same RAG pipeline, and the same underlying LLM. They reach different verdicts. That's the point.
4
-
5
- ---
6
-
7
- ## Example A: Support Agent Assistant ✅ Passes
8
-
9
- **The feature:** A text interface inside your Rails admin that lets support agents paste a customer's problem and receive relevant documentation, similar past tickets, and a suggested first response — all sourced and citable.
10
-
11
- **The user:** A trained support agent. Not the customer.
12
-
13
- ### Gate Evaluation
14
-
15
- ```ruby
16
- class SupportAssistant
17
- include Diogenes::Feature
18
- include Diogenes::Grounding
19
-
20
- gate :failure_mode,
21
- severity: :recoverable
22
- # A wrong suggested response gets caught by the agent before
23
- # it reaches the customer. Failure is visible and correctable.
24
-
25
- gate :user_calibration,
26
- audience: :trained_agent
27
- # Agents know the product. They will notice a wrong answer.
28
- # They have the context to evaluate AI output critically.
29
-
30
- gate :human_in_loop,
31
- verified: true,
32
- max_daily_reviews: 80,
33
- review_sla_minutes: 5
34
- # The agent reads the suggestion before sending anything.
35
- # The customer never sees unreviewed AI output.
36
-
37
- gate :observability,
38
- logging: :full,
39
- alerting: :enabled,
40
- drift_detection: true
41
- # We track suggestion acceptance rate, edit rate, and
42
- # escalation rate as proxy signals for output quality.
43
-
44
- gate :necessity,
45
- alternatives_considered: [
46
- "Static knowledge base with keyword search",
47
- "Pre-written response templates by category"
48
- ],
49
- reasoning: "Agents handle 400+ tickets/day across 12 product areas and
50
- 6 pricing tiers. Template coverage is <40% of actual ticket
51
- variance. Search requires agents to know what to search for."
52
-
53
- verify_grounding threshold: 0.8, on_failure: :flag_for_review
54
-
55
- def answer(query, agent:)
56
- context = retriever.retrieve(query)
57
- response = llm.complete(query, context: context)
58
-
59
- verify_and_return(
60
- response,
61
- context: context,
62
- reviewed_by: agent
63
- )
64
- end
65
- end
66
- ```
67
-
68
- ### Why It Passes
69
-
70
- The human is genuinely in the loop — the agent reads the suggestion and decides what to send. The audience is calibrated — agents know when an answer feels wrong. The failure mode is recoverable — a bad suggestion gets edited or ignored before it causes harm. Observability is real — acceptance and edit rates tell you when quality degrades before users notice.
71
-
72
- The RAG pipeline underneath is the same one described in the implementation guide. The grounding verifier ensures the suggested response is actually supported by retrieved context, not confabulated. Chunks that surface PII from past tickets are scrubbed before storage — the anonymization pipeline is what makes it legal to index support history at all.
73
-
74
- ---
75
-
76
- ## Example B: Invoice Explainer ❌ Fails
77
-
78
- **The feature:** A "What does this mean?" button on invoice line items that uses AI to explain charges to customers in plain language.
79
-
80
- **The user:** Any customer. General consumer audience.
81
-
82
- ### Gate Evaluation
83
-
84
- ```ruby
85
- class InvoiceExplainer
86
- include Diogenes::Feature
87
-
88
- gate :failure_mode,
89
- severity: :financial_dispute
90
- # A wrong explanation of a charge — "that fee doesn't apply to
91
- # your account" when it does — creates a customer expectation that
92
- # is difficult and expensive to walk back. It may constitute a
93
- # misrepresentation of contract terms.
94
-
95
- gate :user_calibration,
96
- audience: :general_consumer
97
- # Customers don't have access to the billing logic. They cannot
98
- # evaluate whether the explanation is correct. A confident wrong
99
- # answer is worse than no answer.
100
-
101
- gate :human_in_loop,
102
- verified: false
103
- # There is no human between the AI output and the customer.
104
- # Real-time invoice explanations cannot wait for review.
105
-
106
- # Diogenes::UnsafeFeatureError raised at boot:
107
- # InvoiceExplainer fails 3 gates:
108
- # - :failure_mode severity :financial_dispute requires :human_in_loop verified: true
109
- # - :user_calibration audience :general_consumer incompatible with :failure_mode severity :financial_dispute
110
- # - :human_in_loop verified: false blocks all consumer-facing output
111
- end
112
- ```
113
-
114
- ### Why It Fails — And What To Do Instead
115
-
116
- The feature fails because the failure mode is financial, the audience can't self-verify, and there's no human review. These aren't independent problems — they compound. An incorrect explanation of a financial charge, delivered confidently to someone who has no way to check it, with no human who saw it before it went out, is a product liability problem.
117
-
118
- **The right solution isn't a better AI feature. It's better UI.**
119
-
120
- ```ruby
121
- # What to build instead:
122
- #
123
- # 1. Plain-English line item labels in the invoice model itself
124
- # "Platform fee (10% of monthly usage)" not "PLAT_FEE_USG_PCT"
125
- #
126
- # 2. A tooltip or expandable section with a static explanation
127
- # per charge type — written once by a human, accurate forever
128
- #
129
- # 3. A "why was I charged this?" link that opens a pre-filtered
130
- # support ticket — staffed by agents using the SupportAssistant above
131
- #
132
- # This solves 90% of user confusion, is faster to build,
133
- # costs nothing to run, and has zero hallucination risk.
134
- ```
135
-
136
- ### The Honest Conversation
137
-
138
- When a client comes to you with the invoice explainer idea, the Diogenes gate output is the conversation:
139
-
140
- > "We ran this through the framework. It fails on failure mode severity, user calibration, and human review. Here's what that means in practice, and here's what we'd build instead."
141
-
142
- That's not a rejection. That's a better answer to the same problem — delivered with evidence.
143
-
144
- ---
145
-
146
- ## What the Dashboard Shows
147
-
148
- Three weeks after shipping `SupportAssistant`, the Diogenes overview tab looks like this:
149
-
150
- ```
151
- ┌──────────────────────┬────────┬─────────────┬──────────────┬───────────────┐
152
- │ Feature │ Gates │ Grounding │ Drift │ Evals │
153
- ├──────────────────────┼────────┼─────────────┼──────────────┼───────────────┤
154
- │ SupportAssistant │ 5/5 ✓ │ 94% clean │ 1 stale doc │ 47/50 passing │
155
- └──────────────────────┴────────┴─────────────┴──────────────┴───────────────┘
156
- ```
157
-
158
- The one stale document is `refund-policy.md`, updated by the product team two weeks ago and not yet re-indexed. The grounding tab shows three recent verifications flagged for containing the old refund timeframe. The drift tab surfaces the document with a re-index button. The eval for "how long does a refund take?" is among the six failing evals.
159
-
160
- All three signals point to the same root cause. One click queues the re-index job. The problem is visible before a customer receives a wrong answer.
161
-
162
- `InvoiceExplainer` doesn't appear in the dashboard at all. It never got past boot.
data/docs/framework.md DELETED
@@ -1,146 +0,0 @@
1
- # The Diogenes Framework
2
-
3
- Diogenes is built around a single question: **can you defend this decision?**
4
-
5
- Not defend it philosophically — defend it in a postmortem, in a client meeting, in an audit, or in a conversation with a user who got a wrong answer and wants to know why.
6
-
7
- The framework is five gates. Each one represents a question your team must answer before shipping an AI-enhanced feature to users. Some gates are enforced by code. Some are enforced by documentation. All of them are enforced by the fact that they live in your codebase, get reviewed in PRs, and travel with the feature forever.
8
-
9
- ---
10
-
11
- ## Gate 1: Failure Mode
12
-
13
- **The question:** What happens when the AI is wrong — and is that acceptable?
14
-
15
- Every AI feature will produce wrong answers. The question is never *if*, it's *what happens when*. The failure mode gate forces you to classify the severity of a wrong answer before you ship.
16
-
17
- ```ruby
18
- gate :failure_mode, severity: :recoverable
19
- # Acceptable severities: :cosmetic, :recoverable, :degraded_experience
20
- # Unacceptable without additional gates: :financial_dispute, :safety_risk, :legal_exposure
21
- ```
22
-
23
- A wrong restaurant recommendation is cosmetic. A wrong answer about a return policy is recoverable. A wrong answer about medication interactions is a safety risk. The gem will warn at boot if you declare a high-severity failure mode without also declaring strong human-in-the-loop configuration.
24
-
25
- **What this forces you to think about:**
26
- - The worst plausible wrong answer, not the average wrong answer
27
- - Whether wrong answers compound over time or reset per interaction
28
- - Whether the user will know they received a wrong answer
29
-
30
- ---
31
-
32
- ## Gate 2: User Calibration
33
-
34
- **The question:** Can your target user tell when the output is wrong?
35
-
36
- This is the gate that kills the most features — and it should. The danger of a confident, fluent AI answer isn't that it's wrong. It's that it's wrong in a way that sounds exactly like being right.
37
-
38
- A developer reading Copilot output has the domain knowledge to spot a hallucinated API method. A consumer reading an AI-generated insurance summary probably does not.
39
-
40
- ```ruby
41
- gate :user_calibration, audience: :trained_agent
42
- # Audiences: :domain_expert, :trained_agent, :power_user, :general_consumer
43
- ```
44
-
45
- `:general_consumer` doesn't automatically fail — but it requires strong grounding verification and conservative output thresholds. The gem will not allow `:general_consumer` + `:safety_risk` without raising at boot.
46
-
47
- **What this forces you to think about:**
48
- - Who the actual user is, not the assumed user
49
- - Whether the UI makes uncertainty visible
50
- - Whether users have any recourse when they suspect an error
51
-
52
- ---
53
-
54
- ## Gate 3: Human in the Loop
55
-
56
- **The question:** Is a human genuinely reviewing output, or rubber-stamping at volume?
57
-
58
- "A human reviews it" is one of the most misused phrases in AI feature development. A human reviewing 400 AI outputs a day for 30 seconds each is not a meaningful review. This gate requires you to specify not just that a human is involved, but the conditions under which that review is real.
59
-
60
- ```ruby
61
- gate :human_in_loop,
62
- verified: true,
63
- max_daily_reviews: 50,
64
- review_sla_minutes: 30
65
- ```
66
-
67
- Diogenes tracks review volume per reviewer. If a reviewer exceeds `max_daily_reviews`, the queue is flagged and additional reviewers are required before new outputs are released.
68
-
69
- **What this forces you to think about:**
70
- - Whether reviewers have enough context to actually evaluate the output
71
- - Whether reviewers have the authority to reject or modify, not just approve
72
- - Whether review load is sustainable as the feature scales
73
-
74
- ---
75
-
76
- ## Gate 4: Observability
77
-
78
- **The question:** Will you know when things are going wrong in production?
79
-
80
- Most teams ship AI features with less monitoring than they'd give a sorting algorithm. The outputs are nondeterministic, which means traditional error rate monitoring misses the most important failure modes — outputs that are wrong but don't raise exceptions.
81
-
82
- ```ruby
83
- gate :observability,
84
- logging: :full,
85
- alerting: :enabled,
86
- drift_detection: true
87
- ```
88
-
89
- Declaring this gate without the corresponding observability infrastructure raises at boot. Diogenes checks for a configured audit log destination and alert webhook before allowing the feature to serve.
90
-
91
- **What this forces you to think about:**
92
- - How you detect when hallucination rates increase
93
- - How you detect when retrieved context is becoming stale
94
- - What a "degraded AI feature" looks like in your alerting dashboard
95
-
96
- ---
97
-
98
- ## Gate 5: Necessity
99
-
100
- **The question:** Is AI actually the right solution, or just the exciting one?
101
-
102
- This is the hardest gate to encode and the most important one to ask. A well-designed UI, a good search index, a rule-based system, or better documentation will outperform a brittle AI feature in most cases — and will be faster to build, cheaper to run, and easier to debug.
103
-
104
- ```ruby
105
- gate :necessity,
106
- alternatives_considered: [
107
- "Better invoice UI with plain-English line item labels",
108
- "Keyword search across the knowledge base",
109
- "Contextual help tooltips on common confusion points"
110
- ],
111
- reasoning: "Support agents handle 400+ tickets/day across 12 product areas.
112
- The volume and variance exceeds what static documentation can address."
113
- ```
114
-
115
- Diogenes doesn't enforce this gate programmatically. It enforces it by requiring you to write it down, put it in a PR, and have someone else read it.
116
-
117
- **What this forces you to think about:**
118
- - Whether the problem is actually an AI problem or a UX problem
119
- - Whether you're solving for users or solving for a roadmap item
120
- - Whether you'll still want this feature in 18 months when the novelty has worn off
121
-
122
- ---
123
-
124
- ## The Verdict
125
-
126
- A feature that passes all five gates is ready to build — with the understanding that the gates are revisited when the feature scales, when the user base changes, or when the failure mode landscape shifts.
127
-
128
- A feature that fails a gate isn't necessarily dead. It's a specification of what needs to change before it ships. Sometimes that's a better review workflow. Sometimes that's a UI change that improves user calibration. Sometimes it's the answer that saves you six months of work on the wrong solution.
129
-
130
- That's the point.
131
-
132
- ---
133
-
134
- ## After the Gates
135
-
136
- Passing the gates is the beginning, not the end. Once a feature ships, Diogenes continues watching it through three lenses:
137
-
138
- **Grounding** — are AI responses actually supported by the context they're drawing from, or are they drifting into confabulation?
139
-
140
- **Drift** — are the documents feeding your retrieval pipeline still current, or have they gone stale while the embeddings haven't caught up?
141
-
142
- **Evals** — are the things the feature was supposed to do when you shipped it still working, or has something quietly regressed?
143
-
144
- All three surface in the mounted dashboard. The gates tell you whether to build it. These tell you whether it's still working.
145
-
146
- See [docs/dashboard.md](docs/dashboard.md) for full documentation.