@simbimbo/brainstem 0.0.3 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/adapters.md CHANGED
@@ -1,435 +1,131 @@
1
1
  # Adapters and Canonical Event Contract
2
2
 
3
- _Status: design contract for intake breadth without connector chaos_
3
+ _Status: aligned contract for implemented intake runtime_
4
4
 
5
- ## Purpose
5
+ This document defines the current adapter and driver surface and preserves the same intent as design governance without promising unshipped connectors.
6
6
 
7
- This document defines how brAInstem should accept many different source types without turning into an ungoverned pile of bespoke connector logic.
7
+ Read with:
8
8
 
9
- Its job is to answer:
10
- - what an adapter is
11
- - what an adapter is allowed to do
12
- - what an adapter is not allowed to do
13
- - what the raw input contract is
14
- - what the canonical event contract is
15
- - how failures should be handled
16
- - what "universal input" means in practice
17
-
18
- This document should be read together with:
19
9
  - `design-governance.md`
20
10
  - `architecture.md`
21
11
  - `v0.0.1.md`
22
12
 
23
- ---
24
-
25
- ## 1. Why adapters exist
26
-
27
- brAInstem should eventually ingest many classes of operational input:
28
- - syslog
29
- - local log files
30
- - JSON log streams
31
- - webhook payloads
32
- - monitoring/alert APIs
33
- - vendor-specific event formats
34
- - later: journald, Windows events, queue/stream sources, cloud audit feeds
35
-
36
- Those sources all have different:
37
- - transport behavior
38
- - payload shapes
39
- - timestamp formats
40
- - metadata conventions
41
- - source identity hints
42
- - failure modes
43
-
44
- The adapter layer exists so that source-specific ugliness stays at the edge.
45
-
46
- The rest of the product should primarily deal with:
47
- - raw input envelopes
48
- - canonical events
49
- - attention and routing
50
- - discovery and memory
51
-
52
- ### Design rule
53
- Adapter complexity belongs at the edges.
54
- The discovery apparatus should not need to know what transport/protocol originally delivered an event.
55
-
56
- ---
57
-
58
- ## 2. What "universal input" means
59
-
60
- "Universal" does **not** mean:
61
- - native first-class support for every source in the first release
62
- - a giant vendor integration matrix before the core event model is stable
63
- - a custom parser for every odd format on day one
64
-
65
- "Universal" **does** mean:
66
- - every source can be represented by a raw input envelope
67
- - every successful parse can become a canonical event
68
- - every canonical event can enter the same attention/discovery pipeline
69
- - new sources can be added by implementing a constrained adapter contract instead of creating system-wide exceptions
70
-
71
- Universal input is a property of the architecture, not a promise of immediate breadth.
72
-
73
- ---
74
-
75
- ## 3. Adapter responsibilities
76
-
77
- An adapter is responsible for:
78
- 1. receiving source data from a specific source class
79
- 2. preserving enough provenance to audit where the input came from
80
- 3. emitting a valid `RawInputEnvelope`
81
- 4. optionally performing source-local pre-parse validation
82
- 5. handing the envelope into the parser/canonicalizer stage
83
-
84
- An adapter is **not** responsible for:
85
- - long-term memory decisions
86
- - discovery logic
87
- - attention scoring policy
88
- - promotion policy
89
- - operator-facing explanation generation
90
-
91
- Adapters should stay narrow.
92
-
93
- ---
94
-
95
- ## 4. Adapter categories
96
-
97
- Early useful categories:
98
-
99
- ### 4.1 File adapter
100
- For:
101
- - local file tails
102
- - rotated logs
103
- - directory watch patterns
104
- - line-oriented service/application logs
105
-
106
- ### 4.2 Syslog adapter
107
- For:
108
- - UDP syslog
109
- - TCP syslog
110
- - later TLS syslog if needed
111
-
112
- ### 4.3 HTTP/webhook adapter
113
- For:
114
- - generic JSON event POST
115
- - vendor webhooks
116
- - batched events
117
-
118
- ### 4.4 API pull adapter
119
- For:
120
- - periodic polling of event history
121
- - alert/event backfill
122
- - vendor APIs like LogicMonitor where polling is useful
123
-
124
- ### 4.5 Stream adapter
125
- For:
126
- - stdin or pipeline-fed events
127
- - queue/stream integrations later
128
- - replay tooling
129
-
130
- These are categories, not promises that all must ship in `v0.0.1`.
131
-
132
- ---
133
-
134
- ## 5. Raw input envelope contract
135
-
136
- Every adapter must emit a raw input envelope before deeper normalization.
137
-
138
- This contract preserves:
139
- - provenance
140
- - transport identity
141
- - original payload fidelity
142
- - parser/debug visibility
143
-
144
- ## Required fields
145
-
146
- ### `envelope_id`
147
- - unique id for the raw envelope
148
- - generated at receipt if source does not provide one
149
-
150
- ### `source_id`
151
- - stable identifier for the source instance
152
- - examples:
153
- - `syslog:edge-fw-01`
154
- - `file:/var/log/auth.log`
155
- - `http:logicmonitor-prod-webhook`
156
-
157
- ### `source_type`
158
- - broad source class
159
- - examples:
160
- - `syslog`
161
- - `file`
162
- - `http`
163
- - `logicmonitor`
164
- - `stream`
165
-
166
- ### `tenant_id`
167
- - logical tenant/environment owner
168
- - may be defaulted in early local mode
169
- - should still exist conceptually even if the first release uses a single tenant
170
-
171
- ### `received_at`
172
- - timestamp when brAInstem received the input
173
-
174
- ### `raw_payload`
175
- - original raw line/body/payload in preserved form
176
- - may be string, bytes, or structured object depending on implementation, but must remain recoverable
177
-
178
- ## Strongly recommended fields
179
-
180
- ### `observed_at`
181
- - source-reported timestamp if available
182
- - may differ from `received_at`
183
-
184
- ### `transport`
185
- - example values:
186
- - `syslog-udp`
187
- - `syslog-tcp`
188
- - `http-post`
189
- - `file-tail`
190
- - `api-poll`
191
-
192
- ### `source_metadata`
193
- - adapter/source-specific metadata
194
- - examples:
195
- - file path
196
- - listener port
197
- - remote ip
198
- - vendor alert id
199
- - request headers subset
200
- - offset/sequence info
201
-
202
- ### `parse_status`
203
- - initial parse state marker
204
- - example values:
205
- - `pending`
206
- - `parsed`
207
- - `parse_error`
208
- - `unsupported`
209
-
210
- ### `sequence_hint`
211
- - optional source ordering hint when available
212
-
213
- ## Design rule
214
- The raw envelope is not the product output.
215
- It is the preserved intake truth that allows everything else to be audited.
216
-
217
- ---
218
-
219
- ## 6. Canonical event contract
220
-
221
- After parsing/canonicalization, a successful input should become a canonical event.
222
-
223
- This is the shared internal stream of consciousness.
224
-
225
- Once an event becomes canonical, the discovery apparatus should not care whether it came from:
226
- - syslog
227
- - file tail
228
- - webhook
229
- - LogicMonitor
230
- - future sources
231
-
232
- ## Required fields
233
-
234
- ### `event_id`
235
- - stable unique id for the canonical event
236
-
237
- ### `tenant_id`
238
- - the tenant/environment the event belongs to
239
-
240
- ### `source_type`
241
- - normalized source family
242
-
243
- ### `timestamp`
244
- - best normalized event timestamp
245
- - prefers true observed time when trustworthy
246
- - may fall back to receipt time
247
-
248
- ### `kind`
249
- - normalized event class
250
- - examples:
251
- - `auth_failure`
252
- - `service_restart`
253
- - `vpn_flap`
254
- - `generic_warning`
255
-
256
- ### `message_raw`
257
- - original message body after basic extraction
258
-
259
- ### `message_normalized`
260
- - normalized message used for fingerprinting/grouping
261
-
262
- ### `raw_ref`
263
- - reference back to the raw input envelope or raw store
264
-
265
- ## Strongly recommended fields
266
-
267
- ### `source_name`
268
- - human-meaningful source instance name
269
-
270
- ### `host`
271
- - normalized host/device identity where possible
272
-
273
- ### `asset_id`
274
- - stable asset identifier if known
275
-
276
- ### `service`
277
- - normalized service/subsystem name
278
-
279
- ### `severity`
280
- - normalized severity value or band
281
-
282
- ### `labels`
283
- - tag-like annotations for routing/discovery
284
-
285
- ### `structured_fields`
286
- - extracted structured values
287
-
288
- ### `correlation_keys`
289
- - fields likely useful for grouping/spread/recurrence logic
290
-
291
- ### `ingest_metadata`
292
- - useful canonicalization metadata that should travel downstream
293
-
294
- ---
295
-
296
- ## 7. Parse failure handling
297
-
298
- brAInstem must never quietly erase parse failures.
299
-
300
- If an adapter can receive input but canonicalization fails, the system should:
301
- - preserve the raw envelope
302
- - emit or record a parse-failure state
303
- - increment parse/decode error counters
304
- - allow operators/builders to inspect representative failures
305
-
306
- ### Why this matters
307
- A malformed payload can still be operationally meaningful.
308
- Also, if adapters or parsers silently drop bad inputs, trust dies.
309
-
310
- ### Design rule
311
- Bad parse is a first-class ingest outcome, not an invisible discard path.
312
-
313
- ---
314
-
315
- ## 8. Normalization responsibilities
316
-
317
- The parser/canonicalizer layer, not the adapter, should own the canonical transformation rules where possible.
318
-
319
- Normalization responsibilities include:
320
- - timestamp parsing
321
- - host/service extraction
322
- - volatility stripping
323
- - field normalization
324
- - message cleanup
325
- - kind classification
326
- - preparation for fingerprinting
327
-
328
- Adapters may do source-local preprocessing when unavoidable, but the canonicalization logic should remain centralized enough that the system has one real opinion about event shape.
329
-
330
- ---
331
-
332
- ## 9. Adapter boundaries
333
-
334
- To prevent connector chaos, adapters should obey these rules.
335
-
336
- ### Adapter may:
337
- - receive source input
338
- - preserve provenance
339
- - perform source-local validation
340
- - map obvious source metadata into envelope fields
341
- - pass through source-specific metadata needed later
342
-
343
- ### Adapter should avoid:
344
- - inventing bespoke downstream fields that only one adapter knows about
345
- - performing discovery logic
346
- - performing long-term suppression policy
347
- - making promotion decisions
348
- - reshaping canonical semantics without going through the canonicalization contract
349
-
350
- ### Strong anti-pattern
351
- "This source is special, so we built a one-off downstream path just for it."
352
-
353
- That is how architecture rots.
13
+ ## 1. Current runtime intake scope
354
14
 
355
- ---
15
+ The implemented foundation includes:
16
+ - `syslog` adapter + source driver
17
+ - `file` adapter + source driver
356
18
 
357
- ## 10. Early recommended source support strategy
19
+ Both are line-oriented, envelope-first sources. Everything else remains in the roadmap.
358
20
 
359
- To keep the product honest and focused, new adapters should be added in this order:
21
+ The UDP listener in `brainstem.listener` is built on the same `syslog` source driver used by API ingestion.
360
22
 
361
- ### First
362
- - file/log ingestion
363
- - syslog-like ingestion
364
- - generic HTTP/webhook ingestion
23
+ ## 2. Why this adapter layer exists
365
24
 
366
- ### Then
367
- - LogicMonitor
368
- - other monitoring/alert sources with strong MSP relevance
25
+ The adapter layer contains source-specific parsing and provenance capture so downstream discovery stays source-agnostic.
369
26
 
370
- ### Later
371
- - richer vendor connectors
372
- - queue/stream integrations
373
- - platform-specific event systems
27
+ It should remain narrow:
28
+ - parse raw input into a `RawInputEnvelope`
29
+ - preserve enough source context for replay and forensics
30
+ - avoid discovery/scoring/promotion policy in adapter code
374
31
 
375
- This order keeps the architecture universal without pretending infinite source breadth on day one.
32
+ ## 3. Source-driver contract (runtime code)
376
33
 
377
- ---
34
+ `source_drivers.py` registers drivers by `source_type`.
378
35
 
379
- ## 11. Attention and adapters
36
+ Current contract:
37
+ - `source_type` string
38
+ - `parse_payload(payload, tenant_id, source_path="", on_parse_error=None) -> list[RawInputEnvelope]`
380
39
 
381
- Adapters do not assign final operator attention.
40
+ Implemented drivers:
41
+ - `file`
42
+ - `syslog`
382
43
 
383
- However, adapters may contribute source metadata that attention scoring later uses, such as:
384
- - source reliability/trust
385
- - source criticality
386
- - source class
387
- - environment/tenant tags
388
- - transport characteristics
44
+ Runtime behavior:
45
+ - returns zero or more envelopes
46
+ - may call `on_parse_error` callback on parse failure
47
+ - should not swallow parse exceptions silently
389
48
 
390
- This distinction matters:
391
- - adapters provide evidence and provenance
392
- - the scoring/discovery apparatus decides attention
49
+ ## 4. Raw input envelope contract
393
50
 
394
- ---
51
+ Adapters are required to return:
52
+ - `tenant_id` (required by ingestion API)
53
+ - `source_type` (`file` or `syslog`)
54
+ - `timestamp` (ISO-ish string; defaults to UTC now when unknown)
55
+ - `message_raw` (non-empty for successful canonicalization)
395
56
 
396
- ## 12. Audit and replay expectations
57
+ Adapters may populate:
58
+ - `source_id`
59
+ - `source_name`
60
+ - `source_path`
61
+ - `host`
62
+ - `service`
63
+ - `severity`
64
+ - `asset_id`
65
+ - `facility`
66
+ - `structured_fields`
67
+ - `correlation_keys`
68
+ - `metadata` (adapter-local audit data; e.g. `raw_line`)
397
69
 
398
- The adapter + raw envelope system should eventually support:
399
- - replay of raw inputs into canonicalization/discovery
400
- - inspection of parse failures
401
- - verification of source attribution
402
- - sampling of suppressed/ignored inputs for trust calibration
70
+ Canonicalization outcomes are not part of this contract; they are recorded by the storage layer.
403
71
 
404
- Even if replay tooling is not fully mature in `v0.0.1`, the architecture should preserve the possibility.
72
+ ## 5. Canonical event contract
405
73
 
406
- ---
74
+ `canonicalize_raw_input_envelope` currently emits:
75
+ - `tenant_id`
76
+ - `source_type`
77
+ - `timestamp`
78
+ - `message_raw`
79
+ - optional `raw_envelope_id`
80
+ - `host`
81
+ - `service`
82
+ - `severity`
83
+ - `asset_id`
84
+ - `source_path`
85
+ - `facility`
86
+ - `structured_fields`
87
+ - `correlation_keys`
88
+ - `message_normalized`
89
+ - `signature_input`
90
+ - `ingest_metadata` (including `canonicalization_source`, `canonicalized_at`, and raw envelope linkage)
407
91
 
408
- ## 13. What a good adapter contract enables
92
+ There is no separate explicit `source_id`/`source_name` field on canonical events in this milestone.
409
93
 
410
- If this contract is followed, brAInstem can:
411
- - expand source breadth over time without discovery-layer chaos
412
- - remain source-agnostic in the core pipeline
413
- - preserve trust via provenance and replayability
414
- - maintain one real stream of operational consciousness
94
+ ## 6. Parse failure handling
415
95
 
416
- If this contract is ignored, brAInstem becomes:
417
- - connector soup
418
- - parsing exceptions everywhere
419
- - brittle discovery logic
420
- - untrustworthy ingestion
96
+ When canonicalization fails:
97
+ - intake row is still tracked as `parse_failed`
98
+ - the raw envelope remains queryable in storage
99
+ - parsing failure reason is captured
100
+ - later replay is possible via `/replay/raw` (DB-backed replay path)
421
101
 
422
- That must be avoided.
102
+ This is an explicit trust boundary requirement, not a silent discard path.
423
103
 
424
- ---
104
+ ## 7. Adapter boundaries
425
105
 
426
- ## 14. v0.0.1 implication
106
+ Adapters may:
107
+ - adapt transport/source quirks
108
+ - extract obvious source metadata
427
109
 
428
- For `v0.0.1`, the key requirement is not broad adapter count.
429
- It is that the repo clearly defines:
430
- - the adapter model
431
- - the raw envelope concept
432
- - the canonical event concept
433
- - the relationship between adapters and the attention/discovery pipeline
110
+ Adapters should not:
111
+ - assign attention
112
+ - alter candidate generation policy
113
+ - run promotion logic
114
+ - persist raw envelopes or candidates directly
115
+
116
+ ## 8. Planned intake categories
434
117
 
435
- That is enough for a truthful first release.
118
+ These remain design targets beyond the current milestone:
119
+ - TCP/TLS syslog transport
120
+ - webhook/API pull sources
121
+ - queue/stream drivers
122
+ - richer vendor-native adapters
123
+
124
+ ## 9. Why this is still the right architecture now
125
+
126
+ Even with two drivers, the architecture can stay universal:
127
+ - all registered drivers produce one `RawInputEnvelope` shape
128
+ - all successful envelopes become canonical events in one stream
129
+ - parse failures remain inspectable and replayable
130
+
131
+ This is the smallest practical intake foundation that matches current runtime implementation.