source_monitor 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. checksums.yaml +4 -4
  2. data/.claude/skills/sm-architecture/SKILL.md +233 -0
  3. data/.claude/skills/sm-architecture/reference/extraction-patterns.md +192 -0
  4. data/.claude/skills/sm-architecture/reference/module-map.md +194 -0
  5. data/.claude/skills/sm-configuration-setting/SKILL.md +264 -0
  6. data/.claude/skills/sm-configuration-setting/reference/settings-catalog.md +248 -0
  7. data/.claude/skills/sm-configuration-setting/reference/settings-pattern.md +297 -0
  8. data/.claude/skills/sm-configure/SKILL.md +153 -0
  9. data/.claude/skills/sm-configure/reference/configuration-reference.md +321 -0
  10. data/.claude/skills/sm-dashboard-widget/SKILL.md +344 -0
  11. data/.claude/skills/sm-dashboard-widget/reference/dashboard-patterns.md +304 -0
  12. data/.claude/skills/sm-domain-model/SKILL.md +188 -0
  13. data/.claude/skills/sm-domain-model/reference/model-graph.md +114 -0
  14. data/.claude/skills/sm-domain-model/reference/table-structure.md +348 -0
  15. data/.claude/skills/sm-engine-migration/SKILL.md +395 -0
  16. data/.claude/skills/sm-engine-migration/reference/migration-conventions.md +255 -0
  17. data/.claude/skills/sm-engine-test/SKILL.md +302 -0
  18. data/.claude/skills/sm-engine-test/reference/test-helpers.md +259 -0
  19. data/.claude/skills/sm-engine-test/reference/test-patterns.md +411 -0
  20. data/.claude/skills/sm-event-handler/SKILL.md +265 -0
  21. data/.claude/skills/sm-event-handler/reference/events-api.md +229 -0
  22. data/.claude/skills/sm-health-rule/SKILL.md +327 -0
  23. data/.claude/skills/sm-health-rule/reference/health-system.md +269 -0
  24. data/.claude/skills/sm-host-setup/SKILL.md +223 -0
  25. data/.claude/skills/sm-host-setup/reference/initializer-template.md +195 -0
  26. data/.claude/skills/sm-host-setup/reference/setup-checklist.md +134 -0
  27. data/.claude/skills/sm-job/SKILL.md +263 -0
  28. data/.claude/skills/sm-job/reference/job-conventions.md +245 -0
  29. data/.claude/skills/sm-model-extension/SKILL.md +287 -0
  30. data/.claude/skills/sm-model-extension/reference/extension-api.md +317 -0
  31. data/.claude/skills/sm-pipeline-stage/SKILL.md +254 -0
  32. data/.claude/skills/sm-pipeline-stage/reference/completion-handlers.md +152 -0
  33. data/.claude/skills/sm-pipeline-stage/reference/entry-processing.md +191 -0
  34. data/.claude/skills/sm-pipeline-stage/reference/feed-fetcher-architecture.md +198 -0
  35. data/.claude/skills/sm-scraper-adapter/SKILL.md +284 -0
  36. data/.claude/skills/sm-scraper-adapter/reference/adapter-contract.md +167 -0
  37. data/.claude/skills/sm-scraper-adapter/reference/example-adapter.md +274 -0
  38. data/.vbw-planning/.notification-log.jsonl +102 -0
  39. data/.vbw-planning/.session-log.jsonl +505 -0
  40. data/AGENTS.md +20 -57
  41. data/CHANGELOG.md +19 -0
  42. data/CLAUDE.md +44 -1
  43. data/CONTRIBUTING.md +5 -5
  44. data/Gemfile.lock +20 -21
  45. data/README.md +18 -5
  46. data/VERSION +1 -0
  47. data/docs/deployment.md +1 -1
  48. data/docs/setup.md +4 -4
  49. data/lib/source_monitor/setup/skills_installer.rb +94 -0
  50. data/lib/source_monitor/setup/workflow.rb +17 -2
  51. data/lib/source_monitor/version.rb +1 -1
  52. data/lib/tasks/source_monitor_setup.rake +58 -0
  53. data/source_monitor.gemspec +1 -0
  54. metadata +39 -1
@@ -0,0 +1,198 @@
1
+ # FeedFetcher Architecture
2
+
3
+ ## Module Structure
4
+
5
+ The `FeedFetcher` was refactored from a 627-line monolith into a 285-line coordinator with 3 sub-modules. Each sub-module is a plain Ruby class instantiated lazily via accessor methods.
6
+
7
+ ```
8
+ FeedFetcher (285 lines) -- coordinator
9
+ |
10
+ +-- AdaptiveInterval (141 lines) -- fetch interval math
11
+ +-- SourceUpdater (200 lines) -- source persistence + fetch logs
12
+ +-- EntryProcessor (89 lines) -- feed entry iteration
13
+ ```
14
+
15
+ ## FeedFetcher (Coordinator)
16
+
17
+ **File:** `lib/source_monitor/fetching/feed_fetcher.rb`
18
+
19
+ Responsibilities:
20
+ - Perform HTTP request via Faraday client
21
+ - Route response by status code (200, 304, else)
22
+ - Parse feed body with Feedjira
23
+ - Delegate to sub-modules for processing
24
+ - Emit instrumentation events
25
+ - Handle and classify errors
26
+
27
+ ### Key Data Structures
28
+
29
+ ```ruby
30
+ Result = Struct.new(:status, :feed, :response, :body, :error, :item_processing, :retry_decision)
31
+ EntryProcessingResult = Struct.new(:created, :updated, :failed, :items, :errors, :created_items, :updated_items)
32
+ ResponseWrapper = Struct.new(:status, :headers, :body)
33
+ ```
34
+
35
+ ### Request Flow
36
+
37
+ ```
38
+ call()
39
+ -> perform_fetch(started_at, payload)
40
+ -> perform_request() # Faraday GET with conditional headers
41
+ -> handle_response(response)
42
+ |
43
+ +-- 200 -> handle_success()
44
+ | -> parse_feed() # Feedjira.parse
45
+ | -> entry_processor.process_feed_entries()
46
+ | -> source_updater.update_source_for_success()
47
+ | -> source_updater.create_fetch_log()
48
+ |
49
+ +-- 304 -> handle_not_modified()
50
+ | -> source_updater.update_source_for_not_modified()
51
+ | -> source_updater.create_fetch_log()
52
+ |
53
+ +-- else -> raise HTTPError
54
+ rescue FetchError -> handle_failure()
55
+ -> source_updater.update_source_for_failure()
56
+ -> source_updater.create_fetch_log()
57
+ ```
58
+
59
+ ### Conditional Request Headers
60
+
61
+ The fetcher sends conditional headers when available:
62
+ - `If-None-Match` -- uses `source.etag`
63
+ - `If-Modified-Since` -- uses `source.last_modified.httpdate`
64
+ - Custom headers from `source.custom_headers`
65
+
66
+ ### Sub-Module Instantiation
67
+
68
+ Sub-modules are lazily instantiated and cached:
69
+
70
+ ```ruby
71
+ def adaptive_interval
72
+ @adaptive_interval ||= AdaptiveInterval.new(source: source, jitter_proc: jitter_proc)
73
+ end
74
+
75
+ def source_updater
76
+ @source_updater ||= SourceUpdater.new(source: source, adaptive_interval: adaptive_interval)
77
+ end
78
+
79
+ def entry_processor
80
+ @entry_processor ||= EntryProcessor.new(source: source)
81
+ end
82
+ ```
83
+
84
+ ### Backward Compatibility
85
+
86
+ Forwarding methods maintain backward compatibility with existing tests:
87
+
88
+ ```ruby
89
+ def process_feed_entries(feed) = entry_processor.process_feed_entries(feed)
90
+ def jitter_offset(interval_seconds) = adaptive_interval.jitter_offset(interval_seconds)
91
+ # ... etc
92
+ ```
93
+
94
+ ## AdaptiveInterval Sub-Module
95
+
96
+ **File:** `lib/source_monitor/fetching/feed_fetcher/adaptive_interval.rb`
97
+
98
+ Controls dynamic fetch scheduling based on content changes and failures.
99
+
100
+ ### Algorithm
101
+
102
+ | Condition | Factor | Effect |
103
+ |-----------|--------|--------|
104
+ | Content changed | `DECREASE_FACTOR` (0.75) | Fetch more often |
105
+ | No change | `INCREASE_FACTOR` (1.25) | Fetch less often |
106
+ | Failure | `FAILURE_INCREASE_FACTOR` (1.5) | Back off significantly |
107
+
108
+ ### Boundaries
109
+
110
+ | Constant | Default | Purpose |
111
+ |----------|---------|---------|
112
+ | `MIN_FETCH_INTERVAL` | 5 minutes | Floor for interval |
113
+ | `MAX_FETCH_INTERVAL` | 24 hours | Ceiling for interval |
114
+ | `JITTER_PERCENT` | 10% | Random offset to prevent thundering herd |
115
+
116
+ ### Configuration Override
117
+
118
+ All constants can be overridden via `SourceMonitor.config.fetching`:
119
+ - `min_interval_minutes`
120
+ - `max_interval_minutes`
121
+ - `increase_factor`
122
+ - `decrease_factor`
123
+ - `failure_increase_factor`
124
+ - `jitter_percent`
125
+
126
+ ### Fixed vs Adaptive
127
+
128
+ When `source.adaptive_fetching_enabled?` is false, the interval uses a simple fixed schedule:
129
+
130
+ ```ruby
131
+ fixed_minutes = [source.fetch_interval_minutes.to_i, 1].max
132
+ attributes[:next_fetch_at] = Time.current + fixed_minutes.minutes
133
+ ```
134
+
135
+ ## SourceUpdater Sub-Module
136
+
137
+ **File:** `lib/source_monitor/fetching/feed_fetcher/source_updater.rb`
138
+
139
+ Handles all source record mutations after a fetch attempt.
140
+
141
+ ### Update Methods
142
+
143
+ | Method | When Called | Key Updates |
144
+ |--------|------------|-------------|
145
+ | `update_source_for_success` | HTTP 200 | Clear errors, update etag/last_modified, adaptive interval, reset retry state |
146
+ | `update_source_for_not_modified` | HTTP 304 | Clear errors, update etag/last_modified, adaptive interval |
147
+ | `update_source_for_failure` | Any error | Increment failure_count, apply retry strategy, adaptive interval with failure flag |
148
+
149
+ ### Fetch Log Creation
150
+
151
+ Every fetch attempt creates a `FetchLog` record via `create_fetch_log` with:
152
+ - Timing (started_at, completed_at, duration_ms)
153
+ - HTTP details (status, response headers)
154
+ - Item counts (created, updated, failed)
155
+ - Error details (class, message, backtrace)
156
+ - Feed metadata (parser, signature, item errors)
157
+
158
+ ### Feed Signature
159
+
160
+ Content change detection uses SHA256 digest of the response body:
161
+
162
+ ```ruby
163
+ def feed_signature_changed?(feed_signature)
164
+ (source.metadata || {}).fetch("last_feed_signature", nil) != feed_signature
165
+ end
166
+ ```
167
+
168
+ ### Retry Strategy
169
+
170
+ On failure, `apply_retry_strategy!` delegates to `RetryPolicy`:
171
+ - If retry: set `fetch_retry_attempt`, schedule retry
172
+ - If circuit open: set `fetch_circuit_opened_at`, `fetch_circuit_until`
173
+ - Updates `next_fetch_at` and `backoff_until` accordingly
174
+
175
+ ## EntryProcessor Sub-Module
176
+
177
+ **File:** `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb`
178
+
179
+ Iterates over `feed.entries` and calls `ItemCreator.call` for each entry.
180
+
181
+ ### Processing Loop
182
+
183
+ ```ruby
184
+ Array(feed.entries).each do |entry|
185
+ result = ItemCreator.call(source:, entry:)
186
+ Events.run_item_processors(source:, entry:, result:)
187
+ if result.created?
188
+ Events.after_item_created(item: result.item, source:, entry:, result:)
189
+ end
190
+ rescue StandardError => error
191
+ # Normalize error, continue processing remaining entries
192
+ end
193
+ ```
194
+
195
+ Key behaviors:
196
+ - Individual entry failures don't stop processing of remaining entries
197
+ - Events are dispatched for both item processors and item creation
198
+ - Error normalization captures GUID and title for debugging
@@ -0,0 +1,284 @@
1
+ ---
2
+ name: sm-scraper-adapter
3
+ description: Use when creating custom scraper adapters for SourceMonitor, inheriting from Scrapers::Base, implementing the adapter contract, or registering/unregistering scrapers.
4
+ allowed-tools: Read, Write, Edit, Bash, Glob, Grep
5
+ ---
6
+
7
+ # sm-scraper-adapter: Custom Scraper Adapters
8
+
9
+ Build custom content scrapers that integrate with SourceMonitor's scraping pipeline.
10
+
11
+ ## When to Use
12
+
13
+ - Creating a new scraper adapter for a specific content type or source
14
+ - Customizing how content is fetched and parsed
15
+ - Understanding the scraper adapter contract
16
+ - Registering or swapping scraper adapters in configuration
17
+ - Debugging scraper failures
18
+
19
+ ## Architecture Overview
20
+
21
+ ```
22
+ SourceMonitor::Scrapers::Base (abstract)
23
+ |
24
+ +-- SourceMonitor::Scrapers::Readability (built-in)
25
+ +-- MyApp::Scrapers::Custom (your adapter)
26
+ ```
27
+
28
+ Scrapers are registered in configuration and selected per-source. Each adapter:
29
+ 1. Receives an `item`, `source`, and merged `settings` hash
30
+ 2. Performs HTTP fetching and content parsing
31
+ 3. Returns a `Result` struct with status, HTML, content, and metadata
32
+
33
+ ## The Adapter Contract
34
+
35
+ ### Base Class: `SourceMonitor::Scrapers::Base`
36
+
37
+ Location: `lib/source_monitor/scrapers/base.rb`
38
+
39
+ All custom scrapers **must** inherit from `SourceMonitor::Scrapers::Base`.
40
+
41
+ ### Required: `#call` Instance Method
42
+
43
+ Must return a `SourceMonitor::Scrapers::Base::Result`:
44
+
45
+ ```ruby
46
+ Result = Struct.new(:status, :html, :content, :metadata, keyword_init: true)
47
+ ```
48
+
49
+ | Field | Type | Description |
50
+ |---|---|---|
51
+ | `status` | Symbol | `:success`, `:partial`, or `:failed` |
52
+ | `html` | String/nil | Raw HTML fetched from the URL |
53
+ | `content` | String/nil | Extracted/cleaned text content |
54
+ | `metadata` | Hash/nil | Diagnostics: headers, timings, URL, error info |
55
+
56
+ ### Class Methods (Optional Overrides)
57
+
58
+ | Method | Default | Description |
59
+ |---|---|---|
60
+ | `self.adapter_name` | Derived from class name | Name used in registry |
61
+ | `self.default_settings` | `{}` | Default settings hash for this adapter |
62
+ | `self.call(item:, source:, settings:, http:)` | Creates instance, calls `#call` | Class-level entry point |
63
+
64
+ ### Protected Accessors
65
+
66
+ Available inside `#call`:
67
+
68
+ | Accessor | Type | Description |
69
+ |---|---|---|
70
+ | `item` | `SourceMonitor::Item` | The item being scraped |
71
+ | `source` | `SourceMonitor::Source` | The owning source |
72
+ | `http` | Module | HTTP client module (`SourceMonitor::HTTP`) |
73
+ | `settings` | HashWithIndifferentAccess | Merged settings (see Settings Merging) |
74
+
75
+ ### Settings Merging
76
+
77
+ Settings are merged in priority order:
78
+ 1. `self.class.default_settings` (adapter defaults)
79
+ 2. `source.scrape_settings` (source-level overrides)
80
+ 3. `settings` parameter (per-invocation overrides)
81
+
82
+ All keys are normalized to strings with indifferent access.
83
+
84
+ ## Creating a Custom Adapter
85
+
86
+ ### Step 1: Create the Adapter Class
87
+
88
+ ```ruby
89
+ # app/scrapers/my_app/scrapers/premium.rb
90
+ module MyApp
91
+ module Scrapers
92
+ class Premium < SourceMonitor::Scrapers::Base
93
+ def self.default_settings
94
+ {
95
+ api_key: nil,
96
+ extract_images: true,
97
+ timeout: 30
98
+ }
99
+ end
100
+
101
+ def call
102
+ url = item.canonical_url.presence || item.url
103
+ return failure("missing_url", "No URL available") unless url.present?
104
+
105
+ response = fetch_content(url)
106
+ return failure("fetch_failed", response[:error]) unless response[:success]
107
+
108
+ content = extract_content(response[:body])
109
+
110
+ Result.new(
111
+ status: :success,
112
+ html: response[:body],
113
+ content: content,
114
+ metadata: {
115
+ url: url,
116
+ http_status: response[:status],
117
+ extraction_method: "premium"
118
+ }
119
+ )
120
+ rescue StandardError => error
121
+ failure(error.class.name, error.message)
122
+ end
123
+
124
+ private
125
+
126
+ def fetch_content(url)
127
+ conn = http.client(
128
+ timeout: settings[:timeout],
129
+ headers: { "Authorization" => "Bearer #{settings[:api_key]}" }
130
+ )
131
+ response = conn.get(url)
132
+ { success: true, body: response.body, status: response.status }
133
+ rescue Faraday::Error => e
134
+ { success: false, error: e.message }
135
+ end
136
+
137
+ def extract_content(html)
138
+ # Your custom extraction logic
139
+ html.gsub(/<[^>]+>/, " ").squeeze(" ").strip
140
+ end
141
+
142
+ def failure(error, message)
143
+ Result.new(
144
+ status: :failed,
145
+ html: nil,
146
+ content: nil,
147
+ metadata: { error: error, message: message }
148
+ )
149
+ end
150
+ end
151
+ end
152
+ end
153
+ ```
154
+
155
+ ### Step 2: Register the Adapter
156
+
157
+ ```ruby
158
+ # config/initializers/source_monitor.rb
159
+ SourceMonitor.configure do |config|
160
+ config.scrapers.register(:premium, "MyApp::Scrapers::Premium")
161
+ end
162
+ ```
163
+
164
+ ### Step 3: Assign to Sources
165
+
166
+ Set the scraper adapter name on individual sources. The source's `scrape_settings` JSON column can hold adapter-specific overrides.
167
+
168
+ ## Built-in Adapter: Readability
169
+
170
+ Location: `lib/source_monitor/scrapers/readability.rb`
171
+
172
+ The built-in Readability adapter:
173
+ 1. Fetches HTML via `HttpFetcher`
174
+ 2. Parses content via `ReadabilityParser`
175
+ 3. Supports CSS selector overrides via settings
176
+
177
+ Default settings structure:
178
+ ```ruby
179
+ {
180
+ http: { headers: {...}, timeout: 15, open_timeout: 5, proxy: nil },
181
+ selectors: { content: nil, title: nil },
182
+ readability: {
183
+ remove_unlikely_candidates: true,
184
+ clean_conditionally: true,
185
+ retry_length: 250,
186
+ min_text_length: 25
187
+ }
188
+ }
189
+ ```
190
+
191
+ ## Registration API
192
+
193
+ ```ruby
194
+ # Register by class
195
+ config.scrapers.register(:custom, MyApp::Scrapers::Custom)
196
+
197
+ # Register by string (lazy constantization)
198
+ config.scrapers.register(:custom, "MyApp::Scrapers::Custom")
199
+
200
+ # Unregister
201
+ config.scrapers.unregister(:custom)
202
+
203
+ # Look up
204
+ adapter_class = config.scrapers.adapter_for(:custom)
205
+
206
+ # Iterate
207
+ config.scrapers.each { |name, klass| puts "#{name}: #{klass}" }
208
+ ```
209
+
210
+ Name validation: must match `/\A[a-z0-9_]+\z/i`, normalized to lowercase.
211
+
212
+ ## Key Source Files
213
+
214
+ | File | Purpose |
215
+ |---|---|
216
+ | `lib/source_monitor/scrapers/base.rb` | Abstract base class and Result struct |
217
+ | `lib/source_monitor/scrapers/readability.rb` | Built-in Readability adapter |
218
+ | `lib/source_monitor/scrapers/fetchers/http_fetcher.rb` | HTTP fetching helper |
219
+ | `lib/source_monitor/scrapers/parsers/readability_parser.rb` | Content parsing |
220
+ | `lib/source_monitor/configuration/scraper_registry.rb` | Registration/lookup |
221
+ | `lib/source_monitor/scraping/item_scraper.rb` | Scraping orchestration |
222
+
223
+ ## References
224
+
225
+ - `reference/adapter-contract.md` -- Detailed interface specification
226
+ - `reference/example-adapter.md` -- Complete working example
227
+ - `lib/source_monitor/scrapers/readability.rb` -- Reference implementation
228
+
229
+ ## Testing
230
+
231
+ ```ruby
232
+ require "test_helper"
233
+
234
+ class PremiumScraperTest < ActiveSupport::TestCase
235
+ setup do
236
+ @source = create_source!
237
+ @item = @source.items.create!(
238
+ title: "Test",
239
+ url: "https://example.com/article",
240
+ external_id: "test-1"
241
+ )
242
+ end
243
+
244
+ test "scrapes content successfully" do
245
+ stub_request(:get, "https://example.com/article")
246
+ .to_return(status: 200, body: "<html><body><p>Content</p></body></html>")
247
+
248
+ result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
249
+
250
+ assert_equal :success, result.status
251
+ assert_includes result.content, "Content"
252
+ assert_equal 200, result.metadata[:http_status]
253
+ end
254
+
255
+ test "handles fetch failure" do
256
+ stub_request(:get, "https://example.com/article")
257
+ .to_return(status: 500, body: "Error")
258
+
259
+ result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
260
+
261
+ assert_equal :failed, result.status
262
+ end
263
+
264
+ test "handles missing URL" do
265
+ @item.update!(url: nil)
266
+ result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
267
+
268
+ assert_equal :failed, result.status
269
+ assert_equal "missing_url", result.metadata[:error]
270
+ end
271
+ end
272
+ ```
273
+
274
+ ## Checklist
275
+
276
+ - [ ] Adapter inherits from `SourceMonitor::Scrapers::Base`
277
+ - [ ] `#call` returns a `Result` struct
278
+ - [ ] `status` is one of `:success`, `:partial`, `:failed`
279
+ - [ ] `metadata` includes `url` and error details on failure
280
+ - [ ] `self.default_settings` defined if adapter has configurable options
281
+ - [ ] Adapter registered in initializer
282
+ - [ ] Exception handling catches `StandardError` in `#call`
283
+ - [ ] Uses `http` accessor for HTTP requests (thread-safe)
284
+ - [ ] Tests cover success, failure, and edge cases
@@ -0,0 +1,167 @@
1
+ # Scraper Adapter Contract
2
+
3
+ Detailed specification of the interface required by custom scraper adapters.
4
+
5
+ Source: `lib/source_monitor/scrapers/base.rb`
6
+
7
+ ## Inheritance Requirement
8
+
9
+ All scraper adapters **must** inherit from `SourceMonitor::Scrapers::Base`:
10
+
11
+ ```ruby
12
+ class MyAdapter < SourceMonitor::Scrapers::Base
13
+ def call
14
+ # implementation
15
+ end
16
+ end
17
+ ```
18
+
19
+ The `ScraperRegistry` validates this at registration time and raises `ArgumentError` if the adapter does not inherit from `Base`.
20
+
21
+ ## Constructor Signature
22
+
23
+ ```ruby
24
+ def initialize(item:, source:, settings: nil, http: SourceMonitor::HTTP)
25
+ ```
26
+
27
+ | Parameter | Type | Description |
28
+ |---|---|---|
29
+ | `item` | `SourceMonitor::Item` | The item to scrape |
30
+ | `source` | `SourceMonitor::Source` | The owning source/feed |
31
+ | `settings` | Hash/nil | Per-invocation setting overrides |
32
+ | `http` | Module | HTTP client module (default: `SourceMonitor::HTTP`) |
33
+
34
+ The constructor is defined on `Base` -- do not override it. Use `#call` for your logic.
35
+
36
+ ## Required Instance Method: `#call`
37
+
38
+ Must return a `SourceMonitor::Scrapers::Base::Result`:
39
+
40
+ ```ruby
41
+ Result = Struct.new(:status, :html, :content, :metadata, keyword_init: true)
42
+ ```
43
+
44
+ ### Result Fields
45
+
46
+ | Field | Type | Required | Description |
47
+ |---|---|---|---|
48
+ | `status` | Symbol | Yes | `:success`, `:partial`, or `:failed` |
49
+ | `html` | String/nil | No | Raw HTML body from the fetch |
50
+ | `content` | String/nil | No | Extracted/cleaned text content |
51
+ | `metadata` | Hash/nil | No | Diagnostics and additional context |
52
+
53
+ ### Status Values
54
+
55
+ | Status | Meaning |
56
+ |---|---|
57
+ | `:success` | Content fully extracted |
58
+ | `:partial` | Content extracted but incomplete (e.g., truncated, missing elements) |
59
+ | `:failed` | Unable to extract content |
60
+
61
+ ### Metadata Conventions
62
+
63
+ On success:
64
+ ```ruby
65
+ {
66
+ url: "https://example.com/article",
67
+ http_status: 200,
68
+ content_type: "text/html",
69
+ extraction_strategy: "custom",
70
+ title: "Article Title"
71
+ }
72
+ ```
73
+
74
+ On failure:
75
+ ```ruby
76
+ {
77
+ error: "fetch_error", # Error classification
78
+ message: "Connection refused", # Human-readable message
79
+ url: "https://example.com/article",
80
+ http_status: 500 # If available
81
+ }
82
+ ```
83
+
84
+ ## Optional Class Methods
85
+
86
+ ### `self.adapter_name`
87
+
88
+ Default: derived from class name by removing `Scraper` suffix and underscoring.
89
+
90
+ ```ruby
91
+ MyApp::Scrapers::Premium # => "premium"
92
+ MyApp::Scrapers::CustomScraper # => "custom"
93
+ ```
94
+
95
+ ### `self.default_settings`
96
+
97
+ Default: `{}`
98
+
99
+ Return a Hash of adapter-specific default settings. These are merged with source-level and invocation-level overrides.
100
+
101
+ ```ruby
102
+ def self.default_settings
103
+ {
104
+ api_key: nil,
105
+ max_retries: 3,
106
+ selectors: { content: "article", title: "h1" }
107
+ }
108
+ end
109
+ ```
110
+
111
+ ### `self.call(item:, source:, settings: nil, http: SourceMonitor::HTTP)`
112
+
113
+ Default implementation creates a new instance and calls `#call`. Rarely needs to be overridden.
114
+
115
+ ## Protected Accessors
116
+
117
+ Available inside `#call`:
118
+
119
+ | Accessor | Type | Description |
120
+ |---|---|---|
121
+ | `item` | `SourceMonitor::Item` | The item being scraped |
122
+ | `source` | `SourceMonitor::Source` | The owning source |
123
+ | `http` | Module | HTTP client module |
124
+ | `settings` | HashWithIndifferentAccess | Merged settings (see below) |
125
+
126
+ ## Settings Merge Order
127
+
128
+ Settings are deep-merged in this priority order (later wins):
129
+
130
+ ```
131
+ 1. self.class.default_settings (adapter defaults)
132
+ 2. source.scrape_settings (source-level, from DB JSON column)
133
+ 3. settings parameter (per-invocation overrides)
134
+ ```
135
+
136
+ All keys are normalized to strings with `ActiveSupport::HashWithIndifferentAccess`, so you can access them with either string or symbol keys.
137
+
138
+ ## Thread Safety
139
+
140
+ Adapters must be stateless and thread-safe:
141
+ - A new instance is created per invocation via `self.call`
142
+ - Use instance variables set in the constructor only
143
+ - Do not use class-level mutable state
144
+ - The `http` client is safe to share
145
+
146
+ ## Error Handling
147
+
148
+ Adapters should:
149
+ 1. Catch expected errors (network, parsing) and return a `Result` with `:failed` status
150
+ 2. Let unexpected errors propagate (they will be caught by the scraping pipeline)
151
+ 3. Never swallow errors silently -- populate `metadata` with error details
152
+
153
+ ```ruby
154
+ def call
155
+ # ... scraping logic ...
156
+ rescue Faraday::Error => error
157
+ Result.new(
158
+ status: :failed,
159
+ metadata: { error: error.class.name, message: error.message }
160
+ )
161
+ rescue StandardError => error
162
+ Result.new(
163
+ status: :failed,
164
+ metadata: { error: error.class.name, message: error.message }
165
+ )
166
+ end
167
+ ```