source_monitor 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. checksums.yaml +4 -4
  2. data/.claude/skills/sm-architecture/SKILL.md +233 -0
  3. data/.claude/skills/sm-architecture/reference/extraction-patterns.md +192 -0
  4. data/.claude/skills/sm-architecture/reference/module-map.md +194 -0
  5. data/.claude/skills/sm-configuration-setting/SKILL.md +264 -0
  6. data/.claude/skills/sm-configuration-setting/reference/settings-catalog.md +248 -0
  7. data/.claude/skills/sm-configuration-setting/reference/settings-pattern.md +297 -0
  8. data/.claude/skills/sm-configure/SKILL.md +153 -0
  9. data/.claude/skills/sm-configure/reference/configuration-reference.md +321 -0
  10. data/.claude/skills/sm-dashboard-widget/SKILL.md +344 -0
  11. data/.claude/skills/sm-dashboard-widget/reference/dashboard-patterns.md +304 -0
  12. data/.claude/skills/sm-domain-model/SKILL.md +188 -0
  13. data/.claude/skills/sm-domain-model/reference/model-graph.md +114 -0
  14. data/.claude/skills/sm-domain-model/reference/table-structure.md +348 -0
  15. data/.claude/skills/sm-engine-migration/SKILL.md +395 -0
  16. data/.claude/skills/sm-engine-migration/reference/migration-conventions.md +255 -0
  17. data/.claude/skills/sm-engine-test/SKILL.md +302 -0
  18. data/.claude/skills/sm-engine-test/reference/test-helpers.md +259 -0
  19. data/.claude/skills/sm-engine-test/reference/test-patterns.md +411 -0
  20. data/.claude/skills/sm-event-handler/SKILL.md +265 -0
  21. data/.claude/skills/sm-event-handler/reference/events-api.md +229 -0
  22. data/.claude/skills/sm-health-rule/SKILL.md +327 -0
  23. data/.claude/skills/sm-health-rule/reference/health-system.md +269 -0
  24. data/.claude/skills/sm-host-setup/SKILL.md +223 -0
  25. data/.claude/skills/sm-host-setup/reference/initializer-template.md +195 -0
  26. data/.claude/skills/sm-host-setup/reference/setup-checklist.md +134 -0
  27. data/.claude/skills/sm-job/SKILL.md +263 -0
  28. data/.claude/skills/sm-job/reference/job-conventions.md +245 -0
  29. data/.claude/skills/sm-model-extension/SKILL.md +287 -0
  30. data/.claude/skills/sm-model-extension/reference/extension-api.md +317 -0
  31. data/.claude/skills/sm-pipeline-stage/SKILL.md +254 -0
  32. data/.claude/skills/sm-pipeline-stage/reference/completion-handlers.md +152 -0
  33. data/.claude/skills/sm-pipeline-stage/reference/entry-processing.md +191 -0
  34. data/.claude/skills/sm-pipeline-stage/reference/feed-fetcher-architecture.md +198 -0
  35. data/.claude/skills/sm-scraper-adapter/SKILL.md +284 -0
  36. data/.claude/skills/sm-scraper-adapter/reference/adapter-contract.md +167 -0
  37. data/.claude/skills/sm-scraper-adapter/reference/example-adapter.md +274 -0
  38. data/.vbw-planning/.notification-log.jsonl +102 -0
  39. data/.vbw-planning/.session-log.jsonl +505 -0
  40. data/AGENTS.md +20 -57
  41. data/CHANGELOG.md +19 -0
  42. data/CLAUDE.md +44 -1
  43. data/CONTRIBUTING.md +5 -5
  44. data/Gemfile.lock +20 -21
  45. data/README.md +18 -5
  46. data/VERSION +1 -0
  47. data/docs/deployment.md +1 -1
  48. data/docs/setup.md +4 -4
  49. data/lib/source_monitor/setup/skills_installer.rb +94 -0
  50. data/lib/source_monitor/setup/workflow.rb +17 -2
  51. data/lib/source_monitor/version.rb +1 -1
  52. data/lib/tasks/source_monitor_setup.rake +58 -0
  53. data/source_monitor.gemspec +1 -0
  54. metadata +39 -1
@@ -0,0 +1,254 @@
1
+ ---
2
+ name: sm-pipeline-stage
3
+ description: How to add or modify fetch and scrape pipeline stages in SourceMonitor. Use when working on FeedFetcher, EntryProcessor, ItemCreator, completion handlers, or adding new processing steps to the feed ingestion pipeline.
4
+ allowed-tools: Read, Write, Edit, Bash, Glob, Grep
5
+ ---
6
+
7
+ # SourceMonitor Pipeline Stage Development
8
+
9
+ ## Overview
10
+
11
+ The SourceMonitor fetch pipeline transforms RSS/Atom/JSON feeds into persisted `Item` records. The pipeline has two main phases: **fetching** (HTTP + parsing) and **item processing** (entry parsing + content extraction + persistence).
12
+
13
+ ## Pipeline Architecture
14
+
15
+ ```
16
+ FetchRunner (orchestrator)
17
+ |
18
+ +-- AdvisoryLock (PG advisory lock per source)
19
+ |
20
+ +-- FeedFetcher (HTTP fetch + parse + process)
21
+ | |
22
+ | +-- AdaptiveInterval (next_fetch_at calculation)
23
+ | +-- SourceUpdater (source record updates + fetch logs)
24
+ | +-- EntryProcessor (iterates feed entries)
25
+ | |
26
+ | +-- ItemCreator (per-entry)
27
+ | |
28
+ | +-- EntryParser (attribute extraction)
29
+ | | +-- MediaExtraction (enclosures, thumbnails)
30
+ | |
31
+ | +-- ContentExtractor (readability processing)
32
+ |
33
+ +-- Completion Handlers (post-fetch)
34
+ +-- RetentionHandler (prune old items)
35
+ +-- FollowUpHandler (enqueue scrape jobs)
36
+ +-- EventPublisher (dispatch callbacks)
37
+ ```
38
+
39
+ ## Key Files
40
+
41
+ | File | Purpose | Lines |
42
+ |------|---------|-------|
43
+ | `lib/source_monitor/fetching/fetch_runner.rb` | Orchestrator: lock, fetch, completion handlers | 142 |
44
+ | `lib/source_monitor/fetching/feed_fetcher.rb` | HTTP request, response routing, error handling | 285 |
45
+ | `lib/source_monitor/fetching/feed_fetcher/adaptive_interval.rb` | Dynamic fetch interval calculation | 141 |
46
+ | `lib/source_monitor/fetching/feed_fetcher/source_updater.rb` | Persists source state + creates fetch logs | 200 |
47
+ | `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb` | Iterates feed entries, calls ItemCreator | 89 |
48
+ | `lib/source_monitor/fetching/completion/retention_handler.rb` | Post-fetch item retention pruning | 30 |
49
+ | `lib/source_monitor/fetching/completion/follow_up_handler.rb` | Enqueues scrape jobs for new items | 37 |
50
+ | `lib/source_monitor/fetching/completion/event_publisher.rb` | Dispatches `after_fetch_completed` event | 22 |
51
+ | `lib/source_monitor/fetching/retry_policy.rb` | Per-error-type retry/circuit-breaker decisions | 85 |
52
+ | `lib/source_monitor/fetching/advisory_lock.rb` | PG advisory lock wrapper | 54 |
53
+ | `lib/source_monitor/items/item_creator.rb` | Find-or-create items by GUID/fingerprint | 174 |
54
+ | `lib/source_monitor/items/item_creator/entry_parser.rb` | Extracts all attributes from feed entries | 294 |
55
+ | `lib/source_monitor/items/item_creator/content_extractor.rb` | Readability-based content processing | 113 |
56
+ | `lib/source_monitor/items/item_creator/entry_parser/media_extraction.rb` | Enclosures, thumbnails, media content | 96 |
57
+
58
+ ## Adding a New Pipeline Stage
59
+
60
+ ### Option 1: Add a Completion Handler
61
+
62
+ Completion handlers run after every fetch, inside the `FetchRunner`. Best for cross-cutting post-fetch logic.
63
+
64
+ **Step 1:** Create the handler class:
65
+
66
+ ```ruby
67
+ # lib/source_monitor/fetching/completion/my_handler.rb
68
+ # frozen_string_literal: true
69
+
70
+ module SourceMonitor
71
+ module Fetching
72
+ module Completion
73
+ class MyHandler
74
+ def initialize(**deps)
75
+ # Accept dependencies for testability
76
+ end
77
+
78
+ def call(source:, result:)
79
+ return unless should_run?(source:, result:)
80
+ # Your logic here
81
+ end
82
+
83
+ private
84
+
85
+ def should_run?(source:, result:)
86
+ result&.status == :fetched
87
+ end
88
+ end
89
+ end
90
+ end
91
+ end
92
+ ```
93
+
94
+ **Step 2:** Wire it into `FetchRunner#initialize`:
95
+
96
+ ```ruby
97
+ # In FetchRunner#initialize, add parameter:
98
+ def initialize(source:, ..., my_handler: nil)
99
+ @my_handler = my_handler || Completion::MyHandler.new
100
+ end
101
+ ```
102
+
103
+ **Step 3:** Call it in `FetchRunner#run` (inside the lock block):
104
+
105
+ ```ruby
106
+ def run
107
+ lock.with_lock do
108
+ mark_fetching!
109
+ result = fetcher_class.new(source: source).call
110
+ retention_handler.call(source:, result:)
111
+ follow_up_handler.call(source:, result:)
112
+ my_handler.call(source:, result:) # <-- Add here
113
+ schedule_retry_if_needed(result)
114
+ mark_complete!(result)
115
+ end
116
+ event_publisher.call(source:, result:)
117
+ result
118
+ end
119
+ ```
120
+
121
+ ### Option 2: Add an Entry Processor Hook
122
+
123
+ Use `SourceMonitor::Events.run_item_processors` to add per-item processing without modifying the pipeline core.
124
+
125
+ ```ruby
126
+ # In an initializer or engine setup:
127
+ SourceMonitor.configure do |config|
128
+ config.events.on_item_processed do |source:, entry:, result:|
129
+ # Custom per-item logic
130
+ end
131
+ end
132
+ ```
133
+
134
+ ### Option 3: Add an EntryParser Extension
135
+
136
+ To extract new fields from feed entries, extend `EntryParser`:
137
+
138
+ ```ruby
139
+ # Add a new extract method to EntryParser
140
+ def extract_my_field
141
+ return unless entry.respond_to?(:my_field)
142
+ string_or_nil(entry.my_field)
143
+ end
144
+ ```
145
+
146
+ Then add it to the `parse` method's return hash.
147
+
148
+ ### Option 4: Add a New Retention Strategy
149
+
150
+ ```ruby
151
+ # lib/source_monitor/items/retention_strategies/archive.rb
152
+ module SourceMonitor
153
+ module Items
154
+ module RetentionStrategies
155
+ class Archive
156
+ def initialize(source:)
157
+ @source = source
158
+ end
159
+
160
+ def apply(batch:, now: Time.current)
161
+ # Your archival logic
162
+ count = 0
163
+ batch.each do |item|
164
+ item.update!(archived_at: now)
165
+ count += 1
166
+ end
167
+ count
168
+ end
169
+
170
+ private
171
+ attr_reader :source
172
+ end
173
+ end
174
+ end
175
+ end
176
+ ```
177
+
178
+ Register in `RetentionPruner::STRATEGY_CLASSES`.
179
+
180
+ ## Data Flow Details
181
+
182
+ See `reference/` for detailed documentation:
183
+ - `reference/feed-fetcher-architecture.md` -- FeedFetcher module structure
184
+ - `reference/completion-handlers.md` -- Completion handler patterns
185
+ - `reference/entry-processing.md` -- Entry processing pipeline
186
+
187
+ ## Error Handling
188
+
189
+ The pipeline uses a typed error hierarchy rooted at `FetchError`:
190
+
191
+ | Error Class | Code | Trigger |
192
+ |-------------|------|---------|
193
+ | `TimeoutError` | `timeout` | Request timeout |
194
+ | `ConnectionError` | `connection` | Connection/SSL failure |
195
+ | `HTTPError` | `http_error` | Non-200/304 HTTP status |
196
+ | `ParsingError` | `parsing` | Feedjira parse failure |
197
+ | `UnexpectedResponseError` | `unexpected_response` | Any other StandardError |
198
+
199
+ Each error type maps to a `RetryPolicy` with configurable attempts, wait times, and circuit-breaker thresholds.
200
+
201
+ ## Result Types
202
+
203
+ **FeedFetcher::Result** -- returned from `FeedFetcher#call`:
204
+ - `status` -- `:fetched`, `:not_modified`, or `:failed`
205
+ - `feed` -- parsed Feedjira feed object
206
+ - `response` -- HTTP response
207
+ - `body` -- raw response body
208
+ - `error` -- FetchError (on failure)
209
+ - `item_processing` -- EntryProcessingResult
210
+ - `retry_decision` -- RetryPolicy::Decision
211
+
212
+ **ItemCreator::Result** -- returned from `ItemCreator.call`:
213
+ - `item` -- the Item record
214
+ - `status` -- `:created` or `:updated`
215
+ - `matched_by` -- `:guid` or `:fingerprint` (for updates)
216
+
217
+ ## Testing
218
+
219
+ - Test helpers: `create_source!`, `with_inline_jobs`
220
+ - WebMock blocks all external HTTP; stub responses manually
221
+ - Use `PARALLEL_WORKERS=1` for single test files
222
+ - Inject dependencies (client, lock_factory) for isolation
223
+
224
+ ```ruby
225
+ test "processes new feed entries" do
226
+ source = create_source!(feed_url: "https://example.com/feed.xml")
227
+ stub_request(:get, source.feed_url).to_return(
228
+ status: 200,
229
+ body: File.read("test/fixtures/files/sample_feed.xml")
230
+ )
231
+
232
+ result = SourceMonitor::Fetching::FeedFetcher.new(source: source).call
233
+
234
+ assert_equal :fetched, result.status
235
+ assert result.item_processing.created.positive?
236
+ end
237
+ ```
238
+
239
+ ## Checklist
240
+
241
+ - [ ] New stage follows dependency injection pattern (accept collaborators in initialize)
242
+ - [ ] Stage has a `call(source:, result:)` interface (for completion handlers)
243
+ - [ ] Error handling returns gracefully (don't crash the pipeline)
244
+ - [ ] Instrumentation payload updated if stage adds metrics
245
+ - [ ] Tests cover success, failure, and skip conditions
246
+ - [ ] No N+1 queries (use `includes`/`preload`)
247
+ - [ ] Documented in this skill's reference files
248
+
249
+ ## References
250
+
251
+ - `lib/source_monitor/fetching/` -- All fetching pipeline code
252
+ - `lib/source_monitor/items/` -- Item creation and retention
253
+ - `test/lib/source_monitor/fetching/` -- Fetching tests
254
+ - `test/lib/source_monitor/items/` -- Item processing tests
@@ -0,0 +1,152 @@
1
+ # Completion Handlers
2
+
3
+ ## Overview
4
+
5
+ Completion handlers are post-fetch processing steps managed by `FetchRunner`. They execute inside the advisory lock (except `EventPublisher`, which runs after the lock is released).
6
+
7
+ ## Execution Order
8
+
9
+ ```
10
+ lock.with_lock do
11
+ mark_fetching!
12
+ result = fetcher.call
13
+ 1. RetentionHandler -- prune old items
14
+ 2. FollowUpHandler -- enqueue scrape jobs for new items
15
+ 3. schedule_retry -- re-enqueue on transient failure
16
+ mark_complete!
17
+ end
18
+ 4. EventPublisher -- dispatch after_fetch_completed callback (outside lock)
19
+ ```
20
+
21
+ ## RetentionHandler
22
+
23
+ **File:** `lib/source_monitor/fetching/completion/retention_handler.rb`
24
+
25
+ Applies item retention pruning after every fetch.
26
+
27
+ ```ruby
28
+ class RetentionHandler
29
+ def initialize(pruner: SourceMonitor::Items::RetentionPruner)
30
+ def call(source:, result:)
31
+ pruner.call(source:, strategy: SourceMonitor.config.retention.strategy)
32
+ end
33
+ end
34
+ ```
35
+
36
+ - Delegates to `RetentionPruner` which supports age-based and count-based limits
37
+ - Rescues errors gracefully (logs and returns nil)
38
+ - Strategy comes from global config (`:destroy` or `:soft_delete`)
39
+
40
+ ### RetentionPruner Details
41
+
42
+ Two pruning modes:
43
+ 1. **Age-based** (`items_retention_days`): Removes items older than N days (uses `COALESCE(published_at, created_at)`)
44
+ 2. **Count-based** (`max_items`): Keeps only the N most recent items
45
+
46
+ Strategies:
47
+ - `Destroy` -- calls `item.destroy!` per record
48
+ - `SoftDelete` -- sets `deleted_at` timestamp, adjusts `items_count` counter
49
+
50
+ Configuration priority: source-level setting > global config.
51
+
52
+ ## FollowUpHandler
53
+
54
+ **File:** `lib/source_monitor/fetching/completion/follow_up_handler.rb`
55
+
56
+ Enqueues scrape jobs for newly created items when auto-scraping is enabled.
57
+
58
+ ```ruby
59
+ class FollowUpHandler
60
+ def initialize(enqueuer_class:, job_class:)
61
+ def call(source:, result:)
62
+ return unless should_enqueue?(source:, result:)
63
+ result.item_processing.created_items.each do |item|
64
+ next unless item.present? && item.scraped_at.nil?
65
+ enqueuer_class.enqueue(item:, source:, job_class:, reason: :auto)
66
+ end
67
+ end
68
+ end
69
+ ```
70
+
71
+ Guard conditions:
72
+ - Result status must be `:fetched`
73
+ - Source must have `scraping_enabled?` and `auto_scrape?`
74
+ - At least one item was created
75
+ - Item must not already be scraped (`scraped_at.nil?`)
76
+
77
+ ## EventPublisher
78
+
79
+ **File:** `lib/source_monitor/fetching/completion/event_publisher.rb`
80
+
81
+ Dispatches the `after_fetch_completed` callback to all registered listeners.
82
+
83
+ ```ruby
84
+ class EventPublisher
85
+ def initialize(dispatcher: SourceMonitor::Events)
86
+ def call(source:, result:)
87
+ dispatcher.after_fetch_completed(source:, result:)
88
+ end
89
+ end
90
+ ```
91
+
92
+ - Runs **outside** the advisory lock to prevent long-running callbacks from holding the lock
93
+ - The health monitoring system (`SourceMonitor::Health`) registers a callback here
94
+ - Callbacks are registered via `SourceMonitor.config.events.after_fetch_completed(lambda)`
95
+
96
+ ## Adding a New Completion Handler
97
+
98
+ ### Pattern
99
+
100
+ ```ruby
101
+ # lib/source_monitor/fetching/completion/my_handler.rb
102
+ module SourceMonitor
103
+ module Fetching
104
+ module Completion
105
+ class MyHandler
106
+ def initialize(**deps)
107
+ @dependency = deps[:dependency] || DefaultDependency
108
+ end
109
+
110
+ def call(source:, result:)
111
+ return unless should_run?(source:, result:)
112
+ # Your logic
113
+ rescue StandardError => error
114
+ Rails.logger.error("[SourceMonitor] MyHandler failed: #{error.message}")
115
+ nil
116
+ end
117
+
118
+ private
119
+
120
+ attr_reader :dependency
121
+
122
+ def should_run?(source:, result:)
123
+ result&.status == :fetched
124
+ end
125
+ end
126
+ end
127
+ end
128
+ end
129
+ ```
130
+
131
+ ### Wiring
132
+
133
+ 1. Add require in `fetch_runner.rb`
134
+ 2. Add parameter to `FetchRunner#initialize` with default instance
135
+ 3. Add `attr_reader` for the handler
136
+ 4. Call `my_handler.call(source:, result:)` in `#run`
137
+ 5. Decide placement: inside lock (for data consistency) or outside (for non-critical work)
138
+
139
+ ### Testing
140
+
141
+ ```ruby
142
+ test "my_handler is called on successful fetch" do
143
+ handler = Minitest::Mock.new
144
+ handler.expect(:call, nil, source: source, result: anything)
145
+
146
+ runner = FetchRunner.new(source: source, my_handler: handler)
147
+ stub_successful_fetch(source)
148
+ runner.run
149
+
150
+ handler.verify
151
+ end
152
+ ```
@@ -0,0 +1,191 @@
1
+ # Entry Processing Pipeline
2
+
3
+ ## Overview
4
+
5
+ The entry processing pipeline transforms raw feed entries into persisted `Item` records. The pipeline was refactored from a 601-line `ItemCreator` monolith into three focused classes.
6
+
7
+ ```
8
+ EntryProcessor (89 lines)
9
+ |
10
+ +-- ItemCreator (174 lines) -- find-or-create orchestrator
11
+ |
12
+ +-- EntryParser (294 lines) -- attribute extraction
13
+ | +-- MediaExtraction (96 lines) -- media-specific extraction
14
+ |
15
+ +-- ContentExtractor (113 lines) -- readability processing
16
+ ```
17
+
18
+ ## EntryProcessor
19
+
20
+ **File:** `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb`
21
+
22
+ Iterates `feed.entries`, calling `ItemCreator.call` for each. Individual entry failures are caught and counted without stopping the batch.
23
+
24
+ Returns `EntryProcessingResult` with:
25
+ - `created` / `updated` / `failed` counts
26
+ - `items` -- all processed Item records
27
+ - `created_items` / `updated_items` -- separate lists
28
+ - `errors` -- normalized error details for failed entries
29
+
30
+ ## ItemCreator
31
+
32
+ **File:** `lib/source_monitor/items/item_creator.rb`
33
+
34
+ Orchestrates finding or creating an Item record for a single feed entry.
35
+
36
+ ### Deduplication Strategy
37
+
38
+ Items are matched in priority order:
39
+
40
+ 1. **GUID match** (case-insensitive) -- if the entry has a `entry_id`
41
+ 2. **Content fingerprint** -- SHA256 of `title + url + content`
42
+
43
+ ```ruby
44
+ def existing_item_for(attributes, raw_guid_present:)
45
+ if raw_guid_present
46
+ existing = find_item_by_guid(guid)
47
+ return [existing, :guid] if existing
48
+ end
49
+ if fingerprint.present?
50
+ existing = find_item_by_fingerprint(fingerprint)
51
+ return [existing, :fingerprint] if existing
52
+ end
53
+ [nil, nil]
54
+ end
55
+ ```
56
+
57
+ ### Concurrent Duplicate Handling
58
+
59
+ When `ActiveRecord::RecordNotUnique` is raised during creation, the code falls back to finding and updating the conflicting record:
60
+
61
+ ```ruby
62
+ def create_new_item(attributes, raw_guid_present:)
63
+ new_item = source.items.new
64
+ apply_attributes(new_item, attributes)
65
+ new_item.save!
66
+ Result.new(item: new_item, status: :created)
67
+ rescue ActiveRecord::RecordNotUnique
68
+ handle_concurrent_duplicate(attributes, raw_guid_present:)
69
+ end
70
+ ```
71
+
72
+ ### Result Struct
73
+
74
+ ```ruby
75
+ Result = Struct.new(:item, :status, :matched_by) do
76
+ def created? = status == :created
77
+ def updated? = status == :updated
78
+ end
79
+ ```
80
+
81
+ ## EntryParser
82
+
83
+ **File:** `lib/source_monitor/items/item_creator/entry_parser.rb`
84
+
85
+ Extracts all item attributes from a Feedjira entry object. Handles RSS 2.0, Atom, and JSON Feed formats.
86
+
87
+ ### Extracted Fields
88
+
89
+ | Field | Method | Notes |
90
+ |-------|--------|-------|
91
+ | `guid` | `extract_guid` | `entry_id` preferred; falls back to `id` if not same as URL |
92
+ | `url` | `extract_url` | Tries `url`, `link_nodes` (alternate), `links` |
93
+ | `title` | -- | Direct from entry |
94
+ | `author` | `extract_author` | Single author string |
95
+ | `authors` | `extract_authors` | Aggregates from rss_authors, dc_creators, author_nodes, JSON Feed |
96
+ | `summary` | `extract_summary` | Entry summary/description |
97
+ | `content` | `extract_content` | Tries `content`, `content_encoded`, `summary` |
98
+ | `published_at` | `extract_timestamp` | First of `published`, `updated` |
99
+ | `updated_at_source` | `extract_updated_timestamp` | Entry `updated` field |
100
+ | `categories` | `extract_categories` | From `categories`, `tags`, JSON Feed tags |
101
+ | `tags` | `extract_tags` | Subset of categories |
102
+ | `keywords` | `extract_keywords` | From `media_keywords_raw`, `itunes_keywords_raw` |
103
+ | `enclosures` | `extract_enclosures` | RSS enclosures, Atom links, JSON attachments |
104
+ | `media_thumbnail_url` | `extract_media_thumbnail_url` | Media RSS thumbnails, entry image |
105
+ | `media_content` | `extract_media_content` | Media RSS content nodes |
106
+ | `language` | `extract_language` | Entry or JSON Feed language |
107
+ | `copyright` | `extract_copyright` | Entry or JSON Feed copyright |
108
+ | `comments_url` | `extract_comments_url` | RSS comments element |
109
+ | `comments_count` | `extract_comments_count` | slash:comments or comments_count |
110
+ | `metadata` | `extract_metadata` | Full entry hash under `feedjira_entry` key |
111
+ | `content_fingerprint` | `generate_fingerprint` | SHA256 of title+url+content |
112
+
113
+ ### Feed Format Detection
114
+
115
+ ```ruby
116
+ def json_entry?
117
+ defined?(Feedjira::Parser::JSONFeedItem) && entry.is_a?(Feedjira::Parser::JSONFeedItem)
118
+ end
119
+
120
+ def atom_entry?
121
+ defined?(Feedjira::Parser::AtomEntry) && entry.is_a?(Feedjira::Parser::AtomEntry)
122
+ end
123
+ ```
124
+
125
+ ### Helper Methods
126
+
127
+ - `string_or_nil(value)` -- strips and returns nil for blank strings
128
+ - `sanitize_string_array(values)` -- deduplicates and compacts
129
+ - `split_keywords(value)` -- splits on `,` or `;`
130
+ - `safe_integer(value)` -- safe Integer conversion
131
+ - `normalize_metadata(value)` -- JSON round-trip for serializable hash
132
+
133
+ ## ContentExtractor
134
+
135
+ **File:** `lib/source_monitor/items/item_creator/content_extractor.rb`
136
+
137
+ Processes HTML content through readability parsing when enabled on the source.
138
+
139
+ ### Processing Flow
140
+
141
+ ```
142
+ process_feed_content(raw_content, title:)
143
+ -> should_process_feed_content?(raw_content)
144
+ -> source.feed_content_readability_enabled?
145
+ -> raw_content.present?
146
+ -> html_fragment?(raw_content)
147
+ -> wrap_content_for_readability(raw_content, title:)
148
+ -> builds full HTML document with title
149
+ -> ReadabilityParser.new.parse(html:, readability:)
150
+ -> build_feed_content_metadata(result:, raw_content:, processed_content:)
151
+ -> returns [processed_content, metadata]
152
+ ```
153
+
154
+ ### Guard Conditions
155
+
156
+ Content processing only runs when:
157
+ 1. Source has `feed_content_readability_enabled?`
158
+ 2. Content is present (not blank)
159
+ 3. Content looks like HTML (`html_fragment?` checks for `<tag` pattern)
160
+
161
+ ### Metadata
162
+
163
+ Processing metadata is stored under `feed_content_processing` key:
164
+ - `strategy` -- always "readability"
165
+ - `status` -- parser result status
166
+ - `applied` -- whether processed content was used
167
+ - `changed` -- whether content differs from raw
168
+ - `readability_text_length` -- extracted text length
169
+ - `title` -- extracted title
170
+
171
+ ## MediaExtraction
172
+
173
+ **File:** `lib/source_monitor/items/item_creator/entry_parser/media_extraction.rb`
174
+
175
+ Mixed into `EntryParser` to handle media-specific fields.
176
+
177
+ ### Enclosure Sources
178
+
179
+ | Format | Source | Key Fields |
180
+ |--------|--------|------------|
181
+ | RSS 2.0 | `enclosure_nodes` | url, type, length |
182
+ | Atom | `link_nodes` with `rel="enclosure"` | url, type, length |
183
+ | JSON Feed | `json["attachments"]` | url, mime_type, size_in_bytes, duration |
184
+
185
+ ### Media Content
186
+
187
+ From Media RSS `media_content_nodes`: url, type, medium, height, width, file_size, duration, expression.
188
+
189
+ ### Thumbnails
190
+
191
+ Priority: `media_thumbnail_nodes` first, then `entry.image`.