source_monitor 0.3.0 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/skills/sm-architecture/SKILL.md +233 -0
- data/.claude/skills/sm-architecture/reference/extraction-patterns.md +192 -0
- data/.claude/skills/sm-architecture/reference/module-map.md +194 -0
- data/.claude/skills/sm-configuration-setting/SKILL.md +264 -0
- data/.claude/skills/sm-configuration-setting/reference/settings-catalog.md +248 -0
- data/.claude/skills/sm-configuration-setting/reference/settings-pattern.md +297 -0
- data/.claude/skills/sm-configure/SKILL.md +153 -0
- data/.claude/skills/sm-configure/reference/configuration-reference.md +321 -0
- data/.claude/skills/sm-dashboard-widget/SKILL.md +344 -0
- data/.claude/skills/sm-dashboard-widget/reference/dashboard-patterns.md +304 -0
- data/.claude/skills/sm-domain-model/SKILL.md +188 -0
- data/.claude/skills/sm-domain-model/reference/model-graph.md +114 -0
- data/.claude/skills/sm-domain-model/reference/table-structure.md +348 -0
- data/.claude/skills/sm-engine-migration/SKILL.md +395 -0
- data/.claude/skills/sm-engine-migration/reference/migration-conventions.md +255 -0
- data/.claude/skills/sm-engine-test/SKILL.md +302 -0
- data/.claude/skills/sm-engine-test/reference/test-helpers.md +259 -0
- data/.claude/skills/sm-engine-test/reference/test-patterns.md +411 -0
- data/.claude/skills/sm-event-handler/SKILL.md +265 -0
- data/.claude/skills/sm-event-handler/reference/events-api.md +229 -0
- data/.claude/skills/sm-health-rule/SKILL.md +327 -0
- data/.claude/skills/sm-health-rule/reference/health-system.md +269 -0
- data/.claude/skills/sm-host-setup/SKILL.md +223 -0
- data/.claude/skills/sm-host-setup/reference/initializer-template.md +195 -0
- data/.claude/skills/sm-host-setup/reference/setup-checklist.md +134 -0
- data/.claude/skills/sm-job/SKILL.md +263 -0
- data/.claude/skills/sm-job/reference/job-conventions.md +245 -0
- data/.claude/skills/sm-model-extension/SKILL.md +287 -0
- data/.claude/skills/sm-model-extension/reference/extension-api.md +317 -0
- data/.claude/skills/sm-pipeline-stage/SKILL.md +254 -0
- data/.claude/skills/sm-pipeline-stage/reference/completion-handlers.md +152 -0
- data/.claude/skills/sm-pipeline-stage/reference/entry-processing.md +191 -0
- data/.claude/skills/sm-pipeline-stage/reference/feed-fetcher-architecture.md +198 -0
- data/.claude/skills/sm-scraper-adapter/SKILL.md +284 -0
- data/.claude/skills/sm-scraper-adapter/reference/adapter-contract.md +167 -0
- data/.claude/skills/sm-scraper-adapter/reference/example-adapter.md +274 -0
- data/.vbw-planning/.notification-log.jsonl +102 -0
- data/.vbw-planning/.session-log.jsonl +505 -0
- data/AGENTS.md +20 -57
- data/CHANGELOG.md +19 -0
- data/CLAUDE.md +44 -1
- data/CONTRIBUTING.md +5 -5
- data/Gemfile.lock +20 -21
- data/README.md +18 -5
- data/VERSION +1 -0
- data/docs/deployment.md +1 -1
- data/docs/setup.md +4 -4
- data/lib/source_monitor/setup/skills_installer.rb +94 -0
- data/lib/source_monitor/setup/workflow.rb +17 -2
- data/lib/source_monitor/version.rb +1 -1
- data/lib/tasks/source_monitor_setup.rake +58 -0
- data/source_monitor.gemspec +1 -0
- metadata +39 -1
|
@@ -0,0 +1,254 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sm-pipeline-stage
|
|
3
|
+
description: How to add or modify fetch and scrape pipeline stages in SourceMonitor. Use when working on FeedFetcher, EntryProcessor, ItemCreator, completion handlers, or adding new processing steps to the feed ingestion pipeline.
|
|
4
|
+
allowed-tools: Read, Write, Edit, Bash, Glob, Grep
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# SourceMonitor Pipeline Stage Development
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
The SourceMonitor fetch pipeline transforms RSS/Atom/JSON feeds into persisted `Item` records. The pipeline has two main phases: **fetching** (HTTP + parsing) and **item processing** (entry parsing + content extraction + persistence).
|
|
12
|
+
|
|
13
|
+
## Pipeline Architecture
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
FetchRunner (orchestrator)
|
|
17
|
+
|
|
|
18
|
+
+-- AdvisoryLock (PG advisory lock per source)
|
|
19
|
+
|
|
|
20
|
+
+-- FeedFetcher (HTTP fetch + parse + process)
|
|
21
|
+
| |
|
|
22
|
+
| +-- AdaptiveInterval (next_fetch_at calculation)
|
|
23
|
+
| +-- SourceUpdater (source record updates + fetch logs)
|
|
24
|
+
| +-- EntryProcessor (iterates feed entries)
|
|
25
|
+
| |
|
|
26
|
+
| +-- ItemCreator (per-entry)
|
|
27
|
+
| |
|
|
28
|
+
| +-- EntryParser (attribute extraction)
|
|
29
|
+
| | +-- MediaExtraction (enclosures, thumbnails)
|
|
30
|
+
| |
|
|
31
|
+
| +-- ContentExtractor (readability processing)
|
|
32
|
+
|
|
|
33
|
+
+-- Completion Handlers (post-fetch)
|
|
34
|
+
+-- RetentionHandler (prune old items)
|
|
35
|
+
+-- FollowUpHandler (enqueue scrape jobs)
|
|
36
|
+
+-- EventPublisher (dispatch callbacks)
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Key Files
|
|
40
|
+
|
|
41
|
+
| File | Purpose | Lines |
|
|
42
|
+
|------|---------|-------|
|
|
43
|
+
| `lib/source_monitor/fetching/fetch_runner.rb` | Orchestrator: lock, fetch, completion handlers | 142 |
|
|
44
|
+
| `lib/source_monitor/fetching/feed_fetcher.rb` | HTTP request, response routing, error handling | 285 |
|
|
45
|
+
| `lib/source_monitor/fetching/feed_fetcher/adaptive_interval.rb` | Dynamic fetch interval calculation | 141 |
|
|
46
|
+
| `lib/source_monitor/fetching/feed_fetcher/source_updater.rb` | Persists source state + creates fetch logs | 200 |
|
|
47
|
+
| `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb` | Iterates feed entries, calls ItemCreator | 89 |
|
|
48
|
+
| `lib/source_monitor/fetching/completion/retention_handler.rb` | Post-fetch item retention pruning | 30 |
|
|
49
|
+
| `lib/source_monitor/fetching/completion/follow_up_handler.rb` | Enqueues scrape jobs for new items | 37 |
|
|
50
|
+
| `lib/source_monitor/fetching/completion/event_publisher.rb` | Dispatches `after_fetch_completed` event | 22 |
|
|
51
|
+
| `lib/source_monitor/fetching/retry_policy.rb` | Per-error-type retry/circuit-breaker decisions | 85 |
|
|
52
|
+
| `lib/source_monitor/fetching/advisory_lock.rb` | PG advisory lock wrapper | 54 |
|
|
53
|
+
| `lib/source_monitor/items/item_creator.rb` | Find-or-create items by GUID/fingerprint | 174 |
|
|
54
|
+
| `lib/source_monitor/items/item_creator/entry_parser.rb` | Extracts all attributes from feed entries | 294 |
|
|
55
|
+
| `lib/source_monitor/items/item_creator/content_extractor.rb` | Readability-based content processing | 113 |
|
|
56
|
+
| `lib/source_monitor/items/item_creator/entry_parser/media_extraction.rb` | Enclosures, thumbnails, media content | 96 |
|
|
57
|
+
|
|
58
|
+
## Adding a New Pipeline Stage
|
|
59
|
+
|
|
60
|
+
### Option 1: Add a Completion Handler
|
|
61
|
+
|
|
62
|
+
Completion handlers run after every fetch, inside the `FetchRunner`. Best for cross-cutting post-fetch logic.
|
|
63
|
+
|
|
64
|
+
**Step 1:** Create the handler class:
|
|
65
|
+
|
|
66
|
+
```ruby
|
|
67
|
+
# lib/source_monitor/fetching/completion/my_handler.rb
|
|
68
|
+
# frozen_string_literal: true
|
|
69
|
+
|
|
70
|
+
module SourceMonitor
|
|
71
|
+
module Fetching
|
|
72
|
+
module Completion
|
|
73
|
+
class MyHandler
|
|
74
|
+
def initialize(**deps)
|
|
75
|
+
# Accept dependencies for testability
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
def call(source:, result:)
|
|
79
|
+
return unless should_run?(source:, result:)
|
|
80
|
+
# Your logic here
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
private
|
|
84
|
+
|
|
85
|
+
def should_run?(source:, result:)
|
|
86
|
+
result&.status == :fetched
|
|
87
|
+
end
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
end
|
|
91
|
+
end
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
**Step 2:** Wire it into `FetchRunner#initialize`:
|
|
95
|
+
|
|
96
|
+
```ruby
|
|
97
|
+
# In FetchRunner#initialize, add parameter:
|
|
98
|
+
def initialize(source:, ..., my_handler: nil)
|
|
99
|
+
@my_handler = my_handler || Completion::MyHandler.new
|
|
100
|
+
end
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Step 3:** Call it in `FetchRunner#run` (inside the lock block):
|
|
104
|
+
|
|
105
|
+
```ruby
|
|
106
|
+
def run
|
|
107
|
+
lock.with_lock do
|
|
108
|
+
mark_fetching!
|
|
109
|
+
result = fetcher_class.new(source: source).call
|
|
110
|
+
retention_handler.call(source:, result:)
|
|
111
|
+
follow_up_handler.call(source:, result:)
|
|
112
|
+
my_handler.call(source:, result:) # <-- Add here
|
|
113
|
+
schedule_retry_if_needed(result)
|
|
114
|
+
mark_complete!(result)
|
|
115
|
+
end
|
|
116
|
+
event_publisher.call(source:, result:)
|
|
117
|
+
result
|
|
118
|
+
end
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Option 2: Add an Entry Processor Hook
|
|
122
|
+
|
|
123
|
+
Use `SourceMonitor::Events.run_item_processors` to add per-item processing without modifying the pipeline core.
|
|
124
|
+
|
|
125
|
+
```ruby
|
|
126
|
+
# In an initializer or engine setup:
|
|
127
|
+
SourceMonitor.configure do |config|
|
|
128
|
+
config.events.on_item_processed do |source:, entry:, result:|
|
|
129
|
+
# Custom per-item logic
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Option 3: Add an EntryParser Extension
|
|
135
|
+
|
|
136
|
+
To extract new fields from feed entries, extend `EntryParser`:
|
|
137
|
+
|
|
138
|
+
```ruby
|
|
139
|
+
# Add a new extract method to EntryParser
|
|
140
|
+
def extract_my_field
|
|
141
|
+
return unless entry.respond_to?(:my_field)
|
|
142
|
+
string_or_nil(entry.my_field)
|
|
143
|
+
end
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Then add it to the `parse` method's return hash.
|
|
147
|
+
|
|
148
|
+
### Option 4: Add a New Retention Strategy
|
|
149
|
+
|
|
150
|
+
```ruby
|
|
151
|
+
# lib/source_monitor/items/retention_strategies/archive.rb
|
|
152
|
+
module SourceMonitor
|
|
153
|
+
module Items
|
|
154
|
+
module RetentionStrategies
|
|
155
|
+
class Archive
|
|
156
|
+
def initialize(source:)
|
|
157
|
+
@source = source
|
|
158
|
+
end
|
|
159
|
+
|
|
160
|
+
def apply(batch:, now: Time.current)
|
|
161
|
+
# Your archival logic
|
|
162
|
+
count = 0
|
|
163
|
+
batch.each do |item|
|
|
164
|
+
item.update!(archived_at: now)
|
|
165
|
+
count += 1
|
|
166
|
+
end
|
|
167
|
+
count
|
|
168
|
+
end
|
|
169
|
+
|
|
170
|
+
private
|
|
171
|
+
attr_reader :source
|
|
172
|
+
end
|
|
173
|
+
end
|
|
174
|
+
end
|
|
175
|
+
end
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Register in `RetentionPruner::STRATEGY_CLASSES`.
|
|
179
|
+
|
|
180
|
+
## Data Flow Details
|
|
181
|
+
|
|
182
|
+
See `reference/` for detailed documentation:
|
|
183
|
+
- `reference/feed-fetcher-architecture.md` -- FeedFetcher module structure
|
|
184
|
+
- `reference/completion-handlers.md` -- Completion handler patterns
|
|
185
|
+
- `reference/entry-processing.md` -- Entry processing pipeline
|
|
186
|
+
|
|
187
|
+
## Error Handling
|
|
188
|
+
|
|
189
|
+
The pipeline uses a typed error hierarchy rooted at `FetchError`:
|
|
190
|
+
|
|
191
|
+
| Error Class | Code | Trigger |
|
|
192
|
+
|-------------|------|---------|
|
|
193
|
+
| `TimeoutError` | `timeout` | Request timeout |
|
|
194
|
+
| `ConnectionError` | `connection` | Connection/SSL failure |
|
|
195
|
+
| `HTTPError` | `http_error` | Non-200/304 HTTP status |
|
|
196
|
+
| `ParsingError` | `parsing` | Feedjira parse failure |
|
|
197
|
+
| `UnexpectedResponseError` | `unexpected_response` | Any other StandardError |
|
|
198
|
+
|
|
199
|
+
Each error type maps to a `RetryPolicy` with configurable attempts, wait times, and circuit-breaker thresholds.
|
|
200
|
+
|
|
201
|
+
## Result Types
|
|
202
|
+
|
|
203
|
+
**FeedFetcher::Result** -- returned from `FeedFetcher#call`:
|
|
204
|
+
- `status` -- `:fetched`, `:not_modified`, or `:failed`
|
|
205
|
+
- `feed` -- parsed Feedjira feed object
|
|
206
|
+
- `response` -- HTTP response
|
|
207
|
+
- `body` -- raw response body
|
|
208
|
+
- `error` -- FetchError (on failure)
|
|
209
|
+
- `item_processing` -- EntryProcessingResult
|
|
210
|
+
- `retry_decision` -- RetryPolicy::Decision
|
|
211
|
+
|
|
212
|
+
**ItemCreator::Result** -- returned from `ItemCreator.call`:
|
|
213
|
+
- `item` -- the Item record
|
|
214
|
+
- `status` -- `:created` or `:updated`
|
|
215
|
+
- `matched_by` -- `:guid` or `:fingerprint` (for updates)
|
|
216
|
+
|
|
217
|
+
## Testing
|
|
218
|
+
|
|
219
|
+
- Test helpers: `create_source!`, `with_inline_jobs`
|
|
220
|
+
- WebMock blocks all external HTTP; stub responses manually
|
|
221
|
+
- Use `PARALLEL_WORKERS=1` for single test files
|
|
222
|
+
- Inject dependencies (client, lock_factory) for isolation
|
|
223
|
+
|
|
224
|
+
```ruby
|
|
225
|
+
test "processes new feed entries" do
|
|
226
|
+
source = create_source!(feed_url: "https://example.com/feed.xml")
|
|
227
|
+
stub_request(:get, source.feed_url).to_return(
|
|
228
|
+
status: 200,
|
|
229
|
+
body: File.read("test/fixtures/files/sample_feed.xml")
|
|
230
|
+
)
|
|
231
|
+
|
|
232
|
+
result = SourceMonitor::Fetching::FeedFetcher.new(source: source).call
|
|
233
|
+
|
|
234
|
+
assert_equal :fetched, result.status
|
|
235
|
+
assert result.item_processing.created.positive?
|
|
236
|
+
end
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
## Checklist
|
|
240
|
+
|
|
241
|
+
- [ ] New stage follows dependency injection pattern (accept collaborators in initialize)
|
|
242
|
+
- [ ] Stage has a `call(source:, result:)` interface (for completion handlers)
|
|
243
|
+
- [ ] Error handling returns gracefully (don't crash the pipeline)
|
|
244
|
+
- [ ] Instrumentation payload updated if stage adds metrics
|
|
245
|
+
- [ ] Tests cover success, failure, and skip conditions
|
|
246
|
+
- [ ] No N+1 queries (use `includes`/`preload`)
|
|
247
|
+
- [ ] Documented in this skill's reference files
|
|
248
|
+
|
|
249
|
+
## References
|
|
250
|
+
|
|
251
|
+
- `lib/source_monitor/fetching/` -- All fetching pipeline code
|
|
252
|
+
- `lib/source_monitor/items/` -- Item creation and retention
|
|
253
|
+
- `test/lib/source_monitor/fetching/` -- Fetching tests
|
|
254
|
+
- `test/lib/source_monitor/items/` -- Item processing tests
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
# Completion Handlers
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Completion handlers are post-fetch processing steps managed by `FetchRunner`. They execute inside the advisory lock (except `EventPublisher`, which runs after the lock is released).
|
|
6
|
+
|
|
7
|
+
## Execution Order
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
lock.with_lock do
|
|
11
|
+
mark_fetching!
|
|
12
|
+
result = fetcher.call
|
|
13
|
+
1. RetentionHandler -- prune old items
|
|
14
|
+
2. FollowUpHandler -- enqueue scrape jobs for new items
|
|
15
|
+
3. schedule_retry -- re-enqueue on transient failure
|
|
16
|
+
mark_complete!
|
|
17
|
+
end
|
|
18
|
+
4. EventPublisher -- dispatch after_fetch_completed callback (outside lock)
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
## RetentionHandler
|
|
22
|
+
|
|
23
|
+
**File:** `lib/source_monitor/fetching/completion/retention_handler.rb`
|
|
24
|
+
|
|
25
|
+
Applies item retention pruning after every fetch.
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
class RetentionHandler
|
|
29
|
+
def initialize(pruner: SourceMonitor::Items::RetentionPruner)
|
|
30
|
+
def call(source:, result:)
|
|
31
|
+
pruner.call(source:, strategy: SourceMonitor.config.retention.strategy)
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
- Delegates to `RetentionPruner` which supports age-based and count-based limits
|
|
37
|
+
- Rescues errors gracefully (logs and returns nil)
|
|
38
|
+
- Strategy comes from global config (`:destroy` or `:soft_delete`)
|
|
39
|
+
|
|
40
|
+
### RetentionPruner Details
|
|
41
|
+
|
|
42
|
+
Two pruning modes:
|
|
43
|
+
1. **Age-based** (`items_retention_days`): Removes items older than N days (uses `COALESCE(published_at, created_at)`)
|
|
44
|
+
2. **Count-based** (`max_items`): Keeps only the N most recent items
|
|
45
|
+
|
|
46
|
+
Strategies:
|
|
47
|
+
- `Destroy` -- calls `item.destroy!` per record
|
|
48
|
+
- `SoftDelete` -- sets `deleted_at` timestamp, adjusts `items_count` counter
|
|
49
|
+
|
|
50
|
+
Configuration priority: source-level setting > global config.
|
|
51
|
+
|
|
52
|
+
## FollowUpHandler
|
|
53
|
+
|
|
54
|
+
**File:** `lib/source_monitor/fetching/completion/follow_up_handler.rb`
|
|
55
|
+
|
|
56
|
+
Enqueues scrape jobs for newly created items when auto-scraping is enabled.
|
|
57
|
+
|
|
58
|
+
```ruby
|
|
59
|
+
class FollowUpHandler
|
|
60
|
+
def initialize(enqueuer_class:, job_class:)
|
|
61
|
+
def call(source:, result:)
|
|
62
|
+
return unless should_enqueue?(source:, result:)
|
|
63
|
+
result.item_processing.created_items.each do |item|
|
|
64
|
+
next unless item.present? && item.scraped_at.nil?
|
|
65
|
+
enqueuer_class.enqueue(item:, source:, job_class:, reason: :auto)
|
|
66
|
+
end
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
Guard conditions:
|
|
72
|
+
- Result status must be `:fetched`
|
|
73
|
+
- Source must have `scraping_enabled?` and `auto_scrape?`
|
|
74
|
+
- At least one item was created
|
|
75
|
+
- Item must not already be scraped (`scraped_at.nil?`)
|
|
76
|
+
|
|
77
|
+
## EventPublisher
|
|
78
|
+
|
|
79
|
+
**File:** `lib/source_monitor/fetching/completion/event_publisher.rb`
|
|
80
|
+
|
|
81
|
+
Dispatches the `after_fetch_completed` callback to all registered listeners.
|
|
82
|
+
|
|
83
|
+
```ruby
|
|
84
|
+
class EventPublisher
|
|
85
|
+
def initialize(dispatcher: SourceMonitor::Events)
|
|
86
|
+
def call(source:, result:)
|
|
87
|
+
dispatcher.after_fetch_completed(source:, result:)
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
- Runs **outside** the advisory lock to prevent long-running callbacks from holding the lock
|
|
93
|
+
- The health monitoring system (`SourceMonitor::Health`) registers a callback here
|
|
94
|
+
- Callbacks are registered via `SourceMonitor.config.events.after_fetch_completed(lambda)`
|
|
95
|
+
|
|
96
|
+
## Adding a New Completion Handler
|
|
97
|
+
|
|
98
|
+
### Pattern
|
|
99
|
+
|
|
100
|
+
```ruby
|
|
101
|
+
# lib/source_monitor/fetching/completion/my_handler.rb
|
|
102
|
+
module SourceMonitor
|
|
103
|
+
module Fetching
|
|
104
|
+
module Completion
|
|
105
|
+
class MyHandler
|
|
106
|
+
def initialize(**deps)
|
|
107
|
+
@dependency = deps[:dependency] || DefaultDependency
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
def call(source:, result:)
|
|
111
|
+
return unless should_run?(source:, result:)
|
|
112
|
+
# Your logic
|
|
113
|
+
rescue StandardError => error
|
|
114
|
+
Rails.logger.error("[SourceMonitor] MyHandler failed: #{error.message}")
|
|
115
|
+
nil
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
private
|
|
119
|
+
|
|
120
|
+
attr_reader :dependency
|
|
121
|
+
|
|
122
|
+
def should_run?(source:, result:)
|
|
123
|
+
result&.status == :fetched
|
|
124
|
+
end
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
end
|
|
128
|
+
end
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Wiring
|
|
132
|
+
|
|
133
|
+
1. Add require in `fetch_runner.rb`
|
|
134
|
+
2. Add parameter to `FetchRunner#initialize` with default instance
|
|
135
|
+
3. Add `attr_reader` for the handler
|
|
136
|
+
4. Call `my_handler.call(source:, result:)` in `#run`
|
|
137
|
+
5. Decide placement: inside lock (for data consistency) or outside (for non-critical work)
|
|
138
|
+
|
|
139
|
+
### Testing
|
|
140
|
+
|
|
141
|
+
```ruby
|
|
142
|
+
test "my_handler is called on successful fetch" do
|
|
143
|
+
handler = Minitest::Mock.new
|
|
144
|
+
handler.expect(:call, nil, source: source, result: anything)
|
|
145
|
+
|
|
146
|
+
runner = FetchRunner.new(source: source, my_handler: handler)
|
|
147
|
+
stub_successful_fetch(source)
|
|
148
|
+
runner.run
|
|
149
|
+
|
|
150
|
+
handler.verify
|
|
151
|
+
end
|
|
152
|
+
```
|
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
# Entry Processing Pipeline
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
The entry processing pipeline transforms raw feed entries into persisted `Item` records. The pipeline was refactored from a 601-line `ItemCreator` monolith into three focused classes.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
EntryProcessor (89 lines)
|
|
9
|
+
|
|
|
10
|
+
+-- ItemCreator (174 lines) -- find-or-create orchestrator
|
|
11
|
+
|
|
|
12
|
+
+-- EntryParser (294 lines) -- attribute extraction
|
|
13
|
+
| +-- MediaExtraction (96 lines) -- media-specific extraction
|
|
14
|
+
|
|
|
15
|
+
+-- ContentExtractor (113 lines) -- readability processing
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
## EntryProcessor
|
|
19
|
+
|
|
20
|
+
**File:** `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb`
|
|
21
|
+
|
|
22
|
+
Iterates `feed.entries`, calling `ItemCreator.call` for each. Individual entry failures are caught and counted without stopping the batch.
|
|
23
|
+
|
|
24
|
+
Returns `EntryProcessingResult` with:
|
|
25
|
+
- `created` / `updated` / `failed` counts
|
|
26
|
+
- `items` -- all processed Item records
|
|
27
|
+
- `created_items` / `updated_items` -- separate lists
|
|
28
|
+
- `errors` -- normalized error details for failed entries
|
|
29
|
+
|
|
30
|
+
## ItemCreator
|
|
31
|
+
|
|
32
|
+
**File:** `lib/source_monitor/items/item_creator.rb`
|
|
33
|
+
|
|
34
|
+
Orchestrates finding or creating an Item record for a single feed entry.
|
|
35
|
+
|
|
36
|
+
### Deduplication Strategy
|
|
37
|
+
|
|
38
|
+
Items are matched in priority order:
|
|
39
|
+
|
|
40
|
+
1. **GUID match** (case-insensitive) -- if the entry has a `entry_id`
|
|
41
|
+
2. **Content fingerprint** -- SHA256 of `title + url + content`
|
|
42
|
+
|
|
43
|
+
```ruby
|
|
44
|
+
def existing_item_for(attributes, raw_guid_present:)
|
|
45
|
+
if raw_guid_present
|
|
46
|
+
existing = find_item_by_guid(guid)
|
|
47
|
+
return [existing, :guid] if existing
|
|
48
|
+
end
|
|
49
|
+
if fingerprint.present?
|
|
50
|
+
existing = find_item_by_fingerprint(fingerprint)
|
|
51
|
+
return [existing, :fingerprint] if existing
|
|
52
|
+
end
|
|
53
|
+
[nil, nil]
|
|
54
|
+
end
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Concurrent Duplicate Handling
|
|
58
|
+
|
|
59
|
+
When `ActiveRecord::RecordNotUnique` is raised during creation, the code falls back to finding and updating the conflicting record:
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
def create_new_item(attributes, raw_guid_present:)
|
|
63
|
+
new_item = source.items.new
|
|
64
|
+
apply_attributes(new_item, attributes)
|
|
65
|
+
new_item.save!
|
|
66
|
+
Result.new(item: new_item, status: :created)
|
|
67
|
+
rescue ActiveRecord::RecordNotUnique
|
|
68
|
+
handle_concurrent_duplicate(attributes, raw_guid_present:)
|
|
69
|
+
end
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Result Struct
|
|
73
|
+
|
|
74
|
+
```ruby
|
|
75
|
+
Result = Struct.new(:item, :status, :matched_by) do
|
|
76
|
+
def created? = status == :created
|
|
77
|
+
def updated? = status == :updated
|
|
78
|
+
end
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## EntryParser
|
|
82
|
+
|
|
83
|
+
**File:** `lib/source_monitor/items/item_creator/entry_parser.rb`
|
|
84
|
+
|
|
85
|
+
Extracts all item attributes from a Feedjira entry object. Handles RSS 2.0, Atom, and JSON Feed formats.
|
|
86
|
+
|
|
87
|
+
### Extracted Fields
|
|
88
|
+
|
|
89
|
+
| Field | Method | Notes |
|
|
90
|
+
|-------|--------|-------|
|
|
91
|
+
| `guid` | `extract_guid` | `entry_id` preferred; falls back to `id` if not same as URL |
|
|
92
|
+
| `url` | `extract_url` | Tries `url`, `link_nodes` (alternate), `links` |
|
|
93
|
+
| `title` | -- | Direct from entry |
|
|
94
|
+
| `author` | `extract_author` | Single author string |
|
|
95
|
+
| `authors` | `extract_authors` | Aggregates from rss_authors, dc_creators, author_nodes, JSON Feed |
|
|
96
|
+
| `summary` | `extract_summary` | Entry summary/description |
|
|
97
|
+
| `content` | `extract_content` | Tries `content`, `content_encoded`, `summary` |
|
|
98
|
+
| `published_at` | `extract_timestamp` | First of `published`, `updated` |
|
|
99
|
+
| `updated_at_source` | `extract_updated_timestamp` | Entry `updated` field |
|
|
100
|
+
| `categories` | `extract_categories` | From `categories`, `tags`, JSON Feed tags |
|
|
101
|
+
| `tags` | `extract_tags` | Subset of categories |
|
|
102
|
+
| `keywords` | `extract_keywords` | From `media_keywords_raw`, `itunes_keywords_raw` |
|
|
103
|
+
| `enclosures` | `extract_enclosures` | RSS enclosures, Atom links, JSON attachments |
|
|
104
|
+
| `media_thumbnail_url` | `extract_media_thumbnail_url` | Media RSS thumbnails, entry image |
|
|
105
|
+
| `media_content` | `extract_media_content` | Media RSS content nodes |
|
|
106
|
+
| `language` | `extract_language` | Entry or JSON Feed language |
|
|
107
|
+
| `copyright` | `extract_copyright` | Entry or JSON Feed copyright |
|
|
108
|
+
| `comments_url` | `extract_comments_url` | RSS comments element |
|
|
109
|
+
| `comments_count` | `extract_comments_count` | slash:comments or comments_count |
|
|
110
|
+
| `metadata` | `extract_metadata` | Full entry hash under `feedjira_entry` key |
|
|
111
|
+
| `content_fingerprint` | `generate_fingerprint` | SHA256 of title+url+content |
|
|
112
|
+
|
|
113
|
+
### Feed Format Detection
|
|
114
|
+
|
|
115
|
+
```ruby
|
|
116
|
+
def json_entry?
|
|
117
|
+
defined?(Feedjira::Parser::JSONFeedItem) && entry.is_a?(Feedjira::Parser::JSONFeedItem)
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
def atom_entry?
|
|
121
|
+
defined?(Feedjira::Parser::AtomEntry) && entry.is_a?(Feedjira::Parser::AtomEntry)
|
|
122
|
+
end
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Helper Methods
|
|
126
|
+
|
|
127
|
+
- `string_or_nil(value)` -- strips and returns nil for blank strings
|
|
128
|
+
- `sanitize_string_array(values)` -- deduplicates and compacts
|
|
129
|
+
- `split_keywords(value)` -- splits on `,` or `;`
|
|
130
|
+
- `safe_integer(value)` -- safe Integer conversion
|
|
131
|
+
- `normalize_metadata(value)` -- JSON round-trip for serializable hash
|
|
132
|
+
|
|
133
|
+
## ContentExtractor
|
|
134
|
+
|
|
135
|
+
**File:** `lib/source_monitor/items/item_creator/content_extractor.rb`
|
|
136
|
+
|
|
137
|
+
Processes HTML content through readability parsing when enabled on the source.
|
|
138
|
+
|
|
139
|
+
### Processing Flow
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
process_feed_content(raw_content, title:)
|
|
143
|
+
-> should_process_feed_content?(raw_content)
|
|
144
|
+
-> source.feed_content_readability_enabled?
|
|
145
|
+
-> raw_content.present?
|
|
146
|
+
-> html_fragment?(raw_content)
|
|
147
|
+
-> wrap_content_for_readability(raw_content, title:)
|
|
148
|
+
-> builds full HTML document with title
|
|
149
|
+
-> ReadabilityParser.new.parse(html:, readability:)
|
|
150
|
+
-> build_feed_content_metadata(result:, raw_content:, processed_content:)
|
|
151
|
+
-> returns [processed_content, metadata]
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
### Guard Conditions
|
|
155
|
+
|
|
156
|
+
Content processing only runs when:
|
|
157
|
+
1. Source has `feed_content_readability_enabled?`
|
|
158
|
+
2. Content is present (not blank)
|
|
159
|
+
3. Content looks like HTML (`html_fragment?` checks for `<tag` pattern)
|
|
160
|
+
|
|
161
|
+
### Metadata
|
|
162
|
+
|
|
163
|
+
Processing metadata is stored under `feed_content_processing` key:
|
|
164
|
+
- `strategy` -- always "readability"
|
|
165
|
+
- `status` -- parser result status
|
|
166
|
+
- `applied` -- whether processed content was used
|
|
167
|
+
- `changed` -- whether content differs from raw
|
|
168
|
+
- `readability_text_length` -- extracted text length
|
|
169
|
+
- `title` -- extracted title
|
|
170
|
+
|
|
171
|
+
## MediaExtraction
|
|
172
|
+
|
|
173
|
+
**File:** `lib/source_monitor/items/item_creator/entry_parser/media_extraction.rb`
|
|
174
|
+
|
|
175
|
+
Mixed into `EntryParser` to handle media-specific fields.
|
|
176
|
+
|
|
177
|
+
### Enclosure Sources
|
|
178
|
+
|
|
179
|
+
| Format | Source | Key Fields |
|
|
180
|
+
|--------|--------|------------|
|
|
181
|
+
| RSS 2.0 | `enclosure_nodes` | url, type, length |
|
|
182
|
+
| Atom | `link_nodes` with `rel="enclosure"` | url, type, length |
|
|
183
|
+
| JSON Feed | `json["attachments"]` | url, mime_type, size_in_bytes, duration |
|
|
184
|
+
|
|
185
|
+
### Media Content
|
|
186
|
+
|
|
187
|
+
From Media RSS `media_content_nodes`: url, type, medium, height, width, file_size, duration, expression.
|
|
188
|
+
|
|
189
|
+
### Thumbnails
|
|
190
|
+
|
|
191
|
+
Priority: `media_thumbnail_nodes` first, then `entry.image`.
|