source_monitor 0.12.4 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/skills/sm-architecture/reference/module-map.md +1 -0
- data/.claude/skills/sm-upgrade/reference/version-history.md +14 -0
- data/CHANGELOG.md +8 -0
- data/Gemfile.lock +1 -1
- data/README.md +3 -3
- data/VERSION +1 -1
- data/app/assets/builds/source_monitor/application.css +17 -13
- data/docs/setup.md +2 -2
- data/docs/upgrade.md +15 -0
- data/lib/source_monitor/fetching/advisory_lock.rb +27 -0
- data/lib/source_monitor/fetching/feed_fetcher/entry_processor.rb +37 -12
- data/lib/source_monitor/fetching/fetch_runner.rb +23 -1
- data/lib/source_monitor/items/batch_item_creator.rb +86 -0
- data/lib/source_monitor/items/item_creator.rb +39 -10
- data/lib/source_monitor/version.rb +1 -1
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ba447fce3d49e4605a01154bfbf1a28179ac1833ba1f240add5f1b9adae3ecb3
|
|
4
|
+
data.tar.gz: 5c90d5148475fd74aa53568df2df60e7143613496df6b985d98acfb5fd84b4c6
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 4427aaa63507229534a56998290cc6175a18116bba7881c80ce16ef97c9b2dee9b64a4fd37f4d30cac2155805fa19b53258dbb7831f28581ffba56f1a67d814e
|
|
7
|
+
data.tar.gz: 1b072b1041d24be68a54f2c3a282101c29129fa0e5b34aff15a1ef07f2665ed494760cf37b7827b19cc399b1332cd720696519b2ab46e0c17df0d1a850e28fd0
|
|
@@ -66,6 +66,7 @@ Complete module tree with each module's responsibility.
|
|
|
66
66
|
| Module | File | Responsibility |
|
|
67
67
|
|--------|------|----------------|
|
|
68
68
|
| `ItemCreator` | `items/item_creator.rb` | Create or update Item from feed entry |
|
|
69
|
+
| `BatchItemCreator` | `items/batch_item_creator.rb` | Pre-fetch lookup index of existing items for batch entry processing |
|
|
69
70
|
| `ItemCreator::EntryParser` | `items/item_creator/entry_parser.rb` | Parse Feedjira entry into attribute hash |
|
|
70
71
|
| `ItemCreator::ContentExtractor` | `items/item_creator/content_extractor.rb` | Process content through readability parser |
|
|
71
72
|
| `RetentionPruner` | `items/retention_pruner.rb` | Prune items by age/count per source |
|
|
@@ -2,6 +2,20 @@
|
|
|
2
2
|
|
|
3
3
|
Version-specific migration notes for each major/minor version transition. Agents should reference this file when guiding users through multi-version upgrades.
|
|
4
4
|
|
|
5
|
+
## 0.12.4 to 0.13.0
|
|
6
|
+
|
|
7
|
+
**Key changes:**
|
|
8
|
+
- Performance: GUID normalization to lowercase on write; plain btree index used instead of LOWER(guid) sequential scans
|
|
9
|
+
- New `AdvisoryLock#acquire!` and `AdvisoryLock#release!` methods alongside existing `with_lock` block API
|
|
10
|
+
- New `BatchItemCreator` class for bulk item lookups; reduces per-fetch DB queries from ~N*2 to 2
|
|
11
|
+
- `ItemCreator` accepts optional `existing_items_index` parameter for batch-mode deduplication
|
|
12
|
+
- `FetchRunner` restructured into 3-phase execution (lock/fetch/write) so DB connections are not held idle during HTTP requests
|
|
13
|
+
- `EntryProcessor` integrates batch index from `BatchItemCreator`
|
|
14
|
+
|
|
15
|
+
**Action items:**
|
|
16
|
+
1. `bundle update source_monitor`
|
|
17
|
+
2. No migrations, config changes, or breaking changes.
|
|
18
|
+
|
|
5
19
|
## 0.12.3 to 0.12.4
|
|
6
20
|
|
|
7
21
|
**Key changes:**
|
data/CHANGELOG.md
CHANGED
|
@@ -15,6 +15,14 @@ All notable changes to this project are documented below. The format follows [Ke
|
|
|
15
15
|
|
|
16
16
|
- No unreleased changes yet.
|
|
17
17
|
|
|
18
|
+
## [0.13.0] - 2026-03-24
|
|
19
|
+
|
|
20
|
+
### Changed
|
|
21
|
+
- **Fetch pipeline performance overhaul** — resolves production DB overload on small servers (2-core/4GB) where 569 fetch jobs backed up and PostgreSQL hit 150% CPU
|
|
22
|
+
- Normalize GUIDs to lowercase on write, replacing `LOWER(guid)` queries that forced sequential scans — existing btree indexes now used correctly
|
|
23
|
+
- Restructure advisory lock in FetchRunner into 3 phases (lock/fetch/write) so DB connections are not held idle during HTTP requests
|
|
24
|
+
- Add BatchItemCreator for bulk item lookups — reduces per-fetch queries from ~N*2 to 2 regardless of entry count
|
|
25
|
+
|
|
18
26
|
## [0.12.4] - 2026-03-17
|
|
19
27
|
|
|
20
28
|
### Fixed
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
|
@@ -9,8 +9,8 @@ SourceMonitor is a production-ready Rails 8 mountable engine for ingesting, norm
|
|
|
9
9
|
In your host Rails app:
|
|
10
10
|
|
|
11
11
|
```bash
|
|
12
|
-
bundle add source_monitor --version "~> 0.
|
|
13
|
-
# or add `gem "source_monitor", "~> 0.
|
|
12
|
+
bundle add source_monitor --version "~> 0.13.0"
|
|
13
|
+
# or add `gem "source_monitor", "~> 0.13.0"` manually, then run:
|
|
14
14
|
bundle install
|
|
15
15
|
```
|
|
16
16
|
|
|
@@ -46,7 +46,7 @@ This exposes `bin/source_monitor` (via Bundler binstubs) so you can run the guid
|
|
|
46
46
|
Before running any SourceMonitor commands inside your host app, add the gem and install dependencies:
|
|
47
47
|
|
|
48
48
|
```bash
|
|
49
|
-
bundle add source_monitor --version "~> 0.
|
|
49
|
+
bundle add source_monitor --version "~> 0.13.0"
|
|
50
50
|
# or edit your Gemfile, then run
|
|
51
51
|
bundle install
|
|
52
52
|
```
|
data/VERSION
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
0.
|
|
1
|
+
0.13.0
|
|
@@ -1275,14 +1275,14 @@ video {
|
|
|
1275
1275
|
border-color: transparent;
|
|
1276
1276
|
}
|
|
1277
1277
|
|
|
1278
|
-
.fm-admin .border-violet-
|
|
1278
|
+
.fm-admin .border-violet-100 {
|
|
1279
1279
|
--tw-border-opacity: 1;
|
|
1280
|
-
border-color: rgb(
|
|
1280
|
+
border-color: rgb(237 233 254 / var(--tw-border-opacity, 1));
|
|
1281
1281
|
}
|
|
1282
1282
|
|
|
1283
|
-
.fm-admin .border-violet-
|
|
1283
|
+
.fm-admin .border-violet-200 {
|
|
1284
1284
|
--tw-border-opacity: 1;
|
|
1285
|
-
border-color: rgb(
|
|
1285
|
+
border-color: rgb(221 214 254 / var(--tw-border-opacity, 1));
|
|
1286
1286
|
}
|
|
1287
1287
|
|
|
1288
1288
|
.fm-admin .bg-amber-100 {
|
|
@@ -1673,6 +1673,10 @@ video {
|
|
|
1673
1673
|
text-transform: uppercase;
|
|
1674
1674
|
}
|
|
1675
1675
|
|
|
1676
|
+
.fm-admin .lowercase {
|
|
1677
|
+
text-transform: lowercase;
|
|
1678
|
+
}
|
|
1679
|
+
|
|
1676
1680
|
.fm-admin .capitalize {
|
|
1677
1681
|
text-transform: capitalize;
|
|
1678
1682
|
}
|
|
@@ -1856,6 +1860,11 @@ video {
|
|
|
1856
1860
|
color: rgb(109 40 217 / var(--tw-text-opacity, 1));
|
|
1857
1861
|
}
|
|
1858
1862
|
|
|
1863
|
+
.fm-admin .text-violet-900 {
|
|
1864
|
+
--tw-text-opacity: 1;
|
|
1865
|
+
color: rgb(76 29 149 / var(--tw-text-opacity, 1));
|
|
1866
|
+
}
|
|
1867
|
+
|
|
1859
1868
|
.fm-admin .text-white {
|
|
1860
1869
|
--tw-text-opacity: 1;
|
|
1861
1870
|
color: rgb(255 255 255 / var(--tw-text-opacity, 1));
|
|
@@ -1866,11 +1875,6 @@ video {
|
|
|
1866
1875
|
color: rgb(161 98 7 / var(--tw-text-opacity, 1));
|
|
1867
1876
|
}
|
|
1868
1877
|
|
|
1869
|
-
.fm-admin .text-violet-900 {
|
|
1870
|
-
--tw-text-opacity: 1;
|
|
1871
|
-
color: rgb(76 29 149 / var(--tw-text-opacity, 1));
|
|
1872
|
-
}
|
|
1873
|
-
|
|
1874
1878
|
.fm-admin .underline {
|
|
1875
1879
|
text-decoration-line: underline;
|
|
1876
1880
|
}
|
|
@@ -2064,14 +2068,14 @@ video {
|
|
|
2064
2068
|
color: rgb(15 23 42 / var(--tw-text-opacity, 1));
|
|
2065
2069
|
}
|
|
2066
2070
|
|
|
2067
|
-
.fm-admin .hover\:text-
|
|
2071
|
+
.fm-admin .hover\:text-violet-600:hover {
|
|
2068
2072
|
--tw-text-opacity: 1;
|
|
2069
|
-
color: rgb(
|
|
2073
|
+
color: rgb(124 58 237 / var(--tw-text-opacity, 1));
|
|
2070
2074
|
}
|
|
2071
2075
|
|
|
2072
|
-
.fm-admin .hover\:text-
|
|
2076
|
+
.fm-admin .hover\:text-white:hover {
|
|
2073
2077
|
--tw-text-opacity: 1;
|
|
2074
|
-
color: rgb(
|
|
2078
|
+
color: rgb(255 255 255 / var(--tw-text-opacity, 1));
|
|
2075
2079
|
}
|
|
2076
2080
|
|
|
2077
2081
|
.fm-admin .hover\:underline:hover {
|
data/docs/setup.md
CHANGED
|
@@ -18,8 +18,8 @@ This guide consolidates the new guided installer, verification commands, and rol
|
|
|
18
18
|
Run these commands inside your host Rails application before invoking the guided workflow:
|
|
19
19
|
|
|
20
20
|
```bash
|
|
21
|
-
bundle add source_monitor --version "~> 0.
|
|
22
|
-
# or add gem "source_monitor", "~> 0.
|
|
21
|
+
bundle add source_monitor --version "~> 0.13.0"
|
|
22
|
+
# or add gem "source_monitor", "~> 0.13.0" to Gemfile manually
|
|
23
23
|
bundle install
|
|
24
24
|
```
|
|
25
25
|
|
data/docs/upgrade.md
CHANGED
|
@@ -46,6 +46,21 @@ If a removed option raises an error (`SourceMonitor::DeprecatedOptionError`), yo
|
|
|
46
46
|
|
|
47
47
|
## Version-Specific Notes
|
|
48
48
|
|
|
49
|
+
### Upgrading to 0.13.0
|
|
50
|
+
|
|
51
|
+
**What changed:**
|
|
52
|
+
- **Performance:** GUID normalization to lowercase on write; plain btree index used instead of LOWER(guid) sequential scans
|
|
53
|
+
- **Fetch pipeline:** Restructured FetchRunner into 3-phase execution (lock/fetch/write) so DB connections not held idle during HTTP requests
|
|
54
|
+
- **Batch processing:** New BatchItemCreator for bulk item lookups reduces per-fetch queries from ~N*2 to 2
|
|
55
|
+
|
|
56
|
+
**Upgrade steps:**
|
|
57
|
+
```bash
|
|
58
|
+
bundle update source_monitor
|
|
59
|
+
bin/rails source_monitor:upgrade
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
No migrations, configuration changes, or breaking changes required.
|
|
63
|
+
|
|
49
64
|
### Upgrading to 0.12.0
|
|
50
65
|
|
|
51
66
|
**What changed:**
|
|
@@ -13,6 +13,8 @@ module SourceMonitor
|
|
|
13
13
|
@connection_pool = connection_pool
|
|
14
14
|
end
|
|
15
15
|
|
|
16
|
+
# Block-based API: acquires lock, yields, releases. Holds a DB connection
|
|
17
|
+
# for the entire duration of the block.
|
|
16
18
|
def with_lock
|
|
17
19
|
connection_pool.with_connection do |connection|
|
|
18
20
|
locked = try_lock(connection)
|
|
@@ -26,6 +28,31 @@ module SourceMonitor
|
|
|
26
28
|
end
|
|
27
29
|
end
|
|
28
30
|
|
|
31
|
+
# Non-blocking acquire: tries to get the advisory lock. Returns true if
|
|
32
|
+
# acquired, false otherwise. Raises NotAcquiredError when raise_on_failure
|
|
33
|
+
# is true (default). The lock is session-scoped -- it stays held until
|
|
34
|
+
# release! is called on the same DB connection, or the connection is closed.
|
|
35
|
+
def acquire!(raise_on_failure: true)
|
|
36
|
+
locked = false
|
|
37
|
+
connection_pool.with_connection do |connection|
|
|
38
|
+
locked = try_lock(connection)
|
|
39
|
+
end
|
|
40
|
+
raise NotAcquiredError, "advisory lock #{namespace}/#{key} busy" if !locked && raise_on_failure
|
|
41
|
+
|
|
42
|
+
locked
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
# Releases the advisory lock. Safe to call even if the lock is not held.
|
|
46
|
+
# Because advisory locks are session-scoped, this must run on the same
|
|
47
|
+
# connection that acquired the lock. In a connection pool the pool returns
|
|
48
|
+
# the same connection to the same thread, so this works correctly as long
|
|
49
|
+
# as acquire! and release! are called from the same thread.
|
|
50
|
+
def release!
|
|
51
|
+
connection_pool.with_connection do |connection|
|
|
52
|
+
release(connection)
|
|
53
|
+
end
|
|
54
|
+
end
|
|
55
|
+
|
|
29
56
|
private
|
|
30
57
|
|
|
31
58
|
attr_reader :namespace, :key, :connection_pool
|
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require "source_monitor/items/batch_item_creator"
|
|
4
|
+
|
|
3
5
|
module SourceMonitor
|
|
4
6
|
module Fetching
|
|
5
7
|
class FeedFetcher
|
|
@@ -22,6 +24,34 @@ module SourceMonitor
|
|
|
22
24
|
updated_items: []
|
|
23
25
|
) unless feed.respond_to?(:entries)
|
|
24
26
|
|
|
27
|
+
entries = Array(feed.entries)
|
|
28
|
+
return empty_result if entries.empty?
|
|
29
|
+
|
|
30
|
+
# Pre-fetch existing items in bulk (2 SELECTs instead of N per-entry).
|
|
31
|
+
# If the batch index build fails, fall back to per-entry lookups.
|
|
32
|
+
existing_items_index = begin
|
|
33
|
+
SourceMonitor::Items::BatchItemCreator.build_index(source: source, entries: entries)
|
|
34
|
+
rescue StandardError
|
|
35
|
+
nil
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
process_entries_with_index(entries, existing_items_index)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
private
|
|
42
|
+
|
|
43
|
+
def empty_result
|
|
44
|
+
FeedFetcher::EntryProcessingResult.new(
|
|
45
|
+
created: 0, updated: 0, unchanged: 0, failed: 0,
|
|
46
|
+
items: [], errors: [], created_items: [], updated_items: []
|
|
47
|
+
)
|
|
48
|
+
end
|
|
49
|
+
|
|
50
|
+
# Processes entries one at a time through ItemCreator.call.
|
|
51
|
+
# When existing_items_index is provided, ItemCreator skips per-entry
|
|
52
|
+
# SELECT queries and uses the pre-fetched index instead.
|
|
53
|
+
# When nil, ItemCreator falls back to individual DB lookups.
|
|
54
|
+
def process_entries_with_index(entries, existing_items_index)
|
|
25
55
|
created = 0
|
|
26
56
|
updated = 0
|
|
27
57
|
unchanged = 0
|
|
@@ -31,9 +61,12 @@ module SourceMonitor
|
|
|
31
61
|
updated_items = []
|
|
32
62
|
errors = []
|
|
33
63
|
|
|
34
|
-
|
|
64
|
+
entries.each do |entry|
|
|
35
65
|
begin
|
|
36
|
-
result = SourceMonitor::Items::ItemCreator.call(
|
|
66
|
+
result = SourceMonitor::Items::ItemCreator.call(
|
|
67
|
+
source: source, entry: entry,
|
|
68
|
+
existing_items_index: existing_items_index
|
|
69
|
+
)
|
|
37
70
|
SourceMonitor::Events.run_item_processors(source:, entry:, result: result)
|
|
38
71
|
items << result.item
|
|
39
72
|
if result.created?
|
|
@@ -54,19 +87,11 @@ module SourceMonitor
|
|
|
54
87
|
end
|
|
55
88
|
|
|
56
89
|
FeedFetcher::EntryProcessingResult.new(
|
|
57
|
-
created:,
|
|
58
|
-
|
|
59
|
-
unchanged:,
|
|
60
|
-
failed:,
|
|
61
|
-
items:,
|
|
62
|
-
errors: errors.compact,
|
|
63
|
-
created_items:,
|
|
64
|
-
updated_items:
|
|
90
|
+
created:, updated:, unchanged:, failed:,
|
|
91
|
+
items:, errors: errors.compact, created_items:, updated_items:
|
|
65
92
|
)
|
|
66
93
|
end
|
|
67
94
|
|
|
68
|
-
private
|
|
69
|
-
|
|
70
95
|
def enqueue_image_download(item)
|
|
71
96
|
return unless SourceMonitor.config.images.download_enabled?
|
|
72
97
|
return if item.content.blank?
|
|
@@ -56,13 +56,35 @@ module SourceMonitor
|
|
|
56
56
|
@retry_scheduled = false
|
|
57
57
|
result = nil
|
|
58
58
|
|
|
59
|
-
lock.
|
|
59
|
+
# Phase 1: Acquire advisory lock and mark source as fetching.
|
|
60
|
+
# Uses a DB connection briefly, then releases it.
|
|
61
|
+
lock.acquire!
|
|
62
|
+
begin
|
|
60
63
|
mark_fetching!
|
|
64
|
+
rescue StandardError
|
|
65
|
+
lock.release!
|
|
66
|
+
raise
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
# Phase 2: HTTP fetch -- no DB connection held during network I/O.
|
|
70
|
+
# This is the key optimization: on slow feeds (up to 30s timeout),
|
|
71
|
+
# we no longer hold a DB connection idle while waiting for HTTP.
|
|
72
|
+
begin
|
|
61
73
|
result = fetcher_class.new(source: source).call
|
|
74
|
+
rescue StandardError => fetch_error
|
|
75
|
+
# Ensure lock is released before propagating
|
|
76
|
+
lock.release!
|
|
77
|
+
raise fetch_error
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
# Phase 3: Post-fetch DB writes under the advisory lock (still held).
|
|
81
|
+
begin
|
|
62
82
|
log_handler_result("RetentionHandler", retention_handler.call(source:, result:))
|
|
63
83
|
log_handler_result("FollowUpHandler", follow_up_handler.call(source:, result:))
|
|
64
84
|
schedule_retry_if_needed(result)
|
|
65
85
|
mark_complete!(result)
|
|
86
|
+
ensure
|
|
87
|
+
lock.release!
|
|
66
88
|
end
|
|
67
89
|
|
|
68
90
|
log_handler_result("EventPublisher", event_publisher.call(source:, result:))
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "source_monitor/items/item_creator"
|
|
4
|
+
|
|
5
|
+
module SourceMonitor
|
|
6
|
+
module Items
|
|
7
|
+
# Builds a pre-fetched lookup index of existing items for a batch of entries.
|
|
8
|
+
#
|
|
9
|
+
# Instead of N individual SELECT queries (one per feed entry) to check for
|
|
10
|
+
# existing items, this class:
|
|
11
|
+
# 1. Pre-parses all entries to collect GUIDs + fingerprints
|
|
12
|
+
# 2. Does a single WHERE guid IN (...) query to find existing items by GUID
|
|
13
|
+
# 3. Does a single WHERE content_fingerprint IN (...) for remaining entries
|
|
14
|
+
# 4. Returns an index hash that ItemCreator can use to skip per-entry SELECTs
|
|
15
|
+
#
|
|
16
|
+
# The actual item creation/update is still done by ItemCreator.call, which
|
|
17
|
+
# accepts the index via the existing_items_index parameter.
|
|
18
|
+
class BatchItemCreator
|
|
19
|
+
# Builds a lookup index from a batch of feed entries.
|
|
20
|
+
# Returns a Hash with :by_guid and :by_fingerprint keys.
|
|
21
|
+
def self.build_index(source:, entries:)
|
|
22
|
+
new(source: source, entries: entries).build_index
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def initialize(source:, entries:)
|
|
26
|
+
@source = source
|
|
27
|
+
@entries = Array(entries)
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def build_index
|
|
31
|
+
return { by_guid: {}, by_fingerprint: {} } if @entries.empty?
|
|
32
|
+
|
|
33
|
+
# Step 1: Pre-parse entries to extract GUIDs and fingerprints for bulk lookup.
|
|
34
|
+
entry_identifiers = @entries.map do |entry|
|
|
35
|
+
parser = ItemCreator::EntryParser.new(
|
|
36
|
+
source: @source,
|
|
37
|
+
entry: entry,
|
|
38
|
+
content_extractor: content_extractor
|
|
39
|
+
)
|
|
40
|
+
attrs = parser.parse
|
|
41
|
+
raw_guid = attrs[:guid]
|
|
42
|
+
normalized_guid = raw_guid.present? ? raw_guid.downcase : nil
|
|
43
|
+
guid = normalized_guid.presence || attrs[:content_fingerprint]
|
|
44
|
+
|
|
45
|
+
{ guid: guid, fingerprint: attrs[:content_fingerprint], raw_guid_present: normalized_guid.present? }
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
# Step 2: Batch-fetch existing items by GUID (single query)
|
|
49
|
+
guids = entry_identifiers
|
|
50
|
+
.select { |ei| ei[:raw_guid_present] }
|
|
51
|
+
.filter_map { |ei| ei[:guid] }
|
|
52
|
+
.uniq
|
|
53
|
+
|
|
54
|
+
existing_by_guid = if guids.any?
|
|
55
|
+
@source.all_items.where(guid: guids).index_by(&:guid)
|
|
56
|
+
else
|
|
57
|
+
{}
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
# Step 3: For entries without a GUID match, batch-fetch by fingerprint
|
|
61
|
+
unmatched_fingerprints = entry_identifiers.filter_map do |ei|
|
|
62
|
+
guid = ei[:guid]
|
|
63
|
+
next if ei[:raw_guid_present] && existing_by_guid.key?(guid)
|
|
64
|
+
|
|
65
|
+
ei[:fingerprint].presence
|
|
66
|
+
end.uniq
|
|
67
|
+
|
|
68
|
+
existing_by_fingerprint = if unmatched_fingerprints.any?
|
|
69
|
+
@source.all_items
|
|
70
|
+
.where(content_fingerprint: unmatched_fingerprints)
|
|
71
|
+
.index_by(&:content_fingerprint)
|
|
72
|
+
else
|
|
73
|
+
{}
|
|
74
|
+
end
|
|
75
|
+
|
|
76
|
+
{ by_guid: existing_by_guid, by_fingerprint: existing_by_fingerprint }
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
private
|
|
80
|
+
|
|
81
|
+
def content_extractor
|
|
82
|
+
@content_extractor ||= ItemCreator::ContentExtractor.new(source: @source)
|
|
83
|
+
end
|
|
84
|
+
end
|
|
85
|
+
end
|
|
86
|
+
end
|
|
@@ -33,21 +33,30 @@ module SourceMonitor
|
|
|
33
33
|
KEYWORD_SEPARATORS = /[,;]+/.freeze
|
|
34
34
|
METADATA_ROOT_KEY = "feedjira_entry".freeze
|
|
35
35
|
|
|
36
|
-
|
|
37
|
-
|
|
36
|
+
# Process a single feed entry, creating or updating the corresponding item.
|
|
37
|
+
#
|
|
38
|
+
# @param existing_items_index [Hash, nil] Optional pre-fetched lookup of
|
|
39
|
+
# existing items keyed by guid and content_fingerprint. When provided,
|
|
40
|
+
# skips per-entry SELECT queries (used by BatchItemCreator).
|
|
41
|
+
def self.call(source:, entry:, existing_items_index: nil)
|
|
42
|
+
new(source:, entry:, existing_items_index: existing_items_index).call
|
|
38
43
|
end
|
|
39
44
|
|
|
40
|
-
def initialize(source:, entry:)
|
|
45
|
+
def initialize(source:, entry:, existing_items_index: nil)
|
|
41
46
|
@source = source
|
|
42
47
|
@entry = entry
|
|
48
|
+
@existing_items_index = existing_items_index
|
|
43
49
|
end
|
|
44
50
|
|
|
45
51
|
def call
|
|
46
52
|
attributes = build_attributes
|
|
47
53
|
raw_guid = attributes[:guid]
|
|
48
|
-
|
|
54
|
+
# Normalize GUID to lowercase so the plain btree index on guid is used
|
|
55
|
+
# for lookups instead of LOWER(guid) which forces sequential scans.
|
|
56
|
+
normalized_guid = raw_guid.present? ? raw_guid.downcase : nil
|
|
57
|
+
attributes[:guid] = normalized_guid.presence || attributes[:content_fingerprint]
|
|
49
58
|
|
|
50
|
-
existing_item, matched_by = existing_item_for(attributes, raw_guid_present:
|
|
59
|
+
existing_item, matched_by = existing_item_for(attributes, raw_guid_present: normalized_guid.present?)
|
|
51
60
|
|
|
52
61
|
if existing_item
|
|
53
62
|
apply_attributes(existing_item, attributes)
|
|
@@ -61,34 +70,54 @@ module SourceMonitor
|
|
|
61
70
|
end
|
|
62
71
|
end
|
|
63
72
|
|
|
64
|
-
create_new_item(attributes, raw_guid_present:
|
|
73
|
+
create_new_item(attributes, raw_guid_present: normalized_guid.present?)
|
|
65
74
|
end
|
|
66
75
|
|
|
67
76
|
private
|
|
68
77
|
|
|
69
|
-
attr_reader :source, :entry
|
|
78
|
+
attr_reader :source, :entry, :existing_items_index
|
|
70
79
|
|
|
71
80
|
def existing_item_for(attributes, raw_guid_present:)
|
|
72
81
|
guid = attributes[:guid]
|
|
73
82
|
fingerprint = attributes[:content_fingerprint]
|
|
74
83
|
|
|
75
84
|
if raw_guid_present
|
|
76
|
-
existing =
|
|
85
|
+
existing = lookup_by_guid(guid)
|
|
77
86
|
return [ existing, :guid ] if existing
|
|
78
87
|
end
|
|
79
88
|
|
|
80
89
|
if fingerprint.present?
|
|
81
|
-
existing =
|
|
90
|
+
existing = lookup_by_fingerprint(fingerprint)
|
|
82
91
|
return [ existing, :fingerprint ] if existing
|
|
83
92
|
end
|
|
84
93
|
|
|
85
94
|
[ nil, nil ]
|
|
86
95
|
end
|
|
87
96
|
|
|
97
|
+
# When a pre-fetched index is available (batch mode), look up from it
|
|
98
|
+
# instead of issuing a per-entry SELECT query.
|
|
99
|
+
def lookup_by_guid(guid)
|
|
100
|
+
if existing_items_index
|
|
101
|
+
existing_items_index[:by_guid]&.dig(guid)
|
|
102
|
+
else
|
|
103
|
+
find_item_by_guid(guid)
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def lookup_by_fingerprint(fingerprint)
|
|
108
|
+
if existing_items_index
|
|
109
|
+
existing_items_index[:by_fingerprint]&.dig(fingerprint)
|
|
110
|
+
else
|
|
111
|
+
find_item_by_fingerprint(fingerprint)
|
|
112
|
+
end
|
|
113
|
+
end
|
|
114
|
+
|
|
88
115
|
def find_item_by_guid(guid)
|
|
89
116
|
return if guid.blank?
|
|
90
117
|
|
|
91
|
-
|
|
118
|
+
# GUIDs are normalized to lowercase on write, so we can use a plain
|
|
119
|
+
# equality check that hits the btree index on (source_id, guid).
|
|
120
|
+
source.all_items.find_by(guid: guid.downcase)
|
|
92
121
|
end
|
|
93
122
|
|
|
94
123
|
def find_item_by_fingerprint(fingerprint)
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: source_monitor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.13.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- dchuk
|
|
@@ -620,6 +620,7 @@ files:
|
|
|
620
620
|
- lib/source_monitor/import_sessions/health_check_updater.rb
|
|
621
621
|
- lib/source_monitor/import_sessions/opml_importer.rb
|
|
622
622
|
- lib/source_monitor/instrumentation.rb
|
|
623
|
+
- lib/source_monitor/items/batch_item_creator.rb
|
|
623
624
|
- lib/source_monitor/items/item_creator.rb
|
|
624
625
|
- lib/source_monitor/items/item_creator/content_extractor.rb
|
|
625
626
|
- lib/source_monitor/items/item_creator/entry_parser.rb
|