RubyGems - source_monitor - Versions diffs - 0.3.0 → 0.3.2 - Mend

source_monitor 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

checksums.yaml +4 -4
data/.claude/skills/sm-architecture/SKILL.md +233 -0
data/.claude/skills/sm-architecture/reference/extraction-patterns.md +192 -0
data/.claude/skills/sm-architecture/reference/module-map.md +194 -0
data/.claude/skills/sm-configuration-setting/SKILL.md +264 -0
data/.claude/skills/sm-configuration-setting/reference/settings-catalog.md +248 -0
data/.claude/skills/sm-configuration-setting/reference/settings-pattern.md +297 -0
data/.claude/skills/sm-configure/SKILL.md +153 -0
data/.claude/skills/sm-configure/reference/configuration-reference.md +321 -0
data/.claude/skills/sm-dashboard-widget/SKILL.md +344 -0
data/.claude/skills/sm-dashboard-widget/reference/dashboard-patterns.md +304 -0
data/.claude/skills/sm-domain-model/SKILL.md +188 -0
data/.claude/skills/sm-domain-model/reference/model-graph.md +114 -0
data/.claude/skills/sm-domain-model/reference/table-structure.md +348 -0
data/.claude/skills/sm-engine-migration/SKILL.md +395 -0
data/.claude/skills/sm-engine-migration/reference/migration-conventions.md +255 -0
data/.claude/skills/sm-engine-test/SKILL.md +302 -0
data/.claude/skills/sm-engine-test/reference/test-helpers.md +259 -0
data/.claude/skills/sm-engine-test/reference/test-patterns.md +411 -0
data/.claude/skills/sm-event-handler/SKILL.md +265 -0
data/.claude/skills/sm-event-handler/reference/events-api.md +229 -0
data/.claude/skills/sm-health-rule/SKILL.md +327 -0
data/.claude/skills/sm-health-rule/reference/health-system.md +269 -0
data/.claude/skills/sm-host-setup/SKILL.md +223 -0
data/.claude/skills/sm-host-setup/reference/initializer-template.md +195 -0
data/.claude/skills/sm-host-setup/reference/setup-checklist.md +134 -0
data/.claude/skills/sm-job/SKILL.md +263 -0
data/.claude/skills/sm-job/reference/job-conventions.md +245 -0
data/.claude/skills/sm-model-extension/SKILL.md +287 -0
data/.claude/skills/sm-model-extension/reference/extension-api.md +317 -0
data/.claude/skills/sm-pipeline-stage/SKILL.md +254 -0
data/.claude/skills/sm-pipeline-stage/reference/completion-handlers.md +152 -0
data/.claude/skills/sm-pipeline-stage/reference/entry-processing.md +191 -0
data/.claude/skills/sm-pipeline-stage/reference/feed-fetcher-architecture.md +198 -0
data/.claude/skills/sm-scraper-adapter/SKILL.md +284 -0
data/.claude/skills/sm-scraper-adapter/reference/adapter-contract.md +167 -0
data/.claude/skills/sm-scraper-adapter/reference/example-adapter.md +274 -0
data/.vbw-planning/.notification-log.jsonl +102 -0
data/.vbw-planning/.session-log.jsonl +505 -0
data/AGENTS.md +20 -57
data/CHANGELOG.md +19 -0
data/CLAUDE.md +44 -1
data/CONTRIBUTING.md +5 -5
data/Gemfile.lock +20 -21
data/README.md +18 -5
data/VERSION +1 -0
data/docs/deployment.md +1 -1
data/docs/setup.md +4 -4
data/lib/source_monitor/setup/skills_installer.rb +94 -0
data/lib/source_monitor/setup/workflow.rb +17 -2
data/lib/source_monitor/version.rb +1 -1
data/lib/tasks/source_monitor_setup.rake +58 -0
data/source_monitor.gemspec +1 -0
metadata +39 -1

data/.claude/skills/sm-pipeline-stage/reference/feed-fetcher-architecture.md ADDED Viewed

@@ -0,0 +1,198 @@
+# FeedFetcher Architecture
+## Module Structure
+The `FeedFetcher` was refactored from a 627-line monolith into a 285-line coordinator with 3 sub-modules. Each sub-module is a plain Ruby class instantiated lazily via accessor methods.
+```
+FeedFetcher (285 lines) -- coordinator
+  |
+  +-- AdaptiveInterval (141 lines) -- fetch interval math
+  +-- SourceUpdater (200 lines) -- source persistence + fetch logs
+  +-- EntryProcessor (89 lines) -- feed entry iteration
+```
+## FeedFetcher (Coordinator)
+**File:** `lib/source_monitor/fetching/feed_fetcher.rb`
+Responsibilities:
+- Perform HTTP request via Faraday client
+- Route response by status code (200, 304, else)
+- Parse feed body with Feedjira
+- Delegate to sub-modules for processing
+- Emit instrumentation events
+- Handle and classify errors
+### Key Data Structures
+```ruby
+Result = Struct.new(:status, :feed, :response, :body, :error, :item_processing, :retry_decision)
+EntryProcessingResult = Struct.new(:created, :updated, :failed, :items, :errors, :created_items, :updated_items)
+ResponseWrapper = Struct.new(:status, :headers, :body)
+```
+### Request Flow
+```
+call()
+  -> perform_fetch(started_at, payload)
+     -> perform_request()           # Faraday GET with conditional headers
+     -> handle_response(response)
+        |
+        +-- 200 -> handle_success()
+        |          -> parse_feed()          # Feedjira.parse
+        |          -> entry_processor.process_feed_entries()
+        |          -> source_updater.update_source_for_success()
+        |          -> source_updater.create_fetch_log()
+        |
+        +-- 304 -> handle_not_modified()
+        |          -> source_updater.update_source_for_not_modified()
+        |          -> source_updater.create_fetch_log()
+        |
+        +-- else -> raise HTTPError
+  rescue FetchError -> handle_failure()
+     -> source_updater.update_source_for_failure()
+     -> source_updater.create_fetch_log()
+```
+### Conditional Request Headers
+The fetcher sends conditional headers when available:
+- `If-None-Match` -- uses `source.etag`
+- `If-Modified-Since` -- uses `source.last_modified.httpdate`
+- Custom headers from `source.custom_headers`
+### Sub-Module Instantiation
+Sub-modules are lazily instantiated and cached:
+```ruby
+def adaptive_interval
+  @adaptive_interval ||= AdaptiveInterval.new(source: source, jitter_proc: jitter_proc)
+end
+def source_updater
+  @source_updater ||= SourceUpdater.new(source: source, adaptive_interval: adaptive_interval)
+end
+def entry_processor
+  @entry_processor ||= EntryProcessor.new(source: source)
+end
+```
+### Backward Compatibility
+Forwarding methods maintain backward compatibility with existing tests:
+```ruby
+def process_feed_entries(feed) = entry_processor.process_feed_entries(feed)
+def jitter_offset(interval_seconds) = adaptive_interval.jitter_offset(interval_seconds)
+# ... etc
+```
+## AdaptiveInterval Sub-Module
+**File:** `lib/source_monitor/fetching/feed_fetcher/adaptive_interval.rb`
+Controls dynamic fetch scheduling based on content changes and failures.
+### Algorithm
+| Condition | Factor | Effect |
+|-----------|--------|--------|
+| Content changed | `DECREASE_FACTOR` (0.75) | Fetch more often |
+| No change | `INCREASE_FACTOR` (1.25) | Fetch less often |
+| Failure | `FAILURE_INCREASE_FACTOR` (1.5) | Back off significantly |
+### Boundaries
+| Constant | Default | Purpose |
+|----------|---------|---------|
+| `MIN_FETCH_INTERVAL` | 5 minutes | Floor for interval |
+| `MAX_FETCH_INTERVAL` | 24 hours | Ceiling for interval |
+| `JITTER_PERCENT` | 10% | Random offset to prevent thundering herd |
+### Configuration Override
+All constants can be overridden via `SourceMonitor.config.fetching`:
+- `min_interval_minutes`
+- `max_interval_minutes`
+- `increase_factor`
+- `decrease_factor`
+- `failure_increase_factor`
+- `jitter_percent`
+### Fixed vs Adaptive
+When `source.adaptive_fetching_enabled?` is false, the interval uses a simple fixed schedule:
+```ruby
+fixed_minutes = [source.fetch_interval_minutes.to_i, 1].max
+attributes[:next_fetch_at] = Time.current + fixed_minutes.minutes
+```
+## SourceUpdater Sub-Module
+**File:** `lib/source_monitor/fetching/feed_fetcher/source_updater.rb`
+Handles all source record mutations after a fetch attempt.
+### Update Methods
+| Method | When Called | Key Updates |
+|--------|------------|-------------|
+| `update_source_for_success` | HTTP 200 | Clear errors, update etag/last_modified, adaptive interval, reset retry state |
+| `update_source_for_not_modified` | HTTP 304 | Clear errors, update etag/last_modified, adaptive interval |
+| `update_source_for_failure` | Any error | Increment failure_count, apply retry strategy, adaptive interval with failure flag |
+### Fetch Log Creation
+Every fetch attempt creates a `FetchLog` record via `create_fetch_log` with:
+- Timing (started_at, completed_at, duration_ms)
+- HTTP details (status, response headers)
+- Item counts (created, updated, failed)
+- Error details (class, message, backtrace)
+- Feed metadata (parser, signature, item errors)
+### Feed Signature
+Content change detection uses SHA256 digest of the response body:
+```ruby
+def feed_signature_changed?(feed_signature)
+  (source.metadata || {}).fetch("last_feed_signature", nil) != feed_signature
+end
+```
+### Retry Strategy
+On failure, `apply_retry_strategy!` delegates to `RetryPolicy`:
+- If retry: set `fetch_retry_attempt`, schedule retry
+- If circuit open: set `fetch_circuit_opened_at`, `fetch_circuit_until`
+- Updates `next_fetch_at` and `backoff_until` accordingly
+## EntryProcessor Sub-Module
+**File:** `lib/source_monitor/fetching/feed_fetcher/entry_processor.rb`
+Iterates over `feed.entries` and calls `ItemCreator.call` for each entry.
+### Processing Loop
+```ruby
+Array(feed.entries).each do |entry|
+  result = ItemCreator.call(source:, entry:)
+  Events.run_item_processors(source:, entry:, result:)
+  if result.created?
+    Events.after_item_created(item: result.item, source:, entry:, result:)
+  end
+rescue StandardError => error
+  # Normalize error, continue processing remaining entries
+end
+```
+Key behaviors:
+- Individual entry failures don't stop processing of remaining entries
+- Events are dispatched for both item processors and item creation
+- Error normalization captures GUID and title for debugging

data/.claude/skills/sm-scraper-adapter/SKILL.md ADDED Viewed

@@ -0,0 +1,284 @@
+---
+name: sm-scraper-adapter
+description: Use when creating custom scraper adapters for SourceMonitor, inheriting from Scrapers::Base, implementing the adapter contract, or registering/unregistering scrapers.
+allowed-tools: Read, Write, Edit, Bash, Glob, Grep
+---
+# sm-scraper-adapter: Custom Scraper Adapters
+Build custom content scrapers that integrate with SourceMonitor's scraping pipeline.
+## When to Use
+- Creating a new scraper adapter for a specific content type or source
+- Customizing how content is fetched and parsed
+- Understanding the scraper adapter contract
+- Registering or swapping scraper adapters in configuration
+- Debugging scraper failures
+## Architecture Overview
+```
+SourceMonitor::Scrapers::Base (abstract)
+  |
+  +-- SourceMonitor::Scrapers::Readability (built-in)
+  +-- MyApp::Scrapers::Custom (your adapter)
+```
+Scrapers are registered in configuration and selected per-source. Each adapter:
+1. Receives an `item`, `source`, and merged `settings` hash
+2. Performs HTTP fetching and content parsing
+3. Returns a `Result` struct with status, HTML, content, and metadata
+## The Adapter Contract
+### Base Class: `SourceMonitor::Scrapers::Base`
+Location: `lib/source_monitor/scrapers/base.rb`
+All custom scrapers **must** inherit from `SourceMonitor::Scrapers::Base`.
+### Required: `#call` Instance Method
+Must return a `SourceMonitor::Scrapers::Base::Result`:
+```ruby
+Result = Struct.new(:status, :html, :content, :metadata, keyword_init: true)
+```
+| Field | Type | Description |
+|---|---|---|
+| `status` | Symbol | `:success`, `:partial`, or `:failed` |
+| `html` | String/nil | Raw HTML fetched from the URL |
+| `content` | String/nil | Extracted/cleaned text content |
+| `metadata` | Hash/nil | Diagnostics: headers, timings, URL, error info |
+### Class Methods (Optional Overrides)
+| Method | Default | Description |
+|---|---|---|
+| `self.adapter_name` | Derived from class name | Name used in registry |
+| `self.default_settings` | `{}` | Default settings hash for this adapter |
+| `self.call(item:, source:, settings:, http:)` | Creates instance, calls `#call` | Class-level entry point |
+### Protected Accessors
+Available inside `#call`:
+| Accessor | Type | Description |
+|---|---|---|
+| `item` | `SourceMonitor::Item` | The item being scraped |
+| `source` | `SourceMonitor::Source` | The owning source |
+| `http` | Module | HTTP client module (`SourceMonitor::HTTP`) |
+| `settings` | HashWithIndifferentAccess | Merged settings (see Settings Merging) |
+### Settings Merging
+Settings are merged in priority order:
+1. `self.class.default_settings` (adapter defaults)
+2. `source.scrape_settings` (source-level overrides)
+3. `settings` parameter (per-invocation overrides)
+All keys are normalized to strings with indifferent access.
+## Creating a Custom Adapter
+### Step 1: Create the Adapter Class
+```ruby
+# app/scrapers/my_app/scrapers/premium.rb
+module MyApp
+  module Scrapers
+    class Premium < SourceMonitor::Scrapers::Base
+      def self.default_settings
+        {
+          api_key: nil,
+          extract_images: true,
+          timeout: 30
+        }
+      end
+      def call
+        url = item.canonical_url.presence || item.url
+        return failure("missing_url", "No URL available") unless url.present?
+        response = fetch_content(url)
+        return failure("fetch_failed", response[:error]) unless response[:success]
+        content = extract_content(response[:body])
+        Result.new(
+          status: :success,
+          html: response[:body],
+          content: content,
+          metadata: {
+            url: url,
+            http_status: response[:status],
+            extraction_method: "premium"
+          }
+        )
+      rescue StandardError => error
+        failure(error.class.name, error.message)
+      end
+      private
+      def fetch_content(url)
+        conn = http.client(
+          timeout: settings[:timeout],
+          headers: { "Authorization" => "Bearer #{settings[:api_key]}" }
+        )
+        response = conn.get(url)
+        { success: true, body: response.body, status: response.status }
+      rescue Faraday::Error => e
+        { success: false, error: e.message }
+      end
+      def extract_content(html)
+        # Your custom extraction logic
+        html.gsub(/<[^>]+>/, " ").squeeze(" ").strip
+      end
+      def failure(error, message)
+        Result.new(
+          status: :failed,
+          html: nil,
+          content: nil,
+          metadata: { error: error, message: message }
+        )
+      end
+    end
+  end
+end
+```
+### Step 2: Register the Adapter
+```ruby
+# config/initializers/source_monitor.rb
+SourceMonitor.configure do |config|
+  config.scrapers.register(:premium, "MyApp::Scrapers::Premium")
+end
+```
+### Step 3: Assign to Sources
+Set the scraper adapter name on individual sources. The source's `scrape_settings` JSON column can hold adapter-specific overrides.
+## Built-in Adapter: Readability
+Location: `lib/source_monitor/scrapers/readability.rb`
+The built-in Readability adapter:
+1. Fetches HTML via `HttpFetcher`
+2. Parses content via `ReadabilityParser`
+3. Supports CSS selector overrides via settings
+Default settings structure:
+```ruby
+{
+  http: { headers: {...}, timeout: 15, open_timeout: 5, proxy: nil },
+  selectors: { content: nil, title: nil },
+  readability: {
+    remove_unlikely_candidates: true,
+    clean_conditionally: true,
+    retry_length: 250,
+    min_text_length: 25
+  }
+}
+```
+## Registration API
+```ruby
+# Register by class
+config.scrapers.register(:custom, MyApp::Scrapers::Custom)
+# Register by string (lazy constantization)
+config.scrapers.register(:custom, "MyApp::Scrapers::Custom")
+# Unregister
+config.scrapers.unregister(:custom)
+# Look up
+adapter_class = config.scrapers.adapter_for(:custom)
+# Iterate
+config.scrapers.each { |name, klass| puts "#{name}: #{klass}" }
+```
+Name validation: must match `/\A[a-z0-9_]+\z/i`, normalized to lowercase.
+## Key Source Files
+| File | Purpose |
+|---|---|
+| `lib/source_monitor/scrapers/base.rb` | Abstract base class and Result struct |
+| `lib/source_monitor/scrapers/readability.rb` | Built-in Readability adapter |
+| `lib/source_monitor/scrapers/fetchers/http_fetcher.rb` | HTTP fetching helper |
+| `lib/source_monitor/scrapers/parsers/readability_parser.rb` | Content parsing |
+| `lib/source_monitor/configuration/scraper_registry.rb` | Registration/lookup |
+| `lib/source_monitor/scraping/item_scraper.rb` | Scraping orchestration |
+## References
+- `reference/adapter-contract.md` -- Detailed interface specification
+- `reference/example-adapter.md` -- Complete working example
+- `lib/source_monitor/scrapers/readability.rb` -- Reference implementation
+## Testing
+```ruby
+require "test_helper"
+class PremiumScraperTest < ActiveSupport::TestCase
+  setup do
+    @source = create_source!
+    @item = @source.items.create!(
+      title: "Test",
+      url: "https://example.com/article",
+      external_id: "test-1"
+    )
+  end
+  test "scrapes content successfully" do
+    stub_request(:get, "https://example.com/article")
+      .to_return(status: 200, body: "<html><body><p>Content</p></body></html>")
+    result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
+    assert_equal :success, result.status
+    assert_includes result.content, "Content"
+    assert_equal 200, result.metadata[:http_status]
+  end
+  test "handles fetch failure" do
+    stub_request(:get, "https://example.com/article")
+      .to_return(status: 500, body: "Error")
+    result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
+    assert_equal :failed, result.status
+  end
+  test "handles missing URL" do
+    @item.update!(url: nil)
+    result = MyApp::Scrapers::Premium.call(item: @item, source: @source)
+    assert_equal :failed, result.status
+    assert_equal "missing_url", result.metadata[:error]
+  end
+end
+```
+## Checklist
+- [ ] Adapter inherits from `SourceMonitor::Scrapers::Base`
+- [ ] `#call` returns a `Result` struct
+- [ ] `status` is one of `:success`, `:partial`, `:failed`
+- [ ] `metadata` includes `url` and error details on failure
+- [ ] `self.default_settings` defined if adapter has configurable options
+- [ ] Adapter registered in initializer
+- [ ] Exception handling catches `StandardError` in `#call`
+- [ ] Uses `http` accessor for HTTP requests (thread-safe)
+- [ ] Tests cover success, failure, and edge cases

data/.claude/skills/sm-scraper-adapter/reference/adapter-contract.md ADDED Viewed

@@ -0,0 +1,167 @@
+# Scraper Adapter Contract
+Detailed specification of the interface required by custom scraper adapters.
+Source: `lib/source_monitor/scrapers/base.rb`
+## Inheritance Requirement
+All scraper adapters **must** inherit from `SourceMonitor::Scrapers::Base`:
+```ruby
+class MyAdapter < SourceMonitor::Scrapers::Base
+  def call
+    # implementation
+  end
+end
+```
+The `ScraperRegistry` validates this at registration time and raises `ArgumentError` if the adapter does not inherit from `Base`.
+## Constructor Signature
+```ruby
+def initialize(item:, source:, settings: nil, http: SourceMonitor::HTTP)
+```
+| Parameter | Type | Description |
+|---|---|---|
+| `item` | `SourceMonitor::Item` | The item to scrape |
+| `source` | `SourceMonitor::Source` | The owning source/feed |
+| `settings` | Hash/nil | Per-invocation setting overrides |
+| `http` | Module | HTTP client module (default: `SourceMonitor::HTTP`) |
+The constructor is defined on `Base` -- do not override it. Use `#call` for your logic.
+## Required Instance Method: `#call`
+Must return a `SourceMonitor::Scrapers::Base::Result`:
+```ruby
+Result = Struct.new(:status, :html, :content, :metadata, keyword_init: true)
+```
+### Result Fields
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `status` | Symbol | Yes | `:success`, `:partial`, or `:failed` |
+| `html` | String/nil | No | Raw HTML body from the fetch |
+| `content` | String/nil | No | Extracted/cleaned text content |
+| `metadata` | Hash/nil | No | Diagnostics and additional context |
+### Status Values
+| Status | Meaning |
+|---|---|
+| `:success` | Content fully extracted |
+| `:partial` | Content extracted but incomplete (e.g., truncated, missing elements) |
+| `:failed` | Unable to extract content |
+### Metadata Conventions
+On success:
+```ruby
+{
+  url: "https://example.com/article",
+  http_status: 200,
+  content_type: "text/html",
+  extraction_strategy: "custom",
+  title: "Article Title"
+}
+```
+On failure:
+```ruby
+{
+  error: "fetch_error",       # Error classification
+  message: "Connection refused", # Human-readable message
+  url: "https://example.com/article",
+  http_status: 500            # If available
+}
+```
+## Optional Class Methods
+### `self.adapter_name`
+Default: derived from class name by removing `Scraper` suffix and underscoring.
+```ruby
+MyApp::Scrapers::Premium      # => "premium"
+MyApp::Scrapers::CustomScraper # => "custom"
+```
+### `self.default_settings`
+Default: `{}`
+Return a Hash of adapter-specific default settings. These are merged with source-level and invocation-level overrides.
+```ruby
+def self.default_settings
+  {
+    api_key: nil,
+    max_retries: 3,
+    selectors: { content: "article", title: "h1" }
+  }
+end
+```
+### `self.call(item:, source:, settings: nil, http: SourceMonitor::HTTP)`
+Default implementation creates a new instance and calls `#call`. Rarely needs to be overridden.
+## Protected Accessors
+Available inside `#call`:
+| Accessor | Type | Description |
+|---|---|---|
+| `item` | `SourceMonitor::Item` | The item being scraped |
+| `source` | `SourceMonitor::Source` | The owning source |
+| `http` | Module | HTTP client module |
+| `settings` | HashWithIndifferentAccess | Merged settings (see below) |
+## Settings Merge Order
+Settings are deep-merged in this priority order (later wins):
+```
+1. self.class.default_settings   (adapter defaults)
+2. source.scrape_settings        (source-level, from DB JSON column)
+3. settings parameter            (per-invocation overrides)
+```
+All keys are normalized to strings with `ActiveSupport::HashWithIndifferentAccess`, so you can access them with either string or symbol keys.
+## Thread Safety
+Adapters must be stateless and thread-safe:
+- A new instance is created per invocation via `self.call`
+- Use instance variables set in the constructor only
+- Do not use class-level mutable state
+- The `http` client is safe to share
+## Error Handling
+Adapters should:
+1. Catch expected errors (network, parsing) and return a `Result` with `:failed` status
+2. Let unexpected errors propagate (they will be caught by the scraping pipeline)
+3. Never swallow errors silently -- populate `metadata` with error details
+```ruby
+def call
+  # ... scraping logic ...
+rescue Faraday::Error => error
+  Result.new(
+    status: :failed,
+    metadata: { error: error.class.name, message: error.message }
+  )
+rescue StandardError => error
+  Result.new(
+    status: :failed,
+    metadata: { error: error.class.name, message: error.message }
+  )
+end
+```