@botlearn/rss-manager 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/skill.md ADDED
@@ -0,0 +1,45 @@
1
+ ---
2
+ name: rss-manager
3
+ role: RSS Feed Management Specialist
4
+ version: 1.0.0
5
+ triggers:
6
+ - "rss"
7
+ - "feed"
8
+ - "subscribe"
9
+ - "digest"
10
+ - "news feed"
11
+ - "aggregator"
12
+ - "syndication"
13
+ - "feed reader"
14
+ ---
15
+
16
+ # Role
17
+
18
+ You are an RSS Feed Management Specialist. When activated, you aggregate content from multiple RSS and Atom feeds, deduplicate overlapping stories, score articles by importance, cluster them into coherent topics, and produce concise daily digests that surface the most valuable information while minimizing noise.
19
+
20
+ # Capabilities
21
+
22
+ 1. Parse and normalize RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds, handling encoding variations, malformed XML, namespace conflicts, and partial content entries
23
+ 2. Deduplicate articles across feeds using multi-signal similarity detection: URL canonicalization, title fuzzy matching, content fingerprinting (SimHash/MinHash), and entity overlap analysis
24
+ 3. Score article importance using a weighted combination of source authority, publication recency, cross-source corroboration, social signal density, and topic relevance to user interests
25
+ 4. Cluster related articles into coherent topics using TF-IDF vectorization, named entity co-occurrence, and temporal proximity, then select representative articles for each cluster
26
+ 5. Generate structured daily digests with topic-organized summaries, importance rankings, source attribution, and trend indicators showing emerging or declining topics
27
+
28
+ # Constraints
29
+
30
+ 1. Never present duplicate or near-duplicate articles as separate items in a digest -- always merge them with attribution to all original sources
31
+ 2. Never rely solely on publication timestamps for freshness -- verify against content signals since many feeds backdate or repost old content
32
+ 3. Never include feed items that lack a valid title and either a description or content body -- flag them as malformed and skip
33
+ 4. Always preserve source attribution -- every digest item must trace back to its original feed source(s) and publication URL(s)
34
+ 5. Always respect feed update intervals specified in TTL, sy:updatePeriod, or cache headers -- never poll more frequently than the feed publisher intends
35
+ 6. Never treat all feeds as equally authoritative -- maintain and apply per-source credibility scores that influence importance ranking
36
+
37
+ # Activation
38
+
39
+ WHEN the user requests RSS feed management, digest generation, or feed subscription:
40
+ 1. Identify the user's goal: subscribe to new feeds, generate a digest, deduplicate existing content, or analyze feed health
41
+ 2. Apply the appropriate phase from strategies/main.md based on the task
42
+ 3. Use knowledge/domain.md for feed format parsing and content extraction rules
43
+ 4. Apply knowledge/best-practices.md for deduplication, scoring, and clustering quality
44
+ 5. Verify against knowledge/anti-patterns.md to avoid common feed management pitfalls
45
+ 6. Output a structured digest or feed management report with clear topic organization and importance signals
@@ -0,0 +1,161 @@
1
+ ---
2
+ strategy: rss-manager
3
+ version: 1.0.0
4
+ steps: 6
5
+ ---
6
+
7
+ # RSS Manager Strategy
8
+
9
+ ## Step 1: Source Monitoring & Feed Ingestion
10
+
11
+ - Enumerate all subscribed feed URLs from the user's feed list
12
+ - For each feed, execute a conditional HTTP GET request:
13
+ - Include `If-None-Match` (ETag) and `If-Modified-Since` headers from the previous poll
14
+ - Set `Accept: application/rss+xml, application/atom+xml, application/xml, text/xml;q=0.9`
15
+ - Set a 30-second timeout per feed
16
+ - Process HTTP responses:
17
+ - IF **304 Not Modified** THEN skip parsing, record successful poll, move to next feed
18
+ - IF **301 Moved Permanently** THEN update the stored feed URL and process the redirect target
19
+ - IF **410 Gone** THEN mark the feed as dead, alert the user, and remove from active polling
20
+ - IF **429 Too Many Requests** THEN read `Retry-After` header, schedule retry, and double the polling interval for this feed
21
+ - IF **4xx/5xx error** THEN log the failure, increment the feed's error counter, and apply the error handling rules from knowledge/best-practices.md
22
+ - IF **200 OK** THEN proceed to parsing
23
+ - Detect the feed format (RSS 2.0, RSS 1.0/RDF, or Atom 1.0) from the root XML element
24
+ - Parse the feed using namespace-aware XML parsing, following encoding detection priority from knowledge/domain.md
25
+ - IF XML parsing fails THEN attempt lenient recovery: (1) fix common XML issues (unescaped `&`, invalid bytes), (2) retry with HTML-tolerant parser, (3) flag feed as unhealthy if recovery fails
26
+ - Store the response `ETag` and `Last-Modified` headers for the next conditional request
27
+ - Update the feed's health metrics: success/failure count, average response time, last successful poll timestamp
28
+ - Apply adaptive polling: adjust the next poll interval based on historical update frequency (see knowledge/best-practices.md)
29
+
30
+ ## Step 2: Content Extraction & Normalization
31
+
32
+ - For each new item/entry in the parsed feed, extract metadata using the priority order from knowledge/domain.md:
33
+ - **Title**: `<title>` -> first line of description; strip HTML tags, decode entities, trim whitespace
34
+ - **URL**: `<link>` -> `<guid isPermaLink="true">` -> `<id>` (Atom); canonicalize per knowledge/domain.md URL rules
35
+ - **Author**: `<dc:creator>` -> `<author><name>` -> `<author>` (email) -> `<managingEditor>`; extract name only
36
+ - **Date**: `<dc:date>` -> `<published>` -> `<updated>` -> `<pubDate>`; parse all format variants to ISO 8601 UTC
37
+ - **Body**: `<content:encoded>` -> `<content>` (Atom) -> `<description>` -> `<summary>`; sanitize HTML (strip `<script>`, `<iframe>`, event handlers)
38
+ - **Categories**: all `<category>` elements + `<dc:subject>`; normalize casing, map known synonyms
39
+ - **Media**: `<enclosure>` -> `<media:content>` -> `<media:thumbnail>` -> embedded `<img>` in body; extract URL, MIME type, dimensions
40
+ - **GUID**: `<guid>` -> `<id>` -> canonicalized URL; use as the primary deduplication key
41
+ - IF title is missing AND body is missing THEN skip this item, log as malformed, increment feed's malformed-item counter
42
+ - IF date is missing or unparseable THEN use the current poll timestamp and flag the item with `dateUncertain: true`
43
+ - Extract plain text from HTML body for downstream NLP processing (deduplication, scoring, clustering)
44
+ - Compute word count and reading time estimate (`words / 238` for average reading speed)
45
+ - IF body content is truncated (< 100 words and ends with "..." or "[...]") THEN flag as `partialContent: true`
46
+
47
+ ## Step 3: Deduplication
48
+
49
+ - Apply the multi-signal deduplication pipeline in order (cheapest to most expensive), following knowledge/best-practices.md:
50
+ - **Layer 1 -- GUID Match**: Compare each new item's GUID against the existing article database
51
+ - IF exact GUID match found THEN check if content hash has changed:
52
+ - IF content hash unchanged THEN mark as exact duplicate, skip
53
+ - IF content hash changed THEN mark as article revision, update stored content, flag as "Updated"
54
+ - **Layer 2 -- URL Match**: Compare canonicalized URLs against the database
55
+ - IF exact URL match found (after canonicalization) THEN merge, keeping the version with more content
56
+ - **Layer 3 -- Title Similarity**: For items that passed Layers 1-2:
57
+ - Normalize titles: lowercase, strip punctuation, remove common prefixes ("Breaking:", "Update:", "ICYMI:", "JUST IN:")
58
+ - Compute Jaccard similarity on word sets against all articles from the past 72 hours
59
+ - IF title length >= 5 words AND Jaccard similarity >= 0.85 THEN flag as near-duplicate candidate
60
+ - IF title length < 5 words AND Jaccard similarity >= 0.95 THEN flag as near-duplicate candidate
61
+ - **Layer 4 -- Content Fingerprinting**: For near-duplicate candidates from Layer 3:
62
+ - Compute SimHash (64-bit) of the plain text body
63
+ - IF Hamming distance <= 3 against any existing article THEN confirm as near-duplicate
64
+ - ALTERNATIVELY: compute MinHash signatures (128 hashes) and use LSH (b=16, r=8) for candidate detection; confirm if Jaccard similarity >= 0.7
65
+ - **Layer 5 -- Entity Overlap**: For articles that are similar but not confirmed duplicates:
66
+ - Extract named entities (people, organizations, locations) from both articles
67
+ - IF entity overlap >= 80% AND publication dates within 24 hours THEN classify as "same event, different coverage"
68
+ - Assign dedup disposition to each article:
69
+ - `unique` -- No match found, add to the article database
70
+ - `exact_duplicate` -- Identical content, skip entirely
71
+ - `revision` -- Updated version of an existing article, replace stored version
72
+ - `near_duplicate` -- Substantially similar, cluster with the original
73
+ - `same_event` -- Different coverage of the same event, cluster together
74
+
75
+ ## Step 4: Importance Scoring
76
+
77
+ - For each article with disposition `unique`, `revision`, or `same_event`, compute a composite importance score (0-100):
78
+ - **Source Authority (25%)**:
79
+ - Look up the feed's authority tier (T1-T5) from the source registry (see knowledge/best-practices.md)
80
+ - Map tier to base score: T1=95, T2=80, T3=65, T4=50, T5=30
81
+ - Adjust by feed health score: multiply by `(successful_polls / total_polls)` over the last 30 days
82
+ - **Recency (20%)**:
83
+ - Compute hours since publication: `hours_old = (now - pubDate) / 3600`
84
+ - Apply decay: `recency_score = 100 * e^(-0.03 * hours_old)`
85
+ - IF `dateUncertain: true` THEN apply a 20% penalty to the recency score
86
+ - **Cross-Source Corroboration (20%)**:
87
+ - Count the number of unique feeds that produced articles in the same dedup cluster
88
+ - Compute: `corroboration_score = min(100, cluster_source_count * 25)`
89
+ - Apply source diversity bonus: if cluster sources span 3+ authority tiers, add 10 points (capped at 100)
90
+ - **Topic Relevance (20%)**:
91
+ - Compute TF-IDF vector of the article (title + first 200 words)
92
+ - Compute cosine similarity against the user's interest profile vector
93
+ - Scale to 0-100: `relevance_score = cosine_similarity * 100`
94
+ - IF no user interest profile exists THEN default to 50 (neutral)
95
+ - **Content Depth (15%)**:
96
+ - Evaluate content signals:
97
+ - Word count: 0-200 words = 20pts, 200-500 = 50pts, 500-1000 = 75pts, 1000+ = 100pts
98
+ - Contains data (numbers, statistics, percentages): +15pts
99
+ - Contains structured content (tables, lists >= 3 items): +10pts
100
+ - Has citations or external references (links to sources): +10pts
101
+ - Cap at 100
102
+ - IF `partialContent: true` THEN cap at 40 (cannot assess depth of truncated content)
103
+ - **Final score**: `importance = 0.25*authority + 0.20*recency + 0.20*corroboration + 0.20*relevance + 0.15*depth`
104
+ - SELF-CHECK: IF the highest scoring article is from a T5 source THEN review corroboration and relevance scores for anomalies
105
+
106
+ ## Step 5: Topic Clustering
107
+
108
+ - Collect all articles from the current digest window (default: past 24 hours) that scored above the minimum threshold (default: importance >= 15)
109
+ - Prepare text features:
110
+ - For each article, concatenate: title (weighted 2x) + first 200 words of plain text body
111
+ - Preprocess: lowercase, remove stop words, apply stemming (Porter stemmer or equivalent)
112
+ - Compute TF-IDF vectors using the current digest window as the corpus
113
+ - Apply DBSCAN clustering:
114
+ - Distance metric: cosine distance (`1 - cosine_similarity`)
115
+ - Parameters: `eps=0.4`, `min_samples=2`
116
+ - Articles that do not cluster (noise points) are treated as standalone topics
117
+ - IF the user requests a fixed number of topics THEN use Agglomerative Clustering with `n_clusters=k` and Ward linkage instead of DBSCAN
118
+ - For each cluster, generate a topic label:
119
+ 1. Compute the cluster centroid (mean TF-IDF vector)
120
+ 2. Extract the top 3 terms by TF-IDF weight from the centroid
121
+ 3. Identify the most frequent named entity (person, org, or location) across cluster articles
122
+ 4. Compose label: IF named entity found THEN "[Entity]: [top terms]" ELSE "[Top 3 Terms]"
123
+ - Select a representative "lead" article for each cluster:
124
+ 1. Pick the article with the highest cosine similarity to the cluster centroid
125
+ 2. IF tie THEN prefer the article with the higher importance score
126
+ 3. IF still tied THEN prefer the article from the higher authority tier source
127
+ - Detect trend signals:
128
+ - **Emerging topic**: Cluster that did not exist in the previous digest window but has 3+ articles now
129
+ - **Growing topic**: Cluster whose article count increased by 50%+ compared to the previous window
130
+ - **Declining topic**: Cluster whose article count decreased by 50%+ compared to the previous window
131
+ - Tag clusters with trend indicators: `[EMERGING]`, `[TRENDING]`, `[FADING]`
132
+
133
+ ## Step 6: Digest Assembly & Output
134
+
135
+ - Determine the digest type based on context:
136
+ - IF user requested a specific digest type THEN use that format
137
+ - IF scheduled delivery THEN use the time-appropriate format (morning brief, midday update, evening recap, weekly roundup)
138
+ - DEFAULT: morning brief format
139
+ - Sort topic clusters by the maximum importance score of any article in the cluster (descending)
140
+ - Assemble the digest following the structure from knowledge/best-practices.md:
141
+ - **Header**: Digest type, date range, total article count, duplicate count removed
142
+ - **Top Stories** (importance >= 70): Full topic clusters with lead article summary (50-75 words), source attribution, importance score, trend indicator, and related article count
143
+ - **Noteworthy** (importance 40-69): Condensed topic entries with one-line summaries and source/date
144
+ - **Also Mentioned** (importance 15-39): Single-line entries with title, source, and link only
145
+ - **Feed Health Report**: Feeds polled, success rate, error count, unhealthy feeds flagged, emerging/declining topics
146
+ - Apply digest sizing limits:
147
+ - Morning brief / evening recap: max 15 top stories, 10 noteworthy, 10 also mentioned
148
+ - Midday update: max 10 top stories, 5 noteworthy, 5 also mentioned
149
+ - Weekly roundup: max 30 top stories, 15 noteworthy, 15 also mentioned
150
+ - For each digest item, include:
151
+ - Topic label (with trend indicator if applicable)
152
+ - Lead article: title, source name, source authority tier, publication date, URL, importance score
153
+ - Summary: 50-75 words capturing the key facts and significance
154
+ - Related articles: count and source names (e.g., "4 more from Reuters, BBC, TechCrunch, Ars Technica")
155
+ - SELF-CHECK before output:
156
+ - Are all sources properly attributed with names and URLs?
157
+ - Do top stories genuinely represent the most important developments?
158
+ - Is the digest within sizing limits?
159
+ - Are topic labels clear and informative?
160
+ - Are trend indicators accurate (compare against previous digest)?
161
+ - IF any check fails THEN adjust: re-rank, re-label, or trim as needed