@botlearn/rss-manager 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 BotLearn
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,35 @@
1
+ # @botlearn/rss-manager
2
+
3
+ > Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ # via npm
9
+ npm install @botlearn/rss-manager
10
+
11
+ # via clawhub
12
+ clawhub install @botlearn/rss-manager
13
+ ```
14
+
15
+ ## Category
16
+
17
+ Information Retrieval
18
+
19
+ ## Dependencies
20
+
21
+ None
22
+
23
+ ## Files
24
+
25
+ | File | Description |
26
+ |------|-------------|
27
+ | `manifest.json` | Skill metadata and configuration |
28
+ | `skill.md` | Role definition and activation rules |
29
+ | `knowledge/` | Domain knowledge documents |
30
+ | `strategies/` | Behavioral strategy definitions |
31
+ | `tests/` | Smoke and benchmark tests |
32
+
33
+ ## License
34
+
35
+ MIT
@@ -0,0 +1,94 @@
1
+ ---
2
+ domain: rss-manager
3
+ topic: anti-patterns
4
+ priority: medium
5
+ ttl: 30d
6
+ ---
7
+
8
+ # RSS Manager -- Anti-Patterns
9
+
10
+ ## Feed Parsing Anti-Patterns
11
+
12
+ ### 1. Assuming Well-Formed XML
13
+ - **Problem**: Treating all feeds as valid XML and failing hard on the first parse error. In practice, 5-15% of feeds in the wild contain malformed XML: unescaped ampersands, invalid characters, unclosed tags, or mixed encoding declarations
14
+ - **Fix**: Use a lenient XML parser with fallback strategies. On parse failure: (1) attempt HTML-tolerant parsing, (2) try fixing common issues (unescaped `&`, invalid UTF-8 bytes), (3) attempt regex-based extraction as last resort. Log parse errors per feed for health monitoring
15
+
16
+ ### 2. Ignoring Namespace Prefixes
17
+ - **Problem**: Hardcoding namespace prefixes like `content:encoded` instead of resolving by namespace URI. Feed A might use `content:encoded` while Feed B uses `c:encoded` -- both map to the same namespace URI but a prefix-based parser breaks on Feed B
18
+ - **Fix**: Resolve all elements by their full namespace URI (`http://purl.org/rss/1.0/modules/content/`), not the prefix string. Use a namespace-aware XML parser
19
+
20
+ ### 3. Trusting Feed-Declared Encoding
21
+ - **Problem**: Accepting the XML declaration's encoding attribute (`encoding="UTF-8"`) without verification. Many feeds declare UTF-8 but actually serve Windows-1252 or ISO-8859-1, causing mojibake in non-ASCII characters
22
+ - **Fix**: Detect actual encoding using BOM detection and byte-pattern analysis. If detected encoding conflicts with declared encoding, trust the detection. Always validate that decoded text is valid Unicode before processing
23
+
24
+ ### 4. Ignoring Content Type Negotiation
25
+ - **Problem**: Not sending proper `Accept` headers when requesting feeds, or not checking the response `Content-Type`. Some servers return HTML error pages or redirects with `text/html` instead of the expected feed XML
26
+ - **Fix**: Send `Accept: application/rss+xml, application/atom+xml, application/xml, text/xml;q=0.9` header. Verify response `Content-Type` before parsing. If HTML is returned, attempt feed auto-discovery from the HTML page
27
+
28
+ ## Deduplication Anti-Patterns
29
+
30
+ ### 5. URL-Only Deduplication
31
+ - **Problem**: Deduplicating solely by comparing URLs. This misses: (1) the same article with different tracking parameters, (2) the same article syndicated to multiple domains, (3) updated articles with new URLs, and (4) AMP vs canonical URL variants
32
+ - **Fix**: Use the multi-signal deduplication pipeline from knowledge/best-practices.md. URL matching should be one layer (after canonicalization), not the only layer. Always combine with title similarity and content fingerprinting
33
+
34
+ ### 6. Title-Only Deduplication
35
+ - **Problem**: Deduplicating solely by title match. Short titles like "Q3 Earnings Report" or "Weekly Update" produce massive false positive matches across unrelated feeds. Conversely, slightly reworded titles ("Company X Acquires Y" vs "Y Acquired by Company X") produce false negatives
36
+ - **Fix**: Never use title matching alone. Combine with at least one content-level signal (fingerprint or entity overlap). For titles under 5 words, require additional confirmation signals. Use fuzzy matching with appropriate thresholds per title length
37
+
38
+ ### 7. Aggressive Deduplication Without Clustering
39
+ - **Problem**: Discarding all but one article when duplicates are detected, losing valuable diverse perspectives. If Reuters, BBC, and a domain expert all cover the same event, the domain expert's unique analysis gets discarded
40
+ - **Fix**: Cluster related articles rather than deleting duplicates. Select the most comprehensive article as the "lead" but preserve other sources as "Related" entries. The digest should show source diversity, not suppress it
41
+
42
+ ### 8. Ignoring Article Updates
43
+ - **Problem**: Treating an article with a matching GUID but updated content as a duplicate and ignoring the update. Many feeds legitimately update articles: correcting errors, adding developments, or appending editor notes
44
+ - **Fix**: When a GUID matches but content has changed (detected via content hash), treat it as an article revision. Keep the latest version but note "Updated" in the digest. Track revision history for articles that update frequently
45
+
46
+ ## Importance Scoring Anti-Patterns
47
+
48
+ ### 9. Recency-Only Ranking
49
+ - **Problem**: Ranking articles purely by publication date, treating all newer articles as more important. This surfaces low-quality recent content above high-quality older content and is easily gamed by feeds that backdate or repeatedly update timestamps
50
+ - **Fix**: Use the multi-dimensional scoring model from knowledge/best-practices.md. Recency should be one factor (20% weight), balanced against source authority, cross-source corroboration, topic relevance, and content depth
51
+
52
+ ### 10. Equal Source Weighting
53
+ - **Problem**: Treating all feed sources as equally authoritative. A random blog post and a Reuters wire report about the same event get the same importance score, leading to unreliable content surfacing in top stories
54
+ - **Fix**: Maintain per-source authority tiers (T1-T5) and apply authority weight to importance scoring. Initialize tiers from known source lists and refine based on historical signal quality, factual accuracy, and user feedback
55
+
56
+ ### 11. Ignoring Cross-Source Corroboration
57
+ - **Problem**: Scoring articles independently without considering how many other sources cover the same story. A single blog post about an event gets the same importance as a story covered by 15 major outlets
58
+ - **Fix**: After deduplication clustering, use the cluster size as a corroboration signal. Stories covered by multiple independent sources are more likely to be genuinely important. Weight: cluster_size * source_diversity_factor
59
+
60
+ ### 12. Static Interest Profiles
61
+ - **Problem**: Using a fixed user interest profile that never adapts. The user's interests shift over time, but the digest keeps surfacing topics they no longer care about while missing emerging interests
62
+ - **Fix**: Implement interest profile decay (reduce weight for unengaged topics by 10%/week) and reinforcement (boost topics the user clicks through on). Allow explicit user feedback ("more like this" / "less like this") to directly adjust weights
63
+
64
+ ## Digest Generation Anti-Patterns
65
+
66
+ ### 13. Information Overload Digests
67
+ - **Problem**: Including every article from every feed in the digest, producing an overwhelming wall of text. Users receiving 200+ item digests stop reading them entirely, defeating the purpose of aggregation
68
+ - **Fix**: Apply strict digest sizing limits from knowledge/best-practices.md. Morning briefs: 10-15 top stories with 50-75 word summaries. Use importance scoring to ruthlessly prioritize. Surface detail on demand ("Show me more about this topic") rather than by default
69
+
70
+ ### 14. Flat List Presentation
71
+ - **Problem**: Presenting digest items as a flat chronological list with no organization. Users must scan the entire list to find topics they care about, and related articles about the same event appear scattered throughout
72
+ - **Fix**: Organize digests by topic clusters. Group related articles under topic headings. Within each topic, order by importance score. Provide a table of contents at the top with topic labels and article counts
73
+
74
+ ### 15. Missing Source Attribution
75
+ - **Problem**: Summarizing articles without attributing the original source. Users cannot verify information, assess credibility, or navigate to the full article. Aggregation without attribution also raises ethical and legal concerns
76
+ - **Fix**: Every digest item must include: source name, publication date, direct URL to the original article, and source authority tier. When clustering, list all contributing sources
77
+
78
+ ### 16. Stale Digest Windows
79
+ - **Problem**: Using a fixed 24-hour digest window regardless of user behavior or news cycle. Breaking news gets delayed until the next scheduled digest, while slow news days produce padding with low-quality content
80
+ - **Fix**: Support multiple digest cadences (morning brief, midday update, evening recap). Implement "breaking news" threshold: if an article scores above 90 importance and is from a T1 source, consider immediate notification outside the regular digest schedule
81
+
82
+ ## Feed Health Anti-Patterns
83
+
84
+ ### 17. Silent Feed Failures
85
+ - **Problem**: Continuing to poll feeds that consistently fail (404, 500, parse errors) without alerting the user. The user believes they are monitoring a source that has actually been dead for weeks
86
+ - **Fix**: Track per-feed error rate over a rolling window. After 3 consecutive failures or >50% failure rate over 7 days, mark the feed as unhealthy and alert the user. Suggest alternatives if available (e.g., the feed may have moved to a new URL)
87
+
88
+ ### 18. No Feed Diversity Monitoring
89
+ - **Problem**: Not tracking the topical diversity of subscribed feeds. The user may subscribe to 20 feeds that all cover the same narrow topic, creating an echo chamber with massive duplication and no breadth
90
+ - **Fix**: Periodically analyze the topic distribution across all subscribed feeds. Report: "80% of your feeds cover AI/ML; consider adding feeds for [underrepresented topics based on stated interests]." Show a diversity dashboard with topic coverage breakdown
91
+
92
+ ### 19. Ignoring Feed Freshness Decay
93
+ - **Problem**: Continuing to poll feeds at the same rate even when they haven't published new content in weeks or months. This wastes resources and clutters the feed list with dormant sources
94
+ - **Fix**: Implement adaptive polling (see knowledge/best-practices.md). After 14+ days of no new content, reduce polling to once daily. After 30+ days, reduce to weekly. After 90+ days, prompt the user: "This feed appears inactive. Keep monitoring or unsubscribe?"
@@ -0,0 +1,208 @@
1
+ ---
2
+ domain: rss-manager
3
+ topic: feed-management-dedup-and-scoring
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # RSS Manager -- Best Practices
9
+
10
+ ## Feed Collection & Polling
11
+
12
+ ### 1. Respect Publisher Update Intervals
13
+
14
+ Before polling a feed, check for update frequency hints:
15
+ - **RSS `<ttl>`** -- Minutes the feed can be cached (e.g., `<ttl>60</ttl>` means poll no more than once per hour)
16
+ - **Atom `<sy:updatePeriod>` and `<sy:updateFrequency>`** -- e.g., `hourly` with frequency `1` means once per hour
17
+ - **HTTP `Cache-Control` / `Expires` headers** -- Standard cache directives
18
+ - **HTTP `ETag` / `Last-Modified` headers** -- Use conditional requests (`If-None-Match`, `If-Modified-Since`) to avoid re-downloading unchanged feeds
19
+
20
+ Default polling schedule when no hints are available:
21
+
22
+ | Feed Type | Suggested Interval | Rationale |
23
+ |-----------|-------------------|-----------|
24
+ | Breaking news | 15 minutes | High update frequency expected |
25
+ | Major news outlets | 30 minutes | Regular updates throughout the day |
26
+ | Blog / personal site | 2-4 hours | Updates less frequently |
27
+ | Weekly newsletter | 12-24 hours | Low frequency, conserve resources |
28
+ | Dormant / low-activity | 24 hours | Check daily, reclassify if activity increases |
29
+
30
+ ### 2. Adaptive Polling
31
+
32
+ Track feed update patterns over time and adjust polling frequency:
33
+ - If a feed hasn't changed in 5 consecutive polls, double the interval (up to 24 hours max)
34
+ - If a feed has new content on every poll, halve the interval (down to the minimum allowed by TTL/headers)
35
+ - Track the average number of new items per poll to predict optimal timing
36
+ - Maintain a per-feed "reliability score" based on: uptime, valid XML rate, consistent timestamps
37
+
38
+ ### 3. Error Handling
39
+
40
+ - **HTTP 301 (Moved Permanently)** -- Update the stored feed URL
41
+ - **HTTP 410 (Gone)** -- Mark feed as dead; alert the user; stop polling
42
+ - **HTTP 429 (Too Many Requests)** -- Back off using `Retry-After` header; double interval
43
+ - **XML parse failures** -- Attempt recovery with lenient parsing; if persistent (3+ failures), flag feed as unhealthy
44
+ - **Timeout** -- Set a 30-second timeout per feed; retry once after 60 seconds; mark as slow if persistent
45
+
46
+ ## Deduplication Strategies
47
+
48
+ ### Multi-Signal Deduplication Pipeline
49
+
50
+ Deduplication should use a layered approach, from cheapest to most expensive:
51
+
52
+ #### Layer 1: GUID / Entry ID Match (Exact)
53
+ - Compare `<guid>` (RSS) or `<id>` (Atom) values directly
54
+ - Cheapest and most reliable when present
55
+ - Caveat: Some feeds reuse GUIDs or change them on updates
56
+
57
+ #### Layer 2: URL Canonicalization Match (Exact)
58
+ - Canonicalize URLs (see knowledge/domain.md) and compare
59
+ - Catches the same article shared via different tracking URLs
60
+ - Handles `http` vs `https`, `www` vs non-www variants
61
+
62
+ #### Layer 3: Title Similarity (Fuzzy)
63
+ - Normalize titles: lowercase, strip punctuation, remove common prefixes ("Breaking:", "Update:", "ICYMI:")
64
+ - Use Jaccard similarity on word sets; threshold >= 0.85 indicates a likely duplicate
65
+ - For short titles (< 5 words), require higher threshold (>= 0.95) to avoid false positives
66
+ - Optionally use Levenshtein distance ratio as a secondary signal
67
+
68
+ #### Layer 4: Content Fingerprinting (Fuzzy)
69
+ - **SimHash**: Compute a 64-bit fingerprint of the article body (after HTML stripping and normalization). Articles with Hamming distance <= 3 are near-duplicates
70
+ - **MinHash with LSH (Locality-Sensitive Hashing)**: Compute k-shingle sets, generate MinHash signatures (128 hashes), use LSH bands (b=16, r=8) for candidate pair detection. Jaccard similarity >= 0.7 confirms near-duplicate
71
+ - **Sentence-level overlap**: Extract the first 3 non-trivial sentences (> 10 words each); if 2+ match another article, flag as near-duplicate
72
+
73
+ #### Layer 5: Entity & Fact Overlap (Semantic)
74
+ - Extract named entities (people, organizations, locations, dates) from both articles
75
+ - If entity overlap >= 80% and temporal proximity <= 24 hours, likely covering the same event
76
+ - Use this layer to merge "same story, different angle" articles into a cluster rather than discarding
77
+
78
+ ### Dedup Decision Matrix
79
+
80
+ | Signal Combination | Action |
81
+ |-------------------|--------|
82
+ | GUID match | Exact duplicate -- merge, keep latest version |
83
+ | URL match (after canonicalization) | Exact duplicate -- merge, keep the one with more content |
84
+ | Title similarity >= 0.85 + URL domain differs | Near-duplicate from different sources -- cluster together |
85
+ | Content fingerprint match | Near-duplicate -- cluster together, note source diversity |
86
+ | Entity overlap >= 80% + within 24h | Same event coverage -- cluster, select most comprehensive as primary |
87
+ | Title similarity >= 0.85 + content fingerprint mismatch | Updated/revised article -- keep both, flag as revision |
88
+
89
+ ## Importance Scoring
90
+
91
+ ### Weighted Scoring Model
92
+
93
+ Score each article on a 0-100 scale using the following weighted dimensions:
94
+
95
+ | Dimension | Weight | Signal Sources |
96
+ |-----------|--------|---------------|
97
+ | Source Authority | 25% | Domain reputation tier (T1-T5), historical accuracy, feed health score |
98
+ | Recency | 20% | Publication age relative to poll time; decay function: `score = 100 * e^(-0.03 * hours_old)` |
99
+ | Cross-Source Corroboration | 20% | Number of independent sources covering the same story (from dedup clustering) |
100
+ | Topic Relevance | 20% | Cosine similarity between article TF-IDF vector and user interest profile vector |
101
+ | Content Depth | 15% | Word count (normalized), presence of data/citations, structured content (tables, lists) |
102
+
103
+ ### Source Authority Tiers
104
+
105
+ | Tier | Description | Base Score | Examples |
106
+ |------|-------------|------------|---------|
107
+ | T1 | Wire services, official sources | 90-100 | Reuters, AP, government feeds, RFC publications |
108
+ | T2 | Major established outlets | 75-89 | NYT, BBC, Nature, IEEE Spectrum |
109
+ | T3 | Respected niche/industry sources | 60-74 | TechCrunch, Ars Technica, The Verge, domain-specific journals |
110
+ | T4 | Community & expert blogs | 40-59 | Popular personal blogs, Medium publications with editors, curated newsletters |
111
+ | T5 | Unverified / user-generated | 20-39 | Anonymous blogs, auto-generated feeds, low-quality aggregators |
112
+
113
+ ### Recency Decay Function
114
+
115
+ Apply an exponential decay to the recency score so that newer articles score higher:
116
+
117
+ ```
118
+ recency_score = 100 * e^(-lambda * hours_since_publication)
119
+
120
+ lambda = 0.03 (half-life ~ 23 hours)
121
+ ```
122
+
123
+ | Age | Recency Score |
124
+ |-----|--------------|
125
+ | 0 hours | 100 |
126
+ | 6 hours | 84 |
127
+ | 12 hours | 70 |
128
+ | 24 hours | 49 |
129
+ | 48 hours | 24 |
130
+ | 72 hours | 12 |
131
+
132
+ Adjust lambda per user preference: set lambda lower (e.g., 0.01) for users who prefer weekly digests; higher (e.g., 0.05) for real-time monitoring.
133
+
134
+ ### User Interest Profile
135
+
136
+ Maintain a vector of topic weights representing the user's interests:
137
+ - Initialize from explicit topic subscriptions and feed categories
138
+ - Update dynamically based on click-through behavior (articles the user reads vs skips)
139
+ - Decay stale interests: reduce weight by 10% per week for topics the user hasn't engaged with
140
+ - Use the interest profile vector to compute topic relevance scores via cosine similarity with article TF-IDF vectors
141
+
142
+ ## Topic Clustering
143
+
144
+ ### TF-IDF Vectorization
145
+
146
+ 1. Preprocess text: lowercase, remove stop words, apply stemming or lemmatization
147
+ 2. Compute TF-IDF vectors using the corpus of articles from the current digest window
148
+ 3. Use only the title + first 200 words of the body for efficiency
149
+ 4. Maintain a background IDF model updated weekly from all ingested articles
150
+
151
+ ### Clustering Algorithm Selection
152
+
153
+ | Method | When to Use | Parameters |
154
+ |--------|------------|------------|
155
+ | DBSCAN | Default choice; handles variable cluster sizes and noise well | eps=0.4 (cosine distance), min_samples=2 |
156
+ | Agglomerative (Ward) | When you need a fixed number of topics (e.g., "give me 5 topics") | n_clusters=k, linkage=ward |
157
+ | Online K-Means | Streaming / real-time updates where articles arrive continuously | n_clusters=k (estimated from historical data) |
158
+
159
+ ### Cluster Labeling
160
+
161
+ For each cluster, generate a human-readable topic label:
162
+ 1. Extract the top 3 TF-IDF terms from the cluster centroid
163
+ 2. Identify the most common named entity (person, org, or location) across cluster articles
164
+ 3. Combine into a label: "[Named Entity]: [Top Terms]" (e.g., "OpenAI: language model GPT release")
165
+ 4. If no clear named entity, use the top 3 terms as the label
166
+
167
+ ### Representative Article Selection
168
+
169
+ From each cluster, select one "lead" article to represent the topic:
170
+ 1. Pick the article closest to the cluster centroid (most representative)
171
+ 2. Break ties by importance score (higher is better)
172
+ 3. Break further ties by source authority tier (prefer T1/T2)
173
+ 4. Include the lead article's summary in the digest; list other cluster articles as "Related"
174
+
175
+ ## Digest Generation
176
+
177
+ ### Digest Structure
178
+
179
+ ```
180
+ # Daily Digest -- [Date]
181
+ ## Top Stories (importance >= 70)
182
+ [Topic Label 1]
183
+ - Lead article summary (source, date, importance score)
184
+ - Related: [n] more articles from [sources]
185
+ [Topic Label 2]
186
+ - ...
187
+
188
+ ## Noteworthy (importance 40-69)
189
+ [Topic clusters organized by category]
190
+
191
+ ## Also Mentioned (importance < 40)
192
+ [Brief one-line entries]
193
+
194
+ ## Feed Health Report
195
+ - [n] feeds polled, [n] successful, [n] errors
196
+ - [n] new articles, [n] duplicates removed
197
+ - Emerging topic: [topic gaining traction]
198
+ - Declining topic: [topic losing traction]
199
+ ```
200
+
201
+ ### Digest Sizing
202
+
203
+ | Digest Type | Max Stories | Max Words per Summary | Delivery Window |
204
+ |-------------|-----------|---------------------|----------------|
205
+ | Morning brief | 10-15 | 50-75 | 06:00-08:00 local |
206
+ | Midday update | 5-10 | 30-50 | 11:30-13:00 local |
207
+ | Evening recap | 10-15 | 50-75 | 17:00-19:00 local |
208
+ | Weekly roundup | 20-30 | 100-150 | Saturday/Sunday morning |
@@ -0,0 +1,203 @@
1
+ ---
2
+ domain: rss-manager
3
+ topic: feed-formats-parsing-and-content-extraction
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # RSS/Atom Feed Formats, XML Parsing & Content Extraction
9
+
10
+ ## Feed Format Specifications
11
+
12
+ ### RSS 2.0 (Really Simple Syndication)
13
+
14
+ RSS 2.0 is the most widely deployed syndication format. A valid RSS 2.0 document is XML with a root `<rss>` element containing a single `<channel>`.
15
+
16
+ #### Channel-Level Elements
17
+
18
+ | Element | Required | Description |
19
+ |---------|----------|-------------|
20
+ | `<title>` | Yes | Name of the feed (e.g., "TechCrunch") |
21
+ | `<link>` | Yes | URL of the HTML website associated with the feed |
22
+ | `<description>` | Yes | Summary of what the feed contains |
23
+ | `<language>` | No | Language code (e.g., "en-us", "zh-cn") |
24
+ | `<lastBuildDate>` | No | Last time feed content changed (RFC 822 date) |
25
+ | `<pubDate>` | No | Publication date of the feed content (RFC 822 date) |
26
+ | `<ttl>` | No | Minutes the feed can be cached before refresh |
27
+ | `<image>` | No | Channel logo with `<url>`, `<title>`, `<link>` sub-elements |
28
+ | `<generator>` | No | Software that generated the feed |
29
+ | `<managingEditor>` | No | Email of the editorial contact |
30
+ | `<category>` | No | One or more categories for the feed |
31
+
32
+ #### Item-Level Elements
33
+
34
+ | Element | Required | Description |
35
+ |---------|----------|-------------|
36
+ | `<title>` | Conditional | Title of the article (required if no description) |
37
+ | `<link>` | No | URL of the full article |
38
+ | `<description>` | Conditional | Article summary or full content (required if no title) |
39
+ | `<author>` | No | Email address of the author |
40
+ | `<category>` | No | One or more categories |
41
+ | `<pubDate>` | No | Publication date (RFC 822 format) |
42
+ | `<guid>` | No | Globally unique identifier; `isPermaLink="true"` means it is a URL |
43
+ | `<enclosure>` | No | Attached media; attributes: `url`, `length`, `type` |
44
+ | `<comments>` | No | URL of the comments page |
45
+ | `<source>` | No | Original feed the item came from; attribute: `url` |
46
+
47
+ #### RSS 2.0 Namespace Extensions
48
+
49
+ Common namespace extensions enrich standard RSS:
50
+
51
+ - **`content:encoded`** (xmlns:content="http://purl.org/rss/1.0/modules/content/") — Full HTML content body, preferred over `<description>` for complete article text
52
+ - **`dc:creator`** (xmlns:dc="http://purl.org/dc/elements/1.1/") — Dublin Core author name (more reliable than `<author>`)
53
+ - **`dc:date`** — ISO 8601 date (more precise than `<pubDate>`)
54
+ - **`slash:comments`** — Comment count (integer)
55
+ - **`wfw:commentRss`** — RSS feed URL for the item's comments
56
+ - **`media:content`** — Rich media attachments with `url`, `medium`, `type`, `width`, `height`
57
+ - **`media:thumbnail`** — Thumbnail image URL
58
+
59
+ ### RSS 1.0 (RDF Site Summary)
60
+
61
+ RSS 1.0 is RDF-based and uses XML namespaces extensively. Less common than RSS 2.0 but still found in academic and government feeds.
62
+
63
+ #### Key Differences from RSS 2.0
64
+
65
+ - Root element is `<rdf:RDF>` (not `<rss>`)
66
+ - Items are listed both inside `<channel><items><rdf:Seq>` as references and as top-level `<item>` elements
67
+ - Uses `rdf:about` attribute for resource identification
68
+ - Relies heavily on Dublin Core (`dc:`) namespace for metadata
69
+ - Extensible through RDF modules: `mod_syndication` (update schedule), `mod_taxonomy` (topic classification)
70
+
71
+ ### Atom 1.0 (RFC 4287)
72
+
73
+ Atom is a more formally specified format than RSS, with clearer semantics and mandatory fields.
74
+
75
+ #### Feed-Level Elements
76
+
77
+ | Element | Required | Description |
78
+ |---------|----------|-------------|
79
+ | `<title>` | Yes | Feed title (supports `type` attribute: text, html, xhtml) |
80
+ | `<id>` | Yes | Permanent, universally unique feed identifier (IRI) |
81
+ | `<updated>` | Yes | Last time the feed was modified (RFC 3339 / ISO 8601) |
82
+ | `<author>` | Yes | At least one `<author>` with `<name>`, optional `<email>`, `<uri>` |
83
+ | `<link>` | Yes | Must include `rel="self"` (feed URL) and `rel="alternate"` (website URL) |
84
+ | `<subtitle>` | No | Feed description |
85
+ | `<generator>` | No | Software that generated the feed |
86
+ | `<icon>` | No | Small feed icon URL |
87
+ | `<logo>` | No | Feed logo URL |
88
+ | `<rights>` | No | Copyright notice |
89
+ | `<category>` | No | One or more categories with `term`, `scheme`, `label` attributes |
90
+
91
+ #### Entry-Level Elements
92
+
93
+ | Element | Required | Description |
94
+ |---------|----------|-------------|
95
+ | `<title>` | Yes | Entry title (with `type` attribute) |
96
+ | `<id>` | Yes | Permanent unique identifier for the entry (IRI) |
97
+ | `<updated>` | Yes | Last modification timestamp |
98
+ | `<published>` | No | Original publication timestamp |
99
+ | `<author>` | Conditional | Required if feed-level author is absent |
100
+ | `<content>` | Recommended | Full entry content; `type` attribute: text, html, xhtml, or media type |
101
+ | `<summary>` | Recommended | Short summary; required if `<content>` is absent or non-text |
102
+ | `<link>` | Recommended | `rel="alternate"` for the article URL |
103
+ | `<category>` | No | One or more categories |
104
+ | `<contributor>` | No | Additional contributors |
105
+ | `<source>` | No | Original feed metadata if the entry was aggregated |
106
+
107
+ ## XML Parsing Considerations
108
+
109
+ ### Encoding Detection Priority
110
+
111
+ 1. HTTP `Content-Type` header charset (highest priority)
112
+ 2. XML declaration encoding attribute: `<?xml version="1.0" encoding="UTF-8"?>`
113
+ 3. BOM (Byte Order Mark) detection
114
+ 4. Default to UTF-8 if none specified
115
+
116
+ ### Common Encoding Issues
117
+
118
+ - **Double encoding**: Content encoded as UTF-8 then re-encoded, producing mojibake (e.g., `é` instead of `e`)
119
+ - **Windows-1252 mislabeled as ISO-8859-1**: Characters in the 0x80-0x9F range render incorrectly
120
+ - **HTML entities in XML**: `&nbsp;` is valid in HTML but not in XML -- must use `&#160;` or be wrapped in CDATA
121
+ - **Unescaped ampersands**: `&` in URLs or text breaks XML parsing -- must be `&amp;`
122
+
123
+ ### CDATA Section Handling
124
+
125
+ Many feeds wrap HTML content in CDATA sections to avoid XML escaping issues:
126
+
127
+ ```xml
128
+ <description><![CDATA[<p>Article with <a href="https://example.com">links</a> and <img src="photo.jpg" /></p>]]></description>
129
+ ```
130
+
131
+ Processing steps:
132
+ 1. Extract raw content from CDATA (stripping `<![CDATA[` and `]]>`)
133
+ 2. Parse the inner HTML separately
134
+ 3. Sanitize: strip `<script>`, `<iframe>`, event handlers, `javascript:` URIs
135
+ 4. Extract plain text for indexing; preserve HTML for display
136
+
137
+ ### Namespace Resolution
138
+
139
+ Feeds commonly use multiple namespaces. A robust parser must:
140
+
141
+ 1. Resolve namespace prefixes to their URIs (prefixes may differ between feeds)
142
+ 2. Recognize elements by namespace URI, not prefix (e.g., `content:encoded` and `c:encoded` are the same if both map to `http://purl.org/rss/1.0/modules/content/`)
143
+ 3. Handle default namespace declarations on the root element
144
+ 4. Gracefully ignore unknown namespaces rather than failing
145
+
146
+ ## Content Extraction
147
+
148
+ ### Extracting Article Text
149
+
150
+ Priority order for obtaining article body:
151
+
152
+ 1. **Atom `<content type="html">`** or **`<content type="xhtml">`** -- fullest content
153
+ 2. **RSS `<content:encoded>`** -- full HTML body (namespace extension)
154
+ 3. **Atom `<summary>`** -- may be full text or truncated
155
+ 4. **RSS `<description>`** -- may be full text, truncated, or just a snippet
156
+ 5. **Fetch the linked URL** -- fallback when feed only provides a title or minimal snippet
157
+
158
+ ### Metadata Extraction Checklist
159
+
160
+ For each feed item, extract and normalize:
161
+
162
+ | Field | Primary Source | Fallback | Normalization |
163
+ |-------|---------------|----------|---------------|
164
+ | Title | `<title>` | First line of description | Strip HTML, decode entities, trim whitespace |
165
+ | URL | `<link>` / `<guid isPermaLink="true">` | `<id>` (Atom) | Canonicalize: lowercase host, remove tracking params |
166
+ | Author | `<dc:creator>` / `<author><name>` | `<author>` (email) / `<managingEditor>` | Extract name, discard email if present |
167
+ | Date | `<dc:date>` / `<published>` / `<updated>` | `<pubDate>` | Parse to ISO 8601 UTC; handle RFC 822, RFC 3339, and common non-standard formats |
168
+ | Body | `<content:encoded>` / `<content>` | `<description>` / `<summary>` | Sanitize HTML, extract plain text, calculate word count |
169
+ | Categories | `<category>` (multiple) | `<dc:subject>` | Normalize casing, map synonyms |
170
+ | Media | `<enclosure>` / `<media:content>` | `<media:thumbnail>` / embedded `<img>` | Extract URL, MIME type, dimensions |
171
+ | GUID | `<guid>` / `<id>` | URL | Use as-is for deduplication key |
172
+
173
+ ### Date Parsing
174
+
175
+ Feeds use inconsistent date formats. A robust parser must handle:
176
+
177
+ - **RFC 822**: `Mon, 15 Jan 2024 13:45:00 GMT` (RSS 2.0 standard)
178
+ - **RFC 3339 / ISO 8601**: `2024-01-15T13:45:00Z` (Atom standard)
179
+ - **Non-standard variations**: `Jan 15, 2024`, `2024/01/15`, `15-01-2024`, `1705322700` (Unix timestamp)
180
+ - **Timezone ambiguity**: `EST` vs `-0500`; always convert to UTC for consistent comparison
181
+ - **Missing timezone**: Assume UTC and flag as uncertain
182
+
183
+ ### URL Canonicalization
184
+
185
+ To detect duplicate URLs pointing to the same article:
186
+
187
+ 1. Convert scheme and host to lowercase: `HTTPS://WWW.Example.COM` -> `https://www.example.com`
188
+ 2. Remove default ports: `:80` for HTTP, `:443` for HTTPS
189
+ 3. Remove trailing slashes on paths (unless the path is `/`)
190
+ 4. Sort query parameters alphabetically
191
+ 5. Remove known tracking parameters: `utm_source`, `utm_medium`, `utm_campaign`, `utm_content`, `utm_term`, `ref`, `source`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`
192
+ 6. Decode unnecessary percent-encoding: `%41` -> `A`
193
+ 7. Remove fragment identifiers (`#section`) unless they are part of a single-page app route
194
+
195
+ ### Feed Discovery
196
+
197
+ When given a website URL instead of a feed URL, discover feeds by:
198
+
199
+ 1. Check `<link rel="alternate" type="application/rss+xml">` in HTML `<head>`
200
+ 2. Check `<link rel="alternate" type="application/atom+xml">` in HTML `<head>`
201
+ 3. Try common paths: `/feed`, `/rss`, `/atom.xml`, `/feed.xml`, `/rss.xml`, `/index.xml`, `/feeds/posts/default` (Blogger)
202
+ 4. Check `/.well-known/` resources
203
+ 5. Parse the page for embedded feed links in the body content
package/manifest.json ADDED
@@ -0,0 +1,26 @@
1
+ {
2
+ "name": "@botlearn/rss-manager",
3
+ "version": "0.1.0",
4
+ "description": "Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent",
5
+ "category": "information-retrieval",
6
+ "author": "BotLearn",
7
+ "benchmarkDimension": "information-retrieval",
8
+ "expectedImprovement": 30,
9
+ "dependencies": {},
10
+ "compatibility": {
11
+ "openclaw": ">=0.5.0"
12
+ },
13
+ "files": {
14
+ "skill": "skill.md",
15
+ "knowledge": [
16
+ "knowledge/domain.md",
17
+ "knowledge/best-practices.md",
18
+ "knowledge/anti-patterns.md"
19
+ ],
20
+ "strategies": [
21
+ "strategies/main.md"
22
+ ],
23
+ "smokeTest": "tests/smoke.json",
24
+ "benchmark": "tests/benchmark.json"
25
+ }
26
+ }
package/package.json ADDED
@@ -0,0 +1,35 @@
1
+ {
2
+ "name": "@botlearn/rss-manager",
3
+ "version": "0.1.0",
4
+ "description": "Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent",
5
+ "type": "module",
6
+ "main": "manifest.json",
7
+ "files": [
8
+ "manifest.json",
9
+ "skill.md",
10
+ "knowledge/",
11
+ "strategies/",
12
+ "tests/",
13
+ "README.md"
14
+ ],
15
+ "keywords": [
16
+ "botlearn",
17
+ "openclaw",
18
+ "skill",
19
+ "information-retrieval"
20
+ ],
21
+ "author": "BotLearn",
22
+ "license": "MIT",
23
+ "repository": {
24
+ "type": "git",
25
+ "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
26
+ "directory": "packages/skills/rss-manager"
27
+ },
28
+ "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/rss-manager",
29
+ "bugs": {
30
+ "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
31
+ },
32
+ "publishConfig": {
33
+ "access": "public"
34
+ }
35
+ }