@botlearn/rss-manager 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +35 -0
- package/knowledge/anti-patterns.md +94 -0
- package/knowledge/best-practices.md +208 -0
- package/knowledge/domain.md +203 -0
- package/manifest.json +26 -0
- package/package.json +35 -0
- package/skill.md +45 -0
- package/strategies/main.md +161 -0
- package/tests/benchmark.json +476 -0
- package/tests/smoke.json +54 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 BotLearn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# @botlearn/rss-manager
|
|
2
|
+
|
|
3
|
+
> Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
# via npm
|
|
9
|
+
npm install @botlearn/rss-manager
|
|
10
|
+
|
|
11
|
+
# via clawhub
|
|
12
|
+
clawhub install @botlearn/rss-manager
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Category
|
|
16
|
+
|
|
17
|
+
Information Retrieval
|
|
18
|
+
|
|
19
|
+
## Dependencies
|
|
20
|
+
|
|
21
|
+
None
|
|
22
|
+
|
|
23
|
+
## Files
|
|
24
|
+
|
|
25
|
+
| File | Description |
|
|
26
|
+
|------|-------------|
|
|
27
|
+
| `manifest.json` | Skill metadata and configuration |
|
|
28
|
+
| `skill.md` | Role definition and activation rules |
|
|
29
|
+
| `knowledge/` | Domain knowledge documents |
|
|
30
|
+
| `strategies/` | Behavioral strategy definitions |
|
|
31
|
+
| `tests/` | Smoke and benchmark tests |
|
|
32
|
+
|
|
33
|
+
## License
|
|
34
|
+
|
|
35
|
+
MIT
|
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: rss-manager
|
|
3
|
+
topic: anti-patterns
|
|
4
|
+
priority: medium
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# RSS Manager -- Anti-Patterns
|
|
9
|
+
|
|
10
|
+
## Feed Parsing Anti-Patterns
|
|
11
|
+
|
|
12
|
+
### 1. Assuming Well-Formed XML
|
|
13
|
+
- **Problem**: Treating all feeds as valid XML and failing hard on the first parse error. In practice, 5-15% of feeds in the wild contain malformed XML: unescaped ampersands, invalid characters, unclosed tags, or mixed encoding declarations
|
|
14
|
+
- **Fix**: Use a lenient XML parser with fallback strategies. On parse failure: (1) attempt HTML-tolerant parsing, (2) try fixing common issues (unescaped `&`, invalid UTF-8 bytes), (3) attempt regex-based extraction as last resort. Log parse errors per feed for health monitoring
|
|
15
|
+
|
|
16
|
+
### 2. Ignoring Namespace Prefixes
|
|
17
|
+
- **Problem**: Hardcoding namespace prefixes like `content:encoded` instead of resolving by namespace URI. Feed A might use `content:encoded` while Feed B uses `c:encoded` -- both map to the same namespace URI but a prefix-based parser breaks on Feed B
|
|
18
|
+
- **Fix**: Resolve all elements by their full namespace URI (`http://purl.org/rss/1.0/modules/content/`), not the prefix string. Use a namespace-aware XML parser
|
|
19
|
+
|
|
20
|
+
### 3. Trusting Feed-Declared Encoding
|
|
21
|
+
- **Problem**: Accepting the XML declaration's encoding attribute (`encoding="UTF-8"`) without verification. Many feeds declare UTF-8 but actually serve Windows-1252 or ISO-8859-1, causing mojibake in non-ASCII characters
|
|
22
|
+
- **Fix**: Detect actual encoding using BOM detection and byte-pattern analysis. If detected encoding conflicts with declared encoding, trust the detection. Always validate that decoded text is valid Unicode before processing
|
|
23
|
+
|
|
24
|
+
### 4. Ignoring Content Type Negotiation
|
|
25
|
+
- **Problem**: Not sending proper `Accept` headers when requesting feeds, or not checking the response `Content-Type`. Some servers return HTML error pages or redirects with `text/html` instead of the expected feed XML
|
|
26
|
+
- **Fix**: Send `Accept: application/rss+xml, application/atom+xml, application/xml, text/xml;q=0.9` header. Verify response `Content-Type` before parsing. If HTML is returned, attempt feed auto-discovery from the HTML page
|
|
27
|
+
|
|
28
|
+
## Deduplication Anti-Patterns
|
|
29
|
+
|
|
30
|
+
### 5. URL-Only Deduplication
|
|
31
|
+
- **Problem**: Deduplicating solely by comparing URLs. This misses: (1) the same article with different tracking parameters, (2) the same article syndicated to multiple domains, (3) updated articles with new URLs, and (4) AMP vs canonical URL variants
|
|
32
|
+
- **Fix**: Use the multi-signal deduplication pipeline from knowledge/best-practices.md. URL matching should be one layer (after canonicalization), not the only layer. Always combine with title similarity and content fingerprinting
|
|
33
|
+
|
|
34
|
+
### 6. Title-Only Deduplication
|
|
35
|
+
- **Problem**: Deduplicating solely by title match. Short titles like "Q3 Earnings Report" or "Weekly Update" produce massive false positive matches across unrelated feeds. Conversely, slightly reworded titles ("Company X Acquires Y" vs "Y Acquired by Company X") produce false negatives
|
|
36
|
+
- **Fix**: Never use title matching alone. Combine with at least one content-level signal (fingerprint or entity overlap). For titles under 5 words, require additional confirmation signals. Use fuzzy matching with appropriate thresholds per title length
|
|
37
|
+
|
|
38
|
+
### 7. Aggressive Deduplication Without Clustering
|
|
39
|
+
- **Problem**: Discarding all but one article when duplicates are detected, losing valuable diverse perspectives. If Reuters, BBC, and a domain expert all cover the same event, the domain expert's unique analysis gets discarded
|
|
40
|
+
- **Fix**: Cluster related articles rather than deleting duplicates. Select the most comprehensive article as the "lead" but preserve other sources as "Related" entries. The digest should show source diversity, not suppress it
|
|
41
|
+
|
|
42
|
+
### 8. Ignoring Article Updates
|
|
43
|
+
- **Problem**: Treating an article with a matching GUID but updated content as a duplicate and ignoring the update. Many feeds legitimately update articles: correcting errors, adding developments, or appending editor notes
|
|
44
|
+
- **Fix**: When a GUID matches but content has changed (detected via content hash), treat it as an article revision. Keep the latest version but note "Updated" in the digest. Track revision history for articles that update frequently
|
|
45
|
+
|
|
46
|
+
## Importance Scoring Anti-Patterns
|
|
47
|
+
|
|
48
|
+
### 9. Recency-Only Ranking
|
|
49
|
+
- **Problem**: Ranking articles purely by publication date, treating all newer articles as more important. This surfaces low-quality recent content above high-quality older content and is easily gamed by feeds that backdate or repeatedly update timestamps
|
|
50
|
+
- **Fix**: Use the multi-dimensional scoring model from knowledge/best-practices.md. Recency should be one factor (20% weight), balanced against source authority, cross-source corroboration, topic relevance, and content depth
|
|
51
|
+
|
|
52
|
+
### 10. Equal Source Weighting
|
|
53
|
+
- **Problem**: Treating all feed sources as equally authoritative. A random blog post and a Reuters wire report about the same event get the same importance score, leading to unreliable content surfacing in top stories
|
|
54
|
+
- **Fix**: Maintain per-source authority tiers (T1-T5) and apply authority weight to importance scoring. Initialize tiers from known source lists and refine based on historical signal quality, factual accuracy, and user feedback
|
|
55
|
+
|
|
56
|
+
### 11. Ignoring Cross-Source Corroboration
|
|
57
|
+
- **Problem**: Scoring articles independently without considering how many other sources cover the same story. A single blog post about an event gets the same importance as a story covered by 15 major outlets
|
|
58
|
+
- **Fix**: After deduplication clustering, use the cluster size as a corroboration signal. Stories covered by multiple independent sources are more likely to be genuinely important. Weight: cluster_size * source_diversity_factor
|
|
59
|
+
|
|
60
|
+
### 12. Static Interest Profiles
|
|
61
|
+
- **Problem**: Using a fixed user interest profile that never adapts. The user's interests shift over time, but the digest keeps surfacing topics they no longer care about while missing emerging interests
|
|
62
|
+
- **Fix**: Implement interest profile decay (reduce weight for unengaged topics by 10%/week) and reinforcement (boost topics the user clicks through on). Allow explicit user feedback ("more like this" / "less like this") to directly adjust weights
|
|
63
|
+
|
|
64
|
+
## Digest Generation Anti-Patterns
|
|
65
|
+
|
|
66
|
+
### 13. Information Overload Digests
|
|
67
|
+
- **Problem**: Including every article from every feed in the digest, producing an overwhelming wall of text. Users receiving 200+ item digests stop reading them entirely, defeating the purpose of aggregation
|
|
68
|
+
- **Fix**: Apply strict digest sizing limits from knowledge/best-practices.md. Morning briefs: 10-15 top stories with 50-75 word summaries. Use importance scoring to ruthlessly prioritize. Surface detail on demand ("Show me more about this topic") rather than by default
|
|
69
|
+
|
|
70
|
+
### 14. Flat List Presentation
|
|
71
|
+
- **Problem**: Presenting digest items as a flat chronological list with no organization. Users must scan the entire list to find topics they care about, and related articles about the same event appear scattered throughout
|
|
72
|
+
- **Fix**: Organize digests by topic clusters. Group related articles under topic headings. Within each topic, order by importance score. Provide a table of contents at the top with topic labels and article counts
|
|
73
|
+
|
|
74
|
+
### 15. Missing Source Attribution
|
|
75
|
+
- **Problem**: Summarizing articles without attributing the original source. Users cannot verify information, assess credibility, or navigate to the full article. Aggregation without attribution also raises ethical and legal concerns
|
|
76
|
+
- **Fix**: Every digest item must include: source name, publication date, direct URL to the original article, and source authority tier. When clustering, list all contributing sources
|
|
77
|
+
|
|
78
|
+
### 16. Stale Digest Windows
|
|
79
|
+
- **Problem**: Using a fixed 24-hour digest window regardless of user behavior or news cycle. Breaking news gets delayed until the next scheduled digest, while slow news days produce padding with low-quality content
|
|
80
|
+
- **Fix**: Support multiple digest cadences (morning brief, midday update, evening recap). Implement "breaking news" threshold: if an article scores above 90 importance and is from a T1 source, consider immediate notification outside the regular digest schedule
|
|
81
|
+
|
|
82
|
+
## Feed Health Anti-Patterns
|
|
83
|
+
|
|
84
|
+
### 17. Silent Feed Failures
|
|
85
|
+
- **Problem**: Continuing to poll feeds that consistently fail (404, 500, parse errors) without alerting the user. The user believes they are monitoring a source that has actually been dead for weeks
|
|
86
|
+
- **Fix**: Track per-feed error rate over a rolling window. After 3 consecutive failures or >50% failure rate over 7 days, mark the feed as unhealthy and alert the user. Suggest alternatives if available (e.g., the feed may have moved to a new URL)
|
|
87
|
+
|
|
88
|
+
### 18. No Feed Diversity Monitoring
|
|
89
|
+
- **Problem**: Not tracking the topical diversity of subscribed feeds. The user may subscribe to 20 feeds that all cover the same narrow topic, creating an echo chamber with massive duplication and no breadth
|
|
90
|
+
- **Fix**: Periodically analyze the topic distribution across all subscribed feeds. Report: "80% of your feeds cover AI/ML; consider adding feeds for [underrepresented topics based on stated interests]." Show a diversity dashboard with topic coverage breakdown
|
|
91
|
+
|
|
92
|
+
### 19. Ignoring Feed Freshness Decay
|
|
93
|
+
- **Problem**: Continuing to poll feeds at the same rate even when they haven't published new content in weeks or months. This wastes resources and clutters the feed list with dormant sources
|
|
94
|
+
- **Fix**: Implement adaptive polling (see knowledge/best-practices.md). After 14+ days of no new content, reduce polling to once daily. After 30+ days, reduce to weekly. After 90+ days, prompt the user: "This feed appears inactive. Keep monitoring or unsubscribe?"
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: rss-manager
|
|
3
|
+
topic: feed-management-dedup-and-scoring
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# RSS Manager -- Best Practices
|
|
9
|
+
|
|
10
|
+
## Feed Collection & Polling
|
|
11
|
+
|
|
12
|
+
### 1. Respect Publisher Update Intervals
|
|
13
|
+
|
|
14
|
+
Before polling a feed, check for update frequency hints:
|
|
15
|
+
- **RSS `<ttl>`** -- Minutes the feed can be cached (e.g., `<ttl>60</ttl>` means poll no more than once per hour)
|
|
16
|
+
- **Atom `<sy:updatePeriod>` and `<sy:updateFrequency>`** -- e.g., `hourly` with frequency `1` means once per hour
|
|
17
|
+
- **HTTP `Cache-Control` / `Expires` headers** -- Standard cache directives
|
|
18
|
+
- **HTTP `ETag` / `Last-Modified` headers** -- Use conditional requests (`If-None-Match`, `If-Modified-Since`) to avoid re-downloading unchanged feeds
|
|
19
|
+
|
|
20
|
+
Default polling schedule when no hints are available:
|
|
21
|
+
|
|
22
|
+
| Feed Type | Suggested Interval | Rationale |
|
|
23
|
+
|-----------|-------------------|-----------|
|
|
24
|
+
| Breaking news | 15 minutes | High update frequency expected |
|
|
25
|
+
| Major news outlets | 30 minutes | Regular updates throughout the day |
|
|
26
|
+
| Blog / personal site | 2-4 hours | Updates less frequently |
|
|
27
|
+
| Weekly newsletter | 12-24 hours | Low frequency, conserve resources |
|
|
28
|
+
| Dormant / low-activity | 24 hours | Check daily, reclassify if activity increases |
|
|
29
|
+
|
|
30
|
+
### 2. Adaptive Polling
|
|
31
|
+
|
|
32
|
+
Track feed update patterns over time and adjust polling frequency:
|
|
33
|
+
- If a feed hasn't changed in 5 consecutive polls, double the interval (up to 24 hours max)
|
|
34
|
+
- If a feed has new content on every poll, halve the interval (down to the minimum allowed by TTL/headers)
|
|
35
|
+
- Track the average number of new items per poll to predict optimal timing
|
|
36
|
+
- Maintain a per-feed "reliability score" based on: uptime, valid XML rate, consistent timestamps
|
|
37
|
+
|
|
38
|
+
### 3. Error Handling
|
|
39
|
+
|
|
40
|
+
- **HTTP 301 (Moved Permanently)** -- Update the stored feed URL
|
|
41
|
+
- **HTTP 410 (Gone)** -- Mark feed as dead; alert the user; stop polling
|
|
42
|
+
- **HTTP 429 (Too Many Requests)** -- Back off using `Retry-After` header; double interval
|
|
43
|
+
- **XML parse failures** -- Attempt recovery with lenient parsing; if persistent (3+ failures), flag feed as unhealthy
|
|
44
|
+
- **Timeout** -- Set a 30-second timeout per feed; retry once after 60 seconds; mark as slow if persistent
|
|
45
|
+
|
|
46
|
+
## Deduplication Strategies
|
|
47
|
+
|
|
48
|
+
### Multi-Signal Deduplication Pipeline
|
|
49
|
+
|
|
50
|
+
Deduplication should use a layered approach, from cheapest to most expensive:
|
|
51
|
+
|
|
52
|
+
#### Layer 1: GUID / Entry ID Match (Exact)
|
|
53
|
+
- Compare `<guid>` (RSS) or `<id>` (Atom) values directly
|
|
54
|
+
- Cheapest and most reliable when present
|
|
55
|
+
- Caveat: Some feeds reuse GUIDs or change them on updates
|
|
56
|
+
|
|
57
|
+
#### Layer 2: URL Canonicalization Match (Exact)
|
|
58
|
+
- Canonicalize URLs (see knowledge/domain.md) and compare
|
|
59
|
+
- Catches the same article shared via different tracking URLs
|
|
60
|
+
- Handles `http` vs `https`, `www` vs non-www variants
|
|
61
|
+
|
|
62
|
+
#### Layer 3: Title Similarity (Fuzzy)
|
|
63
|
+
- Normalize titles: lowercase, strip punctuation, remove common prefixes ("Breaking:", "Update:", "ICYMI:")
|
|
64
|
+
- Use Jaccard similarity on word sets; threshold >= 0.85 indicates a likely duplicate
|
|
65
|
+
- For short titles (< 5 words), require higher threshold (>= 0.95) to avoid false positives
|
|
66
|
+
- Optionally use Levenshtein distance ratio as a secondary signal
|
|
67
|
+
|
|
68
|
+
#### Layer 4: Content Fingerprinting (Fuzzy)
|
|
69
|
+
- **SimHash**: Compute a 64-bit fingerprint of the article body (after HTML stripping and normalization). Articles with Hamming distance <= 3 are near-duplicates
|
|
70
|
+
- **MinHash with LSH (Locality-Sensitive Hashing)**: Compute k-shingle sets, generate MinHash signatures (128 hashes), use LSH bands (b=16, r=8) for candidate pair detection. Jaccard similarity >= 0.7 confirms near-duplicate
|
|
71
|
+
- **Sentence-level overlap**: Extract the first 3 non-trivial sentences (> 10 words each); if 2+ match another article, flag as near-duplicate
|
|
72
|
+
|
|
73
|
+
#### Layer 5: Entity & Fact Overlap (Semantic)
|
|
74
|
+
- Extract named entities (people, organizations, locations, dates) from both articles
|
|
75
|
+
- If entity overlap >= 80% and temporal proximity <= 24 hours, likely covering the same event
|
|
76
|
+
- Use this layer to merge "same story, different angle" articles into a cluster rather than discarding
|
|
77
|
+
|
|
78
|
+
### Dedup Decision Matrix
|
|
79
|
+
|
|
80
|
+
| Signal Combination | Action |
|
|
81
|
+
|-------------------|--------|
|
|
82
|
+
| GUID match | Exact duplicate -- merge, keep latest version |
|
|
83
|
+
| URL match (after canonicalization) | Exact duplicate -- merge, keep the one with more content |
|
|
84
|
+
| Title similarity >= 0.85 + URL domain differs | Near-duplicate from different sources -- cluster together |
|
|
85
|
+
| Content fingerprint match | Near-duplicate -- cluster together, note source diversity |
|
|
86
|
+
| Entity overlap >= 80% + within 24h | Same event coverage -- cluster, select most comprehensive as primary |
|
|
87
|
+
| Title similarity >= 0.85 + content fingerprint mismatch | Updated/revised article -- keep both, flag as revision |
|
|
88
|
+
|
|
89
|
+
## Importance Scoring
|
|
90
|
+
|
|
91
|
+
### Weighted Scoring Model
|
|
92
|
+
|
|
93
|
+
Score each article on a 0-100 scale using the following weighted dimensions:
|
|
94
|
+
|
|
95
|
+
| Dimension | Weight | Signal Sources |
|
|
96
|
+
|-----------|--------|---------------|
|
|
97
|
+
| Source Authority | 25% | Domain reputation tier (T1-T5), historical accuracy, feed health score |
|
|
98
|
+
| Recency | 20% | Publication age relative to poll time; decay function: `score = 100 * e^(-0.03 * hours_old)` |
|
|
99
|
+
| Cross-Source Corroboration | 20% | Number of independent sources covering the same story (from dedup clustering) |
|
|
100
|
+
| Topic Relevance | 20% | Cosine similarity between article TF-IDF vector and user interest profile vector |
|
|
101
|
+
| Content Depth | 15% | Word count (normalized), presence of data/citations, structured content (tables, lists) |
|
|
102
|
+
|
|
103
|
+
### Source Authority Tiers
|
|
104
|
+
|
|
105
|
+
| Tier | Description | Base Score | Examples |
|
|
106
|
+
|------|-------------|------------|---------|
|
|
107
|
+
| T1 | Wire services, official sources | 90-100 | Reuters, AP, government feeds, RFC publications |
|
|
108
|
+
| T2 | Major established outlets | 75-89 | NYT, BBC, Nature, IEEE Spectrum |
|
|
109
|
+
| T3 | Respected niche/industry sources | 60-74 | TechCrunch, Ars Technica, The Verge, domain-specific journals |
|
|
110
|
+
| T4 | Community & expert blogs | 40-59 | Popular personal blogs, Medium publications with editors, curated newsletters |
|
|
111
|
+
| T5 | Unverified / user-generated | 20-39 | Anonymous blogs, auto-generated feeds, low-quality aggregators |
|
|
112
|
+
|
|
113
|
+
### Recency Decay Function
|
|
114
|
+
|
|
115
|
+
Apply an exponential decay to the recency score so that newer articles score higher:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
recency_score = 100 * e^(-lambda * hours_since_publication)
|
|
119
|
+
|
|
120
|
+
lambda = 0.03 (half-life ~ 23 hours)
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
| Age | Recency Score |
|
|
124
|
+
|-----|--------------|
|
|
125
|
+
| 0 hours | 100 |
|
|
126
|
+
| 6 hours | 84 |
|
|
127
|
+
| 12 hours | 70 |
|
|
128
|
+
| 24 hours | 49 |
|
|
129
|
+
| 48 hours | 24 |
|
|
130
|
+
| 72 hours | 12 |
|
|
131
|
+
|
|
132
|
+
Adjust lambda per user preference: set lambda lower (e.g., 0.01) for users who prefer weekly digests; higher (e.g., 0.05) for real-time monitoring.
|
|
133
|
+
|
|
134
|
+
### User Interest Profile
|
|
135
|
+
|
|
136
|
+
Maintain a vector of topic weights representing the user's interests:
|
|
137
|
+
- Initialize from explicit topic subscriptions and feed categories
|
|
138
|
+
- Update dynamically based on click-through behavior (articles the user reads vs skips)
|
|
139
|
+
- Decay stale interests: reduce weight by 10% per week for topics the user hasn't engaged with
|
|
140
|
+
- Use the interest profile vector to compute topic relevance scores via cosine similarity with article TF-IDF vectors
|
|
141
|
+
|
|
142
|
+
## Topic Clustering
|
|
143
|
+
|
|
144
|
+
### TF-IDF Vectorization
|
|
145
|
+
|
|
146
|
+
1. Preprocess text: lowercase, remove stop words, apply stemming or lemmatization
|
|
147
|
+
2. Compute TF-IDF vectors using the corpus of articles from the current digest window
|
|
148
|
+
3. Use only the title + first 200 words of the body for efficiency
|
|
149
|
+
4. Maintain a background IDF model updated weekly from all ingested articles
|
|
150
|
+
|
|
151
|
+
### Clustering Algorithm Selection
|
|
152
|
+
|
|
153
|
+
| Method | When to Use | Parameters |
|
|
154
|
+
|--------|------------|------------|
|
|
155
|
+
| DBSCAN | Default choice; handles variable cluster sizes and noise well | eps=0.4 (cosine distance), min_samples=2 |
|
|
156
|
+
| Agglomerative (Ward) | When you need a fixed number of topics (e.g., "give me 5 topics") | n_clusters=k, linkage=ward |
|
|
157
|
+
| Online K-Means | Streaming / real-time updates where articles arrive continuously | n_clusters=k (estimated from historical data) |
|
|
158
|
+
|
|
159
|
+
### Cluster Labeling
|
|
160
|
+
|
|
161
|
+
For each cluster, generate a human-readable topic label:
|
|
162
|
+
1. Extract the top 3 TF-IDF terms from the cluster centroid
|
|
163
|
+
2. Identify the most common named entity (person, org, or location) across cluster articles
|
|
164
|
+
3. Combine into a label: "[Named Entity]: [Top Terms]" (e.g., "OpenAI: language model GPT release")
|
|
165
|
+
4. If no clear named entity, use the top 3 terms as the label
|
|
166
|
+
|
|
167
|
+
### Representative Article Selection
|
|
168
|
+
|
|
169
|
+
From each cluster, select one "lead" article to represent the topic:
|
|
170
|
+
1. Pick the article closest to the cluster centroid (most representative)
|
|
171
|
+
2. Break ties by importance score (higher is better)
|
|
172
|
+
3. Break further ties by source authority tier (prefer T1/T2)
|
|
173
|
+
4. Include the lead article's summary in the digest; list other cluster articles as "Related"
|
|
174
|
+
|
|
175
|
+
## Digest Generation
|
|
176
|
+
|
|
177
|
+
### Digest Structure
|
|
178
|
+
|
|
179
|
+
```
|
|
180
|
+
# Daily Digest -- [Date]
|
|
181
|
+
## Top Stories (importance >= 70)
|
|
182
|
+
[Topic Label 1]
|
|
183
|
+
- Lead article summary (source, date, importance score)
|
|
184
|
+
- Related: [n] more articles from [sources]
|
|
185
|
+
[Topic Label 2]
|
|
186
|
+
- ...
|
|
187
|
+
|
|
188
|
+
## Noteworthy (importance 40-69)
|
|
189
|
+
[Topic clusters organized by category]
|
|
190
|
+
|
|
191
|
+
## Also Mentioned (importance < 40)
|
|
192
|
+
[Brief one-line entries]
|
|
193
|
+
|
|
194
|
+
## Feed Health Report
|
|
195
|
+
- [n] feeds polled, [n] successful, [n] errors
|
|
196
|
+
- [n] new articles, [n] duplicates removed
|
|
197
|
+
- Emerging topic: [topic gaining traction]
|
|
198
|
+
- Declining topic: [topic losing traction]
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Digest Sizing
|
|
202
|
+
|
|
203
|
+
| Digest Type | Max Stories | Max Words per Summary | Delivery Window |
|
|
204
|
+
|-------------|-----------|---------------------|----------------|
|
|
205
|
+
| Morning brief | 10-15 | 50-75 | 06:00-08:00 local |
|
|
206
|
+
| Midday update | 5-10 | 30-50 | 11:30-13:00 local |
|
|
207
|
+
| Evening recap | 10-15 | 50-75 | 17:00-19:00 local |
|
|
208
|
+
| Weekly roundup | 20-30 | 100-150 | Saturday/Sunday morning |
|
|
@@ -0,0 +1,203 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: rss-manager
|
|
3
|
+
topic: feed-formats-parsing-and-content-extraction
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# RSS/Atom Feed Formats, XML Parsing & Content Extraction
|
|
9
|
+
|
|
10
|
+
## Feed Format Specifications
|
|
11
|
+
|
|
12
|
+
### RSS 2.0 (Really Simple Syndication)
|
|
13
|
+
|
|
14
|
+
RSS 2.0 is the most widely deployed syndication format. A valid RSS 2.0 document is XML with a root `<rss>` element containing a single `<channel>`.
|
|
15
|
+
|
|
16
|
+
#### Channel-Level Elements
|
|
17
|
+
|
|
18
|
+
| Element | Required | Description |
|
|
19
|
+
|---------|----------|-------------|
|
|
20
|
+
| `<title>` | Yes | Name of the feed (e.g., "TechCrunch") |
|
|
21
|
+
| `<link>` | Yes | URL of the HTML website associated with the feed |
|
|
22
|
+
| `<description>` | Yes | Summary of what the feed contains |
|
|
23
|
+
| `<language>` | No | Language code (e.g., "en-us", "zh-cn") |
|
|
24
|
+
| `<lastBuildDate>` | No | Last time feed content changed (RFC 822 date) |
|
|
25
|
+
| `<pubDate>` | No | Publication date of the feed content (RFC 822 date) |
|
|
26
|
+
| `<ttl>` | No | Minutes the feed can be cached before refresh |
|
|
27
|
+
| `<image>` | No | Channel logo with `<url>`, `<title>`, `<link>` sub-elements |
|
|
28
|
+
| `<generator>` | No | Software that generated the feed |
|
|
29
|
+
| `<managingEditor>` | No | Email of the editorial contact |
|
|
30
|
+
| `<category>` | No | One or more categories for the feed |
|
|
31
|
+
|
|
32
|
+
#### Item-Level Elements
|
|
33
|
+
|
|
34
|
+
| Element | Required | Description |
|
|
35
|
+
|---------|----------|-------------|
|
|
36
|
+
| `<title>` | Conditional | Title of the article (required if no description) |
|
|
37
|
+
| `<link>` | No | URL of the full article |
|
|
38
|
+
| `<description>` | Conditional | Article summary or full content (required if no title) |
|
|
39
|
+
| `<author>` | No | Email address of the author |
|
|
40
|
+
| `<category>` | No | One or more categories |
|
|
41
|
+
| `<pubDate>` | No | Publication date (RFC 822 format) |
|
|
42
|
+
| `<guid>` | No | Globally unique identifier; `isPermaLink="true"` means it is a URL |
|
|
43
|
+
| `<enclosure>` | No | Attached media; attributes: `url`, `length`, `type` |
|
|
44
|
+
| `<comments>` | No | URL of the comments page |
|
|
45
|
+
| `<source>` | No | Original feed the item came from; attribute: `url` |
|
|
46
|
+
|
|
47
|
+
#### RSS 2.0 Namespace Extensions
|
|
48
|
+
|
|
49
|
+
Common namespace extensions enrich standard RSS:
|
|
50
|
+
|
|
51
|
+
- **`content:encoded`** (xmlns:content="http://purl.org/rss/1.0/modules/content/") — Full HTML content body, preferred over `<description>` for complete article text
|
|
52
|
+
- **`dc:creator`** (xmlns:dc="http://purl.org/dc/elements/1.1/") — Dublin Core author name (more reliable than `<author>`)
|
|
53
|
+
- **`dc:date`** — ISO 8601 date (more precise than `<pubDate>`)
|
|
54
|
+
- **`slash:comments`** — Comment count (integer)
|
|
55
|
+
- **`wfw:commentRss`** — RSS feed URL for the item's comments
|
|
56
|
+
- **`media:content`** — Rich media attachments with `url`, `medium`, `type`, `width`, `height`
|
|
57
|
+
- **`media:thumbnail`** — Thumbnail image URL
|
|
58
|
+
|
|
59
|
+
### RSS 1.0 (RDF Site Summary)
|
|
60
|
+
|
|
61
|
+
RSS 1.0 is RDF-based and uses XML namespaces extensively. Less common than RSS 2.0 but still found in academic and government feeds.
|
|
62
|
+
|
|
63
|
+
#### Key Differences from RSS 2.0
|
|
64
|
+
|
|
65
|
+
- Root element is `<rdf:RDF>` (not `<rss>`)
|
|
66
|
+
- Items are listed both inside `<channel><items><rdf:Seq>` as references and as top-level `<item>` elements
|
|
67
|
+
- Uses `rdf:about` attribute for resource identification
|
|
68
|
+
- Relies heavily on Dublin Core (`dc:`) namespace for metadata
|
|
69
|
+
- Extensible through RDF modules: `mod_syndication` (update schedule), `mod_taxonomy` (topic classification)
|
|
70
|
+
|
|
71
|
+
### Atom 1.0 (RFC 4287)
|
|
72
|
+
|
|
73
|
+
Atom is a more formally specified format than RSS, with clearer semantics and mandatory fields.
|
|
74
|
+
|
|
75
|
+
#### Feed-Level Elements
|
|
76
|
+
|
|
77
|
+
| Element | Required | Description |
|
|
78
|
+
|---------|----------|-------------|
|
|
79
|
+
| `<title>` | Yes | Feed title (supports `type` attribute: text, html, xhtml) |
|
|
80
|
+
| `<id>` | Yes | Permanent, universally unique feed identifier (IRI) |
|
|
81
|
+
| `<updated>` | Yes | Last time the feed was modified (RFC 3339 / ISO 8601) |
|
|
82
|
+
| `<author>` | Yes | At least one `<author>` with `<name>`, optional `<email>`, `<uri>` |
|
|
83
|
+
| `<link>` | Yes | Must include `rel="self"` (feed URL) and `rel="alternate"` (website URL) |
|
|
84
|
+
| `<subtitle>` | No | Feed description |
|
|
85
|
+
| `<generator>` | No | Software that generated the feed |
|
|
86
|
+
| `<icon>` | No | Small feed icon URL |
|
|
87
|
+
| `<logo>` | No | Feed logo URL |
|
|
88
|
+
| `<rights>` | No | Copyright notice |
|
|
89
|
+
| `<category>` | No | One or more categories with `term`, `scheme`, `label` attributes |
|
|
90
|
+
|
|
91
|
+
#### Entry-Level Elements
|
|
92
|
+
|
|
93
|
+
| Element | Required | Description |
|
|
94
|
+
|---------|----------|-------------|
|
|
95
|
+
| `<title>` | Yes | Entry title (with `type` attribute) |
|
|
96
|
+
| `<id>` | Yes | Permanent unique identifier for the entry (IRI) |
|
|
97
|
+
| `<updated>` | Yes | Last modification timestamp |
|
|
98
|
+
| `<published>` | No | Original publication timestamp |
|
|
99
|
+
| `<author>` | Conditional | Required if feed-level author is absent |
|
|
100
|
+
| `<content>` | Recommended | Full entry content; `type` attribute: text, html, xhtml, or media type |
|
|
101
|
+
| `<summary>` | Recommended | Short summary; required if `<content>` is absent or non-text |
|
|
102
|
+
| `<link>` | Recommended | `rel="alternate"` for the article URL |
|
|
103
|
+
| `<category>` | No | One or more categories |
|
|
104
|
+
| `<contributor>` | No | Additional contributors |
|
|
105
|
+
| `<source>` | No | Original feed metadata if the entry was aggregated |
|
|
106
|
+
|
|
107
|
+
## XML Parsing Considerations
|
|
108
|
+
|
|
109
|
+
### Encoding Detection Priority
|
|
110
|
+
|
|
111
|
+
1. HTTP `Content-Type` header charset (highest priority)
|
|
112
|
+
2. XML declaration encoding attribute: `<?xml version="1.0" encoding="UTF-8"?>`
|
|
113
|
+
3. BOM (Byte Order Mark) detection
|
|
114
|
+
4. Default to UTF-8 if none specified
|
|
115
|
+
|
|
116
|
+
### Common Encoding Issues
|
|
117
|
+
|
|
118
|
+
- **Double encoding**: Content encoded as UTF-8 then re-encoded, producing mojibake (e.g., `é` instead of `e`)
|
|
119
|
+
- **Windows-1252 mislabeled as ISO-8859-1**: Characters in the 0x80-0x9F range render incorrectly
|
|
120
|
+
- **HTML entities in XML**: ` ` is valid in HTML but not in XML -- must use ` ` or be wrapped in CDATA
|
|
121
|
+
- **Unescaped ampersands**: `&` in URLs or text breaks XML parsing -- must be `&`
|
|
122
|
+
|
|
123
|
+
### CDATA Section Handling
|
|
124
|
+
|
|
125
|
+
Many feeds wrap HTML content in CDATA sections to avoid XML escaping issues:
|
|
126
|
+
|
|
127
|
+
```xml
|
|
128
|
+
<description><![CDATA[<p>Article with <a href="https://example.com">links</a> and <img src="photo.jpg" /></p>]]></description>
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Processing steps:
|
|
132
|
+
1. Extract raw content from CDATA (stripping `<![CDATA[` and `]]>`)
|
|
133
|
+
2. Parse the inner HTML separately
|
|
134
|
+
3. Sanitize: strip `<script>`, `<iframe>`, event handlers, `javascript:` URIs
|
|
135
|
+
4. Extract plain text for indexing; preserve HTML for display
|
|
136
|
+
|
|
137
|
+
### Namespace Resolution
|
|
138
|
+
|
|
139
|
+
Feeds commonly use multiple namespaces. A robust parser must:
|
|
140
|
+
|
|
141
|
+
1. Resolve namespace prefixes to their URIs (prefixes may differ between feeds)
|
|
142
|
+
2. Recognize elements by namespace URI, not prefix (e.g., `content:encoded` and `c:encoded` are the same if both map to `http://purl.org/rss/1.0/modules/content/`)
|
|
143
|
+
3. Handle default namespace declarations on the root element
|
|
144
|
+
4. Gracefully ignore unknown namespaces rather than failing
|
|
145
|
+
|
|
146
|
+
## Content Extraction
|
|
147
|
+
|
|
148
|
+
### Extracting Article Text
|
|
149
|
+
|
|
150
|
+
Priority order for obtaining article body:
|
|
151
|
+
|
|
152
|
+
1. **Atom `<content type="html">`** or **`<content type="xhtml">`** -- fullest content
|
|
153
|
+
2. **RSS `<content:encoded>`** -- full HTML body (namespace extension)
|
|
154
|
+
3. **Atom `<summary>`** -- may be full text or truncated
|
|
155
|
+
4. **RSS `<description>`** -- may be full text, truncated, or just a snippet
|
|
156
|
+
5. **Fetch the linked URL** -- fallback when feed only provides a title or minimal snippet
|
|
157
|
+
|
|
158
|
+
### Metadata Extraction Checklist
|
|
159
|
+
|
|
160
|
+
For each feed item, extract and normalize:
|
|
161
|
+
|
|
162
|
+
| Field | Primary Source | Fallback | Normalization |
|
|
163
|
+
|-------|---------------|----------|---------------|
|
|
164
|
+
| Title | `<title>` | First line of description | Strip HTML, decode entities, trim whitespace |
|
|
165
|
+
| URL | `<link>` / `<guid isPermaLink="true">` | `<id>` (Atom) | Canonicalize: lowercase host, remove tracking params |
|
|
166
|
+
| Author | `<dc:creator>` / `<author><name>` | `<author>` (email) / `<managingEditor>` | Extract name, discard email if present |
|
|
167
|
+
| Date | `<dc:date>` / `<published>` / `<updated>` | `<pubDate>` | Parse to ISO 8601 UTC; handle RFC 822, RFC 3339, and common non-standard formats |
|
|
168
|
+
| Body | `<content:encoded>` / `<content>` | `<description>` / `<summary>` | Sanitize HTML, extract plain text, calculate word count |
|
|
169
|
+
| Categories | `<category>` (multiple) | `<dc:subject>` | Normalize casing, map synonyms |
|
|
170
|
+
| Media | `<enclosure>` / `<media:content>` | `<media:thumbnail>` / embedded `<img>` | Extract URL, MIME type, dimensions |
|
|
171
|
+
| GUID | `<guid>` / `<id>` | URL | Use as-is for deduplication key |
|
|
172
|
+
|
|
173
|
+
### Date Parsing
|
|
174
|
+
|
|
175
|
+
Feeds use inconsistent date formats. A robust parser must handle:
|
|
176
|
+
|
|
177
|
+
- **RFC 822**: `Mon, 15 Jan 2024 13:45:00 GMT` (RSS 2.0 standard)
|
|
178
|
+
- **RFC 3339 / ISO 8601**: `2024-01-15T13:45:00Z` (Atom standard)
|
|
179
|
+
- **Non-standard variations**: `Jan 15, 2024`, `2024/01/15`, `15-01-2024`, `1705322700` (Unix timestamp)
|
|
180
|
+
- **Timezone ambiguity**: `EST` vs `-0500`; always convert to UTC for consistent comparison
|
|
181
|
+
- **Missing timezone**: Assume UTC and flag as uncertain
|
|
182
|
+
|
|
183
|
+
### URL Canonicalization
|
|
184
|
+
|
|
185
|
+
To detect duplicate URLs pointing to the same article:
|
|
186
|
+
|
|
187
|
+
1. Convert scheme and host to lowercase: `HTTPS://WWW.Example.COM` -> `https://www.example.com`
|
|
188
|
+
2. Remove default ports: `:80` for HTTP, `:443` for HTTPS
|
|
189
|
+
3. Remove trailing slashes on paths (unless the path is `/`)
|
|
190
|
+
4. Sort query parameters alphabetically
|
|
191
|
+
5. Remove known tracking parameters: `utm_source`, `utm_medium`, `utm_campaign`, `utm_content`, `utm_term`, `ref`, `source`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`
|
|
192
|
+
6. Decode unnecessary percent-encoding: `%41` -> `A`
|
|
193
|
+
7. Remove fragment identifiers (`#section`) unless they are part of a single-page app route
|
|
194
|
+
|
|
195
|
+
### Feed Discovery
|
|
196
|
+
|
|
197
|
+
When given a website URL instead of a feed URL, discover feeds by:
|
|
198
|
+
|
|
199
|
+
1. Check `<link rel="alternate" type="application/rss+xml">` in HTML `<head>`
|
|
200
|
+
2. Check `<link rel="alternate" type="application/atom+xml">` in HTML `<head>`
|
|
201
|
+
3. Try common paths: `/feed`, `/rss`, `/atom.xml`, `/feed.xml`, `/rss.xml`, `/index.xml`, `/feeds/posts/default` (Blogger)
|
|
202
|
+
4. Check `/.well-known/` resources
|
|
203
|
+
5. Parse the page for embedded feed links in the body content
|
package/manifest.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/rss-manager",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent",
|
|
5
|
+
"category": "information-retrieval",
|
|
6
|
+
"author": "BotLearn",
|
|
7
|
+
"benchmarkDimension": "information-retrieval",
|
|
8
|
+
"expectedImprovement": 30,
|
|
9
|
+
"dependencies": {},
|
|
10
|
+
"compatibility": {
|
|
11
|
+
"openclaw": ">=0.5.0"
|
|
12
|
+
},
|
|
13
|
+
"files": {
|
|
14
|
+
"skill": "skill.md",
|
|
15
|
+
"knowledge": [
|
|
16
|
+
"knowledge/domain.md",
|
|
17
|
+
"knowledge/best-practices.md",
|
|
18
|
+
"knowledge/anti-patterns.md"
|
|
19
|
+
],
|
|
20
|
+
"strategies": [
|
|
21
|
+
"strategies/main.md"
|
|
22
|
+
],
|
|
23
|
+
"smokeTest": "tests/smoke.json",
|
|
24
|
+
"benchmark": "tests/benchmark.json"
|
|
25
|
+
}
|
|
26
|
+
}
|
package/package.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/rss-manager",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Multi-source RSS/Atom feed aggregation, deduplication, importance scoring, topic clustering, and daily digest generation for OpenClaw Agent",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"main": "manifest.json",
|
|
7
|
+
"files": [
|
|
8
|
+
"manifest.json",
|
|
9
|
+
"skill.md",
|
|
10
|
+
"knowledge/",
|
|
11
|
+
"strategies/",
|
|
12
|
+
"tests/",
|
|
13
|
+
"README.md"
|
|
14
|
+
],
|
|
15
|
+
"keywords": [
|
|
16
|
+
"botlearn",
|
|
17
|
+
"openclaw",
|
|
18
|
+
"skill",
|
|
19
|
+
"information-retrieval"
|
|
20
|
+
],
|
|
21
|
+
"author": "BotLearn",
|
|
22
|
+
"license": "MIT",
|
|
23
|
+
"repository": {
|
|
24
|
+
"type": "git",
|
|
25
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills.git",
|
|
26
|
+
"directory": "packages/skills/rss-manager"
|
|
27
|
+
},
|
|
28
|
+
"homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/rss-manager",
|
|
29
|
+
"bugs": {
|
|
30
|
+
"url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
|
|
31
|
+
},
|
|
32
|
+
"publishConfig": {
|
|
33
|
+
"access": "public"
|
|
34
|
+
}
|
|
35
|
+
}
|