@aikeytake/social-automation 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/MASTER_PLAN.md ADDED
@@ -0,0 +1,1096 @@
1
+ # 📋 Master Plan: AI Agent Content Research Tool
2
+
3
+ **Project:** Content Research & Aggregation Tool for AI Agents
4
+ **Purpose:** Provide AI agents with organized, scraped content from multiple sources
5
+ **Last Updated:** 2025-03-06
6
+ **Version:** 3.0 (Simplified - Agent-Focused)
7
+
8
+ ---
9
+
10
+ ## 🎯 Executive Summary
11
+
12
+ **What this tool IS:**
13
+ - A data scraping and organization tool
14
+ - An MCP (Model Context Protocol) server for AI agents
15
+ - A content aggregator that feeds AI agents
16
+ - Simple, efficient, agent-friendly
17
+
18
+ **What this tool is NOT:**
19
+ - ❌ Automated social media posting system
20
+ - ❌ Scheduled content generator
21
+ - ❌ Complex publishing pipeline
22
+ - ❌ Auto-publisher to social platforms
23
+
24
+ **The Philosophy:**
25
+ > "Let the AI agents do what they're good at (writing, creativity), while this tool handles the boring work (scraping, organizing, storing data)."
26
+
27
+ ---
28
+
29
+ ## 🚀 System Purpose
30
+
31
+ ### The Problem
32
+ - AI agents need fresh, relevant content to create social media posts
33
+ - Manual research is time-consuming
34
+ - Multiple sources are hard to track
35
+ - Data needs to be organized and accessible
36
+
37
+ ### The Solution
38
+ A simple tool that:
39
+ 1. **Scrapes** content from multiple sources (RSS, Reddit, HN, LinkedIn)
40
+ 2. **Organizes** the data effectively
41
+ 3. **Stores** it in a structured format
42
+ 4. **Exposes** it to AI agents via MCP or simple API
43
+
44
+ ### The Workflow
45
+ ```
46
+ ┌─────────────────────────────────────────────────────────┐
47
+ │ DATA SOURCES │
48
+ │ RSS Feeds │ Reddit │ Hacker News │ LinkedIn │ Twitter │
49
+ └────────────────────┬────────────────────────────────────┘
50
+
51
+
52
+ ┌─────────────────────────────────────────────────────────┐
53
+ │ CONTENT SCRAPER TOOL │
54
+ │ • Fetches from all sources │
55
+ │ • Deduplicates and filters │
56
+ │ • Scores by engagement │
57
+ │ • Stores in JSON │
58
+ └────────────────────┬────────────────────────────────────┘
59
+
60
+
61
+ ┌─────────────────────────────────────────────────────────┐
62
+ │ ORGANIZED DATA STORAGE │
63
+ │ • Trending items (scored) │
64
+ │ • Categorized by topic │
65
+ │ • Timestamped for freshness │
66
+ │ • Easy to query/filter │
67
+ └────────────────────┬────────────────────────────────────┘
68
+
69
+
70
+ ┌─────────────────────────────────────────────────────────┐
71
+ │ AI AGENT (Claude, etc.) │
72
+ │ • Queries data for specific topics │
73
+ │ • Reads organized content │
74
+ │ • Writes creative posts │
75
+ │ • Adds business angle │
76
+ │ • Creates engaging copy │
77
+ └─────────────────────────────────────────────────────────┘
78
+ ```
79
+
80
+ ---
81
+
82
+ ## 🎯 Core Features
83
+
84
+ ### 1. Data Scraping (COMPLETE)
85
+
86
+ #### ✅ RSS Feeds
87
+ ```bash
88
+ # Fetch from RSS feeds
89
+ npm run scrape:rss
90
+
91
+ # Sources:
92
+ - TechCrunch AI
93
+ - OpenAI Blog
94
+ - Anthropic News
95
+ - Google AI Blog
96
+ - arXiv AI/ML
97
+ - MIT Technology Review
98
+ - And more...
99
+ ```
100
+
101
+ #### ✅ Reddit
102
+ ```bash
103
+ # Fetch from Reddit
104
+ npm run scrape:reddit
105
+
106
+ # Subreddits:
107
+ - r/MachineLearning
108
+ - r/artificial
109
+ - r/ArtificialIntelligence
110
+ - r/deeplearning
111
+ - r/OpenAI
112
+ - r/LocalLLaMA
113
+ - r/singularity
114
+ ```
115
+
116
+ #### ✅ Hacker News
117
+ ```bash
118
+ # Fetch from Hacker News
119
+ npm run scrape:hn
120
+
121
+ # Filters by AI keywords
122
+ - AI, machine learning, GPT, LLM
123
+ - OpenAI, Anthropic, Google AI
124
+ ```
125
+
126
+ #### ✅ LinkedIn (Optional)
127
+ ```bash
128
+ # Fetch from LinkedIn KOLs
129
+ npm run scrape:linkedin
130
+
131
+ # Requires BrightData setup
132
+ ```
133
+
134
+ ### 2. Data Organization (COMPLETE)
135
+
136
+ #### Automatic Organization
137
+ - **Deduplication**: Same story across sources = 1 entry
138
+ - **Scoring**: By engagement (upvotes, points, comments)
139
+ - **Categorization**: By topic and keywords
140
+ - **Timestamping**: Track freshness
141
+ - **Source tracking**: Know where it came from
142
+
143
+ #### Data Organization by Day
144
+
145
+ The tool automatically creates dated folders and organized files:
146
+
147
+ ```
148
+ data/
149
+ ├── 2025-03-06/ # Today's folder
150
+ │ ├── trending.json # Top 20 trending (all sources combined)
151
+ │ ├── reddit.json # All Reddit items
152
+ │ ├── hackernews.json # All HN items
153
+ │ ├── rss.json # All RSS items
154
+ │ ├── linkedin.json # All LinkedIn items (if enabled)
155
+ │ └── all.json # Everything combined
156
+
157
+ ├── 2025-03-05/ # Yesterday
158
+ │ ├── trending.json
159
+ │ ├── reddit.json
160
+ │ └── ...
161
+
162
+ ├── 2025-03-04/ # Day before
163
+ │ └── ...
164
+
165
+ └── archive/
166
+ ├── week-2025-03-04/ # Archived by week
167
+ ├── week-2025-02-25/
168
+ └── ...
169
+ ```
170
+
171
+ #### File Structure
172
+
173
+ Each daily folder contains:
174
+
175
+ **trending.json** - Top trending items from all sources
176
+ ```json
177
+ {
178
+ "date": "2025-03-06",
179
+ "generated_at": "2025-03-06T10:00:00Z",
180
+ "items": [
181
+ {
182
+ "rank": 1,
183
+ "score": 4750,
184
+ "sources": ["reddit", "hackernews"],
185
+ "title": "GPT-5 Release Confirmed",
186
+ "url": "https://...",
187
+ "summary": "..."
188
+ }
189
+ ]
190
+ }
191
+ ```
192
+
193
+ **reddit.json** - All Reddit items
194
+ ```json
195
+ {
196
+ "date": "2025-03-06",
197
+ "source": "reddit",
198
+ "items": [
199
+ {
200
+ "id": "reddit_abc123",
201
+ "subreddit": "MachineLearning",
202
+ "title": "...",
203
+ "upvotes": 4500,
204
+ "comments": 823,
205
+ "url": "..."
206
+ }
207
+ ]
208
+ }
209
+ ```
210
+
211
+ **all.json** - Combined data from all sources
212
+ ```json
213
+ {
214
+ "date": "2025-03-06",
215
+ "total_items": 84,
216
+ "sources": {
217
+ "reddit": 47,
218
+ "hackernews": 12,
219
+ "rss": 25
220
+ },
221
+ "items": [...]
222
+ }
223
+ ```
224
+
225
+ ### 3. Agent Access (TO BUILD)
226
+
227
+ #### Option A: MCP Server (Recommended)
228
+ ```javascript
229
+ // Exposes tools to AI agents
230
+ mcp-server/
231
+ ├── tools/
232
+ │ ├── scrape-all.js # Scrape all sources
233
+ │ ├── get-trending.js # Get trending items
234
+ │ ├── get-by-topic.js # Get items by topic
235
+ │ ├── get-fresh.js # Get items from last N hours
236
+ │ └── search.js # Search content
237
+ └── index.js # MCP server
238
+ ```
239
+
240
+ #### Option B: Simple CLI
241
+ ```bash
242
+ # Get trending items
243
+ npm run get:trending
244
+
245
+ # Get items by topic
246
+ npm run get:topic --topic=GPT
247
+
248
+ # Get fresh items (last 6 hours)
249
+ npm run get:fresh --hours=6
250
+
251
+ # Search content
252
+ npm run search --query="automation"
253
+ ```
254
+
255
+ #### Option C: Simple API
256
+ ```javascript
257
+ // Lightweight Express server
258
+ GET /api/trending # Get trending items
259
+ GET /api/topic/:topic # Get by topic
260
+ GET /api/fresh/:hours # Get by freshness
261
+ GET /api/search?q=query # Search content
262
+ ```
263
+
264
+ ---
265
+
266
+ ## 📁 Simplified Project Structure
267
+
268
+ ```
269
+ social-automation/
270
+ ├── src/
271
+ │ ├── scrapers/ # Data source scrapers
272
+ │ │ ├── rss.js # RSS feed scraper
273
+ │ │ ├── reddit.js # Reddit scraper
274
+ │ │ ├── hackernews.js # Hacker News scraper
275
+ │ │ └── linkedin.js # LinkedIn scraper
276
+ │ │
277
+ │ ├── processors/ # Data processing
278
+ │ │ ├── dedupe.js # Deduplication
279
+ │ │ ├── score.js # Engagement scoring
280
+ │ │ ├── categorize.js # Topic categorization
281
+ │ │ └── filter.js # Age/quality filters
282
+ │ │
283
+ │ ├── storage/ # Data management
284
+ │ │ ├── save.js # Save to JSON
285
+ │ │ ├── load.js # Load from JSON
286
+ │ │ └── query.js # Query/filter data
287
+ │ │
288
+ │ ├── mcp/ # MCP server (if using MCP)
289
+ │ │ ├── server.js # MCP server
290
+ │ │ └── tools/ # MCP tools
291
+ │ │
292
+ │ └── cli.js # Simple CLI interface
293
+
294
+ ├── data/
295
+ │ ├── 2025-03-06/ # Today's scraped data
296
+ │ │ ├── trending.json # Top trending items
297
+ │ │ ├── reddit.json # Reddit items
298
+ │ │ ├── hackernews.json # HN items
299
+ │ │ ├── rss.json # RSS items
300
+ │ │ └── all.json # All items combined
301
+ │ ├── 2025-03-05/ # Yesterday's data
302
+ │ ├── 2025-03-04/ # Previous day
303
+ │ └── archive/ # Older data (weekly folders)
304
+
305
+ ├── config/
306
+ │ └── sources.json # Source configuration
307
+
308
+ ├── .env # API keys (minimal)
309
+ ├── package.json
310
+ ├── MASTER_PLAN.md # This file
311
+ ├── INSTRUCTIONS.md # Usage instructions
312
+ └── README.md
313
+ ```
314
+
315
+ ---
316
+
317
+ ## 🔄 Daily Workflow (Organized by Day)
318
+
319
+ ### 📅 Monday - Research & Planning
320
+
321
+ **Morning (9:00 AM)**
322
+ ```bash
323
+ # 1. Scrape fresh data for the week
324
+ npm run fetch
325
+
326
+ # 2. Review what's trending
327
+ npm run queue
328
+
329
+ # 3. Identify top themes for the week
330
+ cat data/queue/*.json | jq '[.[] | select(.metadata.score > 100)]'
331
+ ```
332
+
333
+ **Midday (12:00 PM)**
334
+ ```bash
335
+ # 4. Feed to AI agent
336
+ "Analyze the trending data and identify the top 3 AI themes for this week"
337
+
338
+ # 5. Plan content calendar
339
+ "Create a content plan for Mon-Sun based on these trends"
340
+ ```
341
+
342
+ **Afternoon (3:00 PM)**
343
+ ```bash
344
+ # 6. Generate first week's content
345
+ "Write Monday's LinkedIn post about the #1 trending story"
346
+
347
+ # 7. Create content for the week
348
+ "Generate 7 posts (one for each day) based on the trends"
349
+ ```
350
+
351
+ ---
352
+
353
+ ### 📅 Tuesday - Content Generation
354
+
355
+ **Morning (9:00 AM)**
356
+ ```bash
357
+ # 1. Check for fresh updates
358
+ npm run fetch
359
+
360
+ # 2. Review Monday's performance
361
+ "Which topics from Monday got the most engagement?"
362
+
363
+ # 3. Refresh Tuesday's content
364
+ "Update Tuesday's post with any new developments"
365
+ ```
366
+
367
+ **Midday (12:00 PM)**
368
+ ```bash
369
+ # 4. Generate variations
370
+ "Create 3 different versions of Tuesday's post:
371
+ - Professional/serious
372
+ - Casual/friendly
373
+ - Educational"
374
+ ```
375
+
376
+ **Afternoon (3:00 PM)**
377
+ ```bash
378
+ # 5. Prepare social media
379
+ "Create Twitter thread version of Tuesday's post"
380
+
381
+ # 6. Generate hashtags
382
+ "Suggest 10 relevant hashtags for Tuesday's content"
383
+ ```
384
+
385
+ ---
386
+
387
+ ### 📅 Wednesday - Mid-Week Review
388
+
389
+ **Morning (9:00 AM)**
390
+ ```bash
391
+ # 1. Scrape fresh data
392
+ npm run fetch
393
+
394
+ # 2. Compare with Monday's trends
395
+ "How have the trending topics changed since Monday?"
396
+
397
+ # 3. Identify emerging stories
398
+ "Are there any new trending stories that weren't on Monday's list?"
399
+ ```
400
+
401
+ **Midday (12:00 PM)**
402
+ ```bash
403
+ # 4. Weekly roundup
404
+ "Create a 'This Week in AI' summary post combining Mon-Wed trends"
405
+
406
+ # 5. Case study content
407
+ "Turn the #2 trending story into a case study format"
408
+ ```
409
+
410
+ **Afternoon (3:00 PM)**
411
+ ```bash
412
+ # 6. Community engagement
413
+ "Generate 5 discussion questions based on this week's trends"
414
+
415
+ # 7. Interactive content
416
+ "Create a poll idea based on the most controversial AI topic"
417
+ ```
418
+
419
+ ---
420
+
421
+ ### 📅 Thursday - Deep Dives
422
+
423
+ **Morning (9:00 AM)**
424
+ ```bash
425
+ # 1. Scrape fresh data
426
+ npm run fetch
427
+
428
+ # 2. Select a deep-dive topic
429
+ "From the trending data, which topic deserves a detailed explanation?"
430
+
431
+ # 3. Research mode
432
+ cat data/queue/*.json | jq '[.[] | select(.title | contains("GPT"))]'
433
+ ```
434
+
435
+ **Midday (12:00 PM)**
436
+ ```bash
437
+ # 4. Educational content
438
+ "Write an 'Explain Like I'm 5' post about the chosen topic"
439
+
440
+ # 5. Technical deep-dive
441
+ "Create a more technical version for the community page"
442
+ ```
443
+
444
+ **Afternoon (3:00 PM)**
445
+ ```bash
446
+ # 6. Myth-busting
447
+ "Identify common misconceptions about this topic and create a myth-busting post"
448
+
449
+ # 7. FAQ content
450
+ "Generate 5 FAQ items based on this week's trends"
451
+ ```
452
+
453
+ ---
454
+
455
+ ### 📅 Friday - Weekly Summary
456
+
457
+ **Morning (9:00 AM)**
458
+ ```bash
459
+ # 1. Final scrape of the week
460
+ npm run fetch
461
+
462
+ # 2. Week-over-week comparison
463
+ "Compare this week's trends with last week's trends"
464
+ ```
465
+
466
+ **Midday (12:00 PM)**
467
+ ```bash
468
+ # 3. Weekly roundup
469
+ "Create a comprehensive 'Top 5 AI Stories This Week' post"
470
+
471
+ # 4. Winners & losers
472
+ "Which technologies gained momentum? Which declined?"
473
+ ```
474
+
475
+ **Afternoon (3:00 PM)**
476
+ ```bash
477
+ # 5. Weekend content
478
+ "Create lighter, weekend-appropriate content based on trends"
479
+
480
+ # 6. Look ahead
481
+ "Based on this week's trends, what should we watch next week?"
482
+ ```
483
+
484
+ ---
485
+
486
+ ### 📅 Saturday - Light Content
487
+
488
+ **Morning (10:00 AM)**
489
+ ```bash
490
+ # 1. Quick scrape check
491
+ npm run fetch
492
+
493
+ # 2. Fun facts
494
+ "Extract interesting/fun facts from this week's trending data"
495
+
496
+ # 3. Trivia
497
+ "Create AI-themed trivia based on the trends"
498
+ ```
499
+
500
+ **Afternoon (2:00 PM)**
501
+ ```bash
502
+ # 4. Weekend reading
503
+ "Create a 'Weekend Reading' list from the best long-form content"
504
+
505
+ # 5. Community highlights
506
+ "Highlight the most interesting community discussions"
507
+ ```
508
+
509
+ ---
510
+
511
+ ### 📅 Sunday - Planning for Next Week
512
+
513
+ **Morning (10:00 AM)**
514
+ ```bash
515
+ # 1. Archive this week's data
516
+ mkdir -p data/archive/week-$(date +%Y-%m-%d)
517
+ mv data/queue/*.json data/archive/week-$(date +%Y-%m-%d)/
518
+
519
+ # 2. Fresh scrape for next week
520
+ npm run fetch
521
+ ```
522
+
523
+ **Midday (12:00 PM)**
524
+ ```bash
525
+ # 3. Weekly review
526
+ "Summarize the key AI developments from this week"
527
+
528
+ # 4. Performance analysis
529
+ "Which types of content performed best this week?"
530
+ ```
531
+
532
+ **Afternoon (3:00 PM)**
533
+ ```bash
534
+ # 5. Next week planning
535
+ "Based on current trends, what should we focus on next week?"
536
+
537
+ # 6. Content calendar
538
+ "Draft a content plan for Mon-Fri of next week"
539
+ ```
540
+
541
+ ---
542
+
543
+ ## 🔄 Quick Daily Workflow (For Busy Days)
544
+
545
+ **When You're Short on Time**
546
+
547
+ ```bash
548
+ # 15-Minute Routine
549
+ npm run fetch # 2 min
550
+ npm run queue | head -20 # 3 min
551
+ "Write a post about the #1 trending story" # 10 min
552
+ ```
553
+
554
+ **When You Have More Time**
555
+
556
+ ```bash
557
+ # 1-Hour Deep Dive
558
+ npm run fetch # 2 min
559
+ Review all trending items # 10 min
560
+ Select top 3 stories # 5 min
561
+ Generate content for all 3 # 30 min
562
+ Review and refine # 13 min
563
+ ```
564
+
565
+ ---
566
+
567
+ ## 🎯 Implementation Schedule (Day-by-Day)
568
+
569
+ ### Week 1: Simplify & Organize
570
+
571
+ **Day 1: Cleanup**
572
+ - [ ] Remove unused scheduling code
573
+ - [ ] Remove auto-publishing code
574
+ - [ ] Remove daily post generators
575
+ - [ ] Keep only scraping functionality
576
+
577
+ **Day 2: Simplify Data Structure**
578
+ - [ ] Standardize JSON format
579
+ - [ ] Remove unnecessary fields
580
+ - [ ] Add source tracking
581
+ - [ ] Improve metadata
582
+
583
+ **Day 3: Build Query CLI**
584
+ - [ ] `npm run query --trending`
585
+ - [ ] `npm run query --topic=NAME`
586
+ - [ ] `npm run query --fresh=HOURS`
587
+ - [ ] `npm run query --search=QUERY`
588
+
589
+ **Day 4: Documentation**
590
+ - [ ] Update README
591
+ - [ ] Create agent guide
592
+ - [ ] Add examples
593
+ - [ ] Create quick start
594
+
595
+ **Day 5: Testing & Bug Fixes**
596
+ - [ ] Test all scrapers
597
+ - [ ] Fix any bugs
598
+ - [ ] Test query CLI
599
+ - [ ] Performance check
600
+
601
+ **Day 6-7: Buffer/Polish**
602
+ - [ ] Refine based on testing
603
+ - [ ] Add error handling
604
+ - [ ] Improve logging
605
+
606
+ ---
607
+
608
+ ### Week 2: MCP Server (Optional)
609
+
610
+ **Day 8: MCP Setup**
611
+ - [ ] Initialize MCP server
612
+ - [ ] Basic server structure
613
+ - [ ] Tool registration
614
+
615
+ **Day 9: Scraping Tools**
616
+ - [ ] MCP tool: `scrape_all()`
617
+ - [ ] MCP tool: `scrape_rss()`
618
+ - [ ] MCP tool: `scrape_reddit()`
619
+ - [ ] MCP tool: `scrape_hn()`
620
+
621
+ **Day 10: Query Tools**
622
+ - [ ] MCP tool: `get_trending()`
623
+ - [ ] MCP tool: `get_by_topic()`
624
+ - [ ] MCP tool: `get_fresh()`
625
+ - [ ] MCP tool: `search()`
626
+
627
+ **Day 11: Integration**
628
+ - [ ] Test with Claude
629
+ - [ ] Test with other agents
630
+ - [ ] Fix issues
631
+
632
+ **Day 12-14: Polish**
633
+ - [ ] Error handling
634
+ - [ ] Rate limiting
635
+ - [ ] Documentation
636
+
637
+ ---
638
+
639
+ ### Week 3: Optimization & Enhancements
640
+
641
+ **Day 15: Performance**
642
+ - [ ] Optimize scraping speed
643
+ - [ ] Add caching
644
+ - [ ] Parallel requests
645
+ - [ ] Batch processing
646
+
647
+ **Day 16: Data Quality**
648
+ - [ ] Better deduplication
649
+ - [ ] Improved scoring
650
+ - [ ] Smart categorization
651
+ - [ ] Spam filtering
652
+
653
+ **Day 17: User Experience**
654
+ - [ ] Better error messages
655
+ - [ ] Progress indicators
656
+ - [ ] Colored output
657
+ - [ ] Summary statistics
658
+
659
+ **Day 18: More Sources**
660
+ - [ ] Twitter/X integration
661
+ - [ ] YouTube channels
662
+ - [ ] More RSS feeds
663
+ - [ ] News API
664
+
665
+ **Day 19-21: Testing**
666
+ - [ ] Comprehensive testing
667
+ - [ ] Load testing
668
+ - [ ] Bug fixes
669
+ - [ ] Documentation updates
670
+
671
+ ---
672
+
673
+ ### Week 4: Advanced Features (Optional)
674
+
675
+ **Day 22: Trending Detection**
676
+ - [ ] Velocity tracking
677
+ - [ ] Emerging topics
678
+ - [ ] Trend predictions
679
+ - [ ] Hot right now
680
+
681
+ **Day 23: Analytics**
682
+ - [ ] Engagement tracking
683
+ - [ ] Source performance
684
+ - [ ] Topic trends
685
+ - [ ] Weekly reports
686
+
687
+ **Day 24: Export Features**
688
+ - [ ] Export to CSV
689
+ - [ ] Export to Markdown
690
+ - [ ] Generate reports
691
+ - [ ] Email summaries
692
+
693
+ **Day 25: Automation Helpers**
694
+ - [ ] Watch mode
695
+ - [ ] Auto-scrape on timer
696
+ - [ ] Webhook support
697
+ - [ ] API endpoints
698
+
699
+ **Day 26-28: Final Polish**
700
+ - [ ] All features complete
701
+ - [ ] Full documentation
702
+ - [ ] Examples & tutorials
703
+ - [ ] Release preparation
704
+
705
+ ---
706
+
707
+ ## 🛠️ Configuration
708
+
709
+ ### Minimal Environment Variables
710
+
711
+ ```bash
712
+ # Only needed for LinkedIn scraping (optional)
713
+ BRIGHTDATA_API_KEY= # If using LinkedIn
714
+
715
+ # Optional: Rate limiting
716
+ RATE_LIMIT_DELAY=1000 # ms between requests
717
+ MAX_REQUESTS_PER_MINUTE=60
718
+ ```
719
+
720
+ ### Source Configuration (config/sources.json)
721
+
722
+ ```json
723
+ {
724
+ "rss": [
725
+ {
726
+ "name": "TechCrunch AI",
727
+ "url": "https://techcrunch.com/category/artificial-intelligence/feed/",
728
+ "enabled": true
729
+ },
730
+ {
731
+ "name": "OpenAI Blog",
732
+ "url": "https://openai.com/blog/rss.xml",
733
+ "enabled": true
734
+ }
735
+ ],
736
+ "reddit": {
737
+ "enabled": true,
738
+ "subreddits": [
739
+ "MachineLearning",
740
+ "artificial",
741
+ "OpenAI"
742
+ ],
743
+ "limit": 50
744
+ },
745
+ "hackernews": {
746
+ "enabled": true,
747
+ "keywords": ["AI", "machine learning", "GPT"],
748
+ "limit": 30
749
+ },
750
+ "filters": {
751
+ "min_score": 100,
752
+ "max_age_hours": 24,
753
+ "deduplicate": true
754
+ }
755
+ }
756
+ ```
757
+
758
+ ---
759
+
760
+ ## 📊 Data Models
761
+
762
+ ### Scraped Item
763
+ ```json
764
+ {
765
+ "id": "reddit_abc123",
766
+ "title": "GPT-5 Release Confirmed by OpenAI",
767
+ "summary": "OpenAI has officially confirmed...",
768
+ "url": "https://reddit.com/r/...",
769
+ "source": "reddit",
770
+ "source_name": "r/MachineLearning",
771
+ "score": 4500,
772
+ "engagement": {
773
+ "upvotes": 4500,
774
+ "comments": 823
775
+ },
776
+ "topic": "GPT-5",
777
+ "keywords": ["GPT", "OpenAI", "LLM"],
778
+ "timestamp": "2025-03-06T10:00:00Z",
779
+ "age_hours": 2
780
+ }
781
+ ```
782
+
783
+ ### Deduplicated Item
784
+ ```json
785
+ {
786
+ "id": "merged_123",
787
+ "title": "GPT-5 Release Confirmed",
788
+ "summary": "Combined from multiple sources...",
789
+ "sources": [
790
+ {
791
+ "name": "reddit",
792
+ "url": "https://reddit.com/...",
793
+ "score": 4500
794
+ },
795
+ {
796
+ "name": "hackernews",
797
+ "url": "https://news.ycombinator.com/...",
798
+ "score": 250
799
+ }
800
+ ],
801
+ "combined_score": 4750,
802
+ "topic": "GPT-5",
803
+ "timestamp": "2025-03-06T10:00:00Z"
804
+ }
805
+ ```
806
+
807
+ ---
808
+
809
+ ## 🎯 Success Metrics
810
+
811
+ ### Technical
812
+ - **Scraping speed**: <30 seconds for all sources
813
+ - **Data freshness**: <1 hour old
814
+ - **Deduplication accuracy**: >95%
815
+ - **Uptime**: >99%
816
+
817
+ ### Data Quality
818
+ - **Relevant items**: >80% of scraped content
819
+ - **Trending accuracy**: Matches Reddit/HN front pages
820
+ - **Source diversity**: 3+ sources per topic
821
+ - **Engagement correlation**: Scored items actually get engagement
822
+
823
+ ### Agent Experience
824
+ - **Easy to query**: Simple CLI/API
825
+ - **Well-organized**: Logical categorization
826
+ - **Fast response**: <1 second query time
827
+ - **Clear data**: Understandable structure
828
+
829
+ ---
830
+
831
+ ## 🚀 Quick Start for Agents
832
+
833
+ ### 1. Setup
834
+ ```bash
835
+ cd /home/vankhoa/projects/social-automation
836
+ npm install
837
+ ```
838
+
839
+ ### 2. Configure
840
+ ```bash
841
+ # Edit sources if needed
842
+ nano config/sources.json
843
+ ```
844
+
845
+ ### 3. Use
846
+ ```bash
847
+ # Scrape all sources
848
+ npm run scrape:all
849
+
850
+ # Get trending items
851
+ npm run get:trending
852
+
853
+ # Get by topic
854
+ npm run get:topic --topic=GPT
855
+
856
+ # Get fresh items
857
+ npm run get:fresh --hours=6
858
+ ```
859
+
860
+ ### 4. Query
861
+ ```bash
862
+ # Interactive query mode
863
+ npm run query
864
+
865
+ > Show me trending AI news
866
+ > What's hot about GPT?
867
+ > Stories from last 6 hours
868
+ ```
869
+
870
+ ---
871
+
872
+ ## 📋 Today's Tasks (Daily Checklist)
873
+
874
+ ### Morning Checklist ✅
875
+ ```bash
876
+ □ Run: npm run fetch
877
+ □ Check: npm run queue
878
+ □ Review: Top 5 trending items
879
+ □ Identify: Today's main theme
880
+ ```
881
+
882
+ ### Midday Tasks ✅
883
+ ```bash
884
+ □ Generate content with AI agent
885
+ □ Create multiple versions (A/B test)
886
+ □ Prepare hashtags
887
+ □ Draft engagement questions
888
+ ```
889
+
890
+ ### Afternoon Tasks ✅
891
+ ```bash
892
+ □ Review and refine content
893
+ □ Add French translation
894
+ □ Create image prompts
895
+ □ Schedule/post content
896
+ ```
897
+
898
+ ### End of Day ✅
899
+ ```bash
900
+ □ Backup today's data
901
+ □ Note what worked well
902
+ □ Plan tomorrow's focus
903
+ □ Archive completed items
904
+ ```
905
+
906
+ ---
907
+
908
+ ## 🔄 Daily Workflow
909
+
910
+ ### For AI Agent
911
+ ```
912
+ 1. Scrape fresh data
913
+ npm run scrape:all
914
+
915
+ 2. Query for what's needed
916
+ npm run get:trending --limit=10
917
+
918
+ 3. Receive organized data
919
+ [Array of scored, deduped items]
920
+
921
+ 4. Generate creative content
922
+ (Agent does this part)
923
+
924
+ 5. Done! No publishing automation needed
925
+ ```
926
+
927
+ ### For Human
928
+ ```
929
+ 1. Morning scrape
930
+ npm run scrape:all
931
+
932
+ 2. Review trending
933
+ npm run get:trending
934
+
935
+ 3. Share with AI agent
936
+ "Here's what's trending today, write a post about..."
937
+
938
+ 4. Agent creates content
939
+
940
+ 5. Review and post manually
941
+ (Human does this part)
942
+ ```
943
+
944
+ ---
945
+
946
+ ## 🎯 What This Tool Doesn't Do
947
+
948
+ ### Intentionally NOT Included
949
+
950
+ ❌ **No automated writing**
951
+ - AI agents are better at creativity
952
+ - This tool focuses on data, not content
953
+
954
+ ❌ **No scheduling**
955
+ - Run it when you need data
956
+ - No cron jobs needed
957
+
958
+ ❌ **No auto-publishing**
959
+ - Human oversight is important
960
+ - Quality over automation
961
+
962
+ ❌ **No complex queues**
963
+ - Simple JSON storage
964
+ - No database needed
965
+
966
+ ❌ **No web dashboard**
967
+ - CLI is faster
968
+ - Agents don't need GUIs
969
+
970
+ ---
971
+
972
+ ## 📋 Implementation Checklist
973
+
974
+ ### Core Functionality
975
+ - [x] RSS scraper
976
+ - [x] Reddit scraper
977
+ - [x] Hacker News scraper
978
+ - [x] Deduplication logic
979
+ - [x] Scoring system
980
+ - [x] JSON storage
981
+
982
+ ### Agent Interface
983
+ - [ ] MCP server setup
984
+ - [ ] Simple CLI
985
+ - [ ] Query interface
986
+ - [ ] Agent documentation
987
+
988
+ ### Polish
989
+ - [ ] Error handling
990
+ - [ ] Rate limiting
991
+ - [ ] Tests
992
+ - [ ] Documentation
993
+
994
+ ---
995
+
996
+ ## 🎯 Next Steps
997
+
998
+ ### Immediate (This Week)
999
+ 1. ✅ Keep existing scrapers
1000
+ 2. ✅ Simplify data structure
1001
+ 3. [ ] Remove unnecessary features (scheduling, auto-publishing)
1002
+ 4. [ ] Build simple query interface
1003
+ 5. [ ] Create agent-friendly documentation
1004
+
1005
+ ### Short-term (Next 2 Weeks)
1006
+ 1. Build MCP server
1007
+ 2. Add query tools
1008
+ 3. Test with AI agents
1009
+ 4. Optimize performance
1010
+
1011
+ ### Long-term (Optional)
1012
+ 1. Add more sources
1013
+ 2. Improve categorization
1014
+ 3. Add trending detection
1015
+ 4. Build simple web UI (if needed)
1016
+
1017
+ ---
1018
+
1019
+ ## 📝 Summary
1020
+
1021
+ **This is a research tool, not a publishing platform.**
1022
+
1023
+ The goal is to:
1024
+ 1. Scrape data from multiple sources
1025
+ 2. Organize it effectively
1026
+ 3. Make it easy to query
1027
+ 4. Let AI agents handle the creative work
1028
+
1029
+ **Simple, focused, agent-friendly.**
1030
+
1031
+ ---
1032
+
1033
+ ## 🗓️ Week-at-a-Glance
1034
+
1035
+ ### Monday
1036
+ - **Focus**: Research & Planning
1037
+ - **Scrape**: Full data refresh
1038
+ - **Output**: Content calendar for the week
1039
+
1040
+ ### Tuesday
1041
+ - **Focus**: Content Generation
1042
+ - **Scrape**: Update check
1043
+ - **Output**: Tuesday's content + variations
1044
+
1045
+ ### Wednesday
1046
+ - **Focus**: Mid-Week Review
1047
+ - **Scrape**: Fresh updates
1048
+ - **Output**: Weekly roundup, case studies
1049
+
1050
+ ### Thursday
1051
+ - **Focus**: Deep Dives
1052
+ - **Scrape**: Topic research
1053
+ - **Output**: Educational content, FAQs
1054
+
1055
+ ### Friday
1056
+ - **Focus**: Weekly Summary
1057
+ - **Scrape**: Final weekly scrape
1058
+ - **Output**: Top 5 stories, weekend preview
1059
+
1060
+ ### Saturday
1061
+ - **Focus**: Light Content
1062
+ - **Scrape**: Quick check
1063
+ - **Output**: Fun facts, trivia, highlights
1064
+
1065
+ ### Sunday
1066
+ - **Focus**: Planning Next Week
1067
+ - **Scrape**: Archive & refresh
1068
+ - **Output**: Next week's plan, weekly review
1069
+
1070
+ ---
1071
+
1072
+ ## 🎯 Daily Success Metrics
1073
+
1074
+ ### Daily Goals
1075
+ - ✅ Scraped fresh data
1076
+ - ✅ Reviewed trending items
1077
+ - ✅ Created 1-3 quality posts
1078
+ - ✅ Engaged with audience
1079
+
1080
+ ### Weekly Goals
1081
+ - ✅ 7-10 posts published
1082
+ - ✅ Covered trending topics
1083
+ - ✅ Maintained consistency
1084
+ - ✅ Grew engagement
1085
+
1086
+ ### Monthly Goals
1087
+ - ✅ 30-40 posts published
1088
+ - ✅ Identified top-performing content
1089
+ - ✅ Refined content strategy
1090
+ - ✅ Built audience
1091
+
1092
+ ---
1093
+
1094
+ **Version:** 3.0 (Simplified)
1095
+ **Last Updated:** 2025-03-06
1096
+ **Focus:** Agent Tool, Not Automated Platform