@aikeytake/social-automation 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,280 @@
1
+ # Social Automation — Content Research Tool
2
+
3
+ Content aggregation tool that scrapes AI news from multiple sources and stores structured JSON for AI agents to consume.
4
+
5
+ ## What It Does
6
+
7
+ - Scrapes **17 RSS feeds** (TechCrunch, OpenAI, Anthropic, Claude Blog, Google AI, DeepMind, HuggingFace, arXiv, and more)
8
+ - Scrapes **Reddit** (7 AI subreddits, top posts with 100+ upvotes)
9
+ - Scrapes **Hacker News** (AI-related stories with 50+ points)
10
+ - Scrapes **LinkedIn KOL posts** via BrightData SERP (top 20 KOLs from your list)
11
+ - Outputs a `trending.json` with the top 20 ranked items
12
+ - Everything saved as structured JSON for AI agents
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ cd /home/vankhoa/projects/social-automation
18
+ npm install
19
+ npm run scrape
20
+ ```
21
+
22
+ ## The Only Command You Need
23
+
24
+ ```bash
25
+ npm run scrape
26
+ ```
27
+
28
+ Output saved to `data/YYYY-MM-DD/`:
29
+
30
+ | File | Contents |
31
+ |------|----------|
32
+ | `all.json` | All items from all sources combined |
33
+ | `trending.json` | Top 20 items ranked by engagement score |
34
+ | `rss.json` | All RSS feed items |
35
+ | `reddit.json` | All Reddit posts |
36
+ | `hackernews.json` | All Hacker News stories |
37
+ | `linkedin.json` | LinkedIn KOL posts via BrightData |
38
+
39
+ ## Project Structure
40
+
41
+ ```
42
+ social-automation/
43
+ ├── src/
44
+ │ ├── fetchers/
45
+ │ │ ├── rss.js # 17 RSS feeds
46
+ │ │ ├── reddit.js # 7 AI subreddits
47
+ │ │ ├── hackernews.js # HN top stories
48
+ │ │ └── linkedin.js # LinkedIn KOL posts via BrightData SERP
49
+ │ ├── utils/
50
+ │ │ └── logger.js
51
+ │ ├── cli.js
52
+ │ └── index.js # Main scraper
53
+ ├── config/
54
+ │ └── sources.json # All source configuration
55
+ ├── data/
56
+ │ └── YYYY-MM-DD/ # Daily scraped output
57
+ ├── .env # API keys
58
+ └── package.json
59
+ ```
60
+
61
+ ## Configuration
62
+
63
+ ### Environment Variables (`.env`)
64
+
65
+ Already configured. Key variables:
66
+
67
+ ```bash
68
+ BRIGHTDATA_API_KEY=... # Used for LinkedIn KOL scraping
69
+ BRIGHTDATA_ZONE=mcp_unlocker # BrightData zone
70
+ ANTHROPIC_API_KEY=... # Claude API (for future AI processing)
71
+ ```
72
+
73
+ ### Sources (`config/sources.json`)
74
+
75
+ **RSS Feeds (17 sources):**
76
+ - TechCrunch AI, The Gradient, MIT Technology Review AI
77
+ - OpenAI Blog, Anthropic Blog, **Claude Blog**
78
+ - Google AI Blog, DeepMind Blog, Hugging Face Blog
79
+ - Meta Engineering, Netflix Tech Blog, AWS ML Blog
80
+ - Microsoft AI Blog, NVIDIA Blog, LinkedIn Engineering
81
+ - arXiv AI (cs.AI), arXiv Machine Learning (cs.LG)
82
+
83
+ **Reddit:** MachineLearning, artificial, ArtificialIntelligence, deeplearning, OpenAI, LocalLLaMA, singularity
84
+
85
+ **Hacker News:** keyword-filtered (AI, LLM, GPT, Anthropic, etc.), 50+ points
86
+
87
+ **LinkedIn:** top 20 KOLs from `workspace/marketing/linkedin_kol_clean.json`, scraped via BrightData SERP
88
+
89
+ ### Adding an RSS Feed
90
+
91
+ Edit `config/sources.json`:
92
+
93
+ ```json
94
+ {
95
+ "rssFeeds": [
96
+ {
97
+ "name": "My Blog",
98
+ "url": "https://example.com/feed.xml",
99
+ "category": "ai-news",
100
+ "enabled": true
101
+ }
102
+ ]
103
+ }
104
+ ```
105
+
106
+ ### Adjusting LinkedIn KOL Limit
107
+
108
+ Edit `config/sources.json`:
109
+
110
+ ```json
111
+ {
112
+ "linkedin": {
113
+ "limit": 20
114
+ }
115
+ }
116
+ ```
117
+
118
+ ## Reading the Data
119
+
120
+ ```bash
121
+ # View today's trending items
122
+ cat data/$(date +%Y-%m-%d)/trending.json | jq '.items[] | {rank, title, score}'
123
+
124
+ # View all items from a specific source
125
+ cat data/$(date +%Y-%m-%d)/all.json | jq '[.items[] | select(.source == "reddit")]'
126
+
127
+ # Search by keyword
128
+ cat data/$(date +%Y-%m-%d)/all.json | jq '[.items[] | select(.title | contains("GPT"))]'
129
+
130
+ # View LinkedIn KOL posts
131
+ cat data/$(date +%Y-%m-%d)/linkedin.json | jq '.items[]'
132
+ ```
133
+
134
+ ## Using with AI Agents
135
+
136
+ Point the agent at today's data folder:
137
+
138
+ ```
139
+ Read data/$(date +%Y-%m-%d)/trending.json and create a LinkedIn post about the top trending AI story.
140
+ ```
141
+
142
+ Or for deeper research:
143
+
144
+ ```
145
+ Read data/$(date +%Y-%m-%d)/all.json and summarize the most important AI developments from the last 24 hours.
146
+ ```
147
+
148
+ ## Browser-Based Sources (Twitter/X & LinkedIn Browser)
149
+
150
+ Two sources use a real Chrome browser via Playwright to scrape without an API: **Twitter/X** and **LinkedIn Browser**. They share the same browser profile stored at `data/playwright-profile/`.
151
+
152
+ ### One-Time Setup
153
+
154
+ Run the setup script once to log in and save the browser session:
155
+
156
+ ```bash
157
+ npm run setup:twitter
158
+ ```
159
+
160
+ This opens a real Chrome window. **Log in to both X and LinkedIn** in that window (they share the same profile). Once you're logged in to both, close the window — the session is saved automatically.
161
+
162
+ > ⚠️ Use a **dedicated scraping account**, not your personal account. Sessions last several weeks. Re-run `npm run setup:twitter` when you see auth errors.
163
+
164
+ ---
165
+
166
+ ### Twitter / X
167
+
168
+ **Enable in `config/sources.json`:**
169
+
170
+ ```json
171
+ "trendingSources": {
172
+ "twitter": {
173
+ "enabled": true,
174
+ "accounts": ["AndrewYNg", "ylecun", "OpenAI", "AnthropicAI", "karpathy"],
175
+ "minLikes": 100,
176
+ "maxTweetsPerAccount": 5,
177
+ "maxAgeHours": 24,
178
+ "delayBetweenAccountsMs": 3000
179
+ }
180
+ }
181
+ ```
182
+
183
+ **Config options:**
184
+
185
+ | Key | Description | Default |
186
+ |-----|-------------|---------|
187
+ | `accounts` | X handles to scrape (without `@`) | `[]` |
188
+ | `minLikes` | Skip tweets below this like count | `0` |
189
+ | `maxTweetsPerAccount` | Max tweets to fetch per account | `10` |
190
+ | `maxAgeHours` | Only include tweets from last N hours | `24` |
191
+ | `delayBetweenAccountsMs` | Base delay between accounts (ms) | `3000` |
192
+
193
+ **Run:**
194
+
195
+ ```bash
196
+ npm run test:twitter # isolated test, prints results, no files written
197
+ npm run scrape # full pipeline
198
+ ```
199
+
200
+ **How it works:**
201
+ - Visits X home feed first, then searches for each account via the search box
202
+ - Clicks the matching result to navigate to the profile
203
+ - Scrolls the timeline and extracts top N tweets
204
+ - Applies a random 20–30s delay between accounts to avoid rate limiting
205
+ - Account visit order is randomised each run
206
+
207
+ ---
208
+
209
+ ### LinkedIn Browser
210
+
211
+ Scrapes posts from LinkedIn profiles using direct URL navigation to their recent activity page.
212
+
213
+ **Enable in `config/sources.json`:**
214
+
215
+ ```json
216
+ "linkedin_browser": {
217
+ "enabled": true,
218
+ "accounts": ["julienchaumond", "another-slug"],
219
+ "maxPostsPerAccount": 5,
220
+ "maxAgeHours": 48,
221
+ "delayBetweenAccountsMs": 10000
222
+ }
223
+ ```
224
+
225
+ The `accounts` value is the LinkedIn profile slug — the part after `linkedin.com/in/`.
226
+
227
+ **Config options:**
228
+
229
+ | Key | Description | Default |
230
+ |-----|-------------|---------|
231
+ | `accounts` | LinkedIn profile slugs to scrape | `[]` |
232
+ | `maxPostsPerAccount` | Max posts to fetch per account | `5` |
233
+ | `maxAgeHours` | Only include posts from last N hours | `48` |
234
+ | `delayBetweenAccountsMs` | Base delay between accounts (ms) | `10000` |
235
+
236
+ **Run:**
237
+
238
+ ```bash
239
+ npm run test:linkedin # isolated test, prints results, no files written
240
+ npm run scrape # full pipeline
241
+ ```
242
+
243
+ **How it works:**
244
+ - Navigates directly to `linkedin.com/in/{slug}/recent-activity/all/`
245
+ - Scrolls to load posts, extracts text, reactions, comments, and time
246
+ - Post URL is constructed from LinkedIn's `data-urn` attribute
247
+ - Account visit order is randomised each run
248
+
249
+ ---
250
+
251
+ ### Output files
252
+
253
+ | File | Source |
254
+ |------|--------|
255
+ | `data/YYYY-MM-DD/twitter.json` | Twitter/X posts |
256
+ | `data/YYYY-MM-DD/linkedin_browser.json` | LinkedIn browser posts |
257
+
258
+ Both sources feed into `all.json` and `trending.json` automatically.
259
+
260
+ ---
261
+
262
+ ## Troubleshooting
263
+
264
+ **LinkedIn returns 0 items:**
265
+ - Check logs for BrightData errors: `cat logs/*.log | grep -i linkedin`
266
+ - Confirm the KOL file exists: `ls /home/vankhoa/projects/aikeytake/workspace/marketing/linkedin_kol_clean.json`
267
+ - The BrightData zone `mcp_unlocker` must exist in your BrightData account
268
+
269
+ **RSS feed fails:**
270
+ - Some feeds go down temporarily — the scraper skips them and continues
271
+ - Check logs in `logs/` for specific feed errors
272
+
273
+ **No data for today:**
274
+ ```bash
275
+ # Run the scraper
276
+ npm run scrape
277
+
278
+ # Check if data folder was created
279
+ ls data/$(date +%Y-%m-%d)/
280
+ ```
@@ -0,0 +1,296 @@
1
+ {
2
+ "rssFeeds": [
3
+ {
4
+ "name": "TechCrunch AI",
5
+ "url": "https://techcrunch.com/category/artificial-intelligence/feed/",
6
+ "category": "ai-news",
7
+ "enabled": true
8
+ },
9
+ {
10
+ "name": "The Gradient",
11
+ "url": "https://thegradient.pub/rss/",
12
+ "category": "ai-research",
13
+ "enabled": true
14
+ },
15
+ {
16
+ "name": "MIT Technology Review AI",
17
+ "url": "https://www.technologyreview.com/feed/",
18
+ "category": "tech-news",
19
+ "enabled": true
20
+ },
21
+ {
22
+ "name": "OpenAI Blog",
23
+ "url": "https://openai.com/blog/rss.xml",
24
+ "category": "company-news",
25
+ "enabled": true
26
+ },
27
+ {
28
+ "name": "Anthropic Blog",
29
+ "url": "https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_anthropic.xml",
30
+ "category": "company-news",
31
+ "enabled": true
32
+ },
33
+ {
34
+ "name": "Claude Blog",
35
+ "url": "https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_claude.xml",
36
+ "category": "company-news",
37
+ "enabled": true
38
+ },
39
+ {
40
+ "name": "Google AI Blog",
41
+ "url": "https://blog.google/technology/ai/rss/",
42
+ "category": "company-news",
43
+ "enabled": true
44
+ },
45
+ {
46
+ "name": "DeepMind Blog",
47
+ "url": "https://deepmind.google/blog/rss.xml",
48
+ "category": "research",
49
+ "enabled": true
50
+ },
51
+ {
52
+ "name": "Hugging Face Blog",
53
+ "url": "https://huggingface.co/blog/feed.xml",
54
+ "category": "ml-frameworks",
55
+ "enabled": true
56
+ },
57
+ {
58
+ "name": "Meta Engineering",
59
+ "url": "https://engineering.fb.com/feed/",
60
+ "category": "company-engineering",
61
+ "enabled": true
62
+ },
63
+ {
64
+ "name": "Netflix Tech Blog",
65
+ "url": "https://medium.com/feed/netflix-techblog",
66
+ "category": "company-engineering",
67
+ "enabled": true
68
+ },
69
+ {
70
+ "name": "AWS Machine Learning Blog",
71
+ "url": "https://aws.amazon.com/blogs/machine-learning/feed/",
72
+ "category": "cloud-ai",
73
+ "enabled": true
74
+ },
75
+ {
76
+ "name": "Microsoft AI Blog",
77
+ "url": "https://blogs.microsoft.com/ai/feed/",
78
+ "category": "company-news",
79
+ "enabled": true
80
+ },
81
+ {
82
+ "name": "NVIDIA Technical Blog",
83
+ "url": "https://blogs.nvidia.com/feed/",
84
+ "category": "company-engineering",
85
+ "enabled": true
86
+ },
87
+ {
88
+ "name": "LinkedIn Engineering",
89
+ "url": "https://engineering.linkedin.com/blog.rss",
90
+ "category": "company-engineering",
91
+ "enabled": true
92
+ },
93
+ {
94
+ "name": "arXiv AI",
95
+ "url": "https://rss.arxiv.org/rss/cs.AI",
96
+ "category": "research-papers",
97
+ "enabled": true
98
+ },
99
+ {
100
+ "name": "arXiv Machine Learning",
101
+ "url": "https://rss.arxiv.org/rss/cs.LG",
102
+ "category": "research-papers",
103
+ "enabled": true
104
+ }
105
+ ],
106
+ "linkedin_browser": {
107
+ "enabled": true,
108
+ "accounts": ["julienchaumond"],
109
+ "maxPostsPerAccount": 5,
110
+ "maxAgeHours": 48,
111
+ "delayBetweenAccountsMs": 10000
112
+ },
113
+ "apiSources": [
114
+ {
115
+ "id": "goodailist",
116
+ "name": "Good AI List",
117
+ "enabled": true,
118
+ "weight": 0.5,
119
+ "request": {
120
+ "url": "https://goodailist.com/api/repos",
121
+ "method": "GET",
122
+ "params": { "page": 1, "limit": 100, "sort": "star_1d", "order": "desc" }
123
+ },
124
+ "response": {
125
+ "itemsPath": "repos"
126
+ },
127
+ "mapping": {
128
+ "title": "repo",
129
+ "link": "https://github.com/{repo}",
130
+ "summary": "description",
131
+ "content": "description",
132
+ "author": { "field": "repo", "split": "/", "index": 0 },
133
+ "pubDate": "created_at",
134
+ "category": "category",
135
+ "tags": { "field": "keywords", "split": "," },
136
+ "engagement.upvotes": "star_1d",
137
+ "metadata.stars": "stars",
138
+ "metadata.star_7d": "star_7d",
139
+ "metadata.forks": "forks",
140
+ "metadata.language": "language"
141
+ }
142
+ },
143
+ {
144
+ "id": "producthunt",
145
+ "name": "Product Hunt",
146
+ "enabled": true,
147
+ "auth": {
148
+ "type": "oauth2_client_credentials",
149
+ "tokenUrl": "https://api.producthunt.com/v2/oauth/token",
150
+ "clientIdEnv": "PRODUCT_HUNT_API_KEY",
151
+ "clientSecretEnv": "PRODUCT_HUNT_API_SECRET"
152
+ },
153
+ "request": {
154
+ "url": "https://api.producthunt.com/v2/api/graphql",
155
+ "method": "POST",
156
+ "graphql": {
157
+ "query": "query($first: Int!, $order: PostsOrder!, $postedAfter: DateTime!) { posts(first: $first, order: $order, postedAfter: $postedAfter) { edges { node { name tagline description url website votesCount commentsCount createdAt topics { edges { node { name } } } user { name } } } } }",
158
+ "variables": { "first": 30, "order": "VOTES" }
159
+ },
160
+ "computedVariables": {
161
+ "postedAfter": { "type": "daysAgo", "days": 7 }
162
+ }
163
+ },
164
+ "response": {
165
+ "itemsPath": "data.posts.edges",
166
+ "itemUnwrap": "node"
167
+ },
168
+ "filter": {
169
+ "field": "votesCount",
170
+ "min": 50
171
+ },
172
+ "mapping": {
173
+ "title": "name",
174
+ "link": "url",
175
+ "summary": "tagline",
176
+ "content": ["tagline", "description"],
177
+ "author": { "path": "user.name" },
178
+ "pubDate": "createdAt",
179
+ "category": { "path": "topics.edges", "map": "node.name", "index": 0 },
180
+ "tags": { "path": "topics.edges", "map": "node.name" },
181
+ "engagement.upvotes": "votesCount",
182
+ "engagement.comments": "commentsCount",
183
+ "metadata.website": "website"
184
+ }
185
+ }
186
+ ],
187
+ "linkedin": {
188
+ "profilesFile": "/home/vankhoa/projects/aikeytake/workspace/marketing/linkedin_kol_clean.json",
189
+ "enabled": true,
190
+ "batchSize": 8,
191
+ "budgetPerRun": 25,
192
+ "checkIntervalHours": 24,
193
+ "timeRange": "w",
194
+ "resultsPerBatch": 10,
195
+ "enrichContent": true,
196
+ "enrichConcurrency": 5
197
+ },
198
+ "youtube": {
199
+ "channels": [
200
+ {
201
+ "name": "Andrej Karpathy",
202
+ "channelId": "UC之以A5_BH8q-8v6Fn4qF5A",
203
+ "enabled": false
204
+ },
205
+ {
206
+ "name": "Yannic Kilcher",
207
+ "channelId": "UC媒介ucH6r6tiKnM2LTC1cw",
208
+ "enabled": false
209
+ }
210
+ ],
211
+ "enabled": false
212
+ },
213
+ "keywords": {
214
+ "primary": [
215
+ "artificial intelligence",
216
+ "machine learning",
217
+ "deep learning",
218
+ "LLM",
219
+ "GPT",
220
+ "Claude",
221
+ "transformer",
222
+ "neural network",
223
+ "AGI",
224
+ "AI research"
225
+ ],
226
+ "secondary": [
227
+ "computer vision",
228
+ "NLP",
229
+ "reinforcement learning",
230
+ "diffusion model",
231
+ "multimodal",
232
+ "fine-tuning",
233
+ "RAG",
234
+ "agent",
235
+ "LangChain",
236
+ "vector database"
237
+ ]
238
+ },
239
+ "filtering": {
240
+ "minEngagementScore": 10,
241
+ "maxAgeHours": 48,
242
+ "deduplicationWindow": 72
243
+ },
244
+ "trendingSources": {
245
+ "reddit": {
246
+ "enabled": true,
247
+ "subreddits": [
248
+ "MachineLearning",
249
+ "artificial",
250
+ "ArtificialIntelligence",
251
+ "deeplearning",
252
+ "OpenAI",
253
+ "LocalLLaMA",
254
+ "singularity"
255
+ ],
256
+ "minScore": 100,
257
+ "maxAge": "24h"
258
+ },
259
+ "hackernews": {
260
+ "enabled": true,
261
+ "keywords": [
262
+ "AI",
263
+ "artificial intelligence",
264
+ "machine learning",
265
+ "deep learning",
266
+ "GPT",
267
+ "LLM",
268
+ "OpenAI",
269
+ "Anthropic",
270
+ "Google AI",
271
+ "neural network"
272
+ ],
273
+ "minPoints": 50
274
+ },
275
+ "twitter": {
276
+ "enabled": false,
277
+ "accounts": [
278
+ "AndrewYNg",
279
+ "ylecun",
280
+ "OpenAI",
281
+ "AnthropicAI",
282
+ "GoogleAI"
283
+ ],
284
+ "minLikes": 100,
285
+ "maxTweetsPerAccount": 5,
286
+ "maxAgeHours": 24,
287
+ "delayBetweenAccountsMs": 3000
288
+ }
289
+ },
290
+ "trendAnalysis": {
291
+ "minTrendScore": 70,
292
+ "sourceDiversity": true,
293
+ "engagementWeight": 0.6,
294
+ "recencyWeight": 0.4
295
+ }
296
+ }
package/package.json ADDED
@@ -0,0 +1,37 @@
1
+ {
2
+ "name": "@aikeytake/social-automation",
3
+ "version": "2.0.0",
4
+ "description": "Content research and aggregation tool for AI agents",
5
+ "main": "src/index.js",
6
+ "type": "module",
7
+ "scripts": {
8
+ "start": "node src/index.js scrape",
9
+ "scrape": "node src/index.js scrape",
10
+ "query": "node src/query.js",
11
+ "queue": "node src/cli.js queue",
12
+ "test": "node src/test.js",
13
+ "setup:twitter": "node src/setup/twitter-login.js",
14
+ "test:twitter": "node src/test/twitter.js",
15
+ "test:linkedin": "node src/test/linkedin.js"
16
+ },
17
+ "keywords": [
18
+ "social-media",
19
+ "content-aggregation",
20
+ "ai",
21
+ "research-tool",
22
+ "agent-tool"
23
+ ],
24
+ "author": "aikeytake",
25
+ "license": "MIT",
26
+ "dependencies": {
27
+ "@supabase/supabase-js": "^2.47.0",
28
+ "axios": "^1.7.9",
29
+ "cheerio": "^1.0.0",
30
+ "dotenv": "^16.4.7",
31
+ "playwright": "^1.58.2",
32
+ "rss-parser": "^3.13.0"
33
+ },
34
+ "devDependencies": {
35
+ "@types/node": "^22.13.5"
36
+ }
37
+ }