spectrawl 0.6.4 → 0.6.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -171,6 +171,49 @@ Spectrawl detects block/challenge pages from **8 anti-bot services** and reports
171
171
 
172
172
  When a block is detected, the response includes `blocked: true` and `blockInfo: { type, detail }`.
173
173
 
174
+ ### Site-Specific Fallbacks
175
+
176
+ Some sites block all datacenter IPs regardless of stealth. Spectrawl automatically routes these through alternative APIs:
177
+
178
+ | Site | Problem | Fallback | Cost |
179
+ |------|---------|----------|------|
180
+ | **Reddit** | Blocks all datacenter IPs | [PullPush API](https://api.pullpush.io) — Reddit archive | Free |
181
+ | **Amazon** | CAPTCHA wall on product pages | [Jina Reader](https://r.jina.ai) — server-side rendering | Free |
182
+ | **X/Twitter** | Login wall on posts | [xAI Responses API](https://docs.x.ai) with `x_search` | ~$0.06/post |
183
+ | **LinkedIn** | HTTP 999, IP fingerprinting | Requires residential proxy (see below) | ~$7/GB |
184
+
185
+ These fallbacks activate automatically — just `browse()` the URL and Spectrawl picks the right path. No config needed for Reddit and Amazon. X requires `XAI_API_KEY` env var. LinkedIn requires a residential proxy.
186
+
187
+ #### LinkedIn: Why It's Different
188
+
189
+ LinkedIn fingerprints the IP where cookies were created. Even valid cookies get rejected from a different IP. Every free approach fails from datacenter servers:
190
+
191
+ - Direct browse: HTTP 999
192
+ - Voyager API with cookies: 401 (IP mismatch)
193
+ - Jina Reader: empty response
194
+ - Facebook/Googlebot UA: 317K of CSS, zero content
195
+
196
+ **The only working solution is a residential proxy.** We recommend [Smartproxy](https://smartproxy.com) ($7/GB pay-as-you-go, 55M residential IPs, 3-day free trial). At typical usage (~10 LinkedIn pages/month), cost is under $0.50/month.
197
+
198
+ Setup:
199
+ ```bash
200
+ # Add your proxy to Spectrawl config
201
+ npx spectrawl config set proxy '{"host":"gate.smartproxy.com","port":10001,"username":"YOUR_USER","password":"YOUR_PASS"}'
202
+
203
+ # Store your LinkedIn cookies (export from browser)
204
+ npx spectrawl login linkedin --account yourname --cookies ./linkedin-cookies.json
205
+
206
+ # Now browse LinkedIn normally
207
+ curl localhost:3900/browse -d '{"url":"https://www.linkedin.com/in/someone"}'
208
+ ```
209
+
210
+ Other residential proxy providers that work:
211
+ - [IPRoyal](https://iproyal.com) — $7/GB, 32M IPs
212
+ - [Bright Data](https://brightdata.com) — premium quality, higher cost
213
+ - [Oxylabs](https://oxylabs.io) — enterprise-grade
214
+
215
+ > ⚠️ **Avoid WebShare** — recycled datacenter IPs marketed as residential, no HTTPS support.
216
+
174
217
  ### CAPTCHA Solving
175
218
 
176
219
  Built-in CAPTCHA solver using **Gemini Vision** (free tier: 1,500 req/day):
@@ -685,27 +728,36 @@ Error types: `bad-request` (400), `unauthorized` (401), `forbidden` (403), `not-
685
728
 
686
729
  ## Proxy Configuration
687
730
 
688
- Route browsing through residential or datacenter proxies:
731
+ Route browsing through residential or datacenter proxies. **Required for LinkedIn** — see [Site-Specific Fallbacks](#site-specific-fallbacks) for why.
689
732
 
690
733
  ```json
691
734
  {
692
735
  "browse": {
693
736
  "proxy": {
694
- "host": "proxy.example.com",
695
- "port": 8080,
696
- "username": "user",
697
- "password": "pass"
737
+ "host": "gate.smartproxy.com",
738
+ "port": 10001,
739
+ "username": "YOUR_USER",
740
+ "password": "YOUR_PASS"
698
741
  }
699
742
  }
700
743
  }
701
744
  ```
702
745
 
703
- The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server:
746
+ The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server that rotates through multiple upstream proxies:
704
747
 
705
748
  ```bash
706
749
  npx spectrawl proxy --port 8080
707
750
  ```
708
751
 
752
+ **Recommended providers:**
753
+
754
+ | Provider | Price | IPs | Best For |
755
+ |----------|-------|-----|----------|
756
+ | [Smartproxy](https://smartproxy.com) | $7/GB | 55M | Best budget option, 3-day free trial |
757
+ | [IPRoyal](https://iproyal.com) | $7/GB | 32M | Good alternative |
758
+ | [Bright Data](https://brightdata.com) | $12+/GB | 72M | Best quality, enterprise |
759
+ | [Oxylabs](https://oxylabs.io) | $10+/GB | 100M+ | Enterprise-grade |
760
+
709
761
  ## MCP Server
710
762
 
711
763
  Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "spectrawl",
3
- "version": "0.6.4",
3
+ "version": "0.6.5",
4
4
  "description": "The unified web layer for AI agents. Search (8 engines), stealth browse, auth, act on 24 platforms. Self-hosted.",
5
5
  "main": "src/index.js",
6
6
  "types": "index.d.ts",
@@ -16,6 +16,8 @@ class XAdapter {
16
16
  switch (action) {
17
17
  case 'post':
18
18
  return this._post(params, ctx)
19
+ case 'article':
20
+ return this._postArticle(params, ctx)
19
21
  case 'like':
20
22
  return this._like(params, ctx)
21
23
  case 'retweet':
@@ -126,6 +128,235 @@ class XAdapter {
126
128
  return { tweetId: data.data?.id, url: `https://x.com/i/status/${data.data?.id}` }
127
129
  }
128
130
 
131
+ /**
132
+ * Post an X Article (long-form) via browser automation.
133
+ * X API doesn't support articles — must use the web composer.
134
+ *
135
+ * Flow:
136
+ * 1. Navigate to /compose/articles (article list)
137
+ * 2. Find existing draft or create new article via GraphQL
138
+ * 3. Navigate to /compose/articles/edit/{articleId}
139
+ * 4. Fill title (data-testid="twitter-article-title") and body (contenteditable)
140
+ * 5. Auto-save triggers, or click Publish
141
+ *
142
+ * @param {object} params - { title, body, account, _cookies, publish, articleId }
143
+ * publish: true = auto-publish, false = save as draft (default: false for safety)
144
+ * articleId: edit existing article (optional)
145
+ */
146
+ async _postArticle(params, ctx) {
147
+ const { title, body, account, _cookies, publish = false, articleId } = params
148
+
149
+ if (!_cookies) {
150
+ throw new Error(`No auth for X/${account}. Run: spectrawl login x --account ${account}`)
151
+ }
152
+ if (!title) throw new Error('X article requires a title')
153
+ if (!body) throw new Error('X article requires a body')
154
+
155
+ // Step 1: Get article editor URL
156
+ let editorUrl
157
+ if (articleId) {
158
+ editorUrl = `https://x.com/compose/articles/edit/${articleId}`
159
+ } else {
160
+ // First go to articles list to find/create an article
161
+ const { page: listPage, context: listCtx } = await ctx.browse.getPage({
162
+ _cookies,
163
+ url: 'https://x.com/compose/articles'
164
+ })
165
+
166
+ try {
167
+ await listPage.waitForTimeout(2000 + Math.random() * 1000)
168
+
169
+ // Try to create a new article via the GraphQL API
170
+ const csrfToken = _cookies.find(c => c.name === 'ct0')?.value
171
+ const newArticleId = await listPage.evaluate(async (csrf) => {
172
+ try {
173
+ const res = await fetch('https://x.com/i/api/graphql/uKxr91kGF4E4mdN-G3x0Yw/CreateArticle', {
174
+ method: 'POST',
175
+ headers: {
176
+ 'Content-Type': 'application/json',
177
+ 'X-Csrf-Token': csrf,
178
+ 'Authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA',
179
+ 'X-Twitter-Auth-Type': 'OAuth2Session'
180
+ },
181
+ body: JSON.stringify({
182
+ variables: {},
183
+ queryId: 'uKxr91kGF4E4mdN-G3x0Yw'
184
+ })
185
+ })
186
+ const data = await res.json()
187
+ return data?.data?.article_create?.article_results?.result?.rest_id || null
188
+ } catch { return null }
189
+ }, csrfToken)
190
+
191
+ if (newArticleId) {
192
+ editorUrl = `https://x.com/compose/articles/edit/${newArticleId}`
193
+ } else {
194
+ // Fallback: find existing draft link or the write button
195
+ const draftLink = await listPage.$('a[href*="/compose/articles/edit/"]')
196
+ if (draftLink) {
197
+ const href = await draftLink.getAttribute('href')
198
+ editorUrl = `https://x.com${href}`
199
+ } else {
200
+ // Last resort: look for a "new article" / "write" link
201
+ editorUrl = await listPage.evaluate(() => {
202
+ const links = Array.from(document.querySelectorAll('a[href*="article"]'))
203
+ for (const l of links) {
204
+ if (l.textContent.toLowerCase().includes('write') || l.textContent.toLowerCase().includes('new')) {
205
+ return l.href
206
+ }
207
+ }
208
+ return null
209
+ })
210
+ }
211
+ }
212
+
213
+ await listPage.close()
214
+ } catch (e) {
215
+ await listPage.close().catch(() => {})
216
+ throw e
217
+ }
218
+ }
219
+
220
+ if (!editorUrl) {
221
+ throw new Error('Could not find or create X article editor. Try passing articleId directly.')
222
+ }
223
+
224
+ // Step 2: Open the article editor
225
+ const { page, context } = await ctx.browse.getPage({
226
+ _cookies,
227
+ url: editorUrl
228
+ })
229
+
230
+ try {
231
+ await page.waitForTimeout(2000 + Math.random() * 1000)
232
+
233
+ // Check we're in the editor
234
+ const hasEditor = await page.$('[data-testid="twitter-article-title"], [contenteditable="true"]')
235
+ if (!hasEditor) {
236
+ const content = await page.evaluate(() => document.body.innerText)
237
+ throw new Error(`Not in article editor. Page content: ${content.slice(0, 200)}`)
238
+ }
239
+
240
+ // Step 3: Fill the title
241
+ // Title: data-testid="twitter-article-title" or placeholder "Add a title"
242
+ // Must click and type — execCommand doesn't work on this component
243
+ const titleEl = await page.$('[data-testid="twitter-article-title"]')
244
+ if (titleEl) {
245
+ await titleEl.click()
246
+ await page.waitForTimeout(300)
247
+ // Select all existing text and replace
248
+ await page.keyboard.down('Control')
249
+ await page.keyboard.press('a')
250
+ await page.keyboard.up('Control')
251
+ await page.waitForTimeout(100)
252
+ await page.keyboard.type(title, { delay: 15 + Math.random() * 25 })
253
+ } else {
254
+ // Fallback: find by placeholder
255
+ await page.evaluate(() => {
256
+ const el = document.querySelector('[data-placeholder="Add a title"]')
257
+ if (el) { el.click(); el.focus() }
258
+ })
259
+ await page.waitForTimeout(300)
260
+ await page.keyboard.type(title, { delay: 15 + Math.random() * 25 })
261
+ }
262
+
263
+ await page.waitForTimeout(500 + Math.random() * 500)
264
+
265
+ // Step 4: Fill the body
266
+ // Body has placeholder "Start writing" — it's a contenteditable div
267
+ // Click it directly to avoid navigating away
268
+ const bodyFilled = await page.evaluate((bodyText) => {
269
+ // Find body editor — the one with "Start writing" placeholder
270
+ const candidates = document.querySelectorAll('[contenteditable="true"]')
271
+ let bodyEl = null
272
+ for (const el of candidates) {
273
+ const placeholder = el.getAttribute('data-placeholder') || el.getAttribute('aria-describedby') || ''
274
+ const text = el.textContent || ''
275
+ // The body editor usually has "Start writing" or is the main content area
276
+ if (placeholder.includes('writing') || placeholder.includes('Start') ||
277
+ text.includes('Start writing') || el.getAttribute('aria-multiline') === 'true') {
278
+ bodyEl = el
279
+ break
280
+ }
281
+ }
282
+ // If not found by placeholder, take the contenteditable that's NOT the title
283
+ if (!bodyEl) {
284
+ const title = document.querySelector('[data-testid="twitter-article-title"]')
285
+ for (const el of candidates) {
286
+ if (el !== title && !title?.contains(el)) {
287
+ bodyEl = el
288
+ break
289
+ }
290
+ }
291
+ }
292
+ if (!bodyEl) return false
293
+
294
+ bodyEl.focus()
295
+ // Use insertText for proper React/editor state
296
+ document.execCommand('selectAll', false, null)
297
+ document.execCommand('insertText', false, bodyText)
298
+ return true
299
+ }, body)
300
+
301
+ if (!bodyFilled) {
302
+ // Fallback: click on "Start writing" text and type
303
+ const bodyArea = await page.evaluate(() => {
304
+ const els = document.querySelectorAll('[contenteditable="true"]')
305
+ for (const el of els) {
306
+ if (el.textContent.includes('Start writing') || el.getAttribute('aria-multiline') === 'true') {
307
+ el.click()
308
+ el.focus()
309
+ return true
310
+ }
311
+ }
312
+ return false
313
+ })
314
+ if (bodyArea) {
315
+ await page.waitForTimeout(300)
316
+ await page.keyboard.type(body, { delay: 5 })
317
+ }
318
+ }
319
+
320
+ await page.waitForTimeout(1500) // Let auto-save trigger
321
+
322
+ // Take screenshot for verification
323
+ const screenshot = await page.screenshot({ type: 'png' }).catch(() => null)
324
+ const draftUrl = page.url()
325
+
326
+ if (publish) {
327
+ // Find and click Publish button
328
+ const pubBtn = await page.$('button:has-text("Publish"), [role="button"]:has-text("Publish")')
329
+ if (pubBtn) {
330
+ await pubBtn.click()
331
+ await page.waitForTimeout(3000 + Math.random() * 2000)
332
+
333
+ // Handle confirmation dialog if present
334
+ const confirmBtn = await page.$('button:has-text("Publish"), [data-testid*="confirm"]')
335
+ if (confirmBtn) {
336
+ await confirmBtn.click()
337
+ await page.waitForTimeout(3000)
338
+ }
339
+ }
340
+
341
+ const finalUrl = page.url()
342
+ await page.close()
343
+ return { url: finalUrl, status: 'published', title }
344
+ } else {
345
+ await page.close()
346
+ return {
347
+ url: draftUrl,
348
+ status: 'draft',
349
+ title,
350
+ screenshot: screenshot ? screenshot.toString('base64') : null,
351
+ message: 'Article saved as draft. Set publish: true to auto-publish, or review at: ' + draftUrl
352
+ }
353
+ }
354
+ } catch (e) {
355
+ await page.close().catch(() => {})
356
+ throw e
357
+ }
358
+ }
359
+
129
360
  async _like(params, ctx) {
130
361
  // TODO: implement like via GraphQL
131
362
  throw new Error('X like not yet implemented')
@@ -626,7 +626,21 @@ class BrowseEngine {
626
626
  const context = await this._createContext(browser, opts)
627
627
 
628
628
  if (opts._cookies) {
629
- await context.addCookies(opts._cookies)
629
+ const playwrightCookies = opts._cookies.map(c => {
630
+ const clean = { ...c }
631
+ if (!clean.sameSite || !['Strict', 'Lax', 'None'].includes(clean.sameSite)) {
632
+ clean.sameSite = 'Lax'
633
+ }
634
+ if (clean.domain && clean.domain.startsWith('.')) {
635
+ clean.domain = clean.domain.slice(1)
636
+ }
637
+ delete clean.hostOnly; delete clean.session; delete clean.storeId; delete clean.id
638
+ if (clean.expirationDate && !clean.expires) {
639
+ clean.expires = clean.expirationDate; delete clean.expirationDate
640
+ }
641
+ return clean
642
+ })
643
+ await context.addCookies(playwrightCookies)
630
644
  }
631
645
 
632
646
  const page = await context.newPage()