npm - spectrawl - Versions diffs - 0.6.4 → 0.6.5 - Mend

spectrawl 0.6.4 → 0.6.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -171,6 +171,49 @@ Spectrawl detects block/challenge pages from **8 anti-bot services** and reports
 When a block is detected, the response includes `blocked: true` and `blockInfo: { type, detail }`.
+### Site-Specific Fallbacks
+Some sites block all datacenter IPs regardless of stealth. Spectrawl automatically routes these through alternative APIs:
+| Site | Problem | Fallback | Cost |
+|------|---------|----------|------|
+| **Reddit** | Blocks all datacenter IPs | [PullPush API](https://api.pullpush.io) — Reddit archive | Free |
+| **Amazon** | CAPTCHA wall on product pages | [Jina Reader](https://r.jina.ai) — server-side rendering | Free |
+| **X/Twitter** | Login wall on posts | [xAI Responses API](https://docs.x.ai) with `x_search` | ~$0.06/post |
+| **LinkedIn** | HTTP 999, IP fingerprinting | Requires residential proxy (see below) | ~$7/GB |
+These fallbacks activate automatically — just `browse()` the URL and Spectrawl picks the right path. No config needed for Reddit and Amazon. X requires `XAI_API_KEY` env var. LinkedIn requires a residential proxy.
+#### LinkedIn: Why It's Different
+LinkedIn fingerprints the IP where cookies were created. Even valid cookies get rejected from a different IP. Every free approach fails from datacenter servers:
+- Direct browse: HTTP 999
+- Voyager API with cookies: 401 (IP mismatch)
+- Jina Reader: empty response
+- Facebook/Googlebot UA: 317K of CSS, zero content
+**The only working solution is a residential proxy.** We recommend [Smartproxy](https://smartproxy.com) ($7/GB pay-as-you-go, 55M residential IPs, 3-day free trial). At typical usage (~10 LinkedIn pages/month), cost is under $0.50/month.
+Setup:
+```bash
+# Add your proxy to Spectrawl config
+npx spectrawl config set proxy '{"host":"gate.smartproxy.com","port":10001,"username":"YOUR_USER","password":"YOUR_PASS"}'
+# Store your LinkedIn cookies (export from browser)
+npx spectrawl login linkedin --account yourname --cookies ./linkedin-cookies.json
+# Now browse LinkedIn normally
+curl localhost:3900/browse -d '{"url":"https://www.linkedin.com/in/someone"}'
+```
+Other residential proxy providers that work:
+- [IPRoyal](https://iproyal.com) — $7/GB, 32M IPs
+- [Bright Data](https://brightdata.com) — premium quality, higher cost
+- [Oxylabs](https://oxylabs.io) — enterprise-grade
+> ⚠️ **Avoid WebShare** — recycled datacenter IPs marketed as residential, no HTTPS support.
 ### CAPTCHA Solving
 Built-in CAPTCHA solver using **Gemini Vision** (free tier: 1,500 req/day):
@@ -685,27 +728,36 @@ Error types: `bad-request` (400), `unauthorized` (401), `forbidden` (403), `not-
 ## Proxy Configuration
-Route browsing through residential or datacenter proxies:
+Route browsing through residential or datacenter proxies. **Required for LinkedIn** — see [Site-Specific Fallbacks](#site-specific-fallbacks) for why.
 ```json
 {
   "browse": {
     "proxy": {
-      "host": "proxy.example.com",
-      "port": 8080,
-      "username": "user",
-      "password": "pass"
+      "host": "gate.smartproxy.com",
+      "port": 10001,
+      "username": "YOUR_USER",
+      "password": "YOUR_PASS"
     }
   }
 }
 ```
-The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server:
+The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server that rotates through multiple upstream proxies:
 ```bash
 npx spectrawl proxy --port 8080
 ```
+**Recommended providers:**
+| Provider | Price | IPs | Best For |
+|----------|-------|-----|----------|
+| [Smartproxy](https://smartproxy.com) | $7/GB | 55M | Best budget option, 3-day free trial |
+| [IPRoyal](https://iproyal.com) | $7/GB | 32M | Good alternative |
+| [Bright Data](https://brightdata.com) | $12+/GB | 72M | Best quality, enterprise |
+| [Oxylabs](https://oxylabs.io) | $10+/GB | 100M+ | Enterprise-grade |
 ## MCP Server
 Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "spectrawl",
-  "version": "0.6.4",
+  "version": "0.6.5",
   "description": "The unified web layer for AI agents. Search (8 engines), stealth browse, auth, act on 24 platforms. Self-hosted.",
   "main": "src/index.js",
   "types": "index.d.ts",

package/src/act/adapters/x.js CHANGED Viewed

@@ -16,6 +16,8 @@ class XAdapter {
     switch (action) {
       case 'post':
         return this._post(params, ctx)
+      case 'article':
+        return this._postArticle(params, ctx)
       case 'like':
         return this._like(params, ctx)
       case 'retweet':
@@ -126,6 +128,235 @@ class XAdapter {
     return { tweetId: data.data?.id, url: `https://x.com/i/status/${data.data?.id}` }
   }
+  /**
+   * Post an X Article (long-form) via browser automation.
+   * X API doesn't support articles — must use the web composer.
+   *
+   * Flow:
+   * 1. Navigate to /compose/articles (article list)
+   * 2. Find existing draft or create new article via GraphQL
+   * 3. Navigate to /compose/articles/edit/{articleId}
+   * 4. Fill title (data-testid="twitter-article-title") and body (contenteditable)
+   * 5. Auto-save triggers, or click Publish
+   *
+   * @param {object} params - { title, body, account, _cookies, publish, articleId }
+   * publish: true = auto-publish, false = save as draft (default: false for safety)
+   * articleId: edit existing article (optional)
+   */
+  async _postArticle(params, ctx) {
+    const { title, body, account, _cookies, publish = false, articleId } = params
+    if (!_cookies) {
+      throw new Error(`No auth for X/${account}. Run: spectrawl login x --account ${account}`)
+    }
+    if (!title) throw new Error('X article requires a title')
+    if (!body) throw new Error('X article requires a body')
+    // Step 1: Get article editor URL
+    let editorUrl
+    if (articleId) {
+      editorUrl = `https://x.com/compose/articles/edit/${articleId}`
+    } else {
+      // First go to articles list to find/create an article
+      const { page: listPage, context: listCtx } = await ctx.browse.getPage({
+        _cookies,
+        url: 'https://x.com/compose/articles'
+      })
+      try {
+        await listPage.waitForTimeout(2000 + Math.random() * 1000)
+        // Try to create a new article via the GraphQL API
+        const csrfToken = _cookies.find(c => c.name === 'ct0')?.value
+        const newArticleId = await listPage.evaluate(async (csrf) => {
+          try {
+            const res = await fetch('https://x.com/i/api/graphql/uKxr91kGF4E4mdN-G3x0Yw/CreateArticle', {
+              method: 'POST',
+              headers: {
+                'Content-Type': 'application/json',
+                'X-Csrf-Token': csrf,
+                'Authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA',
+                'X-Twitter-Auth-Type': 'OAuth2Session'
+              },
+              body: JSON.stringify({
+                variables: {},
+                queryId: 'uKxr91kGF4E4mdN-G3x0Yw'
+              })
+            })
+            const data = await res.json()
+            return data?.data?.article_create?.article_results?.result?.rest_id || null
+          } catch { return null }
+        }, csrfToken)
+        if (newArticleId) {
+          editorUrl = `https://x.com/compose/articles/edit/${newArticleId}`
+        } else {
+          // Fallback: find existing draft link or the write button
+          const draftLink = await listPage.$('a[href*="/compose/articles/edit/"]')
+          if (draftLink) {
+            const href = await draftLink.getAttribute('href')
+            editorUrl = `https://x.com${href}`
+          } else {
+            // Last resort: look for a "new article" / "write" link
+            editorUrl = await listPage.evaluate(() => {
+              const links = Array.from(document.querySelectorAll('a[href*="article"]'))
+              for (const l of links) {
+                if (l.textContent.toLowerCase().includes('write') || l.textContent.toLowerCase().includes('new')) {
+                  return l.href
+                }
+              }
+              return null
+            })
+          }
+        }
+        await listPage.close()
+      } catch (e) {
+        await listPage.close().catch(() => {})
+        throw e
+      }
+    }
+    if (!editorUrl) {
+      throw new Error('Could not find or create X article editor. Try passing articleId directly.')
+    }
+    // Step 2: Open the article editor
+    const { page, context } = await ctx.browse.getPage({
+      _cookies,
+      url: editorUrl
+    })
+    try {
+      await page.waitForTimeout(2000 + Math.random() * 1000)
+      // Check we're in the editor
+      const hasEditor = await page.$('[data-testid="twitter-article-title"], [contenteditable="true"]')
+      if (!hasEditor) {
+        const content = await page.evaluate(() => document.body.innerText)
+        throw new Error(`Not in article editor. Page content: ${content.slice(0, 200)}`)
+      }
+      // Step 3: Fill the title
+      // Title: data-testid="twitter-article-title" or placeholder "Add a title"
+      // Must click and type — execCommand doesn't work on this component
+      const titleEl = await page.$('[data-testid="twitter-article-title"]')
+      if (titleEl) {
+        await titleEl.click()
+        await page.waitForTimeout(300)
+        // Select all existing text and replace
+        await page.keyboard.down('Control')
+        await page.keyboard.press('a')
+        await page.keyboard.up('Control')
+        await page.waitForTimeout(100)
+        await page.keyboard.type(title, { delay: 15 + Math.random() * 25 })
+      } else {
+        // Fallback: find by placeholder
+        await page.evaluate(() => {
+          const el = document.querySelector('[data-placeholder="Add a title"]')
+          if (el) { el.click(); el.focus() }
+        })
+        await page.waitForTimeout(300)
+        await page.keyboard.type(title, { delay: 15 + Math.random() * 25 })
+      }
+      await page.waitForTimeout(500 + Math.random() * 500)
+      // Step 4: Fill the body
+      // Body has placeholder "Start writing" — it's a contenteditable div
+      // Click it directly to avoid navigating away
+      const bodyFilled = await page.evaluate((bodyText) => {
+        // Find body editor — the one with "Start writing" placeholder
+        const candidates = document.querySelectorAll('[contenteditable="true"]')
+        let bodyEl = null
+        for (const el of candidates) {
+          const placeholder = el.getAttribute('data-placeholder') || el.getAttribute('aria-describedby') || ''
+          const text = el.textContent || ''
+          // The body editor usually has "Start writing" or is the main content area
+          if (placeholder.includes('writing') || placeholder.includes('Start') ||
+              text.includes('Start writing') || el.getAttribute('aria-multiline') === 'true') {
+            bodyEl = el
+            break
+          }
+        }
+        // If not found by placeholder, take the contenteditable that's NOT the title
+        if (!bodyEl) {
+          const title = document.querySelector('[data-testid="twitter-article-title"]')
+          for (const el of candidates) {
+            if (el !== title && !title?.contains(el)) {
+              bodyEl = el
+              break
+            }
+          }
+        }
+        if (!bodyEl) return false
+        bodyEl.focus()
+        // Use insertText for proper React/editor state
+        document.execCommand('selectAll', false, null)
+        document.execCommand('insertText', false, bodyText)
+        return true
+      }, body)
+      if (!bodyFilled) {
+        // Fallback: click on "Start writing" text and type
+        const bodyArea = await page.evaluate(() => {
+          const els = document.querySelectorAll('[contenteditable="true"]')
+          for (const el of els) {
+            if (el.textContent.includes('Start writing') || el.getAttribute('aria-multiline') === 'true') {
+              el.click()
+              el.focus()
+              return true
+            }
+          }
+          return false
+        })
+        if (bodyArea) {
+          await page.waitForTimeout(300)
+          await page.keyboard.type(body, { delay: 5 })
+        }
+      }
+      await page.waitForTimeout(1500) // Let auto-save trigger
+      // Take screenshot for verification
+      const screenshot = await page.screenshot({ type: 'png' }).catch(() => null)
+      const draftUrl = page.url()
+      if (publish) {
+        // Find and click Publish button
+        const pubBtn = await page.$('button:has-text("Publish"), [role="button"]:has-text("Publish")')
+        if (pubBtn) {
+          await pubBtn.click()
+          await page.waitForTimeout(3000 + Math.random() * 2000)
+          // Handle confirmation dialog if present
+          const confirmBtn = await page.$('button:has-text("Publish"), [data-testid*="confirm"]')
+          if (confirmBtn) {
+            await confirmBtn.click()
+            await page.waitForTimeout(3000)
+          }
+        }
+        const finalUrl = page.url()
+        await page.close()
+        return { url: finalUrl, status: 'published', title }
+      } else {
+        await page.close()
+        return {
+          url: draftUrl,
+          status: 'draft',
+          title,
+          screenshot: screenshot ? screenshot.toString('base64') : null,
+          message: 'Article saved as draft. Set publish: true to auto-publish, or review at: ' + draftUrl
+        }
+      }
+    } catch (e) {
+      await page.close().catch(() => {})
+      throw e
+    }
+  }
   async _like(params, ctx) {
     // TODO: implement like via GraphQL
     throw new Error('X like not yet implemented')

package/src/browse/index.js CHANGED Viewed

@@ -626,7 +626,21 @@ class BrowseEngine {
     const context = await this._createContext(browser, opts)
     if (opts._cookies) {
-      await context.addCookies(opts._cookies)
+      const playwrightCookies = opts._cookies.map(c => {
+        const clean = { ...c }
+        if (!clean.sameSite || !['Strict', 'Lax', 'None'].includes(clean.sameSite)) {
+          clean.sameSite = 'Lax'
+        }
+        if (clean.domain && clean.domain.startsWith('.')) {
+          clean.domain = clean.domain.slice(1)
+        }
+        delete clean.hostOnly; delete clean.session; delete clean.storeId; delete clean.id
+        if (clean.expirationDate && !clean.expires) {
+          clean.expires = clean.expirationDate; delete clean.expirationDate
+        }
+        return clean
+      })
+      await context.addCookies(playwrightCookies)
     }
     const page = await context.newPage()