npm - alvin-bot - Versions diffs - 4.20.2 → 4.21.0 - Mend

alvin-bot 4.20.2 → 4.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CHANGELOG.md +20 -0
package/bin/cli.js +16 -0
package/package.json +1 -1
package/skills/agent-browser/SKILL.md +183 -0
package/skills/browse/SKILL.md +8 -0

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,26 @@
 All notable changes to Alvin Bot are documented here.
+## [4.21.0] — 2026-05-04
+### 🌐 New skill: Agent Browser (Tier-1.5)
+Adds a new bundled skill, `skills/agent-browser/SKILL.md`, that teaches the bot to use the `agent-browser` CLI when it's available. Agent Browser is a [Vercel Labs](https://github.com/vercel-labs/agent-browser) tool that exposes pages as accessibility-tree snapshots with `@e1`, `@e2`, … refs — interactions cost ~200–400 tokens per turn instead of parsing rendered HTML, which is roughly 90 % cheaper than a Playwright/Puppeteer-driven flow.
+The skill is **opt-in by install, not by config**: it only activates when `command -v agent-browser` succeeds. No new dependency in `package.json`, no postinstall hook, no extra disk on a fresh install. Existing browser strategies (Tier 1 Stealth, Tier 2 CDP, Tier 3 Extension) keep working untouched and remain the right tool for stealth scraping, logged-in personal accounts, and watch-along flows.
+The bundled `Browser Automation` skill (`skills/browse/SKILL.md`) was updated to route the bot to the Agent Browser skill first when the binary is on the PATH and the task is interactive (click/fill/extract on cooperative pages).
+`alvin-bot doctor` shows a new `Browser tools:` section reporting whether agent-browser is installed, and gives the one-liner install command if not:
+```
+npm i -g agent-browser && agent-browser install
+```
+The first command pulls the Node CLI; the second downloads a private Chrome-for-Testing build into `~/.agent-browser/`. Together about 240 MB — that's why we don't bundle it.
+No code changes in the bot's core pipeline. Existing users notice nothing unless they install the CLI.
 ## [4.20.2] — 2026-05-04
 ### 🛡️ Security: Web UI loopback by default + Slack caller allowlist

package/bin/cli.js CHANGED Viewed

@@ -1392,6 +1392,22 @@ async function doctor() {
     }
   }
+  // ── Browser tools (optional Tier-1.5 agent-browser) ──
+  console.log("\n  Browser tools:");
+  let agentBrowserVersion = "";
+  try {
+    agentBrowserVersion = execSync("agent-browser --version 2>/dev/null", { encoding: "utf-8", timeout: 3000 }).trim();
+  } catch {}
+  if (agentBrowserVersion) {
+    // `agent-browser --version` prints "agent-browser X.Y.Z" — strip the prefix.
+    const v = agentBrowserVersion.replace(/^agent-browser\s+/i, "");
+    console.log(`  ✅ agent-browser ${v} — Tier-1.5 (token-efficient snapshot+ref) available`);
+  } else {
+    console.log(`  ℹ️  agent-browser not installed (optional Tier-1.5)`);
+    console.log(`     Install for ~90% cheaper interactive automation:`);
+    console.log(`       npm i -g agent-browser && agent-browser install`);
+  }
   // ── Memory (semantic search backend) ──
   console.log("\n  Memory:");
   const embJson = resolve(DATA_DIR, "memory", ".embeddings.json");

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "alvin-bot",
-  "version": "4.20.2",
+  "version": "4.21.0",
   "description": "Alvin Bot — Your personal AI agent on Telegram, WhatsApp, Discord, Signal, and Web.",
   "type": "module",
   "main": "dist/index.js",

package/skills/agent-browser/SKILL.md ADDED Viewed

@@ -0,0 +1,183 @@
+---
+name: Agent Browser (Snapshot+Ref)
+description: Token-efficient browser automation via the `agent-browser` CLI (Vercel Labs). Uses accessibility-tree snapshots with @eN refs (~200–400 tokens per page) instead of raw HTML parsing — typically 90%+ cheaper than Playwright/Puppeteer. Use for click-fill-extract on public pages, single-page test flows, structured form submission, and screenshots-with-refs. Optional dependency — only active if `agent-browser` is on the PATH; otherwise the regular Browser Automation skill takes over.
+triggers: snapshot the page, get refs, list interactive elements, click @e, fill @e, agent-browser, click button on, click the button, fill in the field, extract from page, find on page, scrape page interactively, visit and click, open page and click, navigate and fill, semantic locator, accessibility tree, snapshot+ref, schau auf der Seite nach, klicke auf den Button, fülle das Feld, formular ausfüllen
+priority: 9
+category: automation
+---
+# Agent Browser — Token-Efficient Snapshot+Ref Workflow
+Use this skill when interactive browser automation is needed (click, fill,
+extract, screenshot) AND `agent-browser` is installed. The accessibility-tree
+snapshot makes per-page interaction roughly an order of magnitude cheaper in
+tokens than parsing rendered HTML with Playwright.
+## Pre-flight: is the CLI installed?
+```bash
+command -v agent-browser >/dev/null 2>&1 \
+  && echo "agent-browser ok" \
+  || echo "fall back to the Browser Automation skill"
+```
+If absent: **stop and use the regular Browser Automation skill** (Tier 1
+Stealth / Tier 2 CDP). Don't suggest installing it unless the user asks —
+it's an opt-in tool, see `alvin-bot doctor` for installation hints.
+## Core loop
+```bash
+agent-browser open <url>
+agent-browser snapshot -i               # interactive elements, with @e1..@eN refs
+agent-browser click @e3                 # act on a ref
+agent-browser snapshot -i               # CRITICAL — re-snapshot after every page change
+agent-browser close
+```
+Refs (`@e1`, `@e2`, …) are **assigned fresh every snapshot**. They go stale
+the moment the page changes (click that navigates, form submit, dynamic
+re-render, modal open). Always re-snapshot before the next ref interaction.
+This single rule is the most common pitfall.
+A snapshot looks like:
+```
+Page: Example - Log in
+URL: https://example.com/login
+@e1 [heading] "Log in"
+@e2 [form]
+  @e3 [input type="email"] placeholder="Email"
+  @e4 [input type="password"] placeholder="Password"
+  @e5 [button type="submit"] "Continue"
+  @e6 [link] "Forgot password?"
+```
+## Common patterns
+### Read a page
+```bash
+agent-browser snapshot -i               # interactive only (preferred)
+agent-browser snapshot -i -u            # include href URLs on links
+agent-browser snapshot -i --json        # machine-readable
+agent-browser get text @e1              # visible text of an element
+agent-browser get attr @e10 href        # any attribute
+agent-browser get url                   # current URL
+```
+### Interact
+```bash
+agent-browser click @e1
+agent-browser fill @e2 "user@example.com"  # clear + type
+agent-browser type @e2 " more text"        # type without clearing
+agent-browser press Enter
+agent-browser select @e4 "option-value"
+agent-browser upload @e5 file.pdf
+agent-browser scroll down 500
+agent-browser screenshot result.png
+```
+### Wait for the right thing (most failures come from bad waits)
+```bash
+agent-browser wait @e1                     # until an element appears
+agent-browser wait --text "Success"        # until specific text on the page
+agent-browser wait --url "**/dashboard"    # until URL matches glob
+agent-browser wait --load networkidle      # post-navigation catch-all
+```
+Avoid bare `wait 2000` except in throwaway debugging. Default timeout: 25 s.
+### Find by semantics when refs aren't ergonomic
+```bash
+agent-browser find role button click --name "Submit"
+agent-browser find text "Sign In" click --exact
+agent-browser find label "Email" fill "user@example.com"
+agent-browser find placeholder "Search" type "query"
+agent-browser find testid "submit-btn" click
+```
+### Multiple isolated browser sessions (parallel users)
+```bash
+agent-browser --session a open https://app.example.com
+agent-browser --session b open https://app.example.com
+agent-browser --session a fill @e1 "alice@test.com"
+agent-browser --session b fill @e1 "bob@test.com"
+```
+### Persist login across runs
+```bash
+# Save once after a successful login:
+agent-browser state save ./auth.json
+# Resume already-logged-in:
+agent-browser --state ./auth.json open https://app.example.com
+```
+### Auth vault (don't put passwords in shell history)
+```bash
+agent-browser auth save my-app --url https://app.example.com/login \
+  --username user@example.com --password-stdin
+# (paste password, Ctrl+D)
+agent-browser auth login my-app
+```
+### Iframes
+Iframes are inlined in the snapshot — refs work transparently. To scope a
+snapshot to one iframe:
+```bash
+agent-browser frame @e3
+agent-browser snapshot -i
+agent-browser frame main
+```
+### Mock network (testing)
+```bash
+agent-browser network route "**/api/users" --body '{"users":[]}'
+agent-browser network route "**/analytics" --abort
+agent-browser network har start /tmp/trace.har
+# ... do stuff ...
+agent-browser network har stop
+```
+## When NOT to use this skill
+| Situation | Skill |
+|---|---|
+| Bot-protected site (Cloudflare, DataDome) | regular **Browser Automation** skill, Tier 1 Stealth |
+| Logged-in personal account on LinkedIn / Gmail | **Browser Automation**, Tier 2 CDP (`alvin-bot browser …`) |
+| User wants to watch a complex flow live | **Browser Automation**, Tier 3 Extension |
+| Static HTML / public JSON / RSS / API | `curl` / WebFetch — no browser engine needed |
+agent-browser is great for **task automation on cooperative pages** (your
+own apps, public data sites, form submissions). It is *not* a stealth tool.
+## Diagnostics
+```bash
+agent-browser doctor                # full env check
+agent-browser doctor --quick        # local-only
+agent-browser dashboard start       # observability UI on :4848
+agent-browser skills get core       # the upstream tool's own usage guide
+```
+## One-liner sanity test
+```bash
+agent-browser open https://example.com \
+  && agent-browser snapshot -i \
+  && agent-browser close
+```
+Expect two `@e` refs (heading + link). If that works, the tool is healthy.

package/skills/browse/SKILL.md CHANGED Viewed

@@ -15,12 +15,20 @@ Du hast drei Browser-Strategien plus WebFetch. **Wähle die billigste passende S
 | Task | Tool | Warum |
 |------|------|-------|
 | Einzelne öffentliche Seite, nur Text | `curl` oder WebFetch | Am schnellsten, keine Browser-Engine |
+| Interaktiv (klicken/füllen/extrahieren) auf kooperativer Seite | **Tier 1.5 agent-browser** *(falls installiert)* | Snapshot+Ref-Workflow ist ~90 % token-günstiger als rohes Playwright. Siehe Skill „Agent Browser". |
 | Öffentliche Seite mit JS / Cloudflare | **Tier 1 Stealth** | Headless + Fingerprint-Masking |
 | Login-pflichtige Seite (LinkedIn, Gmail, …) | **Tier 2 CDP** | Echtes Chromium, persistente Cookies |
 | Komplexer Multi-Step-Flow, User soll zusehen | **Tier 3 Extension** | Nur in interaktiven CLI-Sessions |
 **NIEMALS** nacktes `node -e "const {chromium}…"` für externe Seiten — wird sofort geblockt.
+**Vorab prüfen ob agent-browser verfügbar ist:**
+```bash
+command -v agent-browser >/dev/null 2>&1 && echo "Tier 1.5 verfügbar"
+```
+Falls ja und der Task ist „klick X, lies Y, fülle Z aus" → den `agent-browser`-Skill nehmen.
+Falls nein → mit Tier 1/2/3 weitermachen wie unten. Installation auf Wunsch des Users: `npm i -g agent-browser && agent-browser install`.
 ---
 ## Tier 0 — curl / WebFetch (schnellster Pfad)