PyPI - ember-browser - Versions diffs - 0.1.0__tar.gz - Mend

ember-browser 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

ember_browser-0.1.0/LICENSE +11 -0
ember_browser-0.1.0/PKG-INFO +338 -0
ember_browser-0.1.0/README.md +300 -0
ember_browser-0.1.0/emb/__init__.py +38 -0
ember_browser-0.1.0/emb/_browser.py +157 -0
ember_browser-0.1.0/emb/_url_validator.py +59 -0
ember_browser-0.1.0/emb/agent.py +70 -0
ember_browser-0.1.0/emb/api.py +193 -0
ember_browser-0.1.0/emb/cli.py +1041 -0
ember_browser-0.1.0/emb/crawl.py +174 -0
ember_browser-0.1.0/emb/interact.py +156 -0
ember_browser-0.1.0/emb/map.py +109 -0
ember_browser-0.1.0/emb/mcp.py +126 -0
ember_browser-0.1.0/emb/scrape.py +207 -0
ember_browser-0.1.0/emb/search.py +27 -0
ember_browser-0.1.0/emb/types.py +60 -0
ember_browser-0.1.0/ember_browser.egg-info/PKG-INFO +338 -0
ember_browser-0.1.0/ember_browser.egg-info/SOURCES.txt +27 -0
ember_browser-0.1.0/ember_browser.egg-info/dependency_links.txt +1 -0
ember_browser-0.1.0/ember_browser.egg-info/entry_points.txt +2 -0
ember_browser-0.1.0/ember_browser.egg-info/requires.txt +17 -0
ember_browser-0.1.0/ember_browser.egg-info/top_level.txt +1 -0
ember_browser-0.1.0/pyproject.toml +55 -0
ember_browser-0.1.0/setup.cfg +4 -0
ember_browser-0.1.0/tests/test_api.py +330 -0
ember_browser-0.1.0/tests/test_cli.py +403 -0
ember_browser-0.1.0/tests/test_core.py +121 -0
ember_browser-0.1.0/tests/test_mcp.py +264 -0
ember_browser-0.1.0/tests/test_unit.py +1131 -0

ember_browser-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,11 @@
+GNU AFFERO GENERAL PUBLIC LICENSE
+Version 3, 19 November 2007
+Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+Everyone is permitted to copy and distribute verbatim copies
+of this license document, but changing it is not allowed.
+...
+The full AGPL-3.0 license text is available at:
+https://www.gnu.org/licenses/agpl-3.0.txt

ember_browser-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,338 @@
+Metadata-Version: 2.4
+Name: ember-browser
+Version: 0.1.0
+Summary: Open source, lightweight headless browser for AI agents. pip install ember-browser.
+Author: Anda Usman, AndaLabX
+License: AGPL-3.0
+Project-URL: Homepage, https://github.com/andalabx/ember
+Project-URL: Documentation, https://andalabx.com/ember
+Project-URL: Source, https://github.com/andalabx/ember
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: License :: OSI Approved :: GNU Affero General Public License v3
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: trafilatura>=2.0.0
+Requires-Dist: beautifulsoup4>=4.12
+Requires-Dist: lxml>=5.0
+Requires-Dist: httpx>=0.28
+Requires-Dist: ddgs>=1.0
+Requires-Dist: typer>=0.15
+Requires-Dist: rich>=13.0
+Requires-Dist: fastapi>=0.115
+Requires-Dist: uvicorn[standard]>=0.34
+Requires-Dist: pydantic>=2.0
+Requires-Dist: pypdf>=4.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8; extra == "dev"
+Provides-Extra: mcp
+Requires-Dist: fastmcp>=2.0; extra == "mcp"
+Dynamic: license-file
+<div align="center">
+<pre>
+  ███████╗███╗   ███╗██████╗ ███████╗██████╗
+  ██╔════╝████╗ ████║██╔══██╗██╔════╝██╔══██╗
+  █████╗  ██╔████╔██║██████╔╝█████╗  ██████╔╝
+  ██╔══╝  ██║╚██╔╝██║██╔══██╗██╔══╝  ██╔══██╗
+  ███████╗██║ ╚═╝ ██║██████╔╝███████╗██║  ██║
+  ╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═╝
+</pre>
+**Open source, lightweight headless browser for AI agents.**
+[![PyPI](https://img.shields.io/pypi/v/ember-browser)](https://pypi.org/project/ember-browser/)
+[![Python](https://img.shields.io/pypi/pyversions/ember-browser)](https://pypi.org/project/ember-browser/)
+[![License: AGPL v3](https://img.shields.io/badge/license-AGPL--3.0-blue)](LICENSE)
+```bash
+pip install ember-browser
+```
+*No Docker. No API key to start.*
+</div>
+---
+## Why ember
+Most web tools for agents ship with Chromium (641 MB) or require Docker just to get started. We needed something an agent could use on a VPS, a laptop, or a Raspberry Pi without thinking about it.
+ember runs at ~17 MB idle. It decides whether a page needs a browser — you just pass it a URL.
+|                     | ember              | Crawl4AI           |
+|---------------------|--------------------|--------------------|
+| Import footprint    | ~54 MB             | 171.8 MB           |
+| Browser binary      | 20 MB (Lightpanda) | 641 MB (Chromium)  |
+| Scrape success rate | ~85% (trafilatura) / ~95%+ (+ Lightpanda) | 90% |
+| Docker required     | No                 | No                 |
+| API key required    | No                 | No                 |
+---
+## Quick start
+```bash
+pip install ember-browser
+ember                          # start the interactive session
+ember url https://example.com  # or run a one-shot command
+ember serve                    # start the REST API
+```
+---
+## CLI
+### Interactive session
+`ember` with no arguments opens a persistent session. Commands and a save guide are shown on startup — no need to type `help` first.
+```
+  ███████╗███╗   ███╗██████╗ ███████╗██████╗
+  ...
+  ╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═╝
+  v0.1.0  lightweight headless browser for AI agents
+  url        <url>              scrape a page to markdown
+  search     <query>            web search
+  crawl      <url>              crawl a whole website
+  map        <url>              discover all URLs on a site
+  interact   <url>              control a browser with natural language
+  extract    <url>              pull structured data with an LLM
+  batch      <urls.txt>         scrape many URLs concurrently
+  ─── saving results ───────────────────────────────────────────
+  one result   url example.com -o page.md
+  everything   output ./research/  then all results auto-save
+  last result  save page.md        after any command
+ember › url andausman.com
+ember › save page.md
+ember › output ./research/       # auto-save everything from here
+ember/research › search "python asyncio" -n 10
+ember/research › crawl docs.example.com
+ember/research › output clear    # stop auto-saving
+ember › quit
+```
+### One-shot commands
+Every command works standalone too:
+```bash
+ember url https://example.com                         # scrape a page
+ember search "AI agents python" -n 10                 # web search
+ember crawl https://docs.example.com --max-pages 20   # crawl a site
+ember map https://example.com                         # discover all URLs
+ember interact https://amazon.com \
+  --prompt "find a mechanical keyboard under $100"
+ember extract https://example.com/pricing \
+  --prompt "list all plans and prices as JSON"
+```
+### Saving results
+All commands accept `-o` to save that run:
+```bash
+ember url https://example.com -o page.md
+ember search "python" -o results.json
+ember crawl https://docs.example.com -o ./pages/   # one .md per page
+ember map https://example.com -o urls.txt
+ember extract https://example.com -o data.json
+```
+Set a default save directory so you never need `-o`:
+```bash
+ember config --save-dir ./research/    # persists across sessions
+ember config                           # show current settings
+ember config --save-dir ""             # clear it
+```
+Or use an environment variable for the current shell:
+```bash
+EMBER_SAVE_DIR=./out ember url https://example.com
+```
+In a session, the three ways to save:
+```
+ember › url example.com -o page.md     # save just this run
+ember › save page.md                   # save the last result
+ember › output ./research/             # auto-save all results from now on
+```
+### Async batch scraping
+```bash
+# urls.txt — one URL per line, # = comment
+ember batch urls.txt                      # 5 concurrent by default
+ember batch urls.txt -c 20 -o ./pages/   # 20 parallel, save to dir
+```
+---
+## Python API
+```python
+from emb.scrape import scrape_url, scrape_markdown
+from emb.search import search
+from emb.crawl import crawl
+from emb.map import map_url
+# Scrape a page → ScrapeResult
+result = scrape_url("https://example.com")
+print(result.markdown)   # full page content as markdown
+print(result.title)      # page title
+print(result.success)    # True / False
+# Just the markdown text
+md = scrape_markdown("https://example.com")
+# Crawl a site
+result = crawl("https://docs.example.com", max_pages=20, max_depth=3)
+for page in result.pages:
+    print(page.url, len(page.markdown))
+# Discover URLs
+result = map_url("https://example.com", max_links=100)
+print(result.links)   # list[str]
+# Search the web
+results = search("python asyncio tutorial", limit=5)
+for r in results:
+    print(r.title, r.url)
+# Browser interaction with natural language
+from emb.interact import interact
+result = interact("https://example.com", prompt="click the login button")
+print(result.content)   # what the agent did / saw
+# LLM-powered structured extraction
+from emb.agent import extract
+data = extract("https://example.com/pricing", prompt="list all plans and prices")
+print(data)   # dict
+```
+### Async
+```python
+import asyncio
+from emb.scrape import scrape_url_async
+async def main():
+    results = await asyncio.gather(
+        scrape_url_async("https://example.com"),
+        scrape_url_async("https://httpbin.org/get"),
+    )
+    for r in results:
+        print(r.url, r.success)
+asyncio.run(main())
+```
+---
+## REST API
+```bash
+ember serve               # http://127.0.0.1:51251
+ember serve --port 8080   # custom port
+EMBER_API_KEY=your-secret ember serve   # require auth
+```
+```bash
+curl -X POST http://localhost:51251/scrape \
+  -H "Content-Type: application/json" \
+  -H "X-API-Key: your-secret" \
+  -d '{"url": "https://example.com"}'
+curl -X POST http://localhost:51251/search \
+  -H "Content-Type: application/json" \
+  -d '{"query": "AI agents", "limit": 5}'
+curl -X POST http://localhost:51251/crawl \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://docs.example.com", "max_pages": 10}'
+```
+Endpoints: `/scrape` `/search` `/crawl` `/map` `/interact` `/extract` `/agent` `/health`
+---
+## MCP
+```json
+{
+  "mcpServers": {
+    "ember": {
+      "command": "ember",
+      "args": ["mcp"]
+    }
+  }
+}
+```
+Works with Claude Code, Cursor, and any MCP-compatible host.
+Available tools: `scrape`, `search_web`, `crawl_site`, `map_site`, `batch_scrape`, `interact_page`, `extract_data`.
+---
+## How it works
+Not every page needs a browser. ember knows the difference.
+**Tier 1 — trafilatura** handles ~90% of the web: blogs, news, documentation, Wikipedia. Pure HTTP, no browser process, no memory overhead.
+**Tier 2 — Lightpanda** handles JavaScript-heavy pages, SPAs, and interactive content. It's a real browser engine written in Zig, built for machines rather than humans — 20 MB total. ember downloads and caches it automatically on first use, and only falls back to it when tier 1 produces thin content.
+Most requests never reach the browser.
+### Memory footprint
+| State                  | RAM     |
+|------------------------|---------|
+| Idle                   | ~17 MB  |
+| Scraping a static page | ~20 MB  |
+| Running the browser    | ~140 MB |
+Firecrawl needs 4–8 GB in Docker. Crawl4AI imports at 171 MB before scraping anything. ember fits where your agent already runs.
+---
+## Environment variables
+| Variable                  | Default                        | Description |
+|---------------------------|--------------------------------|-------------|
+| `EMBER_SAVE_DIR`          | _(none)_                       | Default directory for saved results. Overrides `ember config --save-dir` for the current shell. |
+| `EMBER_API_KEY`           | _(none)_                       | Enables API key auth on the REST server (`X-API-Key` header). |
+| `EMBER_PORT`              | `51251`                        | Default port for `ember serve`. Overridden by `--port` flag. |
+| `EMBER_INTERACT_PROVIDER` | `openai`                       | LLM provider for `interact` (`openai`, `anthropic`, `ollama`, etc.). |
+| `EMBER_LLM_API_KEY`       | _(none)_                       | API key for LLM-powered extraction. |
+| `EMBER_LLM_BASE_URL`      | `https://api.openai.com/v1`    | LLM API endpoint for extraction. |
+| `EMBER_LLM_MODEL`         | `gpt-4o-mini`                  | Model used by `extract`. |
+| `EMBER_LIGHTPANDA_PATH`   | _(auto)_                       | Path to a custom Lightpanda binary. Skips auto-download if set. |
+---
+## License
+[AGPL-3.0](LICENSE) — open source forever.

ember_browser-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,300 @@
+<div align="center">
+<pre>
+  ███████╗███╗   ███╗██████╗ ███████╗██████╗
+  ██╔════╝████╗ ████║██╔══██╗██╔════╝██╔══██╗
+  █████╗  ██╔████╔██║██████╔╝█████╗  ██████╔╝
+  ██╔══╝  ██║╚██╔╝██║██╔══██╗██╔══╝  ██╔══██╗
+  ███████╗██║ ╚═╝ ██║██████╔╝███████╗██║  ██║
+  ╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═╝
+</pre>
+**Open source, lightweight headless browser for AI agents.**
+[![PyPI](https://img.shields.io/pypi/v/ember-browser)](https://pypi.org/project/ember-browser/)
+[![Python](https://img.shields.io/pypi/pyversions/ember-browser)](https://pypi.org/project/ember-browser/)
+[![License: AGPL v3](https://img.shields.io/badge/license-AGPL--3.0-blue)](LICENSE)
+```bash
+pip install ember-browser
+```
+*No Docker. No API key to start.*
+</div>
+---
+## Why ember
+Most web tools for agents ship with Chromium (641 MB) or require Docker just to get started. We needed something an agent could use on a VPS, a laptop, or a Raspberry Pi without thinking about it.
+ember runs at ~17 MB idle. It decides whether a page needs a browser — you just pass it a URL.
+|                     | ember              | Crawl4AI           |
+|---------------------|--------------------|--------------------|
+| Import footprint    | ~54 MB             | 171.8 MB           |
+| Browser binary      | 20 MB (Lightpanda) | 641 MB (Chromium)  |
+| Scrape success rate | ~85% (trafilatura) / ~95%+ (+ Lightpanda) | 90% |
+| Docker required     | No                 | No                 |
+| API key required    | No                 | No                 |
+---
+## Quick start
+```bash
+pip install ember-browser
+ember                          # start the interactive session
+ember url https://example.com  # or run a one-shot command
+ember serve                    # start the REST API
+```
+---
+## CLI
+### Interactive session
+`ember` with no arguments opens a persistent session. Commands and a save guide are shown on startup — no need to type `help` first.
+```
+  ███████╗███╗   ███╗██████╗ ███████╗██████╗
+  ...
+  ╚══════╝╚═╝     ╚═╝╚═════╝ ╚══════╝╚═╝  ╚═╝
+  v0.1.0  lightweight headless browser for AI agents
+  url        <url>              scrape a page to markdown
+  search     <query>            web search
+  crawl      <url>              crawl a whole website
+  map        <url>              discover all URLs on a site
+  interact   <url>              control a browser with natural language
+  extract    <url>              pull structured data with an LLM
+  batch      <urls.txt>         scrape many URLs concurrently
+  ─── saving results ───────────────────────────────────────────
+  one result   url example.com -o page.md
+  everything   output ./research/  then all results auto-save
+  last result  save page.md        after any command
+ember › url andausman.com
+ember › save page.md
+ember › output ./research/       # auto-save everything from here
+ember/research › search "python asyncio" -n 10
+ember/research › crawl docs.example.com
+ember/research › output clear    # stop auto-saving
+ember › quit
+```
+### One-shot commands
+Every command works standalone too:
+```bash
+ember url https://example.com                         # scrape a page
+ember search "AI agents python" -n 10                 # web search
+ember crawl https://docs.example.com --max-pages 20   # crawl a site
+ember map https://example.com                         # discover all URLs
+ember interact https://amazon.com \
+  --prompt "find a mechanical keyboard under $100"
+ember extract https://example.com/pricing \
+  --prompt "list all plans and prices as JSON"
+```
+### Saving results
+All commands accept `-o` to save that run:
+```bash
+ember url https://example.com -o page.md
+ember search "python" -o results.json
+ember crawl https://docs.example.com -o ./pages/   # one .md per page
+ember map https://example.com -o urls.txt
+ember extract https://example.com -o data.json
+```
+Set a default save directory so you never need `-o`:
+```bash
+ember config --save-dir ./research/    # persists across sessions
+ember config                           # show current settings
+ember config --save-dir ""             # clear it
+```
+Or use an environment variable for the current shell:
+```bash
+EMBER_SAVE_DIR=./out ember url https://example.com
+```
+In a session, the three ways to save:
+```
+ember › url example.com -o page.md     # save just this run
+ember › save page.md                   # save the last result
+ember › output ./research/             # auto-save all results from now on
+```
+### Async batch scraping
+```bash
+# urls.txt — one URL per line, # = comment
+ember batch urls.txt                      # 5 concurrent by default
+ember batch urls.txt -c 20 -o ./pages/   # 20 parallel, save to dir
+```
+---
+## Python API
+```python
+from emb.scrape import scrape_url, scrape_markdown
+from emb.search import search
+from emb.crawl import crawl
+from emb.map import map_url
+# Scrape a page → ScrapeResult
+result = scrape_url("https://example.com")
+print(result.markdown)   # full page content as markdown
+print(result.title)      # page title
+print(result.success)    # True / False
+# Just the markdown text
+md = scrape_markdown("https://example.com")
+# Crawl a site
+result = crawl("https://docs.example.com", max_pages=20, max_depth=3)
+for page in result.pages:
+    print(page.url, len(page.markdown))
+# Discover URLs
+result = map_url("https://example.com", max_links=100)
+print(result.links)   # list[str]
+# Search the web
+results = search("python asyncio tutorial", limit=5)
+for r in results:
+    print(r.title, r.url)
+# Browser interaction with natural language
+from emb.interact import interact
+result = interact("https://example.com", prompt="click the login button")
+print(result.content)   # what the agent did / saw
+# LLM-powered structured extraction
+from emb.agent import extract
+data = extract("https://example.com/pricing", prompt="list all plans and prices")
+print(data)   # dict
+```
+### Async
+```python
+import asyncio
+from emb.scrape import scrape_url_async
+async def main():
+    results = await asyncio.gather(
+        scrape_url_async("https://example.com"),
+        scrape_url_async("https://httpbin.org/get"),
+    )
+    for r in results:
+        print(r.url, r.success)
+asyncio.run(main())
+```
+---
+## REST API
+```bash
+ember serve               # http://127.0.0.1:51251
+ember serve --port 8080   # custom port
+EMBER_API_KEY=your-secret ember serve   # require auth
+```
+```bash
+curl -X POST http://localhost:51251/scrape \
+  -H "Content-Type: application/json" \
+  -H "X-API-Key: your-secret" \
+  -d '{"url": "https://example.com"}'
+curl -X POST http://localhost:51251/search \
+  -H "Content-Type: application/json" \
+  -d '{"query": "AI agents", "limit": 5}'
+curl -X POST http://localhost:51251/crawl \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://docs.example.com", "max_pages": 10}'
+```
+Endpoints: `/scrape` `/search` `/crawl` `/map` `/interact` `/extract` `/agent` `/health`
+---
+## MCP
+```json
+{
+  "mcpServers": {
+    "ember": {
+      "command": "ember",
+      "args": ["mcp"]
+    }
+  }
+}
+```
+Works with Claude Code, Cursor, and any MCP-compatible host.
+Available tools: `scrape`, `search_web`, `crawl_site`, `map_site`, `batch_scrape`, `interact_page`, `extract_data`.
+---
+## How it works
+Not every page needs a browser. ember knows the difference.
+**Tier 1 — trafilatura** handles ~90% of the web: blogs, news, documentation, Wikipedia. Pure HTTP, no browser process, no memory overhead.
+**Tier 2 — Lightpanda** handles JavaScript-heavy pages, SPAs, and interactive content. It's a real browser engine written in Zig, built for machines rather than humans — 20 MB total. ember downloads and caches it automatically on first use, and only falls back to it when tier 1 produces thin content.
+Most requests never reach the browser.
+### Memory footprint
+| State                  | RAM     |
+|------------------------|---------|
+| Idle                   | ~17 MB  |
+| Scraping a static page | ~20 MB  |
+| Running the browser    | ~140 MB |
+Firecrawl needs 4–8 GB in Docker. Crawl4AI imports at 171 MB before scraping anything. ember fits where your agent already runs.
+---
+## Environment variables
+| Variable                  | Default                        | Description |
+|---------------------------|--------------------------------|-------------|
+| `EMBER_SAVE_DIR`          | _(none)_                       | Default directory for saved results. Overrides `ember config --save-dir` for the current shell. |
+| `EMBER_API_KEY`           | _(none)_                       | Enables API key auth on the REST server (`X-API-Key` header). |
+| `EMBER_PORT`              | `51251`                        | Default port for `ember serve`. Overridden by `--port` flag. |
+| `EMBER_INTERACT_PROVIDER` | `openai`                       | LLM provider for `interact` (`openai`, `anthropic`, `ollama`, etc.). |
+| `EMBER_LLM_API_KEY`       | _(none)_                       | API key for LLM-powered extraction. |
+| `EMBER_LLM_BASE_URL`      | `https://api.openai.com/v1`    | LLM API endpoint for extraction. |
+| `EMBER_LLM_MODEL`         | `gpt-4o-mini`                  | Model used by `extract`. |
+| `EMBER_LIGHTPANDA_PATH`   | _(auto)_                       | Path to a custom Lightpanda binary. Skips auto-download if set. |
+---
+## License
+[AGPL-3.0](LICENSE) — open source forever.

ember_browser-0.1.0/emb/__init__.py ADDED Viewed

@@ -0,0 +1,38 @@
+"""ember — open source, lightweight headless browser for AI agents."""
+from __future__ import annotations
+__version__ = "0.1.0"
+# Lazily re-export the most-used public functions so `from emb import scrape_url`
+# works without loading heavy dependencies at `import emb` time.
+#
+# Names that clash with a same-named submodule (search, crawl) can't be re-exported
+# this way — Python returns the submodule before __getattr__ fires. Use the submodule
+# form for those: `from emb.search import search`, `from emb.crawl import crawl`.
+__all__ = [
+    "__version__",
+    "scrape_url",
+    "scrape_url_async",
+    "scrape_markdown",
+    "scrape_markdown_async",
+    "map_url",
+]
+_LAZY: dict[str, str] = {
+    "scrape_url":            "emb.scrape",
+    "scrape_url_async":      "emb.scrape",
+    "scrape_markdown":       "emb.scrape",
+    "scrape_markdown_async": "emb.scrape",
+    "map_url":               "emb.map",
+}
+def __getattr__(name: str):
+    module_path = _LAZY.get(name)
+    if module_path is None:
+        raise AttributeError(f"module 'emb' has no attribute {name!r}")
+    import importlib
+    module = importlib.import_module(module_path)
+    return getattr(module, name)