charlotte-crawler 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. charlotte_crawler-1.1.0/LICENSE +21 -0
  2. charlotte_crawler-1.1.0/PKG-INFO +394 -0
  3. charlotte_crawler-1.1.0/README.md +336 -0
  4. charlotte_crawler-1.1.0/charlotte/__init__.py +72 -0
  5. charlotte_crawler-1.1.0/charlotte/adapters/__init__.py +14 -0
  6. charlotte_crawler-1.1.0/charlotte/adapters/base.py +64 -0
  7. charlotte_crawler-1.1.0/charlotte/adapters/groq.py +250 -0
  8. charlotte_crawler-1.1.0/charlotte/adapters/local.py +364 -0
  9. charlotte_crawler-1.1.0/charlotte/config.py +85 -0
  10. charlotte_crawler-1.1.0/charlotte/core/__init__.py +0 -0
  11. charlotte_crawler-1.1.0/charlotte/core/adapter_validation.py +272 -0
  12. charlotte_crawler-1.1.0/charlotte/core/engine.py +599 -0
  13. charlotte_crawler-1.1.0/charlotte/core/extractor.py +196 -0
  14. charlotte_crawler-1.1.0/charlotte/core/fetcher.py +350 -0
  15. charlotte_crawler-1.1.0/charlotte/core/find_link.py +141 -0
  16. charlotte_crawler-1.1.0/charlotte/core/normalizer.py +262 -0
  17. charlotte_crawler-1.1.0/charlotte/core/plausibility.py +231 -0
  18. charlotte_crawler-1.1.0/charlotte/core/provenance.py +162 -0
  19. charlotte_crawler-1.1.0/charlotte/core/robots.py +218 -0
  20. charlotte_crawler-1.1.0/charlotte/core/sanitizer.py +131 -0
  21. charlotte_crawler-1.1.0/charlotte/exceptions.py +94 -0
  22. charlotte_crawler-1.1.0/charlotte/models.py +207 -0
  23. charlotte_crawler-1.1.0/charlotte_crawler.egg-info/PKG-INFO +394 -0
  24. charlotte_crawler-1.1.0/charlotte_crawler.egg-info/SOURCES.txt +27 -0
  25. charlotte_crawler-1.1.0/charlotte_crawler.egg-info/dependency_links.txt +1 -0
  26. charlotte_crawler-1.1.0/charlotte_crawler.egg-info/requires.txt +17 -0
  27. charlotte_crawler-1.1.0/charlotte_crawler.egg-info/top_level.txt +1 -0
  28. charlotte_crawler-1.1.0/pyproject.toml +58 -0
  29. charlotte_crawler-1.1.0/setup.cfg +4 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Boss Button Studios
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,394 @@
1
+ Metadata-Version: 2.4
2
+ Name: charlotte-crawler
3
+ Version: 1.1.0
4
+ Summary: A goal-directed web navigation agent
5
+ Author: Boss Button Studios
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Boss Button Studios
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/Boss-Button-Studios/charlotte
29
+ Project-URL: Repository, https://github.com/Boss-Button-Studios/charlotte
30
+ Project-URL: Issues, https://github.com/Boss-Button-Studios/charlotte/issues
31
+ Keywords: web,crawler,agent,navigation,llm
32
+ Classifier: Development Status :: 5 - Production/Stable
33
+ Classifier: Intended Audience :: Developers
34
+ Classifier: License :: OSI Approved :: MIT License
35
+ Classifier: Programming Language :: Python :: 3
36
+ Classifier: Programming Language :: Python :: 3.11
37
+ Classifier: Programming Language :: Python :: 3.12
38
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
39
+ Classifier: Topic :: Internet :: WWW/HTTP
40
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
41
+ Requires-Python: >=3.11
42
+ Description-Content-Type: text/markdown
43
+ License-File: LICENSE
44
+ Requires-Dist: httpx<1.0,>=0.27.0
45
+ Requires-Dist: beautifulsoup4<5.0,>=4.12.0
46
+ Provides-Extra: playwright
47
+ Requires-Dist: playwright<3.0,>=1.40.0; extra == "playwright"
48
+ Provides-Extra: groq
49
+ Requires-Dist: groq<2.0,>=0.5.0; extra == "groq"
50
+ Provides-Extra: ollama
51
+ Requires-Dist: ollama<1.0,>=0.1.0; extra == "ollama"
52
+ Provides-Extra: dev
53
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
54
+ Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
55
+ Requires-Dist: respx>=0.20.0; extra == "dev"
56
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
57
+ Dynamic: license-file
58
+
59
+ # Charlotte
60
+
61
+ `charlotte-crawler` is a goal-directed web navigation agent. Given a starting URL and a natural language goal, Charlotte navigates a website purposefully — evaluating each page and deciding which links to follow — until she finds what she is looking for or exhausts her search budget.
62
+
63
+ Charlotte is a **library, not a service.** Import it into any Python project. It has no server to run, no API to call, and no data leaves your environment unless you choose a cloud adapter.
64
+
65
+ ---
66
+
67
+ ## Installation
68
+
69
+ ```bash
70
+ # Base install — httpx fetcher, BeautifulSoup extractor, no model adapter
71
+ pip install charlotte-crawler
72
+
73
+ # With Groq cloud adapter (recommended for cloud deployments)
74
+ pip install charlotte-crawler[groq]
75
+
76
+ # With JavaScript rendering (headless Chromium via Playwright)
77
+ pip install charlotte-crawler[playwright]
78
+ playwright install chromium
79
+ ```
80
+
81
+ The `LocalAdapter` talks to any OpenAI-compatible local server (Ollama, LM Studio, llama.cpp) using `httpx`, which is already a required dependency. No extra install needed.
82
+
83
+ ---
84
+
85
+ ## Quick Start
86
+
87
+ ### Find a link
88
+
89
+ ```python
90
+ import asyncio
91
+ from charlotte import find_link
92
+ from charlotte.adapters.groq import GroqAdapter
93
+
94
+ async def main():
95
+ result = await find_link(
96
+ start_url="https://www.example.edu",
97
+ goal="Find the academic calendar page",
98
+ model=GroqAdapter(), # requires GROQ_API_KEY env var
99
+ )
100
+
101
+ if result.found:
102
+ print(result.urls[0]) # URL of the matching page
103
+ else:
104
+ print(result.note) # explains why nothing was found
105
+
106
+ asyncio.run(main())
107
+ ```
108
+
109
+ ### Crawl with streaming events
110
+
111
+ ```python
112
+ import asyncio
113
+ from charlotte import crawl, ResultFound, CrawlComplete
114
+ from charlotte.adapters.local import LocalAdapter
115
+
116
+ async def main():
117
+ async for event in crawl(
118
+ start_url="https://docs.example.com",
119
+ goal="Find the API reference for the payments module",
120
+ model=LocalAdapter(), # connects to Ollama at localhost:11434
121
+ stream=True,
122
+ ):
123
+ if isinstance(event, ResultFound):
124
+ print(f"Found: {event.url} (confidence {event.confidence:.0%})")
125
+ elif isinstance(event, CrawlComplete):
126
+ print(f"Done — {event.result_count} result(s), {event.pages_visited} page(s) visited")
127
+
128
+ asyncio.run(main())
129
+ ```
130
+
131
+ ### Extract a fact
132
+
133
+ ```python
134
+ import asyncio
135
+ from charlotte import crawl
136
+ from charlotte.adapters.groq import GroqAdapter
137
+
138
+ async def main():
139
+ result = await crawl(
140
+ start_url="https://www.ucsd.edu/about/",
141
+ goal="Find the main switchboard phone number",
142
+ model=GroqAdapter(),
143
+ stream=False,
144
+ )
145
+
146
+ if result.found and result.answers:
147
+ print(result.answers[0]) # e.g. "(858) 534-2230"
148
+
149
+ asyncio.run(main())
150
+ ```
151
+
152
+ ---
153
+
154
+ ## Adapters
155
+
156
+ An adapter is any async callable with the signature below. Charlotte ships two.
157
+
158
+ ### GroqAdapter
159
+
160
+ Calls **Llama 3.1 8B Instruct** via the [Groq API](https://console.groq.com). Fast, accurate, and free to start. Requires the `[groq]` extra and a `GROQ_API_KEY` environment variable.
161
+
162
+ ```python
163
+ from charlotte.adapters.groq import GroqAdapter
164
+
165
+ model = GroqAdapter() # reads GROQ_API_KEY from env
166
+ model = GroqAdapter(api_key="gsk_…") # or pass directly
167
+ model = GroqAdapter(model="llama-3.3-70b-versatile") # override model
168
+ ```
169
+
170
+ ### LocalAdapter
171
+
172
+ Calls any **OpenAI-compatible local inference endpoint** — Ollama, LM Studio, llama.cpp server, text-generation-webui. Defaults to `deepseek-r1:14b` at `http://localhost:11434`.
173
+
174
+ ```python
175
+ from charlotte.adapters.local import LocalAdapter
176
+
177
+ model = LocalAdapter() # deepseek-r1:14b @ localhost:11434
178
+ model = LocalAdapter(model_name="llama3.2:3b") # lighter model
179
+ model = LocalAdapter(
180
+ base_url="http://gpu-box:11434",
181
+ model_name="qwen2.5:14b",
182
+ verbose=True, # stream tokens to stderr
183
+ )
184
+ ```
185
+
186
+ Pull the default model with: `ollama pull deepseek-r1:14b`
187
+
188
+ ### Bring Your Own Model (BYOM)
189
+
190
+ Any async callable that matches this signature works as a `model=` argument:
191
+
192
+ ```python
193
+ from typing import Any
194
+
195
+ async def my_adapter(
196
+ *,
197
+ goal: str,
198
+ navigation_hint: str | None,
199
+ page_title: str,
200
+ page_url: str,
201
+ page_summary: str,
202
+ available_links: list[dict[str, str]], # [{"text": "…", "url": "…"}, …]
203
+ visit_history: list[str],
204
+ results_so_far: int,
205
+ schema_hint: str | None = None,
206
+ ) -> dict[str, Any]:
207
+ ...
208
+ return {
209
+ "found": True, # bool
210
+ "confidence": 0.95, # float 0.0–1.0
211
+ "result_url": page_url, # str when found=True, else null
212
+ "links_to_follow": [], # list[str] of URLs to visit next
213
+ "reasoning": "Found it on this page.", # non-empty str
214
+ "answer": None, # str for facts, null for navigation
215
+ }
216
+ ```
217
+
218
+ Charlotte validates the response dict against this schema before use. Malformed output triggers one retry with a reinforced prompt; two failures skip the page with `AdapterOutputError`.
219
+
220
+ ---
221
+
222
+ ## `crawl()` — Parameters
223
+
224
+ ```python
225
+ crawl(
226
+ start_url, # str — absolute URL to start from
227
+ goal, # str — natural language description of what to find
228
+ *,
229
+ model=None, # AdapterProtocol | None — None resolves via CHARLOTTE_DEFAULT_ADAPTER
230
+ max_pages=20, # int — hard ceiling on pages fetched
231
+ max_depth=5, # int — max link-hops from start_url
232
+ max_results=1, # int | None — stop after N results; None = collect all
233
+ confidence_threshold=0.70, # float — minimum confidence to record a result
234
+ render_js=False, # bool — use Playwright for JS-rendered pages
235
+ allowed_domains=None, # list[str] | None — defaults to start_url domain
236
+ return_content=False, # bool — include sanitized page text in CrawlResult
237
+ navigation_hint=None, # str | None — extra context for the model
238
+ stream=None, # bool | None — None reads CHARLOTTE_STREAM (default True)
239
+ respect_robots=None, # bool | None — None reads CHARLOTTE_RESPECT_ROBOTS (default True)
240
+ connect_timeout=10.0, # float — TCP connection timeout (seconds)
241
+ read_timeout=30.0, # float — response body read timeout (seconds)
242
+ render_timeout=15.0, # float — JS settle timeout for Playwright (seconds)
243
+ default_delay=1.0, # float — floor for polite inter-request delay (seconds)
244
+ chromium_executable=None, # str | None — path to Chromium binary (Playwright)
245
+ )
246
+ ```
247
+
248
+ **Returns:**
249
+ - `AsyncGenerator[StreamEvent, None]` when `stream=True`
250
+ - `Coroutine[CrawlResult]` when `stream=False` — use `await crawl(...)`
251
+
252
+ **`CrawlResult` fields:**
253
+
254
+ | Field | Type | Description |
255
+ |---|---|---|
256
+ | `found` | `bool` | Whether at least one result was confirmed |
257
+ | `result_urls` | `list[str]` | URLs of all confirmed results, in discovery order |
258
+ | `answers` | `list[str \| None] \| None` | Extracted facts parallel to `result_urls`; `None` if nothing found |
259
+ | `content` | `list[str] \| None` | Sanitized page text per result (only when `return_content=True`) |
260
+ | `confidence` | `float` | Confidence of the best result |
261
+ | `pages_visited` | `int` | Total pages fetched |
262
+ | `depth_reached` | `int` | Deepest link-hop reached |
263
+ | `visit_log` | `list[VisitLogEntry]` | Per-page URL, depth, found flag, confidence, reasoning |
264
+ | `best_candidate_url` | `str \| None` | Highest-confidence URL seen, even if below threshold |
265
+ | `budget_exhausted` | `bool` | True if `max_pages` or `max_depth` was hit before finding a result |
266
+
267
+ ---
268
+
269
+ ## `find_link()` — Parameters
270
+
271
+ `find_link()` is a thin wrapper around `crawl()` with two fixed differences: `max_results=None` (collect every match) and `return_content=False` (always). All other parameters are identical.
272
+
273
+ ```python
274
+ find_link(
275
+ start_url, # str
276
+ goal, # str
277
+ *,
278
+ model=None,
279
+ max_pages=20,
280
+ max_depth=5,
281
+ confidence_threshold=0.70,
282
+ render_js=False,
283
+ allowed_domains=None,
284
+ navigation_hint=None,
285
+ stream=None,
286
+ respect_robots=None,
287
+ connect_timeout=10.0,
288
+ read_timeout=30.0,
289
+ render_timeout=15.0,
290
+ default_delay=1.0,
291
+ )
292
+ ```
293
+
294
+ **Returns:**
295
+ - `AsyncGenerator[StreamEvent, None]` when `stream=True`
296
+ - `Coroutine[LinkResult]` when `stream=False` — use `await find_link(...)`
297
+
298
+ **`LinkResult` fields:**
299
+
300
+ | Field | Type | Description |
301
+ |---|---|---|
302
+ | `found` | `bool` | Whether at least one link was found |
303
+ | `urls` | `list[str]` | All matching URLs, in discovery order |
304
+ | `confidence` | `float` | Confidence of the best match |
305
+ | `pages_visited` | `int` | Total pages fetched |
306
+ | `best_candidate_url` | `str \| None` | Highest-confidence URL seen, even if below threshold |
307
+ | `budget_exhausted` | `bool` | True if the budget was exhausted before a match |
308
+ | `note` | `str \| None` | Human-readable explanation when `found=False` |
309
+
310
+ ---
311
+
312
+ ## Environment Variables
313
+
314
+ | Variable | Default | Effect |
315
+ |---|---|---|
316
+ | `CHARLOTTE_DEFAULT_ADAPTER` | `"groq"` | `"groq"` or `"local"` — adapter used when `model=None` |
317
+ | `CHARLOTTE_LOCAL_BASE_URL` | `"http://localhost:11434"` | Base URL for `LocalAdapter` |
318
+ | `CHARLOTTE_LOCAL_MODEL` | `"deepseek-r1:14b"` | Model name for `LocalAdapter` |
319
+ | `CHARLOTTE_STREAM` | `"true"` | `"true"` or `"false"` — default for `stream=None` |
320
+ | `CHARLOTTE_RESPECT_ROBOTS` | `"true"` | `"true"` or `"false"` — default for `respect_robots=None` |
321
+ | `GROQ_API_KEY` | *(none)* | Required when using `GroqAdapter` |
322
+
323
+ Direct `crawl()` / `find_link()` parameters always take precedence over env vars.
324
+
325
+ ---
326
+
327
+ ## robots.txt Policy
328
+
329
+ Charlotte fetches and obeys `robots.txt` before visiting any page, unless `respect_robots=False`.
330
+
331
+ - **404** response → no restrictions; crawl proceeds normally
332
+ - **401 / 403** → no restrictions; crawl proceeds normally
333
+ - **5xx / timeout / parse error** → `RobotsError`; crawl does not start
334
+ - **`Disallow` rule matched** → `RobotsError`; affected page skipped; other pages continue
335
+ - **`Crawl-delay` directive** → honoured; whichever is larger between the directive and `default_delay` is used
336
+ - **Cross-domain redirect** → each domain's `robots.txt` is checked independently; permissions never inherit across domain boundaries
337
+ - **User-agent matching** — `charlotte-crawler` first, then `*`
338
+
339
+ ---
340
+
341
+ ## Streaming Events
342
+
343
+ When `stream=True`, Charlotte yields the following events in order:
344
+
345
+ | Event | Fields | When emitted |
346
+ |---|---|---|
347
+ | `CrawlStarted` | `url`, `goal` | Once, immediately |
348
+ | `PageFetched` | `url`, `status_code`, `depth`, `render_js` | After each successful fetch |
349
+ | `ModelDecision` | `url`, `found`, `confidence`, `reasoning`, `links_queued`, `links_available`, `links_suggested` | After each model evaluation |
350
+ | `ResultFound` | `url`, `confidence`, `result_index`, `answer` | When a result is confirmed |
351
+ | `PageSkipped` | `url`, `reason`, `error_type` | When a page is skipped (fetch error, schema failure, plausibility, robots) |
352
+ | `BudgetExhausted` | `url`, `reason` | When `max_pages` or `max_depth` is reached without a result |
353
+ | `CrawlComplete` | `found`, `result_count`, `pages_visited`, `elapsed_seconds` | Once, always last |
354
+
355
+ All events include `type: str` and `timestamp: float` (Unix time) fields.
356
+
357
+ Import event types directly from the package:
358
+
359
+ ```python
360
+ from charlotte import (
361
+ CrawlStarted, PageFetched, ModelDecision, ResultFound,
362
+ PageSkipped, BudgetExhausted, CrawlComplete,
363
+ )
364
+ ```
365
+
366
+ ---
367
+
368
+ ## Error Classes
369
+
370
+ All Charlotte exceptions inherit from `CharlotteError`. Third-party exceptions (`httpx`, `groq`, `playwright`) are caught at component boundaries and re-raised as one of these — they never reach the caller.
371
+
372
+ | Exception | Raised when |
373
+ |---|---|
374
+ | `CharlotteConfigError` | Invalid configuration — bad URL, missing API key, Playwright not installed, invalid parameter |
375
+ | `CharlotteNetworkError` | HTTP error response (4xx / 5xx) that is not retried |
376
+ | `CharlotteTimeoutError` | Connect, read, render, or model timeout |
377
+ | `CharlotteRedirectError` | Cross-domain redirect to a disallowed host |
378
+ | `RobotsError` | robots.txt blocks a URL or cannot be fetched |
379
+ | `AdapterOutputError` | Model returned malformed JSON or failed schema validation after retry |
380
+ | `CharlotteInternalError` | Unexpected engine-level state (should not occur; file a bug) |
381
+
382
+ `CharlotteConfigError` is raised eagerly — before any network I/O — when configuration is invalid. All others surface as `PageSkipped` events (stream mode) or as logged debug entries in the `visit_log` (non-stream mode). `crawl()` and `find_link()` never raise after the crawl has started.
383
+
384
+ ---
385
+
386
+ ## Specification
387
+
388
+ The full technical specification — adapter authoring guide, streaming events reference, security model, URL normalization rules — is at `docs/charlotte-spec-v1.4.md`.
389
+
390
+ ---
391
+
392
+ ## Licence
393
+
394
+ MIT — see `LICENSE`.