charlotte-crawler 1.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- charlotte_crawler-1.1.0/LICENSE +21 -0
- charlotte_crawler-1.1.0/PKG-INFO +394 -0
- charlotte_crawler-1.1.0/README.md +336 -0
- charlotte_crawler-1.1.0/charlotte/__init__.py +72 -0
- charlotte_crawler-1.1.0/charlotte/adapters/__init__.py +14 -0
- charlotte_crawler-1.1.0/charlotte/adapters/base.py +64 -0
- charlotte_crawler-1.1.0/charlotte/adapters/groq.py +250 -0
- charlotte_crawler-1.1.0/charlotte/adapters/local.py +364 -0
- charlotte_crawler-1.1.0/charlotte/config.py +85 -0
- charlotte_crawler-1.1.0/charlotte/core/__init__.py +0 -0
- charlotte_crawler-1.1.0/charlotte/core/adapter_validation.py +272 -0
- charlotte_crawler-1.1.0/charlotte/core/engine.py +599 -0
- charlotte_crawler-1.1.0/charlotte/core/extractor.py +196 -0
- charlotte_crawler-1.1.0/charlotte/core/fetcher.py +350 -0
- charlotte_crawler-1.1.0/charlotte/core/find_link.py +141 -0
- charlotte_crawler-1.1.0/charlotte/core/normalizer.py +262 -0
- charlotte_crawler-1.1.0/charlotte/core/plausibility.py +231 -0
- charlotte_crawler-1.1.0/charlotte/core/provenance.py +162 -0
- charlotte_crawler-1.1.0/charlotte/core/robots.py +218 -0
- charlotte_crawler-1.1.0/charlotte/core/sanitizer.py +131 -0
- charlotte_crawler-1.1.0/charlotte/exceptions.py +94 -0
- charlotte_crawler-1.1.0/charlotte/models.py +207 -0
- charlotte_crawler-1.1.0/charlotte_crawler.egg-info/PKG-INFO +394 -0
- charlotte_crawler-1.1.0/charlotte_crawler.egg-info/SOURCES.txt +27 -0
- charlotte_crawler-1.1.0/charlotte_crawler.egg-info/dependency_links.txt +1 -0
- charlotte_crawler-1.1.0/charlotte_crawler.egg-info/requires.txt +17 -0
- charlotte_crawler-1.1.0/charlotte_crawler.egg-info/top_level.txt +1 -0
- charlotte_crawler-1.1.0/pyproject.toml +58 -0
- charlotte_crawler-1.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Boss Button Studios
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,394 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: charlotte-crawler
|
|
3
|
+
Version: 1.1.0
|
|
4
|
+
Summary: A goal-directed web navigation agent
|
|
5
|
+
Author: Boss Button Studios
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 Boss Button Studios
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/Boss-Button-Studios/charlotte
|
|
29
|
+
Project-URL: Repository, https://github.com/Boss-Button-Studios/charlotte
|
|
30
|
+
Project-URL: Issues, https://github.com/Boss-Button-Studios/charlotte/issues
|
|
31
|
+
Keywords: web,crawler,agent,navigation,llm
|
|
32
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
33
|
+
Classifier: Intended Audience :: Developers
|
|
34
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
35
|
+
Classifier: Programming Language :: Python :: 3
|
|
36
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
38
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
39
|
+
Classifier: Topic :: Internet :: WWW/HTTP
|
|
40
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
41
|
+
Requires-Python: >=3.11
|
|
42
|
+
Description-Content-Type: text/markdown
|
|
43
|
+
License-File: LICENSE
|
|
44
|
+
Requires-Dist: httpx<1.0,>=0.27.0
|
|
45
|
+
Requires-Dist: beautifulsoup4<5.0,>=4.12.0
|
|
46
|
+
Provides-Extra: playwright
|
|
47
|
+
Requires-Dist: playwright<3.0,>=1.40.0; extra == "playwright"
|
|
48
|
+
Provides-Extra: groq
|
|
49
|
+
Requires-Dist: groq<2.0,>=0.5.0; extra == "groq"
|
|
50
|
+
Provides-Extra: ollama
|
|
51
|
+
Requires-Dist: ollama<1.0,>=0.1.0; extra == "ollama"
|
|
52
|
+
Provides-Extra: dev
|
|
53
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
54
|
+
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
|
|
55
|
+
Requires-Dist: respx>=0.20.0; extra == "dev"
|
|
56
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
57
|
+
Dynamic: license-file
|
|
58
|
+
|
|
59
|
+
# Charlotte
|
|
60
|
+
|
|
61
|
+
`charlotte-crawler` is a goal-directed web navigation agent. Given a starting URL and a natural language goal, Charlotte navigates a website purposefully — evaluating each page and deciding which links to follow — until she finds what she is looking for or exhausts her search budget.
|
|
62
|
+
|
|
63
|
+
Charlotte is a **library, not a service.** Import it into any Python project. It has no server to run, no API to call, and no data leaves your environment unless you choose a cloud adapter.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## Installation
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Base install — httpx fetcher, BeautifulSoup extractor, no model adapter
|
|
71
|
+
pip install charlotte-crawler
|
|
72
|
+
|
|
73
|
+
# With Groq cloud adapter (recommended for cloud deployments)
|
|
74
|
+
pip install charlotte-crawler[groq]
|
|
75
|
+
|
|
76
|
+
# With JavaScript rendering (headless Chromium via Playwright)
|
|
77
|
+
pip install charlotte-crawler[playwright]
|
|
78
|
+
playwright install chromium
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
The `LocalAdapter` talks to any OpenAI-compatible local server (Ollama, LM Studio, llama.cpp) using `httpx`, which is already a required dependency. No extra install needed.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Quick Start
|
|
86
|
+
|
|
87
|
+
### Find a link
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
import asyncio
|
|
91
|
+
from charlotte import find_link
|
|
92
|
+
from charlotte.adapters.groq import GroqAdapter
|
|
93
|
+
|
|
94
|
+
async def main():
|
|
95
|
+
result = await find_link(
|
|
96
|
+
start_url="https://www.example.edu",
|
|
97
|
+
goal="Find the academic calendar page",
|
|
98
|
+
model=GroqAdapter(), # requires GROQ_API_KEY env var
|
|
99
|
+
)
|
|
100
|
+
|
|
101
|
+
if result.found:
|
|
102
|
+
print(result.urls[0]) # URL of the matching page
|
|
103
|
+
else:
|
|
104
|
+
print(result.note) # explains why nothing was found
|
|
105
|
+
|
|
106
|
+
asyncio.run(main())
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Crawl with streaming events
|
|
110
|
+
|
|
111
|
+
```python
|
|
112
|
+
import asyncio
|
|
113
|
+
from charlotte import crawl, ResultFound, CrawlComplete
|
|
114
|
+
from charlotte.adapters.local import LocalAdapter
|
|
115
|
+
|
|
116
|
+
async def main():
|
|
117
|
+
async for event in crawl(
|
|
118
|
+
start_url="https://docs.example.com",
|
|
119
|
+
goal="Find the API reference for the payments module",
|
|
120
|
+
model=LocalAdapter(), # connects to Ollama at localhost:11434
|
|
121
|
+
stream=True,
|
|
122
|
+
):
|
|
123
|
+
if isinstance(event, ResultFound):
|
|
124
|
+
print(f"Found: {event.url} (confidence {event.confidence:.0%})")
|
|
125
|
+
elif isinstance(event, CrawlComplete):
|
|
126
|
+
print(f"Done — {event.result_count} result(s), {event.pages_visited} page(s) visited")
|
|
127
|
+
|
|
128
|
+
asyncio.run(main())
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Extract a fact
|
|
132
|
+
|
|
133
|
+
```python
|
|
134
|
+
import asyncio
|
|
135
|
+
from charlotte import crawl
|
|
136
|
+
from charlotte.adapters.groq import GroqAdapter
|
|
137
|
+
|
|
138
|
+
async def main():
|
|
139
|
+
result = await crawl(
|
|
140
|
+
start_url="https://www.ucsd.edu/about/",
|
|
141
|
+
goal="Find the main switchboard phone number",
|
|
142
|
+
model=GroqAdapter(),
|
|
143
|
+
stream=False,
|
|
144
|
+
)
|
|
145
|
+
|
|
146
|
+
if result.found and result.answers:
|
|
147
|
+
print(result.answers[0]) # e.g. "(858) 534-2230"
|
|
148
|
+
|
|
149
|
+
asyncio.run(main())
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## Adapters
|
|
155
|
+
|
|
156
|
+
An adapter is any async callable with the signature below. Charlotte ships two.
|
|
157
|
+
|
|
158
|
+
### GroqAdapter
|
|
159
|
+
|
|
160
|
+
Calls **Llama 3.1 8B Instruct** via the [Groq API](https://console.groq.com). Fast, accurate, and free to start. Requires the `[groq]` extra and a `GROQ_API_KEY` environment variable.
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
from charlotte.adapters.groq import GroqAdapter
|
|
164
|
+
|
|
165
|
+
model = GroqAdapter() # reads GROQ_API_KEY from env
|
|
166
|
+
model = GroqAdapter(api_key="gsk_…") # or pass directly
|
|
167
|
+
model = GroqAdapter(model="llama-3.3-70b-versatile") # override model
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### LocalAdapter
|
|
171
|
+
|
|
172
|
+
Calls any **OpenAI-compatible local inference endpoint** — Ollama, LM Studio, llama.cpp server, text-generation-webui. Defaults to `deepseek-r1:14b` at `http://localhost:11434`.
|
|
173
|
+
|
|
174
|
+
```python
|
|
175
|
+
from charlotte.adapters.local import LocalAdapter
|
|
176
|
+
|
|
177
|
+
model = LocalAdapter() # deepseek-r1:14b @ localhost:11434
|
|
178
|
+
model = LocalAdapter(model_name="llama3.2:3b") # lighter model
|
|
179
|
+
model = LocalAdapter(
|
|
180
|
+
base_url="http://gpu-box:11434",
|
|
181
|
+
model_name="qwen2.5:14b",
|
|
182
|
+
verbose=True, # stream tokens to stderr
|
|
183
|
+
)
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
Pull the default model with: `ollama pull deepseek-r1:14b`
|
|
187
|
+
|
|
188
|
+
### Bring Your Own Model (BYOM)
|
|
189
|
+
|
|
190
|
+
Any async callable that matches this signature works as a `model=` argument:
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
from typing import Any
|
|
194
|
+
|
|
195
|
+
async def my_adapter(
|
|
196
|
+
*,
|
|
197
|
+
goal: str,
|
|
198
|
+
navigation_hint: str | None,
|
|
199
|
+
page_title: str,
|
|
200
|
+
page_url: str,
|
|
201
|
+
page_summary: str,
|
|
202
|
+
available_links: list[dict[str, str]], # [{"text": "…", "url": "…"}, …]
|
|
203
|
+
visit_history: list[str],
|
|
204
|
+
results_so_far: int,
|
|
205
|
+
schema_hint: str | None = None,
|
|
206
|
+
) -> dict[str, Any]:
|
|
207
|
+
...
|
|
208
|
+
return {
|
|
209
|
+
"found": True, # bool
|
|
210
|
+
"confidence": 0.95, # float 0.0–1.0
|
|
211
|
+
"result_url": page_url, # str when found=True, else null
|
|
212
|
+
"links_to_follow": [], # list[str] of URLs to visit next
|
|
213
|
+
"reasoning": "Found it on this page.", # non-empty str
|
|
214
|
+
"answer": None, # str for facts, null for navigation
|
|
215
|
+
}
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
Charlotte validates the response dict against this schema before use. Malformed output triggers one retry with a reinforced prompt; two failures skip the page with `AdapterOutputError`.
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
## `crawl()` — Parameters
|
|
223
|
+
|
|
224
|
+
```python
|
|
225
|
+
crawl(
|
|
226
|
+
start_url, # str — absolute URL to start from
|
|
227
|
+
goal, # str — natural language description of what to find
|
|
228
|
+
*,
|
|
229
|
+
model=None, # AdapterProtocol | None — None resolves via CHARLOTTE_DEFAULT_ADAPTER
|
|
230
|
+
max_pages=20, # int — hard ceiling on pages fetched
|
|
231
|
+
max_depth=5, # int — max link-hops from start_url
|
|
232
|
+
max_results=1, # int | None — stop after N results; None = collect all
|
|
233
|
+
confidence_threshold=0.70, # float — minimum confidence to record a result
|
|
234
|
+
render_js=False, # bool — use Playwright for JS-rendered pages
|
|
235
|
+
allowed_domains=None, # list[str] | None — defaults to start_url domain
|
|
236
|
+
return_content=False, # bool — include sanitized page text in CrawlResult
|
|
237
|
+
navigation_hint=None, # str | None — extra context for the model
|
|
238
|
+
stream=None, # bool | None — None reads CHARLOTTE_STREAM (default True)
|
|
239
|
+
respect_robots=None, # bool | None — None reads CHARLOTTE_RESPECT_ROBOTS (default True)
|
|
240
|
+
connect_timeout=10.0, # float — TCP connection timeout (seconds)
|
|
241
|
+
read_timeout=30.0, # float — response body read timeout (seconds)
|
|
242
|
+
render_timeout=15.0, # float — JS settle timeout for Playwright (seconds)
|
|
243
|
+
default_delay=1.0, # float — floor for polite inter-request delay (seconds)
|
|
244
|
+
chromium_executable=None, # str | None — path to Chromium binary (Playwright)
|
|
245
|
+
)
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
**Returns:**
|
|
249
|
+
- `AsyncGenerator[StreamEvent, None]` when `stream=True`
|
|
250
|
+
- `Coroutine[CrawlResult]` when `stream=False` — use `await crawl(...)`
|
|
251
|
+
|
|
252
|
+
**`CrawlResult` fields:**
|
|
253
|
+
|
|
254
|
+
| Field | Type | Description |
|
|
255
|
+
|---|---|---|
|
|
256
|
+
| `found` | `bool` | Whether at least one result was confirmed |
|
|
257
|
+
| `result_urls` | `list[str]` | URLs of all confirmed results, in discovery order |
|
|
258
|
+
| `answers` | `list[str \| None] \| None` | Extracted facts parallel to `result_urls`; `None` if nothing found |
|
|
259
|
+
| `content` | `list[str] \| None` | Sanitized page text per result (only when `return_content=True`) |
|
|
260
|
+
| `confidence` | `float` | Confidence of the best result |
|
|
261
|
+
| `pages_visited` | `int` | Total pages fetched |
|
|
262
|
+
| `depth_reached` | `int` | Deepest link-hop reached |
|
|
263
|
+
| `visit_log` | `list[VisitLogEntry]` | Per-page URL, depth, found flag, confidence, reasoning |
|
|
264
|
+
| `best_candidate_url` | `str \| None` | Highest-confidence URL seen, even if below threshold |
|
|
265
|
+
| `budget_exhausted` | `bool` | True if `max_pages` or `max_depth` was hit before finding a result |
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## `find_link()` — Parameters
|
|
270
|
+
|
|
271
|
+
`find_link()` is a thin wrapper around `crawl()` with two fixed differences: `max_results=None` (collect every match) and `return_content=False` (always). All other parameters are identical.
|
|
272
|
+
|
|
273
|
+
```python
|
|
274
|
+
find_link(
|
|
275
|
+
start_url, # str
|
|
276
|
+
goal, # str
|
|
277
|
+
*,
|
|
278
|
+
model=None,
|
|
279
|
+
max_pages=20,
|
|
280
|
+
max_depth=5,
|
|
281
|
+
confidence_threshold=0.70,
|
|
282
|
+
render_js=False,
|
|
283
|
+
allowed_domains=None,
|
|
284
|
+
navigation_hint=None,
|
|
285
|
+
stream=None,
|
|
286
|
+
respect_robots=None,
|
|
287
|
+
connect_timeout=10.0,
|
|
288
|
+
read_timeout=30.0,
|
|
289
|
+
render_timeout=15.0,
|
|
290
|
+
default_delay=1.0,
|
|
291
|
+
)
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**Returns:**
|
|
295
|
+
- `AsyncGenerator[StreamEvent, None]` when `stream=True`
|
|
296
|
+
- `Coroutine[LinkResult]` when `stream=False` — use `await find_link(...)`
|
|
297
|
+
|
|
298
|
+
**`LinkResult` fields:**
|
|
299
|
+
|
|
300
|
+
| Field | Type | Description |
|
|
301
|
+
|---|---|---|
|
|
302
|
+
| `found` | `bool` | Whether at least one link was found |
|
|
303
|
+
| `urls` | `list[str]` | All matching URLs, in discovery order |
|
|
304
|
+
| `confidence` | `float` | Confidence of the best match |
|
|
305
|
+
| `pages_visited` | `int` | Total pages fetched |
|
|
306
|
+
| `best_candidate_url` | `str \| None` | Highest-confidence URL seen, even if below threshold |
|
|
307
|
+
| `budget_exhausted` | `bool` | True if the budget was exhausted before a match |
|
|
308
|
+
| `note` | `str \| None` | Human-readable explanation when `found=False` |
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Environment Variables
|
|
313
|
+
|
|
314
|
+
| Variable | Default | Effect |
|
|
315
|
+
|---|---|---|
|
|
316
|
+
| `CHARLOTTE_DEFAULT_ADAPTER` | `"groq"` | `"groq"` or `"local"` — adapter used when `model=None` |
|
|
317
|
+
| `CHARLOTTE_LOCAL_BASE_URL` | `"http://localhost:11434"` | Base URL for `LocalAdapter` |
|
|
318
|
+
| `CHARLOTTE_LOCAL_MODEL` | `"deepseek-r1:14b"` | Model name for `LocalAdapter` |
|
|
319
|
+
| `CHARLOTTE_STREAM` | `"true"` | `"true"` or `"false"` — default for `stream=None` |
|
|
320
|
+
| `CHARLOTTE_RESPECT_ROBOTS` | `"true"` | `"true"` or `"false"` — default for `respect_robots=None` |
|
|
321
|
+
| `GROQ_API_KEY` | *(none)* | Required when using `GroqAdapter` |
|
|
322
|
+
|
|
323
|
+
Direct `crawl()` / `find_link()` parameters always take precedence over env vars.
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## robots.txt Policy
|
|
328
|
+
|
|
329
|
+
Charlotte fetches and obeys `robots.txt` before visiting any page, unless `respect_robots=False`.
|
|
330
|
+
|
|
331
|
+
- **404** response → no restrictions; crawl proceeds normally
|
|
332
|
+
- **401 / 403** → no restrictions; crawl proceeds normally
|
|
333
|
+
- **5xx / timeout / parse error** → `RobotsError`; crawl does not start
|
|
334
|
+
- **`Disallow` rule matched** → `RobotsError`; affected page skipped; other pages continue
|
|
335
|
+
- **`Crawl-delay` directive** → honoured; whichever is larger between the directive and `default_delay` is used
|
|
336
|
+
- **Cross-domain redirect** → each domain's `robots.txt` is checked independently; permissions never inherit across domain boundaries
|
|
337
|
+
- **User-agent matching** — `charlotte-crawler` first, then `*`
|
|
338
|
+
|
|
339
|
+
---
|
|
340
|
+
|
|
341
|
+
## Streaming Events
|
|
342
|
+
|
|
343
|
+
When `stream=True`, Charlotte yields the following events in order:
|
|
344
|
+
|
|
345
|
+
| Event | Fields | When emitted |
|
|
346
|
+
|---|---|---|
|
|
347
|
+
| `CrawlStarted` | `url`, `goal` | Once, immediately |
|
|
348
|
+
| `PageFetched` | `url`, `status_code`, `depth`, `render_js` | After each successful fetch |
|
|
349
|
+
| `ModelDecision` | `url`, `found`, `confidence`, `reasoning`, `links_queued`, `links_available`, `links_suggested` | After each model evaluation |
|
|
350
|
+
| `ResultFound` | `url`, `confidence`, `result_index`, `answer` | When a result is confirmed |
|
|
351
|
+
| `PageSkipped` | `url`, `reason`, `error_type` | When a page is skipped (fetch error, schema failure, plausibility, robots) |
|
|
352
|
+
| `BudgetExhausted` | `url`, `reason` | When `max_pages` or `max_depth` is reached without a result |
|
|
353
|
+
| `CrawlComplete` | `found`, `result_count`, `pages_visited`, `elapsed_seconds` | Once, always last |
|
|
354
|
+
|
|
355
|
+
All events include `type: str` and `timestamp: float` (Unix time) fields.
|
|
356
|
+
|
|
357
|
+
Import event types directly from the package:
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
from charlotte import (
|
|
361
|
+
CrawlStarted, PageFetched, ModelDecision, ResultFound,
|
|
362
|
+
PageSkipped, BudgetExhausted, CrawlComplete,
|
|
363
|
+
)
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
---
|
|
367
|
+
|
|
368
|
+
## Error Classes
|
|
369
|
+
|
|
370
|
+
All Charlotte exceptions inherit from `CharlotteError`. Third-party exceptions (`httpx`, `groq`, `playwright`) are caught at component boundaries and re-raised as one of these — they never reach the caller.
|
|
371
|
+
|
|
372
|
+
| Exception | Raised when |
|
|
373
|
+
|---|---|
|
|
374
|
+
| `CharlotteConfigError` | Invalid configuration — bad URL, missing API key, Playwright not installed, invalid parameter |
|
|
375
|
+
| `CharlotteNetworkError` | HTTP error response (4xx / 5xx) that is not retried |
|
|
376
|
+
| `CharlotteTimeoutError` | Connect, read, render, or model timeout |
|
|
377
|
+
| `CharlotteRedirectError` | Cross-domain redirect to a disallowed host |
|
|
378
|
+
| `RobotsError` | robots.txt blocks a URL or cannot be fetched |
|
|
379
|
+
| `AdapterOutputError` | Model returned malformed JSON or failed schema validation after retry |
|
|
380
|
+
| `CharlotteInternalError` | Unexpected engine-level state (should not occur; file a bug) |
|
|
381
|
+
|
|
382
|
+
`CharlotteConfigError` is raised eagerly — before any network I/O — when configuration is invalid. All others surface as `PageSkipped` events (stream mode) or as logged debug entries in the `visit_log` (non-stream mode). `crawl()` and `find_link()` never raise after the crawl has started.
|
|
383
|
+
|
|
384
|
+
---
|
|
385
|
+
|
|
386
|
+
## Specification
|
|
387
|
+
|
|
388
|
+
The full technical specification — adapter authoring guide, streaming events reference, security model, URL normalization rules — is at `docs/charlotte-spec-v1.4.md`.
|
|
389
|
+
|
|
390
|
+
---
|
|
391
|
+
|
|
392
|
+
## Licence
|
|
393
|
+
|
|
394
|
+
MIT — see `LICENSE`.
|