contextractor 0.4.1__tar.gz → 0.4.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {contextractor-0.4.1 → contextractor-0.4.2}/PKG-INFO +1 -1
- {contextractor-0.4.1 → contextractor-0.4.2}/SPEC.md +7 -7
- {contextractor-0.4.1 → contextractor-0.4.2}/pyproject.toml +1 -1
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_manifest.py +3 -1
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_options.py +9 -2
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_run.py +26 -5
- {contextractor-0.4.1 → contextractor-0.4.2}/.gitignore +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/README.md +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/hatch_build.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/scripts/stage_vendor.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/__init__.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/__main__.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_errors.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_install.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_redact.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_runtime.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/_vendor/__init__.py +0 -0
- {contextractor-0.4.1 → contextractor-0.4.2}/src/contextractor/py.typed +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: contextractor
|
|
3
|
-
Version: 0.4.
|
|
3
|
+
Version: 0.4.2
|
|
4
4
|
Summary: Drive the Contextractor Node crawler/extractor from Python — clean main-content text in txt, markdown, json, or html.
|
|
5
5
|
Project-URL: Homepage, https://apify.com/glueo/contextractor
|
|
6
6
|
Project-URL: Repository, https://github.com/contextractor/contextractor
|
|
@@ -48,8 +48,8 @@ reads exactly what extract wrote — no bucket names are threaded.
|
|
|
48
48
|
- export `0` → read manifest; non-zero → raise.
|
|
49
49
|
- Both runners (`_run_sync` / `_run_async`) capture raw bytes and decode stdout/stderr as UTF-8 with `errors="replace"` — never the locale codec (Windows cp1252/cp932 mojibake) and never universal-newline translation (which would corrupt `original` raw HTML).
|
|
50
50
|
- Playwright "Executable doesn't exist" in stderr → `MissingBrowserError` pointing at `python -m contextractor install`.
|
|
51
|
-
- Child stderr is redacted
|
|
52
|
-
- A `timeout` (sync or async) raises `ContextractorError("contextractor timed out")` — never the raw `subprocess.TimeoutExpired`, whose `cmd` would leak the `--proxy` argv.
|
|
51
|
+
- Child stderr is redacted before being surfaced — proxy URLs, header values (e.g. `Authorization` tokens), and cookie values (each ≥ 4 chars) are all registered as secrets; argv is never echoed when it carries a proxy.
|
|
52
|
+
- A `timeout` (sync or async) raises `ContextractorError("contextractor timed out")` — never the raw `subprocess.TimeoutExpired`, whose `cmd` would leak the `--proxy` argv. Cancelling the surrounding task in the async path also kills and reaps the child — it is never orphaned.
|
|
53
53
|
|
|
54
54
|
## Single-page orchestration (`extract_one`)
|
|
55
55
|
|
|
@@ -70,9 +70,9 @@ A single data-driven table, `OPTION_SPECS`, applied immediately before spawn (pe
|
|
|
70
70
|
`OPTION_SPECS` keys (enforced by `tests/test_options.py`). Categories:
|
|
71
71
|
|
|
72
72
|
- **scalar** → `--flag <value>` (e.g. `max_crawl_depth`, `mode`, `start_urls_file`, …).
|
|
73
|
-
- **bool-pair** → `--flag` / `--no-flag`: `headless`, `block_media`, `images`.
|
|
73
|
+
- **bool-pair** → `--flag` / `--no-flag`: `headless`, `block_media`, `images`, `close_cookie_modals`.
|
|
74
74
|
- **negation-only** (default include; `False` emits the `--no-` flag): `links`, `comments`, `tables`.
|
|
75
|
-
- **bare-switch** (`True` emits the flag): `purge`, `ignore_cors_and_csp`, `
|
|
75
|
+
- **bare-switch** (`True` emits the flag): `purge`, `ignore_cors_and_csp`, `ignore_https_errors`, `keep_url_fragment`, `use_sitemaps`, `respect_robots_txt`, `store_skipped_urls`, `verbose` (`-v`).
|
|
76
76
|
- **repeatable** (one flag per item): `proxy`, `globs`, `exclude`, `save` (`format-destination` tokens, e.g. `markdown-kvs`).
|
|
77
77
|
- **json** (`--flag <json.dumps>`): `cookies`, `headers`.
|
|
78
78
|
|
|
@@ -119,11 +119,11 @@ browsers are never bundled (`python -m contextractor install`).
|
|
|
119
119
|
|
|
120
120
|
## Packaging & distribution
|
|
121
121
|
|
|
122
|
-
- Backend: hatchling + `hatch_build.py` (`pure_python=False
|
|
123
|
-
- `version` is static in `pyproject.toml` (the `/git:release` and `/publish:
|
|
122
|
+
- Backend: hatchling + `hatch_build.py` (`pure_python=False`; explicit `tag = py3-none-{platform}`, pinned via `CONTEXTRACTOR_WHEEL_PLATFORM` in CI, inferred locally) → `py3-none-{platform}` wheels. Forbid maturin / scikit-build-core / uv_build.
|
|
123
|
+
- `version` is static in `pyproject.toml` (the `/git:release` and `/publish:pypi` bump target); `__version__` is read from installed metadata.
|
|
124
124
|
- `readme = "README.md"` → the PyPI project page (per `.claude/rules/user-facing-docs.md`); included in the sdist.
|
|
125
125
|
- Wheel matrix: `macosx_*_arm64`, `macosx_*_x86_64`, `manylinux_2_28_x86_64`, `manylinux_2_28_aarch64`, `win_amd64`, plus an sdist. **musl is unsupported** — the napi loader throws a clear import error rather than ship a broken `.node`.
|
|
126
|
-
- CI: `.github/workflows/release-pypi.yml` (cibuildwheel; `CIBW_BEFORE_ALL` stages `_vendor`; auditwheel/delocate repair disabled — there is no ELF Python extension; publish via PyPI Trusted Publishing / OIDC). It is sequenced **after** the napi-refresh PR opened by `build-napi.yml` for a `v*` tag, so wheels bundle current `.node` files — the gate is encoded in `/publish:
|
|
126
|
+
- CI: `.github/workflows/release-pypi.yml` (cibuildwheel; `CIBW_BEFORE_ALL` stages `_vendor`; auditwheel/delocate repair disabled — there is no ELF Python extension; publish via PyPI Trusted Publishing / OIDC). It is sequenced **after** the napi-refresh PR opened by `build-napi.yml` for a `v*` tag, so wheels bundle current `.node` files — the gate is encoded in `/publish:pypi`.
|
|
127
127
|
|
|
128
128
|
## Tests
|
|
129
129
|
|
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "contextractor"
|
|
7
|
-
version = "0.4.
|
|
7
|
+
version = "0.4.2"
|
|
8
8
|
description = "Drive the Contextractor Node crawler/extractor from Python — clean main-content text in txt, markdown, json, or html."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.12"
|
|
@@ -28,10 +28,12 @@ def read_summary(manifest_path: Path, output_dir: Path) -> ExtractSummary:
|
|
|
28
28
|
"""Read and tally the manifest at ``manifest_path``."""
|
|
29
29
|
try:
|
|
30
30
|
raw = manifest_path.read_text(encoding="utf-8")
|
|
31
|
-
except
|
|
31
|
+
except FileNotFoundError as exc:
|
|
32
32
|
raise ContextractorError(
|
|
33
33
|
f"manifest not found at {manifest_path} — the export step wrote nothing"
|
|
34
34
|
) from exc
|
|
35
|
+
except OSError as exc:
|
|
36
|
+
raise ContextractorError(f"could not read manifest at {manifest_path}") from exc
|
|
35
37
|
try:
|
|
36
38
|
records = json.loads(raw)
|
|
37
39
|
except json.JSONDecodeError as exc:
|
|
@@ -73,6 +73,9 @@ OPTION_SPECS: dict[str, _Spec] = {
|
|
|
73
73
|
"headless": _Spec("--headless", _Kind.BOOL_PAIR, "--no-headless"),
|
|
74
74
|
"block_media": _Spec("--block-media", _Kind.BOOL_PAIR, "--no-block-media"),
|
|
75
75
|
"images": _Spec("--images", _Kind.BOOL_PAIR, "--no-images"),
|
|
76
|
+
"close_cookie_modals": _Spec(
|
|
77
|
+
"--close-cookie-modals", _Kind.BOOL_PAIR, "--no-close-cookie-modals"
|
|
78
|
+
),
|
|
76
79
|
# --- negation-only (default include; False emits the --no- flag) ---------
|
|
77
80
|
"links": _Spec("--no-links", _Kind.NEGATION_ONLY),
|
|
78
81
|
"comments": _Spec("--no-comments", _Kind.NEGATION_ONLY),
|
|
@@ -80,7 +83,6 @@ OPTION_SPECS: dict[str, _Spec] = {
|
|
|
80
83
|
# --- bare switches ------------------------------------------------------
|
|
81
84
|
"purge": _Spec("--purge", _Kind.BARE_SWITCH),
|
|
82
85
|
"ignore_cors_and_csp": _Spec("--ignore-cors-and-csp", _Kind.BARE_SWITCH),
|
|
83
|
-
"close_cookie_modals": _Spec("--close-cookie-modals", _Kind.BARE_SWITCH),
|
|
84
86
|
"ignore_https_errors": _Spec("--ignore-https-errors", _Kind.BARE_SWITCH),
|
|
85
87
|
"keep_url_fragment": _Spec("--keep-url-fragment", _Kind.BARE_SWITCH),
|
|
86
88
|
"use_sitemaps": _Spec("--use-sitemaps", _Kind.BARE_SWITCH),
|
|
@@ -218,7 +220,12 @@ def validate_proxies(proxies: Sequence[str]) -> None:
|
|
|
218
220
|
The raw URL (which carries credentials) is never echoed in the error.
|
|
219
221
|
"""
|
|
220
222
|
for raw in proxies:
|
|
221
|
-
|
|
223
|
+
try:
|
|
224
|
+
scheme = urlsplit(raw).scheme.lower()
|
|
225
|
+
except ValueError:
|
|
226
|
+
# urlsplit rejects e.g. malformed IPv6 brackets; stay inside the
|
|
227
|
+
# ContextractorError hierarchy and never echo the raw URL.
|
|
228
|
+
raise ProxySchemeError("malformed proxy URL") from None
|
|
222
229
|
if scheme not in _ALLOWED_PROXY_SCHEMES:
|
|
223
230
|
raise ProxySchemeError(
|
|
224
231
|
f"unsupported proxy scheme {scheme or '(none)'!r}; "
|
|
@@ -53,6 +53,24 @@ def _normalize_urls(urls: list[str] | str) -> list[str]:
|
|
|
53
53
|
return [urls] if isinstance(urls, str) else list(urls)
|
|
54
54
|
|
|
55
55
|
|
|
56
|
+
def _collect_secrets(opts: dict[str, Any]) -> tuple[str, ...]:
|
|
57
|
+
"""Everything the child could echo back in stderr: proxy URLs, header
|
|
58
|
+
values (e.g. Authorization tokens), and cookie values. Values shorter than
|
|
59
|
+
4 chars are skipped — they are not meaningful secrets and replacing them
|
|
60
|
+
literally would mangle the error detail."""
|
|
61
|
+
headers = opts.get("headers") or {}
|
|
62
|
+
cookies = opts.get("cookies") or []
|
|
63
|
+
return (
|
|
64
|
+
*(opts.get("proxy") or ()),
|
|
65
|
+
*(v for v in headers.values() if isinstance(v, str) and len(v) >= 4),
|
|
66
|
+
*(
|
|
67
|
+
c["value"]
|
|
68
|
+
for c in cookies
|
|
69
|
+
if isinstance(c, dict) and isinstance(c.get("value"), str) and len(c["value"]) >= 4
|
|
70
|
+
),
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
|
|
56
74
|
def _build_plan(
|
|
57
75
|
urls: list[str] | str,
|
|
58
76
|
output_dir: str | None,
|
|
@@ -76,14 +94,13 @@ def _build_plan(
|
|
|
76
94
|
else Path(tempfile.mkdtemp(prefix="contextractor-storage-"))
|
|
77
95
|
)
|
|
78
96
|
|
|
79
|
-
secrets = tuple(opts.get("proxy") or ())
|
|
80
97
|
return _Plan(
|
|
81
98
|
urls=url_list,
|
|
82
99
|
extract_flags=extract_flags,
|
|
83
100
|
output_dir=out_dir,
|
|
84
101
|
storage_dir=store_dir,
|
|
85
102
|
owns_storage=owns_storage,
|
|
86
|
-
secrets=
|
|
103
|
+
secrets=_collect_secrets(opts),
|
|
87
104
|
)
|
|
88
105
|
|
|
89
106
|
|
|
@@ -217,9 +234,13 @@ async def _run_async(argv: list[str], timeout: float | None) -> tuple[int | None
|
|
|
217
234
|
async with asyncio.timeout(timeout):
|
|
218
235
|
out, err = await proc.communicate()
|
|
219
236
|
except TimeoutError:
|
|
220
|
-
proc.kill()
|
|
221
|
-
await proc.wait()
|
|
222
237
|
raise ContextractorError("contextractor timed out") from None
|
|
238
|
+
finally:
|
|
239
|
+
# Reap the child whenever communicate() did not finish — timeout or
|
|
240
|
+
# cancellation of the surrounding task — so the Node crawler is never orphaned.
|
|
241
|
+
if proc.returncode is None:
|
|
242
|
+
proc.kill()
|
|
243
|
+
await proc.wait()
|
|
223
244
|
return proc.returncode, out.decode(errors="replace"), err.decode(errors="replace")
|
|
224
245
|
|
|
225
246
|
|
|
@@ -272,7 +293,7 @@ def _build_one_plan(
|
|
|
272
293
|
url=url,
|
|
273
294
|
formats=_normalize_formats(formats),
|
|
274
295
|
flags=flags,
|
|
275
|
-
secrets=
|
|
296
|
+
secrets=_collect_secrets(opts),
|
|
276
297
|
)
|
|
277
298
|
|
|
278
299
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|