contextractor 0.4.1__tar.gz → 0.4.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: contextractor
3
- Version: 0.4.1
3
+ Version: 0.4.2
4
4
  Summary: Drive the Contextractor Node crawler/extractor from Python — clean main-content text in txt, markdown, json, or html.
5
5
  Project-URL: Homepage, https://apify.com/glueo/contextractor
6
6
  Project-URL: Repository, https://github.com/contextractor/contextractor
@@ -48,8 +48,8 @@ reads exactly what extract wrote — no bucket names are threaded.
48
48
  - export `0` → read manifest; non-zero → raise.
49
49
  - Both runners (`_run_sync` / `_run_async`) capture raw bytes and decode stdout/stderr as UTF-8 with `errors="replace"` — never the locale codec (Windows cp1252/cp932 mojibake) and never universal-newline translation (which would corrupt `original` raw HTML).
50
50
  - Playwright "Executable doesn't exist" in stderr → `MissingBrowserError` pointing at `python -m contextractor install`.
51
- - Child stderr is redacted (proxy credentials masked) before being surfaced; argv is never echoed when it carries a proxy.
52
- - A `timeout` (sync or async) raises `ContextractorError("contextractor timed out")` — never the raw `subprocess.TimeoutExpired`, whose `cmd` would leak the `--proxy` argv.
51
+ - Child stderr is redacted before being surfaced — proxy URLs, header values (e.g. `Authorization` tokens), and cookie values (each ≥ 4 chars) are all registered as secrets; argv is never echoed when it carries a proxy.
52
+ - A `timeout` (sync or async) raises `ContextractorError("contextractor timed out")` — never the raw `subprocess.TimeoutExpired`, whose `cmd` would leak the `--proxy` argv. Cancelling the surrounding task in the async path also kills and reaps the child — it is never orphaned.
53
53
 
54
54
  ## Single-page orchestration (`extract_one`)
55
55
 
@@ -70,9 +70,9 @@ A single data-driven table, `OPTION_SPECS`, applied immediately before spawn (pe
70
70
  `OPTION_SPECS` keys (enforced by `tests/test_options.py`). Categories:
71
71
 
72
72
  - **scalar** → `--flag <value>` (e.g. `max_crawl_depth`, `mode`, `start_urls_file`, …).
73
- - **bool-pair** → `--flag` / `--no-flag`: `headless`, `block_media`, `images`.
73
+ - **bool-pair** → `--flag` / `--no-flag`: `headless`, `block_media`, `images`, `close_cookie_modals`.
74
74
  - **negation-only** (default include; `False` emits the `--no-` flag): `links`, `comments`, `tables`.
75
- - **bare-switch** (`True` emits the flag): `purge`, `ignore_cors_and_csp`, `close_cookie_modals`, `ignore_https_errors`, `keep_url_fragment`, `use_sitemaps`, `respect_robots_txt`, `store_skipped_urls`, `verbose` (`-v`).
75
+ - **bare-switch** (`True` emits the flag): `purge`, `ignore_cors_and_csp`, `ignore_https_errors`, `keep_url_fragment`, `use_sitemaps`, `respect_robots_txt`, `store_skipped_urls`, `verbose` (`-v`).
76
76
  - **repeatable** (one flag per item): `proxy`, `globs`, `exclude`, `save` (`format-destination` tokens, e.g. `markdown-kvs`).
77
77
  - **json** (`--flag <json.dumps>`): `cookies`, `headers`.
78
78
 
@@ -119,11 +119,11 @@ browsers are never bundled (`python -m contextractor install`).
119
119
 
120
120
  ## Packaging & distribution
121
121
 
122
- - Backend: hatchling + `hatch_build.py` (`pure_python=False`, `infer_tag=True`) → `py3-none-{platform}` wheels. Forbid maturin / scikit-build-core / uv_build.
123
- - `version` is static in `pyproject.toml` (the `/git:release` and `/publish:all` bump target); `__version__` is read from installed metadata.
122
+ - Backend: hatchling + `hatch_build.py` (`pure_python=False`; explicit `tag = py3-none-{platform}`, pinned via `CONTEXTRACTOR_WHEEL_PLATFORM` in CI, inferred locally) → `py3-none-{platform}` wheels. Forbid maturin / scikit-build-core / uv_build.
123
+ - `version` is static in `pyproject.toml` (the `/git:release` and `/publish:pypi` bump target); `__version__` is read from installed metadata.
124
124
  - `readme = "README.md"` → the PyPI project page (per `.claude/rules/user-facing-docs.md`); included in the sdist.
125
125
  - Wheel matrix: `macosx_*_arm64`, `macosx_*_x86_64`, `manylinux_2_28_x86_64`, `manylinux_2_28_aarch64`, `win_amd64`, plus an sdist. **musl is unsupported** — the napi loader throws a clear import error rather than ship a broken `.node`.
126
- - CI: `.github/workflows/release-pypi.yml` (cibuildwheel; `CIBW_BEFORE_ALL` stages `_vendor`; auditwheel/delocate repair disabled — there is no ELF Python extension; publish via PyPI Trusted Publishing / OIDC). It is sequenced **after** the napi-refresh PR opened by `build-napi.yml` for a `v*` tag, so wheels bundle current `.node` files — the gate is encoded in `/publish:all`.
126
+ - CI: `.github/workflows/release-pypi.yml` (cibuildwheel; `CIBW_BEFORE_ALL` stages `_vendor`; auditwheel/delocate repair disabled — there is no ELF Python extension; publish via PyPI Trusted Publishing / OIDC). It is sequenced **after** the napi-refresh PR opened by `build-napi.yml` for a `v*` tag, so wheels bundle current `.node` files — the gate is encoded in `/publish:pypi`.
127
127
 
128
128
  ## Tests
129
129
 
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "contextractor"
7
- version = "0.4.1"
7
+ version = "0.4.2"
8
8
  description = "Drive the Contextractor Node crawler/extractor from Python — clean main-content text in txt, markdown, json, or html."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.12"
@@ -28,10 +28,12 @@ def read_summary(manifest_path: Path, output_dir: Path) -> ExtractSummary:
28
28
  """Read and tally the manifest at ``manifest_path``."""
29
29
  try:
30
30
  raw = manifest_path.read_text(encoding="utf-8")
31
- except OSError as exc:
31
+ except FileNotFoundError as exc:
32
32
  raise ContextractorError(
33
33
  f"manifest not found at {manifest_path} — the export step wrote nothing"
34
34
  ) from exc
35
+ except OSError as exc:
36
+ raise ContextractorError(f"could not read manifest at {manifest_path}") from exc
35
37
  try:
36
38
  records = json.loads(raw)
37
39
  except json.JSONDecodeError as exc:
@@ -73,6 +73,9 @@ OPTION_SPECS: dict[str, _Spec] = {
73
73
  "headless": _Spec("--headless", _Kind.BOOL_PAIR, "--no-headless"),
74
74
  "block_media": _Spec("--block-media", _Kind.BOOL_PAIR, "--no-block-media"),
75
75
  "images": _Spec("--images", _Kind.BOOL_PAIR, "--no-images"),
76
+ "close_cookie_modals": _Spec(
77
+ "--close-cookie-modals", _Kind.BOOL_PAIR, "--no-close-cookie-modals"
78
+ ),
76
79
  # --- negation-only (default include; False emits the --no- flag) ---------
77
80
  "links": _Spec("--no-links", _Kind.NEGATION_ONLY),
78
81
  "comments": _Spec("--no-comments", _Kind.NEGATION_ONLY),
@@ -80,7 +83,6 @@ OPTION_SPECS: dict[str, _Spec] = {
80
83
  # --- bare switches ------------------------------------------------------
81
84
  "purge": _Spec("--purge", _Kind.BARE_SWITCH),
82
85
  "ignore_cors_and_csp": _Spec("--ignore-cors-and-csp", _Kind.BARE_SWITCH),
83
- "close_cookie_modals": _Spec("--close-cookie-modals", _Kind.BARE_SWITCH),
84
86
  "ignore_https_errors": _Spec("--ignore-https-errors", _Kind.BARE_SWITCH),
85
87
  "keep_url_fragment": _Spec("--keep-url-fragment", _Kind.BARE_SWITCH),
86
88
  "use_sitemaps": _Spec("--use-sitemaps", _Kind.BARE_SWITCH),
@@ -218,7 +220,12 @@ def validate_proxies(proxies: Sequence[str]) -> None:
218
220
  The raw URL (which carries credentials) is never echoed in the error.
219
221
  """
220
222
  for raw in proxies:
221
- scheme = urlsplit(raw).scheme.lower()
223
+ try:
224
+ scheme = urlsplit(raw).scheme.lower()
225
+ except ValueError:
226
+ # urlsplit rejects e.g. malformed IPv6 brackets; stay inside the
227
+ # ContextractorError hierarchy and never echo the raw URL.
228
+ raise ProxySchemeError("malformed proxy URL") from None
222
229
  if scheme not in _ALLOWED_PROXY_SCHEMES:
223
230
  raise ProxySchemeError(
224
231
  f"unsupported proxy scheme {scheme or '(none)'!r}; "
@@ -53,6 +53,24 @@ def _normalize_urls(urls: list[str] | str) -> list[str]:
53
53
  return [urls] if isinstance(urls, str) else list(urls)
54
54
 
55
55
 
56
+ def _collect_secrets(opts: dict[str, Any]) -> tuple[str, ...]:
57
+ """Everything the child could echo back in stderr: proxy URLs, header
58
+ values (e.g. Authorization tokens), and cookie values. Values shorter than
59
+ 4 chars are skipped — they are not meaningful secrets and replacing them
60
+ literally would mangle the error detail."""
61
+ headers = opts.get("headers") or {}
62
+ cookies = opts.get("cookies") or []
63
+ return (
64
+ *(opts.get("proxy") or ()),
65
+ *(v for v in headers.values() if isinstance(v, str) and len(v) >= 4),
66
+ *(
67
+ c["value"]
68
+ for c in cookies
69
+ if isinstance(c, dict) and isinstance(c.get("value"), str) and len(c["value"]) >= 4
70
+ ),
71
+ )
72
+
73
+
56
74
  def _build_plan(
57
75
  urls: list[str] | str,
58
76
  output_dir: str | None,
@@ -76,14 +94,13 @@ def _build_plan(
76
94
  else Path(tempfile.mkdtemp(prefix="contextractor-storage-"))
77
95
  )
78
96
 
79
- secrets = tuple(opts.get("proxy") or ())
80
97
  return _Plan(
81
98
  urls=url_list,
82
99
  extract_flags=extract_flags,
83
100
  output_dir=out_dir,
84
101
  storage_dir=store_dir,
85
102
  owns_storage=owns_storage,
86
- secrets=secrets,
103
+ secrets=_collect_secrets(opts),
87
104
  )
88
105
 
89
106
 
@@ -217,9 +234,13 @@ async def _run_async(argv: list[str], timeout: float | None) -> tuple[int | None
217
234
  async with asyncio.timeout(timeout):
218
235
  out, err = await proc.communicate()
219
236
  except TimeoutError:
220
- proc.kill()
221
- await proc.wait()
222
237
  raise ContextractorError("contextractor timed out") from None
238
+ finally:
239
+ # Reap the child whenever communicate() did not finish — timeout or
240
+ # cancellation of the surrounding task — so the Node crawler is never orphaned.
241
+ if proc.returncode is None:
242
+ proc.kill()
243
+ await proc.wait()
223
244
  return proc.returncode, out.decode(errors="replace"), err.decode(errors="replace")
224
245
 
225
246
 
@@ -272,7 +293,7 @@ def _build_one_plan(
272
293
  url=url,
273
294
  formats=_normalize_formats(formats),
274
295
  flags=flags,
275
- secrets=tuple(opts.get("proxy") or ()),
296
+ secrets=_collect_secrets(opts),
276
297
  )
277
298
 
278
299
 
File without changes
File without changes