@lde/distribution-probe 0.1.13 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -24,14 +24,48 @@ Sends `POST` with the configured query (default `SELECT * { ?s ?p ?o } LIMIT 1`)
24
24
 
25
25
  ### Data dumps
26
26
 
27
- Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. If `Content-Length` is missing or ≤ 10 KB, retries with `GET` to validate the body – this also catches servers that return `0` from `HEAD`.
27
+ #### Reachability (the default)
28
+
29
+ Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. A successful `HEAD` settles reachability and gathers metadata (`Content-Length`, `Last-Modified`) **without reading the body**. If `HEAD` is unsuccessful — e.g. a server that returns `405`/`501` because it does not implement `HEAD` — the probe falls back to a body-less `GET` to confirm the endpoint is up. The body is never downloaded.
30
+
31
+ This is deliberately cheap: reading a body forces a slow, generate-on-the-fly endpoint (a TriplyDB dump, a SPARQL `CONSTRUCT` export) to start producing its export, which a `HEAD` does not.
28
32
 
29
33
  - **Content-Type is checked as a soft warning, not a hard failure.** If the server’s Content-Type disagrees with the distribution’s declared `mimeType`, a message is appended to `result.warnings` but `isSuccess()` stays `true`. Compression wrappers (`application/gzip`, `application/x-gzip`, `application/octet-stream`) are skipped so a gzipped Turtle file doesn’t trigger a warning.
30
- - **Body is parse-validated only for Turtle, N-Triples, and N-Quads** (Content-Type starting with `text/turtle`, `application/n-triples`, or `application/n-quads`). Empty bodies and parse errors fail the probe. Other RDF serializations (RDF/XML, JSON-LD, TriG, …) are not parse-validated – only HTTP status and headers are checked.
31
- - Bodies larger than 10 KB are not fetched; only `HEAD` metadata is inspected.
34
+
35
+ #### Content validation (opt-in)
36
+
37
+ Set `validateRdfContent: true` to additionally confirm that a dump actually carries RDF. It applies only to distributions whose **declared** `mimeType` is an RDF serialization (`text/turtle`, `application/n-triples`, `application/n-quads`, `application/trig`, `text/n3`, `application/ld+json`, `application/rdf+xml`); non-RDF and undeclared-type distributions stay reachability-only.
38
+
39
+ When on, the probe `GET`s the dump — **regardless of size** — and reads only a **bounded prefix** (256 KiB), never the whole body:
40
+
41
+ - It settles on the **first triple** and stops, so a large dump is validated from its opening chunk. The line/statement-oriented serializations and RDF/XML stream a triple out of the prefix; **JSON-LD is not streamable** (its parser needs the whole document), so a JSON-LD dump is only validated when it fits the prefix in full — a larger one is reported reachable but unvalidated.
42
+ - A gzip body that `fetch` did not decompress (a `.gz` dump, or one served with a non-standard `Content-Encoding`) is inflated in-place; a gzip that will not inflate when the **complete** compressed body was read fails as `Distribution is not valid gzip`.
43
+ - Empty bodies (`Distribution is empty`) and bodies that parse to **zero** triples (`Distribution contains no RDF triples`) fail the probe. A deliberately truncated prefix is never mistaken for either — it is inconclusive.
44
+ - **Reachability is settled by the response, so validation never turns a reachable dump into a failure.** If no triple surfaces within `rdfValidationBudgetMs` (default `min(timeoutMs, 2000)`, clamped to `timeoutMs`), the read is aborted and the distribution is reported reachable but unvalidated (no `failureReason`). This bounds the extra latency content validation adds on slow, generate-on-the-fly endpoints.
32
45
 
33
46
  ### Network errors
34
47
 
35
48
  A thrown exception from `fetch` (DNS failure, connection refused, socket reset, TLS error, timeout after the configured `timeoutMs` – default 5 000 ms) is a connection-level failure. The probe retries these up to `retries` times (default 2) with a short backoff before giving up and returning a `NetworkError`. This turns a transient transport blip into a reliable single measurement without looking backward across checks. A genuine outage still resolves to a `NetworkError` on the current check – every attempt fails – but note each attempt gets its own `timeoutMs`, so an endpoint that fails only by timing out takes up to `(retries + 1) × timeoutMs` (plus backoff) to be reported down. HTTP error responses (4xx/5xx) and content-validation failures are real ‘down’ states and are **never** retried.
36
49
 
37
50
  `NetworkError.message` includes the underlying `error.cause` (e.g. `ECONNRESET`, `UND_ERR_SOCKET “other side closed”`) when Node wraps one, so observations record what actually failed rather than a bare ‘fetch failed’.
51
+
52
+ ## Probing many distributions
53
+
54
+ `probeMany` probes an array of distributions concurrently and returns one result per input, in input order. Each distribution is probed once with `probe`, so every behaviour above applies per distribution; like `probe`, `probeMany` never throws – a probe that fails is reported as a `NetworkError` in its slot.
55
+
56
+ ```ts
57
+ import { probeMany } from '@lde/distribution-probe';
58
+
59
+ const results = await probeMany(distributions, {
60
+ concurrency: 20, // max probes in flight across all hosts (default 20)
61
+ perHostConcurrency: 4, // max probes in flight against one host (default 4)
62
+ validateRdfContent: true, // any ProbeOptions are forwarded to each probe
63
+ });
64
+ ```
65
+
66
+ Two caps bound the batch:
67
+
68
+ - **`concurrency`** bounds the total fan-out, so a large catalogue does not exhaust sockets or buffer too many response bodies at once.
69
+ - **`perHostConcurrency`** bounds the burst any one server sees, keeping the batch a polite client: a catalogue that declares many distributions on a single host (e.g. a download endpoint per named graph) will not trip that server’s rate limiter (HTTP 429). Distributions sharing a host (by `accessUrl`) contend for the same budget; a probe whose host is saturated waits while probes for other hosts proceed, so one busy host never idles the global pool.
70
+
71
+ All other `ProbeOptions` (`timeoutMs`, `retries`, `validateRdfContent`, and the rest) are forwarded unchanged to every probe.
package/dist/index.d.ts CHANGED
@@ -1,2 +1,2 @@
1
- export { probe, NetworkError, SparqlProbeResult, DataDumpProbeResult, type ProbeOptions, type ProbeResultType, } from './probe.js';
1
+ export { probe, probeMany, NetworkError, SparqlProbeResult, DataDumpProbeResult, type ProbeOptions, type ProbeManyOptions, type ProbeResultType, } from './probe.js';
2
2
  //# sourceMappingURL=index.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EACL,KAAK,EACL,YAAY,EACZ,iBAAiB,EACjB,mBAAmB,EACnB,KAAK,YAAY,EACjB,KAAK,eAAe,GACrB,MAAM,YAAY,CAAC"}
1
+ {"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EACL,KAAK,EACL,SAAS,EACT,YAAY,EACZ,iBAAiB,EACjB,mBAAmB,EACnB,KAAK,YAAY,EACjB,KAAK,gBAAgB,EACrB,KAAK,eAAe,GACrB,MAAM,YAAY,CAAC"}
package/dist/index.js CHANGED
@@ -1 +1 @@
1
- export { probe, NetworkError, SparqlProbeResult, DataDumpProbeResult, } from './probe.js';
1
+ export { probe, probeMany, NetworkError, SparqlProbeResult, DataDumpProbeResult, } from './probe.js';
package/dist/probe.d.ts CHANGED
@@ -28,6 +28,49 @@ export interface ProbeOptions {
28
28
  * the default; negative values are clamped to `0`.
29
29
  */
30
30
  retries?: number;
31
+ /**
32
+ * Validate the body content of data-dump distributions whose declared media
33
+ * type is an RDF serialization, by reading a bounded prefix and confirming it
34
+ * carries at least one triple. When `false` (the default) a data dump is only
35
+ * checked for reachability (a `HEAD`, with a body-less `GET` fallback if `HEAD`
36
+ * is unsupported) and its body is never read. When `true`, every declared-RDF
37
+ * dump — regardless of size — is fetched and validated; non-RDF and
38
+ * undeclared-type distributions are still reachability-only. Validation is
39
+ * opt-in because reading a body forces a slow, generate-on-the-fly endpoint to
40
+ * start producing its export, which a `HEAD` does not.
41
+ */
42
+ validateRdfContent?: boolean;
43
+ /**
44
+ * Soft deadline, in milliseconds, for finding the first triple when
45
+ * {@link validateRdfContent} is on. Reachability is settled by the response
46
+ * itself; if no triple has surfaced within this budget the read is aborted and
47
+ * the distribution is reported reachable but unvalidated (no `failureReason`),
48
+ * never failed. This bounds the extra latency content validation adds on slow,
49
+ * generate-on-the-fly endpoints. Clamped to {@link timeoutMs} (a longer budget
50
+ * is meaningless — the request times out first). Defaults to
51
+ * `min(timeoutMs, 2000)`.
52
+ */
53
+ rdfValidationBudgetMs?: number;
54
+ }
55
+ /**
56
+ * Options for {@link probeMany}: the per-probe {@link ProbeOptions} plus the
57
+ * concurrency budgets that bound the batch.
58
+ */
59
+ export interface ProbeManyOptions extends ProbeOptions {
60
+ /**
61
+ * Maximum number of probes to run at once across all hosts. Bounds the batch’s
62
+ * total fan-out so a large catalogue does not exhaust sockets or buffer too many
63
+ * response bodies at once. Default 20.
64
+ */
65
+ concurrency?: number;
66
+ /**
67
+ * Maximum number of probes to run at once against a single host. Bounds the
68
+ * burst any one server sees, so a catalogue that declares many distributions on
69
+ * one host (e.g. a download endpoint per named graph) does not trip its rate
70
+ * limiter (HTTP 429). A probe whose host is at this cap waits while probes for
71
+ * other hosts proceed, so this never idles the global pool. Default 4.
72
+ */
73
+ perHostConcurrency?: number;
31
74
  }
32
75
  /**
33
76
  * Result of a network error during probing.
@@ -80,10 +123,25 @@ export type ProbeResultType = SparqlProbeResult | DataDumpProbeResult | NetworkE
80
123
  *
81
124
  * For SPARQL endpoints, issues the configured SPARQL query (default: a
82
125
  * minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
83
- * for small or unknown-size bodies).
126
+ * for small or unknown-size bodies, reading only a bounded prefix so a large
127
+ * streamed dump is never downloaded in full).
84
128
  *
85
129
  * Returns a pure result object; never throws.
86
130
  */
87
131
  export declare function probe(distribution: Distribution, options?: ProbeOptions): Promise<ProbeResultType>;
132
+ /**
133
+ * Probe many distributions concurrently, bounded by a global cap and a per-host
134
+ * cap, returning one result per input in input order. Like {@link probe}, this
135
+ * never throws: a probe that somehow fails is reported as a {@link NetworkError}
136
+ * in its slot.
137
+ *
138
+ * The per-host cap keeps the batch a polite client. Distributions sharing a host
139
+ * (by {@link Distribution.accessUrl}) contend for the same budget, so no single
140
+ * server is hit by the full global pool at once — the burst that trips a rate
141
+ * limiter (HTTP 429). When the next queued probe’s host is saturated it is
142
+ * skipped in favour of a later probe on a different host, so one busy host never
143
+ * idles the global pool (no head-of-line blocking).
144
+ */
145
+ export declare function probeMany(distributions: readonly Distribution[], options?: ProbeManyOptions): Promise<ProbeResultType[]>;
88
146
  export {};
89
147
  //# sourceMappingURL=probe.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAInE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;CAClB;AASD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;GAQG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B"}
1
+ {"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAKnE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;IACjB;;;;;;;;;;OAUG;IACH,kBAAkB,CAAC,EAAE,OAAO,CAAC;IAC7B;;;;;;;;;OASG;IACH,qBAAqB,CAAC,EAAE,MAAM,CAAC;CAChC;AAED;;;GAGG;AACH,MAAM,WAAW,gBAAiB,SAAQ,YAAY;IACpD;;;;OAIG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;OAMG;IACH,kBAAkB,CAAC,EAAE,MAAM,CAAC;CAC7B;AAkCD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;;GASG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B;AAED;;;;;;;;;;;;GAYG;AACH,wBAAsB,SAAS,CAC7B,aAAa,EAAE,SAAS,YAAY,EAAE,EACtC,OAAO,CAAC,EAAE,gBAAgB,GACzB,OAAO,CAAC,eAAe,EAAE,CAAC,CA4B5B"}
package/dist/probe.js CHANGED
@@ -1,9 +1,32 @@
1
1
  import { compressionMediaTypes } from '@lde/dataset';
2
2
  import { rdfParser } from 'rdf-parse';
3
3
  import { Readable } from 'node:stream';
4
+ import { createGunzip } from 'node:zlib';
4
5
  const DEFAULT_SPARQL_QUERY = 'SELECT * { ?s ?p ?o } LIMIT 1';
5
6
  const DEFAULT_TIMEOUT_MS = 5000;
6
7
  const DEFAULT_RETRIES = 2;
8
+ const DEFAULT_PROBE_CONCURRENCY = 20;
9
+ const DEFAULT_PROBE_PER_HOST_CONCURRENCY = 4;
10
+ /**
11
+ * Default soft deadline for finding the first triple when content validation is
12
+ * on (capped at `timeoutMs`). Two seconds comfortably covers a static file
13
+ * server's first chunk while keeping the extra wait bounded on a slow,
14
+ * generate-on-the-fly endpoint.
15
+ */
16
+ const DEFAULT_RDF_VALIDATION_BUDGET_MS = 2000;
17
+ /** Sentinel: the validation budget elapsed before a triple surfaced. */
18
+ const VALIDATION_TIMED_OUT = Symbol('rdf-validation-timed-out');
19
+ /**
20
+ * Maximum number of body bytes the data-dump probe reads before it stops and
21
+ * releases the connection. Reachability needs only that the endpoint answered
22
+ * with a success status and produced bytes; a large dump must never be
23
+ * downloaded in full within the probe's timeout budget. 256 KiB comfortably
24
+ * surfaces the first RDF triple — the signal {@link validateBody} needs — while
25
+ * bounding the read regardless of the dump's true size, chunked transfer, or
26
+ * compression. Applied to both the raw read and, for a gzip body, the inflated
27
+ * output.
28
+ */
29
+ const MAX_PROBE_BODY_BYTES = 256 * 1024;
7
30
  /** Base backoff between retries; the nth retry waits `n × base`. */
8
31
  const RETRY_BACKOFF_MS = 250;
9
32
  /**
@@ -107,7 +130,8 @@ export class DataDumpProbeResult extends ProbeResult {
107
130
  *
108
131
  * For SPARQL endpoints, issues the configured SPARQL query (default: a
109
132
  * minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
110
- * for small or unknown-size bodies).
133
+ * for small or unknown-size bodies, reading only a bounded prefix so a large
134
+ * streamed dump is never downloaded in full).
111
135
  *
112
136
  * Returns a pure result object; never throws.
113
137
  */
@@ -147,6 +171,96 @@ export async function probe(distribution, options) {
147
171
  // real cost of a down endpoint.
148
172
  return new NetworkError(url, describeNetworkError(lastError), Math.round(performance.now() - overallStart));
149
173
  }
174
+ /**
175
+ * Probe many distributions concurrently, bounded by a global cap and a per-host
176
+ * cap, returning one result per input in input order. Like {@link probe}, this
177
+ * never throws: a probe that somehow fails is reported as a {@link NetworkError}
178
+ * in its slot.
179
+ *
180
+ * The per-host cap keeps the batch a polite client. Distributions sharing a host
181
+ * (by {@link Distribution.accessUrl}) contend for the same budget, so no single
182
+ * server is hit by the full global pool at once — the burst that trips a rate
183
+ * limiter (HTTP 429). When the next queued probe’s host is saturated it is
184
+ * skipped in favour of a later probe on a different host, so one busy host never
185
+ * idles the global pool (no head-of-line blocking).
186
+ */
187
+ export async function probeMany(distributions, options) {
188
+ // Clamp the budgets to a positive integer, mirroring how probe() treats an
189
+ // invalid retries value: a zero, negative, fractional, or NaN limit would
190
+ // otherwise stall the scheduler (no task ever starts, so the promise never
191
+ // resolves) or overrun the cap, so fall back to the default rather than trust
192
+ // the caller.
193
+ const globalLimit = positiveIntOrDefault(options?.concurrency, DEFAULT_PROBE_CONCURRENCY);
194
+ const perHostLimit = positiveIntOrDefault(options?.perHostConcurrency, DEFAULT_PROBE_PER_HOST_CONCURRENCY);
195
+ // Probes contend per host. An authority-less URL (e.g. urn:, file:) has an
196
+ // empty host, so it falls back to its full href and never shares a budget with
197
+ // an unrelated one.
198
+ const hostKeys = distributions.map((distribution) => distribution.accessUrl.host || distribution.accessUrl.href);
199
+ return mapHostLimited(distributions, hostKeys, globalLimit, perHostLimit, (distribution) => probe(distribution, options));
200
+ }
201
+ /**
202
+ * Coerce an optional concurrency budget to a usable value: a positive integer is
203
+ * taken as-is; undefined, zero, negative, fractional, or NaN falls back to the
204
+ * default. Matches probe()’s treatment of an invalid retries value.
205
+ */
206
+ function positiveIntOrDefault(value, fallback) {
207
+ return value !== undefined && Number.isInteger(value) && value >= 1
208
+ ? value
209
+ : fallback;
210
+ }
211
+ /**
212
+ * Run `task` over `items` with two concurrency caps — a global cap and a per-host
213
+ * cap keyed by `hostKeys[index]` — resolving to results in input order. When the
214
+ * next queued item’s host is at the per-host cap it is skipped for a later item on
215
+ * a different host, so a saturated host never idles the global pool (no head-of-line
216
+ * blocking); the skipped host always has a task in flight, whose completion re-runs
217
+ * the scheduler, so the queue always drains. `task` must not reject — callers wrap
218
+ * failures into a result value — as a rejection would leave the promise pending.
219
+ */
220
+ function mapHostLimited(items, hostKeys, globalLimit, perHostLimit, task) {
221
+ const results = new Array(items.length);
222
+ const perHostInFlight = new Map();
223
+ const pending = items.map((_unused, index) => index);
224
+ let globalInFlight = 0;
225
+ let settledCount = 0;
226
+ const adjustHost = (host, delta) => {
227
+ perHostInFlight.set(host, (perHostInFlight.get(host) ?? 0) + delta);
228
+ };
229
+ return new Promise((resolve) => {
230
+ const schedule = () => {
231
+ let cursor = 0;
232
+ while (cursor < pending.length && globalInFlight < globalLimit) {
233
+ const index = pending[cursor];
234
+ const host = hostKeys[index];
235
+ if ((perHostInFlight.get(host) ?? 0) >= perHostLimit) {
236
+ cursor++; // Host saturated; leave it queued and try a later, different host.
237
+ continue;
238
+ }
239
+ pending.splice(cursor, 1);
240
+ globalInFlight++;
241
+ adjustHost(host, 1);
242
+ void task(items[index]).then((result) => {
243
+ results[index] = result;
244
+ globalInFlight--;
245
+ adjustHost(host, -1);
246
+ settledCount++;
247
+ if (settledCount === items.length) {
248
+ resolve(results);
249
+ }
250
+ else {
251
+ schedule();
252
+ }
253
+ });
254
+ // pending[cursor] now holds the next queued item; do not advance cursor.
255
+ }
256
+ };
257
+ schedule();
258
+ // Resolve immediately when there is nothing to settle (empty input); a
259
+ // non-empty run resolves via the task completion above.
260
+ if (settledCount === items.length)
261
+ resolve(results);
262
+ });
263
+ }
150
264
  function delay(milliseconds) {
151
265
  return new Promise((resolve) => setTimeout(resolve, milliseconds));
152
266
  }
@@ -186,6 +300,9 @@ function resolveOptions(options) {
186
300
  retries: retries === undefined || !Number.isInteger(retries)
187
301
  ? DEFAULT_RETRIES
188
302
  : Math.max(0, retries),
303
+ validateRdfContent: options?.validateRdfContent ?? false,
304
+ rdfValidationBudgetMs: options?.rdfValidationBudgetMs ??
305
+ Math.min(options?.timeoutMs ?? DEFAULT_TIMEOUT_MS, DEFAULT_RDF_VALIDATION_BUDGET_MS),
189
306
  };
190
307
  }
191
308
  /**
@@ -350,30 +467,201 @@ async function probeDataDump(url, distribution, options, authHeaders, start) {
350
467
  method: 'HEAD',
351
468
  ...requestOptions,
352
469
  });
353
- const contentLength = headResponse.headers.get('Content-Length');
354
- const contentLengthBytes = contentLength ? parseInt(contentLength) : 0;
355
- // For small or unknown-size files, do a GET to validate body content.
356
- // This also handles servers that incorrectly return 0 Content-Length for HEAD.
357
- if (contentLengthBytes <= 10_240) {
358
- const getResponse = await fetch(url, {
359
- method: 'GET',
360
- ...requestOptions,
361
- });
362
- const body = await getResponse.text();
363
- const isHttpSuccess = getResponse.status >= 200 && getResponse.status < 400;
364
- const failureReason = isHttpSuccess
365
- ? await validateBody(body, getResponse.headers.get('Content-Type'), url, options.timeoutMs)
366
- : null;
367
- const responseTimeMs = Math.round(performance.now() - start);
368
- const result = new DataDumpProbeResult(url, getResponse, responseTimeMs, failureReason);
369
- checkContentTypeMismatch(result, distribution);
370
- return result;
470
+ // Validate body content only when asked to and the distribution declares an
471
+ // RDF media type; otherwise the probe is reachability-only and never reads a
472
+ // body which keeps it from forcing a slow, generate-on-the-fly endpoint to
473
+ // start producing its export.
474
+ if (options.validateRdfContent &&
475
+ isDeclaredRdf(distribution) &&
476
+ isHttpSuccess(headResponse)) {
477
+ const { response, failureReason } = await validateDumpBody(url, headers, options, headResponse);
478
+ return finalizeDataDump(url, distribution, response, start, failureReason);
479
+ }
480
+ // Reachability only. A successful HEAD is enough; otherwise confirm with a
481
+ // body-less GET, which rescues servers that reject or do not implement HEAD.
482
+ if (isHttpSuccess(headResponse)) {
483
+ return finalizeDataDump(url, distribution, headResponse, start, null);
371
484
  }
485
+ const getResponse = await fetch(url, { method: 'GET', ...requestOptions });
486
+ await getResponse.body?.cancel();
487
+ return finalizeDataDump(url, distribution, getResponse, start, null);
488
+ }
489
+ /** Whether an HTTP response carries a success (2xx/3xx) status. */
490
+ function isHttpSuccess(response) {
491
+ return response.status >= 200 && response.status < 400;
492
+ }
493
+ /** Whether the distribution declares an RDF serialization as its media type. */
494
+ function isDeclaredRdf(distribution) {
495
+ const declared = distribution.mimeType?.toLowerCase();
496
+ return declared !== undefined && rdfContentTypes.includes(declared);
497
+ }
498
+ /** Build a DataDumpProbeResult and attach any Content-Type-mismatch warning. */
499
+ function finalizeDataDump(url, distribution, response, start, failureReason) {
372
500
  const responseTimeMs = Math.round(performance.now() - start);
373
- const result = new DataDumpProbeResult(url, headResponse, responseTimeMs);
501
+ const result = new DataDumpProbeResult(url, response, responseTimeMs, failureReason);
374
502
  checkContentTypeMismatch(result, distribution);
375
503
  return result;
376
504
  }
505
+ /**
506
+ * GET the dump and validate that its body carries a triple, but only for as long
507
+ * as the validation budget allows. Reachability is already settled by the prior
508
+ * HEAD, so any shortfall — a budget that elapses before a triple, a read error,
509
+ * a GET that cannot start — yields a `null` failureReason (reachable,
510
+ * unvalidated), never a failure. Returns the response to draw metadata from
511
+ * (the GET, or the HEAD when the GET could not start) alongside that reason.
512
+ */
513
+ async function validateDumpBody(url, headers, options, headResponse) {
514
+ const budgetMs = Math.min(options.rdfValidationBudgetMs, options.timeoutMs);
515
+ // Aborting on budget expiry stops a slow endpoint from streaming on in the
516
+ // background once we have given up waiting for a triple.
517
+ const budgetController = new AbortController();
518
+ let getResponse;
519
+ try {
520
+ getResponse = await fetch(url, {
521
+ method: 'GET',
522
+ headers,
523
+ signal: AbortSignal.any([
524
+ AbortSignal.timeout(options.timeoutMs),
525
+ budgetController.signal,
526
+ ]),
527
+ });
528
+ }
529
+ catch {
530
+ // The GET could not even return headers; the HEAD already proved the
531
+ // distribution reachable, so report it unvalidated rather than down.
532
+ return { response: headResponse, failureReason: null };
533
+ }
534
+ if (!isHttpSuccess(getResponse)) {
535
+ await getResponse.body?.cancel();
536
+ return { response: getResponse, failureReason: null };
537
+ }
538
+ const validation = (async () => {
539
+ const bounded = await readBoundedBody(getResponse, MAX_PROBE_BODY_BYTES);
540
+ const { text, truncated, corrupt } = await decodeProbeBody(bounded);
541
+ return corrupt
542
+ ? 'Distribution is not valid gzip'
543
+ : await validateBody(text, getResponse.headers.get('Content-Type'), url, budgetMs, truncated);
544
+ })().catch(() => null);
545
+ let budgetTimer;
546
+ const budgetExpiry = new Promise((resolve) => {
547
+ budgetTimer = setTimeout(() => {
548
+ budgetController.abort();
549
+ resolve(VALIDATION_TIMED_OUT);
550
+ }, budgetMs);
551
+ });
552
+ try {
553
+ const outcome = await Promise.race([validation, budgetExpiry]);
554
+ return {
555
+ response: getResponse,
556
+ failureReason: outcome === VALIDATION_TIMED_OUT ? null : outcome,
557
+ };
558
+ }
559
+ finally {
560
+ clearTimeout(budgetTimer);
561
+ }
562
+ }
563
+ /**
564
+ * Read at most `maxBytes` from a response body, then cancel the stream to free
565
+ * the underlying connection. Returns the bytes read and whether the body was
566
+ * longer than the cap (`truncated`), so the caller can tell a complete, small
567
+ * body — whose emptiness or parse errors are meaningful — from a deliberately
568
+ * cut-off prefix of a large one, where only the presence of content is
569
+ * conclusive. This is what keeps the probe from downloading a multi-hundred-MB
570
+ * streamed dump in full just to confirm it is reachable.
571
+ */
572
+ async function readBoundedBody(response, maxBytes) {
573
+ const stream = response.body;
574
+ if (stream === null) {
575
+ return { bytes: new Uint8Array(0), truncated: false };
576
+ }
577
+ const chunks = [];
578
+ let total = 0;
579
+ let truncated = false;
580
+ // Breaking out of `for await` cancels the stream, which stops any further
581
+ // download and releases the underlying connection — so a large dump is never
582
+ // pulled in full once we have the prefix we need.
583
+ for await (const chunk of stream) {
584
+ chunks.push(chunk);
585
+ total += chunk.length;
586
+ if (total >= maxBytes) {
587
+ truncated = true;
588
+ break;
589
+ }
590
+ }
591
+ return { bytes: Buffer.concat(chunks), truncated };
592
+ }
593
+ /**
594
+ * Decode a bounded body to text for RDF validation, inflating it first when it
595
+ * is a gzip stream that `fetch` did not transparently decompress — e.g. a `.gz`
596
+ * data dump served as-is, or one labelled with a non-standard Content-Encoding
597
+ * (`application/gzip`) that undici does not recognise as a content coding.
598
+ * Detection is by the gzip magic on the delivered bytes, so a body that `fetch`
599
+ * already inflated (a standard `Content-Encoding: gzip`) is passed through
600
+ * untouched. A truncated gzip tail is expected — we only read a prefix — and
601
+ * inflates cleanly up to the cut, so it is never mistaken for corruption.
602
+ */
603
+ async function decodeProbeBody(bounded) {
604
+ if (!isGzip(bounded.bytes)) {
605
+ return {
606
+ text: decodeUtf8(bounded.bytes),
607
+ truncated: bounded.truncated,
608
+ corrupt: false,
609
+ };
610
+ }
611
+ // The compressed body is complete only when the raw read was not itself cut
612
+ // off: a gzip error on a complete body is genuine corruption, on a prefix we
613
+ // cut it is just the dropped tail.
614
+ const inflated = await gunzipPrefix(bounded.bytes, MAX_PROBE_BODY_BYTES, !bounded.truncated);
615
+ return {
616
+ text: decodeUtf8(inflated.bytes),
617
+ truncated: bounded.truncated || inflated.truncated,
618
+ corrupt: inflated.corrupt,
619
+ };
620
+ }
621
+ /** Whether the bytes begin with the gzip magic number (RFC 1952 §2.3.1). */
622
+ function isGzip(bytes) {
623
+ return bytes.length >= 2 && bytes[0] === 0x1f && bytes[1] === 0x8b;
624
+ }
625
+ /**
626
+ * Decode bytes as UTF-8 without throwing: an incomplete multi-byte sequence at
627
+ * the truncation boundary is replaced rather than fatal, since the RDF parser
628
+ * only needs the leading, intact portion to find the first triple.
629
+ */
630
+ function decodeUtf8(bytes) {
631
+ return new TextDecoder('utf-8', { fatal: false }).decode(bytes);
632
+ }
633
+ /**
634
+ * Inflate up to `maxBytes` of output from a gzip prefix, stopping once the cap
635
+ * is reached or the input runs out. `inputComplete` says whether the caller
636
+ * handed us the whole compressed body (true) or a prefix it had already cut
637
+ * (false). An inflate error therefore means different things: on a complete body
638
+ * the gzip is genuinely corrupt; on a cut prefix it is just the dropped tail, so
639
+ * whatever inflated cleanly is reported as a (truncated) partial inflate.
640
+ */
641
+ function gunzipPrefix(bytes, maxBytes, inputComplete) {
642
+ return new Promise((resolve) => {
643
+ const gunzip = createGunzip();
644
+ const chunks = [];
645
+ let total = 0;
646
+ // `resolve` and `destroy` are both idempotent, so the first outcome wins and
647
+ // any later event (e.g. a premature-close error emitted by `destroy`) is a
648
+ // harmless no-op — no `settled` guard needed.
649
+ function finish(outcome) {
650
+ gunzip.destroy();
651
+ resolve({ bytes: Buffer.concat(chunks), ...outcome });
652
+ }
653
+ gunzip.on('data', (chunk) => {
654
+ chunks.push(chunk);
655
+ total += chunk.length;
656
+ if (total >= maxBytes) {
657
+ finish({ truncated: true, corrupt: false });
658
+ }
659
+ });
660
+ gunzip.on('error', () => finish({ truncated: !inputComplete, corrupt: inputComplete }));
661
+ gunzip.on('end', () => finish({ truncated: false, corrupt: false }));
662
+ gunzip.end(bytes);
663
+ });
664
+ }
377
665
  // The RDF serializations whose bodies we parse to confirm they carry triples. A
378
666
  // non-empty body in one of these formats that yields zero triples — an empty
379
667
  // graph such as a JSON-LD `{}`, an `<rdf:RDF/>`, or prefix-only Turtle — is a
@@ -389,9 +677,21 @@ const rdfContentTypes = [
389
677
  'application/ld+json',
390
678
  'application/rdf+xml',
391
679
  ];
392
- async function validateBody(body, contentType, baseIRI, timeoutMs) {
680
+ // Serializations a streaming parser cannot validate from a truncated prefix.
681
+ // The line/statement-oriented formats (N-Triples, N-Quads, Turtle, TriG, N3) and
682
+ // SAX-based RDF/XML all yield their first triple from the opening chunk, but
683
+ // JSON-LD is a single JSON value whose parser emits nothing until the whole
684
+ // document closes — a truncated JSON-LD body parses to an ‘unclosed document’
685
+ // error, never a triple. So a truncated body in one of these can only be
686
+ // validated if it happened to fit the read cap in full; beyond that it is
687
+ // inconclusive, and we must not download it in full to find out.
688
+ const nonStreamableRdfContentTypes = ['application/ld+json'];
689
+ async function validateBody(body, contentType, baseIRI, timeoutMs, truncated) {
393
690
  if (body.length === 0) {
394
- return 'Distribution is empty';
691
+ // A complete, empty body is a faulty distribution; an empty *prefix* (a
692
+ // truncated read that yielded no bytes, e.g. a corrupt gzip header) is
693
+ // inconclusive — the endpoint answered, we just could not validate content.
694
+ return truncated ? null : 'Distribution is empty';
395
695
  }
396
696
  // Media types are case-insensitive (RFC 9110 §8.3.1), so normalise before
397
697
  // matching the lower-case allow-list — a server sending `Application/LD+JSON`
@@ -400,7 +700,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
400
700
  if (!serialization || !rdfContentTypes.includes(serialization)) {
401
701
  return null;
402
702
  }
403
- const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs);
703
+ if (truncated && nonStreamableRdfContentTypes.includes(serialization)) {
704
+ // A bounded prefix of a non-streamable serialization (JSON-LD) can never
705
+ // yield a triple, so skip the doomed parse and report it inconclusive — only
706
+ // a complete document, small enough to fit the read cap, can be validated.
707
+ return null;
708
+ }
709
+ const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs, truncated);
404
710
  switch (outcome.type) {
405
711
  case 'empty':
406
712
  return 'Distribution contains no RDF triples';
@@ -422,8 +728,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
422
728
  * on expiry — and likewise when a remote `@context` is unreachable — the outcome
423
729
  * is 'inconclusive', so a valid distribution is never flagged faulty for a
424
730
  * context host's failure. `baseIRI` resolves any relative IRIs in the document.
731
+ *
732
+ * When `truncated` is true the body is only a bounded prefix of a larger one, so
733
+ * only finding a triple ('hasTriples') is conclusive: a parse error at the cut
734
+ * or a clean end with no triple yet means we did not read far enough, not that
735
+ * the distribution is empty or malformed, and is reported as 'inconclusive'.
425
736
  */
426
- function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
737
+ function classifyRdfBody(body, contentType, baseIRI, timeoutMs, truncated) {
427
738
  return new Promise((resolve) => {
428
739
  const quads = rdfParser.parse(Readable.from([body]), {
429
740
  contentType,
@@ -441,10 +752,10 @@ function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
441
752
  }
442
753
  quads
443
754
  .on('data', () => settle({ type: 'hasTriples' }))
444
- .on('error', (error) => settle(isRemoteContextError(error)
755
+ .on('error', (error) => settle(truncated || isRemoteContextError(error)
445
756
  ? { type: 'inconclusive' }
446
757
  : { type: 'parseError', message: error.message }))
447
- .on('end', () => settle({ type: 'empty' }));
758
+ .on('end', () => settle(truncated ? { type: 'inconclusive' } : { type: 'empty' }));
448
759
  });
449
760
  }
450
761
  /**
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@lde/distribution-probe",
3
- "version": "0.1.13",
3
+ "version": "0.2.1",
4
4
  "repository": {
5
5
  "url": "git+https://github.com/ldelements/lde.git",
6
6
  "directory": "packages/distribution-probe"