@lde/distribution-probe 0.1.13 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -3
- package/dist/index.d.ts +1 -1
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +1 -1
- package/dist/probe.d.ts +59 -1
- package/dist/probe.d.ts.map +1 -1
- package/dist/probe.js +337 -26
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -24,14 +24,48 @@ Sends `POST` with the configured query (default `SELECT * { ?s ?p ?o } LIMIT 1`)
|
|
|
24
24
|
|
|
25
25
|
### Data dumps
|
|
26
26
|
|
|
27
|
-
|
|
27
|
+
#### Reachability (the default)
|
|
28
|
+
|
|
29
|
+
Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. A successful `HEAD` settles reachability and gathers metadata (`Content-Length`, `Last-Modified`) **without reading the body**. If `HEAD` is unsuccessful — e.g. a server that returns `405`/`501` because it does not implement `HEAD` — the probe falls back to a body-less `GET` to confirm the endpoint is up. The body is never downloaded.
|
|
30
|
+
|
|
31
|
+
This is deliberately cheap: reading a body forces a slow, generate-on-the-fly endpoint (a TriplyDB dump, a SPARQL `CONSTRUCT` export) to start producing its export, which a `HEAD` does not.
|
|
28
32
|
|
|
29
33
|
- **Content-Type is checked as a soft warning, not a hard failure.** If the server’s Content-Type disagrees with the distribution’s declared `mimeType`, a message is appended to `result.warnings` but `isSuccess()` stays `true`. Compression wrappers (`application/gzip`, `application/x-gzip`, `application/octet-stream`) are skipped so a gzipped Turtle file doesn’t trigger a warning.
|
|
30
|
-
|
|
31
|
-
|
|
34
|
+
|
|
35
|
+
#### Content validation (opt-in)
|
|
36
|
+
|
|
37
|
+
Set `validateRdfContent: true` to additionally confirm that a dump actually carries RDF. It applies only to distributions whose **declared** `mimeType` is an RDF serialization (`text/turtle`, `application/n-triples`, `application/n-quads`, `application/trig`, `text/n3`, `application/ld+json`, `application/rdf+xml`); non-RDF and undeclared-type distributions stay reachability-only.
|
|
38
|
+
|
|
39
|
+
When on, the probe `GET`s the dump — **regardless of size** — and reads only a **bounded prefix** (256 KiB), never the whole body:
|
|
40
|
+
|
|
41
|
+
- It settles on the **first triple** and stops, so a large dump is validated from its opening chunk. The line/statement-oriented serializations and RDF/XML stream a triple out of the prefix; **JSON-LD is not streamable** (its parser needs the whole document), so a JSON-LD dump is only validated when it fits the prefix in full — a larger one is reported reachable but unvalidated.
|
|
42
|
+
- A gzip body that `fetch` did not decompress (a `.gz` dump, or one served with a non-standard `Content-Encoding`) is inflated in-place; a gzip that will not inflate when the **complete** compressed body was read fails as `Distribution is not valid gzip`.
|
|
43
|
+
- Empty bodies (`Distribution is empty`) and bodies that parse to **zero** triples (`Distribution contains no RDF triples`) fail the probe. A deliberately truncated prefix is never mistaken for either — it is inconclusive.
|
|
44
|
+
- **Reachability is settled by the response, so validation never turns a reachable dump into a failure.** If no triple surfaces within `rdfValidationBudgetMs` (default `min(timeoutMs, 2000)`, clamped to `timeoutMs`), the read is aborted and the distribution is reported reachable but unvalidated (no `failureReason`). This bounds the extra latency content validation adds on slow, generate-on-the-fly endpoints.
|
|
32
45
|
|
|
33
46
|
### Network errors
|
|
34
47
|
|
|
35
48
|
A thrown exception from `fetch` (DNS failure, connection refused, socket reset, TLS error, timeout after the configured `timeoutMs` – default 5 000 ms) is a connection-level failure. The probe retries these up to `retries` times (default 2) with a short backoff before giving up and returning a `NetworkError`. This turns a transient transport blip into a reliable single measurement without looking backward across checks. A genuine outage still resolves to a `NetworkError` on the current check – every attempt fails – but note each attempt gets its own `timeoutMs`, so an endpoint that fails only by timing out takes up to `(retries + 1) × timeoutMs` (plus backoff) to be reported down. HTTP error responses (4xx/5xx) and content-validation failures are real ‘down’ states and are **never** retried.
|
|
36
49
|
|
|
37
50
|
`NetworkError.message` includes the underlying `error.cause` (e.g. `ECONNRESET`, `UND_ERR_SOCKET “other side closed”`) when Node wraps one, so observations record what actually failed rather than a bare ‘fetch failed’.
|
|
51
|
+
|
|
52
|
+
## Probing many distributions
|
|
53
|
+
|
|
54
|
+
`probeMany` probes an array of distributions concurrently and returns one result per input, in input order. Each distribution is probed once with `probe`, so every behaviour above applies per distribution; like `probe`, `probeMany` never throws – a probe that fails is reported as a `NetworkError` in its slot.
|
|
55
|
+
|
|
56
|
+
```ts
|
|
57
|
+
import { probeMany } from '@lde/distribution-probe';
|
|
58
|
+
|
|
59
|
+
const results = await probeMany(distributions, {
|
|
60
|
+
concurrency: 20, // max probes in flight across all hosts (default 20)
|
|
61
|
+
perHostConcurrency: 4, // max probes in flight against one host (default 4)
|
|
62
|
+
validateRdfContent: true, // any ProbeOptions are forwarded to each probe
|
|
63
|
+
});
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Two caps bound the batch:
|
|
67
|
+
|
|
68
|
+
- **`concurrency`** bounds the total fan-out, so a large catalogue does not exhaust sockets or buffer too many response bodies at once.
|
|
69
|
+
- **`perHostConcurrency`** bounds the burst any one server sees, keeping the batch a polite client: a catalogue that declares many distributions on a single host (e.g. a download endpoint per named graph) will not trip that server’s rate limiter (HTTP 429). Distributions sharing a host (by `accessUrl`) contend for the same budget; a probe whose host is saturated waits while probes for other hosts proceed, so one busy host never idles the global pool.
|
|
70
|
+
|
|
71
|
+
All other `ProbeOptions` (`timeoutMs`, `retries`, `validateRdfContent`, and the rest) are forwarded unchanged to every probe.
|
package/dist/index.d.ts
CHANGED
|
@@ -1,2 +1,2 @@
|
|
|
1
|
-
export { probe, NetworkError, SparqlProbeResult, DataDumpProbeResult, type ProbeOptions, type ProbeResultType, } from './probe.js';
|
|
1
|
+
export { probe, probeMany, NetworkError, SparqlProbeResult, DataDumpProbeResult, type ProbeOptions, type ProbeManyOptions, type ProbeResultType, } from './probe.js';
|
|
2
2
|
//# sourceMappingURL=index.d.ts.map
|
package/dist/index.d.ts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EACL,KAAK,EACL,YAAY,EACZ,iBAAiB,EACjB,mBAAmB,EACnB,KAAK,YAAY,EACjB,KAAK,eAAe,GACrB,MAAM,YAAY,CAAC"}
|
|
1
|
+
{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EACL,KAAK,EACL,SAAS,EACT,YAAY,EACZ,iBAAiB,EACjB,mBAAmB,EACnB,KAAK,YAAY,EACjB,KAAK,gBAAgB,EACrB,KAAK,eAAe,GACrB,MAAM,YAAY,CAAC"}
|
package/dist/index.js
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
export { probe, NetworkError, SparqlProbeResult, DataDumpProbeResult, } from './probe.js';
|
|
1
|
+
export { probe, probeMany, NetworkError, SparqlProbeResult, DataDumpProbeResult, } from './probe.js';
|
package/dist/probe.d.ts
CHANGED
|
@@ -28,6 +28,49 @@ export interface ProbeOptions {
|
|
|
28
28
|
* the default; negative values are clamped to `0`.
|
|
29
29
|
*/
|
|
30
30
|
retries?: number;
|
|
31
|
+
/**
|
|
32
|
+
* Validate the body content of data-dump distributions whose declared media
|
|
33
|
+
* type is an RDF serialization, by reading a bounded prefix and confirming it
|
|
34
|
+
* carries at least one triple. When `false` (the default) a data dump is only
|
|
35
|
+
* checked for reachability (a `HEAD`, with a body-less `GET` fallback if `HEAD`
|
|
36
|
+
* is unsupported) and its body is never read. When `true`, every declared-RDF
|
|
37
|
+
* dump — regardless of size — is fetched and validated; non-RDF and
|
|
38
|
+
* undeclared-type distributions are still reachability-only. Validation is
|
|
39
|
+
* opt-in because reading a body forces a slow, generate-on-the-fly endpoint to
|
|
40
|
+
* start producing its export, which a `HEAD` does not.
|
|
41
|
+
*/
|
|
42
|
+
validateRdfContent?: boolean;
|
|
43
|
+
/**
|
|
44
|
+
* Soft deadline, in milliseconds, for finding the first triple when
|
|
45
|
+
* {@link validateRdfContent} is on. Reachability is settled by the response
|
|
46
|
+
* itself; if no triple has surfaced within this budget the read is aborted and
|
|
47
|
+
* the distribution is reported reachable but unvalidated (no `failureReason`),
|
|
48
|
+
* never failed. This bounds the extra latency content validation adds on slow,
|
|
49
|
+
* generate-on-the-fly endpoints. Clamped to {@link timeoutMs} (a longer budget
|
|
50
|
+
* is meaningless — the request times out first). Defaults to
|
|
51
|
+
* `min(timeoutMs, 2000)`.
|
|
52
|
+
*/
|
|
53
|
+
rdfValidationBudgetMs?: number;
|
|
54
|
+
}
|
|
55
|
+
/**
|
|
56
|
+
* Options for {@link probeMany}: the per-probe {@link ProbeOptions} plus the
|
|
57
|
+
* concurrency budgets that bound the batch.
|
|
58
|
+
*/
|
|
59
|
+
export interface ProbeManyOptions extends ProbeOptions {
|
|
60
|
+
/**
|
|
61
|
+
* Maximum number of probes to run at once across all hosts. Bounds the batch’s
|
|
62
|
+
* total fan-out so a large catalogue does not exhaust sockets or buffer too many
|
|
63
|
+
* response bodies at once. Default 20.
|
|
64
|
+
*/
|
|
65
|
+
concurrency?: number;
|
|
66
|
+
/**
|
|
67
|
+
* Maximum number of probes to run at once against a single host. Bounds the
|
|
68
|
+
* burst any one server sees, so a catalogue that declares many distributions on
|
|
69
|
+
* one host (e.g. a download endpoint per named graph) does not trip its rate
|
|
70
|
+
* limiter (HTTP 429). A probe whose host is at this cap waits while probes for
|
|
71
|
+
* other hosts proceed, so this never idles the global pool. Default 4.
|
|
72
|
+
*/
|
|
73
|
+
perHostConcurrency?: number;
|
|
31
74
|
}
|
|
32
75
|
/**
|
|
33
76
|
* Result of a network error during probing.
|
|
@@ -80,10 +123,25 @@ export type ProbeResultType = SparqlProbeResult | DataDumpProbeResult | NetworkE
|
|
|
80
123
|
*
|
|
81
124
|
* For SPARQL endpoints, issues the configured SPARQL query (default: a
|
|
82
125
|
* minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
|
|
83
|
-
* for small or unknown-size bodies
|
|
126
|
+
* for small or unknown-size bodies, reading only a bounded prefix so a large
|
|
127
|
+
* streamed dump is never downloaded in full).
|
|
84
128
|
*
|
|
85
129
|
* Returns a pure result object; never throws.
|
|
86
130
|
*/
|
|
87
131
|
export declare function probe(distribution: Distribution, options?: ProbeOptions): Promise<ProbeResultType>;
|
|
132
|
+
/**
|
|
133
|
+
* Probe many distributions concurrently, bounded by a global cap and a per-host
|
|
134
|
+
* cap, returning one result per input in input order. Like {@link probe}, this
|
|
135
|
+
* never throws: a probe that somehow fails is reported as a {@link NetworkError}
|
|
136
|
+
* in its slot.
|
|
137
|
+
*
|
|
138
|
+
* The per-host cap keeps the batch a polite client. Distributions sharing a host
|
|
139
|
+
* (by {@link Distribution.accessUrl}) contend for the same budget, so no single
|
|
140
|
+
* server is hit by the full global pool at once — the burst that trips a rate
|
|
141
|
+
* limiter (HTTP 429). When the next queued probe’s host is saturated it is
|
|
142
|
+
* skipped in favour of a later probe on a different host, so one busy host never
|
|
143
|
+
* idles the global pool (no head-of-line blocking).
|
|
144
|
+
*/
|
|
145
|
+
export declare function probeMany(distributions: readonly Distribution[], options?: ProbeManyOptions): Promise<ProbeResultType[]>;
|
|
88
146
|
export {};
|
|
89
147
|
//# sourceMappingURL=probe.d.ts.map
|
package/dist/probe.d.ts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;
|
|
1
|
+
{"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAKnE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;IACjB;;;;;;;;;;OAUG;IACH,kBAAkB,CAAC,EAAE,OAAO,CAAC;IAC7B;;;;;;;;;OASG;IACH,qBAAqB,CAAC,EAAE,MAAM,CAAC;CAChC;AAED;;;GAGG;AACH,MAAM,WAAW,gBAAiB,SAAQ,YAAY;IACpD;;;;OAIG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;OAMG;IACH,kBAAkB,CAAC,EAAE,MAAM,CAAC;CAC7B;AAkCD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;;GASG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B;AAED;;;;;;;;;;;;GAYG;AACH,wBAAsB,SAAS,CAC7B,aAAa,EAAE,SAAS,YAAY,EAAE,EACtC,OAAO,CAAC,EAAE,gBAAgB,GACzB,OAAO,CAAC,eAAe,EAAE,CAAC,CA4B5B"}
|
package/dist/probe.js
CHANGED
|
@@ -1,9 +1,32 @@
|
|
|
1
1
|
import { compressionMediaTypes } from '@lde/dataset';
|
|
2
2
|
import { rdfParser } from 'rdf-parse';
|
|
3
3
|
import { Readable } from 'node:stream';
|
|
4
|
+
import { createGunzip } from 'node:zlib';
|
|
4
5
|
const DEFAULT_SPARQL_QUERY = 'SELECT * { ?s ?p ?o } LIMIT 1';
|
|
5
6
|
const DEFAULT_TIMEOUT_MS = 5000;
|
|
6
7
|
const DEFAULT_RETRIES = 2;
|
|
8
|
+
const DEFAULT_PROBE_CONCURRENCY = 20;
|
|
9
|
+
const DEFAULT_PROBE_PER_HOST_CONCURRENCY = 4;
|
|
10
|
+
/**
|
|
11
|
+
* Default soft deadline for finding the first triple when content validation is
|
|
12
|
+
* on (capped at `timeoutMs`). Two seconds comfortably covers a static file
|
|
13
|
+
* server's first chunk while keeping the extra wait bounded on a slow,
|
|
14
|
+
* generate-on-the-fly endpoint.
|
|
15
|
+
*/
|
|
16
|
+
const DEFAULT_RDF_VALIDATION_BUDGET_MS = 2000;
|
|
17
|
+
/** Sentinel: the validation budget elapsed before a triple surfaced. */
|
|
18
|
+
const VALIDATION_TIMED_OUT = Symbol('rdf-validation-timed-out');
|
|
19
|
+
/**
|
|
20
|
+
* Maximum number of body bytes the data-dump probe reads before it stops and
|
|
21
|
+
* releases the connection. Reachability needs only that the endpoint answered
|
|
22
|
+
* with a success status and produced bytes; a large dump must never be
|
|
23
|
+
* downloaded in full within the probe's timeout budget. 256 KiB comfortably
|
|
24
|
+
* surfaces the first RDF triple — the signal {@link validateBody} needs — while
|
|
25
|
+
* bounding the read regardless of the dump's true size, chunked transfer, or
|
|
26
|
+
* compression. Applied to both the raw read and, for a gzip body, the inflated
|
|
27
|
+
* output.
|
|
28
|
+
*/
|
|
29
|
+
const MAX_PROBE_BODY_BYTES = 256 * 1024;
|
|
7
30
|
/** Base backoff between retries; the nth retry waits `n × base`. */
|
|
8
31
|
const RETRY_BACKOFF_MS = 250;
|
|
9
32
|
/**
|
|
@@ -107,7 +130,8 @@ export class DataDumpProbeResult extends ProbeResult {
|
|
|
107
130
|
*
|
|
108
131
|
* For SPARQL endpoints, issues the configured SPARQL query (default: a
|
|
109
132
|
* minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
|
|
110
|
-
* for small or unknown-size bodies
|
|
133
|
+
* for small or unknown-size bodies, reading only a bounded prefix so a large
|
|
134
|
+
* streamed dump is never downloaded in full).
|
|
111
135
|
*
|
|
112
136
|
* Returns a pure result object; never throws.
|
|
113
137
|
*/
|
|
@@ -147,6 +171,96 @@ export async function probe(distribution, options) {
|
|
|
147
171
|
// real cost of a down endpoint.
|
|
148
172
|
return new NetworkError(url, describeNetworkError(lastError), Math.round(performance.now() - overallStart));
|
|
149
173
|
}
|
|
174
|
+
/**
|
|
175
|
+
* Probe many distributions concurrently, bounded by a global cap and a per-host
|
|
176
|
+
* cap, returning one result per input in input order. Like {@link probe}, this
|
|
177
|
+
* never throws: a probe that somehow fails is reported as a {@link NetworkError}
|
|
178
|
+
* in its slot.
|
|
179
|
+
*
|
|
180
|
+
* The per-host cap keeps the batch a polite client. Distributions sharing a host
|
|
181
|
+
* (by {@link Distribution.accessUrl}) contend for the same budget, so no single
|
|
182
|
+
* server is hit by the full global pool at once — the burst that trips a rate
|
|
183
|
+
* limiter (HTTP 429). When the next queued probe’s host is saturated it is
|
|
184
|
+
* skipped in favour of a later probe on a different host, so one busy host never
|
|
185
|
+
* idles the global pool (no head-of-line blocking).
|
|
186
|
+
*/
|
|
187
|
+
export async function probeMany(distributions, options) {
|
|
188
|
+
// Clamp the budgets to a positive integer, mirroring how probe() treats an
|
|
189
|
+
// invalid retries value: a zero, negative, fractional, or NaN limit would
|
|
190
|
+
// otherwise stall the scheduler (no task ever starts, so the promise never
|
|
191
|
+
// resolves) or overrun the cap, so fall back to the default rather than trust
|
|
192
|
+
// the caller.
|
|
193
|
+
const globalLimit = positiveIntOrDefault(options?.concurrency, DEFAULT_PROBE_CONCURRENCY);
|
|
194
|
+
const perHostLimit = positiveIntOrDefault(options?.perHostConcurrency, DEFAULT_PROBE_PER_HOST_CONCURRENCY);
|
|
195
|
+
// Probes contend per host. An authority-less URL (e.g. urn:, file:) has an
|
|
196
|
+
// empty host, so it falls back to its full href and never shares a budget with
|
|
197
|
+
// an unrelated one.
|
|
198
|
+
const hostKeys = distributions.map((distribution) => distribution.accessUrl.host || distribution.accessUrl.href);
|
|
199
|
+
return mapHostLimited(distributions, hostKeys, globalLimit, perHostLimit, (distribution) => probe(distribution, options));
|
|
200
|
+
}
|
|
201
|
+
/**
|
|
202
|
+
* Coerce an optional concurrency budget to a usable value: a positive integer is
|
|
203
|
+
* taken as-is; undefined, zero, negative, fractional, or NaN falls back to the
|
|
204
|
+
* default. Matches probe()’s treatment of an invalid retries value.
|
|
205
|
+
*/
|
|
206
|
+
function positiveIntOrDefault(value, fallback) {
|
|
207
|
+
return value !== undefined && Number.isInteger(value) && value >= 1
|
|
208
|
+
? value
|
|
209
|
+
: fallback;
|
|
210
|
+
}
|
|
211
|
+
/**
|
|
212
|
+
* Run `task` over `items` with two concurrency caps — a global cap and a per-host
|
|
213
|
+
* cap keyed by `hostKeys[index]` — resolving to results in input order. When the
|
|
214
|
+
* next queued item’s host is at the per-host cap it is skipped for a later item on
|
|
215
|
+
* a different host, so a saturated host never idles the global pool (no head-of-line
|
|
216
|
+
* blocking); the skipped host always has a task in flight, whose completion re-runs
|
|
217
|
+
* the scheduler, so the queue always drains. `task` must not reject — callers wrap
|
|
218
|
+
* failures into a result value — as a rejection would leave the promise pending.
|
|
219
|
+
*/
|
|
220
|
+
function mapHostLimited(items, hostKeys, globalLimit, perHostLimit, task) {
|
|
221
|
+
const results = new Array(items.length);
|
|
222
|
+
const perHostInFlight = new Map();
|
|
223
|
+
const pending = items.map((_unused, index) => index);
|
|
224
|
+
let globalInFlight = 0;
|
|
225
|
+
let settledCount = 0;
|
|
226
|
+
const adjustHost = (host, delta) => {
|
|
227
|
+
perHostInFlight.set(host, (perHostInFlight.get(host) ?? 0) + delta);
|
|
228
|
+
};
|
|
229
|
+
return new Promise((resolve) => {
|
|
230
|
+
const schedule = () => {
|
|
231
|
+
let cursor = 0;
|
|
232
|
+
while (cursor < pending.length && globalInFlight < globalLimit) {
|
|
233
|
+
const index = pending[cursor];
|
|
234
|
+
const host = hostKeys[index];
|
|
235
|
+
if ((perHostInFlight.get(host) ?? 0) >= perHostLimit) {
|
|
236
|
+
cursor++; // Host saturated; leave it queued and try a later, different host.
|
|
237
|
+
continue;
|
|
238
|
+
}
|
|
239
|
+
pending.splice(cursor, 1);
|
|
240
|
+
globalInFlight++;
|
|
241
|
+
adjustHost(host, 1);
|
|
242
|
+
void task(items[index]).then((result) => {
|
|
243
|
+
results[index] = result;
|
|
244
|
+
globalInFlight--;
|
|
245
|
+
adjustHost(host, -1);
|
|
246
|
+
settledCount++;
|
|
247
|
+
if (settledCount === items.length) {
|
|
248
|
+
resolve(results);
|
|
249
|
+
}
|
|
250
|
+
else {
|
|
251
|
+
schedule();
|
|
252
|
+
}
|
|
253
|
+
});
|
|
254
|
+
// pending[cursor] now holds the next queued item; do not advance cursor.
|
|
255
|
+
}
|
|
256
|
+
};
|
|
257
|
+
schedule();
|
|
258
|
+
// Resolve immediately when there is nothing to settle (empty input); a
|
|
259
|
+
// non-empty run resolves via the task completion above.
|
|
260
|
+
if (settledCount === items.length)
|
|
261
|
+
resolve(results);
|
|
262
|
+
});
|
|
263
|
+
}
|
|
150
264
|
function delay(milliseconds) {
|
|
151
265
|
return new Promise((resolve) => setTimeout(resolve, milliseconds));
|
|
152
266
|
}
|
|
@@ -186,6 +300,9 @@ function resolveOptions(options) {
|
|
|
186
300
|
retries: retries === undefined || !Number.isInteger(retries)
|
|
187
301
|
? DEFAULT_RETRIES
|
|
188
302
|
: Math.max(0, retries),
|
|
303
|
+
validateRdfContent: options?.validateRdfContent ?? false,
|
|
304
|
+
rdfValidationBudgetMs: options?.rdfValidationBudgetMs ??
|
|
305
|
+
Math.min(options?.timeoutMs ?? DEFAULT_TIMEOUT_MS, DEFAULT_RDF_VALIDATION_BUDGET_MS),
|
|
189
306
|
};
|
|
190
307
|
}
|
|
191
308
|
/**
|
|
@@ -350,30 +467,201 @@ async function probeDataDump(url, distribution, options, authHeaders, start) {
|
|
|
350
467
|
method: 'HEAD',
|
|
351
468
|
...requestOptions,
|
|
352
469
|
});
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
//
|
|
356
|
-
//
|
|
357
|
-
if (
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
const responseTimeMs = Math.round(performance.now() - start);
|
|
368
|
-
const result = new DataDumpProbeResult(url, getResponse, responseTimeMs, failureReason);
|
|
369
|
-
checkContentTypeMismatch(result, distribution);
|
|
370
|
-
return result;
|
|
470
|
+
// Validate body content only when asked to and the distribution declares an
|
|
471
|
+
// RDF media type; otherwise the probe is reachability-only and never reads a
|
|
472
|
+
// body — which keeps it from forcing a slow, generate-on-the-fly endpoint to
|
|
473
|
+
// start producing its export.
|
|
474
|
+
if (options.validateRdfContent &&
|
|
475
|
+
isDeclaredRdf(distribution) &&
|
|
476
|
+
isHttpSuccess(headResponse)) {
|
|
477
|
+
const { response, failureReason } = await validateDumpBody(url, headers, options, headResponse);
|
|
478
|
+
return finalizeDataDump(url, distribution, response, start, failureReason);
|
|
479
|
+
}
|
|
480
|
+
// Reachability only. A successful HEAD is enough; otherwise confirm with a
|
|
481
|
+
// body-less GET, which rescues servers that reject or do not implement HEAD.
|
|
482
|
+
if (isHttpSuccess(headResponse)) {
|
|
483
|
+
return finalizeDataDump(url, distribution, headResponse, start, null);
|
|
371
484
|
}
|
|
485
|
+
const getResponse = await fetch(url, { method: 'GET', ...requestOptions });
|
|
486
|
+
await getResponse.body?.cancel();
|
|
487
|
+
return finalizeDataDump(url, distribution, getResponse, start, null);
|
|
488
|
+
}
|
|
489
|
+
/** Whether an HTTP response carries a success (2xx/3xx) status. */
|
|
490
|
+
function isHttpSuccess(response) {
|
|
491
|
+
return response.status >= 200 && response.status < 400;
|
|
492
|
+
}
|
|
493
|
+
/** Whether the distribution declares an RDF serialization as its media type. */
|
|
494
|
+
function isDeclaredRdf(distribution) {
|
|
495
|
+
const declared = distribution.mimeType?.toLowerCase();
|
|
496
|
+
return declared !== undefined && rdfContentTypes.includes(declared);
|
|
497
|
+
}
|
|
498
|
+
/** Build a DataDumpProbeResult and attach any Content-Type-mismatch warning. */
|
|
499
|
+
function finalizeDataDump(url, distribution, response, start, failureReason) {
|
|
372
500
|
const responseTimeMs = Math.round(performance.now() - start);
|
|
373
|
-
const result = new DataDumpProbeResult(url,
|
|
501
|
+
const result = new DataDumpProbeResult(url, response, responseTimeMs, failureReason);
|
|
374
502
|
checkContentTypeMismatch(result, distribution);
|
|
375
503
|
return result;
|
|
376
504
|
}
|
|
505
|
+
/**
|
|
506
|
+
* GET the dump and validate that its body carries a triple, but only for as long
|
|
507
|
+
* as the validation budget allows. Reachability is already settled by the prior
|
|
508
|
+
* HEAD, so any shortfall — a budget that elapses before a triple, a read error,
|
|
509
|
+
* a GET that cannot start — yields a `null` failureReason (reachable,
|
|
510
|
+
* unvalidated), never a failure. Returns the response to draw metadata from
|
|
511
|
+
* (the GET, or the HEAD when the GET could not start) alongside that reason.
|
|
512
|
+
*/
|
|
513
|
+
async function validateDumpBody(url, headers, options, headResponse) {
|
|
514
|
+
const budgetMs = Math.min(options.rdfValidationBudgetMs, options.timeoutMs);
|
|
515
|
+
// Aborting on budget expiry stops a slow endpoint from streaming on in the
|
|
516
|
+
// background once we have given up waiting for a triple.
|
|
517
|
+
const budgetController = new AbortController();
|
|
518
|
+
let getResponse;
|
|
519
|
+
try {
|
|
520
|
+
getResponse = await fetch(url, {
|
|
521
|
+
method: 'GET',
|
|
522
|
+
headers,
|
|
523
|
+
signal: AbortSignal.any([
|
|
524
|
+
AbortSignal.timeout(options.timeoutMs),
|
|
525
|
+
budgetController.signal,
|
|
526
|
+
]),
|
|
527
|
+
});
|
|
528
|
+
}
|
|
529
|
+
catch {
|
|
530
|
+
// The GET could not even return headers; the HEAD already proved the
|
|
531
|
+
// distribution reachable, so report it unvalidated rather than down.
|
|
532
|
+
return { response: headResponse, failureReason: null };
|
|
533
|
+
}
|
|
534
|
+
if (!isHttpSuccess(getResponse)) {
|
|
535
|
+
await getResponse.body?.cancel();
|
|
536
|
+
return { response: getResponse, failureReason: null };
|
|
537
|
+
}
|
|
538
|
+
const validation = (async () => {
|
|
539
|
+
const bounded = await readBoundedBody(getResponse, MAX_PROBE_BODY_BYTES);
|
|
540
|
+
const { text, truncated, corrupt } = await decodeProbeBody(bounded);
|
|
541
|
+
return corrupt
|
|
542
|
+
? 'Distribution is not valid gzip'
|
|
543
|
+
: await validateBody(text, getResponse.headers.get('Content-Type'), url, budgetMs, truncated);
|
|
544
|
+
})().catch(() => null);
|
|
545
|
+
let budgetTimer;
|
|
546
|
+
const budgetExpiry = new Promise((resolve) => {
|
|
547
|
+
budgetTimer = setTimeout(() => {
|
|
548
|
+
budgetController.abort();
|
|
549
|
+
resolve(VALIDATION_TIMED_OUT);
|
|
550
|
+
}, budgetMs);
|
|
551
|
+
});
|
|
552
|
+
try {
|
|
553
|
+
const outcome = await Promise.race([validation, budgetExpiry]);
|
|
554
|
+
return {
|
|
555
|
+
response: getResponse,
|
|
556
|
+
failureReason: outcome === VALIDATION_TIMED_OUT ? null : outcome,
|
|
557
|
+
};
|
|
558
|
+
}
|
|
559
|
+
finally {
|
|
560
|
+
clearTimeout(budgetTimer);
|
|
561
|
+
}
|
|
562
|
+
}
|
|
563
|
+
/**
|
|
564
|
+
* Read at most `maxBytes` from a response body, then cancel the stream to free
|
|
565
|
+
* the underlying connection. Returns the bytes read and whether the body was
|
|
566
|
+
* longer than the cap (`truncated`), so the caller can tell a complete, small
|
|
567
|
+
* body — whose emptiness or parse errors are meaningful — from a deliberately
|
|
568
|
+
* cut-off prefix of a large one, where only the presence of content is
|
|
569
|
+
* conclusive. This is what keeps the probe from downloading a multi-hundred-MB
|
|
570
|
+
* streamed dump in full just to confirm it is reachable.
|
|
571
|
+
*/
|
|
572
|
+
async function readBoundedBody(response, maxBytes) {
|
|
573
|
+
const stream = response.body;
|
|
574
|
+
if (stream === null) {
|
|
575
|
+
return { bytes: new Uint8Array(0), truncated: false };
|
|
576
|
+
}
|
|
577
|
+
const chunks = [];
|
|
578
|
+
let total = 0;
|
|
579
|
+
let truncated = false;
|
|
580
|
+
// Breaking out of `for await` cancels the stream, which stops any further
|
|
581
|
+
// download and releases the underlying connection — so a large dump is never
|
|
582
|
+
// pulled in full once we have the prefix we need.
|
|
583
|
+
for await (const chunk of stream) {
|
|
584
|
+
chunks.push(chunk);
|
|
585
|
+
total += chunk.length;
|
|
586
|
+
if (total >= maxBytes) {
|
|
587
|
+
truncated = true;
|
|
588
|
+
break;
|
|
589
|
+
}
|
|
590
|
+
}
|
|
591
|
+
return { bytes: Buffer.concat(chunks), truncated };
|
|
592
|
+
}
|
|
593
|
+
/**
|
|
594
|
+
* Decode a bounded body to text for RDF validation, inflating it first when it
|
|
595
|
+
* is a gzip stream that `fetch` did not transparently decompress — e.g. a `.gz`
|
|
596
|
+
* data dump served as-is, or one labelled with a non-standard Content-Encoding
|
|
597
|
+
* (`application/gzip`) that undici does not recognise as a content coding.
|
|
598
|
+
* Detection is by the gzip magic on the delivered bytes, so a body that `fetch`
|
|
599
|
+
* already inflated (a standard `Content-Encoding: gzip`) is passed through
|
|
600
|
+
* untouched. A truncated gzip tail is expected — we only read a prefix — and
|
|
601
|
+
* inflates cleanly up to the cut, so it is never mistaken for corruption.
|
|
602
|
+
*/
|
|
603
|
+
async function decodeProbeBody(bounded) {
|
|
604
|
+
if (!isGzip(bounded.bytes)) {
|
|
605
|
+
return {
|
|
606
|
+
text: decodeUtf8(bounded.bytes),
|
|
607
|
+
truncated: bounded.truncated,
|
|
608
|
+
corrupt: false,
|
|
609
|
+
};
|
|
610
|
+
}
|
|
611
|
+
// The compressed body is complete only when the raw read was not itself cut
|
|
612
|
+
// off: a gzip error on a complete body is genuine corruption, on a prefix we
|
|
613
|
+
// cut it is just the dropped tail.
|
|
614
|
+
const inflated = await gunzipPrefix(bounded.bytes, MAX_PROBE_BODY_BYTES, !bounded.truncated);
|
|
615
|
+
return {
|
|
616
|
+
text: decodeUtf8(inflated.bytes),
|
|
617
|
+
truncated: bounded.truncated || inflated.truncated,
|
|
618
|
+
corrupt: inflated.corrupt,
|
|
619
|
+
};
|
|
620
|
+
}
|
|
621
|
+
/** Whether the bytes begin with the gzip magic number (RFC 1952 §2.3.1). */
|
|
622
|
+
function isGzip(bytes) {
|
|
623
|
+
return bytes.length >= 2 && bytes[0] === 0x1f && bytes[1] === 0x8b;
|
|
624
|
+
}
|
|
625
|
+
/**
|
|
626
|
+
* Decode bytes as UTF-8 without throwing: an incomplete multi-byte sequence at
|
|
627
|
+
* the truncation boundary is replaced rather than fatal, since the RDF parser
|
|
628
|
+
* only needs the leading, intact portion to find the first triple.
|
|
629
|
+
*/
|
|
630
|
+
function decodeUtf8(bytes) {
|
|
631
|
+
return new TextDecoder('utf-8', { fatal: false }).decode(bytes);
|
|
632
|
+
}
|
|
633
|
+
/**
|
|
634
|
+
* Inflate up to `maxBytes` of output from a gzip prefix, stopping once the cap
|
|
635
|
+
* is reached or the input runs out. `inputComplete` says whether the caller
|
|
636
|
+
* handed us the whole compressed body (true) or a prefix it had already cut
|
|
637
|
+
* (false). An inflate error therefore means different things: on a complete body
|
|
638
|
+
* the gzip is genuinely corrupt; on a cut prefix it is just the dropped tail, so
|
|
639
|
+
* whatever inflated cleanly is reported as a (truncated) partial inflate.
|
|
640
|
+
*/
|
|
641
|
+
function gunzipPrefix(bytes, maxBytes, inputComplete) {
|
|
642
|
+
return new Promise((resolve) => {
|
|
643
|
+
const gunzip = createGunzip();
|
|
644
|
+
const chunks = [];
|
|
645
|
+
let total = 0;
|
|
646
|
+
// `resolve` and `destroy` are both idempotent, so the first outcome wins and
|
|
647
|
+
// any later event (e.g. a premature-close error emitted by `destroy`) is a
|
|
648
|
+
// harmless no-op — no `settled` guard needed.
|
|
649
|
+
function finish(outcome) {
|
|
650
|
+
gunzip.destroy();
|
|
651
|
+
resolve({ bytes: Buffer.concat(chunks), ...outcome });
|
|
652
|
+
}
|
|
653
|
+
gunzip.on('data', (chunk) => {
|
|
654
|
+
chunks.push(chunk);
|
|
655
|
+
total += chunk.length;
|
|
656
|
+
if (total >= maxBytes) {
|
|
657
|
+
finish({ truncated: true, corrupt: false });
|
|
658
|
+
}
|
|
659
|
+
});
|
|
660
|
+
gunzip.on('error', () => finish({ truncated: !inputComplete, corrupt: inputComplete }));
|
|
661
|
+
gunzip.on('end', () => finish({ truncated: false, corrupt: false }));
|
|
662
|
+
gunzip.end(bytes);
|
|
663
|
+
});
|
|
664
|
+
}
|
|
377
665
|
// The RDF serializations whose bodies we parse to confirm they carry triples. A
|
|
378
666
|
// non-empty body in one of these formats that yields zero triples — an empty
|
|
379
667
|
// graph such as a JSON-LD `{}`, an `<rdf:RDF/>`, or prefix-only Turtle — is a
|
|
@@ -389,9 +677,21 @@ const rdfContentTypes = [
|
|
|
389
677
|
'application/ld+json',
|
|
390
678
|
'application/rdf+xml',
|
|
391
679
|
];
|
|
392
|
-
|
|
680
|
+
// Serializations a streaming parser cannot validate from a truncated prefix.
|
|
681
|
+
// The line/statement-oriented formats (N-Triples, N-Quads, Turtle, TriG, N3) and
|
|
682
|
+
// SAX-based RDF/XML all yield their first triple from the opening chunk, but
|
|
683
|
+
// JSON-LD is a single JSON value whose parser emits nothing until the whole
|
|
684
|
+
// document closes — a truncated JSON-LD body parses to an ‘unclosed document’
|
|
685
|
+
// error, never a triple. So a truncated body in one of these can only be
|
|
686
|
+
// validated if it happened to fit the read cap in full; beyond that it is
|
|
687
|
+
// inconclusive, and we must not download it in full to find out.
|
|
688
|
+
const nonStreamableRdfContentTypes = ['application/ld+json'];
|
|
689
|
+
async function validateBody(body, contentType, baseIRI, timeoutMs, truncated) {
|
|
393
690
|
if (body.length === 0) {
|
|
394
|
-
|
|
691
|
+
// A complete, empty body is a faulty distribution; an empty *prefix* (a
|
|
692
|
+
// truncated read that yielded no bytes, e.g. a corrupt gzip header) is
|
|
693
|
+
// inconclusive — the endpoint answered, we just could not validate content.
|
|
694
|
+
return truncated ? null : 'Distribution is empty';
|
|
395
695
|
}
|
|
396
696
|
// Media types are case-insensitive (RFC 9110 §8.3.1), so normalise before
|
|
397
697
|
// matching the lower-case allow-list — a server sending `Application/LD+JSON`
|
|
@@ -400,7 +700,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
400
700
|
if (!serialization || !rdfContentTypes.includes(serialization)) {
|
|
401
701
|
return null;
|
|
402
702
|
}
|
|
403
|
-
|
|
703
|
+
if (truncated && nonStreamableRdfContentTypes.includes(serialization)) {
|
|
704
|
+
// A bounded prefix of a non-streamable serialization (JSON-LD) can never
|
|
705
|
+
// yield a triple, so skip the doomed parse and report it inconclusive — only
|
|
706
|
+
// a complete document, small enough to fit the read cap, can be validated.
|
|
707
|
+
return null;
|
|
708
|
+
}
|
|
709
|
+
const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs, truncated);
|
|
404
710
|
switch (outcome.type) {
|
|
405
711
|
case 'empty':
|
|
406
712
|
return 'Distribution contains no RDF triples';
|
|
@@ -422,8 +728,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
422
728
|
* on expiry — and likewise when a remote `@context` is unreachable — the outcome
|
|
423
729
|
* is 'inconclusive', so a valid distribution is never flagged faulty for a
|
|
424
730
|
* context host's failure. `baseIRI` resolves any relative IRIs in the document.
|
|
731
|
+
*
|
|
732
|
+
* When `truncated` is true the body is only a bounded prefix of a larger one, so
|
|
733
|
+
* only finding a triple ('hasTriples') is conclusive: a parse error at the cut
|
|
734
|
+
* or a clean end with no triple yet means we did not read far enough, not that
|
|
735
|
+
* the distribution is empty or malformed, and is reported as 'inconclusive'.
|
|
425
736
|
*/
|
|
426
|
-
function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
|
|
737
|
+
function classifyRdfBody(body, contentType, baseIRI, timeoutMs, truncated) {
|
|
427
738
|
return new Promise((resolve) => {
|
|
428
739
|
const quads = rdfParser.parse(Readable.from([body]), {
|
|
429
740
|
contentType,
|
|
@@ -441,10 +752,10 @@ function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
441
752
|
}
|
|
442
753
|
quads
|
|
443
754
|
.on('data', () => settle({ type: 'hasTriples' }))
|
|
444
|
-
.on('error', (error) => settle(isRemoteContextError(error)
|
|
755
|
+
.on('error', (error) => settle(truncated || isRemoteContextError(error)
|
|
445
756
|
? { type: 'inconclusive' }
|
|
446
757
|
: { type: 'parseError', message: error.message }))
|
|
447
|
-
.on('end', () => settle({ type: 'empty' }));
|
|
758
|
+
.on('end', () => settle(truncated ? { type: 'inconclusive' } : { type: 'empty' }));
|
|
448
759
|
});
|
|
449
760
|
}
|
|
450
761
|
/**
|