@lde/distribution-probe 0.1.13 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +16 -3
- package/dist/probe.d.ts +25 -1
- package/dist/probe.d.ts.map +1 -1
- package/dist/probe.js +245 -26
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -24,11 +24,24 @@ Sends `POST` with the configured query (default `SELECT * { ?s ?p ?o } LIMIT 1`)
|
|
|
24
24
|
|
|
25
25
|
### Data dumps
|
|
26
26
|
|
|
27
|
-
|
|
27
|
+
#### Reachability (the default)
|
|
28
|
+
|
|
29
|
+
Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. A successful `HEAD` settles reachability and gathers metadata (`Content-Length`, `Last-Modified`) **without reading the body**. If `HEAD` is unsuccessful — e.g. a server that returns `405`/`501` because it does not implement `HEAD` — the probe falls back to a body-less `GET` to confirm the endpoint is up. The body is never downloaded.
|
|
30
|
+
|
|
31
|
+
This is deliberately cheap: reading a body forces a slow, generate-on-the-fly endpoint (a TriplyDB dump, a SPARQL `CONSTRUCT` export) to start producing its export, which a `HEAD` does not.
|
|
28
32
|
|
|
29
33
|
- **Content-Type is checked as a soft warning, not a hard failure.** If the server’s Content-Type disagrees with the distribution’s declared `mimeType`, a message is appended to `result.warnings` but `isSuccess()` stays `true`. Compression wrappers (`application/gzip`, `application/x-gzip`, `application/octet-stream`) are skipped so a gzipped Turtle file doesn’t trigger a warning.
|
|
30
|
-
|
|
31
|
-
|
|
34
|
+
|
|
35
|
+
#### Content validation (opt-in)
|
|
36
|
+
|
|
37
|
+
Set `validateRdfContent: true` to additionally confirm that a dump actually carries RDF. It applies only to distributions whose **declared** `mimeType` is an RDF serialization (`text/turtle`, `application/n-triples`, `application/n-quads`, `application/trig`, `text/n3`, `application/ld+json`, `application/rdf+xml`); non-RDF and undeclared-type distributions stay reachability-only.
|
|
38
|
+
|
|
39
|
+
When on, the probe `GET`s the dump — **regardless of size** — and reads only a **bounded prefix** (256 KiB), never the whole body:
|
|
40
|
+
|
|
41
|
+
- It settles on the **first triple** and stops, so a large dump is validated from its opening chunk. The line/statement-oriented serializations and RDF/XML stream a triple out of the prefix; **JSON-LD is not streamable** (its parser needs the whole document), so a JSON-LD dump is only validated when it fits the prefix in full — a larger one is reported reachable but unvalidated.
|
|
42
|
+
- A gzip body that `fetch` did not decompress (a `.gz` dump, or one served with a non-standard `Content-Encoding`) is inflated in-place; a gzip that will not inflate when the **complete** compressed body was read fails as `Distribution is not valid gzip`.
|
|
43
|
+
- Empty bodies (`Distribution is empty`) and bodies that parse to **zero** triples (`Distribution contains no RDF triples`) fail the probe. A deliberately truncated prefix is never mistaken for either — it is inconclusive.
|
|
44
|
+
- **Reachability is settled by the response, so validation never turns a reachable dump into a failure.** If no triple surfaces within `rdfValidationBudgetMs` (default `min(timeoutMs, 2000)`, clamped to `timeoutMs`), the read is aborted and the distribution is reported reachable but unvalidated (no `failureReason`). This bounds the extra latency content validation adds on slow, generate-on-the-fly endpoints.
|
|
32
45
|
|
|
33
46
|
### Network errors
|
|
34
47
|
|
package/dist/probe.d.ts
CHANGED
|
@@ -28,6 +28,29 @@ export interface ProbeOptions {
|
|
|
28
28
|
* the default; negative values are clamped to `0`.
|
|
29
29
|
*/
|
|
30
30
|
retries?: number;
|
|
31
|
+
/**
|
|
32
|
+
* Validate the body content of data-dump distributions whose declared media
|
|
33
|
+
* type is an RDF serialization, by reading a bounded prefix and confirming it
|
|
34
|
+
* carries at least one triple. When `false` (the default) a data dump is only
|
|
35
|
+
* checked for reachability (a `HEAD`, with a body-less `GET` fallback if `HEAD`
|
|
36
|
+
* is unsupported) and its body is never read. When `true`, every declared-RDF
|
|
37
|
+
* dump — regardless of size — is fetched and validated; non-RDF and
|
|
38
|
+
* undeclared-type distributions are still reachability-only. Validation is
|
|
39
|
+
* opt-in because reading a body forces a slow, generate-on-the-fly endpoint to
|
|
40
|
+
* start producing its export, which a `HEAD` does not.
|
|
41
|
+
*/
|
|
42
|
+
validateRdfContent?: boolean;
|
|
43
|
+
/**
|
|
44
|
+
* Soft deadline, in milliseconds, for finding the first triple when
|
|
45
|
+
* {@link validateRdfContent} is on. Reachability is settled by the response
|
|
46
|
+
* itself; if no triple has surfaced within this budget the read is aborted and
|
|
47
|
+
* the distribution is reported reachable but unvalidated (no `failureReason`),
|
|
48
|
+
* never failed. This bounds the extra latency content validation adds on slow,
|
|
49
|
+
* generate-on-the-fly endpoints. Clamped to {@link timeoutMs} (a longer budget
|
|
50
|
+
* is meaningless — the request times out first). Defaults to
|
|
51
|
+
* `min(timeoutMs, 2000)`.
|
|
52
|
+
*/
|
|
53
|
+
rdfValidationBudgetMs?: number;
|
|
31
54
|
}
|
|
32
55
|
/**
|
|
33
56
|
* Result of a network error during probing.
|
|
@@ -80,7 +103,8 @@ export type ProbeResultType = SparqlProbeResult | DataDumpProbeResult | NetworkE
|
|
|
80
103
|
*
|
|
81
104
|
* For SPARQL endpoints, issues the configured SPARQL query (default: a
|
|
82
105
|
* minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
|
|
83
|
-
* for small or unknown-size bodies
|
|
106
|
+
* for small or unknown-size bodies, reading only a bounded prefix so a large
|
|
107
|
+
* streamed dump is never downloaded in full).
|
|
84
108
|
*
|
|
85
109
|
* Returns a pure result object; never throws.
|
|
86
110
|
*/
|
package/dist/probe.d.ts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;
|
|
1
|
+
{"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAKnE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;IACjB;;;;;;;;;;OAUG;IACH,kBAAkB,CAAC,EAAE,OAAO,CAAC;IAC7B;;;;;;;;;OASG;IACH,qBAAqB,CAAC,EAAE,MAAM,CAAC;CAChC;AAgCD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;;GASG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B"}
|
package/dist/probe.js
CHANGED
|
@@ -1,9 +1,30 @@
|
|
|
1
1
|
import { compressionMediaTypes } from '@lde/dataset';
|
|
2
2
|
import { rdfParser } from 'rdf-parse';
|
|
3
3
|
import { Readable } from 'node:stream';
|
|
4
|
+
import { createGunzip } from 'node:zlib';
|
|
4
5
|
const DEFAULT_SPARQL_QUERY = 'SELECT * { ?s ?p ?o } LIMIT 1';
|
|
5
6
|
const DEFAULT_TIMEOUT_MS = 5000;
|
|
6
7
|
const DEFAULT_RETRIES = 2;
|
|
8
|
+
/**
|
|
9
|
+
* Default soft deadline for finding the first triple when content validation is
|
|
10
|
+
* on (capped at `timeoutMs`). Two seconds comfortably covers a static file
|
|
11
|
+
* server's first chunk while keeping the extra wait bounded on a slow,
|
|
12
|
+
* generate-on-the-fly endpoint.
|
|
13
|
+
*/
|
|
14
|
+
const DEFAULT_RDF_VALIDATION_BUDGET_MS = 2000;
|
|
15
|
+
/** Sentinel: the validation budget elapsed before a triple surfaced. */
|
|
16
|
+
const VALIDATION_TIMED_OUT = Symbol('rdf-validation-timed-out');
|
|
17
|
+
/**
|
|
18
|
+
* Maximum number of body bytes the data-dump probe reads before it stops and
|
|
19
|
+
* releases the connection. Reachability needs only that the endpoint answered
|
|
20
|
+
* with a success status and produced bytes; a large dump must never be
|
|
21
|
+
* downloaded in full within the probe's timeout budget. 256 KiB comfortably
|
|
22
|
+
* surfaces the first RDF triple — the signal {@link validateBody} needs — while
|
|
23
|
+
* bounding the read regardless of the dump's true size, chunked transfer, or
|
|
24
|
+
* compression. Applied to both the raw read and, for a gzip body, the inflated
|
|
25
|
+
* output.
|
|
26
|
+
*/
|
|
27
|
+
const MAX_PROBE_BODY_BYTES = 256 * 1024;
|
|
7
28
|
/** Base backoff between retries; the nth retry waits `n × base`. */
|
|
8
29
|
const RETRY_BACKOFF_MS = 250;
|
|
9
30
|
/**
|
|
@@ -107,7 +128,8 @@ export class DataDumpProbeResult extends ProbeResult {
|
|
|
107
128
|
*
|
|
108
129
|
* For SPARQL endpoints, issues the configured SPARQL query (default: a
|
|
109
130
|
* minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
|
|
110
|
-
* for small or unknown-size bodies
|
|
131
|
+
* for small or unknown-size bodies, reading only a bounded prefix so a large
|
|
132
|
+
* streamed dump is never downloaded in full).
|
|
111
133
|
*
|
|
112
134
|
* Returns a pure result object; never throws.
|
|
113
135
|
*/
|
|
@@ -186,6 +208,9 @@ function resolveOptions(options) {
|
|
|
186
208
|
retries: retries === undefined || !Number.isInteger(retries)
|
|
187
209
|
? DEFAULT_RETRIES
|
|
188
210
|
: Math.max(0, retries),
|
|
211
|
+
validateRdfContent: options?.validateRdfContent ?? false,
|
|
212
|
+
rdfValidationBudgetMs: options?.rdfValidationBudgetMs ??
|
|
213
|
+
Math.min(options?.timeoutMs ?? DEFAULT_TIMEOUT_MS, DEFAULT_RDF_VALIDATION_BUDGET_MS),
|
|
189
214
|
};
|
|
190
215
|
}
|
|
191
216
|
/**
|
|
@@ -350,30 +375,201 @@ async function probeDataDump(url, distribution, options, authHeaders, start) {
|
|
|
350
375
|
method: 'HEAD',
|
|
351
376
|
...requestOptions,
|
|
352
377
|
});
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
//
|
|
356
|
-
//
|
|
357
|
-
if (
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
const responseTimeMs = Math.round(performance.now() - start);
|
|
368
|
-
const result = new DataDumpProbeResult(url, getResponse, responseTimeMs, failureReason);
|
|
369
|
-
checkContentTypeMismatch(result, distribution);
|
|
370
|
-
return result;
|
|
378
|
+
// Validate body content only when asked to and the distribution declares an
|
|
379
|
+
// RDF media type; otherwise the probe is reachability-only and never reads a
|
|
380
|
+
// body — which keeps it from forcing a slow, generate-on-the-fly endpoint to
|
|
381
|
+
// start producing its export.
|
|
382
|
+
if (options.validateRdfContent &&
|
|
383
|
+
isDeclaredRdf(distribution) &&
|
|
384
|
+
isHttpSuccess(headResponse)) {
|
|
385
|
+
const { response, failureReason } = await validateDumpBody(url, headers, options, headResponse);
|
|
386
|
+
return finalizeDataDump(url, distribution, response, start, failureReason);
|
|
387
|
+
}
|
|
388
|
+
// Reachability only. A successful HEAD is enough; otherwise confirm with a
|
|
389
|
+
// body-less GET, which rescues servers that reject or do not implement HEAD.
|
|
390
|
+
if (isHttpSuccess(headResponse)) {
|
|
391
|
+
return finalizeDataDump(url, distribution, headResponse, start, null);
|
|
371
392
|
}
|
|
393
|
+
const getResponse = await fetch(url, { method: 'GET', ...requestOptions });
|
|
394
|
+
await getResponse.body?.cancel();
|
|
395
|
+
return finalizeDataDump(url, distribution, getResponse, start, null);
|
|
396
|
+
}
|
|
397
|
+
/** Whether an HTTP response carries a success (2xx/3xx) status. */
|
|
398
|
+
function isHttpSuccess(response) {
|
|
399
|
+
return response.status >= 200 && response.status < 400;
|
|
400
|
+
}
|
|
401
|
+
/** Whether the distribution declares an RDF serialization as its media type. */
|
|
402
|
+
function isDeclaredRdf(distribution) {
|
|
403
|
+
const declared = distribution.mimeType?.toLowerCase();
|
|
404
|
+
return declared !== undefined && rdfContentTypes.includes(declared);
|
|
405
|
+
}
|
|
406
|
+
/** Build a DataDumpProbeResult and attach any Content-Type-mismatch warning. */
|
|
407
|
+
function finalizeDataDump(url, distribution, response, start, failureReason) {
|
|
372
408
|
const responseTimeMs = Math.round(performance.now() - start);
|
|
373
|
-
const result = new DataDumpProbeResult(url,
|
|
409
|
+
const result = new DataDumpProbeResult(url, response, responseTimeMs, failureReason);
|
|
374
410
|
checkContentTypeMismatch(result, distribution);
|
|
375
411
|
return result;
|
|
376
412
|
}
|
|
413
|
+
/**
|
|
414
|
+
* GET the dump and validate that its body carries a triple, but only for as long
|
|
415
|
+
* as the validation budget allows. Reachability is already settled by the prior
|
|
416
|
+
* HEAD, so any shortfall — a budget that elapses before a triple, a read error,
|
|
417
|
+
* a GET that cannot start — yields a `null` failureReason (reachable,
|
|
418
|
+
* unvalidated), never a failure. Returns the response to draw metadata from
|
|
419
|
+
* (the GET, or the HEAD when the GET could not start) alongside that reason.
|
|
420
|
+
*/
|
|
421
|
+
async function validateDumpBody(url, headers, options, headResponse) {
|
|
422
|
+
const budgetMs = Math.min(options.rdfValidationBudgetMs, options.timeoutMs);
|
|
423
|
+
// Aborting on budget expiry stops a slow endpoint from streaming on in the
|
|
424
|
+
// background once we have given up waiting for a triple.
|
|
425
|
+
const budgetController = new AbortController();
|
|
426
|
+
let getResponse;
|
|
427
|
+
try {
|
|
428
|
+
getResponse = await fetch(url, {
|
|
429
|
+
method: 'GET',
|
|
430
|
+
headers,
|
|
431
|
+
signal: AbortSignal.any([
|
|
432
|
+
AbortSignal.timeout(options.timeoutMs),
|
|
433
|
+
budgetController.signal,
|
|
434
|
+
]),
|
|
435
|
+
});
|
|
436
|
+
}
|
|
437
|
+
catch {
|
|
438
|
+
// The GET could not even return headers; the HEAD already proved the
|
|
439
|
+
// distribution reachable, so report it unvalidated rather than down.
|
|
440
|
+
return { response: headResponse, failureReason: null };
|
|
441
|
+
}
|
|
442
|
+
if (!isHttpSuccess(getResponse)) {
|
|
443
|
+
await getResponse.body?.cancel();
|
|
444
|
+
return { response: getResponse, failureReason: null };
|
|
445
|
+
}
|
|
446
|
+
const validation = (async () => {
|
|
447
|
+
const bounded = await readBoundedBody(getResponse, MAX_PROBE_BODY_BYTES);
|
|
448
|
+
const { text, truncated, corrupt } = await decodeProbeBody(bounded);
|
|
449
|
+
return corrupt
|
|
450
|
+
? 'Distribution is not valid gzip'
|
|
451
|
+
: await validateBody(text, getResponse.headers.get('Content-Type'), url, budgetMs, truncated);
|
|
452
|
+
})().catch(() => null);
|
|
453
|
+
let budgetTimer;
|
|
454
|
+
const budgetExpiry = new Promise((resolve) => {
|
|
455
|
+
budgetTimer = setTimeout(() => {
|
|
456
|
+
budgetController.abort();
|
|
457
|
+
resolve(VALIDATION_TIMED_OUT);
|
|
458
|
+
}, budgetMs);
|
|
459
|
+
});
|
|
460
|
+
try {
|
|
461
|
+
const outcome = await Promise.race([validation, budgetExpiry]);
|
|
462
|
+
return {
|
|
463
|
+
response: getResponse,
|
|
464
|
+
failureReason: outcome === VALIDATION_TIMED_OUT ? null : outcome,
|
|
465
|
+
};
|
|
466
|
+
}
|
|
467
|
+
finally {
|
|
468
|
+
clearTimeout(budgetTimer);
|
|
469
|
+
}
|
|
470
|
+
}
|
|
471
|
+
/**
|
|
472
|
+
* Read at most `maxBytes` from a response body, then cancel the stream to free
|
|
473
|
+
* the underlying connection. Returns the bytes read and whether the body was
|
|
474
|
+
* longer than the cap (`truncated`), so the caller can tell a complete, small
|
|
475
|
+
* body — whose emptiness or parse errors are meaningful — from a deliberately
|
|
476
|
+
* cut-off prefix of a large one, where only the presence of content is
|
|
477
|
+
* conclusive. This is what keeps the probe from downloading a multi-hundred-MB
|
|
478
|
+
* streamed dump in full just to confirm it is reachable.
|
|
479
|
+
*/
|
|
480
|
+
async function readBoundedBody(response, maxBytes) {
|
|
481
|
+
const stream = response.body;
|
|
482
|
+
if (stream === null) {
|
|
483
|
+
return { bytes: new Uint8Array(0), truncated: false };
|
|
484
|
+
}
|
|
485
|
+
const chunks = [];
|
|
486
|
+
let total = 0;
|
|
487
|
+
let truncated = false;
|
|
488
|
+
// Breaking out of `for await` cancels the stream, which stops any further
|
|
489
|
+
// download and releases the underlying connection — so a large dump is never
|
|
490
|
+
// pulled in full once we have the prefix we need.
|
|
491
|
+
for await (const chunk of stream) {
|
|
492
|
+
chunks.push(chunk);
|
|
493
|
+
total += chunk.length;
|
|
494
|
+
if (total >= maxBytes) {
|
|
495
|
+
truncated = true;
|
|
496
|
+
break;
|
|
497
|
+
}
|
|
498
|
+
}
|
|
499
|
+
return { bytes: Buffer.concat(chunks), truncated };
|
|
500
|
+
}
|
|
501
|
+
/**
|
|
502
|
+
* Decode a bounded body to text for RDF validation, inflating it first when it
|
|
503
|
+
* is a gzip stream that `fetch` did not transparently decompress — e.g. a `.gz`
|
|
504
|
+
* data dump served as-is, or one labelled with a non-standard Content-Encoding
|
|
505
|
+
* (`application/gzip`) that undici does not recognise as a content coding.
|
|
506
|
+
* Detection is by the gzip magic on the delivered bytes, so a body that `fetch`
|
|
507
|
+
* already inflated (a standard `Content-Encoding: gzip`) is passed through
|
|
508
|
+
* untouched. A truncated gzip tail is expected — we only read a prefix — and
|
|
509
|
+
* inflates cleanly up to the cut, so it is never mistaken for corruption.
|
|
510
|
+
*/
|
|
511
|
+
async function decodeProbeBody(bounded) {
|
|
512
|
+
if (!isGzip(bounded.bytes)) {
|
|
513
|
+
return {
|
|
514
|
+
text: decodeUtf8(bounded.bytes),
|
|
515
|
+
truncated: bounded.truncated,
|
|
516
|
+
corrupt: false,
|
|
517
|
+
};
|
|
518
|
+
}
|
|
519
|
+
// The compressed body is complete only when the raw read was not itself cut
|
|
520
|
+
// off: a gzip error on a complete body is genuine corruption, on a prefix we
|
|
521
|
+
// cut it is just the dropped tail.
|
|
522
|
+
const inflated = await gunzipPrefix(bounded.bytes, MAX_PROBE_BODY_BYTES, !bounded.truncated);
|
|
523
|
+
return {
|
|
524
|
+
text: decodeUtf8(inflated.bytes),
|
|
525
|
+
truncated: bounded.truncated || inflated.truncated,
|
|
526
|
+
corrupt: inflated.corrupt,
|
|
527
|
+
};
|
|
528
|
+
}
|
|
529
|
+
/** Whether the bytes begin with the gzip magic number (RFC 1952 §2.3.1). */
|
|
530
|
+
function isGzip(bytes) {
|
|
531
|
+
return bytes.length >= 2 && bytes[0] === 0x1f && bytes[1] === 0x8b;
|
|
532
|
+
}
|
|
533
|
+
/**
|
|
534
|
+
* Decode bytes as UTF-8 without throwing: an incomplete multi-byte sequence at
|
|
535
|
+
* the truncation boundary is replaced rather than fatal, since the RDF parser
|
|
536
|
+
* only needs the leading, intact portion to find the first triple.
|
|
537
|
+
*/
|
|
538
|
+
function decodeUtf8(bytes) {
|
|
539
|
+
return new TextDecoder('utf-8', { fatal: false }).decode(bytes);
|
|
540
|
+
}
|
|
541
|
+
/**
|
|
542
|
+
* Inflate up to `maxBytes` of output from a gzip prefix, stopping once the cap
|
|
543
|
+
* is reached or the input runs out. `inputComplete` says whether the caller
|
|
544
|
+
* handed us the whole compressed body (true) or a prefix it had already cut
|
|
545
|
+
* (false). An inflate error therefore means different things: on a complete body
|
|
546
|
+
* the gzip is genuinely corrupt; on a cut prefix it is just the dropped tail, so
|
|
547
|
+
* whatever inflated cleanly is reported as a (truncated) partial inflate.
|
|
548
|
+
*/
|
|
549
|
+
function gunzipPrefix(bytes, maxBytes, inputComplete) {
|
|
550
|
+
return new Promise((resolve) => {
|
|
551
|
+
const gunzip = createGunzip();
|
|
552
|
+
const chunks = [];
|
|
553
|
+
let total = 0;
|
|
554
|
+
// `resolve` and `destroy` are both idempotent, so the first outcome wins and
|
|
555
|
+
// any later event (e.g. a premature-close error emitted by `destroy`) is a
|
|
556
|
+
// harmless no-op — no `settled` guard needed.
|
|
557
|
+
function finish(outcome) {
|
|
558
|
+
gunzip.destroy();
|
|
559
|
+
resolve({ bytes: Buffer.concat(chunks), ...outcome });
|
|
560
|
+
}
|
|
561
|
+
gunzip.on('data', (chunk) => {
|
|
562
|
+
chunks.push(chunk);
|
|
563
|
+
total += chunk.length;
|
|
564
|
+
if (total >= maxBytes) {
|
|
565
|
+
finish({ truncated: true, corrupt: false });
|
|
566
|
+
}
|
|
567
|
+
});
|
|
568
|
+
gunzip.on('error', () => finish({ truncated: !inputComplete, corrupt: inputComplete }));
|
|
569
|
+
gunzip.on('end', () => finish({ truncated: false, corrupt: false }));
|
|
570
|
+
gunzip.end(bytes);
|
|
571
|
+
});
|
|
572
|
+
}
|
|
377
573
|
// The RDF serializations whose bodies we parse to confirm they carry triples. A
|
|
378
574
|
// non-empty body in one of these formats that yields zero triples — an empty
|
|
379
575
|
// graph such as a JSON-LD `{}`, an `<rdf:RDF/>`, or prefix-only Turtle — is a
|
|
@@ -389,9 +585,21 @@ const rdfContentTypes = [
|
|
|
389
585
|
'application/ld+json',
|
|
390
586
|
'application/rdf+xml',
|
|
391
587
|
];
|
|
392
|
-
|
|
588
|
+
// Serializations a streaming parser cannot validate from a truncated prefix.
|
|
589
|
+
// The line/statement-oriented formats (N-Triples, N-Quads, Turtle, TriG, N3) and
|
|
590
|
+
// SAX-based RDF/XML all yield their first triple from the opening chunk, but
|
|
591
|
+
// JSON-LD is a single JSON value whose parser emits nothing until the whole
|
|
592
|
+
// document closes — a truncated JSON-LD body parses to an ‘unclosed document’
|
|
593
|
+
// error, never a triple. So a truncated body in one of these can only be
|
|
594
|
+
// validated if it happened to fit the read cap in full; beyond that it is
|
|
595
|
+
// inconclusive, and we must not download it in full to find out.
|
|
596
|
+
const nonStreamableRdfContentTypes = ['application/ld+json'];
|
|
597
|
+
async function validateBody(body, contentType, baseIRI, timeoutMs, truncated) {
|
|
393
598
|
if (body.length === 0) {
|
|
394
|
-
|
|
599
|
+
// A complete, empty body is a faulty distribution; an empty *prefix* (a
|
|
600
|
+
// truncated read that yielded no bytes, e.g. a corrupt gzip header) is
|
|
601
|
+
// inconclusive — the endpoint answered, we just could not validate content.
|
|
602
|
+
return truncated ? null : 'Distribution is empty';
|
|
395
603
|
}
|
|
396
604
|
// Media types are case-insensitive (RFC 9110 §8.3.1), so normalise before
|
|
397
605
|
// matching the lower-case allow-list — a server sending `Application/LD+JSON`
|
|
@@ -400,7 +608,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
400
608
|
if (!serialization || !rdfContentTypes.includes(serialization)) {
|
|
401
609
|
return null;
|
|
402
610
|
}
|
|
403
|
-
|
|
611
|
+
if (truncated && nonStreamableRdfContentTypes.includes(serialization)) {
|
|
612
|
+
// A bounded prefix of a non-streamable serialization (JSON-LD) can never
|
|
613
|
+
// yield a triple, so skip the doomed parse and report it inconclusive — only
|
|
614
|
+
// a complete document, small enough to fit the read cap, can be validated.
|
|
615
|
+
return null;
|
|
616
|
+
}
|
|
617
|
+
const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs, truncated);
|
|
404
618
|
switch (outcome.type) {
|
|
405
619
|
case 'empty':
|
|
406
620
|
return 'Distribution contains no RDF triples';
|
|
@@ -422,8 +636,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
422
636
|
* on expiry — and likewise when a remote `@context` is unreachable — the outcome
|
|
423
637
|
* is 'inconclusive', so a valid distribution is never flagged faulty for a
|
|
424
638
|
* context host's failure. `baseIRI` resolves any relative IRIs in the document.
|
|
639
|
+
*
|
|
640
|
+
* When `truncated` is true the body is only a bounded prefix of a larger one, so
|
|
641
|
+
* only finding a triple ('hasTriples') is conclusive: a parse error at the cut
|
|
642
|
+
* or a clean end with no triple yet means we did not read far enough, not that
|
|
643
|
+
* the distribution is empty or malformed, and is reported as 'inconclusive'.
|
|
425
644
|
*/
|
|
426
|
-
function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
|
|
645
|
+
function classifyRdfBody(body, contentType, baseIRI, timeoutMs, truncated) {
|
|
427
646
|
return new Promise((resolve) => {
|
|
428
647
|
const quads = rdfParser.parse(Readable.from([body]), {
|
|
429
648
|
contentType,
|
|
@@ -441,10 +660,10 @@ function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
|
|
|
441
660
|
}
|
|
442
661
|
quads
|
|
443
662
|
.on('data', () => settle({ type: 'hasTriples' }))
|
|
444
|
-
.on('error', (error) => settle(isRemoteContextError(error)
|
|
663
|
+
.on('error', (error) => settle(truncated || isRemoteContextError(error)
|
|
445
664
|
? { type: 'inconclusive' }
|
|
446
665
|
: { type: 'parseError', message: error.message }))
|
|
447
|
-
.on('end', () => settle({ type: 'empty' }));
|
|
666
|
+
.on('end', () => settle(truncated ? { type: 'inconclusive' } : { type: 'empty' }));
|
|
448
667
|
});
|
|
449
668
|
}
|
|
450
669
|
/**
|