@lde/distribution-probe 0.1.12 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -24,11 +24,24 @@ Sends `POST` with the configured query (default `SELECT * { ?s ?p ?o } LIMIT 1`)
24
24
 
25
25
  ### Data dumps
26
26
 
27
- Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. If `Content-Length` is missing or ≤ 10 KB, retries with `GET` to validate the body – this also catches servers that return `0` from `HEAD`.
27
+ #### Reachability (the default)
28
+
29
+ Sends `HEAD` with `Accept: <distribution.mimeType>` and `Accept-Encoding: identity`. A successful `HEAD` settles reachability and gathers metadata (`Content-Length`, `Last-Modified`) **without reading the body**. If `HEAD` is unsuccessful — e.g. a server that returns `405`/`501` because it does not implement `HEAD` — the probe falls back to a body-less `GET` to confirm the endpoint is up. The body is never downloaded.
30
+
31
+ This is deliberately cheap: reading a body forces a slow, generate-on-the-fly endpoint (a TriplyDB dump, a SPARQL `CONSTRUCT` export) to start producing its export, which a `HEAD` does not.
28
32
 
29
33
  - **Content-Type is checked as a soft warning, not a hard failure.** If the server’s Content-Type disagrees with the distribution’s declared `mimeType`, a message is appended to `result.warnings` but `isSuccess()` stays `true`. Compression wrappers (`application/gzip`, `application/x-gzip`, `application/octet-stream`) are skipped so a gzipped Turtle file doesn’t trigger a warning.
30
- - **Body is parse-validated only for Turtle, N-Triples, and N-Quads** (Content-Type starting with `text/turtle`, `application/n-triples`, or `application/n-quads`). Empty bodies and parse errors fail the probe. Other RDF serializations (RDF/XML, JSON-LD, TriG, …) are not parse-validated – only HTTP status and headers are checked.
31
- - Bodies larger than 10 KB are not fetched; only `HEAD` metadata is inspected.
34
+
35
+ #### Content validation (opt-in)
36
+
37
+ Set `validateRdfContent: true` to additionally confirm that a dump actually carries RDF. It applies only to distributions whose **declared** `mimeType` is an RDF serialization (`text/turtle`, `application/n-triples`, `application/n-quads`, `application/trig`, `text/n3`, `application/ld+json`, `application/rdf+xml`); non-RDF and undeclared-type distributions stay reachability-only.
38
+
39
+ When on, the probe `GET`s the dump — **regardless of size** — and reads only a **bounded prefix** (256 KiB), never the whole body:
40
+
41
+ - It settles on the **first triple** and stops, so a large dump is validated from its opening chunk. The line/statement-oriented serializations and RDF/XML stream a triple out of the prefix; **JSON-LD is not streamable** (its parser needs the whole document), so a JSON-LD dump is only validated when it fits the prefix in full — a larger one is reported reachable but unvalidated.
42
+ - A gzip body that `fetch` did not decompress (a `.gz` dump, or one served with a non-standard `Content-Encoding`) is inflated in-place; a gzip that will not inflate when the **complete** compressed body was read fails as `Distribution is not valid gzip`.
43
+ - Empty bodies (`Distribution is empty`) and bodies that parse to **zero** triples (`Distribution contains no RDF triples`) fail the probe. A deliberately truncated prefix is never mistaken for either — it is inconclusive.
44
+ - **Reachability is settled by the response, so validation never turns a reachable dump into a failure.** If no triple surfaces within `rdfValidationBudgetMs` (default `min(timeoutMs, 2000)`, clamped to `timeoutMs`), the read is aborted and the distribution is reported reachable but unvalidated (no `failureReason`). This bounds the extra latency content validation adds on slow, generate-on-the-fly endpoints.
32
45
 
33
46
  ### Network errors
34
47
 
package/dist/probe.d.ts CHANGED
@@ -28,6 +28,29 @@ export interface ProbeOptions {
28
28
  * the default; negative values are clamped to `0`.
29
29
  */
30
30
  retries?: number;
31
+ /**
32
+ * Validate the body content of data-dump distributions whose declared media
33
+ * type is an RDF serialization, by reading a bounded prefix and confirming it
34
+ * carries at least one triple. When `false` (the default) a data dump is only
35
+ * checked for reachability (a `HEAD`, with a body-less `GET` fallback if `HEAD`
36
+ * is unsupported) and its body is never read. When `true`, every declared-RDF
37
+ * dump — regardless of size — is fetched and validated; non-RDF and
38
+ * undeclared-type distributions are still reachability-only. Validation is
39
+ * opt-in because reading a body forces a slow, generate-on-the-fly endpoint to
40
+ * start producing its export, which a `HEAD` does not.
41
+ */
42
+ validateRdfContent?: boolean;
43
+ /**
44
+ * Soft deadline, in milliseconds, for finding the first triple when
45
+ * {@link validateRdfContent} is on. Reachability is settled by the response
46
+ * itself; if no triple has surfaced within this budget the read is aborted and
47
+ * the distribution is reported reachable but unvalidated (no `failureReason`),
48
+ * never failed. This bounds the extra latency content validation adds on slow,
49
+ * generate-on-the-fly endpoints. Clamped to {@link timeoutMs} (a longer budget
50
+ * is meaningless — the request times out first). Defaults to
51
+ * `min(timeoutMs, 2000)`.
52
+ */
53
+ rdfValidationBudgetMs?: number;
31
54
  }
32
55
  /**
33
56
  * Result of a network error during probing.
@@ -80,7 +103,8 @@ export type ProbeResultType = SparqlProbeResult | DataDumpProbeResult | NetworkE
80
103
  *
81
104
  * For SPARQL endpoints, issues the configured SPARQL query (default: a
82
105
  * minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
83
- * for small or unknown-size bodies).
106
+ * for small or unknown-size bodies, reading only a bounded prefix so a large
107
+ * streamed dump is never downloaded in full).
84
108
  *
85
109
  * Returns a pure result object; never throws.
86
110
  */
@@ -1 +1 @@
1
- {"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAInE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;CAClB;AASD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;GAQG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B"}
1
+ {"version":3,"file":"probe.d.ts","sourceRoot":"","sources":["../src/probe.ts"],"names":[],"mappings":"AAAA,OAAO,EAAyB,YAAY,EAAE,MAAM,cAAc,CAAC;AAKnE;;GAEG;AACH,MAAM,WAAW,YAAY;IAC3B,0DAA0D;IAC1D,SAAS,CAAC,EAAE,MAAM,CAAC;IACnB;;;OAGG;IACH,OAAO,CAAC,EAAE,OAAO,CAAC;IAClB;;;;;OAKG;IACH,WAAW,CAAC,EAAE,MAAM,CAAC;IACrB;;;;;;;;;OASG;IACH,OAAO,CAAC,EAAE,MAAM,CAAC;IACjB;;;;;;;;;;OAUG;IACH,kBAAkB,CAAC,EAAE,OAAO,CAAC;IAC7B;;;;;;;;;OASG;IACH,qBAAqB,CAAC,EAAE,MAAM,CAAC;CAChC;AAgCD;;GAEG;AACH,qBAAa,YAAY;aAEL,GAAG,EAAE,MAAM;aACX,OAAO,EAAE,MAAM;aACf,cAAc,EAAE,MAAM;gBAFtB,GAAG,EAAE,MAAM,EACX,OAAO,EAAE,MAAM,EACf,cAAc,EAAE,MAAM;CAEzC;AAED;;GAEG;AACH,uBAAe,WAAW;aAUN,GAAG,EAAE,MAAM;IAT7B,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,UAAU,EAAE,MAAM,CAAC;IACnC,SAAgB,YAAY,EAAE,IAAI,GAAG,IAAI,CAAQ;IACjD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAC;IAC3C,SAAgB,aAAa,EAAE,MAAM,GAAG,IAAI,CAAC;IAC7C,SAAgB,QAAQ,EAAE,MAAM,EAAE,CAAM;IACxC,SAAgB,cAAc,EAAE,MAAM,CAAC;gBAGrB,GAAG,EAAE,MAAM,EAC3B,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;IAa9B,SAAS,IAAI,OAAO;CAO5B;AAqBD;;GAEG;AACH,qBAAa,iBAAkB,SAAQ,WAAW;IAChD;;;;;OAKG;IACH,SAAgB,oBAAoB,EAAE,SAAS,MAAM,EAAE,CAAC;gBAGtD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,oBAAoB,EAAE,MAAM,GAAG,SAAS,MAAM,EAAE,EAChD,aAAa,GAAE,MAAM,GAAG,IAAW;IAS5B,SAAS,IAAI,OAAO;CAQ9B;AAED;;GAEG;AACH,qBAAa,mBAAoB,SAAQ,WAAW;IAClD,SAAgB,WAAW,EAAE,MAAM,GAAG,IAAI,CAAQ;gBAGhD,GAAG,EAAE,MAAM,EACX,QAAQ,EAAE,QAAQ,EAClB,cAAc,EAAE,MAAM,EACtB,aAAa,GAAE,MAAM,GAAG,IAAW;CAQtC;AAED,MAAM,MAAM,eAAe,GACvB,iBAAiB,GACjB,mBAAmB,GACnB,YAAY,CAAC;AAIjB;;;;;;;;;GASG;AACH,wBAAsB,KAAK,CACzB,YAAY,EAAE,YAAY,EAC1B,OAAO,CAAC,EAAE,YAAY,GACrB,OAAO,CAAC,eAAe,CAAC,CAqD1B"}
package/dist/probe.js CHANGED
@@ -1,9 +1,30 @@
1
1
  import { compressionMediaTypes } from '@lde/dataset';
2
2
  import { rdfParser } from 'rdf-parse';
3
3
  import { Readable } from 'node:stream';
4
+ import { createGunzip } from 'node:zlib';
4
5
  const DEFAULT_SPARQL_QUERY = 'SELECT * { ?s ?p ?o } LIMIT 1';
5
6
  const DEFAULT_TIMEOUT_MS = 5000;
6
7
  const DEFAULT_RETRIES = 2;
8
+ /**
9
+ * Default soft deadline for finding the first triple when content validation is
10
+ * on (capped at `timeoutMs`). Two seconds comfortably covers a static file
11
+ * server's first chunk while keeping the extra wait bounded on a slow,
12
+ * generate-on-the-fly endpoint.
13
+ */
14
+ const DEFAULT_RDF_VALIDATION_BUDGET_MS = 2000;
15
+ /** Sentinel: the validation budget elapsed before a triple surfaced. */
16
+ const VALIDATION_TIMED_OUT = Symbol('rdf-validation-timed-out');
17
+ /**
18
+ * Maximum number of body bytes the data-dump probe reads before it stops and
19
+ * releases the connection. Reachability needs only that the endpoint answered
20
+ * with a success status and produced bytes; a large dump must never be
21
+ * downloaded in full within the probe's timeout budget. 256 KiB comfortably
22
+ * surfaces the first RDF triple — the signal {@link validateBody} needs — while
23
+ * bounding the read regardless of the dump's true size, chunked transfer, or
24
+ * compression. Applied to both the raw read and, for a gzip body, the inflated
25
+ * output.
26
+ */
27
+ const MAX_PROBE_BODY_BYTES = 256 * 1024;
7
28
  /** Base backoff between retries; the nth retry waits `n × base`. */
8
29
  const RETRY_BACKOFF_MS = 250;
9
30
  /**
@@ -107,7 +128,8 @@ export class DataDumpProbeResult extends ProbeResult {
107
128
  *
108
129
  * For SPARQL endpoints, issues the configured SPARQL query (default: a
109
130
  * minimal `SELECT`). For data dumps, issues `HEAD` (with a `GET` fallback
110
- * for small or unknown-size bodies).
131
+ * for small or unknown-size bodies, reading only a bounded prefix so a large
132
+ * streamed dump is never downloaded in full).
111
133
  *
112
134
  * Returns a pure result object; never throws.
113
135
  */
@@ -186,6 +208,9 @@ function resolveOptions(options) {
186
208
  retries: retries === undefined || !Number.isInteger(retries)
187
209
  ? DEFAULT_RETRIES
188
210
  : Math.max(0, retries),
211
+ validateRdfContent: options?.validateRdfContent ?? false,
212
+ rdfValidationBudgetMs: options?.rdfValidationBudgetMs ??
213
+ Math.min(options?.timeoutMs ?? DEFAULT_TIMEOUT_MS, DEFAULT_RDF_VALIDATION_BUDGET_MS),
189
214
  };
190
215
  }
191
216
  /**
@@ -350,30 +375,201 @@ async function probeDataDump(url, distribution, options, authHeaders, start) {
350
375
  method: 'HEAD',
351
376
  ...requestOptions,
352
377
  });
353
- const contentLength = headResponse.headers.get('Content-Length');
354
- const contentLengthBytes = contentLength ? parseInt(contentLength) : 0;
355
- // For small or unknown-size files, do a GET to validate body content.
356
- // This also handles servers that incorrectly return 0 Content-Length for HEAD.
357
- if (contentLengthBytes <= 10_240) {
358
- const getResponse = await fetch(url, {
359
- method: 'GET',
360
- ...requestOptions,
361
- });
362
- const body = await getResponse.text();
363
- const isHttpSuccess = getResponse.status >= 200 && getResponse.status < 400;
364
- const failureReason = isHttpSuccess
365
- ? await validateBody(body, getResponse.headers.get('Content-Type'), url, options.timeoutMs)
366
- : null;
367
- const responseTimeMs = Math.round(performance.now() - start);
368
- const result = new DataDumpProbeResult(url, getResponse, responseTimeMs, failureReason);
369
- checkContentTypeMismatch(result, distribution);
370
- return result;
378
+ // Validate body content only when asked to and the distribution declares an
379
+ // RDF media type; otherwise the probe is reachability-only and never reads a
380
+ // body which keeps it from forcing a slow, generate-on-the-fly endpoint to
381
+ // start producing its export.
382
+ if (options.validateRdfContent &&
383
+ isDeclaredRdf(distribution) &&
384
+ isHttpSuccess(headResponse)) {
385
+ const { response, failureReason } = await validateDumpBody(url, headers, options, headResponse);
386
+ return finalizeDataDump(url, distribution, response, start, failureReason);
387
+ }
388
+ // Reachability only. A successful HEAD is enough; otherwise confirm with a
389
+ // body-less GET, which rescues servers that reject or do not implement HEAD.
390
+ if (isHttpSuccess(headResponse)) {
391
+ return finalizeDataDump(url, distribution, headResponse, start, null);
371
392
  }
393
+ const getResponse = await fetch(url, { method: 'GET', ...requestOptions });
394
+ await getResponse.body?.cancel();
395
+ return finalizeDataDump(url, distribution, getResponse, start, null);
396
+ }
397
+ /** Whether an HTTP response carries a success (2xx/3xx) status. */
398
+ function isHttpSuccess(response) {
399
+ return response.status >= 200 && response.status < 400;
400
+ }
401
+ /** Whether the distribution declares an RDF serialization as its media type. */
402
+ function isDeclaredRdf(distribution) {
403
+ const declared = distribution.mimeType?.toLowerCase();
404
+ return declared !== undefined && rdfContentTypes.includes(declared);
405
+ }
406
+ /** Build a DataDumpProbeResult and attach any Content-Type-mismatch warning. */
407
+ function finalizeDataDump(url, distribution, response, start, failureReason) {
372
408
  const responseTimeMs = Math.round(performance.now() - start);
373
- const result = new DataDumpProbeResult(url, headResponse, responseTimeMs);
409
+ const result = new DataDumpProbeResult(url, response, responseTimeMs, failureReason);
374
410
  checkContentTypeMismatch(result, distribution);
375
411
  return result;
376
412
  }
413
+ /**
414
+ * GET the dump and validate that its body carries a triple, but only for as long
415
+ * as the validation budget allows. Reachability is already settled by the prior
416
+ * HEAD, so any shortfall — a budget that elapses before a triple, a read error,
417
+ * a GET that cannot start — yields a `null` failureReason (reachable,
418
+ * unvalidated), never a failure. Returns the response to draw metadata from
419
+ * (the GET, or the HEAD when the GET could not start) alongside that reason.
420
+ */
421
+ async function validateDumpBody(url, headers, options, headResponse) {
422
+ const budgetMs = Math.min(options.rdfValidationBudgetMs, options.timeoutMs);
423
+ // Aborting on budget expiry stops a slow endpoint from streaming on in the
424
+ // background once we have given up waiting for a triple.
425
+ const budgetController = new AbortController();
426
+ let getResponse;
427
+ try {
428
+ getResponse = await fetch(url, {
429
+ method: 'GET',
430
+ headers,
431
+ signal: AbortSignal.any([
432
+ AbortSignal.timeout(options.timeoutMs),
433
+ budgetController.signal,
434
+ ]),
435
+ });
436
+ }
437
+ catch {
438
+ // The GET could not even return headers; the HEAD already proved the
439
+ // distribution reachable, so report it unvalidated rather than down.
440
+ return { response: headResponse, failureReason: null };
441
+ }
442
+ if (!isHttpSuccess(getResponse)) {
443
+ await getResponse.body?.cancel();
444
+ return { response: getResponse, failureReason: null };
445
+ }
446
+ const validation = (async () => {
447
+ const bounded = await readBoundedBody(getResponse, MAX_PROBE_BODY_BYTES);
448
+ const { text, truncated, corrupt } = await decodeProbeBody(bounded);
449
+ return corrupt
450
+ ? 'Distribution is not valid gzip'
451
+ : await validateBody(text, getResponse.headers.get('Content-Type'), url, budgetMs, truncated);
452
+ })().catch(() => null);
453
+ let budgetTimer;
454
+ const budgetExpiry = new Promise((resolve) => {
455
+ budgetTimer = setTimeout(() => {
456
+ budgetController.abort();
457
+ resolve(VALIDATION_TIMED_OUT);
458
+ }, budgetMs);
459
+ });
460
+ try {
461
+ const outcome = await Promise.race([validation, budgetExpiry]);
462
+ return {
463
+ response: getResponse,
464
+ failureReason: outcome === VALIDATION_TIMED_OUT ? null : outcome,
465
+ };
466
+ }
467
+ finally {
468
+ clearTimeout(budgetTimer);
469
+ }
470
+ }
471
+ /**
472
+ * Read at most `maxBytes` from a response body, then cancel the stream to free
473
+ * the underlying connection. Returns the bytes read and whether the body was
474
+ * longer than the cap (`truncated`), so the caller can tell a complete, small
475
+ * body — whose emptiness or parse errors are meaningful — from a deliberately
476
+ * cut-off prefix of a large one, where only the presence of content is
477
+ * conclusive. This is what keeps the probe from downloading a multi-hundred-MB
478
+ * streamed dump in full just to confirm it is reachable.
479
+ */
480
+ async function readBoundedBody(response, maxBytes) {
481
+ const stream = response.body;
482
+ if (stream === null) {
483
+ return { bytes: new Uint8Array(0), truncated: false };
484
+ }
485
+ const chunks = [];
486
+ let total = 0;
487
+ let truncated = false;
488
+ // Breaking out of `for await` cancels the stream, which stops any further
489
+ // download and releases the underlying connection — so a large dump is never
490
+ // pulled in full once we have the prefix we need.
491
+ for await (const chunk of stream) {
492
+ chunks.push(chunk);
493
+ total += chunk.length;
494
+ if (total >= maxBytes) {
495
+ truncated = true;
496
+ break;
497
+ }
498
+ }
499
+ return { bytes: Buffer.concat(chunks), truncated };
500
+ }
501
+ /**
502
+ * Decode a bounded body to text for RDF validation, inflating it first when it
503
+ * is a gzip stream that `fetch` did not transparently decompress — e.g. a `.gz`
504
+ * data dump served as-is, or one labelled with a non-standard Content-Encoding
505
+ * (`application/gzip`) that undici does not recognise as a content coding.
506
+ * Detection is by the gzip magic on the delivered bytes, so a body that `fetch`
507
+ * already inflated (a standard `Content-Encoding: gzip`) is passed through
508
+ * untouched. A truncated gzip tail is expected — we only read a prefix — and
509
+ * inflates cleanly up to the cut, so it is never mistaken for corruption.
510
+ */
511
+ async function decodeProbeBody(bounded) {
512
+ if (!isGzip(bounded.bytes)) {
513
+ return {
514
+ text: decodeUtf8(bounded.bytes),
515
+ truncated: bounded.truncated,
516
+ corrupt: false,
517
+ };
518
+ }
519
+ // The compressed body is complete only when the raw read was not itself cut
520
+ // off: a gzip error on a complete body is genuine corruption, on a prefix we
521
+ // cut it is just the dropped tail.
522
+ const inflated = await gunzipPrefix(bounded.bytes, MAX_PROBE_BODY_BYTES, !bounded.truncated);
523
+ return {
524
+ text: decodeUtf8(inflated.bytes),
525
+ truncated: bounded.truncated || inflated.truncated,
526
+ corrupt: inflated.corrupt,
527
+ };
528
+ }
529
+ /** Whether the bytes begin with the gzip magic number (RFC 1952 §2.3.1). */
530
+ function isGzip(bytes) {
531
+ return bytes.length >= 2 && bytes[0] === 0x1f && bytes[1] === 0x8b;
532
+ }
533
+ /**
534
+ * Decode bytes as UTF-8 without throwing: an incomplete multi-byte sequence at
535
+ * the truncation boundary is replaced rather than fatal, since the RDF parser
536
+ * only needs the leading, intact portion to find the first triple.
537
+ */
538
+ function decodeUtf8(bytes) {
539
+ return new TextDecoder('utf-8', { fatal: false }).decode(bytes);
540
+ }
541
+ /**
542
+ * Inflate up to `maxBytes` of output from a gzip prefix, stopping once the cap
543
+ * is reached or the input runs out. `inputComplete` says whether the caller
544
+ * handed us the whole compressed body (true) or a prefix it had already cut
545
+ * (false). An inflate error therefore means different things: on a complete body
546
+ * the gzip is genuinely corrupt; on a cut prefix it is just the dropped tail, so
547
+ * whatever inflated cleanly is reported as a (truncated) partial inflate.
548
+ */
549
+ function gunzipPrefix(bytes, maxBytes, inputComplete) {
550
+ return new Promise((resolve) => {
551
+ const gunzip = createGunzip();
552
+ const chunks = [];
553
+ let total = 0;
554
+ // `resolve` and `destroy` are both idempotent, so the first outcome wins and
555
+ // any later event (e.g. a premature-close error emitted by `destroy`) is a
556
+ // harmless no-op — no `settled` guard needed.
557
+ function finish(outcome) {
558
+ gunzip.destroy();
559
+ resolve({ bytes: Buffer.concat(chunks), ...outcome });
560
+ }
561
+ gunzip.on('data', (chunk) => {
562
+ chunks.push(chunk);
563
+ total += chunk.length;
564
+ if (total >= maxBytes) {
565
+ finish({ truncated: true, corrupt: false });
566
+ }
567
+ });
568
+ gunzip.on('error', () => finish({ truncated: !inputComplete, corrupt: inputComplete }));
569
+ gunzip.on('end', () => finish({ truncated: false, corrupt: false }));
570
+ gunzip.end(bytes);
571
+ });
572
+ }
377
573
  // The RDF serializations whose bodies we parse to confirm they carry triples. A
378
574
  // non-empty body in one of these formats that yields zero triples — an empty
379
575
  // graph such as a JSON-LD `{}`, an `<rdf:RDF/>`, or prefix-only Turtle — is a
@@ -389,9 +585,21 @@ const rdfContentTypes = [
389
585
  'application/ld+json',
390
586
  'application/rdf+xml',
391
587
  ];
392
- async function validateBody(body, contentType, baseIRI, timeoutMs) {
588
+ // Serializations a streaming parser cannot validate from a truncated prefix.
589
+ // The line/statement-oriented formats (N-Triples, N-Quads, Turtle, TriG, N3) and
590
+ // SAX-based RDF/XML all yield their first triple from the opening chunk, but
591
+ // JSON-LD is a single JSON value whose parser emits nothing until the whole
592
+ // document closes — a truncated JSON-LD body parses to an ‘unclosed document’
593
+ // error, never a triple. So a truncated body in one of these can only be
594
+ // validated if it happened to fit the read cap in full; beyond that it is
595
+ // inconclusive, and we must not download it in full to find out.
596
+ const nonStreamableRdfContentTypes = ['application/ld+json'];
597
+ async function validateBody(body, contentType, baseIRI, timeoutMs, truncated) {
393
598
  if (body.length === 0) {
394
- return 'Distribution is empty';
599
+ // A complete, empty body is a faulty distribution; an empty *prefix* (a
600
+ // truncated read that yielded no bytes, e.g. a corrupt gzip header) is
601
+ // inconclusive — the endpoint answered, we just could not validate content.
602
+ return truncated ? null : 'Distribution is empty';
395
603
  }
396
604
  // Media types are case-insensitive (RFC 9110 §8.3.1), so normalise before
397
605
  // matching the lower-case allow-list — a server sending `Application/LD+JSON`
@@ -400,7 +608,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
400
608
  if (!serialization || !rdfContentTypes.includes(serialization)) {
401
609
  return null;
402
610
  }
403
- const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs);
611
+ if (truncated && nonStreamableRdfContentTypes.includes(serialization)) {
612
+ // A bounded prefix of a non-streamable serialization (JSON-LD) can never
613
+ // yield a triple, so skip the doomed parse and report it inconclusive — only
614
+ // a complete document, small enough to fit the read cap, can be validated.
615
+ return null;
616
+ }
617
+ const outcome = await classifyRdfBody(body, serialization, baseIRI, timeoutMs, truncated);
404
618
  switch (outcome.type) {
405
619
  case 'empty':
406
620
  return 'Distribution contains no RDF triples';
@@ -422,8 +636,13 @@ async function validateBody(body, contentType, baseIRI, timeoutMs) {
422
636
  * on expiry — and likewise when a remote `@context` is unreachable — the outcome
423
637
  * is 'inconclusive', so a valid distribution is never flagged faulty for a
424
638
  * context host's failure. `baseIRI` resolves any relative IRIs in the document.
639
+ *
640
+ * When `truncated` is true the body is only a bounded prefix of a larger one, so
641
+ * only finding a triple ('hasTriples') is conclusive: a parse error at the cut
642
+ * or a clean end with no triple yet means we did not read far enough, not that
643
+ * the distribution is empty or malformed, and is reported as 'inconclusive'.
425
644
  */
426
- function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
645
+ function classifyRdfBody(body, contentType, baseIRI, timeoutMs, truncated) {
427
646
  return new Promise((resolve) => {
428
647
  const quads = rdfParser.parse(Readable.from([body]), {
429
648
  contentType,
@@ -441,10 +660,10 @@ function classifyRdfBody(body, contentType, baseIRI, timeoutMs) {
441
660
  }
442
661
  quads
443
662
  .on('data', () => settle({ type: 'hasTriples' }))
444
- .on('error', (error) => settle(isRemoteContextError(error)
663
+ .on('error', (error) => settle(truncated || isRemoteContextError(error)
445
664
  ? { type: 'inconclusive' }
446
665
  : { type: 'parseError', message: error.message }))
447
- .on('end', () => settle({ type: 'empty' }));
666
+ .on('end', () => settle(truncated ? { type: 'inconclusive' } : { type: 'empty' }));
448
667
  });
449
668
  }
450
669
  /**
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@lde/distribution-probe",
3
- "version": "0.1.12",
3
+ "version": "0.2.0",
4
4
  "repository": {
5
5
  "url": "git+https://github.com/ldelements/lde.git",
6
6
  "directory": "packages/distribution-probe"