@lde/search 0.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,170 @@
1
+ # @lde/search
2
+
3
+ Engine-agnostic search projection for RDF-backed pipelines. **`projectGraph`**
4
+ streams the result of a SPARQL `CONSTRUCT` into flat search documents, with no
5
+ engine and no vocabulary baked in. Internally it does two things per subject of
6
+ a root type: frame its one-hop subgraph into a JSON-LD IR node, then project
7
+ that node into a flat document from a **declarative field spec**.
8
+
9
+ An engine adapter (e.g. [`@lde/search-typesense`](../search-typesense)) then
10
+ writes those documents to a search backend.
11
+
12
+ ```ts
13
+ import { projectGraph, type Projection } from '@lde/search';
14
+
15
+ const projection: Projection = {
16
+ /* type + field spec — see below */
17
+ };
18
+
19
+ for await (const document of projectGraph(quads, [projection])) {
20
+ // one flat search document per matching subject, streamed
21
+ }
22
+ ```
23
+
24
+ `projectGraph` is fully streaming: subjects are grouped and framed one at a time
25
+ and documents are yielded as they are produced, so beyond a subject index memory
26
+ stays flat at scale (framing the whole graph at once is roughly O(N²)). Duplicate
27
+ triples are collapsed first, because some SPARQL engines (e.g. QLever) do not
28
+ deduplicate `CONSTRUCT` output. The IR carries no `@context`, so a `derivation`
29
+ reading it sees full predicate IRIs with language tags preserved.
30
+
31
+ ## Projection
32
+
33
+ The mapping is data, not code. Each field declares the IR `path` to read and a
34
+ `kind`; the conventions (per-locale split, diacritic folding via
35
+ [`@lde/text-normalization`](../text-normalization), facet arrays, numeric
36
+ coercion) are applied for you. Computed fields are `derivations` — hooks that
37
+ read the node and set fields the kinds can't.
38
+
39
+ ```ts
40
+ import { projectGraph, irisOf, type Projection } from '@lde/search';
41
+
42
+ const projection: Projection = {
43
+ type: 'http://www.w3.org/ns/dcat#Dataset',
44
+ fields: [
45
+ // → title_nl, title_en, title_search_nl, title_search_en, title_sort_nl, title_sort_en
46
+ {
47
+ name: 'title',
48
+ path: 'http://purl.org/dc/terms/title',
49
+ kind: {
50
+ type: 'langText',
51
+ locales: ['nl', 'en'],
52
+ display: true,
53
+ search: true,
54
+ sort: true,
55
+ },
56
+ },
57
+ // → publisher (IRI facet)
58
+ {
59
+ name: 'publisher',
60
+ path: 'http://purl.org/dc/terms/publisher',
61
+ kind: { type: 'facet', iri: true },
62
+ },
63
+ // → size (int)
64
+ { name: 'size', path: 'urn:dr:size', kind: { type: 'number' } },
65
+ ],
66
+ derivations: [
67
+ (document, node) => {
68
+ document.class_count = irisOf(node, 'urn:dr:class').length;
69
+ },
70
+ ],
71
+ };
72
+
73
+ for await (const document of projectGraph(quads, [projection])) {
74
+ // …
75
+ }
76
+ ```
77
+
78
+ **Kinds**
79
+
80
+ | kind | emits |
81
+ | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
82
+ | `langText` | per locale (see below), each opt-in: `_${locale}` display with `display`, `_search_${locale}` folded with `search`, `_sort_${locale}` folded with `sort` |
83
+ | `facet` | the field as a deduped array; `iri` reads `@id`; `search` adds a folded `_search`; `transform` rewrites values |
84
+ | `number` | a numeric scalar; `date` parses an ISO date-time to unix seconds |
85
+
86
+ ## Locales
87
+
88
+ `locales` is the **single** list of languages a `langText` field projects;
89
+ `display`, `search` and `sort` are independent opt-in families that each fan out
90
+ over it (so a field emits exactly what it opts into):
91
+
92
+ - `display` → `title_nl`/`title_en` (accents preserved);
93
+ - `search` → `title_search_nl`/`title_search_en` (folded; one field per locale
94
+ lets a query `query_by` them and rank the user’s language higher via
95
+ `query_by_weights`, and lets a language that needs a dedicated tokenizer set
96
+ its own `locale` in the schema);
97
+ - `sort` → `title_sort_nl`/`title_sort_en` (folded, so a locale-switching UI
98
+ sorts on the active language).
99
+
100
+ A field with `search` but no `display` is **search-only** — folded and stemmed
101
+ for retrieval but never rendered (e.g. a `publisher` searched here but shown via
102
+ a separate label).
103
+
104
+ Folding the search fields is what lets diacritic-insensitive matching and
105
+ stemming coexist. A search engine on its **default** locale typically folds case
106
+ and diacritics for you (Typesense v30, verified, even folds ø/æ/ß) — so there the
107
+ folding here is belt-and-suspenders. But enabling a language’s **stemming**
108
+ requires setting that language’s `locale` (e.g. `locale: 'nl'` + `stem: true` so
109
+ `huizen` matches `huis`), and a non-default locale switches the engine to ICU
110
+ tokenization, which **preserves** diacritics. At that point the engine no longer
111
+ folds them, and `fold()` is what keeps matching diacritic-insensitive. Stemming
112
+ is a per-field engine-schema choice (the consumer’s), and being rules-based it
113
+ can mangle proper nouns and place names — e.g. the Dutch stemmer reduces the city
114
+ `Bergen` to `berg`, colliding it with “mountain”.
115
+
116
+ Recommended split: enable stemming on the **free-text** search fields
117
+ (`*_search_${locale}`, descriptions, keywords) where morphological recall helps
118
+ (`verhaal` ↔ `verhalen`), and keep **place names and other proper-noun facets on
119
+ a separate, unstemmed field** (facets are exact-match anyway). That captures the
120
+ recall without the `Bergen`/`berg` collision in the facet. A `stem_dictionary`
121
+ can pin specific names if you need stemmed free-text without given collisions.
122
+
123
+ **Only listed locales are indexed.** A literal whose language tag is not in
124
+ `locales` is not projected at all — no display, no search, no sort field — so it
125
+ is invisible to the index. To index a language, add it to `locales`.
126
+
127
+ Per-locale fields are **omitted, never empty**, when a document lacks that
128
+ language, so declare them `optional: true` in the engine schema. At query time,
129
+ sort with `missing_values: last` to push documents lacking the active locale to
130
+ the end, and `query_by` all the per-locale search fields (weighting the user’s
131
+ locale higher) to keep cross-language recall.
132
+
133
+ A literal with no `@language` tag matches no locale, so it is not projected. Tag
134
+ your source literals (or pre-process them) for the languages you index.
135
+
136
+ ## Querying
137
+
138
+ The search fields are stored already case- and diacritic-folded, so **the query
139
+ must be folded the same way** with the same `fold()` from
140
+ [`@lde/text-normalization`](../text-normalization) before it reaches the engine.
141
+ Otherwise index and query are normalized differently and matches silently miss
142
+ (the user sees no results, with no error). An engine on its default locale would
143
+ fold a raw query for you, but one set to a stemming locale (which preserves
144
+ diacritics) or a non-folding backend will not — so always fold, and matching
145
+ stays correct on any engine.
146
+
147
+ ```ts
148
+ import { fold } from '@lde/text-normalization';
149
+
150
+ await client
151
+ .collections(collection)
152
+ .documents()
153
+ .search({
154
+ q: fold(userQuery),
155
+ query_by: 'title_search_nl,title_search_en',
156
+ query_by_weights: '2,1', // rank the user’s locale higher
157
+ });
158
+ ```
159
+
160
+ This contract holds for **any** consumer, including a search API built on top of
161
+ this package: index-time and query-time folding must use the same `fold()`, or
162
+ non-decomposing terms silently miss.
163
+
164
+ ## Why a spec
165
+
166
+ The field spec's vocabulary mirrors SHACL on purpose: `path` is `sh:path`, and
167
+ the kind is derivable from `sh:datatype` / `sh:nodeKind` / `sh:maxCount` plus
168
+ search annotations. So the same projection engine that runs a hand-written spec
169
+ today will run a **SHACL-generated** spec tomorrow — the engine and the IR stay;
170
+ only spec-authoring gets automated. Nothing is thrown away.
@@ -0,0 +1,16 @@
1
+ import type { Quad } from '@rdfjs/types';
2
+ /** A framed JSON-LD node (full-IRI keys); the engine-agnostic search IR. */
3
+ export type FramedNode = Record<string, unknown>;
4
+ /**
5
+ * Frame CONSTRUCT quads into one JSON-LD node per subject of `rootType`. Each
6
+ * root subject’s own triples plus the one-hop nodes it references (e.g. nested
7
+ * publisher/distribution resources) are grouped lazily and framed one at a
8
+ * time, so beyond the subject index only a single subgraph is held — whole-graph
9
+ * `jsonld.frame()` is ~O(N²). Duplicate triples are collapsed first because some
10
+ * SPARQL engines
11
+ * (e.g. QLever) do not dedupe CONSTRUCT output. The caller supplies the root
12
+ * type, keeping the framing domain-agnostic; the frame carries no `@context`, so
13
+ * framed keys are full predicate IRIs.
14
+ */
15
+ export declare function frameByType(quads: readonly Quad[], rootType: string): AsyncIterable<FramedNode>;
16
+ //# sourceMappingURL=frame-by-type.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"frame-by-type.d.ts","sourceRoot":"","sources":["../src/frame-by-type.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,IAAI,EAAE,MAAM,cAAc,CAAC;AAMzC,4EAA4E;AAC5E,MAAM,MAAM,UAAU,GAAG,MAAM,CAAC,MAAM,EAAE,OAAO,CAAC,CAAC;AAIjD;;;;;;;;;;GAUG;AACH,wBAAuB,WAAW,CAChC,KAAK,EAAE,SAAS,IAAI,EAAE,EACtB,QAAQ,EAAE,MAAM,GACf,aAAa,CAAC,UAAU,CAAC,CAU3B"}
@@ -0,0 +1,63 @@
1
+ import jsonld from 'jsonld';
2
+ import { rdf } from '@tpluscode/rdf-ns-builders';
3
+ const RDF_TYPE = rdf.type.value;
4
+ const FRAME_OPTIONS = { omitGraph: false, embed: '@always' };
5
+ /**
6
+ * Frame CONSTRUCT quads into one JSON-LD node per subject of `rootType`. Each
7
+ * root subject’s own triples plus the one-hop nodes it references (e.g. nested
8
+ * publisher/distribution resources) are grouped lazily and framed one at a
9
+ * time, so beyond the subject index only a single subgraph is held — whole-graph
10
+ * `jsonld.frame()` is ~O(N²). Duplicate triples are collapsed first because some
11
+ * SPARQL engines
12
+ * (e.g. QLever) do not dedupe CONSTRUCT output. The caller supplies the root
13
+ * type, keeping the framing domain-agnostic; the frame carries no `@context`, so
14
+ * framed keys are full predicate IRIs.
15
+ */
16
+ export async function* frameByType(quads, rootType) {
17
+ const frame = { '@type': rootType };
18
+ for (const subgraph of groupByRoot(quads, rootType)) {
19
+ const expanded = await jsonld.fromRDF(subgraph);
20
+ const framed = await jsonld.frame(expanded, frame, FRAME_OPTIONS);
21
+ const node = framed['@graph']?.[0];
22
+ if (node !== undefined) {
23
+ yield node;
24
+ }
25
+ }
26
+ }
27
+ /**
28
+ * Yield one self-contained quad subgraph per root subject – its own (deduped)
29
+ * triples plus the triples of the one-hop IRI or blank nodes it references –
30
+ * lazily, so only the subject index and the current subgraph are held at once
31
+ * (never the whole materialized list of subgraphs).
32
+ */
33
+ function* groupByRoot(quads, rootType) {
34
+ const bySubject = new Map();
35
+ const rootIris = [];
36
+ const seen = new Set();
37
+ for (const quad of quads) {
38
+ const key = `${quad.subject.value} ${quad.predicate.value} ${quad.object.value} ${quad.object.termType === 'Literal' ? quad.object.language || quad.object.datatype.value : ''}`;
39
+ if (seen.has(key)) {
40
+ continue;
41
+ }
42
+ seen.add(key);
43
+ const subject = quad.subject.value;
44
+ const owned = bySubject.get(subject);
45
+ if (owned === undefined) {
46
+ bySubject.set(subject, [quad]);
47
+ }
48
+ else {
49
+ owned.push(quad);
50
+ }
51
+ if (quad.predicate.value === RDF_TYPE && quad.object.value === rootType) {
52
+ rootIris.push(subject);
53
+ }
54
+ }
55
+ for (const iri of rootIris) {
56
+ const owned = bySubject.get(iri) ?? [];
57
+ const referenced = owned
58
+ .filter((quad) => quad.object.termType === 'NamedNode' ||
59
+ quad.object.termType === 'BlankNode')
60
+ .flatMap((quad) => bySubject.get(quad.object.value) ?? []);
61
+ yield [...owned, ...referenced];
62
+ }
63
+ }
@@ -0,0 +1,4 @@
1
+ export { projectGraph, irisOf, literalsOf, firstLiteralOf } from './project.js';
2
+ export type { SearchDocument, Projection, FieldSpec, FieldKind, LangTextKind, FacetKind, NumberKind, DateKind, Derivation, } from './project.js';
3
+ export type { FramedNode } from './frame-by-type.js';
4
+ //# sourceMappingURL=index.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,YAAY,EAAE,MAAM,EAAE,UAAU,EAAE,cAAc,EAAE,MAAM,cAAc,CAAC;AAChF,YAAY,EACV,cAAc,EACd,UAAU,EACV,SAAS,EACT,SAAS,EACT,YAAY,EACZ,SAAS,EACT,UAAU,EACV,QAAQ,EACR,UAAU,GACX,MAAM,cAAc,CAAC;AACtB,YAAY,EAAE,UAAU,EAAE,MAAM,oBAAoB,CAAC"}
package/dist/index.js ADDED
@@ -0,0 +1 @@
1
+ export { projectGraph, irisOf, literalsOf, firstLiteralOf } from './project.js';
@@ -0,0 +1,99 @@
1
+ import type { Quad } from '@rdfjs/types';
2
+ import { type FramedNode } from './frame-by-type.js';
3
+ /** A flat search document. `id` is the engine document key. */
4
+ export type SearchDocument = {
5
+ id: string;
6
+ } & Record<string, unknown>;
7
+ /**
8
+ * How one framed-IR property projects into search fields. The vocabulary mirrors
9
+ * SHACL so a generator can later emit it from shapes + search annotations:
10
+ * `path` is `sh:path`, and the kind is derivable from `sh:datatype`/`sh:nodeKind`
11
+ * /`sh:maxCount` plus the search annotations.
12
+ */
13
+ export type FieldKind = LangTextKind | FacetKind | NumberKind | DateKind;
14
+ /**
15
+ * Language-tagged text, projected per locale. `locales` is the single source of
16
+ * truth for which languages this field emits; `display`, `search` and `sort` are
17
+ * three independent opt-in families that each fan out over it:
18
+ * - `display` → `${name}_${locale}` display label, accents preserved;
19
+ * - `search` → `${name}_search_${locale}` folded match field (one per locale so
20
+ * the engine can tokenize/stem each language and the query can rank the user’s
21
+ * locale higher);
22
+ * - `sort` → `${name}_sort_${locale}` folded sort key (one per locale so a
23
+ * locale-switching UI sorts on the active language).
24
+ *
25
+ * All three default off — a field emits exactly the families it opts into (e.g.
26
+ * `search` alone is a search-only field, shown via a separate label). Only listed
27
+ * locales are projected: a value whose language tag is not in `locales` (and is
28
+ * not mapped in by `untaggedLanguage`) is not indexed at all.
29
+ */
30
+ export interface LangTextKind {
31
+ readonly type: 'langText';
32
+ /** The languages to project; drives whichever of the families are enabled. */
33
+ readonly locales: readonly string[];
34
+ /** Emit the per-locale display labels `${name}_${locale}` (accents preserved). */
35
+ readonly display?: boolean;
36
+ /** Emit a folded `${name}_search_${locale}` per locale (matchable). */
37
+ readonly search?: boolean;
38
+ /** Emit a folded `${name}_sort_${locale}` per locale (sortable). */
39
+ readonly sort?: boolean;
40
+ }
41
+ /** A faceted multi-value field, optionally also folded for search. */
42
+ export interface FacetKind {
43
+ readonly type: 'facet';
44
+ /** Read IRI references (`@id`) rather than literal values. */
45
+ readonly iri?: boolean;
46
+ /** Also emit a folded `${name}_search` array. */
47
+ readonly search?: boolean;
48
+ /** Transform each value before faceting (e.g. strip a media-type prefix). */
49
+ readonly transform?: (value: string) => string;
50
+ }
51
+ /** A numeric scalar. */
52
+ export interface NumberKind {
53
+ readonly type: 'number';
54
+ }
55
+ /** An ISO date-time, parsed into Unix seconds. */
56
+ export interface DateKind {
57
+ readonly type: 'date';
58
+ }
59
+ /**
60
+ * One field of a projection: an output `name`, the framed-IR predicate `path` to
61
+ * read (the SHACL `sh:path`), and the kind-specific config discriminated by
62
+ * `type`.
63
+ */
64
+ export type FieldSpec = {
65
+ /** Output field base name; per-kind suffixes are appended. */
66
+ readonly name: string;
67
+ /** Framed-IR predicate IRI to read (the SHACL `sh:path`). */
68
+ readonly path: string;
69
+ } & FieldKind;
70
+ /** A computed field that is not a direct projection of a single path
71
+ * (e.g. a status rank, or a group derived from a code table). */
72
+ export type Derivation = (document: SearchDocument, node: FramedNode) => void;
73
+ /**
74
+ * One root type’s complete projection — the runtime form of a single SHACL
75
+ * NodeShape: `type` is its `sh:targetClass` (and the framed node’s `@type`),
76
+ * `fields` are its property shapes, and `derivations` are its `sh:rule`-shaped
77
+ * computed fields. A generator emits one of these per NodeShape.
78
+ */
79
+ export interface Projection {
80
+ readonly type: string;
81
+ readonly fields: readonly FieldSpec[];
82
+ readonly derivations?: readonly Derivation[];
83
+ }
84
+ /**
85
+ * Project one framed JSON-LD node into a flat search document: apply each field
86
+ * spec, then run the derivations (which may read fields the specs already set).
87
+ */
88
+ export declare function projectDocument(node: FramedNode, projection: Projection): SearchDocument;
89
+ /**
90
+ * Frame `quads` for every projection’s root type and project each node with its
91
+ * type’s projection — the multi-shape pipeline. Streams one document at a time
92
+ * so memory stays flat. The IR maps to a projection by type, so adding a shape
93
+ * is adding a `Projection` (no engine change).
94
+ */
95
+ export declare function projectGraph(quads: readonly Quad[], projections: readonly Projection[]): AsyncIterable<SearchDocument>;
96
+ export declare function literalsOf(node: FramedNode, path: string): string[];
97
+ export declare function firstLiteralOf(node: FramedNode, path: string): string | undefined;
98
+ export declare function irisOf(node: FramedNode, path: string): string[];
99
+ //# sourceMappingURL=project.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"project.d.ts","sourceRoot":"","sources":["../src/project.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,IAAI,EAAE,MAAM,cAAc,CAAC;AAEzC,OAAO,EAAe,KAAK,UAAU,EAAE,MAAM,oBAAoB,CAAC;AAElE,+DAA+D;AAC/D,MAAM,MAAM,cAAc,GAAG;IAAE,EAAE,EAAE,MAAM,CAAA;CAAE,GAAG,MAAM,CAAC,MAAM,EAAE,OAAO,CAAC,CAAC;AAEtE;;;;;GAKG;AACH,MAAM,MAAM,SAAS,GAAG,YAAY,GAAG,SAAS,GAAG,UAAU,GAAG,QAAQ,CAAC;AAEzE;;;;;;;;;;;;;;;GAeG;AACH,MAAM,WAAW,YAAY;IAC3B,QAAQ,CAAC,IAAI,EAAE,UAAU,CAAC;IAC1B,8EAA8E;IAC9E,QAAQ,CAAC,OAAO,EAAE,SAAS,MAAM,EAAE,CAAC;IACpC,kFAAkF;IAClF,QAAQ,CAAC,OAAO,CAAC,EAAE,OAAO,CAAC;IAC3B,uEAAuE;IACvE,QAAQ,CAAC,MAAM,CAAC,EAAE,OAAO,CAAC;IAC1B,oEAAoE;IACpE,QAAQ,CAAC,IAAI,CAAC,EAAE,OAAO,CAAC;CACzB;AAED,sEAAsE;AACtE,MAAM,WAAW,SAAS;IACxB,QAAQ,CAAC,IAAI,EAAE,OAAO,CAAC;IACvB,8DAA8D;IAC9D,QAAQ,CAAC,GAAG,CAAC,EAAE,OAAO,CAAC;IACvB,iDAAiD;IACjD,QAAQ,CAAC,MAAM,CAAC,EAAE,OAAO,CAAC;IAC1B,6EAA6E;IAC7E,QAAQ,CAAC,SAAS,CAAC,EAAE,CAAC,KAAK,EAAE,MAAM,KAAK,MAAM,CAAC;CAChD;AAED,wBAAwB;AACxB,MAAM,WAAW,UAAU;IACzB,QAAQ,CAAC,IAAI,EAAE,QAAQ,CAAC;CACzB;AAED,kDAAkD;AAClD,MAAM,WAAW,QAAQ;IACvB,QAAQ,CAAC,IAAI,EAAE,MAAM,CAAC;CACvB;AAED;;;;GAIG;AACH,MAAM,MAAM,SAAS,GAAG;IACtB,8DAA8D;IAC9D,QAAQ,CAAC,IAAI,EAAE,MAAM,CAAC;IACtB,6DAA6D;IAC7D,QAAQ,CAAC,IAAI,EAAE,MAAM,CAAC;CACvB,GAAG,SAAS,CAAC;AAEd;kEACkE;AAClE,MAAM,MAAM,UAAU,GAAG,CAAC,QAAQ,EAAE,cAAc,EAAE,IAAI,EAAE,UAAU,KAAK,IAAI,CAAC;AAE9E;;;;;GAKG;AACH,MAAM,WAAW,UAAU;IACzB,QAAQ,CAAC,IAAI,EAAE,MAAM,CAAC;IACtB,QAAQ,CAAC,MAAM,EAAE,SAAS,SAAS,EAAE,CAAC;IACtC,QAAQ,CAAC,WAAW,CAAC,EAAE,SAAS,UAAU,EAAE,CAAC;CAC9C;AAED;;;GAGG;AACH,wBAAgB,eAAe,CAC7B,IAAI,EAAE,UAAU,EAChB,UAAU,EAAE,UAAU,GACrB,cAAc,CAehB;AAED;;;;;GAKG;AACH,wBAAuB,YAAY,CACjC,KAAK,EAAE,SAAS,IAAI,EAAE,EACtB,WAAW,EAAE,SAAS,UAAU,EAAE,GACjC,aAAa,CAAC,cAAc,CAAC,CAS/B;AA+FD,wBAAgB,UAAU,CAAC,IAAI,EAAE,UAAU,EAAE,IAAI,EAAE,MAAM,GAAG,MAAM,EAAE,CAInE;AAED,wBAAgB,cAAc,CAC5B,IAAI,EAAE,UAAU,EAChB,IAAI,EAAE,MAAM,GACX,MAAM,GAAG,SAAS,CAEpB;AAED,wBAAgB,MAAM,CAAC,IAAI,EAAE,UAAU,EAAE,IAAI,EAAE,MAAM,GAAG,MAAM,EAAE,CAI/D"}
@@ -0,0 +1,170 @@
1
+ import { fold } from '@lde/text-normalization';
2
+ import { frameByType } from './frame-by-type.js';
3
+ /**
4
+ * Project one framed JSON-LD node into a flat search document: apply each field
5
+ * spec, then run the derivations (which may read fields the specs already set).
6
+ */
7
+ export function projectDocument(node, projection) {
8
+ const id = node['@id'];
9
+ if (typeof id !== 'string') {
10
+ throw new Error(`Cannot project a ${projection.type} node without an @id: every search document needs a stable key, and an empty one would collide with other keyless nodes.`);
11
+ }
12
+ const document = { id };
13
+ for (const field of projection.fields) {
14
+ applyField(document, node, field);
15
+ }
16
+ for (const derive of projection.derivations ?? []) {
17
+ derive(document, node);
18
+ }
19
+ return document;
20
+ }
21
+ /**
22
+ * Frame `quads` for every projection’s root type and project each node with its
23
+ * type’s projection — the multi-shape pipeline. Streams one document at a time
24
+ * so memory stays flat. The IR maps to a projection by type, so adding a shape
25
+ * is adding a `Projection` (no engine change).
26
+ */
27
+ export async function* projectGraph(quads, projections) {
28
+ const byType = new Map(projections.map((projection) => [projection.type, projection]));
29
+ for (const projection of byType.values()) {
30
+ for await (const node of frameByType(quads, projection.type)) {
31
+ yield projectDocument(node, projection);
32
+ }
33
+ }
34
+ }
35
+ function applyField(document, node, field) {
36
+ switch (field.type) {
37
+ case 'langText':
38
+ return applyLangText(document, langValuesOf(node, field.path), field);
39
+ case 'facet':
40
+ return applyFacet(document, node, field);
41
+ case 'number':
42
+ return setNumber(document, field.name, toInteger(firstLiteralOf(node, field.path)));
43
+ case 'date':
44
+ return setNumber(document, field.name, isoToUnix(firstLiteralOf(node, field.path)));
45
+ }
46
+ }
47
+ function applyLangText(document, values, { name, locales, display, search, sort }) {
48
+ if (locales.length === 0) {
49
+ throw new Error(`langText field “${name}” must declare at least one locale; nothing would be projected otherwise.`);
50
+ }
51
+ for (const locale of locales) {
52
+ const localeValues = values
53
+ .filter((value) => value.lang === locale)
54
+ .map((value) => value.value);
55
+ if (localeValues.length === 0) {
56
+ continue;
57
+ }
58
+ // Display shows one label (accents preserved); sort keys off that same
59
+ // primary value (folded); search folds every value of the locale so all
60
+ // are matchable. Absent locales emit nothing (the field stays optional).
61
+ const [primary] = localeValues;
62
+ if (display) {
63
+ setString(document, `${name}_${locale}`, primary);
64
+ }
65
+ if (search) {
66
+ setString(document, `${name}_search_${locale}`, fold(localeValues.join(' ')).trim());
67
+ }
68
+ if (sort) {
69
+ setString(document, `${name}_sort_${locale}`, fold(primary));
70
+ }
71
+ }
72
+ }
73
+ function applyFacet(document, node, { name, path, iri, search, transform }) {
74
+ const raw = iri ? irisOf(node, path) : literalsOf(node, path);
75
+ const values = dedupe(transform ? raw.map(transform) : raw);
76
+ setArray(document, name, values);
77
+ if (search) {
78
+ setArray(document, `${name}_search`, dedupe(values.map((value) => fold(value))));
79
+ }
80
+ }
81
+ function langValuesOf(node, path) {
82
+ return valuesOf(node, path)
83
+ .map(toLangValue)
84
+ .filter((value) => value !== undefined);
85
+ }
86
+ export function literalsOf(node, path) {
87
+ return valuesOf(node, path)
88
+ .map(literalString)
89
+ .filter((value) => value !== undefined);
90
+ }
91
+ export function firstLiteralOf(node, path) {
92
+ return literalsOf(node, path)[0];
93
+ }
94
+ export function irisOf(node, path) {
95
+ return valuesOf(node, path)
96
+ .map(iriString)
97
+ .filter((value) => value !== undefined);
98
+ }
99
+ function valuesOf(node, path) {
100
+ const value = node[path];
101
+ if (value === undefined) {
102
+ return [];
103
+ }
104
+ return Array.isArray(value) ? value : [value];
105
+ }
106
+ function toLangValue(value) {
107
+ const literal = literalString(value);
108
+ if (literal === undefined) {
109
+ return undefined;
110
+ }
111
+ const lang = isObject(value) && typeof value['@language'] === 'string'
112
+ ? value['@language']
113
+ : '';
114
+ return { value: literal, lang };
115
+ }
116
+ function literalString(value) {
117
+ if (typeof value === 'string') {
118
+ return value;
119
+ }
120
+ if (isObject(value)) {
121
+ const inner = value['@value'];
122
+ if (typeof inner === 'string') {
123
+ return inner;
124
+ }
125
+ if (typeof inner === 'number' || typeof inner === 'boolean') {
126
+ return String(inner);
127
+ }
128
+ }
129
+ return undefined;
130
+ }
131
+ function iriString(value) {
132
+ if (typeof value === 'string') {
133
+ return value;
134
+ }
135
+ if (isObject(value) && typeof value['@id'] === 'string') {
136
+ return value['@id'];
137
+ }
138
+ return undefined;
139
+ }
140
+ function toInteger(literal) {
141
+ return literal === undefined ? undefined : Math.trunc(Number(literal));
142
+ }
143
+ function isoToUnix(iso) {
144
+ if (iso === undefined) {
145
+ return undefined;
146
+ }
147
+ const millis = new Date(iso).getTime();
148
+ return Number.isNaN(millis) ? undefined : Math.trunc(millis / 1000);
149
+ }
150
+ function setNumber(document, field, value) {
151
+ if (value !== undefined && !Number.isNaN(value)) {
152
+ document[field] = value;
153
+ }
154
+ }
155
+ function dedupe(values) {
156
+ return [...new Set(values)];
157
+ }
158
+ function setString(document, field, value) {
159
+ if (value !== undefined && value !== '') {
160
+ document[field] = value;
161
+ }
162
+ }
163
+ function setArray(document, field, values) {
164
+ if (values.length > 0) {
165
+ document[field] = values;
166
+ }
167
+ }
168
+ function isObject(value) {
169
+ return typeof value === 'object' && value !== null && !Array.isArray(value);
170
+ }
package/package.json ADDED
@@ -0,0 +1,38 @@
1
+ {
2
+ "name": "@lde/search",
3
+ "version": "0.0.0",
4
+ "description": "Engine-agnostic search projection for RDF-backed pipelines: frame CONSTRUCT quads into a JSON-LD IR, then project that IR into flat search documents from a declarative field spec (the artifact a SHACL generator would emit)",
5
+ "repository": {
6
+ "url": "git+https://github.com/ldelements/lde.git",
7
+ "directory": "packages/search"
8
+ },
9
+ "license": "MIT",
10
+ "type": "module",
11
+ "exports": {
12
+ "./package.json": "./package.json",
13
+ ".": {
14
+ "types": "./dist/index.d.ts",
15
+ "import": "./dist/index.js",
16
+ "development": "./src/index.ts",
17
+ "default": "./dist/index.js"
18
+ }
19
+ },
20
+ "main": "./dist/index.js",
21
+ "module": "./dist/index.js",
22
+ "types": "./dist/index.d.ts",
23
+ "files": [
24
+ "dist",
25
+ "!**/*.tsbuildinfo"
26
+ ],
27
+ "dependencies": {
28
+ "@lde/text-normalization": "^0.0.0",
29
+ "@rdfjs/types": "^2.0.1",
30
+ "@tpluscode/rdf-ns-builders": "^5.0.0",
31
+ "jsonld": "^9.0.0",
32
+ "tslib": "^2.3.0"
33
+ },
34
+ "devDependencies": {
35
+ "@types/jsonld": "^1.5.15",
36
+ "n3": "^2.0.1"
37
+ }
38
+ }