npm - linguistic-enricher - Versions diffs - 1.0.0 - Mend

linguistic-enricher 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

package/README.md +248 -0
package/bin/linguistic-enricher.js +222 -0
package/package.json +48 -0
package/schema.json +1234 -0
package/src/index.js +72 -0
package/src/pipeline/run-pipeline.js +136 -0
package/src/pipeline/stage-registry.js +183 -0
package/src/pipeline/stages/canonicalization.js +63 -0
package/src/pipeline/stages/chunking.js +428 -0
package/src/pipeline/stages/head-identification.js +272 -0
package/src/pipeline/stages/linguistic-analysis.js +441 -0
package/src/pipeline/stages/mwe-candidate-construction.js +118 -0
package/src/pipeline/stages/mwe-candidate-extraction.js +519 -0
package/src/pipeline/stages/mwe-materialization.js +140 -0
package/src/pipeline/stages/pos-tagging.js +190 -0
package/src/pipeline/stages/relation-extraction.js +739 -0
package/src/pipeline/stages/segmentation.js +123 -0
package/src/pipeline/stages/surface-normalization.js +63 -0
package/src/pipeline/stages/tokenization.js +380 -0
package/src/python/protocol.js +114 -0
package/src/python/python-runner.js +141 -0
package/src/python/runtime-check.js +144 -0
package/src/services/wikipedia-title-index-client.js +81 -0
package/src/util/deep-clone.js +18 -0
package/src/util/determinism.js +82 -0
package/src/util/errors.js +52 -0
package/src/util/ids.js +21 -0
package/src/util/spans.js +48 -0
package/src/validation/runtime-invariants.js +236 -0
package/src/validation/schema-validator.js +37 -0

package/README.md ADDED Viewed

@@ -0,0 +1,248 @@
+# linguistic-enricher
+**linguistic-enricher** is a deterministic linguistic processing pipeline for Node.js that incrementally enriches plain text with structured linguistic information.
+It takes raw text as input and produces a single, fully structured document that contains:
+- a canonical text surface,
+- sentence segmentation and tokenization,
+- part-of-speech information,
+- multi-word expressions (MWEs),
+- shallow phrase structure (chunks),
+- syntactic heads,
+- and deterministic, token-level linguistic relations.
+The pipeline is **library-first**, **schema-driven**, and **additive by design**.
+It focuses strictly on *linguistic structure*, not domain logic, business rules, or normative interpretation.
+---
+## What this project is
+`linguistic-enricher` is best described as a **linguistic enricher**:
+- It **adds structure** to text.
+- It does **not rewrite or reinterpret** the original text.
+- It does **not apply domain semantics, rules, or policies**.
+- It produces **reproducible, explainable results**.
+The output is a single, incrementally enriched document that represents the linguistic state of the input text up to the level of accepted linguistic relations.
+This makes the package suitable as:
+- a preprocessing layer for downstream NLP systems,
+- a compiler-like front end for controlled or structured language processing,
+- or a general-purpose linguistic analysis engine embedded directly into Node.js applications.
+---
+## What this project is not
+`linguistic-enricher` deliberately does **not**:
+- perform business or domain reasoning,
+- assert norms, obligations, or policies,
+- infer facts beyond what is linguistically explicit in the text,
+- or depend on any specific downstream framework or ontology.
+The pipeline’s authoritative output ends at **linguistic relations**.
+Anything beyond that (assertions, governance rules, domain models) belongs in a separate, downstream layer.
+---
+## Core principles
+### Deterministic
+Given the same input text and configuration, the pipeline produces the same output.
+Probabilistic model output is treated as observational input only and is never accepted as authoritative truth.
+### Additive
+The text is enriched step by step.
+Earlier structures are never removed or rewritten.
+Later stages only add structure or precision.
+### Anchored
+All annotations are explicitly anchored to the canonical text using character spans, tokens, or segments.
+There is no implicit or floating interpretation.
+### Schema-driven
+The output conforms to a single, evolving document schema that represents the complete linguistic enrichment state.
+### Library-first
+All functionality is available through a JavaScript API and can be embedded directly into any Node.js project.
+A CLI is provided only as a thin wrapper around the same API.
+---
+## High-level pipeline overview
+Conceptually, the pipeline performs the following transformations:
+1. **Canonicalization**
+   A single authoritative text surface is established. All later offsets and annotations refer to this text.
+2. **Segmentation and tokenization**
+   The text is segmented (typically into sentences) and tokenized into a stable token stream.
+3. **Part-of-speech tagging**
+   Each token is enriched with grammatical category information.
+4. **Multi-word expression detection and materialization**
+   Lexical units spanning multiple tokens are detected and deterministically materialized as authoritative MWEs.
+5. **Shallow parsing (chunking)**
+   Tokens are grouped into flat syntactic phrases (e.g. noun phrases, verb phrases).
+6. **Head identification**
+   Each phrase receives exactly one deterministic syntactic head token.
+7. **Relation extraction**
+   Token-level linguistic relations are derived deterministically and stored as accepted relations.
+Relations represent **linguistic predicate–argument structure**, not conceptual, ontological, or domain semantics.
+The result is a fully enriched linguistic document with stable structure and traceable provenance.
+---
+## Input and output
+### Input
+The minimal input is plain text:
+```js
+const text = `
+A webshop is an online store where customers can select products,
+place them in a cart, and complete a purchase.
+`;
+```
+Optionally, a partially enriched document that already conforms to the schema may be provided to resume processing.
+---
+### Output
+The output is a single JavaScript object representing the enriched document.
+It includes:
+- the canonical text,
+- segments and tokens with stable spans,
+- annotations for MWEs, chunks, and heads,
+- and accepted token-level relations.
+The output is designed to be:
+- machine-readable,
+- human-inspectable,
+- and suitable for downstream processing.
+---
+## Usage as a library
+```js
+const { runPipeline } = require("linguistic-enricher");
+const result = await runPipeline(text, {
+  target: "relations_extracted"
+});
+console.log(result.stage);
+```
+The library API is the primary interface.
+File I/O, serialization, and CLI concerns are intentionally kept outside the core logic.
+---
+## Current maturity and semantic parity
+This package currently provides a stable baseline implementation of the full 00..11 pipeline surface.
+- Baseline orchestration, validation hooks, CLI/API integration, and deterministic utilities are implemented and tested.
+- Stage-by-stage linguistic parity hardening against the intended semantic baseline is still in progress.
+- The legacy semantic corpus is treated as a semantic reference only, not as a technical implementation template.
+---
+## Optional external services
+### Lexical signals (Wikipedia title index)
+Some enrichment stages optionally use **lexical signals provided by a Wikipedia title index service**.
+This service is expected to expose the HTTP API of
+[`wikipedia-title-index`](https://www.npmjs.com/package/wikipedia-title-index) and provides deterministic
+title-based lookup signals (exact matches, prefix counts, and variant matches).
+`linguistic-enricher` does **not** embed or bundle this data.
+Instead, an external service endpoint can be configured:
+```js
+await runPipeline(text, {
+  services: {
+    "wikipedia-title-index": {
+      endpoint: "http://localhost:3000"
+    }
+  }
+});
+```
+If no endpoint is configured, all enrichment stages that depend on lexical title signals run deterministically without those signals.
+---
+## Python runtime integration
+Some enrichment stages rely on established Python-based NLP tooling (for example for part-of-speech tagging and dependency analysis).
+This tooling is handled internally:
+- Python is invoked as a subprocess.
+- Communication uses JSON over stdin/stdout only.
+- Consumers of the Node.js API do not interact with Python directly.
+A built-in runtime check (`doctor`) can be used to verify Python availability and required dependencies.
+---
+## CLI (optional)
+A command-line interface is provided for convenience:
+```bash
+npx linguistic-enricher run --in input.txt --out result.json
+npx linguistic-enricher run --text "A webshop is an online store." --target canonical --pretty
+npx linguistic-enricher validate --in result.json
+npx linguistic-enricher doctor
+```
+The CLI is a thin wrapper around the same library API and is fully cross-platform.
+---
+## Design boundary
+`linguistic-enricher` intentionally produces authoritative output only up to **linguistic relations**.
+The underlying document schema is forward-compatible and may include later enrichment stages, but this package itself does **not** generate normative assertions, obligations, or domain-level interpretations.
+This boundary keeps the project:
+- reusable,
+- framework-agnostic,
+- and stable as a foundational linguistic layer.
+---
+## License
+MIT

package/bin/linguistic-enricher.js ADDED Viewed

@@ -0,0 +1,222 @@
+#!/usr/bin/env node
+"use strict";
+const fs = require("node:fs");
+const api = require("../src/index");
+function usageError(message) {
+  const error = new Error(message);
+  error.code = "E_CLI_USAGE";
+  return error;
+}
+function printGlobalUsage() {
+  console.log("Usage: linguistic-enricher <command> [options]");
+  console.log("");
+  console.log("Commands:");
+  console.log("  run       Run pipeline and print JSON");
+  console.log("  validate  Validate JSON document");
+  console.log("  doctor    Check runtime prerequisites");
+  console.log("");
+  console.log("Use \"linguistic-enricher <command> --help\" for command options.");
+}
+function printRunUsage() {
+  console.log("Usage: linguistic-enricher run (--text <text> | --in <file>) [options]");
+  console.log("Options:");
+  console.log("  --out <file>                  Write JSON output to file");
+  console.log("  --target <pipeline-target>    Pipeline cutoff target");
+  console.log("  --pretty                      Pretty-print JSON");
+  console.log("  --service-wti-endpoint <url>  wikipedia-title-index endpoint");
+  console.log("  --timeout-ms <ms>             Service/runtime timeout");
+  console.log("  --strict                      Strict mode for optional dependencies");
+}
+function printValidateUsage() {
+  console.log("Usage: linguistic-enricher validate --in <file> [--pretty]");
+}
+function printDoctorUsage() {
+  console.log("Usage: linguistic-enricher doctor [--python-executable <path>] [--timeout-ms <ms>]");
+}
+function parseArgs(argv, command) {
+  const args = argv.slice(3);
+  const out = {};
+  const noValueFlags = new Set(["--pretty", "--strict", "--help", "-h"]);
+  const needsValue = new Set(["--in", "--text", "--out", "--target", "--service-wti-endpoint", "--timeout-ms", "--python-executable"]);
+  const allowedByCommand = {
+    run: new Set(["--in", "--text", "--out", "--target", "--pretty", "--service-wti-endpoint", "--timeout-ms", "--strict", "--help", "-h"]),
+    validate: new Set(["--in", "--pretty", "--help", "-h"]),
+    doctor: new Set(["--python-executable", "--timeout-ms", "--help", "-h"])
+  };
+  const allowed = allowedByCommand[command] || new Set(["--help", "-h"]);
+  for (let i = 0; i < args.length; i += 1) {
+    const arg = args[i];
+    if (!arg.startsWith("-")) {
+      throw usageError("Unexpected positional argument: " + arg);
+    }
+    if (!allowed.has(arg)) {
+      throw usageError("Unknown flag for " + command + ": " + arg);
+    }
+    if (noValueFlags.has(arg)) {
+      if (arg === "--help" || arg === "-h") {
+        out.help = true;
+      } else {
+        out[arg.slice(2)] = true;
+      }
+      continue;
+    }
+    if (needsValue.has(arg)) {
+      const next = args[i + 1];
+      if (!next || next.startsWith("-")) {
+        throw usageError("Missing value for flag: " + arg);
+      }
+      out[arg.slice(2)] = next;
+      i += 1;
+      continue;
+    }
+    throw usageError("Unknown flag: " + arg);
+  }
+  return out;
+}
+function readInput(options) {
+  if (options.text && options.in) {
+    throw new Error("Use either --text or --in, not both.");
+  }
+  if (options.text) {
+    return options.text;
+  }
+  if (options.in) {
+    return fs.readFileSync(options.in, "utf8");
+  }
+  throw new Error("Missing input. Use --text or --in.");
+}
+async function run(argv) {
+  const options = parseArgs(argv, "run");
+  if (options.help) {
+    printRunUsage();
+    return;
+  }
+  const input = readInput(options);
+  const pipelineOptions = {
+    target: options.target,
+    timeoutMs: options["timeout-ms"] ? Number(options["timeout-ms"]) : undefined,
+    services: options["service-wti-endpoint"]
+      ? { "wikipedia-title-index": { endpoint: options["service-wti-endpoint"] } }
+      : undefined,
+    strict: options.strict === true
+  };
+  const result = await api.runPipeline(input, pipelineOptions);
+  const serialized = JSON.stringify(result, null, options.pretty ? 2 : 0);
+  if (options.out) {
+    fs.writeFileSync(options.out, serialized + "\n", "utf8");
+  } else {
+    console.log(serialized);
+  }
+}
+/**
+ * CLI command for `doctor`.
+ * @returns {Promise<void>}
+ */
+async function doctor(argv) {
+  const options = parseArgs(argv, "doctor");
+  if (options.help) {
+    printDoctorUsage();
+    return;
+  }
+  const result = await api.runDoctor({
+    pythonExecutable: options["python-executable"] || process.env.PYTHON_EXECUTABLE,
+    timeoutMs: options["timeout-ms"] ? Number(options["timeout-ms"]) : undefined
+  });
+  console.log("Doctor checks passed.");
+  console.log(
+    "Python executable: " +
+      result.python.executable +
+      " (" +
+      result.python.version +
+      ")"
+  );
+  console.log("spaCy: ok");
+  console.log("spaCy model: " + result.model.name + " (installed)");
+}
+function validate(argv) {
+  const options = parseArgs(argv, "validate");
+  if (options.help) {
+    printValidateUsage();
+    return;
+  }
+  if (!options.in) {
+    throw usageError("validate requires --in <path>");
+  }
+  const raw = fs.readFileSync(options.in, "utf8");
+  const doc = JSON.parse(raw);
+  const result = api.validateDocument(doc);
+  console.log(JSON.stringify(result, null, options.pretty ? 2 : 0));
+}
+async function main(argv) {
+  const command = argv[2];
+  if (!command || command === "--help" || command === "-h") {
+    printGlobalUsage();
+    return;
+  }
+  if (command === "run") {
+    await run(argv);
+    return;
+  }
+  if (command === "doctor") {
+    await doctor(argv);
+    return;
+  }
+  if (command === "validate") {
+    validate(argv);
+    return;
+  }
+  throw usageError("Unknown command. Supported commands: run, doctor, validate");
+}
+if (require.main === module) {
+  main(process.argv).catch(function onMainError(error) {
+    const code = error && error.code ? error.code : "E_CLI_FAILED";
+    console.error("CLI failed [" + code + "]: " + error.message);
+    if (code === "E_CLI_USAGE") {
+      const cmd = process.argv[2];
+      if (cmd === "run") {
+        printRunUsage();
+      } else if (cmd === "validate") {
+        printValidateUsage();
+      } else if (cmd === "doctor") {
+        printDoctorUsage();
+      } else {
+        printGlobalUsage();
+      }
+    }
+    process.exit(1);
+  });
+}
+module.exports = {
+  run,
+  doctor,
+  validate,
+  main
+};

package/package.json ADDED Viewed

@@ -0,0 +1,48 @@
+{
+  "name": "linguistic-enricher",
+  "version": "1.0.0",
+  "description": "Deterministic linguistic enrichment pipeline for Node.js (CommonJS)",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/svenschaefer/linguistic-enricher"
+  },
+  "main": "src/index.js",
+  "bin": {
+    "linguistic-enricher": "bin/linguistic-enricher.js"
+  },
+  "files": [
+    "src/**",
+    "bin/**",
+    "README.md",
+    "schema.json",
+    "LICENSE"
+  ],
+  "scripts": {
+    "test": "npm run lint && npm run test:unit && npm run test:integration",
+    "test:unit": "node --test test/unit/*.test.js",
+    "test:integration": "node --test test/integration/*.test.js",
+    "lint": "eslint src bin test",
+    "doctor": "node bin/linguistic-enricher.js doctor",
+    "check:cache-live": "node scripts/check-live-cache.js"
+  },
+  "keywords": [
+    "nlp",
+    "linguistics",
+    "pipeline",
+    "commonjs"
+  ],
+  "license": "MIT",
+  "engines": {
+    "node": ">=24.10.0"
+  },
+  "devDependencies": {
+    "eslint": "^8.57.0"
+  },
+  "dependencies": {
+    "ajv": "^8.17.1",
+    "sbd": "^1.0.19",
+    "wikipedia-title-index": "^1.2.6",
+    "wink-pos-tagger": "^2.2.2",
+    "wink-tokenizer": "^5.2.0"
+  }
+}