npm - objectivist-ner - Versions diffs - 0.0.0 → 0.0.2 - Mend

objectivist-ner 0.0.0 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,83 +1,139 @@
 # objectivist-ner
-Objectivist-inspired Named Entity Recognition with grammar-constrained LLM output.
+Most Named Entity Recognition tools treat language as a bag of words to be statistically tagged.
-Uses [node-llama-cpp](https://github.com/withcatai/node-llama-cpp) to run a small language model locally, enforcing structured output via JSON schema grammars. No API keys, no network calls -- everything runs on your machine.
+This tool takes a different approach.
-The CLI is installed as the `ner` command.
+It is built on the Objectivist recognition that concepts are not arbitrary labels — they are integrations of observed reality, formed by identifying essential characteristics and omitting measurements. A valid concept must be grounded in percepts, organized hierarchically, and maintain identity across contexts.
-## Features
+That is why `objectivist-ner` emphasizes:
-- Exact span extraction -- entity `text` is the substring from the input, not a paraphrase
-- Schema-constrained output via llama.cpp grammar (guaranteed valid JSON)
-- Restrict entity classes, attribute keys, and attribute values
-- Hierarchical class taxonomies
-- Relation extraction between entities
-- Coreference resolution (group mentions of the same entity)
-- Negation and modality detection
-- Confidence scores
-- Schema definition files for reusable ontologies
-- Long document chunking with `--file`
-- Batch processing with `--batch`
-- Three built-in model tiers: `--fast`, `--balanced`, `--best`
-- Reads from argument, file, or stdin
-- Compact JSON output for non-TTY / piping
+- **Exact entity spans** — because a concept must refer to something specific in reality
+- **Hierarchical classification** — because proper concept formation requires understanding genus and differentia
+- **Negation detection** — because the relationship of a concept to existence is epistemologically essential
+- **Coreference resolution** — because the law of identity demands we recognize the same existent across multiple descriptions
+- **Relations** — because concepts do not exist in isolation, they integrate into propositions
-## Installation
+It runs completely locally using a small language model. No API keys. No data leaves your machine.
+## What Makes This Different
+### 1. Assertion vs Negation vs Hypothetical
+> "The patient has diabetes but does not have cancer. He might develop hypertension."
+**Typical NER** sees three diseases. **objectivist-ner** sees three different relationships to reality:
 ```bash
-# Local development
-bun install
+ner --detect-negation "The patient has diabetes but does not have cancer. He might develop hypertension."
+```
-# Install globally as the `ner` command
-bun install -g objectivist-ner
+```json
+[
+  { "class": "disease", "text": "diabetes", "assertion": "present" },
+  { "class": "disease", "text": "cancer", "assertion": "negated" },
+  { "class": "disease", "text": "hypertension", "assertion": "hypothetical" }
+]
 ```
-After global install, use the `ner` command directly.
+The `assertion` field tells you whether the text claims something is **present**, **negated**, or **hypothetical**.
+### 2. Identity Across References
+> "Dr. Chen published a paper. She later won the Nobel Prize. The neurologist was celebrated."
-## Usage
+**Typical NER** sees three separate people. **objectivist-ner** knows they are the same person:
 ```bash
-# Uses --best (4B) by default
-ner "the cat is blue and is feeling sad"
+ner --resolve "Dr. Chen published a paper. She later won the Nobel Prize. The neurologist was celebrated."
+```
-# Pick a model tier
-ner --fast "the cat is blue"
-ner --balanced "John works at Google in NYC"
-ner --best "complex medical research text"
+```json
+[
+  {
+    "class": "person",
+    "text": "Dr. Chen",
+    "entity_id": "e1",
+    "is_canonical": true
+  },
+  {
+    "class": "person",
+    "text": "She",
+    "entity_id": "e1",
+    "is_canonical": false
+  },
+  {
+    "class": "person",
+    "text": "The neurologist",
+    "entity_id": "e1",
+    "is_canonical": false
+  },
+  {
+    "class": "event",
+    "text": "the Nobel Prize",
+    "entity_id": "e2",
+    "is_canonical": true
+  }
+]
 ```
-### Entity constraints
+`entity_id` groups coreferent mentions. `is_canonical` marks the most specific reference.
-```bash
-# Restrict entity classes
-ner "John works at Google" --classes person,organization
+### 3. Hierarchical Classification
-# Restrict attribute keys
-ner "Alice is sad in Paris" --attributes emotional_state,location
+Define your ontology as a tree with mixed arrays (leaf nodes) and objects (nested hierarchies):
-# Restrict attribute values with enums
-ner "The sky is blue" --attr-values '{"color":["blue","red","green"]}'
+```
+organism
+├── person
+└── animal
+    ├── dog
+    └── cat
+idea
+├── dream
+└── principle
+```
-# Hierarchical class taxonomy
-ner "Dr. Chen lives in Boston with her cat" \
-  --taxonomy '{"organism":["person","animal"],"place":["city","country"]}'
+```bash
+ner --taxonomy '{"organism":["person",{"animal":["dog","cat"]}],"idea":["dream","principle"]}' \
+  "The child recounted a vivid dream about the golden retriever."
+```
+```json
+[
+  {
+    "class": "person",
+    "text": "The child",
+    "taxonomyPath": ["organism", "person"]
+  },
+  {
+    "class": "dream",
+    "text": "a vivid dream",
+    "taxonomyPath": ["idea", "dream"]
+  },
+  {
+    "class": "dog",
+    "text": "the golden retriever",
+    "taxonomyPath": ["organism", "animal", "dog"]
+  }
+]
 ```
-### Relation extraction
+The model classifies at the most specific (leaf) level, and `taxonomyPath` preserves the full hierarchy.
+### 4. Conceptual Integration (Relations)
 ```bash
 ner --relations "Dr. Chen works at MIT and collaborates with Prof. Wright"
 ```
-Output:
 ```json
 {
   "entities": [
-    { "class": "person", "text": "Dr. Chen", "attributes": {} },
-    { "class": "organization", "text": "MIT", "attributes": {} },
-    { "class": "person", "text": "Prof. Wright", "attributes": {} }
+    { "class": "person", "text": "Dr. Chen" },
+    { "class": "organization", "text": "MIT" },
+    { "class": "person", "text": "Prof. Wright" }
   ],
   "relations": [
     { "source": "Dr. Chen", "target": "MIT", "relation": "works at" },
@@ -90,266 +146,229 @@ Output:
 }
 ```
-### Coreference resolution
+Relations show how entities connect — extracting the "connective tissue" between concepts.
+You can also categorize relations by class:
 ```bash
-ner --resolve "Dr. Chen published a paper. She later won the Nobel Prize. The neurologist was celebrated."
+ner --relations --relation-classes "employment,location,causal,professional" \
+  "Dr. Chen works at MIT and collaborates with Prof. Wright"
+```
+```json
+{
+  "entities": [
+    { "class": "person", "text": "Dr. Chen" },
+    { "class": "organization", "text": "MIT" },
+    { "class": "person", "text": "Prof. Wright" }
+  ],
+  "relations": [
+    {
+      "source": "Dr. Chen",
+      "target": "MIT",
+      "relation": "works at",
+      "class": "employment"
+    },
+    {
+      "source": "Dr. Chen",
+      "target": "Prof. Wright",
+      "relation": "collaborates with",
+      "class": "professional"
+    }
+  ]
+}
+```
+The `class` field categorizes the relation type (e.g., employment, causal, spatial), allowing you to group and analyze connections by category.
+## Installation
+```bash
+bun install -g objectivist-ner
+```
+## Quick Start
+```bash
+# Basic extraction
+ner "the cat is blue and is feeling sad"
+# Choose quality vs speed
+ner --fast "simple text"
+er --balanced "moderate text"
+er --best "complex text"
 ```
-Output:
+## Usage Examples
+### Constrain entity classes
+```bash
+ner "John works at Google" --classes person,organization
+```
+```json
+[
+  { "class": "person", "text": "John" },
+  { "class": "organization", "text": "Google" }
+]
+```
+### Constrain attribute keys
+```bash
+ner "Alice is sad in Paris" --attributes emotional_state,location
+```
 ```json
 [
   {
     "class": "person",
-    "text": "Dr. Chen",
-    "attributes": {},
-    "entity_id": "e1"
-  },
-  { "class": "person", "text": "She", "attributes": {}, "entity_id": "e1" },
-  {
-    "class": "person",
-    "text": "The neurologist",
-    "attributes": {},
-    "entity_id": "e1"
-  },
-  {
-    "class": "event",
-    "text": "the Nobel Prize",
-    "attributes": {},
-    "entity_id": "e2"
+    "text": "Alice",
+    "attributes": { "emotional_state": "sad", "location": "Paris" }
   }
 ]
 ```
-`entity_id` is a grammar-enforced top-level field, not inside `attributes`.
-### Negation detection
+### Constrain attribute values
 ```bash
-ner --detect-negation "The patient has diabetes but does not have cancer. He might develop hypertension."
+ner "The sky is blue" --attr-values '{"color":["blue","red","green"]}'
 ```
-### Confidence scores
-```bash
-ner --include-confidence "Dr. Maria Chen works at MIT. Someone named Bob might be there too."
+```json
+[
+  {
+    "class": "object",
+    "text": "sky",
+    "attributes": { "color": "blue" }
+  }
+]
 ```
-### Schema definition files
+### Schema files
-Define your ontology in a JSON file and reuse it:
+Define your ontology once and reuse it:
 ```json
 {
   "taxonomy": {
     "organism": ["person", "animal"],
-    "place": ["city", "country", "building"],
-    "institution": ["company", "university", "government_agency"]
+    "animal": ["dog", "cat"]
   },
-  "attributes": ["role", "age", "location", "affiliation"],
-  "relations": ["works_at", "located_in", "affiliated_with"]
+  "attributes": ["role", "location"],
+  "relations": ["works_at", "collaborates_with"]
 }
 ```
 ```bash
-ner --schema schema.json "Dr. Chen works at MIT in Boston"
+ner --schema ontology.json "Dr. Chen works at MIT"
 ```
-Schema files support `taxonomy`, `classes`, `attributes`, `attrValues`, and `relations`. CLI flags override schema file values.
 ### File and batch processing
 ```bash
 # Process a long document (auto-chunked)
 ner --file document.txt
-# Process a JSONL file (one text per line, outputs JSONL)
+# Process a JSONL file
 ner --batch inputs.jsonl
 # Process a directory of .txt files
 ner --batch ./documents/
 ```
-### Other options
+### Read from stdin
 ```bash
-# Append to the built-in system prompt
-ner "text" --system-prompt-append "Focus only on emotions"
-# Replace the system prompt entirely
-ner "text" --system-prompt "You are a custom extractor."
-# Read from stdin
 echo "the cat is blue" | ner
-# Compact JSON output
-ner "the cat is blue" --compact
+cat article.txt | ner --detect-negation
 ```
-## Models
-fastner ships with three built-in model tiers. Pick one with a flag -- the model is downloaded automatically on first use to `~/.fastner/models/`.
-| Flag         | Model             | Size | Download | Best for                        |
-| ------------ | ----------------- | ---- | -------- | ------------------------------- |
-| `--fast`     | Qwen3.5-0.8B Q8_0 | 0.8B | ~0.9 GB  | Simple text, single entities    |
-| `--balanced` | Qwen3.5-2B Q8_0   | 2B   | ~2.3 GB  | Moderate complexity, most tasks |
-| `--best`     | Qwen3.5-4B Q8_0   | 4B   | ~4.5 GB  | Dense text, rare entity types   |
-`--best` is the default. See [Benchmarks](#benchmarks) for why.
-## Options
-| Flag                              | Description                                    |
-| --------------------------------- | ---------------------------------------------- |
-| `--fast`                          | Use 0.8B model -- quick, simple text only      |
-| `--balanced`                      | Use 2B model -- good accuracy/speed tradeoff   |
-| `--best`                          | Use 4B model -- best accuracy (default)        |
-| `-c, --classes <list>`            | Comma-separated allowed entity classes         |
-| `-a, --attributes <list>`         | Comma-separated allowed attribute keys         |
-| `--attr-values <json>`            | JSON enum map for attribute values             |
-| `--taxonomy <json>`               | Class hierarchy JSON                           |
-| `--relations`                     | Extract relations between entities             |
-| `--resolve`                       | Resolve coreferences                           |
-| `--include-confidence`            | Include confidence scores per entity           |
-| `--detect-negation`               | Detect negated/hypothetical entities           |
-| `--schema <path>`                 | Load schema definition from JSON file          |
-| `--file <path>`                   | Read input from file (with chunking)           |
-| `--batch <path>`                  | Process JSONL file or directory of .txt files  |
-| `--system-prompt <string>`        | Replace the built-in system prompt entirely    |
-| `--system-prompt-append <string>` | Append to the built-in system prompt           |
-| `--compact`                       | Output compact JSON (auto-enabled for non-TTY) |
-| `-m, --model <uri>`               | Use any GGUF model (see below)                 |
+## Model Tiers
+| Flag         | Size   | Download | Best for                        |
+| ------------ | ------ | -------- | ------------------------------- |
+| `--fast`     | Small  | ~0.9 GB  | Simple text, single entities    |
+| `--balanced` | Medium | ~2.3 GB  | Moderate complexity, most tasks |
+| `--best`     | Large  | ~4.5 GB  | Dense text, rare entity types   |
+`--best` is the default. See [Benchmarks](#benchmarks).
+## Options Reference
+| Flag                              | Description                                        |
+| --------------------------------- | -------------------------------------------------- |
+| `--fast`                          | Use smallest model                                 |
+| `--balanced`                      | Use mid-size model                                 |
+| `--best`                          | Use largest model (default)                        |
+| `-c, --classes <list>`            | Allowed entity classes                             |
+| `-a, --attributes <list>`         | Allowed attribute keys                             |
+| `--attr-values <json>`            | Enum map for attribute values                      |
+| `--taxonomy <json>`               | Class hierarchy (parent → children)                |
+| `--relations`                     | Extract relations between entities                 |
+| `--relation-classes <list>`       | Allowed relation classes (e.g. employment,causal)  |
+| `--resolve`                       | Resolve coreferences (adds entity_id)              |
+| `--detect-negation`               | Add assertion field (present/negated/hypothetical) |
+| `--include-confidence`            | Add confidence field (low/medium/high)             |
+| `--schema <path>`                 | Load schema from JSON file                         |
+| `--file <path>`                   | Read from file (with chunking)                     |
+| `--batch <path>`                  | Process JSONL file or directory                    |
+| `--system-prompt <string>`        | Replace system prompt                              |
+| `--system-prompt-append <string>` | Append to system prompt                            |
+| `--compact`                       | Compact JSON output                                |
+| `-m, --model <uri>`               | Use custom GGUF model                              |
 ## Benchmarks
-We tested all three tiers against a complex input containing 11 entities across 6 classes (person, organization, location, disease, drug, event):
-> "Dr. Maria Chen, a 42-year-old neurologist at Massachusetts General Hospital in Boston, published a groundbreaking paper with her colleague Prof. James Wright from Oxford University about a rare genetic mutation called BRCA3-delta found in 12 patients from rural Bangladesh, while simultaneously consulting for Pfizer on their new drug Nexavion priced at 450 dollars per dose, which the WHO classified as a Category A essential medicine last Tuesday during their Geneva summit"
-| Entity             | `--fast`   | `--balanced` | `--best` (default)       |
-| ------------------ | ---------- | ------------ | ------------------------ |
-| Dr. Maria Chen     | person     | person       | person                   |
-| Prof. James Wright | person     | person       | person, role: colleague  |
-| MGH                | -          | org          | org                      |
-| Oxford University  | -          | org          | org                      |
-| BRCA3-delta        | -          | disease      | disease                  |
-| Bangladesh         | -          | -            | location                 |
-| Pfizer             | -          | org          | org                      |
-| Nexavion           | -          | drug         | drug, price: 450 dollars |
-| WHO                | -          | -            | org, category: Cat A     |
-| Geneva summit      | -          | event        | location                 |
-| Boston             | location   | -            | location                 |
-| **Entities found** | **3 / 11** | **8 / 11**   | **11 / 11**              |
-All three tiers produce zero hallucinations with the current prompt design.
-## Epistemological design
-fastner's feature set is informed by Objectivist epistemology -- the theory that concepts are formed by abstracting essential characteristics from concretes, organized into hierarchical structures, and held in a specific relationship to reality.
-### Identity: A is A (`--resolve`)
-The law of identity demands that we track _what a thing is_ across all its references. When a text says "Dr. Chen", "she", and "the neurologist", these are three linguistic expressions of one entity. Without coreference resolution, an NER system treats them as three unrelated extractions -- a failure to maintain identity. `--resolve` enforces that A remains A regardless of how it is named.
-### Hierarchical concept formation (`--taxonomy`)
-Objectivist epistemology holds that concepts are organized hierarchically through a process of abstraction. "Cat" is subsumed under "animal", which is subsumed under "organism". Each level retains the essential characteristics of its parent while adding differentia. The `--taxonomy` flag mirrors this structure directly -- you define genus-species relationships between entity classes, and the model classifies at the most specific level it can justify. This isn't just organization; it's how valid concepts are formed.
-### Distinguishing existence from assertion (`--detect-negation`)
-A concept must be connected to reality. "The patient has diabetes" and "the patient does not have diabetes" both contain the entity "diabetes", but their relationship to existence is opposite. Naive NER systems that extract "diabetes" from both sentences without distinguishing assertion from negation commit a fundamental error -- they detach the concept from its existential status. `--detect-negation` forces every entity to declare its relationship to reality: present, negated, or hypothetical.
-### Certainty and the hierarchy of evidence (`--include-confidence`)
-Knowledge exists on a spectrum from certain to speculative. "Dr. Maria Chen" appearing with a full name and title is a high-confidence extraction. "Someone named Bob" is low-confidence. Objectivism rejects both dogmatism (asserting certainty where none exists) and skepticism (denying certainty where it does). `--include-confidence` makes the epistemic status of each extraction explicit, letting downstream systems apply appropriate thresholds.
-### Relations as conceptual integration (`--relations`)
+Tested on a complex input with 11 entities across 6 classes:
-Entities don't exist in isolation. The relationship "Dr. Chen works at MIT" is not a property of Chen or of MIT alone -- it's a fact about reality that connects two existents. Extracting entities without their relations is like forming concepts without integrating them into propositions. `--relations` extracts the connective tissue between entities, producing a knowledge graph rather than an isolated list.
+| Entity             | `--fast` | `--balanced` | `--best`  |
+| ------------------ | -------- | ------------ | --------- |
+| Dr. Maria Chen     | person   | person       | person    |
+| Prof. James Wright | person   | person       | person    |
+| MGH                | —        | org          | org       |
+| Oxford University  | —        | org          | org       |
+| BRCA3-delta        | —        | disease      | disease   |
+| Bangladesh         | —        | —            | location  |
+| Pfizer             | —        | org          | org       |
+| Nexavion           | —        | drug         | drug      |
+| WHO                | —        | —            | org       |
+| Geneva summit      | —        | event        | location  |
+| Boston             | location | —            | location  |
+| **Found**          | **3/11** | **8/11**     | **11/11** |
-### Schema files as objective definitions (`--schema`)
+## Integration with objectivist-lattice
-Definitions, in Objectivist epistemology, identify the essential characteristics that distinguish a concept from all others. A schema file serves this function for NER: it defines your ontology once -- the class hierarchy, the valid attributes, the relation types -- and applies it consistently across all extractions. This is the difference between ad-hoc classification and principled concept formation.
+This tool is designed to work with **[objectivist-lattice](https://github.com/richardanaya/objectivist-lattice)** — a knowledge management system that enforces the Objectivist hierarchy: percepts → concepts → principles → actions.
-### Grammar enforcement as logical constraint
+**objectivist-ner** extracts the percepts and concepts. **objectivist-lattice** validates and organizes them into principles you can act on.
-Several fields (`assertion`, `confidence`, `entity_id`, `class` enums) are enforced at the grammar level, not merely prompted. The model literally cannot produce an invalid value. This is the computational equivalent of the principle that contradictions cannot exist -- the system's structure makes certain errors impossible rather than merely unlikely.
-## Building Up Knowledge
-fastner is designed as a tool for the Objectivist project of building knowledge from percepts through concepts to principles and finally to action — the exact process implemented in the companion project **[objectivist-lattice](https://github.com/richardanaya/objectivist-lattice)**.
-### The Epistemological Pipeline
-Objectivism holds that all knowledge begins with **percepts** (raw sensory data), which are integrated into **concepts**, which are organized into **principles** (general truths), which are finally applied as **actions** in specific contexts.
-`objectivist-lattice` enforces this hierarchy strictly on a filesystem of Markdown files with validation rules:
-- **Axioms** and **percepts** are bedrock — they have no `reduces_to` links
-- **Principles** must reduce to axioms or percepts
-- **Applications** must reduce to principles
-- Promotion from `Tentative/Hypothesis` to `Integrated/Validated` can only happen bottom-up
-### How NER Helps Build the Lattice
-fastner acts as the **percept-to-concept extraction layer** for this system:
-1. **Percept Extraction** (`--detect-negation`)
-   - Identifies concrete entities from source material (books, articles, personal observations)
-   - Distinguishes what is asserted as present, negated, or hypothetical
-   - Feeds raw perceptual data into the `02-Percepts/` directory
-2. **Concept Formation** (`--classes`, `--taxonomy`, `--resolve`)
-   - Groups multiple mentions of the same entity (`entity_id`)
-   - Classifies entities into hierarchical taxonomies (`organism > person > neurologist`)
-   - Maintains identity across contexts — "Dr. Chen", "she", and "the neurologist" are recognized as the same existent
-3. **Principle Discovery** (`--relations`, `--schema`)
-   - Extracts relations between entities ("works at", "causes", "implies")
-   - Uses schema files to enforce your ontological commitments
-   - Surfaces potential principles by showing what consistently reduces to what
-4. **Action Guidance** (`--include-confidence`)
-   - Rates confidence in each extraction
-   - Helps distinguish high-certainty principles (suitable for action) from speculative ones (still tentative)
-### Practical Workflow
+### Workflow
 ```bash
-# Extract entities from a book chapter
-ner --file chapter1.txt --detect-negation --resolve --include-confidence > percepts.json
+# Extract structured observations from text
+ner --file chapter1.txt --detect-negation --resolve > percepts.json
-# Convert to lattice format
-cat percepts.json | jq '.[] | {title: .text, level: "percept", proposition: (.text + " was observed")}' > 02-Percepts/20260315-percept-001.md
-# Later, when forming principles
-ner --relations --schema ontology.json "text from multiple chapters" > principles.json
+# Import into your knowledge lattice
+# (See objectivist-lattice documentation for details)
 ```
-The combination of **objectivist-ner** (extraction) and **objectivist-lattice** (validation and organization) creates a complete pipeline:
-**Percepts → Concepts → Principles → Validated Knowledge → Action**
+## Epistemological Design
-This is not just information extraction. It is epistemological engineering — using computation to enforce the proper hierarchical structure of knowledge, preventing floating abstractions and ensuring every principle is grounded in percepts and axioms.
+Each feature maps to an Objectivist principle:
-The grammar-enforced fields (`assertion`, `confidence`, `entity_id`) are not arbitrary features. They are computational implementations of fundamental epistemological requirements: every concept must have a relationship to reality, every claim must have an epistemic status, and identity must be maintained across contexts.
+- **`--resolve`** — The law of identity (A is A)
+- **`--taxonomy`** — Hierarchical concept formation (genus and differentia)
+- **`--detect-negation`** — Grounding concepts in reality (existence vs non-existence)
+- **`--relations`** — Conceptual integration (concepts form connected propositions)
+- **Grammar enforcement** — Non-contradiction (structure prevents invalid values)
-See the [objectivist-lattice](https://github.com/richardanaya/objectivist-lattice) repository for the validation and knowledge management layer that pairs with this tool.
+## Custom Models
-## Custom models
-If the built-in tiers don't fit your needs, you can pass any GGUF model with `--model`. This overrides `--fast`/`--balanced`/`--best`.
+Use any GGUF model:
 ```bash
-# HuggingFace URI
 ner "text" --model "hf:unsloth/Qwen3-8B-GGUF:Qwen3-8B-Q4_K_M.gguf"
-# Local file
-ner "text" --model ./my-custom-model.gguf
+ner "text" --model ./my-model.gguf
 ```
-## License
-MIT © Richard Anaya

package/index.ts CHANGED Viewed

@@ -44,23 +44,104 @@ interface SchemaFile {
   attributes?: string[];
   attrValues?: Record<string, string[]>;
   relations?: string[];
+  relationClasses?: string[];
 }
 // === TAXONOMY HELPERS ===
-function flattenTaxonomy(taxonomy: Record<string, string[]>): string[] {
-  const all = new Set<string>();
-  for (const [parent, children] of Object.entries(taxonomy)) {
-    all.add(parent);
-    for (const child of children) all.add(child);
+// Taxonomy format: {"organism": ["person", {"animal": ["dog", "cat"]}], "idea": ["dream", "principle"]}
+// Arrays contain leaf nodes, objects contain nested taxonomies
+function getLeafNodes(taxonomy: Record<string, any>): string[] {
+  const leaves: string[] = [];
+  function traverse(node: any) {
+    if (Array.isArray(node)) {
+      for (const item of node) {
+        if (typeof item === "string") {
+          leaves.push(item);
+        } else if (typeof item === "object" && item !== null) {
+          traverse(item);
+        }
+      }
+    } else if (typeof node === "object" && node !== null) {
+      for (const [, value] of Object.entries(node)) {
+        traverse(value);
+      }
+    }
   }
-  return [...all];
+  traverse(taxonomy);
+  return leaves;
 }
-function taxonomyToPrompt(taxonomy: Record<string, string[]>): string {
-  const lines = Object.entries(taxonomy)
-    .map(([parent, children]) => `  ${parent}: ${children.join(", ")}`)
-    .join("\n");
-  return `Use the following class hierarchy. Classify at the most specific level.\n${lines}`;
+function getTaxonomyPath(
+  leaf: string,
+  taxonomy: Record<string, any>,
+): string[] {
+  function findPath(
+    node: any,
+    target: string,
+    currentPath: string[],
+  ): string[] | null {
+    if (Array.isArray(node)) {
+      for (const item of node) {
+        if (typeof item === "string" && item === target) {
+          return [...currentPath, item];
+        } else if (typeof item === "object" && item !== null) {
+          const result = findPath(item, target, currentPath);
+          if (result) return result;
+        }
+      }
+    } else if (typeof node === "object" && node !== null) {
+      for (const [key, value] of Object.entries(node)) {
+        const result = findPath(value, target, [...currentPath, key]);
+        if (result) return result;
+      }
+    }
+    return null;
+  }
+  // Try each root node
+  for (const [rootKey, rootValue] of Object.entries(taxonomy)) {
+    const path = findPath(rootValue, leaf, [rootKey]);
+    if (path) return path;
+  }
+  return [leaf];
+}
+function taxonomyToPrompt(taxonomy: Record<string, any>): string {
+  function formatNode(node: any, indent: string): string[] {
+    const lines: string[] = [];
+    if (Array.isArray(node)) {
+      for (const item of node) {
+        if (typeof item === "string") {
+          lines.push(`${indent}- ${item}`);
+        } else if (typeof item === "object" && item !== null) {
+          lines.push(...formatNode(item, indent));
+        }
+      }
+    } else if (typeof node === "object" && node !== null) {
+      for (const [key, value] of Object.entries(node)) {
+        lines.push(`${indent}- ${key}`);
+        lines.push(...formatNode(value, indent + "  "));
+      }
+    }
+    return lines;
+  }
+  const lines: string[] = [
+    "Use the following class hierarchy. Classify at the most specific (leaf) level:",
+  ];
+  for (const [rootKey, rootValue] of Object.entries(taxonomy)) {
+    lines.push(`- ${rootKey}`);
+    lines.push(...formatNode(rootValue, "  "));
+  }
+  return lines.join("\n");
 }
 // === CHUNKING ===
@@ -98,6 +179,10 @@ program
     'Class hierarchy JSON e.g. {"organism":["animal","plant"]}',
   )
   .option("--relations", "Extract relations between entities")
+  .option(
+    "--relation-classes <list>",
+    "Comma-separated allowed relation classes (e.g. employment,location,causal)",
+  )
   .option("--resolve", "Resolve coreferences (group mentions of same entity)")
   .option("--include-confidence", "Include confidence scores per entity")
   .option("--detect-negation", "Detect negated/hypothetical entities")
@@ -130,17 +215,16 @@ program
     "after",
     `
 Examples:
-  fastner "the cat is blue"
-  fastner --fast "simple short text"
-  fastner "John works at Google" --classes person,organization
-  fastner "sky is blue" --attr-values '{"color":["blue","red"]}'
-  fastner --relations "Dr. Chen works at MIT"
-  fastner --resolve "Dr. Chen published a paper. She won an award."
-  fastner --detect-negation "The patient does not have diabetes"
-  fastner --schema schema.json "complex text"
-  fastner --file document.txt
-  fastner --batch inputs.jsonl
-  echo "the cat is blue" | fastner
+  ner "the cat is blue"
+  ner "John works at Google" --classes person,organization
+  ner "sky is blue" --attr-values '{"color":["blue","red"]}'
+  ner --relations "Dr. Chen works at MIT"
+  ner --resolve "Dr. Chen published a paper. She won an award."
+  ner --detect-negation "The patient does not have cancer"
+  ner --schema schema.json "complex text"
+  ner --file document.txt
+  ner --batch inputs.jsonl
+  echo "the cat is blue" | ner
 `,
   )
   .parse();
@@ -214,6 +298,9 @@ if (opts.taxonomy) {
 const enableRelations = opts.relations || !!schemaFile?.relations;
 const relationTypes: string[] | undefined = schemaFile?.relations || undefined;
+const relationClasses: string[] | undefined = opts.relationClasses
+  ? opts.relationClasses.split(",").map((s: string) => s.trim())
+  : schemaFile?.relationClasses;
 const enableResolve = !!opts.resolve;
 const enableConfidence = !!opts.includeConfidence;
 const enableNegation = !!opts.detectNegation;
@@ -293,6 +380,7 @@ function buildSystemPrompt(): string {
   }
   if (enableResolve) {
     base += `\n- Every entity has a top-level "entity_id" field. If multiple text spans refer to the same real-world entity (e.g. "Dr. Chen" and "she"), they share the same entity_id. Use short IDs like "e1", "e2".`;
+    base += `\n- When multiple mentions share an entity_id, exactly ONE of them must have "is_canonical": true (the most specific reference like a proper name). The others must have "is_canonical": false.`;
   }
   let prompt = `${base}\n\n${FEW_SHOT_EXAMPLES}`;
@@ -306,8 +394,9 @@ function buildSystemPrompt(): string {
 // === BUILD GRAMMAR SCHEMA ===
 function buildGrammarSchema() {
-  // Determine allowed classes from taxonomy or explicit list
-  const classEnum = taxonomy ? flattenTaxonomy(taxonomy) : allowedClasses;
+  // When using taxonomy, only allow leaf nodes as valid classes.
+  // This forces the model to classify at the most specific level.
+  const classEnum = taxonomy ? getLeafNodes(taxonomy) : allowedClasses;
   const attributesSchema: any = {
     type: "object",
@@ -343,7 +432,9 @@ function buildGrammarSchema() {
   }
   if (enableResolve) {
     properties.entity_id = { type: "string" };
+    properties.is_canonical = { type: "boolean" };
     required.push("entity_id");
+    required.push("is_canonical");
   }
   const schema: any = {
@@ -361,6 +452,23 @@ function buildGrammarSchema() {
 // === BUILD RELATIONS SCHEMA ===
 function buildRelationsSchema() {
+  const relationProperties: any = {
+    source: { type: "string" },
+    target: { type: "string" },
+    relation: {
+      type: "string",
+      ...(relationTypes && { enum: relationTypes }),
+    },
+  };
+  // Add class field if relationClasses is specified
+  if (relationClasses) {
+    relationProperties.class = {
+      type: "string",
+      enum: relationClasses,
+    };
+  }
   const relSchema: any = {
     type: "object",
     properties: {
@@ -369,15 +477,10 @@ function buildRelationsSchema() {
         type: "array",
         items: {
           type: "object",
-          properties: {
-            source: { type: "string" },
-            target: { type: "string" },
-            relation: {
-              type: "string",
-              ...(relationTypes && { enum: relationTypes }),
-            },
-          },
-          required: ["source", "target", "relation"],
+          properties: relationProperties,
+          required: relationClasses
+            ? ["source", "target", "relation", "class"]
+            : ["source", "target", "relation"],
           additionalProperties: false,
         },
       },
@@ -414,13 +517,21 @@ function buildConstraints(): string {
     constraints += `\nEvery entity has a "confidence" field (not in attributes). Example: [{"class":"person","text":"John","confidence":"high","attributes":{}}]`;
   }
   if (enableResolve) {
-    constraints += `\nEvery entity has an "entity_id" field (not in attributes). Coreferent mentions share the same entity_id. Example: [{"class":"person","text":"Dr. Chen","entity_id":"e1","attributes":{}},{"class":"person","text":"She","entity_id":"e1","attributes":{}}]`;
+    constraints += `\nEvery entity has "entity_id" and "is_canonical" fields. Coreferent mentions share entity_id; exactly one per group has is_canonical:true (the most specific reference). Example: [{"class":"person","text":"Dr. Chen","entity_id":"e1","is_canonical":true,"attributes":{}},{"class":"person","text":"She","entity_id":"e1","is_canonical":false,"attributes":{}}]`;
   }
   if (enableRelations) {
-    constraints += `\nAlso extract relations between entities. Return {"entities": [...], "relations": [{"source": "entity text", "target": "entity text", "relation": "relation type"}]}.`;
+    let relDesc = `\nAlso extract relations between entities.`;
+    if (relationClasses) {
+      relDesc += ` Each relation must have a "class" field categorizing the relation type.`;
+    }
+    relDesc += ` Return {"entities": [...], "relations": [{"source": "entity text", "target": "entity text", "relation": "relation type"${relationClasses ? ', "class": "relation class"' : ""}}]}.`;
+    constraints += relDesc;
     if (relationTypes) {
       constraints += ` Allowed relation types: ${relationTypes.join(", ")}.`;
     }
+    if (relationClasses) {
+      constraints += ` Allowed relation classes: ${relationClasses.join(", ")}.`;
+    }
   }
   return constraints;
@@ -456,6 +567,15 @@ async function processText(
     }
   }
+  // If taxonomy is used, add taxonomyPath showing full hierarchy
+  if (taxonomy && !enableRelations && Array.isArray(parsed)) {
+    for (const entity of parsed) {
+      if (entity.class && typeof entity.class === "string") {
+        entity.taxonomyPath = getTaxonomyPath(entity.class, taxonomy);
+      }
+    }
+  }
   return parsed;
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "objectivist-ner",
-  "version": "0.0.0",
+  "version": "0.0.2",
   "description": "Objectivist-inspired Named Entity Recognition with grammar-constrained LLM output",
   "bin": {
     "ner": "index.ts"