RubyGems - fact_db - Versions diffs - 0.0.1 - Mend

fact_db 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (89) hide show

checksums.yaml +7 -0
data/.envrc +1 -0
data/CHANGELOG.md +48 -0
data/COMMITS.md +196 -0
data/README.md +102 -0
data/Rakefile +41 -0
data/db/migrate/001_enable_extensions.rb +7 -0
data/db/migrate/002_create_contents.rb +44 -0
data/db/migrate/003_create_entities.rb +36 -0
data/db/migrate/004_create_entity_aliases.rb +18 -0
data/db/migrate/005_create_facts.rb +65 -0
data/db/migrate/006_create_entity_mentions.rb +18 -0
data/db/migrate/007_create_fact_sources.rb +18 -0
data/docs/api/extractors/index.md +71 -0
data/docs/api/extractors/llm.md +162 -0
data/docs/api/extractors/manual.md +92 -0
data/docs/api/extractors/rule-based.md +165 -0
data/docs/api/facts.md +300 -0
data/docs/api/index.md +66 -0
data/docs/api/models/content.md +165 -0
data/docs/api/models/entity.md +202 -0
data/docs/api/models/fact.md +270 -0
data/docs/api/models/index.md +77 -0
data/docs/api/pipeline/extraction.md +175 -0
data/docs/api/pipeline/index.md +72 -0
data/docs/api/pipeline/resolution.md +209 -0
data/docs/api/services/content-service.md +166 -0
data/docs/api/services/entity-service.md +202 -0
data/docs/api/services/fact-service.md +223 -0
data/docs/api/services/index.md +55 -0
data/docs/architecture/database-schema.md +293 -0
data/docs/architecture/entity-resolution.md +293 -0
data/docs/architecture/index.md +149 -0
data/docs/architecture/temporal-facts.md +268 -0
data/docs/architecture/three-layer-model.md +242 -0
data/docs/assets/css/custom.css +137 -0
data/docs/assets/fact_db.jpg +0 -0
data/docs/assets/images/fact_db.jpg +0 -0
data/docs/concepts.md +183 -0
data/docs/examples/basic-usage.md +235 -0
data/docs/examples/hr-onboarding.md +312 -0
data/docs/examples/index.md +64 -0
data/docs/examples/news-analysis.md +288 -0
data/docs/getting-started/database-setup.md +170 -0
data/docs/getting-started/index.md +71 -0
data/docs/getting-started/installation.md +98 -0
data/docs/getting-started/quick-start.md +191 -0
data/docs/guides/batch-processing.md +325 -0
data/docs/guides/configuration.md +243 -0
data/docs/guides/entity-management.md +364 -0
data/docs/guides/extracting-facts.md +299 -0
data/docs/guides/index.md +22 -0
data/docs/guides/ingesting-content.md +252 -0
data/docs/guides/llm-integration.md +299 -0
data/docs/guides/temporal-queries.md +315 -0
data/docs/index.md +121 -0
data/examples/README.md +130 -0
data/examples/basic_usage.rb +164 -0
data/examples/entity_management.rb +216 -0
data/examples/hr_system.rb +428 -0
data/examples/rule_based_extraction.rb +258 -0
data/examples/temporal_queries.rb +245 -0
data/lib/fact_db/config.rb +71 -0
data/lib/fact_db/database.rb +45 -0
data/lib/fact_db/errors.rb +10 -0
data/lib/fact_db/extractors/base.rb +117 -0
data/lib/fact_db/extractors/llm_extractor.rb +179 -0
data/lib/fact_db/extractors/manual_extractor.rb +53 -0
data/lib/fact_db/extractors/rule_based_extractor.rb +228 -0
data/lib/fact_db/llm/adapter.rb +109 -0
data/lib/fact_db/models/content.rb +62 -0
data/lib/fact_db/models/entity.rb +84 -0
data/lib/fact_db/models/entity_alias.rb +26 -0
data/lib/fact_db/models/entity_mention.rb +33 -0
data/lib/fact_db/models/fact.rb +192 -0
data/lib/fact_db/models/fact_source.rb +35 -0
data/lib/fact_db/pipeline/extraction_pipeline.rb +146 -0
data/lib/fact_db/pipeline/resolution_pipeline.rb +129 -0
data/lib/fact_db/resolution/entity_resolver.rb +261 -0
data/lib/fact_db/resolution/fact_resolver.rb +259 -0
data/lib/fact_db/services/content_service.rb +93 -0
data/lib/fact_db/services/entity_service.rb +150 -0
data/lib/fact_db/services/fact_service.rb +193 -0
data/lib/fact_db/temporal/query.rb +125 -0
data/lib/fact_db/temporal/timeline.rb +134 -0
data/lib/fact_db/version.rb +5 -0
data/lib/fact_db.rb +141 -0
data/mkdocs.yml +198 -0
metadata +288 -0

data/docs/architecture/database-schema.md ADDED Viewed

@@ -0,0 +1,293 @@
+# Database Schema
+FactDb uses PostgreSQL with the pgvector extension for semantic search capabilities.
+## Entity Relationship Diagram
+```mermaid
+erDiagram
+    contents ||--o{ fact_sources : "sourced by"
+    entities ||--o{ entity_aliases : "has"
+    entities ||--o{ entity_mentions : "mentioned in"
+    facts ||--o{ entity_mentions : "mentions"
+    facts ||--o{ fact_sources : "sourced from"
+    facts ||--o| facts : "superseded by"
+    contents {
+        bigint id PK
+        string content_hash UK
+        string content_type
+        text raw_text
+        string title
+        string source_uri
+        jsonb source_metadata
+        vector embedding
+        timestamptz captured_at
+        timestamptz created_at
+    }
+    entities {
+        bigint id PK
+        string canonical_name
+        string entity_type
+        string resolution_status
+        bigint merged_into_id FK
+        jsonb metadata
+        vector embedding
+        timestamptz created_at
+    }
+    entity_aliases {
+        bigint id PK
+        bigint entity_id FK
+        string alias_text
+        string alias_type
+        float confidence
+    }
+    facts {
+        bigint id PK
+        text fact_text
+        string fact_hash
+        timestamptz valid_at
+        timestamptz invalid_at
+        string status
+        bigint superseded_by_id FK
+        bigint[] derived_from_ids
+        bigint[] corroborated_by_ids
+        float confidence
+        string extraction_method
+        jsonb metadata
+        vector embedding
+        timestamptz created_at
+    }
+    entity_mentions {
+        bigint id PK
+        bigint fact_id FK
+        bigint entity_id FK
+        string mention_text
+        string mention_role
+        float confidence
+    }
+    fact_sources {
+        bigint id PK
+        bigint fact_id FK
+        bigint content_id FK
+        string source_type
+        text excerpt
+        float confidence
+    }
+```
+## Tables
+### contents
+Stores immutable source documents.
+```sql
+CREATE TABLE contents (
+    id BIGSERIAL PRIMARY KEY,
+    content_hash VARCHAR(64) NOT NULL UNIQUE,
+    content_type VARCHAR(50) NOT NULL,
+    raw_text TEXT NOT NULL,
+    title VARCHAR(255),
+    source_uri TEXT,
+    source_metadata JSONB NOT NULL DEFAULT '{}',
+    embedding VECTOR(1536),
+    captured_at TIMESTAMPTZ NOT NULL,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+CREATE INDEX idx_contents_type ON contents(content_type);
+CREATE INDEX idx_contents_captured ON contents(captured_at);
+CREATE INDEX idx_contents_text ON contents USING gin(to_tsvector('english', raw_text));
+CREATE INDEX idx_contents_embedding ON contents USING hnsw(embedding vector_cosine_ops);
+```
+### entities
+Stores resolved identities.
+```sql
+CREATE TABLE entities (
+    id BIGSERIAL PRIMARY KEY,
+    canonical_name VARCHAR(255) NOT NULL,
+    entity_type VARCHAR(50) NOT NULL,
+    resolution_status VARCHAR(20) NOT NULL DEFAULT 'unresolved',
+    merged_into_id BIGINT REFERENCES entities(id),
+    metadata JSONB NOT NULL DEFAULT '{}',
+    embedding VECTOR(1536),
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+CREATE INDEX idx_entities_name ON entities(canonical_name);
+CREATE INDEX idx_entities_type ON entities(entity_type);
+CREATE INDEX idx_entities_status ON entities(resolution_status);
+CREATE INDEX idx_entities_embedding ON entities USING hnsw(embedding vector_cosine_ops);
+```
+### entity_aliases
+Stores alternative names for entities.
+```sql
+CREATE TABLE entity_aliases (
+    id BIGSERIAL PRIMARY KEY,
+    entity_id BIGINT NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
+    alias_text VARCHAR(255) NOT NULL,
+    alias_type VARCHAR(50),
+    confidence FLOAT DEFAULT 1.0
+);
+CREATE INDEX idx_aliases_entity ON entity_aliases(entity_id);
+CREATE INDEX idx_aliases_text ON entity_aliases(alias_text);
+CREATE UNIQUE INDEX idx_aliases_unique ON entity_aliases(entity_id, alias_text);
+```
+### facts
+Stores temporal assertions.
+```sql
+CREATE TABLE facts (
+    id BIGSERIAL PRIMARY KEY,
+    fact_text TEXT NOT NULL,
+    fact_hash VARCHAR(64) NOT NULL,
+    valid_at TIMESTAMPTZ NOT NULL,
+    invalid_at TIMESTAMPTZ,
+    status VARCHAR(20) NOT NULL DEFAULT 'canonical',
+    superseded_by_id BIGINT REFERENCES facts(id),
+    derived_from_ids BIGINT[],
+    corroborated_by_ids BIGINT[],
+    confidence FLOAT DEFAULT 1.0,
+    extraction_method VARCHAR(50) NOT NULL,
+    metadata JSONB NOT NULL DEFAULT '{}',
+    embedding VECTOR(1536),
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+CREATE INDEX idx_facts_status ON facts(status);
+CREATE INDEX idx_facts_valid ON facts(valid_at);
+CREATE INDEX idx_facts_invalid ON facts(invalid_at);
+CREATE INDEX idx_facts_temporal ON facts(valid_at, invalid_at);
+CREATE INDEX idx_facts_method ON facts(extraction_method);
+CREATE INDEX idx_facts_text ON facts USING gin(to_tsvector('english', fact_text));
+CREATE INDEX idx_facts_embedding ON facts USING hnsw(embedding vector_cosine_ops);
+```
+### entity_mentions
+Links facts to mentioned entities.
+```sql
+CREATE TABLE entity_mentions (
+    id BIGSERIAL PRIMARY KEY,
+    fact_id BIGINT NOT NULL REFERENCES facts(id) ON DELETE CASCADE,
+    entity_id BIGINT NOT NULL REFERENCES entities(id),
+    mention_text VARCHAR(255) NOT NULL,
+    mention_role VARCHAR(50) NOT NULL,
+    confidence FLOAT DEFAULT 1.0
+);
+CREATE INDEX idx_mentions_fact ON entity_mentions(fact_id);
+CREATE INDEX idx_mentions_entity ON entity_mentions(entity_id);
+CREATE INDEX idx_mentions_role ON entity_mentions(mention_role);
+```
+### fact_sources
+Links facts to source content.
+```sql
+CREATE TABLE fact_sources (
+    id BIGSERIAL PRIMARY KEY,
+    fact_id BIGINT NOT NULL REFERENCES facts(id) ON DELETE CASCADE,
+    content_id BIGINT NOT NULL REFERENCES contents(id),
+    source_type VARCHAR(50) NOT NULL DEFAULT 'primary',
+    excerpt TEXT,
+    confidence FLOAT DEFAULT 1.0
+);
+CREATE INDEX idx_sources_fact ON fact_sources(fact_id);
+CREATE INDEX idx_sources_content ON fact_sources(content_id);
+CREATE INDEX idx_sources_type ON fact_sources(source_type);
+```
+## Vector Indexes
+FactDb uses HNSW indexes for fast approximate nearest neighbor search:
+```sql
+-- Contents semantic search
+CREATE INDEX idx_contents_embedding ON contents
+    USING hnsw(embedding vector_cosine_ops)
+    WITH (m = 16, ef_construction = 64);
+-- Entities semantic search
+CREATE INDEX idx_entities_embedding ON entities
+    USING hnsw(embedding vector_cosine_ops)
+    WITH (m = 16, ef_construction = 64);
+-- Facts semantic search
+CREATE INDEX idx_facts_embedding ON facts
+    USING hnsw(embedding vector_cosine_ops)
+    WITH (m = 16, ef_construction = 64);
+```
+## Temporal Query Patterns
+### Currently Valid Facts
+```sql
+SELECT * FROM facts
+WHERE status = 'canonical'
+AND invalid_at IS NULL;
+```
+### Facts Valid at Point in Time
+```sql
+SELECT * FROM facts
+WHERE status IN ('canonical', 'corroborated')
+AND valid_at <= '2024-03-15'
+AND (invalid_at IS NULL OR invalid_at > '2024-03-15');
+```
+### Entity Timeline
+```sql
+SELECT f.* FROM facts f
+JOIN entity_mentions em ON em.fact_id = f.id
+WHERE em.entity_id = 123
+ORDER BY f.valid_at ASC;
+```
+### Semantic Search
+```sql
+SELECT *, embedding <=> '[...]' AS distance
+FROM contents
+ORDER BY embedding <=> '[...]'
+LIMIT 10;
+```
+## Maintenance
+### Vacuum and Analyze
+```sql
+VACUUM ANALYZE contents;
+VACUUM ANALYZE entities;
+VACUUM ANALYZE facts;
+```
+### Reindex Vectors
+```sql
+REINDEX INDEX idx_contents_embedding;
+REINDEX INDEX idx_entities_embedding;
+REINDEX INDEX idx_facts_embedding;
+```

data/docs/architecture/entity-resolution.md ADDED Viewed

@@ -0,0 +1,293 @@
+# Entity Resolution
+Entity resolution is the process of matching text mentions to canonical entities in the system.
+## Overview
+When extracting facts from content, mentions like "Paula", "P. Chen", or "Paula Chen" need to be resolved to a single canonical entity.
+```mermaid
+graph LR
+    M1["'Paula'"] --> R{EntityResolver}
+    M2["'P. Chen'"] --> R
+    M3["'Paula Chen'"] --> R
+    M4["'Chen, Paula'"] --> R
+    R --> E["Entity: Paula Chen"]
+    style M1 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
+    style M2 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
+    style M3 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
+    style M4 fill:#1E40AF,stroke:#1E3A8A,color:#FFFFFF
+    style R fill:#B45309,stroke:#92400E,color:#FFFFFF
+    style E fill:#047857,stroke:#065F46,color:#FFFFFF
+```
+## Resolution Strategies
+The resolver tries multiple strategies in order:
+### 1. Exact Match
+Direct match against canonical names:
+```ruby
+# Looking for "Microsoft"
+entity = facts.resolve_entity("Microsoft")
+# Matches: Entity(canonical_name: "Microsoft")
+```
+### 2. Alias Match
+Match against registered aliases:
+```ruby
+# Entity has aliases: ["MS", "MSFT", "Microsoft Corp"]
+entity = facts.resolve_entity("MSFT")
+# Matches via alias
+```
+### 3. Fuzzy Match
+Levenshtein distance for typos and variations:
+```ruby
+# Looking for "Microsft" (typo)
+entity = facts.resolve_entity("Microsft")
+# Fuzzy matches "Microsoft" if similarity > threshold
+```
+Configuration:
+```ruby
+FactDb.configure do |config|
+  config.fuzzy_match_threshold = 0.85  # 85% similarity required
+  config.auto_merge_threshold = 0.95   # Auto-merge at 95%
+end
+```
+### 4. Type-Constrained
+Limit matches to specific entity types:
+```ruby
+# Only match person entities
+person = facts.resolve_entity("Paula", type: :person)
+# Only match organizations
+org = facts.resolve_entity("Platform", type: :organization)
+```
+## Creating Entities
+### Basic Creation
+```ruby
+entity = facts.entity_service.create(
+  "Paula Chen",
+  type: :person
+)
+```
+### With Aliases
+```ruby
+entity = facts.entity_service.create(
+  "Paula Chen",
+  type: :person,
+  aliases: ["Paula", "P. Chen", "Chen, Paula"]
+)
+```
+### With Metadata
+```ruby
+entity = facts.entity_service.create(
+  "Paula Chen",
+  type: :person,
+  aliases: ["Paula"],
+  metadata: {
+    department: "Engineering",
+    start_date: "2024-01-10",
+    employee_id: "E12345"
+  }
+)
+```
+## Managing Aliases
+### Add Alias
+```ruby
+facts.entity_service.add_alias(
+  entity.id,
+  "P. Chen",
+  type: :abbreviation,
+  confidence: 0.9
+)
+```
+### List Aliases
+```ruby
+entity.entity_aliases.each do |alias_record|
+  puts "#{alias_record.alias_text} (#{alias_record.alias_type})"
+end
+```
+### Remove Alias
+```ruby
+facts.entity_service.remove_alias(entity.id, "Old Name")
+```
+## Merging Entities
+When duplicate entities are discovered:
+```ruby
+# Merge entity2 into entity1
+facts.entity_service.merge(
+  entity1.id,  # Keep this one
+  entity2.id   # Merge into entity1
+)
+# After merge:
+# - entity2.resolution_status => "merged"
+# - entity2.merged_into_id => entity1.id
+# - All facts mentioning entity2 now also reference entity1
+```
+### Automatic Merging
+High-confidence matches can be auto-merged:
+```ruby
+FactDb.configure do |config|
+  config.auto_merge_threshold = 0.95
+end
+# When resolving, if similarity > 0.95, entities auto-merge
+```
+## Resolution in Extraction
+### Manual Resolution
+```ruby
+fact = facts.fact_service.create(
+  "Paula joined the team",
+  valid_at: Date.today,
+  mentions: [
+    {
+      entity: paula_entity,
+      text: "Paula",
+      role: "subject",
+      confidence: 1.0
+    }
+  ]
+)
+```
+### Automatic Resolution
+The LLM extractor resolves mentions automatically:
+```ruby
+extracted = facts.extract_facts(content.id, extractor: :llm)
+extracted.each do |fact|
+  fact.entity_mentions.each do |mention|
+    puts "Resolved '#{mention.mention_text}' to #{mention.entity.canonical_name}"
+    puts "  Role: #{mention.mention_role}"
+    puts "  Confidence: #{mention.confidence}"
+  end
+end
+```
+## Mention Roles
+When linking entities to facts, specify the role:
+| Role | Description | Example |
+|------|-------------|---------|
+| `subject` | Primary actor | "Paula joined..." |
+| `object` | Target of action | "...hired Paula" |
+| `organization` | Company/team | "...at Microsoft" |
+| `location` | Place | "...in Seattle" |
+| `role` | Job title/position | "...as Engineer" |
+| `temporal` | Time reference | "...in Q4 2024" |
+```ruby
+fact = facts.fact_service.create(
+  "Paula Chen joined Microsoft as Principal Engineer in Seattle",
+  valid_at: Date.parse("2024-01-10"),
+  mentions: [
+    { entity: paula, role: "subject", text: "Paula Chen" },
+    { entity: microsoft, role: "organization", text: "Microsoft" },
+    { entity: seattle, role: "location", text: "Seattle" }
+  ]
+)
+```
+## Batch Resolution
+For processing multiple entities efficiently:
+```ruby
+names = ["Paula Chen", "John Smith", "Acme Corp", "Seattle"]
+results = facts.batch_resolve_entities(names)
+results.each do |result|
+  puts "#{result[:name]}: #{result[:status]}"
+  puts "  Entity: #{result[:entity]&.canonical_name}"
+end
+```
+## Best Practices
+### 1. Create Comprehensive Aliases
+```ruby
+# Include common variations
+entity = facts.entity_service.create(
+  "International Business Machines Corporation",
+  type: :organization,
+  aliases: [
+    "IBM",
+    "Big Blue",
+    "International Business Machines"
+  ]
+)
+```
+### 2. Use Type Constraints
+```ruby
+# Avoid ambiguous matches
+entity = facts.resolve_entity("Apple", type: :organization)
+# Won't match "Apple" as a fruit/food entity
+```
+### 3. Review Fuzzy Matches
+```ruby
+# Log low-confidence resolutions for review
+if resolution.confidence < 0.9
+  logger.warn "Low confidence resolution: #{resolution}"
+end
+```
+### 4. Handle Unresolved Mentions
+```ruby
+entity = facts.resolve_entity("Unknown Person")
+if entity.nil?
+  # Create new entity or flag for review
+  entity = facts.entity_service.create(
+    "Unknown Person",
+    type: :person,
+    metadata: { needs_review: true }
+  )
+end
+```