rust-kgdb 0.6.79 → 0.6.80

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +120 -93
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -4,108 +4,24 @@
4
4
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
5
5
  [![W3C](https://img.shields.io/badge/W3C-SPARQL%201.1%20%7C%20RDF%201.2-blue)](https://www.w3.org/TR/sparql11-query/)
6
6
 
7
- > **Native Graph Embeddings + Multi-Vector Search**: The only knowledge graph with built-in RDF2Vec, composite embeddings, and distributed SPARQL - all at native Rust speed.
7
+ > **Enterprise Knowledge Graph with Native Graph Embeddings**: A production-grade RDF database featuring built-in RDF2Vec, multi-vector composite search, and distributed SPARQL execution—engineered for teams who need verifiable AI at scale.
8
8
 
9
9
  ---
10
10
 
11
- ## RDF2Vec: Graph Embeddings That Blow Away The Competition
11
+ ## What's New in v0.6.79
12
12
 
13
- **Why wait for API calls when you can have 98 nanosecond lookups?**
13
+ | Feature | Description | Performance |
14
+ |---------|-------------|-------------|
15
+ | **Rdf2VecEngine** | Native graph embeddings from random walks | 68 µs lookup (3,000x faster than APIs) |
16
+ | **Composite Multi-Vector** | RRF fusion of RDF2Vec + OpenAI + domain | +26% recall improvement |
17
+ | **Distributed SPARQL** | HDRF-partitioned Kubernetes clusters | 66-141ms across 3 executors |
18
+ | **Auto-Embedding Triggers** | Vectors generated on graph insert/update | 37 µs incremental updates |
14
19
 
15
20
  ```javascript
16
21
  const { GraphDB, Rdf2VecEngine, EmbeddingService } = require('rust-kgdb')
17
-
18
- // Create graph and load your knowledge
19
- const db = new GraphDB('http://myapp/knowledge')
20
- db.loadTtl(myOntology, null) // 130,923 triples/sec
21
-
22
- // RDF2Vec: Train embeddings from graph structure
23
- const rdf2vec = new Rdf2VecEngine()
24
- const walks = extractRandomWalks(db) // Graph topology → training data
25
- rdf2vec.train(JSON.stringify(walks)) // 1,207 walks/sec → 384-dim vectors
26
-
27
- // Blazing fast similarity search
28
- const embedding = rdf2vec.getEmbedding('http://myapp/entity123') // 68 µs
29
- const similar = rdf2vec.findSimilar(entity, candidates, 5) // 303 µs
30
22
  ```
31
23
 
32
- ### Performance Numbers That Matter
33
-
34
- | Metric | rust-kgdb | OpenAI API | Speedup |
35
- |--------|-----------|------------|---------|
36
- | **Embedding Lookup** | **68 µs** | 200-500 ms | **3,000-7,000x faster** |
37
- | **Similarity Search** | **303 µs** | 300-800 ms | **1,000-2,600x faster** |
38
- | **Training (1K walks)** | **829 ms** | N/A (no graph structure) | - |
39
- | **Batch Processing** | **In-process** | Rate-limited API | **No quotas** |
40
-
41
- **Why this matters**: OpenAI/Cohere embeddings require HTTP round-trips (200-500ms latency) and rate limits. RDF2Vec runs in your process at native speed. For real-time fraud detection or recommendation engines, this is the difference between catching fraud before payment clears vs. flagging it days later.
42
-
43
- ### Multi-Vector Composite Embeddings (RRF Fusion)
44
-
45
- Combine multiple embedding sources for maximum recall:
46
-
47
- ```javascript
48
- const service = new EmbeddingService()
49
-
50
- // Store embeddings from different providers
51
- service.storeComposite('CLM001', JSON.stringify({
52
- rdf2vec: rdf2vec.getEmbedding('CLM001'), // Graph structure (local)
53
- openai: await openaiEmbed(claimDescription), // Semantic text (API)
54
- domain: customFraudEmbedding // Domain-specific
55
- }))
56
-
57
- // RRF (Reciprocal Rank Fusion) combines all sources
58
- const similar = service.findSimilarComposite('CLM001', 10, 0.7, 'rrf')
59
- // Formula: Score = Σ(1/(k+rank_i)), k=60
60
- // Result: Better recall than any single embedding source
61
- ```
62
-
63
- | Pool Size | Single Embedding | RRF Composite | Improvement |
64
- |-----------|------------------|---------------|-------------|
65
- | 100 entities | 78% recall | **89% recall** | +14% |
66
- | 1K entities | 72% recall | **85% recall** | +18% |
67
- | 10K entities | 65% recall | **82% recall** | +26% |
68
-
69
- ### Distributed Cluster Performance (Real LUBM Benchmark)
70
-
71
- Tested on Kubernetes: 1 coordinator + 3 executors via NodePort:
72
-
73
- | Query | Description | Results | Time |
74
- |-------|-------------|---------|------|
75
- | Q1 | GraduateStudent type | 150 | **66ms** |
76
- | Q4 | Advisor relationships | 150 | **101ms** |
77
- | Q6 | 2-way join (advisor+dept) | 46 | **75ms** |
78
- | Q7 | Course enrollment | 570 | **141ms** |
79
-
80
- **3,272 LUBM triples** distributed across 3 executors via HDRF partitioning. Multi-hop joins execute seamlessly across partition boundaries.
81
-
82
- ### Graph → Embedding Pipeline (End-to-End)
83
-
84
- ```javascript
85
- // 1. Insert triples (auto-distributed across executors)
86
- db.loadTtl(newData, null) // Triggers auto-embedding if configured
87
-
88
- // 2. Extract walks from relationships (graph topology)
89
- const walks = [
90
- ['Company1', 'employs', 'Person1'],
91
- ['Person1', 'knows', 'Person2'],
92
- ['Person2', 'worksFor', 'Company1']
93
- ]
94
-
95
- // 3. Train on walks → 384-dimensional embeddings
96
- const result = JSON.parse(rdf2vec.train(JSON.stringify(walks)))
97
- // { vocabulary_size: 4, dimensions: 384, training_time_secs: 0.8 }
98
-
99
- // 4. Find similar entities in 303 µs
100
- const similar = rdf2vec.findSimilar('Person1', candidates, 5)
101
- ```
102
-
103
- **Pipeline Throughput:**
104
- - Graph load: **130,923 triples/sec**
105
- - RDF2Vec training: **1,207 walks/sec**
106
- - Embedding lookup: **68 µs** (14,700/sec)
107
- - Similarity search: **303 µs** (3,300/sec)
108
- - Incremental update: **37 µs** (no full retrain)
24
+ *See [Native Graph Embeddings](#native-graph-embeddings-rdf2vec-engine) for complete documentation and benchmarks.*
109
25
 
110
26
  ---
111
27
 
@@ -208,6 +124,26 @@ The math matters. When your fraud detection runs 35x faster, you catch fraud bef
208
124
 
209
125
  Most AI frameworks trust the LLM. We don't.
210
126
 
127
+ ### Core Capabilities
128
+
129
+ | Layer | Feature | What It Does |
130
+ |-------|---------|--------------|
131
+ | **Database** | GraphDB | W3C SPARQL 1.1 compliant RDF store with 449ns lookups |
132
+ | **Database** | Distributed SPARQL | HDRF partitioning across Kubernetes executors |
133
+ | **Embeddings** | Rdf2VecEngine | Train 384-dim vectors from graph random walks |
134
+ | **Embeddings** | EmbeddingService | Multi-provider composite vectors with RRF fusion |
135
+ | **Embeddings** | HNSW Index | Approximate nearest neighbor search in 303µs |
136
+ | **Analytics** | GraphFrames | PageRank, connected components, motif matching |
137
+ | **Analytics** | Pregel API | Bulk synchronous parallel graph algorithms |
138
+ | **Reasoning** | Datalog Engine | Recursive rule evaluation with fixpoint semantics |
139
+ | **AI Agent** | HyperMindAgent | Schema-aware SPARQL generation from natural language |
140
+ | **AI Agent** | Type System | Hindley-Milner type inference for query validation |
141
+ | **AI Agent** | Proof DAG | SHA-256 audit trail for every AI decision |
142
+ | **Security** | WASM Sandbox | Capability-based isolation with fuel metering |
143
+ | **Security** | Schema Cache | Cross-agent ontology sharing with validation |
144
+
145
+ ### The Architecture Difference
146
+
211
147
  ```
212
148
  +===========================================================================+
213
149
  | |
@@ -406,6 +342,97 @@ The difference? HyperMind treats tools as **typed morphisms** (category theory),
406
342
 
407
343
  ---
408
344
 
345
+ ## Native Graph Embeddings: RDF2Vec Engine
346
+
347
+ Traditional embedding pipelines introduce significant latency: serialize your entity, make an HTTP request to OpenAI or Cohere, wait 200-500ms, parse the response. For applications requiring real-time similarity—fraud detection, recommendation engines, entity resolution—this latency model becomes a critical bottleneck.
348
+
349
+ **RDF2Vec takes a fundamentally different approach.** Instead of treating entities as text to be embedded by external APIs, it learns vector representations directly from your graph's topology. The algorithm performs random walks across your knowledge graph, treating the resulting paths as "sentences" that capture structural relationships. These walks train a Word2Vec model in-process, producing embeddings that encode how entities relate to each other.
350
+
351
+ ```javascript
352
+ const { GraphDB, Rdf2VecEngine } = require('rust-kgdb')
353
+
354
+ // Load your knowledge graph
355
+ const db = new GraphDB('http://enterprise/claims')
356
+ db.loadTtl(claimsOntology, null) // 130,923 triples/sec throughput
357
+
358
+ // Initialize the RDF2Vec engine
359
+ const rdf2vec = new Rdf2VecEngine()
360
+
361
+ // Train embeddings from graph structure
362
+ // Walks capture: Provider → submits → Claim → involves → Patient
363
+ const walks = extractRandomWalks(db)
364
+ rdf2vec.train(JSON.stringify(walks)) // 1,207 walks/sec → 384-dim vectors
365
+
366
+ // Retrieve embeddings with microsecond latency
367
+ const embedding = rdf2vec.getEmbedding('http://claims/provider/4521') // 68 µs
368
+
369
+ // Find structurally similar entities
370
+ const similar = rdf2vec.findSimilar(provider, candidateProviders, 10) // 303 µs
371
+ ```
372
+
373
+ ### Performance: Why Microseconds Matter
374
+
375
+ | Operation | rust-kgdb (RDF2Vec) | External API (OpenAI) | Advantage |
376
+ |-----------|---------------------|----------------------|-----------|
377
+ | **Single Embedding Lookup** | 68 µs | 200-500 ms | **3,000-7,000x faster** |
378
+ | **Similarity Search (k=10)** | 303 µs | 300-800 ms | **1,000-2,600x faster** |
379
+ | **Batch Training (1K walks)** | 829 ms | N/A | Graph-native training |
380
+ | **Rate Limits** | None (in-process) | Quota-restricted | Unlimited throughput |
381
+
382
+ **Practical Impact**: When investigating a flagged claim, an analyst might check 50 similar providers. At 300ms per API call, that's 15 seconds of waiting. With RDF2Vec at 303µs per lookup, the same operation completes in 15 milliseconds—a 1,000x improvement that transforms the user experience from "waiting for AI" to "instant insight."
383
+
384
+ ### Multi-Vector Composite Embeddings with RRF
385
+
386
+ Real-world similarity often requires multiple perspectives. A claim's structural relationships (RDF2Vec) tell a different story than its textual description (OpenAI) or domain-specific features (custom model). The `EmbeddingService` supports composite embeddings with Reciprocal Rank Fusion (RRF) to combine these views:
387
+
388
+ ```javascript
389
+ const service = new EmbeddingService()
390
+
391
+ // Store embeddings from multiple sources
392
+ service.storeComposite('CLM-2024-0847', JSON.stringify({
393
+ rdf2vec: rdf2vec.getEmbedding('CLM-2024-0847'), // Graph structure
394
+ openai: await openaiEmbed(claimNarrative), // Semantic content
395
+ domain: fraudRiskEmbedding // Domain-specific signals
396
+ }))
397
+
398
+ // RRF fusion combines rankings from each source
399
+ // Formula: Score = Σ(1 / (k + rank_i)), k=60
400
+ const similar = service.findSimilarComposite('CLM-2024-0847', 10, 0.7, 'rrf')
401
+ ```
402
+
403
+ | Candidate Pool | Single-Source Recall | RRF Composite Recall | Improvement |
404
+ |----------------|---------------------|---------------------|-------------|
405
+ | 100 entities | 78% | **89%** | +14% |
406
+ | 1,000 entities | 72% | **85%** | +18% |
407
+ | 10,000 entities | 65% | **82%** | +26% |
408
+
409
+ ### Distributed Cluster Benchmarks (Kubernetes)
410
+
411
+ For deployments exceeding single-node capacity, rust-kgdb supports distributed execution across Kubernetes clusters. Verified benchmarks on the LUBM academic dataset:
412
+
413
+ | Query | Pattern | Results | Latency |
414
+ |-------|---------|---------|---------|
415
+ | Q1 | Type lookup (GraduateStudent) | 150 | **66 ms** |
416
+ | Q4 | Join (student → advisor) | 150 | **101 ms** |
417
+ | Q6 | 2-hop join (advisor → department) | 46 | **75 ms** |
418
+ | Q7 | Course enrollment scan | 570 | **141 ms** |
419
+
420
+ **Configuration**: 1 coordinator + 3 executors, HDRF partitioning, NodePort access at `localhost:30080`. Triples distribute automatically across executors; multi-hop joins execute seamlessly across partition boundaries.
421
+
422
+ ### End-to-End Pipeline Throughput
423
+
424
+ | Stage | Throughput | Notes |
425
+ |-------|------------|-------|
426
+ | Graph ingestion | **130,923 triples/sec** | Bulk load with indexing |
427
+ | RDF2Vec training | **1,207 walks/sec** | Configurable walk length/count |
428
+ | Embedding lookup | **68 µs** (14,700/sec) | In-memory, zero network |
429
+ | Similarity search | **303 µs** (3,300/sec) | HNSW index |
430
+ | Incremental update | **37 µs** | No full retrain required |
431
+
432
+ *For detailed configuration options, see [Walk Configuration](#walk-configuration-tuning-rdf2vec-performance) and [Auto-Embedding Triggers](#auto-embedding-triggers-automatic-on-graph-insertupdate) below.*
433
+
434
+ ---
435
+
409
436
  ## The Deeper Problem: AI Agents Forget
410
437
 
411
438
  Fixing SPARQL syntax is table stakes. Here's what keeps enterprise architects up at night:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "rust-kgdb",
3
- "version": "0.6.79",
3
+ "version": "0.6.80",
4
4
  "description": "High-performance RDF/SPARQL database with AI agent framework. GraphDB (449ns lookups, 35x faster than RDFox), GraphFrames analytics (PageRank, motifs), Datalog reasoning, HNSW vector embeddings. HyperMindAgent for schema-aware query generation with audit trails. W3C SPARQL 1.1 compliant. Native performance via Rust + NAPI-RS.",
5
5
  "main": "index.js",
6
6
  "types": "index.d.ts",