rust-kgdb 0.6.79 → 0.6.80
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +120 -93
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -4,108 +4,24 @@
|
|
|
4
4
|
[](https://opensource.org/licenses/Apache-2.0)
|
|
5
5
|
[](https://www.w3.org/TR/sparql11-query/)
|
|
6
6
|
|
|
7
|
-
> **
|
|
7
|
+
> **Enterprise Knowledge Graph with Native Graph Embeddings**: A production-grade RDF database featuring built-in RDF2Vec, multi-vector composite search, and distributed SPARQL execution—engineered for teams who need verifiable AI at scale.
|
|
8
8
|
|
|
9
9
|
---
|
|
10
10
|
|
|
11
|
-
##
|
|
11
|
+
## What's New in v0.6.79
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
| Feature | Description | Performance |
|
|
14
|
+
|---------|-------------|-------------|
|
|
15
|
+
| **Rdf2VecEngine** | Native graph embeddings from random walks | 68 µs lookup (3,000x faster than APIs) |
|
|
16
|
+
| **Composite Multi-Vector** | RRF fusion of RDF2Vec + OpenAI + domain | +26% recall improvement |
|
|
17
|
+
| **Distributed SPARQL** | HDRF-partitioned Kubernetes clusters | 66-141ms across 3 executors |
|
|
18
|
+
| **Auto-Embedding Triggers** | Vectors generated on graph insert/update | 37 µs incremental updates |
|
|
14
19
|
|
|
15
20
|
```javascript
|
|
16
21
|
const { GraphDB, Rdf2VecEngine, EmbeddingService } = require('rust-kgdb')
|
|
17
|
-
|
|
18
|
-
// Create graph and load your knowledge
|
|
19
|
-
const db = new GraphDB('http://myapp/knowledge')
|
|
20
|
-
db.loadTtl(myOntology, null) // 130,923 triples/sec
|
|
21
|
-
|
|
22
|
-
// RDF2Vec: Train embeddings from graph structure
|
|
23
|
-
const rdf2vec = new Rdf2VecEngine()
|
|
24
|
-
const walks = extractRandomWalks(db) // Graph topology → training data
|
|
25
|
-
rdf2vec.train(JSON.stringify(walks)) // 1,207 walks/sec → 384-dim vectors
|
|
26
|
-
|
|
27
|
-
// Blazing fast similarity search
|
|
28
|
-
const embedding = rdf2vec.getEmbedding('http://myapp/entity123') // 68 µs
|
|
29
|
-
const similar = rdf2vec.findSimilar(entity, candidates, 5) // 303 µs
|
|
30
22
|
```
|
|
31
23
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
| Metric | rust-kgdb | OpenAI API | Speedup |
|
|
35
|
-
|--------|-----------|------------|---------|
|
|
36
|
-
| **Embedding Lookup** | **68 µs** | 200-500 ms | **3,000-7,000x faster** |
|
|
37
|
-
| **Similarity Search** | **303 µs** | 300-800 ms | **1,000-2,600x faster** |
|
|
38
|
-
| **Training (1K walks)** | **829 ms** | N/A (no graph structure) | - |
|
|
39
|
-
| **Batch Processing** | **In-process** | Rate-limited API | **No quotas** |
|
|
40
|
-
|
|
41
|
-
**Why this matters**: OpenAI/Cohere embeddings require HTTP round-trips (200-500ms latency) and rate limits. RDF2Vec runs in your process at native speed. For real-time fraud detection or recommendation engines, this is the difference between catching fraud before payment clears vs. flagging it days later.
|
|
42
|
-
|
|
43
|
-
### Multi-Vector Composite Embeddings (RRF Fusion)
|
|
44
|
-
|
|
45
|
-
Combine multiple embedding sources for maximum recall:
|
|
46
|
-
|
|
47
|
-
```javascript
|
|
48
|
-
const service = new EmbeddingService()
|
|
49
|
-
|
|
50
|
-
// Store embeddings from different providers
|
|
51
|
-
service.storeComposite('CLM001', JSON.stringify({
|
|
52
|
-
rdf2vec: rdf2vec.getEmbedding('CLM001'), // Graph structure (local)
|
|
53
|
-
openai: await openaiEmbed(claimDescription), // Semantic text (API)
|
|
54
|
-
domain: customFraudEmbedding // Domain-specific
|
|
55
|
-
}))
|
|
56
|
-
|
|
57
|
-
// RRF (Reciprocal Rank Fusion) combines all sources
|
|
58
|
-
const similar = service.findSimilarComposite('CLM001', 10, 0.7, 'rrf')
|
|
59
|
-
// Formula: Score = Σ(1/(k+rank_i)), k=60
|
|
60
|
-
// Result: Better recall than any single embedding source
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
| Pool Size | Single Embedding | RRF Composite | Improvement |
|
|
64
|
-
|-----------|------------------|---------------|-------------|
|
|
65
|
-
| 100 entities | 78% recall | **89% recall** | +14% |
|
|
66
|
-
| 1K entities | 72% recall | **85% recall** | +18% |
|
|
67
|
-
| 10K entities | 65% recall | **82% recall** | +26% |
|
|
68
|
-
|
|
69
|
-
### Distributed Cluster Performance (Real LUBM Benchmark)
|
|
70
|
-
|
|
71
|
-
Tested on Kubernetes: 1 coordinator + 3 executors via NodePort:
|
|
72
|
-
|
|
73
|
-
| Query | Description | Results | Time |
|
|
74
|
-
|-------|-------------|---------|------|
|
|
75
|
-
| Q1 | GraduateStudent type | 150 | **66ms** |
|
|
76
|
-
| Q4 | Advisor relationships | 150 | **101ms** |
|
|
77
|
-
| Q6 | 2-way join (advisor+dept) | 46 | **75ms** |
|
|
78
|
-
| Q7 | Course enrollment | 570 | **141ms** |
|
|
79
|
-
|
|
80
|
-
**3,272 LUBM triples** distributed across 3 executors via HDRF partitioning. Multi-hop joins execute seamlessly across partition boundaries.
|
|
81
|
-
|
|
82
|
-
### Graph → Embedding Pipeline (End-to-End)
|
|
83
|
-
|
|
84
|
-
```javascript
|
|
85
|
-
// 1. Insert triples (auto-distributed across executors)
|
|
86
|
-
db.loadTtl(newData, null) // Triggers auto-embedding if configured
|
|
87
|
-
|
|
88
|
-
// 2. Extract walks from relationships (graph topology)
|
|
89
|
-
const walks = [
|
|
90
|
-
['Company1', 'employs', 'Person1'],
|
|
91
|
-
['Person1', 'knows', 'Person2'],
|
|
92
|
-
['Person2', 'worksFor', 'Company1']
|
|
93
|
-
]
|
|
94
|
-
|
|
95
|
-
// 3. Train on walks → 384-dimensional embeddings
|
|
96
|
-
const result = JSON.parse(rdf2vec.train(JSON.stringify(walks)))
|
|
97
|
-
// { vocabulary_size: 4, dimensions: 384, training_time_secs: 0.8 }
|
|
98
|
-
|
|
99
|
-
// 4. Find similar entities in 303 µs
|
|
100
|
-
const similar = rdf2vec.findSimilar('Person1', candidates, 5)
|
|
101
|
-
```
|
|
102
|
-
|
|
103
|
-
**Pipeline Throughput:**
|
|
104
|
-
- Graph load: **130,923 triples/sec**
|
|
105
|
-
- RDF2Vec training: **1,207 walks/sec**
|
|
106
|
-
- Embedding lookup: **68 µs** (14,700/sec)
|
|
107
|
-
- Similarity search: **303 µs** (3,300/sec)
|
|
108
|
-
- Incremental update: **37 µs** (no full retrain)
|
|
24
|
+
*See [Native Graph Embeddings](#native-graph-embeddings-rdf2vec-engine) for complete documentation and benchmarks.*
|
|
109
25
|
|
|
110
26
|
---
|
|
111
27
|
|
|
@@ -208,6 +124,26 @@ The math matters. When your fraud detection runs 35x faster, you catch fraud bef
|
|
|
208
124
|
|
|
209
125
|
Most AI frameworks trust the LLM. We don't.
|
|
210
126
|
|
|
127
|
+
### Core Capabilities
|
|
128
|
+
|
|
129
|
+
| Layer | Feature | What It Does |
|
|
130
|
+
|-------|---------|--------------|
|
|
131
|
+
| **Database** | GraphDB | W3C SPARQL 1.1 compliant RDF store with 449ns lookups |
|
|
132
|
+
| **Database** | Distributed SPARQL | HDRF partitioning across Kubernetes executors |
|
|
133
|
+
| **Embeddings** | Rdf2VecEngine | Train 384-dim vectors from graph random walks |
|
|
134
|
+
| **Embeddings** | EmbeddingService | Multi-provider composite vectors with RRF fusion |
|
|
135
|
+
| **Embeddings** | HNSW Index | Approximate nearest neighbor search in 303µs |
|
|
136
|
+
| **Analytics** | GraphFrames | PageRank, connected components, motif matching |
|
|
137
|
+
| **Analytics** | Pregel API | Bulk synchronous parallel graph algorithms |
|
|
138
|
+
| **Reasoning** | Datalog Engine | Recursive rule evaluation with fixpoint semantics |
|
|
139
|
+
| **AI Agent** | HyperMindAgent | Schema-aware SPARQL generation from natural language |
|
|
140
|
+
| **AI Agent** | Type System | Hindley-Milner type inference for query validation |
|
|
141
|
+
| **AI Agent** | Proof DAG | SHA-256 audit trail for every AI decision |
|
|
142
|
+
| **Security** | WASM Sandbox | Capability-based isolation with fuel metering |
|
|
143
|
+
| **Security** | Schema Cache | Cross-agent ontology sharing with validation |
|
|
144
|
+
|
|
145
|
+
### The Architecture Difference
|
|
146
|
+
|
|
211
147
|
```
|
|
212
148
|
+===========================================================================+
|
|
213
149
|
| |
|
|
@@ -406,6 +342,97 @@ The difference? HyperMind treats tools as **typed morphisms** (category theory),
|
|
|
406
342
|
|
|
407
343
|
---
|
|
408
344
|
|
|
345
|
+
## Native Graph Embeddings: RDF2Vec Engine
|
|
346
|
+
|
|
347
|
+
Traditional embedding pipelines introduce significant latency: serialize your entity, make an HTTP request to OpenAI or Cohere, wait 200-500ms, parse the response. For applications requiring real-time similarity—fraud detection, recommendation engines, entity resolution—this latency model becomes a critical bottleneck.
|
|
348
|
+
|
|
349
|
+
**RDF2Vec takes a fundamentally different approach.** Instead of treating entities as text to be embedded by external APIs, it learns vector representations directly from your graph's topology. The algorithm performs random walks across your knowledge graph, treating the resulting paths as "sentences" that capture structural relationships. These walks train a Word2Vec model in-process, producing embeddings that encode how entities relate to each other.
|
|
350
|
+
|
|
351
|
+
```javascript
|
|
352
|
+
const { GraphDB, Rdf2VecEngine } = require('rust-kgdb')
|
|
353
|
+
|
|
354
|
+
// Load your knowledge graph
|
|
355
|
+
const db = new GraphDB('http://enterprise/claims')
|
|
356
|
+
db.loadTtl(claimsOntology, null) // 130,923 triples/sec throughput
|
|
357
|
+
|
|
358
|
+
// Initialize the RDF2Vec engine
|
|
359
|
+
const rdf2vec = new Rdf2VecEngine()
|
|
360
|
+
|
|
361
|
+
// Train embeddings from graph structure
|
|
362
|
+
// Walks capture: Provider → submits → Claim → involves → Patient
|
|
363
|
+
const walks = extractRandomWalks(db)
|
|
364
|
+
rdf2vec.train(JSON.stringify(walks)) // 1,207 walks/sec → 384-dim vectors
|
|
365
|
+
|
|
366
|
+
// Retrieve embeddings with microsecond latency
|
|
367
|
+
const embedding = rdf2vec.getEmbedding('http://claims/provider/4521') // 68 µs
|
|
368
|
+
|
|
369
|
+
// Find structurally similar entities
|
|
370
|
+
const similar = rdf2vec.findSimilar(provider, candidateProviders, 10) // 303 µs
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Performance: Why Microseconds Matter
|
|
374
|
+
|
|
375
|
+
| Operation | rust-kgdb (RDF2Vec) | External API (OpenAI) | Advantage |
|
|
376
|
+
|-----------|---------------------|----------------------|-----------|
|
|
377
|
+
| **Single Embedding Lookup** | 68 µs | 200-500 ms | **3,000-7,000x faster** |
|
|
378
|
+
| **Similarity Search (k=10)** | 303 µs | 300-800 ms | **1,000-2,600x faster** |
|
|
379
|
+
| **Batch Training (1K walks)** | 829 ms | N/A | Graph-native training |
|
|
380
|
+
| **Rate Limits** | None (in-process) | Quota-restricted | Unlimited throughput |
|
|
381
|
+
|
|
382
|
+
**Practical Impact**: When investigating a flagged claim, an analyst might check 50 similar providers. At 300ms per API call, that's 15 seconds of waiting. With RDF2Vec at 303µs per lookup, the same operation completes in 15 milliseconds—a 1,000x improvement that transforms the user experience from "waiting for AI" to "instant insight."
|
|
383
|
+
|
|
384
|
+
### Multi-Vector Composite Embeddings with RRF
|
|
385
|
+
|
|
386
|
+
Real-world similarity often requires multiple perspectives. A claim's structural relationships (RDF2Vec) tell a different story than its textual description (OpenAI) or domain-specific features (custom model). The `EmbeddingService` supports composite embeddings with Reciprocal Rank Fusion (RRF) to combine these views:
|
|
387
|
+
|
|
388
|
+
```javascript
|
|
389
|
+
const service = new EmbeddingService()
|
|
390
|
+
|
|
391
|
+
// Store embeddings from multiple sources
|
|
392
|
+
service.storeComposite('CLM-2024-0847', JSON.stringify({
|
|
393
|
+
rdf2vec: rdf2vec.getEmbedding('CLM-2024-0847'), // Graph structure
|
|
394
|
+
openai: await openaiEmbed(claimNarrative), // Semantic content
|
|
395
|
+
domain: fraudRiskEmbedding // Domain-specific signals
|
|
396
|
+
}))
|
|
397
|
+
|
|
398
|
+
// RRF fusion combines rankings from each source
|
|
399
|
+
// Formula: Score = Σ(1 / (k + rank_i)), k=60
|
|
400
|
+
const similar = service.findSimilarComposite('CLM-2024-0847', 10, 0.7, 'rrf')
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
| Candidate Pool | Single-Source Recall | RRF Composite Recall | Improvement |
|
|
404
|
+
|----------------|---------------------|---------------------|-------------|
|
|
405
|
+
| 100 entities | 78% | **89%** | +14% |
|
|
406
|
+
| 1,000 entities | 72% | **85%** | +18% |
|
|
407
|
+
| 10,000 entities | 65% | **82%** | +26% |
|
|
408
|
+
|
|
409
|
+
### Distributed Cluster Benchmarks (Kubernetes)
|
|
410
|
+
|
|
411
|
+
For deployments exceeding single-node capacity, rust-kgdb supports distributed execution across Kubernetes clusters. Verified benchmarks on the LUBM academic dataset:
|
|
412
|
+
|
|
413
|
+
| Query | Pattern | Results | Latency |
|
|
414
|
+
|-------|---------|---------|---------|
|
|
415
|
+
| Q1 | Type lookup (GraduateStudent) | 150 | **66 ms** |
|
|
416
|
+
| Q4 | Join (student → advisor) | 150 | **101 ms** |
|
|
417
|
+
| Q6 | 2-hop join (advisor → department) | 46 | **75 ms** |
|
|
418
|
+
| Q7 | Course enrollment scan | 570 | **141 ms** |
|
|
419
|
+
|
|
420
|
+
**Configuration**: 1 coordinator + 3 executors, HDRF partitioning, NodePort access at `localhost:30080`. Triples distribute automatically across executors; multi-hop joins execute seamlessly across partition boundaries.
|
|
421
|
+
|
|
422
|
+
### End-to-End Pipeline Throughput
|
|
423
|
+
|
|
424
|
+
| Stage | Throughput | Notes |
|
|
425
|
+
|-------|------------|-------|
|
|
426
|
+
| Graph ingestion | **130,923 triples/sec** | Bulk load with indexing |
|
|
427
|
+
| RDF2Vec training | **1,207 walks/sec** | Configurable walk length/count |
|
|
428
|
+
| Embedding lookup | **68 µs** (14,700/sec) | In-memory, zero network |
|
|
429
|
+
| Similarity search | **303 µs** (3,300/sec) | HNSW index |
|
|
430
|
+
| Incremental update | **37 µs** | No full retrain required |
|
|
431
|
+
|
|
432
|
+
*For detailed configuration options, see [Walk Configuration](#walk-configuration-tuning-rdf2vec-performance) and [Auto-Embedding Triggers](#auto-embedding-triggers-automatic-on-graph-insertupdate) below.*
|
|
433
|
+
|
|
434
|
+
---
|
|
435
|
+
|
|
409
436
|
## The Deeper Problem: AI Agents Forget
|
|
410
437
|
|
|
411
438
|
Fixing SPARQL syntax is table stakes. Here's what keeps enterprise architects up at night:
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "rust-kgdb",
|
|
3
|
-
"version": "0.6.
|
|
3
|
+
"version": "0.6.80",
|
|
4
4
|
"description": "High-performance RDF/SPARQL database with AI agent framework. GraphDB (449ns lookups, 35x faster than RDFox), GraphFrames analytics (PageRank, motifs), Datalog reasoning, HNSW vector embeddings. HyperMindAgent for schema-aware query generation with audit trails. W3C SPARQL 1.1 compliant. Native performance via Rust + NAPI-RS.",
|
|
5
5
|
"main": "index.js",
|
|
6
6
|
"types": "index.d.ts",
|