@crashbytes/semantic-text-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/DEPLOYMENT.md ADDED
@@ -0,0 +1,203 @@
1
+ # 🚀 Deployment Guide
2
+
3
+ ## Publication Strategy
4
+
5
+ When deploying npm packages to production, prioritize:
6
+ - **Build verification** - Comprehensive testing before publication
7
+ - **Semantic versioning** - Clear version progression that communicates change impact
8
+ - **Documentation completeness** - Users should understand capabilities without reading source
9
+ - **Dependency hygiene** - Minimize external dependencies, document all requirements
10
+
11
+ ---
12
+
13
+ ## 📦 Pre-Publication Checklist
14
+
15
+ ### Code Quality Verification
16
+
17
+ ```bash
18
+ cd ~/github/crashbytes-npmjs/semantic-text-toolkit
19
+ npm install
20
+ npm run build
21
+ npm test
22
+ ```
23
+
24
+ ### Package Metadata Validation
25
+
26
+ Verify in `package.json`:
27
+ - Version follows semantic versioning (currently `1.0.0`)
28
+ - Repository URL matches your GitHub repository
29
+ - Keywords enable discoverability
30
+ - License is appropriate (`MIT`)
31
+
32
+ ---
33
+
34
+ ## 🔐 NPM Authentication
35
+
36
+ ### Initial Setup
37
+
38
+ ```bash
39
+ # Authenticate with npm
40
+ npm login
41
+ ```
42
+
43
+ When prompted, provide:
44
+ - Username
45
+ - Password
46
+ - Email address
47
+ - Two-factor authentication code (if enabled)
48
+
49
+ ### Verify Authentication
50
+
51
+ ```bash
52
+ npm whoami
53
+ ```
54
+
55
+ ---
56
+
57
+ ## 📤 Publication Process
58
+
59
+ ### First-Time Publication
60
+
61
+ ```bash
62
+ cd ~/github/crashbytes-npmjs/semantic-text-toolkit
63
+ npm publish --access public
64
+ ```
65
+
66
+ **Note:** Scoped packages (`@crashbytes/...`) default to restricted access. The `--access public` flag ensures public availability.
67
+
68
+ ### Version Updates
69
+
70
+ Follow semantic versioning principles:
71
+
72
+ ```bash
73
+ # Patch release (bug fixes): 1.0.0 → 1.0.1
74
+ npm version patch
75
+
76
+ # Minor release (new features, backward compatible): 1.0.0 → 1.1.0
77
+ npm version minor
78
+
79
+ # Major release (breaking changes): 1.0.0 → 2.0.0
80
+ npm version major
81
+
82
+ # Publish updated version
83
+ npm publish
84
+ ```
85
+
86
+ ---
87
+
88
+ ## ✅ Post-Publication Verification
89
+
90
+ ### Verify Package Availability
91
+
92
+ ```bash
93
+ npm view @crashbytes/semantic-text-toolkit
94
+ ```
95
+
96
+ ### Test Installation
97
+
98
+ ```bash
99
+ # Create temporary directory
100
+ cd /tmp
101
+ mkdir test-package && cd test-package
102
+
103
+ # Install package
104
+ npm install @crashbytes/semantic-text-toolkit
105
+
106
+ # Verify functionality
107
+ node -e "const { SemanticEngine } = require('@crashbytes/semantic-text-toolkit'); console.log('✅ Success');"
108
+ ```
109
+
110
+ ---
111
+
112
+ ## 🔄 Semantic Versioning Framework
113
+
114
+ ### Version Increment Guidelines
115
+
116
+ **Major Version (Breaking Changes):**
117
+ - API signature modifications
118
+ - Removal of deprecated features
119
+ - Behavioral changes affecting existing implementations
120
+
121
+ **Minor Version (New Features):**
122
+ - Backward-compatible functionality additions
123
+ - Performance improvements
124
+ - New optional parameters
125
+
126
+ **Patch Version (Bug Fixes):**
127
+ - Bug corrections
128
+ - Documentation updates
129
+ - Internal refactoring without API changes
130
+
131
+ ---
132
+
133
+ ## 🛡️ Security Best Practices
134
+
135
+ When managing npm packages:
136
+ - **Enable 2FA** - Two-factor authentication prevents unauthorized access
137
+ - **Rotate tokens** - Periodically regenerate access tokens
138
+ - **Audit dependencies** - Regular `npm audit` checks for vulnerabilities
139
+ - **Monitor downloads** - Track usage patterns for anomaly detection
140
+
141
+ ---
142
+
143
+ ## 🎯 Continuous Deployment Strategy
144
+
145
+ ### GitHub Actions Workflow
146
+
147
+ Create `.github/workflows/publish.yml`:
148
+
149
+ ```yaml
150
+ name: Publish to npm
151
+
152
+ on:
153
+ release:
154
+ types: [created]
155
+
156
+ jobs:
157
+ publish:
158
+ runs-on: ubuntu-latest
159
+ steps:
160
+ - uses: actions/checkout@v3
161
+ - uses: actions/setup-node@v3
162
+ with:
163
+ node-version: '18'
164
+ registry-url: 'https://registry.npmjs.org'
165
+ - run: npm ci
166
+ - run: npm test
167
+ - run: npm run build
168
+ - run: npm publish
169
+ env:
170
+ NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
171
+ ```
172
+
173
+ ---
174
+
175
+ ## 🔧 Troubleshooting Common Issues
176
+
177
+ ### Publication Failures
178
+
179
+ **Error: 403 Forbidden**
180
+ - Verify npm authentication with `npm whoami`
181
+ - Confirm package name availability
182
+ - Check organization membership for scoped packages
183
+
184
+ **Error: Version Already Published**
185
+ - Update version number using `npm version`
186
+ - Never republish same version (violates npm policy)
187
+
188
+ **Error: Package Name Conflict**
189
+ - Choose unique name or use scoped package (`@org/name`)
190
+ - Verify availability with `npm view package-name`
191
+
192
+ ---
193
+
194
+ ## 📚 Mentorship Approach
195
+
196
+ Technical guidance on package deployment should:
197
+ - **Illuminate underlying principles** - Understand why semantic versioning matters
198
+ - **Provide context beyond immediate steps** - Connect deployment to broader software lifecycle
199
+ - **Empower informed decisions** - Evaluate tradeoffs between different versioning strategies
200
+
201
+ ---
202
+
203
+ **Deploy with precision. Maintain with diligence.**
package/README.md ADDED
@@ -0,0 +1,300 @@
1
+ # 🧠 Semantic Text Toolkit
2
+
3
+ Production-grade semantic text analysis with embeddings, similarity computation, and vector search operations.
4
+
5
+ **Built by Blackhole Software, LLC**
6
+
7
+ ---
8
+
9
+ ## 🎯 Architectural Philosophy
10
+
11
+ When building ML-powered production systems, prioritize:
12
+ - **Lazy initialization** - Models load on demand, minimizing startup overhead
13
+ - **Type safety** - Comprehensive TypeScript definitions prevent runtime failures
14
+ - **Resource efficiency** - Quantized models reduce memory footprint by 75%
15
+ - **Defensive programming** - Semantic error codes enable precise debugging
16
+
17
+ ---
18
+
19
+ ## 🚀 Quick Start
20
+
21
+ ### Installation
22
+
23
+ ```bash
24
+ npm install @crashbytes/semantic-text-toolkit
25
+ ```
26
+
27
+ ### Basic Usage
28
+
29
+ ```typescript
30
+ import { createSemanticEngine } from '@crashbytes/semantic-text-toolkit';
31
+
32
+ const engine = await createSemanticEngine();
33
+ const result = await engine.embed("Machine learning transforms data");
34
+ console.log(result.embedding); // 384-dimensional vector
35
+
36
+ const similarity = await engine.similarity(
37
+ "Artificial intelligence is fascinating",
38
+ "Machine learning is interesting"
39
+ );
40
+ console.log(similarity.score); // 0.78
41
+ ```
42
+
43
+ ---
44
+
45
+ ## 🏗️ Core Capabilities
46
+
47
+ ### Text Embeddings
48
+ Transform text into high-dimensional numerical vectors that capture semantic meaning, enabling:
49
+ - Semantic similarity computation beyond keyword matching
50
+ - Vector-based search operations at scale
51
+ - Content clustering and classification
52
+ - Intelligent recommendation systems
53
+
54
+ ### Similarity Metrics
55
+ Multiple metrics for domain-specific optimization:
56
+ - **Cosine similarity** - Preferred for normalized vectors (range: -1 to 1)
57
+ - **Euclidean distance** - Direct geometric distance in vector space
58
+ - **Dot product** - Efficient for pre-normalized embeddings
59
+
60
+ ### Vector Search
61
+ Production-ready semantic search with:
62
+ - Configurable ranking strategies
63
+ - Metadata filtering for complex queries
64
+ - O(n log k) complexity for top-k retrieval
65
+ - Index persistence through export/import
66
+
67
+ ---
68
+
69
+ ## 📚 API Reference
70
+
71
+ ### SemanticEngine
72
+
73
+ Core engine for embedding generation and similarity computation.
74
+
75
+ #### Constructor
76
+
77
+ ```typescript
78
+ new SemanticEngine(config?: ModelConfig)
79
+ ```
80
+
81
+ **Configuration Parameters:**
82
+ - `modelName` - Hugging Face model identifier (default: `'Xenova/all-MiniLM-L6-v2'`)
83
+ - `maxLength` - Maximum sequence length (default: `512`)
84
+ - `quantized` - Enable quantization (default: `true`)
85
+ - `onProgress` - Progress callback for model loading
86
+
87
+ #### Key Methods
88
+
89
+ ##### `async initialize(): Promise<void>`
90
+ Initializes the model. Idempotent and concurrent-safe through promise caching.
91
+
92
+ ##### `async embed(text: string): Promise<EmbeddingResult>`
93
+ Generates embedding for single text input. Returns vector with metadata.
94
+
95
+ ##### `async embedBatch(texts: string[], options?: BatchOptions): Promise<EmbeddingResult[]>`
96
+ Batch processing with automatic batching and progress tracking.
97
+
98
+ ##### `async similarity(textA: string, textB: string, method?: 'cosine' | 'euclidean' | 'dot'): Promise<SimilarityResult>`
99
+ Computes semantic similarity using specified metric.
100
+
101
+ ---
102
+
103
+ ### SemanticSearch
104
+
105
+ High-level search interface with indexing capabilities.
106
+
107
+ #### Constructor
108
+
109
+ ```typescript
110
+ new SemanticSearch<T>(engine: SemanticEngine, config?: SearchConfig<T>)
111
+ ```
112
+
113
+ **Configuration Parameters:**
114
+ - `topK` - Number of results to return (default: `10`)
115
+ - `threshold` - Minimum similarity score (default: `0`)
116
+ - `textExtractor` - Function to extract text from custom objects
117
+ - `metadataExtractor` - Function to extract metadata for filtering
118
+
119
+ #### Key Methods
120
+
121
+ ##### `async index(items: T[], replace?: boolean): Promise<void>`
122
+ Indexes items for semantic search with optional index replacement.
123
+
124
+ ##### `async search(query: string, config?: Partial<SearchConfig<T>>): Promise<SearchResult<T>[]>`
125
+ Performs semantic search with configurable parameters.
126
+
127
+ ##### `async searchWithFilter(query: string, filter: (metadata: Record<string, unknown>) => boolean): Promise<SearchResult<T>[]>`
128
+ Searches with metadata filtering for complex queries.
129
+
130
+ ---
131
+
132
+ ## 🎓 Advanced Usage Patterns
133
+
134
+ ### Custom Object Search
135
+
136
+ ```typescript
137
+ interface Document {
138
+ id: string;
139
+ title: string;
140
+ content: string;
141
+ category: string;
142
+ }
143
+
144
+ const search = new SemanticSearch<Document>(engine, {
145
+ textExtractor: (doc) => `${doc.title} ${doc.content}`,
146
+ metadataExtractor: (doc) => ({ category: doc.category }),
147
+ });
148
+
149
+ await search.index(documents);
150
+
151
+ const results = await search.searchWithFilter(
152
+ "machine learning",
153
+ (metadata) => metadata.category === 'AI'
154
+ );
155
+ ```
156
+
157
+ ### Clustering with Centroids
158
+
159
+ ```typescript
160
+ import { centroid, cosineSimilarity } from '@crashbytes/semantic-text-toolkit';
161
+
162
+ const embeddings = await Promise.all(
163
+ documents.map(doc => engine.embed(doc))
164
+ );
165
+
166
+ const clusterCenter = centroid(embeddings.map(r => r.embedding));
167
+
168
+ const distances = embeddings.map(result =>
169
+ cosineSimilarity(result.embedding, clusterCenter)
170
+ );
171
+ ```
172
+
173
+ ---
174
+
175
+ ## ⚡ Performance Optimization Framework
176
+
177
+ ### 1. Latency-Critical Applications
178
+
179
+ When optimizing for response time:
180
+ - Pre-initialize models at application startup
181
+ - Implement request batching for concurrent operations
182
+ - Enable GPU acceleration in production environments
183
+ - Use connection pooling for API deployments
184
+
185
+ ### 2. Memory-Constrained Environments
186
+
187
+ When managing resource limitations:
188
+ - Leverage quantized models (enabled by default)
189
+ - Clear search indexes when not actively in use
190
+ - Process data in smaller, manageable batches
191
+ - Consider model distillation for further reduction
192
+
193
+ ### 3. High-Throughput Scenarios
194
+
195
+ When scaling for volume:
196
+ - Implement worker pool pattern for parallel processing
197
+ - Use message queues (RabbitMQ, Redis) for load distribution
198
+ - Deploy on GPU-enabled infrastructure for compute-intensive workloads
199
+ - Utilize approximate nearest neighbor (ANN) algorithms for large-scale search
200
+
201
+ ### Performance Characteristics
202
+
203
+ **Single Embedding Generation:**
204
+ - CPU (Apple M1): ~30ms
205
+ - CPU (Intel i7): ~50ms
206
+ - GPU (CUDA): ~5ms
207
+
208
+ **Batch Processing (100 texts):**
209
+ - Sequential: ~3000ms
210
+ - Batched (size=32): ~800ms
211
+ - **Speedup**: 3.75x
212
+
213
+ **Memory Profile:**
214
+ - Model (quantized): ~23MB
215
+ - Base runtime: ~100MB
216
+ - Per 1000 embeddings: ~1.5MB
217
+
218
+ ---
219
+
220
+ ## 🔧 Configuration Examples
221
+
222
+ ### Custom Model
223
+
224
+ ```typescript
225
+ const engine = new SemanticEngine({
226
+ modelName: 'Xenova/multilingual-e5-large',
227
+ maxLength: 512,
228
+ quantized: false
229
+ });
230
+ ```
231
+
232
+ ### Production Configuration
233
+
234
+ ```typescript
235
+ const engine = new SemanticEngine({
236
+ modelName: 'Xenova/all-MiniLM-L6-v2',
237
+ quantized: true,
238
+ onProgress: (progress) => {
239
+ if (progress.status === 'downloading') {
240
+ logger.info(`Model download: ${progress.progress}%`);
241
+ }
242
+ }
243
+ });
244
+ ```
245
+
246
+ ---
247
+
248
+ ## 🧪 Code Quality Manifesto
249
+
250
+ When contributing to this project:
251
+ - **Self-documenting code** - Clear variable names, focused functions
252
+ - **Comprehensive test coverage** - Unit, integration, and E2E tests
253
+ - **Intentional design choices** - Document architectural decisions
254
+ - **Continuous refactoring** - Maintain code health proactively
255
+
256
+ ---
257
+
258
+ ## 📦 Building
259
+
260
+ ```bash
261
+ npm run build
262
+ ```
263
+
264
+ Generates:
265
+ - `dist/index.js` (CommonJS)
266
+ - `dist/index.mjs` (ES Modules)
267
+ - `dist/index.d.ts` (TypeScript definitions)
268
+
269
+ ---
270
+
271
+ ## 🤝 Contributing
272
+
273
+ Contributions welcome. When contributing:
274
+ - Maintain architectural consistency
275
+ - Add comprehensive tests
276
+ - Document public APIs
277
+ - Follow existing code style
278
+ - Update CHANGELOG.md
279
+
280
+ ---
281
+
282
+ ## 📄 License
283
+
284
+ MIT License - see LICENSE file for details
285
+
286
+ ---
287
+
288
+ ## 🏢 About Blackhole Software, LLC
289
+
290
+ Specializing in custom web and software solutions:
291
+ - React, Astro, Next.js
292
+ - Node.js, C#
293
+ - React Native, SwiftUI, Kotlin
294
+ - AI/ML integration
295
+
296
+ **Visit us at [blackholesoftware.com](https://blackholesoftware.com)**
297
+
298
+ ---
299
+
300
+ **Built with precision. Designed for production.**
@@ -0,0 +1,7 @@
1
+ import {
2
+ SemanticEngine
3
+ } from "./chunk-XJ4PTDH6.mjs";
4
+ import "./chunk-TPAL6DKL.mjs";
5
+ export {
6
+ SemanticEngine
7
+ };
@@ -0,0 +1,7 @@
1
+ import {
2
+ SemanticSearch
3
+ } from "./chunk-ENOBULOJ.mjs";
4
+ import "./chunk-TPAL6DKL.mjs";
5
+ export {
6
+ SemanticSearch
7
+ };
@@ -0,0 +1,93 @@
1
+ import {
2
+ SemanticError,
3
+ topKSimilar
4
+ } from "./chunk-TPAL6DKL.mjs";
5
+
6
+ // src/search/SemanticSearch.ts
7
+ var SemanticSearch = class {
8
+ constructor(engine, config = {}) {
9
+ this.indexedItems = [];
10
+ this.engine = engine;
11
+ this.config = {
12
+ topK: config.topK ?? 10,
13
+ threshold: config.threshold ?? 0,
14
+ textExtractor: config.textExtractor ?? ((item) => String(item)),
15
+ metadataExtractor: config.metadataExtractor ?? (() => ({}))
16
+ };
17
+ }
18
+ async index(items, replace = false) {
19
+ if (!Array.isArray(items) || items.length === 0) {
20
+ throw new SemanticError(
21
+ "INVALID_INPUT" /* INVALID_INPUT */,
22
+ "Items must be a non-empty array"
23
+ );
24
+ }
25
+ if (replace) {
26
+ this.indexedItems = [];
27
+ }
28
+ const texts = items.map(this.config.textExtractor);
29
+ const results = await this.engine.embedBatch(texts, { batchSize: 32 });
30
+ const newIndexItems = items.map((item, idx) => ({
31
+ item,
32
+ embedding: results[idx].embedding,
33
+ metadata: this.config.metadataExtractor(item)
34
+ }));
35
+ this.indexedItems.push(...newIndexItems);
36
+ }
37
+ async search(query, overrideConfig) {
38
+ if (this.indexedItems.length === 0) {
39
+ throw new SemanticError(
40
+ "INVALID_INPUT" /* INVALID_INPUT */,
41
+ "Index is empty. Call index() before searching."
42
+ );
43
+ }
44
+ const config = { ...this.config, ...overrideConfig };
45
+ const queryResult = await this.engine.embed(query);
46
+ const candidateEmbeddings = this.indexedItems.map((item) => item.embedding);
47
+ const topK = topKSimilar(queryResult.embedding, candidateEmbeddings, config.topK);
48
+ const results = [];
49
+ let rank = 1;
50
+ for (const [idx, score] of topK) {
51
+ if (score < config.threshold) continue;
52
+ results.push({
53
+ item: this.indexedItems[idx].item,
54
+ score,
55
+ rank: rank++
56
+ });
57
+ }
58
+ return results;
59
+ }
60
+ async searchWithFilter(query, filter, config) {
61
+ const originalIndex = this.indexedItems;
62
+ this.indexedItems = originalIndex.filter((item) => filter(item.metadata ?? {}));
63
+ try {
64
+ return await this.search(query, config);
65
+ } finally {
66
+ this.indexedItems = originalIndex;
67
+ }
68
+ }
69
+ async findSimilar(item, config) {
70
+ const text = this.config.textExtractor(item);
71
+ return this.search(text, config);
72
+ }
73
+ getStats() {
74
+ const itemCount = this.indexedItems.length;
75
+ const dimensions = this.indexedItems[0]?.embedding.length ?? 0;
76
+ const totalBytes = itemCount * dimensions * 8;
77
+ const memoryEstimate = totalBytes < 1024 * 1024 ? `${(totalBytes / 1024).toFixed(2)} KB` : `${(totalBytes / (1024 * 1024)).toFixed(2)} MB`;
78
+ return { itemCount, dimensions, memoryEstimate };
79
+ }
80
+ clear() {
81
+ this.indexedItems = [];
82
+ }
83
+ exportIndex() {
84
+ return [...this.indexedItems];
85
+ }
86
+ importIndex(index) {
87
+ this.indexedItems = [...index];
88
+ }
89
+ };
90
+
91
+ export {
92
+ SemanticSearch
93
+ };