@archetypeai/ds-cli 0.3.9 → 0.3.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,415 @@
1
+ ---
2
+ name: embedding-from-file
3
+ description: Run an Embedding Lens by streaming sensor data from a CSV file. Use when extracting embeddings from time-series CSV data for visualization, clustering, dimensionality reduction, or similarity analysis.
4
+ argument-hint: [csv-file-path]
5
+ ---
6
+
7
+ # Embedding Lens — Stream from CSV File
8
+
9
+ Generate a script that streams time-series data from a CSV file to the Archetype AI Embedding Lens and collects embedding vectors. Supports both Python and JavaScript/Web.
10
+
11
+ > **Frontend architecture:** When building a web UI for this skill, decompose into components (file input, status display, visualization) rather than a monolithic page. Extract API/streaming logic into `$lib/api/`. See `@rules/frontend-architecture` for conventions and `@skills/create-dashboard` / `@skills/build-pattern` for layout and component patterns.
12
+
13
+ ---
14
+
15
+ ## Python Implementation
16
+
17
+ ### Requirements
18
+
19
+ - `archetypeai` Python package
20
+ - `pandas`, `numpy`
21
+ - Environment variables: `ATAI_API_KEY`, optionally `ATAI_API_ENDPOINT`
22
+
23
+ ### Architecture
24
+
25
+ #### 1. API Client Setup
26
+
27
+ ```python
28
+ from archetypeai.api_client import ArchetypeAI
29
+ import os
30
+
31
+ api_key = os.getenv("ATAI_API_KEY")
32
+ api_endpoint = os.getenv("ATAI_API_ENDPOINT", ArchetypeAI.get_default_endpoint())
33
+ client = ArchetypeAI(api_key, api_endpoint=api_endpoint)
34
+ ```
35
+
36
+ #### 2. Lens YAML Configuration
37
+
38
+ The embedding lens uses `lens_timeseries_embedding_processor` — no n-shot files or KNN config needed.
39
+
40
+ ```yaml
41
+ lens_name: Embedding Lens
42
+ lens_config:
43
+ model_pipeline:
44
+ - processor_name: lens_timeseries_embedding_processor
45
+ processor_config: {}
46
+ model_parameters:
47
+ model_name: OmegaEncoder
48
+ model_version: OmegaEncoder::omega_embeddings_01
49
+ normalize_input: true
50
+ buffer_size: {window_size}
51
+ csv_configs:
52
+ timestamp_column: timestamp
53
+ data_columns: ['a1', 'a2', 'a3', 'a4']
54
+ window_size: {window_size}
55
+ step_size: {step_size}
56
+ output_streams:
57
+ - stream_type: server_sent_events_writer
58
+ ```
59
+
60
+ **Key difference from Machine State Lens**: No `input_n_shot`, no `knn_configs`. The processor outputs raw embedding vectors instead of class predictions.
61
+
62
+ #### 3. Session Callback — Stream Windows
63
+
64
+ ```python
65
+ def session_callback(session_id, session_endpoint, client, args):
66
+ sse_reader = client.lens.sessions.create_sse_consumer(
67
+ session_id, max_read_time_sec=args["max_run_time_sec"]
68
+ )
69
+
70
+ # Load CSV with pandas
71
+ df = pd.read_csv(args["file_path"])
72
+ columns = ["a1", "a2", "a3", "a4"]
73
+ data = df[columns].values.T.tolist() # Transpose: [channels][samples]
74
+
75
+ # Stream data in windows
76
+ total_samples = len(df)
77
+ start = 0
78
+ counter = 0
79
+ while start < total_samples:
80
+ end = start + args["window_size"]
81
+ chunk = [series[start:end] for series in data]
82
+
83
+ payload = {
84
+ "type": "session.update",
85
+ "event_data": {
86
+ "type": "data.json",
87
+ "event_data": {
88
+ "sensor_data": chunk,
89
+ "sensor_metadata": {
90
+ "sensor_timestamp": time.time(),
91
+ "sensor_id": f"streamed_sensor_{counter}"
92
+ }
93
+ }
94
+ }
95
+ }
96
+ client.lens.sessions.process_event(session_id, payload)
97
+ start += args["step_size"]
98
+ counter += 1
99
+
100
+ # Collect embeddings
101
+ embeddings = []
102
+ for event in sse_reader.read(block=True):
103
+ if stop_flag:
104
+ break
105
+ etype = event.get("type")
106
+ if etype == "inference.result":
107
+ ed = event.get("event_data", {})
108
+ embedding = ed.get("response")
109
+ meta = ed.get("query_metadata", {})
110
+
111
+ # Flatten 4×768 to 3072D vector
112
+ if isinstance(embedding, list) and len(embedding) > 0:
113
+ if isinstance(embedding[0], list):
114
+ flat = [val for row in embedding for val in row]
115
+ else:
116
+ flat = embedding
117
+
118
+ embeddings.append({
119
+ "window_index": len(embeddings),
120
+ "query_timestamp": meta.get("query_timestamp", "N/A"),
121
+ "embedding": flat,
122
+ })
123
+ print(f"[{len(embeddings)}] Embedding: {len(flat)}D")
124
+
125
+ sse_reader.close()
126
+ return embeddings
127
+ ```
128
+
129
+ #### 4. Create and Run Lens
130
+
131
+ ```python
132
+ client.lens.create_and_run_lens(
133
+ yaml_config, session_callback,
134
+ client=client, args=args
135
+ )
136
+ ```
137
+
138
+ ### Embedding Response Structure
139
+
140
+ The `inference.result` response contains:
141
+ - `response`: nested list of shape `(4, 768)` — one 768D vector per input channel
142
+ - Flatten to `3072D` by concatenating: `[a1_768D, a2_768D, a3_768D, a4_768D]`
143
+ - `query_metadata.query_timestamp`: timestamp of the window
144
+ - `query_metadata.sensor_id`: sensor identifier
145
+ - `query_metadata.read_index`: window position in the data
146
+
147
+ ### Saving Embeddings to CSV
148
+
149
+ ```python
150
+ import csv
151
+
152
+ with open(output_file, 'w', newline='') as f:
153
+ writer = csv.writer(f)
154
+ writer.writerow(['window_index', 'query_timestamp', 'analysis_timestamp',
155
+ 'read_index', 'window_size', 'step_size', 'embedding_vector'])
156
+
157
+ for emb in embeddings:
158
+ writer.writerow([
159
+ emb["window_index"],
160
+ emb["query_timestamp"],
161
+ datetime.now().isoformat(),
162
+ emb.get("read_index", "N/A"),
163
+ args["window_size"],
164
+ args["step_size"],
165
+ str(emb["embedding"]),
166
+ ])
167
+ ```
168
+
169
+ ### CLI Arguments
170
+
171
+ ```
172
+ --api-key API key (fallback to ATAI_API_KEY env var)
173
+ --api-endpoint API endpoint (default from SDK)
174
+ --file-path Path to CSV file to analyze (required)
175
+ --window-size Window size in samples (default: 100)
176
+ --step-size Step size in samples (default: 100)
177
+ --max-run-time-sec Max runtime in seconds (default: 500)
178
+ --output-file Path to save embeddings CSV (optional)
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Web / JavaScript Implementation
184
+
185
+ Uses direct `fetch` calls to the Archetype AI REST API.
186
+
187
+ ### API Reference
188
+
189
+ | Operation | Method | Endpoint | Body |
190
+ |-----------|--------|----------|------|
191
+ | Register lens | POST | `/lens/register` | `{ lens_config: config }` |
192
+ | Create session | POST | `/lens/sessions/create` | `{ lens_id }` |
193
+ | Process event | POST | `/lens/sessions/events/process` | `{ session_id, event }` |
194
+ | Delete lens | POST | `/lens/delete` | `{ lens_id }` |
195
+ | Destroy session | POST | `/lens/sessions/destroy` | `{ session_id }` |
196
+ | SSE consumer | GET | `/lens/sessions/consumer/{sessionId}` | — |
197
+
198
+ ### Helper: API fetch wrapper
199
+
200
+ ```typescript
201
+ const API_ENDPOINT = 'https://api.u1.archetypeai.app/v0.5'
202
+
203
+ async function apiPost<T>(path: string, apiKey: string, body: unknown, timeoutMs = 5000): Promise<T> {
204
+ const controller = new AbortController()
205
+ const timeoutId = setTimeout(() => controller.abort(), timeoutMs)
206
+
207
+ try {
208
+ const response = await fetch(`${API_ENDPOINT}${path}`, {
209
+ method: 'POST',
210
+ headers: {
211
+ Authorization: `Bearer ${apiKey}`,
212
+ 'Content-Type': 'application/json',
213
+ },
214
+ body: JSON.stringify(body),
215
+ signal: controller.signal,
216
+ })
217
+
218
+ if (!response.ok) {
219
+ const errorBody = await response.json().catch(() => ({}))
220
+ throw new Error(`API POST ${path} failed: ${response.status} - ${JSON.stringify(errorBody)}`)
221
+ }
222
+
223
+ return response.json()
224
+ } finally {
225
+ clearTimeout(timeoutId)
226
+ }
227
+ }
228
+ ```
229
+
230
+ ### Step 1: Build and register the embedding lens
231
+
232
+ ```typescript
233
+ const windowSize = 100
234
+ const stepSize = 100
235
+
236
+ const lensConfig = {
237
+ lens_name: 'embedding_lens',
238
+ lens_config: {
239
+ model_pipeline: [
240
+ { processor_name: 'lens_timeseries_embedding_processor', processor_config: {} },
241
+ ],
242
+ model_parameters: {
243
+ model_name: 'OmegaEncoder',
244
+ model_version: 'OmegaEncoder::omega_embeddings_01',
245
+ normalize_input: true,
246
+ buffer_size: windowSize,
247
+ csv_configs: {
248
+ timestamp_column: 'timestamp',
249
+ data_columns: ['a1', 'a2', 'a3', 'a4'],
250
+ window_size: windowSize,
251
+ step_size: stepSize,
252
+ },
253
+ },
254
+ output_streams: [
255
+ { stream_type: 'server_sent_events_writer' },
256
+ ],
257
+ },
258
+ }
259
+
260
+ // Register lens
261
+ const registeredLens = await apiPost<{ lens_id: string }>(
262
+ '/lens/register', apiKey, { lens_config: lensConfig }
263
+ )
264
+ const lensId = registeredLens.lens_id
265
+
266
+ // Create session
267
+ const session = await apiPost<{ session_id: string }>(
268
+ '/lens/sessions/create', apiKey, { lens_id: lensId }
269
+ )
270
+ const sessionId = session.session_id
271
+
272
+ await apiPost('/lens/delete', apiKey, { lens_id: lensId })
273
+
274
+ // Wait for session ready
275
+ async function waitForSessionReady(sessionId: string, maxWaitMs = 30000): Promise<boolean> {
276
+ const start = Date.now()
277
+ while (Date.now() - start < maxWaitMs) {
278
+ const status = await apiPost<{ session_status: string }>(
279
+ '/lens/sessions/events/process', apiKey,
280
+ { session_id: sessionId, event: { type: 'session.status' } },
281
+ 10000
282
+ )
283
+ if (status.session_status === 'LensSessionStatus.SESSION_STATUS_RUNNING' ||
284
+ status.session_status === '3') return true
285
+ if (status.session_status === 'LensSessionStatus.SESSION_STATUS_FAILED' ||
286
+ status.session_status === '6') return false
287
+ await new Promise(r => setTimeout(r, 500))
288
+ }
289
+ return false
290
+ }
291
+
292
+ await waitForSessionReady(sessionId)
293
+ ```
294
+
295
+ ### Step 2: Stream CSV data in windows
296
+
297
+ ```typescript
298
+ // Parse CSV (PapaParse or similar)
299
+ const rows = parsedCsv.data
300
+ const columns = ['a1', 'a2', 'a3', 'a4']
301
+
302
+ let start = 0
303
+ let counter = 0
304
+
305
+ while (start < rows.length) {
306
+ const end = Math.min(start + windowSize, rows.length)
307
+ const window = rows.slice(start, end)
308
+
309
+ const sensorData = columns.map(col =>
310
+ window.map(row => Number(row[col]))
311
+ )
312
+
313
+ await apiPost('/lens/sessions/events/process', apiKey, {
314
+ session_id: sessionId,
315
+ event: {
316
+ type: 'session.update',
317
+ event_data: {
318
+ type: 'data.json',
319
+ event_data: {
320
+ sensor_data: sensorData,
321
+ sensor_metadata: {
322
+ sensor_timestamp: Date.now() / 1000,
323
+ sensor_id: `web_sensor_${counter}`,
324
+ },
325
+ },
326
+ },
327
+ },
328
+ }, 10000)
329
+
330
+ start += stepSize
331
+ counter++
332
+ }
333
+ ```
334
+
335
+ ### Step 3: Consume SSE embedding results
336
+
337
+ ```typescript
338
+ import { fetchEventSource } from '@microsoft/fetch-event-source'
339
+
340
+ interface EmbeddingResult {
341
+ windowIndex: number
342
+ queryTimestamp: string
343
+ embedding: number[] // 3072D flattened
344
+ }
345
+
346
+ const embeddings: EmbeddingResult[] = []
347
+
348
+ fetchEventSource(`${API_ENDPOINT}/lens/sessions/consumer/${sessionId}`, {
349
+ headers: { Authorization: `Bearer ${apiKey}` },
350
+ onmessage(event) {
351
+ const parsed = JSON.parse(event.data)
352
+
353
+ if (parsed.type === 'inference.result') {
354
+ const response = parsed.event_data.response
355
+ const meta = parsed.event_data.query_metadata
356
+
357
+ // Flatten 4×768 → 3072D
358
+ const flat = Array.isArray(response[0])
359
+ ? response.flat()
360
+ : response
361
+
362
+ embeddings.push({
363
+ windowIndex: embeddings.length,
364
+ queryTimestamp: meta?.query_timestamp ?? 'N/A',
365
+ embedding: flat,
366
+ })
367
+ console.log(`[${embeddings.length}] Embedding: ${flat.length}D`)
368
+ }
369
+
370
+ if (parsed.type === 'sse.stream.end') {
371
+ console.log('Stream complete')
372
+ }
373
+ },
374
+ })
375
+ ```
376
+
377
+ ### Step 4: Cleanup
378
+
379
+ ```typescript
380
+ await apiPost('/lens/sessions/destroy', apiKey, { session_id: sessionId })
381
+ ```
382
+
383
+ ### Web Lifecycle Summary
384
+
385
+ ```
386
+ 1. Register lens -> POST /lens/register { lens_config: config }
387
+ 2. Create session -> POST /lens/sessions/create { lens_id }
388
+ 3. Wait for ready -> POST /lens/sessions/events/process (poll session.status)
389
+ 4. (Optional) Delete lens -> POST /lens/delete { lens_id }
390
+ 5. Stream windowed data -> POST /lens/sessions/events/process { session_id, event } (loop)
391
+ 6. Consume SSE results -> GET /lens/sessions/consumer/{sessionId}
392
+ 7. Destroy session -> POST /lens/sessions/destroy { session_id }
393
+ ```
394
+
395
+ ---
396
+
397
+ ## CSV Format Expected
398
+
399
+ ```csv
400
+ timestamp,a1,a2,a3,a4
401
+ 1700000000.0,100,200,300,374
402
+ ```
403
+
404
+ - Column names configurable via `csv_configs.data_columns`
405
+ - `a4` is typically magnitude: sqrt(a1² + a2² + a3²)
406
+
407
+ ## Key Differences from Machine State Lens
408
+
409
+ | | Embedding Lens | Machine State Lens |
410
+ |---|---|---|
411
+ | Processor | `lens_timeseries_embedding_processor` | `lens_timeseries_state_processor` |
412
+ | N-shot files | Not needed | Required (one per class) |
413
+ | KNN config | Not needed | Required |
414
+ | Output | Raw embedding vectors (4×768 = 3072D) | Class predictions + confidence scores |
415
+ | Use case | Visualization, clustering, similarity | Classification, anomaly detection |