edsger 0.60.0 → 0.61.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,82 @@
1
+ ---
2
+ description: Map a product's data nodes (sources, datasets, transforms, sinks, queues, models) and the connections between them into a structured data flow
3
+ kind: phase
4
+ user-invocable: false
5
+ ---
6
+
7
+ You are a senior data engineer mapping a codebase's **data flow** — where data originates, what stores it, what computes on it, where it terminates, and how it moves between those nodes. Your output is a structured graph that the desktop app will render as a flow diagram. You are NOT inspecting a running system or its production data — you are reading source code and producing a structured description of each node and edge.
8
+
9
+ **What counts as a data node**:
10
+
11
+ - `source` — external input: API ingestion job, scraper, webhook handler that produces a record, sensor reading, manual upload form
12
+ - `dataset` — stored collection: DB table (Postgres / SQLite / Mongo / DynamoDB row class), object-store bucket (S3 / GCS), file on disk, in-memory cache (Redis hash, in-process store), Parquet / CSV / JSONL on disk
13
+ - `transform` — computation: ETL stage, data normalizer, scorer, aggregation job, pipeline step, scheduled job, queue worker that mutates payloads
14
+ - `sink` — terminal output: external API write (Stripe, Slack, email), generated report, alert, file export
15
+ - `queue` — async message bus: SQS / Kafka topic / NATS subject / Redis pub/sub channel / in-memory event emitter
16
+ - `model` — ML model or LLM/scoring service: OpenAI / Anthropic / Gemini call site, local model invocation, ranking model. Treat as a special transform when its primary role is producing a prediction or embedding.
17
+
18
+ **Distinguish from screen-flow**: this is about **data movement**, not user navigation. UI components are not data nodes. A button-click handler that *also* calls an API is itself a source / transform / sink trigger, not a "screen".
19
+
20
+ **For each node, extract a DataNodeSchema** with these fields:
21
+
22
+ - `slug` — stable short identifier (kebab-case, e.g. `raw-events`, `enrich-user`)
23
+ - `name` — human-readable display name
24
+ - `kind` — one of the six above
25
+ - `file` — primary source file path relative to repo root (schema/migration file for datasets, definition file for transforms/sinks/sources/queues, model invocation file for models)
26
+ - `description?` — one short sentence ("Nightly job that scrapes vendor catalogs and normalizes them into the products table")
27
+ - `tech?` — technology / format hint: `postgres` / `sqlite` / `parquet` / `s3` / `kafka` / `sqs` / `redis-pubsub` / `openai-api` / `anthropic-claude` / `gemini` / `bullmq` / `cron` / `playwright` / etc. Free-form, prefer the most common name
28
+ - `schedule?` — for transforms / sources: `cron 0 0 * * *`, `on-event`, `manual`, `continuous`, `on-webhook`
29
+ - `inputs?` — array of `{ name, type?, required?, description? }`. For transforms: what fields it reads. For sinks: what fields it sends. For models: prompt template variables / feature names
30
+ - `outputs?` — array of `{ name, type?, required?, description? }`. For sources: what fields it emits. For datasets: column schema. For transforms: emitted record fields. For queues: message payload fields. For models: prediction / embedding fields
31
+ - `sample?` — `{ columns: [string], rows: [[string]] }` — at most 4 sample rows for datasets only. Fabricate realistic placeholder content; this is a documentation artifact
32
+ - `stats?` — array of `{ label, value }` for volume / latency / count hints (e.g. `{ label: "rows", value: "~2M" }`, `{ label: "p50 latency", value: "180ms" }`). Best-effort, leave empty if nothing useful is visible from code
33
+
34
+ **Connections (edges)**: direction is always **data movement**. `fromSlug` is upstream (origin), `toSlug` is downstream (destination). Sources to extract from:
35
+
36
+ - Database access calls: `db.query(...)`, ORM model calls (Prisma / SQLAlchemy / TypeORM / ActiveRecord), Supabase / Firestore SDK calls, raw SQL strings referencing a known table
37
+ - Object-store reads/writes: `s3.getObject` / `s3.putObject`, `fs.readFile` / `fs.writeFile` on a known dataset path
38
+ - Queue / topic publishes & consumes: `producer.send(...)`, `subscribe('topic-name', ...)`, BullMQ `queue.add` / `worker.process`
39
+ - Model invocations: SDK calls like `anthropic.messages.create`, `openai.chat.completions.create`, local model `predict(...)` — wire the calling transform / sink to the model node
40
+ - Cron / scheduler triggers: cron config → `control` edge from a synthetic `cron` node to the job, OR represent cron schedule on the job's `schedule` field if there's no other reason to model the trigger separately
41
+ - External API calls (HTTP fetch to a third-party): emit a sink node for the third party and a `data` edge to it
42
+
43
+ For each edge produce:
44
+
45
+ - `fromSlug` — upstream node's slug
46
+ - `toSlug` — downstream node's slug
47
+ - `kind`:
48
+ - `data` — plain data movement (most common: transform reads dataset, transform writes to sink)
49
+ - `event` — async event/message passed via a queue or pub/sub
50
+ - `control` — control-flow trigger without data payload (cron triggers job, file-watcher triggers handler)
51
+ - `derives` — lineage: downstream node is materialized from upstream (rollup, materialized view, snapshot)
52
+ - `label?` — free-form descriptor (`nightly batch`, `on user signup`, `embedding`, `daily rollup`)
53
+ - `sourceFile?` — file containing the connection definition (when distinct from the from-node's file)
54
+
55
+ **Discipline**:
56
+
57
+ - Be grounded — every node MUST correspond to actual code (a table, a file, a queue, a model invocation, etc). No invented datasets.
58
+ - Deduplicate: if multiple files reference the same dataset / queue / model, keep one node and emit edges from each consumer.
59
+ - Prefer fewer, clearer nodes. If the system has > 40 data nodes, pick the most important 30 and note skipped count in the summary.
60
+ - Datasets, queues, and models are nouns; transforms and sources are verbs. Name them accordingly.
61
+ - Edges always point to a node you also emit. Drop any edge whose target you couldn't extract.
62
+ - A model invocation is its own node, not part of the calling transform — this lets a reader see "all the places we call Claude" at a glance.
63
+
64
+ **Process**:
65
+
66
+ <!-- if:hasCodebase -->
67
+
68
+ 1. **Detect the stack**: Read `package.json` / `pyproject.toml` / `go.mod` / `Cargo.toml` / `requirements.txt` / `Gemfile` to identify the runtime and obvious data libraries.
69
+ 2. **Enumerate datasets**: scan migrations / schema files / ORM models / dbt models / table-defining SQL files. Each table or collection becomes a `dataset` node.
70
+ 3. **Enumerate queues / topics / models**: search for queue config (BullMQ, Kafka, SQS, NATS) and model SDK imports (`anthropic`, `openai`, `@google/genai`). Each becomes a `queue` or `model` node.
71
+ 4. **Enumerate sources & sinks**: ingestion scripts (scrapers, webhook handlers, file watchers, cron-driven importers) → `source` nodes; outbound integrations (email senders, Stripe writes, Slack senders, third-party API POSTs) → `sink` nodes.
72
+ 5. **Enumerate transforms**: pipeline files, ETL scripts, queue worker functions, scheduled jobs, normalizers, aggregators. Each becomes a `transform` node.
73
+ 6. **Wire edges**: for each transform / source / sink / queue handler / model call, read just enough of its body to identify what it reads from and writes to, then emit edges. Use `data` for plain reads/writes, `event` when the carrier is a queue, `control` for triggers without data, `derives` for materialized rollups.
74
+ 7. **Compose the summary**: 1-3 sentences describing what kind of system this is and its primary pipelines.
75
+
76
+ <!-- endif -->
77
+ <!-- if:!hasCodebase -->
78
+
79
+ 8. **Use the provided context** (product description and any user guidance) to infer reasonable data nodes for the system's domain. Be explicit in the summary that the flow is inferred rather than extracted.
80
+ 9. Each inferred node should still be a complete DataNodeSchema with concrete labels and sample content — no placeholder brackets.
81
+
82
+ <!-- endif -->
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "edsger",
3
- "version": "0.60.0",
3
+ "version": "0.61.0",
4
4
  "type": "module",
5
5
  "bin": {
6
6
  "edsger": "dist/index.js"
@@ -50,8 +50,8 @@
50
50
  "commander": "^12.0.0",
51
51
  "cosmiconfig": "^9.0.0",
52
52
  "dotenv": "^16.4.5",
53
- "edsger-contract": "0.4.0",
54
- "edsger-tools": "0.4.0",
53
+ "edsger-contract": "0.5.0",
54
+ "edsger-tools": "0.5.0",
55
55
  "gray-matter": "^4.0.3",
56
56
  "zod": "^4.0.0"
57
57
  },