npm - dirsql - Versions diffs - 0.2.4 → 0.2.9 - Mend

dirsql 0.2.4 → 0.2.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/docs/guide/cli.md ADDED Viewed

@@ -0,0 +1,124 @@
+---
+canonical: https://thekevinscott.github.io/dirsql/guide/cli
+---
+# Command-Line Interface
+> Online: <https://thekevinscott.github.io/dirsql/guide/cli>
+`dirsql` starts an HTTP server that exposes identical SDK functionality.
+## Installation
+::: code-group
+```bash [npm]
+npx dirsql
+```
+```bash [PyPI]
+uvx dirsql
+```
+```bash [Cargo]
+# Installs the binary only (non-default feature)
+cargo install dirsql --features cli
+dirsql
+```
+:::
+::: tip For Rust library consumers
+The `cli` feature is **opt-in**. Adding `dirsql` as a library dependency (`cargo add dirsql`) pulls no CLI dependencies — only the core library. See the [Rust library README](https://github.com/thekevinscott/dirsql/tree/main/packages/rust) for details.
+:::
+## Running the server
+Run `dirsql` from the directory containing your files:
+```bash
+dirsql
+$ Running at localhost:7117
+```
+### Flags
+| Flag | Default | Description |
+|---|---|---|
+| `--config <path>` | `./.dirsql.toml` | Path to the config file. The index is rooted at the directory containing this file. |
+| `--host <addr>` | `localhost` | Bind address |
+| `--port <n>` | `7117` | TCP port to bind |
+## HTTP API
+### `POST /query`
+Run a SQL query. Request body is JSON:
+```json
+{"sql": "SELECT title, author FROM posts WHERE draft = 0"}
+```
+Response is a JSON array of row objects:
+```json
+[
+  {"title": "Hello World", "author": "alice"},
+  {"title": "Second Post", "author": "bob"}
+]
+```
+On error, the server returns a non-2xx status with a JSON body:
+```json
+{"error": "syntax error near \"SLECT\""}
+```
+Malformed SQL returns `400`, not `500` — the client sent bad input. Missing / unreadable config returns `503`.
+```bash
+curl -s http://localhost:7117/query \
+  -H 'content-type: application/json' \
+  -d '{"sql":"SELECT COUNT(*) AS n FROM posts"}' \
+  | jq
+```
+### `GET /events`
+Opens a [Server-Sent Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events) stream of change events. Each `data:` payload is the same JSON schema the SDK emits from [`db.watch()`](./watching.md#event-types):
+```
+event: row
+data: {"action":"insert","table":"posts","file_path":"posts/hello.json","row":{"title":"Hello World","author":"alice"},"old_row":null}
+event: row
+data: {"action":"update","table":"posts","file_path":"posts/hello.json","row":{"title":"Hello, world","author":"alice"},"old_row":{"title":"Hello World","author":"alice"}}
+event: row
+data: {"action":"delete","table":"posts","file_path":"posts/second.json","row":{"title":"Second Post","author":"bob"},"old_row":null}
+```
+Errors during extraction appear as `{"action":"error",...}` events on the same stream. They do **not** terminate the stream — a malformed file is a per-event problem, not a server-wide one.
+```bash
+curl -N http://localhost:7117/events
+```
+## Piping event streams
+The SSE stream is easy to tee into shell tools with `curl -N` plus `jq`:
+```bash
+# Log every delete to a file
+curl -N http://localhost:7117/events \
+  | jq -cR 'fromjson? | select(.action=="delete")' \
+  >> deletes.log
+# Alert on errors
+curl -N http://localhost:7117/events \
+  | jq -c 'fromjson? | select(.action=="error")' \
+  | while read -r line; do notify-send "dirsql error" "$line"; done
+```
+(The `fromjson?` wrapping strips the `data:` framing; drop it if your SSE client is already parsing frames.)

package/docs/guide/config.md ADDED Viewed

@@ -0,0 +1,205 @@
+---
+canonical: https://thekevinscott.github.io/dirsql/guide/config
+---
+# Configuration File
+> Online: <https://thekevinscott.github.io/dirsql/guide/config>
+`dirsql` can be configured with a `.dirsql.toml` file, allowing you to define tables declaratively without writing code.
+## Basic Example
+```toml
+[dirsql]
+ignore = ["node_modules/**", ".git/**"]
+[[table]]
+ddl = "CREATE TABLE posts (title TEXT, author TEXT)"
+glob = "posts/*.json"
+```
+The `format` is inferred from the glob extension (`.json` -> JSON, `.jsonl` -> JSONL, `.csv` -> CSV, etc.). Each JSON key maps to a column with the same name.
+## Loading a Config File
+Pass the config file path to the `DirSQL` constructor:
+::: code-group
+```python [Python]
+from dirsql import DirSQL
+db = DirSQL(config="./my-project/.dirsql.toml")
+await db.ready()
+```
+```rust [Rust]
+use dirsql::DirSQL;
+let db = DirSQL::builder()
+    .config("./my-project/.dirsql.toml")
+    .build()?;
+```
+```typescript [TypeScript]
+import { DirSQL } from "dirsql";
+// String argument is interpreted as a config file path.
+const db = new DirSQL("./my-project/.dirsql.toml");
+await db.ready;
+```
+:::
+By default, the root directory scanned is the config file's parent directory. Override it by passing `root` explicitly (the explicit value wins and a warning is emitted) or by declaring `[dirsql].root` in the config file itself.
+## Root Directory
+By default, the config file's parent directory is the scan root. To index a different location, declare `[dirsql].root` (relative paths are resolved relative to the config file's parent):
+```toml
+[dirsql]
+root = "../data"
+ignore = ["node_modules/**"]
+```
+## Supported Formats
+| Extension | Format | Rows |
+|---|---|---|
+| `.json` | JSON | Object = 1 row, Array = many rows |
+| `.jsonl`, `.ndjson` | JSONL | One row per line |
+| `.csv` | CSV | One row per data line (header = columns) |
+| `.tsv` | TSV | One row per data line (tab-separated) |
+| `.toml` | TOML | One row per file |
+| `.yaml`, `.yml` | YAML | Mapping = 1 row, Sequence = many rows |
+| `.md` | Frontmatter | YAML frontmatter + body column |
+## Path Captures
+Use `{name}` in glob patterns to extract path segments as columns:
+```toml
+[[table]]
+ddl = "CREATE TABLE comments (thread_id TEXT, body TEXT, author TEXT)"
+glob = "_comments/{thread_id}/index.jsonl"
+```
+The directory name (e.g., `abc123`) becomes the `thread_id` column value for every row in that file.
+## Nested Data
+Use `each` to navigate into nested JSON structures:
+```toml
+[[table]]
+ddl = "CREATE TABLE items (name TEXT, price REAL)"
+glob = "catalog/*.json"
+each = "data.items"
+```
+This extracts rows from `{"data": {"items": [...]}}`.
+## Column Mapping
+Use `columns` to map SQL column names to nested fields or path captures:
+```toml
+[[table]]
+ddl = "CREATE TABLE posts (display_name TEXT, body TEXT)"
+glob = "posts/*.json"
+[table.columns]
+display_name = "metadata.author.name"
+body        = "body"
+```
+::: warning `[table.columns]` is a complete projection, not a partial rename
+When a `[table.columns]` section is present, `dirsql` switches to fully
+declarative projection: **only the columns listed in the mapping are
+populated**. Any column in the DDL that is not mentioned in the mapping
+is set to `NULL` for every row — the original key from the file is not
+auto-copied.
+This is intentional: `[table.columns]` means "here is exactly where
+every column comes from", not "rename these specific keys".
+**Trap to avoid.** A config like this:
+```toml
+[[table]]
+ddl = "CREATE TABLE comments (id TEXT, body TEXT, display_name TEXT)"
+glob = "*.json"
+[table.columns]
+display_name = "author"   # intended: "just rename author -> display_name"
+```
+against a file `one.json`:
+```json
+{"id": "a1", "body": "hello", "author": "Alice"}
+```
+produces:
+```json
+[{"id": null, "body": null, "display_name": "Alice"}]
+```
+`id` and `body` are `NULL` because they are not listed in
+`[table.columns]`. To keep them populated, add them to the mapping
+explicitly:
+```toml
+[table.columns]
+id           = "id"
+body         = "body"
+display_name = "author"
+```
+:::
+## Ignore Patterns
+The `ignore` list skips files and directories entirely (not even scanned):
+```toml
+[dirsql]
+ignore = ["node_modules/**", ".git/**", "*.pyc", "__pycache__/**"]
+```
+## Strict Mode
+By default, extra keys in file content are ignored and missing keys become NULL. Enable strict mode to error on mismatches:
+```toml
+[[table]]
+ddl = "CREATE TABLE posts (title TEXT, author TEXT)"
+glob = "posts/*.json"
+strict = true
+```
+## Full Example
+```toml
+[dirsql]
+ignore = ["node_modules/**", ".git/**", "dist/**"]
+[[table]]
+ddl = "CREATE TABLE comments (thread_id TEXT, body TEXT, author TEXT, resolved INTEGER)"
+glob = "_comments/{thread_id}/index.jsonl"
+[[table]]
+ddl = "CREATE TABLE documents (title TEXT, draft INTEGER, body TEXT)"
+glob = "**/index.md"
+[[table]]
+ddl = "CREATE TABLE metrics (date TEXT, requests INTEGER, errors INTEGER)"
+glob = "logs/*.csv"
+[[table]]
+ddl = "CREATE TABLE config (key TEXT, value TEXT)"
+glob = "config/*.toml"
+strict = true
+```

package/docs/guide/crdt.md ADDED Viewed

@@ -0,0 +1,160 @@
+---
+canonical: https://thekevinscott.github.io/dirsql/guide/crdt
+---
+# Collaboration with CRDTs
+> Online: <https://thekevinscott.github.io/dirsql/guide/crdt>
+`dirsql` treats the filesystem as the source of truth. That works well when a single process (or a single human) is writing, but breaks down for multi-writer collaboration: two peers editing the same file concurrently produce a merge conflict, not a merged result.
+[Conflict-free Replicated Data Types](https://crdt.tech/) (CRDTs) solve that merge problem at the data-structure level, not the filesystem level. Two replicas that apply the same set of edits -- in any order, with any network partitions in between -- converge on the same final state, without a central arbiter.
+This guide is **opinionated**. It picks one library, explains the integration pattern with `dirsql`, and names the alternatives so you can steer if your situation is different.
+## Recommendation: Automerge
+Use [Automerge](https://automerge.org/) (specifically the 2.x series with [automerge-repo](https://automerge.org/automerge-repo/) for sync).
+Why Automerge over the alternatives:
+- **JSON-shaped document model.** Automerge docs look like nested maps, lists, and text. That matches `dirsql`'s one-object-per-file workflow -- each Automerge document is the thing your `extract` function projects into rows.
+- **Cross-language SDKs that mirror `dirsql`'s parity story.** First-class Rust ([`automerge`](https://crates.io/crates/automerge)), TypeScript ([`@automerge/automerge`](https://www.npmjs.com/package/@automerge/automerge)), and Python ([`automerge`](https://pypi.org/project/automerge/)) implementations exist today, all driven by the same Rust core. If you already have all three `dirsql` SDKs in play, Automerge won't force a language monoculture on you.
+- **Filesystem-friendly sync primitives.** `automerge-repo` ships a [`NodeFSStorageAdapter`](https://automerge.org/docs/repositories/storage/) that shards document history into regular files under a directory. That directory is exactly the kind of tree `dirsql` is designed to watch.
+- **Binary format with a deterministic JSON view.** You never hand-edit a CRDT file, but every replica projects the same canonical JSON from the binary state. That canonical JSON is what you feed to `dirsql`'s extract function, so two peers that have synced will produce identical rows.
+## The integration shape
+There are two files per logical "document":
+```
+workspace/
+  posts/
+    hello/
+      doc.automerge   <-- binary CRDT state (the source of truth for writers)
+      view.json       <-- materialized JSON snapshot (written on each merge)
+```
+- Writers (editors, sync peers, etc.) mutate `doc.automerge` through Automerge APIs.
+- After every mutation, the writer serializes the current document to `view.json`. This is the file `dirsql` indexes.
+- `dirsql` watches `view.json`, not `doc.automerge`. The CRDT file is an implementation detail of how the JSON got there.
+This keeps `dirsql`'s extract function oblivious to CRDT semantics: it reads a plain JSON file, exactly as it would without Automerge.
+::: tip Why not `extract` directly from `.automerge`?
+You could -- the Rust and Python Automerge SDKs let you load a binary doc and walk its fields. But it couples your table schema to the CRDT library version, makes `extract` non-pure (it allocates CRDT state on every file change), and buys nothing: the writer is the only place that can produce a valid Automerge blob, so it might as well produce the JSON view at the same time.
+:::
+### Example: posts as Automerge documents
+::: code-group
+```python [Python]
+from dirsql import DirSQL, Table
+import json
+db = DirSQL(
+    "./workspace",
+    tables=[
+        Table(
+            ddl="CREATE TABLE posts (id TEXT, title TEXT, body TEXT, updated INTEGER)",
+            # Match the JSON view, not the raw CRDT binary.
+            glob="posts/*/view.json",
+            extract=lambda path, content: [json.loads(content)],
+        ),
+    ],
+)
+```
+```rust [Rust]
+use dirsql::{DirSQL, Table};
+// See `row_from_json` in getting-started.md.
+let db = DirSQL::new(
+    "./workspace",
+    vec![
+        Table::new(
+            "CREATE TABLE posts (id TEXT, title TEXT, body TEXT, updated INTEGER)",
+            "posts/*/view.json",
+            |_path, content| vec![row_from_json(content)],
+        ),
+    ],
+)?;
+```
+```typescript [TypeScript]
+import { DirSQL, type TableDef } from 'dirsql';
+const tables: TableDef[] = [
+  {
+    ddl: 'CREATE TABLE posts (id TEXT, title TEXT, body TEXT, updated INTEGER)',
+    glob: 'posts/*/view.json',
+    extract: (_path, content) => [JSON.parse(content)],
+  },
+];
+const db = new DirSQL('./workspace', tables);
+```
+:::
+The Automerge writer (sketch, TypeScript):
+```typescript
+import * as Automerge from '@automerge/automerge';
+import { writeFileSync, readFileSync } from 'node:fs';
+// Load (or create) the CRDT doc.
+const bytes = readFileSync('workspace/posts/hello/doc.automerge');
+let doc = Automerge.load<Post>(bytes);
+// Apply an edit.
+doc = Automerge.change(doc, (d) => {
+  d.title = 'Hello, world';
+  d.updated = Date.now();
+});
+// Persist both the CRDT state and the materialized view.
+writeFileSync('workspace/posts/hello/doc.automerge', Automerge.save(doc));
+writeFileSync('workspace/posts/hello/view.json', JSON.stringify(doc));
+```
+`dirsql`'s watcher picks up the change to `view.json`, re-runs `extract`, and emits an `update` row event. Queries over `posts` reflect the merged state without `dirsql` knowing Automerge exists.
+## Multi-writer, in practice
+1. Each peer runs an `automerge-repo` instance with the filesystem storage adapter pointed at its local `workspace/`.
+2. Peers sync via any transport `automerge-repo` supports ([WebSocket](https://automerge.org/docs/repositories/networking/), [BroadcastChannel](https://automerge.org/docs/repositories/networking/), or a custom adapter).
+3. On every sync, the repo applies incoming ops to the local CRDT, writes the updated `doc.automerge`, and the writer code re-serializes `view.json`.
+4. Every peer's `dirsql` sees the same eventual `view.json` and produces the same rows.
+The key invariant: **`view.json` is a deterministic projection of `doc.automerge`**. Two peers that have converged on the CRDT state must produce byte-identical (or at least semantically-identical) JSON views. Otherwise you get spurious `update` events that flap with sync order. Use `JSON.stringify` with sorted keys (or `json.dumps(..., sort_keys=True)` in Python) to guarantee this.
+## Tradeoffs vs plain files
+When **plain files** are the right answer:
+- Single writer. A solo user editing `posts/*.json` will never hit a merge conflict. Adding a CRDT is overhead.
+- Human-readable history matters. `git diff` on a JSON file tells a story; `git diff` on a CRDT binary does not.
+- Schema churn is frequent. Renaming a field in plain JSON is a `sed`; in a CRDT it's a migration.
+When **CRDTs** earn their complexity:
+- Multi-writer without a central server (local-first, peer-to-peer).
+- Offline edits that need to merge on reconnect.
+- Fine-grained collaborative editing (cursor-level merging of a shared text field).
+Hybrid is common: keep configuration and reference data as plain files, and use CRDTs only for the documents that genuinely have multiple writers.
+## Alternatives we considered
+- [**Yjs**](https://docs.yjs.dev/) -- the dominant JS CRDT, excellent for rich-text collaboration (it backs many of the production collab editors you've used). Skipped as the primary recommendation because its Rust port ([`yrs`](https://crates.io/crates/yrs)) and Python bindings lag the JS implementation. If your workload is browser-first and text-heavy, prefer Yjs.
+- [**Loro**](https://loro.dev/) -- Rust-first CRDT with a clean API and good cross-language story. Worth watching; we'd consider it once its Python bindings are GA. Try it if you're Rust-centric and don't need Automerge's ecosystem.
+- **Operational Transform / hand-rolled merge logic** -- don't. OT is correct but hard to implement right, and you lose the offline-peer story that CRDTs give you for free.
+- **Git as the merge engine** -- tempting because `dirsql` already lives on the filesystem, but three-way merges of structured JSON produce garbage conflict markers that no extract function can parse. Use a CRDT.
+## See also
+- [Ink & Switch's local-first essay](https://www.inkandswitch.com/local-first/) -- the design space CRDTs sit in.
+- [Automerge documentation](https://automerge.org/docs/) -- API reference and sync-adapter guides.
+- [`crdt.tech`](https://crdt.tech/) -- library survey across languages.

package/docs/guide/querying.md ADDED Viewed

@@ -0,0 +1,216 @@
+---
+canonical: https://thekevinscott.github.io/dirsql/guide/querying
+---
+# Querying
+> Online: <https://thekevinscott.github.io/dirsql/guide/querying>
+Once a `DirSQL` instance is created, the initial directory scan is complete and you can run SQL queries against the indexed data.
+## Basic queries
+::: code-group
+```python [Python]
+# All rows from a table
+results = db.query("SELECT * FROM comments")
+# Filter with WHERE
+results = db.query("SELECT * FROM comments WHERE author = 'alice'")
+# Aggregations
+results = db.query("SELECT author, COUNT(*) as n FROM comments GROUP BY author")
+# JOINs across tables
+results = db.query("""
+    SELECT posts.title, authors.name
+    FROM posts
+    JOIN authors ON posts.author_id = authors.id
+""")
+```
+```rust [Rust]
+// All rows from a table
+let results = db.query("SELECT * FROM comments")?;
+// Filter with WHERE
+let results = db.query("SELECT * FROM comments WHERE author = 'alice'")?;
+// Aggregations
+let results = db.query("SELECT author, COUNT(*) as n FROM comments GROUP BY author")?;
+// JOINs across tables
+let results = db.query(
+    "SELECT posts.title, authors.name \
+     FROM posts JOIN authors ON posts.author_id = authors.id"
+)?;
+```
+```typescript [TypeScript]
+// All rows from a table
+const results = await db.query('SELECT * FROM comments');
+// Filter with WHERE
+const filtered = await db.query("SELECT * FROM comments WHERE author = 'alice'");
+// Aggregations
+const counts = await db.query('SELECT author, COUNT(*) as n FROM comments GROUP BY author');
+// JOINs across tables
+const joined = await db.query(`
+  SELECT posts.title, authors.name
+  FROM posts
+  JOIN authors ON posts.author_id = authors.id
+`);
+```
+:::
+Any valid SQLite **SELECT** works. The in-memory database supports the full SQLite dialect including subqueries, CTEs, window functions, and aggregate functions. See [Read-only queries](#read-only-queries) below for why write statements (`INSERT`, `UPDATE`, `DELETE`, `DROP`, etc.) are rejected.
+## Return format
+`query()` returns a list of dicts (Python), a `Vec<HashMap>` (Rust), or an array of objects (TypeScript). Each entry maps column names to values.
+::: code-group
+```python [Python]
+results = db.query("SELECT title, author FROM posts")
+# [
+#     {"title": "Hello World", "author": "alice"},
+#     {"title": "Second Post", "author": "bob"},
+# ]
+```
+```rust [Rust]
+let results = db.query("SELECT title, author FROM posts")?;
+// Vec<HashMap<String, Value>>
+// [{"title": "Hello World", "author": "alice"}, ...]
+```
+```typescript [TypeScript]
+const results = await db.query('SELECT title, author FROM posts');
+// [
+//   { title: 'Hello World', author: 'alice' },
+//   { title: 'Second Post', author: 'bob' },
+// ]
+```
+:::
+SQLite types map back to Python types:
+| SQLite type | Python type |
+|-------------|-------------|
+| TEXT        | `str`       |
+| INTEGER     | `int`       |
+| REAL        | `float`     |
+| BLOB        | `bytes`     |
+| NULL        | `None`      |
+## Internal columns
+`dirsql` adds internal tracking columns (`_dirsql_file_path`, `_dirsql_row_index`) to each table for file-change diffing. These columns are automatically excluded from `SELECT *` results, so day-to-day queries don't need to account for them.
+If you want to know which file a row came from, you can name the tracking columns explicitly in the projection:
+::: code-group
+```python [Python]
+rows = db.query("SELECT title, _dirsql_file_path FROM posts")
+# [{"title": "Hello World", "_dirsql_file_path": "posts/hello.json"}, ...]
+```
+```rust [Rust]
+let rows = db.query("SELECT title, _dirsql_file_path FROM posts")?;
+// [{"title": "Hello World", "_dirsql_file_path": "posts/hello.json"}, ...]
+```
+```typescript [TypeScript]
+const rows = await db.query('SELECT title, _dirsql_file_path FROM posts');
+// [{ title: 'Hello World', _dirsql_file_path: 'posts/hello.json' }, ...]
+```
+:::
+Tracking columns are only returned when named explicitly — `SELECT *` continues to exclude them.
+## Read-only queries
+`query()` accepts only read-only statements. Each statement is prepared on SQLite and then classified via `sqlite3_stmt_readonly`; anything SQLite itself flags as a write — `INSERT`, `UPDATE`, `DELETE`, `DROP`, `CREATE`, `ALTER`, `REPLACE`, `VACUUM`, `ANALYZE`, etc. — is rejected before any rows are produced.
+This keeps the in-memory index consistent with the on-disk files that back it. Mutations only happen through the watcher/indexer pipeline: to change data, edit the underlying file and let the watcher re-extract rows.
+::: code-group
+```python [Python]
+# Raises a RuntimeError; the index is unchanged.
+db.query("DELETE FROM posts")
+```
+```rust [Rust]
+// Returns DirSqlError::WriteForbidden; the index is unchanged.
+let err = db.query("DELETE FROM posts").unwrap_err();
+assert!(matches!(err, dirsql::DirSqlError::WriteForbidden));
+```
+```typescript [TypeScript]
+// Throws an Error whose message explains writes are not accepted.
+expect(() => db.query('DELETE FROM posts')).toThrow(/read-only/i);
+```
+:::
+## Error handling
+Invalid SQL raises an exception:
+::: code-group
+```python [Python]
+try:
+    db.query("NOT VALID SQL")
+except Exception as e:
+    print(f"Query error: {e}")
+```
+```rust [Rust]
+match db.query("NOT VALID SQL") {
+    Ok(results) => println!("{:?}", results),
+    Err(e) => eprintln!("Query error: {}", e),
+}
+```
+```typescript [TypeScript]
+try {
+  await db.query('NOT VALID SQL');
+} catch (e) {
+  console.error(`Query error: ${e}`);
+}
+```
+:::
+## Empty results
+Queries that match no rows return an empty collection:
+::: code-group
+```python [Python]
+results = db.query("SELECT * FROM posts WHERE author = 'nobody'")
+assert results == []
+```
+```rust [Rust]
+let results = db.query("SELECT * FROM posts WHERE author = 'nobody'")?;
+assert!(results.is_empty());
+```
+```typescript [TypeScript]
+const results = await db.query("SELECT * FROM posts WHERE author = 'nobody'");
+console.assert(results.length === 0);
+```
+:::