dirsql 0.3.43__tar.gz → 0.3.45__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {dirsql-0.3.43 → dirsql-0.3.45}/Cargo.lock +3 -1
- {dirsql-0.3.43 → dirsql-0.3.45}/PKG-INFO +1 -1
- {dirsql-0.3.43/packages/python → dirsql-0.3.45}/docs/cli/config.md +82 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/Cargo.toml +1 -1
- {dirsql-0.3.43 → dirsql-0.3.45/packages/python}/docs/cli/config.md +82 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/Cargo.toml +6 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/config.md +82 -0
- dirsql-0.3.45/packages/rust/src/command.rs +487 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/config.rs +62 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/lib.rs +215 -19
- {dirsql-0.3.43 → dirsql-0.3.45}/Cargo.toml +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/README.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/_async.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/_dirsql.pyi +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/binary_path.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/interpret/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/is_windows.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/main.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/py.typed +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/.claude/CLAUDE.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/theme/index.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/theme/lang.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/AGENTS.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/api/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/http-api.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/init.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/server.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/getting-started.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/async.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/crdt.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/persistence.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/querying.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/tables.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/watching.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/migrations.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/package.json +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/playwright.config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/pnpm-lock.yaml +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/pnpm-workspace.yaml +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/home.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/language-flag.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/sidebar.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/unit/config.test.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/unit/lang.test.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/docs/vitest.config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/README.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/conftest.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.claude/CLAUDE.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/theme/index.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/theme/lang.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/AGENTS.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/api/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/http-api.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/init.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/server.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/getting-started.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/async.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/crdt.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/persistence.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/querying.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/tables.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/watching.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/migrations.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/package.json +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/playwright.config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/pnpm-lock.yaml +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/pnpm-workspace.yaml +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/home.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/language-flag.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/sidebar.spec.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/unit/config.test.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/unit/lang.test.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/vitest.config.ts +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/e2e-attestation.json +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/src/lib.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/conftest.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/e2e/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/integration/__init__.py +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/README.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/db_bench.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/differ_bench.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/matcher_bench.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/scanner_bench.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/api/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/http-api.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/init.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/server.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/getting-started.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/async.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/crdt.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/persistence.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/querying.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/tables.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/watching.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/index.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/migrations.md +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/bin/dirsql.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/init.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/mod.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/router.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/serialize.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/server.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/db.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/differ.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/matcher.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/persist.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/scanner.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/watcher.rs +0 -0
- {dirsql-0.3.43 → dirsql-0.3.45}/pyproject.toml +0 -0
|
@@ -478,11 +478,13 @@ dependencies = [
|
|
|
478
478
|
"rusqlite",
|
|
479
479
|
"serde",
|
|
480
480
|
"serde_json",
|
|
481
|
+
"shlex",
|
|
481
482
|
"tempfile",
|
|
482
483
|
"thiserror",
|
|
483
484
|
"tokio",
|
|
484
485
|
"tokio-stream",
|
|
485
486
|
"toml",
|
|
487
|
+
"wait-timeout",
|
|
486
488
|
"walkdir",
|
|
487
489
|
]
|
|
488
490
|
|
|
@@ -499,7 +501,7 @@ dependencies = [
|
|
|
499
501
|
|
|
500
502
|
[[package]]
|
|
501
503
|
name = "dirsql-py-ext"
|
|
502
|
-
version = "0.3.
|
|
504
|
+
version = "0.3.45"
|
|
503
505
|
dependencies = [
|
|
504
506
|
"dirsql",
|
|
505
507
|
"pyo3",
|
|
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
|
|
|
191
191
|
applies only to keys produced by an extract callback (relevant for
|
|
192
192
|
programmatic [tables](../guide/tables.md)).
|
|
193
193
|
|
|
194
|
+
### Per-file commands (`on-file`)
|
|
195
|
+
|
|
196
|
+
Reach for `on-file` when a table's rows come from the *contents* of each
|
|
197
|
+
matched file, not just its path and stat metadata. A filesystem-fact table
|
|
198
|
+
gives you one row per file; `on-file` runs a command per file that reads the
|
|
199
|
+
file and emits as many rows as it likes.
|
|
200
|
+
|
|
201
|
+
```toml
|
|
202
|
+
[[table]]
|
|
203
|
+
ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
|
|
204
|
+
glob = "**/meta.json"
|
|
205
|
+
on-file = "uv run python extract_papers.py {path}"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
For every file matched by `glob`, `dirsql` runs the command. **The command
|
|
209
|
+
reads the file itself and prints a JSON array of row objects on stdout**; each
|
|
210
|
+
object becomes one row, its fields mapped to columns:
|
|
211
|
+
|
|
212
|
+
```json
|
|
213
|
+
[
|
|
214
|
+
{ "paper_id": "arXiv:2401.001", "title": "On Directories" },
|
|
215
|
+
{ "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
|
|
216
|
+
]
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
Placeholders substituted into the command:
|
|
220
|
+
|
|
221
|
+
| Placeholder | Value |
|
|
222
|
+
|-------------|-------|
|
|
223
|
+
| `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
|
|
224
|
+
| `{abspath}` | The matched file's absolute path. |
|
|
225
|
+
| `{root}` | The index root directory. |
|
|
226
|
+
|
|
227
|
+
Filesystem facts (stat virtuals and glob captures) are still merged onto every
|
|
228
|
+
`on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
|
|
229
|
+
DDL alongside the command's own columns — a column emitted by the command wins
|
|
230
|
+
over a same-named filesystem fact.
|
|
231
|
+
|
|
232
|
+
JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
|
|
233
|
+
an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
|
|
234
|
+
or object → its JSON text as TEXT.
|
|
235
|
+
|
|
236
|
+
**Per-file error isolation.** If a file's command fails — a non-zero exit, a
|
|
237
|
+
timeout, a spawn error, or output that isn't a JSON array of objects — that
|
|
238
|
+
file is skipped (it contributes no rows) and a one-line warning naming the file
|
|
239
|
+
and the error is written to stderr. One bad file never aborts the scan; the
|
|
240
|
+
other files' rows are indexed normally.
|
|
241
|
+
|
|
242
|
+
See [Command execution](#command-execution) for the full contract (argv
|
|
243
|
+
splitting, injection safety, cwd, environment, timeout, and output framing).
|
|
244
|
+
|
|
194
245
|
### Full Example
|
|
195
246
|
|
|
196
247
|
```toml
|
|
@@ -209,3 +260,34 @@ glob = "**/index.md"
|
|
|
209
260
|
ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
|
|
210
261
|
glob = "logs/*.csv"
|
|
211
262
|
```
|
|
263
|
+
|
|
264
|
+
## Command execution
|
|
265
|
+
|
|
266
|
+
Config keys that run an external command — today `on-file`, with more events to
|
|
267
|
+
follow — share one execution contract:
|
|
268
|
+
|
|
269
|
+
- **argv, not a shell.** The command string is split into an argv with
|
|
270
|
+
shell-like quoting (spaces separate arguments; quotes group them), but **no
|
|
271
|
+
shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
|
|
272
|
+
`&&`/`;` chaining. To get those, ask for a shell explicitly:
|
|
273
|
+
`sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
|
|
274
|
+
- **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
|
|
275
|
+
`{root}`, …) is substituted into whole argv tokens, every occurrence, in a
|
|
276
|
+
single left-to-right pass. A substituted value is always exactly one argv
|
|
277
|
+
element, so a path with spaces — or untrusted content that itself contains
|
|
278
|
+
`{…}` or shell metacharacters — is inert and never re-scanned. An unknown
|
|
279
|
+
`{…}` is left literal.
|
|
280
|
+
- **Working directory.** The command runs in the **config file's directory**,
|
|
281
|
+
so relative paths in the command resolve predictably regardless of where you
|
|
282
|
+
launched `dirsql`.
|
|
283
|
+
- **Environment.** The command inherits `dirsql`'s environment, so tools like
|
|
284
|
+
`uvx --with …` / `npx …` resolve their dependencies as usual.
|
|
285
|
+
- **Output framing.** The command's result is the **last non-empty line of
|
|
286
|
+
stdout**; any log/chatter lines above it are ignored. stderr is never data —
|
|
287
|
+
it is captured only to enrich error messages.
|
|
288
|
+
- **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
|
|
289
|
+
per-table override yet); a command that exceeds it is killed and treated as a
|
|
290
|
+
failure.
|
|
291
|
+
- **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
|
|
292
|
+
not parse as expected is a per-file failure: the file is skipped with a
|
|
293
|
+
stderr warning and the scan continues.
|
|
@@ -4,7 +4,7 @@ name = "dirsql-py-ext"
|
|
|
4
4
|
# pypi/maturin handler can rewrite it via `write-version` before
|
|
5
5
|
# `maturin build`. `pyproject.toml` declares `dynamic = ["version"]`
|
|
6
6
|
# and maturin reads this field. Mirrors `packages/rust/Cargo.toml`.
|
|
7
|
-
version = "0.3.
|
|
7
|
+
version = "0.3.45"
|
|
8
8
|
edition.workspace = true
|
|
9
9
|
publish = false
|
|
10
10
|
readme = "README.md"
|
|
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
|
|
|
191
191
|
applies only to keys produced by an extract callback (relevant for
|
|
192
192
|
programmatic [tables](../guide/tables.md)).
|
|
193
193
|
|
|
194
|
+
### Per-file commands (`on-file`)
|
|
195
|
+
|
|
196
|
+
Reach for `on-file` when a table's rows come from the *contents* of each
|
|
197
|
+
matched file, not just its path and stat metadata. A filesystem-fact table
|
|
198
|
+
gives you one row per file; `on-file` runs a command per file that reads the
|
|
199
|
+
file and emits as many rows as it likes.
|
|
200
|
+
|
|
201
|
+
```toml
|
|
202
|
+
[[table]]
|
|
203
|
+
ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
|
|
204
|
+
glob = "**/meta.json"
|
|
205
|
+
on-file = "uv run python extract_papers.py {path}"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
For every file matched by `glob`, `dirsql` runs the command. **The command
|
|
209
|
+
reads the file itself and prints a JSON array of row objects on stdout**; each
|
|
210
|
+
object becomes one row, its fields mapped to columns:
|
|
211
|
+
|
|
212
|
+
```json
|
|
213
|
+
[
|
|
214
|
+
{ "paper_id": "arXiv:2401.001", "title": "On Directories" },
|
|
215
|
+
{ "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
|
|
216
|
+
]
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
Placeholders substituted into the command:
|
|
220
|
+
|
|
221
|
+
| Placeholder | Value |
|
|
222
|
+
|-------------|-------|
|
|
223
|
+
| `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
|
|
224
|
+
| `{abspath}` | The matched file's absolute path. |
|
|
225
|
+
| `{root}` | The index root directory. |
|
|
226
|
+
|
|
227
|
+
Filesystem facts (stat virtuals and glob captures) are still merged onto every
|
|
228
|
+
`on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
|
|
229
|
+
DDL alongside the command's own columns — a column emitted by the command wins
|
|
230
|
+
over a same-named filesystem fact.
|
|
231
|
+
|
|
232
|
+
JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
|
|
233
|
+
an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
|
|
234
|
+
or object → its JSON text as TEXT.
|
|
235
|
+
|
|
236
|
+
**Per-file error isolation.** If a file's command fails — a non-zero exit, a
|
|
237
|
+
timeout, a spawn error, or output that isn't a JSON array of objects — that
|
|
238
|
+
file is skipped (it contributes no rows) and a one-line warning naming the file
|
|
239
|
+
and the error is written to stderr. One bad file never aborts the scan; the
|
|
240
|
+
other files' rows are indexed normally.
|
|
241
|
+
|
|
242
|
+
See [Command execution](#command-execution) for the full contract (argv
|
|
243
|
+
splitting, injection safety, cwd, environment, timeout, and output framing).
|
|
244
|
+
|
|
194
245
|
### Full Example
|
|
195
246
|
|
|
196
247
|
```toml
|
|
@@ -209,3 +260,34 @@ glob = "**/index.md"
|
|
|
209
260
|
ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
|
|
210
261
|
glob = "logs/*.csv"
|
|
211
262
|
```
|
|
263
|
+
|
|
264
|
+
## Command execution
|
|
265
|
+
|
|
266
|
+
Config keys that run an external command — today `on-file`, with more events to
|
|
267
|
+
follow — share one execution contract:
|
|
268
|
+
|
|
269
|
+
- **argv, not a shell.** The command string is split into an argv with
|
|
270
|
+
shell-like quoting (spaces separate arguments; quotes group them), but **no
|
|
271
|
+
shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
|
|
272
|
+
`&&`/`;` chaining. To get those, ask for a shell explicitly:
|
|
273
|
+
`sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
|
|
274
|
+
- **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
|
|
275
|
+
`{root}`, …) is substituted into whole argv tokens, every occurrence, in a
|
|
276
|
+
single left-to-right pass. A substituted value is always exactly one argv
|
|
277
|
+
element, so a path with spaces — or untrusted content that itself contains
|
|
278
|
+
`{…}` or shell metacharacters — is inert and never re-scanned. An unknown
|
|
279
|
+
`{…}` is left literal.
|
|
280
|
+
- **Working directory.** The command runs in the **config file's directory**,
|
|
281
|
+
so relative paths in the command resolve predictably regardless of where you
|
|
282
|
+
launched `dirsql`.
|
|
283
|
+
- **Environment.** The command inherits `dirsql`'s environment, so tools like
|
|
284
|
+
`uvx --with …` / `npx …` resolve their dependencies as usual.
|
|
285
|
+
- **Output framing.** The command's result is the **last non-empty line of
|
|
286
|
+
stdout**; any log/chatter lines above it are ignored. stderr is never data —
|
|
287
|
+
it is captured only to enrich error messages.
|
|
288
|
+
- **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
|
|
289
|
+
per-table override yet); a command that exceeds it is killed and treated as a
|
|
290
|
+
failure.
|
|
291
|
+
- **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
|
|
292
|
+
not parse as expected is a per-file failure: the file is skipped with a
|
|
293
|
+
stderr warning and the scan continues.
|
|
@@ -60,10 +60,16 @@ regex.workspace = true
|
|
|
60
60
|
rusqlite = { version = "0.34", features = ["bundled", "load_extension"] }
|
|
61
61
|
serde.workspace = true
|
|
62
62
|
serde_json.workspace = true
|
|
63
|
+
# Split a command *template* into argv with shell-like quoting (so
|
|
64
|
+
# `sh -c '...'` keeps its script as one arg) WITHOUT invoking a shell.
|
|
65
|
+
shlex = "1"
|
|
63
66
|
thiserror.workspace = true
|
|
64
67
|
tokio.workspace = true
|
|
65
68
|
tokio-stream = { version = "0.1", features = ["sync"], optional = true }
|
|
66
69
|
toml = "0.8"
|
|
70
|
+
# Enforce a per-command timeout: wait for the child up to a deadline,
|
|
71
|
+
# then kill it. No async runtime required in the core.
|
|
72
|
+
wait-timeout = "0.2"
|
|
67
73
|
walkdir = "2"
|
|
68
74
|
|
|
69
75
|
[[bin]]
|
|
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
|
|
|
191
191
|
applies only to keys produced by an extract callback (relevant for
|
|
192
192
|
programmatic [tables](../guide/tables.md)).
|
|
193
193
|
|
|
194
|
+
### Per-file commands (`on-file`)
|
|
195
|
+
|
|
196
|
+
Reach for `on-file` when a table's rows come from the *contents* of each
|
|
197
|
+
matched file, not just its path and stat metadata. A filesystem-fact table
|
|
198
|
+
gives you one row per file; `on-file` runs a command per file that reads the
|
|
199
|
+
file and emits as many rows as it likes.
|
|
200
|
+
|
|
201
|
+
```toml
|
|
202
|
+
[[table]]
|
|
203
|
+
ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
|
|
204
|
+
glob = "**/meta.json"
|
|
205
|
+
on-file = "uv run python extract_papers.py {path}"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
For every file matched by `glob`, `dirsql` runs the command. **The command
|
|
209
|
+
reads the file itself and prints a JSON array of row objects on stdout**; each
|
|
210
|
+
object becomes one row, its fields mapped to columns:
|
|
211
|
+
|
|
212
|
+
```json
|
|
213
|
+
[
|
|
214
|
+
{ "paper_id": "arXiv:2401.001", "title": "On Directories" },
|
|
215
|
+
{ "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
|
|
216
|
+
]
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
Placeholders substituted into the command:
|
|
220
|
+
|
|
221
|
+
| Placeholder | Value |
|
|
222
|
+
|-------------|-------|
|
|
223
|
+
| `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
|
|
224
|
+
| `{abspath}` | The matched file's absolute path. |
|
|
225
|
+
| `{root}` | The index root directory. |
|
|
226
|
+
|
|
227
|
+
Filesystem facts (stat virtuals and glob captures) are still merged onto every
|
|
228
|
+
`on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
|
|
229
|
+
DDL alongside the command's own columns — a column emitted by the command wins
|
|
230
|
+
over a same-named filesystem fact.
|
|
231
|
+
|
|
232
|
+
JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
|
|
233
|
+
an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
|
|
234
|
+
or object → its JSON text as TEXT.
|
|
235
|
+
|
|
236
|
+
**Per-file error isolation.** If a file's command fails — a non-zero exit, a
|
|
237
|
+
timeout, a spawn error, or output that isn't a JSON array of objects — that
|
|
238
|
+
file is skipped (it contributes no rows) and a one-line warning naming the file
|
|
239
|
+
and the error is written to stderr. One bad file never aborts the scan; the
|
|
240
|
+
other files' rows are indexed normally.
|
|
241
|
+
|
|
242
|
+
See [Command execution](#command-execution) for the full contract (argv
|
|
243
|
+
splitting, injection safety, cwd, environment, timeout, and output framing).
|
|
244
|
+
|
|
194
245
|
### Full Example
|
|
195
246
|
|
|
196
247
|
```toml
|
|
@@ -209,3 +260,34 @@ glob = "**/index.md"
|
|
|
209
260
|
ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
|
|
210
261
|
glob = "logs/*.csv"
|
|
211
262
|
```
|
|
263
|
+
|
|
264
|
+
## Command execution
|
|
265
|
+
|
|
266
|
+
Config keys that run an external command — today `on-file`, with more events to
|
|
267
|
+
follow — share one execution contract:
|
|
268
|
+
|
|
269
|
+
- **argv, not a shell.** The command string is split into an argv with
|
|
270
|
+
shell-like quoting (spaces separate arguments; quotes group them), but **no
|
|
271
|
+
shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
|
|
272
|
+
`&&`/`;` chaining. To get those, ask for a shell explicitly:
|
|
273
|
+
`sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
|
|
274
|
+
- **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
|
|
275
|
+
`{root}`, …) is substituted into whole argv tokens, every occurrence, in a
|
|
276
|
+
single left-to-right pass. A substituted value is always exactly one argv
|
|
277
|
+
element, so a path with spaces — or untrusted content that itself contains
|
|
278
|
+
`{…}` or shell metacharacters — is inert and never re-scanned. An unknown
|
|
279
|
+
`{…}` is left literal.
|
|
280
|
+
- **Working directory.** The command runs in the **config file's directory**,
|
|
281
|
+
so relative paths in the command resolve predictably regardless of where you
|
|
282
|
+
launched `dirsql`.
|
|
283
|
+
- **Environment.** The command inherits `dirsql`'s environment, so tools like
|
|
284
|
+
`uvx --with …` / `npx …` resolve their dependencies as usual.
|
|
285
|
+
- **Output framing.** The command's result is the **last non-empty line of
|
|
286
|
+
stdout**; any log/chatter lines above it are ignored. stderr is never data —
|
|
287
|
+
it is captured only to enrich error messages.
|
|
288
|
+
- **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
|
|
289
|
+
per-table override yet); a command that exceeds it is killed and treated as a
|
|
290
|
+
failure.
|
|
291
|
+
- **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
|
|
292
|
+
not parse as expected is a per-file failure: the file is skipped with a
|
|
293
|
+
stderr warning and the scan continues.
|