dirsql 0.3.43__tar.gz → 0.3.45__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (119) hide show
  1. {dirsql-0.3.43 → dirsql-0.3.45}/Cargo.lock +3 -1
  2. {dirsql-0.3.43 → dirsql-0.3.45}/PKG-INFO +1 -1
  3. {dirsql-0.3.43/packages/python → dirsql-0.3.45}/docs/cli/config.md +82 -0
  4. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/Cargo.toml +1 -1
  5. {dirsql-0.3.43 → dirsql-0.3.45/packages/python}/docs/cli/config.md +82 -0
  6. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/Cargo.toml +6 -0
  7. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/config.md +82 -0
  8. dirsql-0.3.45/packages/rust/src/command.rs +487 -0
  9. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/config.rs +62 -0
  10. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/lib.rs +215 -19
  11. {dirsql-0.3.43 → dirsql-0.3.45}/Cargo.toml +0 -0
  12. {dirsql-0.3.43 → dirsql-0.3.45}/README.md +0 -0
  13. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/__init__.py +0 -0
  14. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/_async.py +0 -0
  15. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/_dirsql.pyi +0 -0
  16. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/__init__.py +0 -0
  17. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/binary_path.py +0 -0
  18. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/interpret/__init__.py +0 -0
  19. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/is_windows.py +0 -0
  20. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/cli/main.py +0 -0
  21. {dirsql-0.3.43 → dirsql-0.3.45}/dirsql/py.typed +0 -0
  22. {dirsql-0.3.43 → dirsql-0.3.45}/docs/.claude/CLAUDE.md +0 -0
  23. {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/config.ts +0 -0
  24. {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/theme/index.ts +0 -0
  25. {dirsql-0.3.43 → dirsql-0.3.45}/docs/.vitepress/theme/lang.ts +0 -0
  26. {dirsql-0.3.43 → dirsql-0.3.45}/docs/AGENTS.md +0 -0
  27. {dirsql-0.3.43 → dirsql-0.3.45}/docs/api/index.md +0 -0
  28. {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/http-api.md +0 -0
  29. {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/index.md +0 -0
  30. {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/init.md +0 -0
  31. {dirsql-0.3.43 → dirsql-0.3.45}/docs/cli/server.md +0 -0
  32. {dirsql-0.3.43 → dirsql-0.3.45}/docs/getting-started.md +0 -0
  33. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/async.md +0 -0
  34. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/crdt.md +0 -0
  35. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/persistence.md +0 -0
  36. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/querying.md +0 -0
  37. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/tables.md +0 -0
  38. {dirsql-0.3.43 → dirsql-0.3.45}/docs/guide/watching.md +0 -0
  39. {dirsql-0.3.43 → dirsql-0.3.45}/docs/index.md +0 -0
  40. {dirsql-0.3.43 → dirsql-0.3.45}/docs/migrations.md +0 -0
  41. {dirsql-0.3.43 → dirsql-0.3.45}/docs/package.json +0 -0
  42. {dirsql-0.3.43 → dirsql-0.3.45}/docs/playwright.config.ts +0 -0
  43. {dirsql-0.3.43 → dirsql-0.3.45}/docs/pnpm-lock.yaml +0 -0
  44. {dirsql-0.3.43 → dirsql-0.3.45}/docs/pnpm-workspace.yaml +0 -0
  45. {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/home.spec.ts +0 -0
  46. {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/language-flag.spec.ts +0 -0
  47. {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/integration/sidebar.spec.ts +0 -0
  48. {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/unit/config.test.ts +0 -0
  49. {dirsql-0.3.43 → dirsql-0.3.45}/docs/tests/unit/lang.test.ts +0 -0
  50. {dirsql-0.3.43 → dirsql-0.3.45}/docs/vitest.config.ts +0 -0
  51. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/README.md +0 -0
  52. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/conftest.py +0 -0
  53. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.claude/CLAUDE.md +0 -0
  54. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/config.ts +0 -0
  55. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/theme/index.ts +0 -0
  56. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/.vitepress/theme/lang.ts +0 -0
  57. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/AGENTS.md +0 -0
  58. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/api/index.md +0 -0
  59. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/http-api.md +0 -0
  60. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/index.md +0 -0
  61. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/init.md +0 -0
  62. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/cli/server.md +0 -0
  63. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/getting-started.md +0 -0
  64. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/async.md +0 -0
  65. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/crdt.md +0 -0
  66. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/persistence.md +0 -0
  67. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/querying.md +0 -0
  68. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/tables.md +0 -0
  69. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/guide/watching.md +0 -0
  70. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/index.md +0 -0
  71. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/migrations.md +0 -0
  72. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/package.json +0 -0
  73. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/playwright.config.ts +0 -0
  74. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/pnpm-lock.yaml +0 -0
  75. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/pnpm-workspace.yaml +0 -0
  76. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/home.spec.ts +0 -0
  77. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/language-flag.spec.ts +0 -0
  78. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/integration/sidebar.spec.ts +0 -0
  79. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/unit/config.test.ts +0 -0
  80. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/tests/unit/lang.test.ts +0 -0
  81. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/docs/vitest.config.ts +0 -0
  82. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/e2e-attestation.json +0 -0
  83. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/src/lib.rs +0 -0
  84. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/__init__.py +0 -0
  85. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/conftest.py +0 -0
  86. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/e2e/__init__.py +0 -0
  87. {dirsql-0.3.43 → dirsql-0.3.45}/packages/python/tests/integration/__init__.py +0 -0
  88. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/README.md +0 -0
  89. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/db_bench.rs +0 -0
  90. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/differ_bench.rs +0 -0
  91. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/matcher_bench.rs +0 -0
  92. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/benches/scanner_bench.rs +0 -0
  93. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/api/index.md +0 -0
  94. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/http-api.md +0 -0
  95. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/index.md +0 -0
  96. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/init.md +0 -0
  97. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/cli/server.md +0 -0
  98. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/getting-started.md +0 -0
  99. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/async.md +0 -0
  100. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/crdt.md +0 -0
  101. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/persistence.md +0 -0
  102. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/querying.md +0 -0
  103. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/tables.md +0 -0
  104. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/guide/watching.md +0 -0
  105. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/index.md +0 -0
  106. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/docs/migrations.md +0 -0
  107. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/bin/dirsql.rs +0 -0
  108. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/init.rs +0 -0
  109. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/mod.rs +0 -0
  110. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/router.rs +0 -0
  111. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/serialize.rs +0 -0
  112. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/cli/server.rs +0 -0
  113. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/db.rs +0 -0
  114. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/differ.rs +0 -0
  115. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/matcher.rs +0 -0
  116. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/persist.rs +0 -0
  117. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/scanner.rs +0 -0
  118. {dirsql-0.3.43 → dirsql-0.3.45}/packages/rust/src/watcher.rs +0 -0
  119. {dirsql-0.3.43 → dirsql-0.3.45}/pyproject.toml +0 -0
@@ -478,11 +478,13 @@ dependencies = [
478
478
  "rusqlite",
479
479
  "serde",
480
480
  "serde_json",
481
+ "shlex",
481
482
  "tempfile",
482
483
  "thiserror",
483
484
  "tokio",
484
485
  "tokio-stream",
485
486
  "toml",
487
+ "wait-timeout",
486
488
  "walkdir",
487
489
  ]
488
490
 
@@ -499,7 +501,7 @@ dependencies = [
499
501
 
500
502
  [[package]]
501
503
  name = "dirsql-py-ext"
502
- version = "0.3.43"
504
+ version = "0.3.45"
503
505
  dependencies = [
504
506
  "dirsql",
505
507
  "pyo3",
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: dirsql
3
- Version: 0.3.43
3
+ Version: 0.3.45
4
4
  Requires-Dist: pytest>=8 ; extra == 'dev'
5
5
  Requires-Dist: pytest-describe>=2 ; extra == 'dev'
6
6
  Requires-Dist: pytest-asyncio>=0.23 ; extra == 'dev'
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
191
191
  applies only to keys produced by an extract callback (relevant for
192
192
  programmatic [tables](../guide/tables.md)).
193
193
 
194
+ ### Per-file commands (`on-file`)
195
+
196
+ Reach for `on-file` when a table's rows come from the *contents* of each
197
+ matched file, not just its path and stat metadata. A filesystem-fact table
198
+ gives you one row per file; `on-file` runs a command per file that reads the
199
+ file and emits as many rows as it likes.
200
+
201
+ ```toml
202
+ [[table]]
203
+ ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
204
+ glob = "**/meta.json"
205
+ on-file = "uv run python extract_papers.py {path}"
206
+ ```
207
+
208
+ For every file matched by `glob`, `dirsql` runs the command. **The command
209
+ reads the file itself and prints a JSON array of row objects on stdout**; each
210
+ object becomes one row, its fields mapped to columns:
211
+
212
+ ```json
213
+ [
214
+ { "paper_id": "arXiv:2401.001", "title": "On Directories" },
215
+ { "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
216
+ ]
217
+ ```
218
+
219
+ Placeholders substituted into the command:
220
+
221
+ | Placeholder | Value |
222
+ |-------------|-------|
223
+ | `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
224
+ | `{abspath}` | The matched file's absolute path. |
225
+ | `{root}` | The index root directory. |
226
+
227
+ Filesystem facts (stat virtuals and glob captures) are still merged onto every
228
+ `on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
229
+ DDL alongside the command's own columns — a column emitted by the command wins
230
+ over a same-named filesystem fact.
231
+
232
+ JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
233
+ an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
234
+ or object → its JSON text as TEXT.
235
+
236
+ **Per-file error isolation.** If a file's command fails — a non-zero exit, a
237
+ timeout, a spawn error, or output that isn't a JSON array of objects — that
238
+ file is skipped (it contributes no rows) and a one-line warning naming the file
239
+ and the error is written to stderr. One bad file never aborts the scan; the
240
+ other files' rows are indexed normally.
241
+
242
+ See [Command execution](#command-execution) for the full contract (argv
243
+ splitting, injection safety, cwd, environment, timeout, and output framing).
244
+
194
245
  ### Full Example
195
246
 
196
247
  ```toml
@@ -209,3 +260,34 @@ glob = "**/index.md"
209
260
  ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
210
261
  glob = "logs/*.csv"
211
262
  ```
263
+
264
+ ## Command execution
265
+
266
+ Config keys that run an external command — today `on-file`, with more events to
267
+ follow — share one execution contract:
268
+
269
+ - **argv, not a shell.** The command string is split into an argv with
270
+ shell-like quoting (spaces separate arguments; quotes group them), but **no
271
+ shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
272
+ `&&`/`;` chaining. To get those, ask for a shell explicitly:
273
+ `sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
274
+ - **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
275
+ `{root}`, …) is substituted into whole argv tokens, every occurrence, in a
276
+ single left-to-right pass. A substituted value is always exactly one argv
277
+ element, so a path with spaces — or untrusted content that itself contains
278
+ `{…}` or shell metacharacters — is inert and never re-scanned. An unknown
279
+ `{…}` is left literal.
280
+ - **Working directory.** The command runs in the **config file's directory**,
281
+ so relative paths in the command resolve predictably regardless of where you
282
+ launched `dirsql`.
283
+ - **Environment.** The command inherits `dirsql`'s environment, so tools like
284
+ `uvx --with …` / `npx …` resolve their dependencies as usual.
285
+ - **Output framing.** The command's result is the **last non-empty line of
286
+ stdout**; any log/chatter lines above it are ignored. stderr is never data —
287
+ it is captured only to enrich error messages.
288
+ - **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
289
+ per-table override yet); a command that exceeds it is killed and treated as a
290
+ failure.
291
+ - **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
292
+ not parse as expected is a per-file failure: the file is skipped with a
293
+ stderr warning and the scan continues.
@@ -4,7 +4,7 @@ name = "dirsql-py-ext"
4
4
  # pypi/maturin handler can rewrite it via `write-version` before
5
5
  # `maturin build`. `pyproject.toml` declares `dynamic = ["version"]`
6
6
  # and maturin reads this field. Mirrors `packages/rust/Cargo.toml`.
7
- version = "0.3.43"
7
+ version = "0.3.45"
8
8
  edition.workspace = true
9
9
  publish = false
10
10
  readme = "README.md"
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
191
191
  applies only to keys produced by an extract callback (relevant for
192
192
  programmatic [tables](../guide/tables.md)).
193
193
 
194
+ ### Per-file commands (`on-file`)
195
+
196
+ Reach for `on-file` when a table's rows come from the *contents* of each
197
+ matched file, not just its path and stat metadata. A filesystem-fact table
198
+ gives you one row per file; `on-file` runs a command per file that reads the
199
+ file and emits as many rows as it likes.
200
+
201
+ ```toml
202
+ [[table]]
203
+ ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
204
+ glob = "**/meta.json"
205
+ on-file = "uv run python extract_papers.py {path}"
206
+ ```
207
+
208
+ For every file matched by `glob`, `dirsql` runs the command. **The command
209
+ reads the file itself and prints a JSON array of row objects on stdout**; each
210
+ object becomes one row, its fields mapped to columns:
211
+
212
+ ```json
213
+ [
214
+ { "paper_id": "arXiv:2401.001", "title": "On Directories" },
215
+ { "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
216
+ ]
217
+ ```
218
+
219
+ Placeholders substituted into the command:
220
+
221
+ | Placeholder | Value |
222
+ |-------------|-------|
223
+ | `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
224
+ | `{abspath}` | The matched file's absolute path. |
225
+ | `{root}` | The index root directory. |
226
+
227
+ Filesystem facts (stat virtuals and glob captures) are still merged onto every
228
+ `on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
229
+ DDL alongside the command's own columns — a column emitted by the command wins
230
+ over a same-named filesystem fact.
231
+
232
+ JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
233
+ an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
234
+ or object → its JSON text as TEXT.
235
+
236
+ **Per-file error isolation.** If a file's command fails — a non-zero exit, a
237
+ timeout, a spawn error, or output that isn't a JSON array of objects — that
238
+ file is skipped (it contributes no rows) and a one-line warning naming the file
239
+ and the error is written to stderr. One bad file never aborts the scan; the
240
+ other files' rows are indexed normally.
241
+
242
+ See [Command execution](#command-execution) for the full contract (argv
243
+ splitting, injection safety, cwd, environment, timeout, and output framing).
244
+
194
245
  ### Full Example
195
246
 
196
247
  ```toml
@@ -209,3 +260,34 @@ glob = "**/index.md"
209
260
  ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
210
261
  glob = "logs/*.csv"
211
262
  ```
263
+
264
+ ## Command execution
265
+
266
+ Config keys that run an external command — today `on-file`, with more events to
267
+ follow — share one execution contract:
268
+
269
+ - **argv, not a shell.** The command string is split into an argv with
270
+ shell-like quoting (spaces separate arguments; quotes group them), but **no
271
+ shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
272
+ `&&`/`;` chaining. To get those, ask for a shell explicitly:
273
+ `sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
274
+ - **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
275
+ `{root}`, …) is substituted into whole argv tokens, every occurrence, in a
276
+ single left-to-right pass. A substituted value is always exactly one argv
277
+ element, so a path with spaces — or untrusted content that itself contains
278
+ `{…}` or shell metacharacters — is inert and never re-scanned. An unknown
279
+ `{…}` is left literal.
280
+ - **Working directory.** The command runs in the **config file's directory**,
281
+ so relative paths in the command resolve predictably regardless of where you
282
+ launched `dirsql`.
283
+ - **Environment.** The command inherits `dirsql`'s environment, so tools like
284
+ `uvx --with …` / `npx …` resolve their dependencies as usual.
285
+ - **Output framing.** The command's result is the **last non-empty line of
286
+ stdout**; any log/chatter lines above it are ignored. stderr is never data —
287
+ it is captured only to enrich error messages.
288
+ - **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
289
+ per-table override yet); a command that exceeds it is killed and treated as a
290
+ failure.
291
+ - **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
292
+ not parse as expected is a per-file failure: the file is skipped with a
293
+ stderr warning and the scan continues.
@@ -60,10 +60,16 @@ regex.workspace = true
60
60
  rusqlite = { version = "0.34", features = ["bundled", "load_extension"] }
61
61
  serde.workspace = true
62
62
  serde_json.workspace = true
63
+ # Split a command *template* into argv with shell-like quoting (so
64
+ # `sh -c '...'` keeps its script as one arg) WITHOUT invoking a shell.
65
+ shlex = "1"
63
66
  thiserror.workspace = true
64
67
  tokio.workspace = true
65
68
  tokio-stream = { version = "0.1", features = ["sync"], optional = true }
66
69
  toml = "0.8"
70
+ # Enforce a per-command timeout: wait for the child up to a deadline,
71
+ # then kill it. No async runtime required in the core.
72
+ wait-timeout = "0.2"
67
73
  walkdir = "2"
68
74
 
69
75
  [[bin]]
@@ -191,6 +191,57 @@ always filtered to the DDL's declared columns regardless. Strict mode
191
191
  applies only to keys produced by an extract callback (relevant for
192
192
  programmatic [tables](../guide/tables.md)).
193
193
 
194
+ ### Per-file commands (`on-file`)
195
+
196
+ Reach for `on-file` when a table's rows come from the *contents* of each
197
+ matched file, not just its path and stat metadata. A filesystem-fact table
198
+ gives you one row per file; `on-file` runs a command per file that reads the
199
+ file and emits as many rows as it likes.
200
+
201
+ ```toml
202
+ [[table]]
203
+ ddl = "CREATE TABLE papers (paper_id TEXT, title TEXT)"
204
+ glob = "**/meta.json"
205
+ on-file = "uv run python extract_papers.py {path}"
206
+ ```
207
+
208
+ For every file matched by `glob`, `dirsql` runs the command. **The command
209
+ reads the file itself and prints a JSON array of row objects on stdout**; each
210
+ object becomes one row, its fields mapped to columns:
211
+
212
+ ```json
213
+ [
214
+ { "paper_id": "arXiv:2401.001", "title": "On Directories" },
215
+ { "paper_id": "arXiv:2401.002", "title": "SQL All The Way Down" }
216
+ ]
217
+ ```
218
+
219
+ Placeholders substituted into the command:
220
+
221
+ | Placeholder | Value |
222
+ |-------------|-------|
223
+ | `{path}` | The matched file's path **relative to the index root**. Appended automatically when the command omits it, so `extract.py` and `extract.py {path}` behave identically. |
224
+ | `{abspath}` | The matched file's absolute path. |
225
+ | `{root}` | The index root directory. |
226
+
227
+ Filesystem facts (stat virtuals and glob captures) are still merged onto every
228
+ `on-file` row, so you can declare `_path`, `_basename`, `{capture}`, etc. in the
229
+ DDL alongside the command's own columns — a column emitted by the command wins
230
+ over a same-named filesystem fact.
231
+
232
+ JSON values map to SQLite as follows: `null` → NULL; `true`/`false` → `1`/`0`;
233
+ an integer → INTEGER, any other number → REAL; a string → TEXT; a nested array
234
+ or object → its JSON text as TEXT.
235
+
236
+ **Per-file error isolation.** If a file's command fails — a non-zero exit, a
237
+ timeout, a spawn error, or output that isn't a JSON array of objects — that
238
+ file is skipped (it contributes no rows) and a one-line warning naming the file
239
+ and the error is written to stderr. One bad file never aborts the scan; the
240
+ other files' rows are indexed normally.
241
+
242
+ See [Command execution](#command-execution) for the full contract (argv
243
+ splitting, injection safety, cwd, environment, timeout, and output framing).
244
+
194
245
  ### Full Example
195
246
 
196
247
  ```toml
@@ -209,3 +260,34 @@ glob = "**/index.md"
209
260
  ddl = "CREATE TABLE logs (_path TEXT, _size INTEGER, _mtime INTEGER)"
210
261
  glob = "logs/*.csv"
211
262
  ```
263
+
264
+ ## Command execution
265
+
266
+ Config keys that run an external command — today `on-file`, with more events to
267
+ follow — share one execution contract:
268
+
269
+ - **argv, not a shell.** The command string is split into an argv with
270
+ shell-like quoting (spaces separate arguments; quotes group them), but **no
271
+ shell is invoked** — there is no globbing, piping, `$VAR` expansion, or
272
+ `&&`/`;` chaining. To get those, ask for a shell explicitly:
273
+ `sh -c 'grep foo {path} | sort'` — the quoted script stays a single argument.
274
+ - **Injection-safe placeholders.** Each placeholder (`{path}`, `{abspath}`,
275
+ `{root}`, …) is substituted into whole argv tokens, every occurrence, in a
276
+ single left-to-right pass. A substituted value is always exactly one argv
277
+ element, so a path with spaces — or untrusted content that itself contains
278
+ `{…}` or shell metacharacters — is inert and never re-scanned. An unknown
279
+ `{…}` is left literal.
280
+ - **Working directory.** The command runs in the **config file's directory**,
281
+ so relative paths in the command resolve predictably regardless of where you
282
+ launched `dirsql`.
283
+ - **Environment.** The command inherits `dirsql`'s environment, so tools like
284
+ `uvx --with …` / `npx …` resolve their dependencies as usual.
285
+ - **Output framing.** The command's result is the **last non-empty line of
286
+ stdout**; any log/chatter lines above it are ignored. stderr is never data —
287
+ it is captured only to enrich error messages.
288
+ - **Timeout.** Each command run is bounded by a fixed **30-second** timeout (no
289
+ per-table override yet); a command that exceeds it is killed and treated as a
290
+ failure.
291
+ - **Errors.** A non-zero exit, a timeout, a spawn failure, or output that does
292
+ not parse as expected is a per-file failure: the file is skipped with a
293
+ stderr warning and the scan continues.