doctrail 0.3.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (175) hide show
  1. doctrail/__init__.py +1 -0
  2. doctrail/_docs/SKILL.md +319 -0
  3. doctrail/_docs/llms.txt +1272 -0
  4. doctrail/_examples/tutorial/.doctrail/config.yml +30 -0
  5. doctrail/_examples/tutorial/.doctrail/enrichments/country_mentions.yml +18 -0
  6. doctrail/_examples/tutorial/.doctrail/enrichments/country_stance.yml +11 -0
  7. doctrail/_examples/tutorial/.doctrail/enrichments/econ_threat.yml +30 -0
  8. doctrail/_examples/tutorial/.doctrail/enrichments/mentions_climate.yml +11 -0
  9. doctrail/_examples/tutorial/.doctrail/enrichments/optimism.yml +11 -0
  10. doctrail/_examples/tutorial/.doctrail/enrichments/securitization.yml +28 -0
  11. doctrail/_examples/tutorial/.doctrail/enrichments/test.yml +27 -0
  12. doctrail/_examples/tutorial/.doctrail/replay/country_mentions.jsonl +10 -0
  13. doctrail/_examples/tutorial/.doctrail/replay/country_stance.jsonl +20 -0
  14. doctrail/_examples/tutorial/.doctrail/replay/econ_threat.jsonl +10 -0
  15. doctrail/_examples/tutorial/.doctrail/replay/mentions_climate.jsonl +20 -0
  16. doctrail/_examples/tutorial/.doctrail/replay/optimism.jsonl +20 -0
  17. doctrail/_examples/tutorial/.doctrail/replay/securitization.jsonl +10 -0
  18. doctrail/_examples/tutorial/.doctrail/replay/test.jsonl +18 -0
  19. doctrail/_examples/tutorial/.doctrail/views/country_mentions.yml +14 -0
  20. doctrail/_examples/tutorial/.doctrail/views/econ_threat.yml +13 -0
  21. doctrail/_examples/tutorial/README.md +41 -0
  22. doctrail/_examples/tutorial/corpus/federalist/federalist_01.html +9 -0
  23. doctrail/_examples/tutorial/corpus/federalist/federalist_02.docx +0 -0
  24. doctrail/_examples/tutorial/corpus/federalist/federalist_06.docx +0 -0
  25. doctrail/_examples/tutorial/corpus/federalist/federalist_09.docx +0 -0
  26. doctrail/_examples/tutorial/corpus/federalist/federalist_10.docx +0 -0
  27. doctrail/_examples/tutorial/corpus/federalist/federalist_14.docx +0 -0
  28. doctrail/_examples/tutorial/corpus/federalist/federalist_49.pdf +99 -0
  29. doctrail/_examples/tutorial/corpus/federalist/federalist_50.pdf +99 -0
  30. doctrail/_examples/tutorial/corpus/federalist/federalist_51.pdf +99 -0
  31. doctrail/_examples/tutorial/corpus/federalist/federalist_52.pdf +99 -0
  32. doctrail/_examples/tutorial/corpus/federalist/federalist_53.pdf +99 -0
  33. doctrail/_examples/tutorial/corpus/federalist/federalist_54.pdf +99 -0
  34. doctrail/_examples/tutorial/corpus/federalist/federalist_55.pdf +99 -0
  35. doctrail/_examples/tutorial/corpus/federalist/federalist_56.pdf +99 -0
  36. doctrail/_examples/tutorial/corpus/federalist/federalist_57.pdf +99 -0
  37. doctrail/_examples/tutorial/corpus/federalist/federalist_58.pdf +99 -0
  38. doctrail/_examples/tutorial/corpus/federalist/federalist_62.html +9 -0
  39. doctrail/_examples/tutorial/corpus/federalist/federalist_63.html +9 -0
  40. doctrail/_examples/tutorial/corpus/gt_editorials/README.md +29 -0
  41. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2012_philippines.txt +25 -0
  42. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2013_eu_wine.txt +23 -0
  43. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2013_north_korea.txt +29 -0
  44. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2016_south_korea_thaad.txt +31 -0
  45. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2017_australia.txt +23 -0
  46. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2018_canada_meng.txt +29 -0
  47. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2018_us_trade_war.txt +25 -0
  48. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_afghanistan.txt +23 -0
  49. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_europe.txt +21 -0
  50. doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_us_taiwan.txt +29 -0
  51. doctrail/_examples/tutorial/corpus/manifest.json +290 -0
  52. doctrail/_examples/tutorial/corpus/un_speeches/au_general_debate_2023.html +4 -0
  53. doctrail/_examples/tutorial/corpus/un_speeches/br_general_debate_2023.pdf +80 -0
  54. doctrail/_examples/tutorial/corpus/un_speeches/ca_general_debate_2023.docx +0 -0
  55. doctrail/_examples/tutorial/corpus/un_speeches/fj_general_debate_2023.pdf +80 -0
  56. doctrail/_examples/tutorial/corpus/un_speeches/ie_general_debate_2023.html +3 -0
  57. doctrail/_examples/tutorial/corpus/un_speeches/in_general_debate_2023.pdf +80 -0
  58. doctrail/_examples/tutorial/corpus/un_speeches/jp_general_debate_2023.docx +0 -0
  59. doctrail/_examples/tutorial/corpus/un_speeches/ke_general_debate_2023.docx +0 -0
  60. doctrail/_examples/tutorial/corpus/un_speeches/ng_general_debate_2023.html +4 -0
  61. doctrail/_examples/tutorial/corpus/un_speeches/za_general_debate_2023.pdf +80 -0
  62. doctrail/cli/__init__.py +32 -0
  63. doctrail/cli/__main__.py +7 -0
  64. doctrail/cli/enrich.py +610 -0
  65. doctrail/cli/export.py +51 -0
  66. doctrail/cli/icr.py +295 -0
  67. doctrail/cli/ingest.py +306 -0
  68. doctrail/cli/main.py +2024 -0
  69. doctrail/cli/models.py +505 -0
  70. doctrail/cli/query.py +575 -0
  71. doctrail/cli/review.py +76 -0
  72. doctrail/cli/serve.py +82 -0
  73. doctrail/cli/utils.py +162 -0
  74. doctrail/cli/view.py +5 -0
  75. doctrail/cli.py +19 -0
  76. doctrail/config/__init__.py +6 -0
  77. doctrail/config/config_manager.py +213 -0
  78. doctrail/config/validators.py +189 -0
  79. doctrail/constants.py +105 -0
  80. doctrail/core.py +30 -0
  81. doctrail/core_runtime/__init__.py +22 -0
  82. doctrail/core_runtime/batch.py +778 -0
  83. doctrail/core_runtime/commands.py +1062 -0
  84. doctrail/core_runtime/enrichment.py +946 -0
  85. doctrail/core_runtime/shared.py +796 -0
  86. doctrail/core_utils.py +584 -0
  87. doctrail/cost_estimation.py +33 -0
  88. doctrail/db_operations.py +8 -0
  89. doctrail/db_ops/__init__.py +24 -0
  90. doctrail/db_ops/audit_runs.py +868 -0
  91. doctrail/db_ops/common.py +595 -0
  92. doctrail/db_ops/enrichments.py +1722 -0
  93. doctrail/db_ops/migrations.py +421 -0
  94. doctrail/db_ops/views.py +1470 -0
  95. doctrail/enrichment_config.py +356 -0
  96. doctrail/export_operations.py +170 -0
  97. doctrail/extractors/__init__.py +1 -0
  98. doctrail/extractors/djvu_extractor.py +73 -0
  99. doctrail/extractors/doc_extractor.py +62 -0
  100. doctrail/extractors/docx_extractor.py +116 -0
  101. doctrail/extractors/epub_extractor.py +198 -0
  102. doctrail/extractors/html_extractor.py +71 -0
  103. doctrail/extractors/mhtml-to-html.py +428 -0
  104. doctrail/extractors/mhtml_extractor.py +644 -0
  105. doctrail/extractors/mobi_extractor.py +59 -0
  106. doctrail/extractors/pdf_extractor.py +256 -0
  107. doctrail/extractors/presentation_extractor.py +173 -0
  108. doctrail/extractors/smart_html_extractor.py +217 -0
  109. doctrail/extractors/spreadsheet_extractor.py +482 -0
  110. doctrail/file_filters.py +207 -0
  111. doctrail/ingest/__init__.py +20 -0
  112. doctrail/ingest/base.py +11 -0
  113. doctrail/ingest/core.py +722 -0
  114. doctrail/ingest/database.py +214 -0
  115. doctrail/ingest/document_processor.py +1274 -0
  116. doctrail/ingest/extractors.py +47 -0
  117. doctrail/ingest/file_utils.py +119 -0
  118. doctrail/ingest/manifest.py +109 -0
  119. doctrail/ingest/text_processing.py +196 -0
  120. doctrail/ingester.py +40 -0
  121. doctrail/llm/__init__.py +5 -0
  122. doctrail/llm/client.py +160 -0
  123. doctrail/llm/token_utils.py +120 -0
  124. doctrail/llm_operations.py +1719 -0
  125. doctrail/llm_providers/__init__.py +46 -0
  126. doctrail/llm_providers/anthropic_provider.py +568 -0
  127. doctrail/llm_providers/claude_sdk_provider.py +272 -0
  128. doctrail/llm_providers/cli_provider.py +478 -0
  129. doctrail/llm_providers/factory.py +154 -0
  130. doctrail/llm_providers/gemini_provider.py +705 -0
  131. doctrail/llm_providers/openai_provider.py +551 -0
  132. doctrail/llm_providers/replay_provider.py +155 -0
  133. doctrail/main.py +20 -0
  134. doctrail/plugins/README.md +166 -0
  135. doctrail/plugins/__init__.py +135 -0
  136. doctrail/plugins/_chinese_converter.py +190 -0
  137. doctrail/plugins/doi_connector.py +636 -0
  138. doctrail/plugins/example_custom.py +145 -0
  139. doctrail/plugins/zotero.py +812 -0
  140. doctrail/plugins/zotero_connector.py +61 -0
  141. doctrail/plugins/zotero_ingester.py +660 -0
  142. doctrail/presets/__init__.py +2 -0
  143. doctrail/presets/document_type.yml +25 -0
  144. doctrail/presets/extract_entities.yml +34 -0
  145. doctrail/presets/keywords.yml +23 -0
  146. doctrail/presets/language.yml +24 -0
  147. doctrail/presets/relevance.yml +24 -0
  148. doctrail/presets/research_methods.yml +38 -0
  149. doctrail/presets/sentiment.yml +21 -0
  150. doctrail/presets/summarize.yml +20 -0
  151. doctrail/pydantic_schema.py +499 -0
  152. doctrail/review_server.py +549 -0
  153. doctrail/schema_managers.py +455 -0
  154. doctrail/search.py +629 -0
  155. doctrail/server.py +954 -0
  156. doctrail/server_config.py +296 -0
  157. doctrail/server_ingestor/__init__.py +11 -0
  158. doctrail/server_ingestor/app.py +1205 -0
  159. doctrail/templates/parallel-translation.md +13 -0
  160. doctrail/types.py +104 -0
  161. doctrail/utils/__init__.py +1 -0
  162. doctrail/utils/build_documentation.py +289 -0
  163. doctrail/utils/cost_estimation.py +561 -0
  164. doctrail/utils/dependency_check.py +72 -0
  165. doctrail/utils/logging_config.py +135 -0
  166. doctrail/utils/model_pricing.py +778 -0
  167. doctrail/utils/progress.py +89 -0
  168. doctrail/utils/query_utils.py +103 -0
  169. doctrail/utils/simple_error_handler.py +90 -0
  170. doctrail/utils/validate_config.py +217 -0
  171. doctrail-0.3.1.dist-info/METADATA +103 -0
  172. doctrail-0.3.1.dist-info/RECORD +175 -0
  173. doctrail-0.3.1.dist-info/WHEEL +4 -0
  174. doctrail-0.3.1.dist-info/entry_points.txt +2 -0
  175. doctrail-0.3.1.dist-info/licenses/LICENSE +21 -0
doctrail/__init__.py ADDED
@@ -0,0 +1 @@
1
+ # Doctrail - SQLite database enrichment tool using LLMs
@@ -0,0 +1,319 @@
1
+ ---
2
+ name: doctrail
3
+ description: "Initialize and operate Doctrail projects: ingest document corpora into SQLite, define SQL-scoped YAML enrichments, run sync or provider-batch LLM coding, inspect normalized audit/enrichment storage through run/pivot/spec views, compare model coders, and finalize human-review datasets."
4
+ allowed-tools: Bash
5
+ ---
6
+
7
+ # Doctrail
8
+
9
+ Use this when the user wants to start or operate a Doctrail project: ingest files into SQLite, create or edit enrichment YAML, run LLM coding over SQL-selected rows, inspect run history, build review or analysis views, compare model coders, or repair normalized storage.
10
+
11
+ This skill is the operating doctrine: the mental model, the judgment rules, and worked examples. The complete reference — every YAML key, every command and flag, the storage contract — ships inside the package: `doctrail docs` prints the whole manual to stdout, offline. When you need a detail not covered here, read the manual rather than guessing.
12
+
13
+ ## Core mental model
14
+
15
+ Every answer Doctrail has ever produced lives in one long table, `_enrichments`: one row per document, enrichment, field, model, and prompt version. Three kinds of object share the database, distinguishable at a glance:
16
+
17
+ - Source tables or views keep the original inputs. `documents` is only a default scaffold name.
18
+ - Tables prefixed `_` are Doctrail's ledger: `_enrichments` (parsed answers, long form), `_enrichment_audit` (every raw model call and response, plus the exact projection payload used to populate `_enrichments`), `_enrichment_runs` and `_enrichment_run_items` (provenance and the exact input rowset).
19
+ - Views prefixed `v_` are what people actually read: `v_run_*` shows one specific run in wide form, `v_final_*` applies human overrides to a run, and `doctrail view pivot` builds reusable wide views over `_enrichments`.
20
+
21
+ Rule: store long, inspect wide.
22
+
23
+ ## Happy path
24
+
25
+ 1. Inspect the database or current project.
26
+ 2. Prefer `doctrail init` if starting fresh.
27
+ 3. Use `doctrail new` or patch `.doctrail/enrichments/*.yml` to define the enrichment.
28
+ 4. Ingest if needed.
29
+ 5. Run `doctrail enrich <name> --dry-run`.
30
+ 6. Run `doctrail enrich <name> --limit 5`.
31
+ 7. Inspect with `doctrail runs`, `doctrail view create --run-id <run_id>`, `doctrail view spec`, or `doctrail view pivot`.
32
+ 8. Iterate with `doctrail edit <name>` or direct YAML edits until the prompt and schema are stable.
33
+ 9. Run the full corpus, using `--execution-mode batch` when direct provider batch mode is appropriate.
34
+ 10. Materialize the review surface or final editable table with `doctrail view create`, `doctrail view spec`, `doctrail view pivot`, or `doctrail finalize`.
35
+ 11. Export and import overrides or run ICR if needed.
36
+
37
+ `doctrail run` is an alias of `doctrail enrich`; the two are interchangeable everywhere.
38
+
39
+ ## Project setup
40
+
41
+ Prefer using the same CLI that a human user would use.
42
+
43
+ ```bash
44
+ doctrail init
45
+ doctrail new
46
+ doctrail ingest --input-dir docs/ --yes
47
+ doctrail enrich language --dry-run
48
+ doctrail enrich language --limit 5
49
+ ```
50
+
51
+ If `doctrail init` already created `.doctrail/config.yml` and `.doctrail/enrichments/*.yml`, patch those files rather than inventing parallel config formats.
52
+
53
+ ## Offline runs and demos (replay)
54
+
55
+ The model name `replay` (or `replay/<label>`) returns canned responses from `.doctrail/replay/<enrichment>.jsonl` fixtures instead of calling an API. Everything downstream — audit rows, `_enrichments`, views, runs, ICR — is the real pipeline, so replay runs are how you exercise a config end to end without an API key. Distinct labels act as distinct coders for ICR (`-m replay/coder-a -m replay/coder-b`).
56
+
57
+ `doctrail init test` scaffolds a complete canned tutorial workspace into the current directory: small public corpora, a pre-ingested `out/database.db`, prepared enrichment configs, and replay fixtures. Use it to walk a new user through the whole loop offline.
58
+
59
+ ## Before running an enrichment
60
+
61
+ - Inspect the source schema first:
62
+
63
+ ```bash
64
+ sqlite3 /path/to/db.db ".tables"
65
+ sqlite3 /path/to/db.db ".schema your_source_table"
66
+ sqlite3 /path/to/db.db "SELECT * FROM your_source_table LIMIT 3"
67
+ ```
68
+
69
+ - Confirm the key column. It must be present in the input query.
70
+ - Do not assume the source table is named `documents`. The real contract is: valid SQL input query + stable key column + schema.
71
+ - Keep prompts and schema small on the first pass.
72
+ - Use `--output-db` when the user wants a non-destructive run into a separate database.
73
+ - For targeted reruns, prefer `--where "..."` to filter the enrichment's existing query; use `--query` only when replacing the SQL entirely.
74
+ - Default to `--dedupe-scope query` unless the user explicitly wants prompt-family reuse.
75
+ - Append-mode dedupe is success-based: a row is only "already done" when a successful normalized result exists in `_enrichments` for the current dedupe scope.
76
+ - Null answers count as answered. A field the model explicitly returned as null is stored as a row with a NULL value, and dedupe treats that row as complete; rerunning without `--overwrite` will not re-ask it.
77
+ - Raw attempt history in `_enrichment_audit` and per-run snapshots in `_enrichment_run_items` are provenance, not completion signals. Error-only rows should retry without `--overwrite`.
78
+ - Use `--overwrite` to force reruns of rows that already succeeded.
79
+
80
+ ## Enrichment config rules
81
+
82
+ - `input.query` may be a named SQL query or inline SQL.
83
+ - `input.input_columns` controls what the model sees.
84
+ - Use `table.column` for multi-table inputs.
85
+ - Use `:N` suffix to truncate large fields, for example `raw_content:3000`.
86
+ - `output_column` can still name a single extracted field.
87
+ - For short screening tasks, use `pack_size` to group multiple rows into one structured call.
88
+ - `pack_response_mode: selected_indexes` is the compact mode for a single boolean field. The model returns zero-based matching item indexes only, and Doctrail unpacks them into ordinary row-level `_enrichment_audit` and `_enrichments` rows.
89
+ - `pack_response_mode: exhaustive` returns one structured result per packed item. Prefer it for multi-field schemas or when omission risk matters more than output-token savings.
90
+ - Packed prompt shaping is separate from `--execution-mode batch`. Today `pack_size` is sync-only.
91
+ - Packed mode is most appropriate for short inputs like titles, abstracts, or snippets. It is a poor default for full-document prompts.
92
+ - Design schemas so the common answer is cheap. For rare-hit screening over a large corpus, prefer a `false`/`0`/small-enum default over nullable prose fields, and combine it with `pack_size` so non-hits cost almost no output tokens.
93
+ - Older configs may include `output_table`, but the core storage model is `_enrichments` plus derived views. Do not describe `output_table` as the primary storage contract.
94
+ - Do not name enrichment output fields the same as source table columns. If a schema field matches a source column, Doctrail will error. Rename the field (e.g., `fulltext_cleaned` instead of `fulltext_clean`) or pass `--allow-column-collision` to proceed (the source column becomes `<name>_input` in views).
95
+
96
+ ## Prompt construction
97
+
98
+ - Write the prompt as a codebook, the way you would brief a research assistant: define every enum value, anchor every scale point (`0 = no mention; 5 = explicit existential threat`), and state explicitly when gated fields must be null. Vague constructs produce coder disagreement, with models exactly as with humans; if ICR comes back low, fix the codebook before blaming the model.
99
+ - Provider prompt caching is prefix-based. OpenAI caches eligible matching prompt prefixes automatically. Gemini implicit caching is automatic on Gemini 2.5 and newer models, but Google does not guarantee savings on every request. In both cases, cache reuse depends on the longest identical prefix counted from the top of the request.
100
+ - Therefore the rule is: everything static first, everything per-row last. Doctrail's default renderer puts the prompt first, appends schema instructions there for JSON-mode paths or sends provider-native schema payloads as constant request structure, then appends the per-row `input_columns` content at the end. For provider-reported OpenAI or Gemini cache hits, a long codebook prefix can be billed at cached-input rates. Doctrail does not currently set Anthropic `cache_control` markers, so do not count on Anthropic prompt-cache discounts from this layout.
101
+ - `{column}` placeholders inside the prompt can break this. Each substitution makes the request differ from row to row at that point, reducing cache reuse for everything after it. Feed per-row content through `input_columns` instead; if you genuinely must interpolate, put the placeholder at the end of the prompt so the static prefix above it still matches.
102
+ - Control input size with `:N` truncation on `input_columns` (e.g. `raw_content:3000`) rather than editing source data.
103
+
104
+ Example packed boolean screen:
105
+
106
+ ```yaml
107
+ enrichments:
108
+ - name: packed_relevance
109
+ input:
110
+ query: all_docs
111
+ input_columns: [raw_content:240]
112
+ prompt: |
113
+ Return the items that mention sanctions or trade restrictions.
114
+ output_column: is_relevant
115
+ schema: {type: "boolean"}
116
+ pack_size: 10
117
+ pack_response_mode: selected_indexes
118
+ ```
119
+
120
+ ## Iterative workflow
121
+
122
+ Use this workflow for prompt development:
123
+
124
+ ```bash
125
+ doctrail enrich classify --dry-run
126
+ doctrail enrich classify --limit 10
127
+ doctrail runs
128
+ doctrail view create --run-id <run_id>
129
+ ```
130
+
131
+ Then revise the YAML and rerun on another small sample. Once the prompt is stable:
132
+
133
+ ```bash
134
+ doctrail enrich classify
135
+ doctrail view create --run-id <final_run_id>
136
+ ```
137
+
138
+ If the user wants a reusable wide analysis surface instead of a single-run snapshot:
139
+
140
+ ```bash
141
+ doctrail view pivot review_surface -e classify --include "title,raw_content:500"
142
+ ```
143
+
144
+ ## Provider batch workflow
145
+
146
+ Use `--execution-mode batch` for direct provider batch runs. The older `--execution-mode openai-batch` spelling still works as a compatibility alias, but do not teach it as the current command.
147
+
148
+ - Direct OpenAI model ids must appear in Doctrail's verified OpenAI batch catalog.
149
+ - That catalog is fetched from the official OpenAI model docs, not guessed from a static local allowlist.
150
+ - A model is only accepted for this path if the docs page shows both `v1/batch` and `v1/chat/completions`.
151
+ - The catalog stores batch input, cached-input, and output prices per 1M tokens, documented snapshots, and the fetch date.
152
+ - Direct OpenAI batch cost estimation uses that catalog, not the normal sync pricing table.
153
+
154
+ Inspect or refresh the catalog with:
155
+
156
+ ```bash
157
+ doctrail models --openai-batch
158
+ doctrail models --openai-batch --refresh
159
+ ```
160
+
161
+ Operational rule:
162
+ - if a direct OpenAI model is outside the verified batch catalog, Doctrail should refuse the batch run instead of silently falling back
163
+ - if the docs are unreachable, Doctrail may use the cached or bootstrap catalog, and the CLI should make that visible
164
+
165
+ Direct batch backends behind `--execution-mode batch`:
166
+ - direct OpenAI bare model ids use OpenAI's `/v1/batches` API with request lines targeting `/v1/chat/completions`
167
+ - direct Anthropic model ids like `claude-*` or `anthropic/*` use Anthropic's `/v1/messages/batches` API with request params targeting `/v1/messages`
168
+ - direct Gemini model ids like `gemini-*` or `models/gemini-*` use Gemini's File API plus `v1beta/models/{model}:batchGenerateContent`
169
+ - the verified OpenAI batch catalog only governs the direct OpenAI branch; Anthropic and Gemini do not use that catalog
170
+ - submit with `doctrail enrich <name> --execution-mode batch`
171
+ - reconcile later with `doctrail batch poll --run-id <run_id>` or `doctrail batch watch --run-id <run_id>`
172
+ - after reconciliation, successful rows land in both `_enrichment_audit` and `_enrichments`
173
+ - failed rows may still land in `_enrichment_audit`, but remain eligible for retry on the next append-mode run
174
+
175
+ Monitoring a live batch run — `doctrail batch watch --run-id <run_id>` for a live feed, or query the shard table directly for a precise picture:
176
+
177
+ ```bash
178
+ sqlite3 /path/to/db.db "
179
+ SELECT id, status, request_count, completed_count, failed_count,
180
+ output_file_id IS NOT NULL AS has_output,
181
+ error_file_id IS NOT NULL AS has_errors,
182
+ last_polled_at, completed_at, reconciled_at
183
+ FROM _enrichment_batch_jobs
184
+ WHERE run_id = 'your_run_id'
185
+ ORDER BY id
186
+ "
187
+ ```
188
+
189
+ Large runs are sharded into multiple provider jobs — one row per shard. A run is fully done when every shard has `reconciled_at` set; until then `completed_count` climbs toward `request_count` per shard, and rows land in `_enrichments` only at reconciliation (`batch poll`/`batch watch`), not while the provider is still processing.
190
+
191
+ Practical recovery rule for partial batch failures:
192
+
193
+ ```bash
194
+ doctrail batch poll --run-id <run_id>
195
+ doctrail enrich <name>
196
+ ```
197
+
198
+ Use `--overwrite` only if you want to rerun rows that already succeeded.
199
+
200
+ GPT-5 operational note:
201
+ - direct OpenAI `gpt-5*` chat-completions runs default to `reasoning_effort=minimal` unless the enrichment config overrides it
202
+ - allowed values are `minimal`, `low`, `medium`, `high`
203
+ - use that knob deliberately for cost control on both sync and batch runs
204
+
205
+ ## Views
206
+
207
+ After a run, Doctrail also maintains an automatic wide view over the source table (e.g. `v_documents_enriched`): one row per document, one column per field, latest value wins. For anything beyond that default, use the right view type for the task.
208
+
209
+ - `doctrail view create --run-id <run_id>`
210
+ - exact snapshot of one run (`v_run_*`)
211
+ - best for pilot runs, final runs, and human review
212
+ - `doctrail view create <enrichment>`
213
+ - latest run for that enrichment
214
+ - convenient but less explicit than `--run-id`
215
+ - `doctrail view pivot <name> -e <enrichment>`
216
+ - reusable wide view over normalized enrichment data
217
+ - good for coding surfaces and ICR comparison
218
+ - `doctrail view spec <name>`
219
+ - YAML-driven review surface for a common workflow
220
+ - supports source columns, extra enrichment columns, and one exploded JSON-array field
221
+ - `doctrail view pivot ... --by-model`
222
+ - model-by-model columns for comparison work
223
+
224
+ For human-readable exports of a materialized view:
225
+
226
+ ```bash
227
+ doctrail view render payments_review --output payments_review.html
228
+ ```
229
+
230
+ ## Review and overrides
231
+
232
+ Human correction workflow:
233
+
234
+ ```bash
235
+ doctrail overrides-export --run-id <run_id>
236
+ doctrail overrides-import --run-id <run_id> --input overrides.csv --reviewer alice
237
+ ```
238
+
239
+ `v_final_*` views layer overrides on top of the original run output. The storage tables remain unchanged; the view is the merged surface.
240
+
241
+ Short rule:
242
+ - model runs are immutable coder outputs
243
+ - compare runs in views, but edit final values in one clean final surface
244
+ - agreement between models is ICR
245
+ - do not invent duplicate `machine_*` columns in the final editable table
246
+ - keep only enough provenance to join back to machine history
247
+
248
+ Current implementation note:
249
+ - `doctrail finalize --run-id <run_id> --table <table_name>` materializes one writable final table from a chosen run
250
+ - `doctrail finalize --view <view_name> --table <table_name>` materializes any existing review view into a writable table
251
+ - `v_final_*` views plus override import/export still exist for lightweight review and CSV workflows
252
+
253
+ For repeated extracted items, prefer exploded review views so one extracted item becomes one row.
254
+
255
+ If the user wants to repair or verify normalized storage from audit:
256
+
257
+ ```bash
258
+ doctrail rebuild-enrichments --db-path /path/to/db.db --yes
259
+ ```
260
+
261
+ This only works for audit rows that have the persisted projection payload. It is strict by design and will fail rather than guess from legacy raw JSON.
262
+
263
+ ## ICR workflow
264
+
265
+ For model agreement work:
266
+
267
+ ```bash
268
+ doctrail icr classify -m gpt-4o-mini -m openrouter/google/gemini-2.5-flash --sample 100 --seed 42
269
+ doctrail icr-report --db-path /path/to/db.db --field category
270
+ ```
271
+
272
+ The default install computes Krippendorff's alpha and Cohen's kappa; `doctrail[icr]` remains accepted for older setup scripts but is no longer required.
273
+
274
+ ## Working with the database directly
275
+
276
+ Everything Doctrail produces lands in plain SQLite, so most inspection and reshaping is just SQL. `sqlite3` covers ad-hoc queries; for anything heavier, `sqlite-utils` is the right companion CLI (reference: https://sqlite-utils.datasette.io/). It is not bundled with doctrail — install it once with `uv tool install sqlite-utils`. A ready reckoner of the moves you will actually make:
277
+
278
+ - list tables and views: `sqlite-utils tables out/database.db` (add `--views` for views)
279
+ - inspect schema: `sqlite-utils schema out/database.db documents`
280
+ - peek at a wide view: `sqlite-utils rows out/database.db v_documents_enriched --limit 5`
281
+ - run a query as JSON lines: `sqlite-utils out/database.db "SELECT ... FROM v_..." --nl`
282
+ - count answered rows for an enrichment: `sqlite-utils out/database.db "SELECT COUNT(DISTINCT key_value) FROM _enrichments WHERE enrichment_name='x'"`
283
+ - export a review view to CSV: `sqlite-utils rows out/database.db v_run_<id> --csv > review.csv`
284
+
285
+ Read-only inspection is always safe. Reshape into your own tables and views freely, but do not hand-edit the `_`-prefixed ledger tables — let doctrail write those, and use overrides/finalize for human corrections.
286
+
287
+ ## Troubleshooting
288
+
289
+ - `No config found`: run `doctrail init` or pass `--config`.
290
+ - `key_column not found`: include it in the SQL `SELECT`.
291
+ - Tables named `enrichments` / `enrichment_audit` without the `_` prefix mean a pre-rename database. Any doctrail command migrates it in place automatically (tracked via `PRAGMA user_version`); do not rename tables by hand.
292
+ - Unsure what changed between prompt versions: use `doctrail runs` and `doctrail diff-runs`.
293
+ - Unsure what the user should inspect: materialize a `v_run_*` view first; it is the safest default review surface.
294
+ - `Rows were skipped but should retry`: check `_enrichments`, not just `_enrichment_audit` or `_enrichment_run_items`.
295
+
296
+ ```bash
297
+ sqlite3 /path/to/db.db "
298
+ SELECT COUNT(DISTINCT key_value)
299
+ FROM _enrichments
300
+ WHERE enrichment_name = 'your_enrichment'
301
+ "
302
+ ```
303
+
304
+ - `Batch run shows many errors`: inspect per-row status in the run snapshot.
305
+
306
+ ```bash
307
+ sqlite3 /path/to/db.db "
308
+ SELECT status, COUNT(*)
309
+ FROM _enrichment_run_items
310
+ WHERE run_id = 'your_run_id'
311
+ GROUP BY status
312
+ "
313
+ ```
314
+
315
+ - `Field name collides with source column`: rename the enrichment field if possible. If you need to proceed immediately, use:
316
+
317
+ ```bash
318
+ doctrail enrich <name> --allow-column-collision
319
+ ```