doctrail 0.3.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- doctrail/__init__.py +1 -0
- doctrail/_docs/SKILL.md +319 -0
- doctrail/_docs/llms.txt +1272 -0
- doctrail/_examples/tutorial/.doctrail/config.yml +30 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/country_mentions.yml +18 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/country_stance.yml +11 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/econ_threat.yml +30 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/mentions_climate.yml +11 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/optimism.yml +11 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/securitization.yml +28 -0
- doctrail/_examples/tutorial/.doctrail/enrichments/test.yml +27 -0
- doctrail/_examples/tutorial/.doctrail/replay/country_mentions.jsonl +10 -0
- doctrail/_examples/tutorial/.doctrail/replay/country_stance.jsonl +20 -0
- doctrail/_examples/tutorial/.doctrail/replay/econ_threat.jsonl +10 -0
- doctrail/_examples/tutorial/.doctrail/replay/mentions_climate.jsonl +20 -0
- doctrail/_examples/tutorial/.doctrail/replay/optimism.jsonl +20 -0
- doctrail/_examples/tutorial/.doctrail/replay/securitization.jsonl +10 -0
- doctrail/_examples/tutorial/.doctrail/replay/test.jsonl +18 -0
- doctrail/_examples/tutorial/.doctrail/views/country_mentions.yml +14 -0
- doctrail/_examples/tutorial/.doctrail/views/econ_threat.yml +13 -0
- doctrail/_examples/tutorial/README.md +41 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_01.html +9 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_02.docx +0 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_06.docx +0 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_09.docx +0 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_10.docx +0 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_14.docx +0 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_49.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_50.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_51.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_52.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_53.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_54.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_55.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_56.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_57.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_58.pdf +99 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_62.html +9 -0
- doctrail/_examples/tutorial/corpus/federalist/federalist_63.html +9 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/README.md +29 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2012_philippines.txt +25 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2013_eu_wine.txt +23 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2013_north_korea.txt +29 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2016_south_korea_thaad.txt +31 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2017_australia.txt +23 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2018_canada_meng.txt +29 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2018_us_trade_war.txt +25 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_afghanistan.txt +23 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_europe.txt +21 -0
- doctrail/_examples/tutorial/corpus/gt_editorials/gt_2021_us_taiwan.txt +29 -0
- doctrail/_examples/tutorial/corpus/manifest.json +290 -0
- doctrail/_examples/tutorial/corpus/un_speeches/au_general_debate_2023.html +4 -0
- doctrail/_examples/tutorial/corpus/un_speeches/br_general_debate_2023.pdf +80 -0
- doctrail/_examples/tutorial/corpus/un_speeches/ca_general_debate_2023.docx +0 -0
- doctrail/_examples/tutorial/corpus/un_speeches/fj_general_debate_2023.pdf +80 -0
- doctrail/_examples/tutorial/corpus/un_speeches/ie_general_debate_2023.html +3 -0
- doctrail/_examples/tutorial/corpus/un_speeches/in_general_debate_2023.pdf +80 -0
- doctrail/_examples/tutorial/corpus/un_speeches/jp_general_debate_2023.docx +0 -0
- doctrail/_examples/tutorial/corpus/un_speeches/ke_general_debate_2023.docx +0 -0
- doctrail/_examples/tutorial/corpus/un_speeches/ng_general_debate_2023.html +4 -0
- doctrail/_examples/tutorial/corpus/un_speeches/za_general_debate_2023.pdf +80 -0
- doctrail/cli/__init__.py +32 -0
- doctrail/cli/__main__.py +7 -0
- doctrail/cli/enrich.py +610 -0
- doctrail/cli/export.py +51 -0
- doctrail/cli/icr.py +295 -0
- doctrail/cli/ingest.py +306 -0
- doctrail/cli/main.py +2024 -0
- doctrail/cli/models.py +505 -0
- doctrail/cli/query.py +575 -0
- doctrail/cli/review.py +76 -0
- doctrail/cli/serve.py +82 -0
- doctrail/cli/utils.py +162 -0
- doctrail/cli/view.py +5 -0
- doctrail/cli.py +19 -0
- doctrail/config/__init__.py +6 -0
- doctrail/config/config_manager.py +213 -0
- doctrail/config/validators.py +189 -0
- doctrail/constants.py +105 -0
- doctrail/core.py +30 -0
- doctrail/core_runtime/__init__.py +22 -0
- doctrail/core_runtime/batch.py +778 -0
- doctrail/core_runtime/commands.py +1062 -0
- doctrail/core_runtime/enrichment.py +946 -0
- doctrail/core_runtime/shared.py +796 -0
- doctrail/core_utils.py +584 -0
- doctrail/cost_estimation.py +33 -0
- doctrail/db_operations.py +8 -0
- doctrail/db_ops/__init__.py +24 -0
- doctrail/db_ops/audit_runs.py +868 -0
- doctrail/db_ops/common.py +595 -0
- doctrail/db_ops/enrichments.py +1722 -0
- doctrail/db_ops/migrations.py +421 -0
- doctrail/db_ops/views.py +1470 -0
- doctrail/enrichment_config.py +356 -0
- doctrail/export_operations.py +170 -0
- doctrail/extractors/__init__.py +1 -0
- doctrail/extractors/djvu_extractor.py +73 -0
- doctrail/extractors/doc_extractor.py +62 -0
- doctrail/extractors/docx_extractor.py +116 -0
- doctrail/extractors/epub_extractor.py +198 -0
- doctrail/extractors/html_extractor.py +71 -0
- doctrail/extractors/mhtml-to-html.py +428 -0
- doctrail/extractors/mhtml_extractor.py +644 -0
- doctrail/extractors/mobi_extractor.py +59 -0
- doctrail/extractors/pdf_extractor.py +256 -0
- doctrail/extractors/presentation_extractor.py +173 -0
- doctrail/extractors/smart_html_extractor.py +217 -0
- doctrail/extractors/spreadsheet_extractor.py +482 -0
- doctrail/file_filters.py +207 -0
- doctrail/ingest/__init__.py +20 -0
- doctrail/ingest/base.py +11 -0
- doctrail/ingest/core.py +722 -0
- doctrail/ingest/database.py +214 -0
- doctrail/ingest/document_processor.py +1274 -0
- doctrail/ingest/extractors.py +47 -0
- doctrail/ingest/file_utils.py +119 -0
- doctrail/ingest/manifest.py +109 -0
- doctrail/ingest/text_processing.py +196 -0
- doctrail/ingester.py +40 -0
- doctrail/llm/__init__.py +5 -0
- doctrail/llm/client.py +160 -0
- doctrail/llm/token_utils.py +120 -0
- doctrail/llm_operations.py +1719 -0
- doctrail/llm_providers/__init__.py +46 -0
- doctrail/llm_providers/anthropic_provider.py +568 -0
- doctrail/llm_providers/claude_sdk_provider.py +272 -0
- doctrail/llm_providers/cli_provider.py +478 -0
- doctrail/llm_providers/factory.py +154 -0
- doctrail/llm_providers/gemini_provider.py +705 -0
- doctrail/llm_providers/openai_provider.py +551 -0
- doctrail/llm_providers/replay_provider.py +155 -0
- doctrail/main.py +20 -0
- doctrail/plugins/README.md +166 -0
- doctrail/plugins/__init__.py +135 -0
- doctrail/plugins/_chinese_converter.py +190 -0
- doctrail/plugins/doi_connector.py +636 -0
- doctrail/plugins/example_custom.py +145 -0
- doctrail/plugins/zotero.py +812 -0
- doctrail/plugins/zotero_connector.py +61 -0
- doctrail/plugins/zotero_ingester.py +660 -0
- doctrail/presets/__init__.py +2 -0
- doctrail/presets/document_type.yml +25 -0
- doctrail/presets/extract_entities.yml +34 -0
- doctrail/presets/keywords.yml +23 -0
- doctrail/presets/language.yml +24 -0
- doctrail/presets/relevance.yml +24 -0
- doctrail/presets/research_methods.yml +38 -0
- doctrail/presets/sentiment.yml +21 -0
- doctrail/presets/summarize.yml +20 -0
- doctrail/pydantic_schema.py +499 -0
- doctrail/review_server.py +549 -0
- doctrail/schema_managers.py +455 -0
- doctrail/search.py +629 -0
- doctrail/server.py +954 -0
- doctrail/server_config.py +296 -0
- doctrail/server_ingestor/__init__.py +11 -0
- doctrail/server_ingestor/app.py +1205 -0
- doctrail/templates/parallel-translation.md +13 -0
- doctrail/types.py +104 -0
- doctrail/utils/__init__.py +1 -0
- doctrail/utils/build_documentation.py +289 -0
- doctrail/utils/cost_estimation.py +561 -0
- doctrail/utils/dependency_check.py +72 -0
- doctrail/utils/logging_config.py +135 -0
- doctrail/utils/model_pricing.py +778 -0
- doctrail/utils/progress.py +89 -0
- doctrail/utils/query_utils.py +103 -0
- doctrail/utils/simple_error_handler.py +90 -0
- doctrail/utils/validate_config.py +217 -0
- doctrail-0.3.1.dist-info/METADATA +103 -0
- doctrail-0.3.1.dist-info/RECORD +175 -0
- doctrail-0.3.1.dist-info/WHEEL +4 -0
- doctrail-0.3.1.dist-info/entry_points.txt +2 -0
- doctrail-0.3.1.dist-info/licenses/LICENSE +21 -0
doctrail/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
# Doctrail - SQLite database enrichment tool using LLMs
|
doctrail/_docs/SKILL.md
ADDED
|
@@ -0,0 +1,319 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: doctrail
|
|
3
|
+
description: "Initialize and operate Doctrail projects: ingest document corpora into SQLite, define SQL-scoped YAML enrichments, run sync or provider-batch LLM coding, inspect normalized audit/enrichment storage through run/pivot/spec views, compare model coders, and finalize human-review datasets."
|
|
4
|
+
allowed-tools: Bash
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Doctrail
|
|
8
|
+
|
|
9
|
+
Use this when the user wants to start or operate a Doctrail project: ingest files into SQLite, create or edit enrichment YAML, run LLM coding over SQL-selected rows, inspect run history, build review or analysis views, compare model coders, or repair normalized storage.
|
|
10
|
+
|
|
11
|
+
This skill is the operating doctrine: the mental model, the judgment rules, and worked examples. The complete reference — every YAML key, every command and flag, the storage contract — ships inside the package: `doctrail docs` prints the whole manual to stdout, offline. When you need a detail not covered here, read the manual rather than guessing.
|
|
12
|
+
|
|
13
|
+
## Core mental model
|
|
14
|
+
|
|
15
|
+
Every answer Doctrail has ever produced lives in one long table, `_enrichments`: one row per document, enrichment, field, model, and prompt version. Three kinds of object share the database, distinguishable at a glance:
|
|
16
|
+
|
|
17
|
+
- Source tables or views keep the original inputs. `documents` is only a default scaffold name.
|
|
18
|
+
- Tables prefixed `_` are Doctrail's ledger: `_enrichments` (parsed answers, long form), `_enrichment_audit` (every raw model call and response, plus the exact projection payload used to populate `_enrichments`), `_enrichment_runs` and `_enrichment_run_items` (provenance and the exact input rowset).
|
|
19
|
+
- Views prefixed `v_` are what people actually read: `v_run_*` shows one specific run in wide form, `v_final_*` applies human overrides to a run, and `doctrail view pivot` builds reusable wide views over `_enrichments`.
|
|
20
|
+
|
|
21
|
+
Rule: store long, inspect wide.
|
|
22
|
+
|
|
23
|
+
## Happy path
|
|
24
|
+
|
|
25
|
+
1. Inspect the database or current project.
|
|
26
|
+
2. Prefer `doctrail init` if starting fresh.
|
|
27
|
+
3. Use `doctrail new` or patch `.doctrail/enrichments/*.yml` to define the enrichment.
|
|
28
|
+
4. Ingest if needed.
|
|
29
|
+
5. Run `doctrail enrich <name> --dry-run`.
|
|
30
|
+
6. Run `doctrail enrich <name> --limit 5`.
|
|
31
|
+
7. Inspect with `doctrail runs`, `doctrail view create --run-id <run_id>`, `doctrail view spec`, or `doctrail view pivot`.
|
|
32
|
+
8. Iterate with `doctrail edit <name>` or direct YAML edits until the prompt and schema are stable.
|
|
33
|
+
9. Run the full corpus, using `--execution-mode batch` when direct provider batch mode is appropriate.
|
|
34
|
+
10. Materialize the review surface or final editable table with `doctrail view create`, `doctrail view spec`, `doctrail view pivot`, or `doctrail finalize`.
|
|
35
|
+
11. Export and import overrides or run ICR if needed.
|
|
36
|
+
|
|
37
|
+
`doctrail run` is an alias of `doctrail enrich`; the two are interchangeable everywhere.
|
|
38
|
+
|
|
39
|
+
## Project setup
|
|
40
|
+
|
|
41
|
+
Prefer using the same CLI that a human user would use.
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
doctrail init
|
|
45
|
+
doctrail new
|
|
46
|
+
doctrail ingest --input-dir docs/ --yes
|
|
47
|
+
doctrail enrich language --dry-run
|
|
48
|
+
doctrail enrich language --limit 5
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
If `doctrail init` already created `.doctrail/config.yml` and `.doctrail/enrichments/*.yml`, patch those files rather than inventing parallel config formats.
|
|
52
|
+
|
|
53
|
+
## Offline runs and demos (replay)
|
|
54
|
+
|
|
55
|
+
The model name `replay` (or `replay/<label>`) returns canned responses from `.doctrail/replay/<enrichment>.jsonl` fixtures instead of calling an API. Everything downstream — audit rows, `_enrichments`, views, runs, ICR — is the real pipeline, so replay runs are how you exercise a config end to end without an API key. Distinct labels act as distinct coders for ICR (`-m replay/coder-a -m replay/coder-b`).
|
|
56
|
+
|
|
57
|
+
`doctrail init test` scaffolds a complete canned tutorial workspace into the current directory: small public corpora, a pre-ingested `out/database.db`, prepared enrichment configs, and replay fixtures. Use it to walk a new user through the whole loop offline.
|
|
58
|
+
|
|
59
|
+
## Before running an enrichment
|
|
60
|
+
|
|
61
|
+
- Inspect the source schema first:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
sqlite3 /path/to/db.db ".tables"
|
|
65
|
+
sqlite3 /path/to/db.db ".schema your_source_table"
|
|
66
|
+
sqlite3 /path/to/db.db "SELECT * FROM your_source_table LIMIT 3"
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
- Confirm the key column. It must be present in the input query.
|
|
70
|
+
- Do not assume the source table is named `documents`. The real contract is: valid SQL input query + stable key column + schema.
|
|
71
|
+
- Keep prompts and schema small on the first pass.
|
|
72
|
+
- Use `--output-db` when the user wants a non-destructive run into a separate database.
|
|
73
|
+
- For targeted reruns, prefer `--where "..."` to filter the enrichment's existing query; use `--query` only when replacing the SQL entirely.
|
|
74
|
+
- Default to `--dedupe-scope query` unless the user explicitly wants prompt-family reuse.
|
|
75
|
+
- Append-mode dedupe is success-based: a row is only "already done" when a successful normalized result exists in `_enrichments` for the current dedupe scope.
|
|
76
|
+
- Null answers count as answered. A field the model explicitly returned as null is stored as a row with a NULL value, and dedupe treats that row as complete; rerunning without `--overwrite` will not re-ask it.
|
|
77
|
+
- Raw attempt history in `_enrichment_audit` and per-run snapshots in `_enrichment_run_items` are provenance, not completion signals. Error-only rows should retry without `--overwrite`.
|
|
78
|
+
- Use `--overwrite` to force reruns of rows that already succeeded.
|
|
79
|
+
|
|
80
|
+
## Enrichment config rules
|
|
81
|
+
|
|
82
|
+
- `input.query` may be a named SQL query or inline SQL.
|
|
83
|
+
- `input.input_columns` controls what the model sees.
|
|
84
|
+
- Use `table.column` for multi-table inputs.
|
|
85
|
+
- Use `:N` suffix to truncate large fields, for example `raw_content:3000`.
|
|
86
|
+
- `output_column` can still name a single extracted field.
|
|
87
|
+
- For short screening tasks, use `pack_size` to group multiple rows into one structured call.
|
|
88
|
+
- `pack_response_mode: selected_indexes` is the compact mode for a single boolean field. The model returns zero-based matching item indexes only, and Doctrail unpacks them into ordinary row-level `_enrichment_audit` and `_enrichments` rows.
|
|
89
|
+
- `pack_response_mode: exhaustive` returns one structured result per packed item. Prefer it for multi-field schemas or when omission risk matters more than output-token savings.
|
|
90
|
+
- Packed prompt shaping is separate from `--execution-mode batch`. Today `pack_size` is sync-only.
|
|
91
|
+
- Packed mode is most appropriate for short inputs like titles, abstracts, or snippets. It is a poor default for full-document prompts.
|
|
92
|
+
- Design schemas so the common answer is cheap. For rare-hit screening over a large corpus, prefer a `false`/`0`/small-enum default over nullable prose fields, and combine it with `pack_size` so non-hits cost almost no output tokens.
|
|
93
|
+
- Older configs may include `output_table`, but the core storage model is `_enrichments` plus derived views. Do not describe `output_table` as the primary storage contract.
|
|
94
|
+
- Do not name enrichment output fields the same as source table columns. If a schema field matches a source column, Doctrail will error. Rename the field (e.g., `fulltext_cleaned` instead of `fulltext_clean`) or pass `--allow-column-collision` to proceed (the source column becomes `<name>_input` in views).
|
|
95
|
+
|
|
96
|
+
## Prompt construction
|
|
97
|
+
|
|
98
|
+
- Write the prompt as a codebook, the way you would brief a research assistant: define every enum value, anchor every scale point (`0 = no mention; 5 = explicit existential threat`), and state explicitly when gated fields must be null. Vague constructs produce coder disagreement, with models exactly as with humans; if ICR comes back low, fix the codebook before blaming the model.
|
|
99
|
+
- Provider prompt caching is prefix-based. OpenAI caches eligible matching prompt prefixes automatically. Gemini implicit caching is automatic on Gemini 2.5 and newer models, but Google does not guarantee savings on every request. In both cases, cache reuse depends on the longest identical prefix counted from the top of the request.
|
|
100
|
+
- Therefore the rule is: everything static first, everything per-row last. Doctrail's default renderer puts the prompt first, appends schema instructions there for JSON-mode paths or sends provider-native schema payloads as constant request structure, then appends the per-row `input_columns` content at the end. For provider-reported OpenAI or Gemini cache hits, a long codebook prefix can be billed at cached-input rates. Doctrail does not currently set Anthropic `cache_control` markers, so do not count on Anthropic prompt-cache discounts from this layout.
|
|
101
|
+
- `{column}` placeholders inside the prompt can break this. Each substitution makes the request differ from row to row at that point, reducing cache reuse for everything after it. Feed per-row content through `input_columns` instead; if you genuinely must interpolate, put the placeholder at the end of the prompt so the static prefix above it still matches.
|
|
102
|
+
- Control input size with `:N` truncation on `input_columns` (e.g. `raw_content:3000`) rather than editing source data.
|
|
103
|
+
|
|
104
|
+
Example packed boolean screen:
|
|
105
|
+
|
|
106
|
+
```yaml
|
|
107
|
+
enrichments:
|
|
108
|
+
- name: packed_relevance
|
|
109
|
+
input:
|
|
110
|
+
query: all_docs
|
|
111
|
+
input_columns: [raw_content:240]
|
|
112
|
+
prompt: |
|
|
113
|
+
Return the items that mention sanctions or trade restrictions.
|
|
114
|
+
output_column: is_relevant
|
|
115
|
+
schema: {type: "boolean"}
|
|
116
|
+
pack_size: 10
|
|
117
|
+
pack_response_mode: selected_indexes
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
## Iterative workflow
|
|
121
|
+
|
|
122
|
+
Use this workflow for prompt development:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
doctrail enrich classify --dry-run
|
|
126
|
+
doctrail enrich classify --limit 10
|
|
127
|
+
doctrail runs
|
|
128
|
+
doctrail view create --run-id <run_id>
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Then revise the YAML and rerun on another small sample. Once the prompt is stable:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
doctrail enrich classify
|
|
135
|
+
doctrail view create --run-id <final_run_id>
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
If the user wants a reusable wide analysis surface instead of a single-run snapshot:
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
doctrail view pivot review_surface -e classify --include "title,raw_content:500"
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## Provider batch workflow
|
|
145
|
+
|
|
146
|
+
Use `--execution-mode batch` for direct provider batch runs. The older `--execution-mode openai-batch` spelling still works as a compatibility alias, but do not teach it as the current command.
|
|
147
|
+
|
|
148
|
+
- Direct OpenAI model ids must appear in Doctrail's verified OpenAI batch catalog.
|
|
149
|
+
- That catalog is fetched from the official OpenAI model docs, not guessed from a static local allowlist.
|
|
150
|
+
- A model is only accepted for this path if the docs page shows both `v1/batch` and `v1/chat/completions`.
|
|
151
|
+
- The catalog stores batch input, cached-input, and output prices per 1M tokens, documented snapshots, and the fetch date.
|
|
152
|
+
- Direct OpenAI batch cost estimation uses that catalog, not the normal sync pricing table.
|
|
153
|
+
|
|
154
|
+
Inspect or refresh the catalog with:
|
|
155
|
+
|
|
156
|
+
```bash
|
|
157
|
+
doctrail models --openai-batch
|
|
158
|
+
doctrail models --openai-batch --refresh
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Operational rule:
|
|
162
|
+
- if a direct OpenAI model is outside the verified batch catalog, Doctrail should refuse the batch run instead of silently falling back
|
|
163
|
+
- if the docs are unreachable, Doctrail may use the cached or bootstrap catalog, and the CLI should make that visible
|
|
164
|
+
|
|
165
|
+
Direct batch backends behind `--execution-mode batch`:
|
|
166
|
+
- direct OpenAI bare model ids use OpenAI's `/v1/batches` API with request lines targeting `/v1/chat/completions`
|
|
167
|
+
- direct Anthropic model ids like `claude-*` or `anthropic/*` use Anthropic's `/v1/messages/batches` API with request params targeting `/v1/messages`
|
|
168
|
+
- direct Gemini model ids like `gemini-*` or `models/gemini-*` use Gemini's File API plus `v1beta/models/{model}:batchGenerateContent`
|
|
169
|
+
- the verified OpenAI batch catalog only governs the direct OpenAI branch; Anthropic and Gemini do not use that catalog
|
|
170
|
+
- submit with `doctrail enrich <name> --execution-mode batch`
|
|
171
|
+
- reconcile later with `doctrail batch poll --run-id <run_id>` or `doctrail batch watch --run-id <run_id>`
|
|
172
|
+
- after reconciliation, successful rows land in both `_enrichment_audit` and `_enrichments`
|
|
173
|
+
- failed rows may still land in `_enrichment_audit`, but remain eligible for retry on the next append-mode run
|
|
174
|
+
|
|
175
|
+
Monitoring a live batch run — `doctrail batch watch --run-id <run_id>` for a live feed, or query the shard table directly for a precise picture:
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
sqlite3 /path/to/db.db "
|
|
179
|
+
SELECT id, status, request_count, completed_count, failed_count,
|
|
180
|
+
output_file_id IS NOT NULL AS has_output,
|
|
181
|
+
error_file_id IS NOT NULL AS has_errors,
|
|
182
|
+
last_polled_at, completed_at, reconciled_at
|
|
183
|
+
FROM _enrichment_batch_jobs
|
|
184
|
+
WHERE run_id = 'your_run_id'
|
|
185
|
+
ORDER BY id
|
|
186
|
+
"
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
Large runs are sharded into multiple provider jobs — one row per shard. A run is fully done when every shard has `reconciled_at` set; until then `completed_count` climbs toward `request_count` per shard, and rows land in `_enrichments` only at reconciliation (`batch poll`/`batch watch`), not while the provider is still processing.
|
|
190
|
+
|
|
191
|
+
Practical recovery rule for partial batch failures:
|
|
192
|
+
|
|
193
|
+
```bash
|
|
194
|
+
doctrail batch poll --run-id <run_id>
|
|
195
|
+
doctrail enrich <name>
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
Use `--overwrite` only if you want to rerun rows that already succeeded.
|
|
199
|
+
|
|
200
|
+
GPT-5 operational note:
|
|
201
|
+
- direct OpenAI `gpt-5*` chat-completions runs default to `reasoning_effort=minimal` unless the enrichment config overrides it
|
|
202
|
+
- allowed values are `minimal`, `low`, `medium`, `high`
|
|
203
|
+
- use that knob deliberately for cost control on both sync and batch runs
|
|
204
|
+
|
|
205
|
+
## Views
|
|
206
|
+
|
|
207
|
+
After a run, Doctrail also maintains an automatic wide view over the source table (e.g. `v_documents_enriched`): one row per document, one column per field, latest value wins. For anything beyond that default, use the right view type for the task.
|
|
208
|
+
|
|
209
|
+
- `doctrail view create --run-id <run_id>`
|
|
210
|
+
- exact snapshot of one run (`v_run_*`)
|
|
211
|
+
- best for pilot runs, final runs, and human review
|
|
212
|
+
- `doctrail view create <enrichment>`
|
|
213
|
+
- latest run for that enrichment
|
|
214
|
+
- convenient but less explicit than `--run-id`
|
|
215
|
+
- `doctrail view pivot <name> -e <enrichment>`
|
|
216
|
+
- reusable wide view over normalized enrichment data
|
|
217
|
+
- good for coding surfaces and ICR comparison
|
|
218
|
+
- `doctrail view spec <name>`
|
|
219
|
+
- YAML-driven review surface for a common workflow
|
|
220
|
+
- supports source columns, extra enrichment columns, and one exploded JSON-array field
|
|
221
|
+
- `doctrail view pivot ... --by-model`
|
|
222
|
+
- model-by-model columns for comparison work
|
|
223
|
+
|
|
224
|
+
For human-readable exports of a materialized view:
|
|
225
|
+
|
|
226
|
+
```bash
|
|
227
|
+
doctrail view render payments_review --output payments_review.html
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
## Review and overrides
|
|
231
|
+
|
|
232
|
+
Human correction workflow:
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
doctrail overrides-export --run-id <run_id>
|
|
236
|
+
doctrail overrides-import --run-id <run_id> --input overrides.csv --reviewer alice
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
`v_final_*` views layer overrides on top of the original run output. The storage tables remain unchanged; the view is the merged surface.
|
|
240
|
+
|
|
241
|
+
Short rule:
|
|
242
|
+
- model runs are immutable coder outputs
|
|
243
|
+
- compare runs in views, but edit final values in one clean final surface
|
|
244
|
+
- agreement between models is ICR
|
|
245
|
+
- do not invent duplicate `machine_*` columns in the final editable table
|
|
246
|
+
- keep only enough provenance to join back to machine history
|
|
247
|
+
|
|
248
|
+
Current implementation note:
|
|
249
|
+
- `doctrail finalize --run-id <run_id> --table <table_name>` materializes one writable final table from a chosen run
|
|
250
|
+
- `doctrail finalize --view <view_name> --table <table_name>` materializes any existing review view into a writable table
|
|
251
|
+
- `v_final_*` views plus override import/export still exist for lightweight review and CSV workflows
|
|
252
|
+
|
|
253
|
+
For repeated extracted items, prefer exploded review views so one extracted item becomes one row.
|
|
254
|
+
|
|
255
|
+
If the user wants to repair or verify normalized storage from audit:
|
|
256
|
+
|
|
257
|
+
```bash
|
|
258
|
+
doctrail rebuild-enrichments --db-path /path/to/db.db --yes
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
This only works for audit rows that have the persisted projection payload. It is strict by design and will fail rather than guess from legacy raw JSON.
|
|
262
|
+
|
|
263
|
+
## ICR workflow
|
|
264
|
+
|
|
265
|
+
For model agreement work:
|
|
266
|
+
|
|
267
|
+
```bash
|
|
268
|
+
doctrail icr classify -m gpt-4o-mini -m openrouter/google/gemini-2.5-flash --sample 100 --seed 42
|
|
269
|
+
doctrail icr-report --db-path /path/to/db.db --field category
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
The default install computes Krippendorff's alpha and Cohen's kappa; `doctrail[icr]` remains accepted for older setup scripts but is no longer required.
|
|
273
|
+
|
|
274
|
+
## Working with the database directly
|
|
275
|
+
|
|
276
|
+
Everything Doctrail produces lands in plain SQLite, so most inspection and reshaping is just SQL. `sqlite3` covers ad-hoc queries; for anything heavier, `sqlite-utils` is the right companion CLI (reference: https://sqlite-utils.datasette.io/). It is not bundled with doctrail — install it once with `uv tool install sqlite-utils`. A ready reckoner of the moves you will actually make:
|
|
277
|
+
|
|
278
|
+
- list tables and views: `sqlite-utils tables out/database.db` (add `--views` for views)
|
|
279
|
+
- inspect schema: `sqlite-utils schema out/database.db documents`
|
|
280
|
+
- peek at a wide view: `sqlite-utils rows out/database.db v_documents_enriched --limit 5`
|
|
281
|
+
- run a query as JSON lines: `sqlite-utils out/database.db "SELECT ... FROM v_..." --nl`
|
|
282
|
+
- count answered rows for an enrichment: `sqlite-utils out/database.db "SELECT COUNT(DISTINCT key_value) FROM _enrichments WHERE enrichment_name='x'"`
|
|
283
|
+
- export a review view to CSV: `sqlite-utils rows out/database.db v_run_<id> --csv > review.csv`
|
|
284
|
+
|
|
285
|
+
Read-only inspection is always safe. Reshape into your own tables and views freely, but do not hand-edit the `_`-prefixed ledger tables — let doctrail write those, and use overrides/finalize for human corrections.
|
|
286
|
+
|
|
287
|
+
## Troubleshooting
|
|
288
|
+
|
|
289
|
+
- `No config found`: run `doctrail init` or pass `--config`.
|
|
290
|
+
- `key_column not found`: include it in the SQL `SELECT`.
|
|
291
|
+
- Tables named `enrichments` / `enrichment_audit` without the `_` prefix mean a pre-rename database. Any doctrail command migrates it in place automatically (tracked via `PRAGMA user_version`); do not rename tables by hand.
|
|
292
|
+
- Unsure what changed between prompt versions: use `doctrail runs` and `doctrail diff-runs`.
|
|
293
|
+
- Unsure what the user should inspect: materialize a `v_run_*` view first; it is the safest default review surface.
|
|
294
|
+
- `Rows were skipped but should retry`: check `_enrichments`, not just `_enrichment_audit` or `_enrichment_run_items`.
|
|
295
|
+
|
|
296
|
+
```bash
|
|
297
|
+
sqlite3 /path/to/db.db "
|
|
298
|
+
SELECT COUNT(DISTINCT key_value)
|
|
299
|
+
FROM _enrichments
|
|
300
|
+
WHERE enrichment_name = 'your_enrichment'
|
|
301
|
+
"
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
- `Batch run shows many errors`: inspect per-row status in the run snapshot.
|
|
305
|
+
|
|
306
|
+
```bash
|
|
307
|
+
sqlite3 /path/to/db.db "
|
|
308
|
+
SELECT status, COUNT(*)
|
|
309
|
+
FROM _enrichment_run_items
|
|
310
|
+
WHERE run_id = 'your_run_id'
|
|
311
|
+
GROUP BY status
|
|
312
|
+
"
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
- `Field name collides with source column`: rename the enrichment field if possible. If you need to proceed immediately, use:
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
doctrail enrich <name> --allow-column-collision
|
|
319
|
+
```
|