aiex-cli 0.0.5-beta.3 → 0.0.5-beta.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +6 -16
- package/dist/cli.mjs +242 -321
- package/dist/{doctor-collector-CQPDBVTw.mjs → doctor-collector-Cv7RArla.mjs} +8 -5
- package/dist/index.mjs +1 -1
- package/dist/web/assets/AISettings-BlyTFIIy.js +272 -0
- package/dist/web/assets/ExtractionViewer-BhhWrBs2.js +1 -0
- package/dist/web/assets/{index-BWm_fhNt.js → index-CKV2X6sS.js} +2 -2
- package/dist/web/assets/index-Csdgio76.css +2 -0
- package/dist/web/index.html +2 -2
- package/dist/{zh-CN-CKxdpj8c.mjs → zh-CN-CyL-61Ow.mjs} +1 -2
- package/package.json +1 -1
- package/dist/web/assets/AISettings-DoDVYWfb.js +0 -272
- package/dist/web/assets/ExtractionViewer-DqIrBGNK.js +0 -1
- package/dist/web/assets/index-CvY9TGny.css +0 -2
package/README.md
CHANGED
|
@@ -70,7 +70,6 @@ aiex extract -s <schema> -f <file> # from file (txt, pdf, png, jpg, ...)
|
|
|
70
70
|
aiex extract -s <schema> -f <file> -m <model> # specify AI model (overrides auto-selection)
|
|
71
71
|
aiex extract -s <schema> -f <file> --no-insert # extract and save JSON without inserting into SQLite
|
|
72
72
|
aiex extract -s <schema> -f <file> --force # force re-extraction even if already processed
|
|
73
|
-
aiex extract -s <schema> -f <file> --agent # run ReAct agent mode (ideal for large documents)
|
|
74
73
|
aiex extract -s <schema> -d <directory> # batch extract all supported files in a directory
|
|
75
74
|
aiex extract -s <schema> -d <dir> -g "*.pdf" # batch with glob filter
|
|
76
75
|
aiex extract history # list extraction audit records
|
|
@@ -129,7 +128,6 @@ Dumps all extracted data for a given schema (or table) from the SQLite database
|
|
|
129
128
|
| `aiex extract -s <name> -f <file> -m <model>` | Extract with a specific AI model |
|
|
130
129
|
| `aiex extract -s <name> -f <file> --no-insert` | Extract and save JSON without inserting into SQLite |
|
|
131
130
|
| `aiex extract -s <name> -f <file> --force` | Force re-extraction even if the file has already been processed |
|
|
132
|
-
| `aiex extract -s <name> -f <file> --agent` | Extract data in ReAct agent mode (using tool navigation) |
|
|
133
131
|
| `aiex extract -s <name> -d <dir>` | Batch extract all supported files in a directory |
|
|
134
132
|
| `aiex extract -s <name> -d <dir> -g "*.pdf"` | Batch extract with glob filter |
|
|
135
133
|
| `aiex extract history` | List extraction audit records |
|
|
@@ -204,22 +202,14 @@ aiex completion fish | source
|
|
|
204
202
|
|
|
205
203
|
<br>
|
|
206
204
|
|
|
207
|
-
## 📄 Large Document Processing
|
|
205
|
+
## 📄 Large Document Processing
|
|
208
206
|
|
|
209
|
-
When processing very large documents (exceeding `40,000` characters), `aiex`
|
|
207
|
+
When processing very large documents (exceeding `40,000` characters), `aiex` runs an optimized **Pipeline Mode** to handle context window limits and control API costs:
|
|
210
208
|
|
|
211
|
-
|
|
212
|
-
- **
|
|
213
|
-
- **
|
|
214
|
-
|
|
215
|
-
### 2. ReAct Agent Mode
|
|
216
|
-
- **Mechanism**: Spawns an agent equipped with document navigation tools:
|
|
217
|
-
- `listChunks()`: Returns a Table of Contents (headings, sizes, indices).
|
|
218
|
-
- `readChunk(chunkId)`: Fetches a specific section.
|
|
219
|
-
- `searchChunks(query)`: Matches keywords across all chunks.
|
|
220
|
-
- `submitExtraction(data)`: Submits the final structured JSON payload.
|
|
221
|
-
The agent uses these tools to dynamically browse and retrieve only the relevant parts, drastically reducing API token costs for giant documents.
|
|
222
|
-
- **How to run**: Pass `--agent` / `-a` via the CLI, or toggle **Extraction Mode** under the **Prompts** tab in the Web UI.
|
|
209
|
+
- **Sliding Window & Overlapping Slices**: Splits the document logically at Markdown headings or paragraph boundaries. It uses an overlapping sliding window to ensure contextual continuity at slice boundaries. Active heading hierarchies are tracked and prepended to each chunk as context.
|
|
210
|
+
- **Concurrency Limiting**: To respect strict model rate limits, chunk extractions are processed in parallel with a strict concurrency limit (capped at 2 concurrent requests).
|
|
211
|
+
- **Pre-filtering**: Integrates hybrid search-based pre-filtering to score and select only the most relevant document chunks based on schema queries, preventing unnecessary token usage on unrelated sections.
|
|
212
|
+
- **Recursive Merging**: The final extracted JSON objects from each chunk are recursively merged, concatenating lists and deduplicating primitive fields.
|
|
223
213
|
|
|
224
214
|
<br>
|
|
225
215
|
|