aiex-cli 0.0.5-beta.3 → 0.0.5-beta.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -70,7 +70,6 @@ aiex extract -s <schema> -f <file> # from file (txt, pdf, png, jpg, ...)
70
70
  aiex extract -s <schema> -f <file> -m <model> # specify AI model (overrides auto-selection)
71
71
  aiex extract -s <schema> -f <file> --no-insert # extract and save JSON without inserting into SQLite
72
72
  aiex extract -s <schema> -f <file> --force # force re-extraction even if already processed
73
- aiex extract -s <schema> -f <file> --agent # run ReAct agent mode (ideal for large documents)
74
73
  aiex extract -s <schema> -d <directory> # batch extract all supported files in a directory
75
74
  aiex extract -s <schema> -d <dir> -g "*.pdf" # batch with glob filter
76
75
  aiex extract history # list extraction audit records
@@ -129,7 +128,6 @@ Dumps all extracted data for a given schema (or table) from the SQLite database
129
128
  | `aiex extract -s <name> -f <file> -m <model>` | Extract with a specific AI model |
130
129
  | `aiex extract -s <name> -f <file> --no-insert` | Extract and save JSON without inserting into SQLite |
131
130
  | `aiex extract -s <name> -f <file> --force` | Force re-extraction even if the file has already been processed |
132
- | `aiex extract -s <name> -f <file> --agent` | Extract data in ReAct agent mode (using tool navigation) |
133
131
  | `aiex extract -s <name> -d <dir>` | Batch extract all supported files in a directory |
134
132
  | `aiex extract -s <name> -d <dir> -g "*.pdf"` | Batch extract with glob filter |
135
133
  | `aiex extract history` | List extraction audit records |
@@ -204,22 +202,14 @@ aiex completion fish | source
204
202
 
205
203
  <br>
206
204
 
207
- ## 📄 Large Document Processing (Pipeline vs. ReAct Agent)
205
+ ## 📄 Large Document Processing
208
206
 
209
- When processing very large documents (exceeding `40,000` characters), `aiex` provides two separate modes to handle context window limits and cost:
207
+ When processing very large documents (exceeding `40,000` characters), `aiex` runs an optimized **Pipeline Mode** to handle context window limits and control API costs:
210
208
 
211
- ### 1. Pipeline Mode (Default)
212
- - **Mechanism**: Splits the document logically at Markdown headings or paragraph boundaries. It processes each chunk sequentially through the LLM, prepending active heading stacks as context to prevent losing track of document structure (like headers). Finally, it merges the outputs recursively.
213
- - **Best for**: Small-to-medium files or structures where every single section must be scanned completely (e.g. log files).
214
-
215
- ### 2. ReAct Agent Mode
216
- - **Mechanism**: Spawns an agent equipped with document navigation tools:
217
- - `listChunks()`: Returns a Table of Contents (headings, sizes, indices).
218
- - `readChunk(chunkId)`: Fetches a specific section.
219
- - `searchChunks(query)`: Matches keywords across all chunks.
220
- - `submitExtraction(data)`: Submits the final structured JSON payload.
221
- The agent uses these tools to dynamically browse and retrieve only the relevant parts, drastically reducing API token costs for giant documents.
222
- - **How to run**: Pass `--agent` / `-a` via the CLI, or toggle **Extraction Mode** under the **Prompts** tab in the Web UI.
209
+ - **Token-Aware AST Splitting**: Parses structural Markdown elements (headings, paragraphs, lists) using an AST-based parser (`marked.lexer`) and splits them using precise token counters (`js-tiktoken`). Active heading hierarchies are tracked and prepended to each chunk as context. Tables and code blocks are kept intact (atomic blocks) to avoid syntax corruption.
210
+ - **Concurrency Limiting**: To respect strict model rate limits, chunk extractions are processed in parallel with a strict concurrency limit (capped at 2 concurrent requests).
211
+ - **Pre-filtering**: Integrates hybrid search-based pre-filtering to score and select only the most relevant document chunks based on schema queries, preventing unnecessary token usage on unrelated sections.
212
+ - **Recursive Merging**: The final extracted JSON objects from each chunk are recursively merged, concatenating lists and deduplicating primitive fields.
223
213
 
224
214
  <br>
225
215