delm 0.1.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. delm-0.1.3/LICENSE.md +21 -0
  2. delm-0.1.3/PKG-INFO +377 -0
  3. delm-0.1.3/README.md +332 -0
  4. delm-0.1.3/pyproject.toml +109 -0
  5. delm-0.1.3/setup.cfg +4 -0
  6. delm-0.1.3/src/delm/__init__.py +166 -0
  7. delm-0.1.3/src/delm/config.py +709 -0
  8. delm-0.1.3/src/delm/constants.py +165 -0
  9. delm-0.1.3/src/delm/core/__init__.py +15 -0
  10. delm-0.1.3/src/delm/core/data_processor.py +156 -0
  11. delm-0.1.3/src/delm/core/experiment_manager.py +835 -0
  12. delm-0.1.3/src/delm/core/extraction_manager.py +526 -0
  13. delm-0.1.3/src/delm/delm.py +389 -0
  14. delm-0.1.3/src/delm/exceptions.py +76 -0
  15. delm-0.1.3/src/delm/logging.py +130 -0
  16. delm-0.1.3/src/delm/models.py +45 -0
  17. delm-0.1.3/src/delm/schemas/__init__.py +17 -0
  18. delm-0.1.3/src/delm/schemas/schema_manager.py +75 -0
  19. delm-0.1.3/src/delm/schemas/schemas.py +749 -0
  20. delm-0.1.3/src/delm/strategies/__init__.py +26 -0
  21. delm-0.1.3/src/delm/strategies/data_loaders.py +381 -0
  22. delm-0.1.3/src/delm/strategies/scoring_strategies.py +131 -0
  23. delm-0.1.3/src/delm/strategies/splitting_strategies.py +163 -0
  24. delm-0.1.3/src/delm/utils/__init__.py +13 -0
  25. delm-0.1.3/src/delm/utils/concurrent_processing.py +151 -0
  26. delm-0.1.3/src/delm/utils/cost_estimation.py +202 -0
  27. delm-0.1.3/src/delm/utils/cost_tracker.py +187 -0
  28. delm-0.1.3/src/delm/utils/model_price_database.py +121 -0
  29. delm-0.1.3/src/delm/utils/performance_estimation.py +467 -0
  30. delm-0.1.3/src/delm/utils/post_processing.py +378 -0
  31. delm-0.1.3/src/delm/utils/retry_handler.py +58 -0
  32. delm-0.1.3/src/delm/utils/semantic_cache.py +619 -0
  33. delm-0.1.3/src/delm/utils/type_checks.py +9 -0
  34. delm-0.1.3/src/delm.egg-info/PKG-INFO +377 -0
  35. delm-0.1.3/src/delm.egg-info/SOURCES.txt +36 -0
  36. delm-0.1.3/src/delm.egg-info/dependency_links.txt +1 -0
  37. delm-0.1.3/src/delm.egg-info/requires.txt +23 -0
  38. delm-0.1.3/src/delm.egg-info/top_level.txt +1 -0
delm-0.1.3/LICENSE.md ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Eric Fithian and Kirill Skobelev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
delm-0.1.3/PKG-INFO ADDED
@@ -0,0 +1,377 @@
1
+ Metadata-Version: 2.4
2
+ Name: delm
3
+ Version: 0.1.3
4
+ Summary: Data Extraction Language Model - A pipeline for extracting structured data from text using language models
5
+ Author: Eric Fithian - Chicago Booth CAAI Lab
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/your-org/delm
8
+ Project-URL: Repository, https://github.com/your-org/delm
9
+ Project-URL: Documentation, https://github.com/your-org/delm#readme
10
+ Project-URL: Issues, https://github.com/your-org/delm/issues
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
18
+ Classifier: Topic :: Text Processing :: Linguistic
19
+ Requires-Python: >=3.10
20
+ Description-Content-Type: text/markdown
21
+ License-File: LICENSE.md
22
+ Requires-Dist: pandas>=1.5.0
23
+ Requires-Dist: numpy>=1.21.0
24
+ Requires-Dist: pyarrow>=7.0.0
25
+ Requires-Dist: instructor>=0.4.0
26
+ Requires-Dist: pydantic>=2.0.0
27
+ Requires-Dist: pyyaml>=6.0
28
+ Requires-Dist: python-dotenv>=1.0.0
29
+ Requires-Dist: tqdm>=4.64.0
30
+ Requires-Dist: rapidfuzz>=3.0.0
31
+ Requires-Dist: beautifulsoup4>=4.11.0
32
+ Requires-Dist: python-docx>=0.8.11
33
+ Requires-Dist: tiktoken>=0.5.0
34
+ Provides-Extra: dev
35
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
36
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
37
+ Requires-Dist: black>=22.0.0; extra == "dev"
38
+ Requires-Dist: flake8>=5.0.0; extra == "dev"
39
+ Requires-Dist: mypy>=1.0.0; extra == "dev"
40
+ Requires-Dist: openpyxl>=3.0.0; extra == "dev"
41
+ Requires-Dist: marker-pdf>=0.1.0; extra == "dev"
42
+ Requires-Dist: zstandard>=0.21.0; extra == "dev"
43
+ Requires-Dist: lmdb>=1.3.0; extra == "dev"
44
+ Dynamic: license-file
45
+
46
+ # DELM (Data Extraction with Language Models)
47
+
48
+ A comprehensive Python toolkit for extracting structured data from unstructured text using language models. DELM provides a configurable, scalable pipeline with built-in cost tracking, caching, and evaluation capabilities.
49
+
50
+ ## Features
51
+
52
+ - **Multi-format Support**: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
53
+ - **Progressive Schema System**: Simple → Nested → Multiple schemas for any complexity
54
+ - **Multi-Provider Support**: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
55
+ - **Smart Processing**: Configurable text splitting, relevance scoring, and filtering
56
+ - **Cost Optimization**: Built-in cost tracking, caching, and budget management
57
+ - **Batch Processing**: Parallel execution with checkpointing and resume capabilities
58
+ - **Comprehensive Evaluation**: Performance metrics and cost analysis tools
59
+
60
+ ## Installation
61
+
62
+ ```bash
63
+ # Clone the repository
64
+ git clone https://github.com/your-org/delm.git
65
+ cd delm
66
+
67
+ # Install from source
68
+ pip install -e .
69
+ ```
70
+
71
+ ## Quick Start
72
+
73
+ ### Basic Usage
74
+
75
+ ```python
76
+ from pathlib import Path
77
+ from delm import DELM
78
+
79
+ # Initialize DELM from a pipeline config YAML
80
+ delm = DELM.from_yaml(
81
+ config_path="example.config.yaml",
82
+ experiment_name="my_experiment",
83
+ experiment_directory=Path("experiments"),
84
+ )
85
+
86
+ # Process data
87
+ df = delm.prep_data("data/input.txt")
88
+ results = delm.process_via_llm()
89
+
90
+ # Get results
91
+ final_df = delm.get_extraction_results()
92
+ cost_summary = delm.get_cost_summary()
93
+ ```
94
+
95
+ ### Configuration Files
96
+
97
+ DELM uses two configuration files:
98
+
99
+ **1. Pipeline Configuration (`config.yaml`)**
100
+ ```yaml
101
+ llm_extraction:
102
+ provider: "openai"
103
+ name: "gpt-4o-mini"
104
+ temperature: 0.0
105
+ batch_size: 10
106
+ track_cost: true
107
+ max_budget: 50.0
108
+
109
+ data_preprocessing:
110
+ target_column: "text"
111
+ splitting:
112
+ type: "ParagraphSplit"
113
+ scoring:
114
+ type: "KeywordScorer"
115
+ keywords: ["price", "forecast", "guidance"]
116
+
117
+ schema:
118
+ spec_path: "schema_spec.yaml"
119
+ ```
120
+
121
+ **2. Schema Specification (`schema_spec.yaml`)**
122
+ ```yaml
123
+ schema_type: "nested"
124
+ container_name: "commodities"
125
+
126
+ variables:
127
+ - name: "commodity_type"
128
+ description: "Type of commodity mentioned"
129
+ data_type: "string"
130
+ required: true
131
+ allowed_values: ["oil", "gas", "copper", "gold"]
132
+
133
+ - name: "price_value"
134
+ description: "Price mentioned in text"
135
+ data_type: "number"
136
+ required: false
137
+ ```
138
+
139
+ ## Schema Types
140
+
141
+ DELM supports three levels of schema complexity:
142
+
143
+ ### Simple Schema (Level 1)
144
+ Extract key-value pairs from each text chunk:
145
+ ```yaml
146
+ schema_type: "simple"
147
+ variables:
148
+ - name: "price"
149
+ description: "Price mentioned"
150
+ data_type: "number"
151
+ - name: "company"
152
+ description: "Company name"
153
+ data_type: "string"
154
+ ```
155
+
156
+ ### Nested Schema (Level 2)
157
+ Extract structured objects with multiple fields:
158
+ ```yaml
159
+ schema_type: "nested"
160
+ container_name: "commodities"
161
+ variables:
162
+ - name: "type"
163
+ description: "Commodity type"
164
+ data_type: "string"
165
+ - name: "price"
166
+ description: "Price value"
167
+ data_type: "number"
168
+ ```
169
+
170
+ ### Multiple Schema (Level 3)
171
+ Extract multiple independent schemas simultaneously:
172
+ ```yaml
173
+ schema_type: "multiple"
174
+ commodities:
175
+ schema_type: "nested"
176
+ container_name: "commodities"
177
+ variables: [...]
178
+ companies:
179
+ schema_type: "nested"
180
+ container_name: "companies"
181
+ variables: [...]
182
+ ```
183
+
184
+ ## Supported Data Types
185
+
186
+ | Type | Description | Example |
187
+ |------|-------------|---------|
188
+ | `string` | Text values | `"Apple Inc."` |
189
+ | `number` | Floating-point numbers | `150.5` |
190
+ | `integer` | Whole numbers | `2024` |
191
+ | `boolean` | True/False values | `true` |
192
+ | `[string]` | List of strings | `["oil", "gas"]` |
193
+ | `[number]` | List of numbers | `[100, 200, 300]` |
194
+ | `[integer]` | List of integers | `[1, 2, 3, 4]` |
195
+ | `[boolean]` | List of booleans | `[true, false, true]` |
196
+
197
+
198
+ ## Advanced Features
199
+
200
+ ### Cost Summary
201
+ ```python
202
+ # Get cost summary after extraction
203
+ cost_summary = delm.get_cost_summary()
204
+ print(f"Total cost: ${cost_summary['total_cost']}")
205
+ ```
206
+
207
+ ### Semantic Caching
208
+ Reuses api responses from identical calls. Ensures no wasted api credits for certain experiment re-runs.
209
+ ```yaml
210
+ semantic_cache:
211
+ backend: "sqlite" # sqlite, lmdb, filesystem
212
+ path: ".delm_cache"
213
+ max_size_mb: 512
214
+ synchronous: "normal" # sqlite only: "normal" or "full"
215
+ ```
216
+
217
+ ### Relevance Filtering
218
+ ```yaml
219
+ data_preprocessing:
220
+ scoring:
221
+ type: "KeywordScorer"
222
+ keywords: ["price", "forecast", "guidance"]
223
+ pandas_score_filter: "delm_score >= 0.7"
224
+ ```
225
+ If a scorer is configured but no `pandas_score_filter` is provided, all chunks are kept (a warning is logged).
226
+
227
+ ### Text Splitting Strategies
228
+ ```yaml
229
+ data_preprocessing:
230
+ splitting:
231
+ type: "ParagraphSplit" # Split by paragraphs
232
+ # type: "FixedWindowSplit" # Split by sentence count
233
+ # window: 5
234
+ # stride: 2
235
+ # type: "RegexSplit" # Custom regex pattern
236
+ # pattern: "\n\n"
237
+ ```
238
+
239
+ ## Performance & Evaluation
240
+
241
+ ### Cost Estimation
242
+ Estimate total cost of your current configuration setup before running the full extraction.
243
+ ```python
244
+ from delm.utils.cost_estimation import estimate_input_token_cost, estimate_total_cost
245
+
246
+ # Estimate input token costs without API calls
247
+ input_cost = estimate_input_token_cost(
248
+ config="config.yaml",
249
+ data_source="data.csv"
250
+ )
251
+ print(f"Input token cost: ${input_cost:.2f}")
252
+
253
+ # Estimate total costs using API calls on a sample
254
+ total_cost = estimate_total_cost(
255
+ config="config.yaml",
256
+ data_source="data.csv",
257
+ sample_size=100
258
+ )
259
+ print(f"Estimated total cost: ${total_cost:.2f}")
260
+ ```
261
+
262
+ ### Performance Evaluation
263
+ Estimate the performance of your current configuration before running the full extraction.
264
+ ```python
265
+ from delm.utils.performance_estimation import estimate_performance
266
+
267
+ # Evaluate against human-labeled data
268
+ metrics, expected_and_extracted_df = estimate_performance(
269
+ config="config.yaml",
270
+ data_source="test_data.csv",
271
+ expected_extraction_output_df=human_labeled_df,
272
+ true_json_column="expected_json",
273
+ matching_id_column="id",
274
+ record_sample_size=50 # Optional: limit sample size
275
+ )
276
+
277
+ # Display performance metrics
278
+ for key, value in metrics.items():
279
+ precision = value.get("precision", 0)
280
+ recall = value.get("recall", 0)
281
+ f1 = value.get("f1", 0)
282
+ print(f"{key:<30} Precision: {precision:.3f} Recall: {recall:.3f} F1: {f1:.3f}")
283
+ ```
284
+
285
+ ## Configuration Reference
286
+
287
+ ### Required Fields
288
+ - `llm_extraction.provider`: LLM provider (openai, anthropic, google, etc.)
289
+ - `llm_extraction.name`: Model name (gpt-4o-mini, claude-3-sonnet, etc.)
290
+ - `schema.spec_path`: Path to schema specification file
291
+
292
+ ### Optional Fields with Defaults
293
+ - `llm_extraction.temperature`: 0.0 (deterministic)
294
+ - `llm_extraction.batch_size`: 10 (records per batch)
295
+ - `llm_extraction.max_workers`: 1 (concurrent workers)
296
+ - `llm_extraction.track_cost`: true (cost tracking)
297
+ - `semantic_cache.backend`: "sqlite" (cache backend)
298
+
299
+ ### Additional LLM Fields
300
+ - `llm_extraction.max_retries`: 3 (retry attempts)
301
+ - `llm_extraction.base_delay`: 1.0 (seconds, exponential backoff base)
302
+ - `llm_extraction.dotenv_path`: null (path to “.env” for credentials)
303
+ - `llm_extraction.model_input_cost_per_1M_tokens`: null (override pricing)
304
+ - `llm_extraction.model_output_cost_per_1M_tokens`: null (override pricing)
305
+
306
+ If using providers not present in the built-in pricing DB, set both `model_input_cost_per_1M_tokens` and `model_output_cost_per_1M_tokens`, or set `track_cost: false`.
307
+
308
+ ### Data Preprocessing Fields
309
+ - `data_preprocessing.drop_target_column`: false
310
+ - `data_preprocessing.pandas_score_filter`: null (e.g., "delm_score >= 0.7")
311
+ - `data_preprocessing.preprocessed_data_path`: null (path to “.feather” with `delm_text_chunk` and `delm_chunk_id`; when set, omit splitting/scoring/filter fields)
312
+
313
+ ### Semantic Cache Fields
314
+ - `semantic_cache.backend`: "sqlite" | "lmdb" | "filesystem"
315
+ - `semantic_cache.path`: ".delm_cache"
316
+ - `semantic_cache.max_size_mb`: 512
317
+ - `semantic_cache.synchronous`: "normal" | "full" (sqlite only)
318
+
319
+ ## Experiment Storage & Logging
320
+
321
+ - Disk storage (default): checkpointing, resume, and results persisted under `delm_experiments/<experiment_name>/`.
322
+ - In-memory storage: `use_disk_storage=False` for fast prototyping (no persistence, no resume).
323
+ - Logging: by default, rotating file logs under `delm_logs/<experiment_name>/` when `save_file_log=True`.
324
+ - Tunables: `save_file_log`, `log_dir`, `console_log_level`, `file_log_level`, `override_logging`.
325
+ - Or call `delm.logging.configure(...)` directly.
326
+
327
+ ## Architecture
328
+
329
+ ### Core Components
330
+ 1. **DataProcessor**: Handles loading, splitting, and scoring
331
+ 2. **SchemaManager**: Manages schema loading and validation
332
+ 3. **ExtractionManager**: Orchestrates LLM extraction
333
+ 4. **ExperimentManager**: Handles experiment state and checkpointing
334
+ 5. **CostTracker**: Monitors API costs and budgets
335
+
336
+ ### Strategy Classes
337
+ - **SplitStrategy**: Text chunking (Paragraph, FixedWindow, Regex)
338
+ - **RelevanceScorer**: Content scoring (Keyword, Fuzzy)
339
+ - **SchemaRegistry**: Schema type management
340
+
341
+ ### Estimation Functions
342
+ - **estimate_input_token_cost**: Estimate input token costs without API calls
343
+ - **estimate_total_cost**: Estimate total costs using API calls on a sample
344
+ - **estimate_performance**: Evaluate extraction performance against human-labeled data
345
+
346
+ ## File Format Support
347
+
348
+ | Format | Extension | Requirements |
349
+ |--------|-----------|--------------|
350
+ | Text | `.txt` | Built-in |
351
+ | HTML/Markdown | `.html`, `.htm`, `.md` | `beautifulsoup4` |
352
+ | Word Documents | `.docx` | `python-docx` |
353
+ | PDF | `.pdf` | `marker` (OCR) |
354
+ | CSV | `.csv` | `pandas` |
355
+ | Excel | `.xlsx` | `openpyxl` |
356
+ | Parquet | `.parquet` | `pyarrow` |
357
+ | Feather | `.feather` | `pyarrow` |
358
+
359
+ ## Documentation
360
+
361
+ ### Local MkDocs Site
362
+ 1. Install the documentation dependencies: `pip install -e .[docs]`
363
+ 2. Serve the docs locally: `mkdocs serve`
364
+ 3. Open `http://127.0.0.1:8000/` in your browser to explore the site.
365
+
366
+ Use `mkdocs build` to generate a static site in the `site/` directory when you need a distributable bundle.
367
+
368
+ ### Reference Materials
369
+ - [Schema Reference](SCHEMA_REFERENCE.md) - Detailed schema configuration guide
370
+ - [Configuration Examples](example.config.yaml) - Complete configuration templates
371
+ - [Schema Examples](example.schema_spec.yaml) - Schema specification templates
372
+
373
+ ## Acknowledgments
374
+
375
+ - Built on [Instructor](https://python.useinstructor.com/) for structured outputs
376
+ - Uses [Marker](https://pypi.org/project/marker-pdf/) for PDF processing
377
+ - Developed at the Center for Applied AI at Chicago Booth