gaik 0.2.6__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
gaik-0.2.6/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 GAIK - GenAI for knowledge mgt
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
gaik-0.2.6/PKG-INFO ADDED
@@ -0,0 +1,275 @@
1
+ Metadata-Version: 2.4
2
+ Name: gaik
3
+ Version: 0.2.6
4
+ Summary: General AI Kit - Reusable AI/ML components for Python
5
+ Author: GAIK Project
6
+ License: MIT License
7
+
8
+ Copyright (c) 2025 GAIK - GenAI for knowledge mgt
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://gaik.ai/
29
+ Project-URL: Repository, https://github.com/GAIK-project/toolkit-shared-components
30
+ Project-URL: Documentation, https://github.com/GAIK-project/toolkit-shared-components/tree/main/gaik-py
31
+ Project-URL: Issues, https://github.com/GAIK-project/toolkit-shared-components/issues
32
+ Keywords: ai,ml,langchain,openai,anthropic,google,structured-outputs,pydantic,schema,extraction
33
+ Classifier: Development Status :: 3 - Alpha
34
+ Classifier: Intended Audience :: Developers
35
+ Classifier: License :: OSI Approved :: MIT License
36
+ Classifier: Programming Language :: Python :: 3
37
+ Classifier: Programming Language :: Python :: 3.10
38
+ Classifier: Programming Language :: Python :: 3.11
39
+ Classifier: Programming Language :: Python :: 3.12
40
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
41
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
42
+ Requires-Python: >=3.10
43
+ Description-Content-Type: text/markdown
44
+ License-File: LICENSE
45
+ Requires-Dist: pydantic>=2.12.3
46
+ Requires-Dist: langchain-core>=1.0.3
47
+ Requires-Dist: langchain-openai>=1.0.2
48
+ Requires-Dist: langchain-anthropic>=1.0.1
49
+ Requires-Dist: langchain-google-genai>=3.0.1
50
+ Provides-Extra: dev
51
+ Requires-Dist: ruff>=0.14.1; extra == "dev"
52
+ Requires-Dist: build>=1.0; extra == "dev"
53
+ Requires-Dist: twine>=4.0; extra == "dev"
54
+ Provides-Extra: vision
55
+ Requires-Dist: openai>=1.40.0; extra == "vision"
56
+ Requires-Dist: pdf2image>=1.17.0; extra == "vision"
57
+ Requires-Dist: pillow>=10.0.0; extra == "vision"
58
+ Requires-Dist: python-dotenv>=1.0.0; extra == "vision"
59
+ Dynamic: license-file
60
+
61
+ # GAIK - General AI Kit
62
+
63
+ **Reusable AI/ML components for Python**
64
+
65
+ Multi-provider AI toolkit for structured data extraction. Supports OpenAI, Anthropic Claude, Google Gemini, and Azure OpenAI.
66
+
67
+ ## Features
68
+
69
+ ### 🔍 Dynamic Data Extraction (`gaik.extract`)
70
+
71
+ Extract structured data from unstructured text using LangChain's structured outputs:
72
+
73
+ - ✅ **Multi-provider** - OpenAI, Anthropic, Azure, Google - easy switching
74
+ - ✅ **Guaranteed structure** - API-enforced schema compliance
75
+ - ✅ **Type-safe** - Full Pydantic validation
76
+ - ✅ **No code generation** - Uses Pydantic's `create_model()`, no `eval()`
77
+ - ✅ **Cost-effective** - Minimal API calls
78
+ - ✅ **Simple & clean** - Easy to understand, minimal dependencies
79
+
80
+ ### 🖼️ Vision PDF Parsing (`gaik.parsers`)
81
+
82
+ Convert PDF pages to Markdown with OpenAI or Azure OpenAI vision models:
83
+
84
+ - ✅ **Single API surface** - Works with standard OpenAI or Azure deployments
85
+ - ✅ **Optional extras** - Install with `pip install gaik[vision]`
86
+ - ✅ **CLI ready** - See `examples/demo_vision_parser.py` for quick conversions
87
+ - ✅ **Table-aware** - Keeps multi-page tables intact with optional cleanup
88
+
89
+ ## Installation
90
+
91
+ ```bash
92
+ # Install from Test PyPI
93
+ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ gaik
94
+ ```
95
+
96
+ ## Quick Start
97
+
98
+ ### 1. Set up your provider API key
99
+
100
+ **OpenAI (default):**
101
+
102
+ ```bash
103
+ export OPENAI_API_KEY='sk-...' # Get from: https://platform.openai.com/api-keys
104
+ ```
105
+
106
+ **Anthropic:**
107
+
108
+ ```bash
109
+ export ANTHROPIC_API_KEY='sk-ant-...' # Get from: https://console.anthropic.com
110
+ ```
111
+
112
+ **Google:**
113
+
114
+ ```bash
115
+ export GOOGLE_API_KEY='...' # Get from: https://ai.google.dev
116
+ ```
117
+
118
+ **Azure OpenAI:**
119
+
120
+ ```bash
121
+ export AZURE_OPENAI_API_KEY='...'
122
+ export AZURE_OPENAI_ENDPOINT='https://your-resource.openai.azure.com/'
123
+ ```
124
+
125
+ ### 2. Simple Extraction
126
+
127
+ ```python
128
+ from gaik.extract import SchemaExtractor
129
+
130
+ # Using default OpenAI provider
131
+ extractor = SchemaExtractor("Extract name and age from text")
132
+ result = extractor.extract_one("Alice is 25 years old")
133
+ print(result)
134
+ # {'name': 'Alice', 'age': 25}
135
+
136
+ # Switch to Anthropic Claude
137
+ extractor = SchemaExtractor(
138
+ "Extract name and age from text",
139
+ provider="anthropic"
140
+ )
141
+
142
+ # Use Google Gemini
143
+ extractor = SchemaExtractor(
144
+ "Extract name and age from text",
145
+ provider="google"
146
+ )
147
+ ```
148
+
149
+ ### 3. Batch Extraction
150
+
151
+ ```python
152
+ from gaik.extract import dynamic_extraction_workflow
153
+
154
+ description = """
155
+ Extract from invoices:
156
+ - Invoice number
157
+ - Total amount in USD
158
+ - Vendor name
159
+ """
160
+
161
+ documents = [
162
+ "Invoice #12345 from Acme Corp. Total: $1,500",
163
+ "INV-67890, Supplier: TechCo, Amount: $2,750"
164
+ ]
165
+
166
+ # Use any provider
167
+ results = dynamic_extraction_workflow(
168
+ description,
169
+ documents,
170
+ provider="openai" # or "anthropic", "google", "azure"
171
+ )
172
+
173
+ for result in results:
174
+ print(f"Invoice: {result['invoice_number']}, Amount: ${result['total_amount']}")
175
+ ```
176
+
177
+ ### 4. Reusable Extractor (Recommended)
178
+
179
+ ```python
180
+ from gaik.extract import SchemaExtractor
181
+
182
+ # Create extractor once
183
+ extractor = SchemaExtractor("""
184
+ Extract from project reports:
185
+ - Project title
186
+ - Lead institution
187
+ - Total funding in euros
188
+ - List of partner countries
189
+ """)
190
+
191
+ # Reuse for multiple batches
192
+ batch1_results = extractor.extract(documents_batch1)
193
+ batch2_results = extractor.extract(documents_batch2)
194
+
195
+ # Inspect the schema
196
+ print(f"Fields: {extractor.field_names}")
197
+ # ['project_title', 'lead_institution', 'total_funding', 'partner_countries']
198
+ ```
199
+
200
+ ### 5. Schema-Only Generation
201
+
202
+ Generate Pydantic schemas without extraction:
203
+
204
+ ```python
205
+ from gaik.extract import FieldSpec, ExtractionRequirements, create_extraction_model
206
+
207
+ requirements = ExtractionRequirements(
208
+ use_case_name="Invoice",
209
+ fields=[
210
+ FieldSpec(
211
+ field_name="invoice_number",
212
+ field_type="str",
213
+ description="Invoice identifier",
214
+ required=True
215
+ ),
216
+ FieldSpec(
217
+ field_name="amount",
218
+ field_type="float",
219
+ description="Total amount",
220
+ required=True
221
+ )
222
+ ]
223
+ )
224
+
225
+ # Create Pydantic model
226
+ InvoiceModel = create_extraction_model(requirements)
227
+ schema = InvoiceModel.model_json_schema()
228
+ ```
229
+
230
+ ## API Reference
231
+
232
+ | Function/Class | Purpose |
233
+ | ------------------------------- | ------------------------------------------------- |
234
+ | `SchemaExtractor` | Reusable extractor with provider selection |
235
+ | `dynamic_extraction_workflow()` | One-shot extraction from natural language |
236
+ | `create_extraction_model()` | Generate Pydantic model from field specifications |
237
+ | `FieldSpec` | Define a single extraction field |
238
+ | `ExtractionRequirements` | Collection of field specifications |
239
+
240
+ ### Provider Parameters
241
+
242
+ ```python
243
+ SchemaExtractor(
244
+ user_description: str | None = None, # Optional if requirements provided
245
+ provider: Literal["openai", "anthropic", "google", "azure"] = "openai",
246
+ model: str | None = None, # Optional: override default model
247
+ api_key: str | None = None, # Optional: override env variable
248
+ client: BaseChatModel | None = None, # Optional: custom LangChain client
249
+ requirements: ExtractionRequirements | None = None # Optional: pre-defined schema
250
+ )
251
+ ```
252
+
253
+ **Note:**
254
+
255
+ - IDEs with type checking (VS Code, PyCharm) will show autocomplete for `provider` parameter
256
+ - Either `user_description` or `requirements` must be provided
257
+ - Using `requirements` skips LLM parsing step (faster & cheaper)
258
+
259
+ ## Default Models
260
+
261
+ - OpenAI: `gpt-4.1`
262
+ - Anthropic: `claude-sonnet-4-5-20250929`
263
+ - Google: `gemini-2.5-flash`
264
+ - Azure: `gpt-4.1`
265
+
266
+ ## Resources
267
+
268
+ - [GitHub Repository](https://github.com/GAIK-project/toolkit-shared-components)
269
+ - [Examples Directory](https://github.com/GAIK-project/toolkit-shared-components/tree/main/examples)
270
+ - [LangChain Documentation](https://python.langchain.com/docs/how_to/structured_output/)
271
+ - [Pydantic Documentation](https://docs.pydantic.dev/)
272
+
273
+ ## License
274
+
275
+ MIT License - see [LICENSE](LICENSE) file for details.
gaik-0.2.6/README.md ADDED
@@ -0,0 +1,215 @@
1
+ # GAIK - General AI Kit
2
+
3
+ **Reusable AI/ML components for Python**
4
+
5
+ Multi-provider AI toolkit for structured data extraction. Supports OpenAI, Anthropic Claude, Google Gemini, and Azure OpenAI.
6
+
7
+ ## Features
8
+
9
+ ### 🔍 Dynamic Data Extraction (`gaik.extract`)
10
+
11
+ Extract structured data from unstructured text using LangChain's structured outputs:
12
+
13
+ - ✅ **Multi-provider** - OpenAI, Anthropic, Azure, Google - easy switching
14
+ - ✅ **Guaranteed structure** - API-enforced schema compliance
15
+ - ✅ **Type-safe** - Full Pydantic validation
16
+ - ✅ **No code generation** - Uses Pydantic's `create_model()`, no `eval()`
17
+ - ✅ **Cost-effective** - Minimal API calls
18
+ - ✅ **Simple & clean** - Easy to understand, minimal dependencies
19
+
20
+ ### 🖼️ Vision PDF Parsing (`gaik.parsers`)
21
+
22
+ Convert PDF pages to Markdown with OpenAI or Azure OpenAI vision models:
23
+
24
+ - ✅ **Single API surface** - Works with standard OpenAI or Azure deployments
25
+ - ✅ **Optional extras** - Install with `pip install gaik[vision]`
26
+ - ✅ **CLI ready** - See `examples/demo_vision_parser.py` for quick conversions
27
+ - ✅ **Table-aware** - Keeps multi-page tables intact with optional cleanup
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ # Install from Test PyPI
33
+ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ gaik
34
+ ```
35
+
36
+ ## Quick Start
37
+
38
+ ### 1. Set up your provider API key
39
+
40
+ **OpenAI (default):**
41
+
42
+ ```bash
43
+ export OPENAI_API_KEY='sk-...' # Get from: https://platform.openai.com/api-keys
44
+ ```
45
+
46
+ **Anthropic:**
47
+
48
+ ```bash
49
+ export ANTHROPIC_API_KEY='sk-ant-...' # Get from: https://console.anthropic.com
50
+ ```
51
+
52
+ **Google:**
53
+
54
+ ```bash
55
+ export GOOGLE_API_KEY='...' # Get from: https://ai.google.dev
56
+ ```
57
+
58
+ **Azure OpenAI:**
59
+
60
+ ```bash
61
+ export AZURE_OPENAI_API_KEY='...'
62
+ export AZURE_OPENAI_ENDPOINT='https://your-resource.openai.azure.com/'
63
+ ```
64
+
65
+ ### 2. Simple Extraction
66
+
67
+ ```python
68
+ from gaik.extract import SchemaExtractor
69
+
70
+ # Using default OpenAI provider
71
+ extractor = SchemaExtractor("Extract name and age from text")
72
+ result = extractor.extract_one("Alice is 25 years old")
73
+ print(result)
74
+ # {'name': 'Alice', 'age': 25}
75
+
76
+ # Switch to Anthropic Claude
77
+ extractor = SchemaExtractor(
78
+ "Extract name and age from text",
79
+ provider="anthropic"
80
+ )
81
+
82
+ # Use Google Gemini
83
+ extractor = SchemaExtractor(
84
+ "Extract name and age from text",
85
+ provider="google"
86
+ )
87
+ ```
88
+
89
+ ### 3. Batch Extraction
90
+
91
+ ```python
92
+ from gaik.extract import dynamic_extraction_workflow
93
+
94
+ description = """
95
+ Extract from invoices:
96
+ - Invoice number
97
+ - Total amount in USD
98
+ - Vendor name
99
+ """
100
+
101
+ documents = [
102
+ "Invoice #12345 from Acme Corp. Total: $1,500",
103
+ "INV-67890, Supplier: TechCo, Amount: $2,750"
104
+ ]
105
+
106
+ # Use any provider
107
+ results = dynamic_extraction_workflow(
108
+ description,
109
+ documents,
110
+ provider="openai" # or "anthropic", "google", "azure"
111
+ )
112
+
113
+ for result in results:
114
+ print(f"Invoice: {result['invoice_number']}, Amount: ${result['total_amount']}")
115
+ ```
116
+
117
+ ### 4. Reusable Extractor (Recommended)
118
+
119
+ ```python
120
+ from gaik.extract import SchemaExtractor
121
+
122
+ # Create extractor once
123
+ extractor = SchemaExtractor("""
124
+ Extract from project reports:
125
+ - Project title
126
+ - Lead institution
127
+ - Total funding in euros
128
+ - List of partner countries
129
+ """)
130
+
131
+ # Reuse for multiple batches
132
+ batch1_results = extractor.extract(documents_batch1)
133
+ batch2_results = extractor.extract(documents_batch2)
134
+
135
+ # Inspect the schema
136
+ print(f"Fields: {extractor.field_names}")
137
+ # ['project_title', 'lead_institution', 'total_funding', 'partner_countries']
138
+ ```
139
+
140
+ ### 5. Schema-Only Generation
141
+
142
+ Generate Pydantic schemas without extraction:
143
+
144
+ ```python
145
+ from gaik.extract import FieldSpec, ExtractionRequirements, create_extraction_model
146
+
147
+ requirements = ExtractionRequirements(
148
+ use_case_name="Invoice",
149
+ fields=[
150
+ FieldSpec(
151
+ field_name="invoice_number",
152
+ field_type="str",
153
+ description="Invoice identifier",
154
+ required=True
155
+ ),
156
+ FieldSpec(
157
+ field_name="amount",
158
+ field_type="float",
159
+ description="Total amount",
160
+ required=True
161
+ )
162
+ ]
163
+ )
164
+
165
+ # Create Pydantic model
166
+ InvoiceModel = create_extraction_model(requirements)
167
+ schema = InvoiceModel.model_json_schema()
168
+ ```
169
+
170
+ ## API Reference
171
+
172
+ | Function/Class | Purpose |
173
+ | ------------------------------- | ------------------------------------------------- |
174
+ | `SchemaExtractor` | Reusable extractor with provider selection |
175
+ | `dynamic_extraction_workflow()` | One-shot extraction from natural language |
176
+ | `create_extraction_model()` | Generate Pydantic model from field specifications |
177
+ | `FieldSpec` | Define a single extraction field |
178
+ | `ExtractionRequirements` | Collection of field specifications |
179
+
180
+ ### Provider Parameters
181
+
182
+ ```python
183
+ SchemaExtractor(
184
+ user_description: str | None = None, # Optional if requirements provided
185
+ provider: Literal["openai", "anthropic", "google", "azure"] = "openai",
186
+ model: str | None = None, # Optional: override default model
187
+ api_key: str | None = None, # Optional: override env variable
188
+ client: BaseChatModel | None = None, # Optional: custom LangChain client
189
+ requirements: ExtractionRequirements | None = None # Optional: pre-defined schema
190
+ )
191
+ ```
192
+
193
+ **Note:**
194
+
195
+ - IDEs with type checking (VS Code, PyCharm) will show autocomplete for `provider` parameter
196
+ - Either `user_description` or `requirements` must be provided
197
+ - Using `requirements` skips LLM parsing step (faster & cheaper)
198
+
199
+ ## Default Models
200
+
201
+ - OpenAI: `gpt-4.1`
202
+ - Anthropic: `claude-sonnet-4-5-20250929`
203
+ - Google: `gemini-2.5-flash`
204
+ - Azure: `gpt-4.1`
205
+
206
+ ## Resources
207
+
208
+ - [GitHub Repository](https://github.com/GAIK-project/toolkit-shared-components)
209
+ - [Examples Directory](https://github.com/GAIK-project/toolkit-shared-components/tree/main/examples)
210
+ - [LangChain Documentation](https://python.langchain.com/docs/how_to/structured_output/)
211
+ - [Pydantic Documentation](https://docs.pydantic.dev/)
212
+
213
+ ## License
214
+
215
+ MIT License - see [LICENSE](LICENSE) file for details.
@@ -0,0 +1,72 @@
1
+ [project]
2
+ name = "gaik"
3
+ version = "0.2.6"
4
+ description = "General AI Kit - Reusable AI/ML components for Python"
5
+ readme = "README.md"
6
+ requires-python = ">=3.10"
7
+ license = { file = "LICENSE" }
8
+ authors = [{ name = "GAIK Project" }]
9
+ keywords = [
10
+ "ai",
11
+ "ml",
12
+ "langchain",
13
+ "openai",
14
+ "anthropic",
15
+ "google",
16
+ "structured-outputs",
17
+ "pydantic",
18
+ "schema",
19
+ "extraction",
20
+ ]
21
+ classifiers = [
22
+ "Development Status :: 3 - Alpha",
23
+ "Intended Audience :: Developers",
24
+ "License :: OSI Approved :: MIT License",
25
+ "Programming Language :: Python :: 3",
26
+ "Programming Language :: Python :: 3.10",
27
+ "Programming Language :: Python :: 3.11",
28
+ "Programming Language :: Python :: 3.12",
29
+ "Topic :: Software Development :: Libraries :: Python Modules",
30
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
31
+ ]
32
+
33
+ dependencies = [
34
+ "pydantic>=2.12.3",
35
+ "langchain-core>=1.0.3",
36
+ "langchain-openai>=1.0.2",
37
+ "langchain-anthropic>=1.0.1",
38
+ "langchain-google-genai>=3.0.1",
39
+ ]
40
+
41
+ [project.optional-dependencies]
42
+ dev = ["ruff>=0.14.1", "build>=1.0", "twine>=4.0"]
43
+ vision = [
44
+ "openai>=1.40.0",
45
+ "pdf2image>=1.17.0",
46
+ "pillow>=10.0.0",
47
+ "python-dotenv>=1.0.0",
48
+ ]
49
+
50
+ [project.urls]
51
+ Homepage = "https://gaik.ai/"
52
+ Repository = "https://github.com/GAIK-project/toolkit-shared-components"
53
+ Documentation = "https://github.com/GAIK-project/toolkit-shared-components/tree/main/gaik-py"
54
+ Issues = "https://github.com/GAIK-project/toolkit-shared-components/issues"
55
+
56
+ [build-system]
57
+ requires = ["setuptools>=61.0", "wheel"]
58
+ build-backend = "setuptools.build_meta"
59
+
60
+ [tool.setuptools.packages.find]
61
+ where = ["src"]
62
+
63
+ [tool.setuptools.package-data]
64
+ gaik = ["py.typed"]
65
+
66
+ [tool.ruff]
67
+ line-length = 100
68
+ target-version = "py310"
69
+
70
+ [tool.ruff.lint]
71
+ select = ["E", "F", "I", "N", "W", "UP"]
72
+ ignore = []
gaik-0.2.6/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,35 @@
1
+ """General AI Kit (GAIK) - Reusable AI/ML components for Python.
2
+
3
+ GAIK provides modular, production-ready tools for common AI/ML tasks including:
4
+ - Dynamic data extraction with structured outputs
5
+ - Multi-provider LLM support (OpenAI, Anthropic, Azure, Google)
6
+ - And more modules coming soon...
7
+
8
+ Available modules:
9
+ - gaik.extract: Dynamic data extraction with LangChain structured outputs
10
+ - gaik.providers: Multi-provider LLM interface (OpenAI, Anthropic, Azure, Google)
11
+ - gaik.parsers: Vision-enabled PDF to Markdown parsing utilities
12
+
13
+ Example:
14
+ >>> from gaik.extract import SchemaExtractor
15
+ >>>
16
+ >>> # Using default OpenAI provider
17
+ >>> extractor = SchemaExtractor("Extract title and date from articles")
18
+ >>> results = extractor.extract(documents)
19
+ >>>
20
+ >>> # Using Anthropic Claude
21
+ >>> # IDE autocomplete shows: "openai" | "anthropic" | "google" | "azure"
22
+ >>> extractor = SchemaExtractor(
23
+ ... "Extract name and age",
24
+ ... provider="anthropic"
25
+ ... )
26
+ """
27
+
28
+ import importlib.metadata
29
+
30
+ try:
31
+ __version__ = importlib.metadata.version("gaik")
32
+ except importlib.metadata.PackageNotFoundError:
33
+ __version__ = "0.0.0.dev"
34
+
35
+ __all__ = ["__version__"]
@@ -0,0 +1,69 @@
1
+ """Dynamic data extraction with OpenAI structured outputs.
2
+
3
+ This module provides tools for extracting structured data from unstructured text
4
+ using dynamically created Pydantic schemas and OpenAI's structured outputs API.
5
+
6
+ Benefits of this approach:
7
+ - Type-safe and guaranteed structure (enforced by the API)
8
+ - Cost-effective (fewer tokens, no code generation)
9
+ - Secure (no eval/exec needed)
10
+ - Simple and maintainable
11
+ - Reliable results with automatic retries
12
+
13
+ Quick Start:
14
+ >>> from gaik.extract import dynamic_extraction_workflow
15
+ >>>
16
+ >>> results = dynamic_extraction_workflow(
17
+ ... user_description="Extract title, date, and author from articles",
18
+ ... documents=[doc1, doc2, doc3]
19
+ ... )
20
+
21
+ Advanced Usage:
22
+ >>> from gaik.extract import SchemaExtractor
23
+ >>>
24
+ >>> # Reuse the same schema for multiple batches
25
+ >>> extractor = SchemaExtractor("Extract invoice number and amount")
26
+ >>> batch1 = extractor.extract(documents1)
27
+ >>> batch2 = extractor.extract(documents2)
28
+ >>>
29
+ >>> # Access the generated Pydantic model
30
+ >>> schema = extractor.model.model_json_schema()
31
+ >>> print(schema)
32
+
33
+ Custom Field Specifications:
34
+ >>> from gaik.extract import (
35
+ ... FieldSpec,
36
+ ... ExtractionRequirements,
37
+ ... create_extraction_model,
38
+ ... )
39
+ >>>
40
+ >>> fields = [
41
+ ... FieldSpec(
42
+ ... field_name="invoice_number",
43
+ ... field_type="str",
44
+ ... description="Extract invoice ID",
45
+ ... required=True
46
+ ... )
47
+ ... ]
48
+ >>> requirements = ExtractionRequirements(
49
+ ... use_case_name="Invoice",
50
+ ... fields=fields
51
+ ... )
52
+ >>> model = create_extraction_model(requirements)
53
+ """
54
+
55
+ from gaik.extract.extractor import SchemaExtractor, dynamic_extraction_workflow
56
+ from gaik.extract.models import ExtractionRequirements, FieldSpec
57
+ from gaik.extract.utils import create_extraction_model, sanitize_model_name
58
+
59
+ __all__ = [
60
+ # Main API
61
+ "SchemaExtractor",
62
+ "dynamic_extraction_workflow",
63
+ # Models
64
+ "FieldSpec",
65
+ "ExtractionRequirements",
66
+ # Utilities
67
+ "create_extraction_model",
68
+ "sanitize_model_name",
69
+ ]