openaivec 1.0.0__py3-none-any.whl → 1.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- openaivec-1.0.1.dist-info/METADATA +377 -0
- {openaivec-1.0.0.dist-info → openaivec-1.0.1.dist-info}/RECORD +4 -4
- openaivec-1.0.0.dist-info/METADATA +0 -807
- {openaivec-1.0.0.dist-info → openaivec-1.0.1.dist-info}/WHEEL +0 -0
- {openaivec-1.0.0.dist-info → openaivec-1.0.1.dist-info}/licenses/LICENSE +0 -0
|
@@ -0,0 +1,377 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: openaivec
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: Generative mutation for tabular calculation
|
|
5
|
+
Project-URL: Homepage, https://microsoft.github.io/openaivec/
|
|
6
|
+
Project-URL: Repository, https://github.com/microsoft/openaivec
|
|
7
|
+
Author-email: Hiroki Mizukami <hmizukami@microsoft.com>
|
|
8
|
+
License: MIT
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: llm,openai,openai-api,openai-python,pandas,pyspark
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Requires-Python: >=3.10
|
|
18
|
+
Requires-Dist: ipywidgets>=8.1.7
|
|
19
|
+
Requires-Dist: openai>=1.74.0
|
|
20
|
+
Requires-Dist: pandas>=2.2.3
|
|
21
|
+
Requires-Dist: tiktoken>=0.9.0
|
|
22
|
+
Requires-Dist: tqdm>=4.67.1
|
|
23
|
+
Provides-Extra: spark
|
|
24
|
+
Requires-Dist: pyspark>=3.5.5; extra == 'spark'
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
|
|
27
|
+
# openaivec
|
|
28
|
+
|
|
29
|
+
Transform pandas and Spark workflows with AI-powered text processing—batching, caching, and guardrails included.
|
|
30
|
+
|
|
31
|
+
[Contributor guidelines](AGENTS.md)
|
|
32
|
+
|
|
33
|
+
## Quick start
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
pip install openaivec
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
import os
|
|
41
|
+
import pandas as pd
|
|
42
|
+
from openaivec import pandas_ext
|
|
43
|
+
|
|
44
|
+
# Auth: choose OpenAI or Azure OpenAI
|
|
45
|
+
os.environ["OPENAI_API_KEY"] = "your-api-key"
|
|
46
|
+
# Azure alternative:
|
|
47
|
+
# os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-key"
|
|
48
|
+
# os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
|
|
49
|
+
# os.environ["AZURE_OPENAI_API_VERSION"] = "preview"
|
|
50
|
+
|
|
51
|
+
pandas_ext.set_responses_model("gpt-5.1") # Optional override (use deployment name for Azure)
|
|
52
|
+
|
|
53
|
+
reviews = pd.Series([
|
|
54
|
+
"Great coffee and friendly staff.",
|
|
55
|
+
"Delivery was late and the package was damaged.",
|
|
56
|
+
])
|
|
57
|
+
|
|
58
|
+
sentiment = reviews.ai.responses(
|
|
59
|
+
"Summarize sentiment in one short sentence.",
|
|
60
|
+
reasoning={"effort": "medium"}, # Mirrors OpenAI SDK for reasoning models
|
|
61
|
+
)
|
|
62
|
+
print(sentiment.tolist())
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
**Try it live:** https://microsoft.github.io/openaivec/examples/pandas/
|
|
66
|
+
|
|
67
|
+
## Contents
|
|
68
|
+
|
|
69
|
+
- [Why openaivec?](#why-openaivec)
|
|
70
|
+
- [Core Workflows](#core-workflows)
|
|
71
|
+
- [Using with Apache Spark UDFs](#using-with-apache-spark-udfs)
|
|
72
|
+
- [Building Prompts](#building-prompts)
|
|
73
|
+
- [Using with Microsoft Fabric](#using-with-microsoft-fabric)
|
|
74
|
+
- [Contributing](#contributing)
|
|
75
|
+
- [Additional Resources](#additional-resources)
|
|
76
|
+
- [Community](#community)
|
|
77
|
+
|
|
78
|
+
## Why openaivec?
|
|
79
|
+
|
|
80
|
+
- Drop-in `.ai` and `.aio` accessors keep pandas analysts in familiar tooling.
|
|
81
|
+
- Smart batching (`BatchingMapProxy`/`AsyncBatchingMapProxy`) dedupes prompts, preserves order, and releases waiters on failure.
|
|
82
|
+
- Reasoning support mirrors the OpenAI SDK; structured outputs accept Pydantic `response_format`.
|
|
83
|
+
- Built-in caches and retries remove boilerplate; helpers reuse caches across pandas, Spark, and async flows.
|
|
84
|
+
- Spark UDFs and Microsoft Fabric guides move notebooks into production-scale ETL.
|
|
85
|
+
- Prompt tooling (`FewShotPromptBuilder`, `improve`) and the task library ship curated prompts with validated outputs.
|
|
86
|
+
|
|
87
|
+
# Overview
|
|
88
|
+
|
|
89
|
+
Vectorized OpenAI access so you process many inputs per call instead of one-by-one. Batching proxies dedupe inputs, enforce ordered outputs, and unblock waiters even on upstream errors. Cache helpers (`responses_with_cache`, Spark UDF builders) plug into the same layer so expensive prompts are reused across pandas, Spark, and async flows. Reasoning models honor SDK semantics. Requires Python 3.10+.
|
|
90
|
+
|
|
91
|
+
## Core Workflows
|
|
92
|
+
|
|
93
|
+
### Direct API usage
|
|
94
|
+
|
|
95
|
+
For maximum control over batch processing:
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
import os
|
|
99
|
+
from openai import OpenAI
|
|
100
|
+
from openaivec import BatchResponses
|
|
101
|
+
|
|
102
|
+
# Initialize the batch client
|
|
103
|
+
client = BatchResponses.of(
|
|
104
|
+
client=OpenAI(),
|
|
105
|
+
model_name="gpt-5.1",
|
|
106
|
+
system_message="Please answer only with 'xx family' and do not output anything else.",
|
|
107
|
+
# batch_size defaults to None (automatic optimization)
|
|
108
|
+
)
|
|
109
|
+
|
|
110
|
+
result = client.parse(
|
|
111
|
+
["panda", "rabbit", "koala"],
|
|
112
|
+
reasoning={"effort": "medium"}, # Required for gpt-5.1
|
|
113
|
+
)
|
|
114
|
+
print(result) # Expected output: ['bear family', 'rabbit family', 'koala family']
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
📓 **[Complete tutorial →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
118
|
+
|
|
119
|
+
### pandas integration (recommended)
|
|
120
|
+
|
|
121
|
+
The easiest way to get started with your DataFrames:
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
import os
|
|
125
|
+
import pandas as pd
|
|
126
|
+
from openaivec import pandas_ext
|
|
127
|
+
|
|
128
|
+
# Authentication Option 1: Environment variables (automatic detection)
|
|
129
|
+
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
|
|
130
|
+
# Or for Azure OpenAI:
|
|
131
|
+
# os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-key"
|
|
132
|
+
# os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
|
|
133
|
+
# os.environ["AZURE_OPENAI_API_VERSION"] = "preview"
|
|
134
|
+
|
|
135
|
+
# Authentication Option 2: Custom client (optional)
|
|
136
|
+
# from openai import OpenAI, AsyncOpenAI
|
|
137
|
+
# pandas_ext.set_client(OpenAI())
|
|
138
|
+
# pandas_ext.set_async_client(AsyncOpenAI())
|
|
139
|
+
|
|
140
|
+
# Configure model (optional - defaults to gpt-5.1; use deployment name for Azure)
|
|
141
|
+
pandas_ext.set_responses_model("gpt-5.1")
|
|
142
|
+
|
|
143
|
+
# Create your data
|
|
144
|
+
df = pd.DataFrame({"name": ["panda", "rabbit", "koala"]})
|
|
145
|
+
|
|
146
|
+
# Add AI-powered columns
|
|
147
|
+
result = df.assign(
|
|
148
|
+
family=lambda df: df.name.ai.responses(
|
|
149
|
+
"What animal family? Answer with 'X family'",
|
|
150
|
+
reasoning={"effort": "medium"},
|
|
151
|
+
),
|
|
152
|
+
habitat=lambda df: df.name.ai.responses(
|
|
153
|
+
"Primary habitat in one word",
|
|
154
|
+
reasoning={"effort": "medium"},
|
|
155
|
+
),
|
|
156
|
+
fun_fact=lambda df: df.name.ai.responses(
|
|
157
|
+
"One interesting fact in 10 words or less",
|
|
158
|
+
reasoning={"effort": "medium"},
|
|
159
|
+
),
|
|
160
|
+
)
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
| name | family | habitat | fun_fact |
|
|
164
|
+
| ------ | ---------------- | ------- | -------------------------- |
|
|
165
|
+
| panda | bear family | forest | Eats bamboo 14 hours daily |
|
|
166
|
+
| rabbit | rabbit family | meadow | Can see nearly 360 degrees |
|
|
167
|
+
| koala | marsupial family | tree | Sleeps 22 hours per day |
|
|
168
|
+
|
|
169
|
+
📓 **[Interactive pandas examples →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
170
|
+
|
|
171
|
+
### Using with reasoning models
|
|
172
|
+
|
|
173
|
+
Reasoning models (o1-preview, o1-mini, o3-mini, etc.) work without special flags. `reasoning` mirrors the OpenAI SDK.
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
pandas_ext.set_responses_model("o1-mini") # Set your reasoning model
|
|
177
|
+
|
|
178
|
+
result = df.assign(
|
|
179
|
+
analysis=lambda df: df.text.ai.responses(
|
|
180
|
+
"Analyze this text step by step",
|
|
181
|
+
reasoning={"effort": "medium"} # Optional: mirrors the OpenAI SDK argument
|
|
182
|
+
)
|
|
183
|
+
)
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
You can omit `reasoning` to use the model defaults or tune it per request with the same shape (`dict` with effort) as the OpenAI SDK.
|
|
187
|
+
|
|
188
|
+
### Using pre-configured tasks
|
|
189
|
+
|
|
190
|
+
For common text processing operations, openaivec provides ready-to-use tasks that eliminate the need to write custom prompts:
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
from openaivec.task import nlp, customer_support
|
|
194
|
+
|
|
195
|
+
text_df = pd.DataFrame({
|
|
196
|
+
"text": [
|
|
197
|
+
"Great product, fast delivery!",
|
|
198
|
+
"Need help with billing issue",
|
|
199
|
+
"How do I reset my password?"
|
|
200
|
+
]
|
|
201
|
+
})
|
|
202
|
+
|
|
203
|
+
results = text_df.assign(
|
|
204
|
+
sentiment=lambda df: df.text.ai.task(nlp.SENTIMENT_ANALYSIS),
|
|
205
|
+
intent=lambda df: df.text.ai.task(customer_support.INTENT_ANALYSIS),
|
|
206
|
+
)
|
|
207
|
+
|
|
208
|
+
# Extract structured results into separate columns
|
|
209
|
+
extracted_results = results.ai.extract("sentiment")
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Task categories:** Text analysis (`nlp.SENTIMENT_ANALYSIS`, `nlp.MULTILINGUAL_TRANSLATION`, `nlp.NAMED_ENTITY_RECOGNITION`, `nlp.KEYWORD_EXTRACTION`); Content classification (`customer_support.INTENT_ANALYSIS`, `customer_support.URGENCY_ANALYSIS`, `customer_support.INQUIRY_CLASSIFICATION`).
|
|
213
|
+
|
|
214
|
+
### Asynchronous processing with `.aio`
|
|
215
|
+
|
|
216
|
+
High-throughput workloads use the `.aio` accessor for async versions of all operations:
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
import asyncio
|
|
220
|
+
import pandas as pd
|
|
221
|
+
from openaivec import pandas_ext
|
|
222
|
+
|
|
223
|
+
pandas_ext.set_responses_model("gpt-5.1")
|
|
224
|
+
|
|
225
|
+
df = pd.DataFrame({"text": [
|
|
226
|
+
"This product is amazing!",
|
|
227
|
+
"Terrible customer service",
|
|
228
|
+
"Good value for money",
|
|
229
|
+
"Not what I expected"
|
|
230
|
+
] * 250}) # 1000 rows for demonstration
|
|
231
|
+
|
|
232
|
+
async def process_data():
|
|
233
|
+
return await df["text"].aio.responses(
|
|
234
|
+
"Analyze sentiment and classify as positive/negative/neutral",
|
|
235
|
+
reasoning={"effort": "medium"}, # Required for gpt-5.1
|
|
236
|
+
max_concurrency=12 # Allow up to 12 concurrent requests
|
|
237
|
+
)
|
|
238
|
+
|
|
239
|
+
sentiments = asyncio.run(process_data())
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
**Performance benefits:** Parallel processing with automatic batching/deduplication, built-in rate limiting and error handling, and memory-efficient streaming for large datasets.
|
|
243
|
+
|
|
244
|
+
## Using with Apache Spark UDFs
|
|
245
|
+
|
|
246
|
+
Scale to enterprise datasets with distributed processing.
|
|
247
|
+
|
|
248
|
+
📓 **[Spark tutorial →](https://microsoft.github.io/openaivec/examples/spark/)**
|
|
249
|
+
|
|
250
|
+
First, obtain a Spark session and configure authentication:
|
|
251
|
+
|
|
252
|
+
```python
|
|
253
|
+
from pyspark.sql import SparkSession
|
|
254
|
+
from openaivec.spark import setup, setup_azure
|
|
255
|
+
|
|
256
|
+
spark = SparkSession.builder.getOrCreate()
|
|
257
|
+
|
|
258
|
+
# Option 1: Using OpenAI
|
|
259
|
+
setup(
|
|
260
|
+
spark,
|
|
261
|
+
api_key="your-openai-api-key",
|
|
262
|
+
responses_model_name="gpt-5.1", # Optional: set default model
|
|
263
|
+
embeddings_model_name="text-embedding-3-small" # Optional: set default model
|
|
264
|
+
)
|
|
265
|
+
|
|
266
|
+
# Option 2: Using Azure OpenAI
|
|
267
|
+
# setup_azure(
|
|
268
|
+
# spark,
|
|
269
|
+
# api_key="your-azure-openai-api-key",
|
|
270
|
+
# base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
|
|
271
|
+
# api_version="preview",
|
|
272
|
+
# responses_model_name="my-gpt4-deployment", # Optional: set default deployment
|
|
273
|
+
# embeddings_model_name="my-embedding-deployment" # Optional: set default deployment
|
|
274
|
+
# )
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
Create and register UDFs using the provided helpers:
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
from openaivec.spark import responses_udf, task_udf, embeddings_udf, count_tokens_udf, similarity_udf, parse_udf
|
|
281
|
+
from pydantic import BaseModel
|
|
282
|
+
|
|
283
|
+
spark.udf.register(
|
|
284
|
+
"extract_brand",
|
|
285
|
+
responses_udf(
|
|
286
|
+
instructions="Extract the brand name from the product. Return only the brand name.",
|
|
287
|
+
reasoning={"effort": "medium"}, # Recommended with gpt-5.1
|
|
288
|
+
)
|
|
289
|
+
)
|
|
290
|
+
|
|
291
|
+
class Translation(BaseModel):
|
|
292
|
+
en: str
|
|
293
|
+
fr: str
|
|
294
|
+
ja: str
|
|
295
|
+
|
|
296
|
+
spark.udf.register(
|
|
297
|
+
"translate_struct",
|
|
298
|
+
responses_udf(
|
|
299
|
+
instructions="Translate the text to English, French, and Japanese.",
|
|
300
|
+
response_format=Translation,
|
|
301
|
+
reasoning={"effort": "medium"}, # Recommended with gpt-5.1
|
|
302
|
+
)
|
|
303
|
+
)
|
|
304
|
+
|
|
305
|
+
spark.udf.register("embed_text", embeddings_udf())
|
|
306
|
+
spark.udf.register("count_tokens", count_tokens_udf())
|
|
307
|
+
spark.udf.register("compute_similarity", similarity_udf())
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
### Spark performance tips
|
|
311
|
+
|
|
312
|
+
- Duplicate detection automatically caches repeated inputs per partition for UDFs.
|
|
313
|
+
- `batch_size=None` auto-optimizes; set 32–128 for fixed sizes if needed.
|
|
314
|
+
- `max_concurrency` is per executor; total concurrency = executors × max_concurrency. Start with 4–12.
|
|
315
|
+
- Monitor rate limits and adjust concurrency to your OpenAI tier.
|
|
316
|
+
|
|
317
|
+
## Building Prompts
|
|
318
|
+
|
|
319
|
+
Few-shot prompts improve LLM quality. `FewShotPromptBuilder` structures purpose, cautions, and examples; `improve()` iterates with OpenAI to remove contradictions.
|
|
320
|
+
|
|
321
|
+
```python
|
|
322
|
+
from openaivec import FewShotPromptBuilder
|
|
323
|
+
|
|
324
|
+
prompt = (
|
|
325
|
+
FewShotPromptBuilder()
|
|
326
|
+
.purpose("Return the smallest category that includes the given word")
|
|
327
|
+
.caution("Never use proper nouns as categories")
|
|
328
|
+
.example("Apple", "Fruit")
|
|
329
|
+
.example("Car", "Vehicle")
|
|
330
|
+
.improve(max_iter=1) # optional
|
|
331
|
+
.build()
|
|
332
|
+
)
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
📓 **[Advanced prompting techniques →](https://microsoft.github.io/openaivec/examples/prompt/)**
|
|
336
|
+
|
|
337
|
+
## Using with Microsoft Fabric
|
|
338
|
+
|
|
339
|
+
[Microsoft Fabric](https://www.microsoft.com/en-us/microsoft-fabric/) is a unified, cloud-based analytics platform. Add `openaivec` from PyPI in your Fabric environment, select it in your notebook, and use `openaivec.spark` like standard Spark. Detailed walkthrough: 📓 **[Fabric guide →](https://microsoft.github.io/openaivec/examples/fabric/)**.
|
|
340
|
+
|
|
341
|
+
## Contributing
|
|
342
|
+
|
|
343
|
+
We welcome contributions! Please:
|
|
344
|
+
|
|
345
|
+
1. Fork and branch from `main`.
|
|
346
|
+
2. Add or update tests when you change code.
|
|
347
|
+
3. Run formatting and tests before opening a PR.
|
|
348
|
+
|
|
349
|
+
Install dev deps:
|
|
350
|
+
|
|
351
|
+
```bash
|
|
352
|
+
uv sync --all-extras --dev
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
Lint and format:
|
|
356
|
+
|
|
357
|
+
```bash
|
|
358
|
+
uv run ruff check . --fix
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
Quick test pass:
|
|
362
|
+
|
|
363
|
+
```bash
|
|
364
|
+
uv run pytest -m "not slow and not requires_api"
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
## Additional Resources
|
|
368
|
+
|
|
369
|
+
📓 **[Customer feedback analysis →](https://microsoft.github.io/openaivec/examples/customer_analysis/)** - Sentiment analysis & prioritization
|
|
370
|
+
📓 **[Survey data transformation →](https://microsoft.github.io/openaivec/examples/survey_transformation/)** - Unstructured to structured data
|
|
371
|
+
📓 **[Asynchronous processing examples →](https://microsoft.github.io/openaivec/examples/aio/)** - High-performance async workflows
|
|
372
|
+
📓 **[Auto-generate FAQs from documents →](https://microsoft.github.io/openaivec/examples/generate_faq/)** - Create FAQs using AI
|
|
373
|
+
📓 **[All examples →](https://microsoft.github.io/openaivec/examples/pandas/)** - Complete collection of tutorials and use cases
|
|
374
|
+
|
|
375
|
+
## Community
|
|
376
|
+
|
|
377
|
+
Join our Discord community for support and announcements: https://discord.gg/vbb83Pgn
|
|
@@ -33,7 +33,7 @@ openaivec/task/nlp/sentiment_analysis.py,sha256=1igoAhns-VgsDE8XI47Dw-zeOcR5wEY9
|
|
|
33
33
|
openaivec/task/nlp/translation.py,sha256=TtV7F6bmKPqLi3_Ok7GoOqT_GKJiemotVq-YEbKd6IA,6617
|
|
34
34
|
openaivec/task/table/__init__.py,sha256=kJz15WDJXjyC7UIHKBvlTRhCf347PCDMH5T5fONV2sU,83
|
|
35
35
|
openaivec/task/table/fillna.py,sha256=4j27fWT5IzOhQqCPwLhomjBOAWPBslyIBbBMspjqtbw,6877
|
|
36
|
-
openaivec-1.0.
|
|
37
|
-
openaivec-1.0.
|
|
38
|
-
openaivec-1.0.
|
|
39
|
-
openaivec-1.0.
|
|
36
|
+
openaivec-1.0.1.dist-info/METADATA,sha256=xLWrSZd9aX_mgAElLZgHft-jUbvrZfumvx_uJbr8C1Y,12991
|
|
37
|
+
openaivec-1.0.1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
|
38
|
+
openaivec-1.0.1.dist-info/licenses/LICENSE,sha256=ws_MuBL-SCEBqPBFl9_FqZkaaydIJmxHrJG2parhU4M,1141
|
|
39
|
+
openaivec-1.0.1.dist-info/RECORD,,
|
|
@@ -1,807 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: openaivec
|
|
3
|
-
Version: 1.0.0
|
|
4
|
-
Summary: Generative mutation for tabular calculation
|
|
5
|
-
Project-URL: Homepage, https://microsoft.github.io/openaivec/
|
|
6
|
-
Project-URL: Repository, https://github.com/microsoft/openaivec
|
|
7
|
-
Author-email: Hiroki Mizukami <hmizukami@microsoft.com>
|
|
8
|
-
License: MIT
|
|
9
|
-
License-File: LICENSE
|
|
10
|
-
Keywords: llm,openai,openai-api,openai-python,pandas,pyspark
|
|
11
|
-
Classifier: Development Status :: 4 - Beta
|
|
12
|
-
Classifier: Intended Audience :: Developers
|
|
13
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
15
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
-
Requires-Python: >=3.10
|
|
18
|
-
Requires-Dist: ipywidgets>=8.1.7
|
|
19
|
-
Requires-Dist: openai>=1.74.0
|
|
20
|
-
Requires-Dist: pandas>=2.2.3
|
|
21
|
-
Requires-Dist: tiktoken>=0.9.0
|
|
22
|
-
Requires-Dist: tqdm>=4.67.1
|
|
23
|
-
Provides-Extra: spark
|
|
24
|
-
Requires-Dist: pyspark>=3.5.5; extra == 'spark'
|
|
25
|
-
Description-Content-Type: text/markdown
|
|
26
|
-
|
|
27
|
-
# openaivec
|
|
28
|
-
|
|
29
|
-
[Contributor guidelines](AGENTS.md)
|
|
30
|
-
|
|
31
|
-
**Transform your data analysis with AI-powered text processing at scale.**
|
|
32
|
-
|
|
33
|
-
**openaivec** enables data analysts to seamlessly integrate OpenAI's language models into their pandas and Spark workflows. Process thousands of text records with natural language instructions, turning unstructured data into actionable insights with just a few lines of code.
|
|
34
|
-
|
|
35
|
-
## Contents
|
|
36
|
-
- [Why openaivec?](#why-openaivec)
|
|
37
|
-
- [Quick Start](#-quick-start-from-text-to-insights-in-seconds)
|
|
38
|
-
- [Real-World Impact](#-real-world-impact)
|
|
39
|
-
- [Overview](#overview)
|
|
40
|
-
- [Core Workflows](#core-workflows)
|
|
41
|
-
- [Using with Apache Spark UDFs](#using-with-apache-spark-udfs)
|
|
42
|
-
- [Building Prompts](#building-prompts)
|
|
43
|
-
- [Using with Microsoft Fabric](#using-with-microsoft-fabric)
|
|
44
|
-
- [Contributing](#contributing)
|
|
45
|
-
- [Additional Resources](#additional-resources)
|
|
46
|
-
- [Community](#community)
|
|
47
|
-
|
|
48
|
-
## Why openaivec?
|
|
49
|
-
- Drop-in `.ai` and `.aio` DataFrame accessors keep pandas analysts in their favorite tools.
|
|
50
|
-
- Smart batching (`BatchingMapProxy`) deduplicates prompts, enforces ordered outputs, and shortens runtimes without manual tuning.
|
|
51
|
-
- Built-in caches, retry logic, and reasoning model safeguards cut noisy boilerplate from production pipelines.
|
|
52
|
-
- Ready-made Spark UDF helpers and Microsoft Fabric guides take AI workloads from notebooks into enterprise-scale ETL.
|
|
53
|
-
- Pre-configured task library and `FewShotPromptBuilder` ship curated prompts and structured outputs validated by Pydantic.
|
|
54
|
-
- Supports OpenAI and Azure OpenAI clients interchangeably, including async workloads and embeddings.
|
|
55
|
-
|
|
56
|
-
## 🚀 Quick Start: From Text to Insights in Seconds
|
|
57
|
-
|
|
58
|
-
Imagine analyzing 10,000 customer reviews. Instead of manual work, just write:
|
|
59
|
-
|
|
60
|
-
```python
|
|
61
|
-
import pandas as pd
|
|
62
|
-
from openaivec import pandas_ext
|
|
63
|
-
|
|
64
|
-
# Your data
|
|
65
|
-
reviews = pd.DataFrame({
|
|
66
|
-
"review": ["Great product, fast delivery!", "Terrible quality, very disappointed", ...]
|
|
67
|
-
})
|
|
68
|
-
|
|
69
|
-
# AI-powered analysis in one line
|
|
70
|
-
results = reviews.assign(
|
|
71
|
-
sentiment=lambda df: df.review.ai.responses("Classify sentiment: positive/negative/neutral"),
|
|
72
|
-
issues=lambda df: df.review.ai.responses("Extract main issues or compliments"),
|
|
73
|
-
priority=lambda df: df.review.ai.responses("Priority for follow-up: low/medium/high")
|
|
74
|
-
)
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
**Result**: Thousands of reviews classified and analyzed in minutes, not days.
|
|
78
|
-
|
|
79
|
-
📓 **[Try it yourself →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
80
|
-
|
|
81
|
-
## 💡 Real-World Impact
|
|
82
|
-
|
|
83
|
-
### Customer Feedback Analysis
|
|
84
|
-
|
|
85
|
-
```python
|
|
86
|
-
# Process 50,000 support tickets automatically
|
|
87
|
-
tickets.assign(
|
|
88
|
-
category=lambda df: df.description.ai.responses("Categorize: billing/technical/feature_request"),
|
|
89
|
-
urgency=lambda df: df.description.ai.responses("Urgency level: low/medium/high/critical"),
|
|
90
|
-
solution_type=lambda df: df.description.ai.responses("Best resolution approach")
|
|
91
|
-
)
|
|
92
|
-
```
|
|
93
|
-
|
|
94
|
-
### Market Research at Scale
|
|
95
|
-
|
|
96
|
-
```python
|
|
97
|
-
# Analyze multilingual social media data
|
|
98
|
-
social_data.assign(
|
|
99
|
-
english_text=lambda df: df.post.ai.responses("Translate to English"),
|
|
100
|
-
brand_mention=lambda df: df.english_text.ai.responses("Extract brand mentions and sentiment"),
|
|
101
|
-
market_trend=lambda df: df.english_text.ai.responses("Identify emerging trends or concerns")
|
|
102
|
-
)
|
|
103
|
-
```
|
|
104
|
-
|
|
105
|
-
### Survey Data Transformation
|
|
106
|
-
|
|
107
|
-
```python
|
|
108
|
-
# Convert free-text responses to structured data
|
|
109
|
-
from pydantic import BaseModel
|
|
110
|
-
|
|
111
|
-
class Demographics(BaseModel):
|
|
112
|
-
age_group: str
|
|
113
|
-
location: str
|
|
114
|
-
interests: list[str]
|
|
115
|
-
|
|
116
|
-
survey_responses.assign(
|
|
117
|
-
structured=lambda df: df.response.ai.responses(
|
|
118
|
-
"Extract demographics as structured data",
|
|
119
|
-
response_format=Demographics
|
|
120
|
-
)
|
|
121
|
-
).ai.extract("structured") # Auto-expands to columns
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
📓 **[See more examples →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
125
|
-
|
|
126
|
-
# Overview
|
|
127
|
-
|
|
128
|
-
This package provides a vectorized interface for the OpenAI API, enabling you to process multiple inputs with a single
|
|
129
|
-
API call instead of sending requests one by one.
|
|
130
|
-
This approach helps reduce latency and simplifies your code.
|
|
131
|
-
|
|
132
|
-
Additionally, it integrates effortlessly with Pandas DataFrames and Apache Spark UDFs, making it easy to incorporate
|
|
133
|
-
into your data processing pipelines.
|
|
134
|
-
|
|
135
|
-
Behind the scenes, `BatchingMapProxy` and `AsyncBatchingMapProxy` deduplicate repeated inputs, guarantee response order,
|
|
136
|
-
and unblock waiters even when upstream APIs error. Caches created via helpers such as `responses_with_cache` plug into
|
|
137
|
-
this batching layer so expensive prompts are reused across pandas, Spark, and async flows. Progress bars surface
|
|
138
|
-
automatically in notebook environments when `show_progress=True`.
|
|
139
|
-
|
|
140
|
-
## Core Capabilities
|
|
141
|
-
|
|
142
|
-
- Vectorized request batching with automatic deduplication, retries, and cache hooks for any OpenAI-compatible client.
|
|
143
|
-
- pandas `.ai` and `.aio` accessors for synchronous and asynchronous DataFrame pipelines, including `ai.extract` helpers.
|
|
144
|
-
- Task library with Pydantic-backed schemas for consistent structured outputs across pandas and Spark jobs.
|
|
145
|
-
- Spark UDF helpers (`responses_udf`, `embeddings_udf`, `parse_udf`, `task_udf`, etc.) for large-scale ETL and BI.
|
|
146
|
-
- Embeddings, token counting, and similarity utilities for search and retrieval use cases.
|
|
147
|
-
- Prompt tooling (`FewShotPromptBuilder`, `improve`) to craft and iterate production-ready instructions.
|
|
148
|
-
|
|
149
|
-
## Key Benefits
|
|
150
|
-
|
|
151
|
-
- **🚀 Throughput**: Smart batching and concurrency tuning process thousands of records in minutes, not hours.
|
|
152
|
-
- **💰 Cost Efficiency**: Input deduplication and optional caches cut redundant token usage on real-world datasets.
|
|
153
|
-
- **🛡️ Reliability**: Guardrails for reasoning models, informative errors, and automatic waiter release keep pipelines healthy.
|
|
154
|
-
- **🔗 Integration**: pandas, Spark, async, and Fabric workflows share the same API surface—no bespoke adapters required.
|
|
155
|
-
- **🎯 Consistency**: Pre-configured tasks and extractors deliver structured outputs validated with Pydantic models.
|
|
156
|
-
- **🏢 Enterprise Ready**: Azure OpenAI parity, Microsoft Fabric walkthroughs, and Spark UDFs shorten the path to production.
|
|
157
|
-
|
|
158
|
-
## Requirements
|
|
159
|
-
|
|
160
|
-
- Python 3.10 or higher
|
|
161
|
-
|
|
162
|
-
## Installation
|
|
163
|
-
|
|
164
|
-
Install the package with:
|
|
165
|
-
|
|
166
|
-
```bash
|
|
167
|
-
pip install openaivec
|
|
168
|
-
```
|
|
169
|
-
|
|
170
|
-
If you want to uninstall the package, you can do so with:
|
|
171
|
-
|
|
172
|
-
```bash
|
|
173
|
-
pip uninstall openaivec
|
|
174
|
-
```
|
|
175
|
-
|
|
176
|
-
## Core Workflows
|
|
177
|
-
|
|
178
|
-
### Direct API Usage
|
|
179
|
-
|
|
180
|
-
For maximum control over batch processing:
|
|
181
|
-
|
|
182
|
-
```python
|
|
183
|
-
import os
|
|
184
|
-
from openai import OpenAI
|
|
185
|
-
from openaivec import BatchResponses
|
|
186
|
-
|
|
187
|
-
# Initialize the batch client
|
|
188
|
-
client = BatchResponses.of(
|
|
189
|
-
client=OpenAI(),
|
|
190
|
-
model_name="gpt-4.1-mini",
|
|
191
|
-
system_message="Please answer only with 'xx family' and do not output anything else.",
|
|
192
|
-
# batch_size defaults to None (automatic optimization)
|
|
193
|
-
)
|
|
194
|
-
|
|
195
|
-
result = client.parse(["panda", "rabbit", "koala"])
|
|
196
|
-
print(result) # Expected output: ['bear family', 'rabbit family', 'koala family']
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
📓 **[Complete tutorial →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
200
|
-
|
|
201
|
-
### Pandas Integration (Recommended)
|
|
202
|
-
|
|
203
|
-
The easiest way to get started with your DataFrames:
|
|
204
|
-
|
|
205
|
-
```python
|
|
206
|
-
import os
|
|
207
|
-
import pandas as pd
|
|
208
|
-
from openaivec import pandas_ext
|
|
209
|
-
|
|
210
|
-
# Authentication Option 1: Environment variables (automatic detection)
|
|
211
|
-
# For OpenAI:
|
|
212
|
-
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
|
|
213
|
-
# Or for Azure OpenAI:
|
|
214
|
-
# os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-key"
|
|
215
|
-
# os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
|
|
216
|
-
# os.environ["AZURE_OPENAI_API_VERSION"] = "preview"
|
|
217
|
-
|
|
218
|
-
# Authentication Option 2: Custom client (optional)
|
|
219
|
-
# from openai import OpenAI, AsyncOpenAI
|
|
220
|
-
# pandas_ext.set_client(OpenAI())
|
|
221
|
-
# For async operations:
|
|
222
|
-
# pandas_ext.set_async_client(AsyncOpenAI())
|
|
223
|
-
|
|
224
|
-
# Configure model (optional - defaults to gpt-4.1-mini)
|
|
225
|
-
# For Azure OpenAI: use your deployment name, for OpenAI: use model name
|
|
226
|
-
pandas_ext.set_responses_model("gpt-4.1-mini")
|
|
227
|
-
|
|
228
|
-
# Create your data
|
|
229
|
-
df = pd.DataFrame({"name": ["panda", "rabbit", "koala"]})
|
|
230
|
-
|
|
231
|
-
# Add AI-powered columns
|
|
232
|
-
result = df.assign(
|
|
233
|
-
family=lambda df: df.name.ai.responses("What animal family? Answer with 'X family'"),
|
|
234
|
-
habitat=lambda df: df.name.ai.responses("Primary habitat in one word"),
|
|
235
|
-
fun_fact=lambda df: df.name.ai.responses("One interesting fact in 10 words or less")
|
|
236
|
-
)
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
| name | family | habitat | fun_fact |
|
|
240
|
-
| ------ | ---------------- | ------- | -------------------------- |
|
|
241
|
-
| panda | bear family | forest | Eats bamboo 14 hours daily |
|
|
242
|
-
| rabbit | rabbit family | meadow | Can see nearly 360 degrees |
|
|
243
|
-
| koala | marsupial family | tree | Sleeps 22 hours per day |
|
|
244
|
-
|
|
245
|
-
📓 **[Interactive pandas examples →](https://microsoft.github.io/openaivec/examples/pandas/)**
|
|
246
|
-
|
|
247
|
-
### Using with Reasoning Models
|
|
248
|
-
|
|
249
|
-
When using reasoning models (o1-preview, o1-mini, o3-mini, etc.), you must set `temperature=None` to avoid API errors:
|
|
250
|
-
|
|
251
|
-
```python
|
|
252
|
-
# For reasoning models like o1-preview, o1-mini, o3-mini
|
|
253
|
-
pandas_ext.set_responses_model("o1-mini") # Set your reasoning model
|
|
254
|
-
|
|
255
|
-
# MUST use temperature=None with reasoning models
|
|
256
|
-
result = df.assign(
|
|
257
|
-
analysis=lambda df: df.text.ai.responses(
|
|
258
|
-
"Analyze this text step by step",
|
|
259
|
-
temperature=None # Required for reasoning models
|
|
260
|
-
)
|
|
261
|
-
)
|
|
262
|
-
```
|
|
263
|
-
|
|
264
|
-
**Why this is needed**: Reasoning models don't support temperature parameters and will return an error if temperature is specified. The library automatically detects these errors and provides guidance on how to fix them.
|
|
265
|
-
|
|
266
|
-
**Reference**: [Azure OpenAI Reasoning Models](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/reasoning)
|
|
267
|
-
|
|
268
|
-
### Using Pre-configured Tasks
|
|
269
|
-
|
|
270
|
-
For common text processing operations, openaivec provides ready-to-use tasks that eliminate the need to write custom prompts:
|
|
271
|
-
|
|
272
|
-
```python
|
|
273
|
-
from openaivec.task import nlp, customer_support
|
|
274
|
-
|
|
275
|
-
# Text analysis with pre-configured tasks
|
|
276
|
-
text_df = pd.DataFrame({
|
|
277
|
-
"text": [
|
|
278
|
-
"Great product, fast delivery!",
|
|
279
|
-
"Need help with billing issue",
|
|
280
|
-
"How do I reset my password?"
|
|
281
|
-
]
|
|
282
|
-
})
|
|
283
|
-
|
|
284
|
-
# Use pre-configured tasks for consistent, optimized results
|
|
285
|
-
results = text_df.assign(
|
|
286
|
-
sentiment=lambda df: df.text.ai.task(nlp.SENTIMENT_ANALYSIS),
|
|
287
|
-
entities=lambda df: df.text.ai.task(nlp.NAMED_ENTITY_RECOGNITION),
|
|
288
|
-
intent=lambda df: df.text.ai.task(customer_support.INTENT_ANALYSIS),
|
|
289
|
-
urgency=lambda df: df.text.ai.task(customer_support.URGENCY_ANALYSIS)
|
|
290
|
-
)
|
|
291
|
-
|
|
292
|
-
# Extract structured results into separate columns (one at a time)
|
|
293
|
-
extracted_results = (results
|
|
294
|
-
.ai.extract("sentiment")
|
|
295
|
-
.ai.extract("entities")
|
|
296
|
-
.ai.extract("intent")
|
|
297
|
-
.ai.extract("urgency")
|
|
298
|
-
)
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
**Available Task Categories:**
|
|
302
|
-
|
|
303
|
-
- **Text Analysis**: `nlp.SENTIMENT_ANALYSIS`, `nlp.MULTILINGUAL_TRANSLATION`, `nlp.NAMED_ENTITY_RECOGNITION`, `nlp.KEYWORD_EXTRACTION`
|
|
304
|
-
- **Content Classification**: `customer_support.INTENT_ANALYSIS`, `customer_support.URGENCY_ANALYSIS`, `customer_support.INQUIRY_CLASSIFICATION`
|
|
305
|
-
|
|
306
|
-
**Benefits of Pre-configured Tasks:**
|
|
307
|
-
|
|
308
|
-
- Optimized prompts tested across diverse datasets
|
|
309
|
-
- Consistent structured outputs with Pydantic validation
|
|
310
|
-
- Multilingual support with standardized categorical fields
|
|
311
|
-
- Extensible framework for adding domain-specific tasks
|
|
312
|
-
- Direct compatibility with Spark UDFs
|
|
313
|
-
|
|
314
|
-
### Asynchronous Processing with `.aio`
|
|
315
|
-
|
|
316
|
-
For high-performance concurrent processing, use the `.aio` accessor which provides asynchronous versions of all AI operations:
|
|
317
|
-
|
|
318
|
-
```python
|
|
319
|
-
import asyncio
|
|
320
|
-
import pandas as pd
|
|
321
|
-
from openaivec import pandas_ext
|
|
322
|
-
|
|
323
|
-
# Setup (same as synchronous version)
|
|
324
|
-
pandas_ext.set_responses_model("gpt-4.1-mini")
|
|
325
|
-
|
|
326
|
-
df = pd.DataFrame({"text": [
|
|
327
|
-
"This product is amazing!",
|
|
328
|
-
"Terrible customer service",
|
|
329
|
-
"Good value for money",
|
|
330
|
-
"Not what I expected"
|
|
331
|
-
] * 250}) # 1000 rows for demonstration
|
|
332
|
-
|
|
333
|
-
async def process_data():
|
|
334
|
-
# Asynchronous processing with fine-tuned concurrency control
|
|
335
|
-
results = await df["text"].aio.responses(
|
|
336
|
-
"Analyze sentiment and classify as positive/negative/neutral",
|
|
337
|
-
# batch_size defaults to None (automatic optimization)
|
|
338
|
-
max_concurrency=12 # Allow up to 12 concurrent requests
|
|
339
|
-
)
|
|
340
|
-
return results
|
|
341
|
-
|
|
342
|
-
# Run the async operation
|
|
343
|
-
sentiments = asyncio.run(process_data())
|
|
344
|
-
```
|
|
345
|
-
|
|
346
|
-
**Key Parameters for Performance Tuning:**
|
|
347
|
-
|
|
348
|
-
- **`batch_size`** (default: None): Controls how many inputs are grouped into a single API request. When None (default), automatic batch size optimization adjusts based on execution time. Set to a positive integer for fixed batch size. Higher values reduce API call overhead but increase memory usage and request processing time.
|
|
349
|
-
- **`max_concurrency`** (default: 8): Limits the number of concurrent API requests. Higher values increase throughput but may hit rate limits or overwhelm the API.
|
|
350
|
-
|
|
351
|
-
**Performance Benefits:**
|
|
352
|
-
|
|
353
|
-
- Process thousands of records in parallel
|
|
354
|
-
- Automatic request batching and deduplication
|
|
355
|
-
- Built-in rate limiting and error handling
|
|
356
|
-
- Memory-efficient streaming for large datasets
|
|
357
|
-
|
|
358
|
-
## Using with Apache Spark UDFs
|
|
359
|
-
|
|
360
|
-
Scale to enterprise datasets with distributed processing:
|
|
361
|
-
|
|
362
|
-
📓 **[Complete Spark tutorial →](https://microsoft.github.io/openaivec/examples/spark/)**
|
|
363
|
-
|
|
364
|
-
First, obtain a Spark session and configure authentication:
|
|
365
|
-
|
|
366
|
-
```python
|
|
367
|
-
from pyspark.sql import SparkSession
|
|
368
|
-
from openaivec.spark import setup, setup_azure
|
|
369
|
-
|
|
370
|
-
spark = SparkSession.builder.getOrCreate()
|
|
371
|
-
|
|
372
|
-
# Option 1: Using OpenAI
|
|
373
|
-
setup(
|
|
374
|
-
spark,
|
|
375
|
-
api_key="your-openai-api-key",
|
|
376
|
-
responses_model_name="gpt-4.1-mini", # Optional: set default model
|
|
377
|
-
embeddings_model_name="text-embedding-3-small" # Optional: set default model
|
|
378
|
-
)
|
|
379
|
-
|
|
380
|
-
# Option 2: Using Azure OpenAI
|
|
381
|
-
# setup_azure(
|
|
382
|
-
# spark,
|
|
383
|
-
# api_key="your-azure-openai-api-key",
|
|
384
|
-
# base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
|
|
385
|
-
# api_version="preview",
|
|
386
|
-
# responses_model_name="my-gpt4-deployment", # Optional: set default deployment
|
|
387
|
-
# embeddings_model_name="my-embedding-deployment" # Optional: set default deployment
|
|
388
|
-
# )
|
|
389
|
-
```
|
|
390
|
-
|
|
391
|
-
Next, create and register UDFs using the provided functions:
|
|
392
|
-
|
|
393
|
-
```python
|
|
394
|
-
from openaivec.spark import responses_udf, task_udf, embeddings_udf, count_tokens_udf, similarity_udf, parse_udf
|
|
395
|
-
from pydantic import BaseModel
|
|
396
|
-
|
|
397
|
-
# --- Register Responses UDF (String Output) ---
|
|
398
|
-
spark.udf.register(
|
|
399
|
-
"extract_brand",
|
|
400
|
-
responses_udf(
|
|
401
|
-
instructions="Extract the brand name from the product. Return only the brand name."
|
|
402
|
-
)
|
|
403
|
-
)
|
|
404
|
-
|
|
405
|
-
# --- Register Responses UDF (Structured Output with Pydantic) ---
|
|
406
|
-
class Translation(BaseModel):
|
|
407
|
-
en: str
|
|
408
|
-
fr: str
|
|
409
|
-
ja: str
|
|
410
|
-
|
|
411
|
-
spark.udf.register(
|
|
412
|
-
"translate_struct",
|
|
413
|
-
responses_udf(
|
|
414
|
-
instructions="Translate the text to English, French, and Japanese.",
|
|
415
|
-
response_format=Translation
|
|
416
|
-
)
|
|
417
|
-
)
|
|
418
|
-
|
|
419
|
-
# --- Register Embeddings UDF ---
|
|
420
|
-
spark.udf.register(
|
|
421
|
-
"embed_text",
|
|
422
|
-
embeddings_udf()
|
|
423
|
-
)
|
|
424
|
-
|
|
425
|
-
# --- Register Token Counting UDF ---
|
|
426
|
-
spark.udf.register("count_tokens", count_tokens_udf())
|
|
427
|
-
|
|
428
|
-
# --- Register Similarity UDF ---
|
|
429
|
-
spark.udf.register("compute_similarity", similarity_udf())
|
|
430
|
-
|
|
431
|
-
# --- Register UDFs with Pre-configured Tasks ---
|
|
432
|
-
from openaivec.task import nlp, customer_support
|
|
433
|
-
|
|
434
|
-
spark.udf.register(
|
|
435
|
-
"analyze_sentiment",
|
|
436
|
-
task_udf(
|
|
437
|
-
task=nlp.SENTIMENT_ANALYSIS
|
|
438
|
-
)
|
|
439
|
-
)
|
|
440
|
-
|
|
441
|
-
spark.udf.register(
|
|
442
|
-
"classify_intent",
|
|
443
|
-
task_udf(
|
|
444
|
-
task=customer_support.INTENT_ANALYSIS
|
|
445
|
-
)
|
|
446
|
-
)
|
|
447
|
-
|
|
448
|
-
# --- Register UDF for Reasoning Models ---
|
|
449
|
-
# For reasoning models (o1-preview, o1-mini, o3, etc.), set temperature=None
|
|
450
|
-
spark.udf.register(
|
|
451
|
-
"reasoning_analysis",
|
|
452
|
-
responses_udf(
|
|
453
|
-
instructions="Analyze this step by step with detailed reasoning",
|
|
454
|
-
temperature=None # Required for reasoning models
|
|
455
|
-
)
|
|
456
|
-
)
|
|
457
|
-
|
|
458
|
-
# --- Register Parse UDF (Dynamic Schema Inference) ---
|
|
459
|
-
spark.udf.register(
|
|
460
|
-
"parse_dynamic",
|
|
461
|
-
parse_udf(
|
|
462
|
-
instructions="Extract key entities and attributes from the text",
|
|
463
|
-
example_table_name="sample_texts", # Infer schema from examples
|
|
464
|
-
example_field_name="text",
|
|
465
|
-
max_examples=50
|
|
466
|
-
)
|
|
467
|
-
)
|
|
468
|
-
|
|
469
|
-
```
|
|
470
|
-
|
|
471
|
-
You can now use these UDFs in Spark SQL:
|
|
472
|
-
|
|
473
|
-
```sql
|
|
474
|
-
-- Create a sample table (replace with your actual table)
|
|
475
|
-
CREATE OR REPLACE TEMP VIEW product_reviews AS SELECT * FROM VALUES
|
|
476
|
-
('1001', 'The new TechPhone X camera quality is amazing, Nexus Corp really outdid themselves this time!'),
|
|
477
|
-
('1002', 'Quantum Galaxy has great battery life but the price is too high for what you get'),
|
|
478
|
-
('1003', 'Zephyr mobile phone crashed twice today, very disappointed with this purchase')
|
|
479
|
-
AS product_reviews(id, review_text);
|
|
480
|
-
|
|
481
|
-
-- Use the registered UDFs (including pre-configured tasks)
|
|
482
|
-
SELECT
|
|
483
|
-
id,
|
|
484
|
-
review_text,
|
|
485
|
-
extract_brand(review_text) AS brand,
|
|
486
|
-
translate_struct(review_text) AS translation,
|
|
487
|
-
analyze_sentiment(review_text).sentiment AS sentiment,
|
|
488
|
-
analyze_sentiment(review_text).confidence AS sentiment_confidence,
|
|
489
|
-
classify_intent(review_text).primary_intent AS intent,
|
|
490
|
-
classify_intent(review_text).action_required AS action_required,
|
|
491
|
-
embed_text(review_text) AS embedding,
|
|
492
|
-
count_tokens(review_text) AS token_count
|
|
493
|
-
FROM product_reviews;
|
|
494
|
-
```
|
|
495
|
-
|
|
496
|
-
Example Output (structure might vary slightly):
|
|
497
|
-
|
|
498
|
-
| id | review_text | brand | translation | sentiment | sentiment_confidence | intent | action_required | embedding | token_count |
|
|
499
|
-
| ---- | ----------------------------------------------------------------------------- | ---------- | --------------------------- | --------- | -------------------- | ---------------- | ------------------ | ---------------------- | ----------- |
|
|
500
|
-
| 1001 | The new TechPhone X camera quality is amazing, Nexus Corp really outdid... | Nexus Corp | {en: ..., fr: ..., ja: ...} | positive | 0.95 | provide_feedback | acknowledge_review | [0.1, -0.2, ..., 0.5] | 19 |
|
|
501
|
-
| 1002 | Quantum Galaxy has great battery life but the price is too high for what... | Quantum | {en: ..., fr: ..., ja: ...} | mixed | 0.78 | provide_feedback | follow_up_pricing | [-0.3, 0.1, ..., -0.1] | 16 |
|
|
502
|
-
| 1003 | Zephyr mobile phone crashed twice today, very disappointed with this purchase | Zephyr | {en: ..., fr: ..., ja: ...} | negative | 0.88 | complaint | investigate_issue | [0.0, 0.4, ..., 0.2] | 12 |
|
|
503
|
-
|
|
504
|
-
### Spark Performance Tuning
|
|
505
|
-
|
|
506
|
-
When using openaivec with Spark, proper configuration of `batch_size` and `max_concurrency` is crucial for optimal performance:
|
|
507
|
-
|
|
508
|
-
**Automatic Caching** (New):
|
|
509
|
-
|
|
510
|
-
- **Duplicate Detection**: All AI-powered UDFs (`responses_udf`, `task_udf`, `embeddings_udf`) automatically cache duplicate inputs within each partition
|
|
511
|
-
- **Cost Reduction**: Significantly reduces API calls and costs on datasets with repeated content
|
|
512
|
-
- **Transparent**: Works automatically without code changes - your existing UDFs become more efficient
|
|
513
|
-
- **Partition-Level**: Each partition maintains its own cache, optimal for distributed processing patterns
|
|
514
|
-
|
|
515
|
-
**`batch_size`** (default: None):
|
|
516
|
-
|
|
517
|
-
- Controls how many rows are processed together in each API request within a partition
|
|
518
|
-
- **Default (None)**: Automatic batch size optimization adjusts based on execution time
|
|
519
|
-
- **Positive integer**: Fixed batch size - larger values reduce API calls but increase memory usage
|
|
520
|
-
- **Recommendation**: Use default automatic optimization, or set 32-128 for fixed batch size
|
|
521
|
-
|
|
522
|
-
**`max_concurrency`** (default: 8):
|
|
523
|
-
|
|
524
|
-
- **Important**: This is the number of concurrent API requests **PER EXECUTOR**
|
|
525
|
-
- Total cluster concurrency = `max_concurrency × number_of_executors`
|
|
526
|
-
- **Higher values**: Faster processing but may overwhelm API rate limits
|
|
527
|
-
- **Lower values**: More conservative, better for shared API quotas
|
|
528
|
-
- **Recommendation**: 4-12 per executor, considering your OpenAI tier limits
|
|
529
|
-
|
|
530
|
-
**Example for a 10-executor cluster:**
|
|
531
|
-
|
|
532
|
-
```python
|
|
533
|
-
# With max_concurrency=8, total cluster concurrency = 8 × 10 = 80 concurrent requests
|
|
534
|
-
spark.udf.register(
|
|
535
|
-
"analyze_sentiment",
|
|
536
|
-
responses_udf(
|
|
537
|
-
instructions="Analyze sentiment as positive/negative/neutral",
|
|
538
|
-
# batch_size defaults to None (automatic optimization)
|
|
539
|
-
max_concurrency=8 # 80 total concurrent requests across cluster
|
|
540
|
-
)
|
|
541
|
-
)
|
|
542
|
-
```
|
|
543
|
-
|
|
544
|
-
**Monitoring and Scaling:**
|
|
545
|
-
|
|
546
|
-
- Monitor OpenAI API rate limits and adjust `max_concurrency` accordingly
|
|
547
|
-
- Use Spark UI to optimize partition sizes and executor configurations
|
|
548
|
-
- Consider your OpenAI tier limits when scaling clusters
|
|
549
|
-
|
|
550
|
-
## Building Prompts
|
|
551
|
-
|
|
552
|
-
Building prompt is a crucial step in using LLMs.
|
|
553
|
-
In particular, providing a few examples in a prompt can significantly improve an LLM’s performance,
|
|
554
|
-
a technique known as "few-shot learning." Typically, a few-shot prompt consists of a purpose, cautions,
|
|
555
|
-
and examples.
|
|
556
|
-
|
|
557
|
-
📓 **[Advanced prompting techniques →](https://microsoft.github.io/openaivec/examples/prompt/)**
|
|
558
|
-
|
|
559
|
-
The `FewShotPromptBuilder` helps you create structured, high-quality prompts with examples, cautions, and automatic improvement.
|
|
560
|
-
|
|
561
|
-
### Basic Usage
|
|
562
|
-
|
|
563
|
-
`FewShotPromptBuilder` requires simply a purpose, cautions, and examples, and `build` method will
|
|
564
|
-
return rendered prompt with XML format.
|
|
565
|
-
|
|
566
|
-
Here is an example:
|
|
567
|
-
|
|
568
|
-
```python
|
|
569
|
-
from openaivec import FewShotPromptBuilder
|
|
570
|
-
|
|
571
|
-
prompt: str = (
|
|
572
|
-
FewShotPromptBuilder()
|
|
573
|
-
.purpose("Return the smallest category that includes the given word")
|
|
574
|
-
.caution("Never use proper nouns as categories")
|
|
575
|
-
.example("Apple", "Fruit")
|
|
576
|
-
.example("Car", "Vehicle")
|
|
577
|
-
.example("Tokyo", "City")
|
|
578
|
-
.example("Keiichi Sogabe", "Musician")
|
|
579
|
-
.example("America", "Country")
|
|
580
|
-
.build()
|
|
581
|
-
)
|
|
582
|
-
print(prompt)
|
|
583
|
-
```
|
|
584
|
-
|
|
585
|
-
The output will be:
|
|
586
|
-
|
|
587
|
-
```xml
|
|
588
|
-
|
|
589
|
-
<Prompt>
|
|
590
|
-
<Purpose>Return the smallest category that includes the given word</Purpose>
|
|
591
|
-
<Cautions>
|
|
592
|
-
<Caution>Never use proper nouns as categories</Caution>
|
|
593
|
-
</Cautions>
|
|
594
|
-
<Examples>
|
|
595
|
-
<Example>
|
|
596
|
-
<Input>Apple</Input>
|
|
597
|
-
<Output>Fruit</Output>
|
|
598
|
-
</Example>
|
|
599
|
-
<Example>
|
|
600
|
-
<Input>Car</Input>
|
|
601
|
-
<Output>Vehicle</Output>
|
|
602
|
-
</Example>
|
|
603
|
-
<Example>
|
|
604
|
-
<Input>Tokyo</Input>
|
|
605
|
-
<Output>City</Output>
|
|
606
|
-
</Example>
|
|
607
|
-
<Example>
|
|
608
|
-
<Input>Keiichi Sogabe</Input>
|
|
609
|
-
<Output>Musician</Output>
|
|
610
|
-
</Example>
|
|
611
|
-
<Example>
|
|
612
|
-
<Input>America</Input>
|
|
613
|
-
<Output>Country</Output>
|
|
614
|
-
</Example>
|
|
615
|
-
</Examples>
|
|
616
|
-
</Prompt>
|
|
617
|
-
```
|
|
618
|
-
|
|
619
|
-
### Improve with OpenAI
|
|
620
|
-
|
|
621
|
-
For most users, it can be challenging to write a prompt entirely free of contradictions, ambiguities, or
|
|
622
|
-
redundancies.
|
|
623
|
-
`FewShotPromptBuilder` provides an `improve` method to refine your prompt using OpenAI's API.
|
|
624
|
-
|
|
625
|
-
`improve` method will try to eliminate contradictions, ambiguities, and redundancies in the prompt with OpenAI's API,
|
|
626
|
-
and iterate the process up to `max_iter` times.
|
|
627
|
-
|
|
628
|
-
Here is an example:
|
|
629
|
-
|
|
630
|
-
```python
|
|
631
|
-
from openai import OpenAI
|
|
632
|
-
from openaivec import FewShotPromptBuilder
|
|
633
|
-
|
|
634
|
-
client = OpenAI(...)
|
|
635
|
-
model_name = "<your-model-name>"
|
|
636
|
-
improved_prompt: str = (
|
|
637
|
-
FewShotPromptBuilder()
|
|
638
|
-
.purpose("Return the smallest category that includes the given word")
|
|
639
|
-
.caution("Never use proper nouns as categories")
|
|
640
|
-
# Examples which has contradictions, ambiguities, or redundancies
|
|
641
|
-
.example("Apple", "Fruit")
|
|
642
|
-
.example("Apple", "Technology")
|
|
643
|
-
.example("Apple", "Company")
|
|
644
|
-
.example("Apple", "Color")
|
|
645
|
-
.example("Apple", "Animal")
|
|
646
|
-
# improve the prompt with OpenAI's API
|
|
647
|
-
.improve()
|
|
648
|
-
.build()
|
|
649
|
-
)
|
|
650
|
-
print(improved_prompt)
|
|
651
|
-
```
|
|
652
|
-
|
|
653
|
-
Then we will get the improved prompt with extra examples, improved purpose, and cautions:
|
|
654
|
-
|
|
655
|
-
```xml
|
|
656
|
-
<Prompt>
|
|
657
|
-
<Purpose>Classify a given word into its most relevant category by considering its context and potential meanings.
|
|
658
|
-
The input is a word accompanied by context, and the output is the appropriate category based on that context.
|
|
659
|
-
This is useful for disambiguating words with multiple meanings, ensuring accurate understanding and
|
|
660
|
-
categorization.
|
|
661
|
-
</Purpose>
|
|
662
|
-
<Cautions>
|
|
663
|
-
<Caution>Ensure the context of the word is clear to avoid incorrect categorization.</Caution>
|
|
664
|
-
<Caution>Be aware of words with multiple meanings and provide the most relevant category.</Caution>
|
|
665
|
-
<Caution>Consider the possibility of new or uncommon contexts that may not fit traditional categories.</Caution>
|
|
666
|
-
</Cautions>
|
|
667
|
-
<Examples>
|
|
668
|
-
<Example>
|
|
669
|
-
<Input>Apple (as a fruit)</Input>
|
|
670
|
-
<Output>Fruit</Output>
|
|
671
|
-
</Example>
|
|
672
|
-
<Example>
|
|
673
|
-
<Input>Apple (as a tech company)</Input>
|
|
674
|
-
<Output>Technology</Output>
|
|
675
|
-
</Example>
|
|
676
|
-
<Example>
|
|
677
|
-
<Input>Java (as a programming language)</Input>
|
|
678
|
-
<Output>Technology</Output>
|
|
679
|
-
</Example>
|
|
680
|
-
<Example>
|
|
681
|
-
<Input>Java (as an island)</Input>
|
|
682
|
-
<Output>Geography</Output>
|
|
683
|
-
</Example>
|
|
684
|
-
<Example>
|
|
685
|
-
<Input>Mercury (as a planet)</Input>
|
|
686
|
-
<Output>Astronomy</Output>
|
|
687
|
-
</Example>
|
|
688
|
-
<Example>
|
|
689
|
-
<Input>Mercury (as an element)</Input>
|
|
690
|
-
<Output>Chemistry</Output>
|
|
691
|
-
</Example>
|
|
692
|
-
<Example>
|
|
693
|
-
<Input>Bark (as a sound made by a dog)</Input>
|
|
694
|
-
<Output>Animal Behavior</Output>
|
|
695
|
-
</Example>
|
|
696
|
-
<Example>
|
|
697
|
-
<Input>Bark (as the outer covering of a tree)</Input>
|
|
698
|
-
<Output>Botany</Output>
|
|
699
|
-
</Example>
|
|
700
|
-
<Example>
|
|
701
|
-
<Input>Bass (as a type of fish)</Input>
|
|
702
|
-
<Output>Aquatic Life</Output>
|
|
703
|
-
</Example>
|
|
704
|
-
<Example>
|
|
705
|
-
<Input>Bass (as a low-frequency sound)</Input>
|
|
706
|
-
<Output>Music</Output>
|
|
707
|
-
</Example>
|
|
708
|
-
</Examples>
|
|
709
|
-
</Prompt>
|
|
710
|
-
```
|
|
711
|
-
|
|
712
|
-
## Using with Microsoft Fabric
|
|
713
|
-
|
|
714
|
-
[Microsoft Fabric](https://www.microsoft.com/en-us/microsoft-fabric/) is a unified, cloud-based analytics platform that
|
|
715
|
-
seamlessly integrates data engineering, warehousing, and business intelligence to simplify the journey from raw data to
|
|
716
|
-
actionable insights.
|
|
717
|
-
|
|
718
|
-
This section provides instructions on how to integrate and use `openaivec` within Microsoft Fabric. Follow these
|
|
719
|
-
steps:
|
|
720
|
-
|
|
721
|
-
1. **Create an Environment in Microsoft Fabric:**
|
|
722
|
-
|
|
723
|
-
- In Microsoft Fabric, click on **New item** in your workspace.
|
|
724
|
-
- Select **Environment** to create a new environment for Apache Spark.
|
|
725
|
-
- Determine the environment name, eg. `openai-environment`.
|
|
726
|
-
- 
|
|
727
|
-
_Figure: Creating a new Environment in Microsoft Fabric._
|
|
728
|
-
|
|
729
|
-
2. **Add `openaivec` to the Environment from Public Library**
|
|
730
|
-
|
|
731
|
-
- Once your environment is set up, go to the **Custom Library** section within that environment.
|
|
732
|
-
- Click on **Add from PyPI** and search for latest version of `openaivec`.
|
|
733
|
-
- Save and publish to reflect the changes.
|
|
734
|
-
- 
|
|
735
|
-
_Figure: Add `openaivec` from PyPI to Public Library_
|
|
736
|
-
|
|
737
|
-
3. **Use the Environment from a Notebook:**
|
|
738
|
-
|
|
739
|
-
- Open a notebook within Microsoft Fabric.
|
|
740
|
-
- Select the environment you created in the previous steps.
|
|
741
|
-
- 
|
|
742
|
-
_Figure: Using custom environment from a notebook._
|
|
743
|
-
- In the notebook, import and use `openaivec.spark` functions as you normally would. For example:
|
|
744
|
-
|
|
745
|
-
```python
|
|
746
|
-
from openaivec.spark import setup_azure, responses_udf, embeddings_udf
|
|
747
|
-
|
|
748
|
-
# In Microsoft Fabric, spark session is automatically available
|
|
749
|
-
# spark = SparkSession.builder.getOrCreate()
|
|
750
|
-
|
|
751
|
-
# Configure Azure OpenAI authentication
|
|
752
|
-
setup_azure(
|
|
753
|
-
spark,
|
|
754
|
-
api_key="<your-api-key>",
|
|
755
|
-
base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
|
|
756
|
-
api_version="preview",
|
|
757
|
-
responses_model_name="my-gpt4-deployment" # Your Azure deployment name
|
|
758
|
-
)
|
|
759
|
-
|
|
760
|
-
# Register UDFs
|
|
761
|
-
spark.udf.register(
|
|
762
|
-
"analyze_text",
|
|
763
|
-
responses_udf(
|
|
764
|
-
instructions="Analyze the sentiment of the text",
|
|
765
|
-
model_name="gpt-4.1-mini" # Use your Azure deployment name here
|
|
766
|
-
)
|
|
767
|
-
)
|
|
768
|
-
```
|
|
769
|
-
|
|
770
|
-
Following these steps allows you to successfully integrate and use `openaivec` within Microsoft Fabric.
|
|
771
|
-
|
|
772
|
-
## Contributing
|
|
773
|
-
|
|
774
|
-
We welcome contributions to this project! If you would like to contribute, please follow these guidelines:
|
|
775
|
-
|
|
776
|
-
1. Fork the repository and create your branch from `main`.
|
|
777
|
-
2. If you've added code that should be tested, add tests.
|
|
778
|
-
3. Ensure the test suite passes.
|
|
779
|
-
4. Make sure your code lints.
|
|
780
|
-
|
|
781
|
-
### Installing Dependencies
|
|
782
|
-
|
|
783
|
-
To install the necessary dependencies for development, run:
|
|
784
|
-
|
|
785
|
-
```bash
|
|
786
|
-
uv sync --all-extras --dev
|
|
787
|
-
```
|
|
788
|
-
|
|
789
|
-
### Code Formatting
|
|
790
|
-
|
|
791
|
-
To reformat the code, use the following command:
|
|
792
|
-
|
|
793
|
-
```bash
|
|
794
|
-
uv run ruff check . --fix
|
|
795
|
-
```
|
|
796
|
-
|
|
797
|
-
## Additional Resources
|
|
798
|
-
|
|
799
|
-
📓 **[Customer feedback analysis →](https://microsoft.github.io/openaivec/examples/customer_analysis/)** - Sentiment analysis & prioritization
|
|
800
|
-
📓 **[Survey data transformation →](https://microsoft.github.io/openaivec/examples/survey_transformation/)** - Unstructured to structured data
|
|
801
|
-
📓 **[Asynchronous processing examples →](https://microsoft.github.io/openaivec/examples/aio/)** - High-performance async workflows
|
|
802
|
-
📓 **[Auto-generate FAQs from documents →](https://microsoft.github.io/openaivec/examples/generate_faq/)** - Create FAQs using AI
|
|
803
|
-
📓 **[All examples →](https://microsoft.github.io/openaivec/examples/pandas/)** - Complete collection of tutorials and use cases
|
|
804
|
-
|
|
805
|
-
## Community
|
|
806
|
-
|
|
807
|
-
Join our Discord community for developers: https://discord.gg/vbb83Pgn
|
|
File without changes
|
|
File without changes
|