openaivec 0.12.5__py3-none-any.whl → 1.0.10__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. openaivec/__init__.py +13 -4
  2. openaivec/_cache/__init__.py +12 -0
  3. openaivec/_cache/optimize.py +109 -0
  4. openaivec/_cache/proxy.py +806 -0
  5. openaivec/{di.py → _di.py} +36 -12
  6. openaivec/_embeddings.py +203 -0
  7. openaivec/{log.py → _log.py} +2 -2
  8. openaivec/_model.py +113 -0
  9. openaivec/{prompt.py → _prompt.py} +95 -28
  10. openaivec/_provider.py +207 -0
  11. openaivec/_responses.py +511 -0
  12. openaivec/_schema/__init__.py +9 -0
  13. openaivec/_schema/infer.py +340 -0
  14. openaivec/_schema/spec.py +350 -0
  15. openaivec/_serialize.py +234 -0
  16. openaivec/{util.py → _util.py} +25 -85
  17. openaivec/pandas_ext.py +1496 -318
  18. openaivec/spark.py +485 -183
  19. openaivec/task/__init__.py +9 -7
  20. openaivec/task/customer_support/__init__.py +9 -15
  21. openaivec/task/customer_support/customer_sentiment.py +17 -15
  22. openaivec/task/customer_support/inquiry_classification.py +23 -22
  23. openaivec/task/customer_support/inquiry_summary.py +14 -13
  24. openaivec/task/customer_support/intent_analysis.py +21 -19
  25. openaivec/task/customer_support/response_suggestion.py +16 -16
  26. openaivec/task/customer_support/urgency_analysis.py +24 -25
  27. openaivec/task/nlp/__init__.py +4 -4
  28. openaivec/task/nlp/dependency_parsing.py +10 -12
  29. openaivec/task/nlp/keyword_extraction.py +11 -14
  30. openaivec/task/nlp/morphological_analysis.py +12 -14
  31. openaivec/task/nlp/named_entity_recognition.py +16 -18
  32. openaivec/task/nlp/sentiment_analysis.py +14 -11
  33. openaivec/task/nlp/translation.py +6 -9
  34. openaivec/task/table/__init__.py +2 -2
  35. openaivec/task/table/fillna.py +11 -11
  36. openaivec-1.0.10.dist-info/METADATA +399 -0
  37. openaivec-1.0.10.dist-info/RECORD +39 -0
  38. {openaivec-0.12.5.dist-info → openaivec-1.0.10.dist-info}/WHEEL +1 -1
  39. openaivec/embeddings.py +0 -172
  40. openaivec/model.py +0 -67
  41. openaivec/provider.py +0 -45
  42. openaivec/responses.py +0 -393
  43. openaivec/serialize.py +0 -225
  44. openaivec-0.12.5.dist-info/METADATA +0 -696
  45. openaivec-0.12.5.dist-info/RECORD +0 -33
  46. {openaivec-0.12.5.dist-info → openaivec-1.0.10.dist-info}/licenses/LICENSE +0 -0
@@ -1,696 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: openaivec
3
- Version: 0.12.5
4
- Summary: Generative mutation for tabular calculation
5
- Project-URL: Homepage, https://microsoft.github.io/openaivec/
6
- Project-URL: Repository, https://github.com/microsoft/openaivec
7
- Author-email: Hiroki Mizukami <hmizukami@microsoft.com>
8
- License: MIT
9
- License-File: LICENSE
10
- Keywords: llm,openai,openai-api,openai-python,pandas,pyspark
11
- Classifier: Development Status :: 4 - Beta
12
- Classifier: Intended Audience :: Developers
13
- Classifier: License :: OSI Approved :: MIT License
14
- Classifier: Programming Language :: Python :: 3.10
15
- Classifier: Programming Language :: Python :: 3.11
16
- Classifier: Programming Language :: Python :: 3.12
17
- Requires-Python: >=3.10
18
- Requires-Dist: openai>=1.74.0
19
- Requires-Dist: pandas>=2.2.3
20
- Requires-Dist: tiktoken>=0.9.0
21
- Provides-Extra: spark
22
- Requires-Dist: pyspark>=3.5.5; extra == 'spark'
23
- Description-Content-Type: text/markdown
24
-
25
- # openaivec
26
-
27
- **Transform your data analysis with AI-powered text processing at scale.**
28
-
29
- **openaivec** enables data analysts to seamlessly integrate OpenAI's language models into their pandas and Spark workflows. Process thousands of text records with natural language instructions, turning unstructured data into actionable insights with just a few lines of code.
30
-
31
- ## 🚀 Quick Start: From Text to Insights in Seconds
32
-
33
- Imagine analyzing 10,000 customer reviews. Instead of manual work, just write:
34
-
35
- ```python
36
- import pandas as pd
37
- from openaivec import pandas_ext
38
-
39
- # Your data
40
- reviews = pd.DataFrame({
41
- "review": ["Great product, fast delivery!", "Terrible quality, very disappointed", ...]
42
- })
43
-
44
- # AI-powered analysis in one line
45
- results = reviews.assign(
46
- sentiment=lambda df: df.review.ai.responses("Classify sentiment: positive/negative/neutral"),
47
- issues=lambda df: df.review.ai.responses("Extract main issues or compliments"),
48
- priority=lambda df: df.review.ai.responses("Priority for follow-up: low/medium/high")
49
- )
50
- ```
51
-
52
- **Result**: Thousands of reviews classified and analyzed in minutes, not days.
53
-
54
- 📓 **[Try it yourself →](https://microsoft.github.io/openaivec/examples/pandas/)**
55
-
56
- ## 💡 Real-World Impact
57
-
58
- ### Customer Feedback Analysis
59
-
60
- ```python
61
- # Process 50,000 support tickets automatically
62
- tickets.assign(
63
- category=lambda df: df.description.ai.responses("Categorize: billing/technical/feature_request"),
64
- urgency=lambda df: df.description.ai.responses("Urgency level: low/medium/high/critical"),
65
- solution_type=lambda df: df.description.ai.responses("Best resolution approach")
66
- )
67
- ```
68
-
69
- ### Market Research at Scale
70
-
71
- ```python
72
- # Analyze multilingual social media data
73
- social_data.assign(
74
- english_text=lambda df: df.post.ai.responses("Translate to English"),
75
- brand_mention=lambda df: df.english_text.ai.responses("Extract brand mentions and sentiment"),
76
- market_trend=lambda df: df.english_text.ai.responses("Identify emerging trends or concerns")
77
- )
78
- ```
79
-
80
- ### Survey Data Transformation
81
-
82
- ```python
83
- # Convert free-text responses to structured data
84
- from pydantic import BaseModel
85
-
86
- class Demographics(BaseModel):
87
- age_group: str
88
- location: str
89
- interests: list[str]
90
-
91
- survey_responses.assign(
92
- structured=lambda df: df.response.ai.responses(
93
- "Extract demographics as structured data",
94
- response_format=Demographics
95
- )
96
- ).ai.extract("structured") # Auto-expands to columns
97
- ```
98
-
99
- 📓 **[See more examples →](https://microsoft.github.io/openaivec/examples/)**
100
-
101
- # Overview
102
-
103
- This package provides a vectorized interface for the OpenAI API, enabling you to process multiple inputs with a single
104
- API call instead of sending requests one by one.
105
- This approach helps reduce latency and simplifies your code.
106
-
107
- Additionally, it integrates effortlessly with Pandas DataFrames and Apache Spark UDFs, making it easy to incorporate
108
- into your data processing pipelines.
109
-
110
- ## Features
111
-
112
- - Vectorized API requests for processing multiple inputs at once.
113
- - Seamless integration with Pandas DataFrames.
114
- - A UDF builder for Apache Spark.
115
- - Compatibility with multiple OpenAI clients, including Azure OpenAI.
116
-
117
- ## Key Benefits
118
-
119
- - **🚀 Performance**: Vectorized processing handles thousands of records in minutes, not hours
120
- - **💰 Cost Efficiency**: Automatic deduplication significantly reduces API costs on typical datasets
121
- - **🔗 Integration**: Works within existing pandas/Spark workflows without architectural changes
122
- - **📈 Scalability**: Same API scales from exploratory analysis (100s of records) to production systems (millions of records)
123
- - **🎯 Pre-configured Tasks**: Ready-to-use task library with optimized prompts for common use cases
124
- - **🏢 Enterprise Ready**: Microsoft Fabric integration, Apache Spark UDFs, Azure OpenAI compatibility
125
-
126
- ## Requirements
127
-
128
- - Python 3.10 or higher
129
-
130
- ## Installation
131
-
132
- Install the package with:
133
-
134
- ```bash
135
- pip install openaivec
136
- ```
137
-
138
- If you want to uninstall the package, you can do so with:
139
-
140
- ```bash
141
- pip uninstall openaivec
142
- ```
143
-
144
- ## Basic Usage
145
-
146
- ### Direct API Usage
147
-
148
- For maximum control over batch processing:
149
-
150
- ```python
151
- import os
152
- from openai import OpenAI
153
- from openaivec import BatchResponses
154
-
155
- # Initialize the batch client
156
- client = BatchResponses(
157
- client=OpenAI(),
158
- model_name="gpt-4.1-mini",
159
- system_message="Please answer only with 'xx family' and do not output anything else."
160
- )
161
-
162
- result = client.parse(["panda", "rabbit", "koala"], batch_size=32)
163
- print(result) # Expected output: ['bear family', 'rabbit family', 'koala family']
164
- ```
165
-
166
- 📓 **[Complete tutorial →](https://microsoft.github.io/openaivec/examples/pandas/)**
167
-
168
- ### Pandas Integration (Recommended)
169
-
170
- The easiest way to get started with your DataFrames:
171
-
172
- ```python
173
- import pandas as pd
174
- from openaivec import pandas_ext
175
-
176
- # Setup (optional - uses OPENAI_API_KEY environment variable by default)
177
- pandas_ext.responses_model("gpt-4.1-mini")
178
-
179
- # Create your data
180
- df = pd.DataFrame({"name": ["panda", "rabbit", "koala"]})
181
-
182
- # Add AI-powered columns
183
- result = df.assign(
184
- family=lambda df: df.name.ai.responses("What animal family? Answer with 'X family'"),
185
- habitat=lambda df: df.name.ai.responses("Primary habitat in one word"),
186
- fun_fact=lambda df: df.name.ai.responses("One interesting fact in 10 words or less")
187
- )
188
- ```
189
-
190
- | name | family | habitat | fun_fact |
191
- | ------ | ---------------- | ------- | -------------------------- |
192
- | panda | bear family | forest | Eats bamboo 14 hours daily |
193
- | rabbit | rabbit family | meadow | Can see nearly 360 degrees |
194
- | koala | marsupial family | tree | Sleeps 22 hours per day |
195
-
196
- 📓 **[Interactive pandas examples →](https://microsoft.github.io/openaivec/examples/pandas/)**
197
-
198
- ### Using Pre-configured Tasks
199
-
200
- For common text processing operations, openaivec provides ready-to-use tasks that eliminate the need to write custom prompts:
201
-
202
- ```python
203
- from openaivec.task import nlp, customer_support
204
-
205
- # Text analysis with pre-configured tasks
206
- text_df = pd.DataFrame({
207
- "text": [
208
- "Great product, fast delivery!",
209
- "Need help with billing issue",
210
- "How do I reset my password?"
211
- ]
212
- })
213
-
214
- # Use pre-configured tasks for consistent, optimized results
215
- results = text_df.assign(
216
- sentiment=lambda df: df.text.ai.task(nlp.SENTIMENT_ANALYSIS),
217
- entities=lambda df: df.text.ai.task(nlp.NAMED_ENTITY_RECOGNITION),
218
- intent=lambda df: df.text.ai.task(customer_support.INTENT_ANALYSIS),
219
- urgency=lambda df: df.text.ai.task(customer_support.URGENCY_ANALYSIS)
220
- )
221
-
222
- # Extract structured results into separate columns (one at a time)
223
- extracted_results = (results
224
- .ai.extract("sentiment")
225
- .ai.extract("entities")
226
- .ai.extract("intent")
227
- .ai.extract("urgency")
228
- )
229
- ```
230
-
231
- **Available Task Categories:**
232
-
233
- - **Text Analysis**: `nlp.SENTIMENT_ANALYSIS`, `nlp.TRANSLATION`, `nlp.NAMED_ENTITY_RECOGNITION`, `nlp.KEYWORD_EXTRACTION`
234
- - **Content Classification**: `customer_support.INTENT_ANALYSIS`, `customer_support.URGENCY_ANALYSIS`, `customer_support.INQUIRY_CLASSIFICATION`
235
-
236
- **Benefits of Pre-configured Tasks:**
237
-
238
- - Optimized prompts tested across diverse datasets
239
- - Consistent structured outputs with Pydantic validation
240
- - Multilingual support with standardized categorical fields
241
- - Extensible framework for adding domain-specific tasks
242
- - Direct compatibility with Spark UDFs
243
-
244
- ### Asynchronous Processing with `.aio`
245
-
246
- For high-performance concurrent processing, use the `.aio` accessor which provides asynchronous versions of all AI operations:
247
-
248
- ```python
249
- import asyncio
250
- import pandas as pd
251
- from openaivec import pandas_ext
252
-
253
- # Setup (same as synchronous version)
254
- pandas_ext.responses_model("gpt-4.1-mini")
255
-
256
- df = pd.DataFrame({"text": [
257
- "This product is amazing!",
258
- "Terrible customer service",
259
- "Good value for money",
260
- "Not what I expected"
261
- ] * 250}) # 1000 rows for demonstration
262
-
263
- async def process_data():
264
- # Asynchronous processing with fine-tuned concurrency control
265
- results = await df["text"].aio.responses(
266
- "Analyze sentiment and classify as positive/negative/neutral",
267
- batch_size=64, # Process 64 items per API request
268
- max_concurrency=12 # Allow up to 12 concurrent requests
269
- )
270
- return results
271
-
272
- # Run the async operation
273
- sentiments = asyncio.run(process_data())
274
- ```
275
-
276
- **Key Parameters for Performance Tuning:**
277
-
278
- - **`batch_size`** (default: 128): Controls how many inputs are grouped into a single API request. Higher values reduce API call overhead but increase memory usage and request processing time.
279
- - **`max_concurrency`** (default: 8): Limits the number of concurrent API requests. Higher values increase throughput but may hit rate limits or overwhelm the API.
280
-
281
- **Performance Benefits:**
282
-
283
- - Process thousands of records in parallel
284
- - Automatic request batching and deduplication
285
- - Built-in rate limiting and error handling
286
- - Memory-efficient streaming for large datasets
287
-
288
- ## Using with Apache Spark UDFs
289
-
290
- Scale to enterprise datasets with distributed processing:
291
-
292
- 📓 **[Complete Spark tutorial →](https://microsoft.github.io/openaivec/examples/spark/)**
293
-
294
- First, obtain a Spark session and configure authentication:
295
-
296
- ```python
297
- import os
298
- from pyspark.sql import SparkSession
299
-
300
- spark = SparkSession.builder.getOrCreate()
301
- sc = spark.sparkContext
302
-
303
- # Configure authentication via SparkContext environment variables
304
- # Option 1: Using OpenAI
305
- sc.environment["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")
306
-
307
- # Option 2: Using Azure OpenAI
308
- # sc.environment["AZURE_OPENAI_API_KEY"] = os.environ.get("AZURE_OPENAI_API_KEY")
309
- # sc.environment["AZURE_OPENAI_API_ENDPOINT"] = os.environ.get("AZURE_OPENAI_API_ENDPOINT")
310
- # sc.environment["AZURE_OPENAI_API_VERSION"] = os.environ.get("AZURE_OPENAI_API_VERSION")
311
- ```
312
-
313
- Next, create and register UDFs using the provided functions:
314
-
315
- ```python
316
- from openaivec.spark import responses_udf, task_udf, embeddings_udf, count_tokens_udf
317
- from pydantic import BaseModel
318
-
319
- # --- Register Responses UDF (String Output) ---
320
- spark.udf.register(
321
- "extract_brand",
322
- responses_udf(
323
- instructions="Extract the brand name from the product. Return only the brand name."
324
- )
325
- )
326
-
327
- # --- Register Responses UDF (Structured Output with Pydantic) ---
328
- class Translation(BaseModel):
329
- en: str
330
- fr: str
331
- ja: str
332
-
333
- spark.udf.register(
334
- "translate_struct",
335
- responses_udf(
336
- instructions="Translate the text to English, French, and Japanese.",
337
- response_format=Translation
338
- )
339
- )
340
-
341
- # --- Register Embeddings UDF ---
342
- spark.udf.register(
343
- "embed_text",
344
- embeddings_udf()
345
- )
346
-
347
- # --- Register Token Counting UDF ---
348
- spark.udf.register("count_tokens", count_tokens_udf("gpt-4o"))
349
-
350
- # --- Register UDFs with Pre-configured Tasks ---
351
- from openaivec.task import nlp, customer_support
352
-
353
- spark.udf.register(
354
- "analyze_sentiment",
355
- task_udf(
356
- task=nlp.SENTIMENT_ANALYSIS
357
- )
358
- )
359
-
360
- spark.udf.register(
361
- "classify_intent",
362
- task_udf(
363
- task=customer_support.INTENT_ANALYSIS
364
- )
365
- )
366
-
367
- ```
368
-
369
- You can now use these UDFs in Spark SQL:
370
-
371
- ```sql
372
- -- Create a sample table (replace with your actual table)
373
- CREATE OR REPLACE TEMP VIEW product_reviews AS SELECT * FROM VALUES
374
- ('1001', 'The new TechPhone X camera quality is amazing, Nexus Corp really outdid themselves this time!'),
375
- ('1002', 'Quantum Galaxy has great battery life but the price is too high for what you get'),
376
- ('1003', 'Zephyr mobile phone crashed twice today, very disappointed with this purchase')
377
- AS product_reviews(id, review_text);
378
-
379
- -- Use the registered UDFs (including pre-configured tasks)
380
- SELECT
381
- id,
382
- review_text,
383
- extract_brand(review_text) AS brand,
384
- translate_struct(review_text) AS translation,
385
- analyze_sentiment(review_text).sentiment AS sentiment,
386
- analyze_sentiment(review_text).confidence AS sentiment_confidence,
387
- classify_intent(review_text).primary_intent AS intent,
388
- classify_intent(review_text).action_required AS action_required,
389
- embed_text(review_text) AS embedding,
390
- count_tokens(review_text) AS token_count
391
- FROM product_reviews;
392
- ```
393
-
394
- Example Output (structure might vary slightly):
395
-
396
- | id | review_text | brand | translation | sentiment | sentiment_confidence | intent | action_required | embedding | token_count |
397
- | ---- | ----------------------------------------------------------------------------- | ---------- | --------------------------- | --------- | -------------------- | ---------------- | ------------------- | ---------------------- | ----------- |
398
- | 1001 | The new TechPhone X camera quality is amazing, Nexus Corp really outdid... | Nexus Corp | {en: ..., fr: ..., ja: ...} | positive | 0.95 | provide_feedback | acknowledge_review | [0.1, -0.2, ..., 0.5] | 19 |
399
- | 1002 | Quantum Galaxy has great battery life but the price is too high for what... | Quantum | {en: ..., fr: ..., ja: ...} | mixed | 0.78 | provide_feedback | follow_up_pricing | [-0.3, 0.1, ..., -0.1] | 16 |
400
- | 1003 | Zephyr mobile phone crashed twice today, very disappointed with this purchase | Zephyr | {en: ..., fr: ..., ja: ...} | negative | 0.88 | complaint | investigate_issue | [0.0, 0.4, ..., 0.2] | 12 |
401
-
402
- ### Spark Performance Tuning
403
-
404
- When using openaivec with Spark, proper configuration of `batch_size` and `max_concurrency` is crucial for optimal performance:
405
-
406
- **`batch_size`** (default: 128):
407
-
408
- - Controls how many rows are processed together in each API request within a partition
409
- - **Larger values**: Fewer API calls per partition, reduced overhead
410
- - **Smaller values**: More granular processing, better memory management
411
- - **Recommendation**: 32-128 depending on data complexity and partition size
412
-
413
- **`max_concurrency`** (default: 8):
414
-
415
- - **Important**: This is the number of concurrent API requests **PER EXECUTOR**
416
- - Total cluster concurrency = `max_concurrency × number_of_executors`
417
- - **Higher values**: Faster processing but may overwhelm API rate limits
418
- - **Lower values**: More conservative, better for shared API quotas
419
- - **Recommendation**: 4-12 per executor, considering your OpenAI tier limits
420
-
421
- **Example for a 10-executor cluster:**
422
-
423
- ```python
424
- # With max_concurrency=8, total cluster concurrency = 8 × 10 = 80 concurrent requests
425
- spark.udf.register(
426
- "analyze_sentiment",
427
- responses_udf(
428
- instructions="Analyze sentiment as positive/negative/neutral",
429
- batch_size=64, # Good balance for most use cases
430
- max_concurrency=8 # 80 total concurrent requests across cluster
431
- )
432
- )
433
- ```
434
-
435
- **Monitoring and Scaling:**
436
-
437
- - Monitor OpenAI API rate limits and adjust `max_concurrency` accordingly
438
- - Use Spark UI to optimize partition sizes and executor configurations
439
- - Consider your OpenAI tier limits when scaling clusters
440
-
441
- ## Building Prompts
442
-
443
- Building prompt is a crucial step in using LLMs.
444
- In particular, providing a few examples in a prompt can significantly improve an LLM’s performance,
445
- a technique known as "few-shot learning." Typically, a few-shot prompt consists of a purpose, cautions,
446
- and examples.
447
-
448
- 📓 **[Advanced prompting techniques →](https://microsoft.github.io/openaivec/examples/prompt/)**
449
-
450
- The `FewShotPromptBuilder` helps you create structured, high-quality prompts with examples, cautions, and automatic improvement.
451
-
452
- ### Basic Usage
453
-
454
- `FewShotPromptBuilder` requires simply a purpose, cautions, and examples, and `build` method will
455
- return rendered prompt with XML format.
456
-
457
- Here is an example:
458
-
459
- ```python
460
- from openaivec.prompt import FewShotPromptBuilder
461
-
462
- prompt: str = (
463
- FewShotPromptBuilder()
464
- .purpose("Return the smallest category that includes the given word")
465
- .caution("Never use proper nouns as categories")
466
- .example("Apple", "Fruit")
467
- .example("Car", "Vehicle")
468
- .example("Tokyo", "City")
469
- .example("Keiichi Sogabe", "Musician")
470
- .example("America", "Country")
471
- .build()
472
- )
473
- print(prompt)
474
- ```
475
-
476
- The output will be:
477
-
478
- ```xml
479
-
480
- <Prompt>
481
- <Purpose>Return the smallest category that includes the given word</Purpose>
482
- <Cautions>
483
- <Caution>Never use proper nouns as categories</Caution>
484
- </Cautions>
485
- <Examples>
486
- <Example>
487
- <Input>Apple</Input>
488
- <Output>Fruit</Output>
489
- </Example>
490
- <Example>
491
- <Input>Car</Input>
492
- <Output>Vehicle</Output>
493
- </Example>
494
- <Example>
495
- <Input>Tokyo</Input>
496
- <Output>City</Output>
497
- </Example>
498
- <Example>
499
- <Input>Keiichi Sogabe</Input>
500
- <Output>Musician</Output>
501
- </Example>
502
- <Example>
503
- <Input>America</Input>
504
- <Output>Country</Output>
505
- </Example>
506
- </Examples>
507
- </Prompt>
508
- ```
509
-
510
- ### Improve with OpenAI
511
-
512
- For most users, it can be challenging to write a prompt entirely free of contradictions, ambiguities, or
513
- redundancies.
514
- `FewShotPromptBuilder` provides an `improve` method to refine your prompt using OpenAI's API.
515
-
516
- `improve` method will try to eliminate contradictions, ambiguities, and redundancies in the prompt with OpenAI's API,
517
- and iterate the process up to `max_iter` times.
518
-
519
- Here is an example:
520
-
521
- ```python
522
- from openai import OpenAI
523
- from openaivec.prompt import FewShotPromptBuilder
524
-
525
- client = OpenAI(...)
526
- model_name = "<your-model-name>"
527
- improved_prompt: str = (
528
- FewShotPromptBuilder()
529
- .purpose("Return the smallest category that includes the given word")
530
- .caution("Never use proper nouns as categories")
531
- # Examples which has contradictions, ambiguities, or redundancies
532
- .example("Apple", "Fruit")
533
- .example("Apple", "Technology")
534
- .example("Apple", "Company")
535
- .example("Apple", "Color")
536
- .example("Apple", "Animal")
537
- # improve the prompt with OpenAI's API
538
- .improve(client, model_name)
539
- .build()
540
- )
541
- print(improved_prompt)
542
- ```
543
-
544
- Then we will get the improved prompt with extra examples, improved purpose, and cautions:
545
-
546
- ```xml
547
- <Prompt>
548
- <Purpose>Classify a given word into its most relevant category by considering its context and potential meanings.
549
- The input is a word accompanied by context, and the output is the appropriate category based on that context.
550
- This is useful for disambiguating words with multiple meanings, ensuring accurate understanding and
551
- categorization.
552
- </Purpose>
553
- <Cautions>
554
- <Caution>Ensure the context of the word is clear to avoid incorrect categorization.</Caution>
555
- <Caution>Be aware of words with multiple meanings and provide the most relevant category.</Caution>
556
- <Caution>Consider the possibility of new or uncommon contexts that may not fit traditional categories.</Caution>
557
- </Cautions>
558
- <Examples>
559
- <Example>
560
- <Input>Apple (as a fruit)</Input>
561
- <Output>Fruit</Output>
562
- </Example>
563
- <Example>
564
- <Input>Apple (as a tech company)</Input>
565
- <Output>Technology</Output>
566
- </Example>
567
- <Example>
568
- <Input>Java (as a programming language)</Input>
569
- <Output>Technology</Output>
570
- </Example>
571
- <Example>
572
- <Input>Java (as an island)</Input>
573
- <Output>Geography</Output>
574
- </Example>
575
- <Example>
576
- <Input>Mercury (as a planet)</Input>
577
- <Output>Astronomy</Output>
578
- </Example>
579
- <Example>
580
- <Input>Mercury (as an element)</Input>
581
- <Output>Chemistry</Output>
582
- </Example>
583
- <Example>
584
- <Input>Bark (as a sound made by a dog)</Input>
585
- <Output>Animal Behavior</Output>
586
- </Example>
587
- <Example>
588
- <Input>Bark (as the outer covering of a tree)</Input>
589
- <Output>Botany</Output>
590
- </Example>
591
- <Example>
592
- <Input>Bass (as a type of fish)</Input>
593
- <Output>Aquatic Life</Output>
594
- </Example>
595
- <Example>
596
- <Input>Bass (as a low-frequency sound)</Input>
597
- <Output>Music</Output>
598
- </Example>
599
- </Examples>
600
- </Prompt>
601
- ```
602
-
603
- ## Using with Microsoft Fabric
604
-
605
- [Microsoft Fabric](https://www.microsoft.com/en-us/microsoft-fabric/) is a unified, cloud-based analytics platform that
606
- seamlessly integrates data engineering, warehousing, and business intelligence to simplify the journey from raw data to
607
- actionable insights.
608
-
609
- This section provides instructions on how to integrate and use `openaivec` within Microsoft Fabric. Follow these
610
- steps:
611
-
612
- 1. **Create an Environment in Microsoft Fabric:**
613
-
614
- - In Microsoft Fabric, click on **New item** in your workspace.
615
- - Select **Environment** to create a new environment for Apache Spark.
616
- - Determine the environment name, eg. `openai-environment`.
617
- - ![image](https://github.com/user-attachments/assets/bd1754ef-2f58-46b4-83ed-b335b64aaa1c)
618
- _Figure: Creating a new Environment in Microsoft Fabric._
619
-
620
- 2. **Add `openaivec` to the Environment from Public Library**
621
-
622
- - Once your environment is set up, go to the **Custom Library** section within that environment.
623
- - Click on **Add from PyPI** and search for latest version of `openaivec`.
624
- - Save and publish to reflect the changes.
625
- - ![image](https://github.com/user-attachments/assets/7b6320db-d9d6-4b89-a49d-e55b1489d1ae)
626
- _Figure: Add `openaivec` from PyPI to Public Library_
627
-
628
- 3. **Use the Environment from a Notebook:**
629
-
630
- - Open a notebook within Microsoft Fabric.
631
- - Select the environment you created in the previous steps.
632
- - ![image](https://github.com/user-attachments/assets/2457c078-1691-461b-b66e-accc3989e419)
633
- _Figure: Using custom environment from a notebook._
634
- - In the notebook, import and use `openaivec.spark` functions as you normally would. For example:
635
-
636
- ```python
637
- import os
638
- from pyspark.sql import SparkSession
639
- from openaivec.spark import responses_udf, embeddings_udf
640
-
641
- spark = SparkSession.builder.getOrCreate()
642
- sc = spark.sparkContext
643
-
644
- # Configure Azure OpenAI authentication
645
- sc.environment["AZURE_OPENAI_API_KEY"] = "<your-api-key>"
646
- sc.environment["AZURE_OPENAI_API_ENDPOINT"] = "https://<your-resource-name>.openai.azure.com"
647
- sc.environment["AZURE_OPENAI_API_VERSION"] = "2024-10-21"
648
-
649
- # Register UDFs
650
- spark.udf.register(
651
- "analyze_text",
652
- responses_udf(
653
- instructions="Analyze the sentiment of the text",
654
- model_name="<your-deployment-name>"
655
- )
656
- )
657
- ```
658
-
659
- Following these steps allows you to successfully integrate and use `openaivec` within Microsoft Fabric.
660
-
661
- ## Contributing
662
-
663
- We welcome contributions to this project! If you would like to contribute, please follow these guidelines:
664
-
665
- 1. Fork the repository and create your branch from `main`.
666
- 2. If you've added code that should be tested, add tests.
667
- 3. Ensure the test suite passes.
668
- 4. Make sure your code lints.
669
-
670
- ### Installing Dependencies
671
-
672
- To install the necessary dependencies for development, run:
673
-
674
- ```bash
675
- uv sync --all-extras --dev
676
- ```
677
-
678
- ### Code Formatting
679
-
680
- To reformat the code, use the following command:
681
-
682
- ```bash
683
- uv run ruff check . --fix
684
- ```
685
-
686
- ## Additional Resources
687
-
688
- 📓 **[Customer feedback analysis →](https://microsoft.github.io/openaivec/examples/customer_analysis/)** - Sentiment analysis & prioritization
689
- 📓 **[Survey data transformation →](https://microsoft.github.io/openaivec/examples/survey_transformation/)** - Unstructured to structured data
690
- 📓 **[Asynchronous processing examples →](https://microsoft.github.io/openaivec/examples/aio/)** - High-performance async workflows
691
- 📓 **[Auto-generate FAQs from documents →](https://microsoft.github.io/openaivec/examples/generate_faq/)** - Create FAQs using AI
692
- 📓 **[All examples →](https://microsoft.github.io/openaivec/examples/)** - Complete collection of tutorials and use cases
693
-
694
- ## Community
695
-
696
- Join our Discord community for developers: https://discord.gg/vbb83Pgn