PyPI - palimpzest - Versions diffs - 0.6.3__tar.gz → 0.7.0__tar.gz - Mend

palimpzest 0.6.3tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (113) hide show

{palimpzest-0.6.3/src/palimpzest.egg-info → palimpzest-0.7.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: palimpzest
-Version: 0.6.3
+Version: 0.7.0
 Summary: Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language
 Author-email: MIT DSG Semantic Management Lab <michjc@csail.mit.edu>
 Project-URL: homepage, https://palimpzest.org
@@ -16,6 +16,7 @@ Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: charset-normalizer>=3.3.2
+Requires-Dist: chromadb>=0.6.3
 Requires-Dist: click>=8.1.7
 Requires-Dist: click-aliases>=1.0.4
 Requires-Dist: colorama>=0.4.6
@@ -54,19 +55,17 @@ Requires-Dist: pypdf>=5.1.0
 Requires-Dist: pytest-mock>=3.14.0
 Requires-Dist: python-Levenshtein>=0.25.1
 Requires-Dist: pyyaml>=6.0.1
+Requires-Dist: ragatouille>=0.0.9
 Requires-Dist: requests>=2.25
-Requires-Dist: requests-html>=0.10.0
 Requires-Dist: ruff>=0.9.0
-Requires-Dist: scikit-learn>=1.5.2
-Requires-Dist: scipy>=1.9.0
 Requires-Dist: setuptools>=70.1.1
 Requires-Dist: tabulate>=0.9.0
-Requires-Dist: tenacity>=8.2.3
 Requires-Dist: together>=1.3.1
 Requires-Dist: tqdm~=4.66.1
-Requires-Dist: transformers>=4.11.3
-Requires-Dist: requests-html
-Requires-Dist: sphinx>=8.1.3
+Requires-Dist: transformers<4.50.0,>=4.41.3
+Requires-Dist: rich[jupyter]>=13.9.2
+Requires-Dist: voyager>=2.0.9
+Dynamic: license-file
 ![pz-banner](https://palimpzest-workloads.s3.us-east-1.amazonaws.com/palimpzest-cropped.png)
@@ -141,9 +140,6 @@ output_df = output.to_df(cols=["date", "sender", "subject"])
 display(output_df)
 ```
-## Palimpzest CLI
-Installing Palimpzest also installs its CLI tool `pz` which provides users with basic utilities at the command line for creating and managing their own Palimpzest system. Please read the readme in [src/cli/README.md](./src/cli/README.md) for instructions on how to use it.
 ## Python Demos
 Below are simple instructions to run PZ on a test data set of enron emails that is included with the system.
@@ -153,19 +149,27 @@ To run the provided demos, you will need to download the test data. Due to the s
 chmod +x testdata/download-testdata.sh
 ./testdata/download-testdata.sh
 ```
-For convenience, we have also provided a script to register all test data with Palimpzest:
-```
-chmod +x testdata/register-sources.sh
-./testdata/register-sources.sh
-```
 ### Running the Demos
-- Initialize the configuration by running `pz init`.
-- Palimpzest defaults to using OpenAI. You’ll need to export an environment variable `OPENAI_API_KEY`
+Set your OpenAI (or Together.ai) api key at the command line:
+```bash
+# set one (or both) of the following:
+export OPENAI_API_KEY=<your-api-key>
+export TOGETHER_API_KEY=<your-api-key>
+```
-- (Skip this step if you ran the `register-sources.sh` script successfully) Add the enron data set with:
-`pz reg --path testdata/enron-tiny --name enron-tiny`
+Now you can run the simple test program with:
+```bash
+$ python demos/simple-demo.py --task enron --dataset testdata/enron-eval-tiny --verbose
+```
-- Finally, run the simple test program with:
-      `python demos/simpleDemo.py --task enron --datasetid enron-eval-tiny --verbose`
+### Citation
+If you would like to cite our work, please use the following citation:
+```
+@inproceedings{palimpzestCIDR,
+    title={Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing},
+    author={Liu, Chunwei and Russo, Matthew and Cafarella, Michael and Cao, Lei and Chen, Peter Baile and Chen, Zui and Franklin, Michael and Kraska, Tim and Madden, Samuel and Shahout, Rana and Vitagliano, Gerardo},
+    booktitle = {Proceedings of the {{Conference}} on {{Innovative Database Research}} ({{CIDR}})},
+    date = 2025,
+}
+```

{palimpzest-0.6.3 → palimpzest-0.7.0}/README.md RENAMED Viewed

@@ -71,9 +71,6 @@ output_df = output.to_df(cols=["date", "sender", "subject"])
 display(output_df)
 ```
-## Palimpzest CLI
-Installing Palimpzest also installs its CLI tool `pz` which provides users with basic utilities at the command line for creating and managing their own Palimpzest system. Please read the readme in [src/cli/README.md](./src/cli/README.md) for instructions on how to use it.
 ## Python Demos
 Below are simple instructions to run PZ on a test data set of enron emails that is included with the system.
@@ -83,19 +80,27 @@ To run the provided demos, you will need to download the test data. Due to the s
 chmod +x testdata/download-testdata.sh
 ./testdata/download-testdata.sh
 ```
-For convenience, we have also provided a script to register all test data with Palimpzest:
-```
-chmod +x testdata/register-sources.sh
-./testdata/register-sources.sh
-```
 ### Running the Demos
-- Initialize the configuration by running `pz init`.
-- Palimpzest defaults to using OpenAI. You’ll need to export an environment variable `OPENAI_API_KEY`
+Set your OpenAI (or Together.ai) api key at the command line:
+```bash
+# set one (or both) of the following:
+export OPENAI_API_KEY=<your-api-key>
+export TOGETHER_API_KEY=<your-api-key>
+```
-- (Skip this step if you ran the `register-sources.sh` script successfully) Add the enron data set with:
-`pz reg --path testdata/enron-tiny --name enron-tiny`
+Now you can run the simple test program with:
+```bash
+$ python demos/simple-demo.py --task enron --dataset testdata/enron-eval-tiny --verbose
+```
-- Finally, run the simple test program with:
-      `python demos/simpleDemo.py --task enron --datasetid enron-eval-tiny --verbose`
+### Citation
+If you would like to cite our work, please use the following citation:
+```
+@inproceedings{palimpzestCIDR,
+    title={Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing},
+    author={Liu, Chunwei and Russo, Matthew and Cafarella, Michael and Cao, Lei and Chen, Peter Baile and Chen, Zui and Franklin, Michael and Kraska, Tim and Madden, Samuel and Shahout, Rana and Vitagliano, Gerardo},
+    booktitle = {Proceedings of the {{Conference}} on {{Innovative Database Research}} ({{CIDR}})},
+    date = 2025,
+}
+```

{palimpzest-0.6.3 → palimpzest-0.7.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "palimpzest"
-version = "0.6.3"
+version = "0.7.0"
 description = "Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language"
 readme = "README.md"
 requires-python = ">=3.8"
@@ -10,6 +10,7 @@ authors = [
 ]
 dependencies = [
     "charset-normalizer>=3.3.2",
+    "chromadb>=0.6.3",
     "click>=8.1.7",
     "click-aliases>=1.0.4",
     "colorama>=0.4.6",
@@ -48,21 +49,16 @@ dependencies = [
     "pytest-mock>=3.14.0",
     "python-Levenshtein>=0.25.1",
     "pyyaml>=6.0.1",
+    "ragatouille>=0.0.9",
     "requests>=2.25",
-    "requests-html>=0.10.0",
     "ruff>=0.9.0",
-    "scikit-learn>=1.5.2",
-    "scipy>=1.9.0",
     "setuptools>=70.1.1",
     "tabulate>=0.9.0",
-    "tenacity>=8.2.3",
     "together>=1.3.1",
-    # "torch>=1.9.0",
     "tqdm~=4.66.1",
-    "transformers>=4.11.3",
-    "requests-html",
-    "sphinx>=8.1.3",
-    # Add other dependencies as needed
+    "transformers>=4.41.3,<4.50.0",
+    "rich[jupyter]>=13.9.2",
+    "voyager>=2.0.9",
 ]
 classifiers=[
     "Development Status :: 4 - Beta",  # Change as appropriate

{palimpzest-0.6.3 → palimpzest-0.7.0}/src/palimpzest/__init__.py RENAMED Viewed

@@ -1,3 +1,5 @@
+import logging
 from palimpzest.constants import Cardinality
 from palimpzest.core.data.datareaders import DataReader
 from palimpzest.policy import (
@@ -14,6 +16,9 @@ from palimpzest.policy import (
 from palimpzest.query.processor.config import QueryProcessorConfig
 from palimpzest.sets import Dataset
+# Initialize the root logger
+logging.getLogger(__name__).addHandler(logging.NullHandler())
 __all__ = [
     # constants
     "Cardinality",

{palimpzest-0.6.3 → palimpzest-0.7.0}/src/palimpzest/constants.py RENAMED Viewed

@@ -10,23 +10,36 @@ class Model(str, Enum):
     which requires invoking an LLM. It does NOT specify whether the model need be executed
     remotely or locally (if applicable).
     """
-    LLAMA3 = "meta-llama/Llama-3-8b-chat-hf"
-    LLAMA3_V = "meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo"
+    # LLAMA3 = "meta-llama/Llama-3-8b-chat-hf"
+    LLAMA3 = "meta-llama/Llama-3.3-70B-Instruct-Turbo"
+    LLAMA3_V = "meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo"
     MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"
+    DEEPSEEK = "deepseek-ai/DeepSeek-V3"
     GPT_4o = "gpt-4o-2024-08-06"
     GPT_4o_V = "gpt-4o-2024-08-06"
     GPT_4o_MINI = "gpt-4o-mini-2024-07-18"
     GPT_4o_MINI_V = "gpt-4o-mini-2024-07-18"
+    TEXT_EMBEDDING_3_SMALL = "text-embedding-3-small"
+    CLIP_VIT_B_32 = "clip-ViT-B-32"
     def __repr__(self):
         return f"{self.name}"
+class APIClient(str, Enum):
+    """
+    APIClient describes the API client to be used when invoking an LLM.
+    """
+    OPENAI = "openai"
+    TOGETHER = "together"
 class PromptStrategy(str, Enum):
     """
     PromptStrategy describes the prompting technique to be used by a Generator when
     performing some task with a specified Model.
     """
     # Chain-of-Thought Boolean Prompt Strategies
     COT_BOOL = "chain-of-thought-bool"
     # COT_BOOL_CRITIC = "chain-of-thought-bool-critic"
@@ -52,14 +65,18 @@ class PromptStrategy(str, Enum):
     COT_MOA_PROPOSER_IMAGE = "chain-of-thought-mixture-of-agents-proposer-image"
     COT_MOA_AGG = "chain-of-thought-mixture-of-agents-aggregation"
+    # Split Convert Prompt Strategies
+    SPLIT_PROPOSER = "split-proposer"
+    SPLIT_MERGER = "split-merger"
     def is_image_prompt(self):
         return "image" in self.value
-    def is_cot_bool_prompt(self):
-        return "chain-of-thought-bool" in self.value
+    def is_bool_prompt(self):
+        return "bool" in self.value
-    def is_cot_qa_prompt(self):
-        return "chain-of-thought-question" in self.value
+    def is_convert_prompt(self):
+        return "bool" not in self.value
     def is_critic_prompt(self):
         return "critic" in self.value
@@ -73,6 +90,11 @@ class PromptStrategy(str, Enum):
     def is_moa_aggregator_prompt(self):
         return "mixture-of-agents-aggregation" in self.value
+    def is_split_proposer_prompt(self):
+        return "split-proposer" in self.value
+    def is_split_merger_prompt(self):
+        return "split-merger" in self.value
 class AggFunc(str, Enum):
     COUNT = "count"
@@ -93,10 +115,12 @@ class Cardinality(str, Enum):
                     return member
         return cls.ONE_TO_ONE
 class PickOutputStrategy(str, Enum):
     CHAMPION = "champion"
     ENSEMBLE = "ensemble"
 IMAGE_EXTENSIONS = [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff"]
 PDF_EXTENSIONS = [".pdf"]
 XLS_EXTENSIONS = [".xls", ".xlsx"]
@@ -111,11 +135,6 @@ DEFAULT_PDF_PROCESSOR = "pypdf"
 # character limit for various IDs
 MAX_ID_CHARS = 10
-# retry LLM executions 2^x * (multiplier) for up to 10 seconds and at most 4 times
-RETRY_MULTIPLIER = 2
-RETRY_MAX_SECS = 10
-RETRY_MAX_ATTEMPTS = 1
 # maximum number of rows to display in a table
 MAX_ROWS = 5
@@ -201,25 +220,45 @@ LOG_LLM_OUTPUT = False
 # values more precisely:
 # - https://artificialanalysis.ai/models/llama-3-1-instruct-8b
 #
-LLAMA3_8B_MODEL_CARD = {
+# LLAMA3_8B_MODEL_CARD = {
+#     ##### Cost in USD #####
+#     "usd_per_input_token": 0.18 / 1E6,
+#     "usd_per_output_token": 0.18 / 1E6,
+#     ##### Time #####
+#     "seconds_per_output_token": 0.0061,
+#     ##### Agg. Benchmark #####
+#     "overall": 71.0,
+#     ##### Code #####
+#     "code": 64.0,
+# }
+LLAMA3_3_70B_INSTRUCT_MODEL_CARD = {
     ##### Cost in USD #####
-    "usd_per_input_token": 0.18 / 1E6,
-    "usd_per_output_token": 0.18 / 1E6,
+    "usd_per_input_token": 0.88 / 1e6,
+    "usd_per_output_token": 0.88 / 1e6,
     ##### Time #####
-    "seconds_per_output_token": 0.0061,
+    "seconds_per_output_token": 0.0139,
     ##### Agg. Benchmark #####
-    "overall": 71.0,
+    "overall": 86.0,
     ##### Code #####
-    "code": 64.0,
+    "code": 88.4,
 }
-LLAMA3_11B_V_MODEL_CARD = {
+# LLAMA3_2_11B_V_MODEL_CARD = {
+#     ##### Cost in USD #####
+#     "usd_per_input_token": 0.18 / 1E6,
+#     "usd_per_output_token": 0.18 / 1E6,
+#     ##### Time #####
+#     "seconds_per_output_token": 0.0061,
+#     ##### Agg. Benchmark #####
+#     "overall": 71.0,
+# }
+LLAMA3_2_90B_V_MODEL_CARD = {
     ##### Cost in USD #####
-    "usd_per_input_token": 0.18 / 1E6,
-    "usd_per_output_token": 0.18 / 1E6,
+    "usd_per_input_token": 1.2 / 1e6,
+    "usd_per_output_token": 1.2 / 1e6,
     ##### Time #####
-    "seconds_per_output_token": 0.0061,
+    "seconds_per_output_token": 0.0222,
     ##### Agg. Benchmark #####
-    "overall": 71.0,
+    "overall": 84.0,
 }
 MIXTRAL_8X_7B_MODEL_CARD = {
     ##### Cost in USD #####
@@ -232,10 +271,21 @@ MIXTRAL_8X_7B_MODEL_CARD = {
     ##### Code #####
     "code": 40.0,
 }
+DEEPSEEK_V3_MODEL_CARD = {
+    ##### Cost in USD #####
+    "usd_per_input_token": 1.25 / 1E6,
+    "usd_per_output_token": 1.25 / 1E6,
+    ##### Time #####
+    "seconds_per_output_token": 0.0769,
+    ##### Agg. Benchmark #####
+    "overall": 87.0,
+    ##### Code #####
+    "code": 92.0,
+}
 GPT_4o_MODEL_CARD = {
     ##### Cost in USD #####
-    "usd_per_input_token": 2.5 / 1E6,
-    "usd_per_output_token": 10.0 / 1E6,
+    "usd_per_input_token": 2.5 / 1e6,
+    "usd_per_output_token": 10.0 / 1e6,
     ##### Time #####
     "seconds_per_output_token": 0.0079,
     ##### Agg. Benchmark #####
@@ -246,8 +296,8 @@ GPT_4o_MODEL_CARD = {
 GPT_4o_V_MODEL_CARD = {
     # NOTE: it is unclear if the same ($ / token) costs can be applied, or if we have to calculate this ourselves
     ##### Cost in USD #####
-    "usd_per_input_token": 2.5 / 1E6,
-    "usd_per_output_token": 10.0 / 1E6,
+    "usd_per_input_token": 2.5 / 1e6,
+    "usd_per_output_token": 10.0 / 1e6,
     ##### Time #####
     "seconds_per_output_token": 0.0079,
     ##### Agg. Benchmark #####
@@ -255,8 +305,8 @@ GPT_4o_V_MODEL_CARD = {
 }
 GPT_4o_MINI_MODEL_CARD = {
     ##### Cost in USD #####
-    "usd_per_input_token": 0.15 / 1E6,
-    "usd_per_output_token": 0.6 / 1E6,
+    "usd_per_input_token": 0.15 / 1e6,
+    "usd_per_output_token": 0.6 / 1e6,
     ##### Time #####
     "seconds_per_output_token": 0.0098,
     ##### Agg. Benchmark #####
@@ -267,24 +317,44 @@ GPT_4o_MINI_MODEL_CARD = {
 GPT_4o_MINI_V_MODEL_CARD = {
     # NOTE: it is unclear if the same ($ / token) costs can be applied, or if we have to calculate this ourselves
     ##### Cost in USD #####
-    "usd_per_input_token": 0.15 / 1E6,
-    "usd_per_output_token": 0.6 / 1E6,
+    "usd_per_input_token": 0.15 / 1e6,
+    "usd_per_output_token": 0.6 / 1e6,
     ##### Time #####
     "seconds_per_output_token": 0.0098,
     ##### Agg. Benchmark #####
     "overall": 82.0,
 }
+TEXT_EMBEDDING_3_SMALL_MODEL_CARD = {
+    ##### Cost in USD #####
+    "usd_per_input_token": 0.02 / 1e6,
+    "usd_per_output_token": None,
+    ##### Time #####
+    "seconds_per_output_token": 0.0098,  # NOTE: just copying GPT_4o_MINI_MODEL_CARD for now
+    ##### Agg. Benchmark #####
+    "overall": 82.0,  # NOTE: just copying GPT_4o_MINI_MODEL_CARD for now
+}
+CLIP_VIT_B_32_MODEL_CARD = {
+    ##### Cost in USD #####
+    "usd_per_input_token": 0.00,
+    "usd_per_output_token": None,
+    ##### Time #####
+    "seconds_per_output_token": 0.0098,  # NOTE: just copying TEXT_EMBEDDING_3_SMALL_MODEL_CARD for now
+    ##### Agg. Benchmark #####
+    "overall": 63.3,  # NOTE: ImageNet top-1 accuracy
+}
 MODEL_CARDS = {
-    Model.LLAMA3.value: LLAMA3_8B_MODEL_CARD,
-    Model.LLAMA3_V.value: LLAMA3_11B_V_MODEL_CARD,
+    Model.LLAMA3.value: LLAMA3_3_70B_INSTRUCT_MODEL_CARD,
+    Model.LLAMA3_V.value: LLAMA3_2_90B_V_MODEL_CARD,
+    Model.DEEPSEEK.value: DEEPSEEK_V3_MODEL_CARD,
     Model.MIXTRAL.value: MIXTRAL_8X_7B_MODEL_CARD,
     Model.GPT_4o.value: GPT_4o_MODEL_CARD,
     Model.GPT_4o_V.value: GPT_4o_V_MODEL_CARD,
     Model.GPT_4o_MINI.value: GPT_4o_MINI_MODEL_CARD,
     Model.GPT_4o_MINI_V.value: GPT_4o_MINI_V_MODEL_CARD,
+    Model.TEXT_EMBEDDING_3_SMALL.value: TEXT_EMBEDDING_3_SMALL_MODEL_CARD,
+    Model.CLIP_VIT_B_32.value: CLIP_VIT_B_32_MODEL_CARD,
     ###
     # Model.GPT_3_5.value: GPT_3_5_MODEL_CARD,
     # Model.GPT_4.value: GPT_4_MODEL_CARD,
@@ -294,9 +364,6 @@ MODEL_CARDS = {
 }
 ###### DEPRECATED ######
 # # NOTE: seconds_per_output_token is based on `gpt-3.5-turbo-1106`
 # GPT_3_5_MODEL_CARD = {
@@ -317,7 +384,7 @@ MODEL_CARDS = {
 #     ### "DROP": 64.1, # 3-shot
 #     ##### Code #####
 #     "code": 48.1,
-#     ### "HumanEval": 48.1,^ # 0-shot
+#     ### "HumanEval": 48.1,^ # 0-shot
 #     ##### Math #####
 #     "math": 57.1,
 #     ### "GSM8K": 57.1,^     # 5-shot
@@ -364,10 +431,10 @@ MODEL_CARDS = {
 # GEMINI_1_MODEL_CARD = {
 #     ##### Cost in USD #####
-#     "usd_per_input_token": 125 / 1E8, # Gemini is free but rate limited for now. Pricing will be updated
+#     "usd_per_input_token": 125 / 1E8, # Gemini is free but rate limited for now. Pricing will be updated
 #     "usd_per_output_token": 375 / 1E9,
 #     ##### Time #####
-#     "seconds_per_output_token": 0.042 / 10.0, # TODO:
+#     "seconds_per_output_token": 0.042 / 10.0, # TODO:
 #     ##### Agg. Benchmark #####
 #     "overall": 65.0, # 90.0 TODO: we are using the free version of Gemini which is substantially worse than its paid version; I'm manually revising it's quality below that of Mixtral
 #     ##### Commonsense Reasoning #####
@@ -379,7 +446,7 @@ MODEL_CARDS = {
 #     ##### Code #####
 #     "code": 74.4,
 #     # "HumanEval": 74.4, # 0-shot (IT)*
-#     # "Natural2Code": 74.9, # 0-shot
+#     # "Natural2Code": 74.9, # 0-shot
 #     ##### Math #####
 #     "math": 94.4,
 #     # "GSM8K": 94.4,     # maj1@32
@@ -388,10 +455,10 @@ MODEL_CARDS = {
 # GEMINI_1V_MODEL_CARD = {
 #     ##### Cost in USD #####
-#     "usd_per_input_token": 25 / 1E6,  # Gemini is free but rate limited for now. Pricing will be updated
+#     "usd_per_input_token": 25 / 1E6,  # Gemini is free but rate limited for now. Pricing will be updated
 #     "usd_per_output_token": 375 / 1E9,
 #     ##### Time #####
-#     "seconds_per_output_token": 0.042, # / 10.0, # TODO:
+#     "seconds_per_output_token": 0.042, # / 10.0, # TODO:
 #     ##### Agg. Benchmark #####
 #     "overall": 65.0, # 90.0, TODO: see note above in Gemini_1 model card
 #     ##### Commonsense Reasoning #####
@@ -403,7 +470,7 @@ MODEL_CARDS = {
 #     ##### Code #####
 #     "code": 74.4,
 #     # "HumanEval": 74.4, # 0-shot (IT)*
-#     # "Natural2Code": 74.9, # 0-shot
+#     # "Natural2Code": 74.9, # 0-shot
 #     ##### Math #####
 #     "math": 94.4,
 #     # "GSM8K": 94.4,     # maj1@32

palimpzest 0.6.3__tar.gz → 0.7.0__tar.gz

palimpzest 0.6.3tar.gz → 0.7.0tar.gz