PyPI - palimpzest - Versions diffs - 0.9.0__tar.gz → 1.0.0__tar.gz - Mend

palimpzest 0.9.0tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (101) hide show

{palimpzest-0.9.0/src/palimpzest.egg-info → palimpzest-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: palimpzest
-Version: 0.9.0
+Version: 1.0.0
 Summary: Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language
 Author-email: MIT DSG Semantic Management Lab <michjc@csail.mit.edu>
 Project-URL: homepage, https://palimpzest.org
@@ -12,7 +12,7 @@ Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.8
-Requires-Python: >=3.10
+Requires-Python: >=3.12
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: anthropic>=0.55.0
@@ -59,15 +59,20 @@ Dynamic: license-file
 <!-- [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2405.14696) -->
 <!-- [![Video](https://img.shields.io/badge/YouTube-Talk-red?logo=youtube)](https://youtu.be/T8VQfyBiki0?si=eiph57DSEkDNbEIu) -->
-## Learn How to Use PZ
-Our [full documentation](https://palimpzest.org) is the definitive resource for learning how to use PZ. It contains all of the installation and quickstart materials on this page, as well as user guides, full API documentation, and much more.
+## 📚 Learn How to Use PZ
+Our [full documentation](https://palimpzest.org) is the definitive resource for learning how to use PZ. It contains all of the installation and quickstart materials on this page, as well as user guides, full API documentation (coming soon), and much more.
-## Getting started
+## 🚀 Getting started
 You can find a stable version of the PZ package on PyPI [here](https://pypi.org/project/palimpzest/). To install the package, run:
 ```bash
 $ pip install palimpzest
 ```
+You can also install PZ with [uv](https://docs.astral.sh/uv/) for a faster installation:
+```bash
+$ uv pip install palimpzest
+```
 Alternatively, to install the latest version of the package from this repository, you can clone this repository and run the following commands:
 ```bash
 $ git clone git@github.com:mitdbg/palimpzest.git
@@ -75,7 +80,7 @@ $ cd palimpzest
 $ pip install .
 ```
-## Join the PZ Community
+## 🙋🏽 Join the PZ Community
 We are actively hacking on PZ and would love to have you join our community [![Discord](https://img.shields.io/discord/1245561987480420445?logo=discord)](https://discord.gg/dN85JJ6jaH)
 [Our Discord server](https://discord.gg/dN85JJ6jaH) is the best place to:
@@ -86,66 +91,8 @@ We are actively hacking on PZ and would love to have you join our community [![D
 We are eager to learn more about your workloads and use cases, and will take them into consideration in planning our future roadmap.
-## Quick Start
-The easiest way to get started with Palimpzest is to run the `quickstart.ipynb` jupyter notebook. We demonstrate the full workflow of working with PZ, including registering a dataset, composing and executing a pipeline, and accessing the results.
-To run the notebook, you can use the following command:
-```bash
-$ jupyter notebook
-```
-And then access the notebook from the jupyter interface in your browser at `localhost:8888`.
-### Even Quicker Start
-For eager readers, the code in the notebook can be found in the following condensed snippet. However, we do suggest reading the notebook as it contains more insight into each element of the program.
-```python
-import palimpzest as pz
-# define the fields we wish to compute
-email_cols = [
-    {"name": "sender", "type": str, "desc": "The email address of the sender"},
-    {"name": "subject", "type": str, "desc": "The subject of the email"},
-    {"name": "date", "type": str, "desc": "The date the email was sent"},
-]
-# lazily construct the computation to get emails about holidays sent in July
-dataset = pz.Dataset("testdata/enron-tiny/")
-dataset = dataset.sem_add_columns(email_cols)
-dataset = dataset.sem_filter("The email was sent in July")
-dataset = dataset.sem_filter("The email is about holidays")
-# execute the computation w/the MinCost policy
-config = pz.QueryProcessorConfig(policy=pz.MinCost(), verbose=True)
-output = dataset.run(config)
-# display output (if using Jupyter, otherwise use print(output_df))
-output_df = output.to_df(cols=["date", "sender", "subject"])
-display(output_df)
-```
-## Python Demos
-Below are simple instructions to run PZ on a test data set of enron emails that is included with the system.
-### Downloading test data
-To run the provided demos, you will need to download the test data. Due to the size of the data, we are unable to include it in the repository. You can download the test data by running the following command from a unix terminal (requires `wget` and `tar`):
-```
-chmod +x testdata/download-testdata.sh
-./testdata/download-testdata.sh
-```
-### Running the Demos
-Set your OpenAI (or Together.ai) api key at the command line:
-```bash
-# set one (or both) of the following:
-export OPENAI_API_KEY=<your-api-key>
-export TOGETHER_API_KEY=<your-api-key>
-```
-Now you can run the simple test program with:
-```bash
-$ python demos/simple-demo.py --task enron --dataset testdata/enron-eval-tiny --verbose
-```
-### Citation
-If you would like to cite our work, please use the following citation:
+### 📓 Citation
+If you would like to cite our original paper on Palimpzest, please use the following citation:
 ```
 @inproceedings{palimpzestCIDR,
     title={Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing},
@@ -154,3 +101,16 @@ If you would like to cite our work, please use the following citation:
     date = 2025,
 }
 ```
+If you would like to cite our paper on Palimpzest's optimizer Abacus, please use the following citation:
+```
+@misc{russo2025abacuscostbasedoptimizersemantic,
+      title={Abacus: A Cost-Based Optimizer for Semantic Operator Systems},
+      author={Matthew Russo and Sivaprasad Sudhir and Gerardo Vitagliano and Chunwei Liu and Tim Kraska and Samuel Madden and Michael Cafarella},
+      year={2025},
+      eprint={2505.14661},
+      archivePrefix={arXiv},
+      primaryClass={cs.DB},
+      url={https://arxiv.org/abs/2505.14661},
+}
+```

{palimpzest-0.9.0 → palimpzest-1.0.0}/README.md RENAMED Viewed

@@ -9,15 +9,20 @@
 <!-- [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2405.14696) -->
 <!-- [![Video](https://img.shields.io/badge/YouTube-Talk-red?logo=youtube)](https://youtu.be/T8VQfyBiki0?si=eiph57DSEkDNbEIu) -->
-## Learn How to Use PZ
-Our [full documentation](https://palimpzest.org) is the definitive resource for learning how to use PZ. It contains all of the installation and quickstart materials on this page, as well as user guides, full API documentation, and much more.
+## 📚 Learn How to Use PZ
+Our [full documentation](https://palimpzest.org) is the definitive resource for learning how to use PZ. It contains all of the installation and quickstart materials on this page, as well as user guides, full API documentation (coming soon), and much more.
-## Getting started
+## 🚀 Getting started
 You can find a stable version of the PZ package on PyPI [here](https://pypi.org/project/palimpzest/). To install the package, run:
 ```bash
 $ pip install palimpzest
 ```
+You can also install PZ with [uv](https://docs.astral.sh/uv/) for a faster installation:
+```bash
+$ uv pip install palimpzest
+```
 Alternatively, to install the latest version of the package from this repository, you can clone this repository and run the following commands:
 ```bash
 $ git clone git@github.com:mitdbg/palimpzest.git
@@ -25,7 +30,7 @@ $ cd palimpzest
 $ pip install .
 ```
-## Join the PZ Community
+## 🙋🏽 Join the PZ Community
 We are actively hacking on PZ and would love to have you join our community [![Discord](https://img.shields.io/discord/1245561987480420445?logo=discord)](https://discord.gg/dN85JJ6jaH)
 [Our Discord server](https://discord.gg/dN85JJ6jaH) is the best place to:
@@ -36,66 +41,8 @@ We are actively hacking on PZ and would love to have you join our community [![D
 We are eager to learn more about your workloads and use cases, and will take them into consideration in planning our future roadmap.
-## Quick Start
-The easiest way to get started with Palimpzest is to run the `quickstart.ipynb` jupyter notebook. We demonstrate the full workflow of working with PZ, including registering a dataset, composing and executing a pipeline, and accessing the results.
-To run the notebook, you can use the following command:
-```bash
-$ jupyter notebook
-```
-And then access the notebook from the jupyter interface in your browser at `localhost:8888`.
-### Even Quicker Start
-For eager readers, the code in the notebook can be found in the following condensed snippet. However, we do suggest reading the notebook as it contains more insight into each element of the program.
-```python
-import palimpzest as pz
-# define the fields we wish to compute
-email_cols = [
-    {"name": "sender", "type": str, "desc": "The email address of the sender"},
-    {"name": "subject", "type": str, "desc": "The subject of the email"},
-    {"name": "date", "type": str, "desc": "The date the email was sent"},
-]
-# lazily construct the computation to get emails about holidays sent in July
-dataset = pz.Dataset("testdata/enron-tiny/")
-dataset = dataset.sem_add_columns(email_cols)
-dataset = dataset.sem_filter("The email was sent in July")
-dataset = dataset.sem_filter("The email is about holidays")
-# execute the computation w/the MinCost policy
-config = pz.QueryProcessorConfig(policy=pz.MinCost(), verbose=True)
-output = dataset.run(config)
-# display output (if using Jupyter, otherwise use print(output_df))
-output_df = output.to_df(cols=["date", "sender", "subject"])
-display(output_df)
-```
-## Python Demos
-Below are simple instructions to run PZ on a test data set of enron emails that is included with the system.
-### Downloading test data
-To run the provided demos, you will need to download the test data. Due to the size of the data, we are unable to include it in the repository. You can download the test data by running the following command from a unix terminal (requires `wget` and `tar`):
-```
-chmod +x testdata/download-testdata.sh
-./testdata/download-testdata.sh
-```
-### Running the Demos
-Set your OpenAI (or Together.ai) api key at the command line:
-```bash
-# set one (or both) of the following:
-export OPENAI_API_KEY=<your-api-key>
-export TOGETHER_API_KEY=<your-api-key>
-```
-Now you can run the simple test program with:
-```bash
-$ python demos/simple-demo.py --task enron --dataset testdata/enron-eval-tiny --verbose
-```
-### Citation
-If you would like to cite our work, please use the following citation:
+### 📓 Citation
+If you would like to cite our original paper on Palimpzest, please use the following citation:
 ```
 @inproceedings{palimpzestCIDR,
     title={Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing},
@@ -103,4 +50,17 @@ If you would like to cite our work, please use the following citation:
     booktitle = {Proceedings of the {{Conference}} on {{Innovative Database Research}} ({{CIDR}})},
     date = 2025,
 }
-```
+```
+If you would like to cite our paper on Palimpzest's optimizer Abacus, please use the following citation:
+```
+@misc{russo2025abacuscostbasedoptimizersemantic,
+      title={Abacus: A Cost-Based Optimizer for Semantic Operator Systems},
+      author={Matthew Russo and Sivaprasad Sudhir and Gerardo Vitagliano and Chunwei Liu and Tim Kraska and Samuel Madden and Michael Cafarella},
+      year={2025},
+      eprint={2505.14661},
+      archivePrefix={arXiv},
+      primaryClass={cs.DB},
+      url={https://arxiv.org/abs/2505.14661},
+}
+```

{palimpzest-0.9.0 → palimpzest-1.0.0}/pyproject.toml RENAMED Viewed

@@ -1,9 +1,9 @@
 [project]
 name = "palimpzest"
-version = "0.9.0"
+version = "1.0.0"
 description = "Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language"
 readme = "README.md"
-requires-python = ">=3.10"
+requires-python = ">=3.12"
 keywords = ["relational", "optimization", "llm", "AI programming", "extraction", "tools", "document", "search", "integration"]
 authors = [
     {name="MIT DSG Semantic Management Lab", email="michjc@csail.mit.edu"},

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/constants.py RENAMED Viewed

@@ -207,6 +207,7 @@ class Modality(str, Enum):
 class AggFunc(str, Enum):
     COUNT = "count"
     AVERAGE = "average"
+    SUM = "sum"
     MIN = "min"
     MAX = "max"

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/core/data/dataset.py RENAMED Viewed

@@ -22,7 +22,7 @@ from palimpzest.query.operators.logical import (
     LimitScan,
     LogicalOperator,
     Project,
-    RetrieveScan,
+    TopKScan,
 )
 from palimpzest.query.processor.config import QueryProcessorConfig
 from palimpzest.utils.hash_helpers import hash_for_serialized_dict
@@ -243,7 +243,30 @@ class Dataset:
             id=self.id,
         )
-    def sem_join(self, other: Dataset, condition: str, desc: str | None = None, depends_on: str | list[str] | None = None) -> Dataset:
+    def join(self, other: Dataset, on: str | list[str], how: str = "inner") -> Dataset:
+        """
+        Perform the specified join on the specified (list of) column(s)
+        """
+        # enforce type for on
+        if isinstance(on, str):
+            on = [on]
+        # construct new output schema
+        combined_schema = union_schemas([self.schema, other.schema], join=True, on=on)
+        # construct logical operator
+        operator = JoinOp(
+            input_schema=combined_schema,
+            output_schema=combined_schema,
+            condition="",
+            on=on,
+            how=how,
+            depends_on=on,
+        )
+        return Dataset(sources=[self, other], operator=operator, schema=combined_schema)
+    def sem_join(self, other: Dataset, condition: str, desc: str | None = None, depends_on: str | list[str] | None = None, how: str = "inner") -> Dataset:
         """
         Perform a semantic (inner) join on the specified join predicate
         """
@@ -259,6 +282,7 @@ class Dataset:
             input_schema=combined_schema,
             output_schema=combined_schema,
             condition=condition,
+            how=how,
             desc=desc,
             depends_on=depends_on,
         )
@@ -346,7 +370,6 @@ class Dataset:
         return Dataset(sources=[self], operator=operator, schema=new_output_schema)
     def sem_add_columns(self, cols: list[dict] | type[BaseModel],
                         cardinality: Cardinality = Cardinality.ONE_TO_ONE,
                         desc: str | None = None,
@@ -534,6 +557,11 @@ class Dataset:
         operator = Aggregate(input_schema=self.schema, agg_func=AggFunc.AVERAGE)
         return Dataset(sources=[self], operator=operator, schema=operator.output_schema)
+    def sum(self) -> Dataset:
+        """Apply a summation to this set"""
+        operator = Aggregate(input_schema=self.schema, agg_func=AggFunc.SUM)
+        return Dataset(sources=[self], operator=operator, schema=operator.output_schema)
     def min(self) -> Dataset:
         """Apply an min operator to this set"""
         operator = Aggregate(input_schema=self.schema, agg_func=AggFunc.MIN)
@@ -581,7 +609,7 @@ class Dataset:
         return Dataset(sources=[self], operator=operator, schema=operator.output_schema)
-    def retrieve(
+    def sem_topk(
         self,
         index: Collection,
         search_attr: str,
@@ -608,7 +636,7 @@ class Dataset:
         # index = index_factory(index)
         # construct logical operator
-        operator = RetrieveScan(
+        operator = TopKScan(
             input_schema=self.schema,
             output_schema=new_output_schema,
             index=index,

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/core/elements/groupbysig.py RENAMED Viewed

@@ -6,8 +6,11 @@ from pydantic import BaseModel
 from palimpzest.core.lib.schemas import create_schema_from_fields
+# TODO:
+# - move the arguments for group_by_fields, agg_funcs, and agg_fields into the Dataset.groupby() operator
+# - construct the correct output schema using the input schema and the group by and aggregation fields
+# - remove/update all other references to GroupBySig in the codebase
-# TODO: need to rethink how group bys work
 # signature for a group by aggregate that applies
 # group and aggregation to an input tuple
 class GroupBySig:
@@ -50,6 +53,7 @@ class GroupBySig:
             ops.append(self.agg_funcs[i] + "(" + self.agg_fields[i] + ")")
         return ops
+    # TODO: output schema needs to account for input schema types and create new output schema types
     def output_schema(self) -> type[BaseModel]:
         # the output class varies depending on the group by, so here
         # we dynamically construct this output

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/core/elements/records.py RENAMED Viewed

@@ -140,7 +140,7 @@ class DataRecord:
     def schema(self) -> type[BaseModel]:
         return type(self._data_item)
-    def copy(self):
+    def copy(self) -> DataRecord:
         # get the set of fields to copy from the parent record
         copy_field_names = [field.split(".")[-1] for field in self.get_field_names()]
@@ -228,18 +228,18 @@ class DataRecord:
     @staticmethod
     def from_join_parents(
         schema: type[BaseModel],
-        left_parent_record: DataRecord,
-        right_parent_record: DataRecord,
+        left_parent_record: DataRecord | None,
+        right_parent_record: DataRecord | None,
         project_cols: list[str] | None = None,
         cardinality_idx: int = None,
     ) -> DataRecord:
         # get the set of fields and field descriptions to copy from the parent record(s)
-        left_copy_field_names = (
+        left_copy_field_names = [] if left_parent_record is None else (
             left_parent_record.get_field_names()
             if project_cols is None
             else [col for col in project_cols if col in left_parent_record.get_field_names()]
         )
-        right_copy_field_names = (
+        right_copy_field_names = [] if right_parent_record is None else (
             right_parent_record.get_field_names()
             if project_cols is None
             else [col for col in project_cols if col in right_parent_record.get_field_names()]
@@ -255,11 +255,20 @@ class DataRecord:
                 new_field_name = f"{field_name}_right"
             data_item[new_field_name] = right_parent_record[field_name]
+        # for any missing fields in the schema, set them to None
+        for field_name in schema.model_fields:
+            if field_name not in data_item:
+                data_item[field_name] = None
         # make new record which has left and right parent record as its parents
+        left_parent_source_indices = [] if left_parent_record is None else list(left_parent_record._source_indices)
+        right_parent_source_indices = [] if right_parent_record is None else list(right_parent_record._source_indices)
+        left_parent_record_id = [] if left_parent_record is None else [left_parent_record._id]
+        right_parent_record_id = [] if right_parent_record is None else [right_parent_record._id]
         new_dr = DataRecord(
             schema(**data_item),
-            source_indices=list(left_parent_record._source_indices) + list(right_parent_record._source_indices),
-            parent_ids=[left_parent_record._id, right_parent_record._id],
+            source_indices=left_parent_source_indices + right_parent_source_indices,
+            parent_ids=left_parent_record_id + right_parent_record_id,
             cardinality_idx=cardinality_idx,
         )

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/core/lib/schemas.py RENAMED Viewed

@@ -142,16 +142,30 @@ def create_schema_from_df(df: pd.DataFrame) -> type[BaseModel]:
     return _create_pickleable_model(fields)
-def union_schemas(models: list[type[BaseModel]], join: bool = False) -> type[BaseModel]:
+def union_schemas(models: list[type[BaseModel]], join: bool = False, on: list[str] | None = None) -> type[BaseModel]:
     """Union multiple Pydantic models into a single model."""
+    # convert on to empty list if None
+    if on is None:
+        on = []
+    # build up the fields for the new schema
     fields = {}
     for model in models:
         for field_name, field in model.model_fields.items():
-            if field_name in fields and not join:
+            # for non-join unions, make sure duplicate fields have the same type
+            if not join and field_name in fields:
                 assert fields[field_name][0] == field.annotation, f"Field {field_name} has different types in different models"
-            elif field_name in fields and join:
+            # for joins with "on" specified, no need to rename fields in "on"
+            elif join and field_name in on and field_name in fields:
+                continue
+            # otherwise, rename duplicate fields by appending _right
+            elif join and field_name in fields:
                 while field_name in fields:
                     field_name = f"{field_name}_right"
+            # add the field to the new schema
             fields[field_name] = (field.annotation, field)
     # create and return the new schema
@@ -194,6 +208,9 @@ class Average(BaseModel):
 class Count(BaseModel):
     count: int = Field(description="The count of items in the dataset")
+class Sum(BaseModel):
+    sum: int = Field(description="The summation of items in the dataset")
 class Min(BaseModel):
     min: int | float = Field(description="The minimum value of some items in the dataset")

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/core/models.py RENAMED Viewed

@@ -51,10 +51,10 @@ class GenerationStats(BaseModel):
     fn_call_duration_secs: float = 0.0
     # (if applicable) the total number of LLM calls made by this operator
-    total_llm_calls: int = 0
+    total_llm_calls: float = 0
     # (if applicable) the total number of embedding LLM calls made by this operator
-    total_embedding_llm_calls: int = 0
+    total_embedding_llm_calls: float = 0
     def __iadd__(self, other: GenerationStats) -> GenerationStats:
         # self.raw_answers.extend(other.raw_answers)
@@ -243,10 +243,10 @@ class RecordOpStats(BaseModel):
     fn_call_duration_secs: float = 0.0
     # (if applicable) the total number of LLM calls made by this operator
-    total_llm_calls: int = 0
+    total_llm_calls: float = 0
     # (if applicable) the total number of embedding LLM calls made by this operator
-    total_embedding_llm_calls: int = 0
+    total_embedding_llm_calls: float = 0
     # (if applicable) a boolean indicating whether this is the statistics captured from a failed convert operation
     failed_convert: bool | None = None

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/query/execution/all_sample_execution_strategy.py RENAMED Viewed

@@ -225,7 +225,7 @@ class AllSamplingExecutionStrategy(SentinelExecutionStrategy):
         dataset_id_to_source_indices = {}
         for dataset_id, dataset in train_dataset.items():
             total_num_samples = len(dataset)
-            source_indices = [f"{dataset_id}-{int(idx)}" for idx in np.arange(total_num_samples)]
+            source_indices = [f"{dataset_id}---{int(idx)}" for idx in np.arange(total_num_samples)]
             dataset_id_to_source_indices[dataset_id] = source_indices
         # initialize set of physical operators for each logical operator

{palimpzest-0.9.0 → palimpzest-1.0.0}/src/palimpzest/query/execution/execution_strategy.py RENAMED Viewed

@@ -14,8 +14,8 @@ from palimpzest.query.operators.convert import LLMConvert
 from palimpzest.query.operators.filter import LLMFilter
 from palimpzest.query.operators.join import JoinOp
 from palimpzest.query.operators.physical import PhysicalOperator
-from palimpzest.query.operators.retrieve import RetrieveOp
 from palimpzest.query.operators.scan import ContextScanOp, ScanPhysicalOp
+from palimpzest.query.operators.topk import TopKOp
 from palimpzest.query.optimizer.plan import PhysicalPlan, SentinelPlan
 from palimpzest.utils.progress import PZSentinelProgressManager
 from palimpzest.validator.validator import Validator
@@ -123,7 +123,7 @@ class SentinelExecutionStrategy(BaseExecutionStrategy, ABC):
             return (
                 not isinstance(op, LLMConvert)
                 and not isinstance(op, LLMFilter)
-                and not isinstance(op, RetrieveOp)
+                and not isinstance(op, TopKOp)
                 and not isinstance(op, JoinOp)
             )
@@ -167,8 +167,8 @@ class SentinelExecutionStrategy(BaseExecutionStrategy, ABC):
                             full_hashes.add(full_hash)
                             futures.append(executor.submit(validator._score_flat_map, op, fields, input_record, output, full_hash))
-                    # create future for retrieve
-                    elif isinstance(op, RetrieveOp):
+                    # create future for top-k
+                    elif isinstance(op, TopKOp):
                         fields = op.generated_fields
                         input_record: DataRecord = record_set.input
                         output = record_set.data_records[0].to_dict(project_cols=fields)
@@ -176,7 +176,7 @@ class SentinelExecutionStrategy(BaseExecutionStrategy, ABC):
                         full_hash = f"{hash(input_record)}{hash(output_str)}"
                         if full_hash not in full_hashes:
                             full_hashes.add(full_hash)
-                            futures.append(executor.submit(validator._score_retrieve, op, fields, input_record, output, full_hash))
+                            futures.append(executor.submit(validator._score_topk, op, fields, input_record, output, full_hash))
                     # create future for filter
                     elif isinstance(op, LLMFilter):
@@ -235,7 +235,7 @@ class SentinelExecutionStrategy(BaseExecutionStrategy, ABC):
                 # TODO: this scoring function will (likely) bias towards small values of k since it
                 # measures precision and not recall / F1; will need to revisit this in the future
-                elif isinstance(op, RetrieveOp):
+                elif isinstance(op, TopKOp):
                     fields = op.generated_fields
                     input_record: DataRecord = record_set.input
                     output_str = record_set.data_records[0].to_json_str(project_cols=fields, bytes_to_str=True, sorted=True)
@@ -341,9 +341,9 @@ class SentinelExecutionStrategy(BaseExecutionStrategy, ABC):
     def _is_llm_op(self, physical_op: PhysicalOperator) -> bool:
         is_llm_convert = isinstance(physical_op, LLMConvert)
         is_llm_filter = isinstance(physical_op, LLMFilter)
-        is_llm_retrieve = isinstance(physical_op, RetrieveOp) and isinstance(physical_op.index, Collection)
+        is_llm_topk = isinstance(physical_op, TopKOp) and isinstance(physical_op.index, Collection)
         is_llm_join = isinstance(physical_op, JoinOp)
-        return is_llm_convert or is_llm_filter or is_llm_retrieve or is_llm_join
+        return is_llm_convert or is_llm_filter or is_llm_topk or is_llm_join
     @abstractmethod
     def execute_sentinel_plan(self, sentinel_plan: SentinelPlan, train_dataset: dict[str, Dataset], validator: Validator) -> SentinelPlanStats:

palimpzest 0.9.0__tar.gz → 1.0.0__tar.gz

palimpzest 0.9.0tar.gz → 1.0.0tar.gz