PyPI - elasticsearch-haystack - Versions diffs - 2.0.0__tar.gz → 2.1.0__tar.gz - Mend

elasticsearch-haystack 2.0.0tar.gz → 2.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of elasticsearch-haystack might be problematic. Click here for more details.

Files changed (19) hide show

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/CHANGELOG.md RENAMED Viewed

@@ -1,12 +1,27 @@
 # Changelog
+## [integrations/elasticsearch-v2.0.0] - 2025-02-14
+### 🧹 Chores
+- Inherit from `FilterDocumentsTestWithDataframe` in Document Stores (#1290)
+- [**breaking**] Elasticsearch - remove dataframe support (#1377)
 ## [integrations/elasticsearch-v1.0.1] - 2024-10-28
-### ⚙️ Miscellaneous Tasks
+### ⚙️ CI
+- Adopt uv as installer (#1142)
+### 🧹 Chores
 - Update changelog after removing legacy filters (#1083)
 - Update ruff linting scripts and settings (#1105)
-- Adopt uv as installer (#1142)
+### 🌀 Miscellaneous
+- Fix: Elasticsearch - allow passing headers (#1156)
 ## [integrations/elasticsearch-v1.0.0] - 2024-09-12
@@ -23,18 +38,32 @@
 - Do not retry tests in `hatch run test` command (#954)
-### ⚙️ Miscellaneous Tasks
+### ⚙️ CI
 - Retry tests to reduce flakyness (#836)
+### 🧹 Chores
 - Update ruff invocation to include check parameter (#853)
 - ElasticSearch - remove legacy filters elasticsearch (#1078)
+### 🌀 Miscellaneous
+- Ci: install `pytest-rerunfailures` where needed; add retry config to `test-cov` script (#845)
+- Chore: Minor retriever pydoc fix (#884)
+- Chore: elasticsearch - ruff update, don't ruff tests (#999)
 ## [integrations/elasticsearch-v0.5.0] - 2024-05-24
 ### 🐛 Bug Fixes
 - Add support for custom mapping in ElasticsearchDocumentStore (#721)
+### 🌀 Miscellaneous
+- Chore: add license classifiers (#680)
+- Chore: change the pydoc renderer class (#718)
 ## [integrations/elasticsearch-v0.4.0] - 2024-04-03
 ### 📚 Documentation
@@ -43,49 +72,64 @@
 - Review Elastic (#541)
 - Disable-class-def (#556)
+### 🌀 Miscellaneous
+- Make tests show coverage (#566)
+- Refactor tests (#574)
+- Remove references to Python 3.7 (#601)
+- Make Document Stores initially skip `SparseEmbedding` (#606)
+- [Elasticsearch] fix: Filters not working with metadata that contain a space or capitalization (#639)
 ## [integrations/elasticsearch-v0.3.0] - 2024-02-23
 ### 🐛 Bug Fixes
 - Fix order of API docs (#447)
-This PR will also push the docs to Readme
 ### 📚 Documentation
 - Update category slug (#442)
-### Elasticsearch
+### 🌀 Miscellaneous
+- Generate api docs (#322)
+- Add filters to run function in retrievers of elasticsearch (#440)
 - Add user-agent header (#457)
-### Feat
+## [integrations/elasticsearch-v0.2.0] - 2024-01-19
-- Add filters to run function in retrievers of elasticsearch (#440)
+### 🌀 Miscellaneous
-### Elasticsearch
+- Mount import paths under haystack_integrations (#244)
-- Generate api docs (#322)
+## [integrations/elasticsearch-v0.1.3] - 2024-01-18
-## [integrations/elasticsearch-v0.2.0] - 2024-01-19
+### 🌀 Miscellaneous
-## [integrations/elasticsearch-v0.1.3] - 2024-01-18
+- Added top_k argument in the run function of ElasticSearcBM25Retriever (#130)
+- Add more docstrings for `ElasticsearchDocumentStore` and `ElasticsearchBM25Retriever` (#184)
+- Elastic - update imports for beta5 (#238)
 ## [integrations/elasticsearch-v0.1.2] - 2023-12-20
 ### 🐛 Bug Fixes
-- Fix project urls (#96)
+- Fix project URLs (#96)
 ### 🚜 Refactor
 - Use `hatch_vcs` to manage integrations versioning (#103)
+### 🌀 Miscellaneous
+- Update elasticsearch test badge (#79)
+- [Elasticsearch] - BM25 retrieval: not all terms must mandatorily match (#125)
 ## [integrations/elasticsearch-v0.1.1] - 2023-12-05
 ### 🐛 Bug Fixes
-- Fix import and increase version (#77)
+- Document Stores: fix protocol import (#77)
 ## [integrations/elasticsearch-v0.1.0] - 2023-12-04
@@ -93,6 +137,16 @@ This PR will also push the docs to Readme
 - Fix license headers
+### 🌀 Miscellaneous
+- Remove Document Store decorator (#76)
 ## [integrations/elasticsearch-v0.0.2] - 2023-11-29
+### 🌀 Miscellaneous
+- Reorganize repository (#62)
+- Update `ElasticSearchDocumentStore` to use latest `haystack-ai` version (#63)
+- Bump elasticsearch_haystack to 0.0.2
 <!-- generated by git-cliff -->

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: elasticsearch-haystack
-Version: 2.0.0
+Version: 2.1.0
 Summary: Haystack 2.x Document Store for ElasticSearch
 Project-URL: Documentation, https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch#readme
 Project-URL: Issues, https://github.com/deepset-ai/haystack-core-integrations/issues
@@ -11,13 +11,12 @@ License-File: LICENSE
 Classifier: Development Status :: 4 - Beta
 Classifier: License :: OSI Approved :: Apache Software License
 Classifier: Programming Language :: Python
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: Implementation :: CPython
 Classifier: Programming Language :: Python :: Implementation :: PyPy
-Requires-Python: >=3.8
+Requires-Python: >=3.9
 Requires-Dist: elasticsearch<9,>=8
 Requires-Dist: haystack-ai
 Description-Content-Type: text/markdown

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/pyproject.toml RENAMED Viewed

@@ -7,7 +7,7 @@ name = "elasticsearch-haystack"
 dynamic = ["version"]
 description = 'Haystack 2.x Document Store for ElasticSearch'
 readme = "README.md"
-requires-python = ">=3.8"
+requires-python = ">=3.9"
 license = "Apache-2.0"
 keywords = []
 authors = [{ name = "Silvano Cerza", email = "silvanocerza@gmail.com" }]
@@ -15,7 +15,6 @@ classifiers = [
   "License :: OSI Approved :: Apache Software License",
   "Development Status :: 4 - Beta",
   "Programming Language :: Python",
-  "Programming Language :: Python :: 3.8",
   "Programming Language :: Python :: 3.9",
   "Programming Language :: Python :: 3.10",
   "Programming Language :: Python :: 3.11",
@@ -45,6 +44,7 @@ installer = "uv"
 dependencies = [
   "coverage[toml]>=6.5",
   "pytest",
+  "pytest-asyncio",
   "pytest-rerunfailures",
   "pytest-xdist",
   "haystack-pydoc-tools",
@@ -59,12 +59,13 @@ cov-retry = ["test-cov-retry", "cov-report"]
 docs = ["pydoc-markdown pydoc/config.yml"]
 [[tool.hatch.envs.all.matrix]]
-python = ["3.8", "3.9", "3.10", "3.11"]
+python = [ "3.9", "3.10", "3.11"]
 [tool.hatch.envs.lint]
 installer = "uv"
 detached = true
 dependencies = ["pip", "black>=23.1.0", "mypy>=1.0.0", "ruff>=0.0.243"]
 [tool.hatch.envs.lint.scripts]
 typing = "mypy --install-types --non-interactive --explicit-package-bases {args:src/ tests}"
 style = ["ruff check {args:}", "black --check --diff {args:.}"]
@@ -157,6 +158,8 @@ exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]
 [tool.pytest.ini_options]
 minversion = "6.0"
 markers = ["unit: unit tests", "integration: integration tests"]
+asyncio_mode = "auto"
+asyncio_default_fixture_loop_scope = "class"
 [[tool.mypy.overrides]]
 module = ["haystack.*", "haystack_integrations.*", "numpy.*", "pytest.*"]

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/src/haystack_integrations/components/retrievers/elasticsearch/bm25_retriever.py RENAMED Viewed

@@ -120,7 +120,7 @@ class ElasticsearchBM25Retriever:
         """
         Retrieve documents using the BM25 keyword-based algorithm.
-        :param query: String to search in `Document`s' text.
+        :param query: String to search in the `Document`s text.
         :param filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on
                         the `filter_policy` chosen at retriever initialization. See init method docstring for more
                         details.
@@ -137,3 +137,26 @@ class ElasticsearchBM25Retriever:
             scale_score=self._scale_score,
         )
         return {"documents": docs}
+    @component.output_types(documents=List[Document])
+    async def run_async(self, query: str, filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None):
+        """
+        Asynchronously retrieve documents using the BM25 keyword-based algorithm.
+        :param query: String to search in the `Document` text.
+        :param filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on
+                        the `filter_policy` chosen at retriever initialization. See init method docstring for more
+                        details.
+        :param top_k: Maximum number of `Document` to return.
+        :returns: A dictionary with the following keys:
+            - `documents`: List of `Document`s that match the query.
+        """
+        filters = apply_filter_policy(self._filter_policy, self._filters, filters)
+        docs = await self._document_store._bm25_retrieval_async(
+            query=query,
+            filters=filters,
+            fuzziness=self._fuzziness,
+            top_k=top_k or self._top_k,
+            scale_score=self._scale_score,
+        )
+        return {"documents": docs}

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/src/haystack_integrations/components/retrievers/elasticsearch/embedding_retriever.py RENAMED Viewed

@@ -119,10 +119,11 @@ class ElasticsearchEmbeddingRetriever:
         Retrieve documents using a vector similarity metric.
         :param query_embedding: Embedding of the query.
-        :param filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on
-                        the `filter_policy` chosen at retriever initialization. See init method docstring for more
-                        details.
-        :param top_k: Maximum number of `Document`s to return.
+        :param filters: Filters applied when fetching documents from the Document Store.
+            Filters are applied during the approximate kNN search to ensure the Retriever returns
+              `top_k` matching documents.
+            The way runtime filters are applied depends on the `filter_policy` selected when initializing the Retriever.
+        :param top_k: Maximum number of documents to return.
         :returns: A dictionary with the following keys:
             - `documents`: List of `Document`s most similar to the given `query_embedding`
         """
@@ -134,3 +135,28 @@ class ElasticsearchEmbeddingRetriever:
             num_candidates=self._num_candidates,
         )
         return {"documents": docs}
+    @component.output_types(documents=List[Document])
+    async def run_async(
+        self, query_embedding: List[float], filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None
+    ):
+        """
+        Asynchronously retrieve documents using a vector similarity metric.
+        :param query_embedding: Embedding of the query.
+        :param filters: Filters applied when fetching documents from the Document Store.
+            Filters are applied during the approximate kNN search to ensure the Retriever returns
+              `top_k` matching documents.
+            The way runtime filters are applied depends on the `filter_policy` selected when initializing the Retriever.
+        :param top_k: Maximum number of documents to return.
+        :returns: A dictionary with the following keys:
+            - `documents`: List of `Document`s that match the query.
+        """
+        filters = apply_filter_policy(self._filter_policy, self._filters, filters)
+        docs = await self._document_store._embedding_retrieval_async(
+            query_embedding=query_embedding,
+            filters=filters,
+            top_k=top_k or self._top_k,
+            num_candidates=self._num_candidates,
+        )
+        return {"documents": docs}

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/src/haystack_integrations/document_stores/elasticsearch/document_store.py RENAMED Viewed

@@ -2,7 +2,8 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 import logging
-from typing import Any, Dict, List, Literal, Mapping, Optional, Union
+from collections.abc import Mapping
+from typing import Any, Dict, List, Literal, Optional, Union
 import numpy as np
@@ -14,7 +15,7 @@ from haystack.document_stores.errors import DocumentStoreError, DuplicateDocumen
 from haystack.document_stores.types import DuplicatePolicy
 from haystack.version import __version__ as haystack_version
-from elasticsearch import Elasticsearch, helpers  # type: ignore[import-not-found]
+from elasticsearch import AsyncElasticsearch, Elasticsearch, helpers  # type: ignore[import-not-found]
 from .filters import _normalize_filters
@@ -30,6 +31,7 @@ Hosts = Union[str, List[Union[str, Mapping[str, Union[str, int]], NodeConfig]]]
 # Increase the default if most unscaled scores are larger than expected (>30) and otherwise would incorrectly
 # all be mapped to scores ~1.
 BM25_SCALING_FACTOR = 8
+DOC_ALREADY_EXISTS = 409
 class ElasticsearchDocumentStore:
@@ -93,28 +95,39 @@ class ElasticsearchDocumentStore:
         """
         self._hosts = hosts
         self._client = None
+        self._async_client = None
         self._index = index
         self._embedding_similarity_function = embedding_similarity_function
         self._custom_mapping = custom_mapping
         self._kwargs = kwargs
+        self._initialized = False
         if self._custom_mapping and not isinstance(self._custom_mapping, Dict):
             msg = "custom_mapping must be a dictionary"
             raise ValueError(msg)
-    @property
-    def client(self) -> Elasticsearch:
-        if self._client is None:
+    def _ensure_initialized(self):
+        """
+        Ensures both sync and async clients are initialized and the index exists.
+        """
+        if not self._initialized:
             headers = self._kwargs.pop("headers", {})
             headers["user-agent"] = f"haystack-py-ds/{haystack_version}"
-            client = Elasticsearch(
+            # Initialize both sync and async clients
+            self._client = Elasticsearch(
+                self._hosts,
+                headers=headers,
+                **self._kwargs,
+            )
+            self._async_client = AsyncElasticsearch(
                 self._hosts,
                 headers=headers,
                 **self._kwargs,
             )
             # Check client connection, this will raise if not connected
-            client.info()
+            self._client.info()
             if self._custom_mapping:
                 mappings = self._custom_mapping
@@ -143,13 +156,27 @@ class ElasticsearchDocumentStore:
                 }
             # Create the index if it doesn't exist
-            if not client.indices.exists(index=self._index):
-                client.indices.create(index=self._index, mappings=mappings)
+            if not self._client.indices.exists(index=self._index):
+                self._client.indices.create(index=self._index, mappings=mappings)
-            self._client = client
+            self._initialized = True
+    @property
+    def client(self) -> Elasticsearch:
+        """
+        Returns the synchronous Elasticsearch client, initializing it if necessary.
+        """
+        self._ensure_initialized()
         return self._client
+    @property
+    def async_client(self) -> AsyncElasticsearch:
+        """
+        Returns the asynchronous Elasticsearch client, initializing it if necessary.
+        """
+        self._ensure_initialized()
+        return self._async_client
     def to_dict(self) -> Dict[str, Any]:
         """
         Serializes the component to a dictionary.
@@ -184,15 +211,26 @@ class ElasticsearchDocumentStore:
     def count_documents(self) -> int:
         """
         Returns how many documents are present in the document store.
-        :returns: Number of documents in the document store.
+        :returns:
+            Number of documents in the document store.
         """
+        self._ensure_initialized()
         return self.client.count(index=self._index)["count"]
+    async def count_documents_async(self) -> int:
+        """
+        Asynchronously returns how many documents are present in the document store.
+        :returns: Number of documents in the document store.
+        """
+        self._ensure_initialized()
+        result = await self._async_client.count(index=self._index)  # type: ignore
+        return result["count"]
     def _search_documents(self, **kwargs) -> List[Document]:
         """
         Calls the Elasticsearch client's search method and handles pagination.
         """
         top_k = kwargs.get("size")
         if top_k is None and "knn" in kwargs and "k" in kwargs["knn"]:
             top_k = kwargs["knn"]["k"]
@@ -207,7 +245,7 @@ class ElasticsearchDocumentStore:
                 **kwargs,
             )
-            documents.extend(self._deserialize_document(hit) for hit in res["hits"]["hits"])
+            documents.extend(self._deserialize_document(hit) for hit in res["hits"]["hits"])  # type: ignore
             from_ = len(documents)
             if top_k is not None and from_ >= top_k:
@@ -216,6 +254,31 @@ class ElasticsearchDocumentStore:
                 break
         return documents
+    async def _search_documents_async(self, **kwargs) -> List[Document]:
+        """
+        Asynchronously calls the Elasticsearch client's search method and handles pagination.
+        """
+        top_k = kwargs.get("size")
+        if top_k is None and "knn" in kwargs and "k" in kwargs["knn"]:
+            top_k = kwargs["knn"]["k"]
+        documents: List[Document] = []
+        from_ = 0
+        # handle pagination
+        while True:
+            res = await self._async_client.search(index=self._index, from_=from_, **kwargs)  # type: ignore
+            documents.extend(self._deserialize_document(hit) for hit in res["hits"]["hits"])  # type: ignore
+            from_ = len(documents)
+            if top_k is not None and from_ >= top_k:
+                break
+            if from_ >= res["hits"]["total"]["value"]:
+                break
+        return documents
     def filter_documents(self, filters: Optional[Dict[str, Any]] = None) -> List[Document]:
         """
         The main query method for the document store. It retrieves all documents that match the filters.
@@ -229,10 +292,54 @@ class ElasticsearchDocumentStore:
             msg = "Invalid filter syntax. See https://docs.haystack.deepset.ai/docs/metadata-filtering for details."
             raise ValueError(msg)
+        self._ensure_initialized()
         query = {"bool": {"filter": _normalize_filters(filters)}} if filters else None
         documents = self._search_documents(query=query)
         return documents
+    async def filter_documents_async(self, filters: Optional[Dict[str, Any]] = None) -> List[Document]:
+        """
+        Asynchronously retrieves all documents that match the filters.
+        :param filters: A dictionary of filters to apply. For more information on the structure of the filters,
+            see the official Elasticsearch
+            [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)
+        :returns: List of `Document`s that match the filters.
+        """
+        if filters and "operator" not in filters and "conditions" not in filters:
+            msg = "Invalid filter syntax. See https://docs.haystack.deepset.ai/docs/metadata-filtering for details."
+            raise ValueError(msg)
+        self._ensure_initialized()
+        query = {"bool": {"filter": _normalize_filters(filters)}} if filters else None
+        documents = await self._search_documents_async(query=query)
+        return documents
+    @staticmethod
+    def _deserialize_document(hit: Dict[str, Any]) -> Document:
+        """
+        Creates a `Document` from the search hit provided.
+        This is mostly useful in self.filter_documents().
+        :param hit: A search hit from Elasticsearch.
+        :returns: `Document` created from the search hit.
+        """
+        data = hit["_source"]
+        if "highlight" in hit:
+            data["metadata"]["highlighted"] = hit["highlight"]
+        data["score"] = hit["_score"]
+        if "dataframe" in data:
+            dataframe = data.pop("dataframe")
+            if dataframe:
+                logger.warning(
+                    "Document %s has the `dataframe` field set,"
+                    "ElasticsearchDocumentStore no longer supports dataframes and this field will be ignored. "
+                    "The `dataframe` field will soon be removed from Haystack Document.",
+                    data["id"],
+                )
+        return Document.from_dict(data)
     def write_documents(self, documents: List[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int:
         """
         Writes `Document`s to Elasticsearch.
@@ -315,40 +422,86 @@ class ElasticsearchDocumentStore:
         return documents_written
-    @staticmethod
-    def _deserialize_document(hit: Dict[str, Any]) -> Document:
+    async def write_documents_async(
+        self, documents: List[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
+    ) -> int:
         """
-        Creates a `Document` from the search hit provided.
-        This is mostly useful in self.filter_documents().
+        Asynchronously writes `Document`s to Elasticsearch.
-        :param hit: A search hit from Elasticsearch.
-        :returns: `Document` created from the search hit.
+        :param documents: List of Documents to write to the document store.
+        :param policy: DuplicatePolicy to apply when a document with the same ID already exists in the document store.
+        :raises ValueError: If `documents` is not a list of `Document`s.
+        :raises DuplicateDocumentError: If a document with the same ID already exists in the document store and
+            `policy` is set to `DuplicatePolicy.FAIL` or `DuplicatePolicy.NONE`.
+        :raises DocumentStoreError: If an error occurs while writing the documents to the document store.
+        :returns: Number of documents written to the document store.
         """
-        data = hit["_source"]
+        self._ensure_initialized()
-        if "highlight" in hit:
-            data["metadata"]["highlighted"] = hit["highlight"]
-        data["score"] = hit["_score"]
+        if len(documents) > 0:
+            if not isinstance(documents[0], Document):
+                msg = "param 'documents' must contain a list of objects of type Document"
+                raise ValueError(msg)
-        if "dataframe" in data:
-            dataframe = data.pop("dataframe")
-            if dataframe:
-                logger.warning(
-                    "Document %s has the `dataframe` field set,"
-                    "ElasticsearchDocumentStore no longer supports dataframes and this field will be ignored. "
-                    "The `dataframe` field will soon be removed from Haystack Document.",
-                    data["id"],
-                )
-        return Document.from_dict(data)
+        if policy == DuplicatePolicy.NONE:
+            policy = DuplicatePolicy.FAIL
+        actions = []
+        for doc in documents:
+            doc_dict = doc.to_dict()
+            if "dataframe" in doc_dict:
+                dataframe = doc_dict.pop("dataframe")
+                if dataframe:
+                    logger.warning(
+                        "Document {id} has the `dataframe` field set,"
+                        "ElasticsearchDocumentStore no longer supports dataframes and this field will be ignored. "
+                        "The `dataframe` field will soon be removed from Haystack Document.",
+                    )
+            if "sparse_embedding" in doc_dict:
+                sparse_embedding = doc_dict.pop("sparse_embedding", None)
+                if sparse_embedding:
+                    logger.warning(
+                        "Document %s has the `sparse_embedding` field set,"
+                        "but storing sparse embeddings in Elasticsearch is not currently supported."
+                        "The `sparse_embedding` field will be ignored.",
+                        doc.id,
+                    )
+            action = {
+                "_op_type": "create" if policy == DuplicatePolicy.FAIL else "index",
+                "_id": doc.id,
+                "_source": doc_dict,
+            }
+            actions.append(action)
+        try:
+            success, failed = await helpers.async_bulk(
+                client=self._async_client,
+                actions=actions,
+                index=self._index,
+                refresh=True,
+                raise_on_error=False,
+            )
+            if failed:
+                if policy == DuplicatePolicy.FAIL:
+                    for error in failed:
+                        if "create" in error and error["create"]["status"] == DOC_ALREADY_EXISTS:
+                            msg = f"ID '{error['create']['_id']}' already exists in the document store"
+                            raise DuplicateDocumentError(msg)
+                msg = f"Failed to write documents to Elasticsearch. Errors:\n{failed}"
+                raise DocumentStoreError(msg)
+            return success
+        except Exception as e:
+            msg = f"Failed to write documents to Elasticsearch: {e!s}"
+            raise DocumentStoreError(msg) from e
     def delete_documents(self, document_ids: List[str]) -> None:
         """
-        Deletes all `Document`s with a matching `document_ids` from the document store.
+        Deletes all documents with a matching document_ids from the document store.
-        :param document_ids: the object IDs to delete
+        :param document_ids: the document ids to delete
         """
         helpers.bulk(
             client=self.client,
             actions=({"_op_type": "delete", "_id": id_} for id_ in document_ids),
@@ -357,6 +510,25 @@ class ElasticsearchDocumentStore:
             raise_on_error=False,
         )
+    async def delete_documents_async(self, document_ids: List[str]) -> None:
+        """
+        Asynchronously deletes all documents with a matching document_ids from the document store.
+        :param document_ids: the document ids to delete
+        """
+        self._ensure_initialized()
+        try:
+            await helpers.async_bulk(
+                client=self._async_client,
+                actions=({"_op_type": "delete", "_id": id_} for id_ in document_ids),
+                index=self._index,
+                refresh=True,
+            )
+        except Exception as e:
+            msg = f"Failed to delete documents from Elasticsearch: {e!s}"
+            raise DocumentStoreError(msg) from e
     def _bm25_retrieval(
         self,
         query: str,
@@ -367,27 +539,15 @@ class ElasticsearchDocumentStore:
         scale_score: bool = False,
     ) -> List[Document]:
         """
-        Retrieves `Document`s from Elasticsearch using the BM25 search algorithm.
-        Even though this method is called `bm25_retrieval` it searches for `query`
-        using the search algorithm `_client` was configured with.
-        This method is not meant to be part of the public interface of
-        `ElasticsearchDocumentStore` nor called directly.
-        `ElasticsearchBM25Retriever` uses this method directly and is the public interface for it.
-        :param query: String to search in saved `Document`s' text.
-        :param filters: Filters applied to the retrieved `Document`s, for more info
-                        see `ElasticsearchDocumentStore.filter_documents`.
-        :param fuzziness: Fuzziness parameter passed to Elasticsearch. See the official
-            [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness)
-            for valid values.
-        :param top_k: Maximum number of `Document`s to return.
-        :param scale_score: If `True` scales the `Document``s scores between 0 and 1.
-        :raises ValueError: If `query` is an empty string
-        :returns: List of `Document` that match `query`
+        Retrieves documents using BM25 retrieval.
+        :param query: The query string to search for
+        :param filters: Optional filters to narrow down the search space
+        :param fuzziness: Fuzziness parameter for the search query
+        :param top_k: Maximum number of documents to return
+        :param scale_score: Whether to scale the similarity score to the range [0,1]
+        :returns: List of Documents that match the query
         """
         if not query:
             msg = "query must be a non empty string"
             raise ValueError(msg)
@@ -421,35 +581,79 @@ class ElasticsearchDocumentStore:
         return documents
-    def _embedding_retrieval(
+    async def _bm25_retrieval_async(
         self,
-        query_embedding: List[float],
+        query: str,
         *,
         filters: Optional[Dict[str, Any]] = None,
+        fuzziness: str = "AUTO",
         top_k: int = 10,
-        num_candidates: Optional[int] = None,
+        scale_score: bool = False,
     ) -> List[Document]:
         """
-        Retrieves documents that are most similar to the query embedding using a vector similarity metric.
+        Asynchronously retrieves documents using BM25 retrieval.
+        :param query: The query string to search for
+        :param filters: Optional filters to narrow down the search space
+        :param fuzziness: Fuzziness parameter for the search query
+        :param top_k: Maximum number of documents to return
+        :param scale_score: Whether to scale the similarity score to the range [0,1]
+        :returns: List of Documents that match the query
+        """
+        self._ensure_initialized()
+        if not query:
+            msg = "query must be a non empty string"
+            raise ValueError(msg)
+        # Prepare the search body
+        search_body = {
+            "size": top_k,
+            "query": {
+                "bool": {
+                    "must": [
+                        {
+                            "multi_match": {
+                                "query": query,
+                                "type": "most_fields",
+                                "operator": "OR",
+                                "fuzziness": fuzziness,
+                            }
+                        }
+                    ]
+                }
+            },
+        }
+        if filters:
+            search_body["query"]["bool"]["filter"] = _normalize_filters(filters)  # type:ignore
-        It uses the Elasticsearch's Approximate k-Nearest Neighbors search algorithm.
+        documents = await self._search_documents_async(**search_body)
-        This method is not meant to be part of the public interface of
-        `ElasticsearchDocumentStore` nor called directly.
-        `ElasticsearchEmbeddingRetriever` uses this method directly and is the public interface for it.
+        if scale_score:
+            for doc in documents:
+                if doc.score is not None:
+                    doc.score = float(1 / (1 + np.exp(-(doc.score / float(BM25_SCALING_FACTOR)))))
-        :param query_embedding: Embedding of the query.
-        :param filters: Filters applied to the retrieved `Document`s.
-            Filters are applied during the approximate kNN search to ensure that top_k matching documents are returned.
-        :param top_k: Maximum number of `Document`s to return.
-        :param num_candidates: Number of approximate nearest neighbor candidates on each shard. Defaults to top_k * 10.
-            Increasing this value will improve search accuracy at the cost of slower search speeds.
-            You can read more about it in the Elasticsearch
-            [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#tune-approximate-knn-for-speed-accuracy)
-        :raises ValueError: If `query_embedding` is an empty list.
-        :returns: List of `Document` that are most similar to `query_embedding`.
+        return documents
+    def _embedding_retrieval(
+        self,
+        query_embedding: List[float],
+        *,
+        filters: Optional[Dict[str, Any]] = None,
+        top_k: int = 10,
+        num_candidates: Optional[int] = None,
+    ) -> List[Document]:
         """
+        Retrieves documents using dense vector similarity search.
+        :param query_embedding: Embedding vector to search for
+        :param filters: Optional filters to narrow down the search space
+        :param top_k: Maximum number of documents to return
+        :param num_candidates: Number of candidates to consider in the search
+        :returns: List of Documents most similar to query_embedding
+        """
         if not query_embedding:
             msg = "query_embedding must be a non-empty list of floats"
             raise ValueError(msg)
@@ -471,3 +675,45 @@ class ElasticsearchDocumentStore:
         docs = self._search_documents(**body)
         return docs
+    async def _embedding_retrieval_async(
+        self,
+        query_embedding: List[float],
+        *,
+        filters: Optional[Dict[str, Any]] = None,
+        top_k: int = 10,
+        num_candidates: Optional[int] = None,
+    ) -> List[Document]:
+        """
+        Asynchronously retrieves documents using dense vector similarity search.
+        :param query_embedding: Embedding vector to search for
+        :param filters: Optional filters to narrow down the search space
+        :param top_k: Maximum number of documents to return
+        :param num_candidates: Number of candidates to consider in the search
+        :returns: List of Documents most similar to query_embedding
+        """
+        self._ensure_initialized()
+        if not query_embedding:
+            msg = "query_embedding must be a non-empty list of floats"
+            raise ValueError(msg)
+        # If num_candidates is not set, use top_k * 10 as default
+        if num_candidates is None:
+            num_candidates = top_k * 10
+        # Prepare the search body
+        search_body = {
+            "knn": {
+                "field": "embedding",
+                "query_vector": query_embedding,
+                "k": top_k,
+                "num_candidates": num_candidates,
+            },
+        }
+        if filters:
+            search_body["knn"]["filter"] = _normalize_filters(filters)
+        return await self._search_documents_async(**search_body)

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/tests/test_bm25_retriever.py RENAMED Viewed

@@ -117,3 +117,66 @@ def test_run():
     assert len(res) == 1
     assert len(res["documents"]) == 1
     assert res["documents"][0].content == "Test doc"
+@pytest.mark.asyncio
+async def test_run_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._bm25_retrieval_async.return_value = [Document(content="test document")]
+    retriever = ElasticsearchBM25Retriever(document_store=mock_store)
+    res = await retriever.run_async(query="some test query")
+    mock_store._bm25_retrieval_async.assert_called_once_with(
+        query="some test query", filters={}, fuzziness="AUTO", top_k=10, scale_score=False
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"
+@pytest.mark.asyncio
+async def test_run_init_params_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._bm25_retrieval_async.return_value = [Document(content="test document")]
+    retriever = ElasticsearchBM25Retriever(
+        document_store=mock_store,
+        filters={"some": "filter"},
+        fuzziness="3",
+        top_k=3,
+        scale_score=True,
+        filter_policy=FilterPolicy.MERGE,
+    )
+    res = await retriever.run_async(query="some query")
+    mock_store._bm25_retrieval_async.assert_called_once_with(
+        query="some query",
+        filters={"some": "filter"},
+        fuzziness="3",
+        top_k=3,
+        scale_score=True,
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"
+@pytest.mark.asyncio
+async def test_run_time_params_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._bm25_retrieval_async.return_value = [Document(content="test document")]
+    retriever = ElasticsearchBM25Retriever(
+        document_store=mock_store,
+        filters={"some": "filter"},
+        fuzziness="3",
+        top_k=3,
+        scale_score=True,
+        filter_policy=FilterPolicy.MERGE,
+    )
+    res = await retriever.run_async(query="some query", filters={"another": "filter"}, top_k=1)
+    mock_store._bm25_retrieval_async.assert_called_once_with(
+        query="some query", filters={"another": "filter"}, top_k=1, fuzziness="3", scale_score=True
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/tests/test_document_store.py RENAMED Viewed

@@ -9,6 +9,7 @@ from unittest.mock import Mock, patch
 import pytest
 from elasticsearch.exceptions import BadRequestError  # type: ignore[import-not-found]
 from haystack.dataclasses.document import Document
+from haystack.dataclasses.sparse_embedding import SparseEmbedding
 from haystack.document_stores.errors import DocumentStoreError, DuplicateDocumentError
 from haystack.document_stores.types import DuplicatePolicy
 from haystack.testing.document_store import DocumentStoreBaseTests
@@ -25,7 +26,9 @@ def test_init_is_lazy(_mock_es_client):
 @patch("haystack_integrations.document_stores.elasticsearch.document_store.Elasticsearch")
 def test_headers_are_supported(_mock_es_client):
-    _ = ElasticsearchDocumentStore(hosts="testhost", headers={"header1": "value1", "header2": "value2"}).client
+    _ = ElasticsearchDocumentStore(
+        hosts="http://testhost:9200", headers={"header1": "value1", "header2": "value2"}
+    ).client
     assert _mock_es_client.call_count == 1
     _, kwargs = _mock_es_client.call_args
@@ -96,6 +99,7 @@ class TestDocumentStore(DocumentStoreBaseTests):
         )
         yield store
         store.client.options(ignore_status=[400, 404]).indices.delete(index=index)
+        store.client.close()
     def assert_documents_are_equal(self, received: List[Document], expected: List[Document]):
         """
@@ -134,15 +138,11 @@ class TestDocumentStore(DocumentStoreBaseTests):
     def test_write_documents_dataframe_ignored(self, document_store: ElasticsearchDocumentStore):
         doc = Document(id="1", content="test")
         doc.dataframe = DataFrame({"a": [1, 2, 3]})
         document_store.write_documents([doc])
         res = document_store.filter_documents()
         assert len(res) == 1
         assert res[0].id == "1"
         assert res[0].content == "test"
         assert not hasattr(res[0], "dataframe") or res[0].dataframe is None
     def test_deserialize_document_dataframe_ignored(self, document_store: ElasticsearchDocumentStore):
@@ -242,16 +242,16 @@ class TestDocumentStore(DocumentStoreBaseTests):
         Test that not all terms must mandatorily match for BM25 retrieval to return a result.
         """
         documents = [
-            Document(id=1, content="There are over 7,000 languages spoken around the world today."),
+            Document(id="1", content="There are over 7,000 languages spoken around the world today."),
             Document(
-                id=2,
+                id="2",
                 content=(
                     "Elephants have been observed to behave in a way that indicates a high level of self-awareness"
                     " such as recognizing themselves in mirrors."
                 ),
             ),
             Document(
-                id=3,
+                id="3",
                 content=(
                     "In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness"
                     " the phenomenon of bioluminescent waves."
@@ -262,7 +262,7 @@ class TestDocumentStore(DocumentStoreBaseTests):
         res = document_store._bm25_retrieval("How much self awareness do elephants have?", top_k=3)
         assert len(res) == 1
-        assert res[0].id == 2
+        assert res[0].id == "2"
     def test_embedding_retrieval(self, document_store: ElasticsearchDocumentStore):
         docs = [
@@ -355,8 +355,172 @@ class TestDocumentStore(DocumentStoreBaseTests):
         )
         mock_elasticsearch.return_value = mock_client
-        _ = ElasticsearchDocumentStore(hosts="some hosts", custom_mapping=custom_mapping).client
+        _ = ElasticsearchDocumentStore(hosts="http://testhost:9200", custom_mapping=custom_mapping).client
         mock_client.indices.create.assert_called_once_with(
             index="default",
             mappings=custom_mapping,
         )
+@pytest.mark.integration
+class TestElasticsearchDocumentStoreAsync:
+    @pytest.fixture
+    async def document_store(self, request):
+        """
+        Basic fixture providing a document store instance for async tests
+        """
+        hosts = ["http://localhost:9200"]
+        # Use a different index for each test so we can run them in parallel
+        index = f"{request.node.name}"
+        store = ElasticsearchDocumentStore(hosts=hosts, index=index)
+        yield store
+        store.client.options(ignore_status=[400, 404]).indices.delete(index=index)
+        await store.async_client.close()
+    @pytest.mark.asyncio
+    async def test_write_documents_async(self, document_store):
+        docs = [Document(id="1", content="test")]
+        assert await document_store.write_documents_async(docs) == 1
+        assert await document_store.count_documents_async() == 1
+        with pytest.raises(DocumentStoreError):
+            await document_store.write_documents_async(docs, policy=DuplicatePolicy.FAIL)
+    @pytest.mark.asyncio
+    async def test_count_documents_async(self, document_store):
+        docs = [
+            Document(content="test doc 1"),
+            Document(content="test doc 2"),
+            Document(content="test doc 3"),
+        ]
+        await document_store.write_documents_async(docs)
+        assert await document_store.count_documents_async() == 3
+    @pytest.mark.asyncio
+    async def test_delete_documents_async(self, document_store):
+        doc = Document(content="test doc")
+        await document_store.write_documents_async([doc])
+        assert await document_store.count_documents_async() == 1
+        await document_store.delete_documents_async([doc.id])
+        assert await document_store.count_documents_async() == 0
+    @pytest.mark.asyncio
+    async def test_filter_documents_async(self, document_store):
+        filterable_docs = [
+            Document(content="1", meta={"number": -10}),
+            Document(content="2", meta={"number": 100}),
+        ]
+        await document_store.write_documents_async(filterable_docs)
+        result = await document_store.filter_documents_async(
+            filters={"field": "number", "operator": "==", "value": 100}
+        )
+        assert len(result) == 1
+        assert result[0].meta["number"] == 100
+    @pytest.mark.asyncio
+    async def test_bm25_retrieval_async(self, document_store):
+        docs = [
+            Document(content="Haskell is a functional programming language"),
+            Document(content="Python is an object oriented programming language"),
+        ]
+        await document_store.write_documents_async(docs)
+        results = await document_store._bm25_retrieval_async("functional", top_k=1)
+        assert len(results) == 1
+        assert "functional" in results[0].content
+    @pytest.mark.asyncio
+    async def test_embedding_retrieval_async(self, document_store):
+        # init document store
+        docs = [
+            Document(content="Most similar document", embedding=[1.0, 1.0, 1.0, 1.0]),
+            Document(content="Less similar document", embedding=[0.5, 0.5, 0.5, 0.5]),
+        ]
+        await document_store.write_documents_async(docs)
+        # without num_candidates set to None
+        results = await document_store._embedding_retrieval_async(query_embedding=[1.0, 1.0, 1.0, 1.0], top_k=1)
+        assert len(results) == 1
+        assert results[0].content == "Most similar document"
+        # with num_candidates not None
+        results = await document_store._embedding_retrieval_async(
+            query_embedding=[1.0, 1.0, 1.0, 1.0], top_k=2, num_candidates=2
+        )
+        assert len(results) == 2
+        assert results[0].content == "Most similar document"
+        # with an embedding containing None
+        with pytest.raises(ValueError, match="query_embedding must be a non-empty list of floats"):
+            _ = await document_store._embedding_retrieval_async(query_embedding=None, top_k=2)
+    @pytest.mark.asyncio
+    async def test_bm25_retrieval_async_with_filters(self, document_store):
+        docs = [
+            Document(content="Haskell is a functional programming language", meta={"type": "functional"}),
+            Document(content="Python is an object oriented programming language", meta={"type": "oop"}),
+        ]
+        await document_store.write_documents_async(docs)
+        results = await document_store._bm25_retrieval_async(
+            "programming", filters={"field": "type", "operator": "==", "value": "functional"}, top_k=1
+        )
+        assert len(results) == 1
+        assert "functional" in results[0].content
+        # test with scale_score=True
+        results = await document_store._bm25_retrieval_async(
+            "programming", filters={"field": "type", "operator": "==", "value": "functional"}, top_k=1, scale_score=True
+        )
+        assert len(results) == 1
+        assert "functional" in results[0].content
+        assert 0 <= results[0].score <= 1  # score should be between 0 and 1
+    @pytest.mark.asyncio
+    async def test_embedding_retrieval_async_with_filters(self, document_store):
+        docs = [
+            Document(content="Most similar document", embedding=[1.0, 1.0, 1.0, 1.0], meta={"type": "similar"}),
+            Document(content="Less similar document", embedding=[0.5, 0.5, 0.5, 0.5], meta={"type": "different"}),
+        ]
+        await document_store.write_documents_async(docs)
+        results = await document_store._embedding_retrieval_async(
+            query_embedding=[1.0, 1.0, 1.0, 1.0],
+            filters={"field": "type", "operator": "==", "value": "similar"},
+            top_k=1,
+        )
+        assert len(results) == 1
+        assert results[0].content == "Most similar document"
+    @pytest.mark.asyncio
+    async def test_write_documents_async_invalid_document_type(self, document_store):
+        """Test write_documents with invalid document type"""
+        invalid_docs = [{"id": "1", "content": "test"}]  # Dictionary instead of Document object
+        with pytest.raises(ValueError, match="param 'documents' must contain a list of objects of type Document"):
+            await document_store.write_documents_async(invalid_docs)
+    @pytest.mark.asyncio
+    async def test_write_documents_async_with_dataframe_warning(self, document_store, caplog):
+        """Test write_documents with document containing dataframe field"""
+        doc = Document(id="1", content="test", dataframe=DataFrame({"col": [1, 2, 3]}))
+        await document_store.write_documents_async([doc])
+        assert "ElasticsearchDocumentStore no longer supports dataframes" in caplog.text
+        results = await document_store.filter_documents_async()
+        assert len(results) == 1
+        assert results[0].id == "1"
+        assert not hasattr(results[0], "dataframe") or results[0].dataframe is None
+    @pytest.mark.asyncio
+    async def test_write_documents_async_with_sparse_embedding_warning(self, document_store, caplog):
+        """Test write_documents with document containing sparse_embedding field"""
+        doc = Document(id="1", content="test", sparse_embedding=SparseEmbedding(indices=[0, 1], values=[0.5, 0.5]))
+        await document_store.write_documents_async([doc])
+        assert "but storing sparse embeddings in Elasticsearch is not currently supported." in caplog.text
+        results = await document_store.filter_documents_async()
+        assert len(results) == 1
+        assert results[0].id == "1"
+        assert not hasattr(results[0], "sparse_embedding") or results[0].sparse_embedding is None

{elasticsearch_haystack-2.0.0 → elasticsearch_haystack-2.1.0}/tests/test_embedding_retriever.py RENAMED Viewed

@@ -113,3 +113,68 @@ def test_run():
     assert len(res["documents"]) == 1
     assert res["documents"][0].content == "Test doc"
     assert res["documents"][0].embedding == [0.1, 0.2]
+@pytest.mark.asyncio
+async def test_run_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._embedding_retrieval_async.return_value = [Document(content="test document", embedding=[0.1, 0.2])]
+    retriever = ElasticsearchEmbeddingRetriever(document_store=mock_store)
+    res = await retriever.run_async(query_embedding=[0.5, 0.7])
+    mock_store._embedding_retrieval_async.assert_called_once_with(
+        query_embedding=[0.5, 0.7],
+        filters={},
+        top_k=10,
+        num_candidates=None,
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"
+    assert res["documents"][0].embedding == [0.1, 0.2]
+@pytest.mark.asyncio
+async def test_run_init_params_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._embedding_retrieval_async.return_value = [Document(content="test document", embedding=[0.1, 0.2])]
+    retriever = ElasticsearchEmbeddingRetriever(
+        document_store=mock_store,
+        filters={"some": "filter"},
+        top_k=3,
+        num_candidates=30,
+        filter_policy=FilterPolicy.MERGE,
+    )
+    res = await retriever.run_async(query_embedding=[0.5, 0.7])
+    mock_store._embedding_retrieval_async.assert_called_once_with(
+        query_embedding=[0.5, 0.7],
+        filters={"some": "filter"},
+        top_k=3,
+        num_candidates=30,
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"
+    assert res["documents"][0].embedding == [0.1, 0.2]
+@pytest.mark.asyncio
+async def test_run_time_params_async():
+    mock_store = Mock(spec=ElasticsearchDocumentStore)
+    mock_store._embedding_retrieval_async.return_value = [Document(content="test document", embedding=[0.1, 0.2])]
+    retriever = ElasticsearchEmbeddingRetriever(
+        document_store=mock_store,
+        filters={"some": "filter"},
+        top_k=3,
+        num_candidates=30,
+        filter_policy=FilterPolicy.MERGE,
+    )
+    res = await retriever.run_async(query_embedding=[0.5, 0.7], filters={"another": "filter"}, top_k=1)
+    mock_store._embedding_retrieval_async.assert_called_once_with(
+        query_embedding=[0.5, 0.7], filters={"another": "filter"}, top_k=1, num_candidates=30
+    )
+    assert len(res) == 1
+    assert len(res["documents"]) == 1
+    assert res["documents"][0].content == "test document"
+    assert res["documents"][0].embedding == [0.1, 0.2]