cmem-plugin-pgvector 0.5.0__tar.gz → 0.6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/PKG-INFO +7 -30
- cmem_plugin_pgvector-0.6.0/README-public.md +28 -0
- cmem_plugin_pgvector-0.6.0/cmem_plugin_pgvector/commons.py +131 -0
- cmem_plugin_pgvector-0.6.0/cmem_plugin_pgvector/search_task.py +221 -0
- cmem_plugin_pgvector-0.5.0/cmem_plugin_pgvector/cmem_plugin_pgvector.py → cmem_plugin_pgvector-0.6.0/cmem_plugin_pgvector/store_task.py +18 -55
- {cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/pyproject.toml +2 -2
- cmem_plugin_pgvector-0.5.0/README-public.md +0 -51
- {cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/LICENSE +0 -0
- {cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/cmem_plugin_pgvector/__init__.py +0 -0
- {cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/cmem_plugin_pgvector/postgresql.svg +0 -0
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: cmem-plugin-pgvector
|
|
3
|
-
Version: 0.
|
|
4
|
-
Summary: Store embedding vectors
|
|
3
|
+
Version: 0.6.0
|
|
4
|
+
Summary: Store and search for embedding vectors in a Postgres vector store.
|
|
5
5
|
License: Apache-2.0
|
|
6
6
|
Keywords: eccenca Corporate Memory,plugin
|
|
7
7
|
Author: eccenca GmbH
|
|
@@ -24,43 +24,20 @@ Description-Content-Type: text/markdown
|
|
|
24
24
|
|
|
25
25
|
# cmem-plugin-pgvector
|
|
26
26
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
Store embedding vectors into a Postgres vector store.
|
|
30
|
-
|
|
31
|
-
This plugin consumes the costumable entity's paths ```embedding```, ```text``` and ```metadata``` as following:
|
|
32
|
-
|
|
33
|
-
- The text path contain the text used to generate the embeddings, default ```text```.
|
|
34
|
-
- The embedding path contain the embedding representation of the text, default ```embedding```.
|
|
35
|
-
- The metadata path contain the information that will be associated with the embedding, default all paths.
|
|
27
|
+
Store and search for embedding vectors in a Postgres vector store.
|
|
36
28
|
|
|
37
29
|
[![eccenca Corporate Memory][cmem-shield]][cmem-link]
|
|
38
30
|
|
|
39
|
-
## Use
|
|
40
|
-
|
|
41
|
-
Interact with Large Language Models.
|
|
42
|
-
|
|
43
31
|
This is a plugin for [eccenca](https://eccenca.com) [Corporate Memory](https://documentation.eccenca.com).
|
|
44
32
|
|
|
45
|
-
You can install it with the [cmemc](https://eccenca.com/go/cmemc) command line
|
|
46
|
-
clients like this:
|
|
33
|
+
You can install it with the [cmemc](https://eccenca.com/go/cmemc) command line client like this:
|
|
47
34
|
|
|
48
|
-
```
|
|
35
|
+
``` bash
|
|
49
36
|
cmemc admin workspace python install cmem-plugin-llm
|
|
50
37
|
```
|
|
51
38
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
- ```collection_name```: The name of the collection where the embeddings are going to be stored, default ```my_collection```
|
|
55
|
-
- ```user```:the database user
|
|
56
|
-
- ```password```: the database password
|
|
57
|
-
- ```host```: the databse host, i.e. locahost
|
|
58
|
-
- ```port```: the database port, default ```5432```
|
|
59
|
-
- ```database```: the name of the database
|
|
60
|
-
- ```pre_delete_collection```: boolean parameter indicating if the collection should be cleanse before insertion, default ```false```
|
|
61
|
-
- ```embedding_path```: output path that will contain the generated embedding, default ```embedding```
|
|
62
|
-
- ```text_path```: path containing the text used for genereting the embedding, default ```text```
|
|
63
|
-
- ```metadata_paths```: paths from the entity that will be stored along with the embedding, default all paths
|
|
39
|
+
[](https://pypi.org/project/cmem-plugin-pgvector) [](https://pypi.org/project/cmem-plugin-pgvector)
|
|
40
|
+
[![poetry][poetry-shield]][poetry-link] [![ruff][ruff-shield]][ruff-link] [![mypy][mypy-shield]][mypy-link] [![copier][copier-shield]][copier]
|
|
64
41
|
|
|
65
42
|
[cmem-link]: https://documentation.eccenca.com
|
|
66
43
|
[cmem-shield]: https://img.shields.io/endpoint?url=https://dev.documentation.eccenca.com/badge.json
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# cmem-plugin-pgvector
|
|
2
|
+
|
|
3
|
+
Store and search for embedding vectors in a Postgres vector store.
|
|
4
|
+
|
|
5
|
+
[![eccenca Corporate Memory][cmem-shield]][cmem-link]
|
|
6
|
+
|
|
7
|
+
This is a plugin for [eccenca](https://eccenca.com) [Corporate Memory](https://documentation.eccenca.com).
|
|
8
|
+
|
|
9
|
+
You can install it with the [cmemc](https://eccenca.com/go/cmemc) command line client like this:
|
|
10
|
+
|
|
11
|
+
``` bash
|
|
12
|
+
cmemc admin workspace python install cmem-plugin-llm
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
[](https://pypi.org/project/cmem-plugin-pgvector) [](https://pypi.org/project/cmem-plugin-pgvector)
|
|
16
|
+
[![poetry][poetry-shield]][poetry-link] [![ruff][ruff-shield]][ruff-link] [![mypy][mypy-shield]][mypy-link] [![copier][copier-shield]][copier]
|
|
17
|
+
|
|
18
|
+
[cmem-link]: https://documentation.eccenca.com
|
|
19
|
+
[cmem-shield]: https://img.shields.io/endpoint?url=https://dev.documentation.eccenca.com/badge.json
|
|
20
|
+
[poetry-link]: https://python-poetry.org/
|
|
21
|
+
[poetry-shield]: https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json
|
|
22
|
+
[ruff-link]: https://docs.astral.sh/ruff/
|
|
23
|
+
[ruff-shield]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json&label=Code%20Style
|
|
24
|
+
[mypy-link]: https://mypy-lang.org/
|
|
25
|
+
[mypy-shield]: https://www.mypy-lang.org/static/mypy_badge.svg
|
|
26
|
+
[copier]: https://copier.readthedocs.io/
|
|
27
|
+
[copier-shield]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/copier-org/copier/master/img/badge/badge-grayscale-inverted-border-purple.json
|
|
28
|
+
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
"""PGVector commons"""
|
|
2
|
+
|
|
3
|
+
from typing import Any, ClassVar
|
|
4
|
+
|
|
5
|
+
import psycopg
|
|
6
|
+
from cmem_plugin_base.dataintegration.context import (
|
|
7
|
+
PluginContext,
|
|
8
|
+
)
|
|
9
|
+
from cmem_plugin_base.dataintegration.description import PluginParameter
|
|
10
|
+
from cmem_plugin_base.dataintegration.parameter.password import PasswordParameterType
|
|
11
|
+
from cmem_plugin_base.dataintegration.types import (
|
|
12
|
+
Autocompletion,
|
|
13
|
+
IntParameterType,
|
|
14
|
+
StringParameterType,
|
|
15
|
+
)
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
def get_collection_names(
|
|
19
|
+
dbname: str, user: str, password: str, host: str = "localhost", port: int = 5432
|
|
20
|
+
) -> list[str]:
|
|
21
|
+
"""Return list of collection names"""
|
|
22
|
+
# Create a connection to the database
|
|
23
|
+
with (
|
|
24
|
+
psycopg.connect(dbname=dbname, user=user, password=password, host=host, port=port) as conn,
|
|
25
|
+
conn.cursor() as cursor,
|
|
26
|
+
):
|
|
27
|
+
# Execute query
|
|
28
|
+
cursor.execute("SELECT name FROM public.langchain_pg_collection;")
|
|
29
|
+
return [row[0] for row in cursor.fetchall()] # Fetch all names
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
class PGVectorCollection(StringParameterType):
|
|
33
|
+
"""PGVector Collection Type"""
|
|
34
|
+
|
|
35
|
+
autocompletion_depends_on_parameters: ClassVar[list[str]] = [
|
|
36
|
+
"host",
|
|
37
|
+
"port",
|
|
38
|
+
"database",
|
|
39
|
+
"user",
|
|
40
|
+
"password",
|
|
41
|
+
]
|
|
42
|
+
|
|
43
|
+
# auto complete for values
|
|
44
|
+
allow_only_autocompleted_values: bool = True
|
|
45
|
+
# auto complete for labels
|
|
46
|
+
autocomplete_value_with_labels: bool = True
|
|
47
|
+
|
|
48
|
+
def autocomplete(
|
|
49
|
+
self,
|
|
50
|
+
query_terms: list[str],
|
|
51
|
+
depend_on_parameter_values: list[Any],
|
|
52
|
+
context: PluginContext,
|
|
53
|
+
) -> list[Autocompletion]:
|
|
54
|
+
"""Return all results that match ALL provided query terms."""
|
|
55
|
+
_ = context
|
|
56
|
+
host = depend_on_parameter_values[0]
|
|
57
|
+
port = depend_on_parameter_values[1]
|
|
58
|
+
dbname = depend_on_parameter_values[2]
|
|
59
|
+
user = depend_on_parameter_values[3]
|
|
60
|
+
password = depend_on_parameter_values[4]
|
|
61
|
+
password = password if isinstance(password, str) else password.decrypt()
|
|
62
|
+
result = []
|
|
63
|
+
try:
|
|
64
|
+
collections = get_collection_names(
|
|
65
|
+
host=host, port=port, dbname=dbname, user=user, password=password
|
|
66
|
+
)
|
|
67
|
+
filtered_models = set()
|
|
68
|
+
if query_terms:
|
|
69
|
+
for term in query_terms:
|
|
70
|
+
for collection in collections:
|
|
71
|
+
if term in collection:
|
|
72
|
+
filtered_models.add(collection)
|
|
73
|
+
else:
|
|
74
|
+
filtered_models = set(collections)
|
|
75
|
+
result = [Autocompletion(value=f"{_}", label=f"{_}") for _ in filtered_models]
|
|
76
|
+
except Exception as error:
|
|
77
|
+
raise ValueError(
|
|
78
|
+
"Failed to authenticate with OpenAI API, Please check URL and API key."
|
|
79
|
+
) from error
|
|
80
|
+
result.sort(key=lambda x: x.label)
|
|
81
|
+
return result
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
class DatabaseParams:
|
|
85
|
+
"""Common Plugin parameters"""
|
|
86
|
+
|
|
87
|
+
host = PluginParameter(
|
|
88
|
+
name="host",
|
|
89
|
+
label="Database Host",
|
|
90
|
+
description="The hostname of the postgres database service.",
|
|
91
|
+
default_value="pgvector",
|
|
92
|
+
)
|
|
93
|
+
port = PluginParameter(
|
|
94
|
+
name="port",
|
|
95
|
+
label="Database Port",
|
|
96
|
+
param_type=IntParameterType(),
|
|
97
|
+
description="The port number of the postgres database service.",
|
|
98
|
+
default_value=5432,
|
|
99
|
+
)
|
|
100
|
+
user = PluginParameter(
|
|
101
|
+
name="user",
|
|
102
|
+
label="Database User",
|
|
103
|
+
description="The account name used to login to the postgres database service.",
|
|
104
|
+
default_value="pgvector",
|
|
105
|
+
)
|
|
106
|
+
password = PluginParameter(
|
|
107
|
+
name="password",
|
|
108
|
+
label="Database Password",
|
|
109
|
+
param_type=PasswordParameterType(),
|
|
110
|
+
description="The password of the database account.",
|
|
111
|
+
)
|
|
112
|
+
database = PluginParameter(
|
|
113
|
+
name="database",
|
|
114
|
+
label="Database Name",
|
|
115
|
+
description="The database name.",
|
|
116
|
+
default_value="pgvector",
|
|
117
|
+
)
|
|
118
|
+
collection_name = PluginParameter(
|
|
119
|
+
name="collection_name",
|
|
120
|
+
label="Collection Name",
|
|
121
|
+
description="The name of the collection that will be used for search.",
|
|
122
|
+
param_type=PGVectorCollection(),
|
|
123
|
+
)
|
|
124
|
+
|
|
125
|
+
def as_list(self) -> list[PluginParameter]:
|
|
126
|
+
"""Provide all parameters as list"""
|
|
127
|
+
return [
|
|
128
|
+
getattr(self, attr)
|
|
129
|
+
for attr in dir(self)
|
|
130
|
+
if not callable(getattr(self, attr)) and not attr.startswith("__")
|
|
131
|
+
]
|
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
"""Search Task"""
|
|
2
|
+
|
|
3
|
+
import json
|
|
4
|
+
from ast import literal_eval
|
|
5
|
+
from collections.abc import Generator, Sequence
|
|
6
|
+
|
|
7
|
+
from cmem_plugin_base.dataintegration.context import ExecutionContext, ExecutionReport
|
|
8
|
+
from cmem_plugin_base.dataintegration.description import Icon, Plugin, PluginParameter
|
|
9
|
+
from cmem_plugin_base.dataintegration.entity import Entities, Entity, EntityPath, EntitySchema
|
|
10
|
+
from cmem_plugin_base.dataintegration.parameter.password import Password
|
|
11
|
+
from cmem_plugin_base.dataintegration.plugins import WorkflowPlugin
|
|
12
|
+
from cmem_plugin_base.dataintegration.ports import (
|
|
13
|
+
FixedNumberOfInputs,
|
|
14
|
+
FixedSchemaPort,
|
|
15
|
+
)
|
|
16
|
+
from cmem_plugin_base.dataintegration.types import IntParameterType
|
|
17
|
+
from langchain_core.documents import Document
|
|
18
|
+
from langchain_postgres import PGVector
|
|
19
|
+
|
|
20
|
+
from cmem_plugin_pgvector.commons import DatabaseParams
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
@Plugin(
|
|
24
|
+
label="Search Vector Embeddings",
|
|
25
|
+
description="Search for top-k metadata stored in Postgres Vector Store (PGVector).",
|
|
26
|
+
documentation="""
|
|
27
|
+
This workflow task search for the top-k metadata stored into Postgres Vector Store.
|
|
28
|
+
|
|
29
|
+
The incoming embedding entities are used to retrieve the nearest top-k
|
|
30
|
+
vectors in the collection stored in the Postgres Vector Store.
|
|
31
|
+
It is possible to specify which paths are going to be used for searching as well as which Postgres
|
|
32
|
+
Vector Store and collection name.
|
|
33
|
+
|
|
34
|
+
The task uses the embeddings from the path configured with the Embedding Query Path
|
|
35
|
+
parameter (`embedding_query_path`, default value: `_embedding`) to search over the collection.
|
|
36
|
+
The results are provided in the output path configured with the Search Result Path parameter
|
|
37
|
+
(`search_result_path`, default value: `_search_result`).
|
|
38
|
+
|
|
39
|
+
The results in this output are structured like this:
|
|
40
|
+
|
|
41
|
+
``` json
|
|
42
|
+
[
|
|
43
|
+
{
|
|
44
|
+
"id": "..."
|
|
45
|
+
"metadata": "..."
|
|
46
|
+
"content": "..."
|
|
47
|
+
"score": "..."
|
|
48
|
+
},
|
|
49
|
+
...
|
|
50
|
+
]
|
|
51
|
+
```
|
|
52
|
+
""",
|
|
53
|
+
icon=Icon(package=__package__, file_name="postgresql.svg"),
|
|
54
|
+
plugin_id="cmem_plugin_pgvector-Search",
|
|
55
|
+
parameters=[
|
|
56
|
+
*DatabaseParams().as_list(),
|
|
57
|
+
PluginParameter(
|
|
58
|
+
name="embedding_query_path",
|
|
59
|
+
label="Embedding Query Path",
|
|
60
|
+
description="""The path containing the embedding to be used for searching.""",
|
|
61
|
+
default_value="_embedding",
|
|
62
|
+
),
|
|
63
|
+
PluginParameter(
|
|
64
|
+
name="search_result_path",
|
|
65
|
+
label="Search Result Path",
|
|
66
|
+
description="""The path containing the search result in the output entities.""",
|
|
67
|
+
default_value="_search_result",
|
|
68
|
+
),
|
|
69
|
+
PluginParameter(
|
|
70
|
+
name="top_k",
|
|
71
|
+
label="Top-k",
|
|
72
|
+
description="The number of entries to be returned in the search result.",
|
|
73
|
+
default_value=10,
|
|
74
|
+
param_type=IntParameterType(),
|
|
75
|
+
),
|
|
76
|
+
],
|
|
77
|
+
)
|
|
78
|
+
class PGVectorSearchPlugin(WorkflowPlugin):
|
|
79
|
+
"""PGVectorSearchPlugin: Enable the search of vectors in a Postgres Vector Store."""
|
|
80
|
+
|
|
81
|
+
connection_string: str
|
|
82
|
+
user: str
|
|
83
|
+
password: str
|
|
84
|
+
host: str
|
|
85
|
+
port: int
|
|
86
|
+
database: str
|
|
87
|
+
collection_name: str
|
|
88
|
+
embedding_query_path: str
|
|
89
|
+
inputs: Sequence[Entities]
|
|
90
|
+
db: PGVector
|
|
91
|
+
execution_context: ExecutionContext
|
|
92
|
+
report: ExecutionReport
|
|
93
|
+
search_result_path: str
|
|
94
|
+
top_k: int
|
|
95
|
+
|
|
96
|
+
def __init__( # noqa: PLR0913
|
|
97
|
+
self,
|
|
98
|
+
host: str = DatabaseParams.host.default_value,
|
|
99
|
+
port: int = DatabaseParams.port.default_value,
|
|
100
|
+
user: str = DatabaseParams.user.default_value,
|
|
101
|
+
password: Password | str = "",
|
|
102
|
+
database: str = DatabaseParams.database.default_value,
|
|
103
|
+
collection_name: str = DatabaseParams.collection_name.default_value,
|
|
104
|
+
search_result_path: str = "_search_result",
|
|
105
|
+
embedding_query_path: str = "_embedding",
|
|
106
|
+
top_k: int = 10,
|
|
107
|
+
) -> None:
|
|
108
|
+
self.collection_name = collection_name
|
|
109
|
+
self.user = user
|
|
110
|
+
self.host = host
|
|
111
|
+
self.port = port
|
|
112
|
+
self.database = database
|
|
113
|
+
self.embedding_query_path = embedding_query_path
|
|
114
|
+
self.search_result_path = search_result_path
|
|
115
|
+
self.top_k = top_k
|
|
116
|
+
|
|
117
|
+
str_password = self.password = password if isinstance(password, str) else password.decrypt()
|
|
118
|
+
self.connection_string = (
|
|
119
|
+
f"postgresql+psycopg://{user}:{str_password}@{host}:{port}/{database}"
|
|
120
|
+
)
|
|
121
|
+
|
|
122
|
+
self.report = ExecutionReport()
|
|
123
|
+
self.report.operation = "search"
|
|
124
|
+
self.report.operation_desc = "searches"
|
|
125
|
+
|
|
126
|
+
self.db = PGVector(
|
|
127
|
+
collection_name=self.collection_name,
|
|
128
|
+
connection=self.connection_string,
|
|
129
|
+
embeddings=None, # type: ignore # noqa: PGH003
|
|
130
|
+
use_jsonb=True,
|
|
131
|
+
pre_delete_collection=False,
|
|
132
|
+
)
|
|
133
|
+
self._setup_ports()
|
|
134
|
+
|
|
135
|
+
def _setup_ports(self) -> None:
|
|
136
|
+
"""Configure input and output ports depending on the configuration"""
|
|
137
|
+
input_paths = [EntityPath(path=self.embedding_query_path)]
|
|
138
|
+
input_schema = EntitySchema(type_uri="entity", paths=input_paths)
|
|
139
|
+
self.input_ports = FixedNumberOfInputs(ports=[FixedSchemaPort(schema=input_schema)])
|
|
140
|
+
|
|
141
|
+
output_schema = self._generate_output_schema(input_schema=input_schema)
|
|
142
|
+
self.output_port = FixedSchemaPort(schema=output_schema)
|
|
143
|
+
|
|
144
|
+
def _generate_output_schema(self, input_schema: EntitySchema) -> EntitySchema:
|
|
145
|
+
"""Get output schema"""
|
|
146
|
+
paths = list(input_schema.paths).copy()
|
|
147
|
+
paths.append(EntityPath(self.search_result_path))
|
|
148
|
+
return EntitySchema(type_uri=input_schema.type_uri, paths=paths)
|
|
149
|
+
|
|
150
|
+
@staticmethod
|
|
151
|
+
def _entity_to_dict(paths: Sequence[EntityPath], entity: Entity) -> dict[str, list[str]]:
|
|
152
|
+
"""Create a dict representation of an entity"""
|
|
153
|
+
entity_dic = {}
|
|
154
|
+
for key, value in zip(paths, entity.values, strict=False):
|
|
155
|
+
entity_dic[key.path] = list(value)
|
|
156
|
+
return entity_dic
|
|
157
|
+
|
|
158
|
+
def _update_report(self, count: int) -> None:
|
|
159
|
+
"""Update the report"""
|
|
160
|
+
self.report.entity_count = count
|
|
161
|
+
self.execution_context.report.update(self.report)
|
|
162
|
+
|
|
163
|
+
def _cancel_workflow(self) -> bool:
|
|
164
|
+
"""Cancel workflow"""
|
|
165
|
+
try:
|
|
166
|
+
if self.execution_context.workflow.status() == "Canceling":
|
|
167
|
+
self.log.info("End task (Cancelled Workflow).")
|
|
168
|
+
return True
|
|
169
|
+
except AttributeError:
|
|
170
|
+
pass
|
|
171
|
+
return False
|
|
172
|
+
|
|
173
|
+
def _docs_to_json(self, docs: list[tuple[Document, float]]) -> list:
|
|
174
|
+
"""Convert a list of Documents to a list of metadata"""
|
|
175
|
+
doc_list: list = []
|
|
176
|
+
for doc_tuple in docs:
|
|
177
|
+
json_entity = {}
|
|
178
|
+
json_entity["id"] = doc_tuple[0].id
|
|
179
|
+
json_entity["metadata"] = str(doc_tuple[0].metadata)
|
|
180
|
+
json_entity["content"] = doc_tuple[0].page_content
|
|
181
|
+
json_entity["score"] = str(doc_tuple[1])
|
|
182
|
+
doc_list.append(json_entity)
|
|
183
|
+
return doc_list
|
|
184
|
+
|
|
185
|
+
def _process_entities(self, entities: Entities) -> Generator[Entity]:
|
|
186
|
+
"""Process incoming entities' embeddings in vector search"""
|
|
187
|
+
schema_paths: list[EntityPath] = list(entities.schema.paths)
|
|
188
|
+
n_processed_entries: int = 0
|
|
189
|
+
self._update_report(n_processed_entries)
|
|
190
|
+
for entity in entities.entities:
|
|
191
|
+
if self._cancel_workflow():
|
|
192
|
+
return
|
|
193
|
+
entity_dict = self._entity_to_dict(schema_paths, entity)
|
|
194
|
+
embedding: list[float] = literal_eval(entity_dict[self.embedding_query_path][0])
|
|
195
|
+
result: list[tuple[Document, float]] = self.db.similarity_search_with_score_by_vector(
|
|
196
|
+
embedding=embedding, k=self.top_k
|
|
197
|
+
)
|
|
198
|
+
json_result = self._docs_to_json(result)
|
|
199
|
+
entity_dict[self.search_result_path] = [json.dumps(json_result)]
|
|
200
|
+
values = list(entity_dict.values())
|
|
201
|
+
n_processed_entries += 1
|
|
202
|
+
self._update_report(n_processed_entries)
|
|
203
|
+
yield Entity(uri=entity.uri, values=values)
|
|
204
|
+
|
|
205
|
+
def execute(
|
|
206
|
+
self,
|
|
207
|
+
inputs: Sequence[Entities],
|
|
208
|
+
context: ExecutionContext,
|
|
209
|
+
) -> Entities:
|
|
210
|
+
"""Run the workflow operator."""
|
|
211
|
+
self.log.info("Start searching collection.")
|
|
212
|
+
self.inputs = inputs
|
|
213
|
+
self.execution_context = context
|
|
214
|
+
try:
|
|
215
|
+
first_input: Entities = self.inputs[0]
|
|
216
|
+
except IndexError as error:
|
|
217
|
+
raise ValueError("Input port not connected.") from error
|
|
218
|
+
entities = self._process_entities(first_input)
|
|
219
|
+
schema = self._generate_output_schema(first_input.schema)
|
|
220
|
+
self.log.info("End")
|
|
221
|
+
return Entities(entities=entities, schema=schema)
|
|
@@ -1,7 +1,4 @@
|
|
|
1
|
-
"""
|
|
2
|
-
|
|
3
|
-
Remove this and other example files after bootstrapping your project.
|
|
4
|
-
"""
|
|
1
|
+
"""Store Task"""
|
|
5
2
|
|
|
6
3
|
from ast import literal_eval
|
|
7
4
|
from collections.abc import Sequence
|
|
@@ -10,15 +7,16 @@ from typing import Any
|
|
|
10
7
|
from cmem_plugin_base.dataintegration.context import ExecutionContext, ExecutionReport
|
|
11
8
|
from cmem_plugin_base.dataintegration.description import Icon, Plugin, PluginParameter
|
|
12
9
|
from cmem_plugin_base.dataintegration.entity import Entities, Entity, EntityPath
|
|
13
|
-
from cmem_plugin_base.dataintegration.parameter.password import Password
|
|
10
|
+
from cmem_plugin_base.dataintegration.parameter.password import Password
|
|
14
11
|
from cmem_plugin_base.dataintegration.plugins import WorkflowPlugin
|
|
15
12
|
from cmem_plugin_base.dataintegration.ports import (
|
|
16
13
|
FixedNumberOfInputs,
|
|
17
14
|
UnknownSchemaPort,
|
|
18
15
|
)
|
|
19
|
-
from cmem_plugin_base.dataintegration.types import IntParameterType
|
|
20
16
|
from langchain_postgres import PGVector
|
|
21
17
|
|
|
18
|
+
from cmem_plugin_pgvector.commons import DatabaseParams
|
|
19
|
+
|
|
22
20
|
|
|
23
21
|
class DataContainer:
|
|
24
22
|
"""Encapsulate the data to be added to the database."""
|
|
@@ -26,19 +24,19 @@ class DataContainer:
|
|
|
26
24
|
def __init__(self):
|
|
27
25
|
self.texts = []
|
|
28
26
|
self.embeddings = []
|
|
29
|
-
self.
|
|
27
|
+
self.metadata = []
|
|
30
28
|
|
|
31
29
|
def add(self, text: str, embedding: list[float], metadata: dict) -> None:
|
|
32
30
|
"""Add objects to the respective lists."""
|
|
33
31
|
self.texts.append(text)
|
|
34
32
|
self.embeddings.append(embedding)
|
|
35
|
-
self.
|
|
33
|
+
self.metadata.append(metadata)
|
|
36
34
|
|
|
37
35
|
def clear(self) -> None:
|
|
38
36
|
"""Clear all three lists."""
|
|
39
37
|
self.texts.clear()
|
|
40
38
|
self.embeddings.clear()
|
|
41
|
-
self.
|
|
39
|
+
self.metadata.clear()
|
|
42
40
|
|
|
43
41
|
def size(self) -> int:
|
|
44
42
|
"""Return the size of the lists (assuming all lists have the same length)."""
|
|
@@ -46,8 +44,8 @@ class DataContainer:
|
|
|
46
44
|
|
|
47
45
|
|
|
48
46
|
@Plugin(
|
|
49
|
-
label="
|
|
50
|
-
description="Store embeddings into Postgres Vector Store.",
|
|
47
|
+
label="Store Vector Embeddings",
|
|
48
|
+
description="Store embeddings into Postgres Vector Store (PGVector).",
|
|
51
49
|
documentation="""
|
|
52
50
|
This plugin workflow store embeddings into Postgres Vector Store.
|
|
53
51
|
|
|
@@ -57,44 +55,9 @@ It is possible to specify either the name of the attributes containing the vecto
|
|
|
57
55
|
metadata.
|
|
58
56
|
""",
|
|
59
57
|
icon=Icon(package=__package__, file_name="postgresql.svg"),
|
|
58
|
+
plugin_id="cmem_plugin_pgvector-Store",
|
|
60
59
|
parameters=[
|
|
61
|
-
|
|
62
|
-
name="host",
|
|
63
|
-
label="Database Host",
|
|
64
|
-
description="The hostname of the postgres database service.",
|
|
65
|
-
default_value="pgvector",
|
|
66
|
-
),
|
|
67
|
-
PluginParameter(
|
|
68
|
-
name="port",
|
|
69
|
-
label="Database Port",
|
|
70
|
-
param_type=IntParameterType(),
|
|
71
|
-
description="The port number of the postgres database service.",
|
|
72
|
-
default_value=5432,
|
|
73
|
-
),
|
|
74
|
-
PluginParameter(
|
|
75
|
-
name="user",
|
|
76
|
-
label="Database User",
|
|
77
|
-
description="The account name used to login to the postgres database service.",
|
|
78
|
-
default_value="pgvector",
|
|
79
|
-
),
|
|
80
|
-
PluginParameter(
|
|
81
|
-
name="password",
|
|
82
|
-
label="Database Password",
|
|
83
|
-
param_type=PasswordParameterType(),
|
|
84
|
-
description="The password of the database account.",
|
|
85
|
-
),
|
|
86
|
-
PluginParameter(
|
|
87
|
-
name="database",
|
|
88
|
-
label="Database Name",
|
|
89
|
-
description="The database name.",
|
|
90
|
-
default_value="pgvector",
|
|
91
|
-
),
|
|
92
|
-
PluginParameter(
|
|
93
|
-
name="collection_name",
|
|
94
|
-
label="Collection Name",
|
|
95
|
-
description="The name of the collection, where the embeddings are going to be stored.",
|
|
96
|
-
default_value="my_collection",
|
|
97
|
-
),
|
|
60
|
+
*DatabaseParams().as_list(),
|
|
98
61
|
PluginParameter(
|
|
99
62
|
name="pre_delete_collection",
|
|
100
63
|
label="Pre Delete Collection",
|
|
@@ -153,12 +116,12 @@ class PGVectorStorePlugin(WorkflowPlugin):
|
|
|
153
116
|
|
|
154
117
|
def __init__( # noqa: PLR0913
|
|
155
118
|
self,
|
|
156
|
-
host: str =
|
|
157
|
-
port: int =
|
|
158
|
-
user: str =
|
|
119
|
+
host: str = DatabaseParams.host.default_value,
|
|
120
|
+
port: int = DatabaseParams.port.default_value,
|
|
121
|
+
user: str = DatabaseParams.user.default_value,
|
|
159
122
|
password: Password | str = "",
|
|
160
|
-
database: str =
|
|
161
|
-
collection_name: str =
|
|
123
|
+
database: str = DatabaseParams.database.default_value,
|
|
124
|
+
collection_name: str = DatabaseParams.collection_name.default_value,
|
|
162
125
|
pre_delete_collection: bool = True,
|
|
163
126
|
source_path: str = "_embedding_source",
|
|
164
127
|
embedding_path: str = "_embedding",
|
|
@@ -247,14 +210,14 @@ class PGVectorStorePlugin(WorkflowPlugin):
|
|
|
247
210
|
else self._metadata(entity, schema_paths),
|
|
248
211
|
)
|
|
249
212
|
if container.size() == self.batch_processing_size:
|
|
250
|
-
self.db.add_embeddings(container.texts, container.embeddings, container.
|
|
213
|
+
self.db.add_embeddings(container.texts, container.embeddings, container.metadata)
|
|
251
214
|
n_processed_entries += container.size()
|
|
252
215
|
self._update_report(n_processed_entries)
|
|
253
216
|
container.clear()
|
|
254
217
|
if self._cancel_workflow():
|
|
255
218
|
return
|
|
256
219
|
if container.size() > 0:
|
|
257
|
-
self.db.add_embeddings(container.texts, container.embeddings, container.
|
|
220
|
+
self.db.add_embeddings(container.texts, container.embeddings, container.metadata)
|
|
258
221
|
n_processed_entries += container.size()
|
|
259
222
|
self._update_report(n_processed_entries)
|
|
260
223
|
|
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
[tool.poetry]
|
|
2
2
|
name = "cmem-plugin-pgvector"
|
|
3
|
-
version = "0.
|
|
3
|
+
version = "0.6.0"
|
|
4
4
|
license = "Apache-2.0"
|
|
5
|
-
description = "Store embedding vectors
|
|
5
|
+
description = "Store and search for embedding vectors in a Postgres vector store."
|
|
6
6
|
authors = ["eccenca GmbH <cmempy-developer@eccenca.com>"]
|
|
7
7
|
maintainers = ["Edgard Marx <edgard.marx@eccenca.com>"]
|
|
8
8
|
classifiers = [
|
|
@@ -1,51 +0,0 @@
|
|
|
1
|
-
# cmem-plugin-pgvector
|
|
2
|
-
|
|
3
|
-
[![poetry][poetry-shield]][poetry-link] [![ruff][ruff-shield]][ruff-link] [![mypy][mypy-shield]][mypy-link] [![copier][copier-shield]][copier]
|
|
4
|
-
|
|
5
|
-
Store embedding vectors into a Postgres vector store.
|
|
6
|
-
|
|
7
|
-
This plugin consumes the costumable entity's paths ```embedding```, ```text``` and ```metadata``` as following:
|
|
8
|
-
|
|
9
|
-
- The text path contain the text used to generate the embeddings, default ```text```.
|
|
10
|
-
- The embedding path contain the embedding representation of the text, default ```embedding```.
|
|
11
|
-
- The metadata path contain the information that will be associated with the embedding, default all paths.
|
|
12
|
-
|
|
13
|
-
[![eccenca Corporate Memory][cmem-shield]][cmem-link]
|
|
14
|
-
|
|
15
|
-
## Use
|
|
16
|
-
|
|
17
|
-
Interact with Large Language Models.
|
|
18
|
-
|
|
19
|
-
This is a plugin for [eccenca](https://eccenca.com) [Corporate Memory](https://documentation.eccenca.com).
|
|
20
|
-
|
|
21
|
-
You can install it with the [cmemc](https://eccenca.com/go/cmemc) command line
|
|
22
|
-
clients like this:
|
|
23
|
-
|
|
24
|
-
```
|
|
25
|
-
cmemc admin workspace python install cmem-plugin-llm
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
### Parameters
|
|
29
|
-
|
|
30
|
-
- ```collection_name```: The name of the collection where the embeddings are going to be stored, default ```my_collection```
|
|
31
|
-
- ```user```:the database user
|
|
32
|
-
- ```password```: the database password
|
|
33
|
-
- ```host```: the databse host, i.e. locahost
|
|
34
|
-
- ```port```: the database port, default ```5432```
|
|
35
|
-
- ```database```: the name of the database
|
|
36
|
-
- ```pre_delete_collection```: boolean parameter indicating if the collection should be cleanse before insertion, default ```false```
|
|
37
|
-
- ```embedding_path```: output path that will contain the generated embedding, default ```embedding```
|
|
38
|
-
- ```text_path```: path containing the text used for genereting the embedding, default ```text```
|
|
39
|
-
- ```metadata_paths```: paths from the entity that will be stored along with the embedding, default all paths
|
|
40
|
-
|
|
41
|
-
[cmem-link]: https://documentation.eccenca.com
|
|
42
|
-
[cmem-shield]: https://img.shields.io/endpoint?url=https://dev.documentation.eccenca.com/badge.json
|
|
43
|
-
[poetry-link]: https://python-poetry.org/
|
|
44
|
-
[poetry-shield]: https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json
|
|
45
|
-
[ruff-link]: https://docs.astral.sh/ruff/
|
|
46
|
-
[ruff-shield]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json&label=Code%20Style
|
|
47
|
-
[mypy-link]: https://mypy-lang.org/
|
|
48
|
-
[mypy-shield]: https://www.mypy-lang.org/static/mypy_badge.svg
|
|
49
|
-
[copier]: https://copier.readthedocs.io/
|
|
50
|
-
[copier-shield]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/copier-org/copier/master/img/badge/badge-grayscale-inverted-border-purple.json
|
|
51
|
-
|
|
File without changes
|
|
File without changes
|
{cmem_plugin_pgvector-0.5.0 → cmem_plugin_pgvector-0.6.0}/cmem_plugin_pgvector/postgresql.svg
RENAMED
|
File without changes
|