vectorsearch 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4c9913804bc8aaadc08a60c0a250e2923f2a4ceb28633fdd99e4bf86544203b4
4
- data.tar.gz: 8db3a77121d948f6ed709618da5bd4411a87a01aba26feb7942e74e0dd18f207
3
+ metadata.gz: 160b98d1553c63fae2e50c07dae83e80376eade573874e0fcb28a0d1f7f476ea
4
+ data.tar.gz: 21fbcbb750cd878ceedec2646ef8db116a839b422f751ce3959aa6da8da78ff1
5
5
  SHA512:
6
- metadata.gz: d76d13ee23c7219483eac27a37ae61cad335bbd2d4169a76723d19a0e334ab6ac01037923cb2c07a64c853280d2449e54c0107fba7f36a231a624efaf1b68b46
7
- data.tar.gz: 43612526795e54138ec0c891e5bbdef2ef712ac287a4776680d26d94e8cabcb9d5976774b5144cbeb3ae95b367e26460230931041f011f1a69671e014060684f
6
+ metadata.gz: 40991aab084c3eb16d8029b598f70227946edc72411c74de0906289d7e23b16a83fbbd7aa9325c44b6ebc42ee0cac3146d55cf9bf5b1eeb23e862f0f66da66c2
7
+ data.tar.gz: 1c23318876143377b826c8e619978b83e61da68b10dbed34c203ae0f35a10efeb6fa48843fada3782b3fd7c1785802b0cda1332281ebcc39230bb7ed10c0905d
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- vectorsearch (0.1.0)
4
+ vectorsearch (0.1.2)
5
5
  cohere-ruby (~> 0.9.1)
6
6
  milvus (~> 0.9.0)
7
7
  pinecone (~> 0.1.6)
data/README.md CHANGED
@@ -1,6 +1,7 @@
1
1
  # Vectorsearch
2
+ ![Tests status](https://github.com/andreibondarev/vectorsearch/actions/workflows/ci.yml/badge.svg) [![Gem Version](https://badge.fury.io/rb/vectorsearch.svg)](https://badge.fury.io/rb/vectorsearch)
2
3
 
3
- Vectorsearch library is an abstraction layer on top of many popular vector search databases. It is a modern ORM that allows developers to easily chunk, generate embeddings, store, search, query and retrieve data from vector search databases. Vectorsearch offers a straight-forward DSL and abstract developers away from overly complex machine learning/data science-specific configurations.
4
+ Vectorsearch library is an abstraction layer on top of many popular vector search databases. It is a modern ORM that allows developers to easily chunk data, generate embeddings, store, search, query and retrieve data from vector search databases. Vectorsearch offers a straight-forward DSL and abstracts away overly complicated machine learning/data science-specific configurations and concepts
4
5
 
5
6
  ## Installation
6
7
 
@@ -20,12 +21,12 @@ require "vectorsearch"
20
21
 
21
22
  List of currently supported vector search databases and features:
22
23
 
23
- | Database | Querying | Storage |
24
- | -------------------------------------- |
25
- | Weaviate | :white_check_mark: | WIP |
26
- | Qdrant | :white_check_mark: | WIP |
27
- | Milvus | :white_check_mark: | WIP |
28
- | Pinecone | :white_check_mark: | WIP |
24
+ | Database | Querying | Storage | Schema Management | Backups | Rails Integration | ??? |
25
+ | -------- |:------------------:| -------:| -----------------:| -------:| -----------------:| ---:|
26
+ | Weaviate | :white_check_mark: | WIP | WIP | WIP | | |
27
+ | Qdrant | :white_check_mark: | WIP | WIP | WIP | | |
28
+ | Milvus | :white_check_mark: | WIP | WIP | WIP | | |
29
+ | Pinecone | :white_check_mark: | WIP | WIP | WIP | | |
29
30
 
30
31
  ### Create an instance
31
32
 
@@ -40,16 +41,24 @@ client = Vectorsearch::Weaviate.new(
40
41
  llm_api_key: ENV["OPENAI_API_KEY"]
41
42
  )
42
43
 
43
- # You instantiate any other supported vector search database:
44
+ # You can instantiate any other supported vector search database:
44
45
  client = Vectorsearch::Milvus.new(...)
45
46
  client = Vectorsearch::Qdrant.new(...)
46
47
  client = Vectorsearch::Pinecone.new(...)
47
48
  ```
48
49
 
50
+ ```ruby
51
+ # Creating the default schema
52
+ client.create_default_schema
53
+ ```
54
+
49
55
  ```ruby
50
56
  # Store your documents in your vector search database
51
- client.add_documents(
52
- documents: []
57
+ client.add_texts(
58
+ texts: [
59
+ "Begin by preheating your oven to 375°F (190°C). Prepare four boneless, skinless chicken breasts by cutting a pocket into the side of each breast, being careful not to cut all the way through. Season the chicken with salt and pepper to taste. In a large skillet, melt 2 tablespoons of unsalted butter over medium heat. Add 1 small diced onion and 2 minced garlic cloves, and cook until softened, about 3-4 minutes. Add 8 ounces of fresh spinach and cook until wilted, about 3 minutes. Remove the skillet from heat and let the mixture cool slightly.",
60
+ "In a bowl, combine the spinach mixture with 4 ounces of softened cream cheese, 1/4 cup of grated Parmesan cheese, 1/4 cup of shredded mozzarella cheese, and 1/4 teaspoon of red pepper flakes. Mix until well combined. Stuff each chicken breast pocket with an equal amount of the spinach mixture. Seal the pocket with a toothpick if necessary. In the same skillet, heat 1 tablespoon of olive oil over medium-high heat. Add the stuffed chicken breasts and sear on each side for 3-4 minutes, or until golden brown."
61
+ ]
53
62
  )
54
63
  ```
55
64
 
@@ -7,8 +7,16 @@ module Vectorsearch
7
7
  class Base
8
8
  attr_reader :client, :index_name, :llm, :llm_api_key
9
9
 
10
+ DEFAULT_METRIC = "cosine".freeze
11
+ DEFAULT_COHERE_DIMENSION = 1024
12
+ DEFAULT_OPENAI_DIMENSION = 1536
13
+
14
+ # Currently supported LLMs
15
+ # TODO: Add support for HuggingFace
10
16
  LLMS = %i[openai cohere].freeze
11
17
 
18
+ # @param llm [Symbol] The LLM to use
19
+ # @param llm_api_key [String] The API key for the LLM
12
20
  def initialize(llm:, llm_api_key:)
13
21
  validate_llm!(llm: llm)
14
22
 
@@ -16,27 +24,96 @@ module Vectorsearch
16
24
  @llm_api_key = llm_api_key
17
25
  end
18
26
 
27
+ def create_default_schema
28
+ raise NotImplementedError
29
+ end
30
+
31
+ # TODO
32
+ def add_texts(texts:)
33
+ raise NotImplementedError
34
+ end
35
+
36
+ # NotImplementedError will be raised if the subclass does not implement the `ask()` method
37
+ def ask(question:)
38
+ raise NotImplementedError
39
+ end
40
+
41
+ # Generate an embedding for a given text
42
+ # Currently supports OpenAI and Cohere
43
+ # The LLM-related method will most likely need to be abstracted out into a separate class
44
+ # @param text [String] The text to generate an embedding for
45
+ # @return [String] The embedding
19
46
  def generate_embedding(text:)
20
47
  case llm
21
48
  when :openai
22
- response = OpenAI::Client.new(access_token: llm_api_key)
23
- .embeddings(
24
- parameters: {
25
- model: "text-embedding-ada-002",
26
- input: text
27
- }
28
- )
49
+ response = openai_client.embeddings(
50
+ parameters: {
51
+ model: "text-embedding-ada-002",
52
+ input: text
53
+ }
54
+ )
29
55
  response.dig("data").first.dig("embedding")
30
56
  when :cohere
31
- response = Cohere::Client.new(api_key: llm_api_key)
32
- .embed(
33
- texts: [text],
34
- model: "small"
35
- )
57
+ response = cohere_client.embed(
58
+ texts: [text],
59
+ model: "small"
60
+ )
36
61
  response.dig("embeddings").first
37
62
  end
38
63
  end
39
64
 
65
+ # Generate a completion for a given prompt
66
+ # Currently supports OpenAI and Cohere
67
+ # The LLM-related method will most likely need to be abstracted out into a separate class
68
+ # @param prompt [String] The prompt to generate a completion for
69
+ # @return [String] The completion
70
+ def generate_completion(prompt:)
71
+ case llm
72
+ when :openai
73
+ response = openai_client.completions(
74
+ parameters: {
75
+ model: "text-davinci-003",
76
+ temperature: 0.0,
77
+ prompt: prompt
78
+ }
79
+ )
80
+ response.dig("choices").first.dig("text")
81
+ when :cohere
82
+ response = cohere_client.generate(
83
+ prompt: prompt,
84
+ temperature: 0.0
85
+ )
86
+ response.dig("generations").first.dig("text")
87
+ end
88
+ end
89
+
90
+ def generate_prompt(question:, context:)
91
+ "Context:\n" +
92
+ "#{context}\n" +
93
+ "---\n" +
94
+ "Question: #{question}\n" +
95
+ "---\n" +
96
+ "Answer:"
97
+ end
98
+
99
+ private
100
+
101
+ def default_dimension
102
+ if llm == :openai
103
+ DEFAULT_OPENAI_DIMENSION
104
+ elsif llm == :cohere
105
+ DEFAULT_COHERE_DIMENSION
106
+ end
107
+ end
108
+
109
+ def openai_client
110
+ @openai_client ||= OpenAI::Client.new(access_token: llm_api_key)
111
+ end
112
+
113
+ def cohere_client
114
+ @cohere_client ||= Cohere::Client.new(api_key: llm_api_key)
115
+ end
116
+
40
117
  def validate_llm!(llm:)
41
118
  raise ArgumentError, "LLM must be one of #{LLMS}" unless LLMS.include?(llm)
42
119
  end
@@ -19,6 +19,64 @@ module Vectorsearch
19
19
  super(llm: llm, llm_api_key: llm_api_key)
20
20
  end
21
21
 
22
+ def add_texts(
23
+ texts:
24
+ )
25
+ client.entities.insert(
26
+ collection_name: index_name,
27
+ num_rows: texts.count,
28
+ fields_data: [
29
+ {
30
+ field_name: "content",
31
+ type: ::Milvus::DATA_TYPES["varchar"],
32
+ field: texts
33
+ }, {
34
+ field_name: "vectors",
35
+ type: ::Milvus::DATA_TYPES["binary_vector"],
36
+ field: texts.map { |text| generate_embedding(text: text) }
37
+ }
38
+ ]
39
+ )
40
+ end
41
+
42
+ # Create default schema
43
+ # @return [Hash] The response from the server
44
+ def create_default_schema
45
+ client.collections.create(
46
+ auto_id: true,
47
+ collection_name: index_name,
48
+ description: "Default schema created by Vectorsearch",
49
+ fields: [
50
+ {
51
+ name: "id",
52
+ is_primary_key: true,
53
+ autoID: true,
54
+ data_type: ::Milvus::DATA_TYPES["int64"]
55
+ }, {
56
+ name: "content",
57
+ is_primary_key: false,
58
+ data_type: ::Milvus::DATA_TYPES["varchar"],
59
+ type_params: [
60
+ {
61
+ key: "max_length",
62
+ value: "32768" # Largest allowed value
63
+ }
64
+ ]
65
+ }, {
66
+ name: "vectors",
67
+ data_type: ::Milvus::DATA_TYPES["binary_vector"],
68
+ is_primary_key: false,
69
+ type_params: [
70
+ {
71
+ key: "dim",
72
+ value: default_dimension.to_s
73
+ }
74
+ ]
75
+ }
76
+ ]
77
+ )
78
+ end
79
+
22
80
  def similarity_search(
23
81
  query:,
24
82
  k: 4
@@ -41,7 +99,7 @@ module Vectorsearch
41
99
  vectors: [ embedding ],
42
100
  dsl_type: 1,
43
101
  params: "{\"nprobe\": 10}",
44
- anns_field: "book_intro", # Should it get all abstracted away to "content" field?
102
+ anns_field: "content",
45
103
  metric_type: "L2"
46
104
  )
47
105
  end
@@ -22,6 +22,36 @@ module Vectorsearch
22
22
  super(llm: llm, llm_api_key: llm_api_key)
23
23
  end
24
24
 
25
+ # Add a list of texts to the index
26
+ # @param texts [Array] The list of texts to add
27
+ # @return [Hash] The response from the server
28
+ def add_texts(
29
+ texts:
30
+ )
31
+ vectors = texts.map do |text|
32
+ {
33
+ # TODO: Allows passing in your own IDs
34
+ id: SecureRandom.uuid,
35
+ metadata: { content: text },
36
+ values: generate_embedding(text: text)
37
+ }
38
+ end
39
+
40
+ index = client.index(index_name)
41
+
42
+ index.upsert(vectors: vectors)
43
+ end
44
+
45
+ # Create the index with the default schema
46
+ # @return [Hash] The response from the server
47
+ def create_default_schema
48
+ client.create_index(
49
+ metric: DEFAULT_METRIC,
50
+ name: index_name,
51
+ dimension: default_dimension
52
+ )
53
+ end
54
+
25
55
  def similarity_search(
26
56
  query:,
27
57
  k: 4
@@ -40,16 +70,26 @@ module Vectorsearch
40
70
  )
41
71
  index = client.index(index_name)
42
72
 
43
- index.query(
73
+ response = index.query(
44
74
  vector: embedding,
45
75
  top_k: k,
46
76
  include_values: true,
47
77
  include_metadata: true
48
78
  )
79
+ response.dig("matches")
49
80
  end
50
81
 
51
82
  def ask(question:)
52
- raise NotImplementedError
83
+ search_results = similarity_search(query: question)
84
+
85
+ context = search_results.dig("matches").map do |result|
86
+ result.dig("metadata").to_s
87
+ end
88
+ context = context.join("\n---\n")
89
+
90
+ prompt = generate_prompt(question: question, context: context)
91
+
92
+ generate_completion(prompt: prompt)
53
93
  end
54
94
  end
55
95
  end
@@ -20,6 +20,38 @@ module Vectorsearch
20
20
  super(llm: llm, llm_api_key: llm_api_key)
21
21
  end
22
22
 
23
+ # Add a list of texts to the index
24
+ # @param texts [Array] The list of texts to add
25
+ # @return [Hash] The response from the server
26
+ def add_texts(
27
+ texts:
28
+ )
29
+ batch = { ids: [], vectors: [], payloads: [] }
30
+
31
+ texts.each do |text|
32
+ batch[:ids].push(SecureRandom.uuid)
33
+ batch[:vectors].push(generate_embedding(text: text))
34
+ batch[:payloads].push({ content: text })
35
+ end
36
+
37
+ client.points.upsert(
38
+ collection_name: index_name,
39
+ batch: batch
40
+ )
41
+ end
42
+
43
+ # Create the index with the default schema
44
+ # @return [Hash] The response from the server
45
+ def create_default_schema
46
+ client.collections.create(
47
+ collection_name: index_name,
48
+ vectors: {
49
+ distance: DEFAULT_METRIC.capitalize,
50
+ size: default_dimension
51
+ }
52
+ )
53
+ end
54
+
23
55
  def similarity_search(
24
56
  query:,
25
57
  k: 4
@@ -45,7 +77,16 @@ module Vectorsearch
45
77
  end
46
78
 
47
79
  def ask(question:)
48
- raise NotImplementedError
80
+ search_results = similarity_search(query: question)
81
+
82
+ context = search_results.dig("result").map do |result|
83
+ result.dig("payload").to_s
84
+ end
85
+ context = context.join("\n---\n")
86
+
87
+ prompt = generate_prompt(question: question, context: context)
88
+
89
+ generate_completion(prompt: prompt)
49
90
  end
50
91
  end
51
92
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Vectorsearch
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.2"
5
5
  end
@@ -25,32 +25,56 @@ module Vectorsearch
25
25
  def add_texts(
26
26
  texts:
27
27
  )
28
+ objects = []
28
29
  texts.each do |text|
29
- text['class'] = index_name
30
+ objects.push({
31
+ class_name: index_name,
32
+ properties: {
33
+ content: text
34
+ }
35
+ })
30
36
  end
31
37
 
32
- client.batch_create(
33
- objects: texts
38
+ client.objects.batch_create(
39
+ objects: objects
40
+ )
41
+ end
42
+
43
+ def create_default_schema
44
+ client.schema.create(
45
+ class_name: index_name,
46
+ vectorizer: "text2vec-#{llm.to_s}",
47
+ properties: [
48
+ {
49
+ dataType: ["text"],
50
+ name: "content"
51
+ }
52
+ ]
34
53
  )
35
54
  end
36
55
 
37
56
  # Return documents similar to the query
57
+ # @param query [String] The query to search for
58
+ # @param k [Integer|String] The number of results to return
59
+ # @return [Hash] The search results
38
60
  def similarity_search(
39
61
  query:,
40
62
  k: 4
41
63
  )
42
- near_text = "{
43
- concepts: [\"#{query}\"],
44
- }"
64
+ near_text = "{ concepts: [\"#{query}\"] }"
45
65
 
46
66
  client.query.get(
47
67
  class_name: index_name,
48
68
  near_text: near_text,
49
69
  limit: k.to_s,
50
- fields: "content recipe_id"
70
+ fields: "content _additional { id }"
51
71
  )
52
72
  end
53
73
 
74
+ # Return documents similar to the vector
75
+ # @param embedding [Array] The vector to search for
76
+ # @param k [Integer|String] The number of results to return
77
+ # @return [Hash] The search results
54
78
  def similarity_search_by_vector(
55
79
  embedding:,
56
80
  k: 4
@@ -65,17 +89,34 @@ module Vectorsearch
65
89
  )
66
90
  end
67
91
 
92
+ # Ask a question and return the answer
93
+ # @param question [String] The question to ask
94
+ # @return [Hash] The answer
68
95
  def ask(
69
96
  question:
70
97
  )
71
- ask_object = "{ question: \"#{question}\" }"
98
+ # Weaviate currently supports the `ask:` parameter only for the OpenAI LLM (with `qna-openai` module enabled).
99
+ if llm == :openai
100
+ ask_object = "{ question: \"#{question}\" }"
72
101
 
73
- client.query.get(
74
- class_name: index_name,
75
- ask: ask_object,
76
- limit: "1",
77
- fields: "_additional { answer { result } }"
78
- )
102
+ client.query.get(
103
+ class_name: index_name,
104
+ ask: ask_object,
105
+ limit: "1",
106
+ fields: "_additional { answer { result } }"
107
+ )
108
+ elsif llm == :cohere
109
+ search_results = similarity_search(query: question)
110
+
111
+ context = search_results.map do |result|
112
+ result.dig("content").to_s
113
+ end
114
+ context = context.join("\n---\n")
115
+
116
+ prompt = generate_prompt(question: question, context: context)
117
+
118
+ generate_completion(prompt: prompt)
119
+ end
79
120
  end
80
121
  end
81
122
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: vectorsearch
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrei Bondarev
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-04-30 00:00:00.000000000 Z
11
+ date: 2023-05-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: pry-byebug