pgvector 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 215f975cafa2f782f3777ec65dd2fea587b420286f7f5ccf63b6b68376756dfd
4
- data.tar.gz: e1979a1aa4fd4157cb04ae73c0c62b378046ca97154a25d4355e827be5abc53b
3
+ metadata.gz: 07a80636c13841d2fa97f8b740feb095c1c2f94fa3691b374d1472bba918bddf
4
+ data.tar.gz: 50c9e67781fbbe23fa3dbe2f87790025f18d3686e412b46f62d1cb0d2995ecb3
5
5
  SHA512:
6
- metadata.gz: 1e54d7c41f8750b99262021e402f329252b428678ee99510dd52a1a113aeffdf35628a23b6ca9408ce55350a237cb2d224e66dfe7aafefd645612a29b74529ff
7
- data.tar.gz: c8a014bb3708690cf3e463954aa7a846ce79075da039e23b3f94644e50e3c28d05376539699fd8ffe3c3b93d54eecb393fed472bddb60619ba897a5204bc511a
6
+ metadata.gz: f0aa5b733e1bc4022c2052be6b04aa0656bf306f588baf2907827674aeadc7dc52507895043049246634999dea37de629c19a164bd289fbd979eb628aaab0c33
7
+ data.tar.gz: f52e2e9c1c074c4f809fe78d6926bcb630694d79199e288ebb33d136a05191c4b2fff6038f8d7fc56d4291f949caa1b536d942aa7d0912be2daccf5a7ee23f83
data/CHANGELOG.md CHANGED
@@ -1,3 +1,13 @@
1
+ ## 0.3.0 (2024-06-25)
2
+
3
+ - Added support for `halfvec` and `sparsevec` types
4
+ - Added `taxicab`, `hamming`, and `jaccard` distances for Sequel
5
+ - Dropped support for Ruby < 3.1
6
+
7
+ ## 0.2.2 (2023-10-03)
8
+
9
+ - Added `nearest_neighbors` method to datasets with Sequel
10
+
1
11
  ## 0.2.1 (2023-06-04)
2
12
 
3
13
  - Added support for Sequel
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2022-2023 Andrew Kane
3
+ Copyright (c) 2022-2024 Andrew Kane
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -6,7 +6,7 @@ Supports [pg](https://github.com/ged/ruby-pg) and [Sequel](https://github.com/je
6
6
 
7
7
  For Rails, check out [Neighbor](https://github.com/ankane/neighbor)
8
8
 
9
- [![Build Status](https://github.com/pgvector/pgvector-ruby/workflows/build/badge.svg?branch=master)](https://github.com/pgvector/pgvector-ruby/actions)
9
+ [![Build Status](https://github.com/pgvector/pgvector-ruby/actions/workflows/build.yml/badge.svg)](https://github.com/pgvector/pgvector-ruby/actions)
10
10
 
11
11
  ## Installation
12
12
 
@@ -26,10 +26,17 @@ Or check out some examples:
26
26
  - [Embeddings](examples/openai_embeddings.rb) with OpenAI
27
27
  - [User-based recommendations](examples/disco_user_recs.rb) with Disco
28
28
  - [Item-based recommendations](examples/disco_item_recs.rb) with Disco
29
+ - [Bulk loading](examples/bulk_loading.rb) with `COPY`
29
30
 
30
31
  ## pg
31
32
 
32
- Register the vector type with your connection
33
+ Enable the extension
34
+
35
+ ```ruby
36
+ conn.exec("CREATE EXTENSION IF NOT EXISTS vector")
37
+ ```
38
+
39
+ Optionally enable type casting for results
33
40
 
34
41
  ```ruby
35
42
  registry = PG::BasicTypeRegistry.new.define_default_types
@@ -37,6 +44,12 @@ Pgvector::PG.register_vector(registry)
37
44
  conn.type_map_for_results = PG::BasicTypeMapForResults.new(conn, registry: registry)
38
45
  ```
39
46
 
47
+ Create a table
48
+
49
+ ```ruby
50
+ conn.exec("CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3))")
51
+ ```
52
+
40
53
  Insert a vector
41
54
 
42
55
  ```ruby
@@ -50,8 +63,24 @@ Get the nearest neighbors to a vector
50
63
  conn.exec_params("SELECT * FROM items ORDER BY embedding <-> $1 LIMIT 5", [embedding]).to_a
51
64
  ```
52
65
 
66
+ Add an approximate index
67
+
68
+ ```ruby
69
+ conn.exec("CREATE INDEX ON items USING hnsw (embedding vector_l2_ops)")
70
+ # or
71
+ conn.exec("CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100)")
72
+ ```
73
+
74
+ Use `vector_ip_ops` for inner product and `vector_cosine_ops` for cosine distance
75
+
53
76
  ## Sequel
54
77
 
78
+ Enable the extension
79
+
80
+ ```ruby
81
+ DB.run("CREATE EXTENSION IF NOT EXISTS vector")
82
+ ```
83
+
55
84
  Create a table
56
85
 
57
86
  ```ruby
@@ -81,7 +110,7 @@ Get the nearest neighbors to a record
81
110
  item.nearest_neighbors(:embedding, distance: "euclidean").limit(5)
82
111
  ```
83
112
 
84
- Also supports `inner_product` and `cosine` distance
113
+ Also supports `inner_product`, `cosine`, `taxicab`, `hamming`, and `jaccard` distance
85
114
 
86
115
  Get the nearest neighbors to a vector
87
116
 
@@ -89,6 +118,14 @@ Get the nearest neighbors to a vector
89
118
  Item.nearest_neighbors(:embedding, [1, 1, 1], distance: "euclidean").limit(5)
90
119
  ```
91
120
 
121
+ Add an approximate index
122
+
123
+ ```ruby
124
+ DB.add_index :items, :embedding, type: "hnsw", opclass: "vector_l2_ops"
125
+ ```
126
+
127
+ Use `vector_ip_ops` for inner product and `vector_cosine_ops` for cosine distance
128
+
92
129
  ## History
93
130
 
94
131
  View the [changelog](https://github.com/pgvector/pgvector-ruby/blob/master/CHANGELOG.md)
@@ -0,0 +1,19 @@
1
+ module Pgvector
2
+ class HalfVector
3
+ def initialize(data)
4
+ @data = data.to_a.map(&:to_f)
5
+ end
6
+
7
+ def self.from_text(string)
8
+ new(string[1..-2].split(",").map(&:to_f))
9
+ end
10
+
11
+ def to_s
12
+ "[#{@data.to_a.map(&:to_f).join(",")}]"
13
+ end
14
+
15
+ def to_a
16
+ @data
17
+ end
18
+ end
19
+ end
data/lib/pgvector/pg.rb CHANGED
@@ -5,14 +5,24 @@ module Pgvector
5
5
  def self.register_vector(registry)
6
6
  registry.register_type(0, "vector", nil, TextDecoder::Vector)
7
7
  registry.register_type(1, "vector", nil, BinaryDecoder::Vector)
8
+
9
+ # no binary decoder for halfvec since unpack does not have directive for half-precision
10
+ registry.register_type(0, "halfvec", nil, TextDecoder::Halfvec)
11
+
12
+ registry.register_type(0, "sparsevec", nil, TextDecoder::Sparsevec)
13
+ registry.register_type(1, "sparsevec", nil, BinaryDecoder::Sparsevec)
8
14
  end
9
15
 
10
16
  module BinaryDecoder
11
17
  class Vector < ::PG::SimpleDecoder
12
18
  def decode(string, tuple = nil, field = nil)
13
- dim, unused = string[0, 4].unpack("nn")
14
- raise "expected unused to be 0" if unused != 0
15
- string[4..-1].unpack("g#{dim}")
19
+ ::Pgvector::Vector.from_binary(string).to_a
20
+ end
21
+ end
22
+
23
+ class Sparsevec < ::PG::SimpleDecoder
24
+ def decode(string, tuple = nil, field = nil)
25
+ SparseVector.from_binary(string)
16
26
  end
17
27
  end
18
28
  end
@@ -20,7 +30,19 @@ module Pgvector
20
30
  module TextDecoder
21
31
  class Vector < ::PG::SimpleDecoder
22
32
  def decode(string, tuple = nil, field = nil)
23
- Pgvector.decode(string)
33
+ ::Pgvector::Vector.from_text(string).to_a
34
+ end
35
+ end
36
+
37
+ class Halfvec < ::PG::SimpleDecoder
38
+ def decode(string, tuple = nil, field = nil)
39
+ HalfVector.from_text(string).to_a
40
+ end
41
+ end
42
+
43
+ class Sparsevec < ::PG::SimpleDecoder
44
+ def decode(string, tuple = nil, field = nil)
45
+ SparseVector.from_text(string)
24
46
  end
25
47
  end
26
48
  end
@@ -0,0 +1,87 @@
1
+ module Pgvector
2
+ class SparseVector
3
+ attr_reader :dimensions, :indices, :values
4
+
5
+ NO_DEFAULT = Object.new
6
+
7
+ def initialize(value, dimensions = NO_DEFAULT)
8
+ if value.is_a?(Hash)
9
+ if dimensions == NO_DEFAULT
10
+ raise ArgumentError, "missing dimensions"
11
+ end
12
+ from_hash(value, dimensions)
13
+ else
14
+ unless dimensions == NO_DEFAULT
15
+ raise ArgumentError, "extra argument"
16
+ end
17
+ from_array(value)
18
+ end
19
+ end
20
+
21
+ def to_s
22
+ "{#{@indices.zip(@values).map { |i, v| "#{i.to_i + 1}:#{v.to_f}" }.join(",")}}/#{@dimensions.to_i}"
23
+ end
24
+
25
+ def to_a
26
+ arr = Array.new(dimensions, 0.0)
27
+ @indices.zip(@values) do |i, v|
28
+ arr[i] = v
29
+ end
30
+ arr
31
+ end
32
+
33
+ private
34
+
35
+ def from_hash(data, dimensions)
36
+ elements = data.select { |_, v| v != 0 }.sort
37
+ @dimensions = dimensions.to_i
38
+ @indices = elements.map { |v| v[0].to_i }
39
+ @values = elements.map { |v| v[1].to_f }
40
+ end
41
+
42
+ def from_array(arr)
43
+ arr = arr.to_a
44
+ @dimensions = arr.size
45
+ @indices = []
46
+ @values = []
47
+ arr.each_with_index do |v, i|
48
+ if v != 0
49
+ @indices << i
50
+ @values << v.to_f
51
+ end
52
+ end
53
+ end
54
+
55
+ class << self
56
+ def from_text(string)
57
+ elements, dimensions = string.split("/", 2)
58
+ indices = []
59
+ values = []
60
+ elements[1..-2].split(",").each do |e|
61
+ index, value = e.split(":", 2)
62
+ indices << index.to_i - 1
63
+ values << value.to_f
64
+ end
65
+ from_parts(dimensions.to_i, indices, values)
66
+ end
67
+
68
+ def from_binary(string)
69
+ dim, nnz, unused = string[0, 12].unpack("l>l>l>")
70
+ raise "expected unused to be 0" if unused != 0
71
+ indices = string[12, nnz * 4].unpack("l>#{nnz}")
72
+ values = string[(12 + nnz * 4)..-1].unpack("g#{nnz}")
73
+ from_parts(dim, indices, values)
74
+ end
75
+
76
+ private
77
+
78
+ def from_parts(dimensions, indices, values)
79
+ vec = allocate
80
+ vec.instance_variable_set(:@dimensions, dimensions)
81
+ vec.instance_variable_set(:@indices, indices)
82
+ vec.instance_variable_set(:@values, values)
83
+ vec
84
+ end
85
+ end
86
+ end
87
+ end
@@ -0,0 +1,25 @@
1
+ module Pgvector
2
+ class Vector
3
+ def initialize(data)
4
+ @data = data.to_a.map(&:to_f)
5
+ end
6
+
7
+ def self.from_text(string)
8
+ Vector.new(string[1..-2].split(",").map(&:to_f))
9
+ end
10
+
11
+ def self.from_binary(string)
12
+ dim, unused = string[0, 4].unpack("nn")
13
+ raise "expected unused to be 0" if unused != 0
14
+ Vector.new(string[4..-1].unpack("g#{dim}"))
15
+ end
16
+
17
+ def to_s
18
+ "[#{@data.to_a.map(&:to_f).join(",")}]"
19
+ end
20
+
21
+ def to_a
22
+ @data
23
+ end
24
+ end
25
+ end
@@ -1,3 +1,3 @@
1
1
  module Pgvector
2
- VERSION = "0.2.1"
2
+ VERSION = "0.3.0"
3
3
  end
data/lib/pgvector.rb CHANGED
@@ -1,14 +1,27 @@
1
1
  # modules
2
+ require_relative "pgvector/half_vector"
3
+ require_relative "pgvector/sparse_vector"
4
+ require_relative "pgvector/vector"
2
5
  require_relative "pgvector/version"
3
6
 
4
7
  module Pgvector
5
8
  autoload :PG, "pgvector/pg"
6
9
 
7
10
  def self.encode(data)
8
- "[#{data.to_a.map(&:to_f).join(",")}]"
11
+ if data.is_a?(SparseVector)
12
+ data.to_s
13
+ else
14
+ Vector.new(data).to_s
15
+ end
9
16
  end
10
17
 
11
18
  def self.decode(string)
12
- string[1..-2].split(",").map(&:to_f)
19
+ if string[0] == "["
20
+ Vector.from_text(string).to_a
21
+ elsif string[0] == "{"
22
+ SparseVector.from_text(string)
23
+ else
24
+ string
25
+ end
13
26
  end
14
27
  end
@@ -8,12 +8,10 @@ module Sequel
8
8
  end
9
9
  end
10
10
 
11
- module ClassMethods
12
- attr_accessor :vector_columns
13
-
11
+ module DatasetMethods
14
12
  def nearest_neighbors(column, value, distance:)
15
13
  value = ::Pgvector.encode(value) unless value.is_a?(String)
16
- quoted_column = dataset.quote_identifier(column)
14
+ quoted_column = quote_identifier(column)
17
15
  distance = distance.to_s
18
16
 
19
17
  operator =
@@ -24,6 +22,12 @@ module Sequel
24
22
  "<=>"
25
23
  when "euclidean"
26
24
  "<->"
25
+ when "taxicab"
26
+ "<+>"
27
+ when "hamming"
28
+ "<~>"
29
+ when "jaccard"
30
+ "<%>"
27
31
  end
28
32
 
29
33
  raise ArgumentError, "Invalid distance: #{distance}" unless operator
@@ -41,6 +45,12 @@ module Sequel
41
45
  .exclude(column => nil)
42
46
  .order(Sequel.lit(order, value))
43
47
  end
48
+ end
49
+
50
+ module ClassMethods
51
+ attr_accessor :vector_columns
52
+
53
+ Sequel::Plugins.def_dataset_methods(self, :nearest_neighbors)
44
54
 
45
55
  Plugins.inherited_instance_variables(self, :@vector_columns => :dup)
46
56
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pgvector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.1
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-06-05 00:00:00.000000000 Z
11
+ date: 2024-06-26 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
14
  email: andrew@ankane.org
@@ -20,7 +20,10 @@ files:
20
20
  - LICENSE.txt
21
21
  - README.md
22
22
  - lib/pgvector.rb
23
+ - lib/pgvector/half_vector.rb
23
24
  - lib/pgvector/pg.rb
25
+ - lib/pgvector/sparse_vector.rb
26
+ - lib/pgvector/vector.rb
24
27
  - lib/pgvector/version.rb
25
28
  - lib/sequel/plugins/pgvector.rb
26
29
  homepage: https://github.com/pgvector/pgvector-ruby
@@ -35,14 +38,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
35
38
  requirements:
36
39
  - - ">="
37
40
  - !ruby/object:Gem::Version
38
- version: '3'
41
+ version: '3.1'
39
42
  required_rubygems_version: !ruby/object:Gem::Requirement
40
43
  requirements:
41
44
  - - ">="
42
45
  - !ruby/object:Gem::Version
43
46
  version: '0'
44
47
  requirements: []
45
- rubygems_version: 3.4.10
48
+ rubygems_version: 3.5.11
46
49
  signing_key:
47
50
  specification_version: 4
48
51
  summary: pgvector support for Ruby