jekyll_ai_related_posts 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 8b59977a0c1d912e06792f7123c7d443547f1decb9fdf4042d90a4fcd4e1eb4e
4
+ data.tar.gz: 159db1306777a201cff3a9d3532ddba444164dbfb780207fb9457ce4898df5f8
5
+ SHA512:
6
+ metadata.gz: e0f73857997bdacd22059c542a1d2a642a3ea76aa165241c155a0e8b2cafdc4f48607f334a2ed7a265a252e86969d9e523bf4ef2ac5889e1f39c36d1a27792f5
7
+ data.tar.gz: a5f5f573459deb7fab308b1fb98a1d60ce1aeda95e28860161ed886a3ab90c43f1b7d36fea6cc5c570b05cd15284a59240e8676f5b10f44290b05fbacbd379df
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,5 @@
1
+ # Omakase Ruby styling for Rails
2
+ inherit_gem:
3
+ rubocop-rails-omakase: rubocop.yml
4
+
5
+ # Your own specialized rules go here
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2024-04-18
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2024 Mike Kasberg
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,101 @@
1
+ # Jekyll AI Related Posts 🪄
2
+
3
+ Jekyll ships with functionality that populates
4
+ [related_posts](https://jekyllrb.com/docs/variables/) with the ten most recent
5
+ posts. If you install
6
+ [classifier_reborn](https://jekyll.github.io/classifier-reborn/) and use the
7
+ `--lsi` option, Jekyll will populate `related_posts` using latent semantic
8
+ indexing.
9
+
10
+ **Using AI is a much better approach.** Latent semantic indexing seems
11
+ promising, but in practice requires libraries like Numo or GSL that are tricky
12
+ to install, and still produces mediocre results. In contrast, OpenAI offers an
13
+ embeddings API that allows us to easily get the embedding vector (in one of
14
+ OpenAI's models) of some text. We can use these vectors to compute related
15
+ posts with the accuracy of OpenAI's models (or any other LLM, for that matter).
16
+
17
+ ## Installation
18
+
19
+ Jekyll AI Related Posts is a [Jekyll
20
+ plugin](https://jekyllrb.com/docs/plugins/installation/). It can be installed
21
+ using any Jekyll plugin installation method.
22
+
23
+ ## Configuration
24
+
25
+ All config for this plugin sits under a top-level `ai_related_posts` key.
26
+
27
+ The only required config is `openai_api_key` -- we need to authenticate to the
28
+ API to fetch embedding vectors.
29
+
30
+ - **openai_api_key** Your OpenAI API key, used to fetch embeddings.
31
+ - **fetch_enabled** (optional, default `true`). If true, fetch embeddings. If
32
+ false, don't fetch embeddings. If this is a string (like `prod`), fetch
33
+ embeddings only when the `JEKYLL_ENV` environment variable is equal to the
34
+ string. (This is useful if you want to reduce API costs by only fetching
35
+ embeddings on production builds.)
36
+
37
+ ### Example Config
38
+
39
+ ```yaml
40
+ ai_related_posts:
41
+ openai_api_key: sk-proj-abc123
42
+ fetch_enabled: prod
43
+ ```
44
+
45
+ ## Usage
46
+
47
+ When the plugin is installed and configured, it will populate an
48
+ `ai_related_posts` key in the post data for all posts. Here's an example of how
49
+ to use it:
50
+
51
+ ```liquid
52
+ <h2>Related Posts</h2>
53
+ <ul>
54
+ {% for post in page.ai_related_posts limit:3 %}
55
+ <li><a href="{{ post.url }}">{{ post.title }}</a></li>
56
+ {% endfor %}
57
+ </ul>
58
+ ```
59
+
60
+ ### Upgrading from Built-In Related Posts
61
+
62
+ If you're already using Jekyll's built-in `site.related_posts` and you want to
63
+ upgrade to AI related posts:
64
+
65
+ - Install the plugin.
66
+ - Replace `site.related_posts` with `page.ai_related_posts` in your templates.
67
+ - If you were using LSI, stop. It's no longer necessary. Don't pass the `--lsi`
68
+ option to the `jekyll` command. You can remove the `classifier-reborn` gem and
69
+ its dependencies (Numo).
70
+
71
+
72
+ ## How It Works
73
+
74
+ Jekyll AI Related Posts is implemented as a Jekyll Generator plugin. During the
75
+ build process, the plugin will call the [OpenAI Embeddings
76
+ API](https://platform.openai.com/docs/guides/embeddings) to fetch the vector
77
+ embedding for a string containing the title, tags, and categories of your
78
+ article. It's not necessary to use the full post text, in most cases the title
79
+ and tags produce very accurate results because the LLM knows when topics are
80
+ related even if they never use identical words. This is also why the LLM
81
+ produces better results than LSI. These vector embeddings are cached in a SQLite
82
+ database. To query for related posts, we query the cached vectors using the
83
+ [sqlite-vss](https://github.com/asg017/sqlite-vss) plugin.
84
+
85
+ ## Development
86
+
87
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run
88
+ `rake spec` to run the tests. You can also run `bin/console` for an interactive
89
+ prompt that will allow you to experiment.
90
+
91
+ To install this gem onto your local machine, run `bundle exec rake install`. To
92
+ release a new version, update the version number in `version.rb`, and then run
93
+ `bundle exec rake release`, which will create a git tag for the version, push
94
+ git commits and the created tag, and push the `.gem` file to
95
+ [rubygems.org](https://rubygems.org).
96
+
97
+ ## Contributing
98
+
99
+ Bug reports and pull requests are welcome on GitHub at
100
+ https://github.com/mkasberg/jekyll_ai_related_posts.
101
+
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,181 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "active_record"
4
+ require "sqlite3"
5
+ require "sqlite_vss"
6
+ require "jekyll"
7
+ require "json"
8
+
9
+ module JekyllAiRelatedPosts
10
+ class Generator < Jekyll::Generator
11
+ def generate(site)
12
+ @site = site
13
+ setup_database
14
+
15
+ if fetch_enabled?
16
+ Jekyll.logger.info "[ai_related_posts] Generating related posts..."
17
+ @embeddings_fetcher = new_fetcher
18
+
19
+ @site.posts.docs.each do |p|
20
+ ensure_embedding_cached(p)
21
+ end
22
+
23
+ @indexed_posts = {}
24
+ site.posts.docs.each do |p|
25
+ @indexed_posts[p.relative_path] = p
26
+ end
27
+
28
+ @site.posts.docs.each do |p|
29
+ find_related(p)
30
+ end
31
+ else
32
+ Jekyll.logger.info "[ai_related_posts] Using cached related posts data..."
33
+
34
+ @site.posts.docs.each do |p|
35
+ fallback_generate_related(p)
36
+ end
37
+ end
38
+ end
39
+
40
+ private
41
+
42
+ def fetch_enabled?
43
+ enabled = true
44
+ if @site.config["ai_related_posts"]["fetch_enabled"].is_a? String
45
+ enabled = ENV["JEKYLL_ENV"] == @site.config["ai_related_posts"]["fetch_enabled"]
46
+ elsif [ true, false ].include? @site.config["ai_related_posts"]["fetch_enabled"]
47
+ enabled = @site.config["ai_related_posts"]["fetch_enabled"]
48
+ end
49
+
50
+ enabled
51
+ end
52
+
53
+ def fallback_generate_related(post)
54
+ existing = Models::Post.find_by(relative_path: post.relative_path)
55
+ if existing.nil?
56
+ post.data["ai_related_posts"] = post.related_posts
57
+ else
58
+ find_related(post)
59
+ end
60
+ end
61
+
62
+ def new_fetcher
63
+ case @site.config["ai_related_posts"]["embeddings_source"]
64
+ when "mock"
65
+ MockEmbeddings.new
66
+ else
67
+ OpenAiEmbeddings.new(@site.config["ai_related_posts"]["openai_api_key"])
68
+ end
69
+ end
70
+
71
+ def ensure_embedding_cached(post)
72
+ existing = Models::Post.find_by(relative_path: post.relative_path)
73
+
74
+ # Clear cache if post has been updated
75
+ if !existing.nil? && existing.embedding_text != embedding_text(post)
76
+ sql = "DELETE FROM vss_posts WHERE rowid = (SELECT rowid FROM posts WHERE relative_path = :relative_path);"
77
+ ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql,
78
+ { relative_path: post.relative_path } ]))
79
+ existing.destroy!
80
+ existing = nil
81
+ end
82
+
83
+ return unless existing.nil?
84
+
85
+ Models::Post.create!(
86
+ relative_path: post.relative_path,
87
+ embedding_text: embedding_text(post),
88
+ embedding: embedding_for(post).to_json
89
+ )
90
+
91
+ sql = <<-SQL
92
+ INSERT INTO vss_posts (rowid, post_embedding)
93
+ SELECT rowid, embedding FROM posts WHERE relative_path = :relative_path;
94
+ SQL
95
+ ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql,
96
+ { relative_path: post.relative_path } ]))
97
+ end
98
+
99
+ def find_related(post)
100
+ sql = <<-SQL
101
+ SELECT rowid, distance
102
+ FROM vss_posts
103
+ WHERE vss_search(
104
+ post_embedding,
105
+ (select embedding from posts where relative_path = :relative_path)
106
+ )
107
+ LIMIT 10000;
108
+ SQL
109
+
110
+ results = ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql, {
111
+ relative_path: post.relative_path
112
+ } ]))
113
+ # The first result is the post itself, with a distance of 0.
114
+ rowids = results.sort_by { |r| r["distance"] }.drop(1).first(3).map { |r| r["rowid"] }
115
+
116
+ posts_by_rowid = {}
117
+ rowids.each do |rowid|
118
+ # This *is* an N+1 query, but:
119
+ # - N+1 penalty is way less with SQLite
120
+ # - N is relatively small (it's Jekyll post count)
121
+ # - This is an easy way to work around rowid not being a real column that ActiveRecord knows about.
122
+ posts_by_rowid[rowid] = Models::Post.select(:relative_path).find_by(rowid: rowid)
123
+ end
124
+
125
+ related_posts = rowids.map do |rowid|
126
+ relative_path = posts_by_rowid[rowid]["relative_path"]
127
+ @indexed_posts[relative_path]
128
+ end
129
+
130
+ post.data["ai_related_posts"] = related_posts
131
+ end
132
+
133
+ def embedding_text(post)
134
+ text = "Title: #{post.data["title"]}"
135
+ text += "; Categories: #{post.data["categories"].join(", ")}" unless post.data["categories"].empty?
136
+ text += "; Tags: #{post.data["tags"].join(", ")}" unless post.data["tags"].empty?
137
+
138
+ text
139
+ end
140
+
141
+ def embedding_for(post)
142
+ Jekyll.logger.info "[ai_related_posts] Fetching embedding for #{post.relative_path}"
143
+ input = embedding_text(post)
144
+
145
+ @embeddings_fetcher.embedding_for(input)
146
+ end
147
+
148
+ def setup_database
149
+ ActiveRecord::Base.establish_connection(
150
+ adapter: "sqlite3",
151
+ database: @site.in_source_dir(".ai_related_posts_cache.sqlite3")
152
+ )
153
+ # We don't need WAL mode for this.
154
+ ActiveRecord::Base.connection.execute("PRAGMA journal_mode=DELETE;")
155
+
156
+ # Enable sqlite-vss vector extension
157
+ db = ActiveRecord::Base.connection.raw_connection
158
+ db.enable_load_extension(true)
159
+ SqliteVss.load(db)
160
+ db.enable_load_extension(false)
161
+
162
+ create_posts = <<-SQL
163
+ CREATE TABLE IF NOT EXISTS posts(
164
+ relative_path TEXT PRIMARY KEY,
165
+ embedding_text TEXT,
166
+ embedding TEXT
167
+ );
168
+ SQL
169
+ ActiveRecord::Base.connection.execute(create_posts)
170
+
171
+ create_vss_posts = <<-SQL
172
+ CREATE VIRTUAL TABLE IF NOT EXISTS vss_posts using vss0(
173
+ post_embedding(#{OpenAiEmbeddings::DIMENSIONS})
174
+ );
175
+ SQL
176
+ ActiveRecord::Base.connection.execute(create_vss_posts)
177
+
178
+ Jekyll.logger.debug("ai_related_posts db setup complete")
179
+ end
180
+ end
181
+ end
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ module JekyllAiRelatedPosts
4
+ module Models
5
+ class Post < ActiveRecord::Base
6
+ end
7
+ end
8
+ end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "faraday"
4
+
5
+ module JekyllAiRelatedPosts
6
+ class OpenAiEmbeddings
7
+ DIMENSIONS = 1536
8
+
9
+ def initialize(api_key, connection: nil)
10
+ @connection = if connection.nil?
11
+ Faraday.new(url: "https://api.openai.com") do |builder|
12
+ builder.request :authorization, "Bearer", api_key
13
+ builder.request :json
14
+ builder.response :json
15
+ builder.response :raise_error
16
+ end
17
+ else
18
+ connection
19
+ end
20
+ end
21
+
22
+ def embedding_for(text)
23
+ res = @connection.post("/v1/embeddings") do |req|
24
+ req.body = {
25
+ input: text,
26
+ model: "text-embedding-3-small"
27
+ }
28
+ end
29
+
30
+ res.body["data"].first["embedding"]
31
+ rescue Faraday::Error => e
32
+ Jekyll.logger.error "Error response from OpanAI API!"
33
+ Jekyll.logger.error e.inspect
34
+
35
+ raise
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module JekyllAiRelatedPosts
4
+ VERSION = "0.1.0"
5
+ end
@@ -0,0 +1,13 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "jekyll_ai_related_posts/generator"
4
+
5
+ require "zeitwerk"
6
+ loader = Zeitwerk::Loader.for_gem
7
+ loader.setup
8
+
9
+ module JekyllAiRelatedPosts
10
+ GEM_ROOT = File.expand_path("..", __dir__)
11
+
12
+ class Error < StandardError; end
13
+ end
metadata ADDED
@@ -0,0 +1,142 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: jekyll_ai_related_posts
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Mike Kasberg
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2024-04-23 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: activerecord
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '7.1'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '7.1'
27
+ - !ruby/object:Gem::Dependency
28
+ name: faraday
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '2.9'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '2.9'
41
+ - !ruby/object:Gem::Dependency
42
+ name: jekyll
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '3.0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '3.0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: sqlite3
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.4'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.4'
69
+ - !ruby/object:Gem::Dependency
70
+ name: sqlite-vss
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.1.2
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.1.2
83
+ - !ruby/object:Gem::Dependency
84
+ name: zeitwerk
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '2.6'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '2.6'
97
+ description: Populate ai_related_posts using Open AI embeddings
98
+ email:
99
+ - kasberg.mike@gmail.com
100
+ executables: []
101
+ extensions: []
102
+ extra_rdoc_files: []
103
+ files:
104
+ - ".rspec"
105
+ - ".rubocop.yml"
106
+ - CHANGELOG.md
107
+ - LICENSE.txt
108
+ - README.md
109
+ - Rakefile
110
+ - lib/jekyll_ai_related_posts.rb
111
+ - lib/jekyll_ai_related_posts/generator.rb
112
+ - lib/jekyll_ai_related_posts/models/post.rb
113
+ - lib/jekyll_ai_related_posts/open_ai_embeddings.rb
114
+ - lib/jekyll_ai_related_posts/version.rb
115
+ homepage: https://github.com/mkasberg/jekyll_ai_related_posts
116
+ licenses:
117
+ - MIT
118
+ metadata:
119
+ allowed_push_host: https://rubygems.org
120
+ homepage_uri: https://github.com/mkasberg/jekyll_ai_related_posts
121
+ source_code_uri: https://github.com/mkasberg/jekyll_ai_related_posts
122
+ changelog_uri: https://github.com/mkasberg/jekyll_ai_related_posts
123
+ post_install_message:
124
+ rdoc_options: []
125
+ require_paths:
126
+ - lib
127
+ required_ruby_version: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: 3.0.0
132
+ required_rubygems_version: !ruby/object:Gem::Requirement
133
+ requirements:
134
+ - - ">="
135
+ - !ruby/object:Gem::Version
136
+ version: '0'
137
+ requirements: []
138
+ rubygems_version: 3.5.6
139
+ signing_key:
140
+ specification_version: 4
141
+ summary: Populate ai_related_posts using Open AI embeddings
142
+ test_files: []