jekyll_ai_related_posts 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 8b59977a0c1d912e06792f7123c7d443547f1decb9fdf4042d90a4fcd4e1eb4e
4
+ data.tar.gz: 159db1306777a201cff3a9d3532ddba444164dbfb780207fb9457ce4898df5f8
5
+ SHA512:
6
+ metadata.gz: e0f73857997bdacd22059c542a1d2a642a3ea76aa165241c155a0e8b2cafdc4f48607f334a2ed7a265a252e86969d9e523bf4ef2ac5889e1f39c36d1a27792f5
7
+ data.tar.gz: a5f5f573459deb7fab308b1fb98a1d60ce1aeda95e28860161ed886a3ab90c43f1b7d36fea6cc5c570b05cd15284a59240e8676f5b10f44290b05fbacbd379df
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,5 @@
1
+ # Omakase Ruby styling for Rails
2
+ inherit_gem:
3
+ rubocop-rails-omakase: rubocop.yml
4
+
5
+ # Your own specialized rules go here
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2024-04-18
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2024 Mike Kasberg
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,101 @@
1
+ # Jekyll AI Related Posts 🪄
2
+
3
+ Jekyll ships with functionality that populates
4
+ [related_posts](https://jekyllrb.com/docs/variables/) with the ten most recent
5
+ posts. If you install
6
+ [classifier_reborn](https://jekyll.github.io/classifier-reborn/) and use the
7
+ `--lsi` option, Jekyll will populate `related_posts` using latent semantic
8
+ indexing.
9
+
10
+ **Using AI is a much better approach.** Latent semantic indexing seems
11
+ promising, but in practice requires libraries like Numo or GSL that are tricky
12
+ to install, and still produces mediocre results. In contrast, OpenAI offers an
13
+ embeddings API that allows us to easily get the embedding vector (in one of
14
+ OpenAI's models) of some text. We can use these vectors to compute related
15
+ posts with the accuracy of OpenAI's models (or any other LLM, for that matter).
16
+
17
+ ## Installation
18
+
19
+ Jekyll AI Related Posts is a [Jekyll
20
+ plugin](https://jekyllrb.com/docs/plugins/installation/). It can be installed
21
+ using any Jekyll plugin installation method.
22
+
23
+ ## Configuration
24
+
25
+ All config for this plugin sits under a top-level `ai_related_posts` key.
26
+
27
+ The only required config is `openai_api_key` -- we need to authenticate to the
28
+ API to fetch embedding vectors.
29
+
30
+ - **openai_api_key** Your OpenAI API key, used to fetch embeddings.
31
+ - **fetch_enabled** (optional, default `true`). If true, fetch embeddings. If
32
+ false, don't fetch embeddings. If this is a string (like `prod`), fetch
33
+ embeddings only when the `JEKYLL_ENV` environment variable is equal to the
34
+ string. (This is useful if you want to reduce API costs by only fetching
35
+ embeddings on production builds.)
36
+
37
+ ### Example Config
38
+
39
+ ```yaml
40
+ ai_related_posts:
41
+ openai_api_key: sk-proj-abc123
42
+ fetch_enabled: prod
43
+ ```
44
+
45
+ ## Usage
46
+
47
+ When the plugin is installed and configured, it will populate an
48
+ `ai_related_posts` key in the post data for all posts. Here's an example of how
49
+ to use it:
50
+
51
+ ```liquid
52
+ <h2>Related Posts</h2>
53
+ <ul>
54
+ {% for post in page.ai_related_posts limit:3 %}
55
+ <li><a href="{{ post.url }}">{{ post.title }}</a></li>
56
+ {% endfor %}
57
+ </ul>
58
+ ```
59
+
60
+ ### Upgrading from Built-In Related Posts
61
+
62
+ If you're already using Jekyll's built-in `site.related_posts` and you want to
63
+ upgrade to AI related posts:
64
+
65
+ - Install the plugin.
66
+ - Replace `site.related_posts` with `page.ai_related_posts` in your templates.
67
+ - If you were using LSI, stop. It's no longer necessary. Don't pass the `--lsi`
68
+ option to the `jekyll` command. You can remove the `classifier-reborn` gem and
69
+ its dependencies (Numo).
70
+
71
+
72
+ ## How It Works
73
+
74
+ Jekyll AI Related Posts is implemented as a Jekyll Generator plugin. During the
75
+ build process, the plugin will call the [OpenAI Embeddings
76
+ API](https://platform.openai.com/docs/guides/embeddings) to fetch the vector
77
+ embedding for a string containing the title, tags, and categories of your
78
+ article. It's not necessary to use the full post text, in most cases the title
79
+ and tags produce very accurate results because the LLM knows when topics are
80
+ related even if they never use identical words. This is also why the LLM
81
+ produces better results than LSI. These vector embeddings are cached in a SQLite
82
+ database. To query for related posts, we query the cached vectors using the
83
+ [sqlite-vss](https://github.com/asg017/sqlite-vss) plugin.
84
+
85
+ ## Development
86
+
87
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run
88
+ `rake spec` to run the tests. You can also run `bin/console` for an interactive
89
+ prompt that will allow you to experiment.
90
+
91
+ To install this gem onto your local machine, run `bundle exec rake install`. To
92
+ release a new version, update the version number in `version.rb`, and then run
93
+ `bundle exec rake release`, which will create a git tag for the version, push
94
+ git commits and the created tag, and push the `.gem` file to
95
+ [rubygems.org](https://rubygems.org).
96
+
97
+ ## Contributing
98
+
99
+ Bug reports and pull requests are welcome on GitHub at
100
+ https://github.com/mkasberg/jekyll_ai_related_posts.
101
+
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,181 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "active_record"
4
+ require "sqlite3"
5
+ require "sqlite_vss"
6
+ require "jekyll"
7
+ require "json"
8
+
9
+ module JekyllAiRelatedPosts
10
+ class Generator < Jekyll::Generator
11
+ def generate(site)
12
+ @site = site
13
+ setup_database
14
+
15
+ if fetch_enabled?
16
+ Jekyll.logger.info "[ai_related_posts] Generating related posts..."
17
+ @embeddings_fetcher = new_fetcher
18
+
19
+ @site.posts.docs.each do |p|
20
+ ensure_embedding_cached(p)
21
+ end
22
+
23
+ @indexed_posts = {}
24
+ site.posts.docs.each do |p|
25
+ @indexed_posts[p.relative_path] = p
26
+ end
27
+
28
+ @site.posts.docs.each do |p|
29
+ find_related(p)
30
+ end
31
+ else
32
+ Jekyll.logger.info "[ai_related_posts] Using cached related posts data..."
33
+
34
+ @site.posts.docs.each do |p|
35
+ fallback_generate_related(p)
36
+ end
37
+ end
38
+ end
39
+
40
+ private
41
+
42
+ def fetch_enabled?
43
+ enabled = true
44
+ if @site.config["ai_related_posts"]["fetch_enabled"].is_a? String
45
+ enabled = ENV["JEKYLL_ENV"] == @site.config["ai_related_posts"]["fetch_enabled"]
46
+ elsif [ true, false ].include? @site.config["ai_related_posts"]["fetch_enabled"]
47
+ enabled = @site.config["ai_related_posts"]["fetch_enabled"]
48
+ end
49
+
50
+ enabled
51
+ end
52
+
53
+ def fallback_generate_related(post)
54
+ existing = Models::Post.find_by(relative_path: post.relative_path)
55
+ if existing.nil?
56
+ post.data["ai_related_posts"] = post.related_posts
57
+ else
58
+ find_related(post)
59
+ end
60
+ end
61
+
62
+ def new_fetcher
63
+ case @site.config["ai_related_posts"]["embeddings_source"]
64
+ when "mock"
65
+ MockEmbeddings.new
66
+ else
67
+ OpenAiEmbeddings.new(@site.config["ai_related_posts"]["openai_api_key"])
68
+ end
69
+ end
70
+
71
+ def ensure_embedding_cached(post)
72
+ existing = Models::Post.find_by(relative_path: post.relative_path)
73
+
74
+ # Clear cache if post has been updated
75
+ if !existing.nil? && existing.embedding_text != embedding_text(post)
76
+ sql = "DELETE FROM vss_posts WHERE rowid = (SELECT rowid FROM posts WHERE relative_path = :relative_path);"
77
+ ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql,
78
+ { relative_path: post.relative_path } ]))
79
+ existing.destroy!
80
+ existing = nil
81
+ end
82
+
83
+ return unless existing.nil?
84
+
85
+ Models::Post.create!(
86
+ relative_path: post.relative_path,
87
+ embedding_text: embedding_text(post),
88
+ embedding: embedding_for(post).to_json
89
+ )
90
+
91
+ sql = <<-SQL
92
+ INSERT INTO vss_posts (rowid, post_embedding)
93
+ SELECT rowid, embedding FROM posts WHERE relative_path = :relative_path;
94
+ SQL
95
+ ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql,
96
+ { relative_path: post.relative_path } ]))
97
+ end
98
+
99
+ def find_related(post)
100
+ sql = <<-SQL
101
+ SELECT rowid, distance
102
+ FROM vss_posts
103
+ WHERE vss_search(
104
+ post_embedding,
105
+ (select embedding from posts where relative_path = :relative_path)
106
+ )
107
+ LIMIT 10000;
108
+ SQL
109
+
110
+ results = ActiveRecord::Base.connection.execute(ActiveRecord::Base.sanitize_sql([ sql, {
111
+ relative_path: post.relative_path
112
+ } ]))
113
+ # The first result is the post itself, with a distance of 0.
114
+ rowids = results.sort_by { |r| r["distance"] }.drop(1).first(3).map { |r| r["rowid"] }
115
+
116
+ posts_by_rowid = {}
117
+ rowids.each do |rowid|
118
+ # This *is* an N+1 query, but:
119
+ # - N+1 penalty is way less with SQLite
120
+ # - N is relatively small (it's Jekyll post count)
121
+ # - This is an easy way to work around rowid not being a real column that ActiveRecord knows about.
122
+ posts_by_rowid[rowid] = Models::Post.select(:relative_path).find_by(rowid: rowid)
123
+ end
124
+
125
+ related_posts = rowids.map do |rowid|
126
+ relative_path = posts_by_rowid[rowid]["relative_path"]
127
+ @indexed_posts[relative_path]
128
+ end
129
+
130
+ post.data["ai_related_posts"] = related_posts
131
+ end
132
+
133
+ def embedding_text(post)
134
+ text = "Title: #{post.data["title"]}"
135
+ text += "; Categories: #{post.data["categories"].join(", ")}" unless post.data["categories"].empty?
136
+ text += "; Tags: #{post.data["tags"].join(", ")}" unless post.data["tags"].empty?
137
+
138
+ text
139
+ end
140
+
141
+ def embedding_for(post)
142
+ Jekyll.logger.info "[ai_related_posts] Fetching embedding for #{post.relative_path}"
143
+ input = embedding_text(post)
144
+
145
+ @embeddings_fetcher.embedding_for(input)
146
+ end
147
+
148
+ def setup_database
149
+ ActiveRecord::Base.establish_connection(
150
+ adapter: "sqlite3",
151
+ database: @site.in_source_dir(".ai_related_posts_cache.sqlite3")
152
+ )
153
+ # We don't need WAL mode for this.
154
+ ActiveRecord::Base.connection.execute("PRAGMA journal_mode=DELETE;")
155
+
156
+ # Enable sqlite-vss vector extension
157
+ db = ActiveRecord::Base.connection.raw_connection
158
+ db.enable_load_extension(true)
159
+ SqliteVss.load(db)
160
+ db.enable_load_extension(false)
161
+
162
+ create_posts = <<-SQL
163
+ CREATE TABLE IF NOT EXISTS posts(
164
+ relative_path TEXT PRIMARY KEY,
165
+ embedding_text TEXT,
166
+ embedding TEXT
167
+ );
168
+ SQL
169
+ ActiveRecord::Base.connection.execute(create_posts)
170
+
171
+ create_vss_posts = <<-SQL
172
+ CREATE VIRTUAL TABLE IF NOT EXISTS vss_posts using vss0(
173
+ post_embedding(#{OpenAiEmbeddings::DIMENSIONS})
174
+ );
175
+ SQL
176
+ ActiveRecord::Base.connection.execute(create_vss_posts)
177
+
178
+ Jekyll.logger.debug("ai_related_posts db setup complete")
179
+ end
180
+ end
181
+ end
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ module JekyllAiRelatedPosts
4
+ module Models
5
+ class Post < ActiveRecord::Base
6
+ end
7
+ end
8
+ end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "faraday"
4
+
5
+ module JekyllAiRelatedPosts
6
+ class OpenAiEmbeddings
7
+ DIMENSIONS = 1536
8
+
9
+ def initialize(api_key, connection: nil)
10
+ @connection = if connection.nil?
11
+ Faraday.new(url: "https://api.openai.com") do |builder|
12
+ builder.request :authorization, "Bearer", api_key
13
+ builder.request :json
14
+ builder.response :json
15
+ builder.response :raise_error
16
+ end
17
+ else
18
+ connection
19
+ end
20
+ end
21
+
22
+ def embedding_for(text)
23
+ res = @connection.post("/v1/embeddings") do |req|
24
+ req.body = {
25
+ input: text,
26
+ model: "text-embedding-3-small"
27
+ }
28
+ end
29
+
30
+ res.body["data"].first["embedding"]
31
+ rescue Faraday::Error => e
32
+ Jekyll.logger.error "Error response from OpanAI API!"
33
+ Jekyll.logger.error e.inspect
34
+
35
+ raise
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module JekyllAiRelatedPosts
4
+ VERSION = "0.1.0"
5
+ end
@@ -0,0 +1,13 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "jekyll_ai_related_posts/generator"
4
+
5
+ require "zeitwerk"
6
+ loader = Zeitwerk::Loader.for_gem
7
+ loader.setup
8
+
9
+ module JekyllAiRelatedPosts
10
+ GEM_ROOT = File.expand_path("..", __dir__)
11
+
12
+ class Error < StandardError; end
13
+ end
metadata ADDED
@@ -0,0 +1,142 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: jekyll_ai_related_posts
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Mike Kasberg
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2024-04-23 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: activerecord
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '7.1'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '7.1'
27
+ - !ruby/object:Gem::Dependency
28
+ name: faraday
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '2.9'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '2.9'
41
+ - !ruby/object:Gem::Dependency
42
+ name: jekyll
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '3.0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '3.0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: sqlite3
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.4'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.4'
69
+ - !ruby/object:Gem::Dependency
70
+ name: sqlite-vss
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.1.2
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.1.2
83
+ - !ruby/object:Gem::Dependency
84
+ name: zeitwerk
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '2.6'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '2.6'
97
+ description: Populate ai_related_posts using Open AI embeddings
98
+ email:
99
+ - kasberg.mike@gmail.com
100
+ executables: []
101
+ extensions: []
102
+ extra_rdoc_files: []
103
+ files:
104
+ - ".rspec"
105
+ - ".rubocop.yml"
106
+ - CHANGELOG.md
107
+ - LICENSE.txt
108
+ - README.md
109
+ - Rakefile
110
+ - lib/jekyll_ai_related_posts.rb
111
+ - lib/jekyll_ai_related_posts/generator.rb
112
+ - lib/jekyll_ai_related_posts/models/post.rb
113
+ - lib/jekyll_ai_related_posts/open_ai_embeddings.rb
114
+ - lib/jekyll_ai_related_posts/version.rb
115
+ homepage: https://github.com/mkasberg/jekyll_ai_related_posts
116
+ licenses:
117
+ - MIT
118
+ metadata:
119
+ allowed_push_host: https://rubygems.org
120
+ homepage_uri: https://github.com/mkasberg/jekyll_ai_related_posts
121
+ source_code_uri: https://github.com/mkasberg/jekyll_ai_related_posts
122
+ changelog_uri: https://github.com/mkasberg/jekyll_ai_related_posts
123
+ post_install_message:
124
+ rdoc_options: []
125
+ require_paths:
126
+ - lib
127
+ required_ruby_version: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: 3.0.0
132
+ required_rubygems_version: !ruby/object:Gem::Requirement
133
+ requirements:
134
+ - - ">="
135
+ - !ruby/object:Gem::Version
136
+ version: '0'
137
+ requirements: []
138
+ rubygems_version: 3.5.6
139
+ signing_key:
140
+ specification_version: 4
141
+ summary: Populate ai_related_posts using Open AI embeddings
142
+ test_files: []