uptriever 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: e53b1a43458a2039d72c38133e6d92e55c4c715c5645d51c6801c4c6c7812ee1
4
+ data.tar.gz: 69bfee013e194759a5e8db274cf5d5bf690a6cb558fade52e9e5951114b91463
5
+ SHA512:
6
+ metadata.gz: 8038fc632bd0b7afd1e715259c1eb5c3d77c5a1f9d72b09517dca1f1afd9b1541d5dce398d1edd1dbe9f310b9577f371d2f87e11bda69af63fe09946691a76d2
7
+ data.tar.gz: 31b1eae2eae167e0ec1786474e8d771339f3fc055cc5aba4dc467c83d5cb8c9e0aa0d99212bb2e20c99e72ce22334ccdf1326c45ba30557ba68d084557f73774
data/CHANGELOG.md ADDED
@@ -0,0 +1,9 @@
1
+ # Change log
2
+
3
+ ## master
4
+
5
+ ## 0.1.0 (2024-07-24)
6
+
7
+ - Initial release. ([@palkan][])
8
+
9
+ [@palkan]: https://github.com/palkan
data/LICENSE.txt ADDED
@@ -0,0 +1,23 @@
1
+ Copyright (c) 2024 Vladimir Dementyev
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23
+
data/README.md ADDED
@@ -0,0 +1,98 @@
1
+ [![Gem Version](https://badge.fury.io/rb/uptriever.svg)](https://rubygems.org/gems/uptriever)
2
+ [![Build](https://github.com/palkan/uptriever/workflows/Build/badge.svg)](https://github.com/palkan/uptriever/actions)
3
+
4
+ # Uptriever
5
+
6
+ Uptriever is a CLI to upload documentation source file (HTML, Markdown) to [Trieve][] for search indexing.
7
+
8
+ ## Installation
9
+
10
+ Install Uptreiver as a Ruby gem (Ruby 3.1+ is required):
11
+
12
+ ```sh
13
+ gem install uptriever
14
+ ```
15
+
16
+ ## Usage
17
+
18
+ Currently, Uptriever requires an index configuration file (`.trieve.yml`) to be present in the documentation root folder containing the list of files to index and their metadata. A minimal example of indexing everything looks as follows:
19
+
20
+ ```yml
21
+ hostname: https://myproject.example/docs
22
+ pages:
23
+ - "**/*.md"
24
+ ```
25
+
26
+ The `hostname` field is used to generate the `link` property for chunks (see [Trieve API](https://docs.trieve.ai/api-reference/chunk/create-or-upsert-chunk-or-chunks)).
27
+
28
+ The `pages` field contains the list of pages to index. It supports glob patterns.
29
+
30
+ With config in place, you can run the `uptriever` executable to perform the indexing:
31
+
32
+ ```sh
33
+ $ uptriever -d ./docs --api-key=<Trieve API key> --dataset=<Trieve dataset>
34
+
35
+ Groups: |===========================|
36
+ Chunks: |===========================|
37
+ ```
38
+
39
+ ## Full-featured example
40
+
41
+ Why do we need a configuration file? To leverage Trieve features such as groups, tags, and weights. Here is a real-life example:
42
+
43
+ ```yml
44
+ # Ignore patterns for globs in pages
45
+ ignore:
46
+ - "**/*/Readme.md"
47
+ hostname: https://docs.anycable.io
48
+ # Prepend file paths with this prefix.
49
+ # Useful when you store documentation in multiple sources.
50
+ url_prefix: anycable-go/
51
+
52
+ # Make sure the following chunk groups are created
53
+ groups:
54
+ - name: PRO version
55
+ tracking_id: pro
56
+ - name: Server
57
+ tracking_id: server
58
+ - name: Client
59
+ tracking_id: client
60
+ - name: Go package
61
+ tracking_id: package
62
+
63
+ # Default metadata for pages (can be overriden)
64
+ defaults:
65
+ groups: ["server"]
66
+ tags: ["docs"]
67
+
68
+ pages:
69
+ # You can use a dictionary to define source paths
70
+ # along with metadata
71
+ - source: "./apollo.md"
72
+ groups: ["pro", "server"]
73
+ - source: "./binary_formats.md"
74
+ groups: ["pro", "server", "client"]
75
+ - "./broadcasting.md"
76
+ - "./broker.md"
77
+ - "./health_checking.md"
78
+ - "./instrumentation.md"
79
+ - source: "./library.md"
80
+ groups: ["package"]
81
+ - "./pubsub.md"
82
+ - source: "./js/**/*.md"
83
+ groups: ["client"]
84
+ ```
85
+
86
+ ## Contributing
87
+
88
+ Bug reports and pull requests are welcome on GitHub at [https://github.com/palkan/uptriever](https://github.com/palkan/uptriever).
89
+
90
+ ## Credits
91
+
92
+ This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
93
+
94
+ ## License
95
+
96
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
97
+
98
+ [Trieve]: https://trieve.ai
data/bin/uptriever ADDED
@@ -0,0 +1,13 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "uptriever/cli"
4
+
5
+ begin
6
+ cli = Uptriever::CLI.new
7
+ cli.run(ARGV)
8
+ rescue => e
9
+ raise e if $DEBUG
10
+ STDERR.puts e.message
11
+ STDERR.puts e.backtrace.join("\n")
12
+ exit 1
13
+ end
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "nokogiri"
4
+
5
+ module Uptriever
6
+ # Splits HTML into smaller chunks by h2 headers
7
+ class Chunker
8
+ attr_reader :chunk
9
+
10
+ def initialize(chunk)
11
+ @chunk = chunk
12
+ end
13
+
14
+ def chunks
15
+ doc = Nokogiri::HTML(chunk.fetch(:chunk_html))
16
+ header = doc.at_css("h1")
17
+ return [chunk_dup] unless header
18
+
19
+ # Root chunks are usually less specific, so make them weigh less
20
+ root_chunk = chunk_dup.tap {
21
+ _1[:weight] = 1.5
22
+ _1[:metadata] = {title: doc.at_css("h1").inner_text}
23
+ }
24
+ doc.xpath("//body").children.each_with_object([root_chunk]) do |child, acc|
25
+ # Start new chunk
26
+ if child.name == "h2"
27
+ anchor = child.inner_text.downcase.gsub(/[^a-z0-9]/, "-")
28
+ acc << chunk_dup.tap {
29
+ _1.merge!(
30
+ link: "#{_1.fetch(:link)}?id=#{anchor}",
31
+ tracking_id: "#{_1.fetch(:tracking_id)}##{anchor}",
32
+ metadata: {title: child.inner_text}
33
+ )
34
+ }
35
+ next acc
36
+ end
37
+
38
+ acc.last[:chunk_html] << child.to_xhtml
39
+ end
40
+ end
41
+
42
+ private
43
+
44
+ def chunk_dup = chunk.dup.tap { _1[:chunk_html] = +"" }
45
+ end
46
+ end
@@ -0,0 +1,52 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uptriever"
4
+ require "ruby-progressbar"
5
+ require "optparse"
6
+
7
+ module Uptriever
8
+ class CLI
9
+ def run(args = [])
10
+ @docs_dir = File.join(Dir.pwd, "docs")
11
+ @api_key = ENV["TRIEVE_API_KEY"]
12
+ @dataset = ENV["TRIEVE_DATASET"]
13
+ @dry_run = false
14
+
15
+ # Add optparser to parse options: --dir, --api_key, --dataset, --dry-run
16
+ OptionParser.new do |opts|
17
+ opts.banner = "Usage: uptriever [options]"
18
+ opts.on("-d", "--dir DIR", "Directory with documents") do |dir|
19
+ @docs_dir = dir
20
+ end
21
+ opts.on("-k", "--api-key API_KEY", "Trieve API key") do |key|
22
+ @api_key = key
23
+ end
24
+ opts.on("-s", "--dataset DATASET", "Trieve dataset") do |dataset|
25
+ @dataset = dataset
26
+ end
27
+ opts.on("--dry-run", "Dry run mode") do
28
+ @dry_run = true
29
+ end
30
+ end.parse!(args)
31
+
32
+ config = Config.new(@docs_dir)
33
+ client = Client.new(@api_key, @dataset, dry_run: @dry_run)
34
+
35
+ groups = config.groups
36
+ if groups.any?
37
+ progressbar = ProgressBar.create(title: "Groups", total: groups.size)
38
+ groups.each do
39
+ client.push_group(_1)
40
+ progressbar.increment
41
+ end
42
+ end
43
+
44
+ chunks = config.documents.flat_map { Chunker.new(_1.to_chunk_json).chunks }
45
+ progressbar = ProgressBar.create(title: "Chunks", total: chunks.size)
46
+ chunks.each do
47
+ client.push_chunk(_1)
48
+ progressbar.increment
49
+ end
50
+ end
51
+ end
52
+ end
@@ -0,0 +1,59 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+ require "json"
5
+
6
+ module Uptriever
7
+ class Client
8
+ BASE_URL = "https://api.trieve.ai/api"
9
+
10
+ attr_reader :headers
11
+ private attr_reader :dry_run
12
+
13
+ def initialize(api_key, dataset, dry_run: false)
14
+ @dry_run = dry_run
15
+ @headers = {
16
+ "Authorization" => api_key,
17
+ "TR-Dataset" => dataset
18
+ }.freeze
19
+ end
20
+
21
+ def push_group(group, upsert: true)
22
+ group[:upsert_by_tracking_id] = upsert
23
+ perform_request("/chunk_group", group.to_json)
24
+ end
25
+
26
+ def push_chunk(chunk, upsert: true)
27
+ chunk[:upsert_by_tracking_id] = upsert
28
+ perform_request("/chunk", chunk.to_json)
29
+ end
30
+
31
+ private
32
+
33
+ def perform_request(path, data)
34
+ uri = URI.parse(BASE_URL + path)
35
+
36
+ http = Net::HTTP.new(uri.host, uri.port)
37
+ http.use_ssl = true if uri.scheme == "https"
38
+
39
+ request = Net::HTTP::Post.new(
40
+ uri.request_uri,
41
+ headers.merge("Content-Type" => "application/json")
42
+ )
43
+ request.body = data
44
+
45
+ if dry_run
46
+ puts "[DRY RUN] Perform POST #{path}: #{data}"
47
+ return
48
+ end
49
+
50
+ response = http.request(request)
51
+
52
+ if response.code.to_i != 200
53
+ raise "Invalid response code: #{response.code} (#{response.body[100...]})"
54
+ end
55
+
56
+ JSON.parse(response.body)
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,59 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "yaml"
4
+
5
+ module Uptriever
6
+ class Config
7
+ def self.parse(path) = new(path).documents
8
+
9
+ attr_reader :config_path, :root_dir
10
+
11
+ def initialize(root_dir)
12
+ @root_dir = root_dir
13
+ @config_path = File.join(root_dir, ".trieve.yml")
14
+ raise ArgumentError, ".trieve.yml is missing in the #{root_dir}" unless File.file?(config_path)
15
+ end
16
+
17
+ def groups
18
+ config["groups"] || []
19
+ end
20
+
21
+ def documents
22
+ pages = unwrap_pages(config["pages"])
23
+
24
+ defaults = (config["defaults"] || {}).transform_keys(&:to_sym)
25
+
26
+ pages.filter_map do |page|
27
+ next if config["ignore"]&.any? { File.fnmatch?(_1, page["source"]) }
28
+
29
+ relative_link = page["source"].sub(root_dir, "").sub(/\.[^\.]+$/, "").then do
30
+ next _1 unless config["url_prefix"]
31
+ File.join(config["url_prefix"], _1)
32
+ end
33
+
34
+ link = page["link"] || File.join(config.fetch("hostname"), relative_link)
35
+ id = page["id"] || relative_link.sub(/^\//, "").gsub(/[\/-]/, "-")
36
+
37
+ Document.new(id, page["source"], link, **defaults.merge({groups: page["groups"], tags: page["tags"], weight: page["weight"]}.compact))
38
+ end
39
+ end
40
+
41
+ private
42
+
43
+ def config = @config ||= YAML.load_file(config_path)
44
+
45
+ def unwrap_pages(items)
46
+ items.flat_map do |item|
47
+ if item.is_a?(String)
48
+ Dir.glob(File.expand_path(File.join(root_dir, item))).map { {"source" => _1} }
49
+ else
50
+ Dir.glob(File.expand_path(File.join(root_dir, item.fetch("source")))).map do
51
+ new_item = item.dup
52
+ new_item["source"] = _1
53
+ new_item
54
+ end
55
+ end
56
+ end
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,42 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "redcarpet"
4
+
5
+ module Uptriever
6
+ class Document
7
+ attr_reader :id, :path, :link, :tags, :groups, :weight
8
+
9
+ def initialize(id, path, link, tags: nil, groups: nil, weight: 1.0)
10
+ @id = id
11
+ @path = path
12
+ @link = link
13
+ @tags = tags
14
+ @groups = groups
15
+ @weight = weight
16
+ end
17
+
18
+ def to_html
19
+ case File.extname(path)
20
+ when ".md"
21
+ markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true, tables: true)
22
+ markdown.render(File.read(path))
23
+ when ".html"
24
+ File.read(path)
25
+ else
26
+ raise ArgumentError, "Unsupported file type: #{path}"
27
+ end
28
+ end
29
+
30
+ def to_chunk_json
31
+ {
32
+ chunk_html: to_html,
33
+ link:,
34
+ tracking_id: id,
35
+ weight:
36
+ }.tap do
37
+ _1.merge!(tag_set: tags) if tags
38
+ _1.merge!(group_tracking_ids: groups) if groups
39
+ end
40
+ end
41
+ end
42
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Uptriever # :nodoc:
4
+ VERSION = "0.0.1"
5
+ end
data/lib/uptriever.rb ADDED
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uptriever/version"
4
+
5
+ module Uptriever
6
+ autoload :Chunker, "uptriever/chunker"
7
+ autoload :Client, "uptriever/client"
8
+ autoload :Config, "uptriever/config"
9
+ autoload :Document, "uptriever/document"
10
+ end
metadata ADDED
@@ -0,0 +1,214 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: uptriever
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Vladimir Dementyev
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2024-07-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: redcarpet
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '3.6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '3.6'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: yaml
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: ruby-progressbar
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: uri
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: json
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: optparse
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: bundler
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '1.15'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '1.15'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rake
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '13.0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '13.0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: minitest
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - "~>"
144
+ - !ruby/object:Gem::Version
145
+ version: '5.0'
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '5.0'
153
+ - !ruby/object:Gem::Dependency
154
+ name: webmock
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '3.23'
160
+ type: :development
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - "~>"
165
+ - !ruby/object:Gem::Version
166
+ version: '3.23'
167
+ description: Upload documenbts to Trieve
168
+ email:
169
+ - Vladimir Dementyev
170
+ executables:
171
+ - uptriever
172
+ extensions: []
173
+ extra_rdoc_files: []
174
+ files:
175
+ - CHANGELOG.md
176
+ - LICENSE.txt
177
+ - README.md
178
+ - bin/uptriever
179
+ - lib/uptriever.rb
180
+ - lib/uptriever/chunker.rb
181
+ - lib/uptriever/cli.rb
182
+ - lib/uptriever/client.rb
183
+ - lib/uptriever/config.rb
184
+ - lib/uptriever/document.rb
185
+ - lib/uptriever/version.rb
186
+ homepage: https://github.com/palkan/uptriever
187
+ licenses:
188
+ - MIT
189
+ metadata:
190
+ bug_tracker_uri: https://github.com/palkan/uptriever/issues
191
+ changelog_uri: https://github.com/palkan/uptriever/blob/master/CHANGELOG.md
192
+ documentation_uri: https://github.com/palkan/uptriever
193
+ homepage_uri: https://github.com/palkan/uptriever
194
+ source_code_uri: https://github.com/palkan/uptriever
195
+ post_install_message:
196
+ rdoc_options: []
197
+ require_paths:
198
+ - lib
199
+ required_ruby_version: !ruby/object:Gem::Requirement
200
+ requirements:
201
+ - - ">="
202
+ - !ruby/object:Gem::Version
203
+ version: '3.1'
204
+ required_rubygems_version: !ruby/object:Gem::Requirement
205
+ requirements:
206
+ - - ">="
207
+ - !ruby/object:Gem::Version
208
+ version: '0'
209
+ requirements: []
210
+ rubygems_version: 3.4.20
211
+ signing_key:
212
+ specification_version: 4
213
+ summary: Upload documenbts to Trieve
214
+ test_files: []