uptriever 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: e53b1a43458a2039d72c38133e6d92e55c4c715c5645d51c6801c4c6c7812ee1
4
+ data.tar.gz: 69bfee013e194759a5e8db274cf5d5bf690a6cb558fade52e9e5951114b91463
5
+ SHA512:
6
+ metadata.gz: 8038fc632bd0b7afd1e715259c1eb5c3d77c5a1f9d72b09517dca1f1afd9b1541d5dce398d1edd1dbe9f310b9577f371d2f87e11bda69af63fe09946691a76d2
7
+ data.tar.gz: 31b1eae2eae167e0ec1786474e8d771339f3fc055cc5aba4dc467c83d5cb8c9e0aa0d99212bb2e20c99e72ce22334ccdf1326c45ba30557ba68d084557f73774
data/CHANGELOG.md ADDED
@@ -0,0 +1,9 @@
1
+ # Change log
2
+
3
+ ## master
4
+
5
+ ## 0.1.0 (2024-07-24)
6
+
7
+ - Initial release. ([@palkan][])
8
+
9
+ [@palkan]: https://github.com/palkan
data/LICENSE.txt ADDED
@@ -0,0 +1,23 @@
1
+ Copyright (c) 2024 Vladimir Dementyev
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23
+
data/README.md ADDED
@@ -0,0 +1,98 @@
1
+ [![Gem Version](https://badge.fury.io/rb/uptriever.svg)](https://rubygems.org/gems/uptriever)
2
+ [![Build](https://github.com/palkan/uptriever/workflows/Build/badge.svg)](https://github.com/palkan/uptriever/actions)
3
+
4
+ # Uptriever
5
+
6
+ Uptriever is a CLI to upload documentation source file (HTML, Markdown) to [Trieve][] for search indexing.
7
+
8
+ ## Installation
9
+
10
+ Install Uptreiver as a Ruby gem (Ruby 3.1+ is required):
11
+
12
+ ```sh
13
+ gem install uptriever
14
+ ```
15
+
16
+ ## Usage
17
+
18
+ Currently, Uptriever requires an index configuration file (`.trieve.yml`) to be present in the documentation root folder containing the list of files to index and their metadata. A minimal example of indexing everything looks as follows:
19
+
20
+ ```yml
21
+ hostname: https://myproject.example/docs
22
+ pages:
23
+ - "**/*.md"
24
+ ```
25
+
26
+ The `hostname` field is used to generate the `link` property for chunks (see [Trieve API](https://docs.trieve.ai/api-reference/chunk/create-or-upsert-chunk-or-chunks)).
27
+
28
+ The `pages` field contains the list of pages to index. It supports glob patterns.
29
+
30
+ With config in place, you can run the `uptriever` executable to perform the indexing:
31
+
32
+ ```sh
33
+ $ uptriever -d ./docs --api-key=<Trieve API key> --dataset=<Trieve dataset>
34
+
35
+ Groups: |===========================|
36
+ Chunks: |===========================|
37
+ ```
38
+
39
+ ## Full-featured example
40
+
41
+ Why do we need a configuration file? To leverage Trieve features such as groups, tags, and weights. Here is a real-life example:
42
+
43
+ ```yml
44
+ # Ignore patterns for globs in pages
45
+ ignore:
46
+ - "**/*/Readme.md"
47
+ hostname: https://docs.anycable.io
48
+ # Prepend file paths with this prefix.
49
+ # Useful when you store documentation in multiple sources.
50
+ url_prefix: anycable-go/
51
+
52
+ # Make sure the following chunk groups are created
53
+ groups:
54
+ - name: PRO version
55
+ tracking_id: pro
56
+ - name: Server
57
+ tracking_id: server
58
+ - name: Client
59
+ tracking_id: client
60
+ - name: Go package
61
+ tracking_id: package
62
+
63
+ # Default metadata for pages (can be overriden)
64
+ defaults:
65
+ groups: ["server"]
66
+ tags: ["docs"]
67
+
68
+ pages:
69
+ # You can use a dictionary to define source paths
70
+ # along with metadata
71
+ - source: "./apollo.md"
72
+ groups: ["pro", "server"]
73
+ - source: "./binary_formats.md"
74
+ groups: ["pro", "server", "client"]
75
+ - "./broadcasting.md"
76
+ - "./broker.md"
77
+ - "./health_checking.md"
78
+ - "./instrumentation.md"
79
+ - source: "./library.md"
80
+ groups: ["package"]
81
+ - "./pubsub.md"
82
+ - source: "./js/**/*.md"
83
+ groups: ["client"]
84
+ ```
85
+
86
+ ## Contributing
87
+
88
+ Bug reports and pull requests are welcome on GitHub at [https://github.com/palkan/uptriever](https://github.com/palkan/uptriever).
89
+
90
+ ## Credits
91
+
92
+ This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
93
+
94
+ ## License
95
+
96
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
97
+
98
+ [Trieve]: https://trieve.ai
data/bin/uptriever ADDED
@@ -0,0 +1,13 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "uptriever/cli"
4
+
5
+ begin
6
+ cli = Uptriever::CLI.new
7
+ cli.run(ARGV)
8
+ rescue => e
9
+ raise e if $DEBUG
10
+ STDERR.puts e.message
11
+ STDERR.puts e.backtrace.join("\n")
12
+ exit 1
13
+ end
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "nokogiri"
4
+
5
+ module Uptriever
6
+ # Splits HTML into smaller chunks by h2 headers
7
+ class Chunker
8
+ attr_reader :chunk
9
+
10
+ def initialize(chunk)
11
+ @chunk = chunk
12
+ end
13
+
14
+ def chunks
15
+ doc = Nokogiri::HTML(chunk.fetch(:chunk_html))
16
+ header = doc.at_css("h1")
17
+ return [chunk_dup] unless header
18
+
19
+ # Root chunks are usually less specific, so make them weigh less
20
+ root_chunk = chunk_dup.tap {
21
+ _1[:weight] = 1.5
22
+ _1[:metadata] = {title: doc.at_css("h1").inner_text}
23
+ }
24
+ doc.xpath("//body").children.each_with_object([root_chunk]) do |child, acc|
25
+ # Start new chunk
26
+ if child.name == "h2"
27
+ anchor = child.inner_text.downcase.gsub(/[^a-z0-9]/, "-")
28
+ acc << chunk_dup.tap {
29
+ _1.merge!(
30
+ link: "#{_1.fetch(:link)}?id=#{anchor}",
31
+ tracking_id: "#{_1.fetch(:tracking_id)}##{anchor}",
32
+ metadata: {title: child.inner_text}
33
+ )
34
+ }
35
+ next acc
36
+ end
37
+
38
+ acc.last[:chunk_html] << child.to_xhtml
39
+ end
40
+ end
41
+
42
+ private
43
+
44
+ def chunk_dup = chunk.dup.tap { _1[:chunk_html] = +"" }
45
+ end
46
+ end
@@ -0,0 +1,52 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uptriever"
4
+ require "ruby-progressbar"
5
+ require "optparse"
6
+
7
+ module Uptriever
8
+ class CLI
9
+ def run(args = [])
10
+ @docs_dir = File.join(Dir.pwd, "docs")
11
+ @api_key = ENV["TRIEVE_API_KEY"]
12
+ @dataset = ENV["TRIEVE_DATASET"]
13
+ @dry_run = false
14
+
15
+ # Add optparser to parse options: --dir, --api_key, --dataset, --dry-run
16
+ OptionParser.new do |opts|
17
+ opts.banner = "Usage: uptriever [options]"
18
+ opts.on("-d", "--dir DIR", "Directory with documents") do |dir|
19
+ @docs_dir = dir
20
+ end
21
+ opts.on("-k", "--api-key API_KEY", "Trieve API key") do |key|
22
+ @api_key = key
23
+ end
24
+ opts.on("-s", "--dataset DATASET", "Trieve dataset") do |dataset|
25
+ @dataset = dataset
26
+ end
27
+ opts.on("--dry-run", "Dry run mode") do
28
+ @dry_run = true
29
+ end
30
+ end.parse!(args)
31
+
32
+ config = Config.new(@docs_dir)
33
+ client = Client.new(@api_key, @dataset, dry_run: @dry_run)
34
+
35
+ groups = config.groups
36
+ if groups.any?
37
+ progressbar = ProgressBar.create(title: "Groups", total: groups.size)
38
+ groups.each do
39
+ client.push_group(_1)
40
+ progressbar.increment
41
+ end
42
+ end
43
+
44
+ chunks = config.documents.flat_map { Chunker.new(_1.to_chunk_json).chunks }
45
+ progressbar = ProgressBar.create(title: "Chunks", total: chunks.size)
46
+ chunks.each do
47
+ client.push_chunk(_1)
48
+ progressbar.increment
49
+ end
50
+ end
51
+ end
52
+ end
@@ -0,0 +1,59 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+ require "json"
5
+
6
+ module Uptriever
7
+ class Client
8
+ BASE_URL = "https://api.trieve.ai/api"
9
+
10
+ attr_reader :headers
11
+ private attr_reader :dry_run
12
+
13
+ def initialize(api_key, dataset, dry_run: false)
14
+ @dry_run = dry_run
15
+ @headers = {
16
+ "Authorization" => api_key,
17
+ "TR-Dataset" => dataset
18
+ }.freeze
19
+ end
20
+
21
+ def push_group(group, upsert: true)
22
+ group[:upsert_by_tracking_id] = upsert
23
+ perform_request("/chunk_group", group.to_json)
24
+ end
25
+
26
+ def push_chunk(chunk, upsert: true)
27
+ chunk[:upsert_by_tracking_id] = upsert
28
+ perform_request("/chunk", chunk.to_json)
29
+ end
30
+
31
+ private
32
+
33
+ def perform_request(path, data)
34
+ uri = URI.parse(BASE_URL + path)
35
+
36
+ http = Net::HTTP.new(uri.host, uri.port)
37
+ http.use_ssl = true if uri.scheme == "https"
38
+
39
+ request = Net::HTTP::Post.new(
40
+ uri.request_uri,
41
+ headers.merge("Content-Type" => "application/json")
42
+ )
43
+ request.body = data
44
+
45
+ if dry_run
46
+ puts "[DRY RUN] Perform POST #{path}: #{data}"
47
+ return
48
+ end
49
+
50
+ response = http.request(request)
51
+
52
+ if response.code.to_i != 200
53
+ raise "Invalid response code: #{response.code} (#{response.body[100...]})"
54
+ end
55
+
56
+ JSON.parse(response.body)
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,59 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "yaml"
4
+
5
+ module Uptriever
6
+ class Config
7
+ def self.parse(path) = new(path).documents
8
+
9
+ attr_reader :config_path, :root_dir
10
+
11
+ def initialize(root_dir)
12
+ @root_dir = root_dir
13
+ @config_path = File.join(root_dir, ".trieve.yml")
14
+ raise ArgumentError, ".trieve.yml is missing in the #{root_dir}" unless File.file?(config_path)
15
+ end
16
+
17
+ def groups
18
+ config["groups"] || []
19
+ end
20
+
21
+ def documents
22
+ pages = unwrap_pages(config["pages"])
23
+
24
+ defaults = (config["defaults"] || {}).transform_keys(&:to_sym)
25
+
26
+ pages.filter_map do |page|
27
+ next if config["ignore"]&.any? { File.fnmatch?(_1, page["source"]) }
28
+
29
+ relative_link = page["source"].sub(root_dir, "").sub(/\.[^\.]+$/, "").then do
30
+ next _1 unless config["url_prefix"]
31
+ File.join(config["url_prefix"], _1)
32
+ end
33
+
34
+ link = page["link"] || File.join(config.fetch("hostname"), relative_link)
35
+ id = page["id"] || relative_link.sub(/^\//, "").gsub(/[\/-]/, "-")
36
+
37
+ Document.new(id, page["source"], link, **defaults.merge({groups: page["groups"], tags: page["tags"], weight: page["weight"]}.compact))
38
+ end
39
+ end
40
+
41
+ private
42
+
43
+ def config = @config ||= YAML.load_file(config_path)
44
+
45
+ def unwrap_pages(items)
46
+ items.flat_map do |item|
47
+ if item.is_a?(String)
48
+ Dir.glob(File.expand_path(File.join(root_dir, item))).map { {"source" => _1} }
49
+ else
50
+ Dir.glob(File.expand_path(File.join(root_dir, item.fetch("source")))).map do
51
+ new_item = item.dup
52
+ new_item["source"] = _1
53
+ new_item
54
+ end
55
+ end
56
+ end
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,42 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "redcarpet"
4
+
5
+ module Uptriever
6
+ class Document
7
+ attr_reader :id, :path, :link, :tags, :groups, :weight
8
+
9
+ def initialize(id, path, link, tags: nil, groups: nil, weight: 1.0)
10
+ @id = id
11
+ @path = path
12
+ @link = link
13
+ @tags = tags
14
+ @groups = groups
15
+ @weight = weight
16
+ end
17
+
18
+ def to_html
19
+ case File.extname(path)
20
+ when ".md"
21
+ markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true, tables: true)
22
+ markdown.render(File.read(path))
23
+ when ".html"
24
+ File.read(path)
25
+ else
26
+ raise ArgumentError, "Unsupported file type: #{path}"
27
+ end
28
+ end
29
+
30
+ def to_chunk_json
31
+ {
32
+ chunk_html: to_html,
33
+ link:,
34
+ tracking_id: id,
35
+ weight:
36
+ }.tap do
37
+ _1.merge!(tag_set: tags) if tags
38
+ _1.merge!(group_tracking_ids: groups) if groups
39
+ end
40
+ end
41
+ end
42
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Uptriever # :nodoc:
4
+ VERSION = "0.0.1"
5
+ end
data/lib/uptriever.rb ADDED
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uptriever/version"
4
+
5
+ module Uptriever
6
+ autoload :Chunker, "uptriever/chunker"
7
+ autoload :Client, "uptriever/client"
8
+ autoload :Config, "uptriever/config"
9
+ autoload :Document, "uptriever/document"
10
+ end
metadata ADDED
@@ -0,0 +1,214 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: uptriever
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Vladimir Dementyev
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2024-07-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: redcarpet
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '3.6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '3.6'
27
+ - !ruby/object:Gem::Dependency
28
+ name: nokogiri
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: yaml
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: ruby-progressbar
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: uri
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: json
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: optparse
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: bundler
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '1.15'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '1.15'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rake
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '13.0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '13.0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: minitest
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - "~>"
144
+ - !ruby/object:Gem::Version
145
+ version: '5.0'
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '5.0'
153
+ - !ruby/object:Gem::Dependency
154
+ name: webmock
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '3.23'
160
+ type: :development
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - "~>"
165
+ - !ruby/object:Gem::Version
166
+ version: '3.23'
167
+ description: Upload documenbts to Trieve
168
+ email:
169
+ - Vladimir Dementyev
170
+ executables:
171
+ - uptriever
172
+ extensions: []
173
+ extra_rdoc_files: []
174
+ files:
175
+ - CHANGELOG.md
176
+ - LICENSE.txt
177
+ - README.md
178
+ - bin/uptriever
179
+ - lib/uptriever.rb
180
+ - lib/uptriever/chunker.rb
181
+ - lib/uptriever/cli.rb
182
+ - lib/uptriever/client.rb
183
+ - lib/uptriever/config.rb
184
+ - lib/uptriever/document.rb
185
+ - lib/uptriever/version.rb
186
+ homepage: https://github.com/palkan/uptriever
187
+ licenses:
188
+ - MIT
189
+ metadata:
190
+ bug_tracker_uri: https://github.com/palkan/uptriever/issues
191
+ changelog_uri: https://github.com/palkan/uptriever/blob/master/CHANGELOG.md
192
+ documentation_uri: https://github.com/palkan/uptriever
193
+ homepage_uri: https://github.com/palkan/uptriever
194
+ source_code_uri: https://github.com/palkan/uptriever
195
+ post_install_message:
196
+ rdoc_options: []
197
+ require_paths:
198
+ - lib
199
+ required_ruby_version: !ruby/object:Gem::Requirement
200
+ requirements:
201
+ - - ">="
202
+ - !ruby/object:Gem::Version
203
+ version: '3.1'
204
+ required_rubygems_version: !ruby/object:Gem::Requirement
205
+ requirements:
206
+ - - ">="
207
+ - !ruby/object:Gem::Version
208
+ version: '0'
209
+ requirements: []
210
+ rubygems_version: 3.4.20
211
+ signing_key:
212
+ specification_version: 4
213
+ summary: Upload documenbts to Trieve
214
+ test_files: []