llmsherpa 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: bafd988ec6e1909868e5afa6ff45a5d551f51867ddaebabbbc6c2f2fdcf7aac2
4
+ data.tar.gz: 4d697e0cb08d6d1d8306262da81d2fda1c5595e9dfd843109861823ba42cb781
5
+ SHA512:
6
+ metadata.gz: 9ae38fa4cc6dfe494313394a8cdd19ae536af0cf2b5f3191887515b1b6de63d7604959f03ce178d3bda902a640f640cc6a31d27b44255e23b1b111fbbe5897ef
7
+ data.tar.gz: 27fe378529b8bac701fe86475ac3c921fd79291d9d6808dbacab3c2d2b804724bd9d362896615b4f2354241b039c06105ec1b57404955b746ab50b45641c5832
data/.DS_Store ADDED
Binary file
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,13 @@
1
+ AllCops:
2
+ TargetRubyVersion: 2.6
3
+
4
+ Style/StringLiterals:
5
+ Enabled: true
6
+ EnforcedStyle: double_quotes
7
+
8
+ Style/StringLiteralsInInterpolation:
9
+ Enabled: true
10
+ EnforcedStyle: double_quotes
11
+
12
+ Layout/LineLength:
13
+ Max: 120
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2024-03-21
4
+
5
+ - Initial release
@@ -0,0 +1,84 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
6
+
7
+ We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
8
+
9
+ ## Our Standards
10
+
11
+ Examples of behavior that contributes to a positive environment for our community include:
12
+
13
+ * Demonstrating empathy and kindness toward other people
14
+ * Being respectful of differing opinions, viewpoints, and experiences
15
+ * Giving and gracefully accepting constructive feedback
16
+ * Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
17
+ * Focusing on what is best not just for us as individuals, but for the overall community
18
+
19
+ Examples of unacceptable behavior include:
20
+
21
+ * The use of sexualized language or imagery, and sexual attention or
22
+ advances of any kind
23
+ * Trolling, insulting or derogatory comments, and personal or political attacks
24
+ * Public or private harassment
25
+ * Publishing others' private information, such as a physical or email
26
+ address, without their explicit permission
27
+ * Other conduct which could reasonably be considered inappropriate in a
28
+ professional setting
29
+
30
+ ## Enforcement Responsibilities
31
+
32
+ Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
33
+
34
+ Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
35
+
36
+ ## Scope
37
+
38
+ This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
39
+
40
+ ## Enforcement
41
+
42
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at luca.pradovera@gmail.com. All complaints will be reviewed and investigated promptly and fairly.
43
+
44
+ All community leaders are obligated to respect the privacy and security of the reporter of any incident.
45
+
46
+ ## Enforcement Guidelines
47
+
48
+ Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
49
+
50
+ ### 1. Correction
51
+
52
+ **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
53
+
54
+ **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
55
+
56
+ ### 2. Warning
57
+
58
+ **Community Impact**: A violation through a single incident or series of actions.
59
+
60
+ **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
61
+
62
+ ### 3. Temporary Ban
63
+
64
+ **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
65
+
66
+ **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
67
+
68
+ ### 4. Permanent Ban
69
+
70
+ **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
71
+
72
+ **Consequence**: A permanent ban from any sort of public interaction within the community.
73
+
74
+ ## Attribution
75
+
76
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.0,
77
+ available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
78
+
79
+ Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity).
80
+
81
+ [homepage]: https://www.contributor-covenant.org
82
+
83
+ For answers to common questions about this code of conduct, see the FAQ at
84
+ https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2024 Luca Pradovera
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,44 @@
1
+ # Llmsherpa
2
+
3
+ This gem implements a Ruby version of the https://github.com/nlmatics/llmsherpa Python package, with the goal of making it easy to parse and ingest documents.
4
+
5
+ You will need to run an instance of https://github.com/nlmatics/nlm-ingestor. The recommended way is using Docker:
6
+
7
+ ```
8
+ docker pull ghcr.io/nlmatics/nlm-ingestor:latest
9
+ docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest
10
+ ```
11
+
12
+ Many thanks to the [NLMatics team](https://github.com/nlmatics) for their awesome work.
13
+
14
+ ## Installation
15
+
16
+ Install the gem and add to the application's Gemfile by executing:
17
+
18
+ $ bundle add llmsherpa
19
+
20
+ If bundler is not being used to manage dependencies, install the gem by executing:
21
+
22
+ $ gem install llmsherpa
23
+
24
+ ## Usage
25
+
26
+ TODO: Write usage instructions here
27
+
28
+ ## Development
29
+
30
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
31
+
32
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
33
+
34
+ ## Contributing
35
+
36
+ Bug reports and pull requests are welcome on GitHub at https://github.com/lpradovera/llmsherpa. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/lpradovera/llmsherpa/blob/main/CODE_OF_CONDUCT.md).
37
+
38
+ ## License
39
+
40
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
41
+
42
+ ## Code of Conduct
43
+
44
+ Everyone interacting in the Llmsherpa project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/llmsherpa/blob/main/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "rubocop/rake_task"
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,448 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Llmsherpa
4
+ # A block is a node in the layout tree. It can be a paragraph, a list item, a table, or a section header.
5
+ # This is the base class for all blocks such as Paragraph, ListItem, Table, Section.
6
+ class Block
7
+ attr_accessor :tag, :level, :page_idx, :block_idx, :top, :left, :bbox, :sentences, :children, :parent, :block_json
8
+
9
+ def initialize(block_json = nil)
10
+ @tag = block_json["tag"] if block_json&.key?("tag")
11
+ @level = (block_json["level"] if block_json&.key?("level")) || 0
12
+ @page_idx = block_json["page_idx"] if block_json&.key?("page_idx")
13
+ @block_idx = block_json["block_idx"] if block_json&.key?("block_idx")
14
+ @top = block_json["top"] if block_json&.key?("top")
15
+ @left = block_json["left"] if block_json&.key?("left")
16
+ @bbox = block_json["bbox"] if block_json&.key?("bbox")
17
+ @sentences = block_json["sentences"] if block_json&.key?("sentences")
18
+ @children = []
19
+ @parent = nil
20
+ @block_json = block_json
21
+ end
22
+
23
+ # Adds a child to the block. Sets the parent of the child to self.
24
+ def add_child(node)
25
+ @children.push(node)
26
+ node.parent = self
27
+ end
28
+
29
+ # Converts the block to html. This is a virtual method and should be implemented by the derived classes.
30
+ def to_html(include_children = false, recurse = false); end
31
+
32
+ # Converts the block to text. This is a virtual method and should be implemented by the derived classes.
33
+ def to_text(include_children = false, recurse = false); end
34
+
35
+ # Returns the parent chain of the block consisting of all the parents of the block until the root.
36
+ def parent_chain
37
+ chain = []
38
+ parent = self.parent
39
+ while parent
40
+ chain.push(parent)
41
+ parent = parent.parent
42
+ end
43
+ chain.reverse
44
+ end
45
+
46
+ # Returns the text of the parent chain of the block. This is useful for adding section information to the text.
47
+ def parent_text
48
+ parent_chain = self.parent_chain
49
+ header_texts = []
50
+ para_texts = []
51
+ parent_chain.each do |p|
52
+ if p.tag == "header"
53
+ header_texts.push(p.to_text)
54
+ elsif %w[list_item para].include?(p.tag)
55
+ para_texts.push(p.to_text)
56
+ end
57
+ end
58
+ text = header_texts.join(" > ")
59
+ text += "\n#{para_texts.join("\n")}" unless para_texts.empty?
60
+ text
61
+ end
62
+
63
+ # Returns the text of the block with section information. This provides context to the text.
64
+ def to_context_text(include_section_info = true)
65
+ text = ""
66
+ text += "#{parent_text}\n" if include_section_info
67
+ text += if %w[list_item para table].include?(@tag)
68
+ to_text(true, true)
69
+ else
70
+ to_text
71
+ end
72
+ text
73
+ end
74
+
75
+ # Iterates over all the children of the node and calls the node_visitor function on each child.
76
+ def iter_children(node, level, &node_visitor)
77
+ node.children.each do |child|
78
+ node_visitor.call(child)
79
+ iter_children(child, level + 1, &node_visitor) unless %w[list_item para table].include?(child.tag)
80
+ end
81
+ end
82
+
83
+ # Returns all the paragraphs in the block. This is useful for getting all the paragraphs in a section.
84
+ def paragraphs
85
+ paragraphs = []
86
+ iter_children(self, 0) do |node|
87
+ paragraphs.push(node) if node.tag == "para"
88
+ end
89
+ paragraphs
90
+ end
91
+
92
+ # Returns all the chunks in the block. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.
93
+ def chunks
94
+ chunks = []
95
+ iter_children(self, 0) do |node|
96
+ chunks.push(node) if %w[para list_item table].include?(node.tag)
97
+ end
98
+ chunks
99
+ end
100
+
101
+ # Returns all the tables in
102
+ # Returns all the tables in the block. This is useful for getting all the tables in a section.
103
+ def tables
104
+ tables = []
105
+ iter_children(self, 0) do |node|
106
+ tables.push(node) if node.tag == "table"
107
+ end
108
+ tables
109
+ end
110
+
111
+ # Returns all the sections in the block. This is useful for getting all the sections in a document.
112
+ def sections
113
+ sections = []
114
+ iter_children(self, 0) do |node|
115
+ sections.push(node) if node.tag == "header"
116
+ end
117
+ sections
118
+ end
119
+ end
120
+
121
+ # A paragraph is a block of text. It can have children such as lists. A paragraph has tag 'para'.
122
+ class Paragraph < Block
123
+ def to_text(include_children = false, recurse = false)
124
+ para_text = @sentences.join("\n")
125
+ if include_children
126
+ @children.each do |child|
127
+ para_text += "\n#{child.to_text(include_children: recurse, recurse: recurse)}"
128
+ end
129
+ end
130
+ para_text
131
+ end
132
+
133
+ def to_html(include_children = false, recurse = false)
134
+ html_str = "<p>"
135
+ html_str += @sentences.join("\n")
136
+ if include_children && !@children.empty?
137
+ html_str += "<ul>"
138
+ @children.each do |child|
139
+ html_str += child.to_html(include_children: recurse, recurse: recurse)
140
+ end
141
+ html_str += "</ul>"
142
+ end
143
+ html_str += "</p>"
144
+ html_str
145
+ end
146
+ end
147
+
148
+ # A section is a block of text. It can have children such as paragraphs, lists, and tables. A section has tag 'header'.
149
+ class Section < Block
150
+ attr_accessor :title
151
+
152
+ def initialize(section_json)
153
+ super(section_json)
154
+ @title = @sentences.join("\n")
155
+ end
156
+
157
+ def to_text(include_children = false, recurse = false)
158
+ text = @title
159
+ if include_children
160
+ @children.each do |child|
161
+ text += "\n#{child.to_text(include_children: recurse, recurse: recurse)}"
162
+ end
163
+ end
164
+ text
165
+ end
166
+
167
+ def to_html(include_children = false, recurse = false)
168
+ html_str = "<h#{@level + 1}>#{@title}</h#{@level + 1}>"
169
+ if include_children
170
+ @children.each do |child|
171
+ html_str += child.to_html(include_children: recurse, recurse: recurse)
172
+ end
173
+ end
174
+ html_str
175
+ end
176
+ end
177
+
178
+ # A list item is a block of text. It can have child list items. A list item has tag 'list_item'.
179
+ class ListItem < Block
180
+ def to_text(include_children = false, recurse = false)
181
+ text = @sentences.join("\n")
182
+ if include_children
183
+ @children.each do |child|
184
+ text += "\n#{child.to_text(include_children: recurse, recurse: recurse)}"
185
+ end
186
+ end
187
+ text
188
+ end
189
+
190
+ def to_html(include_children = false, recurse = false)
191
+ html_str = "<li>"
192
+ html_str += @sentences.join("\n")
193
+ if include_children && !@children.empty?
194
+ html_str += "<ul>"
195
+ @children.each do |child|
196
+ html_str += child.to_html(include_children: recurse, recurse: recurse)
197
+ end
198
+ html_str += "</ul>"
199
+ end
200
+ html_str += "</li>"
201
+ html_str
202
+ end
203
+ end
204
+
205
+ # A table cell is a block of text. It can have child paragraphs. A table cell has tag 'table_cell'.
206
+ # A table cell is contained within table rows.
207
+ class TableCell < Block
208
+ attr_accessor :col_span, :cell_value, :cell_node
209
+
210
+ def initialize(cell_json)
211
+ super(cell_json)
212
+ @col_span = cell_json["col_span"] if cell_json.key?("col_span")
213
+ @cell_value = cell_json["cell_value"]
214
+ @cell_node = if @cell_value.is_a?(String)
215
+ nil
216
+ else
217
+ Paragraph.new(@cell_value)
218
+ end
219
+ end
220
+
221
+ def to_text
222
+ cell_text = @cell_value
223
+ cell_text = @cell_node.to_text if @cell_node
224
+ cell_text
225
+ end
226
+
227
+ def to_html
228
+ cell_html = @cell_value
229
+ cell_html = @cell_node.to_html if @cell_node
230
+ if @col_span == 1
231
+ "<td colSpan='#{@col_span}'>#{cell_html}</td>"
232
+ else
233
+ "<td>#{cell_html}</td>"
234
+ end
235
+ end
236
+ end
237
+
238
+ # A table row is a block of text
239
+ # Base Block class assumed to be defined elsewhere
240
+
241
+ class TableRow < Block
242
+ # Initializes a TableRow with child table cells
243
+ def initialize(row_json)
244
+ @cells = []
245
+ if row_json["type"] == "full_row"
246
+ cell = TableCell.new(row_json)
247
+ @cells << cell
248
+ else
249
+ row_json["cells"].each do |cell_json|
250
+ cell = TableCell.new(cell_json)
251
+ @cells << cell
252
+ end
253
+ end
254
+ end
255
+
256
+ # Returns text of a row with text from all the cells in the row delimited by '|'
257
+ def to_text(_include_children = false, _recurse = false)
258
+ @cells.map(&:to_text).join(" | ")
259
+ end
260
+
261
+ # Returns html for a <tr> with html from all the cells in the row as <td>
262
+ def to_html(_include_children = false, _recurse = false)
263
+ html_str = "<tr>"
264
+ @cells.each { |cell| html_str += cell.to_html }
265
+ html_str += "</tr>"
266
+ html_str
267
+ end
268
+ end
269
+
270
+ class TableHeader < Block
271
+ # Initializes a TableHeader with child table cells
272
+ def initialize(row_json)
273
+ super(row_json)
274
+ @cells = []
275
+ row_json["cells"].each do |cell_json|
276
+ cell = TableCell.new(cell_json)
277
+ @cells << cell
278
+ end
279
+ end
280
+
281
+ # Returns text of a header row in markdown format
282
+ def to_text(_include_children = false, _recurse = false)
283
+ cell_text = @cells.map(&:to_text).join(" | ")
284
+ cell_text += "\n" + @cells.map { "---" }.join(" | ")
285
+ cell_text
286
+ end
287
+
288
+ # Returns html for a <th> with html from all the cells in the row as <td>
289
+ def to_html(_include_children = false, _recurse = false)
290
+ html_str = "<th>"
291
+ @cells.each { |cell| html_str += cell.to_html }
292
+ html_str += "</th>"
293
+ html_str
294
+ end
295
+ end
296
+
297
+ # The Table and Document classes would be similarly translated, focusing on Ruby's syntax for inheritance, method definitions, and iteration.
298
+ # The `initialize` method in Ruby is used instead of `__init__` in Python, and instance variables are prefixed with `@`.
299
+ # Method definitions in Ruby do not require the `def` keyword for each argument, and blocks of code are enclosed in `do...end` instead of indentation.
300
+
301
+ # This conversion assumes the presence of similarly functional TableCell, Paragraph, ListItem, and Section classes or modules in Ruby.
302
+ class Table < Block
303
+ # Initializes a Table with child table rows and headers
304
+ def initialize(table_json, _parent)
305
+ super(table_json)
306
+ @rows = []
307
+ @headers = []
308
+ @name = table_json["name"]
309
+ return unless table_json.include?("table_rows")
310
+
311
+ table_json["table_rows"].each do |row_json|
312
+ if row_json["type"] == "table_header"
313
+ row = TableHeader.new(row_json)
314
+ @headers << row
315
+ else
316
+ row = TableRow.new(row_json)
317
+ @rows << row
318
+ end
319
+ end
320
+ end
321
+
322
+ # Returns text of a table with text from all the rows in the table delimited by '\n'
323
+ def to_text(_include_children = false, _recurse = false)
324
+ "#{@headers.map(&:to_text).join("\n")}\n#{@rows.map(&:to_text).join("\n")}"
325
+ end
326
+
327
+ # Returns html for a <table> with html from all the rows in the table as <tr>
328
+ def to_html(_include_children = false, _recurse = false)
329
+ html_str = "<table>"
330
+ @headers.each { |header| html_str += header.to_html }
331
+ @rows.each { |row| html_str += row.to_html }
332
+ html_str += "</table>"
333
+ html_str
334
+ end
335
+ end
336
+
337
+ class Document
338
+ # Initializes a Document with a layout tree from the json
339
+ def initialize(blocks_json)
340
+ @reader = LayoutReader.new
341
+ @root_node = @reader.read(blocks_json)
342
+ @json = blocks_json
343
+ end
344
+
345
+ # Returns all the chunks in the document
346
+ def chunks
347
+ @root_node.chunks
348
+ end
349
+
350
+ # Returns all the tables in the document
351
+ def tables
352
+ @root_node.tables
353
+ end
354
+
355
+ # Returns all the sections in the document
356
+ def sections
357
+ @root_node.sections
358
+ end
359
+
360
+ # Returns text of a document by iterating through all the sections '\n'
361
+ def to_text
362
+ sections.map { |section| section.to_text(true, true) }.join("\n")
363
+ end
364
+
365
+ # Returns html for the document by iterating through all the sections
366
+ def to_html
367
+ html_str = "<html>"
368
+ sections.each { |section| html_str += section.to_html(true, true) }
369
+ html_str += "</html>"
370
+ html_str
371
+ end
372
+ end
373
+
374
+ class LayoutReader
375
+ # Reads the layout tree from the JSON returned by the parser API.
376
+
377
+ def debug(pdf_root)
378
+ iter_children = lambda do |node, level|
379
+ node.children.each do |child|
380
+ puts "#{"-" * level} #{child.tag} (#{child.children.length}) #{child.to_text}"
381
+ iter_children.call(child, level + 1)
382
+ end
383
+ end
384
+ iter_children.call(pdf_root, 0)
385
+ end
386
+
387
+ def read(blocks_json)
388
+ root = Block.new
389
+ parent_stack = [root]
390
+ prev_node = root
391
+ parent = root
392
+ list_stack = []
393
+
394
+ blocks_json.each do |block|
395
+ list_stack = [] if block["tag"] != "list_item" && !list_stack.empty?
396
+
397
+ node = case block["tag"]
398
+ when "para"
399
+ Paragraph.new(block)
400
+ when "table"
401
+ Table.new(block, prev_node)
402
+ when "list_item"
403
+ ListItem.new(block)
404
+ when "header"
405
+ Section.new(block)
406
+ else
407
+ raise "Unsupported block type: #{block["tag"]}"
408
+ end
409
+
410
+ case block["tag"]
411
+ when "para"
412
+ parent.add_child(node)
413
+ when "table"
414
+ parent.add_child(node)
415
+ when "list_item"
416
+ if prev_node.tag == "para" && prev_node.level == node.level
417
+ list_stack << prev_node
418
+ elsif prev_node.tag == "list_item"
419
+ if node.level > prev_node.level
420
+ list_stack << prev_node
421
+ elsif node.level < prev_node.level
422
+ list_stack.pop while !list_stack.empty? && list_stack.last.level > node.level
423
+ end
424
+ end
425
+ if list_stack.any?
426
+ list_stack.last.add_child(node)
427
+ else
428
+ parent.add_child(node)
429
+ end
430
+ when "header"
431
+ if node.level > parent.level
432
+ parent_stack << node
433
+ parent.add_child(node)
434
+ else
435
+ parent_stack.pop while parent_stack.length > 1 && parent_stack.last.level >= node.level
436
+ parent_stack.last.add_child(node)
437
+ parent_stack << node
438
+ end
439
+ parent = node
440
+ end
441
+
442
+ prev_node = node
443
+ end
444
+
445
+ root
446
+ end
447
+ end
448
+ end
@@ -0,0 +1,63 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+ require "uri"
5
+ require "json"
6
+ require "open-uri"
7
+ require "tempfile"
8
+
9
+ module Llmsherpa
10
+ class LayoutPDFReader
11
+ # Reads PDF content and understands hierarchical layout of the document sections and structural components
12
+ def initialize(parser_api_url)
13
+ @parser_api_url = parser_api_url
14
+ end
15
+
16
+ def read_pdf(path_or_url, contents = nil)
17
+ pdf_file = if contents
18
+ [path_or_url, contents, "application/pdf"]
19
+ else
20
+ is_url = %w[http https].include?(URI.parse(path_or_url).scheme)
21
+ if is_url
22
+ _download_pdf(path_or_url)
23
+ else
24
+ file_name = path_or_url
25
+ file_data = nil # no need to read the file here
26
+ [file_name, file_data, "application/pdf"]
27
+ end
28
+ end
29
+
30
+ parser_response = _parse_pdf(pdf_file)
31
+ response_json = JSON.parse(parser_response.body)
32
+ blocks = response_json["return_dict"]["result"]["blocks"]
33
+ Document.new(blocks)
34
+ end
35
+
36
+ private
37
+
38
+ def _download_pdf(pdf_url)
39
+ user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
40
+ download_uri = URI(pdf_url)
41
+ download_request = Net::HTTP::Get.new(download_uri)
42
+ download_request["User-Agent"] = user_agent
43
+ download_response = Net::HTTP.start(download_uri.hostname, download_uri.port,
44
+ use_ssl: download_uri.scheme == "https") do |http|
45
+ http.request(download_request)
46
+ end
47
+ file_name = File.basename(download_uri.path)
48
+ temp_file = Tempfile.new(file_name)
49
+ temp_file.write(download_response.body)
50
+ pdf_file = [temp_file.path, "", "application/pdf"] if download_response.code == "200"
51
+ pdf_file
52
+ end
53
+
54
+ def _parse_pdf(pdf_file)
55
+ uri = URI(@parser_api_url)
56
+ request = Net::HTTP::Post.new(uri)
57
+ request.set_form({ "file" => File.open(pdf_file[0]) }, "multipart/form-data")
58
+ Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == "https") do |http|
59
+ http.request(request)
60
+ end
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Llmsherpa
4
+ VERSION = "0.1.0"
5
+ end
data/lib/llmsherpa.rb ADDED
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "llmsherpa/version"
4
+ require_relative "llmsherpa/blocks"
5
+ require_relative "llmsherpa/layout_pdf_reader"
6
+
7
+ module Llmsherpa
8
+ class Error < StandardError; end
9
+ # Your code goes here...
10
+ end
data/sig/llmsherpa.rbs ADDED
@@ -0,0 +1,4 @@
1
+ module Llmsherpa
2
+ VERSION: String
3
+ # See the writing guide of rbs: https://github.com/ruby/rbs#guides
4
+ end
metadata ADDED
@@ -0,0 +1,60 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: llmsherpa
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Luca Pradovera
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2024-03-21 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description:
14
+ email:
15
+ - luca.pradovera@gmail.com
16
+ executables: []
17
+ extensions: []
18
+ extra_rdoc_files: []
19
+ files:
20
+ - ".DS_Store"
21
+ - ".rspec"
22
+ - ".rubocop.yml"
23
+ - CHANGELOG.md
24
+ - CODE_OF_CONDUCT.md
25
+ - LICENSE.txt
26
+ - README.md
27
+ - Rakefile
28
+ - lib/llmsherpa.rb
29
+ - lib/llmsherpa/blocks.rb
30
+ - lib/llmsherpa/layout_pdf_reader.rb
31
+ - lib/llmsherpa/version.rb
32
+ - sig/llmsherpa.rbs
33
+ homepage: https://github.com/lpradovera/llmsherpa
34
+ licenses:
35
+ - MIT
36
+ metadata:
37
+ allowed_push_host: https://rubygems.org
38
+ homepage_uri: https://github.com/lpradovera/llmsherpa
39
+ source_code_uri: https://github.com/lpradovera/llmsherpa
40
+ changelog_uri: https://github.com/lpradovera/llmsherpa/blob/main/CHANGELOG.md
41
+ post_install_message:
42
+ rdoc_options: []
43
+ require_paths:
44
+ - lib
45
+ required_ruby_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: 2.6.0
50
+ required_rubygems_version: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ requirements: []
56
+ rubygems_version: 3.5.3
57
+ signing_key:
58
+ specification_version: 4
59
+ summary: Client for the nlm-ingestor server
60
+ test_files: []