pikuri-pdf 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 7d52809c4ac479bbf4823b0e47ae96ec955a56ba80616a2fdc495ca4e785eff7
4
+ data.tar.gz: '0178b0b772e9032fe1b952ae363827d46b7f73cf0b683c6b15d4368befe0cd85'
5
+ SHA512:
6
+ metadata.gz: 7ef82b47e6c54b02957bfe0948f4cf4f949c3925ead0b994ff2a9260addd0562389b2ea3a2c83a1e053b461579795888f927a317cd7d2d10e9c43cb2901d59a4
7
+ data.tar.gz: 7fbcc9311f3d4ee942f26698c35e4306e19c3fde42d75569ee355d8610f14254ca9866a5d3a2550f6e6f1f83196828d69ac9dbcb378747f29f10ac3c1ff038d6
data/README.md ADDED
@@ -0,0 +1,78 @@
1
+ # pikuri-pdf
2
+
3
+ PDF text extraction for the
4
+ [pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
5
+ in-process, pure Ruby, and **lazy** — paged reads parse only the
6
+ pages the window needs, so showing the first page of a 500-page PDF
7
+ never pays for the other 499.
8
+
9
+ Provides:
10
+ - `Pikuri::Extractors::PDF` — an extractor for the
11
+ `Pikuri::Extractor` registry, wrapping the pure-Ruby
12
+ [pdf-reader](https://github.com/yob/pdf-reader) gem. Once
13
+ registered, every pikuri surface that routes through the registry
14
+ picks PDFs up for free: the `read` tool pages through a local
15
+ `.pdf` with `--- Page N ---` markers, `web_scrape` extracts a
16
+ downloaded paper, the pikuri-vectordb indexer ingests a PDF
17
+ corpus.
18
+
19
+ ## Why a separate gem
20
+
21
+ pikuri-core's pitch is a dependency tree you can audit in an
22
+ evening. pdf-reader brings five transitive gems (Ascii85, afm,
23
+ hashery, ruby-rc4, ttfunk) that serve nothing else in core — the
24
+ largest single bite in that tree, for one file format. So PDF
25
+ support is an opt-in sibling instead: install it when your agent
26
+ needs PDFs, skip it (and its whole subtree) when it doesn't.
27
+
28
+ Everything is pure Ruby, so the worst a malicious PDF can do to the
29
+ parser is burn CPU and memory — there's no native code to corrupt.
30
+
31
+ **This gem or [pikuri-extractors](../pikuri-extractors) — pick one
32
+ per wiring.** pikuri-extractors' converter container also has a PDF
33
+ arm (poppler's `pdftotext`, sandboxed, same `--- Page N ---`
34
+ markers): on an agent that fetches untrusted documents from the
35
+ web, parsing them in the networkless container is the stronger
36
+ posture. This gem is the no-infrastructure wiring — in-process
37
+ means no docker and no host CLIs, and it's what makes the lazy
38
+ page-windowed reads possible (a subprocess converter must convert
39
+ the whole document before emitting anything, and re-converts it on
40
+ every paged read). The guide wires this gem in chapter 3 and
41
+ supersedes it with pikuri-extractors in chapter 7's assistant.
42
+
43
+ ## Install
44
+
45
+ ```ruby
46
+ # Gemfile
47
+ gem 'pikuri-pdf'
48
+ ```
49
+
50
+ ## Usage
51
+
52
+ Requiring the gem changes nothing — registration is an explicit
53
+ opt-in your script makes, same philosophy as `c.add_extension`:
54
+
55
+ ```ruby
56
+ require 'pikuri-core'
57
+ require 'pikuri-pdf'
58
+
59
+ Pikuri::Extractors::PDF.register
60
+
61
+ # From here on, the registry handles PDFs everywhere:
62
+ text = Pikuri::FileType.read_as_text(Pathname.new('paper.pdf'))
63
+ ```
64
+
65
+ `register` inserts the extractor at the *front* of the registry: the
66
+ `%PDF-` magic-byte sniff is the strongest signal there — it never
67
+ misfires on text, and it must win over the HTML extractor's
68
+ content-type match so a PDF served under a lying `Content-Type`
69
+ header still extracts.
70
+
71
+ ## Limits
72
+
73
+ Best-effort by design: pdf-reader produces clean text from PDFs
74
+ generated from a digital source (LaTeX, Word export, ...) but
75
+ nothing useful from scanned documents — those extract to the empty
76
+ string, and the `read` tool words that as a scanned-image hint to
77
+ the model. No OCR. Encrypted and XFA-form PDFs surface as
78
+ `Error: ...` observations the model can react to.
@@ -0,0 +1,134 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'pdf-reader'
4
+
5
+ module Pikuri
6
+ module Extractors
7
+ # PDF → text extractor. Wraps the +pdf-reader+ gem: walk every
8
+ # page, emit a +"--- Page N ---"+ marker line followed by that
9
+ # page's extracted text, join the blocks with single newlines.
10
+ # The markers give every consumer page provenance — the Read
11
+ # tools tell the model to cite pages back to the user from them,
12
+ # +vectordb_search+ chunks carry them so a hit can say which page
13
+ # it came from, and what +vectordb_read+ shows matches what was
14
+ # indexed exactly. Pages with no extractable text contribute
15
+ # nothing (no marker either), so a fully scanned PDF extracts to
16
+ # the empty String — a deliberate silent skip callers detect by
17
+ # length if they care. No OCR in this path.
18
+ #
19
+ # == Why a separate gem
20
+ #
21
+ # This extractor lived in pikuri-core until pdf-reader's
22
+ # dependency tail (Ascii85, afm, hashery, ruby-rc4, ttfunk) became
23
+ # the largest single bite in the core's audit tree — five gems for
24
+ # one file format, serving nothing else in core. Splitting it out
25
+ # keeps the core minimal; hosts that want PDFs opt in with one
26
+ # {.register} call. Distinct from pikuri-extractors' sandboxed
27
+ # subprocess converters: this one is in-process and *lazy*
28
+ # ({.extract_lines} parses pages on demand), a property a
29
+ # subprocess converter structurally cannot have — see
30
+ # +Pikuri::Extractor+'s windowing yardoc.
31
+ #
32
+ # == Registration is explicit
33
+ #
34
+ # Requiring pikuri-pdf defines this module but registers nothing.
35
+ # A host script opts in with +Pikuri::Extractors::PDF.register+,
36
+ # which inserts it at the *front* of the registry — unlike
37
+ # pikuri-extractors' before-the-terminal insert — because the
38
+ # +%PDF-+ magic-byte sniff is the strongest signal in the
39
+ # registry: it must win over +HTML+'s content-type match so a PDF
40
+ # served under a lying header is still extracted, and it never
41
+ # misfires on text.
42
+ #
43
+ # Matched by the +%PDF-+ magic prefix *or* an +application/pdf+
44
+ # content-type.
45
+ #
46
+ # Best-effort by design: +pdf-reader+ produces clean text from
47
+ # PDFs generated from a digital source (LaTeX, Word export, ...)
48
+ # but nothing useful from scanned documents.
49
+ module PDF
50
+ # Insert this extractor at the front of
51
+ # +Pikuri::Extractor.registry+ (see "Registration is explicit"
52
+ # above for why the front). Idempotent.
53
+ #
54
+ # @return [Module] self, for one-line wiring in host scripts.
55
+ def self.register
56
+ registry = Pikuri::Extractor.registry
57
+ registry.unshift(self) unless registry.include?(self)
58
+ self
59
+ end
60
+
61
+ # @return [Symbol] {Pikuri::Extractor::Page#kind} tag.
62
+ def self.kind
63
+ :pdf
64
+ end
65
+
66
+ # @param sample [String] leading bytes of the content.
67
+ # @param content_type [String, nil] normalized content-type,
68
+ # when the transport supplies one.
69
+ # @return [Boolean]
70
+ def self.matches?(sample:, content_type:)
71
+ content_type == 'application/pdf' || sample.start_with?(FileType::PDF_MAGIC)
72
+ end
73
+
74
+ # Render the PDF behind +io+ as plain text, one
75
+ # +"--- Page N ---"+-headed block per page that carries text.
76
+ # Defined as +extract_lines.to_a.join+ so the two duck-type
77
+ # shapes cannot drift apart.
78
+ #
79
+ # @param io [IO, StringIO] seekable IO positioned at the start
80
+ # of the PDF bytes.
81
+ # @return [String] concatenated page blocks; possibly empty when
82
+ # the PDF carries no extractable text (scanned image, empty
83
+ # document).
84
+ # @raise [Pikuri::Extractor::Error] when +pdf-reader+ refuses
85
+ # the document.
86
+ def self.extract(io)
87
+ extract_lines(io).to_a.join("\n")
88
+ end
89
+
90
+ # The lazy line stream behind {.extract}: a marker line per
91
+ # text-carrying page, then that page's lines. +pdf-reader+
92
+ # parses a page's content stream only when +Page#text+ is
93
+ # called, so a consumer that stops early (the
94
+ # +Pikuri::Extractor.extract_paged+ window) never pays for the
95
+ # pages past its window.
96
+ #
97
+ # +pdf-reader+ raises a handful of typed exceptions for
98
+ # documents it cannot parse — broken xrefs
99
+ # ({::PDF::Reader::MalformedPDFError}), invalid page references
100
+ # ({::PDF::Reader::InvalidPageError}), encrypted/XFA files
101
+ # ({::PDF::Reader::UnsupportedFeatureError}). All three describe
102
+ # a property of the document the LLM can react to ("try a
103
+ # different URL / file"), so they re-raise as
104
+ # {Pikuri::Extractor::Error} — from inside the enumerator, i.e.
105
+ # at consumption time, which for a broken xref means the first
106
+ # +next+. Genuine bugs in +pdf-reader+ itself surface as their
107
+ # own classes and crash loud.
108
+ #
109
+ # @param io [IO, StringIO] seekable IO positioned at the start
110
+ # of the PDF bytes; must remain open while the enumerator is
111
+ # consumed.
112
+ # @return [Enumerator<String>] chomped lines, produced
113
+ # page-by-page.
114
+ # @raise [Pikuri::Extractor::Error] when +pdf-reader+ refuses
115
+ # the document (raised on consumption).
116
+ def self.extract_lines(io)
117
+ Enumerator.new do |lines|
118
+ ::PDF::Reader.new(io).pages.each_with_index do |page, idx|
119
+ text = page.text.strip
120
+ next if text.empty?
121
+
122
+ lines << "--- Page #{idx + 1} ---"
123
+ text.split("\n").each { |line| lines << line }
124
+ end
125
+ rescue ::PDF::Reader::MalformedPDFError,
126
+ ::PDF::Reader::InvalidPageError,
127
+ ::PDF::Reader::UnsupportedFeatureError => e
128
+ raise Pikuri::Extractor::Error,
129
+ "PDF rendering failed: #{e.class.name.split('::').last}: #{e.message}"
130
+ end
131
+ end
132
+ end
133
+ end
134
+ end
data/lib/pikuri-pdf.rb ADDED
@@ -0,0 +1,30 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'pikuri-core'
4
+
5
+ # Entry file for the pikuri-pdf gem. Sets up a dedicated Zeitwerk
6
+ # loader rooted at this gem's +lib/+, contributing to the shared
7
+ # +Pikuri::+ namespace alongside pikuri-core. After +require
8
+ # 'pikuri-pdf'+, +Pikuri::Extractors::PDF+ is defined — but *nothing
9
+ # is registered*: extractors plug into +Pikuri::Extractor.registry+
10
+ # only when the host script calls their +register+ explicitly, so a
11
+ # +bin/pikuri-*+ picks which extractors it wires in (same opt-in
12
+ # philosophy as +c.add_extension+, same shape as pikuri-extractors'
13
+ # +DOCUMENTS.register+).
14
+ #
15
+ # The +Pikuri::Extractors+ namespace is cooperative: pikuri-extractors
16
+ # contributes +Documents+ / +DOCUMENTS+, this gem contributes +PDF+.
17
+ # Each gem's loader manages only its own files; the loader constant is
18
+ # +PDF_LOADER+ (not +LOADER+) so both gems can be loaded together
19
+ # without colliding in the shared namespace.
20
+ module Pikuri
21
+ module Extractors
22
+ PDF_LOADER = Zeitwerk::Loader.new
23
+ PDF_LOADER.tag = 'pikuri-pdf'
24
+ PDF_LOADER.push_dir(File.expand_path('.', __dir__))
25
+ PDF_LOADER.ignore(__FILE__)
26
+ PDF_LOADER.inflector.inflect('pdf' => 'PDF')
27
+ PDF_LOADER.setup
28
+ PDF_LOADER.eager_load
29
+ end
30
+ end
metadata ADDED
@@ -0,0 +1,94 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: pikuri-pdf
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.6
5
+ platform: ruby
6
+ authors:
7
+ - Martin Vysny
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2026-06-04 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: pikuri-core
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - '='
18
+ - !ruby/object:Gem::Version
19
+ version: 0.0.6
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '='
25
+ - !ruby/object:Gem::Version
26
+ version: 0.0.6
27
+ - !ruby/object:Gem::Dependency
28
+ name: pdf-reader
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '2.15'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '2.15'
41
+ description: |
42
+ pikuri-pdf plugs PDF → text extraction into pikuri-core's
43
+ +Pikuri::Extractor+ registry. The bundled +Pikuri::Extractors::PDF+
44
+ extractor wraps the pure-Ruby pdf-reader gem and extracts lazily:
45
+ paged reads (the +read+ tool's windows) parse only the pages the
46
+ window needs, so the first page of a 500-page PDF never pays for
47
+ the other 499.
48
+
49
+ Shipped separately from pikuri-core so the core's dependency tree
50
+ stays minimal and auditable: pdf-reader and its transitive deps
51
+ (Ascii85, afm, hashery, ruby-rc4, ttfunk) ride along only for hosts
52
+ that opt into PDF support.
53
+
54
+ Registration is explicit — +Pikuri::Extractors::PDF.register+ — so
55
+ requiring the gem changes nothing by itself; the host script picks
56
+ which extractors it wires in. One registration extends the +read+
57
+ tool, +web_scrape+, and the pikuri-vectordb indexer simultaneously.
58
+ email:
59
+ - martin@vysny.me
60
+ executables: []
61
+ extensions: []
62
+ extra_rdoc_files: []
63
+ files:
64
+ - README.md
65
+ - lib/pikuri-pdf.rb
66
+ - lib/pikuri/extractors/pdf.rb
67
+ homepage: https://codeberg.org/mvysny/pikuri
68
+ licenses:
69
+ - MIT
70
+ metadata:
71
+ source_code_uri: https://codeberg.org/mvysny/pikuri/src/branch/master
72
+ changelog_uri: https://codeberg.org/mvysny/pikuri/src/branch/master/CHANGELOG.md
73
+ bug_tracker_uri: https://codeberg.org/mvysny/pikuri/issues
74
+ rubygems_mfa_required: 'true'
75
+ post_install_message:
76
+ rdoc_options: []
77
+ require_paths:
78
+ - lib
79
+ required_ruby_version: !ruby/object:Gem::Requirement
80
+ requirements:
81
+ - - ">="
82
+ - !ruby/object:Gem::Version
83
+ version: '3.3'
84
+ required_rubygems_version: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - ">="
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ requirements: []
90
+ rubygems_version: 3.5.22
91
+ signing_key:
92
+ specification_version: 4
93
+ summary: In-process, lazily-paged PDF text extraction for pikuri.
94
+ test_files: []