pikuri-pdf 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/README.md +78 -0
- data/lib/pikuri/extractors/pdf.rb +134 -0
- data/lib/pikuri-pdf.rb +30 -0
- metadata +94 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 7d52809c4ac479bbf4823b0e47ae96ec955a56ba80616a2fdc495ca4e785eff7
|
|
4
|
+
data.tar.gz: '0178b0b772e9032fe1b952ae363827d46b7f73cf0b683c6b15d4368befe0cd85'
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 7ef82b47e6c54b02957bfe0948f4cf4f949c3925ead0b994ff2a9260addd0562389b2ea3a2c83a1e053b461579795888f927a317cd7d2d10e9c43cb2901d59a4
|
|
7
|
+
data.tar.gz: 7fbcc9311f3d4ee942f26698c35e4306e19c3fde42d75569ee355d8610f14254ca9866a5d3a2550f6e6f1f83196828d69ac9dbcb378747f29f10ac3c1ff038d6
|
data/README.md
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# pikuri-pdf
|
|
2
|
+
|
|
3
|
+
PDF text extraction for the
|
|
4
|
+
[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
|
|
5
|
+
in-process, pure Ruby, and **lazy** — paged reads parse only the
|
|
6
|
+
pages the window needs, so showing the first page of a 500-page PDF
|
|
7
|
+
never pays for the other 499.
|
|
8
|
+
|
|
9
|
+
Provides:
|
|
10
|
+
- `Pikuri::Extractors::PDF` — an extractor for the
|
|
11
|
+
`Pikuri::Extractor` registry, wrapping the pure-Ruby
|
|
12
|
+
[pdf-reader](https://github.com/yob/pdf-reader) gem. Once
|
|
13
|
+
registered, every pikuri surface that routes through the registry
|
|
14
|
+
picks PDFs up for free: the `read` tool pages through a local
|
|
15
|
+
`.pdf` with `--- Page N ---` markers, `web_scrape` extracts a
|
|
16
|
+
downloaded paper, the pikuri-vectordb indexer ingests a PDF
|
|
17
|
+
corpus.
|
|
18
|
+
|
|
19
|
+
## Why a separate gem
|
|
20
|
+
|
|
21
|
+
pikuri-core's pitch is a dependency tree you can audit in an
|
|
22
|
+
evening. pdf-reader brings five transitive gems (Ascii85, afm,
|
|
23
|
+
hashery, ruby-rc4, ttfunk) that serve nothing else in core — the
|
|
24
|
+
largest single bite in that tree, for one file format. So PDF
|
|
25
|
+
support is an opt-in sibling instead: install it when your agent
|
|
26
|
+
needs PDFs, skip it (and its whole subtree) when it doesn't.
|
|
27
|
+
|
|
28
|
+
Everything is pure Ruby, so the worst a malicious PDF can do to the
|
|
29
|
+
parser is burn CPU and memory — there's no native code to corrupt.
|
|
30
|
+
|
|
31
|
+
**This gem or [pikuri-extractors](../pikuri-extractors) — pick one
|
|
32
|
+
per wiring.** pikuri-extractors' converter container also has a PDF
|
|
33
|
+
arm (poppler's `pdftotext`, sandboxed, same `--- Page N ---`
|
|
34
|
+
markers): on an agent that fetches untrusted documents from the
|
|
35
|
+
web, parsing them in the networkless container is the stronger
|
|
36
|
+
posture. This gem is the no-infrastructure wiring — in-process
|
|
37
|
+
means no docker and no host CLIs, and it's what makes the lazy
|
|
38
|
+
page-windowed reads possible (a subprocess converter must convert
|
|
39
|
+
the whole document before emitting anything, and re-converts it on
|
|
40
|
+
every paged read). The guide wires this gem in chapter 3 and
|
|
41
|
+
supersedes it with pikuri-extractors in chapter 7's assistant.
|
|
42
|
+
|
|
43
|
+
## Install
|
|
44
|
+
|
|
45
|
+
```ruby
|
|
46
|
+
# Gemfile
|
|
47
|
+
gem 'pikuri-pdf'
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## Usage
|
|
51
|
+
|
|
52
|
+
Requiring the gem changes nothing — registration is an explicit
|
|
53
|
+
opt-in your script makes, same philosophy as `c.add_extension`:
|
|
54
|
+
|
|
55
|
+
```ruby
|
|
56
|
+
require 'pikuri-core'
|
|
57
|
+
require 'pikuri-pdf'
|
|
58
|
+
|
|
59
|
+
Pikuri::Extractors::PDF.register
|
|
60
|
+
|
|
61
|
+
# From here on, the registry handles PDFs everywhere:
|
|
62
|
+
text = Pikuri::FileType.read_as_text(Pathname.new('paper.pdf'))
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
`register` inserts the extractor at the *front* of the registry: the
|
|
66
|
+
`%PDF-` magic-byte sniff is the strongest signal there — it never
|
|
67
|
+
misfires on text, and it must win over the HTML extractor's
|
|
68
|
+
content-type match so a PDF served under a lying `Content-Type`
|
|
69
|
+
header still extracts.
|
|
70
|
+
|
|
71
|
+
## Limits
|
|
72
|
+
|
|
73
|
+
Best-effort by design: pdf-reader produces clean text from PDFs
|
|
74
|
+
generated from a digital source (LaTeX, Word export, ...) but
|
|
75
|
+
nothing useful from scanned documents — those extract to the empty
|
|
76
|
+
string, and the `read` tool words that as a scanned-image hint to
|
|
77
|
+
the model. No OCR. Encrypted and XFA-form PDFs surface as
|
|
78
|
+
`Error: ...` observations the model can react to.
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'pdf-reader'
|
|
4
|
+
|
|
5
|
+
module Pikuri
|
|
6
|
+
module Extractors
|
|
7
|
+
# PDF → text extractor. Wraps the +pdf-reader+ gem: walk every
|
|
8
|
+
# page, emit a +"--- Page N ---"+ marker line followed by that
|
|
9
|
+
# page's extracted text, join the blocks with single newlines.
|
|
10
|
+
# The markers give every consumer page provenance — the Read
|
|
11
|
+
# tools tell the model to cite pages back to the user from them,
|
|
12
|
+
# +vectordb_search+ chunks carry them so a hit can say which page
|
|
13
|
+
# it came from, and what +vectordb_read+ shows matches what was
|
|
14
|
+
# indexed exactly. Pages with no extractable text contribute
|
|
15
|
+
# nothing (no marker either), so a fully scanned PDF extracts to
|
|
16
|
+
# the empty String — a deliberate silent skip callers detect by
|
|
17
|
+
# length if they care. No OCR in this path.
|
|
18
|
+
#
|
|
19
|
+
# == Why a separate gem
|
|
20
|
+
#
|
|
21
|
+
# This extractor lived in pikuri-core until pdf-reader's
|
|
22
|
+
# dependency tail (Ascii85, afm, hashery, ruby-rc4, ttfunk) became
|
|
23
|
+
# the largest single bite in the core's audit tree — five gems for
|
|
24
|
+
# one file format, serving nothing else in core. Splitting it out
|
|
25
|
+
# keeps the core minimal; hosts that want PDFs opt in with one
|
|
26
|
+
# {.register} call. Distinct from pikuri-extractors' sandboxed
|
|
27
|
+
# subprocess converters: this one is in-process and *lazy*
|
|
28
|
+
# ({.extract_lines} parses pages on demand), a property a
|
|
29
|
+
# subprocess converter structurally cannot have — see
|
|
30
|
+
# +Pikuri::Extractor+'s windowing yardoc.
|
|
31
|
+
#
|
|
32
|
+
# == Registration is explicit
|
|
33
|
+
#
|
|
34
|
+
# Requiring pikuri-pdf defines this module but registers nothing.
|
|
35
|
+
# A host script opts in with +Pikuri::Extractors::PDF.register+,
|
|
36
|
+
# which inserts it at the *front* of the registry — unlike
|
|
37
|
+
# pikuri-extractors' before-the-terminal insert — because the
|
|
38
|
+
# +%PDF-+ magic-byte sniff is the strongest signal in the
|
|
39
|
+
# registry: it must win over +HTML+'s content-type match so a PDF
|
|
40
|
+
# served under a lying header is still extracted, and it never
|
|
41
|
+
# misfires on text.
|
|
42
|
+
#
|
|
43
|
+
# Matched by the +%PDF-+ magic prefix *or* an +application/pdf+
|
|
44
|
+
# content-type.
|
|
45
|
+
#
|
|
46
|
+
# Best-effort by design: +pdf-reader+ produces clean text from
|
|
47
|
+
# PDFs generated from a digital source (LaTeX, Word export, ...)
|
|
48
|
+
# but nothing useful from scanned documents.
|
|
49
|
+
module PDF
|
|
50
|
+
# Insert this extractor at the front of
|
|
51
|
+
# +Pikuri::Extractor.registry+ (see "Registration is explicit"
|
|
52
|
+
# above for why the front). Idempotent.
|
|
53
|
+
#
|
|
54
|
+
# @return [Module] self, for one-line wiring in host scripts.
|
|
55
|
+
def self.register
|
|
56
|
+
registry = Pikuri::Extractor.registry
|
|
57
|
+
registry.unshift(self) unless registry.include?(self)
|
|
58
|
+
self
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
# @return [Symbol] {Pikuri::Extractor::Page#kind} tag.
|
|
62
|
+
def self.kind
|
|
63
|
+
:pdf
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
# @param sample [String] leading bytes of the content.
|
|
67
|
+
# @param content_type [String, nil] normalized content-type,
|
|
68
|
+
# when the transport supplies one.
|
|
69
|
+
# @return [Boolean]
|
|
70
|
+
def self.matches?(sample:, content_type:)
|
|
71
|
+
content_type == 'application/pdf' || sample.start_with?(FileType::PDF_MAGIC)
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
# Render the PDF behind +io+ as plain text, one
|
|
75
|
+
# +"--- Page N ---"+-headed block per page that carries text.
|
|
76
|
+
# Defined as +extract_lines.to_a.join+ so the two duck-type
|
|
77
|
+
# shapes cannot drift apart.
|
|
78
|
+
#
|
|
79
|
+
# @param io [IO, StringIO] seekable IO positioned at the start
|
|
80
|
+
# of the PDF bytes.
|
|
81
|
+
# @return [String] concatenated page blocks; possibly empty when
|
|
82
|
+
# the PDF carries no extractable text (scanned image, empty
|
|
83
|
+
# document).
|
|
84
|
+
# @raise [Pikuri::Extractor::Error] when +pdf-reader+ refuses
|
|
85
|
+
# the document.
|
|
86
|
+
def self.extract(io)
|
|
87
|
+
extract_lines(io).to_a.join("\n")
|
|
88
|
+
end
|
|
89
|
+
|
|
90
|
+
# The lazy line stream behind {.extract}: a marker line per
|
|
91
|
+
# text-carrying page, then that page's lines. +pdf-reader+
|
|
92
|
+
# parses a page's content stream only when +Page#text+ is
|
|
93
|
+
# called, so a consumer that stops early (the
|
|
94
|
+
# +Pikuri::Extractor.extract_paged+ window) never pays for the
|
|
95
|
+
# pages past its window.
|
|
96
|
+
#
|
|
97
|
+
# +pdf-reader+ raises a handful of typed exceptions for
|
|
98
|
+
# documents it cannot parse — broken xrefs
|
|
99
|
+
# ({::PDF::Reader::MalformedPDFError}), invalid page references
|
|
100
|
+
# ({::PDF::Reader::InvalidPageError}), encrypted/XFA files
|
|
101
|
+
# ({::PDF::Reader::UnsupportedFeatureError}). All three describe
|
|
102
|
+
# a property of the document the LLM can react to ("try a
|
|
103
|
+
# different URL / file"), so they re-raise as
|
|
104
|
+
# {Pikuri::Extractor::Error} — from inside the enumerator, i.e.
|
|
105
|
+
# at consumption time, which for a broken xref means the first
|
|
106
|
+
# +next+. Genuine bugs in +pdf-reader+ itself surface as their
|
|
107
|
+
# own classes and crash loud.
|
|
108
|
+
#
|
|
109
|
+
# @param io [IO, StringIO] seekable IO positioned at the start
|
|
110
|
+
# of the PDF bytes; must remain open while the enumerator is
|
|
111
|
+
# consumed.
|
|
112
|
+
# @return [Enumerator<String>] chomped lines, produced
|
|
113
|
+
# page-by-page.
|
|
114
|
+
# @raise [Pikuri::Extractor::Error] when +pdf-reader+ refuses
|
|
115
|
+
# the document (raised on consumption).
|
|
116
|
+
def self.extract_lines(io)
|
|
117
|
+
Enumerator.new do |lines|
|
|
118
|
+
::PDF::Reader.new(io).pages.each_with_index do |page, idx|
|
|
119
|
+
text = page.text.strip
|
|
120
|
+
next if text.empty?
|
|
121
|
+
|
|
122
|
+
lines << "--- Page #{idx + 1} ---"
|
|
123
|
+
text.split("\n").each { |line| lines << line }
|
|
124
|
+
end
|
|
125
|
+
rescue ::PDF::Reader::MalformedPDFError,
|
|
126
|
+
::PDF::Reader::InvalidPageError,
|
|
127
|
+
::PDF::Reader::UnsupportedFeatureError => e
|
|
128
|
+
raise Pikuri::Extractor::Error,
|
|
129
|
+
"PDF rendering failed: #{e.class.name.split('::').last}: #{e.message}"
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
end
|
data/lib/pikuri-pdf.rb
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'pikuri-core'
|
|
4
|
+
|
|
5
|
+
# Entry file for the pikuri-pdf gem. Sets up a dedicated Zeitwerk
|
|
6
|
+
# loader rooted at this gem's +lib/+, contributing to the shared
|
|
7
|
+
# +Pikuri::+ namespace alongside pikuri-core. After +require
|
|
8
|
+
# 'pikuri-pdf'+, +Pikuri::Extractors::PDF+ is defined — but *nothing
|
|
9
|
+
# is registered*: extractors plug into +Pikuri::Extractor.registry+
|
|
10
|
+
# only when the host script calls their +register+ explicitly, so a
|
|
11
|
+
# +bin/pikuri-*+ picks which extractors it wires in (same opt-in
|
|
12
|
+
# philosophy as +c.add_extension+, same shape as pikuri-extractors'
|
|
13
|
+
# +DOCUMENTS.register+).
|
|
14
|
+
#
|
|
15
|
+
# The +Pikuri::Extractors+ namespace is cooperative: pikuri-extractors
|
|
16
|
+
# contributes +Documents+ / +DOCUMENTS+, this gem contributes +PDF+.
|
|
17
|
+
# Each gem's loader manages only its own files; the loader constant is
|
|
18
|
+
# +PDF_LOADER+ (not +LOADER+) so both gems can be loaded together
|
|
19
|
+
# without colliding in the shared namespace.
|
|
20
|
+
module Pikuri
|
|
21
|
+
module Extractors
|
|
22
|
+
PDF_LOADER = Zeitwerk::Loader.new
|
|
23
|
+
PDF_LOADER.tag = 'pikuri-pdf'
|
|
24
|
+
PDF_LOADER.push_dir(File.expand_path('.', __dir__))
|
|
25
|
+
PDF_LOADER.ignore(__FILE__)
|
|
26
|
+
PDF_LOADER.inflector.inflect('pdf' => 'PDF')
|
|
27
|
+
PDF_LOADER.setup
|
|
28
|
+
PDF_LOADER.eager_load
|
|
29
|
+
end
|
|
30
|
+
end
|
metadata
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: pikuri-pdf
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.0.6
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- Martin Vysny
|
|
8
|
+
autorequire:
|
|
9
|
+
bindir: bin
|
|
10
|
+
cert_chain: []
|
|
11
|
+
date: 2026-06-04 00:00:00.000000000 Z
|
|
12
|
+
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: pikuri-core
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - '='
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: 0.0.6
|
|
20
|
+
type: :runtime
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - '='
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: 0.0.6
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: pdf-reader
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '2.15'
|
|
34
|
+
type: :runtime
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '2.15'
|
|
41
|
+
description: |
|
|
42
|
+
pikuri-pdf plugs PDF → text extraction into pikuri-core's
|
|
43
|
+
+Pikuri::Extractor+ registry. The bundled +Pikuri::Extractors::PDF+
|
|
44
|
+
extractor wraps the pure-Ruby pdf-reader gem and extracts lazily:
|
|
45
|
+
paged reads (the +read+ tool's windows) parse only the pages the
|
|
46
|
+
window needs, so the first page of a 500-page PDF never pays for
|
|
47
|
+
the other 499.
|
|
48
|
+
|
|
49
|
+
Shipped separately from pikuri-core so the core's dependency tree
|
|
50
|
+
stays minimal and auditable: pdf-reader and its transitive deps
|
|
51
|
+
(Ascii85, afm, hashery, ruby-rc4, ttfunk) ride along only for hosts
|
|
52
|
+
that opt into PDF support.
|
|
53
|
+
|
|
54
|
+
Registration is explicit — +Pikuri::Extractors::PDF.register+ — so
|
|
55
|
+
requiring the gem changes nothing by itself; the host script picks
|
|
56
|
+
which extractors it wires in. One registration extends the +read+
|
|
57
|
+
tool, +web_scrape+, and the pikuri-vectordb indexer simultaneously.
|
|
58
|
+
email:
|
|
59
|
+
- martin@vysny.me
|
|
60
|
+
executables: []
|
|
61
|
+
extensions: []
|
|
62
|
+
extra_rdoc_files: []
|
|
63
|
+
files:
|
|
64
|
+
- README.md
|
|
65
|
+
- lib/pikuri-pdf.rb
|
|
66
|
+
- lib/pikuri/extractors/pdf.rb
|
|
67
|
+
homepage: https://codeberg.org/mvysny/pikuri
|
|
68
|
+
licenses:
|
|
69
|
+
- MIT
|
|
70
|
+
metadata:
|
|
71
|
+
source_code_uri: https://codeberg.org/mvysny/pikuri/src/branch/master
|
|
72
|
+
changelog_uri: https://codeberg.org/mvysny/pikuri/src/branch/master/CHANGELOG.md
|
|
73
|
+
bug_tracker_uri: https://codeberg.org/mvysny/pikuri/issues
|
|
74
|
+
rubygems_mfa_required: 'true'
|
|
75
|
+
post_install_message:
|
|
76
|
+
rdoc_options: []
|
|
77
|
+
require_paths:
|
|
78
|
+
- lib
|
|
79
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
80
|
+
requirements:
|
|
81
|
+
- - ">="
|
|
82
|
+
- !ruby/object:Gem::Version
|
|
83
|
+
version: '3.3'
|
|
84
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
85
|
+
requirements:
|
|
86
|
+
- - ">="
|
|
87
|
+
- !ruby/object:Gem::Version
|
|
88
|
+
version: '0'
|
|
89
|
+
requirements: []
|
|
90
|
+
rubygems_version: 3.5.22
|
|
91
|
+
signing_key:
|
|
92
|
+
specification_version: 4
|
|
93
|
+
summary: In-process, lazily-paged PDF text extraction for pikuri.
|
|
94
|
+
test_files: []
|