pikuri-extractors 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 760f5421f1396e28dfb86eacfbee7a70e3040313498805cc5617eff91cf7fb86
4
+ data.tar.gz: e220e4620d19cb8a400b71f6e0874b0db17e4a6ff289d2baf12d6b852876aca4
5
+ SHA512:
6
+ metadata.gz: 20f920922f3548d0d862653ed081e61593eddb37cef90a48e00e1669cfb4181ecb6f82fce0befbbcf5a80a84c88aca109a8184b6c997175cde4fa15468b3b19c
7
+ data.tar.gz: 9e9d1428944ecf927d585e64e6efa42a7f111ab86bf510107ced9694e94a34288e91e415641c6990d9e862ef8fad2bf4ed0f123eabae36a9129c467c0fc140a0
data/README.md ADDED
@@ -0,0 +1,102 @@
1
+ # pikuri-extractors
2
+
3
+ Additional document extractors for the
4
+ [pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
5
+ office documents and PDFs → Markdown, converted preferably inside a
6
+ one-shot **networkless docker container**, so a malicious document
7
+ downloaded from the web is parsed somewhere it can't phone home or
8
+ read your files.
9
+
10
+ Provides:
11
+ - `Pikuri::Extractors::DOCUMENTS` — an extractor for the
12
+ `Pikuri::Extractor` registry covering **DOCX, ODT, XLSX, legacy
13
+ XLS, PPTX, EPUB, RTF, and PDF**. Once registered, every pikuri
14
+ surface that routes through the registry picks the formats up for
15
+ free: the `read` tool pages through a local `.docx`, `web_scrape`
16
+ / `fetch` convert a downloaded `.odt`, the pikuri-vectordb indexer
17
+ ingests an `.epub` or a paper PDF (with `--- Page N ---` page
18
+ markers preserved for citations).
19
+
20
+ The actual conversion is done by [pandoc](https://pandoc.org) (ODF,
21
+ RTF, EPUB, DOCX — its readers preserve the most structure),
22
+ [markitdown](https://github.com/microsoft/markitdown) (the XLSX /
23
+ XLS / PPTX arms), and poppler's `pdftotext` (the PDF arm — its `\f`
24
+ page separators are rebuilt into `--- Page N ---` markers on the
25
+ Ruby side), dispatched per format. One stdin→stdout contract,
26
+ two ways to run it:
27
+
28
+ 1. **Container (preferred).** A small, locally-built docker image
29
+ (pinned pandoc + pinned markitdown; built from
30
+ [docker/Dockerfile](docker/Dockerfile) on first use — read it,
31
+ it's short) run as
32
+ `docker run --rm -i --network=none --read-only --cap-drop=ALL`,
33
+ bytes in via stdin, Markdown out via stdout, **no volume
34
+ mounts**. Office-format parsers are large, complex codebases and
35
+ the documents they parse are exactly the bytes an attacker
36
+ controls; in the container, the blast radius of a parser exploit
37
+ is one throwaway process that can see neither your network nor
38
+ your filesystem.
39
+ 2. **Host CLI (fallback).** When docker is absent or the daemon is
40
+ down, a host-installed `pandoc` / `markitdown` / `pdftotext` is
41
+ used directly — convenient, but unpinned and unsandboxed.
42
+
43
+ Deliberately **not** covered: ODS / ODP (neither converter reads
44
+ them; the only one that does is LibreOffice, a 2 GB+ image), and
45
+ image OCR / audio transcription (need model downloads — point a
46
+ multi-modal main LLM at images instead).
47
+
48
+ **PDF: this gem or [pikuri-pdf](../pikuri-pdf) — pick one per
49
+ wiring.** This gem parses PDFs inside the sandbox (poppler is native
50
+ code chewing attacker-controlled bytes — exactly what the container
51
+ is for) but re-converts the whole document on every paged read;
52
+ pikuri-pdf is in-process pure Ruby with lazy page-windowed reads and
53
+ no infrastructure to set up. The guide wires pikuri-pdf in chapter 3
54
+ and supersedes it with this gem in chapter 7's assistant.
55
+
56
+ ## Install
57
+
58
+ ```ruby
59
+ # Gemfile
60
+ gem 'pikuri-extractors'
61
+ ```
62
+
63
+ Plus *one* of: a working `docker` (recommended; the image builds
64
+ itself on first use, network is only needed for that build), or
65
+ host `pandoc` / `markitdown` CLIs.
66
+
67
+ ## Usage
68
+
69
+ Requiring the gem changes nothing — registration is an explicit
70
+ opt-in your script makes, same philosophy as `c.add_extension`:
71
+
72
+ ```ruby
73
+ require 'pikuri-core'
74
+ require 'pikuri-extractors'
75
+
76
+ Pikuri::Extractors::DOCUMENTS.register
77
+
78
+ # From here on, the registry handles the new formats everywhere:
79
+ text = Pikuri::FileType.read_as_text(Pathname.new('report.docx'))
80
+
81
+ # Optional: pay the one-time image build (~minutes) at boot instead
82
+ # of mid-conversation. Requires docker; skip when relying on host CLIs.
83
+ Pikuri::Extractors::DOCUMENTS.ensure_image!
84
+ ```
85
+
86
+ ## Performance posture
87
+
88
+ A subprocess converter has no lazy mode: every paged `read` of a
89
+ long document re-runs the full conversion (a cold `docker run` plus
90
+ the converter itself — roughly a second for pandoc, a few for
91
+ markitdown). That's accepted v1 behavior — no result cache — and it
92
+ is well inside what an LLM tool call tolerates.
93
+
94
+ ## Format detection
95
+
96
+ Content-type when the transport provides one (HTTP header), byte
97
+ sniff otherwise: RTF by its `{\rtf` prefix; ODT / EPUB by the
98
+ uncompressed `mimetype` zip entry their specs mandate first; DOCX /
99
+ XLSX / PPTX by `[Content_Types].xml` plus an entry-name scan.
100
+ Legacy `.xls` is recognised by content-type only (its OLE2 container
101
+ isn't sniffable from the leading bytes), so a *local* `.xls` file —
102
+ no content-type — keeps pikuri-core's binary refusal.
data/docker/Dockerfile ADDED
@@ -0,0 +1,38 @@
1
+ # Converter image for Pikuri::Extractors::DOCUMENTS — a one-shot,
2
+ # networkless stdin→stdout document converter. Built locally by the
3
+ # extractor on first use (`docker build -t pikuri-internal-extractors:<version> .`);
4
+ # never pulled from a registry, so what runs is exactly what you can
5
+ # read here.
6
+ #
7
+ # Runtime contract (enforced by the extractor's `docker run` flags,
8
+ # not by this image): --network=none --read-only --cap-drop=ALL
9
+ # --tmpfs /tmp, no volume mounts. Document bytes arrive on stdin,
10
+ # Markdown leaves on stdout, diagnostics on stderr. Network exists
11
+ # only at *build* time (apt + pip below).
12
+ #
13
+ # All converters are version-pinned: pandoc and poppler (pdftotext)
14
+ # via the pinned Debian release of the base image, markitdown via the
15
+ # exact pip version. Bump deliberately, with a conversion smoke-test.
16
+ FROM python:3.13-slim-trixie
17
+
18
+ # poppler-utils ships pdftotext — the PDF arm. Its C++ parser is
19
+ # exactly the kind of attack surface this networkless container
20
+ # exists to contain.
21
+ RUN apt-get update \
22
+ && apt-get install -y --no-install-recommends pandoc poppler-utils \
23
+ && rm -rf /var/lib/apt/lists/*
24
+
25
+ # Extras pull the per-format converter deps (mammoth, openpyxl,
26
+ # python-pptx, ...). No `all` — the OCR / transcription / cloud arms
27
+ # would bloat the image and want network or models at runtime.
28
+ RUN pip install --no-cache-dir 'markitdown[docx,xlsx,xls,pptx]==0.1.6'
29
+
30
+ COPY convert.sh /usr/local/bin/pikuri-convert
31
+ RUN chmod 0755 /usr/local/bin/pikuri-convert
32
+
33
+ # Drop root; HOME on the tmpfs so anything insisting on a writable
34
+ # home (cache dirs etc.) lands somewhere ephemeral under --read-only.
35
+ ENV HOME=/tmp
36
+ USER 65534:65534
37
+
38
+ ENTRYPOINT ["pikuri-convert"]
data/docker/convert.sh ADDED
@@ -0,0 +1,20 @@
1
+ #!/bin/sh
2
+ # Entrypoint dispatch for the pikuri-extractors converter image:
3
+ # $1 is the format tag, document bytes on stdin, Markdown on stdout.
4
+ # Keep the format→converter mapping in sync with
5
+ # Pikuri::Extractors::Documents::HOST_CONVERTERS — pandoc where its
6
+ # reader exists (it preserves more structure), markitdown for the
7
+ # OOXML spreadsheet/presentation arms, pdftotext for PDF (plain text
8
+ # with \f page separators; the Ruby side turns those into
9
+ # "--- Page N ---" markers), bare markitdown (its own magic-byte
10
+ # detection) for the `auto` fallback.
11
+ set -eu
12
+
13
+ fmt="${1:-auto}"
14
+ case "$fmt" in
15
+ odt|rtf|epub|docx) exec pandoc -f "$fmt" -t gfm --wrap=none ;;
16
+ xlsx|xls|pptx) exec markitdown -x "$fmt" ;;
17
+ pdf) exec pdftotext -q -enc UTF-8 - - ;;
18
+ auto) exec markitdown ;;
19
+ *) echo "pikuri-convert: unknown format tag: $fmt" >&2; exit 64 ;;
20
+ esac
@@ -0,0 +1,517 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'tempfile'
4
+
5
+ module Pikuri
6
+ module Extractors
7
+ # Document extractor for the {Pikuri::Extractor} registry:
8
+ # DOCX / ODT / XLSX / legacy XLS / PPTX / EPUB / RTF / PDF →
9
+ # Markdown, by piping the document bytes through pandoc (ODF,
10
+ # RTF, EPUB, DOCX), markitdown (the OOXML spreadsheet /
11
+ # presentation arms), or pdftotext (PDF), selected per format.
12
+ #
13
+ # == Container first, host CLI second
14
+ #
15
+ # The preferred converter is a one-shot, locally-built docker
16
+ # container ({IMAGE}, built from this gem's +docker/+ directory):
17
+ # +docker run --rm -i --network=none --read-only --cap-drop=ALL+,
18
+ # bytes in via stdin, Markdown out via stdout, **no volume
19
+ # mounts**. Two reasons this beats running a host-installed
20
+ # converter directly:
21
+ #
22
+ # * *Security.* These documents typically arrive via +fetch+ /
23
+ # +web_scrape+ — untrusted bytes — and complex format parsers
24
+ # are a classic exploitation surface. In the container the
25
+ # parser sees no network and no host filesystem; the worst a
26
+ # malicious document can do is produce garbage Markdown.
27
+ # * *Reproducibility.* The Dockerfile pins pandoc (via the base
28
+ # image's apt) and markitdown (exact pip version); a host
29
+ # install is whatever version the machine happens to have.
30
+ #
31
+ # When docker is unavailable (binary absent or daemon down), the
32
+ # extractor falls back to host-installed +pandoc+ / +markitdown+
33
+ # / +pdftotext+ CLIs — same stdin→stdout contract, one code path
34
+ # with two argv builders. Which arm was picked is logged once via
35
+ # +Pikuri.logger_for('Extractors')+.
36
+ #
37
+ # == Registration is explicit
38
+ #
39
+ # Requiring pikuri-extractors defines this class and the shared
40
+ # {DOCUMENTS} instance but registers nothing. A host script opts
41
+ # in with +Pikuri::Extractors::DOCUMENTS.register+, which inserts
42
+ # the instance before the registry's terminal +Passthrough+ entry
43
+ # (and after core's +HTML+ — and pikuri-pdf's front-inserted
44
+ # +PDF+, when the host registers that too — which keep winning
45
+ # their formats). Same opt-in philosophy as +c.add_extension+ —
46
+ # no behavior changes by require alone.
47
+ #
48
+ # == Format detection
49
+ #
50
+ # {#matches?} claims content by normalized content-type
51
+ # ({CONTENT_TYPES}) or by byte sniff: PDF's +%PDF-+ prefix, RTF's
52
+ # +{\\rtf+ prefix; for
53
+ # zip-based formats, ODF and EPUB mandate an uncompressed
54
+ # +mimetype+ first entry (so the literal mime string sits inside
55
+ # the leading sample), and OOXML is recognised by the
56
+ # +[Content_Types].xml+ entry plus a +word/+ / +ppt/+ / +xl/+
57
+ # entry-name scan. {#extract} re-sniffs (the registry duck type
58
+ # doesn't pass +content_type+ to +extract+); when content was
59
+ # claimed by content-type but the sniff is blind (legacy XLS — an
60
+ # OLE2 container whose discriminating directory sits at the end of
61
+ # the file, past the sample), the bytes go to markitdown with no
62
+ # format hint and its own magic-byte detection takes over. The
63
+ # consequence: a *local* +.xls+ (no transport content-type, sniff
64
+ # blind) is not claimed at all and keeps today's binary refusal.
65
+ # One ordering edge vs pikuri-pdf: this instance sits *after*
66
+ # core's +HTML+ in the registry, so a PDF served under a lying
67
+ # +text/html+ header goes to the HTML extractor (pikuri-pdf
68
+ # front-inserts and wins that case). Accepted — lying-header PDFs
69
+ # under specifically +text/html+ are rare.
70
+ #
71
+ # == PDF: this gem or pikuri-pdf — pick one per wiring
72
+ #
73
+ # The PDF arm (pdftotext, with {#pdf_page_lines} restoring the
74
+ # +"--- Page N ---"+ markers from pdftotext's +\f+ separators)
75
+ # makes this extractor a complete superset of pikuri-pdf's
76
+ # formats, so a host that registers {DOCUMENTS} does NOT also
77
+ # register +Extractors::PDF+ — one extractor per format keeps the
78
+ # registry's first-match-wins semantics legible. The trade per
79
+ # wiring:
80
+ #
81
+ # * *This gem* — PDF parsing happens inside the sandbox (poppler
82
+ # is native code parsing attacker-controlled bytes; the
83
+ # container is exactly the right place for it), one gem covers
84
+ # every document format. Costs: docker (or host CLIs), no lazy
85
+ # paging (each paged read re-converts the whole PDF), and the
86
+ # generic +:document+ kind (the Read tools say "End of file",
87
+ # not "End of PDF", and a scanned PDF reads as "(Empty file)"
88
+ # rather than the scanned-image hint).
89
+ # * *pikuri-pdf* — in-process pure Ruby (no infrastructure), lazy
90
+ # +extract_lines+ paging (a windowed read of a 500-page PDF
91
+ # parses only its window), PDF-specific Read-tool wording.
92
+ # Costs: pdf-reader's dependency subtree, parsing untrusted
93
+ # bytes in-process (pure Ruby, so DoS at worst).
94
+ #
95
+ # The guide walks this as a progression: chapter 3 wires
96
+ # pikuri-pdf (no docker yet), chapter 7's assistant supersedes it
97
+ # with this extractor.
98
+ #
99
+ # == Deliberately out of scope
100
+ #
101
+ # * *ODS / ODP* — neither pandoc nor markitdown reads them; the
102
+ # only converter that does (LibreOffice headless) costs a 2 GB+
103
+ # image. Excluded rather than half-supported.
104
+ # * *Image OCR / audio transcription* — markitdown's optional
105
+ # arms need model downloads; the converter image stays
106
+ # networkless and small. A multi-modal main LLM is the pikuri
107
+ # answer to images.
108
+ #
109
+ # == Paging economics
110
+ #
111
+ # A subprocess converter needs the whole document before it can
112
+ # emit anything, so there is no lazy parse: every
113
+ # +Extractor.extract_paged+ call (each +Read+ page of a long DOCX)
114
+ # re-runs the full conversion. Accepted — no result cache in v1.
115
+ # Both legs of one conversion still stream, though: the source +io+
116
+ # is handed to {Pikuri::Subprocess.run} and copied straight into
117
+ # the converter's stdin (+IO.copy_stream+ — a big local file never
118
+ # loads into the Ruby heap), and the converter's stdout lands in a
119
+ # Tempfile (also what makes the stdin/stdout pumping deadlock-free
120
+ # — see {Pikuri::Subprocess.run}) whose lines {#extract_lines}
121
+ # yields from disk — so neither the document nor the full Markdown
122
+ # String is ever resident during paging.
123
+ class Documents
124
+ # @return [Logger] gem-wide diagnostics logger.
125
+ LOGGER = Pikuri.logger_for('Extractors')
126
+
127
+ # @return [String] converter image tag. Version-tied so a gem
128
+ # upgrade rebuilds with the new pins; +pikuri-internal-+
129
+ # prefix matches the container-naming convention of the
130
+ # vectordb/memory supervisors.
131
+ IMAGE = "pikuri-internal-extractors:#{Pikuri::VERSION}"
132
+
133
+ # @return [String] absolute path to the shipped docker build
134
+ # context (Dockerfile + convert.sh).
135
+ DOCKER_DIR = File.expand_path('../../../docker', __dir__)
136
+
137
+ # @return [String] coreutils-+timeout+ budget for one
138
+ # conversion. Generous — a huge PPTX through markitdown can
139
+ # take a while — but bounded, so a wedged converter can't
140
+ # hang the agent loop.
141
+ CONVERT_TIMEOUT = '300s'
142
+
143
+ # @return [String] sentinel format meaning "let markitdown's
144
+ # magic-byte detection decide" — the fallback when content was
145
+ # claimed by content-type but the byte sniff is blind.
146
+ AUTO = 'auto'
147
+
148
+ # @return [String] the PDF format tag. Singled out as a constant
149
+ # because PDF is the one format whose converter output gets a
150
+ # post-processing pass: pdftotext emits +\f+ between pages, and
151
+ # {#extract} / {#extract_lines} turn those into the same
152
+ # +"--- Page N ---"+ marker lines pikuri-pdf's extractor emits,
153
+ # so page provenance (vectordb chunk citations, the Read tools'
154
+ # page references) survives whichever PDF extractor a host
155
+ # wires.
156
+ PDF = 'pdf'
157
+
158
+ # @return [Hash{String => String}] normalized content-type →
159
+ # format tag (the tag doubles as the container entrypoint's
160
+ # dispatch argument and pandoc's +-f+ / markitdown's +-x+
161
+ # value).
162
+ CONTENT_TYPES = {
163
+ 'application/vnd.oasis.opendocument.text' => 'odt',
164
+ 'application/rtf' => 'rtf',
165
+ 'text/rtf' => 'rtf',
166
+ 'application/epub+zip' => 'epub',
167
+ 'application/pdf' => PDF,
168
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx',
169
+ 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'xlsx',
170
+ 'application/vnd.ms-excel' => 'xls',
171
+ 'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'pptx'
172
+ }.freeze
173
+
174
+ # @return [Hash{String => Array<Symbol>}] format tag → host CLIs
175
+ # that can convert it, in preference order. Mirrors the
176
+ # container entrypoint's dispatch (+docker/convert.sh+) — keep
177
+ # the two in sync. pandoc leads where both could serve (DOCX,
178
+ # EPUB): its readers preserve more structure.
179
+ HOST_CONVERTERS = {
180
+ 'odt' => %i[pandoc],
181
+ 'rtf' => %i[pandoc],
182
+ 'epub' => %i[pandoc markitdown],
183
+ 'docx' => %i[pandoc markitdown],
184
+ 'xlsx' => %i[markitdown],
185
+ 'xls' => %i[markitdown],
186
+ 'pptx' => %i[markitdown],
187
+ PDF => %i[pdftotext],
188
+ AUTO => %i[markitdown]
189
+ }.freeze
190
+
191
+ # @return [String] zip local-file-header magic, shared by every
192
+ # OOXML / ODF / EPUB document.
193
+ ZIP_MAGIC = "PK\x03\x04".b
194
+
195
+ # @return [Hash{String => String}] host-CLI name → the flag that
196
+ # makes it print a version and exit 0, where +--version+ (the
197
+ # default probe, see {#cli?}) doesn't work: poppler's
198
+ # +pdftotext+ parses +--version+ as a *filename* and exits 1,
199
+ # but accepts +-v+.
200
+ VERSION_PROBE_FLAGS = { 'pdftotext' => '-v' }.freeze
201
+
202
+ # @return [Symbol] kind tag carried on +Extractor::Page#kind+.
203
+ def kind
204
+ :document
205
+ end
206
+
207
+ # Claim content this extractor can convert: a recognised
208
+ # content-type, or a positive byte sniff (see "Format detection"
209
+ # in the class docs).
210
+ #
211
+ # @param sample [String] leading bytes of the content.
212
+ # @param content_type [String, nil] normalized content-type, or
213
+ # +nil+ when the transport carries none (local files).
214
+ # @return [Boolean]
215
+ def matches?(sample:, content_type:)
216
+ CONTENT_TYPES.key?(content_type) || !sniff(sample).nil?
217
+ end
218
+
219
+ # Convert the whole document behind +io+ to one Markdown String.
220
+ # PDFs come back as one +"--- Page N ---"+-headed block per
221
+ # text-carrying page (see {PDF}); a fully scanned PDF extracts
222
+ # to the empty String — same contract as pikuri-pdf's extractor.
223
+ #
224
+ # @param io [IO, StringIO] seekable IO positioned at the start.
225
+ # @return [String] Markdown-flavoured UTF-8 text.
226
+ # @raise [Pikuri::Extractor::Error] when no converter is
227
+ # available, the conversion exits non-zero, or it times out.
228
+ def extract(io)
229
+ with_converted(io) do |file, format|
230
+ format == PDF ? pdf_page_lines(file).to_a.join("\n") : file.read
231
+ end
232
+ end
233
+
234
+ # Same content as {#extract}, as a stream of +chomp+ed lines
235
+ # read off the converter's stdout Tempfile — the whole-document
236
+ # conversion still runs up front (subprocess converters can't
237
+ # parse lazily), but neither the document nor the Markdown ever
238
+ # materialises as one String: the conversion fires on first
239
+ # consumption, streaming +io+ into the converter. The enumerator
240
+ # owns the Tempfile and deletes it when iteration ends.
241
+ #
242
+ # @param io [IO, StringIO] seekable IO positioned at the start;
243
+ # must remain open until the enumerator is consumed (same
244
+ # contract as pikuri-pdf's lazy +extract_lines+).
245
+ # @return [Enumerator<String>]
246
+ # @raise [Pikuri::Extractor::Error] as for {#extract}, raised on
247
+ # first consumption.
248
+ def extract_lines(io)
249
+ Enumerator.new do |yielder|
250
+ with_converted(io) do |file, format|
251
+ if format == PDF
252
+ pdf_page_lines(file).each { |line| yielder << line }
253
+ else
254
+ file.each_line { |line| yielder << line.chomp }
255
+ end
256
+ end
257
+ end
258
+ end
259
+
260
+ # Plug this extractor into {Pikuri::Extractor.registry}, before
261
+ # the terminal +Passthrough+ entry. Idempotent — a second call
262
+ # is a no-op.
263
+ #
264
+ # @return [Documents] self, for chaining.
265
+ def register
266
+ registry = Pikuri::Extractor.registry
267
+ registry.insert(-2, self) unless registry.include?(self)
268
+ self
269
+ end
270
+
271
+ # Build the converter image now if it isn't present — for host
272
+ # scripts that prefer paying the one-time build (pip install +
273
+ # apt, minutes) at boot rather than mid-conversation. Entirely
274
+ # optional: {#extract} builds lazily on first use otherwise.
275
+ #
276
+ # @return [void]
277
+ # @raise [Pikuri::Extractor::Error] when docker is unavailable
278
+ # or the build fails.
279
+ def ensure_image!
280
+ raise Pikuri::Extractor::Error, '`docker` is unavailable; cannot build the converter image' unless docker?
281
+
282
+ image_ready!
283
+ nil
284
+ end
285
+
286
+ private
287
+
288
+ # Byte-sniff +sample+ to a format tag, or +nil+ when nothing
289
+ # recognisable. ODF / EPUB ride their mandated uncompressed
290
+ # +mimetype+ first zip entry; OOXML rides +[Content_Types].xml+
291
+ # plus an entry-name scan (the distinctive +word/+ / +ppt/+ /
292
+ # +xl/+ names almost always appear within the sample, the first
293
+ # entries being small).
294
+ #
295
+ # @param sample [String]
296
+ # @return [String, nil]
297
+ def sniff(sample)
298
+ s = sample.b
299
+ return PDF if s.start_with?(FileType::PDF_MAGIC)
300
+ return 'rtf' if s.start_with?('{\rtf')
301
+ return nil unless s.start_with?(ZIP_MAGIC)
302
+ return 'odt' if s.include?('mimetypeapplication/vnd.oasis.opendocument.text')
303
+ return 'epub' if s.include?('mimetypeapplication/epub+zip')
304
+ return nil unless s.include?('[Content_Types].xml')
305
+ return 'docx' if s.include?('word/')
306
+ return 'pptx' if s.include?('ppt/')
307
+ return 'xlsx' if s.include?('xl/')
308
+
309
+ nil
310
+ end
311
+
312
+ # Convert the document behind +io+, yield the rewound stdout
313
+ # Tempfile along with the resolved format tag (so callers know
314
+ # whether the PDF post-processing pass applies), delete the
315
+ # Tempfile. Sniffs the leading sample and rewinds before
316
+ # streaming — the same read-then-rewind pattern as
317
+ # +Extractor.extractor_for+.
318
+ #
319
+ # @param io [IO, StringIO] seekable IO positioned at the start.
320
+ # @yieldparam file [Tempfile] converter stdout, rewound.
321
+ # @yieldparam format [String] resolved format tag (or {AUTO}).
322
+ # @return [Object] the block's value.
323
+ def with_converted(io)
324
+ sample = io.read(FileType::SAMPLE_BYTES) || +''
325
+ io.rewind
326
+ format = sniff(sample) || AUTO
327
+ file = convert(io, format)
328
+ begin
329
+ yield file, format
330
+ ensure
331
+ file.close!
332
+ end
333
+ end
334
+
335
+ # Run one conversion of +format+: build the argv for the
336
+ # resolved arm, stream +io+ through it with stdout landing in
337
+ # a Tempfile.
338
+ #
339
+ # @param io [IO, StringIO] the whole document, positioned at the
340
+ # start.
341
+ # @param format [String] format tag (or {AUTO}).
342
+ # @return [Tempfile] rewound, holding the Markdown.
343
+ # @raise [Pikuri::Extractor::Error] on conversion failure.
344
+ def convert(io, format)
345
+ argv = converter_argv(format)
346
+ out = Tempfile.new(['pikuri-extractors', '.md'])
347
+ begin
348
+ result = Pikuri::Subprocess.run(
349
+ 'timeout', '--signal=TERM', '--kill-after=5s', CONVERT_TIMEOUT, *argv,
350
+ stdin_data: io, stdout: out, chdir: '/'
351
+ )
352
+ unless result.status.success?
353
+ raise Pikuri::Extractor::Error,
354
+ "document conversion (#{format}) failed " \
355
+ "(exit #{result.status.exitstatus}): #{stderr_tail(result.output)}"
356
+ end
357
+ out.rewind
358
+ out
359
+ rescue StandardError
360
+ out.close!
361
+ raise
362
+ end
363
+ end
364
+
365
+ # Turn pdftotext's +\f+-separated page stream into the
366
+ # +"--- Page N ---"+ marker shape pikuri-pdf's extractor emits:
367
+ # one marker line per page that carries text, then that page's
368
+ # stripped lines, textless pages silently skipped (keeping their
369
+ # number, so citations stay correct). Buffers one page at a
370
+ # time — bounded memory even for huge documents.
371
+ #
372
+ # @param file [Tempfile] pdftotext stdout, rewound.
373
+ # @return [Enumerator<String>] chomped lines, marker-headed per
374
+ # page.
375
+ def pdf_page_lines(file)
376
+ Enumerator.new do |lines|
377
+ page_no = 1
378
+ buffer = +''
379
+ flush = lambda do
380
+ text = buffer.strip
381
+ unless text.empty?
382
+ lines << "--- Page #{page_no} ---"
383
+ text.split("\n").each { |line| lines << line }
384
+ end
385
+ page_no += 1
386
+ buffer = +''
387
+ end
388
+ file.each_line do |raw|
389
+ segments = raw.split("\f", -1)
390
+ segments[0...-1].each do |segment|
391
+ buffer << segment
392
+ flush.call
393
+ end
394
+ buffer << segments.last
395
+ end
396
+ flush.call
397
+ end
398
+ end
399
+
400
+ # The argv for one conversion of +format+ — the container when
401
+ # docker is up (building the image first if needed), else the
402
+ # first present host CLI for the format.
403
+ #
404
+ # @param format [String] format tag (or {AUTO}).
405
+ # @return [Array<String>]
406
+ # @raise [Pikuri::Extractor::Error] when no arm is available.
407
+ def converter_argv(format)
408
+ if docker?
409
+ return ['docker', 'run', '--rm', '-i', '--network=none', '--read-only',
410
+ '--cap-drop=ALL', '--security-opt', 'no-new-privileges',
411
+ '--tmpfs', '/tmp', image_ready!, format]
412
+ end
413
+
414
+ backend = HOST_CONVERTERS.fetch(format).find { |cli| cli?(cli.to_s) }
415
+ unless backend
416
+ raise Pikuri::Extractor::Error,
417
+ "no converter available for #{format}: install docker (recommended — " \
418
+ "sandboxed, pinned versions) or a host #{HOST_CONVERTERS.fetch(format).join(' / ')} CLI"
419
+ end
420
+ host_argv(backend, format)
421
+ end
422
+
423
+ # @param backend [Symbol] +:pandoc+ or +:markitdown+.
424
+ # @param format [String] format tag (or {AUTO}).
425
+ # @return [Array<String>] host-CLI argv, same stdin→stdout
426
+ # contract as the container.
427
+ def host_argv(backend, format)
428
+ case backend
429
+ when :pandoc then ['pandoc', '-f', format, '-t', 'gfm', '--wrap=none']
430
+ when :markitdown then format == AUTO ? ['markitdown'] : ['markitdown', '-x', format]
431
+ when :pdftotext then ['pdftotext', '-q', '-enc', 'UTF-8', '-', '-']
432
+ else raise "unknown converter backend #{backend.inspect}" # pikuri bug
433
+ end
434
+ end
435
+
436
+ # Is a usable docker (binary on PATH *and* daemon answering)
437
+ # available? Probed once, memoized, choice logged.
438
+ #
439
+ # @return [Boolean]
440
+ def docker?
441
+ unless defined?(@docker)
442
+ @docker = begin
443
+ docker_cmd('info').status.success?
444
+ rescue Errno::ENOENT
445
+ false
446
+ end
447
+ LOGGER.info(@docker ? 'docker available — converting in the sandboxed container' \
448
+ : 'docker unavailable — falling back to host pandoc/markitdown')
449
+ end
450
+ @docker
451
+ end
452
+
453
+ # Make sure {IMAGE} exists locally, building from {DOCKER_DIR}
454
+ # on first need. Memoized after one successful check.
455
+ #
456
+ # @return [String] {IMAGE}, for inlining into the run argv.
457
+ # @raise [Pikuri::Extractor::Error] when the build fails.
458
+ def image_ready!
459
+ return IMAGE if @image_ready
460
+
461
+ unless docker_cmd('image', 'inspect', IMAGE).status.success?
462
+ LOGGER.info("building #{IMAGE} from #{DOCKER_DIR} (one-time, may take minutes)")
463
+ build = docker_cmd('build', '-t', IMAGE, DOCKER_DIR)
464
+ unless build.status.success?
465
+ raise Pikuri::Extractor::Error,
466
+ "docker build of #{IMAGE} failed: #{stderr_tail(build.output)}"
467
+ end
468
+ end
469
+ @image_ready = true
470
+ IMAGE
471
+ end
472
+
473
+ # @param argv [Array<String>] docker subcommand + args.
474
+ # @return [Pikuri::Subprocess::Result]
475
+ def docker_cmd(*argv)
476
+ Pikuri::Subprocess.spawn('docker', *argv, chdir: '/').wait
477
+ end
478
+
479
+ # Is +name+'s version probe (+--version+, or the CLI's
480
+ # {VERSION_PROBE_FLAGS} override) runnable on the host? Probed
481
+ # once per name, memoized.
482
+ #
483
+ # @param name [String] CLI binary name.
484
+ # @return [Boolean]
485
+ def cli?(name)
486
+ @clis ||= {}
487
+ return @clis[name] if @clis.key?(name)
488
+
489
+ @clis[name] = begin
490
+ flag = VERSION_PROBE_FLAGS.fetch(name, '--version')
491
+ Pikuri::Subprocess.spawn(name, flag, chdir: '/').wait.status.success?
492
+ rescue Errno::ENOENT
493
+ false
494
+ end
495
+ end
496
+
497
+ # Last ~500 chars of converter diagnostics, for an
498
+ # LLM-presentable error message.
499
+ #
500
+ # @param text [String, nil]
501
+ # @return [String]
502
+ def stderr_tail(text)
503
+ t = text.to_s.strip
504
+ return '(no diagnostics)' if t.empty?
505
+
506
+ t.length > 500 ? "...#{t[-500..]}" : t
507
+ end
508
+ end
509
+
510
+ # The shared, host-script-facing instance: call
511
+ # +Pikuri::Extractors::DOCUMENTS.register+ to plug it into the
512
+ # registry. One instance is right — its only state is memoized
513
+ # environment probes (docker? / image built / host CLIs), true
514
+ # process-wide.
515
+ DOCUMENTS = Documents.new
516
+ end
517
+ end
@@ -0,0 +1,27 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'pikuri-core'
4
+
5
+ # Entry file for the pikuri-extractors gem. Sets up a dedicated
6
+ # Zeitwerk loader rooted at this gem's +lib/+, contributing to the
7
+ # shared +Pikuri::+ namespace alongside pikuri-core. After +require
8
+ # 'pikuri-extractors'+, +Pikuri::Extractors::Documents+ and the shared
9
+ # +Pikuri::Extractors::DOCUMENTS+ instance are defined — but *nothing
10
+ # is registered*: extractors plug into +Pikuri::Extractor.registry+
11
+ # only when the host script calls their +#register+ explicitly, so a
12
+ # +bin/pikuri-*+ picks which extractors it wires in (same opt-in
13
+ # philosophy as +c.add_extension+).
14
+ #
15
+ # The loader is per-gem (not shared with pikuri-core's loader) so each
16
+ # gem owns its own +lib/+ tree and the cooperation between gems is via
17
+ # the Pikuri namespace alone.
18
+ module Pikuri
19
+ module Extractors
20
+ LOADER = Zeitwerk::Loader.new
21
+ LOADER.tag = 'pikuri-extractors'
22
+ LOADER.push_dir(File.expand_path('.', __dir__))
23
+ LOADER.ignore(__FILE__)
24
+ LOADER.setup
25
+ LOADER.eager_load
26
+ end
27
+ end
metadata ADDED
@@ -0,0 +1,79 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: pikuri-extractors
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.6
5
+ platform: ruby
6
+ authors:
7
+ - Martin Vysny
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2026-06-04 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: pikuri-core
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - '='
18
+ - !ruby/object:Gem::Version
19
+ version: 0.0.6
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '='
25
+ - !ruby/object:Gem::Version
26
+ version: 0.0.6
27
+ description: |
28
+ pikuri-extractors plugs additional document formats into
29
+ pikuri-core's +Pikuri::Extractor+ registry. The bundled
30
+ +Pikuri::Extractors::DOCUMENTS+ extractor converts office
31
+ documents (DOCX, ODT, XLSX, legacy XLS, PPTX, EPUB, RTF) to
32
+ Markdown by piping the bytes through pandoc / markitdown —
33
+ preferably inside a one-shot, networkless, locally-built docker
34
+ container (the untrusted bytes never touch the host filesystem or
35
+ network), falling back to a host-installed pandoc / markitdown
36
+ CLI when docker is absent.
37
+
38
+ Registration is explicit — +Pikuri::Extractors::DOCUMENTS.register+
39
+ — so requiring the gem changes nothing by itself; the host script
40
+ picks which extractors it wires in.
41
+ email:
42
+ - martin@vysny.me
43
+ executables: []
44
+ extensions: []
45
+ extra_rdoc_files: []
46
+ files:
47
+ - README.md
48
+ - docker/Dockerfile
49
+ - docker/convert.sh
50
+ - lib/pikuri-extractors.rb
51
+ - lib/pikuri/extractors/documents.rb
52
+ homepage: https://codeberg.org/mvysny/pikuri
53
+ licenses:
54
+ - MIT
55
+ metadata:
56
+ source_code_uri: https://codeberg.org/mvysny/pikuri/src/branch/master
57
+ changelog_uri: https://codeberg.org/mvysny/pikuri/src/branch/master/CHANGELOG.md
58
+ bug_tracker_uri: https://codeberg.org/mvysny/pikuri/issues
59
+ rubygems_mfa_required: 'true'
60
+ post_install_message:
61
+ rdoc_options: []
62
+ require_paths:
63
+ - lib
64
+ required_ruby_version: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '3.3'
69
+ required_rubygems_version: !ruby/object:Gem::Requirement
70
+ requirements:
71
+ - - ">="
72
+ - !ruby/object:Gem::Version
73
+ version: '0'
74
+ requirements: []
75
+ rubygems_version: 3.5.22
76
+ signing_key:
77
+ specification_version: 4
78
+ summary: Sandboxed document extractors (DOCX/ODT/XLSX/PPTX/EPUB/RTF) for pikuri.
79
+ test_files: []