RubyGems - pikuri-extractors - Versions diffs - 0.0.6 - Mend

pikuri-extractors 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +7 -0
data/README.md +102 -0
data/docker/Dockerfile +38 -0
data/docker/convert.sh +20 -0
data/lib/pikuri/extractors/documents.rb +517 -0
data/lib/pikuri-extractors.rb +27 -0
metadata +79 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 760f5421f1396e28dfb86eacfbee7a70e3040313498805cc5617eff91cf7fb86
+  data.tar.gz: e220e4620d19cb8a400b71f6e0874b0db17e4a6ff289d2baf12d6b852876aca4
+SHA512:
+  metadata.gz: 20f920922f3548d0d862653ed081e61593eddb37cef90a48e00e1669cfb4181ecb6f82fce0befbbcf5a80a84c88aca109a8184b6c997175cde4fa15468b3b19c
+  data.tar.gz: 9e9d1428944ecf927d585e64e6efa42a7f111ab86bf510107ced9694e94a34288e91e415641c6990d9e862ef8fad2bf4ed0f123eabae36a9129c467c0fc140a0

data/README.md ADDED Viewed

@@ -0,0 +1,102 @@
+# pikuri-extractors
+Additional document extractors for the
+[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
+office documents and PDFs → Markdown, converted preferably inside a
+one-shot **networkless docker container**, so a malicious document
+downloaded from the web is parsed somewhere it can't phone home or
+read your files.
+Provides:
+- `Pikuri::Extractors::DOCUMENTS` — an extractor for the
+  `Pikuri::Extractor` registry covering **DOCX, ODT, XLSX, legacy
+  XLS, PPTX, EPUB, RTF, and PDF**. Once registered, every pikuri
+  surface that routes through the registry picks the formats up for
+  free: the `read` tool pages through a local `.docx`, `web_scrape`
+  / `fetch` convert a downloaded `.odt`, the pikuri-vectordb indexer
+  ingests an `.epub` or a paper PDF (with `--- Page N ---` page
+  markers preserved for citations).
+The actual conversion is done by [pandoc](https://pandoc.org) (ODF,
+RTF, EPUB, DOCX — its readers preserve the most structure),
+[markitdown](https://github.com/microsoft/markitdown) (the XLSX /
+XLS / PPTX arms), and poppler's `pdftotext` (the PDF arm — its `\f`
+page separators are rebuilt into `--- Page N ---` markers on the
+Ruby side), dispatched per format. One stdin→stdout contract,
+two ways to run it:
+1. **Container (preferred).** A small, locally-built docker image
+   (pinned pandoc + pinned markitdown; built from
+   [docker/Dockerfile](docker/Dockerfile) on first use — read it,
+   it's short) run as
+   `docker run --rm -i --network=none --read-only --cap-drop=ALL`,
+   bytes in via stdin, Markdown out via stdout, **no volume
+   mounts**. Office-format parsers are large, complex codebases and
+   the documents they parse are exactly the bytes an attacker
+   controls; in the container, the blast radius of a parser exploit
+   is one throwaway process that can see neither your network nor
+   your filesystem.
+2. **Host CLI (fallback).** When docker is absent or the daemon is
+   down, a host-installed `pandoc` / `markitdown` / `pdftotext` is
+   used directly — convenient, but unpinned and unsandboxed.
+Deliberately **not** covered: ODS / ODP (neither converter reads
+them; the only one that does is LibreOffice, a 2 GB+ image), and
+image OCR / audio transcription (need model downloads — point a
+multi-modal main LLM at images instead).
+**PDF: this gem or [pikuri-pdf](../pikuri-pdf) — pick one per
+wiring.** This gem parses PDFs inside the sandbox (poppler is native
+code chewing attacker-controlled bytes — exactly what the container
+is for) but re-converts the whole document on every paged read;
+pikuri-pdf is in-process pure Ruby with lazy page-windowed reads and
+no infrastructure to set up. The guide wires pikuri-pdf in chapter 3
+and supersedes it with this gem in chapter 7's assistant.
+## Install
+```ruby
+# Gemfile
+gem 'pikuri-extractors'
+```
+Plus *one* of: a working `docker` (recommended; the image builds
+itself on first use, network is only needed for that build), or
+host `pandoc` / `markitdown` CLIs.
+## Usage
+Requiring the gem changes nothing — registration is an explicit
+opt-in your script makes, same philosophy as `c.add_extension`:
+```ruby
+require 'pikuri-core'
+require 'pikuri-extractors'
+Pikuri::Extractors::DOCUMENTS.register
+# From here on, the registry handles the new formats everywhere:
+text = Pikuri::FileType.read_as_text(Pathname.new('report.docx'))
+# Optional: pay the one-time image build (~minutes) at boot instead
+# of mid-conversation. Requires docker; skip when relying on host CLIs.
+Pikuri::Extractors::DOCUMENTS.ensure_image!
+```
+## Performance posture
+A subprocess converter has no lazy mode: every paged `read` of a
+long document re-runs the full conversion (a cold `docker run` plus
+the converter itself — roughly a second for pandoc, a few for
+markitdown). That's accepted v1 behavior — no result cache — and it
+is well inside what an LLM tool call tolerates.
+## Format detection
+Content-type when the transport provides one (HTTP header), byte
+sniff otherwise: RTF by its `{\rtf` prefix; ODT / EPUB by the
+uncompressed `mimetype` zip entry their specs mandate first; DOCX /
+XLSX / PPTX by `[Content_Types].xml` plus an entry-name scan.
+Legacy `.xls` is recognised by content-type only (its OLE2 container
+isn't sniffable from the leading bytes), so a *local* `.xls` file —
+no content-type — keeps pikuri-core's binary refusal.

data/docker/Dockerfile ADDED Viewed

@@ -0,0 +1,38 @@
+# Converter image for Pikuri::Extractors::DOCUMENTS — a one-shot,
+# networkless stdin→stdout document converter. Built locally by the
+# extractor on first use (`docker build -t pikuri-internal-extractors:<version> .`);
+# never pulled from a registry, so what runs is exactly what you can
+# read here.
+#
+# Runtime contract (enforced by the extractor's `docker run` flags,
+# not by this image): --network=none --read-only --cap-drop=ALL
+# --tmpfs /tmp, no volume mounts. Document bytes arrive on stdin,
+# Markdown leaves on stdout, diagnostics on stderr. Network exists
+# only at *build* time (apt + pip below).
+#
+# All converters are version-pinned: pandoc and poppler (pdftotext)
+# via the pinned Debian release of the base image, markitdown via the
+# exact pip version. Bump deliberately, with a conversion smoke-test.
+FROM python:3.13-slim-trixie
+# poppler-utils ships pdftotext — the PDF arm. Its C++ parser is
+# exactly the kind of attack surface this networkless container
+# exists to contain.
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends pandoc poppler-utils \
+    && rm -rf /var/lib/apt/lists/*
+# Extras pull the per-format converter deps (mammoth, openpyxl,
+# python-pptx, ...). No `all` — the OCR / transcription / cloud arms
+# would bloat the image and want network or models at runtime.
+RUN pip install --no-cache-dir 'markitdown[docx,xlsx,xls,pptx]==0.1.6'
+COPY convert.sh /usr/local/bin/pikuri-convert
+RUN chmod 0755 /usr/local/bin/pikuri-convert
+# Drop root; HOME on the tmpfs so anything insisting on a writable
+# home (cache dirs etc.) lands somewhere ephemeral under --read-only.
+ENV HOME=/tmp
+USER 65534:65534
+ENTRYPOINT ["pikuri-convert"]

data/docker/convert.sh ADDED Viewed

@@ -0,0 +1,20 @@
+#!/bin/sh
+# Entrypoint dispatch for the pikuri-extractors converter image:
+# $1 is the format tag, document bytes on stdin, Markdown on stdout.
+# Keep the format→converter mapping in sync with
+# Pikuri::Extractors::Documents::HOST_CONVERTERS — pandoc where its
+# reader exists (it preserves more structure), markitdown for the
+# OOXML spreadsheet/presentation arms, pdftotext for PDF (plain text
+# with \f page separators; the Ruby side turns those into
+# "--- Page N ---" markers), bare markitdown (its own magic-byte
+# detection) for the `auto` fallback.
+set -eu
+fmt="${1:-auto}"
+case "$fmt" in
+  odt|rtf|epub|docx) exec pandoc -f "$fmt" -t gfm --wrap=none ;;
+  xlsx|xls|pptx)     exec markitdown -x "$fmt" ;;
+  pdf)               exec pdftotext -q -enc UTF-8 - - ;;
+  auto)              exec markitdown ;;
+  *) echo "pikuri-convert: unknown format tag: $fmt" >&2; exit 64 ;;
+esac

data/lib/pikuri/extractors/documents.rb ADDED Viewed

@@ -0,0 +1,517 @@
+# frozen_string_literal: true
+require 'tempfile'
+module Pikuri
+  module Extractors
+    # Document extractor for the {Pikuri::Extractor} registry:
+    # DOCX / ODT / XLSX / legacy XLS / PPTX / EPUB / RTF / PDF →
+    # Markdown, by piping the document bytes through pandoc (ODF,
+    # RTF, EPUB, DOCX), markitdown (the OOXML spreadsheet /
+    # presentation arms), or pdftotext (PDF), selected per format.
+    #
+    # == Container first, host CLI second
+    #
+    # The preferred converter is a one-shot, locally-built docker
+    # container ({IMAGE}, built from this gem's +docker/+ directory):
+    # +docker run --rm -i --network=none --read-only --cap-drop=ALL+,
+    # bytes in via stdin, Markdown out via stdout, **no volume
+    # mounts**. Two reasons this beats running a host-installed
+    # converter directly:
+    #
+    # * *Security.* These documents typically arrive via +fetch+ /
+    #   +web_scrape+ — untrusted bytes — and complex format parsers
+    #   are a classic exploitation surface. In the container the
+    #   parser sees no network and no host filesystem; the worst a
+    #   malicious document can do is produce garbage Markdown.
+    # * *Reproducibility.* The Dockerfile pins pandoc (via the base
+    #   image's apt) and markitdown (exact pip version); a host
+    #   install is whatever version the machine happens to have.
+    #
+    # When docker is unavailable (binary absent or daemon down), the
+    # extractor falls back to host-installed +pandoc+ / +markitdown+
+    # / +pdftotext+ CLIs — same stdin→stdout contract, one code path
+    # with two argv builders. Which arm was picked is logged once via
+    # +Pikuri.logger_for('Extractors')+.
+    #
+    # == Registration is explicit
+    #
+    # Requiring pikuri-extractors defines this class and the shared
+    # {DOCUMENTS} instance but registers nothing. A host script opts
+    # in with +Pikuri::Extractors::DOCUMENTS.register+, which inserts
+    # the instance before the registry's terminal +Passthrough+ entry
+    # (and after core's +HTML+ — and pikuri-pdf's front-inserted
+    # +PDF+, when the host registers that too — which keep winning
+    # their formats). Same opt-in philosophy as +c.add_extension+ —
+    # no behavior changes by require alone.
+    #
+    # == Format detection
+    #
+    # {#matches?} claims content by normalized content-type
+    # ({CONTENT_TYPES}) or by byte sniff: PDF's +%PDF-+ prefix, RTF's
+    # +{\\rtf+ prefix; for
+    # zip-based formats, ODF and EPUB mandate an uncompressed
+    # +mimetype+ first entry (so the literal mime string sits inside
+    # the leading sample), and OOXML is recognised by the
+    # +[Content_Types].xml+ entry plus a +word/+ / +ppt/+ / +xl/+
+    # entry-name scan. {#extract} re-sniffs (the registry duck type
+    # doesn't pass +content_type+ to +extract+); when content was
+    # claimed by content-type but the sniff is blind (legacy XLS — an
+    # OLE2 container whose discriminating directory sits at the end of
+    # the file, past the sample), the bytes go to markitdown with no
+    # format hint and its own magic-byte detection takes over. The
+    # consequence: a *local* +.xls+ (no transport content-type, sniff
+    # blind) is not claimed at all and keeps today's binary refusal.
+    # One ordering edge vs pikuri-pdf: this instance sits *after*
+    # core's +HTML+ in the registry, so a PDF served under a lying
+    # +text/html+ header goes to the HTML extractor (pikuri-pdf
+    # front-inserts and wins that case). Accepted — lying-header PDFs
+    # under specifically +text/html+ are rare.
+    #
+    # == PDF: this gem or pikuri-pdf — pick one per wiring
+    #
+    # The PDF arm (pdftotext, with {#pdf_page_lines} restoring the
+    # +"--- Page N ---"+ markers from pdftotext's +\f+ separators)
+    # makes this extractor a complete superset of pikuri-pdf's
+    # formats, so a host that registers {DOCUMENTS} does NOT also
+    # register +Extractors::PDF+ — one extractor per format keeps the
+    # registry's first-match-wins semantics legible. The trade per
+    # wiring:
+    #
+    # * *This gem* — PDF parsing happens inside the sandbox (poppler
+    #   is native code parsing attacker-controlled bytes; the
+    #   container is exactly the right place for it), one gem covers
+    #   every document format. Costs: docker (or host CLIs), no lazy
+    #   paging (each paged read re-converts the whole PDF), and the
+    #   generic +:document+ kind (the Read tools say "End of file",
+    #   not "End of PDF", and a scanned PDF reads as "(Empty file)"
+    #   rather than the scanned-image hint).
+    # * *pikuri-pdf* — in-process pure Ruby (no infrastructure), lazy
+    #   +extract_lines+ paging (a windowed read of a 500-page PDF
+    #   parses only its window), PDF-specific Read-tool wording.
+    #   Costs: pdf-reader's dependency subtree, parsing untrusted
+    #   bytes in-process (pure Ruby, so DoS at worst).
+    #
+    # The guide walks this as a progression: chapter 3 wires
+    # pikuri-pdf (no docker yet), chapter 7's assistant supersedes it
+    # with this extractor.
+    #
+    # == Deliberately out of scope
+    #
+    # * *ODS / ODP* — neither pandoc nor markitdown reads them; the
+    #   only converter that does (LibreOffice headless) costs a 2 GB+
+    #   image. Excluded rather than half-supported.
+    # * *Image OCR / audio transcription* — markitdown's optional
+    #   arms need model downloads; the converter image stays
+    #   networkless and small. A multi-modal main LLM is the pikuri
+    #   answer to images.
+    #
+    # == Paging economics
+    #
+    # A subprocess converter needs the whole document before it can
+    # emit anything, so there is no lazy parse: every
+    # +Extractor.extract_paged+ call (each +Read+ page of a long DOCX)
+    # re-runs the full conversion. Accepted — no result cache in v1.
+    # Both legs of one conversion still stream, though: the source +io+
+    # is handed to {Pikuri::Subprocess.run} and copied straight into
+    # the converter's stdin (+IO.copy_stream+ — a big local file never
+    # loads into the Ruby heap), and the converter's stdout lands in a
+    # Tempfile (also what makes the stdin/stdout pumping deadlock-free
+    # — see {Pikuri::Subprocess.run}) whose lines {#extract_lines}
+    # yields from disk — so neither the document nor the full Markdown
+    # String is ever resident during paging.
+    class Documents
+      # @return [Logger] gem-wide diagnostics logger.
+      LOGGER = Pikuri.logger_for('Extractors')
+      # @return [String] converter image tag. Version-tied so a gem
+      #   upgrade rebuilds with the new pins; +pikuri-internal-+
+      #   prefix matches the container-naming convention of the
+      #   vectordb/memory supervisors.
+      IMAGE = "pikuri-internal-extractors:#{Pikuri::VERSION}"
+      # @return [String] absolute path to the shipped docker build
+      #   context (Dockerfile + convert.sh).
+      DOCKER_DIR = File.expand_path('../../../docker', __dir__)
+      # @return [String] coreutils-+timeout+ budget for one
+      #   conversion. Generous — a huge PPTX through markitdown can
+      #   take a while — but bounded, so a wedged converter can't
+      #   hang the agent loop.
+      CONVERT_TIMEOUT = '300s'
+      # @return [String] sentinel format meaning "let markitdown's
+      #   magic-byte detection decide" — the fallback when content was
+      #   claimed by content-type but the byte sniff is blind.
+      AUTO = 'auto'
+      # @return [String] the PDF format tag. Singled out as a constant
+      #   because PDF is the one format whose converter output gets a
+      #   post-processing pass: pdftotext emits +\f+ between pages, and
+      #   {#extract} / {#extract_lines} turn those into the same
+      #   +"--- Page N ---"+ marker lines pikuri-pdf's extractor emits,
+      #   so page provenance (vectordb chunk citations, the Read tools'
+      #   page references) survives whichever PDF extractor a host
+      #   wires.
+      PDF = 'pdf'
+      # @return [Hash{String => String}] normalized content-type →
+      #   format tag (the tag doubles as the container entrypoint's
+      #   dispatch argument and pandoc's +-f+ / markitdown's +-x+
+      #   value).
+      CONTENT_TYPES = {
+        'application/vnd.oasis.opendocument.text' => 'odt',
+        'application/rtf' => 'rtf',
+        'text/rtf' => 'rtf',
+        'application/epub+zip' => 'epub',
+        'application/pdf' => PDF,
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx',
+        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'xlsx',
+        'application/vnd.ms-excel' => 'xls',
+        'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'pptx'
+      }.freeze
+      # @return [Hash{String => Array<Symbol>}] format tag → host CLIs
+      #   that can convert it, in preference order. Mirrors the
+      #   container entrypoint's dispatch (+docker/convert.sh+) — keep
+      #   the two in sync. pandoc leads where both could serve (DOCX,
+      #   EPUB): its readers preserve more structure.
+      HOST_CONVERTERS = {
+        'odt'  => %i[pandoc],
+        'rtf'  => %i[pandoc],
+        'epub' => %i[pandoc markitdown],
+        'docx' => %i[pandoc markitdown],
+        'xlsx' => %i[markitdown],
+        'xls'  => %i[markitdown],
+        'pptx' => %i[markitdown],
+        PDF    => %i[pdftotext],
+        AUTO   => %i[markitdown]
+      }.freeze
+      # @return [String] zip local-file-header magic, shared by every
+      #   OOXML / ODF / EPUB document.
+      ZIP_MAGIC = "PK\x03\x04".b
+      # @return [Hash{String => String}] host-CLI name → the flag that
+      #   makes it print a version and exit 0, where +--version+ (the
+      #   default probe, see {#cli?}) doesn't work: poppler's
+      #   +pdftotext+ parses +--version+ as a *filename* and exits 1,
+      #   but accepts +-v+.
+      VERSION_PROBE_FLAGS = { 'pdftotext' => '-v' }.freeze
+      # @return [Symbol] kind tag carried on +Extractor::Page#kind+.
+      def kind
+        :document
+      end
+      # Claim content this extractor can convert: a recognised
+      # content-type, or a positive byte sniff (see "Format detection"
+      # in the class docs).
+      #
+      # @param sample [String] leading bytes of the content.
+      # @param content_type [String, nil] normalized content-type, or
+      #   +nil+ when the transport carries none (local files).
+      # @return [Boolean]
+      def matches?(sample:, content_type:)
+        CONTENT_TYPES.key?(content_type) || !sniff(sample).nil?
+      end
+      # Convert the whole document behind +io+ to one Markdown String.
+      # PDFs come back as one +"--- Page N ---"+-headed block per
+      # text-carrying page (see {PDF}); a fully scanned PDF extracts
+      # to the empty String — same contract as pikuri-pdf's extractor.
+      #
+      # @param io [IO, StringIO] seekable IO positioned at the start.
+      # @return [String] Markdown-flavoured UTF-8 text.
+      # @raise [Pikuri::Extractor::Error] when no converter is
+      #   available, the conversion exits non-zero, or it times out.
+      def extract(io)
+        with_converted(io) do |file, format|
+          format == PDF ? pdf_page_lines(file).to_a.join("\n") : file.read
+        end
+      end
+      # Same content as {#extract}, as a stream of +chomp+ed lines
+      # read off the converter's stdout Tempfile — the whole-document
+      # conversion still runs up front (subprocess converters can't
+      # parse lazily), but neither the document nor the Markdown ever
+      # materialises as one String: the conversion fires on first
+      # consumption, streaming +io+ into the converter. The enumerator
+      # owns the Tempfile and deletes it when iteration ends.
+      #
+      # @param io [IO, StringIO] seekable IO positioned at the start;
+      #   must remain open until the enumerator is consumed (same
+      #   contract as pikuri-pdf's lazy +extract_lines+).
+      # @return [Enumerator<String>]
+      # @raise [Pikuri::Extractor::Error] as for {#extract}, raised on
+      #   first consumption.
+      def extract_lines(io)
+        Enumerator.new do |yielder|
+          with_converted(io) do |file, format|
+            if format == PDF
+              pdf_page_lines(file).each { |line| yielder << line }
+            else
+              file.each_line { |line| yielder << line.chomp }
+            end
+          end
+        end
+      end
+      # Plug this extractor into {Pikuri::Extractor.registry}, before
+      # the terminal +Passthrough+ entry. Idempotent — a second call
+      # is a no-op.
+      #
+      # @return [Documents] self, for chaining.
+      def register
+        registry = Pikuri::Extractor.registry
+        registry.insert(-2, self) unless registry.include?(self)
+        self
+      end
+      # Build the converter image now if it isn't present — for host
+      # scripts that prefer paying the one-time build (pip install +
+      # apt, minutes) at boot rather than mid-conversation. Entirely
+      # optional: {#extract} builds lazily on first use otherwise.
+      #
+      # @return [void]
+      # @raise [Pikuri::Extractor::Error] when docker is unavailable
+      #   or the build fails.
+      def ensure_image!
+        raise Pikuri::Extractor::Error, '`docker` is unavailable; cannot build the converter image' unless docker?
+        image_ready!
+        nil
+      end
+      private
+      # Byte-sniff +sample+ to a format tag, or +nil+ when nothing
+      # recognisable. ODF / EPUB ride their mandated uncompressed
+      # +mimetype+ first zip entry; OOXML rides +[Content_Types].xml+
+      # plus an entry-name scan (the distinctive +word/+ / +ppt/+ /
+      # +xl/+ names almost always appear within the sample, the first
+      # entries being small).
+      #
+      # @param sample [String]
+      # @return [String, nil]
+      def sniff(sample)
+        s = sample.b
+        return PDF   if s.start_with?(FileType::PDF_MAGIC)
+        return 'rtf' if s.start_with?('{\rtf')
+        return nil unless s.start_with?(ZIP_MAGIC)
+        return 'odt'  if s.include?('mimetypeapplication/vnd.oasis.opendocument.text')
+        return 'epub' if s.include?('mimetypeapplication/epub+zip')
+        return nil unless s.include?('[Content_Types].xml')
+        return 'docx' if s.include?('word/')
+        return 'pptx' if s.include?('ppt/')
+        return 'xlsx' if s.include?('xl/')
+        nil
+      end
+      # Convert the document behind +io+, yield the rewound stdout
+      # Tempfile along with the resolved format tag (so callers know
+      # whether the PDF post-processing pass applies), delete the
+      # Tempfile. Sniffs the leading sample and rewinds before
+      # streaming — the same read-then-rewind pattern as
+      # +Extractor.extractor_for+.
+      #
+      # @param io [IO, StringIO] seekable IO positioned at the start.
+      # @yieldparam file [Tempfile] converter stdout, rewound.
+      # @yieldparam format [String] resolved format tag (or {AUTO}).
+      # @return [Object] the block's value.
+      def with_converted(io)
+        sample = io.read(FileType::SAMPLE_BYTES) || +''
+        io.rewind
+        format = sniff(sample) || AUTO
+        file = convert(io, format)
+        begin
+          yield file, format
+        ensure
+          file.close!
+        end
+      end
+      # Run one conversion of +format+: build the argv for the
+      # resolved arm, stream +io+ through it with stdout landing in
+      # a Tempfile.
+      #
+      # @param io [IO, StringIO] the whole document, positioned at the
+      #   start.
+      # @param format [String] format tag (or {AUTO}).
+      # @return [Tempfile] rewound, holding the Markdown.
+      # @raise [Pikuri::Extractor::Error] on conversion failure.
+      def convert(io, format)
+        argv = converter_argv(format)
+        out = Tempfile.new(['pikuri-extractors', '.md'])
+        begin
+          result = Pikuri::Subprocess.run(
+            'timeout', '--signal=TERM', '--kill-after=5s', CONVERT_TIMEOUT, *argv,
+            stdin_data: io, stdout: out, chdir: '/'
+          )
+          unless result.status.success?
+            raise Pikuri::Extractor::Error,
+                  "document conversion (#{format}) failed " \
+                  "(exit #{result.status.exitstatus}): #{stderr_tail(result.output)}"
+          end
+          out.rewind
+          out
+        rescue StandardError
+          out.close!
+          raise
+        end
+      end
+      # Turn pdftotext's +\f+-separated page stream into the
+      # +"--- Page N ---"+ marker shape pikuri-pdf's extractor emits:
+      # one marker line per page that carries text, then that page's
+      # stripped lines, textless pages silently skipped (keeping their
+      # number, so citations stay correct). Buffers one page at a
+      # time — bounded memory even for huge documents.
+      #
+      # @param file [Tempfile] pdftotext stdout, rewound.
+      # @return [Enumerator<String>] chomped lines, marker-headed per
+      #   page.
+      def pdf_page_lines(file)
+        Enumerator.new do |lines|
+          page_no = 1
+          buffer = +''
+          flush = lambda do
+            text = buffer.strip
+            unless text.empty?
+              lines << "--- Page #{page_no} ---"
+              text.split("\n").each { |line| lines << line }
+            end
+            page_no += 1
+            buffer = +''
+          end
+          file.each_line do |raw|
+            segments = raw.split("\f", -1)
+            segments[0...-1].each do |segment|
+              buffer << segment
+              flush.call
+            end
+            buffer << segments.last
+          end
+          flush.call
+        end
+      end
+      # The argv for one conversion of +format+ — the container when
+      # docker is up (building the image first if needed), else the
+      # first present host CLI for the format.
+      #
+      # @param format [String] format tag (or {AUTO}).
+      # @return [Array<String>]
+      # @raise [Pikuri::Extractor::Error] when no arm is available.
+      def converter_argv(format)
+        if docker?
+          return ['docker', 'run', '--rm', '-i', '--network=none', '--read-only',
+                  '--cap-drop=ALL', '--security-opt', 'no-new-privileges',
+                  '--tmpfs', '/tmp', image_ready!, format]
+        end
+        backend = HOST_CONVERTERS.fetch(format).find { |cli| cli?(cli.to_s) }
+        unless backend
+          raise Pikuri::Extractor::Error,
+                "no converter available for #{format}: install docker (recommended — " \
+                "sandboxed, pinned versions) or a host #{HOST_CONVERTERS.fetch(format).join(' / ')} CLI"
+        end
+        host_argv(backend, format)
+      end
+      # @param backend [Symbol] +:pandoc+ or +:markitdown+.
+      # @param format [String] format tag (or {AUTO}).
+      # @return [Array<String>] host-CLI argv, same stdin→stdout
+      #   contract as the container.
+      def host_argv(backend, format)
+        case backend
+        when :pandoc then ['pandoc', '-f', format, '-t', 'gfm', '--wrap=none']
+        when :markitdown then format == AUTO ? ['markitdown'] : ['markitdown', '-x', format]
+        when :pdftotext then ['pdftotext', '-q', '-enc', 'UTF-8', '-', '-']
+        else raise "unknown converter backend #{backend.inspect}" # pikuri bug
+        end
+      end
+      # Is a usable docker (binary on PATH *and* daemon answering)
+      # available? Probed once, memoized, choice logged.
+      #
+      # @return [Boolean]
+      def docker?
+        unless defined?(@docker)
+          @docker = begin
+            docker_cmd('info').status.success?
+          rescue Errno::ENOENT
+            false
+          end
+          LOGGER.info(@docker ? 'docker available — converting in the sandboxed container' \
+                              : 'docker unavailable — falling back to host pandoc/markitdown')
+        end
+        @docker
+      end
+      # Make sure {IMAGE} exists locally, building from {DOCKER_DIR}
+      # on first need. Memoized after one successful check.
+      #
+      # @return [String] {IMAGE}, for inlining into the run argv.
+      # @raise [Pikuri::Extractor::Error] when the build fails.
+      def image_ready!
+        return IMAGE if @image_ready
+        unless docker_cmd('image', 'inspect', IMAGE).status.success?
+          LOGGER.info("building #{IMAGE} from #{DOCKER_DIR} (one-time, may take minutes)")
+          build = docker_cmd('build', '-t', IMAGE, DOCKER_DIR)
+          unless build.status.success?
+            raise Pikuri::Extractor::Error,
+                  "docker build of #{IMAGE} failed: #{stderr_tail(build.output)}"
+          end
+        end
+        @image_ready = true
+        IMAGE
+      end
+      # @param argv [Array<String>] docker subcommand + args.
+      # @return [Pikuri::Subprocess::Result]
+      def docker_cmd(*argv)
+        Pikuri::Subprocess.spawn('docker', *argv, chdir: '/').wait
+      end
+      # Is +name+'s version probe (+--version+, or the CLI's
+      # {VERSION_PROBE_FLAGS} override) runnable on the host? Probed
+      # once per name, memoized.
+      #
+      # @param name [String] CLI binary name.
+      # @return [Boolean]
+      def cli?(name)
+        @clis ||= {}
+        return @clis[name] if @clis.key?(name)
+        @clis[name] = begin
+          flag = VERSION_PROBE_FLAGS.fetch(name, '--version')
+          Pikuri::Subprocess.spawn(name, flag, chdir: '/').wait.status.success?
+        rescue Errno::ENOENT
+          false
+        end
+      end
+      # Last ~500 chars of converter diagnostics, for an
+      # LLM-presentable error message.
+      #
+      # @param text [String, nil]
+      # @return [String]
+      def stderr_tail(text)
+        t = text.to_s.strip
+        return '(no diagnostics)' if t.empty?
+        t.length > 500 ? "...#{t[-500..]}" : t
+      end
+    end
+    # The shared, host-script-facing instance: call
+    # +Pikuri::Extractors::DOCUMENTS.register+ to plug it into the
+    # registry. One instance is right — its only state is memoized
+    # environment probes (docker? / image built / host CLIs), true
+    # process-wide.
+    DOCUMENTS = Documents.new
+  end
+end

data/lib/pikuri-extractors.rb ADDED Viewed

@@ -0,0 +1,27 @@
+# frozen_string_literal: true
+require 'pikuri-core'
+# Entry file for the pikuri-extractors gem. Sets up a dedicated
+# Zeitwerk loader rooted at this gem's +lib/+, contributing to the
+# shared +Pikuri::+ namespace alongside pikuri-core. After +require
+# 'pikuri-extractors'+, +Pikuri::Extractors::Documents+ and the shared
+# +Pikuri::Extractors::DOCUMENTS+ instance are defined — but *nothing
+# is registered*: extractors plug into +Pikuri::Extractor.registry+
+# only when the host script calls their +#register+ explicitly, so a
+# +bin/pikuri-*+ picks which extractors it wires in (same opt-in
+# philosophy as +c.add_extension+).
+#
+# The loader is per-gem (not shared with pikuri-core's loader) so each
+# gem owns its own +lib/+ tree and the cooperation between gems is via
+# the Pikuri namespace alone.
+module Pikuri
+  module Extractors
+    LOADER = Zeitwerk::Loader.new
+    LOADER.tag = 'pikuri-extractors'
+    LOADER.push_dir(File.expand_path('.', __dir__))
+    LOADER.ignore(__FILE__)
+    LOADER.setup
+    LOADER.eager_load
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,79 @@
+--- !ruby/object:Gem::Specification
+name: pikuri-extractors
+version: !ruby/object:Gem::Version
+  version: 0.0.6
+platform: ruby
+authors:
+- Martin Vysny
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2026-06-04 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: pikuri-core
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.0.6
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.0.6
+description: |
+  pikuri-extractors plugs additional document formats into
+  pikuri-core's +Pikuri::Extractor+ registry. The bundled
+  +Pikuri::Extractors::DOCUMENTS+ extractor converts office
+  documents (DOCX, ODT, XLSX, legacy XLS, PPTX, EPUB, RTF) to
+  Markdown by piping the bytes through pandoc / markitdown —
+  preferably inside a one-shot, networkless, locally-built docker
+  container (the untrusted bytes never touch the host filesystem or
+  network), falling back to a host-installed pandoc / markitdown
+  CLI when docker is absent.
+  Registration is explicit — +Pikuri::Extractors::DOCUMENTS.register+
+  — so requiring the gem changes nothing by itself; the host script
+  picks which extractors it wires in.
+email:
+- martin@vysny.me
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- README.md
+- docker/Dockerfile
+- docker/convert.sh
+- lib/pikuri-extractors.rb
+- lib/pikuri/extractors/documents.rb
+homepage: https://codeberg.org/mvysny/pikuri
+licenses:
+- MIT
+metadata:
+  source_code_uri: https://codeberg.org/mvysny/pikuri/src/branch/master
+  changelog_uri: https://codeberg.org/mvysny/pikuri/src/branch/master/CHANGELOG.md
+  bug_tracker_uri: https://codeberg.org/mvysny/pikuri/issues
+  rubygems_mfa_required: 'true'
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '3.3'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.5.22
+signing_key:
+specification_version: 4
+summary: Sandboxed document extractors (DOCX/ODT/XLSX/PPTX/EPUB/RTF) for pikuri.
+test_files: []