RubyGems - pikuri-workspace - Versions diffs - 0.0.5 → 0.0.6 - Mend

pikuri-workspace 0.0.5 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

checksums.yaml +4 -4
data/lib/pikuri/workspace/read.rb +41 -36
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: afd83aa622997eea8b09c700282cf401753d1925ace04a0c4ade3856c07ee8b4
-  data.tar.gz: 4cd46c7de8d42112e1119e68bab0ce7937ed9e5c270a29ceb34e5089178abd4c
+  metadata.gz: 52db0d4ce078507aa4a72cbc84d54a9433788f2a2d9a90ab883a9e469e6dea99
+  data.tar.gz: 107b34e1c3b387b4b68d542ff8ab2345885383a461c227a6b7aeb29973a4292e
 SHA512:
-  metadata.gz: c818b87a2aca2f0f615d084cc7c6e0457d0a84e5cd5be8b061c36aba50a071ad2565c27170d33ae25ca810dee8a25b16b1a13b25477b21e1c00eb79d067f2bb0
-  data.tar.gz: 7277b1ba878546589d4107136f4d49e3dc68b0ee5aff80d3eeff8e06f9d97b4fa569eb8caf186337e39c72e9f33812e3963fd433eeb35f926f82bb6b681bb579
+  metadata.gz: 432af6cfc0a0f3555666e9c88accb0e9b6162af2c5f041c9ff71b10443f1681b8e70b9e46aa6c75ed12344357a286df087869b889f0a40aeb9635aa6c9a1e651
+  data.tar.gz: 362eb9437127e8734f15e09919ae7fb928e0d1b71fc9bb90a78f419cb7b4f52f29aebd323dd32e7bd491b0a73a3de60cff54a91e1756f24bcdc4e91d52de5fc2

data/lib/pikuri/workspace/read.rb CHANGED Viewed

@@ -27,7 +27,7 @@ module Pikuri
     #
     # The line/byte windowing is delegated to
     # {Pikuri::FileType.read_as_text_paged}, which returns a
-    # {Pikuri::FileType::Page} this tool renders; the same windower
+    # {Pikuri::Extractor::Page} this tool renders; the same windower
     # backs +VectorDb::Tools::Read+. Two independent limits, whichever fires
     # first wins:
     #
@@ -38,27 +38,29 @@ module Pikuri
     # Additionally, individual lines longer than {MAX_LINE_LENGTH} chars
     # are truncated with {LINE_TRUNCATION_MARKER} appended; the model is
     # told to reach for +grep+ to find content inside such files. (These
-    # constants alias the +PAGE_*+ ones on {Pikuri::FileType} — one
+    # constants alias the +PAGE_*+ ones on {Pikuri::Extractor} — one
     # source of truth, shared with +VectorDb::Tools::Read+.)
     #
-    # == PDF extraction
+    # == PDF (and other extracted formats)
     #
-    # PDFs are detected by their +%PDF-+ magic prefix in the sample bytes
-    # and routed through {Pikuri::FileType.read_as_text_paged} instead of
-    # the binary-refusal path. The extractor walks pages lazily via
-    # +pdf-reader+, emitting one synthetic +"--- Page N ---"+ header line
-    # per page followed by that page's text. The offset / limit /
-    # MAX_BYTES contract is identical to the text path — extraction stops
-    # as soon as the line or byte cap is hit, so reading the first window
-    # of a 500-page PDF only parses the few pages needed. Line numbers in
-    # PDF output are for citation back to the user only; PDFs are not
-    # editable through {Edit}.
+    # Which formats read as text is the {Pikuri::Extractor} registry's
+    # business, not this tool's: with pikuri-pdf's extractor
+    # registered, PDFs are claimed by their +%PDF-+ magic prefix ahead
+    # of the binary refusal and extracted with one synthetic
+    # +"--- Page N ---"+ header line per page (see
+    # +Pikuri::Extractors::PDF+); a gem plugging another extractor
+    # into the registry extends this tool for free. Extraction is lazy
+    # where the format allows (+extract_lines+): reading the first
+    # window of a 500-page PDF parses only the pages the window needs.
+    # Formats without a lazy line shape (HTML) are extracted in full
+    # and then windowed. Line numbers in PDF output are for citation
+    # back to the user only; PDFs are not editable through {Edit}.
     #
     # PDFs with no extractable text (scanned images, empty documents) come
     # back with an LLM-actionable hint string rather than an empty
     # observation. Encrypted / malformed / XFA-form PDFs surface as
-    # +"Error: cannot extract PDF text: ..."+ — same convention as other
-    # tool errors the model can react to. No OCR.
+    # +"Error: ..."+ — same convention as other tool errors the model
+    # can react to. No OCR.
     #
     # == Image attachments
     #
@@ -96,36 +98,38 @@ module Pikuri
     # * Image larger than {MAX_IMAGE_BYTES} → +"Error: image too large…"+,
     #   leaving the model to pick a different file or ask the user to
     #   resize.
-    # * Binary content → {Pikuri::FileType.binary?} on the sample; any
-    #   +NUL+ byte or a sample dense in control characters triggers
-    #   refusal. Catches archives and compiled artifacts without an
-    #   extension list to maintain. PDFs and supported images are
-    #   intercepted by their respective magic-byte checks via
-    #   {Pikuri::FileType.detect_mime} before the binary sniff — see
-    #   above.
+    # * Binary content → nothing in the {Pikuri::Extractor} registry
+    #   claims it ({Pikuri::Extractor::Passthrough} declines on the
+    #   {Pikuri::FileType.binary?} heuristic: any +NUL+ byte or a
+    #   sample dense in control characters). Catches archives and
+    #   compiled artifacts without an extension list to maintain.
+    #   Registered extractors (pikuri-pdf's PDF, pikuri-extractors'
+    #   office formats) claim their bytes ahead of that refusal;
+    #   images are intercepted here via {Pikuri::FileType.detect_mime}
+    #   before extraction is attempted — see above.
     # * Offset past EOF → +"Error: offset N is beyond end of file (M lines total)"+.
     class Read < Pikuri::Tool
-      # The windowing constants live on {Pikuri::FileType} now (shared
+      # The windowing constants live on {Pikuri::Extractor} (shared
       # with +VectorDb::Tools::Read+); these aliases keep the names this tool's
       # description and specs reference pointing at the single source.
       # @return [Integer] default value of the +limit+ parameter (number
       #   of lines to read per call).
-      DEFAULT_LIMIT = Pikuri::FileType::PAGE_DEFAULT_LIMIT
+      DEFAULT_LIMIT = Pikuri::Extractor::PAGE_DEFAULT_LIMIT
       # @return [Integer] per-line character cap; longer lines are
       #   truncated with {LINE_TRUNCATION_MARKER}.
-      MAX_LINE_LENGTH = Pikuri::FileType::PAGE_MAX_LINE_LENGTH
+      MAX_LINE_LENGTH = Pikuri::Extractor::PAGE_MAX_LINE_LENGTH
       # @return [String] suffix appended to lines truncated by
       #   {MAX_LINE_LENGTH}.
-      LINE_TRUNCATION_MARKER = Pikuri::FileType::PAGE_LINE_TRUNCATION_MARKER
+      LINE_TRUNCATION_MARKER = Pikuri::Extractor::PAGE_LINE_TRUNCATION_MARKER
       # @return [Integer] hard byte cap on input content collected per
       #   call. Counted on the line bytes (plus one for the joining
       #   newline); the rendered output is slightly larger due to the
       #   per-line +"%6d\t"+ prefix.
-      MAX_BYTES = Pikuri::FileType::PAGE_MAX_BYTES
+      MAX_BYTES = Pikuri::Extractor::PAGE_MAX_BYTES
       # @return [String] human-readable form of {MAX_BYTES} for the
       #   continuation marker.
@@ -221,11 +225,6 @@ module Pikuri
         mime = Pikuri::FileType.detect_mime(resolved)
         return format_image(path: path, resolved: resolved, mime: mime) if mime&.start_with?('image/')
-        # PDFs are binary by the heuristic, so the PDF route (handled
-        # inside read_as_text_paged) must win over the binary refusal.
-        if mime != 'application/pdf' && Pikuri::FileType.binary?(resolved)
-          return "Error: cannot read binary file: #{path}"
-        end
         page = Pikuri::FileType.read_as_text_paged(
           resolved, offset: offset, limit: limit,
@@ -236,18 +235,24 @@ module Pikuri
         "Error: #{e.message}"
       rescue Errno::EACCES => e
         "Error: cannot read #{path}: #{e.message}"
+      rescue ArgumentError
+        # Nothing in the Extractor registry claimed the content —
+        # read_as_text_paged's binary refusal (directories and images
+        # were already handled above).
+        "Error: cannot read binary file: #{path}"
       rescue RuntimeError => e
-        # Malformed / unsupported PDF surfaced by read_as_text_paged.
+        # Extraction failure (malformed / unsupported PDF, ...)
+        # surfaced by read_as_text_paged.
         "Error: #{e.message}"
       end
-      # Render a {Pikuri::FileType::Page} as the cat-n observation: a
+      # Render a {Pikuri::Extractor::Page} as the cat-n observation: a
       # six-column line number, a tab, then the (already-truncated)
       # content, followed by a trailer that tells the model whether to
       # page on. PDF pages carry +"--- Page N ---"+ marker lines from
       # the extractor; the +kind+ only changes trailer wording here.
       #
-      # @param page [Pikuri::FileType::Page]
+      # @param page [Pikuri::Extractor::Page]
       # @return [String]
       def self.render_page(page)
         if page.lines.empty?
@@ -281,7 +286,7 @@ module Pikuri
       # text-free PDF gets an LLM-actionable hint rather than the
       # plain-file "(Empty file)".
       #
-      # @param page [Pikuri::FileType::Page]
+      # @param page [Pikuri::Extractor::Page]
       # @return [String]
       def self.empty_message(page)
         if page.kind == :pdf

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pikuri-workspace
 version: !ruby/object:Gem::Version
-  version: 0.0.5
+  version: 0.0.6
 platform: ruby
 authors:
 - Martin Vysny
@@ -16,14 +16,14 @@ dependencies:
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 0.0.5
+        version: 0.0.6
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - '='
       - !ruby/object:Gem::Version
-        version: 0.0.5
+        version: 0.0.6
 description: |
   pikuri-workspace adds "operate on a directory tree" to pikuri-core
   agents: the +Pikuri::Workspace::Filesystem+ class that scopes