RubyGems - kreuzberg - Versions diffs - 4.6.3-aarch64-linux → 4.7.0-aarch64-linux - Mend

kreuzberg 4.6.3-aarch64-linux → 4.7.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

checksums.yaml +4 -4
data/README.md +26 -5
data/lib/kreuzberg/config.rb +17 -13
data/lib/kreuzberg/result.rb +43 -6
data/lib/kreuzberg/types.rb +205 -15
data/lib/kreuzberg/version.rb +1 -1
data/lib/kreuzberg_rb.so +0 -0
data/sig/kreuzberg.rbs +303 -0
metadata +2 -22
data/spec/binding/config_result_spec.rb +0 -377
data/spec/binding/metadata_types_spec.rb +0 -1253
data/spec/serialization_spec.rb +0 -134
data/spec/smoke/package_spec.rb +0 -199
data/spec/unit/config/chunking_config_spec.rb +0 -213
data/spec/unit/config/embedding_config_spec.rb +0 -343
data/spec/unit/config/extraction_config_spec.rb +0 -434
data/spec/unit/config/font_config_spec.rb +0 -285
data/spec/unit/config/hierarchy_config_spec.rb +0 -314
data/spec/unit/config/image_extraction_config_spec.rb +0 -209
data/spec/unit/config/image_preprocessing_config_spec.rb +0 -230
data/spec/unit/config/keyword_config_spec.rb +0 -229
data/spec/unit/config/language_detection_config_spec.rb +0 -258
data/spec/unit/config/ocr_config_spec.rb +0 -171
data/spec/unit/config/output_format_spec.rb +0 -380
data/spec/unit/config/page_config_spec.rb +0 -221
data/spec/unit/config/pdf_config_spec.rb +0 -267
data/spec/unit/config/postprocessor_config_spec.rb +0 -290
data/spec/unit/config/tesseract_config_spec.rb +0 -181
data/spec/unit/config/token_reduction_config_spec.rb +0 -251

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: adae55dc7f30e68a211cc0493985d0d1687b3988e76509fbffd87f91fee45207
-  data.tar.gz: 6071b7d76b01dc15b47a11fc5eaeb4292fbb07630d20c3ac113751bbded3de0f
+  metadata.gz: 1777d29275333b413764e5417f805de33ad0f9378dbb2a6372d9d573a23ae0e9
+  data.tar.gz: 34fad03e39480a52e6a2f91ea4a7a17335eacaf97f8b17d32d440d617842b068
 SHA512:
-  metadata.gz: 1e90683694a29205d479b3cda7fb367658bf311520884f7c1faa1b4ec1d6be69dad491b620e33d521b8dea893e062c620738b0974157a096670e825a0fd1a434
-  data.tar.gz: ff29c19cb5b0085b84ba1a3ad9f97602ed83211e934d5eff1b8a630868a2ff7e919bba5937d7c49cb7a424fbc2239b726750cdca5c8893fdeb6cb4d98540f5fa
+  metadata.gz: 93cc0429e0d310125071f7091c3029936683b69c665b499a411fa9cd8df1e22077fb6ff3c009040b3395f6cabc92fcac490db3a2bf013680a9c1393f15245799
+  data.tar.gz: e62adbc1ed01632d96f397d89b18050e403d6363d8a2658f4b48da99dc2d284a48cc0891d77cc0afb7483ce9383f3eab6a1367bd2207083b5c1fec48b8543c8e

data/README.md CHANGED Viewed

@@ -22,7 +22,7 @@
     <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
   </a>
   <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
-    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.6.3" alt="Go">
+    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
   </a>
   <a href="https://www.nuget.org/packages/Kreuzberg/">
     <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
@@ -42,13 +42,16 @@
   <!-- Project Info -->
   <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
-    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
+    <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
   </a>
   <a href="https://docs.kreuzberg.dev">
-    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
+    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-007ec6" alt="Documentation">
+  </a>
+  <a href="https://docs.kreuzberg.dev/demo.html">
+    <img src="https://img.shields.io/badge/%E2%96%B6%EF%B8%8F_Live_Demo-007ec6" alt="Live Demo">
   </a>
   <a href="https://huggingface.co/Kreuzberg">
-    <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow" alt="Hugging Face">
+    <img src="https://img.shields.io/badge/%F0%9F%A4%97_Hugging_Face-007ec6" alt="Hugging Face">
   </a>
 </div>
@@ -61,7 +64,7 @@
 </div>
-Extract text, tables, images, and metadata from 91+ file formats including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
+Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
 ## Installation
@@ -74,6 +77,7 @@ Install via one of the supported package managers:
 **gem:**
 ```bash
 gem install kreuzberg
 ```
@@ -82,6 +86,7 @@ gem install kreuzberg
 **Bundler:**
 ```ruby
 gem 'kreuzberg'
 ```
@@ -258,6 +263,19 @@ puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
 | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
 | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
+#### Code Intelligence (248 Languages)
+| Feature | Description |
+|---------|-------------|
+| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
+| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
+| **Symbol Extraction** | Variables, constants, type aliases, properties |
+| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
+| **Diagnostics** | Parse errors with line/column positions |
+| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
+Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
 **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
 ### Key Capabilities
@@ -279,6 +297,9 @@ puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
 - **Batch Processing** - Efficiently process multiple documents in parallel
 - **Memory Efficient** - Stream large files without loading entirely into memory
 - **Language Detection** - Detect and support multiple languages in documents
+- **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [248 programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
 - **Configuration** - Fine-grained control over extraction behavior
 ### Performance Characteristics

data/lib/kreuzberg/config.rb CHANGED Viewed

@@ -858,21 +858,20 @@ module Kreuzberg
     # Layout detection configuration
     #
-    # @example Basic usage with fast preset
-    #   layout = LayoutDetection.new(preset: "fast")
+    # @example Basic usage
+    #   layout = LayoutDetection.new
     #
-    # @example Accurate preset with custom threshold
+    # @example With custom threshold and table model
     #   layout = LayoutDetection.new(
-    #     preset: "accurate",
     #     confidence_threshold: 0.5,
-    #     apply_heuristics: true
+    #     apply_heuristics: true,
+    #     table_model: "tatr"
     #   )
     #
     class LayoutDetection
-      attr_reader :preset, :confidence_threshold, :apply_heuristics, :table_model
+      attr_reader :confidence_threshold, :apply_heuristics, :table_model
-      def initialize(preset: 'fast', confidence_threshold: nil, apply_heuristics: true, table_model: nil)
-        @preset = preset.to_s
+      def initialize(confidence_threshold: nil, apply_heuristics: true, table_model: nil)
         @confidence_threshold = confidence_threshold&.to_f
         @apply_heuristics = apply_heuristics ? true : false
         @table_model = table_model&.to_s
@@ -880,7 +879,6 @@ module Kreuzberg
       def to_h
         {
-          preset: @preset,
           confidence_threshold: @confidence_threshold,
           apply_heuristics: @apply_heuristics,
           table_model: @table_model
@@ -945,7 +943,7 @@ module Kreuzberg
     #   )
     #
     class Extraction
-      attr_reader :use_cache, :enable_quality_processing, :force_ocr, :force_ocr_pages,
+      attr_reader :use_cache, :enable_quality_processing, :force_ocr, :disable_ocr, :force_ocr_pages,
                   :include_document_structure,
                   :ocr, :chunking, :language_detection, :pdf_options,
                   :images, :postprocessor,
@@ -974,8 +972,8 @@ module Kreuzberg
       #
       # Keys that are allowed in the Extraction config
       ALLOWED_KEYS = %i[
-        use_cache enable_quality_processing force_ocr force_ocr_pages include_document_structure ocr chunking
-        language_detection pdf_options image_extraction
+        use_cache enable_quality_processing force_ocr disable_ocr force_ocr_pages
+        include_document_structure ocr chunking language_detection pdf_options image_extraction
         postprocessor token_reduction keywords html_options pages
         max_concurrent_extractions output_format result_format
         security_limits layout concurrency cache_namespace cache_ttl_secs extraction_timeout_secs
@@ -1040,6 +1038,7 @@ module Kreuzberg
                      use_cache: true,
                      enable_quality_processing: true,
                      force_ocr: false,
+                     disable_ocr: false,
                      force_ocr_pages: nil,
                      include_document_structure: false,
                      ocr: nil,
@@ -1066,7 +1065,7 @@ module Kreuzberg
                      email: nil)
         kwargs = {
           use_cache: use_cache, enable_quality_processing: enable_quality_processing,
-          force_ocr: force_ocr, force_ocr_pages: force_ocr_pages,
+          force_ocr: force_ocr, disable_ocr: disable_ocr, force_ocr_pages: force_ocr_pages,
           include_document_structure: include_document_structure,
           ocr: ocr, chunking: chunking, language_detection: language_detection,
           pdf_options: pdf_options, image_extraction: image_extraction,
@@ -1099,6 +1098,7 @@ module Kreuzberg
         @use_cache = params[:use_cache] ? true : false
         @enable_quality_processing = params[:enable_quality_processing] ? true : false
         @force_ocr = params[:force_ocr] ? true : false
+        @disable_ocr = params[:disable_ocr] ? true : false
         @force_ocr_pages = params[:force_ocr_pages]
         @include_document_structure = params[:include_document_structure] ? true : false
         @ocr = normalize_config(params[:ocr], OCR)
@@ -1154,6 +1154,7 @@ module Kreuzberg
           use_cache: @use_cache,
           enable_quality_processing: @enable_quality_processing,
           force_ocr: @force_ocr,
+          disable_ocr: @disable_ocr,
           force_ocr_pages: @force_ocr_pages,
           include_document_structure: @include_document_structure,
           max_concurrent_extractions: @max_concurrent_extractions,
@@ -1290,6 +1291,8 @@ module Kreuzberg
           @enable_quality_processing = value ? true : false
         when :force_ocr
           @force_ocr = value ? true : false
+        when :disable_ocr
+          @disable_ocr = value ? true : false
         when :force_ocr_pages
           @force_ocr_pages = value
         when :include_document_structure
@@ -1395,6 +1398,7 @@ module Kreuzberg
         @use_cache = merged.use_cache
         @enable_quality_processing = merged.enable_quality_processing
         @force_ocr = merged.force_ocr
+        @disable_ocr = merged.disable_ocr
         @force_ocr_pages = merged.force_ocr_pages
         @include_document_structure = merged.include_document_structure
         @ocr = merged.ocr

data/lib/kreuzberg/result.rb CHANGED Viewed

@@ -14,7 +14,8 @@ module Kreuzberg
   class Result
     attr_reader :content, :mime_type, :metadata, :metadata_json, :tables,
                 :detected_languages, :chunks, :images, :pages, :elements, :ocr_elements, :djot_content,
-                :document, :extracted_keywords, :quality_score, :processing_warnings, :annotations
+                :document, :extracted_keywords, :quality_score, :processing_warnings, :annotations,
+                :uris, :children
     # @!attribute [r] cells
     #   @return [Array<Array<String>>] Table cells (2D array)
@@ -51,6 +52,7 @@ module Kreuzberg
       :total_chunks,
       :first_page,
       :last_page,
+      :chunk_type,
       :embedding
     ) do
       def to_h
@@ -63,6 +65,7 @@ module Kreuzberg
           total_chunks: total_chunks,
           first_page: first_page,
           last_page: last_page,
+          chunk_type: chunk_type,
           embedding: embedding
         }
       end
@@ -318,7 +321,7 @@ module Kreuzberg
     #
     # @param hash [Hash] Hash returned from native extension
     #
-    # rubocop:disable Metrics/AbcSize
+    # rubocop:disable Metrics/AbcSize, Metrics/MethodLength
     def initialize(hash)
       @content = get_value(hash, 'content', '')
       @mime_type = get_value(hash, 'mime_type', '')
@@ -337,14 +340,16 @@ module Kreuzberg
       @quality_score = get_value(hash, 'quality_score')
       @processing_warnings = parse_processing_warnings(get_value(hash, 'processing_warnings'))
       @annotations = parse_annotations(get_value(hash, 'annotations'))
+      @uris = parse_uris(get_value(hash, 'uris'))
+      @children = parse_children(get_value(hash, 'children'))
     end
-    # rubocop:enable Metrics/AbcSize
+    # rubocop:enable Metrics/AbcSize, Metrics/MethodLength
     # Convert to hash
     #
     # @return [Hash] Hash representation
     #
-    # rubocop:disable Metrics/CyclomaticComplexity
+    # rubocop:disable Metrics/CyclomaticComplexity, Metrics/MethodLength
     def to_h
       {
         content: @content,
@@ -362,10 +367,12 @@ module Kreuzberg
         extracted_keywords: @extracted_keywords&.map(&:to_h),
         quality_score: @quality_score,
         processing_warnings: @processing_warnings.map(&:to_h),
-        annotations: @annotations&.map(&:to_h)
+        annotations: @annotations&.map(&:to_h),
+        uris: @uris&.map(&:to_h),
+        children: @children&.map(&:to_h)
       }
     end
-    # rubocop:enable Metrics/CyclomaticComplexity
+    # rubocop:enable Metrics/CyclomaticComplexity, Metrics/MethodLength
     # Convert to JSON
     #
@@ -520,6 +527,7 @@ module Kreuzberg
           total_chunks: chunk_hash['total_chunks'],
           first_page: chunk_hash['first_page'],
           last_page: chunk_hash['last_page'],
+          chunk_type: chunk_hash['chunk_type'],
           embedding: chunk_hash['embedding']
         )
       end
@@ -738,6 +746,35 @@ module Kreuzberg
     def bbox_field(bbox_data, primary_key, fallback_key)
       (bbox_data[primary_key] || bbox_data[fallback_key])&.to_f
     end
+    def parse_uris(uris_data)
+      return nil if uris_data.nil?
+      uris_data.map { |u| build_uri(u) }
+    end
+    def build_uri(u_hash)
+      Struct.new(:url, :label, :page, :kind).new(
+        url: u_hash['url'] || '',
+        label: u_hash['label'],
+        page: u_hash['page']&.to_i,
+        kind: u_hash['kind'] || 'hyperlink'
+      )
+    end
+    def parse_children(children_data)
+      return nil if children_data.nil?
+      children_data.map { |c| build_archive_entry(c) }
+    end
+    def build_archive_entry(c_hash)
+      Struct.new(:path, :mime_type, :result).new(
+        path: c_hash['path'] || '',
+        mime_type: c_hash['mime_type'] || '',
+        result: c_hash['result'] ? self.class.new(c_hash['result']) : nil
+      )
+    end
   end
   # rubocop:enable Metrics/ClassLength
 end

data/lib/kreuzberg/types.rb CHANGED Viewed

@@ -10,21 +10,24 @@ module Kreuzberg
   #
   # @example
   #   type = Kreuzberg::ElementType::TITLE
-  #
-  ElementType = T.type_alias do
-    T.any(
-      'title',
-      'narrative_text',
-      'heading',
-      'list_item',
-      'table',
-      'image',
-      'page_break',
-      'code_block',
-      'block_quote',
-      'footer',
-      'header'
-    )
+  #   Kreuzberg::ElementType.values # => ["title", "narrative_text", ...]
+  #
+  module ElementType
+    TITLE = 'title'
+    NARRATIVE_TEXT = 'narrative_text'
+    HEADING = 'heading'
+    LIST_ITEM = 'list_item'
+    TABLE = 'table'
+    IMAGE = 'image'
+    PAGE_BREAK = 'page_break'
+    CODE_BLOCK = 'code_block'
+    BLOCK_QUOTE = 'block_quote'
+    FOOTER = 'footer'
+    HEADER = 'header'
+    def self.values
+      [TITLE, NARRATIVE_TEXT, HEADING, LIST_ITEM, TABLE, IMAGE, PAGE_BREAK, CODE_BLOCK, BLOCK_QUOTE, FOOTER, HEADER]
+    end
   end
   # Bounding box coordinates for element positioning.
@@ -431,4 +434,191 @@ module Kreuzberg
     const :page_number, T.nilable(Integer)
     const :bounding_box, T.nilable(PdfAnnotationBoundingBox)
   end
+  # An entry within an archive (zip, tar, etc.) extraction result.
+  #
+  # @example
+  #   entry = Kreuzberg::ArchiveEntry.new(
+  #     path: "readme.txt",
+  #     mime_type: "text/plain",
+  #     result: extraction_result
+  #   )
+  #
+  class ArchiveEntry < T::Struct
+    extend T::Sig
+    const :path, String
+    const :mime_type, String
+    const :result, T.untyped
+  end
+  # Extracted keyword with relevance metadata.
+  #
+  # @example
+  #   kw = Kreuzberg::Keyword.new(
+  #     text: "machine learning",
+  #     score: 0.95,
+  #     algorithm: "yake",
+  #     positions: [42, 128]
+  #   )
+  #
+  class Keyword < T::Struct
+    extend T::Sig
+    const :text, String
+    const :score, Float
+    const :algorithm, String
+    const :positions, T.nilable(T::Array[Integer])
+  end
+  # A table extracted from a document.
+  #
+  # @example
+  #   table = Kreuzberg::Table.new(
+  #     cells: [["A", "B"], ["1", "2"]],
+  #     markdown: "| A | B |\n|---|---|\n| 1 | 2 |",
+  #     page_number: 1,
+  #     bounding_box: bbox
+  #   )
+  #
+  class Table < T::Struct
+    extend T::Sig
+    const :cells, T::Array[T::Array[String]]
+    const :markdown, String
+    const :page_number, Integer
+    const :bounding_box, T.nilable(BoundingBox)
+  end
+  # A URI extracted from a document.
+  #
+  # @example
+  #   uri = Kreuzberg::Uri.new(
+  #     url: "https://example.com",
+  #     kind: "hyperlink",
+  #     label: "Example",
+  #     page: 1
+  #   )
+  #
+  class Uri < T::Struct
+    extend T::Sig
+    const :url, String
+    const :kind, String
+    const :label, T.nilable(String)
+    const :page, T.nilable(Integer)
+  end
+  # Content layer classification for document nodes.
+  module ContentLayer
+    BODY = 'body'
+    HEADER = 'header'
+    FOOTER = 'footer'
+    FOOTNOTE = 'footnote'
+    def self.values
+      [BODY, HEADER, FOOTER, FOOTNOTE]
+    end
+  end
+  # Algorithm used for keyword extraction.
+  module KeywordAlgorithm
+    YAKE = 'yake'
+    RAKE = 'rake'
+    def self.values
+      [YAKE, RAKE]
+    end
+  end
+  # OCR element granularity level.
+  module OcrElementLevel
+    WORD = 'word'
+    LINE = 'line'
+    BLOCK = 'block'
+    PAGE = 'page'
+    def self.values
+      [WORD, LINE, BLOCK, PAGE]
+    end
+  end
+  # Output format for extraction results.
+  module OutputFormat
+    PLAIN = 'plain'
+    MARKDOWN = 'markdown'
+    DJOT = 'djot'
+    HTML = 'html'
+    JSON = 'json'
+    STRUCTURED = 'structured'
+    def self.values
+      [PLAIN, MARKDOWN, DJOT, HTML, JSON, STRUCTURED]
+    end
+  end
+  # Page unit type classification.
+  module PageUnitType
+    PAGE = 'page'
+    SLIDE = 'slide'
+    SHEET = 'sheet'
+    def self.values
+      [PAGE, SLIDE, SHEET]
+    end
+  end
+  # PDF annotation type classification.
+  module PdfAnnotationType
+    TEXT = 'text'
+    HIGHLIGHT = 'highlight'
+    LINK = 'link'
+    STAMP = 'stamp'
+    UNDERLINE = 'underline'
+    STRIKE_OUT = 'strike_out'
+    OTHER = 'other'
+    def self.values
+      [TEXT, HIGHLIGHT, LINK, STAMP, UNDERLINE, STRIKE_OUT, OTHER]
+    end
+  end
+  # Relationship kind between document elements.
+  module RelationshipKind
+    FOOTNOTE_REFERENCE = 'footnote_reference'
+    CITATION_REFERENCE = 'citation_reference'
+    INTERNAL_LINK = 'internal_link'
+    CAPTION = 'caption'
+    LABEL = 'label'
+    TOC_ENTRY = 'toc_entry'
+    CROSS_REFERENCE = 'cross_reference'
+    def self.values
+      [FOOTNOTE_REFERENCE, CITATION_REFERENCE, INTERNAL_LINK, CAPTION, LABEL, TOC_ENTRY, CROSS_REFERENCE]
+    end
+  end
+  # Result format classification.
+  module ResultFormat
+    UNIFIED = 'unified'
+    ELEMENT_BASED = 'element_based'
+    def self.values
+      [UNIFIED, ELEMENT_BASED]
+    end
+  end
+  # URI kind classification.
+  module UriKind
+    HYPERLINK = 'hyperlink'
+    IMAGE = 'image'
+    ANCHOR = 'anchor'
+    CITATION = 'citation'
+    REFERENCE = 'reference'
+    EMAIL = 'email'
+    def self.values
+      [HYPERLINK, IMAGE, ANCHOR, CITATION, REFERENCE, EMAIL]
+    end
+  end
 end

data/lib/kreuzberg/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Kreuzberg
-  VERSION = '4.6.3'
+  VERSION = '4.7.0'
 end

data/lib/kreuzberg_rb.so CHANGED Viewed

Binary file