RubyGems - parsekit - Versions diffs - 0.1.2 → 0.2.0 - Mend

parsekit 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6ad6eb42fb7e96fa944f30245b2c7be51bf4ce1a0f7766749309676b225b17df
-  data.tar.gz: deb56ea394ac3fedc840e890e4d27de14585661233f19eeaae06baf7be1b1e90
+  metadata.gz: b32f09ec6af6545f7db84b9c6c6f10a27998d95b2305ec5f5a5bef4a80a2a717
+  data.tar.gz: 7b36ef18a14bd708ae885c5b101f822cf6cb088c1ba729b0398bd4d5522ab0fb
 SHA512:
-  metadata.gz: dc88b902dd12008a6936f4d62f5d4651544a3f463b725a15d385b919141e93873bd809436e6b9b008baa7b310d149becb2106a29ca103736f6525e09bef871d6
-  data.tar.gz: 9cbc5464a5cbe06a241d2253cde81da82c7eb75742654b7753c91a922acc87125f81a33c3e77d0d107a1435e8946a860e12388e44fa84dc887d9bb4bf9d2d3a2
+  metadata.gz: 2f5b479a90c550ea25c4a0a6f19afbb10aee5a51b14f7a27868857249605e6b72e045288eb2019805c8281d499d31b8d0597c96bea3c1cee87753430541116a1
+  data.tar.gz: a1ab174853194a4806e1c88606912005bc074b072893681c17e3e271339baadd401e7849ebc026305809507c7ca0b6d30584941c4a94f5619d339000df3e06dd

data/ext/parsekit/Cargo.toml CHANGED Viewed

@@ -14,14 +14,14 @@ name = "parsekit"
 magnus = { version = "0.8", features = ["rb-sys"] }
 # Document parsing - testing embedded C libraries
 # MuPDF builds from source and statically links
-mupdf = { version = "0.5", default-features = false, features = [] }
+mupdf = { version = "0.7", default-features = false, features = [] }
 # OCR - Using tesseract-rs for both system and bundled modes
-tesseract-rs = "0.1"  # Tesseract with optional bundling
+tesseract-rs = "0.2"  # Tesseract with optional bundling
 image = "0.25"  # Image processing library (match rusty-tesseract's version)
-calamine = "0.30"  # Excel parsing
+calamine = "0.35"  # Excel parsing
 docx-rs = "0.4"  # Word document parsing
-quick-xml = "0.38"  # XML parsing
-zip = "5.0"  # ZIP archive handling for PPTX
+quick-xml = "0.40"  # XML parsing
+zip = "8.2"  # ZIP archive handling for PPTX
 serde_json = "1.0"  # JSON parsing
 regex = "1.10"  # Text parsing
 encoding_rs = "0.8"  # Encoding detection

data/ext/parsekit/src/parser.rs CHANGED Viewed

@@ -242,7 +242,7 @@ impl Parser {
             // Continue on page errors rather than failing entirely
             if let Ok(page) = doc.load_page(page_num) {
                 // Extract text from the page
-                if let Ok(text) = page.to_text() {
+                if let Ok(text) = page.to_text_page(mupdf::TextPageFlags::empty()).and_then(|tp| tp.to_text()) {
                     all_text.push_str(&text);
                     all_text.push('\n');
                 }

data/lib/parsekit/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ParseKit
-  VERSION = "0.1.2"
+  VERSION = "0.2.0"
 end

data/lib/parsekit.rb CHANGED Viewed

@@ -2,9 +2,15 @@
 require_relative "parsekit/version"
-# Load the native extension
+# Load the compiled Rust extension. Precompiled (platform) gems install it into a
+# Ruby-ABI-versioned subdir (lib/parsekit/<major.minor>/parsekit.{so,bundle}) so a
+# single fat gem can carry a binary per Ruby version; source/dev builds place it flat
+# at lib/parsekit/parsekit.{so,bundle}. Try the versioned path first, fall back to the
+# flat one. Resolution goes through $LOAD_PATH (`require`, never `require_relative`)
+# because RubyGems installs native extensions outside the gem's lib/ dir.
 begin
-  require_relative "parsekit/parsekit"
+  RUBY_VERSION =~ /(\d+\.\d+)/
+  require "parsekit/#{Regexp.last_match(1)}/parsekit"
 rescue LoadError
   require "parsekit/parsekit"
 end

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: parsekit
 version: !ruby/object:Gem::Version
-  version: 0.1.2
+  version: 0.2.0
 platform: ruby
 authors:
 - Chris Petersen
-autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-09-06 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rb_sys
@@ -100,9 +99,7 @@ files:
 - ext/parsekit/src/lib.rs
 - ext/parsekit/src/parser.rs
 - lib/parsekit.rb
-- lib/parsekit/NATIVE_API.md
 - lib/parsekit/error.rb
-- lib/parsekit/parsekit.bundle
 - lib/parsekit/parser.rb
 - lib/parsekit/version.rb
 homepage: https://github.com/scientist-labs/parsekit
@@ -112,7 +109,6 @@ metadata:
   homepage_uri: https://github.com/scientist-labs/parsekit
   source_code_uri: https://github.com/scientist-labs/parsekit
   changelog_uri: https://github.com/scientist-labs/parsekit/blob/main/CHANGELOG.md
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -127,8 +123,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.3
-signing_key:
+rubygems_version: 3.6.9
 specification_version: 4
 summary: Ruby document parsing toolkit with PDF and OCR support
 test_files: []

data/lib/parsekit/NATIVE_API.md DELETED Viewed

@@ -1,125 +0,0 @@
-# ParseKit Native API Documentation
-This document describes the methods implemented in the Rust native extension for ParseKit::Parser.
-## Instance Methods
-### `initialize(options = {})`
-Initialize a new Parser instance with optional configuration.
-**Parameters:**
-- `options` [Hash] Configuration options
-  - `:encoding` [String] Input encoding (default: UTF-8)
-  - `:strict_mode` [Boolean] Enable strict parsing mode (default: false)
-  - `:max_depth` [Integer] Maximum nesting depth (default: 100)
-  - `:max_size` [Integer] Maximum file size in bytes (default: 100MB)
-### `parse(input)`
-Parse an input string (for text content).
-**Parameters:**
-- `input` [String] The input to parse
-**Returns:**
-- [String] The parsed result
-**Raises:**
-- `ArgumentError` If input is empty
-### `parse_file(path)`
-Parse a file (supports PDF, Office documents, text files, images with OCR).
-**Parameters:**
-- `path` [String] Path to the file to parse
-**Returns:**
-- [String] The extracted text content
-**Raises:**
-- `IOError` If file cannot be read
-- `RuntimeError` If parsing fails
-### `parse_bytes(data)`
-Parse binary data.
-**Parameters:**
-- `data` [Array<Integer>] Binary data as byte array
-**Returns:**
-- [String] The extracted text content
-**Raises:**
-- `ArgumentError` If data is empty
-- `RuntimeError` If parsing fails
-### `config`
-Get the current parser configuration.
-**Returns:**
-- [Hash] The parser configuration including encoding, strict_mode, max_depth, and max_size
-### `supports_file?(path)`
-Check if a file format is supported.
-**Parameters:**
-- `path` [String] File path to check
-**Returns:**
-- [Boolean] True if the file format is supported
-### `strict_mode?`
-Check if strict mode is enabled.
-**Returns:**
-- [Boolean] True if strict mode is enabled
-## Format-Specific Parsers
-These methods are also available but typically called internally via `parse_file` or `parse_bytes`:
-### `parse_pdf(data)`
-Parse PDF files using MuPDF (statically linked).
-### `parse_docx(data)`
-Parse Microsoft Word documents.
-### `parse_pptx(data)`
-Parse Microsoft PowerPoint presentations.
-### `parse_xlsx(data)`
-Parse Microsoft Excel spreadsheets.
-### `parse_json(data)`
-Parse and pretty-print JSON data.
-### `parse_xml(data)`
-Parse XML/HTML files and extract text content.
-### `parse_text(data)`
-Parse plain text files.
-### `ocr_image(data)`
-Perform OCR on images (PNG, JPEG, TIFF, BMP) using Tesseract.
-## Class Methods
-### `Parser.supported_formats`
-Get list of supported file formats.
-**Returns:**
-- [Array<String>] List of supported file extensions
-**Example:**
-```ruby
-ParseKit::Parser.supported_formats
-# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp", ...]
-```
-## Implementation Notes
-All these methods are implemented in Rust via the native extension. The Ruby layer (`lib/parsekit/parser.rb`) provides additional convenience methods and helpers that wrap these native methods.
-The native extension uses:
-- **MuPDF** for PDF parsing (statically linked)
-- **Tesseract** for OCR functionality (bundled)
-- **Various Rust crates** for Office document parsing (docx-rs, calamine, etc.)

data/lib/parsekit/parsekit.bundle DELETED Viewed

Binary file