RubyGems - parsekit - Versions diffs - 0.1.2 → 0.1.3 - Mend

parsekit 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6ad6eb42fb7e96fa944f30245b2c7be51bf4ce1a0f7766749309676b225b17df
-  data.tar.gz: deb56ea394ac3fedc840e890e4d27de14585661233f19eeaae06baf7be1b1e90
+  metadata.gz: ee11d59d78b4a2d0d837233b464f3bc84c934659826fe879953b2c1caa56521a
+  data.tar.gz: b3829363b1821c19d51d86beb494b83a30e0f875e847a7b439a6ed27a993cb26
 SHA512:
-  metadata.gz: dc88b902dd12008a6936f4d62f5d4651544a3f463b725a15d385b919141e93873bd809436e6b9b008baa7b310d149becb2106a29ca103736f6525e09bef871d6
-  data.tar.gz: 9cbc5464a5cbe06a241d2253cde81da82c7eb75742654b7753c91a922acc87125f81a33c3e77d0d107a1435e8946a860e12388e44fa84dc887d9bb4bf9d2d3a2
+  metadata.gz: dddee73ead9421e822a25d97b8fc32ac852c6cc4b8729ae77308b304544738f53e3f6313f124eaf01e4a0bee0e28f2066d411dc00fe324ec47133a0ab9dc4445
+  data.tar.gz: 724666245f2fe62df854b034deb6db3b1db0fcd55f58a257e6929d0946fa1c88ec3e8644b59a0c3930dbc82608ff113d5a2d5a782b6e30f0e40678f1ed5edb43

data/ext/parsekit/Cargo.toml CHANGED Viewed

@@ -14,14 +14,14 @@ name = "parsekit"
 magnus = { version = "0.8", features = ["rb-sys"] }
 # Document parsing - testing embedded C libraries
 # MuPDF builds from source and statically links
-mupdf = { version = "0.5", default-features = false, features = [] }
+mupdf = { version = "0.6", default-features = false, features = [] }
 # OCR - Using tesseract-rs for both system and bundled modes
 tesseract-rs = "0.1"  # Tesseract with optional bundling
 image = "0.25"  # Image processing library (match rusty-tesseract's version)
-calamine = "0.30"  # Excel parsing
+calamine = "0.34"  # Excel parsing
 docx-rs = "0.4"  # Word document parsing
-quick-xml = "0.38"  # XML parsing
-zip = "5.0"  # ZIP archive handling for PPTX
+quick-xml = "0.39"  # XML parsing
+zip = "8.2"  # ZIP archive handling for PPTX
 serde_json = "1.0"  # JSON parsing
 regex = "1.10"  # Text parsing
 encoding_rs = "0.8"  # Encoding detection

data/ext/parsekit/src/parser.rs CHANGED Viewed

@@ -242,7 +242,7 @@ impl Parser {
             // Continue on page errors rather than failing entirely
             if let Ok(page) = doc.load_page(page_num) {
                 // Extract text from the page
-                if let Ok(text) = page.to_text() {
+                if let Ok(text) = page.to_text_page(mupdf::TextPageFlags::empty()).and_then(|tp| tp.to_text()) {
                     all_text.push_str(&text);
                     all_text.push('\n');
                 }

data/lib/parsekit/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ParseKit
-  VERSION = "0.1.2"
+  VERSION = "0.1.3"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: parsekit
 version: !ruby/object:Gem::Version
-  version: 0.1.2
+  version: 0.1.3
 platform: ruby
 authors:
 - Chris Petersen
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-09-06 00:00:00.000000000 Z
+date: 2026-03-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rb_sys
@@ -100,9 +100,7 @@ files:
 - ext/parsekit/src/lib.rs
 - ext/parsekit/src/parser.rs
 - lib/parsekit.rb
-- lib/parsekit/NATIVE_API.md
 - lib/parsekit/error.rb
-- lib/parsekit/parsekit.bundle
 - lib/parsekit/parser.rb
 - lib/parsekit/version.rb
 homepage: https://github.com/scientist-labs/parsekit
@@ -127,7 +125,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.3
+rubygems_version: 3.5.22
 signing_key:
 specification_version: 4
 summary: Ruby document parsing toolkit with PDF and OCR support

data/lib/parsekit/NATIVE_API.md DELETED Viewed

@@ -1,125 +0,0 @@
-# ParseKit Native API Documentation
-This document describes the methods implemented in the Rust native extension for ParseKit::Parser.
-## Instance Methods
-### `initialize(options = {})`
-Initialize a new Parser instance with optional configuration.
-**Parameters:**
-- `options` [Hash] Configuration options
-  - `:encoding` [String] Input encoding (default: UTF-8)
-  - `:strict_mode` [Boolean] Enable strict parsing mode (default: false)
-  - `:max_depth` [Integer] Maximum nesting depth (default: 100)
-  - `:max_size` [Integer] Maximum file size in bytes (default: 100MB)
-### `parse(input)`
-Parse an input string (for text content).
-**Parameters:**
-- `input` [String] The input to parse
-**Returns:**
-- [String] The parsed result
-**Raises:**
-- `ArgumentError` If input is empty
-### `parse_file(path)`
-Parse a file (supports PDF, Office documents, text files, images with OCR).
-**Parameters:**
-- `path` [String] Path to the file to parse
-**Returns:**
-- [String] The extracted text content
-**Raises:**
-- `IOError` If file cannot be read
-- `RuntimeError` If parsing fails
-### `parse_bytes(data)`
-Parse binary data.
-**Parameters:**
-- `data` [Array<Integer>] Binary data as byte array
-**Returns:**
-- [String] The extracted text content
-**Raises:**
-- `ArgumentError` If data is empty
-- `RuntimeError` If parsing fails
-### `config`
-Get the current parser configuration.
-**Returns:**
-- [Hash] The parser configuration including encoding, strict_mode, max_depth, and max_size
-### `supports_file?(path)`
-Check if a file format is supported.
-**Parameters:**
-- `path` [String] File path to check
-**Returns:**
-- [Boolean] True if the file format is supported
-### `strict_mode?`
-Check if strict mode is enabled.
-**Returns:**
-- [Boolean] True if strict mode is enabled
-## Format-Specific Parsers
-These methods are also available but typically called internally via `parse_file` or `parse_bytes`:
-### `parse_pdf(data)`
-Parse PDF files using MuPDF (statically linked).
-### `parse_docx(data)`
-Parse Microsoft Word documents.
-### `parse_pptx(data)`
-Parse Microsoft PowerPoint presentations.
-### `parse_xlsx(data)`
-Parse Microsoft Excel spreadsheets.
-### `parse_json(data)`
-Parse and pretty-print JSON data.
-### `parse_xml(data)`
-Parse XML/HTML files and extract text content.
-### `parse_text(data)`
-Parse plain text files.
-### `ocr_image(data)`
-Perform OCR on images (PNG, JPEG, TIFF, BMP) using Tesseract.
-## Class Methods
-### `Parser.supported_formats`
-Get list of supported file formats.
-**Returns:**
-- [Array<String>] List of supported file extensions
-**Example:**
-```ruby
-ParseKit::Parser.supported_formats
-# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp", ...]
-```
-## Implementation Notes
-All these methods are implemented in Rust via the native extension. The Ruby layer (`lib/parsekit/parser.rb`) provides additional convenience methods and helpers that wrap these native methods.
-The native extension uses:
-- **MuPDF** for PDF parsing (statically linked)
-- **Tesseract** for OCR functionality (bundled)
-- **Various Rust crates** for Office document parsing (docx-rs, calamine, etc.)

data/lib/parsekit/parsekit.bundle DELETED Viewed

Binary file