RubyGems - parsekit - Versions diffs - 0.1.0.pre.1 → 0.1.1 - Mend

parsekit 0.1.0.pre.1 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/README.md +29 -17
data/ext/parsekit/Cargo.toml +9 -7
data/ext/parsekit/src/error.rs +7 -7
data/ext/parsekit/src/format_detector.rs +233 -0
data/ext/parsekit/src/lib.rs +1 -0
data/ext/parsekit/src/parser.rs +357 -199
data/lib/parsekit/NATIVE_API.md +125 -0
data/lib/parsekit/parsekit.bundle +0 -0
data/lib/parsekit/parser.rb +156 -104
data/lib/parsekit/version.rb +1 -1
data/lib/parsekit.rb +32 -0
metadata +4 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 02b091ecd1da29c68d59afb1089f1756cef350252b260b531ef82a06fb163c65
-  data.tar.gz: e34663b8f849a907ede07b357ad3c5b21a614c16ea767fb5b735c3422bd66aa7
+  metadata.gz: e77e605d938d5b0b89c7814d1360f4c505415c54efbf8ffe9f2f7d4c564d917e
+  data.tar.gz: 6b86f57b2dce1231cae704b4d35c7562807ab77b001860b6fa5bbcdc9844781f
 SHA512:
-  metadata.gz: b476aad0a9c9a711fce10d3a22dedd64e6ac82597c1d5d501d3ced7a46982d8f65b5bf44b513c3daabc5c5115a4b6278a0bea911b4dc9b1667010467e1cad8c9
-  data.tar.gz: f1d2adeb0bf8199b5ce397537b8b40577a79e954dddf33f4e4ff2fe418791cddb2f41adf79654c05c9d37e6ef0b1e99526f390560b5321feb711982d2218372d
+  metadata.gz: a3f7089e8bd3e84cb2e14614cb78c3b3132d4d93a3c95d5cdcfa6c63723fe2dfce3a01bf0ee27255be7ff036bd0e438492434ded72853772e57b65faf7bded9b
+  data.tar.gz: c84b03d65471f50d6ec72eaa21269b5fc1c5e40e0cefa923cc71d50b802734d64c8296e5e5e6a76ca2f5d388a568119ed09474ae622534aaef03e0a96109dee3

data/README.md CHANGED Viewed

@@ -1,14 +1,13 @@
-# ParseKit
+<img src="/docs/assets/parsekit-wide.png" alt="parsekit" height="80px">
-[![CI](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml/badge.svg)](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml)
 [![Gem Version](https://badge.fury.io/rb/parsekit.svg)](https://badge.fury.io/rb/parsekit)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
+Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
 ## Features
-- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
+- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)
 - 🖼️ **OCR Support**: Extract text from images using Tesseract OCR
 - 🚀 **High Performance**: Native Rust performance with Ruby convenience
 - 🔧 **Unified API**: Single interface for multiple document formats
@@ -38,13 +37,8 @@ gem install parsekit
 - Ruby >= 3.0.0
 - Rust toolchain (stable)
 - C compiler (for linking)
-- System libraries for document parsing:
-  - **macOS**: `brew install leptonica tesseract poppler`
-  - **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
-  - **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
-  - **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
-For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
+That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
 ## Usage
@@ -57,10 +51,6 @@ require 'parsekit'
 text = ParseKit.parse_file("document.pdf")
 puts text  # Extracted text from the PDF
-# Parse an Office document
-text = ParseKit.parse_file("presentation.pptx")
-puts text  # Extracted text from all slides
 # Parse an Excel file
 text = ParseKit.parse_file("spreadsheet.xlsx")
 puts text  # Extracted text from all sheets
@@ -131,7 +121,8 @@ excel_text = parser.parse_xlsx(excel_data)
 | PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
 | Word | .docx | `parse_docx` | Office Open XML format |
 | Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
-| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
+| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |
+| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
 | JSON | .json | `parse_json` | Pretty-printed output |
 | XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
 | Text | .txt, .csv, .md | `parse_text` | With encoding detection |
@@ -161,6 +152,27 @@ To run tests with coverage:
 rake dev:coverage
 ```
+### OCR Mode Configuration
+By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
+**Using system Tesseract during installation:**
+```bash
+gem install parsekit -- --no-default-features
+```
+**For development with system Tesseract:**
+```bash
+rake compile CARGO_FEATURES=""  # Disables bundled-tesseract feature
+```
+**System Tesseract requirements:**
+- **macOS**: `brew install tesseract`
+- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
+- **Fedora/RHEL**: `sudo dnf install tesseract-devel`
+The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
 ## Architecture
 ParseKit uses a hybrid Ruby/Rust architecture:
@@ -168,7 +180,7 @@ ParseKit uses a hybrid Ruby/Rust architecture:
 - **Ruby Layer**: Provides convenient API and format detection
 - **Rust Layer**: Implements high-performance parsing using:
   - MuPDF for PDF text extraction (statically linked)
-  - rusty-tesseract for OCR (with embedded Tesseract)
+  - tesseract-rs for OCR (with bundled Tesseract by default)
   - Pure Rust libraries for DOCX/XLSX parsing
   - Magnus for Ruby-Rust FFI bindings
@@ -180,4 +192,4 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/cpeter
 The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
-Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
+Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

data/ext/parsekit/Cargo.toml CHANGED Viewed

@@ -11,24 +11,26 @@ crate-type = ["cdylib"]
 name = "parsekit"
 [dependencies]
-magnus = { version = "0.7", features = ["rb-sys"] }
+magnus = { version = "0.8", features = ["rb-sys"] }
 # Document parsing - testing embedded C libraries
 # MuPDF builds from source and statically links
 mupdf = { version = "0.5", default-features = false, features = [] }
-# OCR - Tesseract with image loading support
-rusty-tesseract = "1.1"  # Tesseract wrapper with image loading
+# OCR - Using tesseract-rs for both system and bundled modes
+tesseract-rs = "0.1"  # Tesseract with optional bundling
 image = "0.25"  # Image processing library (match rusty-tesseract's version)
-calamine = "0.26"  # Excel parsing
+calamine = "0.30"  # Excel parsing
 docx-rs = "0.4"  # Word document parsing
-quick-xml = "0.36"  # XML parsing
+quick-xml = "0.38"  # XML parsing
+zip = "2.1"  # ZIP archive handling for PPTX
 serde_json = "1.0"  # JSON parsing
 regex = "1.10"  # Text parsing
 encoding_rs = "0.8"  # Encoding detection
 [features]
-default = []
+default = ["bundled-tesseract"]
+bundled-tesseract = []
 [profile.release]
 opt-level = 3
 lto = true
-codegen-units = 1
+codegen-units = 1

data/ext/parsekit/src/error.rs CHANGED Viewed

@@ -1,4 +1,4 @@
-use magnus::{exception, Error, RModule, Ruby, Module};
+use magnus::{Error, RModule, Ruby, Module};
 /// Custom error types for ParseKit
 #[derive(Debug)]
@@ -15,13 +15,13 @@ impl ParserError {
     pub fn to_error(&self) -> Error {
         match self {
             ParserError::ParseError(msg) => {
-                Error::new(exception::runtime_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_runtime_error(), msg.clone())
             }
             ParserError::ConfigError(msg) => {
-                Error::new(exception::arg_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_arg_error(), msg.clone())
             }
             ParserError::IoError(msg) => {
-                Error::new(exception::io_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_io_error(), msg.clone())
             }
         }
     }
@@ -37,9 +37,9 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
     // Define error classes as regular Ruby classes
     // Users can still rescue them by name in Ruby code
-    let _error = module.define_class("Error", magnus::class::object())?;
-    let _parse_error = module.define_class("ParseError", magnus::class::object())?;
-    let _config_error = module.define_class("ConfigError", magnus::class::object())?;
+    let _error = module.define_class("Error", Ruby::get().unwrap().class_object())?;
+    let _parse_error = module.define_class("ParseError", Ruby::get().unwrap().class_object())?;
+    let _config_error = module.define_class("ConfigError", Ruby::get().unwrap().class_object())?;
     Ok(())
 }

data/ext/parsekit/src/format_detector.rs ADDED Viewed

@@ -0,0 +1,233 @@
+use std::path::Path;
+/// Represents a detected file format
+#[derive(Debug, Clone, PartialEq)]
+pub enum FileFormat {
+    Pdf,
+    Docx,
+    Xlsx,
+    Xls,
+    Pptx,
+    Png,
+    Jpeg,
+    Tiff,
+    Bmp,
+    Json,
+    Xml,
+    Html,
+    Text,
+    Unknown,
+}
+impl FileFormat {
+    /// Convert to Ruby symbol representation
+    pub fn to_symbol(&self) -> &'static str {
+        match self {
+            FileFormat::Pdf => "pdf",
+            FileFormat::Docx => "docx",
+            FileFormat::Xlsx => "xlsx",
+            FileFormat::Xls => "xls",
+            FileFormat::Pptx => "pptx",
+            FileFormat::Png => "png",
+            FileFormat::Jpeg => "jpeg",
+            FileFormat::Tiff => "tiff",
+            FileFormat::Bmp => "bmp",
+            FileFormat::Json => "json",
+            FileFormat::Xml => "xml",
+            FileFormat::Html => "xml", // HTML is treated as XML in Ruby
+            FileFormat::Text => "text",
+            FileFormat::Unknown => "unknown",
+        }
+    }
+}
+/// Central format detection logic
+pub struct FormatDetector;
+impl FormatDetector {
+    /// Detect format from filename and content
+    /// Prioritizes content detection over extension when both are available
+    pub fn detect(filename: Option<&str>, content: Option<&[u8]>) -> FileFormat {
+        // First try content-based detection if content is provided
+        if let Some(data) = content {
+            let format = Self::detect_from_content(data);
+            // If we got a definitive format from content, use it
+            if !matches!(format, FileFormat::Text | FileFormat::Unknown) {
+                return format;
+            }
+        }
+        // Fall back to extension-based detection
+        if let Some(name) = filename {
+            let ext_format = Self::detect_from_extension(name);
+            if ext_format != FileFormat::Unknown {
+                return ext_format;
+            }
+        }
+        // If content detection returned Text and no extension match, return Text
+        if let Some(data) = content {
+            let format = Self::detect_from_content(data);
+            if format == FileFormat::Text {
+                return FileFormat::Text;
+            }
+        }
+        FileFormat::Unknown
+    }
+    /// Detect format from file extension
+    pub fn detect_from_extension(filename: &str) -> FileFormat {
+        let path = Path::new(filename);
+        let ext = match path.extension().and_then(|s| s.to_str()) {
+            Some(e) => e.to_lowercase(),
+            None => return FileFormat::Unknown,
+        };
+        match ext.as_str() {
+            "pdf" => FileFormat::Pdf,
+            "docx" => FileFormat::Docx,
+            "xlsx" => FileFormat::Xlsx,
+            "xls" => FileFormat::Xls,
+            "pptx" => FileFormat::Pptx,
+            "png" => FileFormat::Png,
+            "jpg" | "jpeg" => FileFormat::Jpeg,
+            "tiff" | "tif" => FileFormat::Tiff,
+            "bmp" => FileFormat::Bmp,
+            "json" => FileFormat::Json,
+            "xml" => FileFormat::Xml,
+            "html" | "htm" => FileFormat::Html,
+            "txt" | "text" | "md" | "markdown" | "csv" => FileFormat::Text,
+            _ => FileFormat::Unknown,
+        }
+    }
+    /// Detect format from file content (magic bytes)
+    pub fn detect_from_content(data: &[u8]) -> FileFormat {
+        if data.is_empty() {
+            return FileFormat::Text; // Empty files are treated as text
+        }
+        // PDF
+        if data.len() >= 4 && data.starts_with(b"%PDF") {
+            return FileFormat::Pdf;
+        }
+        // PNG
+        if data.len() >= 8 && data.starts_with(&[0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]) {
+            return FileFormat::Png;
+        }
+        // JPEG
+        if data.len() >= 3 && data.starts_with(&[0xFF, 0xD8, 0xFF]) {
+            return FileFormat::Jpeg;
+        }
+        // BMP
+        if data.len() >= 2 && data.starts_with(b"BM") {
+            return FileFormat::Bmp;
+        }
+        // TIFF (little-endian or big-endian)
+        if data.len() >= 4 {
+            if data.starts_with(b"II\x2A\x00") || data.starts_with(b"MM\x00\x2A") {
+                return FileFormat::Tiff;
+            }
+        }
+        // OLE Compound Document (old Excel/Word)
+        if data.len() >= 4 && data.starts_with(&[0xD0, 0xCF, 0x11, 0xE0]) {
+            return FileFormat::Xls; // Old Office format, usually Excel
+        }
+        // ZIP archive (could be DOCX, XLSX, PPTX)
+        if data.len() >= 2 && data.starts_with(b"PK") {
+            return Self::detect_office_format(data);
+        }
+        // XML
+        if data.len() >= 5 {
+            let start = String::from_utf8_lossy(&data[0..5.min(data.len())]);
+            if start.starts_with("<?xml") || start.starts_with("<!") {
+                return FileFormat::Xml;
+            }
+        }
+        // HTML
+        if data.len() >= 14 {
+            let start = String::from_utf8_lossy(&data[0..14.min(data.len())]).to_lowercase();
+            if start.contains("<!doctype") || start.contains("<html") {
+                return FileFormat::Html;
+            }
+        }
+        // JSON
+        if let Some(&first_non_ws) = data.iter().find(|&&b| !b" \t\n\r".contains(&b)) {
+            if first_non_ws == b'{' || first_non_ws == b'[' {
+                return FileFormat::Json;
+            }
+        }
+        // Default to text for unrecognized formats
+        FileFormat::Text
+    }
+    /// Detect specific Office format from ZIP data
+    fn detect_office_format(data: &[u8]) -> FileFormat {
+        // Look for Office-specific directory names in first 2KB of ZIP
+        let check_len = 2000.min(data.len());
+        let content = String::from_utf8_lossy(&data[0..check_len]);
+        // Check for format-specific markers
+        if content.contains("word/") || content.contains("word/_rels") {
+            FileFormat::Docx
+        } else if content.contains("xl/") || content.contains("xl/_rels") {
+            FileFormat::Xlsx
+        } else if content.contains("ppt/") || content.contains("ppt/_rels") {
+            FileFormat::Pptx
+        } else {
+            // Default to XLSX for generic ZIP (most common Office format)
+            FileFormat::Xlsx
+        }
+    }
+    /// Get all supported extensions
+    pub fn supported_extensions() -> Vec<&'static str> {
+        vec![
+            "pdf", "docx", "xlsx", "xls", "pptx",
+            "png", "jpg", "jpeg", "tiff", "tif", "bmp",
+            "json", "xml", "html", "htm",
+            "txt", "text", "md", "markdown", "csv"
+        ]
+    }
+}
+#[cfg(test)]
+mod tests {
+    use super::*;
+    #[test]
+    fn test_detect_pdf() {
+        let pdf_data = b"%PDF-1.5\n";
+        assert_eq!(FormatDetector::detect_from_content(pdf_data), FileFormat::Pdf);
+    }
+    #[test]
+    fn test_detect_png() {
+        let png_data = &[0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A];
+        assert_eq!(FormatDetector::detect_from_content(png_data), FileFormat::Png);
+    }
+    #[test]
+    fn test_detect_from_extension() {
+        assert_eq!(FormatDetector::detect_from_extension("document.pdf"), FileFormat::Pdf);
+        assert_eq!(FormatDetector::detect_from_extension("Document.PDF"), FileFormat::Pdf);
+        assert_eq!(FormatDetector::detect_from_extension("data.xlsx"), FileFormat::Xlsx);
+    }
+    #[test]
+    fn test_empty_data() {
+        assert_eq!(FormatDetector::detect_from_content(&[]), FileFormat::Text);
+    }
+}

data/ext/parsekit/src/lib.rs CHANGED Viewed

@@ -2,6 +2,7 @@ use magnus::{function, prelude::*, Error, Ruby};
 mod parser;
 mod error;
+mod format_detector;
 /// Initialize the ParseKit module and its submodules
 #[magnus::init]