RubyGems - parsekit - Versions diffs - 0.1.0.pre.1 → 0.1.0 - Mend

parsekit 0.1.0.pre.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/README.md +29 -17
data/ext/parsekit/Cargo.toml +9 -7
data/ext/parsekit/src/error.rs +7 -7
data/ext/parsekit/src/parser.rs +317 -89
data/lib/parsekit/parsekit.bundle +0 -0
data/lib/parsekit/parser.rb +1 -0
data/lib/parsekit/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 02b091ecd1da29c68d59afb1089f1756cef350252b260b531ef82a06fb163c65
-  data.tar.gz: e34663b8f849a907ede07b357ad3c5b21a614c16ea767fb5b735c3422bd66aa7
+  metadata.gz: 35c0708c088075c883b3b35c7d76f1573f29a19bf65ac0b89b636a5b76cee662
+  data.tar.gz: b1ddf9260329239c3a1e791f3ed3249b3577cb210f4c898677316fe55cc951f4
 SHA512:
-  metadata.gz: b476aad0a9c9a711fce10d3a22dedd64e6ac82597c1d5d501d3ced7a46982d8f65b5bf44b513c3daabc5c5115a4b6278a0bea911b4dc9b1667010467e1cad8c9
-  data.tar.gz: f1d2adeb0bf8199b5ce397537b8b40577a79e954dddf33f4e4ff2fe418791cddb2f41adf79654c05c9d37e6ef0b1e99526f390560b5321feb711982d2218372d
+  metadata.gz: 2fe76f5b28927e3989502b0ea5f084f5bfc265aae9a65aaba47349e3e540e8150612d75f8f4ddcdc38be7edd9ae7edbf42220ba95b42a535dbc200503759c419
+  data.tar.gz: e5b9e8eff90f8583f8289bea5100ac43434978ebba814bf9198fb92cc622a9b4fa6e99e28fe2ed31ffa0040c3ac48a38c8361bc1994200059a23d040440a64cc

data/README.md CHANGED Viewed

@@ -1,14 +1,13 @@
-# ParseKit
+<img src="/docs/assets/parsekit-wide.png" alt="parsekit" height="80px">
-[![CI](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml/badge.svg)](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml)
 [![Gem Version](https://badge.fury.io/rb/parsekit.svg)](https://badge.fury.io/rb/parsekit)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
+Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
 ## Features
-- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
+- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)
 - 🖼️ **OCR Support**: Extract text from images using Tesseract OCR
 - 🚀 **High Performance**: Native Rust performance with Ruby convenience
 - 🔧 **Unified API**: Single interface for multiple document formats
@@ -38,13 +37,8 @@ gem install parsekit
 - Ruby >= 3.0.0
 - Rust toolchain (stable)
 - C compiler (for linking)
-- System libraries for document parsing:
-  - **macOS**: `brew install leptonica tesseract poppler`
-  - **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
-  - **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
-  - **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
-For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
+That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
 ## Usage
@@ -57,10 +51,6 @@ require 'parsekit'
 text = ParseKit.parse_file("document.pdf")
 puts text  # Extracted text from the PDF
-# Parse an Office document
-text = ParseKit.parse_file("presentation.pptx")
-puts text  # Extracted text from all slides
 # Parse an Excel file
 text = ParseKit.parse_file("spreadsheet.xlsx")
 puts text  # Extracted text from all sheets
@@ -131,7 +121,8 @@ excel_text = parser.parse_xlsx(excel_data)
 | PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
 | Word | .docx | `parse_docx` | Office Open XML format |
 | Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
-| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
+| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |
+| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
 | JSON | .json | `parse_json` | Pretty-printed output |
 | XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
 | Text | .txt, .csv, .md | `parse_text` | With encoding detection |
@@ -161,6 +152,27 @@ To run tests with coverage:
 rake dev:coverage
 ```
+### OCR Mode Configuration
+By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
+**Using system Tesseract during installation:**
+```bash
+gem install parsekit -- --no-default-features
+```
+**For development with system Tesseract:**
+```bash
+rake compile CARGO_FEATURES=""  # Disables bundled-tesseract feature
+```
+**System Tesseract requirements:**
+- **macOS**: `brew install tesseract`
+- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
+- **Fedora/RHEL**: `sudo dnf install tesseract-devel`
+The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
 ## Architecture
 ParseKit uses a hybrid Ruby/Rust architecture:
@@ -168,7 +180,7 @@ ParseKit uses a hybrid Ruby/Rust architecture:
 - **Ruby Layer**: Provides convenient API and format detection
 - **Rust Layer**: Implements high-performance parsing using:
   - MuPDF for PDF text extraction (statically linked)
-  - rusty-tesseract for OCR (with embedded Tesseract)
+  - tesseract-rs for OCR (with bundled Tesseract by default)
   - Pure Rust libraries for DOCX/XLSX parsing
   - Magnus for Ruby-Rust FFI bindings
@@ -180,4 +192,4 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/cpeter
 The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
-Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
+Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

data/ext/parsekit/Cargo.toml CHANGED Viewed

@@ -11,24 +11,26 @@ crate-type = ["cdylib"]
 name = "parsekit"
 [dependencies]
-magnus = { version = "0.7", features = ["rb-sys"] }
+magnus = { version = "0.8", features = ["rb-sys"] }
 # Document parsing - testing embedded C libraries
 # MuPDF builds from source and statically links
 mupdf = { version = "0.5", default-features = false, features = [] }
-# OCR - Tesseract with image loading support
-rusty-tesseract = "1.1"  # Tesseract wrapper with image loading
+# OCR - Using tesseract-rs for both system and bundled modes
+tesseract-rs = "0.1"  # Tesseract with optional bundling
 image = "0.25"  # Image processing library (match rusty-tesseract's version)
-calamine = "0.26"  # Excel parsing
+calamine = "0.30"  # Excel parsing
 docx-rs = "0.4"  # Word document parsing
-quick-xml = "0.36"  # XML parsing
+quick-xml = "0.38"  # XML parsing
+zip = "2.1"  # ZIP archive handling for PPTX
 serde_json = "1.0"  # JSON parsing
 regex = "1.10"  # Text parsing
 encoding_rs = "0.8"  # Encoding detection
 [features]
-default = []
+default = ["bundled-tesseract"]
+bundled-tesseract = []
 [profile.release]
 opt-level = 3
 lto = true
-codegen-units = 1
+codegen-units = 1

data/ext/parsekit/src/error.rs CHANGED Viewed

@@ -1,4 +1,4 @@
-use magnus::{exception, Error, RModule, Ruby, Module};
+use magnus::{Error, RModule, Ruby, Module};
 /// Custom error types for ParseKit
 #[derive(Debug)]
@@ -15,13 +15,13 @@ impl ParserError {
     pub fn to_error(&self) -> Error {
         match self {
             ParserError::ParseError(msg) => {
-                Error::new(exception::runtime_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_runtime_error(), msg.clone())
             }
             ParserError::ConfigError(msg) => {
-                Error::new(exception::arg_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_arg_error(), msg.clone())
             }
             ParserError::IoError(msg) => {
-                Error::new(exception::io_error(), msg.clone())
+                Error::new(Ruby::get().unwrap().exception_io_error(), msg.clone())
             }
         }
     }
@@ -37,9 +37,9 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
     // Define error classes as regular Ruby classes
     // Users can still rescue them by name in Ruby code
-    let _error = module.define_class("Error", magnus::class::object())?;
-    let _parse_error = module.define_class("ParseError", magnus::class::object())?;
-    let _config_error = module.define_class("ConfigError", magnus::class::object())?;
+    let _error = module.define_class("Error", Ruby::get().unwrap().class_object())?;
+    let _parse_error = module.define_class("ParseError", Ruby::get().unwrap().class_object())?;
+    let _config_error = module.define_class("ConfigError", Ruby::get().unwrap().class_object())?;
     Ok(())
 }

data/ext/parsekit/src/parser.rs CHANGED Viewed

@@ -1,5 +1,5 @@
 use magnus::{
-    class, function, method, prelude::*, scan_args, Error, RHash, RModule, Ruby, Value, Module,
+    function, method, prelude::*, scan_args, Error, Module, RHash, RModule, Ruby, Value,
 };
 use std::path::Path;
@@ -33,9 +33,9 @@ impl Parser {
     fn new(ruby: &Ruby, args: &[Value]) -> Result<Self, Error> {
         let args = scan_args::scan_args::<(), (Option<RHash>,), (), (), (), ()>(args)?;
         let options = args.optional.0;
         let mut config = ParserConfig::default();
         if let Some(opts) = options {
             if let Some(strict) = opts.get(ruby.to_symbol("strict_mode")) {
                 config.strict_mode = bool::try_convert(strict)?;
@@ -50,30 +50,35 @@ impl Parser {
                 config.max_size = usize::try_convert(max_size)?;
             }
         }
         Ok(Self { config })
     }
     /// Parse input bytes based on file type (internal helper)
     fn parse_bytes_internal(&self, data: Vec<u8>, filename: Option<&str>) -> Result<String, Error> {
         // Check size limit
         if data.len() > self.config.max_size {
             return Err(Error::new(
-                magnus::exception::runtime_error(),
-                format!("File size {} exceeds maximum allowed size {}", data.len(), self.config.max_size),
+                Ruby::get().unwrap().exception_runtime_error(),
+                format!(
+                    "File size {} exceeds maximum allowed size {}",
+                    data.len(),
+                    self.config.max_size
+                ),
             ));
         }
         // Detect file type from extension or content
         let file_type = if let Some(name) = filename {
             Self::detect_type_from_filename(name)
         } else {
             Self::detect_type_from_content(&data)
         };
         match file_type.as_str() {
             "pdf" => self.parse_pdf(data),
             "docx" => self.parse_docx(data),
+            "pptx" => self.parse_pptx(data),
             "xlsx" | "xls" => self.parse_xlsx(data),
             "json" => self.parse_json(data),
             "xml" | "html" => self.parse_xml(data),
@@ -82,7 +87,7 @@ impl Parser {
             _ => self.parse_text(data), // Default to text parsing
         }
     }
     /// Detect file type from filename extension
     fn detect_type_from_filename(filename: &str) -> String {
         let path = Path::new(filename);
@@ -91,7 +96,7 @@ impl Parser {
             None => "txt".to_string(),
         }
     }
     /// Detect file type from content (basic detection)
     fn detect_type_from_content(data: &[u8]) -> String {
         if data.starts_with(b"%PDF") {
@@ -120,65 +125,138 @@ impl Parser {
             "txt".to_string()
         }
     }
     /// Perform OCR on image data using Tesseract
     fn ocr_image(&self, data: Vec<u8>) -> Result<String, Error> {
-        use rusty_tesseract::{Image, Args};
+        use tesseract_rs::TesseractAPI;
+        // Create tesseract instance
+        let tesseract = TesseractAPI::new();
-        // Load image from memory
+        // Try to initialize with appropriate tessdata path
+        // Even in bundled mode, we need to find tessdata files
+        #[cfg(feature = "bundled-tesseract")]
+        let init_result = {
+            // Build list of tessdata paths to try
+            let mut tessdata_paths = Vec::new();
+            // Check TESSDATA_PREFIX environment variable first (for CI)
+            if let Ok(env_path) = std::env::var("TESSDATA_PREFIX") {
+                tessdata_paths.push(env_path);
+            }
+            // Add common system paths
+            tessdata_paths.extend_from_slice(&[
+                "/usr/share/tessdata".to_string(),
+                "/usr/local/share/tessdata".to_string(),
+                "/opt/homebrew/share/tessdata".to_string(),
+                "/opt/local/share/tessdata".to_string(),
+                "tessdata".to_string(),  // Local tessdata directory
+                ".".to_string(),  // Current directory as fallback
+            ]);
+            let mut result = Err(tesseract_rs::TesseractError::InitError);
+            for path in &tessdata_paths {
+                // Check if path exists first to avoid noisy error messages
+                if std::path::Path::new(path).exists() {
+                    if tesseract.init(path.as_str(), "eng").is_ok() {
+                        result = Ok(());
+                        break;
+                    }
+                }
+            }
+            result
+        };
+        #[cfg(not(feature = "bundled-tesseract"))]
+        let init_result = {
+            // Try common system tessdata paths
+            let tessdata_paths = vec![
+                "/usr/share/tessdata",
+                "/usr/local/share/tessdata",
+                "/opt/homebrew/share/tessdata",
+                "/opt/local/share/tessdata",
+            ];
+            let mut result = Err(tesseract_rs::TesseractError::InitError);
+            for path in &tessdata_paths {
+                if std::path::Path::new(path).exists() {
+                    if tesseract.init(path, "eng").is_ok() {
+                        result = Ok(());
+                        break;
+                    }
+                }
+            }
+            result
+        };
+        if let Err(e) = init_result {
+            return Err(Error::new(
+                Ruby::get().unwrap().exception_runtime_error(),
+                format!("Failed to initialize Tesseract: {:?}", e),
+            ))
+        }
+        // Load the image from bytes
         let img = match image::load_from_memory(&data) {
             Ok(img) => img,
             Err(e) => return Err(Error::new(
-                magnus::exception::runtime_error(),
+                Ruby::get().unwrap().exception_runtime_error(),
                 format!("Failed to load image: {}", e),
             ))
         };
-        // Create rusty_tesseract Image from DynamicImage
-        let tess_img = match Image::from_dynamic_image(&img) {
-            Ok(img) => img,
-            Err(e) => return Err(Error::new(
-                magnus::exception::runtime_error(),
-                format!("Failed to convert image for OCR: {}", e),
-            ))
-        };
+        // Convert to RGBA8 format
+        let rgba_img = img.to_rgba8();
+        let (width, height) = rgba_img.dimensions();
+        let raw_data = rgba_img.into_raw();
-        // Set up OCR arguments
-        let mut args = Args::default();
-        args.lang = "eng".to_string();
-        // Optional: Add more configuration
-        // args.config_variables.insert("tessedit_char_whitelist".to_string(),
-        //     "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz .,!?-".to_string());
+        // Set image data
+        if let Err(e) = tesseract.set_image(
+            &raw_data,
+            width as i32,
+            height as i32,
+            4,  // bytes per pixel (RGBA)
+            (width * 4) as i32,  // bytes per line
+        ) {
+            return Err(Error::new(
+                Ruby::get().unwrap().exception_runtime_error(),
+                format!("Failed to set image: {}", e),
+            ))
+        }
-        // Perform OCR
-        match rusty_tesseract::image_to_string(&tess_img, &args) {
+        // Extract text
+        match tesseract.get_utf8_text() {
             Ok(text) => Ok(text.trim().to_string()),
             Err(e) => Err(Error::new(
-                magnus::exception::runtime_error(),
+                Ruby::get().unwrap().exception_runtime_error(),
                 format!("Failed to perform OCR: {}", e),
-            ))
+            )),
         }
     }
     /// Parse PDF files using MuPDF (statically linked) - exposed to Ruby
     fn parse_pdf(&self, data: Vec<u8>) -> Result<String, Error> {
         use mupdf::Document;
         // Try to load the PDF from memory
         // The magic parameter helps MuPDF identify the file type
         match Document::from_bytes(&data, "pdf") {
             Ok(doc) => {
                 let mut all_text = String::new();
                 // Get page count - this returns a Result
                 let page_count = match doc.page_count() {
                     Ok(count) => count,
-                    Err(e) => return Err(Error::new(
-                        magnus::exception::runtime_error(),
-                        format!("Failed to get page count: {}", e),
-                    ))
+                    Err(e) => {
+                        return Err(Error::new(
+                            Ruby::get().unwrap().exception_runtime_error(),
+                            format!("Failed to get page count: {}", e),
+                        ))
+                    }
                 };
                 // Iterate through pages
                 for page_num in 0..page_count {
                     match doc.load_page(page_num) {
@@ -195,28 +273,31 @@ impl Parser {
                         Err(_) => continue,
                     }
                 }
                 if all_text.is_empty() {
-                    Ok("PDF contains no extractable text (might be scanned/image-based)".to_string())
+                    Ok(
+                        "PDF contains no extractable text (might be scanned/image-based)"
+                            .to_string(),
+                    )
                 } else {
                     Ok(all_text.trim().to_string())
                 }
             }
             Err(e) => Err(Error::new(
-                magnus::exception::runtime_error(),
+                Ruby::get().unwrap().exception_runtime_error(),
                 format!("Failed to parse PDF: {}", e),
-            ))
+            )),
         }
     }
     /// Parse DOCX (Word) files - exposed to Ruby
     fn parse_docx(&self, data: Vec<u8>) -> Result<String, Error> {
         use docx_rs::read_docx;
         match read_docx(&data) {
             Ok(docx) => {
                 let mut result = String::new();
                 // Extract text from all document children
                 // For simplicity, we'll focus on paragraphs only for now
                 // Tables require more complex handling with the current API
@@ -238,29 +319,166 @@ impl Parser {
                     // table.rows -> TableChild::TableRow -> row.cells -> TableRowChild
                     // which has a more complex structure in docx-rs
                 }
                 Ok(result.trim().to_string())
             }
             Err(e) => Err(Error::new(
-                magnus::exception::runtime_error(),
+                Ruby::get().unwrap().exception_runtime_error(),
                 format!("Failed to parse DOCX file: {}", e),
-            ))
+            )),
+        }
+    }
+    /// Parse PPTX (PowerPoint) files - exposed to Ruby
+    fn parse_pptx(&self, data: Vec<u8>) -> Result<String, Error> {
+        use std::io::{Cursor, Read};
+        use zip::ZipArchive;
+        let cursor = Cursor::new(data);
+        let mut archive = match ZipArchive::new(cursor) {
+            Ok(archive) => archive,
+            Err(e) => {
+                return Err(Error::new(
+                    Ruby::get().unwrap().exception_runtime_error(),
+                    format!("Failed to open PPTX as ZIP: {}", e),
+                ))
+            }
+        };
+        let mut all_text = Vec::new();
+        let mut slide_numbers = Vec::new();
+        // First, collect slide numbers and sort them
+        for i in 0..archive.len() {
+            let file = match archive.by_index(i) {
+                Ok(file) => file,
+                Err(_) => continue,
+            };
+            let name = file.name();
+            // Match slide XML files (e.g., ppt/slides/slide1.xml)
+            if name.starts_with("ppt/slides/slide") && name.ends_with(".xml") && !name.contains("_rels") {
+                // Extract slide number from filename
+                if let Some(num_str) = name
+                    .strip_prefix("ppt/slides/slide")
+                    .and_then(|s| s.strip_suffix(".xml"))
+                {
+                    if let Ok(num) = num_str.parse::<usize>() {
+                        slide_numbers.push((num, i));
+                    }
+                }
+            }
+        }
+        // Sort by slide number to maintain order
+        slide_numbers.sort_by_key(|&(num, _)| num);
+        // Now process slides in order
+        for (_, index) in slide_numbers {
+            let mut file = match archive.by_index(index) {
+                Ok(file) => file,
+                Err(_) => continue,
+            };
+            let mut contents = String::new();
+            if file.read_to_string(&mut contents).is_ok() {
+                // Extract text from slide XML
+                let text = self.extract_text_from_slide_xml(&contents);
+                if !text.is_empty() {
+                    all_text.push(text);
+                }
+            }
+        }
+        // Also extract notes if present
+        for i in 0..archive.len() {
+            let mut file = match archive.by_index(i) {
+                Ok(file) => file,
+                Err(_) => continue,
+            };
+            let name = file.name();
+            // Match notes slide XML files
+            if name.starts_with("ppt/notesSlides/notesSlide") && name.ends_with(".xml") && !name.contains("_rels") {
+                let mut contents = String::new();
+                if file.read_to_string(&mut contents).is_ok() {
+                    let text = self.extract_text_from_slide_xml(&contents);
+                    if !text.is_empty() {
+                        all_text.push(format!("[Notes: {}]", text));
+                    }
+                }
+            }
+        }
+        if all_text.is_empty() {
+            Ok("".to_string())
+        } else {
+            Ok(all_text.join("\n\n"))
         }
     }
+    /// Helper method to extract text from slide XML
+    fn extract_text_from_slide_xml(&self, xml_content: &str) -> String {
+        use quick_xml::events::Event;
+        use quick_xml::Reader;
+        let mut reader = Reader::from_str(xml_content);
+        let mut text_parts = Vec::new();
+        let mut buf = Vec::new();
+        let mut in_text_element = false;
+        loop {
+            match reader.read_event_into(&mut buf) {
+                Ok(Event::Start(ref e)) => {
+                    // Look for text elements (a:t or t)
+                    let name = e.name();
+                    let local_name_bytes = name.local_name();
+                    let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
+                    if local_name == "t" {
+                        in_text_element = true;
+                    }
+                }
+                Ok(Event::Text(e)) => {
+                    if in_text_element {
+                        if let Ok(text) = e.decode() {
+                            let text_str = text.trim();
+                            if !text_str.is_empty() {
+                                text_parts.push(text_str.to_string());
+                            }
+                        }
+                    }
+                }
+                Ok(Event::End(ref e)) => {
+                    let name = e.name();
+                    let local_name_bytes = name.local_name();
+                    let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
+                    if local_name == "t" {
+                        in_text_element = false;
+                    }
+                }
+                Ok(Event::Eof) => break,
+                _ => {}
+            }
+            buf.clear();
+        }
+        text_parts.join(" ")
+    }
     /// Parse Excel files - exposed to Ruby
     fn parse_xlsx(&self, data: Vec<u8>) -> Result<String, Error> {
         use calamine::{Reader, Xlsx};
         use std::io::Cursor;
         let cursor = Cursor::new(data);
         match Xlsx::new(cursor) {
             Ok(mut workbook) => {
                 let mut result = String::new();
                 for sheet_name in workbook.sheet_names().to_owned() {
                     result.push_str(&format!("Sheet: {}\n", sheet_name));
                     if let Ok(range) = workbook.worksheet_range(&sheet_name) {
                         for row in range.rows() {
                             for cell in row {
@@ -271,44 +489,46 @@ impl Parser {
                     }
                     result.push('\n');
                 }
                 Ok(result)
             }
             Err(e) => Err(Error::new(
-                magnus::exception::runtime_error(),
+                Ruby::get().unwrap().exception_runtime_error(),
                 format!("Failed to parse Excel file: {}", e),
-            ))
+            )),
         }
     }
     /// Parse JSON files - exposed to Ruby
     fn parse_json(&self, data: Vec<u8>) -> Result<String, Error> {
         let text = String::from_utf8_lossy(&data);
         match serde_json::from_str::<serde_json::Value>(&text) {
-            Ok(json) => Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string())),
+            Ok(json) => {
+                Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string()))
+            }
             Err(_) => Ok(text.to_string()),
         }
     }
     /// Parse XML/HTML files - exposed to Ruby
     fn parse_xml(&self, data: Vec<u8>) -> Result<String, Error> {
         use quick_xml::events::Event;
         use quick_xml::Reader;
         let mut reader = Reader::from_reader(&data[..]);
         let mut txt = String::new();
         let mut buf = Vec::new();
         loop {
             match reader.read_event_into(&mut buf) {
                 Ok(Event::Text(e)) => {
-                    txt.push_str(&e.unescape().unwrap_or_default());
+                    txt.push_str(&e.decode().unwrap_or_default());
                     txt.push(' ');
                 }
                 Ok(Event::Eof) => break,
                 Err(e) => {
                     return Err(Error::new(
-                        magnus::exception::runtime_error(),
+                        Ruby::get().unwrap().exception_runtime_error(),
                         format!("XML parse error: {}", e),
                     ))
                 }
@@ -316,15 +536,15 @@ impl Parser {
             }
             buf.clear();
         }
         Ok(txt.trim().to_string())
     }
     /// Parse plain text with encoding detection - exposed to Ruby
     fn parse_text(&self, data: Vec<u8>) -> Result<String, Error> {
         // Detect encoding
         let (decoded, _encoding, malformed) = encoding_rs::UTF_8.decode(&data);
         if malformed {
             // Try other encodings
             let (decoded, _encoding, _malformed) = encoding_rs::WINDOWS_1252.decode(&data);
@@ -333,16 +553,16 @@ impl Parser {
             Ok(decoded.to_string())
         }
     }
     /// Parse input string (for text content)
     fn parse(&self, input: String) -> Result<String, Error> {
         if input.is_empty() {
             return Err(Error::new(
-                magnus::exception::arg_error(),
+                Ruby::get().unwrap().exception_arg_error(),
                 "Input cannot be empty",
             ));
         }
         // For string input, just return cleaned text
         // If strict mode is on, append indicator for testing
         if self.config.strict_mode {
@@ -351,29 +571,33 @@ impl Parser {
             Ok(input.trim().to_string())
         }
     }
     /// Parse a file
     fn parse_file(&self, path: String) -> Result<String, Error> {
         use std::fs;
-        let data = fs::read(&path)
-            .map_err(|e| Error::new(magnus::exception::io_error(), format!("Failed to read file: {}", e)))?;
+        let data = fs::read(&path).map_err(|e| {
+            Error::new(
+                Ruby::get().unwrap().exception_io_error(),
+                format!("Failed to read file: {}", e),
+            )
+        })?;
         self.parse_bytes_internal(data, Some(&path))
     }
     /// Parse bytes from Ruby
     fn parse_bytes(&self, data: Vec<u8>) -> Result<String, Error> {
         if data.is_empty() {
             return Err(Error::new(
-                magnus::exception::arg_error(),
+                Ruby::get().unwrap().exception_arg_error(),
                 "Data cannot be empty",
             ));
         }
         self.parse_bytes_internal(data, None)
     }
     /// Get parser configuration
     fn config(&self) -> Result<RHash, Error> {
         let ruby = Ruby::get().unwrap();
@@ -384,12 +608,12 @@ impl Parser {
         hash.aset(ruby.to_symbol("max_size"), self.config.max_size)?;
         Ok(hash)
     }
     /// Check if parser is in strict mode
     fn strict_mode(&self) -> bool {
         self.config.strict_mode
     }
     /// Check supported file types
     fn supported_formats() -> Vec<String> {
         vec![
@@ -397,7 +621,10 @@ impl Parser {
             "json".to_string(),
             "xml".to_string(),
             "html".to_string(),
+            "htm".to_string(), // HTML files (alternative extension)
+            "md".to_string(),  // Markdown files
             "docx".to_string(),
+            "pptx".to_string(),
             "xlsx".to_string(),
             "xls".to_string(),
             "csv".to_string(),
@@ -409,12 +636,12 @@ impl Parser {
             "bmp".to_string(),  // OCR via Tesseract
         ]
     }
     /// Detect if file extension is supported
     fn supports_file(&self, path: String) -> bool {
         if let Some(ext) = std::path::Path::new(&path)
             .extension()
-            .and_then(|s| s.to_str())
+            .and_then(|s| s.to_str())
         {
             Self::supported_formats().contains(&ext.to_lowercase())
         } else {
@@ -441,8 +668,8 @@ fn parse_bytes_direct(data: Vec<u8>) -> Result<String, Error> {
 /// Initialize the Parser class
 pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
-    let class = module.define_class("Parser", class::object())?;
+    let class = module.define_class("Parser", Ruby::get().unwrap().class_object())?;
     // Instance methods
     class.define_singleton_method("new", function!(Parser::new, -1))?;
     class.define_method("parse", method!(Parser::parse, 1))?;
@@ -451,22 +678,23 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
     class.define_method("config", method!(Parser::config, 0))?;
     class.define_method("strict_mode?", method!(Parser::strict_mode, 0))?;
     class.define_method("supports_file?", method!(Parser::supports_file, 1))?;
     // Individual parser methods exposed to Ruby
     class.define_method("parse_pdf", method!(Parser::parse_pdf, 1))?;
     class.define_method("parse_docx", method!(Parser::parse_docx, 1))?;
+    class.define_method("parse_pptx", method!(Parser::parse_pptx, 1))?;
     class.define_method("parse_xlsx", method!(Parser::parse_xlsx, 1))?;
     class.define_method("parse_json", method!(Parser::parse_json, 1))?;
     class.define_method("parse_xml", method!(Parser::parse_xml, 1))?;
     class.define_method("parse_text", method!(Parser::parse_text, 1))?;
     class.define_method("ocr_image", method!(Parser::ocr_image, 1))?;
     // Class methods
     class.define_singleton_method("supported_formats", function!(Parser::supported_formats, 0))?;
     // Module-level convenience methods
     module.define_singleton_method("parse_file", function!(parse_file_direct, 1))?;
     module.define_singleton_method("parse_bytes", function!(parse_bytes_direct, 1))?;
     Ok(())
-}
+}

data/lib/parsekit/parsekit.bundle CHANGED Viewed

Binary file

data/lib/parsekit/parser.rb CHANGED Viewed

@@ -89,6 +89,7 @@ module ParseKit
       case ext.downcase
       when 'docx' then :docx
+      when 'pptx' then :pptx
       when 'xlsx', 'xls' then :xlsx
       when 'pdf' then :pdf
       when 'json' then :json

data/lib/parsekit/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ParseKit
-  VERSION = "0.1.0.pre.1"
+  VERSION = "0.1.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: parsekit
 version: !ruby/object:Gem::Version
-  version: 0.1.0.pre.1
+  version: 0.1.0
 platform: ruby
 authors:
 - Chris Petersen
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-08-21 00:00:00.000000000 Z
+date: 2025-09-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rb_sys