parsekit 0.1.0.pre.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 02b091ecd1da29c68d59afb1089f1756cef350252b260b531ef82a06fb163c65
4
- data.tar.gz: e34663b8f849a907ede07b357ad3c5b21a614c16ea767fb5b735c3422bd66aa7
3
+ metadata.gz: 35c0708c088075c883b3b35c7d76f1573f29a19bf65ac0b89b636a5b76cee662
4
+ data.tar.gz: b1ddf9260329239c3a1e791f3ed3249b3577cb210f4c898677316fe55cc951f4
5
5
  SHA512:
6
- metadata.gz: b476aad0a9c9a711fce10d3a22dedd64e6ac82597c1d5d501d3ced7a46982d8f65b5bf44b513c3daabc5c5115a4b6278a0bea911b4dc9b1667010467e1cad8c9
7
- data.tar.gz: f1d2adeb0bf8199b5ce397537b8b40577a79e954dddf33f4e4ff2fe418791cddb2f41adf79654c05c9d37e6ef0b1e99526f390560b5321feb711982d2218372d
6
+ metadata.gz: 2fe76f5b28927e3989502b0ea5f084f5bfc265aae9a65aaba47349e3e540e8150612d75f8f4ddcdc38be7edd9ae7edbf42220ba95b42a535dbc200503759c419
7
+ data.tar.gz: e5b9e8eff90f8583f8289bea5100ac43434978ebba814bf9198fb92cc622a9b4fa6e99e28fe2ed31ffa0040c3ac48a38c8361bc1994200059a23d040440a64cc
data/README.md CHANGED
@@ -1,14 +1,13 @@
1
- # ParseKit
1
+ <img src="/docs/assets/parsekit-wide.png" alt="parsekit" height="80px">
2
2
 
3
- [![CI](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml/badge.svg)](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml)
4
3
  [![Gem Version](https://badge.fury.io/rb/parsekit.svg)](https://badge.fury.io/rb/parsekit)
5
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
5
 
7
- Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
6
+ Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
8
7
 
9
8
  ## Features
10
9
 
11
- - 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
10
+ - 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)
12
11
  - 🖼️ **OCR Support**: Extract text from images using Tesseract OCR
13
12
  - 🚀 **High Performance**: Native Rust performance with Ruby convenience
14
13
  - 🔧 **Unified API**: Single interface for multiple document formats
@@ -38,13 +37,8 @@ gem install parsekit
38
37
  - Ruby >= 3.0.0
39
38
  - Rust toolchain (stable)
40
39
  - C compiler (for linking)
41
- - System libraries for document parsing:
42
- - **macOS**: `brew install leptonica tesseract poppler`
43
- - **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
44
- - **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
45
- - **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
46
40
 
47
- For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
41
+ That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
48
42
 
49
43
  ## Usage
50
44
 
@@ -57,10 +51,6 @@ require 'parsekit'
57
51
  text = ParseKit.parse_file("document.pdf")
58
52
  puts text # Extracted text from the PDF
59
53
 
60
- # Parse an Office document
61
- text = ParseKit.parse_file("presentation.pptx")
62
- puts text # Extracted text from all slides
63
-
64
54
  # Parse an Excel file
65
55
  text = ParseKit.parse_file("spreadsheet.xlsx")
66
56
  puts text # Extracted text from all sheets
@@ -131,7 +121,8 @@ excel_text = parser.parse_xlsx(excel_data)
131
121
  | PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
132
122
  | Word | .docx | `parse_docx` | Office Open XML format |
133
123
  | Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
134
- | Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
124
+ | PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |
125
+ | Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
135
126
  | JSON | .json | `parse_json` | Pretty-printed output |
136
127
  | XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
137
128
  | Text | .txt, .csv, .md | `parse_text` | With encoding detection |
@@ -161,6 +152,27 @@ To run tests with coverage:
161
152
  rake dev:coverage
162
153
  ```
163
154
 
155
+ ### OCR Mode Configuration
156
+
157
+ By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
158
+
159
+ **Using system Tesseract during installation:**
160
+ ```bash
161
+ gem install parsekit -- --no-default-features
162
+ ```
163
+
164
+ **For development with system Tesseract:**
165
+ ```bash
166
+ rake compile CARGO_FEATURES="" # Disables bundled-tesseract feature
167
+ ```
168
+
169
+ **System Tesseract requirements:**
170
+ - **macOS**: `brew install tesseract`
171
+ - **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
172
+ - **Fedora/RHEL**: `sudo dnf install tesseract-devel`
173
+
174
+ The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
175
+
164
176
  ## Architecture
165
177
 
166
178
  ParseKit uses a hybrid Ruby/Rust architecture:
@@ -168,7 +180,7 @@ ParseKit uses a hybrid Ruby/Rust architecture:
168
180
  - **Ruby Layer**: Provides convenient API and format detection
169
181
  - **Rust Layer**: Implements high-performance parsing using:
170
182
  - MuPDF for PDF text extraction (statically linked)
171
- - rusty-tesseract for OCR (with embedded Tesseract)
183
+ - tesseract-rs for OCR (with bundled Tesseract by default)
172
184
  - Pure Rust libraries for DOCX/XLSX parsing
173
185
  - Magnus for Ruby-Rust FFI bindings
174
186
 
@@ -180,4 +192,4 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/cpeter
180
192
 
181
193
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
182
194
 
183
- Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
195
+ Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
@@ -11,24 +11,26 @@ crate-type = ["cdylib"]
11
11
  name = "parsekit"
12
12
 
13
13
  [dependencies]
14
- magnus = { version = "0.7", features = ["rb-sys"] }
14
+ magnus = { version = "0.8", features = ["rb-sys"] }
15
15
  # Document parsing - testing embedded C libraries
16
16
  # MuPDF builds from source and statically links
17
17
  mupdf = { version = "0.5", default-features = false, features = [] }
18
- # OCR - Tesseract with image loading support
19
- rusty-tesseract = "1.1" # Tesseract wrapper with image loading
18
+ # OCR - Using tesseract-rs for both system and bundled modes
19
+ tesseract-rs = "0.1" # Tesseract with optional bundling
20
20
  image = "0.25" # Image processing library (match rusty-tesseract's version)
21
- calamine = "0.26" # Excel parsing
21
+ calamine = "0.30" # Excel parsing
22
22
  docx-rs = "0.4" # Word document parsing
23
- quick-xml = "0.36" # XML parsing
23
+ quick-xml = "0.38" # XML parsing
24
+ zip = "2.1" # ZIP archive handling for PPTX
24
25
  serde_json = "1.0" # JSON parsing
25
26
  regex = "1.10" # Text parsing
26
27
  encoding_rs = "0.8" # Encoding detection
27
28
 
28
29
  [features]
29
- default = []
30
+ default = ["bundled-tesseract"]
31
+ bundled-tesseract = []
30
32
 
31
33
  [profile.release]
32
34
  opt-level = 3
33
35
  lto = true
34
- codegen-units = 1
36
+ codegen-units = 1
@@ -1,4 +1,4 @@
1
- use magnus::{exception, Error, RModule, Ruby, Module};
1
+ use magnus::{Error, RModule, Ruby, Module};
2
2
 
3
3
  /// Custom error types for ParseKit
4
4
  #[derive(Debug)]
@@ -15,13 +15,13 @@ impl ParserError {
15
15
  pub fn to_error(&self) -> Error {
16
16
  match self {
17
17
  ParserError::ParseError(msg) => {
18
- Error::new(exception::runtime_error(), msg.clone())
18
+ Error::new(Ruby::get().unwrap().exception_runtime_error(), msg.clone())
19
19
  }
20
20
  ParserError::ConfigError(msg) => {
21
- Error::new(exception::arg_error(), msg.clone())
21
+ Error::new(Ruby::get().unwrap().exception_arg_error(), msg.clone())
22
22
  }
23
23
  ParserError::IoError(msg) => {
24
- Error::new(exception::io_error(), msg.clone())
24
+ Error::new(Ruby::get().unwrap().exception_io_error(), msg.clone())
25
25
  }
26
26
  }
27
27
  }
@@ -37,9 +37,9 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
37
37
 
38
38
  // Define error classes as regular Ruby classes
39
39
  // Users can still rescue them by name in Ruby code
40
- let _error = module.define_class("Error", magnus::class::object())?;
41
- let _parse_error = module.define_class("ParseError", magnus::class::object())?;
42
- let _config_error = module.define_class("ConfigError", magnus::class::object())?;
40
+ let _error = module.define_class("Error", Ruby::get().unwrap().class_object())?;
41
+ let _parse_error = module.define_class("ParseError", Ruby::get().unwrap().class_object())?;
42
+ let _config_error = module.define_class("ConfigError", Ruby::get().unwrap().class_object())?;
43
43
 
44
44
  Ok(())
45
45
  }
@@ -1,5 +1,5 @@
1
1
  use magnus::{
2
- class, function, method, prelude::*, scan_args, Error, RHash, RModule, Ruby, Value, Module,
2
+ function, method, prelude::*, scan_args, Error, Module, RHash, RModule, Ruby, Value,
3
3
  };
4
4
  use std::path::Path;
5
5
 
@@ -33,9 +33,9 @@ impl Parser {
33
33
  fn new(ruby: &Ruby, args: &[Value]) -> Result<Self, Error> {
34
34
  let args = scan_args::scan_args::<(), (Option<RHash>,), (), (), (), ()>(args)?;
35
35
  let options = args.optional.0;
36
-
36
+
37
37
  let mut config = ParserConfig::default();
38
-
38
+
39
39
  if let Some(opts) = options {
40
40
  if let Some(strict) = opts.get(ruby.to_symbol("strict_mode")) {
41
41
  config.strict_mode = bool::try_convert(strict)?;
@@ -50,30 +50,35 @@ impl Parser {
50
50
  config.max_size = usize::try_convert(max_size)?;
51
51
  }
52
52
  }
53
-
53
+
54
54
  Ok(Self { config })
55
55
  }
56
-
56
+
57
57
  /// Parse input bytes based on file type (internal helper)
58
58
  fn parse_bytes_internal(&self, data: Vec<u8>, filename: Option<&str>) -> Result<String, Error> {
59
59
  // Check size limit
60
60
  if data.len() > self.config.max_size {
61
61
  return Err(Error::new(
62
- magnus::exception::runtime_error(),
63
- format!("File size {} exceeds maximum allowed size {}", data.len(), self.config.max_size),
62
+ Ruby::get().unwrap().exception_runtime_error(),
63
+ format!(
64
+ "File size {} exceeds maximum allowed size {}",
65
+ data.len(),
66
+ self.config.max_size
67
+ ),
64
68
  ));
65
69
  }
66
-
70
+
67
71
  // Detect file type from extension or content
68
72
  let file_type = if let Some(name) = filename {
69
73
  Self::detect_type_from_filename(name)
70
74
  } else {
71
75
  Self::detect_type_from_content(&data)
72
76
  };
73
-
77
+
74
78
  match file_type.as_str() {
75
79
  "pdf" => self.parse_pdf(data),
76
80
  "docx" => self.parse_docx(data),
81
+ "pptx" => self.parse_pptx(data),
77
82
  "xlsx" | "xls" => self.parse_xlsx(data),
78
83
  "json" => self.parse_json(data),
79
84
  "xml" | "html" => self.parse_xml(data),
@@ -82,7 +87,7 @@ impl Parser {
82
87
  _ => self.parse_text(data), // Default to text parsing
83
88
  }
84
89
  }
85
-
90
+
86
91
  /// Detect file type from filename extension
87
92
  fn detect_type_from_filename(filename: &str) -> String {
88
93
  let path = Path::new(filename);
@@ -91,7 +96,7 @@ impl Parser {
91
96
  None => "txt".to_string(),
92
97
  }
93
98
  }
94
-
99
+
95
100
  /// Detect file type from content (basic detection)
96
101
  fn detect_type_from_content(data: &[u8]) -> String {
97
102
  if data.starts_with(b"%PDF") {
@@ -120,65 +125,138 @@ impl Parser {
120
125
  "txt".to_string()
121
126
  }
122
127
  }
123
-
128
+
124
129
  /// Perform OCR on image data using Tesseract
125
130
  fn ocr_image(&self, data: Vec<u8>) -> Result<String, Error> {
126
- use rusty_tesseract::{Image, Args};
131
+ use tesseract_rs::TesseractAPI;
132
+
133
+ // Create tesseract instance
134
+ let tesseract = TesseractAPI::new();
127
135
 
128
- // Load image from memory
136
+ // Try to initialize with appropriate tessdata path
137
+ // Even in bundled mode, we need to find tessdata files
138
+ #[cfg(feature = "bundled-tesseract")]
139
+ let init_result = {
140
+ // Build list of tessdata paths to try
141
+ let mut tessdata_paths = Vec::new();
142
+
143
+ // Check TESSDATA_PREFIX environment variable first (for CI)
144
+ if let Ok(env_path) = std::env::var("TESSDATA_PREFIX") {
145
+ tessdata_paths.push(env_path);
146
+ }
147
+
148
+ // Add common system paths
149
+ tessdata_paths.extend_from_slice(&[
150
+ "/usr/share/tessdata".to_string(),
151
+ "/usr/local/share/tessdata".to_string(),
152
+ "/opt/homebrew/share/tessdata".to_string(),
153
+ "/opt/local/share/tessdata".to_string(),
154
+ "tessdata".to_string(), // Local tessdata directory
155
+ ".".to_string(), // Current directory as fallback
156
+ ]);
157
+
158
+ let mut result = Err(tesseract_rs::TesseractError::InitError);
159
+ for path in &tessdata_paths {
160
+ // Check if path exists first to avoid noisy error messages
161
+ if std::path::Path::new(path).exists() {
162
+ if tesseract.init(path.as_str(), "eng").is_ok() {
163
+ result = Ok(());
164
+ break;
165
+ }
166
+ }
167
+ }
168
+ result
169
+ };
170
+
171
+ #[cfg(not(feature = "bundled-tesseract"))]
172
+ let init_result = {
173
+ // Try common system tessdata paths
174
+ let tessdata_paths = vec![
175
+ "/usr/share/tessdata",
176
+ "/usr/local/share/tessdata",
177
+ "/opt/homebrew/share/tessdata",
178
+ "/opt/local/share/tessdata",
179
+ ];
180
+
181
+ let mut result = Err(tesseract_rs::TesseractError::InitError);
182
+ for path in &tessdata_paths {
183
+ if std::path::Path::new(path).exists() {
184
+ if tesseract.init(path, "eng").is_ok() {
185
+ result = Ok(());
186
+ break;
187
+ }
188
+ }
189
+ }
190
+ result
191
+ };
192
+
193
+ if let Err(e) = init_result {
194
+ return Err(Error::new(
195
+ Ruby::get().unwrap().exception_runtime_error(),
196
+ format!("Failed to initialize Tesseract: {:?}", e),
197
+ ))
198
+ }
199
+
200
+ // Load the image from bytes
129
201
  let img = match image::load_from_memory(&data) {
130
202
  Ok(img) => img,
131
203
  Err(e) => return Err(Error::new(
132
- magnus::exception::runtime_error(),
204
+ Ruby::get().unwrap().exception_runtime_error(),
133
205
  format!("Failed to load image: {}", e),
134
206
  ))
135
207
  };
136
208
 
137
- // Create rusty_tesseract Image from DynamicImage
138
- let tess_img = match Image::from_dynamic_image(&img) {
139
- Ok(img) => img,
140
- Err(e) => return Err(Error::new(
141
- magnus::exception::runtime_error(),
142
- format!("Failed to convert image for OCR: {}", e),
143
- ))
144
- };
209
+ // Convert to RGBA8 format
210
+ let rgba_img = img.to_rgba8();
211
+ let (width, height) = rgba_img.dimensions();
212
+ let raw_data = rgba_img.into_raw();
145
213
 
146
- // Set up OCR arguments
147
- let mut args = Args::default();
148
- args.lang = "eng".to_string();
149
- // Optional: Add more configuration
150
- // args.config_variables.insert("tessedit_char_whitelist".to_string(),
151
- // "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz .,!?-".to_string());
214
+ // Set image data
215
+ if let Err(e) = tesseract.set_image(
216
+ &raw_data,
217
+ width as i32,
218
+ height as i32,
219
+ 4, // bytes per pixel (RGBA)
220
+ (width * 4) as i32, // bytes per line
221
+ ) {
222
+ return Err(Error::new(
223
+ Ruby::get().unwrap().exception_runtime_error(),
224
+ format!("Failed to set image: {}", e),
225
+ ))
226
+ }
152
227
 
153
- // Perform OCR
154
- match rusty_tesseract::image_to_string(&tess_img, &args) {
228
+ // Extract text
229
+ match tesseract.get_utf8_text() {
155
230
  Ok(text) => Ok(text.trim().to_string()),
156
231
  Err(e) => Err(Error::new(
157
- magnus::exception::runtime_error(),
232
+ Ruby::get().unwrap().exception_runtime_error(),
158
233
  format!("Failed to perform OCR: {}", e),
159
- ))
234
+ )),
160
235
  }
161
236
  }
162
237
 
238
+
163
239
  /// Parse PDF files using MuPDF (statically linked) - exposed to Ruby
164
240
  fn parse_pdf(&self, data: Vec<u8>) -> Result<String, Error> {
165
241
  use mupdf::Document;
166
-
242
+
167
243
  // Try to load the PDF from memory
168
244
  // The magic parameter helps MuPDF identify the file type
169
245
  match Document::from_bytes(&data, "pdf") {
170
246
  Ok(doc) => {
171
247
  let mut all_text = String::new();
172
-
248
+
173
249
  // Get page count - this returns a Result
174
250
  let page_count = match doc.page_count() {
175
251
  Ok(count) => count,
176
- Err(e) => return Err(Error::new(
177
- magnus::exception::runtime_error(),
178
- format!("Failed to get page count: {}", e),
179
- ))
252
+ Err(e) => {
253
+ return Err(Error::new(
254
+ Ruby::get().unwrap().exception_runtime_error(),
255
+ format!("Failed to get page count: {}", e),
256
+ ))
257
+ }
180
258
  };
181
-
259
+
182
260
  // Iterate through pages
183
261
  for page_num in 0..page_count {
184
262
  match doc.load_page(page_num) {
@@ -195,28 +273,31 @@ impl Parser {
195
273
  Err(_) => continue,
196
274
  }
197
275
  }
198
-
276
+
199
277
  if all_text.is_empty() {
200
- Ok("PDF contains no extractable text (might be scanned/image-based)".to_string())
278
+ Ok(
279
+ "PDF contains no extractable text (might be scanned/image-based)"
280
+ .to_string(),
281
+ )
201
282
  } else {
202
283
  Ok(all_text.trim().to_string())
203
284
  }
204
285
  }
205
286
  Err(e) => Err(Error::new(
206
- magnus::exception::runtime_error(),
287
+ Ruby::get().unwrap().exception_runtime_error(),
207
288
  format!("Failed to parse PDF: {}", e),
208
- ))
289
+ )),
209
290
  }
210
291
  }
211
-
292
+
212
293
  /// Parse DOCX (Word) files - exposed to Ruby
213
294
  fn parse_docx(&self, data: Vec<u8>) -> Result<String, Error> {
214
295
  use docx_rs::read_docx;
215
-
296
+
216
297
  match read_docx(&data) {
217
298
  Ok(docx) => {
218
299
  let mut result = String::new();
219
-
300
+
220
301
  // Extract text from all document children
221
302
  // For simplicity, we'll focus on paragraphs only for now
222
303
  // Tables require more complex handling with the current API
@@ -238,29 +319,166 @@ impl Parser {
238
319
  // table.rows -> TableChild::TableRow -> row.cells -> TableRowChild
239
320
  // which has a more complex structure in docx-rs
240
321
  }
241
-
322
+
242
323
  Ok(result.trim().to_string())
243
324
  }
244
325
  Err(e) => Err(Error::new(
245
- magnus::exception::runtime_error(),
326
+ Ruby::get().unwrap().exception_runtime_error(),
246
327
  format!("Failed to parse DOCX file: {}", e),
247
- ))
328
+ )),
329
+ }
330
+ }
331
+
332
+ /// Parse PPTX (PowerPoint) files - exposed to Ruby
333
+ fn parse_pptx(&self, data: Vec<u8>) -> Result<String, Error> {
334
+ use std::io::{Cursor, Read};
335
+ use zip::ZipArchive;
336
+
337
+ let cursor = Cursor::new(data);
338
+ let mut archive = match ZipArchive::new(cursor) {
339
+ Ok(archive) => archive,
340
+ Err(e) => {
341
+ return Err(Error::new(
342
+ Ruby::get().unwrap().exception_runtime_error(),
343
+ format!("Failed to open PPTX as ZIP: {}", e),
344
+ ))
345
+ }
346
+ };
347
+
348
+ let mut all_text = Vec::new();
349
+ let mut slide_numbers = Vec::new();
350
+
351
+ // First, collect slide numbers and sort them
352
+ for i in 0..archive.len() {
353
+ let file = match archive.by_index(i) {
354
+ Ok(file) => file,
355
+ Err(_) => continue,
356
+ };
357
+
358
+ let name = file.name();
359
+ // Match slide XML files (e.g., ppt/slides/slide1.xml)
360
+ if name.starts_with("ppt/slides/slide") && name.ends_with(".xml") && !name.contains("_rels") {
361
+ // Extract slide number from filename
362
+ if let Some(num_str) = name
363
+ .strip_prefix("ppt/slides/slide")
364
+ .and_then(|s| s.strip_suffix(".xml"))
365
+ {
366
+ if let Ok(num) = num_str.parse::<usize>() {
367
+ slide_numbers.push((num, i));
368
+ }
369
+ }
370
+ }
371
+ }
372
+
373
+ // Sort by slide number to maintain order
374
+ slide_numbers.sort_by_key(|&(num, _)| num);
375
+
376
+ // Now process slides in order
377
+ for (_, index) in slide_numbers {
378
+ let mut file = match archive.by_index(index) {
379
+ Ok(file) => file,
380
+ Err(_) => continue,
381
+ };
382
+
383
+ let mut contents = String::new();
384
+ if file.read_to_string(&mut contents).is_ok() {
385
+ // Extract text from slide XML
386
+ let text = self.extract_text_from_slide_xml(&contents);
387
+ if !text.is_empty() {
388
+ all_text.push(text);
389
+ }
390
+ }
391
+ }
392
+
393
+ // Also extract notes if present
394
+ for i in 0..archive.len() {
395
+ let mut file = match archive.by_index(i) {
396
+ Ok(file) => file,
397
+ Err(_) => continue,
398
+ };
399
+
400
+ let name = file.name();
401
+ // Match notes slide XML files
402
+ if name.starts_with("ppt/notesSlides/notesSlide") && name.ends_with(".xml") && !name.contains("_rels") {
403
+ let mut contents = String::new();
404
+ if file.read_to_string(&mut contents).is_ok() {
405
+ let text = self.extract_text_from_slide_xml(&contents);
406
+ if !text.is_empty() {
407
+ all_text.push(format!("[Notes: {}]", text));
408
+ }
409
+ }
410
+ }
411
+ }
412
+
413
+ if all_text.is_empty() {
414
+ Ok("".to_string())
415
+ } else {
416
+ Ok(all_text.join("\n\n"))
248
417
  }
249
418
  }
250
419
 
420
+ /// Helper method to extract text from slide XML
421
+ fn extract_text_from_slide_xml(&self, xml_content: &str) -> String {
422
+ use quick_xml::events::Event;
423
+ use quick_xml::Reader;
424
+
425
+ let mut reader = Reader::from_str(xml_content);
426
+
427
+ let mut text_parts = Vec::new();
428
+ let mut buf = Vec::new();
429
+ let mut in_text_element = false;
430
+
431
+ loop {
432
+ match reader.read_event_into(&mut buf) {
433
+ Ok(Event::Start(ref e)) => {
434
+ // Look for text elements (a:t or t)
435
+ let name = e.name();
436
+ let local_name_bytes = name.local_name();
437
+ let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
438
+ if local_name == "t" {
439
+ in_text_element = true;
440
+ }
441
+ }
442
+ Ok(Event::Text(e)) => {
443
+ if in_text_element {
444
+ if let Ok(text) = e.decode() {
445
+ let text_str = text.trim();
446
+ if !text_str.is_empty() {
447
+ text_parts.push(text_str.to_string());
448
+ }
449
+ }
450
+ }
451
+ }
452
+ Ok(Event::End(ref e)) => {
453
+ let name = e.name();
454
+ let local_name_bytes = name.local_name();
455
+ let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
456
+ if local_name == "t" {
457
+ in_text_element = false;
458
+ }
459
+ }
460
+ Ok(Event::Eof) => break,
461
+ _ => {}
462
+ }
463
+ buf.clear();
464
+ }
465
+
466
+ text_parts.join(" ")
467
+ }
468
+
251
469
  /// Parse Excel files - exposed to Ruby
252
470
  fn parse_xlsx(&self, data: Vec<u8>) -> Result<String, Error> {
253
471
  use calamine::{Reader, Xlsx};
254
472
  use std::io::Cursor;
255
-
473
+
256
474
  let cursor = Cursor::new(data);
257
475
  match Xlsx::new(cursor) {
258
476
  Ok(mut workbook) => {
259
477
  let mut result = String::new();
260
-
478
+
261
479
  for sheet_name in workbook.sheet_names().to_owned() {
262
480
  result.push_str(&format!("Sheet: {}\n", sheet_name));
263
-
481
+
264
482
  if let Ok(range) = workbook.worksheet_range(&sheet_name) {
265
483
  for row in range.rows() {
266
484
  for cell in row {
@@ -271,44 +489,46 @@ impl Parser {
271
489
  }
272
490
  result.push('\n');
273
491
  }
274
-
492
+
275
493
  Ok(result)
276
494
  }
277
495
  Err(e) => Err(Error::new(
278
- magnus::exception::runtime_error(),
496
+ Ruby::get().unwrap().exception_runtime_error(),
279
497
  format!("Failed to parse Excel file: {}", e),
280
- ))
498
+ )),
281
499
  }
282
500
  }
283
-
501
+
284
502
  /// Parse JSON files - exposed to Ruby
285
503
  fn parse_json(&self, data: Vec<u8>) -> Result<String, Error> {
286
504
  let text = String::from_utf8_lossy(&data);
287
505
  match serde_json::from_str::<serde_json::Value>(&text) {
288
- Ok(json) => Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string())),
506
+ Ok(json) => {
507
+ Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string()))
508
+ }
289
509
  Err(_) => Ok(text.to_string()),
290
510
  }
291
511
  }
292
-
512
+
293
513
  /// Parse XML/HTML files - exposed to Ruby
294
514
  fn parse_xml(&self, data: Vec<u8>) -> Result<String, Error> {
295
515
  use quick_xml::events::Event;
296
516
  use quick_xml::Reader;
297
-
517
+
298
518
  let mut reader = Reader::from_reader(&data[..]);
299
519
  let mut txt = String::new();
300
520
  let mut buf = Vec::new();
301
-
521
+
302
522
  loop {
303
523
  match reader.read_event_into(&mut buf) {
304
524
  Ok(Event::Text(e)) => {
305
- txt.push_str(&e.unescape().unwrap_or_default());
525
+ txt.push_str(&e.decode().unwrap_or_default());
306
526
  txt.push(' ');
307
527
  }
308
528
  Ok(Event::Eof) => break,
309
529
  Err(e) => {
310
530
  return Err(Error::new(
311
- magnus::exception::runtime_error(),
531
+ Ruby::get().unwrap().exception_runtime_error(),
312
532
  format!("XML parse error: {}", e),
313
533
  ))
314
534
  }
@@ -316,15 +536,15 @@ impl Parser {
316
536
  }
317
537
  buf.clear();
318
538
  }
319
-
539
+
320
540
  Ok(txt.trim().to_string())
321
541
  }
322
-
542
+
323
543
  /// Parse plain text with encoding detection - exposed to Ruby
324
544
  fn parse_text(&self, data: Vec<u8>) -> Result<String, Error> {
325
545
  // Detect encoding
326
546
  let (decoded, _encoding, malformed) = encoding_rs::UTF_8.decode(&data);
327
-
547
+
328
548
  if malformed {
329
549
  // Try other encodings
330
550
  let (decoded, _encoding, _malformed) = encoding_rs::WINDOWS_1252.decode(&data);
@@ -333,16 +553,16 @@ impl Parser {
333
553
  Ok(decoded.to_string())
334
554
  }
335
555
  }
336
-
556
+
337
557
  /// Parse input string (for text content)
338
558
  fn parse(&self, input: String) -> Result<String, Error> {
339
559
  if input.is_empty() {
340
560
  return Err(Error::new(
341
- magnus::exception::arg_error(),
561
+ Ruby::get().unwrap().exception_arg_error(),
342
562
  "Input cannot be empty",
343
563
  ));
344
564
  }
345
-
565
+
346
566
  // For string input, just return cleaned text
347
567
  // If strict mode is on, append indicator for testing
348
568
  if self.config.strict_mode {
@@ -351,29 +571,33 @@ impl Parser {
351
571
  Ok(input.trim().to_string())
352
572
  }
353
573
  }
354
-
574
+
355
575
  /// Parse a file
356
576
  fn parse_file(&self, path: String) -> Result<String, Error> {
357
577
  use std::fs;
358
-
359
- let data = fs::read(&path)
360
- .map_err(|e| Error::new(magnus::exception::io_error(), format!("Failed to read file: {}", e)))?;
361
-
578
+
579
+ let data = fs::read(&path).map_err(|e| {
580
+ Error::new(
581
+ Ruby::get().unwrap().exception_io_error(),
582
+ format!("Failed to read file: {}", e),
583
+ )
584
+ })?;
585
+
362
586
  self.parse_bytes_internal(data, Some(&path))
363
587
  }
364
-
588
+
365
589
  /// Parse bytes from Ruby
366
590
  fn parse_bytes(&self, data: Vec<u8>) -> Result<String, Error> {
367
591
  if data.is_empty() {
368
592
  return Err(Error::new(
369
- magnus::exception::arg_error(),
593
+ Ruby::get().unwrap().exception_arg_error(),
370
594
  "Data cannot be empty",
371
595
  ));
372
596
  }
373
-
597
+
374
598
  self.parse_bytes_internal(data, None)
375
599
  }
376
-
600
+
377
601
  /// Get parser configuration
378
602
  fn config(&self) -> Result<RHash, Error> {
379
603
  let ruby = Ruby::get().unwrap();
@@ -384,12 +608,12 @@ impl Parser {
384
608
  hash.aset(ruby.to_symbol("max_size"), self.config.max_size)?;
385
609
  Ok(hash)
386
610
  }
387
-
611
+
388
612
  /// Check if parser is in strict mode
389
613
  fn strict_mode(&self) -> bool {
390
614
  self.config.strict_mode
391
615
  }
392
-
616
+
393
617
  /// Check supported file types
394
618
  fn supported_formats() -> Vec<String> {
395
619
  vec![
@@ -397,7 +621,10 @@ impl Parser {
397
621
  "json".to_string(),
398
622
  "xml".to_string(),
399
623
  "html".to_string(),
624
+ "htm".to_string(), // HTML files (alternative extension)
625
+ "md".to_string(), // Markdown files
400
626
  "docx".to_string(),
627
+ "pptx".to_string(),
401
628
  "xlsx".to_string(),
402
629
  "xls".to_string(),
403
630
  "csv".to_string(),
@@ -409,12 +636,12 @@ impl Parser {
409
636
  "bmp".to_string(), // OCR via Tesseract
410
637
  ]
411
638
  }
412
-
639
+
413
640
  /// Detect if file extension is supported
414
641
  fn supports_file(&self, path: String) -> bool {
415
642
  if let Some(ext) = std::path::Path::new(&path)
416
643
  .extension()
417
- .and_then(|s| s.to_str())
644
+ .and_then(|s| s.to_str())
418
645
  {
419
646
  Self::supported_formats().contains(&ext.to_lowercase())
420
647
  } else {
@@ -441,8 +668,8 @@ fn parse_bytes_direct(data: Vec<u8>) -> Result<String, Error> {
441
668
 
442
669
  /// Initialize the Parser class
443
670
  pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
444
- let class = module.define_class("Parser", class::object())?;
445
-
671
+ let class = module.define_class("Parser", Ruby::get().unwrap().class_object())?;
672
+
446
673
  // Instance methods
447
674
  class.define_singleton_method("new", function!(Parser::new, -1))?;
448
675
  class.define_method("parse", method!(Parser::parse, 1))?;
@@ -451,22 +678,23 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
451
678
  class.define_method("config", method!(Parser::config, 0))?;
452
679
  class.define_method("strict_mode?", method!(Parser::strict_mode, 0))?;
453
680
  class.define_method("supports_file?", method!(Parser::supports_file, 1))?;
454
-
681
+
455
682
  // Individual parser methods exposed to Ruby
456
683
  class.define_method("parse_pdf", method!(Parser::parse_pdf, 1))?;
457
684
  class.define_method("parse_docx", method!(Parser::parse_docx, 1))?;
685
+ class.define_method("parse_pptx", method!(Parser::parse_pptx, 1))?;
458
686
  class.define_method("parse_xlsx", method!(Parser::parse_xlsx, 1))?;
459
687
  class.define_method("parse_json", method!(Parser::parse_json, 1))?;
460
688
  class.define_method("parse_xml", method!(Parser::parse_xml, 1))?;
461
689
  class.define_method("parse_text", method!(Parser::parse_text, 1))?;
462
690
  class.define_method("ocr_image", method!(Parser::ocr_image, 1))?;
463
-
691
+
464
692
  // Class methods
465
693
  class.define_singleton_method("supported_formats", function!(Parser::supported_formats, 0))?;
466
-
694
+
467
695
  // Module-level convenience methods
468
696
  module.define_singleton_method("parse_file", function!(parse_file_direct, 1))?;
469
697
  module.define_singleton_method("parse_bytes", function!(parse_bytes_direct, 1))?;
470
-
698
+
471
699
  Ok(())
472
- }
700
+ }
Binary file
@@ -89,6 +89,7 @@ module ParseKit
89
89
 
90
90
  case ext.downcase
91
91
  when 'docx' then :docx
92
+ when 'pptx' then :pptx
92
93
  when 'xlsx', 'xls' then :xlsx
93
94
  when 'pdf' then :pdf
94
95
  when 'json' then :json
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ParseKit
4
- VERSION = "0.1.0.pre.1"
4
+ VERSION = "0.1.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parsekit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0.pre.1
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Chris Petersen
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-08-21 00:00:00.000000000 Z
11
+ date: 2025-09-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys