parsekit 0.1.0.pre.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +29 -17
- data/ext/parsekit/Cargo.toml +9 -7
- data/ext/parsekit/src/error.rs +7 -7
- data/ext/parsekit/src/parser.rs +317 -89
- data/lib/parsekit/parsekit.bundle +0 -0
- data/lib/parsekit/parser.rb +1 -0
- data/lib/parsekit/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 35c0708c088075c883b3b35c7d76f1573f29a19bf65ac0b89b636a5b76cee662
|
|
4
|
+
data.tar.gz: b1ddf9260329239c3a1e791f3ed3249b3577cb210f4c898677316fe55cc951f4
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2fe76f5b28927e3989502b0ea5f084f5bfc265aae9a65aaba47349e3e540e8150612d75f8f4ddcdc38be7edd9ae7edbf42220ba95b42a535dbc200503759c419
|
|
7
|
+
data.tar.gz: e5b9e8eff90f8583f8289bea5100ac43434978ebba814bf9198fb92cc622a9b4fa6e99e28fe2ed31ffa0040c3ac48a38c8361bc1994200059a23d040440a64cc
|
data/README.md
CHANGED
|
@@ -1,14 +1,13 @@
|
|
|
1
|
-
|
|
1
|
+
<img src="/docs/assets/parsekit-wide.png" alt="parsekit" height="80px">
|
|
2
2
|
|
|
3
|
-
[](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml)
|
|
4
3
|
[](https://badge.fury.io/rb/parsekit)
|
|
5
4
|
[](https://opensource.org/licenses/MIT)
|
|
6
5
|
|
|
7
|
-
Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX
|
|
6
|
+
Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
|
|
8
7
|
|
|
9
8
|
## Features
|
|
10
9
|
|
|
11
|
-
- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX
|
|
10
|
+
- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)
|
|
12
11
|
- 🖼️ **OCR Support**: Extract text from images using Tesseract OCR
|
|
13
12
|
- 🚀 **High Performance**: Native Rust performance with Ruby convenience
|
|
14
13
|
- 🔧 **Unified API**: Single interface for multiple document formats
|
|
@@ -38,13 +37,8 @@ gem install parsekit
|
|
|
38
37
|
- Ruby >= 3.0.0
|
|
39
38
|
- Rust toolchain (stable)
|
|
40
39
|
- C compiler (for linking)
|
|
41
|
-
- System libraries for document parsing:
|
|
42
|
-
- **macOS**: `brew install leptonica tesseract poppler`
|
|
43
|
-
- **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
|
|
44
|
-
- **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
|
|
45
|
-
- **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
|
|
46
40
|
|
|
47
|
-
|
|
41
|
+
That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
|
|
48
42
|
|
|
49
43
|
## Usage
|
|
50
44
|
|
|
@@ -57,10 +51,6 @@ require 'parsekit'
|
|
|
57
51
|
text = ParseKit.parse_file("document.pdf")
|
|
58
52
|
puts text # Extracted text from the PDF
|
|
59
53
|
|
|
60
|
-
# Parse an Office document
|
|
61
|
-
text = ParseKit.parse_file("presentation.pptx")
|
|
62
|
-
puts text # Extracted text from all slides
|
|
63
|
-
|
|
64
54
|
# Parse an Excel file
|
|
65
55
|
text = ParseKit.parse_file("spreadsheet.xlsx")
|
|
66
56
|
puts text # Extracted text from all sheets
|
|
@@ -131,7 +121,8 @@ excel_text = parser.parse_xlsx(excel_data)
|
|
|
131
121
|
| PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
|
|
132
122
|
| Word | .docx | `parse_docx` | Office Open XML format |
|
|
133
123
|
| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
|
|
134
|
-
|
|
|
124
|
+
| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |
|
|
125
|
+
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
|
|
135
126
|
| JSON | .json | `parse_json` | Pretty-printed output |
|
|
136
127
|
| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
|
|
137
128
|
| Text | .txt, .csv, .md | `parse_text` | With encoding detection |
|
|
@@ -161,6 +152,27 @@ To run tests with coverage:
|
|
|
161
152
|
rake dev:coverage
|
|
162
153
|
```
|
|
163
154
|
|
|
155
|
+
### OCR Mode Configuration
|
|
156
|
+
|
|
157
|
+
By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
|
|
158
|
+
|
|
159
|
+
**Using system Tesseract during installation:**
|
|
160
|
+
```bash
|
|
161
|
+
gem install parsekit -- --no-default-features
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**For development with system Tesseract:**
|
|
165
|
+
```bash
|
|
166
|
+
rake compile CARGO_FEATURES="" # Disables bundled-tesseract feature
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**System Tesseract requirements:**
|
|
170
|
+
- **macOS**: `brew install tesseract`
|
|
171
|
+
- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
|
|
172
|
+
- **Fedora/RHEL**: `sudo dnf install tesseract-devel`
|
|
173
|
+
|
|
174
|
+
The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
|
|
175
|
+
|
|
164
176
|
## Architecture
|
|
165
177
|
|
|
166
178
|
ParseKit uses a hybrid Ruby/Rust architecture:
|
|
@@ -168,7 +180,7 @@ ParseKit uses a hybrid Ruby/Rust architecture:
|
|
|
168
180
|
- **Ruby Layer**: Provides convenient API and format detection
|
|
169
181
|
- **Rust Layer**: Implements high-performance parsing using:
|
|
170
182
|
- MuPDF for PDF text extraction (statically linked)
|
|
171
|
-
-
|
|
183
|
+
- tesseract-rs for OCR (with bundled Tesseract by default)
|
|
172
184
|
- Pure Rust libraries for DOCX/XLSX parsing
|
|
173
185
|
- Magnus for Ruby-Rust FFI bindings
|
|
174
186
|
|
|
@@ -180,4 +192,4 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/cpeter
|
|
|
180
192
|
|
|
181
193
|
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
182
194
|
|
|
183
|
-
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
|
|
195
|
+
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
|
data/ext/parsekit/Cargo.toml
CHANGED
|
@@ -11,24 +11,26 @@ crate-type = ["cdylib"]
|
|
|
11
11
|
name = "parsekit"
|
|
12
12
|
|
|
13
13
|
[dependencies]
|
|
14
|
-
magnus = { version = "0.
|
|
14
|
+
magnus = { version = "0.8", features = ["rb-sys"] }
|
|
15
15
|
# Document parsing - testing embedded C libraries
|
|
16
16
|
# MuPDF builds from source and statically links
|
|
17
17
|
mupdf = { version = "0.5", default-features = false, features = [] }
|
|
18
|
-
# OCR -
|
|
19
|
-
|
|
18
|
+
# OCR - Using tesseract-rs for both system and bundled modes
|
|
19
|
+
tesseract-rs = "0.1" # Tesseract with optional bundling
|
|
20
20
|
image = "0.25" # Image processing library (match rusty-tesseract's version)
|
|
21
|
-
calamine = "0.
|
|
21
|
+
calamine = "0.30" # Excel parsing
|
|
22
22
|
docx-rs = "0.4" # Word document parsing
|
|
23
|
-
quick-xml = "0.
|
|
23
|
+
quick-xml = "0.38" # XML parsing
|
|
24
|
+
zip = "2.1" # ZIP archive handling for PPTX
|
|
24
25
|
serde_json = "1.0" # JSON parsing
|
|
25
26
|
regex = "1.10" # Text parsing
|
|
26
27
|
encoding_rs = "0.8" # Encoding detection
|
|
27
28
|
|
|
28
29
|
[features]
|
|
29
|
-
default = []
|
|
30
|
+
default = ["bundled-tesseract"]
|
|
31
|
+
bundled-tesseract = []
|
|
30
32
|
|
|
31
33
|
[profile.release]
|
|
32
34
|
opt-level = 3
|
|
33
35
|
lto = true
|
|
34
|
-
codegen-units = 1
|
|
36
|
+
codegen-units = 1
|
data/ext/parsekit/src/error.rs
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
use magnus::{
|
|
1
|
+
use magnus::{Error, RModule, Ruby, Module};
|
|
2
2
|
|
|
3
3
|
/// Custom error types for ParseKit
|
|
4
4
|
#[derive(Debug)]
|
|
@@ -15,13 +15,13 @@ impl ParserError {
|
|
|
15
15
|
pub fn to_error(&self) -> Error {
|
|
16
16
|
match self {
|
|
17
17
|
ParserError::ParseError(msg) => {
|
|
18
|
-
Error::new(
|
|
18
|
+
Error::new(Ruby::get().unwrap().exception_runtime_error(), msg.clone())
|
|
19
19
|
}
|
|
20
20
|
ParserError::ConfigError(msg) => {
|
|
21
|
-
Error::new(
|
|
21
|
+
Error::new(Ruby::get().unwrap().exception_arg_error(), msg.clone())
|
|
22
22
|
}
|
|
23
23
|
ParserError::IoError(msg) => {
|
|
24
|
-
Error::new(
|
|
24
|
+
Error::new(Ruby::get().unwrap().exception_io_error(), msg.clone())
|
|
25
25
|
}
|
|
26
26
|
}
|
|
27
27
|
}
|
|
@@ -37,9 +37,9 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
|
|
|
37
37
|
|
|
38
38
|
// Define error classes as regular Ruby classes
|
|
39
39
|
// Users can still rescue them by name in Ruby code
|
|
40
|
-
let _error = module.define_class("Error",
|
|
41
|
-
let _parse_error = module.define_class("ParseError",
|
|
42
|
-
let _config_error = module.define_class("ConfigError",
|
|
40
|
+
let _error = module.define_class("Error", Ruby::get().unwrap().class_object())?;
|
|
41
|
+
let _parse_error = module.define_class("ParseError", Ruby::get().unwrap().class_object())?;
|
|
42
|
+
let _config_error = module.define_class("ConfigError", Ruby::get().unwrap().class_object())?;
|
|
43
43
|
|
|
44
44
|
Ok(())
|
|
45
45
|
}
|
data/ext/parsekit/src/parser.rs
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
use magnus::{
|
|
2
|
-
|
|
2
|
+
function, method, prelude::*, scan_args, Error, Module, RHash, RModule, Ruby, Value,
|
|
3
3
|
};
|
|
4
4
|
use std::path::Path;
|
|
5
5
|
|
|
@@ -33,9 +33,9 @@ impl Parser {
|
|
|
33
33
|
fn new(ruby: &Ruby, args: &[Value]) -> Result<Self, Error> {
|
|
34
34
|
let args = scan_args::scan_args::<(), (Option<RHash>,), (), (), (), ()>(args)?;
|
|
35
35
|
let options = args.optional.0;
|
|
36
|
-
|
|
36
|
+
|
|
37
37
|
let mut config = ParserConfig::default();
|
|
38
|
-
|
|
38
|
+
|
|
39
39
|
if let Some(opts) = options {
|
|
40
40
|
if let Some(strict) = opts.get(ruby.to_symbol("strict_mode")) {
|
|
41
41
|
config.strict_mode = bool::try_convert(strict)?;
|
|
@@ -50,30 +50,35 @@ impl Parser {
|
|
|
50
50
|
config.max_size = usize::try_convert(max_size)?;
|
|
51
51
|
}
|
|
52
52
|
}
|
|
53
|
-
|
|
53
|
+
|
|
54
54
|
Ok(Self { config })
|
|
55
55
|
}
|
|
56
|
-
|
|
56
|
+
|
|
57
57
|
/// Parse input bytes based on file type (internal helper)
|
|
58
58
|
fn parse_bytes_internal(&self, data: Vec<u8>, filename: Option<&str>) -> Result<String, Error> {
|
|
59
59
|
// Check size limit
|
|
60
60
|
if data.len() > self.config.max_size {
|
|
61
61
|
return Err(Error::new(
|
|
62
|
-
|
|
63
|
-
format!(
|
|
62
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
63
|
+
format!(
|
|
64
|
+
"File size {} exceeds maximum allowed size {}",
|
|
65
|
+
data.len(),
|
|
66
|
+
self.config.max_size
|
|
67
|
+
),
|
|
64
68
|
));
|
|
65
69
|
}
|
|
66
|
-
|
|
70
|
+
|
|
67
71
|
// Detect file type from extension or content
|
|
68
72
|
let file_type = if let Some(name) = filename {
|
|
69
73
|
Self::detect_type_from_filename(name)
|
|
70
74
|
} else {
|
|
71
75
|
Self::detect_type_from_content(&data)
|
|
72
76
|
};
|
|
73
|
-
|
|
77
|
+
|
|
74
78
|
match file_type.as_str() {
|
|
75
79
|
"pdf" => self.parse_pdf(data),
|
|
76
80
|
"docx" => self.parse_docx(data),
|
|
81
|
+
"pptx" => self.parse_pptx(data),
|
|
77
82
|
"xlsx" | "xls" => self.parse_xlsx(data),
|
|
78
83
|
"json" => self.parse_json(data),
|
|
79
84
|
"xml" | "html" => self.parse_xml(data),
|
|
@@ -82,7 +87,7 @@ impl Parser {
|
|
|
82
87
|
_ => self.parse_text(data), // Default to text parsing
|
|
83
88
|
}
|
|
84
89
|
}
|
|
85
|
-
|
|
90
|
+
|
|
86
91
|
/// Detect file type from filename extension
|
|
87
92
|
fn detect_type_from_filename(filename: &str) -> String {
|
|
88
93
|
let path = Path::new(filename);
|
|
@@ -91,7 +96,7 @@ impl Parser {
|
|
|
91
96
|
None => "txt".to_string(),
|
|
92
97
|
}
|
|
93
98
|
}
|
|
94
|
-
|
|
99
|
+
|
|
95
100
|
/// Detect file type from content (basic detection)
|
|
96
101
|
fn detect_type_from_content(data: &[u8]) -> String {
|
|
97
102
|
if data.starts_with(b"%PDF") {
|
|
@@ -120,65 +125,138 @@ impl Parser {
|
|
|
120
125
|
"txt".to_string()
|
|
121
126
|
}
|
|
122
127
|
}
|
|
123
|
-
|
|
128
|
+
|
|
124
129
|
/// Perform OCR on image data using Tesseract
|
|
125
130
|
fn ocr_image(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
126
|
-
use
|
|
131
|
+
use tesseract_rs::TesseractAPI;
|
|
132
|
+
|
|
133
|
+
// Create tesseract instance
|
|
134
|
+
let tesseract = TesseractAPI::new();
|
|
127
135
|
|
|
128
|
-
//
|
|
136
|
+
// Try to initialize with appropriate tessdata path
|
|
137
|
+
// Even in bundled mode, we need to find tessdata files
|
|
138
|
+
#[cfg(feature = "bundled-tesseract")]
|
|
139
|
+
let init_result = {
|
|
140
|
+
// Build list of tessdata paths to try
|
|
141
|
+
let mut tessdata_paths = Vec::new();
|
|
142
|
+
|
|
143
|
+
// Check TESSDATA_PREFIX environment variable first (for CI)
|
|
144
|
+
if let Ok(env_path) = std::env::var("TESSDATA_PREFIX") {
|
|
145
|
+
tessdata_paths.push(env_path);
|
|
146
|
+
}
|
|
147
|
+
|
|
148
|
+
// Add common system paths
|
|
149
|
+
tessdata_paths.extend_from_slice(&[
|
|
150
|
+
"/usr/share/tessdata".to_string(),
|
|
151
|
+
"/usr/local/share/tessdata".to_string(),
|
|
152
|
+
"/opt/homebrew/share/tessdata".to_string(),
|
|
153
|
+
"/opt/local/share/tessdata".to_string(),
|
|
154
|
+
"tessdata".to_string(), // Local tessdata directory
|
|
155
|
+
".".to_string(), // Current directory as fallback
|
|
156
|
+
]);
|
|
157
|
+
|
|
158
|
+
let mut result = Err(tesseract_rs::TesseractError::InitError);
|
|
159
|
+
for path in &tessdata_paths {
|
|
160
|
+
// Check if path exists first to avoid noisy error messages
|
|
161
|
+
if std::path::Path::new(path).exists() {
|
|
162
|
+
if tesseract.init(path.as_str(), "eng").is_ok() {
|
|
163
|
+
result = Ok(());
|
|
164
|
+
break;
|
|
165
|
+
}
|
|
166
|
+
}
|
|
167
|
+
}
|
|
168
|
+
result
|
|
169
|
+
};
|
|
170
|
+
|
|
171
|
+
#[cfg(not(feature = "bundled-tesseract"))]
|
|
172
|
+
let init_result = {
|
|
173
|
+
// Try common system tessdata paths
|
|
174
|
+
let tessdata_paths = vec![
|
|
175
|
+
"/usr/share/tessdata",
|
|
176
|
+
"/usr/local/share/tessdata",
|
|
177
|
+
"/opt/homebrew/share/tessdata",
|
|
178
|
+
"/opt/local/share/tessdata",
|
|
179
|
+
];
|
|
180
|
+
|
|
181
|
+
let mut result = Err(tesseract_rs::TesseractError::InitError);
|
|
182
|
+
for path in &tessdata_paths {
|
|
183
|
+
if std::path::Path::new(path).exists() {
|
|
184
|
+
if tesseract.init(path, "eng").is_ok() {
|
|
185
|
+
result = Ok(());
|
|
186
|
+
break;
|
|
187
|
+
}
|
|
188
|
+
}
|
|
189
|
+
}
|
|
190
|
+
result
|
|
191
|
+
};
|
|
192
|
+
|
|
193
|
+
if let Err(e) = init_result {
|
|
194
|
+
return Err(Error::new(
|
|
195
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
196
|
+
format!("Failed to initialize Tesseract: {:?}", e),
|
|
197
|
+
))
|
|
198
|
+
}
|
|
199
|
+
|
|
200
|
+
// Load the image from bytes
|
|
129
201
|
let img = match image::load_from_memory(&data) {
|
|
130
202
|
Ok(img) => img,
|
|
131
203
|
Err(e) => return Err(Error::new(
|
|
132
|
-
|
|
204
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
133
205
|
format!("Failed to load image: {}", e),
|
|
134
206
|
))
|
|
135
207
|
};
|
|
136
208
|
|
|
137
|
-
//
|
|
138
|
-
let
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
magnus::exception::runtime_error(),
|
|
142
|
-
format!("Failed to convert image for OCR: {}", e),
|
|
143
|
-
))
|
|
144
|
-
};
|
|
209
|
+
// Convert to RGBA8 format
|
|
210
|
+
let rgba_img = img.to_rgba8();
|
|
211
|
+
let (width, height) = rgba_img.dimensions();
|
|
212
|
+
let raw_data = rgba_img.into_raw();
|
|
145
213
|
|
|
146
|
-
// Set
|
|
147
|
-
let
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
214
|
+
// Set image data
|
|
215
|
+
if let Err(e) = tesseract.set_image(
|
|
216
|
+
&raw_data,
|
|
217
|
+
width as i32,
|
|
218
|
+
height as i32,
|
|
219
|
+
4, // bytes per pixel (RGBA)
|
|
220
|
+
(width * 4) as i32, // bytes per line
|
|
221
|
+
) {
|
|
222
|
+
return Err(Error::new(
|
|
223
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
224
|
+
format!("Failed to set image: {}", e),
|
|
225
|
+
))
|
|
226
|
+
}
|
|
152
227
|
|
|
153
|
-
//
|
|
154
|
-
match
|
|
228
|
+
// Extract text
|
|
229
|
+
match tesseract.get_utf8_text() {
|
|
155
230
|
Ok(text) => Ok(text.trim().to_string()),
|
|
156
231
|
Err(e) => Err(Error::new(
|
|
157
|
-
|
|
232
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
158
233
|
format!("Failed to perform OCR: {}", e),
|
|
159
|
-
))
|
|
234
|
+
)),
|
|
160
235
|
}
|
|
161
236
|
}
|
|
162
237
|
|
|
238
|
+
|
|
163
239
|
/// Parse PDF files using MuPDF (statically linked) - exposed to Ruby
|
|
164
240
|
fn parse_pdf(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
165
241
|
use mupdf::Document;
|
|
166
|
-
|
|
242
|
+
|
|
167
243
|
// Try to load the PDF from memory
|
|
168
244
|
// The magic parameter helps MuPDF identify the file type
|
|
169
245
|
match Document::from_bytes(&data, "pdf") {
|
|
170
246
|
Ok(doc) => {
|
|
171
247
|
let mut all_text = String::new();
|
|
172
|
-
|
|
248
|
+
|
|
173
249
|
// Get page count - this returns a Result
|
|
174
250
|
let page_count = match doc.page_count() {
|
|
175
251
|
Ok(count) => count,
|
|
176
|
-
Err(e) =>
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
252
|
+
Err(e) => {
|
|
253
|
+
return Err(Error::new(
|
|
254
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
255
|
+
format!("Failed to get page count: {}", e),
|
|
256
|
+
))
|
|
257
|
+
}
|
|
180
258
|
};
|
|
181
|
-
|
|
259
|
+
|
|
182
260
|
// Iterate through pages
|
|
183
261
|
for page_num in 0..page_count {
|
|
184
262
|
match doc.load_page(page_num) {
|
|
@@ -195,28 +273,31 @@ impl Parser {
|
|
|
195
273
|
Err(_) => continue,
|
|
196
274
|
}
|
|
197
275
|
}
|
|
198
|
-
|
|
276
|
+
|
|
199
277
|
if all_text.is_empty() {
|
|
200
|
-
Ok(
|
|
278
|
+
Ok(
|
|
279
|
+
"PDF contains no extractable text (might be scanned/image-based)"
|
|
280
|
+
.to_string(),
|
|
281
|
+
)
|
|
201
282
|
} else {
|
|
202
283
|
Ok(all_text.trim().to_string())
|
|
203
284
|
}
|
|
204
285
|
}
|
|
205
286
|
Err(e) => Err(Error::new(
|
|
206
|
-
|
|
287
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
207
288
|
format!("Failed to parse PDF: {}", e),
|
|
208
|
-
))
|
|
289
|
+
)),
|
|
209
290
|
}
|
|
210
291
|
}
|
|
211
|
-
|
|
292
|
+
|
|
212
293
|
/// Parse DOCX (Word) files - exposed to Ruby
|
|
213
294
|
fn parse_docx(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
214
295
|
use docx_rs::read_docx;
|
|
215
|
-
|
|
296
|
+
|
|
216
297
|
match read_docx(&data) {
|
|
217
298
|
Ok(docx) => {
|
|
218
299
|
let mut result = String::new();
|
|
219
|
-
|
|
300
|
+
|
|
220
301
|
// Extract text from all document children
|
|
221
302
|
// For simplicity, we'll focus on paragraphs only for now
|
|
222
303
|
// Tables require more complex handling with the current API
|
|
@@ -238,29 +319,166 @@ impl Parser {
|
|
|
238
319
|
// table.rows -> TableChild::TableRow -> row.cells -> TableRowChild
|
|
239
320
|
// which has a more complex structure in docx-rs
|
|
240
321
|
}
|
|
241
|
-
|
|
322
|
+
|
|
242
323
|
Ok(result.trim().to_string())
|
|
243
324
|
}
|
|
244
325
|
Err(e) => Err(Error::new(
|
|
245
|
-
|
|
326
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
246
327
|
format!("Failed to parse DOCX file: {}", e),
|
|
247
|
-
))
|
|
328
|
+
)),
|
|
329
|
+
}
|
|
330
|
+
}
|
|
331
|
+
|
|
332
|
+
/// Parse PPTX (PowerPoint) files - exposed to Ruby
|
|
333
|
+
fn parse_pptx(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
334
|
+
use std::io::{Cursor, Read};
|
|
335
|
+
use zip::ZipArchive;
|
|
336
|
+
|
|
337
|
+
let cursor = Cursor::new(data);
|
|
338
|
+
let mut archive = match ZipArchive::new(cursor) {
|
|
339
|
+
Ok(archive) => archive,
|
|
340
|
+
Err(e) => {
|
|
341
|
+
return Err(Error::new(
|
|
342
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
343
|
+
format!("Failed to open PPTX as ZIP: {}", e),
|
|
344
|
+
))
|
|
345
|
+
}
|
|
346
|
+
};
|
|
347
|
+
|
|
348
|
+
let mut all_text = Vec::new();
|
|
349
|
+
let mut slide_numbers = Vec::new();
|
|
350
|
+
|
|
351
|
+
// First, collect slide numbers and sort them
|
|
352
|
+
for i in 0..archive.len() {
|
|
353
|
+
let file = match archive.by_index(i) {
|
|
354
|
+
Ok(file) => file,
|
|
355
|
+
Err(_) => continue,
|
|
356
|
+
};
|
|
357
|
+
|
|
358
|
+
let name = file.name();
|
|
359
|
+
// Match slide XML files (e.g., ppt/slides/slide1.xml)
|
|
360
|
+
if name.starts_with("ppt/slides/slide") && name.ends_with(".xml") && !name.contains("_rels") {
|
|
361
|
+
// Extract slide number from filename
|
|
362
|
+
if let Some(num_str) = name
|
|
363
|
+
.strip_prefix("ppt/slides/slide")
|
|
364
|
+
.and_then(|s| s.strip_suffix(".xml"))
|
|
365
|
+
{
|
|
366
|
+
if let Ok(num) = num_str.parse::<usize>() {
|
|
367
|
+
slide_numbers.push((num, i));
|
|
368
|
+
}
|
|
369
|
+
}
|
|
370
|
+
}
|
|
371
|
+
}
|
|
372
|
+
|
|
373
|
+
// Sort by slide number to maintain order
|
|
374
|
+
slide_numbers.sort_by_key(|&(num, _)| num);
|
|
375
|
+
|
|
376
|
+
// Now process slides in order
|
|
377
|
+
for (_, index) in slide_numbers {
|
|
378
|
+
let mut file = match archive.by_index(index) {
|
|
379
|
+
Ok(file) => file,
|
|
380
|
+
Err(_) => continue,
|
|
381
|
+
};
|
|
382
|
+
|
|
383
|
+
let mut contents = String::new();
|
|
384
|
+
if file.read_to_string(&mut contents).is_ok() {
|
|
385
|
+
// Extract text from slide XML
|
|
386
|
+
let text = self.extract_text_from_slide_xml(&contents);
|
|
387
|
+
if !text.is_empty() {
|
|
388
|
+
all_text.push(text);
|
|
389
|
+
}
|
|
390
|
+
}
|
|
391
|
+
}
|
|
392
|
+
|
|
393
|
+
// Also extract notes if present
|
|
394
|
+
for i in 0..archive.len() {
|
|
395
|
+
let mut file = match archive.by_index(i) {
|
|
396
|
+
Ok(file) => file,
|
|
397
|
+
Err(_) => continue,
|
|
398
|
+
};
|
|
399
|
+
|
|
400
|
+
let name = file.name();
|
|
401
|
+
// Match notes slide XML files
|
|
402
|
+
if name.starts_with("ppt/notesSlides/notesSlide") && name.ends_with(".xml") && !name.contains("_rels") {
|
|
403
|
+
let mut contents = String::new();
|
|
404
|
+
if file.read_to_string(&mut contents).is_ok() {
|
|
405
|
+
let text = self.extract_text_from_slide_xml(&contents);
|
|
406
|
+
if !text.is_empty() {
|
|
407
|
+
all_text.push(format!("[Notes: {}]", text));
|
|
408
|
+
}
|
|
409
|
+
}
|
|
410
|
+
}
|
|
411
|
+
}
|
|
412
|
+
|
|
413
|
+
if all_text.is_empty() {
|
|
414
|
+
Ok("".to_string())
|
|
415
|
+
} else {
|
|
416
|
+
Ok(all_text.join("\n\n"))
|
|
248
417
|
}
|
|
249
418
|
}
|
|
250
419
|
|
|
420
|
+
/// Helper method to extract text from slide XML
|
|
421
|
+
fn extract_text_from_slide_xml(&self, xml_content: &str) -> String {
|
|
422
|
+
use quick_xml::events::Event;
|
|
423
|
+
use quick_xml::Reader;
|
|
424
|
+
|
|
425
|
+
let mut reader = Reader::from_str(xml_content);
|
|
426
|
+
|
|
427
|
+
let mut text_parts = Vec::new();
|
|
428
|
+
let mut buf = Vec::new();
|
|
429
|
+
let mut in_text_element = false;
|
|
430
|
+
|
|
431
|
+
loop {
|
|
432
|
+
match reader.read_event_into(&mut buf) {
|
|
433
|
+
Ok(Event::Start(ref e)) => {
|
|
434
|
+
// Look for text elements (a:t or t)
|
|
435
|
+
let name = e.name();
|
|
436
|
+
let local_name_bytes = name.local_name();
|
|
437
|
+
let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
|
|
438
|
+
if local_name == "t" {
|
|
439
|
+
in_text_element = true;
|
|
440
|
+
}
|
|
441
|
+
}
|
|
442
|
+
Ok(Event::Text(e)) => {
|
|
443
|
+
if in_text_element {
|
|
444
|
+
if let Ok(text) = e.decode() {
|
|
445
|
+
let text_str = text.trim();
|
|
446
|
+
if !text_str.is_empty() {
|
|
447
|
+
text_parts.push(text_str.to_string());
|
|
448
|
+
}
|
|
449
|
+
}
|
|
450
|
+
}
|
|
451
|
+
}
|
|
452
|
+
Ok(Event::End(ref e)) => {
|
|
453
|
+
let name = e.name();
|
|
454
|
+
let local_name_bytes = name.local_name();
|
|
455
|
+
let local_name = std::str::from_utf8(local_name_bytes.as_ref()).unwrap_or("");
|
|
456
|
+
if local_name == "t" {
|
|
457
|
+
in_text_element = false;
|
|
458
|
+
}
|
|
459
|
+
}
|
|
460
|
+
Ok(Event::Eof) => break,
|
|
461
|
+
_ => {}
|
|
462
|
+
}
|
|
463
|
+
buf.clear();
|
|
464
|
+
}
|
|
465
|
+
|
|
466
|
+
text_parts.join(" ")
|
|
467
|
+
}
|
|
468
|
+
|
|
251
469
|
/// Parse Excel files - exposed to Ruby
|
|
252
470
|
fn parse_xlsx(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
253
471
|
use calamine::{Reader, Xlsx};
|
|
254
472
|
use std::io::Cursor;
|
|
255
|
-
|
|
473
|
+
|
|
256
474
|
let cursor = Cursor::new(data);
|
|
257
475
|
match Xlsx::new(cursor) {
|
|
258
476
|
Ok(mut workbook) => {
|
|
259
477
|
let mut result = String::new();
|
|
260
|
-
|
|
478
|
+
|
|
261
479
|
for sheet_name in workbook.sheet_names().to_owned() {
|
|
262
480
|
result.push_str(&format!("Sheet: {}\n", sheet_name));
|
|
263
|
-
|
|
481
|
+
|
|
264
482
|
if let Ok(range) = workbook.worksheet_range(&sheet_name) {
|
|
265
483
|
for row in range.rows() {
|
|
266
484
|
for cell in row {
|
|
@@ -271,44 +489,46 @@ impl Parser {
|
|
|
271
489
|
}
|
|
272
490
|
result.push('\n');
|
|
273
491
|
}
|
|
274
|
-
|
|
492
|
+
|
|
275
493
|
Ok(result)
|
|
276
494
|
}
|
|
277
495
|
Err(e) => Err(Error::new(
|
|
278
|
-
|
|
496
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
279
497
|
format!("Failed to parse Excel file: {}", e),
|
|
280
|
-
))
|
|
498
|
+
)),
|
|
281
499
|
}
|
|
282
500
|
}
|
|
283
|
-
|
|
501
|
+
|
|
284
502
|
/// Parse JSON files - exposed to Ruby
|
|
285
503
|
fn parse_json(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
286
504
|
let text = String::from_utf8_lossy(&data);
|
|
287
505
|
match serde_json::from_str::<serde_json::Value>(&text) {
|
|
288
|
-
Ok(json) =>
|
|
506
|
+
Ok(json) => {
|
|
507
|
+
Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string()))
|
|
508
|
+
}
|
|
289
509
|
Err(_) => Ok(text.to_string()),
|
|
290
510
|
}
|
|
291
511
|
}
|
|
292
|
-
|
|
512
|
+
|
|
293
513
|
/// Parse XML/HTML files - exposed to Ruby
|
|
294
514
|
fn parse_xml(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
295
515
|
use quick_xml::events::Event;
|
|
296
516
|
use quick_xml::Reader;
|
|
297
|
-
|
|
517
|
+
|
|
298
518
|
let mut reader = Reader::from_reader(&data[..]);
|
|
299
519
|
let mut txt = String::new();
|
|
300
520
|
let mut buf = Vec::new();
|
|
301
|
-
|
|
521
|
+
|
|
302
522
|
loop {
|
|
303
523
|
match reader.read_event_into(&mut buf) {
|
|
304
524
|
Ok(Event::Text(e)) => {
|
|
305
|
-
txt.push_str(&e.
|
|
525
|
+
txt.push_str(&e.decode().unwrap_or_default());
|
|
306
526
|
txt.push(' ');
|
|
307
527
|
}
|
|
308
528
|
Ok(Event::Eof) => break,
|
|
309
529
|
Err(e) => {
|
|
310
530
|
return Err(Error::new(
|
|
311
|
-
|
|
531
|
+
Ruby::get().unwrap().exception_runtime_error(),
|
|
312
532
|
format!("XML parse error: {}", e),
|
|
313
533
|
))
|
|
314
534
|
}
|
|
@@ -316,15 +536,15 @@ impl Parser {
|
|
|
316
536
|
}
|
|
317
537
|
buf.clear();
|
|
318
538
|
}
|
|
319
|
-
|
|
539
|
+
|
|
320
540
|
Ok(txt.trim().to_string())
|
|
321
541
|
}
|
|
322
|
-
|
|
542
|
+
|
|
323
543
|
/// Parse plain text with encoding detection - exposed to Ruby
|
|
324
544
|
fn parse_text(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
325
545
|
// Detect encoding
|
|
326
546
|
let (decoded, _encoding, malformed) = encoding_rs::UTF_8.decode(&data);
|
|
327
|
-
|
|
547
|
+
|
|
328
548
|
if malformed {
|
|
329
549
|
// Try other encodings
|
|
330
550
|
let (decoded, _encoding, _malformed) = encoding_rs::WINDOWS_1252.decode(&data);
|
|
@@ -333,16 +553,16 @@ impl Parser {
|
|
|
333
553
|
Ok(decoded.to_string())
|
|
334
554
|
}
|
|
335
555
|
}
|
|
336
|
-
|
|
556
|
+
|
|
337
557
|
/// Parse input string (for text content)
|
|
338
558
|
fn parse(&self, input: String) -> Result<String, Error> {
|
|
339
559
|
if input.is_empty() {
|
|
340
560
|
return Err(Error::new(
|
|
341
|
-
|
|
561
|
+
Ruby::get().unwrap().exception_arg_error(),
|
|
342
562
|
"Input cannot be empty",
|
|
343
563
|
));
|
|
344
564
|
}
|
|
345
|
-
|
|
565
|
+
|
|
346
566
|
// For string input, just return cleaned text
|
|
347
567
|
// If strict mode is on, append indicator for testing
|
|
348
568
|
if self.config.strict_mode {
|
|
@@ -351,29 +571,33 @@ impl Parser {
|
|
|
351
571
|
Ok(input.trim().to_string())
|
|
352
572
|
}
|
|
353
573
|
}
|
|
354
|
-
|
|
574
|
+
|
|
355
575
|
/// Parse a file
|
|
356
576
|
fn parse_file(&self, path: String) -> Result<String, Error> {
|
|
357
577
|
use std::fs;
|
|
358
|
-
|
|
359
|
-
let data = fs::read(&path)
|
|
360
|
-
|
|
361
|
-
|
|
578
|
+
|
|
579
|
+
let data = fs::read(&path).map_err(|e| {
|
|
580
|
+
Error::new(
|
|
581
|
+
Ruby::get().unwrap().exception_io_error(),
|
|
582
|
+
format!("Failed to read file: {}", e),
|
|
583
|
+
)
|
|
584
|
+
})?;
|
|
585
|
+
|
|
362
586
|
self.parse_bytes_internal(data, Some(&path))
|
|
363
587
|
}
|
|
364
|
-
|
|
588
|
+
|
|
365
589
|
/// Parse bytes from Ruby
|
|
366
590
|
fn parse_bytes(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
367
591
|
if data.is_empty() {
|
|
368
592
|
return Err(Error::new(
|
|
369
|
-
|
|
593
|
+
Ruby::get().unwrap().exception_arg_error(),
|
|
370
594
|
"Data cannot be empty",
|
|
371
595
|
));
|
|
372
596
|
}
|
|
373
|
-
|
|
597
|
+
|
|
374
598
|
self.parse_bytes_internal(data, None)
|
|
375
599
|
}
|
|
376
|
-
|
|
600
|
+
|
|
377
601
|
/// Get parser configuration
|
|
378
602
|
fn config(&self) -> Result<RHash, Error> {
|
|
379
603
|
let ruby = Ruby::get().unwrap();
|
|
@@ -384,12 +608,12 @@ impl Parser {
|
|
|
384
608
|
hash.aset(ruby.to_symbol("max_size"), self.config.max_size)?;
|
|
385
609
|
Ok(hash)
|
|
386
610
|
}
|
|
387
|
-
|
|
611
|
+
|
|
388
612
|
/// Check if parser is in strict mode
|
|
389
613
|
fn strict_mode(&self) -> bool {
|
|
390
614
|
self.config.strict_mode
|
|
391
615
|
}
|
|
392
|
-
|
|
616
|
+
|
|
393
617
|
/// Check supported file types
|
|
394
618
|
fn supported_formats() -> Vec<String> {
|
|
395
619
|
vec![
|
|
@@ -397,7 +621,10 @@ impl Parser {
|
|
|
397
621
|
"json".to_string(),
|
|
398
622
|
"xml".to_string(),
|
|
399
623
|
"html".to_string(),
|
|
624
|
+
"htm".to_string(), // HTML files (alternative extension)
|
|
625
|
+
"md".to_string(), // Markdown files
|
|
400
626
|
"docx".to_string(),
|
|
627
|
+
"pptx".to_string(),
|
|
401
628
|
"xlsx".to_string(),
|
|
402
629
|
"xls".to_string(),
|
|
403
630
|
"csv".to_string(),
|
|
@@ -409,12 +636,12 @@ impl Parser {
|
|
|
409
636
|
"bmp".to_string(), // OCR via Tesseract
|
|
410
637
|
]
|
|
411
638
|
}
|
|
412
|
-
|
|
639
|
+
|
|
413
640
|
/// Detect if file extension is supported
|
|
414
641
|
fn supports_file(&self, path: String) -> bool {
|
|
415
642
|
if let Some(ext) = std::path::Path::new(&path)
|
|
416
643
|
.extension()
|
|
417
|
-
.and_then(|s| s.to_str())
|
|
644
|
+
.and_then(|s| s.to_str())
|
|
418
645
|
{
|
|
419
646
|
Self::supported_formats().contains(&ext.to_lowercase())
|
|
420
647
|
} else {
|
|
@@ -441,8 +668,8 @@ fn parse_bytes_direct(data: Vec<u8>) -> Result<String, Error> {
|
|
|
441
668
|
|
|
442
669
|
/// Initialize the Parser class
|
|
443
670
|
pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
|
|
444
|
-
let class = module.define_class("Parser",
|
|
445
|
-
|
|
671
|
+
let class = module.define_class("Parser", Ruby::get().unwrap().class_object())?;
|
|
672
|
+
|
|
446
673
|
// Instance methods
|
|
447
674
|
class.define_singleton_method("new", function!(Parser::new, -1))?;
|
|
448
675
|
class.define_method("parse", method!(Parser::parse, 1))?;
|
|
@@ -451,22 +678,23 @@ pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
|
|
|
451
678
|
class.define_method("config", method!(Parser::config, 0))?;
|
|
452
679
|
class.define_method("strict_mode?", method!(Parser::strict_mode, 0))?;
|
|
453
680
|
class.define_method("supports_file?", method!(Parser::supports_file, 1))?;
|
|
454
|
-
|
|
681
|
+
|
|
455
682
|
// Individual parser methods exposed to Ruby
|
|
456
683
|
class.define_method("parse_pdf", method!(Parser::parse_pdf, 1))?;
|
|
457
684
|
class.define_method("parse_docx", method!(Parser::parse_docx, 1))?;
|
|
685
|
+
class.define_method("parse_pptx", method!(Parser::parse_pptx, 1))?;
|
|
458
686
|
class.define_method("parse_xlsx", method!(Parser::parse_xlsx, 1))?;
|
|
459
687
|
class.define_method("parse_json", method!(Parser::parse_json, 1))?;
|
|
460
688
|
class.define_method("parse_xml", method!(Parser::parse_xml, 1))?;
|
|
461
689
|
class.define_method("parse_text", method!(Parser::parse_text, 1))?;
|
|
462
690
|
class.define_method("ocr_image", method!(Parser::ocr_image, 1))?;
|
|
463
|
-
|
|
691
|
+
|
|
464
692
|
// Class methods
|
|
465
693
|
class.define_singleton_method("supported_formats", function!(Parser::supported_formats, 0))?;
|
|
466
|
-
|
|
694
|
+
|
|
467
695
|
// Module-level convenience methods
|
|
468
696
|
module.define_singleton_method("parse_file", function!(parse_file_direct, 1))?;
|
|
469
697
|
module.define_singleton_method("parse_bytes", function!(parse_bytes_direct, 1))?;
|
|
470
|
-
|
|
698
|
+
|
|
471
699
|
Ok(())
|
|
472
|
-
}
|
|
700
|
+
}
|
|
Binary file
|
data/lib/parsekit/parser.rb
CHANGED
data/lib/parsekit/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: parsekit
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.0
|
|
4
|
+
version: 0.1.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Chris Petersen
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2025-
|
|
11
|
+
date: 2025-09-06 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: rb_sys
|