parsekit 0.1.0.pre.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +53 -0
- data/LICENSE.txt +21 -0
- data/README.md +183 -0
- data/ext/parsekit/Cargo.toml +34 -0
- data/ext/parsekit/extconf.rb +6 -0
- data/ext/parsekit/src/error.rs +45 -0
- data/ext/parsekit/src/lib.rs +24 -0
- data/ext/parsekit/src/parser.rs +472 -0
- data/lib/parsekit/error.rb +15 -0
- data/lib/parsekit/parsekit.bundle +0 -0
- data/lib/parsekit/parser.rb +201 -0
- data/lib/parsekit/version.rb +5 -0
- data/lib/parsekit.rb +61 -0
- metadata +132 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 02b091ecd1da29c68d59afb1089f1756cef350252b260b531ef82a06fb163c65
|
|
4
|
+
data.tar.gz: e34663b8f849a907ede07b357ad3c5b21a614c16ea767fb5b735c3422bd66aa7
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: b476aad0a9c9a711fce10d3a22dedd64e6ac82597c1d5d501d3ced7a46982d8f65b5bf44b513c3daabc5c5115a4b6278a0bea911b4dc9b1667010467e1cad8c9
|
|
7
|
+
data.tar.gz: f1d2adeb0bf8199b5ce397537b8b40577a79e954dddf33f4e4ff2fe418791cddb2f41adf79654c05c9d37e6ef0b1e99526f390560b5321feb711982d2218372d
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
- Nothing yet
|
|
12
|
+
|
|
13
|
+
### Changed
|
|
14
|
+
- Nothing yet
|
|
15
|
+
|
|
16
|
+
### Deprecated
|
|
17
|
+
- Nothing yet
|
|
18
|
+
|
|
19
|
+
### Removed
|
|
20
|
+
- Nothing yet
|
|
21
|
+
|
|
22
|
+
### Fixed
|
|
23
|
+
- Nothing yet
|
|
24
|
+
|
|
25
|
+
### Security
|
|
26
|
+
- Nothing yet
|
|
27
|
+
|
|
28
|
+
## [0.1.0] - 2024-08-09
|
|
29
|
+
|
|
30
|
+
### Added
|
|
31
|
+
- Initial release of parsekit
|
|
32
|
+
- Basic parser functionality with Ruby bindings via Magnus
|
|
33
|
+
- Support for parsing strings and files
|
|
34
|
+
- Configurable parser with options (strict_mode, max_depth, encoding)
|
|
35
|
+
- Parser class with instance methods
|
|
36
|
+
- Module-level convenience methods
|
|
37
|
+
- Error handling with custom error classes
|
|
38
|
+
- Thread-safe parsing operations
|
|
39
|
+
- Cross-platform support (Linux, macOS, Windows)
|
|
40
|
+
- Ruby 3.0+ support
|
|
41
|
+
- Comprehensive test suite with RSpec
|
|
42
|
+
- CI/CD with GitHub Actions
|
|
43
|
+
- Documentation and examples
|
|
44
|
+
- Integration with ruby-nlp ecosystem
|
|
45
|
+
|
|
46
|
+
### Technical Details
|
|
47
|
+
- Built with Magnus 0.7 for Ruby-Rust bindings
|
|
48
|
+
- Uses rb_sys 0.9 for build system integration
|
|
49
|
+
- Rust edition 2021
|
|
50
|
+
- Cross-compilation support for multiple platforms
|
|
51
|
+
|
|
52
|
+
[Unreleased]: https://github.com/cpetersen/parsekit/compare/v0.1.0...HEAD
|
|
53
|
+
[0.1.0]: https://github.com/cpetersen/parsekit/releases/tag/v0.1.0
|
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 Your Name
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
|
13
|
+
all copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# ParseKit
|
|
2
|
+
|
|
3
|
+
[](https://github.com/cpetersen/parsekit/actions/workflows/ci.yml)
|
|
4
|
+
[](https://badge.fury.io/rb/parsekit)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
|
|
7
|
+
Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
|
|
8
|
+
|
|
9
|
+
## Features
|
|
10
|
+
|
|
11
|
+
- ๐ **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
|
|
12
|
+
- ๐ผ๏ธ **OCR Support**: Extract text from images using Tesseract OCR
|
|
13
|
+
- ๐ **High Performance**: Native Rust performance with Ruby convenience
|
|
14
|
+
- ๐ง **Unified API**: Single interface for multiple document formats
|
|
15
|
+
- ๐ฆ **Cross-Platform**: Works on Linux, macOS, and Windows
|
|
16
|
+
- ๐งช **Well Tested**: Comprehensive test suite with RSpec
|
|
17
|
+
|
|
18
|
+
## Installation
|
|
19
|
+
|
|
20
|
+
Add this line to your application's Gemfile:
|
|
21
|
+
|
|
22
|
+
```ruby
|
|
23
|
+
gem 'parsekit'
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
And then execute:
|
|
27
|
+
|
|
28
|
+
$ bundle install
|
|
29
|
+
|
|
30
|
+
Or install it yourself as:
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
gem install parsekit
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Requirements
|
|
37
|
+
|
|
38
|
+
- Ruby >= 3.0.0
|
|
39
|
+
- Rust toolchain (stable)
|
|
40
|
+
- C compiler (for linking)
|
|
41
|
+
- System libraries for document parsing:
|
|
42
|
+
- **macOS**: `brew install leptonica tesseract poppler`
|
|
43
|
+
- **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
|
|
44
|
+
- **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
|
|
45
|
+
- **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
|
|
46
|
+
|
|
47
|
+
For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
|
|
48
|
+
|
|
49
|
+
## Usage
|
|
50
|
+
|
|
51
|
+
### Basic Usage
|
|
52
|
+
|
|
53
|
+
```ruby
|
|
54
|
+
require 'parsekit'
|
|
55
|
+
|
|
56
|
+
# Parse a PDF file
|
|
57
|
+
text = ParseKit.parse_file("document.pdf")
|
|
58
|
+
puts text # Extracted text from the PDF
|
|
59
|
+
|
|
60
|
+
# Parse an Office document
|
|
61
|
+
text = ParseKit.parse_file("presentation.pptx")
|
|
62
|
+
puts text # Extracted text from all slides
|
|
63
|
+
|
|
64
|
+
# Parse an Excel file
|
|
65
|
+
text = ParseKit.parse_file("spreadsheet.xlsx")
|
|
66
|
+
puts text # Extracted text from all sheets
|
|
67
|
+
|
|
68
|
+
# Parse binary data directly
|
|
69
|
+
file_data = File.binread("document.pdf")
|
|
70
|
+
text = ParseKit.parse_bytes(file_data)
|
|
71
|
+
puts text
|
|
72
|
+
|
|
73
|
+
# Parse with a Parser instance
|
|
74
|
+
parser = ParseKit::Parser.new
|
|
75
|
+
text = parser.parse_file("report.docx")
|
|
76
|
+
puts text
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Module-Level Convenience Methods
|
|
80
|
+
|
|
81
|
+
```ruby
|
|
82
|
+
# Parse files directly
|
|
83
|
+
content = ParseKit.parse_file('document.pdf')
|
|
84
|
+
|
|
85
|
+
# Parse bytes
|
|
86
|
+
data = File.read('document.pdf', mode: 'rb')
|
|
87
|
+
content = ParseKit.parse_bytes(data.bytes)
|
|
88
|
+
|
|
89
|
+
# Check supported formats
|
|
90
|
+
formats = ParseKit.supported_formats
|
|
91
|
+
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]
|
|
92
|
+
|
|
93
|
+
# Check if a file is supported
|
|
94
|
+
ParseKit.supports_file?('document.pdf') # => true
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Configuration Options
|
|
98
|
+
|
|
99
|
+
```ruby
|
|
100
|
+
# Create parser with options
|
|
101
|
+
parser = ParseKit::Parser.new(
|
|
102
|
+
strict_mode: true,
|
|
103
|
+
max_size: 50 * 1024 * 1024, # 50MB limit
|
|
104
|
+
encoding: 'UTF-8'
|
|
105
|
+
)
|
|
106
|
+
|
|
107
|
+
# Or use the strict convenience method
|
|
108
|
+
parser = ParseKit::Parser.strict
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### Format-Specific Parsing
|
|
112
|
+
|
|
113
|
+
```ruby
|
|
114
|
+
parser = ParseKit::Parser.new
|
|
115
|
+
|
|
116
|
+
# Direct access to format-specific parsers
|
|
117
|
+
pdf_data = File.read('document.pdf', mode: 'rb').bytes
|
|
118
|
+
pdf_text = parser.parse_pdf(pdf_data)
|
|
119
|
+
|
|
120
|
+
image_data = File.read('image.png', mode: 'rb').bytes
|
|
121
|
+
ocr_text = parser.ocr_image(image_data)
|
|
122
|
+
|
|
123
|
+
excel_data = File.read('data.xlsx', mode: 'rb').bytes
|
|
124
|
+
excel_text = parser.parse_xlsx(excel_data)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Supported Formats
|
|
128
|
+
|
|
129
|
+
| Format | Extensions | Method | Notes |
|
|
130
|
+
|--------|------------|--------|-------|
|
|
131
|
+
| PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
|
|
132
|
+
| Word | .docx | `parse_docx` | Office Open XML format |
|
|
133
|
+
| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
|
|
134
|
+
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
|
|
135
|
+
| JSON | .json | `parse_json` | Pretty-printed output |
|
|
136
|
+
| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
|
|
137
|
+
| Text | .txt, .csv, .md | `parse_text` | With encoding detection |
|
|
138
|
+
|
|
139
|
+
## Performance
|
|
140
|
+
|
|
141
|
+
ParseKit is built with performance in mind:
|
|
142
|
+
|
|
143
|
+
- Native Rust implementation for speed
|
|
144
|
+
- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
|
|
145
|
+
- Efficient memory usage with streaming where possible
|
|
146
|
+
- Configurable size limits to prevent memory issues
|
|
147
|
+
|
|
148
|
+
## Development
|
|
149
|
+
|
|
150
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.
|
|
151
|
+
|
|
152
|
+
To compile the Rust extension:
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
rake compile
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
To run tests with coverage:
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
rake dev:coverage
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## Architecture
|
|
165
|
+
|
|
166
|
+
ParseKit uses a hybrid Ruby/Rust architecture:
|
|
167
|
+
|
|
168
|
+
- **Ruby Layer**: Provides convenient API and format detection
|
|
169
|
+
- **Rust Layer**: Implements high-performance parsing using:
|
|
170
|
+
- MuPDF for PDF text extraction (statically linked)
|
|
171
|
+
- rusty-tesseract for OCR (with embedded Tesseract)
|
|
172
|
+
- Pure Rust libraries for DOCX/XLSX parsing
|
|
173
|
+
- Magnus for Ruby-Rust FFI bindings
|
|
174
|
+
|
|
175
|
+
## Contributing
|
|
176
|
+
|
|
177
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.
|
|
178
|
+
|
|
179
|
+
## License
|
|
180
|
+
|
|
181
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
182
|
+
|
|
183
|
+
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
[package]
|
|
2
|
+
name = "parsekit"
|
|
3
|
+
version = "0.1.0"
|
|
4
|
+
edition = "2021"
|
|
5
|
+
authors = ["Your Name <your.email@example.com>"]
|
|
6
|
+
license = "MIT"
|
|
7
|
+
publish = false
|
|
8
|
+
|
|
9
|
+
[lib]
|
|
10
|
+
crate-type = ["cdylib"]
|
|
11
|
+
name = "parsekit"
|
|
12
|
+
|
|
13
|
+
[dependencies]
|
|
14
|
+
magnus = { version = "0.7", features = ["rb-sys"] }
|
|
15
|
+
# Document parsing - testing embedded C libraries
|
|
16
|
+
# MuPDF builds from source and statically links
|
|
17
|
+
mupdf = { version = "0.5", default-features = false, features = [] }
|
|
18
|
+
# OCR - Tesseract with image loading support
|
|
19
|
+
rusty-tesseract = "1.1" # Tesseract wrapper with image loading
|
|
20
|
+
image = "0.25" # Image processing library (match rusty-tesseract's version)
|
|
21
|
+
calamine = "0.26" # Excel parsing
|
|
22
|
+
docx-rs = "0.4" # Word document parsing
|
|
23
|
+
quick-xml = "0.36" # XML parsing
|
|
24
|
+
serde_json = "1.0" # JSON parsing
|
|
25
|
+
regex = "1.10" # Text parsing
|
|
26
|
+
encoding_rs = "0.8" # Encoding detection
|
|
27
|
+
|
|
28
|
+
[features]
|
|
29
|
+
default = []
|
|
30
|
+
|
|
31
|
+
[profile.release]
|
|
32
|
+
opt-level = 3
|
|
33
|
+
lto = true
|
|
34
|
+
codegen-units = 1
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
use magnus::{exception, Error, RModule, Ruby, Module};
|
|
2
|
+
|
|
3
|
+
/// Custom error types for ParseKit
|
|
4
|
+
#[derive(Debug)]
|
|
5
|
+
#[allow(dead_code)]
|
|
6
|
+
pub enum ParserError {
|
|
7
|
+
ParseError(String),
|
|
8
|
+
ConfigError(String),
|
|
9
|
+
IoError(String),
|
|
10
|
+
}
|
|
11
|
+
|
|
12
|
+
impl ParserError {
|
|
13
|
+
/// Convert to Magnus Error
|
|
14
|
+
#[allow(dead_code)]
|
|
15
|
+
pub fn to_error(&self) -> Error {
|
|
16
|
+
match self {
|
|
17
|
+
ParserError::ParseError(msg) => {
|
|
18
|
+
Error::new(exception::runtime_error(), msg.clone())
|
|
19
|
+
}
|
|
20
|
+
ParserError::ConfigError(msg) => {
|
|
21
|
+
Error::new(exception::arg_error(), msg.clone())
|
|
22
|
+
}
|
|
23
|
+
ParserError::IoError(msg) => {
|
|
24
|
+
Error::new(exception::io_error(), msg.clone())
|
|
25
|
+
}
|
|
26
|
+
}
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
/// Initialize error classes
|
|
31
|
+
/// For simplicity, we'll just create Ruby classes that inherit from Object,
|
|
32
|
+
/// and document that they should be treated as exceptions
|
|
33
|
+
pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
|
|
34
|
+
// For now, just create placeholder classes
|
|
35
|
+
// In a real implementation, you'd want to properly set up exception classes
|
|
36
|
+
// but Magnus 0.7's API for this is complex
|
|
37
|
+
|
|
38
|
+
// Define error classes as regular Ruby classes
|
|
39
|
+
// Users can still rescue them by name in Ruby code
|
|
40
|
+
let _error = module.define_class("Error", magnus::class::object())?;
|
|
41
|
+
let _parse_error = module.define_class("ParseError", magnus::class::object())?;
|
|
42
|
+
let _config_error = module.define_class("ConfigError", magnus::class::object())?;
|
|
43
|
+
|
|
44
|
+
Ok(())
|
|
45
|
+
}
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
use magnus::{function, prelude::*, Error, Ruby};
|
|
2
|
+
|
|
3
|
+
mod parser;
|
|
4
|
+
mod error;
|
|
5
|
+
|
|
6
|
+
/// Initialize the ParseKit module and its submodules
|
|
7
|
+
#[magnus::init]
|
|
8
|
+
fn init(ruby: &Ruby) -> Result<(), Error> {
|
|
9
|
+
let module = ruby.define_module("ParseKit")?;
|
|
10
|
+
|
|
11
|
+
// Initialize submodules
|
|
12
|
+
parser::init(ruby, module)?;
|
|
13
|
+
error::init(ruby, module)?;
|
|
14
|
+
|
|
15
|
+
// Add module-level methods
|
|
16
|
+
module.define_singleton_method("version", function!(version, 0))?;
|
|
17
|
+
|
|
18
|
+
Ok(())
|
|
19
|
+
}
|
|
20
|
+
|
|
21
|
+
/// Return the version of the parsekit gem
|
|
22
|
+
fn version() -> String {
|
|
23
|
+
env!("CARGO_PKG_VERSION").to_string()
|
|
24
|
+
}
|
|
@@ -0,0 +1,472 @@
|
|
|
1
|
+
use magnus::{
|
|
2
|
+
class, function, method, prelude::*, scan_args, Error, RHash, RModule, Ruby, Value, Module,
|
|
3
|
+
};
|
|
4
|
+
use std::path::Path;
|
|
5
|
+
|
|
6
|
+
#[derive(Debug, Clone)]
|
|
7
|
+
#[magnus::wrap(class = "ParseKit::Parser", free_immediately, size)]
|
|
8
|
+
pub struct Parser {
|
|
9
|
+
config: ParserConfig,
|
|
10
|
+
}
|
|
11
|
+
|
|
12
|
+
#[derive(Debug, Clone)]
|
|
13
|
+
struct ParserConfig {
|
|
14
|
+
strict_mode: bool,
|
|
15
|
+
max_depth: usize,
|
|
16
|
+
encoding: String,
|
|
17
|
+
max_size: usize,
|
|
18
|
+
}
|
|
19
|
+
|
|
20
|
+
impl Default for ParserConfig {
|
|
21
|
+
fn default() -> Self {
|
|
22
|
+
Self {
|
|
23
|
+
strict_mode: false,
|
|
24
|
+
max_depth: 100,
|
|
25
|
+
encoding: "UTF-8".to_string(),
|
|
26
|
+
max_size: 100 * 1024 * 1024, // 100MB default limit
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
}
|
|
30
|
+
|
|
31
|
+
impl Parser {
|
|
32
|
+
/// Create a new Parser instance with optional configuration
|
|
33
|
+
fn new(ruby: &Ruby, args: &[Value]) -> Result<Self, Error> {
|
|
34
|
+
let args = scan_args::scan_args::<(), (Option<RHash>,), (), (), (), ()>(args)?;
|
|
35
|
+
let options = args.optional.0;
|
|
36
|
+
|
|
37
|
+
let mut config = ParserConfig::default();
|
|
38
|
+
|
|
39
|
+
if let Some(opts) = options {
|
|
40
|
+
if let Some(strict) = opts.get(ruby.to_symbol("strict_mode")) {
|
|
41
|
+
config.strict_mode = bool::try_convert(strict)?;
|
|
42
|
+
}
|
|
43
|
+
if let Some(depth) = opts.get(ruby.to_symbol("max_depth")) {
|
|
44
|
+
config.max_depth = usize::try_convert(depth)?;
|
|
45
|
+
}
|
|
46
|
+
if let Some(encoding) = opts.get(ruby.to_symbol("encoding")) {
|
|
47
|
+
config.encoding = String::try_convert(encoding)?;
|
|
48
|
+
}
|
|
49
|
+
if let Some(max_size) = opts.get(ruby.to_symbol("max_size")) {
|
|
50
|
+
config.max_size = usize::try_convert(max_size)?;
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
Ok(Self { config })
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
/// Parse input bytes based on file type (internal helper)
|
|
58
|
+
fn parse_bytes_internal(&self, data: Vec<u8>, filename: Option<&str>) -> Result<String, Error> {
|
|
59
|
+
// Check size limit
|
|
60
|
+
if data.len() > self.config.max_size {
|
|
61
|
+
return Err(Error::new(
|
|
62
|
+
magnus::exception::runtime_error(),
|
|
63
|
+
format!("File size {} exceeds maximum allowed size {}", data.len(), self.config.max_size),
|
|
64
|
+
));
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
// Detect file type from extension or content
|
|
68
|
+
let file_type = if let Some(name) = filename {
|
|
69
|
+
Self::detect_type_from_filename(name)
|
|
70
|
+
} else {
|
|
71
|
+
Self::detect_type_from_content(&data)
|
|
72
|
+
};
|
|
73
|
+
|
|
74
|
+
match file_type.as_str() {
|
|
75
|
+
"pdf" => self.parse_pdf(data),
|
|
76
|
+
"docx" => self.parse_docx(data),
|
|
77
|
+
"xlsx" | "xls" => self.parse_xlsx(data),
|
|
78
|
+
"json" => self.parse_json(data),
|
|
79
|
+
"xml" | "html" => self.parse_xml(data),
|
|
80
|
+
"png" | "jpg" | "jpeg" | "tiff" | "bmp" => self.ocr_image(data),
|
|
81
|
+
"txt" | "text" => self.parse_text(data),
|
|
82
|
+
_ => self.parse_text(data), // Default to text parsing
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
|
|
86
|
+
/// Detect file type from filename extension
|
|
87
|
+
fn detect_type_from_filename(filename: &str) -> String {
|
|
88
|
+
let path = Path::new(filename);
|
|
89
|
+
match path.extension().and_then(|s| s.to_str()) {
|
|
90
|
+
Some(ext) => ext.to_lowercase(),
|
|
91
|
+
None => "txt".to_string(),
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
|
|
95
|
+
/// Detect file type from content (basic detection)
|
|
96
|
+
fn detect_type_from_content(data: &[u8]) -> String {
|
|
97
|
+
if data.starts_with(b"%PDF") {
|
|
98
|
+
"pdf".to_string()
|
|
99
|
+
} else if data.starts_with(b"PK") {
|
|
100
|
+
// PK is the ZIP signature - could be DOCX or XLSX
|
|
101
|
+
// Try to differentiate by looking for common patterns
|
|
102
|
+
// This is a simplified check - both DOCX and XLSX are ZIP files
|
|
103
|
+
// For now, default to xlsx as it's more commonly parsed
|
|
104
|
+
"xlsx".to_string() // Office Open XML format (could also be DOCX)
|
|
105
|
+
} else if data.starts_with(&[0xD0, 0xCF, 0x11, 0xE0]) {
|
|
106
|
+
"xls".to_string() // Old Excel format
|
|
107
|
+
} else if data.starts_with(&[0x89, 0x50, 0x4E, 0x47]) {
|
|
108
|
+
"png".to_string() // PNG signature
|
|
109
|
+
} else if data.starts_with(&[0xFF, 0xD8, 0xFF]) {
|
|
110
|
+
"jpg".to_string() // JPEG signature
|
|
111
|
+
} else if data.starts_with(b"BM") {
|
|
112
|
+
"bmp".to_string() // BMP signature
|
|
113
|
+
} else if data.starts_with(b"II\x2A\x00") || data.starts_with(b"MM\x00\x2A") {
|
|
114
|
+
"tiff".to_string() // TIFF signature (little-endian or big-endian)
|
|
115
|
+
} else if data.starts_with(b"<?xml") || data.starts_with(b"<html") {
|
|
116
|
+
"xml".to_string()
|
|
117
|
+
} else if data.starts_with(b"{") || data.starts_with(b"[") {
|
|
118
|
+
"json".to_string()
|
|
119
|
+
} else {
|
|
120
|
+
"txt".to_string()
|
|
121
|
+
}
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
/// Perform OCR on image data using Tesseract
|
|
125
|
+
fn ocr_image(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
126
|
+
use rusty_tesseract::{Image, Args};
|
|
127
|
+
|
|
128
|
+
// Load image from memory
|
|
129
|
+
let img = match image::load_from_memory(&data) {
|
|
130
|
+
Ok(img) => img,
|
|
131
|
+
Err(e) => return Err(Error::new(
|
|
132
|
+
magnus::exception::runtime_error(),
|
|
133
|
+
format!("Failed to load image: {}", e),
|
|
134
|
+
))
|
|
135
|
+
};
|
|
136
|
+
|
|
137
|
+
// Create rusty_tesseract Image from DynamicImage
|
|
138
|
+
let tess_img = match Image::from_dynamic_image(&img) {
|
|
139
|
+
Ok(img) => img,
|
|
140
|
+
Err(e) => return Err(Error::new(
|
|
141
|
+
magnus::exception::runtime_error(),
|
|
142
|
+
format!("Failed to convert image for OCR: {}", e),
|
|
143
|
+
))
|
|
144
|
+
};
|
|
145
|
+
|
|
146
|
+
// Set up OCR arguments
|
|
147
|
+
let mut args = Args::default();
|
|
148
|
+
args.lang = "eng".to_string();
|
|
149
|
+
// Optional: Add more configuration
|
|
150
|
+
// args.config_variables.insert("tessedit_char_whitelist".to_string(),
|
|
151
|
+
// "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz .,!?-".to_string());
|
|
152
|
+
|
|
153
|
+
// Perform OCR
|
|
154
|
+
match rusty_tesseract::image_to_string(&tess_img, &args) {
|
|
155
|
+
Ok(text) => Ok(text.trim().to_string()),
|
|
156
|
+
Err(e) => Err(Error::new(
|
|
157
|
+
magnus::exception::runtime_error(),
|
|
158
|
+
format!("Failed to perform OCR: {}", e),
|
|
159
|
+
))
|
|
160
|
+
}
|
|
161
|
+
}
|
|
162
|
+
|
|
163
|
+
/// Parse PDF files using MuPDF (statically linked) - exposed to Ruby
|
|
164
|
+
fn parse_pdf(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
165
|
+
use mupdf::Document;
|
|
166
|
+
|
|
167
|
+
// Try to load the PDF from memory
|
|
168
|
+
// The magic parameter helps MuPDF identify the file type
|
|
169
|
+
match Document::from_bytes(&data, "pdf") {
|
|
170
|
+
Ok(doc) => {
|
|
171
|
+
let mut all_text = String::new();
|
|
172
|
+
|
|
173
|
+
// Get page count - this returns a Result
|
|
174
|
+
let page_count = match doc.page_count() {
|
|
175
|
+
Ok(count) => count,
|
|
176
|
+
Err(e) => return Err(Error::new(
|
|
177
|
+
magnus::exception::runtime_error(),
|
|
178
|
+
format!("Failed to get page count: {}", e),
|
|
179
|
+
))
|
|
180
|
+
};
|
|
181
|
+
|
|
182
|
+
// Iterate through pages
|
|
183
|
+
for page_num in 0..page_count {
|
|
184
|
+
match doc.load_page(page_num) {
|
|
185
|
+
Ok(page) => {
|
|
186
|
+
// Extract text from the page
|
|
187
|
+
match page.to_text() {
|
|
188
|
+
Ok(text) => {
|
|
189
|
+
all_text.push_str(&text);
|
|
190
|
+
all_text.push('\n');
|
|
191
|
+
}
|
|
192
|
+
Err(_) => continue,
|
|
193
|
+
}
|
|
194
|
+
}
|
|
195
|
+
Err(_) => continue,
|
|
196
|
+
}
|
|
197
|
+
}
|
|
198
|
+
|
|
199
|
+
if all_text.is_empty() {
|
|
200
|
+
Ok("PDF contains no extractable text (might be scanned/image-based)".to_string())
|
|
201
|
+
} else {
|
|
202
|
+
Ok(all_text.trim().to_string())
|
|
203
|
+
}
|
|
204
|
+
}
|
|
205
|
+
Err(e) => Err(Error::new(
|
|
206
|
+
magnus::exception::runtime_error(),
|
|
207
|
+
format!("Failed to parse PDF: {}", e),
|
|
208
|
+
))
|
|
209
|
+
}
|
|
210
|
+
}
|
|
211
|
+
|
|
212
|
+
/// Parse DOCX (Word) files - exposed to Ruby
|
|
213
|
+
fn parse_docx(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
214
|
+
use docx_rs::read_docx;
|
|
215
|
+
|
|
216
|
+
match read_docx(&data) {
|
|
217
|
+
Ok(docx) => {
|
|
218
|
+
let mut result = String::new();
|
|
219
|
+
|
|
220
|
+
// Extract text from all document children
|
|
221
|
+
// For simplicity, we'll focus on paragraphs only for now
|
|
222
|
+
// Tables require more complex handling with the current API
|
|
223
|
+
for child in docx.document.children.iter() {
|
|
224
|
+
if let docx_rs::DocumentChild::Paragraph(p) = child {
|
|
225
|
+
// Extract text from paragraph
|
|
226
|
+
for p_child in &p.children {
|
|
227
|
+
if let docx_rs::ParagraphChild::Run(r) = p_child {
|
|
228
|
+
for run_child in &r.children {
|
|
229
|
+
if let docx_rs::RunChild::Text(t) = run_child {
|
|
230
|
+
result.push_str(&t.text);
|
|
231
|
+
}
|
|
232
|
+
}
|
|
233
|
+
}
|
|
234
|
+
}
|
|
235
|
+
result.push('\n');
|
|
236
|
+
}
|
|
237
|
+
// Note: Table text extraction would require iterating through
|
|
238
|
+
// table.rows -> TableChild::TableRow -> row.cells -> TableRowChild
|
|
239
|
+
// which has a more complex structure in docx-rs
|
|
240
|
+
}
|
|
241
|
+
|
|
242
|
+
Ok(result.trim().to_string())
|
|
243
|
+
}
|
|
244
|
+
Err(e) => Err(Error::new(
|
|
245
|
+
magnus::exception::runtime_error(),
|
|
246
|
+
format!("Failed to parse DOCX file: {}", e),
|
|
247
|
+
))
|
|
248
|
+
}
|
|
249
|
+
}
|
|
250
|
+
|
|
251
|
+
/// Parse Excel files - exposed to Ruby
|
|
252
|
+
fn parse_xlsx(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
253
|
+
use calamine::{Reader, Xlsx};
|
|
254
|
+
use std::io::Cursor;
|
|
255
|
+
|
|
256
|
+
let cursor = Cursor::new(data);
|
|
257
|
+
match Xlsx::new(cursor) {
|
|
258
|
+
Ok(mut workbook) => {
|
|
259
|
+
let mut result = String::new();
|
|
260
|
+
|
|
261
|
+
for sheet_name in workbook.sheet_names().to_owned() {
|
|
262
|
+
result.push_str(&format!("Sheet: {}\n", sheet_name));
|
|
263
|
+
|
|
264
|
+
if let Ok(range) = workbook.worksheet_range(&sheet_name) {
|
|
265
|
+
for row in range.rows() {
|
|
266
|
+
for cell in row {
|
|
267
|
+
result.push_str(&format!("{}\t", cell));
|
|
268
|
+
}
|
|
269
|
+
result.push('\n');
|
|
270
|
+
}
|
|
271
|
+
}
|
|
272
|
+
result.push('\n');
|
|
273
|
+
}
|
|
274
|
+
|
|
275
|
+
Ok(result)
|
|
276
|
+
}
|
|
277
|
+
Err(e) => Err(Error::new(
|
|
278
|
+
magnus::exception::runtime_error(),
|
|
279
|
+
format!("Failed to parse Excel file: {}", e),
|
|
280
|
+
))
|
|
281
|
+
}
|
|
282
|
+
}
|
|
283
|
+
|
|
284
|
+
/// Parse JSON files - exposed to Ruby
|
|
285
|
+
fn parse_json(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
286
|
+
let text = String::from_utf8_lossy(&data);
|
|
287
|
+
match serde_json::from_str::<serde_json::Value>(&text) {
|
|
288
|
+
Ok(json) => Ok(serde_json::to_string_pretty(&json).unwrap_or_else(|_| text.to_string())),
|
|
289
|
+
Err(_) => Ok(text.to_string()),
|
|
290
|
+
}
|
|
291
|
+
}
|
|
292
|
+
|
|
293
|
+
/// Parse XML/HTML files - exposed to Ruby
|
|
294
|
+
fn parse_xml(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
295
|
+
use quick_xml::events::Event;
|
|
296
|
+
use quick_xml::Reader;
|
|
297
|
+
|
|
298
|
+
let mut reader = Reader::from_reader(&data[..]);
|
|
299
|
+
let mut txt = String::new();
|
|
300
|
+
let mut buf = Vec::new();
|
|
301
|
+
|
|
302
|
+
loop {
|
|
303
|
+
match reader.read_event_into(&mut buf) {
|
|
304
|
+
Ok(Event::Text(e)) => {
|
|
305
|
+
txt.push_str(&e.unescape().unwrap_or_default());
|
|
306
|
+
txt.push(' ');
|
|
307
|
+
}
|
|
308
|
+
Ok(Event::Eof) => break,
|
|
309
|
+
Err(e) => {
|
|
310
|
+
return Err(Error::new(
|
|
311
|
+
magnus::exception::runtime_error(),
|
|
312
|
+
format!("XML parse error: {}", e),
|
|
313
|
+
))
|
|
314
|
+
}
|
|
315
|
+
_ => {}
|
|
316
|
+
}
|
|
317
|
+
buf.clear();
|
|
318
|
+
}
|
|
319
|
+
|
|
320
|
+
Ok(txt.trim().to_string())
|
|
321
|
+
}
|
|
322
|
+
|
|
323
|
+
/// Parse plain text with encoding detection - exposed to Ruby
|
|
324
|
+
fn parse_text(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
325
|
+
// Detect encoding
|
|
326
|
+
let (decoded, _encoding, malformed) = encoding_rs::UTF_8.decode(&data);
|
|
327
|
+
|
|
328
|
+
if malformed {
|
|
329
|
+
// Try other encodings
|
|
330
|
+
let (decoded, _encoding, _malformed) = encoding_rs::WINDOWS_1252.decode(&data);
|
|
331
|
+
Ok(decoded.to_string())
|
|
332
|
+
} else {
|
|
333
|
+
Ok(decoded.to_string())
|
|
334
|
+
}
|
|
335
|
+
}
|
|
336
|
+
|
|
337
|
+
/// Parse input string (for text content)
|
|
338
|
+
fn parse(&self, input: String) -> Result<String, Error> {
|
|
339
|
+
if input.is_empty() {
|
|
340
|
+
return Err(Error::new(
|
|
341
|
+
magnus::exception::arg_error(),
|
|
342
|
+
"Input cannot be empty",
|
|
343
|
+
));
|
|
344
|
+
}
|
|
345
|
+
|
|
346
|
+
// For string input, just return cleaned text
|
|
347
|
+
// If strict mode is on, append indicator for testing
|
|
348
|
+
if self.config.strict_mode {
|
|
349
|
+
Ok(format!("{} strict=true", input.trim()))
|
|
350
|
+
} else {
|
|
351
|
+
Ok(input.trim().to_string())
|
|
352
|
+
}
|
|
353
|
+
}
|
|
354
|
+
|
|
355
|
+
/// Parse a file
|
|
356
|
+
fn parse_file(&self, path: String) -> Result<String, Error> {
|
|
357
|
+
use std::fs;
|
|
358
|
+
|
|
359
|
+
let data = fs::read(&path)
|
|
360
|
+
.map_err(|e| Error::new(magnus::exception::io_error(), format!("Failed to read file: {}", e)))?;
|
|
361
|
+
|
|
362
|
+
self.parse_bytes_internal(data, Some(&path))
|
|
363
|
+
}
|
|
364
|
+
|
|
365
|
+
/// Parse bytes from Ruby
|
|
366
|
+
fn parse_bytes(&self, data: Vec<u8>) -> Result<String, Error> {
|
|
367
|
+
if data.is_empty() {
|
|
368
|
+
return Err(Error::new(
|
|
369
|
+
magnus::exception::arg_error(),
|
|
370
|
+
"Data cannot be empty",
|
|
371
|
+
));
|
|
372
|
+
}
|
|
373
|
+
|
|
374
|
+
self.parse_bytes_internal(data, None)
|
|
375
|
+
}
|
|
376
|
+
|
|
377
|
+
/// Get parser configuration
|
|
378
|
+
fn config(&self) -> Result<RHash, Error> {
|
|
379
|
+
let ruby = Ruby::get().unwrap();
|
|
380
|
+
let hash = ruby.hash_new();
|
|
381
|
+
hash.aset(ruby.to_symbol("strict_mode"), self.config.strict_mode)?;
|
|
382
|
+
hash.aset(ruby.to_symbol("max_depth"), self.config.max_depth)?;
|
|
383
|
+
hash.aset(ruby.to_symbol("encoding"), self.config.encoding.as_str())?;
|
|
384
|
+
hash.aset(ruby.to_symbol("max_size"), self.config.max_size)?;
|
|
385
|
+
Ok(hash)
|
|
386
|
+
}
|
|
387
|
+
|
|
388
|
+
/// Check if parser is in strict mode
|
|
389
|
+
fn strict_mode(&self) -> bool {
|
|
390
|
+
self.config.strict_mode
|
|
391
|
+
}
|
|
392
|
+
|
|
393
|
+
/// Check supported file types
|
|
394
|
+
fn supported_formats() -> Vec<String> {
|
|
395
|
+
vec![
|
|
396
|
+
"txt".to_string(),
|
|
397
|
+
"json".to_string(),
|
|
398
|
+
"xml".to_string(),
|
|
399
|
+
"html".to_string(),
|
|
400
|
+
"docx".to_string(),
|
|
401
|
+
"xlsx".to_string(),
|
|
402
|
+
"xls".to_string(),
|
|
403
|
+
"csv".to_string(),
|
|
404
|
+
"pdf".to_string(), // Text extraction via MuPDF
|
|
405
|
+
"png".to_string(), // OCR via Tesseract
|
|
406
|
+
"jpg".to_string(), // OCR via Tesseract
|
|
407
|
+
"jpeg".to_string(), // OCR via Tesseract
|
|
408
|
+
"tiff".to_string(), // OCR via Tesseract
|
|
409
|
+
"bmp".to_string(), // OCR via Tesseract
|
|
410
|
+
]
|
|
411
|
+
}
|
|
412
|
+
|
|
413
|
+
/// Detect if file extension is supported
|
|
414
|
+
fn supports_file(&self, path: String) -> bool {
|
|
415
|
+
if let Some(ext) = std::path::Path::new(&path)
|
|
416
|
+
.extension()
|
|
417
|
+
.and_then(|s| s.to_str())
|
|
418
|
+
{
|
|
419
|
+
Self::supported_formats().contains(&ext.to_lowercase())
|
|
420
|
+
} else {
|
|
421
|
+
false
|
|
422
|
+
}
|
|
423
|
+
}
|
|
424
|
+
}
|
|
425
|
+
|
|
426
|
+
/// Module-level convenience function for parsing files
|
|
427
|
+
fn parse_file_direct(path: String) -> Result<String, Error> {
|
|
428
|
+
let parser = Parser {
|
|
429
|
+
config: ParserConfig::default(),
|
|
430
|
+
};
|
|
431
|
+
parser.parse_file(path)
|
|
432
|
+
}
|
|
433
|
+
|
|
434
|
+
/// Module-level convenience function for parsing binary data
|
|
435
|
+
fn parse_bytes_direct(data: Vec<u8>) -> Result<String, Error> {
|
|
436
|
+
let parser = Parser {
|
|
437
|
+
config: ParserConfig::default(),
|
|
438
|
+
};
|
|
439
|
+
parser.parse_bytes_internal(data, None)
|
|
440
|
+
}
|
|
441
|
+
|
|
442
|
+
/// Initialize the Parser class
|
|
443
|
+
pub fn init(_ruby: &Ruby, module: RModule) -> Result<(), Error> {
|
|
444
|
+
let class = module.define_class("Parser", class::object())?;
|
|
445
|
+
|
|
446
|
+
// Instance methods
|
|
447
|
+
class.define_singleton_method("new", function!(Parser::new, -1))?;
|
|
448
|
+
class.define_method("parse", method!(Parser::parse, 1))?;
|
|
449
|
+
class.define_method("parse_file", method!(Parser::parse_file, 1))?;
|
|
450
|
+
class.define_method("parse_bytes", method!(Parser::parse_bytes, 1))?;
|
|
451
|
+
class.define_method("config", method!(Parser::config, 0))?;
|
|
452
|
+
class.define_method("strict_mode?", method!(Parser::strict_mode, 0))?;
|
|
453
|
+
class.define_method("supports_file?", method!(Parser::supports_file, 1))?;
|
|
454
|
+
|
|
455
|
+
// Individual parser methods exposed to Ruby
|
|
456
|
+
class.define_method("parse_pdf", method!(Parser::parse_pdf, 1))?;
|
|
457
|
+
class.define_method("parse_docx", method!(Parser::parse_docx, 1))?;
|
|
458
|
+
class.define_method("parse_xlsx", method!(Parser::parse_xlsx, 1))?;
|
|
459
|
+
class.define_method("parse_json", method!(Parser::parse_json, 1))?;
|
|
460
|
+
class.define_method("parse_xml", method!(Parser::parse_xml, 1))?;
|
|
461
|
+
class.define_method("parse_text", method!(Parser::parse_text, 1))?;
|
|
462
|
+
class.define_method("ocr_image", method!(Parser::ocr_image, 1))?;
|
|
463
|
+
|
|
464
|
+
// Class methods
|
|
465
|
+
class.define_singleton_method("supported_formats", function!(Parser::supported_formats, 0))?;
|
|
466
|
+
|
|
467
|
+
// Module-level convenience methods
|
|
468
|
+
module.define_singleton_method("parse_file", function!(parse_file_direct, 1))?;
|
|
469
|
+
module.define_singleton_method("parse_bytes", function!(parse_bytes_direct, 1))?;
|
|
470
|
+
|
|
471
|
+
Ok(())
|
|
472
|
+
}
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module ParseKit
|
|
4
|
+
# Error classes are defined in the native extension
|
|
5
|
+
# This file is kept for documentation purposes
|
|
6
|
+
|
|
7
|
+
# Base error class for ParseKit (defined in native extension)
|
|
8
|
+
# class Error < StandardError; end
|
|
9
|
+
|
|
10
|
+
# Raised when parsing fails (defined in native extension)
|
|
11
|
+
# class ParseError < Error; end
|
|
12
|
+
|
|
13
|
+
# Raised when configuration is invalid (defined in native extension)
|
|
14
|
+
# class ConfigError < Error; end
|
|
15
|
+
end
|
|
Binary file
|
|
@@ -0,0 +1,201 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module ParseKit
|
|
4
|
+
# Ruby wrapper for the native Parser class
|
|
5
|
+
#
|
|
6
|
+
# The Ruby layer now handles format detection and routing to specific parsers,
|
|
7
|
+
# while Rust provides the actual parsing implementations.
|
|
8
|
+
class Parser
|
|
9
|
+
# These methods are implemented in the native extension
|
|
10
|
+
# and are documented here for YARD
|
|
11
|
+
|
|
12
|
+
# Initialize a new Parser instance
|
|
13
|
+
# @param options [Hash] Configuration options
|
|
14
|
+
# @option options [String] :encoding Input encoding (default: UTF-8)
|
|
15
|
+
# def initialize(options = {})
|
|
16
|
+
# # Implemented in native extension
|
|
17
|
+
# end
|
|
18
|
+
|
|
19
|
+
# Parse an input string (for text content)
|
|
20
|
+
# @param input [String] The input to parse
|
|
21
|
+
# @return [String] The parsed result
|
|
22
|
+
# @raise [ArgumentError] If input is empty
|
|
23
|
+
# def parse(input)
|
|
24
|
+
# # Implemented in native extension
|
|
25
|
+
# end
|
|
26
|
+
|
|
27
|
+
# Parse a file (supports PDF, Office documents, text files)
|
|
28
|
+
# @param path [String] Path to the file to parse
|
|
29
|
+
# @return [String] The extracted text content
|
|
30
|
+
# @raise [IOError] If file cannot be read
|
|
31
|
+
# @raise [RuntimeError] If parsing fails
|
|
32
|
+
# def parse_file(path)
|
|
33
|
+
# # Implemented in native extension
|
|
34
|
+
# end
|
|
35
|
+
|
|
36
|
+
# Parse binary data
|
|
37
|
+
# @param data [Array<Integer>] Binary data as byte array
|
|
38
|
+
# @return [String] The extracted text content
|
|
39
|
+
# @raise [ArgumentError] If data is empty
|
|
40
|
+
# @raise [RuntimeError] If parsing fails
|
|
41
|
+
# def parse_bytes(data)
|
|
42
|
+
# # Implemented in native extension
|
|
43
|
+
# end
|
|
44
|
+
|
|
45
|
+
# Get the current configuration
|
|
46
|
+
# @return [Hash] The parser configuration
|
|
47
|
+
# def config
|
|
48
|
+
# # Implemented in native extension
|
|
49
|
+
# end
|
|
50
|
+
|
|
51
|
+
# Check if a file format is supported
|
|
52
|
+
# @param path [String] File path to check
|
|
53
|
+
# @return [Boolean] True if the file format is supported
|
|
54
|
+
# def supports_file?(path)
|
|
55
|
+
# # Implemented in native extension
|
|
56
|
+
# end
|
|
57
|
+
|
|
58
|
+
# Get list of supported file formats
|
|
59
|
+
# @return [Array<String>] List of supported file extensions
|
|
60
|
+
# def self.supported_formats
|
|
61
|
+
# # Implemented in native extension
|
|
62
|
+
# end
|
|
63
|
+
|
|
64
|
+
# Ruby-level helper methods
|
|
65
|
+
|
|
66
|
+
# Create a parser with strict mode enabled
|
|
67
|
+
# @param options [Hash] Additional options
|
|
68
|
+
# @return [Parser] A new parser instance with strict mode
|
|
69
|
+
def self.strict(options = {})
|
|
70
|
+
new(options.merge(strict_mode: true))
|
|
71
|
+
end
|
|
72
|
+
|
|
73
|
+
# Parse a file with a block for processing results
|
|
74
|
+
# @param path [String] Path to the file to parse
|
|
75
|
+
# @yield [result] Yields the parsed result for processing
|
|
76
|
+
# @return [Object] The block's return value
|
|
77
|
+
def parse_file_with_block(path)
|
|
78
|
+
result = parse_file(path)
|
|
79
|
+
yield result if block_given?
|
|
80
|
+
result
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
# Detect format from file path
|
|
84
|
+
# @param path [String] File path
|
|
85
|
+
# @return [Symbol, nil] Format symbol or nil if unknown
|
|
86
|
+
def detect_format(path)
|
|
87
|
+
ext = file_extension(path)
|
|
88
|
+
return nil unless ext
|
|
89
|
+
|
|
90
|
+
case ext.downcase
|
|
91
|
+
when 'docx' then :docx
|
|
92
|
+
when 'xlsx', 'xls' then :xlsx
|
|
93
|
+
when 'pdf' then :pdf
|
|
94
|
+
when 'json' then :json
|
|
95
|
+
when 'xml', 'html' then :xml
|
|
96
|
+
when 'txt', 'text', 'md', 'markdown' then :text
|
|
97
|
+
when 'csv' then :text # CSV is handled as text for now
|
|
98
|
+
else :text # Default to text
|
|
99
|
+
end
|
|
100
|
+
end
|
|
101
|
+
|
|
102
|
+
# Detect format from binary data
|
|
103
|
+
# @param data [String, Array<Integer>] Binary data
|
|
104
|
+
# @return [Symbol] Format symbol
|
|
105
|
+
def detect_format_from_bytes(data)
|
|
106
|
+
# Convert to bytes if string
|
|
107
|
+
bytes = data.is_a?(String) ? data.bytes : data
|
|
108
|
+
return :text if bytes.empty?
|
|
109
|
+
|
|
110
|
+
# Check magic bytes
|
|
111
|
+
if bytes[0..3] == [0x25, 0x50, 0x44, 0x46] # %PDF
|
|
112
|
+
:pdf
|
|
113
|
+
elsif bytes[0..1] == [0x50, 0x4B] # PK (ZIP archive)
|
|
114
|
+
# Could be DOCX or XLSX, default to xlsx for now
|
|
115
|
+
# In the future, could inspect ZIP contents to determine
|
|
116
|
+
:xlsx
|
|
117
|
+
elsif bytes[0..3] == [0xD0, 0xCF, 0x11, 0xE0] # Old Excel
|
|
118
|
+
:xlsx
|
|
119
|
+
elsif bytes[0..4] == [0x3C, 0x3F, 0x78, 0x6D, 0x6C] # <?xml
|
|
120
|
+
:xml
|
|
121
|
+
elsif bytes[0..4] == [0x3C, 0x68, 0x74, 0x6D, 0x6C] # <html
|
|
122
|
+
:xml
|
|
123
|
+
elsif bytes[0] == 0x7B || bytes[0] == 0x5B # { or [
|
|
124
|
+
:json
|
|
125
|
+
else
|
|
126
|
+
:text
|
|
127
|
+
end
|
|
128
|
+
end
|
|
129
|
+
|
|
130
|
+
# Parse file using format-specific parser
|
|
131
|
+
# This method now detects format and routes to the appropriate parser
|
|
132
|
+
# @param path [String] File path
|
|
133
|
+
# @return [String] Parsed content
|
|
134
|
+
def parse_file_routed(path)
|
|
135
|
+
format = detect_format(path)
|
|
136
|
+
data = File.read(path, mode: 'rb').bytes
|
|
137
|
+
|
|
138
|
+
case format
|
|
139
|
+
when :docx then parse_docx(data)
|
|
140
|
+
when :xlsx then parse_xlsx(data)
|
|
141
|
+
when :pdf then parse_pdf(data)
|
|
142
|
+
when :json then parse_json(data)
|
|
143
|
+
when :xml then parse_xml(data)
|
|
144
|
+
else parse_text(data)
|
|
145
|
+
end
|
|
146
|
+
end
|
|
147
|
+
|
|
148
|
+
# Parse bytes using format-specific parser
|
|
149
|
+
# This method detects format and routes to the appropriate parser
|
|
150
|
+
# @param data [String, Array<Integer>] Binary data
|
|
151
|
+
# @return [String] Parsed content
|
|
152
|
+
def parse_bytes_routed(data)
|
|
153
|
+
format = detect_format_from_bytes(data)
|
|
154
|
+
bytes = data.is_a?(String) ? data.bytes : data
|
|
155
|
+
|
|
156
|
+
case format
|
|
157
|
+
when :docx then parse_docx(bytes)
|
|
158
|
+
when :xlsx then parse_xlsx(bytes)
|
|
159
|
+
when :pdf then parse_pdf(bytes)
|
|
160
|
+
when :json then parse_json(bytes)
|
|
161
|
+
when :xml then parse_xml(bytes)
|
|
162
|
+
else parse_text(bytes)
|
|
163
|
+
end
|
|
164
|
+
end
|
|
165
|
+
|
|
166
|
+
# Parse with a block for processing results
|
|
167
|
+
# @param input [String] The input to parse
|
|
168
|
+
# @yield [result] Yields the parsed result for processing
|
|
169
|
+
# @return [Object] The block's return value
|
|
170
|
+
def parse_with_block(input)
|
|
171
|
+
result = parse(input)
|
|
172
|
+
yield result if block_given?
|
|
173
|
+
result
|
|
174
|
+
end
|
|
175
|
+
|
|
176
|
+
# Validate input before parsing
|
|
177
|
+
# @param input [String] The input to validate
|
|
178
|
+
# @return [Boolean] True if input is valid
|
|
179
|
+
def valid_input?(input)
|
|
180
|
+
return false unless input.is_a?(String)
|
|
181
|
+
return false if input.empty?
|
|
182
|
+
true
|
|
183
|
+
end
|
|
184
|
+
|
|
185
|
+
# Validate file before parsing
|
|
186
|
+
# @param path [String] The file path to validate
|
|
187
|
+
# @return [Boolean] True if file exists and format is supported
|
|
188
|
+
def valid_file?(path)
|
|
189
|
+
return false unless File.exist?(path)
|
|
190
|
+
supports_file?(path)
|
|
191
|
+
end
|
|
192
|
+
|
|
193
|
+
# Get file extension
|
|
194
|
+
# @param path [String] File path
|
|
195
|
+
# @return [String, nil] File extension in lowercase
|
|
196
|
+
def file_extension(path)
|
|
197
|
+
ext = File.extname(path)
|
|
198
|
+
ext.empty? ? nil : ext[1..].downcase
|
|
199
|
+
end
|
|
200
|
+
end
|
|
201
|
+
end
|
data/lib/parsekit.rb
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative "parsekit/version"
|
|
4
|
+
|
|
5
|
+
# Load the native extension
|
|
6
|
+
begin
|
|
7
|
+
require_relative "parsekit/parsekit"
|
|
8
|
+
rescue LoadError
|
|
9
|
+
require "parsekit/parsekit"
|
|
10
|
+
end
|
|
11
|
+
|
|
12
|
+
require_relative "parsekit/error"
|
|
13
|
+
require_relative "parsekit/parser"
|
|
14
|
+
|
|
15
|
+
# ParseKit is a Ruby document parsing toolkit with PDF and OCR support
|
|
16
|
+
module ParseKit
|
|
17
|
+
class << self
|
|
18
|
+
# The parse_file and parse_bytes methods are defined in the native extension
|
|
19
|
+
# We just need to document them here or add wrapper logic if needed
|
|
20
|
+
|
|
21
|
+
# Convenience method to parse input directly (for text)
|
|
22
|
+
# @param input [String] The input string to parse
|
|
23
|
+
# @param options [Hash] Optional configuration options
|
|
24
|
+
# @option options [String] :encoding Input encoding (default: UTF-8)
|
|
25
|
+
# @return [String] The parsed result
|
|
26
|
+
def parse(input, options = {})
|
|
27
|
+
Parser.new(options).parse(input)
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Parse binary data
|
|
31
|
+
# @param data [String, Array] Binary data to parse
|
|
32
|
+
# @param options [Hash] Optional configuration options
|
|
33
|
+
# @return [String] The extracted text
|
|
34
|
+
def parse_bytes(data, options = {})
|
|
35
|
+
# Convert string to bytes if needed
|
|
36
|
+
byte_data = data.is_a?(String) ? data.bytes : data
|
|
37
|
+
Parser.new(options).parse_bytes(byte_data)
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
# Get supported file formats
|
|
41
|
+
# @return [Array<String>] List of supported file extensions
|
|
42
|
+
def supported_formats
|
|
43
|
+
Parser.supported_formats
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
# Check if a file format is supported
|
|
47
|
+
# @param path [String] File path to check
|
|
48
|
+
# @return [Boolean] True if the file format is supported
|
|
49
|
+
def supports_file?(path)
|
|
50
|
+
Parser.new.supports_file?(path)
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
# Get the native library version
|
|
54
|
+
# @return [String] Version of the native library
|
|
55
|
+
def native_version
|
|
56
|
+
version
|
|
57
|
+
rescue StandardError
|
|
58
|
+
"unknown"
|
|
59
|
+
end
|
|
60
|
+
end
|
|
61
|
+
end
|
metadata
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: parsekit
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.1.0.pre.1
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- Chris Petersen
|
|
8
|
+
autorequire:
|
|
9
|
+
bindir: exe
|
|
10
|
+
cert_chain: []
|
|
11
|
+
date: 2025-08-21 00:00:00.000000000 Z
|
|
12
|
+
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: rb_sys
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - "~>"
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: '0.9'
|
|
20
|
+
type: :runtime
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - "~>"
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: '0.9'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: rake
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '13.0'
|
|
34
|
+
type: :development
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '13.0'
|
|
41
|
+
- !ruby/object:Gem::Dependency
|
|
42
|
+
name: rake-compiler
|
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
|
44
|
+
requirements:
|
|
45
|
+
- - "~>"
|
|
46
|
+
- !ruby/object:Gem::Version
|
|
47
|
+
version: '1.2'
|
|
48
|
+
type: :development
|
|
49
|
+
prerelease: false
|
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
51
|
+
requirements:
|
|
52
|
+
- - "~>"
|
|
53
|
+
- !ruby/object:Gem::Version
|
|
54
|
+
version: '1.2'
|
|
55
|
+
- !ruby/object:Gem::Dependency
|
|
56
|
+
name: rspec
|
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
|
58
|
+
requirements:
|
|
59
|
+
- - "~>"
|
|
60
|
+
- !ruby/object:Gem::Version
|
|
61
|
+
version: '3.0'
|
|
62
|
+
type: :development
|
|
63
|
+
prerelease: false
|
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
+
requirements:
|
|
66
|
+
- - "~>"
|
|
67
|
+
- !ruby/object:Gem::Version
|
|
68
|
+
version: '3.0'
|
|
69
|
+
- !ruby/object:Gem::Dependency
|
|
70
|
+
name: simplecov
|
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
|
72
|
+
requirements:
|
|
73
|
+
- - "~>"
|
|
74
|
+
- !ruby/object:Gem::Version
|
|
75
|
+
version: '0.22'
|
|
76
|
+
type: :development
|
|
77
|
+
prerelease: false
|
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
79
|
+
requirements:
|
|
80
|
+
- - "~>"
|
|
81
|
+
- !ruby/object:Gem::Version
|
|
82
|
+
version: '0.22'
|
|
83
|
+
description: Native Ruby gem for parsing documents (PDF, DOCX, XLSX, images with OCR)
|
|
84
|
+
with zero runtime dependencies. Statically links MuPDF for PDF extraction and Tesseract
|
|
85
|
+
for OCR.
|
|
86
|
+
email:
|
|
87
|
+
- chris@petersen.io
|
|
88
|
+
executables: []
|
|
89
|
+
extensions:
|
|
90
|
+
- ext/parsekit/extconf.rb
|
|
91
|
+
extra_rdoc_files: []
|
|
92
|
+
files:
|
|
93
|
+
- CHANGELOG.md
|
|
94
|
+
- LICENSE.txt
|
|
95
|
+
- README.md
|
|
96
|
+
- ext/parsekit/Cargo.toml
|
|
97
|
+
- ext/parsekit/extconf.rb
|
|
98
|
+
- ext/parsekit/src/error.rs
|
|
99
|
+
- ext/parsekit/src/lib.rs
|
|
100
|
+
- ext/parsekit/src/parser.rs
|
|
101
|
+
- lib/parsekit.rb
|
|
102
|
+
- lib/parsekit/error.rb
|
|
103
|
+
- lib/parsekit/parsekit.bundle
|
|
104
|
+
- lib/parsekit/parser.rb
|
|
105
|
+
- lib/parsekit/version.rb
|
|
106
|
+
homepage: https://github.com/cpetersen/parsekit
|
|
107
|
+
licenses:
|
|
108
|
+
- MIT
|
|
109
|
+
metadata:
|
|
110
|
+
homepage_uri: https://github.com/cpetersen/parsekit
|
|
111
|
+
source_code_uri: https://github.com/cpetersen/parsekit
|
|
112
|
+
changelog_uri: https://github.com/cpetersen/parsekit/blob/main/CHANGELOG.md
|
|
113
|
+
post_install_message:
|
|
114
|
+
rdoc_options: []
|
|
115
|
+
require_paths:
|
|
116
|
+
- lib
|
|
117
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
118
|
+
requirements:
|
|
119
|
+
- - ">="
|
|
120
|
+
- !ruby/object:Gem::Version
|
|
121
|
+
version: 3.0.0
|
|
122
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
123
|
+
requirements:
|
|
124
|
+
- - ">="
|
|
125
|
+
- !ruby/object:Gem::Version
|
|
126
|
+
version: '0'
|
|
127
|
+
requirements: []
|
|
128
|
+
rubygems_version: 3.5.3
|
|
129
|
+
signing_key:
|
|
130
|
+
specification_version: 4
|
|
131
|
+
summary: Ruby document parsing toolkit with PDF and OCR support
|
|
132
|
+
test_files: []
|