tahweel 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cfa4c4ffbb0b1229794addc7ae400b8af9fda793e7c17b403d4aa60bd92fa7f6
4
- data.tar.gz: 2d236076739f2ae1b892669487b6ac839b1a5420c1928621787e6387e7ba50ab
3
+ metadata.gz: 162fd5253be8e0c0873bbedb1bd0218a57964bbbfd27616e327aaa5624092894
4
+ data.tar.gz: 1ce916a6abd98caa715e226ceacb0cbc32aecd5a317f75c397ac353fee9006c4
5
5
  SHA512:
6
- metadata.gz: de06384d492cd26925dee76392119d0cd7c05d279e1aaafa97d055091558513847b051f5c65bc505a5213623a36004f734fbe0794b414ebc28c6982639841960
7
- data.tar.gz: 31e2d05fbaf89c4f09ef3859243052ffd00f95ffa277368826e9ebed11539b401d8b8487920c799b6bc2bf5d52ab36cfa4ff4771f9661a38a839d5353dfed2bf
6
+ metadata.gz: 7f7aee9ddc925f4b351bf167420cbf5176a06e522aa5c82164778b4bd2e6abf028574c8835ab887f2ac9333ba7e45e14f5f80697d135961066a14f375313498e
7
+ data.tar.gz: 11e0c1347dfa5a7d746ce15d5702c5edf5ad115162b9698e07a76d340d5ba7116e1aa438cd98433d6833f7df46a74bdc635c64f0a79e0a617b5efa373aa7dcb6
data/CHANGELOG.md CHANGED
@@ -1,5 +1,17 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.1.4] - 2026-01-07
4
+
5
+ ### Changed
6
+
7
+ - Fix Windows encoding issue in `PdfSplitter#total_pages`
8
+
9
+ ## [0.1.3] - 2026-01-07
10
+
11
+ ### Changed
12
+
13
+ - Update Google OAuth credentials
14
+
3
15
  ## [0.1.2] - 2026-01-03
4
16
 
5
17
  ### Changed
data/README.en.md ADDED
@@ -0,0 +1,468 @@
1
+ <p align="center">
2
+ <img src="assets/logo.png" alt="Tahweel Logo" width="200" />
3
+ </p>
4
+
5
+ <h1 align="center">Tahweel (تحويل)</h1>
6
+
7
+ <p align="center">
8
+ <strong>Convert PDF files and images to text using Google Drive OCR</strong>
9
+ </p>
10
+
11
+ <p align="center">
12
+ <a href="https://rubygems.org/gems/tahweel"><img src="https://img.shields.io/gem/v/tahweel.svg" alt="Gem Version" /></a>
13
+ <a href="https://github.com/ieasybooks/tahweel.rb/blob/main/LICENSE.txt"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" /></a>
14
+ <img src="https://img.shields.io/badge/ruby-%3E%3D%203.2-ruby.svg" alt="Ruby Version" />
15
+ </p>
16
+
17
+ <p align="center">
18
+ <a href="#features">Features</a> •
19
+ <a href="#installation">Installation</a> •
20
+ <a href="#prerequisites">Prerequisites</a> •
21
+ <a href="#authentication">Authentication</a> •
22
+ <a href="#usage">Usage</a> •
23
+ <a href="#api-reference">API Reference</a> •
24
+ <a href="#contributing">Contributing</a>
25
+ </p>
26
+
27
+ <p align="center">
28
+ <a href="README.md">🌐 العربية</a>
29
+ </p>
30
+
31
+ ---
32
+
33
+ **Tahweel** (Arabic: تحويل, meaning "conversion") is a powerful Ruby gem for converting PDF files and images to editable text formats using Google Drive's OCR capabilities. It's especially optimized for Arabic text but works excellently with any language supported by Google's OCR engine.
34
+
35
+ ## Features
36
+
37
+ - 🔤 **High-Quality OCR** — Leverages Google Drive's powerful OCR engine for accurate text extraction
38
+ - 📄 **Multiple Input Formats** — Supports PDF, JPG, JPEG, and PNG files
39
+ - 📝 **Multiple Output Formats** — Export to TXT, DOCX, or JSON
40
+ - 🌐 **Arabic Text Support** — Automatic right-to-left alignment detection for Arabic documents
41
+ - ⚡ **Concurrent Processing** — Multi-threaded processing at both file and page levels
42
+ - 📊 **Real-Time Progress** — Beautiful terminal UI with progress tracking per worker thread
43
+ - 🖥️ **Desktop GUI** — Cross-platform graphical interface with Arabic and English support
44
+ - 🔄 **Smart Skip Logic** — Automatically skips files when output already exists
45
+ - 📁 **Directory Structure Preservation** — Maintains input folder hierarchy in output
46
+ - 🛡️ **Robust Error Handling** — Exponential backoff retries for API rate limits
47
+
48
+ ## Installation
49
+
50
+ ### From RubyGems
51
+
52
+ ```bash
53
+ gem install tahweel
54
+ ```
55
+
56
+ ### Using Bundler
57
+
58
+ Add this line to your application's Gemfile:
59
+
60
+ ```ruby
61
+ gem 'tahweel'
62
+ ```
63
+
64
+ Then run:
65
+
66
+ ```bash
67
+ bundle install
68
+ ```
69
+
70
+ ### From Source
71
+
72
+ ```bash
73
+ git clone https://github.com/ieasybooks/tahweel.rb.git
74
+ cd tahweel.rb
75
+ bundle install
76
+ ```
77
+
78
+ ## Prerequisites
79
+
80
+ ### Ruby Version
81
+
82
+ Tahweel requires **Ruby 3.2.0** or higher.
83
+
84
+ ### Poppler Utils
85
+
86
+ Tahweel uses Poppler utilities (`pdftoppm` and `pdfinfo`) for splitting PDF files into images.
87
+
88
+ **macOS:**
89
+ ```bash
90
+ brew install poppler
91
+ ```
92
+
93
+ **Ubuntu/Debian:**
94
+ ```bash
95
+ sudo apt install poppler-utils
96
+ ```
97
+
98
+ **Windows:**
99
+
100
+ Tahweel automatically downloads and installs Poppler binaries on Windows when first run.
101
+
102
+ ### Google Account
103
+
104
+ You'll need a Google account to authenticate with Google Drive's OCR service. The first time you run Tahweel, it will open a browser window for OAuth authentication.
105
+
106
+ ## Authentication
107
+
108
+ Tahweel uses OAuth 2.0 to authenticate with Google Drive. On first run:
109
+
110
+ 1. A browser window will open automatically
111
+ 2. Sign in with your Google account
112
+ 3. Grant Tahweel permission to create and manage files in your Google Drive
113
+ 4. After authorization, you'll see a success page and can close the browser
114
+
115
+ **Note:** Tahweel only creates temporary files for OCR processing and deletes them immediately after extraction. It uses the `drive.file` scope, which only allows access to files created by the application.
116
+
117
+ Your credentials are securely stored in:
118
+ - **Linux/macOS:** `~/.cache/tahweel/token.yaml`
119
+ - **Windows:** `%LOCALAPPDATA%\tahweel\token.yaml`
120
+
121
+ ### Clearing Credentials
122
+
123
+ To remove stored credentials and re-authenticate:
124
+
125
+ ```bash
126
+ tahweel-clear
127
+ ```
128
+
129
+ ## Usage
130
+
131
+ ### Command-Line Interface
132
+
133
+ #### Basic Usage
134
+
135
+ Convert a single PDF file:
136
+
137
+ ```bash
138
+ tahweel document.pdf
139
+ ```
140
+
141
+ Convert all PDFs in a directory:
142
+
143
+ ```bash
144
+ tahweel /path/to/documents/
145
+ ```
146
+
147
+ #### Output Formats
148
+
149
+ Specify output formats (default: `txt,docx`):
150
+
151
+ ```bash
152
+ # Text only
153
+ tahweel document.pdf -f txt
154
+
155
+ # DOCX only
156
+ tahweel document.pdf -f docx
157
+
158
+ # JSON only
159
+ tahweel document.pdf -f json
160
+
161
+ # Multiple formats
162
+ tahweel document.pdf -f txt,docx,json
163
+ ```
164
+
165
+ #### Custom Output Directory
166
+
167
+ ```bash
168
+ tahweel document.pdf -o /path/to/output/
169
+ ```
170
+
171
+ #### Filter by File Extensions
172
+
173
+ ```bash
174
+ # Process only PDF files
175
+ tahweel /path/to/documents/ -e pdf
176
+
177
+ # Process only images
178
+ tahweel /path/to/documents/ -e jpg,jpeg,png
179
+ ```
180
+
181
+ #### Concurrency Settings
182
+
183
+ ```bash
184
+ # Process 4 files concurrently
185
+ tahweel /path/to/documents/ -F 4
186
+
187
+ # Use 8 concurrent OCR operations per file
188
+ tahweel /path/to/documents/ -O 8
189
+ ```
190
+
191
+ #### DPI Settings
192
+
193
+ Higher DPI produces better quality but slower processing:
194
+
195
+ ```bash
196
+ tahweel document.pdf --dpi 300
197
+ ```
198
+
199
+ #### Custom Page Separator (TXT output)
200
+
201
+ ```bash
202
+ tahweel document.pdf --page-separator "\\n---PAGE BREAK---\\n"
203
+ ```
204
+
205
+ ### CLI Options Reference
206
+
207
+ | Option | Short | Description | Default |
208
+ |--------|-------|-------------|---------|
209
+ | `--extensions` | `-e` | File extensions to process | `pdf,jpg,jpeg,png` |
210
+ | `--dpi` | | DPI for PDF to image conversion | `150` |
211
+ | `--processor` | `-p` | OCR processor to use | `google_drive` |
212
+ | `--file-concurrency` | `-F` | Max concurrent files to process | `CPUs - 2` |
213
+ | `--ocr-concurrency` | `-O` | Max concurrent OCR operations | `12` |
214
+ | `--formats` | `-f` | Output formats (comma-separated) | `txt,docx` |
215
+ | `--page-separator` | | Page separator for TXT output | `\n\nPAGE_SEPARATOR\n\n` |
216
+ | `--output` | `-o` | Output directory | Input file directory |
217
+ | `--version` | `-v` | Display version | |
218
+
219
+ ### Graphical User Interface
220
+
221
+ Launch the desktop GUI:
222
+
223
+ ```bash
224
+ tahweel-ui
225
+ ```
226
+
227
+ The GUI provides:
228
+ - Single file or folder conversion
229
+ - Arabic and English interface
230
+ - Progress tracking for both global and per-file progress
231
+ - Automatic opening of output directory on completion
232
+
233
+ ### Progress Display
234
+
235
+ The CLI shows a real-time progress dashboard:
236
+
237
+ ```
238
+ Total Progress: [3/10] 30.0% | Time: 45s
239
+ [Worker 1] document1.pdf | Ocr | 75.0% (6/8)
240
+ [Worker 2] document2.pdf | Splitting | 50.0% (5/10)
241
+ [Worker 3] Idle
242
+ [Worker 4] document4.pdf | Ocr | 25.0% (2/8)
243
+ ```
244
+
245
+ ## Output Formats
246
+
247
+ ### TXT (Plain Text)
248
+
249
+ Simple text output with configurable page separators:
250
+
251
+ ```
252
+ Page 1 content here...
253
+
254
+ PAGE_SEPARATOR
255
+
256
+ Page 2 content here...
257
+ ```
258
+
259
+ ### DOCX (Microsoft Word)
260
+
261
+ Formatted Word documents with:
262
+ - One page of content per document page
263
+ - Automatic text direction (RTL for Arabic, LTR otherwise)
264
+ - Normalized line endings compatible with all platforms
265
+ - Intelligent line merging for better readability
266
+
267
+ ### JSON (Structured Data)
268
+
269
+ Page-by-page structured output:
270
+
271
+ ```json
272
+ [
273
+ {
274
+ "page": 1,
275
+ "content": "Page 1 content here..."
276
+ },
277
+ {
278
+ "page": 2,
279
+ "content": "Page 2 content here..."
280
+ }
281
+ ]
282
+ ```
283
+
284
+ ## API Reference
285
+
286
+ ### Converting PDF Files
287
+
288
+ ```ruby
289
+ require 'tahweel'
290
+
291
+ # Convert a PDF to text (returns array of page texts)
292
+ pages = Tahweel.convert('document.pdf')
293
+
294
+ # With options
295
+ pages = Tahweel.convert(
296
+ 'document.pdf',
297
+ dpi: 300, # Higher quality
298
+ processor: :google_drive,
299
+ concurrency: 8
300
+ )
301
+
302
+ # With progress tracking
303
+ pages = Tahweel.convert('document.pdf') do |progress|
304
+ puts "Stage: #{progress[:stage]}"
305
+ puts "Progress: #{progress[:percentage]}%"
306
+ puts "Current page: #{progress[:current_page]}"
307
+ end
308
+ ```
309
+
310
+ ### Extracting Text from Images
311
+
312
+ ```ruby
313
+ require 'tahweel'
314
+
315
+ # Extract text from a single image
316
+ text = Tahweel.extract('image.png')
317
+ text = Tahweel.extract('photo.jpg', processor: :google_drive)
318
+ ```
319
+
320
+ ### Writing Output Files
321
+
322
+ ```ruby
323
+ require 'tahweel'
324
+
325
+ pages = Tahweel.convert('document.pdf')
326
+
327
+ # Write to multiple formats
328
+ Tahweel::Writer.write(pages, 'output', formats: [:txt, :docx, :json])
329
+
330
+ # Write to a single format with options
331
+ Tahweel::Writer.write(
332
+ pages,
333
+ 'output',
334
+ formats: [:txt],
335
+ page_separator: "\n---\n"
336
+ )
337
+ ```
338
+
339
+ ### Full Processing Pipeline
340
+
341
+ ```ruby
342
+ require 'tahweel'
343
+
344
+ # Using the CLI FileProcessor for complete workflow
345
+ Tahweel::CLI::FileProcessor.process('document.pdf', {
346
+ dpi: 150,
347
+ processor: :google_drive,
348
+ ocr_concurrency: 12,
349
+ formats: [:txt, :docx],
350
+ output: '/path/to/output'
351
+ }) do |progress|
352
+ puts "#{progress[:stage]}: #{progress[:percentage]}%"
353
+ end
354
+ ```
355
+
356
+ ### Collecting Files from Directory
357
+
358
+ ```ruby
359
+ require 'tahweel'
360
+
361
+ # Get all supported files in a directory
362
+ files = Tahweel::CLI::FileCollector.collect('/path/to/documents/')
363
+
364
+ # Filter by specific extensions
365
+ files = Tahweel::CLI::FileCollector.collect(
366
+ '/path/to/documents/',
367
+ extensions: ['pdf']
368
+ )
369
+ ```
370
+
371
+ ## Examples
372
+
373
+ ### Batch Convert Arabic Books
374
+
375
+ ```bash
376
+ # Convert all PDFs in an Arabic books directory with high quality
377
+ tahweel ~/arabic-books/ -f txt,docx --dpi 200 -o ~/converted-books/
378
+ ```
379
+
380
+ ### Process Scanned Documents
381
+
382
+ ```bash
383
+ # Convert scanned images to searchable text
384
+ tahweel ~/scanned-docs/ -e jpg,png -f txt -o ~/ocr-output/
385
+ ```
386
+
387
+ ### Library Integration
388
+
389
+ ```ruby
390
+ require 'tahweel'
391
+
392
+ # Convert and process in your application
393
+ def process_document(pdf_path)
394
+ pages = Tahweel.convert(pdf_path) do |progress|
395
+ update_progress_bar(progress[:percentage])
396
+ end
397
+
398
+ # Process the extracted text
399
+ full_text = pages.join("\n\n")
400
+ word_count = full_text.split.size
401
+
402
+ {
403
+ pages: pages.size,
404
+ words: word_count,
405
+ text: full_text
406
+ }
407
+ end
408
+ ```
409
+
410
+ ## Troubleshooting
411
+
412
+ ### File Descriptor Limits
413
+
414
+ If you encounter connection errors or freezing with large batches:
415
+
416
+ ```bash
417
+ ulimit -n 4096
418
+ ```
419
+
420
+ ### Rate Limiting
421
+
422
+ Tahweel automatically handles Google API rate limits with exponential backoff. If you still encounter issues, try reducing concurrency:
423
+
424
+ ```bash
425
+ tahweel documents/ -F 2 -O 6
426
+ ```
427
+
428
+ ### Poppler Not Found
429
+
430
+ Ensure Poppler is installed and in your PATH:
431
+
432
+ ```bash
433
+ which pdftoppm # Should return a path
434
+ ```
435
+
436
+ ## Contributing
437
+
438
+ Bug reports and pull requests are welcome on GitHub at https://github.com/ieasybooks/tahweel.rb.
439
+
440
+ 1. Fork the repository
441
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
442
+ 3. Commit your changes (`git commit -am 'Add amazing feature'`)
443
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
444
+ 5. Open a Pull Request
445
+
446
+ ### Development
447
+
448
+ After checking out the repo:
449
+
450
+ ```bash
451
+ bin/setup # Install dependencies
452
+ rake spec # Run tests
453
+ bin/console # Interactive prompt
454
+ ```
455
+
456
+ ## License
457
+
458
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
459
+
460
+ ## Code of Conduct
461
+
462
+ Everyone interacting in the Tahweel project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the [code of conduct](https://github.com/ieasybooks/tahweel.rb/blob/main/CODE_OF_CONDUCT.md).
463
+
464
+ ---
465
+
466
+ <p align="center">
467
+ Made with ❤️ by <a href="https://github.com/ieasybooks">iEasyBooks</a>
468
+ </p>