gemini-ocr-cli 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,16 @@
1
+ # Gemini OCR CLI Configuration
2
+
3
+ # Required: Your Google Gemini API key
4
+ # Get one at: https://makersuite.google.com/app/apikey
5
+ GEMINI_API_KEY=your-api-key-here
6
+
7
+ # Optional: Default model (default: gemini-2.0-flash-exp)
8
+ # Options: gemini-2.0-flash-exp, gemini-1.5-flash, gemini-1.5-pro
9
+ # GEMINI_MODEL=gemini-2.0-flash-exp
10
+
11
+ # Optional: PDF rendering DPI (default: 200)
12
+ # Higher = better quality but slower
13
+ # GEMINI_DPI=200
14
+
15
+ # Optional: Maximum file size in MB (default: 20)
16
+ # GEMINI_MAX_FILE_SIZE_MB=20
@@ -0,0 +1,57 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual environments
24
+ .venv/
25
+ venv/
26
+ ENV/
27
+
28
+ # IDE
29
+ .idea/
30
+ .vscode/
31
+ *.swp
32
+ *.swo
33
+ *~
34
+
35
+ # Testing
36
+ .pytest_cache/
37
+ .coverage
38
+ htmlcov/
39
+ .tox/
40
+ .nox/
41
+
42
+ # Type checking
43
+ .mypy_cache/
44
+ .dmypy.json
45
+
46
+ # Environment
47
+ .env
48
+ .env.local
49
+ .env.*.local
50
+
51
+ # OS
52
+ .DS_Store
53
+ Thumbs.db
54
+
55
+ # Output
56
+ *_output/
57
+ gemini_ocr_output/
@@ -0,0 +1,47 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [0.2.0] - 2024-12-23
9
+
10
+ ### Changed
11
+
12
+ - **BREAKING**: Replaced page-by-page PDF processing with native Gemini Files API upload
13
+ - PDFs are now uploaded directly to Gemini (single API call per document)
14
+ - Significantly faster processing for multi-page documents
15
+ - Better quality: native PDF parsing preserves text, tables, and layout
16
+ - Updated default model from `gemini-2.0-flash-exp` to `gemini-3.0-flash`
17
+ - Simplified `OCRResult` dataclass (removed per-page tracking)
18
+
19
+ ### Added
20
+
21
+ - Retry logic with exponential backoff for API rate limits
22
+ - Comprehensive test suite (105 unit tests, integration tests)
23
+ - `token_count` field in `OCRResult` for usage tracking
24
+
25
+ ### Removed
26
+
27
+ - `--dpi` CLI flag (no longer applicable with native PDF upload)
28
+ - `GEMINI_DPI` environment variable
29
+ - Unused utility functions: `image_to_base64`, `pil_image_to_base64`, `save_base64_image`
30
+ - Module-level `settings` singleton from config
31
+
32
+ ### Fixed
33
+
34
+ - API key resolution now correctly prioritizes `GEMINI_API_KEY` over `GOOGLE_API_KEY`
35
+
36
+ ## [0.1.0] - 2024-12-22
37
+
38
+ ### Added
39
+
40
+ - Initial release
41
+ - PDF and image OCR using Google Gemini vision models
42
+ - CLI commands: `process`, `describe`, `info`
43
+ - Batch processing with progress tracking
44
+ - Incremental processing (skip already-processed files)
45
+ - Markdown output format
46
+ - Figure/chart description generation
47
+ - Support for multiple image formats (JPG, PNG, WEBP, GIF, BMP, TIFF)
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Ruben Fernandez-Fuertes
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,193 @@
1
+ Metadata-Version: 2.4
2
+ Name: gemini-ocr-cli
3
+ Version: 0.2.0
4
+ Summary: CLI tool for OCR processing using Google Gemini's vision capabilities
5
+ Project-URL: Homepage, https://github.com/r-uben/gemini-ocr-cli
6
+ Project-URL: Repository, https://github.com/r-uben/gemini-ocr-cli
7
+ Project-URL: Issues, https://github.com/r-uben/gemini-ocr-cli/issues
8
+ Author-email: Ruben Fernandez-Fuertes <fernandezfuertesruben@gmail.com>
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: cli,document-processing,gemini,google,ocr,pdf,vision
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Text Processing :: General
23
+ Requires-Python: >=3.10
24
+ Requires-Dist: click>=8.1.0
25
+ Requires-Dist: google-genai>=1.0.0
26
+ Requires-Dist: pillow>=10.0.0
27
+ Requires-Dist: pydantic-settings>=2.0.0
28
+ Requires-Dist: pydantic>=2.0.0
29
+ Requires-Dist: pymupdf>=1.24.0
30
+ Requires-Dist: python-dotenv>=1.0.0
31
+ Requires-Dist: rich>=13.0.0
32
+ Provides-Extra: dev
33
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
34
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
35
+ Requires-Dist: pytest>=8.0.0; extra == 'dev'
36
+ Requires-Dist: ruff>=0.8.0; extra == 'dev'
37
+ Description-Content-Type: text/markdown
38
+
39
+ # Gemini OCR CLI
40
+
41
+ Command-line tool for OCR processing using Google Gemini's vision capabilities. Extract text, tables, equations, and figures from PDFs and images with high accuracy.
42
+
43
+ ## Features
44
+
45
+ - **Native PDF upload**: Direct PDF processing via Gemini Files API (fast, single API call)
46
+ - **Multi-format support**: PDF and images (JPG, PNG, WEBP, GIF, BMP, TIFF)
47
+ - **High-quality OCR**: Leverages Gemini's advanced vision models
48
+ - **Structure preservation**: Maintains headings, tables, lists, equations
49
+ - **Figure analysis**: Generate detailed descriptions of charts and diagrams
50
+ - **Batch processing**: Process entire directories with progress tracking
51
+ - **Incremental processing**: Skip already-processed files
52
+ - **Automatic retry**: Exponential backoff for API rate limits
53
+ - **Markdown output**: Clean, structured output format
54
+
55
+ ## Installation
56
+
57
+ ### From PyPI (recommended)
58
+
59
+ ```bash
60
+ pip install gemini-ocr-cli
61
+ ```
62
+
63
+ ### Using pipx
64
+
65
+ ```bash
66
+ pipx install gemini-ocr-cli
67
+ ```
68
+
69
+ ### From source
70
+
71
+ ```bash
72
+ git clone https://github.com/r-uben/gemini-ocr-cli.git
73
+ cd gemini-ocr-cli
74
+ uv pip install -e .
75
+ ```
76
+
77
+ ## Quick Start
78
+
79
+ ### API Key Resolution
80
+
81
+ The CLI automatically picks up your API key from environment variables (no configuration needed if already set):
82
+
83
+ **Priority order:**
84
+ 1. `--api-key` CLI argument (highest priority)
85
+ 2. `GEMINI_API_KEY` environment variable
86
+ 3. `GOOGLE_API_KEY` environment variable (fallback)
87
+ 4. `.env` file in current directory
88
+
89
+ ```bash
90
+ # Option 1: Set environment variable (recommended)
91
+ export GEMINI_API_KEY="your-api-key"
92
+
93
+ # Option 2: Use existing GOOGLE_API_KEY (auto-detected)
94
+ export GOOGLE_API_KEY="your-api-key"
95
+
96
+ # Option 3: Create a .env file
97
+ echo "GEMINI_API_KEY=your-api-key" > .env
98
+
99
+ # Option 4: Pass directly (not recommended for security)
100
+ gemini-ocr paper.pdf --api-key "your-api-key"
101
+ ```
102
+
103
+ ### Process documents
104
+
105
+ ```bash
106
+ # Single file
107
+ gemini-ocr paper.pdf
108
+
109
+ # Directory
110
+ gemini-ocr ./documents/ -o ./results/
111
+
112
+ # With custom model
113
+ gemini-ocr paper.pdf --model gemini-1.5-pro
114
+ ```
115
+
116
+ ### Describe figures
117
+
118
+ ```bash
119
+ # Analyze a chart/diagram
120
+ gemini-ocr describe chart.png
121
+
122
+ # Save to file
123
+ gemini-ocr describe figure.jpg -o description.md
124
+ ```
125
+
126
+ ## CLI Reference
127
+
128
+ ### `gemini-ocr process`
129
+
130
+ Process documents and images with OCR.
131
+
132
+ ```
133
+ Usage: gemini-ocr process [OPTIONS] INPUT_PATH
134
+
135
+ Options:
136
+ -o, --output-dir PATH Output directory for results
137
+ --api-key TEXT Gemini API key
138
+ --model TEXT Model to use (default: gemini-3.0-flash)
139
+ --task [convert|extract|table] OCR task type (default: convert)
140
+ --prompt TEXT Custom prompt for OCR
141
+ --include-images/--no-images Extract embedded images (default: True)
142
+ --save-originals/--no-save-originals
143
+ Save original input images (default: True)
144
+ --add-timestamp/--no-timestamp Add timestamp to output folder
145
+ --reprocess Reprocess existing files
146
+ --env-file PATH Path to .env file
147
+ -v, --verbose Enable verbose output
148
+ ```
149
+
150
+ ### `gemini-ocr describe`
151
+
152
+ Generate detailed descriptions of figures, charts, and diagrams.
153
+
154
+ ```
155
+ Usage: gemini-ocr describe [OPTIONS] IMAGE_PATH
156
+
157
+ Options:
158
+ --api-key TEXT Gemini API key
159
+ --model TEXT Model to use
160
+ -o, --output PATH Output file (default: stdout)
161
+ ```
162
+
163
+ ### `gemini-ocr info`
164
+
165
+ Show configuration and system information.
166
+
167
+ ## Output Format
168
+
169
+ Results are saved as Markdown files with:
170
+ - File metadata (original path, processing time)
171
+ - Extracted text (full document)
172
+ - Embedded image references (if enabled)
173
+ - `metadata.json` tracking all processed files
174
+
175
+ ## Models
176
+
177
+ | Model | Speed | Quality | Cost | Recommended For |
178
+ |-------|-------|---------|------|-----------------|
179
+ | `gemini-3.0-flash` | Fast | Good | Low | Default, most documents |
180
+ | `gemini-1.5-flash` | Fast | Good | Low | Simple documents |
181
+ | `gemini-1.5-pro` | Slower | Best | Higher | Complex layouts, equations |
182
+
183
+ ## Environment Variables
184
+
185
+ | Variable | Description | Default |
186
+ |----------|-------------|---------|
187
+ | `GEMINI_API_KEY` | Google Gemini API key | Required |
188
+ | `GOOGLE_API_KEY` | Fallback API key | - |
189
+ | `GEMINI_MODEL` | Default model | `gemini-3.0-flash` |
190
+
191
+ ## License
192
+
193
+ MIT
@@ -0,0 +1,155 @@
1
+ # Gemini OCR CLI
2
+
3
+ Command-line tool for OCR processing using Google Gemini's vision capabilities. Extract text, tables, equations, and figures from PDFs and images with high accuracy.
4
+
5
+ ## Features
6
+
7
+ - **Native PDF upload**: Direct PDF processing via Gemini Files API (fast, single API call)
8
+ - **Multi-format support**: PDF and images (JPG, PNG, WEBP, GIF, BMP, TIFF)
9
+ - **High-quality OCR**: Leverages Gemini's advanced vision models
10
+ - **Structure preservation**: Maintains headings, tables, lists, equations
11
+ - **Figure analysis**: Generate detailed descriptions of charts and diagrams
12
+ - **Batch processing**: Process entire directories with progress tracking
13
+ - **Incremental processing**: Skip already-processed files
14
+ - **Automatic retry**: Exponential backoff for API rate limits
15
+ - **Markdown output**: Clean, structured output format
16
+
17
+ ## Installation
18
+
19
+ ### From PyPI (recommended)
20
+
21
+ ```bash
22
+ pip install gemini-ocr-cli
23
+ ```
24
+
25
+ ### Using pipx
26
+
27
+ ```bash
28
+ pipx install gemini-ocr-cli
29
+ ```
30
+
31
+ ### From source
32
+
33
+ ```bash
34
+ git clone https://github.com/r-uben/gemini-ocr-cli.git
35
+ cd gemini-ocr-cli
36
+ uv pip install -e .
37
+ ```
38
+
39
+ ## Quick Start
40
+
41
+ ### API Key Resolution
42
+
43
+ The CLI automatically picks up your API key from environment variables (no configuration needed if already set):
44
+
45
+ **Priority order:**
46
+ 1. `--api-key` CLI argument (highest priority)
47
+ 2. `GEMINI_API_KEY` environment variable
48
+ 3. `GOOGLE_API_KEY` environment variable (fallback)
49
+ 4. `.env` file in current directory
50
+
51
+ ```bash
52
+ # Option 1: Set environment variable (recommended)
53
+ export GEMINI_API_KEY="your-api-key"
54
+
55
+ # Option 2: Use existing GOOGLE_API_KEY (auto-detected)
56
+ export GOOGLE_API_KEY="your-api-key"
57
+
58
+ # Option 3: Create a .env file
59
+ echo "GEMINI_API_KEY=your-api-key" > .env
60
+
61
+ # Option 4: Pass directly (not recommended for security)
62
+ gemini-ocr paper.pdf --api-key "your-api-key"
63
+ ```
64
+
65
+ ### Process documents
66
+
67
+ ```bash
68
+ # Single file
69
+ gemini-ocr paper.pdf
70
+
71
+ # Directory
72
+ gemini-ocr ./documents/ -o ./results/
73
+
74
+ # With custom model
75
+ gemini-ocr paper.pdf --model gemini-1.5-pro
76
+ ```
77
+
78
+ ### Describe figures
79
+
80
+ ```bash
81
+ # Analyze a chart/diagram
82
+ gemini-ocr describe chart.png
83
+
84
+ # Save to file
85
+ gemini-ocr describe figure.jpg -o description.md
86
+ ```
87
+
88
+ ## CLI Reference
89
+
90
+ ### `gemini-ocr process`
91
+
92
+ Process documents and images with OCR.
93
+
94
+ ```
95
+ Usage: gemini-ocr process [OPTIONS] INPUT_PATH
96
+
97
+ Options:
98
+ -o, --output-dir PATH Output directory for results
99
+ --api-key TEXT Gemini API key
100
+ --model TEXT Model to use (default: gemini-3.0-flash)
101
+ --task [convert|extract|table] OCR task type (default: convert)
102
+ --prompt TEXT Custom prompt for OCR
103
+ --include-images/--no-images Extract embedded images (default: True)
104
+ --save-originals/--no-save-originals
105
+ Save original input images (default: True)
106
+ --add-timestamp/--no-timestamp Add timestamp to output folder
107
+ --reprocess Reprocess existing files
108
+ --env-file PATH Path to .env file
109
+ -v, --verbose Enable verbose output
110
+ ```
111
+
112
+ ### `gemini-ocr describe`
113
+
114
+ Generate detailed descriptions of figures, charts, and diagrams.
115
+
116
+ ```
117
+ Usage: gemini-ocr describe [OPTIONS] IMAGE_PATH
118
+
119
+ Options:
120
+ --api-key TEXT Gemini API key
121
+ --model TEXT Model to use
122
+ -o, --output PATH Output file (default: stdout)
123
+ ```
124
+
125
+ ### `gemini-ocr info`
126
+
127
+ Show configuration and system information.
128
+
129
+ ## Output Format
130
+
131
+ Results are saved as Markdown files with:
132
+ - File metadata (original path, processing time)
133
+ - Extracted text (full document)
134
+ - Embedded image references (if enabled)
135
+ - `metadata.json` tracking all processed files
136
+
137
+ ## Models
138
+
139
+ | Model | Speed | Quality | Cost | Recommended For |
140
+ |-------|-------|---------|------|-----------------|
141
+ | `gemini-3.0-flash` | Fast | Good | Low | Default, most documents |
142
+ | `gemini-1.5-flash` | Fast | Good | Low | Simple documents |
143
+ | `gemini-1.5-pro` | Slower | Best | Higher | Complex layouts, equations |
144
+
145
+ ## Environment Variables
146
+
147
+ | Variable | Description | Default |
148
+ |----------|-------------|---------|
149
+ | `GEMINI_API_KEY` | Google Gemini API key | Required |
150
+ | `GOOGLE_API_KEY` | Fallback API key | - |
151
+ | `GEMINI_MODEL` | Default model | `gemini-3.0-flash` |
152
+
153
+ ## License
154
+
155
+ MIT
@@ -0,0 +1,8 @@
1
+ """Gemini OCR CLI - Document processing using Google Gemini's vision capabilities."""
2
+
3
+ __version__ = "0.2.0"
4
+
5
+ from gemini_ocr.processor import OCRProcessor
6
+ from gemini_ocr.config import Config
7
+
8
+ __all__ = ["OCRProcessor", "Config", "__version__"]
@@ -0,0 +1,6 @@
1
+ """Allow running as python -m gemini_ocr."""
2
+
3
+ from gemini_ocr.cli import main
4
+
5
+ if __name__ == "__main__":
6
+ main()