PyPI - gemini-ocr-cli - Versions diffs - 0.3.0__tar.gz → 0.3.2__tar.gz - Mend

gemini-ocr-cli 0.3.0tar.gz → 0.3.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/.env.example RENAMED Viewed

@@ -5,8 +5,8 @@
 # Get one at: https://aistudio.google.com/apikey
 GEMINI_API_KEY=your-api-key-here
-# Optional: Model to use (default: gemini-3.1-flash-lite-preview)
-# GEMINI_MODEL=gemini-3.1-flash-lite-preview
+# Optional: Model to use (default: gemini-3-flash-preview)
+# GEMINI_MODEL=gemini-3-flash-preview
 # Optional: Maximum file size in MB (default: 50)
 # GEMINI_MAX_FILE_SIZE_MB=50

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gemini-ocr-cli
-Version: 0.3.0
+Version: 0.3.2
 Summary: CLI tool for OCR processing using Google Gemini's vision capabilities
 Project-URL: Homepage, https://github.com/r-uben/gemini-ocr-cli
 Project-URL: Repository, https://github.com/r-uben/gemini-ocr-cli
@@ -45,6 +45,18 @@ Description-Content-Type: text/markdown
 A command-line tool for OCR processing using Google Gemini's vision capabilities. Process PDFs and images to extract text, tables, equations, and figures.
+## Choosing an OCR tool
+This is one of five OCR CLI tools with a shared design: clean Markdown output, batch processing, and figure extraction. Pick based on your constraints:
+| Tool | Engine | Runs | Cost | Best for |
+|------|--------|------|------|----------|
+| [deepseek-ocr-cli](https://github.com/r-uben/deepseek-ocr-cli) | DeepSeek vision | Local (Ollama / vLLM) | Free | General-purpose local OCR with multi-backend flexibility |
+| **gemini-ocr-cli** (this repo) | Google Gemini | Cloud API | Free tier / Pay-per-use | Fast cloud OCR with concurrent processing |
+| [marker-ocr-cli](https://github.com/r-uben/marker-ocr-cli) | Marker (Surya + Texify) | Local | Free | Academic papers with equations, tables, complex layouts |
+| [mistral-ocr-cli](https://github.com/r-uben/mistral-ocr-cli) | Mistral OCR API | Cloud API | ~$1/1k pages | Structured extraction (tables, headers, footers) |
+| [nougat-ocr-cli](https://github.com/r-uben/nougat-ocr-cli) | Meta Nougat | Local (GPU) | Free | Academic papers, GPU-accelerated batch processing |
 ## Installation
 Requires Python 3.11+ and a [Google Gemini API key](https://aistudio.google.com/apikey).
@@ -88,7 +100,7 @@ Usage: gemini-ocr [OPTIONS] INPUT_PATH
 Options:
   -o, --output-dir PATH           Output directory (default: <input_dir>/gemini_ocr_output/)
   --api-key TEXT                  Gemini API key (or set GEMINI_API_KEY env var)
-  --model TEXT                    Model to use (default: gemini-3.1-flash-lite-preview)
+  --model TEXT                    Model to use (default: gemini-3-flash-preview)
   --task [convert|extract|table|describe_figure]
                                   OCR task type (default: convert)
   --prompt TEXT                   Custom prompt for OCR processing
@@ -136,7 +148,7 @@ All CLI options can also be set via environment variables or a `.env` file:
 | CLI flag | Environment variable | Default |
 |----------|---------------------|---------|
 | `--api-key` | `GEMINI_API_KEY` | (required) |
-| `--model` | `GEMINI_MODEL` | `gemini-3.1-flash-lite-preview` |
+| `--model` | `GEMINI_MODEL` | `gemini-3-flash-preview` |
 | `--include-images` | `GEMINI_INCLUDE_IMAGES` | `true` |
 | `--save-originals` | `GEMINI_SAVE_ORIGINAL_IMAGES` | `true` |
 | `--workers` | `GEMINI_MAX_WORKERS` | `1` |

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/README.md RENAMED Viewed

@@ -7,6 +7,18 @@
 A command-line tool for OCR processing using Google Gemini's vision capabilities. Process PDFs and images to extract text, tables, equations, and figures.
+## Choosing an OCR tool
+This is one of five OCR CLI tools with a shared design: clean Markdown output, batch processing, and figure extraction. Pick based on your constraints:
+| Tool | Engine | Runs | Cost | Best for |
+|------|--------|------|------|----------|
+| [deepseek-ocr-cli](https://github.com/r-uben/deepseek-ocr-cli) | DeepSeek vision | Local (Ollama / vLLM) | Free | General-purpose local OCR with multi-backend flexibility |
+| **gemini-ocr-cli** (this repo) | Google Gemini | Cloud API | Free tier / Pay-per-use | Fast cloud OCR with concurrent processing |
+| [marker-ocr-cli](https://github.com/r-uben/marker-ocr-cli) | Marker (Surya + Texify) | Local | Free | Academic papers with equations, tables, complex layouts |
+| [mistral-ocr-cli](https://github.com/r-uben/mistral-ocr-cli) | Mistral OCR API | Cloud API | ~$1/1k pages | Structured extraction (tables, headers, footers) |
+| [nougat-ocr-cli](https://github.com/r-uben/nougat-ocr-cli) | Meta Nougat | Local (GPU) | Free | Academic papers, GPU-accelerated batch processing |
 ## Installation
 Requires Python 3.11+ and a [Google Gemini API key](https://aistudio.google.com/apikey).
@@ -50,7 +62,7 @@ Usage: gemini-ocr [OPTIONS] INPUT_PATH
 Options:
   -o, --output-dir PATH           Output directory (default: <input_dir>/gemini_ocr_output/)
   --api-key TEXT                  Gemini API key (or set GEMINI_API_KEY env var)
-  --model TEXT                    Model to use (default: gemini-3.1-flash-lite-preview)
+  --model TEXT                    Model to use (default: gemini-3-flash-preview)
   --task [convert|extract|table|describe_figure]
                                   OCR task type (default: convert)
   --prompt TEXT                   Custom prompt for OCR processing
@@ -98,7 +110,7 @@ All CLI options can also be set via environment variables or a `.env` file:
 | CLI flag | Environment variable | Default |
 |----------|---------------------|---------|
 | `--api-key` | `GEMINI_API_KEY` | (required) |
-| `--model` | `GEMINI_MODEL` | `gemini-3.1-flash-lite-preview` |
+| `--model` | `GEMINI_MODEL` | `gemini-3-flash-preview` |
 | `--include-images` | `GEMINI_INCLUDE_IMAGES` | `true` |
 | `--save-originals` | `GEMINI_SAVE_ORIGINAL_IMAGES` | `true` |
 | `--workers` | `GEMINI_MAX_WORKERS` | `1` |

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """Gemini OCR CLI - Document processing using Google Gemini's vision capabilities."""
-__version__ = "0.3.0"
+__version__ = "0.3.2"
 from gemini_ocr.config import Config
 from gemini_ocr.processor import OCRProcessor

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/cli.py RENAMED Viewed

@@ -23,7 +23,8 @@ from gemini_ocr.utils import (
 console = Console()
 # Get original working directory if set (for wrapper scripts)
-ORIGINAL_CWD = os.environ.get("GEMINI_OCR_CWD", os.getcwd())
+_cwd_override = os.environ.get("GEMINI_OCR_CWD", "")
+ORIGINAL_CWD = _cwd_override if _cwd_override and Path(_cwd_override).is_absolute() else os.getcwd()
 def _resolve_path(path: Path) -> Path:
@@ -50,8 +51,8 @@ def _resolve_path(path: Path) -> Path:
 @click.option(
     "--model",
     type=str,
-    default="gemini-3.1-flash-lite-preview",
-    help="Gemini model to use (default: gemini-3.1-flash-lite-preview)",
+    default="gemini-3-flash-preview",
+    help="Gemini model to use (default: gemini-3-flash-preview)",
 )
 @click.option(
     "--task",
@@ -174,10 +175,12 @@ def cli(
         if env_file:
             config = Config.from_env(env_file)
         else:
-            if api_key:
-                os.environ["GEMINI_API_KEY"] = api_key
             config = Config.from_env()
+        # Pass CLI api_key directly to config (don't pollute os.environ)
+        if api_key:
+            config.api_key = api_key
         # Override with CLI options
         config.model = model
         config.include_images = include_images
@@ -213,7 +216,7 @@ def cli(
         if verbose:
             import traceback
-            traceback.print_exc()
+            traceback.print_exc(file=sys.stderr)
         sys.exit(1)
@@ -260,9 +263,9 @@ def _show_info(api_key: str | None = None) -> None:
     console.print()
     try:
-        if api_key:
-            os.environ["GEMINI_API_KEY"] = api_key
         config = Config.from_env()
+        if api_key:
+            config.api_key = api_key
         config_table = Table(title="Configuration")
         config_table.add_column("Setting", style="cyan")

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/config.py RENAMED Viewed

@@ -42,7 +42,7 @@ class Config(BaseSettings):
     # Model Configuration
     model: str = Field(
-        default="gemini-3.1-flash-lite-preview",
+        default="gemini-3-flash-preview",
         description="Gemini model to use for OCR",
     )

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/processor.py RENAMED Viewed

@@ -2,6 +2,7 @@
 import io
 import logging
+import re
 import shutil
 import threading
 import time
@@ -35,15 +36,12 @@ console = Console()
 # OCR prompts for different tasks
 OCR_PROMPTS = {
-    "convert": """Extract all text from this document and convert it to clean markdown format.
-Rules:
-- Preserve the document structure (headings, paragraphs, lists, tables)
-- Convert tables to markdown table format
-- Preserve mathematical equations in LaTeX format where possible
-- Include figure/image captions if present
-- Do not describe images, just note their presence as [Figure X] or [Image]
-- Output ONLY the extracted text in markdown, no commentary""",
+    "convert": """Convert this document into well-structured markdown.
+- Maintain headings, paragraphs, lists, and tables (use markdown table format).
+- Represent equations in LaTeX syntax.
+- Preserve figure captions as [Figure N: <caption>]. Do not describe figure contents.
+- Output only the resulting markdown, no commentary.""",
     "extract": """Extract all visible text from this document exactly as it appears.
 Output only the extracted text, preserving line breaks and spacing.""",
     "describe_figure": """Analyze this figure/chart/diagram in detail:
@@ -101,23 +99,67 @@ class OCRProcessor:
         error_str = str(error).lower()
         return "429" in error_str or "rate limit" in error_str or "quota" in error_str
+    # Gemini 3.x Flash models use thinking architecture and need explicit config
+    # to avoid empty responses (thinking stalls at low temperature).
+    # Does NOT match: gemini-2.x (different thinking API), gemini-3-pro (not Flash)
+    _GEMINI_3_FLASH_RE = re.compile(r"gemini-3(?:\.\d+)?-flash")
+    def _build_generation_config(self) -> types.GenerateContentConfig:
+        """Build GenerateContentConfig, adding thinking config for Gemini 3 Flash models."""
+        kwargs: dict[str, Any] = {"temperature": 0.1}
+        if self._GEMINI_3_FLASH_RE.search(self.model_name):
+            kwargs["thinking_config"] = types.ThinkingConfig(
+                thinking_level="MINIMAL",
+            )
+        return types.GenerateContentConfig(**kwargs)
+    @staticmethod
+    def _extract_text(response: Any) -> str:
+        """Extract text from a GenerateContentResponse by walking parts explicitly.
+        The `.text` shortcut returns None when parts include thought summaries,
+        non-text parts, or when finish_reason != STOP — which is common with
+        Gemini 3.x thinking models. Walking parts is the reliable path.
+        """
+        candidates = getattr(response, "candidates", None) or []
+        if not candidates:
+            feedback = getattr(response, "prompt_feedback", None)
+            raise RuntimeError(f"Empty response: no candidates (prompt_feedback={feedback})")
+        candidate = candidates[0]
+        content = getattr(candidate, "content", None)
+        parts = getattr(content, "parts", None) or []
+        text = "".join(
+            p.text for p in parts if getattr(p, "text", None) and not getattr(p, "thought", False)
+        ).strip()
+        if not text:
+            finish = getattr(candidate, "finish_reason", None)
+            safety = getattr(candidate, "safety_ratings", None)
+            part_types = [type(p).__name__ for p in parts]
+            raise RuntimeError(
+                f"Empty response: finish_reason={finish}, "
+                f"len(parts)={len(parts)}, part_types={part_types}, "
+                f"safety_ratings={safety}"
+            )
+        return text
     def _call_with_retry(self, contents: list[Any], prompt: str) -> str:
         """Call generate_content with exponential backoff on transient errors."""
         max_attempts = self.config.max_retries + 1
         base_delay = self.config.retry_base_delay
+        config = self._build_generation_config()
         for attempt in range(max_attempts):
             try:
                 response = self.client.models.generate_content(
                     model=self.model_name,
                     contents=[prompt, *contents],
-                    config=types.GenerateContentConfig(
-                        temperature=0.1,
-                    ),
+                    config=config,
                 )
-                if response.text:
-                    return response.text.strip()
-                return ""
+                return self._extract_text(response)
             except Exception as e:
                 is_last = attempt == max_attempts - 1
                 if is_last or not self._is_retryable(e):
@@ -204,6 +246,7 @@ class OCRProcessor:
         start_time = time.time()
         self.config.validate_file_size(pdf_path)
+        uploaded_file = None
         try:
             if show_progress and not self.config.quiet:
                 with Progress(
@@ -245,6 +288,13 @@ class OCRProcessor:
                 error=str(e),
                 processing_time=time.time() - start_time,
             )
+        finally:
+            # Clean up uploaded file from Gemini Files API (48hr retention)
+            if uploaded_file is not None:
+                try:
+                    self.client.files.delete(name=uploaded_file.name)
+                except Exception as del_err:
+                    logger.debug(f"Failed to delete uploaded file: {del_err}")
     def process_file(
         self,
@@ -281,9 +331,11 @@ class OCRProcessor:
             shutil.copy2(result.file_path, original_output)
         # Write clean markdown — just the OCR text, no headers
-        markdown_path.write_text(
-            result.text if result.success else f"*[OCR Failed: {result.error}]*", encoding="utf-8"
-        )
+        if result.success:
+            markdown_path.write_text(result.text, encoding="utf-8")
+        else:
+            # Sanitize error: don't leak raw exception details to output files
+            markdown_path.write_text("*[OCR Failed]*", encoding="utf-8")
         # Save extracted images
         if result.extracted_images and self.config.include_images:

gemini_ocr_cli-0.3.2/gemini_ocr/retry.py ADDED Viewed

@@ -0,0 +1,104 @@
+"""Retry logic with exponential backoff for API calls."""
+import logging
+import time
+from functools import wraps
+from typing import Callable, Tuple, Type, TypeVar
+logger = logging.getLogger(__name__)
+T = TypeVar("T")
+class RetryError(Exception):
+    """Raised when all retry attempts are exhausted."""
+    def __init__(self, message: str, last_exception: Exception):
+        super().__init__(message)
+        self.last_exception = last_exception
+def retry(
+    max_attempts: int = 3,
+    backoff_factor: float = 2.0,
+    initial_delay: float = 1.0,
+    max_delay: float = 60.0,
+    exceptions: Tuple[Type[Exception], ...] = (Exception,),
+) -> Callable[[Callable[..., T]], Callable[..., T]]:
+    """Decorator for retrying functions with exponential backoff.
+    Args:
+        max_attempts: Maximum number of attempts (including first try)
+        backoff_factor: Multiplier for delay between retries
+        initial_delay: Initial delay in seconds
+        max_delay: Maximum delay in seconds
+        exceptions: Tuple of exception types to catch and retry
+    Returns:
+        Decorated function with retry logic
+    """
+    def decorator(func: Callable[..., T]) -> Callable[..., T]:
+        @wraps(func)
+        def wrapper(*args, **kwargs) -> T:
+            delay = initial_delay
+            last_exception = None
+            for attempt in range(1, max_attempts + 1):
+                try:
+                    return func(*args, **kwargs)
+                except exceptions as e:
+                    last_exception = e
+                    if attempt == max_attempts:
+                        logger.error(
+                            f"All {max_attempts} attempts failed for {func.__name__}: {e}"
+                        )
+                        raise RetryError(
+                            f"Failed after {max_attempts} attempts", last_exception
+                        ) from e
+                    logger.warning(
+                        f"Attempt {attempt}/{max_attempts} failed for {func.__name__}: {e}. "
+                        f"Retrying in {delay:.1f}s..."
+                    )
+                    time.sleep(delay)
+                    delay = min(delay * backoff_factor, max_delay)
+            # Should not reach here, but for type safety
+            raise RetryError(f"Failed after {max_attempts} attempts", last_exception)
+        return wrapper
+    return decorator
+def is_retryable_error(error: Exception) -> bool:
+    """Check if an error is retryable.
+    Args:
+        error: The exception to check
+    Returns:
+        True if the error is typically transient and retryable
+    """
+    error_str = str(error).lower()
+    # Rate limit errors
+    if "rate" in error_str and "limit" in error_str:
+        return True
+    if "429" in error_str or "too many requests" in error_str:
+        return True
+    # Server errors
+    if "500" in error_str or "502" in error_str or "503" in error_str:
+        return True
+    if "internal" in error_str and "error" in error_str:
+        return True
+    # Connection errors
+    if "timeout" in error_str:
+        return True
+    if "connection" in error_str:
+        return True
+    return False

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/utils.py RENAMED Viewed

@@ -56,10 +56,12 @@ def get_supported_files(directory: Path, recursive: bool = True) -> list[Path]:
 def sanitize_filename(filename: str, max_length: int | None = 200) -> str:
     """Sanitize filename for safe filesystem usage."""
-    sanitized = re.sub(r'[<>:"/\\|?*]', "_", filename)
+    # Strip null bytes and leading dots (prevent hidden files / path tricks)
+    sanitized = filename.replace("\x00", "")
+    sanitized = re.sub(r'[<>:"/\\|?*]', "_", sanitized)
     sanitized = re.sub(r"\s+", "_", sanitized)
     sanitized = re.sub(r"_+", "_", sanitized)
-    sanitized = sanitized.strip("_")
+    sanitized = sanitized.strip("_.")
     if max_length and len(sanitized) > max_length:
         sanitized = sanitized[:max_length]
     return sanitized or "unnamed"

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "gemini-ocr-cli"
-version = "0.3.0"
+version = "0.3.2"
 description = "CLI tool for OCR processing using Google Gemini's vision capabilities"
 authors = [
     {name = "Ruben Fernandez-Fuertes", email = "fernandezfuertesruben@gmail.com"}

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/conftest.py RENAMED Viewed

@@ -66,7 +66,7 @@ def mock_config():
     with patch.dict(os.environ, {"GEMINI_API_KEY": "test-api-key"}):
         config = Config()
         config.api_key = "test-api-key"
-        config.model = "gemini-3.1-flash-lite-preview"
+        config.model = "gemini-3-flash-preview"
         config.verbose = False
         config.quiet = False
         config.max_workers = 1

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_config.py RENAMED Viewed

@@ -55,7 +55,7 @@ class TestConfigDefaults:
     def test_default_model(self):
         with patch.dict(os.environ, {"GEMINI_API_KEY": "test"}, clear=True):
             config = Config()
-            assert config.model == "gemini-3.1-flash-lite-preview"
+            assert config.model == "gemini-3-flash-preview"
     def test_default_max_file_size(self):
         with patch.dict(os.environ, {"GEMINI_API_KEY": "test"}, clear=True):

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_metadata.py RENAMED Viewed

@@ -40,7 +40,7 @@ class TestMetadataManager:
         f.write_bytes(b"fake pdf content")
         meta = MetadataManager(tmp_path)
-        meta.record(f, processing_time=1.5, model="gemini-3.1-flash-lite-preview", output_path="test/test.md")
+        meta.record(f, processing_time=1.5, model="gemini-3-flash-preview", output_path="test/test.md")
         assert meta.is_processed(f)
@@ -98,12 +98,12 @@ class TestMetadataManager:
         f.write_bytes(b"data")
         meta = MetadataManager(tmp_path)
-        meta.record(f, processing_time=2.5, model="gemini-3.1-flash-lite-preview", output_path="test/test.md")
+        meta.record(f, processing_time=2.5, model="gemini-3-flash-preview", output_path="test/test.md")
         entry = meta.files["test.pdf"]
         assert entry["status"] == "completed"
         assert entry["processing_time"] == 2.5
-        assert entry["model"] == "gemini-3.1-flash-lite-preview"
+        assert entry["model"] == "gemini-3-flash-preview"
         assert entry["output_path"] == "test/test.md"
         assert "checksum" in entry
         assert "timestamp" in entry

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/.github/workflows/ci.yml RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/.gitignore RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/.pre-commit-config.yaml RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/CHANGELOG.md RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/LICENSE RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/__main__.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/gemini_ocr/metadata.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/__init__.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_cli.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_import.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_integration.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_processor.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/tests/test_utils.py RENAMED Viewed

File without changes

{gemini_ocr_cli-0.3.0 → gemini_ocr_cli-0.3.2}/uv.lock RENAMED Viewed

File without changes

gemini-ocr-cli 0.3.0__tar.gz → 0.3.2__tar.gz

gemini-ocr-cli 0.3.0tar.gz → 0.3.2tar.gz