PyPI - content-core - Versions diffs - 1.2.0__tar.gz → 1.2.1__tar.gz - Mend

content-core 1.2.0tar.gz → 1.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of content-core might be problematic. Click here for more details.

Files changed (89) hide show

{content_core-1.2.0 → content_core-1.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: content-core
-Version: 1.2.0
+Version: 1.2.1
 Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
 Author-email: LUIS NOVO <lfnovo@gmail.com>
 License-File: LICENSE
@@ -112,11 +112,17 @@ summary = await cc.summarize_content(result, context="explain to a child")
 Install Content Core using `pip`:
 ```bash
-# Install the package
+# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
 pip install content-core
-# Install with MCP server support
+# With enhanced document processing (adds Docling)
+pip install content-core[docling]
+# With MCP server support
 pip install content-core[mcp]
+# Full installation
+pip install content-core[docling,mcp]
 ```
 Alternatively, if you’re developing locally:
@@ -526,8 +532,21 @@ Example `.env`:
 ```plaintext
 OPENAI_API_KEY=your-key-here
 GOOGLE_API_KEY=your-key-here
+# Engine Selection (optional)
+CCORE_DOCUMENT_ENGINE=auto  # auto, simple, docling
+CCORE_URL_ENGINE=auto       # auto, simple, firecrawl, jina
 ```
+### Engine Selection via Environment Variables
+For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
+- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
+- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
+These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
 ### Custom Prompt Templates
 Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.

{content_core-1.2.0 → content_core-1.2.1}/README.md RENAMED Viewed

@@ -74,11 +74,17 @@ summary = await cc.summarize_content(result, context="explain to a child")
 Install Content Core using `pip`:
 ```bash
-# Install the package
+# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
 pip install content-core
-# Install with MCP server support
+# With enhanced document processing (adds Docling)
+pip install content-core[docling]
+# With MCP server support
 pip install content-core[mcp]
+# Full installation
+pip install content-core[docling,mcp]
 ```
 Alternatively, if you’re developing locally:
@@ -488,8 +494,21 @@ Example `.env`:
 ```plaintext
 OPENAI_API_KEY=your-key-here
 GOOGLE_API_KEY=your-key-here
+# Engine Selection (optional)
+CCORE_DOCUMENT_ENGINE=auto  # auto, simple, docling
+CCORE_URL_ENGINE=auto       # auto, simple, firecrawl, jina
 ```
+### Engine Selection via Environment Variables
+For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
+- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
+- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
+These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
 ### Custom Prompt Templates
 Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.

{content_core-1.2.0 → content_core-1.2.1}/docs/mcp.md RENAMED Viewed

@@ -292,6 +292,34 @@ export GOOGLE_API_KEY="your-google-key"
 - **Firecrawl**: Visit [Firecrawl](https://www.firecrawl.dev/) for enhanced web scraping
 - **Jina**: Visit [Jina AI](https://jina.ai/) for alternative web extraction
+### Engine Selection via Environment Variables
+For advanced users, you can override the extraction engines:
+```json
+{
+  "mcpServers": {
+    "content-core": {
+      "env": {
+        "OPENAI_API_KEY": "sk-...",
+        "FIRECRAWL_API_KEY": "fc-...",
+        "CCORE_DOCUMENT_ENGINE": "simple",    // Skip docling, use PyMuPDF
+        "CCORE_URL_ENGINE": "auto"       // Or firecrawl, jina
+      }
+    }
+  }
+}
+```
+**Available engines:**
+- **Document**: `auto`, `simple`, `docling` (requires `content-core[docling]`)
+- **URL**: `auto`, `simple`, `firecrawl`, `jina`
+**Use cases:**
+- Set `CCORE_DOCUMENT_ENGINE=simple` to avoid docling dependency issues
+- Set `CCORE_URL_ENGINE=firecrawl` to always use paid service for better reliability
+- Set `CCORE_URL_ENGINE=simple` for faster processing without external API calls
 ### Custom Prompts
 You can customize Content Core's behavior by setting a custom prompt path:

{content_core-1.2.0 → content_core-1.2.1}/docs/processors.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # Content Core Processors
-**Note:** As of vNEXT, the default extraction engine is now `'auto'`. This means Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to enhanced PyMuPDF extraction (with quality flags and table detection), then to basic simple extraction. See details below.
+**Note:** As of vNEXT, the default extraction engine is now `'auto'`. This means Content Core will automatically select the best extraction method based on your environment and available packages, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first (if installed with `pip install content-core[docling]`), then falls back to enhanced PyMuPDF extraction (with quality flags and table detection), then to basic simple extraction. See details below.
 This document provides an overview of the content processors available in Content Core. These processors are responsible for extracting and handling content from various sources and file types.
@@ -62,14 +62,15 @@ Content Core uses a modular approach to process content from different sources.
   ```
 - **Performance**: Standard extraction maintains baseline performance; OCR only triggers selectively on formula-heavy pages
-### 6. **Docling Processor**
+### 6. **Docling Processor (Optional)**
 - **Purpose**: Use Docling library for rich document parsing (PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, images).
+- **Installation**: Requires `pip install content-core[docling]`
 - **Supported Input**: PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, Images (PNG, JPEG, TIFF, BMP).
 - **Returned Data**: Content converted to configured format (markdown, html, json).
 - **Location**: `src/content_core/processors/docling.py`
 - **Default Document Engine (`auto`) Logic for Files/Documents**:
-    - Tries the `'docling'` extraction method first (robust document parsing for supported types).
-    - If `'docling'` fails or is not supported, automatically falls back to enhanced PyMuPDF extraction (fast, with quality flags and table detection).
+    - Tries the `'docling'` extraction method first (if installed with `content-core[docling]`).
+    - If `'docling'` is not installed or fails, automatically falls back to enhanced PyMuPDF extraction (fast, with quality flags and table detection).
     - Final fallback to basic simple extraction if needed.
     - You can explicitly specify `'docling'` or `'simple'` as the document engine, but `'auto'` is now the default and recommended for most users.
 - **Configuration**: Activate the Docling engine in `cc_config.yaml` or custom config:

{content_core-1.2.0 → content_core-1.2.1}/docs/usage.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # Using the Content Core Library
-> **Note:** As of vNEXT, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
+> **Note:** As of vNEXT, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available packages, with a smart fallback order for both URLs and files. For files/documents, `'auto'` tries Docling first (if installed with `pip install content-core[docling]`), then falls back to enhanced PyMuPDF extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
 This documentation explains how to configure and use the **Content Core** library in your projects. The library allows customization of AI model settings through a YAML file and environment variables.
@@ -12,8 +12,21 @@ To set the environment variable, add the following line to your `.env` file or s
 ```
 CCORE_MODEL_CONFIG_PATH=/path/to/your/models_config.yaml
+# Optional: Override extraction engines
+CCORE_DOCUMENT_ENGINE=auto  # auto, simple, docling
+CCORE_URL_ENGINE=auto       # auto, simple, firecrawl, jina
 ```
+### Engine Selection Environment Variables
+Content Core supports environment variable overrides for extraction engines, useful for deployment scenarios:
+- **`CCORE_DOCUMENT_ENGINE`**: Override document engine (`auto`, `simple`, `docling`)
+- **`CCORE_URL_ENGINE`**: Override URL engine (`auto`, `simple`, `firecrawl`, `jina`)
+These environment variables take precedence over configuration file settings and per-call overrides.
 ## YAML File Schema
 The YAML configuration file defines the AI models that the library will use. The structure of the file is as follows:

{content_core-1.2.0 → content_core-1.2.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "content-core"
-version = "1.2.0"
+version = "1.2.1"
 description = "Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server"
 readme = "README.md"
 homepage = "https://github.com/lfnovo/content-core"

{content_core-1.2.0 → content_core-1.2.1}/src/content_core/config.py RENAMED Viewed

@@ -6,6 +6,10 @@ from dotenv import load_dotenv
 # Load environment variables from .env file
 load_dotenv()
+# Allowed engine values for validation
+ALLOWED_DOCUMENT_ENGINES = {"auto", "simple", "docling"}
+ALLOWED_URL_ENGINES = {"auto", "simple", "firecrawl", "jina"}
 def load_config():
     config_path = os.environ.get("CCORE_CONFIG_PATH") or os.environ.get("CCORE_MODEL_CONFIG_PATH")
@@ -33,6 +37,39 @@ def load_config():
 CONFIG = load_config()
+# Environment variable engine selectors for MCP/Raycast users
+def get_document_engine():
+    """Get document engine with environment variable override and validation."""
+    env_engine = os.environ.get("CCORE_DOCUMENT_ENGINE")
+    if env_engine:
+        if env_engine not in ALLOWED_DOCUMENT_ENGINES:
+            # Import logger here to avoid circular imports
+            from content_core.logging import logger
+            logger.warning(
+                f"Invalid CCORE_DOCUMENT_ENGINE: '{env_engine}'. "
+                f"Allowed values: {', '.join(sorted(ALLOWED_DOCUMENT_ENGINES))}. "
+                f"Using default from config."
+            )
+            return CONFIG.get("extraction", {}).get("document_engine", "auto")
+        return env_engine
+    return CONFIG.get("extraction", {}).get("document_engine", "auto")
+def get_url_engine():
+    """Get URL engine with environment variable override and validation."""
+    env_engine = os.environ.get("CCORE_URL_ENGINE")
+    if env_engine:
+        if env_engine not in ALLOWED_URL_ENGINES:
+            # Import logger here to avoid circular imports
+            from content_core.logging import logger
+            logger.warning(
+                f"Invalid CCORE_URL_ENGINE: '{env_engine}'. "
+                f"Allowed values: {', '.join(sorted(ALLOWED_URL_ENGINES))}. "
+                f"Using default from config."
+            )
+            return CONFIG.get("extraction", {}).get("url_engine", "auto")
+        return env_engine
+    return CONFIG.get("extraction", {}).get("url_engine", "auto")
 # Programmatic config overrides: use in notebooks or scripts
 def set_document_engine(engine: str):
     """Override the document extraction engine ('auto', 'simple', or 'docling')."""

{content_core-1.2.0 → content_core-1.2.1}/src/content_core/content/extraction/graph.py RENAMED Viewed

@@ -12,13 +12,19 @@ from content_core.common import (
     ProcessSourceState,
     UnsupportedTypeException,
 )
-from content_core.config import CONFIG  # type: ignore
+from content_core.config import get_document_engine
 from content_core.logging import logger
 from content_core.processors.audio import extract_audio_data  # type: ignore
-from content_core.processors.docling import (
-    DOCLING_SUPPORTED,  # type: ignore
-    extract_with_docling,
-)
+try:
+    from content_core.processors.docling import (
+        DOCLING_SUPPORTED,  # type: ignore
+        extract_with_docling,
+        DOCLING_AVAILABLE,
+    )
+except ImportError:
+    DOCLING_AVAILABLE = False
+    DOCLING_SUPPORTED = set()
+    extract_with_docling = None
 from content_core.processors.office import (
     SUPPORTED_OFFICE_TYPES,
     extract_office_content,
@@ -126,26 +132,30 @@ async def file_type_router_docling(state: ProcessSourceState) -> str:
     Supports 'auto', 'docling', and 'simple'.
     'auto' tries docling first, then falls back to simple if docling fails.
     """
-    engine = state.document_engine or CONFIG.get("extraction", {}).get("document_engine", "auto")
+    # Use environment-aware engine selection
+    engine = state.document_engine or get_document_engine()
     if engine == "auto":
         logger.debug("Using auto engine")
-        # Try docling first; if it fails or is not supported, fallback to simple
-        if state.identified_type in DOCLING_SUPPORTED:
-            try:
-                logger.debug("Trying docling extraction")
-                return "extract_docling"
-            except Exception as e:
-                logger.warning(
-                    f"Docling extraction failed in 'auto' mode, falling back to simple: {e}"
-                )
+        # Check if docling is available AND supports the file type
+        if DOCLING_AVAILABLE and state.identified_type in DOCLING_SUPPORTED:
+            logger.debug("Using docling extraction (auto mode)")
+            return "extract_docling"
         # Fallback to simple
-        logger.debug("Falling back to simple extraction")
+        logger.debug("Falling back to simple extraction (docling unavailable or unsupported)")
         return await file_type_edge(state)
-    if engine == "docling" and state.identified_type in DOCLING_SUPPORTED:
-        logger.debug("Using docling engine")
-        return "extract_docling"
-    # For 'simple', use the default file type edge
+    if engine == "docling":
+        if not DOCLING_AVAILABLE:
+            raise ImportError("Docling engine requested but docling package not installed. Install with: pip install content-core[docling]")
+        if state.identified_type in DOCLING_SUPPORTED:
+            logger.debug("Using docling engine")
+            return "extract_docling"
+        # If docling doesn't support this file type, fall back to simple
+        logger.debug("Docling doesn't support this file type, using simple engine")
+        return await file_type_edge(state)
+    # For 'simple' or any other engine
     logger.debug("Using simple engine")
     return await file_type_edge(state)
@@ -168,7 +178,9 @@ workflow.add_node("extract_audio_data", extract_audio_data)
 workflow.add_node("extract_youtube_transcript", extract_youtube_transcript)
 workflow.add_node("delete_file", delete_file)
 workflow.add_node("download_remote_file", download_remote_file)
-workflow.add_node("extract_docling", extract_with_docling)
+# Only add docling node if available
+if DOCLING_AVAILABLE:
+    workflow.add_node("extract_docling", extract_with_docling)
 # Add edges
 workflow.add_edge(START, "source")

{content_core-1.2.0 → content_core-1.2.1}/src/content_core/processors/docling.py RENAMED Viewed

@@ -2,22 +2,29 @@
 Docling-based document extraction processor.
 """
+from content_core.common.state import ProcessSourceState
+from content_core.config import CONFIG
+DOCLING_AVAILABLE = False
 try:
     from docling.document_converter import DocumentConverter
+    DOCLING_AVAILABLE = True
 except ImportError:
     class DocumentConverter:
         """Stub when docling is not installed."""
         def __init__(self):
-            raise ImportError("Docling not installed")
+            raise ImportError(
+                "Docling not installed. Install with: pip install content-core[docling] "
+                "or use CCORE_DOCUMENT_ENGINE=simple to skip docling."
+            )
         def convert(self, source: str):
-            raise ImportError("Docling not installed")
-from content_core.common.state import ProcessSourceState
-from content_core.config import CONFIG
+            raise ImportError(
+                "Docling not installed. Install with: pip install content-core[docling] "
+                "or use CCORE_DOCUMENT_ENGINE=simple to skip docling."
+            )
 # Supported MIME types for Docling extraction
 DOCLING_SUPPORTED = {

{content_core-1.2.0 → content_core-1.2.1}/src/content_core/processors/url.py RENAMED Viewed

@@ -5,7 +5,7 @@ from bs4 import BeautifulSoup
 from readability import Document
 from content_core.common import ProcessSourceState
-from content_core.config import CONFIG
+from content_core.config import get_url_engine
 from content_core.logging import logger
 from content_core.processors.docling import DOCLING_SUPPORTED
 from content_core.processors.office import SUPPORTED_OFFICE_TYPES
@@ -165,7 +165,8 @@ async def extract_url(state: ProcessSourceState):
     """
     assert state.url, "No URL provided"
     url = state.url
-    engine = state.url_engine or CONFIG.get("extraction", {}).get("url_engine", "auto")
+    # Use environment-aware engine selection
+    engine = state.url_engine or get_url_engine()
     try:
         if engine == "auto":
             if os.environ.get("FIRECRAWL_API_KEY"):

content_core-1.2.1/tests/unit/test_config.py ADDED Viewed

@@ -0,0 +1,109 @@
+"""Tests for configuration functions and environment variable handling."""
+import pytest
+from unittest.mock import patch, MagicMock
+from content_core.config import (
+    get_document_engine,
+    get_url_engine,
+    ALLOWED_DOCUMENT_ENGINES,
+    ALLOWED_URL_ENGINES,
+)
+class TestDocumentEngineSelection:
+    """Test document engine selection with environment variables."""
+    def test_default_document_engine(self):
+        """Test default document engine when no env var is set."""
+        with patch.dict('os.environ', {}, clear=False):
+            # Remove the env var if it exists
+            if 'CCORE_DOCUMENT_ENGINE' in __import__('os').environ:
+                del __import__('os').environ['CCORE_DOCUMENT_ENGINE']
+            engine = get_document_engine()
+            assert engine == "auto"  # Default from config
+    def test_valid_document_engine_env_var(self):
+        """Test valid document engine environment variable override."""
+        for engine in ALLOWED_DOCUMENT_ENGINES:
+            with patch.dict('os.environ', {'CCORE_DOCUMENT_ENGINE': engine}):
+                assert get_document_engine() == engine
+    def test_invalid_document_engine_env_var(self):
+        """Test invalid document engine environment variable falls back to default."""
+        with patch.dict('os.environ', {'CCORE_DOCUMENT_ENGINE': 'invalid_engine'}):
+            engine = get_document_engine()
+            assert engine == "auto"  # Should fallback to default
+    def test_case_sensitive_document_engine_env_var(self):
+        """Test that document engine environment variable is case sensitive."""
+        with patch.dict('os.environ', {'CCORE_DOCUMENT_ENGINE': 'AUTO'}):  # uppercase
+            engine = get_document_engine()
+            assert engine == "auto"  # Should fallback to default
+class TestUrlEngineSelection:
+    """Test URL engine selection with environment variables."""
+    def test_default_url_engine(self):
+        """Test default URL engine when no env var is set."""
+        with patch.dict('os.environ', {}, clear=False):
+            # Remove the env var if it exists
+            if 'CCORE_URL_ENGINE' in __import__('os').environ:
+                del __import__('os').environ['CCORE_URL_ENGINE']
+            engine = get_url_engine()
+            assert engine == "auto"  # Default from config
+    def test_valid_url_engine_env_var(self):
+        """Test valid URL engine environment variable override."""
+        for engine in ALLOWED_URL_ENGINES:
+            with patch.dict('os.environ', {'CCORE_URL_ENGINE': engine}):
+                assert get_url_engine() == engine
+    def test_invalid_url_engine_env_var(self):
+        """Test invalid URL engine environment variable falls back to default."""
+        with patch.dict('os.environ', {'CCORE_URL_ENGINE': 'invalid_engine'}):
+            engine = get_url_engine()
+            assert engine == "auto"  # Should fallback to default
+    def test_case_sensitive_url_engine_env_var(self):
+        """Test that URL engine environment variable is case sensitive."""
+        with patch.dict('os.environ', {'CCORE_URL_ENGINE': 'FIRECRAWL'}):  # uppercase
+            engine = get_url_engine()
+            assert engine == "auto"  # Should fallback to default
+class TestEngineConstants:
+    """Test that engine constants contain expected values."""
+    def test_document_engine_constants(self):
+        """Test document engine allowed values."""
+        expected = {"auto", "simple", "docling"}
+        assert ALLOWED_DOCUMENT_ENGINES == expected
+    def test_url_engine_constants(self):
+        """Test URL engine allowed values."""
+        expected = {"auto", "simple", "firecrawl", "jina"}
+        assert ALLOWED_URL_ENGINES == expected
+class TestEdgeCases:
+    """Test edge cases and error conditions."""
+    def test_empty_string_document_engine(self):
+        """Test empty string for document engine env var."""
+        with patch.dict('os.environ', {'CCORE_DOCUMENT_ENGINE': ''}):
+            # Empty string should be falsy and use default
+            engine = get_document_engine()
+            assert engine == "auto"
+    def test_empty_string_url_engine(self):
+        """Test empty string for URL engine env var."""
+        with patch.dict('os.environ', {'CCORE_URL_ENGINE': ''}):
+            # Empty string should be falsy and use default
+            engine = get_url_engine()
+            assert engine == "auto"
+    def test_whitespace_engine_values(self):
+        """Test whitespace in engine values are treated as invalid."""
+        with patch.dict('os.environ', {'CCORE_DOCUMENT_ENGINE': ' auto '}):
+            engine = get_document_engine()
+            assert engine == "auto"  # Should fallback to default

{content_core-1.2.0 → content_core-1.2.1}/tests/unit/test_docling.py RENAMED Viewed

@@ -1,7 +1,7 @@
 import os
 import pytest
 from types import SimpleNamespace
-from content_core.processors.docling import extract_with_docling
+from content_core.processors.docling import extract_with_docling, DOCLING_AVAILABLE
 from content_core.common.state import ProcessSourceState
 class DummyDoc:
@@ -31,6 +31,7 @@ def patch_converter(monkeypatch):
     )
 @pytest.mark.asyncio
+@pytest.mark.skipif(not DOCLING_AVAILABLE, reason="Docling not installed")
 async def test_extract_file(tmp_path):
     # File input with explicit markdown format
     fp = tmp_path / "test.txt"
@@ -40,6 +41,7 @@ async def test_extract_file(tmp_path):
     assert new_state.content == "md:file:" + str(fp)
 @pytest.mark.asyncio
+@pytest.mark.skipif(not DOCLING_AVAILABLE, reason="Docling not installed")
 async def test_extract_block_html():
     # Block input with HTML format
     state = ProcessSourceState(content="block content", metadata={"docling_format": "html"})
@@ -47,6 +49,7 @@ async def test_extract_block_html():
     assert new_state.content == "<p>blk:block content</p>"
 @pytest.mark.asyncio
+@pytest.mark.skipif(not DOCLING_AVAILABLE, reason="Docling not installed")
 async def test_default_to_markdown():
     # Default format should fallback to markdown
     state = ProcessSourceState(content="plain text")

{content_core-1.2.0 → content_core-1.2.1}/uv.lock RENAMED Viewed

@@ -419,7 +419,7 @@ wheels = [
 [[package]]
 name = "content-core"
-version = "1.2.0"
+version = "1.2.1"
 source = { editable = "." }
 dependencies = [
     { name = "ai-prompter" },