PyPI - content-core - Versions diffs - 0.7.2__tar.gz → 0.8.0__tar.gz - Mend

content-core 0.7.2tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of content-core might be problematic. Click here for more details.

Files changed (61) hide show

{content_core-0.7.2 → content_core-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: content-core
-Version: 0.7.2
+Version: 0.8.0
 Summary: Extract what matters from any media source
 Author-email: LUIS NOVO <lfnovo@gmail.com>
 License-File: LICENSE
@@ -10,6 +10,8 @@ Requires-Dist: aiohttp>=3.11
 Requires-Dist: bs4>=0.0.2
 Requires-Dist: dicttoxml>=1.7.16
 Requires-Dist: esperanto[openai]>=1.2.0
+Requires-Dist: firecrawl-py>=2.7.0
+Requires-Dist: firecrawl>=2.7.0
 Requires-Dist: jinja2>=3.1.6
 Requires-Dist: langdetect>=1.0.9
 Requires-Dist: langgraph>=0.3.29
@@ -22,6 +24,7 @@ Requires-Dist: python-docx>=1.1.2
 Requires-Dist: python-dotenv>=1.1.0
 Requires-Dist: python-magic>=0.4.27
 Requires-Dist: python-pptx>=1.0.2
+Requires-Dist: readability-lxml>=0.8.4.1
 Requires-Dist: validators>=0.34.0
 Requires-Dist: youtube-transcript-api>=1.0.3
 Provides-Extra: docling
@@ -39,6 +42,8 @@ Description-Content-Type: text/markdown
 ## Overview
+> **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
 The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
 ## Key Features
@@ -48,6 +53,10 @@ The primary goal of Content Core is to simplify the process of ingesting content
     *   Web URLs (using robust extraction methods).
     *   Local files (including automatic transcription for video/audio files and parsing for text-based formats).
 *   **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
+*   **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
+    * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
+    * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
+    * You can override this by specifying an engine, but `'auto'` is recommended for most users.
 *   **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
 *   **Asynchronous:** Built with `asyncio` for efficient I/O operations.
@@ -218,15 +227,15 @@ async def main():
     text_data = await extract_content({"content": "This is my sample text content."})
     print(text_data)
-    # Extract from a URL
+    # Extract from a URL (uses 'auto' engine by default)
     url_data = await extract_content({"url": "https://www.example.com"})
     print(url_data)
-    # Extract from a local video file (gets transcript)
+    # Extract from a local video file (gets transcript, engine='auto' by default)
     video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
     print(video_data)
-    # Extract from a local markdown file
+    # Extract from a local markdown file (engine='auto' by default)
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)

{content_core-0.7.2 → content_core-0.8.0}/README.md RENAMED Viewed

@@ -6,6 +6,8 @@
 ## Overview
+> **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
 The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
 ## Key Features
@@ -15,6 +17,10 @@ The primary goal of Content Core is to simplify the process of ingesting content
     *   Web URLs (using robust extraction methods).
     *   Local files (including automatic transcription for video/audio files and parsing for text-based formats).
 *   **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
+*   **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
+    * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
+    * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
+    * You can override this by specifying an engine, but `'auto'` is recommended for most users.
 *   **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
 *   **Asynchronous:** Built with `asyncio` for efficient I/O operations.
@@ -185,15 +191,15 @@ async def main():
     text_data = await extract_content({"content": "This is my sample text content."})
     print(text_data)
-    # Extract from a URL
+    # Extract from a URL (uses 'auto' engine by default)
     url_data = await extract_content({"url": "https://www.example.com"})
     print(url_data)
-    # Extract from a local video file (gets transcript)
+    # Extract from a local video file (gets transcript, engine='auto' by default)
     video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
     print(video_data)
-    # Extract from a local markdown file
+    # Extract from a local markdown file (engine='auto' by default)
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)

{content_core-0.7.2 → content_core-0.8.0}/docs/processors.md RENAMED Viewed

@@ -1,5 +1,7 @@
 # Content Core Processors
+**Note:** As of vNEXT, the default extraction engine is now `'auto'`. This means Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. See details below.
 This document provides an overview of the content processors available in Content Core. These processors are responsible for extracting and handling content from various sources and file types.
 ## Overview
@@ -19,6 +21,11 @@ Content Core uses a modular approach to process content from different sources.
 - **Supported Input**: URLs (web pages).
 - **Returned Data**: Extracted text content from the web page, often in a cleaned format.
 - **Location**: `src/content_core/processors/url.py`
+- **Default Engine (`auto`) Logic**:
+    - If `FIRECRAWL_API_KEY` is set, uses Firecrawl for extraction.
+    - Else it tries Jina until it fails because of rate limits (unless `JINA_API_KEY` is set).
+    - Else, falls back to BeautifulSoup-based extraction.
+    - You can explicitly specify an engine (`'firecrawl'`, `'jina'`, `'simple'`, etc.), but `'auto'` is now the default and recommended for most users.
 ### 3. **File Processor**
 - **Purpose**: Processes local files of various types, extracting content based on file format.
@@ -40,10 +47,14 @@ Content Core uses a modular approach to process content from different sources.
 - **Supported Input**: PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, Images (PNG, JPEG, TIFF, BMP).
 - **Returned Data**: Content converted to configured format (markdown, html, json).
 - **Location**: `src/content_core/processors/docling.py`
+- **Default Engine (`auto`) Logic for Files/Documents**:
+    - Tries the `'docling'` extraction method first (robust document parsing for supported types).
+    - If `'docling'` fails or is not supported, automatically falls back to simple extraction (fast, lightweight for supported types).
+    - You can explicitly specify `'docling'`, `'simple'`, or `'legacy'` as the engine, but `'auto'` is now the default and recommended for most users.
 - **Configuration**: Activate the Docling engine in `cc_config.yaml` or custom config:
   ```yaml
   extraction:
-    engine: docling       # 'legacy' (default) or 'docling'
+    engine: docling       # 'auto' (default), 'docling', or 'simple'
     docling:
       output_format: markdown  # markdown | html | json
   ```

{content_core-0.7.2 → content_core-0.8.0}/docs/usage.md RENAMED Viewed

@@ -1,5 +1,7 @@
 # Using the Content Core Library
+> **Note:** As of vNEXT, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
 This documentation explains how to configure and use the **Content Core** library in your projects. The library allows customization of AI model settings through a YAML file and environment variables.
 ## Environment Variable for Configuration
@@ -76,20 +78,28 @@ To simplify setup, we suggest copying the provided sample files:
 This will allow you to quickly start with customized settings without needing to create the files from scratch.
-### Docling Engine
+### Extraction Engine Selection
+By default, Content Core uses the `'auto'` engine for all extraction tasks. The logic is as follows:
+- **For URLs**: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else Jina if `JINA_API_KEY` is set, else falls back to BeautifulSoup.
+- **For files**: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
+You can override this behavior by specifying an engine in your config or function call, but `'auto'` is recommended for most users.
+#### Docling Engine
-Content Core supports an optional Docling engine for advanced document parsing. To enable:
+Content Core supports an optional Docling engine for advanced document parsing. To enable Docling explicitly:
-#### In YAML config
+##### In YAML config
 Add under the `extraction` section:
 ```yaml
 extraction:
-  engine: docling        # legacy (default) or docling
+  engine: docling        # auto (default), docling, or simple
   docling:
     output_format: html  # markdown | html | json
 ```
-#### Programmatically in Python
+##### Programmatically in Python
 ```python
 from content_core.config import set_extraction_engine, set_docling_output_format

{content_core-0.7.2 → content_core-0.8.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "content-core"
-version = "0.7.2"
+version = "0.8.0"
 description = "Extract what matters from any media source"
 readme = "README.md"
 homepage = "https://github.com/lfnovo/content-core"
@@ -28,6 +28,9 @@ dependencies = [
     "validators>=0.34.0",
     "ai-prompter>=0.2.3",
     "moviepy>=2.1.2",
+    "readability-lxml>=0.8.4.1",
+    "firecrawl>=2.7.0",
+    "firecrawl-py>=2.7.0",
 ]
 [project.optional-dependencies]

{content_core-0.7.2 → content_core-0.8.0}/src/content_core/common/state.py RENAMED Viewed

@@ -2,6 +2,9 @@ from typing import Optional
 from pydantic import BaseModel, Field
+from content_core.common.types import Engine
+from content_core.common.types import Engine
 class ProcessSourceState(BaseModel):
     file_path: Optional[str] = ""
@@ -13,8 +16,9 @@ class ProcessSourceState(BaseModel):
     identified_provider: Optional[str] = ""
     metadata: Optional[dict] = Field(default_factory=lambda: {})
     content: Optional[str] = ""
-    engine: Optional[str] = Field(
-        default=None, description="Override extraction engine: 'legacy' or 'docling'"
+    engine: Optional[Engine] = Field(
+        default=None,
+        description="Override extraction engine: 'auto', 'simple', 'legacy', 'firecrawl', 'jina', or 'docling'",
     )
     output_format: Optional[str] = Field(
         default=None,

content_core-0.8.0/src/content_core/common/types.py ADDED Viewed

@@ -0,0 +1,21 @@
+from typing import Literal
+import warnings
+Engine = Literal[
+    "auto",
+    "simple",
+    "legacy",
+    "firecrawl",
+    "jina",
+    "docling",
+]
+DEPRECATED_ENGINES = {"legacy": "simple"}
+def warn_if_deprecated_engine(engine: str):
+    if engine in DEPRECATED_ENGINES:
+        warnings.warn(
+            f"Engine '{engine}' is deprecated and will be removed in a future release. Use '{DEPRECATED_ENGINES[engine]}' instead.",
+            DeprecationWarning,
+            stacklevel=2,
+        )

{content_core-0.7.2 → content_core-0.8.0}/src/content_core/content/extraction/graph.py RENAMED Viewed

@@ -2,6 +2,7 @@ import os
 import tempfile
 from typing import Any, Dict, Optional
 from urllib.parse import urlparse
+from content_core.common.types import warn_if_deprecated_engine
 import aiohttp
 import magic
@@ -114,14 +115,28 @@ async def download_remote_file(state: ProcessSourceState) -> Dict[str, Any]:
     return {"file_path": tmp, "identified_type": mime}
 async def file_type_router_docling(state: ProcessSourceState) -> str:
     """
-    Route to Docling if enabled and supported; otherwise use legacy file type edge.
+    Route to Docling if enabled and supported; otherwise use simple file type edge.
+    Supports 'auto', 'docling', 'simple', and 'legacy' (deprecated, alias for simple).
+    'auto' tries simple first, then falls back to docling if simple fails.
     """
-    # allow per-execution override of engine via state.engine
-    engine = state.engine or CONFIG.get("extraction", {}).get("engine", "legacy")
+    engine = state.engine or CONFIG.get("extraction", {}).get("engine", "auto")
+    warn_if_deprecated_engine(engine)
+    if engine == "auto":
+        # Try docling first; if it fails or is not supported, fallback to simple
+        if state.identified_type in DOCLING_SUPPORTED:
+            try:
+                return "extract_docling"
+            except Exception as e:
+                logger.warning(f"Docling extraction failed in 'auto' mode, falling back to simple: {e}")
+        # Fallback to simple
+        return await file_type_edge(state)
     if engine == "docling" and state.identified_type in DOCLING_SUPPORTED:
         return "extract_docling"
+    # For 'simple' and 'legacy', use the default file type edge
     return await file_type_edge(state)

{content_core-0.7.2 → content_core-0.8.0}/src/content_core/processors/audio.py RENAMED Viewed

@@ -1,9 +1,10 @@
 import asyncio
+import math
 import os
 import tempfile
-import math
 import traceback
 from functools import partial
 from moviepy import AudioFileClip
 from content_core.common import ProcessSourceState
@@ -64,7 +65,9 @@ async def split_audio(input_file, segment_length_minutes=15, output_prefix=None)
     )
-def extract_audio(input_file: str, output_file: str, start_time: float = None, end_time: float = None) -> None:
+def extract_audio(
+    input_file: str, output_file: str, start_time: float = None, end_time: float = None
+) -> None:
     """
     Extract audio from a video or audio file and save it as an MP3 file.
     If start_time and end_time are provided, only that segment of audio is extracted.
@@ -78,17 +81,17 @@ def extract_audio(input_file: str, output_file: str, start_time: float = None, e
     try:
         # Load the file as an AudioFileClip
         audio_clip = AudioFileClip(input_file)
-        # If start_time and end_time are provided, trim the audio
+        # If start_time and/or end_time are provided, trim the audio using subclipped
         if start_time is not None and end_time is not None:
-            audio_clip = audio_clip.cutout(0, start_time).cutout(end_time - start_time, audio_clip.duration)
+            audio_clip = audio_clip.subclipped(start_time, end_time)
         elif start_time is not None:
-            audio_clip = audio_clip.cutout(0, start_time)
+            audio_clip = audio_clip.subclipped(start_time)
         elif end_time is not None:
-            audio_clip = audio_clip.cutout(end_time, audio_clip.duration)
+            audio_clip = audio_clip.subclipped(0, end_time)
         # Export the audio as MP3
-        audio_clip.write_audiofile(output_file, codec='mp3')
+        audio_clip.write_audiofile(output_file, codec="mp3")
         audio_clip.close()
     except Exception as e:
         logger.error(f"Error extracting audio: {str(e)}")
@@ -117,7 +120,9 @@ async def extract_audio_data(data: ProcessSourceState):
         output_files = []
         if duration_s > segment_length_s:
-            logger.info(f"Audio is longer than 10 minutes ({duration_s}s), splitting into {math.ceil(duration_s / segment_length_s)} segments")
+            logger.info(
+                f"Audio is longer than 10 minutes ({duration_s}s), splitting into {math.ceil(duration_s / segment_length_s)} segments"
+            )
             for i in range(math.ceil(duration_s / segment_length_s)):
                 start_time = i * segment_length_s
                 end_time = min((i + 1) * segment_length_s, audio.duration)
@@ -134,15 +139,18 @@ async def extract_audio_data(data: ProcessSourceState):
         # Transcribe audio files
         from content_core.models import ModelFactory
         speech_to_text_model = ModelFactory.get_model("speech_to_text")
         transcriptions = []
         for audio_file in output_files:
-            transcription = await transcribe_audio_segment(audio_file, speech_to_text_model)
+            transcription = await transcribe_audio_segment(
+                audio_file, speech_to_text_model
+            )
             transcriptions.append(transcription)
         return {
             "metadata": {"audio_files": output_files},
-            "content": " ".join(transcriptions)
+            "content": " ".join(transcriptions),
         }
     except Exception as e:
         logger.error(f"Error processing audio: {str(e)}")

content_core-0.8.0/src/content_core/processors/url.py ADDED Viewed

@@ -0,0 +1,248 @@
+import os
+from io import BytesIO
+from urllib.parse import urlparse
+import aiohttp
+import docx
+from bs4 import BeautifulSoup
+from readability import Document
+from content_core.common import ProcessSourceState
+from content_core.common.types import warn_if_deprecated_engine
+from content_core.logging import logger
+from content_core.processors.pdf import SUPPORTED_FITZ_TYPES
+DOCX_MIME_TYPE = (
+    "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+)
+async def _extract_docx_content(docx_bytes: bytes, url: str):
+    """
+    Extract content from DOCX file bytes.
+    """
+    try:
+        logger.debug(f"Attempting to parse DOCX from URL: {url} with python-docx")
+        doc = docx.Document(BytesIO(docx_bytes))
+        content_parts = [p.text for p in doc.paragraphs if p.text]
+        full_content = "\n\n".join(content_parts)
+        # Try to get a title from document properties or first heading
+        title = doc.core_properties.title
+        if not title and doc.paragraphs:
+            # Look for a potential title in the first few paragraphs (e.g., if styled as heading)
+            for p in doc.paragraphs[:5]:  # Check first 5 paragraphs
+                if p.style.name.startswith("Heading"):
+                    title = p.text
+                    break
+            if not title:  # Fallback to first line if no heading found
+                title = (
+                    doc.paragraphs[0].text.strip()
+                    if doc.paragraphs[0].text.strip()
+                    else None
+                )
+        # If no title found, use filename from URL
+        if not title:
+            title = urlparse(url).path.split("/")[-1]
+        logger.info(f"Successfully extracted content from DOCX: {url}, Title: {title}")
+        return {
+            "title": title,
+            "content": full_content,
+            "domain": urlparse(url).netloc,
+            "url": url,
+        }
+    except Exception as e:
+        logger.error(f"Failed to process DOCX content from {url}: {e}")
+        # Fallback or re-raise, depending on desired error handling
+        return {
+            "title": f"Error Processing DOCX: {urlparse(url).path.split('/')[-1]}",
+            "content": f"Failed to extract content from DOCX: {e}",
+            "domain": urlparse(url).netloc,
+            "url": url,
+        }
+async def url_provider(state: ProcessSourceState):
+    """
+    Identify the provider
+    """
+    return_dict = {}
+    url = state.url
+    if url:
+        if "youtube.com" in url or "youtu.be" in url:
+            return_dict["identified_type"] = "youtube"
+        else:
+            # remote URL: check content-type to catch PDFs
+            try:
+                async with aiohttp.ClientSession() as session:
+                    async with session.head(
+                        url, timeout=10, allow_redirects=True
+                    ) as resp:
+                        mime = resp.headers.get("content-type", "").split(";", 1)[0]
+            except Exception as e:
+                logger.debug(f"HEAD check failed for {url}: {e}")
+                mime = "article"
+            if mime in SUPPORTED_FITZ_TYPES:
+                return_dict["identified_type"] = mime
+            else:
+                return_dict["identified_type"] = "article"
+    return return_dict
+async def extract_url_bs4(url: str) -> dict:
+    """
+    Get the title and content of a URL using readability with a fallback to BeautifulSoup.
+    Args:
+        url (str): The URL of the webpage to extract content from.
+    Returns:
+        dict: A dictionary containing the 'title' and 'content' of the webpage.
+    """
+    async with aiohttp.ClientSession() as session:
+        try:
+            # Fetch the webpage content
+            async with session.get(url, timeout=10) as response:
+                if response.status != 200:
+                    raise Exception(f"HTTP error: {response.status}")
+                html = await response.text()
+            # Try extracting with readability
+            try:
+                doc = Document(html)
+                title = doc.title() or "No title found"
+                # Extract content as plain text by parsing the cleaned HTML
+                soup = BeautifulSoup(doc.summary(), "lxml")
+                content = soup.get_text(separator=" ", strip=True)
+                if not content.strip():
+                    raise ValueError("No content extracted by readability")
+            except Exception as e:
+                print(f"Readability failed: {e}")
+                # Fallback to BeautifulSoup
+                soup = BeautifulSoup(html, "lxml")
+                # Extract title
+                title_tag = (
+                    soup.find("title")
+                    or soup.find("h1")
+                    or soup.find("meta", property="og:title")
+                )
+                title = (
+                    title_tag.get_text(strip=True) if title_tag else "No title found"
+                )
+                # Extract content from common content tags
+                content_tags = soup.select(
+                    'article, .content, .post, main, [role="main"], div[class*="content"], div[class*="article"]'
+                )
+                content = (
+                    " ".join(
+                        tag.get_text(separator=" ", strip=True) for tag in content_tags
+                    )
+                    if content_tags
+                    else soup.get_text(separator=" ", strip=True)
+                )
+                content = content.strip() or "No content found"
+            return {
+                "title": title,
+                "content": content,
+            }
+        except Exception as e:
+            print(f"Error processing URL {url}: {e}")
+            return {
+                "title": "Error",
+                "content": f"Failed to extract content: {str(e)}",
+            }
+async def extract_url_jina(url: str):
+    """
+    Get the content of a URL using Jina. Uses Bearer token if JINA_API_KEY is set.
+    """
+    headers = {}
+    api_key = os.environ.get("JINA_API_KEY")
+    if api_key:
+        headers["Authorization"] = f"Bearer {api_key}"
+    async with aiohttp.ClientSession() as session:
+        async with session.get(f"https://r.jina.ai/{url}", headers=headers) as response:
+            text = await response.text()
+            if text.startswith("Title:") and "\n" in text:
+                title_end = text.index("\n")
+                title = text[6:title_end].strip()
+                content = text[title_end + 1 :].strip()
+                logger.debug(
+                    f"Processed url: {url}, found title: {title}, content: {content[:100]}..."
+                )
+                return {"title": title, "content": content}
+            else:
+                logger.debug(
+                    f"Processed url: {url}, does not have Title prefix, returning full content: {text[:100]}..."
+                )
+                return {"content": text}
+async def extract_url_firecrawl(url: str):
+    """
+    Get the content of a URL using Firecrawl.
+    Returns {"title": ..., "content": ...} or None on failure.
+    """
+    try:
+        from firecrawl import AsyncFirecrawlApp
+        app = AsyncFirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
+        scrape_result = await app.scrape_url(url, formats=["markdown", "html"])
+        return {
+            "title": scrape_result.metadata["title"] or scrape_result.title,
+            "content": scrape_result.markdown,
+        }
+    except Exception as e:
+        logger.error(f"Firecrawl extraction error for URL: {url}: {e}")
+        return None
+async def extract_url(state: ProcessSourceState):
+    """
+    Extract content from a URL using the engine specified in the state.
+    Supported engines: 'auto', 'simple', 'legacy' (deprecated), 'firecrawl', 'jina'.
+    """
+    assert state.url, "No URL provided"
+    url = state.url
+    engine = state.engine or "auto"
+    warn_if_deprecated_engine(engine)
+    try:
+        if engine == "auto":
+            if os.environ.get("FIRECRAWL_API_KEY"):
+                logger.debug(
+                    "Engine 'auto' selected: using Firecrawl (FIRECRAWL_API_KEY detected)"
+                )
+                return await extract_url_firecrawl(url)
+            else:
+                try:
+                    logger.debug("Trying to use Jina to extract URL")
+                    return await extract_url_jina(url)
+                except Exception as e:
+                    logger.error(f"Jina extraction error for URL: {url}: {e}")
+                    logger.debug("Falling back to BeautifulSoup")
+                    return await extract_url_bs4(url)
+        elif engine == "simple" or engine == "legacy":
+            # 'legacy' is deprecated alias for 'simple'
+            return await extract_url_bs4(url)
+        elif engine == "firecrawl":
+            return await extract_url_firecrawl(url)
+        elif engine == "jina":
+            return await extract_url_jina(url)
+        elif engine == "docling":
+            from content_core.processors.docling import extract_with_docling
+            state.url = url
+            result_state = await extract_with_docling(state)
+            return {"title": None, "content": result_state.content}
+        else:
+            raise ValueError(f"Unknown engine: {engine}")
+    except Exception as e:
+        logger.error(f"URL extraction failed for URL: {url}")
+        logger.exception(e)
+        return None

content-core 0.7.2__tar.gz → 0.8.0__tar.gz

Potentially problematic release.

content-core 0.7.2tar.gz → 0.8.0tar.gz