PyPI - content-core - Versions diffs - 0.8.5__tar.gz → 1.0.1__tar.gz - Mend

content-core 0.8.5tar.gz → 1.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of content-core might be problematic. Click here for more details.

Files changed (63) hide show

{content_core-0.8.5 → content_core-1.0.1}/.gitignore RENAMED Viewed

@@ -21,4 +21,5 @@ todo.md
 WIP/
 *.ignore
-.windsurfrules
+.windsurfrules
+CLAUDE.md

{content_core-0.8.5 → content_core-1.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: content-core
-Version: 0.8.5
+Version: 1.0.1
 Summary: Extract what matters from any media source
 Author-email: LUIS NOVO <lfnovo@gmail.com>
 License-File: LICENSE
@@ -234,12 +234,18 @@ async def main():
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)
-    # Per-execution override with Docling
+    # Per-execution override with Docling for documents
     doc_data = await extract_content({
         "file_path": "path/to/your/document.pdf",
-        "engine": "docling",
+        "document_engine": "docling",
         "output_format": "html"
     })
+    # Per-execution override with Firecrawl for URLs
+    url_data = await extract_content({
+        "url": "https://www.example.com",
+        "url_engine": "firecrawl"
+    })
     print(doc_data)
 if __name__ == "__main__":
@@ -262,7 +268,8 @@ Docling is not the default engine when parsing documents. If you don't want to u
 In your `cc_config.yaml` or custom config, set:
 ```yaml
 extraction:
-  engine: docling       # 'legacy' (default) or 'docling'
+  document_engine: docling  # 'auto' (default), 'simple', or 'docling'
+  url_engine: auto          # 'auto' (default), 'simple', 'firecrawl', or 'jina'
   docling:
     output_format: markdown  # markdown | html | json
 ```
@@ -270,10 +277,13 @@ extraction:
 #### Programmatically in Python
 ```python
-from content_core.config import set_extraction_engine, set_docling_output_format
+from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
+# switch document engine to Docling
+set_document_engine("docling")
-# switch engine to Docling
-set_extraction_engine("docling")
+# switch URL engine to Firecrawl
+set_url_engine("firecrawl")
 # choose output format: 'markdown', 'html', or 'json'
 set_docling_output_format("html")

{content_core-0.8.5 → content_core-1.0.1}/README.md RENAMED Viewed

@@ -201,12 +201,18 @@ async def main():
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)
-    # Per-execution override with Docling
+    # Per-execution override with Docling for documents
     doc_data = await extract_content({
         "file_path": "path/to/your/document.pdf",
-        "engine": "docling",
+        "document_engine": "docling",
         "output_format": "html"
     })
+    # Per-execution override with Firecrawl for URLs
+    url_data = await extract_content({
+        "url": "https://www.example.com",
+        "url_engine": "firecrawl"
+    })
     print(doc_data)
 if __name__ == "__main__":
@@ -229,7 +235,8 @@ Docling is not the default engine when parsing documents. If you don't want to u
 In your `cc_config.yaml` or custom config, set:
 ```yaml
 extraction:
-  engine: docling       # 'legacy' (default) or 'docling'
+  document_engine: docling  # 'auto' (default), 'simple', or 'docling'
+  url_engine: auto          # 'auto' (default), 'simple', 'firecrawl', or 'jina'
   docling:
     output_format: markdown  # markdown | html | json
 ```
@@ -237,10 +244,13 @@ extraction:
 #### Programmatically in Python
 ```python
-from content_core.config import set_extraction_engine, set_docling_output_format
+from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
+# switch document engine to Docling
+set_document_engine("docling")
-# switch engine to Docling
-set_extraction_engine("docling")
+# switch URL engine to Firecrawl
+set_url_engine("firecrawl")
 # choose output format: 'markdown', 'html', or 'json'
 set_docling_output_format("html")

{content_core-0.8.5 → content_core-1.0.1}/docs/processors.md RENAMED Viewed

@@ -21,11 +21,11 @@ Content Core uses a modular approach to process content from different sources.
 - **Supported Input**: URLs (web pages).
 - **Returned Data**: Extracted text content from the web page, often in a cleaned format.
 - **Location**: `src/content_core/processors/url.py`
-- **Default Engine (`auto`) Logic**:
+- **Default URL Engine (`auto`) Logic**:
     - If `FIRECRAWL_API_KEY` is set, uses Firecrawl for extraction.
     - Else it tries Jina until it fails because of rate limits (unless `JINA_API_KEY` is set).
     - Else, falls back to BeautifulSoup-based extraction.
-    - You can explicitly specify an engine (`'firecrawl'`, `'jina'`, `'simple'`, etc.), but `'auto'` is now the default and recommended for most users.
+    - You can explicitly specify a URL engine (`'firecrawl'`, `'jina'`, `'simple'`), but `'auto'` is now the default and recommended for most users.
 ### 3. **File Processor**
 - **Purpose**: Processes local files of various types, extracting content based on file format.
@@ -47,23 +47,27 @@ Content Core uses a modular approach to process content from different sources.
 - **Supported Input**: PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, Images (PNG, JPEG, TIFF, BMP).
 - **Returned Data**: Content converted to configured format (markdown, html, json).
 - **Location**: `src/content_core/processors/docling.py`
-- **Default Engine (`auto`) Logic for Files/Documents**:
+- **Default Document Engine (`auto`) Logic for Files/Documents**:
     - Tries the `'docling'` extraction method first (robust document parsing for supported types).
     - If `'docling'` fails or is not supported, automatically falls back to simple extraction (fast, lightweight for supported types).
-    - You can explicitly specify `'docling'`, `'simple'`, or `'legacy'` as the engine, but `'auto'` is now the default and recommended for most users.
+    - You can explicitly specify `'docling'` or `'simple'` as the document engine, but `'auto'` is now the default and recommended for most users.
 - **Configuration**: Activate the Docling engine in `cc_config.yaml` or custom config:
   ```yaml
   extraction:
-    engine: docling       # 'auto' (default), 'docling', or 'simple'
+    document_engine: docling  # 'auto' (default), 'simple', or 'docling'
+    url_engine: auto          # 'auto' (default), 'simple', 'firecrawl', or 'jina'
     docling:
       output_format: markdown  # markdown | html | json
   ```
 - **Programmatic Toggle**: Use helper functions in Python:
   ```python
-  from content_core.config import set_extraction_engine, set_docling_output_format
+  from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
-  # switch engine to Docling
-  set_extraction_engine("docling")
+  # switch document engine to Docling
+  set_document_engine("docling")
+  # switch URL engine to Firecrawl
+  set_url_engine("firecrawl")
   # choose output format
   set_docling_output_format("html")

{content_core-0.8.5 → content_core-1.0.1}/docs/usage.md RENAMED Viewed

@@ -80,11 +80,11 @@ This will allow you to quickly start with customized settings without needing to
 ### Extraction Engine Selection
-By default, Content Core uses the `'auto'` engine for all extraction tasks. The logic is as follows:
-- **For URLs**: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else Jina if `JINA_API_KEY` is set, else falls back to BeautifulSoup.
-- **For files**: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
+By default, Content Core uses the `'auto'` engine for both document and URL extraction tasks. The logic is as follows:
+- **For URLs** (`url_engine`): Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else Jina if `JINA_API_KEY` is set, else falls back to BeautifulSoup.
+- **For files** (`document_engine`): Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
-You can override this behavior by specifying an engine in your config or function call, but `'auto'` is recommended for most users.
+You can override this behavior by specifying separate engines for documents and URLs in your config or function call, but `'auto'` is recommended for most users.
 #### Docling Engine
@@ -94,35 +94,46 @@ Content Core supports an optional Docling engine for advanced document parsing.
 Add under the `extraction` section:
 ```yaml
 extraction:
-  engine: docling        # auto (default), docling, or simple
+  document_engine: docling  # auto (default), simple, or docling
+  url_engine: auto          # auto (default), simple, firecrawl, or jina
   docling:
-    output_format: html  # markdown | html | json
+    output_format: html     # markdown | html | json
 ```
 ##### Programmatically in Python
 ```python
-from content_core.config import set_extraction_engine, set_docling_output_format
+from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
-# toggle to Docling
-set_extraction_engine("docling")
+# toggle document engine to Docling
+set_document_engine("docling")
+# toggle URL engine to Firecrawl
+set_url_engine("firecrawl")
 # pick format
 set_docling_output_format("json")
 ```
 #### Per-Execution Overrides
-You can override the extraction engine and Docling output format on a per-call basis by including `engine` and `output_format` in your input:
+You can override the extraction engines and Docling output format on a per-call basis by including `document_engine`, `url_engine` and `output_format` in your input:
 ```python
 from content_core.content.extraction import extract_content
-# override engine and format for this document
+# override document engine and format for this document
 result = await extract_content({
     "file_path": "document.pdf",
-    "engine": "docling",
+    "document_engine": "docling",
     "output_format": "html"
 })
 print(result.content)
+# override URL engine for this URL
+result = await extract_content({
+    "url": "https://example.com",
+    "url_engine": "firecrawl"
+})
+print(result.content)
 ```
 Or using `ProcessSourceInput`:
@@ -133,7 +144,7 @@ from content_core.content.extraction import extract_content
 input = ProcessSourceInput(
     file_path="document.pdf",
-    engine="docling",
+    document_engine="docling",
     output_format="json"
 )
 result = await extract_content(input)

{content_core-0.8.5 → content_core-1.0.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "content-core"
-version = "0.8.5"
+version = "1.0.1"
 description = "Extract what matters from any media source"
 readme = "README.md"
 homepage = "https://github.com/lfnovo/content-core"

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/__init__.py RENAMED Viewed

@@ -113,7 +113,7 @@ async def ccore_main():
         if args.format == "xml":
             result = dicttoxml(
                 result.model_dump(), custom_root="result", attr_type=False
-            )
+            ).decode('utf-8')
         elif args.format == "json":
             result = result.model_dump_json()
         else:  # text

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/cc_config.yaml RENAMED Viewed

@@ -30,7 +30,8 @@ summary_model:
     max_tokens: 2000
 extraction:
-  engine: legacy  # change to 'docling' to enable Docling engine
+  document_engine: auto  # auto | simple | docling - for files/documents
+  url_engine: auto  # auto | simple | firecrawl | jina | docling - for URLs
   docling:
     output_format: markdown  # markdown | html | json

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/common/state.py RENAMED Viewed

@@ -2,8 +2,7 @@ from typing import Optional
 from pydantic import BaseModel, Field
-from content_core.common.types import Engine
-from content_core.common.types import Engine
+from content_core.common.types import DocumentEngine, UrlEngine
 class ProcessSourceState(BaseModel):
@@ -16,9 +15,13 @@ class ProcessSourceState(BaseModel):
     identified_provider: Optional[str] = ""
     metadata: Optional[dict] = Field(default_factory=lambda: {})
     content: Optional[str] = ""
-    engine: Optional[Engine] = Field(
+    document_engine: Optional[DocumentEngine] = Field(
         default=None,
-        description="Override extraction engine: 'auto', 'simple', 'legacy', 'firecrawl', 'jina', or 'docling'",
+        description="Override document extraction engine: 'auto', 'simple', or 'docling'",
+    )
+    url_engine: Optional[UrlEngine] = Field(
+        default=None,
+        description="Override URL extraction engine: 'auto', 'simple', 'firecrawl', 'jina', or 'docling'",
     )
     output_format: Optional[str] = Field(
         default=None,
@@ -30,7 +33,8 @@ class ProcessSourceInput(BaseModel):
     content: Optional[str] = ""
     file_path: Optional[str] = ""
     url: Optional[str] = ""
-    engine: Optional[str] = None
+    document_engine: Optional[str] = None
+    url_engine: Optional[str] = None
     output_format: Optional[str] = None

content_core-1.0.1/src/content_core/common/types.py ADDED Viewed

@@ -0,0 +1,14 @@
+from typing import Literal
+DocumentEngine = Literal[
+    "auto",
+    "simple",
+    "docling",
+]
+UrlEngine = Literal[
+    "auto",
+    "simple",
+    "firecrawl",
+    "jina",
+]

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/config.py RENAMED Viewed

@@ -35,9 +35,13 @@ def load_config():
 CONFIG = load_config()
 # Programmatic config overrides: use in notebooks or scripts
-def set_extraction_engine(engine: str):
-    """Override the extraction engine ('legacy' or 'docling')."""
-    CONFIG.setdefault("extraction", {})["engine"] = engine
+def set_document_engine(engine: str):
+    """Override the document extraction engine ('auto', 'simple', or 'docling')."""
+    CONFIG.setdefault("extraction", {})["document_engine"] = engine
+def set_url_engine(engine: str):
+    """Override the URL extraction engine ('auto', 'simple', 'firecrawl', 'jina', or 'docling')."""
+    CONFIG.setdefault("extraction", {})["url_engine"] = engine
 def set_docling_output_format(fmt: str):
     """Override Docling output_format ('markdown', 'html', or 'json')."""

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/extraction/graph.py RENAMED Viewed

@@ -12,7 +12,6 @@ from content_core.common import (
     ProcessSourceState,
     UnsupportedTypeException,
 )
-from content_core.common.types import warn_if_deprecated_engine
 from content_core.config import CONFIG  # type: ignore
 from content_core.logging import logger
 from content_core.processors.audio import extract_audio_data  # type: ignore
@@ -124,11 +123,10 @@ async def download_remote_file(state: ProcessSourceState) -> Dict[str, Any]:
 async def file_type_router_docling(state: ProcessSourceState) -> str:
     """
     Route to Docling if enabled and supported; otherwise use simple file type edge.
-    Supports 'auto', 'docling', 'simple', and 'legacy' (deprecated, alias for simple).
-    'auto' tries simple first, then falls back to docling if simple fails.
+    Supports 'auto', 'docling', and 'simple'.
+    'auto' tries docling first, then falls back to simple if docling fails.
     """
-    engine = state.engine or CONFIG.get("extraction", {}).get("engine", "auto")
-    warn_if_deprecated_engine(engine)
+    engine = state.document_engine or CONFIG.get("extraction", {}).get("document_engine", "auto")
     if engine == "auto":
         logger.debug("Using auto engine")
         # Try docling first; if it fails or is not supported, fallback to simple
@@ -147,7 +145,7 @@ async def file_type_router_docling(state: ProcessSourceState) -> str:
     if engine == "docling" and state.identified_type in DOCLING_SUPPORTED:
         logger.debug("Using docling engine")
         return "extract_docling"
-    # For 'simple' and 'legacy', use the default file type edge
+    # For 'simple', use the default file type edge
     logger.debug("Using simple engine")
     return await file_type_edge(state)
@@ -196,8 +194,10 @@ workflow.add_conditional_edges(
             for m in list(SUPPORTED_FITZ_TYPES)
             + list(SUPPORTED_OFFICE_TYPES)
             + list(DOCLING_SUPPORTED)
+            if m not in ["text/html"]  # Exclude HTML from file download, treat as web content
         },
         "article": "extract_url",
+        "text/html": "extract_url",  # Route HTML content to URL extraction
         "youtube": "extract_youtube_transcript",
     },
 )

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/url.py RENAMED Viewed

@@ -5,7 +5,7 @@ from bs4 import BeautifulSoup
 from readability import Document
 from content_core.common import ProcessSourceState
-from content_core.common.types import warn_if_deprecated_engine
+from content_core.config import CONFIG
 from content_core.logging import logger
 from content_core.processors.docling import DOCLING_SUPPORTED
 from content_core.processors.office import SUPPORTED_OFFICE_TYPES
@@ -160,13 +160,12 @@ async def extract_url_firecrawl(url: str):
 async def extract_url(state: ProcessSourceState):
     """
-    Extract content from a URL using the engine specified in the state.
-    Supported engines: 'auto', 'simple', 'legacy' (deprecated), 'firecrawl', 'jina'.
+    Extract content from a URL using the url_engine specified in the state.
+    Supported engines: 'auto', 'simple', 'firecrawl', 'jina'.
     """
     assert state.url, "No URL provided"
     url = state.url
-    engine = state.engine or "auto"
-    warn_if_deprecated_engine(engine)
+    engine = state.url_engine or CONFIG.get("extraction", {}).get("url_engine", "auto")
     try:
         if engine == "auto":
             if os.environ.get("FIRECRAWL_API_KEY"):
@@ -182,19 +181,12 @@ async def extract_url(state: ProcessSourceState):
                     logger.error(f"Jina extraction error for URL: {url}: {e}")
                     logger.debug("Falling back to BeautifulSoup")
                     return await extract_url_bs4(url)
-        elif engine == "simple" or engine == "legacy":
-            # 'legacy' is deprecated alias for 'simple'
+        elif engine == "simple":
             return await extract_url_bs4(url)
         elif engine == "firecrawl":
             return await extract_url_firecrawl(url)
         elif engine == "jina":
             return await extract_url_jina(url)
-        elif engine == "docling":
-            from content_core.processors.docling import extract_with_docling
-            state.url = url
-            result_state = await extract_with_docling(state)
-            return {"title": None, "content": result_state.content}
         else:
             raise ValueError(f"Unknown engine: {engine}")
     except Exception as e:

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/youtube.py RENAMED Viewed

@@ -1,3 +1,4 @@
+import asyncio
 import re
 import ssl
@@ -68,69 +69,86 @@ async def _extract_youtube_id(url):
 async def get_best_transcript(video_id, preferred_langs=["en", "es", "pt"]):
-    try:
-        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
-        # First try: Manual transcripts in preferred languages
-        manual_transcripts = []
-        try:
-            for transcript in transcript_list:
-                if not transcript.is_generated and not transcript.is_translatable:
-                    manual_transcripts.append(transcript)
-            if manual_transcripts:
-                # Sort based on preferred language order
-                for lang in preferred_langs:
-                    for transcript in manual_transcripts:
-                        if transcript.language_code == lang:
-                            return transcript.fetch()
-                # If no preferred language found, return first manual transcript
-                return manual_transcripts[0].fetch()
-        except NoTranscriptFound:
-            pass
-        # Second try: Auto-generated transcripts in preferred languages
-        generated_transcripts = []
-        try:
-            for transcript in transcript_list:
-                if transcript.is_generated and not transcript.is_translatable:
-                    generated_transcripts.append(transcript)
-            if generated_transcripts:
-                # Sort based on preferred language order
-                for lang in preferred_langs:
-                    for transcript in generated_transcripts:
-                        if transcript.language_code == lang:
-                            return transcript.fetch()
-                # If no preferred language found, return first generated transcript
-                return generated_transcripts[0].fetch()
-        except NoTranscriptFound:
-            pass
-        # Last try: Translated transcripts in preferred languages
-        translated_transcripts = []
+    max_attempts = 5
+    for attempt in range(max_attempts):
         try:
-            for transcript in transcript_list:
-                if transcript.is_translatable:
-                    translated_transcripts.append(transcript)
-            if translated_transcripts:
-                # Sort based on preferred language order
-                for lang in preferred_langs:
-                    for transcript in translated_transcripts:
-                        if transcript.language_code == lang:
-                            return transcript.fetch()
-                # If no preferred language found, return translation to first preferred language
-                translation = translated_transcripts[0].translate(preferred_langs[0])
-                return translation.fetch()
-        except NoTranscriptFound:
-            pass
-        raise Exception("No suitable transcript found")
-    except Exception as e:
-        logger.error(f"Failed to get transcript for video {video_id}: {e}")
-        return None
+            transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
+            # First try: Manual transcripts in preferred languages
+            manual_transcripts = []
+            try:
+                for transcript in transcript_list:
+                    if not transcript.is_generated and not transcript.is_translatable:
+                        manual_transcripts.append(transcript)
+                if manual_transcripts:
+                    # Sort based on preferred language order
+                    for lang in preferred_langs:
+                        for transcript in manual_transcripts:
+                            if transcript.language_code == lang:
+                                return transcript.fetch()
+                    # If no preferred language found, return first manual transcript
+                    return manual_transcripts[0].fetch()
+            except NoTranscriptFound:
+                pass
+            # Second try: Auto-generated transcripts in preferred languages
+            generated_transcripts = []
+            try:
+                for transcript in transcript_list:
+                    if transcript.is_generated and not transcript.is_translatable:
+                        generated_transcripts.append(transcript)
+                if generated_transcripts:
+                    # Sort based on preferred language order
+                    for lang in preferred_langs:
+                        for transcript in generated_transcripts:
+                            if transcript.language_code == lang:
+                                return transcript.fetch()
+                    # If no preferred language found, return first generated transcript
+                    return generated_transcripts[0].fetch()
+            except NoTranscriptFound:
+                pass
+            # Last try: Translated transcripts in preferred languages
+            translated_transcripts = []
+            try:
+                for transcript in transcript_list:
+                    if transcript.is_translatable:
+                        translated_transcripts.append(transcript)
+                if translated_transcripts:
+                    # Sort based on preferred language order
+                    for lang in preferred_langs:
+                        for transcript in translated_transcripts:
+                            if transcript.language_code == lang:
+                                return transcript.fetch()
+                    # If no preferred language found, return translation to first preferred language
+                    translation = translated_transcripts[0].translate(
+                        preferred_langs[0]
+                    )
+                    return translation.fetch()
+            except NoTranscriptFound:
+                pass
+            raise Exception("No suitable transcript found")
+        except Exception as e:
+            if e.__class__.__name__ == "ParserError":
+                logger.warning(
+                    f"ParserError on attempt {attempt+1}/5 for video {video_id}. Retrying..."
+                )
+                if attempt == max_attempts - 1:
+                    logger.error(
+                        f"Failed to get transcript for video {video_id} after {max_attempts} attempts due to repeated ParserError."
+                    )
+                    return None
+                await asyncio.sleep(2)
+                continue
+            else:
+                logger.error(f"Failed to get transcript for video {video_id}: {e}")
+                return None
+    return None
 async def extract_youtube_transcript(state: ProcessSourceState):

content_core-1.0.1/tests/integration/test_cli.py ADDED Viewed

@@ -0,0 +1,394 @@
+import json
+import subprocess
+import sys
+from pathlib import Path
+from xml.etree import ElementTree as ET
+import pytest
+@pytest.fixture
+def fixture_path():
+    """Provides the path to the directory containing test input files."""
+    return Path(__file__).parent.parent / "input_content"
+def run_cli_command(command_args, input_data=None):
+    """Helper to run CLI commands and capture output."""
+    try:
+        result = subprocess.run(
+            command_args,
+            input=input_data,
+            capture_output=True,
+            text=True,
+            timeout=30
+        )
+        return result
+    except subprocess.TimeoutExpired:
+        pytest.fail(f"Command {command_args} timed out")
+class TestCcoreCLI:
+    """Tests for the ccore CLI command."""
+    def test_ccore_help(self):
+        """Test ccore help output."""
+        result = run_cli_command([sys.executable, "-m", "content_core", "--help"])
+        # Note: ccore is the default when running the module, but let's test the actual CLI entry points
+    def test_ccore_text_input(self):
+        """Test ccore with direct text input."""
+        result = run_cli_command(["uv", "run", "ccore", "This is a test content."])
+        assert result.returncode == 0
+        assert "This is a test content." in result.stdout
+        assert result.stderr == ""
+    def test_ccore_file_input(self, fixture_path):
+        """Test ccore with file input."""
+        md_file = fixture_path / "file.md"
+        if not md_file.exists():
+            pytest.skip(f"Fixture file not found: {md_file}")
+        result = run_cli_command(["uv", "run", "ccore", str(md_file)])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+        assert "Buenos Aires" in result.stdout
+    def test_ccore_url_input(self):
+        """Test ccore with URL input."""
+        result = run_cli_command(["uv", "run", "ccore", "https://www.example.com"])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+    def test_ccore_json_format(self):
+        """Test ccore with JSON output format."""
+        result = run_cli_command(["uv", "run", "ccore", "-f", "json", "Test content for JSON output."])
+        assert result.returncode == 0
+        # Verify it's valid JSON
+        output_data = json.loads(result.stdout)
+        assert isinstance(output_data, dict)
+        assert "content" in output_data
+        assert "Test content for JSON output." in output_data["content"]
+    def test_ccore_xml_format(self):
+        """Test ccore with XML output format."""
+        result = run_cli_command(["uv", "run", "ccore", "-f", "xml", "Test content for XML output."])
+        assert result.returncode == 0
+        # Verify it's valid XML
+        root = ET.fromstring(result.stdout.strip())
+        assert root.tag == "result"
+        content_elem = root.find(".//content")
+        assert content_elem is not None
+        assert "Test content for XML output." in content_elem.text
+    def test_ccore_text_format_explicit(self):
+        """Test ccore with explicit text format."""
+        result = run_cli_command(["uv", "run", "ccore", "-f", "text", "Test content for text output."])
+        assert result.returncode == 0
+        assert "Test content for text output." in result.stdout
+    def test_ccore_stdin_input(self):
+        """Test ccore with stdin input."""
+        test_content = "This content comes from stdin."
+        result = run_cli_command(["uv", "run", "ccore"], input_data=test_content)
+        assert result.returncode == 0
+        assert test_content in result.stdout
+    def test_ccore_stdin_json_format(self):
+        """Test ccore with stdin input and JSON format."""
+        test_content = "Stdin content with JSON format."
+        result = run_cli_command(["uv", "run", "ccore", "-f", "json"], input_data=test_content)
+        assert result.returncode == 0
+        # Verify it's valid JSON
+        output_data = json.loads(result.stdout)
+        assert test_content in output_data["content"]
+    def test_ccore_debug_flag(self):
+        """Test ccore with debug flag."""
+        result = run_cli_command(["uv", "run", "ccore", "-d", "Debug test content."])
+        assert result.returncode == 0
+        assert "Debug test content." in result.stdout
+        # Debug output goes to stderr in loguru
+    def test_ccore_file_pdf(self, fixture_path):
+        """Test ccore with PDF file."""
+        pdf_file = fixture_path / "file.pdf"
+        if not pdf_file.exists():
+            pytest.skip(f"Fixture file not found: {pdf_file}")
+        result = run_cli_command(["uv", "run", "ccore", str(pdf_file)])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+class TestCcleanCLI:
+    """Tests for the cclean CLI command."""
+    def test_cclean_text_input(self):
+        """Test cclean with direct text input."""
+        messy_text = "  This   is    messy    text   with   extra   spaces.  "
+        result = run_cli_command(["uv", "run", "cclean", messy_text])
+        assert result.returncode == 0
+        cleaned = result.stdout.strip()
+        assert cleaned != messy_text
+        assert "This is messy text" in cleaned
+    def test_cclean_json_input(self):
+        """Test cclean with JSON input containing content field."""
+        json_input = '{"content": "  Messy   JSON   content  "}'
+        result = run_cli_command(["uv", "run", "cclean"], input_data=json_input)
+        assert result.returncode == 0
+        cleaned = result.stdout.strip()
+        assert "Messy JSON content" in cleaned
+    def test_cclean_xml_input(self):
+        """Test cclean with XML input containing content field."""
+        xml_input = '<root><content>  Messy   XML   content  </content></root>'
+        result = run_cli_command(["uv", "run", "cclean"], input_data=xml_input)
+        assert result.returncode == 0
+        cleaned = result.stdout.strip()
+        assert "Messy XML content" in cleaned
+    def test_cclean_file_input(self, fixture_path):
+        """Test cclean with file input."""
+        txt_file = fixture_path / "file.txt"
+        if not txt_file.exists():
+            pytest.skip(f"Fixture file not found: {txt_file}")
+        result = run_cli_command(["uv", "run", "cclean", str(txt_file)])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+    def test_cclean_url_input(self):
+        """Test cclean with URL input."""
+        result = run_cli_command(["uv", "run", "cclean", "https://www.example.com"])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+    def test_cclean_stdin_input(self):
+        """Test cclean with stdin input."""
+        messy_content = "  This    has   too   many    spaces   and needs   cleaning.  "
+        result = run_cli_command(["uv", "run", "cclean"], input_data=messy_content)
+        assert result.returncode == 0
+        cleaned = result.stdout.strip()
+        assert "This has too many spaces" in cleaned
+    def test_cclean_debug_flag(self):
+        """Test cclean with debug flag."""
+        result = run_cli_command(["uv", "run", "cclean", "-d", "Debug clean test."])
+        assert result.returncode == 0
+        assert "Debug clean test" in result.stdout
+class TestCsumCLI:
+    """Tests for the csum CLI command."""
+    def test_csum_text_input(self):
+        """Test csum with direct text input."""
+        long_text = "Artificial Intelligence is revolutionizing industries across the globe. From healthcare to finance, AI technologies are enabling automation, improving decision-making, and creating new possibilities for innovation."
+        result = run_cli_command(["uv", "run", "csum", long_text])
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+        assert len(summary) < len(long_text)  # Summary should be shorter
+    def test_csum_with_context(self):
+        """Test csum with context parameter."""
+        text = "Machine learning algorithms process vast amounts of data to identify patterns and make predictions."
+        context = "explain in simple terms"
+        result = run_cli_command(["uv", "run", "csum", "--context", context, text])
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+    def test_csum_file_input(self, fixture_path):
+        """Test csum with file input."""
+        md_file = fixture_path / "file.md"
+        if not md_file.exists():
+            pytest.skip(f"Fixture file not found: {md_file}")
+        result = run_cli_command(["uv", "run", "csum", str(md_file)])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+    def test_csum_url_input(self):
+        """Test csum with URL input."""
+        result = run_cli_command(["uv", "run", "csum", "https://www.example.com"])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+    def test_csum_json_input(self):
+        """Test csum with JSON input containing content field."""
+        json_input = '{"content": "This is a long article about technology trends. It discusses various aspects of innovation, digital transformation, and the future of work in the digital age."}'
+        result = run_cli_command(["uv", "run", "csum"], input_data=json_input)
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+    def test_csum_xml_input(self):
+        """Test csum with XML input containing content field."""
+        xml_input = '<article><content>This is a comprehensive guide to understanding cloud computing. It covers infrastructure, platforms, software services, and deployment models.</content></article>'
+        result = run_cli_command(["uv", "run", "csum"], input_data=xml_input)
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+    def test_csum_stdin_input(self):
+        """Test csum with stdin input."""
+        long_content = "The Internet of Things (IoT) represents a network of interconnected devices that communicate and exchange data. This technology has applications in smart homes, industrial automation, healthcare monitoring, and environmental sensing. As IoT devices become more prevalent, they are transforming how we interact with our environment and creating new opportunities for data-driven insights."
+        result = run_cli_command(["uv", "run", "csum"], input_data=long_content)
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+        assert len(summary) < len(long_content)
+    def test_csum_context_bullet_points(self):
+        """Test csum with bullet points context."""
+        text = "Blockchain technology provides a decentralized approach to data storage and transaction processing. It ensures security through cryptographic methods and maintains transparency through distributed ledgers."
+        result = run_cli_command(["uv", "run", "csum", "--context", "in bullet points", text])
+        assert result.returncode == 0
+        summary = result.stdout.strip()
+        assert len(summary) > 0
+    def test_csum_debug_flag(self):
+        """Test csum with debug flag."""
+        result = run_cli_command(["uv", "run", "csum", "-d", "Debug summary test content."])
+        assert result.returncode == 0
+        assert len(result.stdout.strip()) > 0
+class TestCLIErrorHandling:
+    """Tests for CLI error handling and edge cases."""
+    def test_ccore_empty_input_error(self):
+        """Test ccore with empty input should error."""
+        result = run_cli_command(["uv", "run", "ccore", ""])
+        assert result.returncode != 0
+    def test_cclean_empty_input_error(self):
+        """Test cclean with empty input should error."""
+        result = run_cli_command(["uv", "run", "cclean", ""])
+        assert result.returncode != 0
+    def test_csum_empty_input_error(self):
+        """Test csum with empty input should error."""
+        result = run_cli_command(["uv", "run", "csum", ""])
+        assert result.returncode != 0
+    def test_ccore_invalid_format(self):
+        """Test ccore with invalid format option."""
+        result = run_cli_command(["uv", "run", "ccore", "-f", "invalid", "test"])
+        assert result.returncode != 0
+        assert "invalid choice" in result.stderr.lower()
+    def test_ccore_nonexistent_file(self):
+        """Test ccore with non-existent file."""
+        result = run_cli_command(["uv", "run", "ccore", "/path/to/nonexistent/file.txt"])
+        # Should not error but treat as text content
+        assert result.returncode == 0
+        assert "/path/to/nonexistent/file.txt" in result.stdout
+    def test_stdin_no_content_error(self):
+        """Test CLI with no content and no stdin should error."""
+        # This is tricky to test as it involves TTY detection
+        # We'll skip this for now as it requires special handling
+        pass
+class TestCLIIntegration:
+    """Integration tests combining multiple CLI features."""
+    def test_pipeline_extract_clean_summarize(self, fixture_path):
+        """Test a pipeline of extract -> clean -> summarize."""
+        md_file = fixture_path / "file.md"
+        if not md_file.exists():
+            pytest.skip(f"Fixture file not found: {md_file}")
+        # Extract content
+        extract_result = run_cli_command(["uv", "run", "ccore", str(md_file)])
+        assert extract_result.returncode == 0
+        # Clean extracted content
+        clean_result = run_cli_command(["uv", "run", "cclean"], input_data=extract_result.stdout)
+        assert clean_result.returncode == 0
+        # Summarize cleaned content
+        summary_result = run_cli_command(["uv", "run", "csum"], input_data=clean_result.stdout)
+        assert summary_result.returncode == 0
+        assert len(summary_result.stdout.strip()) > 0
+    def test_json_pipeline(self):
+        """Test pipeline with JSON format."""
+        text = "This is a test for JSON pipeline processing."
+        # Extract as JSON
+        extract_result = run_cli_command(["uv", "run", "ccore", "-f", "json", text])
+        assert extract_result.returncode == 0
+        # Verify JSON output
+        json_data = json.loads(extract_result.stdout)
+        assert text in json_data["content"]
+        # Clean JSON content
+        clean_result = run_cli_command(["uv", "run", "cclean"], input_data=extract_result.stdout)
+        assert clean_result.returncode == 0
+        # Summarize cleaned content
+        summary_result = run_cli_command(["uv", "run", "csum"], input_data=clean_result.stdout)
+        assert summary_result.returncode == 0
+    def test_xml_processing(self):
+        """Test XML format processing."""
+        text = "This is test content for XML processing and validation."
+        # Extract as XML
+        extract_result = run_cli_command(["uv", "run", "ccore", "-f", "xml", text])
+        assert extract_result.returncode == 0
+        # Verify XML output
+        root = ET.fromstring(extract_result.stdout.strip())
+        content_elem = root.find(".//content")
+        assert content_elem is not None
+        assert text in content_elem.text
+        # Process XML content through clean and summarize
+        clean_result = run_cli_command(["uv", "run", "cclean"], input_data=extract_result.stdout)
+        assert clean_result.returncode == 0
+        summary_result = run_cli_command(["uv", "run", "csum", "--context", "one sentence"], input_data=clean_result.stdout)
+        assert summary_result.returncode == 0

{content_core-0.8.5 → content_core-1.0.1}/tests/integration/test_extraction.py RENAMED Viewed

@@ -26,7 +26,7 @@ async def test_extract_content_from_text():
 async def test_extract_content_from_url(fixture_path):
     """Tests content extraction from a URL."""
     # Using a known URL from the notebook example
-    input_data = {"url": "https://www.supernovalabs.com", "engine": "simple"}
+    input_data = {"url": "https://www.supernovalabs.com", "url_engine": "simple"}
     result = await extract_content(input_data)
     assert hasattr(result, "source_type")
@@ -41,8 +41,13 @@ async def test_extract_content_from_url(fixture_path):
 @pytest.mark.asyncio
 async def test_extract_content_from_url_firecrawl(fixture_path):
     """Tests content extraction from a URL."""
+    try:
+        import firecrawl
+    except ImportError:
+        pytest.skip("Firecrawl not installed")
     # Using a known URL from the notebook example
-    input_data = {"url": "https://www.supernovalabs.com", "engine": "firecrawl"}
+    input_data = {"url": "https://www.supernovalabs.com", "url_engine": "firecrawl"}
     result = await extract_content(input_data)
     assert hasattr(result, "source_type")
@@ -58,7 +63,7 @@ async def test_extract_content_from_url_firecrawl(fixture_path):
 async def test_extract_content_from_url_jina(fixture_path):
     """Tests content extraction from a URL."""
     # Using a known URL from the notebook example
-    input_data = {"url": "https://www.supernovalabs.com", "engine": "jina"}
+    input_data = {"url": "https://www.supernovalabs.com", "url_engine": "jina"}
     result = await extract_content(input_data)
     assert hasattr(result, "source_type")
@@ -222,7 +227,7 @@ async def test_extract_content_from_xlsx(fixture_path):
     if not xlsx_file.exists():
         pytest.skip(f"Fixture file not found: {xlsx_file}")
-    result = await extract_content(dict(file_path=str(xlsx_file), engine="simple"))
+    result = await extract_content(dict(file_path=str(xlsx_file), document_engine="simple"))
     assert result.source_type == "file"
     assert (
@@ -240,7 +245,7 @@ async def test_extract_content_from_xlsx(fixture_path):
 #     if not xlsx_file.exists():
 #         pytest.skip(f"Fixture file not found: {xlsx_file}")
-#     result = await extract_content(dict(file_path=str(xlsx_file), engine="docling"))
+#     result = await extract_content(dict(file_path=str(xlsx_file), document_engine="docling"))
 #     assert result.source_type == "file"
 #     assert (

{content_core-0.8.5 → content_core-1.0.1}/uv.lock RENAMED Viewed

@@ -410,7 +410,7 @@ wheels = [
 [[package]]
 name = "content-core"
-version = "0.8.5"
+version = "1.0.1"
 source = { editable = "." }
 dependencies = [
     { name = "ai-prompter" },

content_core-0.8.5/src/content_core/common/types.py DELETED Viewed

@@ -1,21 +0,0 @@
-from typing import Literal
-import warnings
-Engine = Literal[
-    "auto",
-    "simple",
-    "legacy",
-    "firecrawl",
-    "jina",
-    "docling",
-]
-DEPRECATED_ENGINES = {"legacy": "simple"}
-def warn_if_deprecated_engine(engine: str):
-    if engine in DEPRECATED_ENGINES:
-        warnings.warn(
-            f"Engine '{engine}' is deprecated and will be removed in a future release. Use '{DEPRECATED_ENGINES[engine]}' instead.",
-            DeprecationWarning,
-            stacklevel=2,
-        )

{content_core-0.8.5 → content_core-1.0.1}/.github/PULL_REQUEST_TEMPLATE.md RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/.github/workflows/publish.yml RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/.python-version RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/CONTRIBUTING.md RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/LICENSE RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/Makefile RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/prompts/content/cleanup.jinja RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/prompts/content/summarize.jinja RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/common/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/common/exceptions.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/common/utils.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/cleanup/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/cleanup/core.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/extraction/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/identification/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/summary/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/content/summary/core.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/logging.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/models.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/models_config.yaml RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/notebooks/run.ipynb RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/audio.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/docling.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/office.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/pdf.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/text.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/processors/video.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/py.typed RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/templated_message.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/tools/__init__.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/tools/cleanup.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/tools/extract.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/src/content_core/tools/summarize.py RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.docx RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.epub RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.md RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.mp3 RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.mp4 RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.pdf RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.pptx RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.txt RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file.xlsx RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/input_content/file_audio.mp3 RENAMED Viewed

File without changes

{content_core-0.8.5 → content_core-1.0.1}/tests/unit/test_docling.py RENAMED Viewed

File without changes

content-core 0.8.5__tar.gz → 1.0.1__tar.gz

Potentially problematic release.

content-core 0.8.5tar.gz → 1.0.1tar.gz