PyPI - content-core - Versions diffs - 0.3.1__tar.gz → 0.5.0__tar.gz - Mend

content-core 0.3.1tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of content-core might be problematic. Click here for more details.

Files changed (63) hide show

{content_core-0.3.1 → content_core-0.5.0}/.gitignore RENAMED Viewed

@@ -20,4 +20,4 @@ ai_docs/
 todo.md
 WIP/
-*.ignore
+*.ignore

{content_core-0.3.1 → content_core-0.5.0}/.windsurfrules RENAMED Viewed

@@ -4,10 +4,10 @@ All documentation (code or readmes) must be in english.
 Whenever I ask you to tag and release, make sure to run `make test` as part of the process.
 The full release process is:
-- Run `make test` to make sure everything is working
+- Run `make test` to make sure everything is working (if we changed any code or import)
 - Update version on pyproject.toml
 - Run `uv sync` to update the lock file
 - Commit all that's needed
-- Merge to main
+- Merge to main (if in a branch)
 - Tag the release
 - Push to GitHub

content_core-0.3.1/README.md → content_core-0.5.0/PKG-INFO RENAMED Viewed

@@ -1,3 +1,37 @@
+Metadata-Version: 2.4
+Name: content-core
+Version: 0.5.0
+Summary: Extract what matters from any media source
+Author-email: LUIS NOVO <lfnovo@gmail.com>
+License-File: LICENSE
+Requires-Python: >=3.10
+Requires-Dist: aiohttp>=3.11
+Requires-Dist: bs4>=0.0.2
+Requires-Dist: dicttoxml>=1.7.16
+Requires-Dist: esperanto>=1.2.0
+Requires-Dist: google-genai>=1.10.0
+Requires-Dist: jinja2>=3.1.6
+Requires-Dist: langdetect>=1.0.9
+Requires-Dist: langgraph>=0.3.29
+Requires-Dist: loguru>=0.7.3
+Requires-Dist: openai>=1.73.0
+Requires-Dist: openpyxl>=3.1.5
+Requires-Dist: pandas>=2.2.3
+Requires-Dist: pydub>=0.25.1
+Requires-Dist: pymupdf>=1.25.5
+Requires-Dist: python-docx>=1.1.2
+Requires-Dist: python-dotenv>=1.1.0
+Requires-Dist: python-magic>=0.4.27
+Requires-Dist: python-pptx>=1.0.2
+Requires-Dist: validators>=0.34.0
+Requires-Dist: youtube-transcript-api>=1.0.3
+Provides-Extra: docling
+Requires-Dist: asciidoc; extra == 'docling'
+Requires-Dist: docling[ocr]; extra == 'docling'
+Requires-Dist: pandas; extra == 'docling'
+Requires-Dist: pillow; extra == 'docling'
+Description-Content-Type: text/markdown
 # Content Core
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -25,8 +59,10 @@ The primary goal of Content Core is to simplify the process of ingesting content
 Install Content Core using `pip`:
 ```bash
-# Install the package
+# Install the package (without Docling)
 pip install content-core
+# Install with Docling support
+pip install content-core[docling]
 ```
 Alternatively, if you’re developing locally:
@@ -195,12 +231,58 @@ async def main():
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)
+    # Per-execution override with Docling
+    doc_data = await extract_content({
+        "file_path": "path/to/your/document.pdf",
+        "engine": "docling",
+        "output_format": "html"
+    })
+    print(doc_data)
 if __name__ == "__main__":
     asyncio.run(main())
 ```
 (See `src/content_core/notebooks/run.ipynb` for more detailed examples.)
+## Docling Integration
+Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
+### Installation
+```bash
+# Install with Docling support
+pip install content-core[docling]
+```
+### Enabling Docling
+#### Via configuration file
+In your `cc_config.yaml` or custom config, set:
+```yaml
+extraction:
+  engine: docling       # 'legacy' (default) or 'docling'
+  docling:
+    output_format: markdown  # markdown | html | json
+```
+#### Programmatically in Python
+```python
+from content_core.config import set_extraction_engine, set_docling_output_format
+# switch engine to Docling
+set_extraction_engine("docling")
+# choose output format: 'markdown', 'html', or 'json'
+set_docling_output_format("html")
+# now use ccore.extract or ccore.ccore
+result = await cc.extract("document.pdf")
+```
 ## Configuration
 Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.

content_core-0.3.1/PKG-INFO → content_core-0.5.0/README.md RENAMED Viewed

@@ -1,32 +1,3 @@
-Metadata-Version: 2.4
-Name: content-core
-Version: 0.3.1
-Summary: Extract what matters from any media source
-Author-email: LUIS NOVO <lfnovo@gmail.com>
-License-File: LICENSE
-Requires-Python: >=3.10
-Requires-Dist: aiohttp>=3.11
-Requires-Dist: bs4>=0.0.2
-Requires-Dist: dicttoxml>=1.7.16
-Requires-Dist: esperanto>=1.2.0
-Requires-Dist: google-genai>=1.10.0
-Requires-Dist: jinja2>=3.1.6
-Requires-Dist: langdetect>=1.0.9
-Requires-Dist: langgraph>=0.3.29
-Requires-Dist: loguru>=0.7.3
-Requires-Dist: openai>=1.73.0
-Requires-Dist: openpyxl>=3.1.5
-Requires-Dist: pandas>=2.2.3
-Requires-Dist: pydub>=0.25.1
-Requires-Dist: pymupdf>=1.25.5
-Requires-Dist: python-docx>=1.1.2
-Requires-Dist: python-dotenv>=1.1.0
-Requires-Dist: python-magic>=0.4.27
-Requires-Dist: python-pptx>=1.0.2
-Requires-Dist: validators>=0.34.0
-Requires-Dist: youtube-transcript-api>=1.0.3
-Description-Content-Type: text/markdown
 # Content Core
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -54,8 +25,10 @@ The primary goal of Content Core is to simplify the process of ingesting content
 Install Content Core using `pip`:
 ```bash
-# Install the package
+# Install the package (without Docling)
 pip install content-core
+# Install with Docling support
+pip install content-core[docling]
 ```
 Alternatively, if you’re developing locally:
@@ -224,12 +197,58 @@ async def main():
     md_data = await extract_content({"file_path": "path/to/your/document.md"})
     print(md_data)
+    # Per-execution override with Docling
+    doc_data = await extract_content({
+        "file_path": "path/to/your/document.pdf",
+        "engine": "docling",
+        "output_format": "html"
+    })
+    print(doc_data)
 if __name__ == "__main__":
     asyncio.run(main())
 ```
 (See `src/content_core/notebooks/run.ipynb` for more detailed examples.)
+## Docling Integration
+Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
+### Installation
+```bash
+# Install with Docling support
+pip install content-core[docling]
+```
+### Enabling Docling
+#### Via configuration file
+In your `cc_config.yaml` or custom config, set:
+```yaml
+extraction:
+  engine: docling       # 'legacy' (default) or 'docling'
+  docling:
+    output_format: markdown  # markdown | html | json
+```
+#### Programmatically in Python
+```python
+from content_core.config import set_extraction_engine, set_docling_output_format
+# switch engine to Docling
+set_extraction_engine("docling")
+# choose output format: 'markdown', 'html', or 'json'
+set_docling_output_format("html")
+# now use ccore.extract or ccore.ccore
+result = await cc.extract("document.pdf")
+```
 ## Configuration
 Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.

{content_core-0.3.1 → content_core-0.5.0}/docs/processors.md RENAMED Viewed

@@ -14,11 +14,11 @@ Content Core uses a modular approach to process content from different sources.
 - **Returned Data**: The input text as-is, wrapped in a structured format compatible with Content Core's output schema.
 - **Location**: `src/content_core/processors/text.py`
-### 2. **Web Processor**
+### 2. **Web (URL) Processor**
 - **Purpose**: Extracts content from web URLs, focusing on meaningful text while ignoring boilerplate (ads, navigation, etc.).
 - **Supported Input**: URLs (web pages).
 - **Returned Data**: Extracted text content from the web page, often in a cleaned format.
-- **Location**: `src/content_core/processors/web.py`
+- **Location**: `src/content_core/processors/url.py`
 ### 3. **File Processor**
 - **Purpose**: Processes local files of various types, extracting content based on file format.
@@ -35,10 +35,33 @@ Content Core uses a modular approach to process content from different sources.
 - **Returned Data**: Transcribed text from the media content.
 - **Location**: `src/content_core/processors/transcription.py`
+### 5. **Docling Processor**
+- **Purpose**: Use Docling library for rich document parsing (PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, images).
+- **Supported Input**: PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, Images (PNG, JPEG, TIFF, BMP).
+- **Returned Data**: Content converted to configured format (markdown, html, json).
+- **Location**: `src/content_core/processors/docling.py`
+- **Configuration**: Activate the Docling engine in `cc_config.yaml` or custom config:
+  ```yaml
+  extraction:
+    engine: docling       # 'legacy' (default) or 'docling'
+    docling:
+      output_format: markdown  # markdown | html | json
+  ```
+- **Programmatic Toggle**: Use helper functions in Python:
+  ```python
+  from content_core.config import set_extraction_engine, set_docling_output_format
+  # switch engine to Docling
+  set_extraction_engine("docling")
+  # choose output format
+  set_docling_output_format("html")
+  ```
 ## How Processors Work
 Content Core automatically selects the appropriate processor based on the input type:
-- If a URL is provided, the Web Processor is used.
+- If a URL is provided, the Web (URL) Processor is used.
 - If a file path is provided, the File Processor determines the file type and delegates to specialized handlers (like the Media Transcription Processor for audio/video).
 - If raw text is provided, the Text Processor handles it directly.

{content_core-0.3.1 → content_core-0.5.0}/docs/usage.md RENAMED Viewed

@@ -76,6 +76,60 @@ To simplify setup, we suggest copying the provided sample files:
 This will allow you to quickly start with customized settings without needing to create the files from scratch.
+### Docling Engine
+Content Core supports an optional Docling engine for advanced document parsing. To enable:
+#### In YAML config
+Add under the `extraction` section:
+```yaml
+extraction:
+  engine: docling        # legacy (default) or docling
+  docling:
+    output_format: html  # markdown | html | json
+```
+#### Programmatically in Python
+```python
+from content_core.config import set_extraction_engine, set_docling_output_format
+# toggle to Docling
+set_extraction_engine("docling")
+# pick format
+set_docling_output_format("json")
+```
+#### Per-Execution Overrides
+You can override the extraction engine and Docling output format on a per-call basis by including `engine` and `output_format` in your input:
+```python
+from content_core.content.extraction import extract_content
+# override engine and format for this document
+result = await extract_content({
+    "file_path": "document.pdf",
+    "engine": "docling",
+    "output_format": "html"
+})
+print(result.content)
+```
+Or using `ProcessSourceInput`:
+```python
+from content_core.common.state import ProcessSourceInput
+from content_core.content.extraction import extract_content
+input = ProcessSourceInput(
+    file_path="document.pdf",
+    engine="docling",
+    output_format="json"
+)
+result = await extract_content(input)
+print(result.content)
+```
 ## Support
 If you have questions or encounter issues while using the library, open an issue in the repository or contact the support team.

{content_core-0.3.1 → content_core-0.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "content-core"
-version = "0.3.1"
+version = "0.5.0"
 description = "Extract what matters from any media source"
 readme = "README.md"
 homepage = "https://github.com/lfnovo/content-core"
@@ -31,6 +31,9 @@ dependencies = [
     "validators>=0.34.0",
 ]
+[project.optional-dependencies]
+docling = ["docling[ocr]", "Pillow", "pandas", "asciidoc"]
 [project.scripts]
 ccore = "content_core:ccore"
 cclean = "content_core:cclean"

content_core-0.5.0/src/content_core/cc_config.yaml ADDED Viewed

@@ -0,0 +1,35 @@
+# Content Core main configuration
+# Copy this file to your project root or set CCORE_CONFIG_PATH to its location
+speech_to_text:
+  provider: openai
+  model_name: whisper-1
+default_model:
+  provider: openai
+  model_name: gpt-4o-mini
+  config:
+    temperature: 0.5
+    top_p: 1
+    max_tokens: 2000
+cleanup_model:
+  provider: openai
+  model_name: gpt-4o-mini
+  config:
+    temperature: 0
+    max_tokens: 8000
+    output_format: json
+summary_model:
+  provider: openai
+  model_name: gpt-4o-mini
+  config:
+    temperature: 0
+    top_p: 1
+    max_tokens: 2000
+extraction:
+  engine: legacy  # change to 'docling' to enable Docling engine
+  docling:
+    output_format: markdown  # markdown | html | json

{content_core-0.3.1 → content_core-0.5.0}/src/content_core/common/state.py RENAMED Viewed

@@ -13,12 +13,16 @@ class ProcessSourceState(BaseModel):
     identified_provider: Optional[str] = ""
     metadata: Optional[dict] = Field(default_factory=lambda: {})
     content: Optional[str] = ""
+    engine: Optional[str] = Field(default=None, description="Override extraction engine: 'legacy' or 'docling'")
+    output_format: Optional[str] = Field(default=None, description="Override Docling output format: 'markdown', 'html', or 'json'")
 class ProcessSourceInput(BaseModel):
     content: Optional[str] = ""
     file_path: Optional[str] = ""
     url: Optional[str] = ""
+    engine: Optional[str] = None
+    output_format: Optional[str] = None
 class ProcessSourceOutput(BaseModel):

content_core-0.5.0/src/content_core/config.py ADDED Viewed

@@ -0,0 +1,46 @@
+import os
+import pkgutil
+import os  # needed for load_config env/path checks
+import yaml
+from dotenv import load_dotenv
+# Load environment variables from .env file
+load_dotenv()
+def load_config():
+    config_path = os.environ.get("CCORE_CONFIG_PATH") or os.environ.get("CCORE_MODEL_CONFIG_PATH")
+    if config_path and os.path.exists(config_path):
+        try:
+            with open(config_path, "r") as file:
+                return yaml.safe_load(file)
+        except Exception as e:
+            print(f"Erro ao carregar o arquivo de configuração de {config_path}: {e}")
+            print("Usando configurações padrão internas.")
+    default_config_data = pkgutil.get_data("content_core", "models_config.yaml")
+    if default_config_data:
+        base = yaml.safe_load(default_config_data)
+    else:
+        base = {}
+    # load new cc_config.yaml defaults
+    cc_default = pkgutil.get_data("content_core", "cc_config.yaml")
+    if cc_default:
+        docling_cfg = yaml.safe_load(cc_default)
+        # merge extraction section
+        base["extraction"] = docling_cfg.get("extraction", {})
+    return base
+CONFIG = load_config()
+# Programmatic config overrides: use in notebooks or scripts
+def set_extraction_engine(engine: str):
+    """Override the extraction engine ('legacy' or 'docling')."""
+    CONFIG.setdefault("extraction", {})["engine"] = engine
+def set_docling_output_format(fmt: str):
+    """Override Docling output_format ('markdown', 'html', or 'json')."""
+    extraction = CONFIG.setdefault("extraction", {})
+    docling_cfg = extraction.setdefault("docling", {})
+    docling_cfg["output_format"] = fmt

{content_core-0.3.1 → content_core-0.5.0}/src/content_core/content/extraction/graph.py RENAMED Viewed

@@ -20,6 +20,12 @@ from content_core.processors.text import extract_txt
 from content_core.processors.url import extract_url, url_provider
 from content_core.processors.video import extract_best_audio_from_video
 from content_core.processors.youtube import extract_youtube_transcript
+from content_core.processors.docling import extract_with_docling, DOCLING_SUPPORTED  # type: ignore
+import aiohttp
+import tempfile
+from urllib.parse import urlparse
+from content_core.config import CONFIG  # type: ignore
 async def source_identification(state: ProcessSourceState) -> Dict[str, str]:
@@ -91,6 +97,32 @@ async def source_type_router(x: ProcessSourceState) -> Optional[str]:
     return x.source_type
+async def download_remote_file(state: ProcessSourceState) -> Dict[str, Any]:
+    url = state.url
+    assert url, "No URL provided"
+    async with aiohttp.ClientSession() as session:
+        async with session.get(url) as resp:
+            resp.raise_for_status()
+            mime = resp.headers.get("content-type", "").split(";", 1)[0]
+            suffix = os.path.splitext(urlparse(url).path)[1] if urlparse(url).path else ""
+            fd, tmp = tempfile.mkstemp(suffix=suffix)
+            os.close(fd)
+            with open(tmp, "wb") as f:
+                f.write(await resp.read())
+    return {"file_path": tmp, "identified_type": mime}
+async def file_type_router_docling(state: ProcessSourceState) -> str:
+    """
+    Route to Docling if enabled and supported; otherwise use legacy file type edge.
+    """
+    # allow per-execution override of engine via state.engine
+    engine = state.engine or CONFIG.get("extraction", {}).get("engine", "legacy")
+    if engine == "docling" and state.identified_type in DOCLING_SUPPORTED:
+        return "extract_docling"
+    return await file_type_edge(state)
 # Create workflow
 workflow = StateGraph(
     ProcessSourceState, input=ProcessSourceInput, output=ProcessSourceState
@@ -108,6 +140,8 @@ workflow.add_node("extract_best_audio_from_video", extract_best_audio_from_video
 workflow.add_node("extract_audio", extract_audio)
 workflow.add_node("extract_youtube_transcript", extract_youtube_transcript)
 workflow.add_node("delete_file", delete_file)
+workflow.add_node("download_remote_file", download_remote_file)
+workflow.add_node("extract_docling", extract_with_docling)
 # Add edges
 workflow.add_edge(START, "source")
@@ -122,12 +156,12 @@ workflow.add_conditional_edges(
 )
 workflow.add_conditional_edges(
     "file_type",
-    file_type_edge,
+    file_type_router_docling,
 )
 workflow.add_conditional_edges(
     "url_provider",
     url_type_router,
-    {"article": "extract_url", "youtube": "extract_youtube_transcript"},
+    {**{m: "download_remote_file" for m in SUPPORTED_FITZ_TYPES}, "article": "extract_url", "youtube": "extract_youtube_transcript"},
 )
 workflow.add_edge("url_provider", END)
 workflow.add_edge("file_type", END)
@@ -140,6 +174,7 @@ workflow.add_edge("extract_office_content", "delete_file")
 workflow.add_edge("extract_best_audio_from_video", "extract_audio")
 workflow.add_edge("extract_audio", "delete_file")
 workflow.add_edge("delete_file", END)
+workflow.add_edge("download_remote_file", "file_type")
 # Compile graph
 graph = workflow.compile()

content_core-0.5.0/src/content_core/notebooks/docling.ipynb ADDED Viewed

@@ -0,0 +1,27 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from docling.document_converter import DocumentConverter\n",
+    "\n",
+    "\n",
+    "source = \"/Users/luisnovo/dev/projetos/content-core/tests/input_content/file.docx\"\n",
+    "source_url = \"https://arxiv.org/pdf/2408.09869\"  # PDF path or URL\n",
+    "converter = DocumentConverter()\n",
+    "result = converter.convert(source)\n",
+    "print(result.document.export_to_markdown())"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

content-core 0.3.1__tar.gz → 0.5.0__tar.gz

Potentially problematic release.

content-core 0.3.1tar.gz → 0.5.0tar.gz