content-core 1.10.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. content_core/__init__.py +216 -0
  2. content_core/cc_config.yaml +86 -0
  3. content_core/common/__init__.py +38 -0
  4. content_core/common/exceptions.py +70 -0
  5. content_core/common/retry.py +325 -0
  6. content_core/common/state.py +64 -0
  7. content_core/common/types.py +15 -0
  8. content_core/common/utils.py +31 -0
  9. content_core/config.py +575 -0
  10. content_core/content/__init__.py +6 -0
  11. content_core/content/cleanup/__init__.py +5 -0
  12. content_core/content/cleanup/core.py +15 -0
  13. content_core/content/extraction/__init__.py +13 -0
  14. content_core/content/extraction/graph.py +252 -0
  15. content_core/content/identification/__init__.py +9 -0
  16. content_core/content/identification/file_detector.py +505 -0
  17. content_core/content/summary/__init__.py +5 -0
  18. content_core/content/summary/core.py +15 -0
  19. content_core/logging.py +15 -0
  20. content_core/mcp/__init__.py +5 -0
  21. content_core/mcp/server.py +214 -0
  22. content_core/models.py +60 -0
  23. content_core/models_config.yaml +31 -0
  24. content_core/notebooks/run.ipynb +359 -0
  25. content_core/notebooks/urls.ipynb +154 -0
  26. content_core/processors/audio.py +272 -0
  27. content_core/processors/docling.py +79 -0
  28. content_core/processors/office.py +331 -0
  29. content_core/processors/pdf.py +292 -0
  30. content_core/processors/text.py +36 -0
  31. content_core/processors/url.py +324 -0
  32. content_core/processors/video.py +166 -0
  33. content_core/processors/youtube.py +262 -0
  34. content_core/py.typed +2 -0
  35. content_core/templated_message.py +70 -0
  36. content_core/tools/__init__.py +9 -0
  37. content_core/tools/cleanup.py +15 -0
  38. content_core/tools/extract.py +21 -0
  39. content_core/tools/summarize.py +17 -0
  40. content_core-1.10.0.dist-info/METADATA +742 -0
  41. content_core-1.10.0.dist-info/RECORD +44 -0
  42. content_core-1.10.0.dist-info/WHEEL +4 -0
  43. content_core-1.10.0.dist-info/entry_points.txt +5 -0
  44. content_core-1.10.0.dist-info/licenses/LICENSE +21 -0
@@ -0,0 +1,742 @@
1
+ Metadata-Version: 2.4
2
+ Name: content-core
3
+ Version: 1.10.0
4
+ Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
5
+ Author-email: LUIS NOVO <lfnovo@gmail.com>
6
+ License-File: LICENSE
7
+ Requires-Python: >=3.10
8
+ Requires-Dist: ai-prompter>=0.2.3
9
+ Requires-Dist: aiohttp>=3.11
10
+ Requires-Dist: asciidoc>=10.2.1
11
+ Requires-Dist: bs4>=0.0.2
12
+ Requires-Dist: dicttoxml>=1.7.16
13
+ Requires-Dist: esperanto>=2.14.0
14
+ Requires-Dist: fastmcp>=2.10.0
15
+ Requires-Dist: firecrawl-py>=2.7.0
16
+ Requires-Dist: jinja2>=3.1.6
17
+ Requires-Dist: langdetect>=1.0.9
18
+ Requires-Dist: langgraph>=0.3.29
19
+ Requires-Dist: loguru>=0.7.3
20
+ Requires-Dist: moviepy>=2.1.2
21
+ Requires-Dist: openpyxl>=3.1.5
22
+ Requires-Dist: pandas>=2.2.3
23
+ Requires-Dist: pillow>=10.4.0
24
+ Requires-Dist: pymupdf>=1.25.5
25
+ Requires-Dist: python-docx>=1.1.2
26
+ Requires-Dist: python-dotenv>=1.1.0
27
+ Requires-Dist: python-pptx>=1.0.2
28
+ Requires-Dist: pytubefix>=9.1.1
29
+ Requires-Dist: readability-lxml>=0.8.4.1
30
+ Requires-Dist: tenacity>=8.0.0
31
+ Requires-Dist: validators>=0.34.0
32
+ Requires-Dist: youtube-transcript-api>=1.0.3
33
+ Provides-Extra: crawl4ai
34
+ Requires-Dist: crawl4ai>=0.7.0; extra == 'crawl4ai'
35
+ Provides-Extra: docling
36
+ Requires-Dist: docling>=2.34.0; extra == 'docling'
37
+ Description-Content-Type: text/markdown
38
+
39
+ # Content Core
40
+
41
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
42
+ [![PyPI version](https://badge.fury.io/py/content-core.svg)](https://badge.fury.io/py/content-core)
43
+ [![Downloads](https://pepy.tech/badge/content-core)](https://pepy.tech/project/content-core)
44
+ [![Downloads](https://pepy.tech/badge/content-core/month)](https://pepy.tech/project/content-core)
45
+ [![GitHub stars](https://img.shields.io/github/stars/lfnovo/content-core?style=social)](https://github.com/lfnovo/content-core)
46
+ [![GitHub forks](https://img.shields.io/github/forks/lfnovo/content-core?style=social)](https://github.com/lfnovo/content-core)
47
+ [![GitHub issues](https://img.shields.io/github/issues/lfnovo/content-core)](https://github.com/lfnovo/content-core/issues)
48
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
49
+ [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
50
+
51
+ **Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summariesโ€”all through a unified interface with multiple integration options.
52
+
53
+ ## ๐Ÿš€ What You Can Do
54
+
55
+ **Extract content from anywhere:**
56
+ - ๐Ÿ“„ **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
57
+ - ๐ŸŽฅ **Media** - Videos (MP4, AVI, MOV) with automatic transcription
58
+ - ๐ŸŽต **Audio** - MP3, WAV, M4A with speech-to-text conversion
59
+ - ๐ŸŒ **Web** - Any URL with intelligent content extraction
60
+ - ๐Ÿ–ผ๏ธ **Images** - JPG, PNG, TIFF with OCR text recognition
61
+ - ๐Ÿ“ฆ **Archives** - ZIP, TAR, GZ with content analysis
62
+
63
+ **Process with AI:**
64
+ - โœจ **Clean & format** extracted content automatically
65
+ - ๐Ÿ“ **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
66
+ - ๐ŸŽฏ **Context-aware processing** - explain to a child, technical summary, action items
67
+ - ๐Ÿ”„ **Smart engine selection** - automatically chooses the best extraction method
68
+
69
+ ## ๐Ÿ› ๏ธ Multiple Ways to Use
70
+
71
+ ### ๐Ÿ–ฅ๏ธ Command Line (Zero Install)
72
+ ```bash
73
+ # Extract content from any source
74
+ uvx --from "content-core" ccore https://example.com
75
+ uvx --from "content-core" ccore document.pdf
76
+
77
+ # Generate AI summaries
78
+ uvx --from "content-core" csum video.mp4 --context "bullet points"
79
+ ```
80
+
81
+ ### ๐Ÿค– Claude Desktop Integration
82
+ One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
83
+
84
+ ### ๐Ÿ” Raycast Extension
85
+ Smart auto-detection commands:
86
+ - **Extract Content** - Full interface with format options
87
+ - **Summarize Content** - 9 summary styles available
88
+ - **Quick Extract** - Instant clipboard extraction
89
+
90
+ ### ๐Ÿ–ฑ๏ธ macOS Right-Click Integration
91
+ Right-click any file in Finder โ†’ Services โ†’ Extract or Summarize content instantly.
92
+
93
+ ### ๐Ÿ Python Library
94
+ ```python
95
+ import content_core as cc
96
+
97
+ # Extract from any source
98
+ result = await cc.extract("https://example.com/article")
99
+ summary = await cc.summarize_content(result, context="explain to a child")
100
+ ```
101
+
102
+ ## โšก Key Features
103
+
104
+ * **๐ŸŽฏ Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
105
+ * **๐Ÿ”ง Smart Engine Selection:**
106
+ * **URLs:** Firecrawl โ†’ Jina โ†’ Crawl4AI (optional) โ†’ BeautifulSoup fallback chain
107
+ * **Documents:** Docling โ†’ Enhanced PyMuPDF โ†’ Simple extraction fallback
108
+ * **Media:** OpenAI Whisper transcription
109
+ * **Images:** OCR with multiple engine support
110
+ * **๐Ÿ“Š Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
111
+ * **๐ŸŒ Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
112
+ * **โšก Zero-Install Options:** Use `uvx` for instant access without installation
113
+ * **๐Ÿง  AI-Powered Processing:** LLM integration for content cleaning and summarization
114
+ * **๐Ÿ”„ Asynchronous:** Built with `asyncio` for efficient processing
115
+ * **๐Ÿ Pure Python Implementation:** No system dependencies required - simplified installation across all platforms
116
+
117
+ ## Getting Started
118
+
119
+ ### Installation
120
+
121
+ Install Content Core using `pip` - **no system dependencies required!**
122
+
123
+ ```bash
124
+ # Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
125
+ pip install content-core
126
+
127
+ # With enhanced document processing (adds Docling)
128
+ pip install content-core[docling]
129
+
130
+ # With local browser-based URL extraction (adds Crawl4AI)
131
+ # Note: Requires Playwright browsers (~300MB). Run:
132
+ pip install content-core[crawl4ai]
133
+ python -m playwright install --with-deps
134
+
135
+ # Full installation (with all optional features)
136
+ pip install content-core[docling,crawl4ai]
137
+ ```
138
+
139
+ > **Note:** The core installation uses pure Python implementations and doesn't require system libraries like libmagic, ensuring consistent, hassle-free installation across Windows, macOS, and Linux. Optional features like Crawl4AI (browser automation) may require additional system dependencies.
140
+
141
+ Alternatively, if youโ€™re developing locally:
142
+
143
+ ```bash
144
+ # Clone the repository
145
+ git clone https://github.com/lfnovo/content-core
146
+ cd content-core
147
+
148
+ # Install with uv
149
+ uv sync
150
+ ```
151
+
152
+ ### Command-Line Interface
153
+
154
+ Content Core provides three CLI commands for extracting, cleaning, and summarizing content:
155
+ ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
156
+
157
+ **Zero-install usage with uvx:**
158
+ ```bash
159
+ # Extract content
160
+ uvx --from "content-core" ccore https://example.com
161
+
162
+ # Clean content
163
+ uvx --from "content-core" cclean "messy content"
164
+
165
+ # Summarize content
166
+ uvx --from "content-core" csum "long text" --context "bullet points"
167
+ ```
168
+
169
+ #### ccore - Extract Content
170
+
171
+ Extracts content from text, URLs, or files, with optional formatting.
172
+ Usage:
173
+ ```bash
174
+ ccore [-f|--format xml|json|text] [-d|--debug] [content]
175
+ ```
176
+ Options:
177
+ - `-f`, `--format`: Output format (xml, json, or text). Default: text.
178
+ - `-d`, `--debug`: Enable debug logging.
179
+ - `content`: Input content (text, URL, or file path). If omitted, reads from stdin.
180
+
181
+ Examples:
182
+
183
+ ```bash
184
+ # Extract from a URL as text
185
+ ccore https://example.com
186
+
187
+ # Extract from a file as JSON
188
+ ccore -f json document.pdf
189
+
190
+ # Extract from piped text as XML
191
+ echo "Sample text" | ccore --format xml
192
+ ```
193
+
194
+ #### cclean - Clean Content
195
+ Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths.
196
+ Usage:
197
+
198
+ ```bash
199
+ cclean [-d|--debug] [content]
200
+ ```
201
+
202
+ Options:
203
+ - `-d`, `--debug`: Enable debug logging.
204
+ - `content`: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
205
+
206
+ Examples:
207
+
208
+ ```bash
209
+ # Clean a text string
210
+ cclean " messy text "
211
+
212
+ # Clean piped JSON
213
+ echo '{"content": " messy text "}' | cclean
214
+
215
+ # Clean content from a URL
216
+ cclean https://example.com
217
+
218
+ # Clean a fileโ€™s content
219
+ cclean document.txt
220
+ ```
221
+
222
+ ### csum - Summarize Content
223
+
224
+ Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
225
+
226
+ Usage:
227
+
228
+ ```bash
229
+ csum [--context "context text"] [-d|--debug] [content]
230
+ ```
231
+
232
+ Options:
233
+ - `--context`: Context for summarization (e.g., "explain to a child"). Default: none.
234
+ - `-d`, `--debug`: Enable debug logging.
235
+ - `content`: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
236
+
237
+ Examples:
238
+
239
+ ```bash
240
+ # Summarize text
241
+ csum "AI is transforming industries."
242
+
243
+ # Summarize with context
244
+ csum --context "in bullet points" "AI is transforming industries."
245
+
246
+ # Summarize piped content
247
+ cat article.txt | csum --context "one sentence"
248
+
249
+ # Summarize content from URL
250
+ csum https://example.com
251
+
252
+ # Summarize a file's content
253
+ csum document.txt
254
+ ```
255
+
256
+ ## Quick Start
257
+
258
+ You can quickly integrate `content-core` into your Python projects to extract, clean, and summarize content from various sources.
259
+
260
+ ```python
261
+ import content_core as cc
262
+
263
+ # Extract content from a URL, file, or text
264
+ result = await cc.extract("https://example.com/article")
265
+
266
+ # Clean messy content
267
+ cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")
268
+
269
+ # Summarize content with optional context
270
+ summary = await cc.summarize_content("long article text", context="explain to a child")
271
+
272
+ # Extract audio with custom speech-to-text model
273
+ from content_core.common import ProcessSourceInput
274
+ result = await cc.extract(ProcessSourceInput(
275
+ file_path="interview.mp3",
276
+ audio_provider="openai",
277
+ audio_model="whisper-1"
278
+ ))
279
+ ```
280
+
281
+ ## Documentation
282
+
283
+ For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our [Usage Documentation](docs/usage.md).
284
+
285
+ ## MCP Server Integration
286
+
287
+ Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.
288
+
289
+ <a href="https://glama.ai/mcp/servers/@lfnovo/content-core">
290
+ <img width="380" height="200" src="https://glama.ai/mcp/servers/@lfnovo/content-core/badge" />
291
+ </a>
292
+
293
+ ### Quick Setup with Claude Desktop
294
+
295
+ ```bash
296
+ # Install Content Core (MCP server included)
297
+ pip install content-core
298
+
299
+ # Or use directly with uvx (no installation required)
300
+ uvx --from "content-core" content-core-mcp
301
+ ```
302
+
303
+ Add to your `claude_desktop_config.json`:
304
+ ```json
305
+ {
306
+ "mcpServers": {
307
+ "content-core": {
308
+ "command": "uvx",
309
+ "args": [
310
+ "--from",
311
+ "content-core",
312
+ "content-core-mcp"
313
+ ]
314
+ }
315
+ }
316
+ }
317
+ ```
318
+
319
+ For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
320
+
321
+ ## Enhanced PDF Processing
322
+
323
+ Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
324
+
325
+ ### Key Improvements
326
+
327
+ - **๐Ÿ”ฌ Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
328
+ - **๐Ÿ“Š Automatic Table Detection**: Tables converted to markdown format for LLM consumption
329
+ - **๐Ÿ”ง Quality Text Rendering**: Better ligature, whitespace, and image-text integration
330
+ - **โšก Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
331
+
332
+ ### Configuration for Scientific Documents
333
+
334
+ For documents with heavy mathematical content, enable OCR enhancement:
335
+
336
+ ```yaml
337
+ # In cc_config.yaml
338
+ extraction:
339
+ pymupdf:
340
+ enable_formula_ocr: true # Enable OCR for formula-heavy pages
341
+ formula_threshold: 3 # Min formulas per page to trigger OCR
342
+ ocr_fallback: true # Graceful fallback if OCR fails
343
+ ```
344
+
345
+ ```python
346
+ # Runtime configuration
347
+ from content_core.config import set_pymupdf_ocr_enabled
348
+ set_pymupdf_ocr_enabled(True)
349
+ ```
350
+
351
+ ### Requirements for OCR Enhancement
352
+
353
+ ```bash
354
+ # Install Tesseract OCR (optional, for formula enhancement)
355
+ # macOS
356
+ brew install tesseract
357
+
358
+ # Ubuntu/Debian
359
+ sudo apt-get install tesseract-ocr
360
+ ```
361
+
362
+ **Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
363
+
364
+ ## macOS Services Integration
365
+
366
+ Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
367
+
368
+ ### Available Services
369
+
370
+ Create **4 convenient services** for different workflows:
371
+
372
+ - **Extract Content โ†’ Clipboard** - Quick copy for immediate pasting
373
+ - **Extract Content โ†’ TextEdit** - Review before using
374
+ - **Summarize Content โ†’ Clipboard** - Quick summary copying
375
+ - **Summarize Content โ†’ TextEdit** - Formatted summary with headers
376
+
377
+ ### Quick Setup
378
+
379
+ 1. **Install uv** (if not already installed):
380
+ ```bash
381
+ curl -LsSf https://astral.sh/uv/install.sh | sh
382
+ ```
383
+
384
+ 2. **Create services manually** using Automator (5 minutes setup)
385
+
386
+ ### Usage
387
+
388
+ **Right-click any supported file** in Finder โ†’ **Services** โ†’ Choose your option:
389
+
390
+ - **PDFs, Word docs** - Instant text extraction
391
+ - **Videos, audio files** - Automatic transcription
392
+ - **Images** - OCR text recognition
393
+ - **Web content** - Clean text extraction
394
+ - **Multiple files** - Batch processing support
395
+
396
+ ### Features
397
+
398
+ - **Zero-install processing**: Uses `uvx` for isolated execution
399
+ - **Multiple output options**: Clipboard or TextEdit display
400
+ - **System notifications**: Visual feedback on completion
401
+ - **Wide format support**: 20+ file types supported
402
+ - **Batch processing**: Handle multiple files at once
403
+ - **Keyboard shortcuts**: Assignable hotkeys for power users
404
+
405
+ For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
406
+
407
+ ## Raycast Extension
408
+
409
+ Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
410
+
411
+ ### Quick Setup
412
+
413
+ **From Raycast Store** (coming soon):
414
+ 1. Open Raycast and search for "Content Core"
415
+ 2. Install the extension by `luis_novo`
416
+ 3. Configure API keys in preferences
417
+
418
+ **Manual Installation**:
419
+ 1. Download the extension from the repository
420
+ 2. Open Raycast โ†’ "Import Extension"
421
+ 3. Select the `raycast-content-core` folder
422
+
423
+ ### Commands
424
+
425
+ **๐Ÿ” Extract Content** - Smart URL/file detection with full interface
426
+ - Auto-detects URLs vs file paths in real-time
427
+ - Multiple output formats (Text, JSON, XML)
428
+ - Drag & drop support for files
429
+ - Rich results view with metadata
430
+
431
+ **๐Ÿ“ Summarize Content** - AI-powered summaries with customizable styles
432
+ - 9 different summary styles (bullet points, executive summary, etc.)
433
+ - Auto-detects source type with visual feedback
434
+ - One-click snippet creation and quicklinks
435
+
436
+ **โšก Quick Extract** - Instant extraction to clipboard
437
+ - Type โ†’ Tab โ†’ Paste source โ†’ Enter
438
+ - No UI, works directly from command bar
439
+ - Perfect for quick workflows
440
+
441
+ ### Features
442
+
443
+ - **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
444
+ - **Zero Installation**: Uses `uvx` for Content Core execution
445
+ - **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
446
+ - **All File Types**: Documents, videos, audio, images, archives
447
+ - **Visual Feedback**: Real-time type detection with icons
448
+
449
+ For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
450
+
451
+ ## Using with Langchain
452
+
453
+ For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
454
+
455
+ You can import and use these tools like any other Langchain tool. For example:
456
+
457
+ ```python
458
+ from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool
459
+ from langchain.agents import initialize_agent, AgentType
460
+
461
+ tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]
462
+ agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
463
+ agent.run("Extract the content from https://example.com and then summarize it.")
464
+ ```
465
+
466
+ Refer to the source code in `src/content_core/tools` for specific tool implementations and usage details.
467
+
468
+ ## Basic Usage
469
+
470
+ The core functionality revolves around the extract_content function.
471
+
472
+ ```python
473
+ import asyncio
474
+ from content_core.extraction import extract_content
475
+
476
+ async def main():
477
+ # Extract from raw text
478
+ text_data = await extract_content({"content": "This is my sample text content."})
479
+ print(text_data)
480
+
481
+ # Extract from a URL (uses 'auto' engine by default)
482
+ url_data = await extract_content({"url": "https://www.example.com"})
483
+ print(url_data)
484
+
485
+ # Extract from a local video file (gets transcript, engine='auto' by default)
486
+ video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
487
+ print(video_data)
488
+
489
+ # Extract from a local markdown file (engine='auto' by default)
490
+ md_data = await extract_content({"file_path": "path/to/your/document.md"})
491
+ print(md_data)
492
+
493
+ # Per-execution override with Docling for documents
494
+ doc_data = await extract_content({
495
+ "file_path": "path/to/your/document.pdf",
496
+ "document_engine": "docling",
497
+ "output_format": "html"
498
+ })
499
+
500
+ # Per-execution override with Firecrawl for URLs
501
+ url_data = await extract_content({
502
+ "url": "https://www.example.com",
503
+ "url_engine": "firecrawl"
504
+ })
505
+ print(doc_data)
506
+
507
+ if __name__ == "__main__":
508
+ asyncio.run(main())
509
+ ```
510
+
511
+ (See `src/content_core/notebooks/run.ipynb` for more detailed examples.)
512
+
513
+ ## Docling Integration
514
+
515
+ Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
516
+
517
+
518
+ ### Enabling Docling
519
+
520
+ Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".
521
+
522
+ #### Via configuration file
523
+
524
+ In your `cc_config.yaml` or custom config, set:
525
+ ```yaml
526
+ extraction:
527
+ document_engine: docling # 'auto' (default), 'simple', or 'docling'
528
+ url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina'
529
+ docling:
530
+ output_format: markdown # markdown | html | json
531
+ ```
532
+
533
+ #### Programmatically in Python
534
+
535
+ ```python
536
+ from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
537
+
538
+ # switch document engine to Docling
539
+ set_document_engine("docling")
540
+
541
+ # switch URL engine to Firecrawl
542
+ set_url_engine("firecrawl")
543
+
544
+ # choose output format: 'markdown', 'html', or 'json'
545
+ set_docling_output_format("html")
546
+
547
+ # now use ccore.extract or ccore.ccore
548
+ result = await cc.extract("document.pdf")
549
+ ```
550
+
551
+ ## Configuration
552
+
553
+ Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.
554
+
555
+ Example `.env`:
556
+
557
+ ```plaintext
558
+ OPENAI_API_KEY=your-key-here
559
+ GOOGLE_API_KEY=your-key-here
560
+
561
+ # Engine Selection (optional)
562
+ CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
563
+ CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
564
+
565
+ # Audio Processing (optional)
566
+ CCORE_AUDIO_CONCURRENCY=3 # Number of concurrent audio transcriptions (1-10, default: 3)
567
+
568
+ # Esperanto Timeout Configuration (optional)
569
+ ESPERANTO_LLM_TIMEOUT=300 # Language model timeout in seconds (default: 300, max: 3600)
570
+ ESPERANTO_STT_TIMEOUT=3600 # Speech-to-text timeout in seconds (default: 3600, max: 3600)
571
+ ```
572
+
573
+ ### Engine Selection via Environment Variables
574
+
575
+ For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
576
+
577
+ - **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
578
+ - **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`, `crawl4ai`)
579
+ - **`CCORE_AUDIO_CONCURRENCY`**: Number of concurrent audio transcriptions (1-10, default: 3)
580
+
581
+ These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
582
+
583
+ ### Audio Processing Configuration
584
+
585
+ Content Core processes long audio files by splitting them into segments and transcribing them in parallel for improved performance. You can control the concurrency level to balance speed with API rate limits:
586
+
587
+ - **Default**: 3 concurrent transcriptions
588
+ - **Range**: 1-10 concurrent transcriptions
589
+ - **Configuration**: Set via `CCORE_AUDIO_CONCURRENCY` environment variable or `extraction.audio.concurrency` in `cc_config.yaml`
590
+
591
+ Higher concurrency values can speed up processing of long audio/video files but may hit API rate limits. Lower values are more conservative and suitable for accounts with lower API quotas.
592
+
593
+ ### Retry Configuration
594
+
595
+ Content Core includes automatic retry logic for transient failures in external operations (network requests, API calls, transcription). Retries use exponential backoff with jitter to handle temporary issues gracefully.
596
+
597
+ **Supported operations:**
598
+ - `youtube` - YouTube video title and transcript fetching (5 retries, 2-60s backoff)
599
+ - `url_api` - URL extraction via Jina/Firecrawl APIs (3 retries, 1-30s backoff)
600
+ - `url_network` - Network operations like HEAD requests, BeautifulSoup (3 retries, 0.5-10s backoff)
601
+ - `audio` - Audio transcription API calls (3 retries, 2-30s backoff)
602
+ - `llm` - LLM API calls for cleanup/summary (3 retries, 1-30s backoff)
603
+ - `download` - Remote file downloads (3 retries, 1-15s backoff)
604
+
605
+ **Environment variable overrides:**
606
+ ```bash
607
+ # Override retry settings per operation type
608
+ CCORE_YOUTUBE_MAX_RETRIES=10 # Max retry attempts (1-20)
609
+ CCORE_YOUTUBE_BASE_DELAY=3 # Base delay in seconds (0.1-60)
610
+ CCORE_YOUTUBE_MAX_DELAY=120 # Max delay in seconds (1-300)
611
+
612
+ # Same pattern for other operations:
613
+ CCORE_URL_API_MAX_RETRIES=5
614
+ CCORE_AUDIO_MAX_RETRIES=5
615
+ CCORE_LLM_MAX_RETRIES=5
616
+ CCORE_DOWNLOAD_MAX_RETRIES=5
617
+ ```
618
+
619
+ For detailed configuration, see our [Usage Documentation](docs/usage.md#retry-configuration).
620
+
621
+ ### Proxy Configuration
622
+
623
+ Content Core supports HTTP/HTTPS proxy configuration for all external network requests. This is useful when operating in corporate environments, behind firewalls, or when you need to route traffic through a specific server.
624
+
625
+ **Configuration Methods** (in priority order):
626
+
627
+ 1. **Per-request**: Pass `proxy` parameter directly in `ProcessSourceInput`
628
+ 2. **Programmatic**: Use `set_proxy()` for runtime configuration
629
+ 3. **Environment Variables**: `CCORE_HTTP_PROXY`, `HTTP_PROXY`, or `HTTPS_PROXY`
630
+ 4. **YAML Config**: Set in `cc_config.yaml`
631
+
632
+ **Quick Start:**
633
+
634
+ ```bash
635
+ # Via environment variable
636
+ export CCORE_HTTP_PROXY=http://proxy.example.com:8080
637
+
638
+ # With authentication
639
+ export CCORE_HTTP_PROXY=http://user:password@proxy.example.com:8080
640
+ ```
641
+
642
+ ```python
643
+ # Programmatic configuration
644
+ from content_core.config import set_proxy, clear_proxy
645
+
646
+ set_proxy("http://proxy.example.com:8080")
647
+ # ... use Content Core ...
648
+ clear_proxy() # Reset to default behavior
649
+
650
+ # Per-request override
651
+ from content_core.common import ProcessSourceInput
652
+ result = await cc.extract(ProcessSourceInput(
653
+ url="https://example.com",
654
+ proxy="http://specific-proxy:8080"
655
+ ))
656
+ ```
657
+
658
+ **Supported Services:**
659
+ - All aiohttp requests (URL extraction, downloads)
660
+ - YouTube transcript/title fetching (pytubefix, youtube-transcript-api)
661
+ - Crawl4AI browser automation
662
+ - Esperanto AI models (LLM, speech-to-text)
663
+
664
+ **Note:** Firecrawl does not support client-side proxy configuration. A warning is logged when proxy is configured but Firecrawl is used.
665
+
666
+ For detailed configuration options, see our [Usage Documentation](docs/usage.md#proxy-configuration).
667
+
668
+ ### Timeout Configuration
669
+
670
+ Content Core uses the Esperanto library for AI model interactions and supports configurable timeouts for different operations. Timeouts prevent requests from hanging indefinitely and ensure reliable processing.
671
+
672
+ **Configuration Methods** (in priority order):
673
+
674
+ 1. **Config Files** (highest priority): Set in `cc_config.yaml` or `models_config.yaml`
675
+ 2. **Environment Variables**: Provide global defaults via `ESPERANTO_LLM_TIMEOUT` and `ESPERANTO_STT_TIMEOUT` when a timeout isn't specified in configuration files
676
+
677
+ **Default Timeouts:**
678
+
679
+ - **Speech-to-Text**: 3600 seconds (1 hour) - for very long audio files
680
+ - **Language Models**: 300-600 seconds - for content processing operations
681
+ - **Cleanup Model**: 600 seconds (10 minutes) - handles large content with 8000 max tokens
682
+ - **Summary Model**: 300 seconds (5 minutes) - for content summarization
683
+
684
+ **Environment Variable Overrides:**
685
+
686
+ ```bash
687
+ # Override language model timeout globally (used when config files omit a timeout)
688
+ export ESPERANTO_LLM_TIMEOUT=300
689
+
690
+ # Override speech-to-text timeout globally (used when config files omit a timeout)
691
+ export ESPERANTO_STT_TIMEOUT=3600
692
+ ```
693
+
694
+ **Valid Range:** 1 to 3600 seconds (1 hour maximum)
695
+
696
+ For more details on Esperanto timeout configuration, see the [Esperanto documentation](https://github.com/lfnovo/esperanto/blob/main/docs/advanced/timeout-configuration.md).
697
+
698
+ ### Custom Prompt Templates
699
+
700
+ Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
701
+
702
+ Example `.env` with custom prompt path:
703
+
704
+ ```plaintext
705
+ OPENAI_API_KEY=your-key-here
706
+ GOOGLE_API_KEY=your-key-here
707
+ PROMPT_PATH=/path/to/your/custom/prompts
708
+ ```
709
+
710
+ When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
711
+
712
+ ## Development
713
+
714
+ To set up a development environment:
715
+
716
+ ```bash
717
+ # Clone the repository
718
+ git clone <repository-url>
719
+ cd content-core
720
+
721
+ # Create virtual environment and install dependencies
722
+ uv venv
723
+ source .venv/bin/activate
724
+ uv sync --group dev
725
+
726
+ # Run tests
727
+ make test
728
+
729
+ # Lint code
730
+ make lint
731
+
732
+ # See all commands
733
+ make help
734
+ ```
735
+
736
+ ## License
737
+
738
+ This project is licensed under the [MIT License](LICENSE). See the [LICENSE](LICENSE) file for details.
739
+
740
+ ## Contributing
741
+
742
+ Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for more details on how to get started.