content-core 1.10.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- content_core/__init__.py +216 -0
- content_core/cc_config.yaml +86 -0
- content_core/common/__init__.py +38 -0
- content_core/common/exceptions.py +70 -0
- content_core/common/retry.py +325 -0
- content_core/common/state.py +64 -0
- content_core/common/types.py +15 -0
- content_core/common/utils.py +31 -0
- content_core/config.py +575 -0
- content_core/content/__init__.py +6 -0
- content_core/content/cleanup/__init__.py +5 -0
- content_core/content/cleanup/core.py +15 -0
- content_core/content/extraction/__init__.py +13 -0
- content_core/content/extraction/graph.py +252 -0
- content_core/content/identification/__init__.py +9 -0
- content_core/content/identification/file_detector.py +505 -0
- content_core/content/summary/__init__.py +5 -0
- content_core/content/summary/core.py +15 -0
- content_core/logging.py +15 -0
- content_core/mcp/__init__.py +5 -0
- content_core/mcp/server.py +214 -0
- content_core/models.py +60 -0
- content_core/models_config.yaml +31 -0
- content_core/notebooks/run.ipynb +359 -0
- content_core/notebooks/urls.ipynb +154 -0
- content_core/processors/audio.py +272 -0
- content_core/processors/docling.py +79 -0
- content_core/processors/office.py +331 -0
- content_core/processors/pdf.py +292 -0
- content_core/processors/text.py +36 -0
- content_core/processors/url.py +324 -0
- content_core/processors/video.py +166 -0
- content_core/processors/youtube.py +262 -0
- content_core/py.typed +2 -0
- content_core/templated_message.py +70 -0
- content_core/tools/__init__.py +9 -0
- content_core/tools/cleanup.py +15 -0
- content_core/tools/extract.py +21 -0
- content_core/tools/summarize.py +17 -0
- content_core-1.10.0.dist-info/METADATA +742 -0
- content_core-1.10.0.dist-info/RECORD +44 -0
- content_core-1.10.0.dist-info/WHEEL +4 -0
- content_core-1.10.0.dist-info/entry_points.txt +5 -0
- content_core-1.10.0.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,742 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: content-core
|
|
3
|
+
Version: 1.10.0
|
|
4
|
+
Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
|
|
5
|
+
Author-email: LUIS NOVO <lfnovo@gmail.com>
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
Requires-Python: >=3.10
|
|
8
|
+
Requires-Dist: ai-prompter>=0.2.3
|
|
9
|
+
Requires-Dist: aiohttp>=3.11
|
|
10
|
+
Requires-Dist: asciidoc>=10.2.1
|
|
11
|
+
Requires-Dist: bs4>=0.0.2
|
|
12
|
+
Requires-Dist: dicttoxml>=1.7.16
|
|
13
|
+
Requires-Dist: esperanto>=2.14.0
|
|
14
|
+
Requires-Dist: fastmcp>=2.10.0
|
|
15
|
+
Requires-Dist: firecrawl-py>=2.7.0
|
|
16
|
+
Requires-Dist: jinja2>=3.1.6
|
|
17
|
+
Requires-Dist: langdetect>=1.0.9
|
|
18
|
+
Requires-Dist: langgraph>=0.3.29
|
|
19
|
+
Requires-Dist: loguru>=0.7.3
|
|
20
|
+
Requires-Dist: moviepy>=2.1.2
|
|
21
|
+
Requires-Dist: openpyxl>=3.1.5
|
|
22
|
+
Requires-Dist: pandas>=2.2.3
|
|
23
|
+
Requires-Dist: pillow>=10.4.0
|
|
24
|
+
Requires-Dist: pymupdf>=1.25.5
|
|
25
|
+
Requires-Dist: python-docx>=1.1.2
|
|
26
|
+
Requires-Dist: python-dotenv>=1.1.0
|
|
27
|
+
Requires-Dist: python-pptx>=1.0.2
|
|
28
|
+
Requires-Dist: pytubefix>=9.1.1
|
|
29
|
+
Requires-Dist: readability-lxml>=0.8.4.1
|
|
30
|
+
Requires-Dist: tenacity>=8.0.0
|
|
31
|
+
Requires-Dist: validators>=0.34.0
|
|
32
|
+
Requires-Dist: youtube-transcript-api>=1.0.3
|
|
33
|
+
Provides-Extra: crawl4ai
|
|
34
|
+
Requires-Dist: crawl4ai>=0.7.0; extra == 'crawl4ai'
|
|
35
|
+
Provides-Extra: docling
|
|
36
|
+
Requires-Dist: docling>=2.34.0; extra == 'docling'
|
|
37
|
+
Description-Content-Type: text/markdown
|
|
38
|
+
|
|
39
|
+
# Content Core
|
|
40
|
+
|
|
41
|
+
[](https://opensource.org/licenses/MIT)
|
|
42
|
+
[](https://badge.fury.io/py/content-core)
|
|
43
|
+
[](https://pepy.tech/project/content-core)
|
|
44
|
+
[](https://pepy.tech/project/content-core)
|
|
45
|
+
[](https://github.com/lfnovo/content-core)
|
|
46
|
+
[](https://github.com/lfnovo/content-core)
|
|
47
|
+
[](https://github.com/lfnovo/content-core/issues)
|
|
48
|
+
[](https://github.com/psf/black)
|
|
49
|
+
[](https://github.com/astral-sh/ruff)
|
|
50
|
+
|
|
51
|
+
**Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summariesโall through a unified interface with multiple integration options.
|
|
52
|
+
|
|
53
|
+
## ๐ What You Can Do
|
|
54
|
+
|
|
55
|
+
**Extract content from anywhere:**
|
|
56
|
+
- ๐ **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
|
|
57
|
+
- ๐ฅ **Media** - Videos (MP4, AVI, MOV) with automatic transcription
|
|
58
|
+
- ๐ต **Audio** - MP3, WAV, M4A with speech-to-text conversion
|
|
59
|
+
- ๐ **Web** - Any URL with intelligent content extraction
|
|
60
|
+
- ๐ผ๏ธ **Images** - JPG, PNG, TIFF with OCR text recognition
|
|
61
|
+
- ๐ฆ **Archives** - ZIP, TAR, GZ with content analysis
|
|
62
|
+
|
|
63
|
+
**Process with AI:**
|
|
64
|
+
- โจ **Clean & format** extracted content automatically
|
|
65
|
+
- ๐ **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
|
|
66
|
+
- ๐ฏ **Context-aware processing** - explain to a child, technical summary, action items
|
|
67
|
+
- ๐ **Smart engine selection** - automatically chooses the best extraction method
|
|
68
|
+
|
|
69
|
+
## ๐ ๏ธ Multiple Ways to Use
|
|
70
|
+
|
|
71
|
+
### ๐ฅ๏ธ Command Line (Zero Install)
|
|
72
|
+
```bash
|
|
73
|
+
# Extract content from any source
|
|
74
|
+
uvx --from "content-core" ccore https://example.com
|
|
75
|
+
uvx --from "content-core" ccore document.pdf
|
|
76
|
+
|
|
77
|
+
# Generate AI summaries
|
|
78
|
+
uvx --from "content-core" csum video.mp4 --context "bullet points"
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### ๐ค Claude Desktop Integration
|
|
82
|
+
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
|
|
83
|
+
|
|
84
|
+
### ๐ Raycast Extension
|
|
85
|
+
Smart auto-detection commands:
|
|
86
|
+
- **Extract Content** - Full interface with format options
|
|
87
|
+
- **Summarize Content** - 9 summary styles available
|
|
88
|
+
- **Quick Extract** - Instant clipboard extraction
|
|
89
|
+
|
|
90
|
+
### ๐ฑ๏ธ macOS Right-Click Integration
|
|
91
|
+
Right-click any file in Finder โ Services โ Extract or Summarize content instantly.
|
|
92
|
+
|
|
93
|
+
### ๐ Python Library
|
|
94
|
+
```python
|
|
95
|
+
import content_core as cc
|
|
96
|
+
|
|
97
|
+
# Extract from any source
|
|
98
|
+
result = await cc.extract("https://example.com/article")
|
|
99
|
+
summary = await cc.summarize_content(result, context="explain to a child")
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## โก Key Features
|
|
103
|
+
|
|
104
|
+
* **๐ฏ Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
|
|
105
|
+
* **๐ง Smart Engine Selection:**
|
|
106
|
+
* **URLs:** Firecrawl โ Jina โ Crawl4AI (optional) โ BeautifulSoup fallback chain
|
|
107
|
+
* **Documents:** Docling โ Enhanced PyMuPDF โ Simple extraction fallback
|
|
108
|
+
* **Media:** OpenAI Whisper transcription
|
|
109
|
+
* **Images:** OCR with multiple engine support
|
|
110
|
+
* **๐ Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
|
|
111
|
+
* **๐ Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
|
|
112
|
+
* **โก Zero-Install Options:** Use `uvx` for instant access without installation
|
|
113
|
+
* **๐ง AI-Powered Processing:** LLM integration for content cleaning and summarization
|
|
114
|
+
* **๐ Asynchronous:** Built with `asyncio` for efficient processing
|
|
115
|
+
* **๐ Pure Python Implementation:** No system dependencies required - simplified installation across all platforms
|
|
116
|
+
|
|
117
|
+
## Getting Started
|
|
118
|
+
|
|
119
|
+
### Installation
|
|
120
|
+
|
|
121
|
+
Install Content Core using `pip` - **no system dependencies required!**
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
|
|
125
|
+
pip install content-core
|
|
126
|
+
|
|
127
|
+
# With enhanced document processing (adds Docling)
|
|
128
|
+
pip install content-core[docling]
|
|
129
|
+
|
|
130
|
+
# With local browser-based URL extraction (adds Crawl4AI)
|
|
131
|
+
# Note: Requires Playwright browsers (~300MB). Run:
|
|
132
|
+
pip install content-core[crawl4ai]
|
|
133
|
+
python -m playwright install --with-deps
|
|
134
|
+
|
|
135
|
+
# Full installation (with all optional features)
|
|
136
|
+
pip install content-core[docling,crawl4ai]
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
> **Note:** The core installation uses pure Python implementations and doesn't require system libraries like libmagic, ensuring consistent, hassle-free installation across Windows, macOS, and Linux. Optional features like Crawl4AI (browser automation) may require additional system dependencies.
|
|
140
|
+
|
|
141
|
+
Alternatively, if youโre developing locally:
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
# Clone the repository
|
|
145
|
+
git clone https://github.com/lfnovo/content-core
|
|
146
|
+
cd content-core
|
|
147
|
+
|
|
148
|
+
# Install with uv
|
|
149
|
+
uv sync
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Command-Line Interface
|
|
153
|
+
|
|
154
|
+
Content Core provides three CLI commands for extracting, cleaning, and summarizing content:
|
|
155
|
+
ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
|
|
156
|
+
|
|
157
|
+
**Zero-install usage with uvx:**
|
|
158
|
+
```bash
|
|
159
|
+
# Extract content
|
|
160
|
+
uvx --from "content-core" ccore https://example.com
|
|
161
|
+
|
|
162
|
+
# Clean content
|
|
163
|
+
uvx --from "content-core" cclean "messy content"
|
|
164
|
+
|
|
165
|
+
# Summarize content
|
|
166
|
+
uvx --from "content-core" csum "long text" --context "bullet points"
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
#### ccore - Extract Content
|
|
170
|
+
|
|
171
|
+
Extracts content from text, URLs, or files, with optional formatting.
|
|
172
|
+
Usage:
|
|
173
|
+
```bash
|
|
174
|
+
ccore [-f|--format xml|json|text] [-d|--debug] [content]
|
|
175
|
+
```
|
|
176
|
+
Options:
|
|
177
|
+
- `-f`, `--format`: Output format (xml, json, or text). Default: text.
|
|
178
|
+
- `-d`, `--debug`: Enable debug logging.
|
|
179
|
+
- `content`: Input content (text, URL, or file path). If omitted, reads from stdin.
|
|
180
|
+
|
|
181
|
+
Examples:
|
|
182
|
+
|
|
183
|
+
```bash
|
|
184
|
+
# Extract from a URL as text
|
|
185
|
+
ccore https://example.com
|
|
186
|
+
|
|
187
|
+
# Extract from a file as JSON
|
|
188
|
+
ccore -f json document.pdf
|
|
189
|
+
|
|
190
|
+
# Extract from piped text as XML
|
|
191
|
+
echo "Sample text" | ccore --format xml
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
#### cclean - Clean Content
|
|
195
|
+
Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths.
|
|
196
|
+
Usage:
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
cclean [-d|--debug] [content]
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
Options:
|
|
203
|
+
- `-d`, `--debug`: Enable debug logging.
|
|
204
|
+
- `content`: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
|
|
205
|
+
|
|
206
|
+
Examples:
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
# Clean a text string
|
|
210
|
+
cclean " messy text "
|
|
211
|
+
|
|
212
|
+
# Clean piped JSON
|
|
213
|
+
echo '{"content": " messy text "}' | cclean
|
|
214
|
+
|
|
215
|
+
# Clean content from a URL
|
|
216
|
+
cclean https://example.com
|
|
217
|
+
|
|
218
|
+
# Clean a fileโs content
|
|
219
|
+
cclean document.txt
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### csum - Summarize Content
|
|
223
|
+
|
|
224
|
+
Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
|
|
225
|
+
|
|
226
|
+
Usage:
|
|
227
|
+
|
|
228
|
+
```bash
|
|
229
|
+
csum [--context "context text"] [-d|--debug] [content]
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
Options:
|
|
233
|
+
- `--context`: Context for summarization (e.g., "explain to a child"). Default: none.
|
|
234
|
+
- `-d`, `--debug`: Enable debug logging.
|
|
235
|
+
- `content`: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
|
|
236
|
+
|
|
237
|
+
Examples:
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
# Summarize text
|
|
241
|
+
csum "AI is transforming industries."
|
|
242
|
+
|
|
243
|
+
# Summarize with context
|
|
244
|
+
csum --context "in bullet points" "AI is transforming industries."
|
|
245
|
+
|
|
246
|
+
# Summarize piped content
|
|
247
|
+
cat article.txt | csum --context "one sentence"
|
|
248
|
+
|
|
249
|
+
# Summarize content from URL
|
|
250
|
+
csum https://example.com
|
|
251
|
+
|
|
252
|
+
# Summarize a file's content
|
|
253
|
+
csum document.txt
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
## Quick Start
|
|
257
|
+
|
|
258
|
+
You can quickly integrate `content-core` into your Python projects to extract, clean, and summarize content from various sources.
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
import content_core as cc
|
|
262
|
+
|
|
263
|
+
# Extract content from a URL, file, or text
|
|
264
|
+
result = await cc.extract("https://example.com/article")
|
|
265
|
+
|
|
266
|
+
# Clean messy content
|
|
267
|
+
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")
|
|
268
|
+
|
|
269
|
+
# Summarize content with optional context
|
|
270
|
+
summary = await cc.summarize_content("long article text", context="explain to a child")
|
|
271
|
+
|
|
272
|
+
# Extract audio with custom speech-to-text model
|
|
273
|
+
from content_core.common import ProcessSourceInput
|
|
274
|
+
result = await cc.extract(ProcessSourceInput(
|
|
275
|
+
file_path="interview.mp3",
|
|
276
|
+
audio_provider="openai",
|
|
277
|
+
audio_model="whisper-1"
|
|
278
|
+
))
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Documentation
|
|
282
|
+
|
|
283
|
+
For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our [Usage Documentation](docs/usage.md).
|
|
284
|
+
|
|
285
|
+
## MCP Server Integration
|
|
286
|
+
|
|
287
|
+
Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.
|
|
288
|
+
|
|
289
|
+
<a href="https://glama.ai/mcp/servers/@lfnovo/content-core">
|
|
290
|
+
<img width="380" height="200" src="https://glama.ai/mcp/servers/@lfnovo/content-core/badge" />
|
|
291
|
+
</a>
|
|
292
|
+
|
|
293
|
+
### Quick Setup with Claude Desktop
|
|
294
|
+
|
|
295
|
+
```bash
|
|
296
|
+
# Install Content Core (MCP server included)
|
|
297
|
+
pip install content-core
|
|
298
|
+
|
|
299
|
+
# Or use directly with uvx (no installation required)
|
|
300
|
+
uvx --from "content-core" content-core-mcp
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
Add to your `claude_desktop_config.json`:
|
|
304
|
+
```json
|
|
305
|
+
{
|
|
306
|
+
"mcpServers": {
|
|
307
|
+
"content-core": {
|
|
308
|
+
"command": "uvx",
|
|
309
|
+
"args": [
|
|
310
|
+
"--from",
|
|
311
|
+
"content-core",
|
|
312
|
+
"content-core-mcp"
|
|
313
|
+
]
|
|
314
|
+
}
|
|
315
|
+
}
|
|
316
|
+
}
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
|
|
320
|
+
|
|
321
|
+
## Enhanced PDF Processing
|
|
322
|
+
|
|
323
|
+
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
|
|
324
|
+
|
|
325
|
+
### Key Improvements
|
|
326
|
+
|
|
327
|
+
- **๐ฌ Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
|
|
328
|
+
- **๐ Automatic Table Detection**: Tables converted to markdown format for LLM consumption
|
|
329
|
+
- **๐ง Quality Text Rendering**: Better ligature, whitespace, and image-text integration
|
|
330
|
+
- **โก Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
|
|
331
|
+
|
|
332
|
+
### Configuration for Scientific Documents
|
|
333
|
+
|
|
334
|
+
For documents with heavy mathematical content, enable OCR enhancement:
|
|
335
|
+
|
|
336
|
+
```yaml
|
|
337
|
+
# In cc_config.yaml
|
|
338
|
+
extraction:
|
|
339
|
+
pymupdf:
|
|
340
|
+
enable_formula_ocr: true # Enable OCR for formula-heavy pages
|
|
341
|
+
formula_threshold: 3 # Min formulas per page to trigger OCR
|
|
342
|
+
ocr_fallback: true # Graceful fallback if OCR fails
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
```python
|
|
346
|
+
# Runtime configuration
|
|
347
|
+
from content_core.config import set_pymupdf_ocr_enabled
|
|
348
|
+
set_pymupdf_ocr_enabled(True)
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
### Requirements for OCR Enhancement
|
|
352
|
+
|
|
353
|
+
```bash
|
|
354
|
+
# Install Tesseract OCR (optional, for formula enhancement)
|
|
355
|
+
# macOS
|
|
356
|
+
brew install tesseract
|
|
357
|
+
|
|
358
|
+
# Ubuntu/Debian
|
|
359
|
+
sudo apt-get install tesseract-ocr
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
**Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
|
|
363
|
+
|
|
364
|
+
## macOS Services Integration
|
|
365
|
+
|
|
366
|
+
Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
|
|
367
|
+
|
|
368
|
+
### Available Services
|
|
369
|
+
|
|
370
|
+
Create **4 convenient services** for different workflows:
|
|
371
|
+
|
|
372
|
+
- **Extract Content โ Clipboard** - Quick copy for immediate pasting
|
|
373
|
+
- **Extract Content โ TextEdit** - Review before using
|
|
374
|
+
- **Summarize Content โ Clipboard** - Quick summary copying
|
|
375
|
+
- **Summarize Content โ TextEdit** - Formatted summary with headers
|
|
376
|
+
|
|
377
|
+
### Quick Setup
|
|
378
|
+
|
|
379
|
+
1. **Install uv** (if not already installed):
|
|
380
|
+
```bash
|
|
381
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
2. **Create services manually** using Automator (5 minutes setup)
|
|
385
|
+
|
|
386
|
+
### Usage
|
|
387
|
+
|
|
388
|
+
**Right-click any supported file** in Finder โ **Services** โ Choose your option:
|
|
389
|
+
|
|
390
|
+
- **PDFs, Word docs** - Instant text extraction
|
|
391
|
+
- **Videos, audio files** - Automatic transcription
|
|
392
|
+
- **Images** - OCR text recognition
|
|
393
|
+
- **Web content** - Clean text extraction
|
|
394
|
+
- **Multiple files** - Batch processing support
|
|
395
|
+
|
|
396
|
+
### Features
|
|
397
|
+
|
|
398
|
+
- **Zero-install processing**: Uses `uvx` for isolated execution
|
|
399
|
+
- **Multiple output options**: Clipboard or TextEdit display
|
|
400
|
+
- **System notifications**: Visual feedback on completion
|
|
401
|
+
- **Wide format support**: 20+ file types supported
|
|
402
|
+
- **Batch processing**: Handle multiple files at once
|
|
403
|
+
- **Keyboard shortcuts**: Assignable hotkeys for power users
|
|
404
|
+
|
|
405
|
+
For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
|
|
406
|
+
|
|
407
|
+
## Raycast Extension
|
|
408
|
+
|
|
409
|
+
Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
|
|
410
|
+
|
|
411
|
+
### Quick Setup
|
|
412
|
+
|
|
413
|
+
**From Raycast Store** (coming soon):
|
|
414
|
+
1. Open Raycast and search for "Content Core"
|
|
415
|
+
2. Install the extension by `luis_novo`
|
|
416
|
+
3. Configure API keys in preferences
|
|
417
|
+
|
|
418
|
+
**Manual Installation**:
|
|
419
|
+
1. Download the extension from the repository
|
|
420
|
+
2. Open Raycast โ "Import Extension"
|
|
421
|
+
3. Select the `raycast-content-core` folder
|
|
422
|
+
|
|
423
|
+
### Commands
|
|
424
|
+
|
|
425
|
+
**๐ Extract Content** - Smart URL/file detection with full interface
|
|
426
|
+
- Auto-detects URLs vs file paths in real-time
|
|
427
|
+
- Multiple output formats (Text, JSON, XML)
|
|
428
|
+
- Drag & drop support for files
|
|
429
|
+
- Rich results view with metadata
|
|
430
|
+
|
|
431
|
+
**๐ Summarize Content** - AI-powered summaries with customizable styles
|
|
432
|
+
- 9 different summary styles (bullet points, executive summary, etc.)
|
|
433
|
+
- Auto-detects source type with visual feedback
|
|
434
|
+
- One-click snippet creation and quicklinks
|
|
435
|
+
|
|
436
|
+
**โก Quick Extract** - Instant extraction to clipboard
|
|
437
|
+
- Type โ Tab โ Paste source โ Enter
|
|
438
|
+
- No UI, works directly from command bar
|
|
439
|
+
- Perfect for quick workflows
|
|
440
|
+
|
|
441
|
+
### Features
|
|
442
|
+
|
|
443
|
+
- **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
|
|
444
|
+
- **Zero Installation**: Uses `uvx` for Content Core execution
|
|
445
|
+
- **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
|
|
446
|
+
- **All File Types**: Documents, videos, audio, images, archives
|
|
447
|
+
- **Visual Feedback**: Real-time type detection with icons
|
|
448
|
+
|
|
449
|
+
For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
|
|
450
|
+
|
|
451
|
+
## Using with Langchain
|
|
452
|
+
|
|
453
|
+
For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
|
|
454
|
+
|
|
455
|
+
You can import and use these tools like any other Langchain tool. For example:
|
|
456
|
+
|
|
457
|
+
```python
|
|
458
|
+
from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool
|
|
459
|
+
from langchain.agents import initialize_agent, AgentType
|
|
460
|
+
|
|
461
|
+
tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]
|
|
462
|
+
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
|
|
463
|
+
agent.run("Extract the content from https://example.com and then summarize it.")
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
Refer to the source code in `src/content_core/tools` for specific tool implementations and usage details.
|
|
467
|
+
|
|
468
|
+
## Basic Usage
|
|
469
|
+
|
|
470
|
+
The core functionality revolves around the extract_content function.
|
|
471
|
+
|
|
472
|
+
```python
|
|
473
|
+
import asyncio
|
|
474
|
+
from content_core.extraction import extract_content
|
|
475
|
+
|
|
476
|
+
async def main():
|
|
477
|
+
# Extract from raw text
|
|
478
|
+
text_data = await extract_content({"content": "This is my sample text content."})
|
|
479
|
+
print(text_data)
|
|
480
|
+
|
|
481
|
+
# Extract from a URL (uses 'auto' engine by default)
|
|
482
|
+
url_data = await extract_content({"url": "https://www.example.com"})
|
|
483
|
+
print(url_data)
|
|
484
|
+
|
|
485
|
+
# Extract from a local video file (gets transcript, engine='auto' by default)
|
|
486
|
+
video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
|
|
487
|
+
print(video_data)
|
|
488
|
+
|
|
489
|
+
# Extract from a local markdown file (engine='auto' by default)
|
|
490
|
+
md_data = await extract_content({"file_path": "path/to/your/document.md"})
|
|
491
|
+
print(md_data)
|
|
492
|
+
|
|
493
|
+
# Per-execution override with Docling for documents
|
|
494
|
+
doc_data = await extract_content({
|
|
495
|
+
"file_path": "path/to/your/document.pdf",
|
|
496
|
+
"document_engine": "docling",
|
|
497
|
+
"output_format": "html"
|
|
498
|
+
})
|
|
499
|
+
|
|
500
|
+
# Per-execution override with Firecrawl for URLs
|
|
501
|
+
url_data = await extract_content({
|
|
502
|
+
"url": "https://www.example.com",
|
|
503
|
+
"url_engine": "firecrawl"
|
|
504
|
+
})
|
|
505
|
+
print(doc_data)
|
|
506
|
+
|
|
507
|
+
if __name__ == "__main__":
|
|
508
|
+
asyncio.run(main())
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
(See `src/content_core/notebooks/run.ipynb` for more detailed examples.)
|
|
512
|
+
|
|
513
|
+
## Docling Integration
|
|
514
|
+
|
|
515
|
+
Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
|
|
516
|
+
|
|
517
|
+
|
|
518
|
+
### Enabling Docling
|
|
519
|
+
|
|
520
|
+
Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".
|
|
521
|
+
|
|
522
|
+
#### Via configuration file
|
|
523
|
+
|
|
524
|
+
In your `cc_config.yaml` or custom config, set:
|
|
525
|
+
```yaml
|
|
526
|
+
extraction:
|
|
527
|
+
document_engine: docling # 'auto' (default), 'simple', or 'docling'
|
|
528
|
+
url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina'
|
|
529
|
+
docling:
|
|
530
|
+
output_format: markdown # markdown | html | json
|
|
531
|
+
```
|
|
532
|
+
|
|
533
|
+
#### Programmatically in Python
|
|
534
|
+
|
|
535
|
+
```python
|
|
536
|
+
from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
|
|
537
|
+
|
|
538
|
+
# switch document engine to Docling
|
|
539
|
+
set_document_engine("docling")
|
|
540
|
+
|
|
541
|
+
# switch URL engine to Firecrawl
|
|
542
|
+
set_url_engine("firecrawl")
|
|
543
|
+
|
|
544
|
+
# choose output format: 'markdown', 'html', or 'json'
|
|
545
|
+
set_docling_output_format("html")
|
|
546
|
+
|
|
547
|
+
# now use ccore.extract or ccore.ccore
|
|
548
|
+
result = await cc.extract("document.pdf")
|
|
549
|
+
```
|
|
550
|
+
|
|
551
|
+
## Configuration
|
|
552
|
+
|
|
553
|
+
Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or `.env` files, loaded automatically via `python-dotenv`.
|
|
554
|
+
|
|
555
|
+
Example `.env`:
|
|
556
|
+
|
|
557
|
+
```plaintext
|
|
558
|
+
OPENAI_API_KEY=your-key-here
|
|
559
|
+
GOOGLE_API_KEY=your-key-here
|
|
560
|
+
|
|
561
|
+
# Engine Selection (optional)
|
|
562
|
+
CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
|
|
563
|
+
CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
|
|
564
|
+
|
|
565
|
+
# Audio Processing (optional)
|
|
566
|
+
CCORE_AUDIO_CONCURRENCY=3 # Number of concurrent audio transcriptions (1-10, default: 3)
|
|
567
|
+
|
|
568
|
+
# Esperanto Timeout Configuration (optional)
|
|
569
|
+
ESPERANTO_LLM_TIMEOUT=300 # Language model timeout in seconds (default: 300, max: 3600)
|
|
570
|
+
ESPERANTO_STT_TIMEOUT=3600 # Speech-to-text timeout in seconds (default: 3600, max: 3600)
|
|
571
|
+
```
|
|
572
|
+
|
|
573
|
+
### Engine Selection via Environment Variables
|
|
574
|
+
|
|
575
|
+
For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
|
|
576
|
+
|
|
577
|
+
- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
|
|
578
|
+
- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`, `crawl4ai`)
|
|
579
|
+
- **`CCORE_AUDIO_CONCURRENCY`**: Number of concurrent audio transcriptions (1-10, default: 3)
|
|
580
|
+
|
|
581
|
+
These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
|
|
582
|
+
|
|
583
|
+
### Audio Processing Configuration
|
|
584
|
+
|
|
585
|
+
Content Core processes long audio files by splitting them into segments and transcribing them in parallel for improved performance. You can control the concurrency level to balance speed with API rate limits:
|
|
586
|
+
|
|
587
|
+
- **Default**: 3 concurrent transcriptions
|
|
588
|
+
- **Range**: 1-10 concurrent transcriptions
|
|
589
|
+
- **Configuration**: Set via `CCORE_AUDIO_CONCURRENCY` environment variable or `extraction.audio.concurrency` in `cc_config.yaml`
|
|
590
|
+
|
|
591
|
+
Higher concurrency values can speed up processing of long audio/video files but may hit API rate limits. Lower values are more conservative and suitable for accounts with lower API quotas.
|
|
592
|
+
|
|
593
|
+
### Retry Configuration
|
|
594
|
+
|
|
595
|
+
Content Core includes automatic retry logic for transient failures in external operations (network requests, API calls, transcription). Retries use exponential backoff with jitter to handle temporary issues gracefully.
|
|
596
|
+
|
|
597
|
+
**Supported operations:**
|
|
598
|
+
- `youtube` - YouTube video title and transcript fetching (5 retries, 2-60s backoff)
|
|
599
|
+
- `url_api` - URL extraction via Jina/Firecrawl APIs (3 retries, 1-30s backoff)
|
|
600
|
+
- `url_network` - Network operations like HEAD requests, BeautifulSoup (3 retries, 0.5-10s backoff)
|
|
601
|
+
- `audio` - Audio transcription API calls (3 retries, 2-30s backoff)
|
|
602
|
+
- `llm` - LLM API calls for cleanup/summary (3 retries, 1-30s backoff)
|
|
603
|
+
- `download` - Remote file downloads (3 retries, 1-15s backoff)
|
|
604
|
+
|
|
605
|
+
**Environment variable overrides:**
|
|
606
|
+
```bash
|
|
607
|
+
# Override retry settings per operation type
|
|
608
|
+
CCORE_YOUTUBE_MAX_RETRIES=10 # Max retry attempts (1-20)
|
|
609
|
+
CCORE_YOUTUBE_BASE_DELAY=3 # Base delay in seconds (0.1-60)
|
|
610
|
+
CCORE_YOUTUBE_MAX_DELAY=120 # Max delay in seconds (1-300)
|
|
611
|
+
|
|
612
|
+
# Same pattern for other operations:
|
|
613
|
+
CCORE_URL_API_MAX_RETRIES=5
|
|
614
|
+
CCORE_AUDIO_MAX_RETRIES=5
|
|
615
|
+
CCORE_LLM_MAX_RETRIES=5
|
|
616
|
+
CCORE_DOWNLOAD_MAX_RETRIES=5
|
|
617
|
+
```
|
|
618
|
+
|
|
619
|
+
For detailed configuration, see our [Usage Documentation](docs/usage.md#retry-configuration).
|
|
620
|
+
|
|
621
|
+
### Proxy Configuration
|
|
622
|
+
|
|
623
|
+
Content Core supports HTTP/HTTPS proxy configuration for all external network requests. This is useful when operating in corporate environments, behind firewalls, or when you need to route traffic through a specific server.
|
|
624
|
+
|
|
625
|
+
**Configuration Methods** (in priority order):
|
|
626
|
+
|
|
627
|
+
1. **Per-request**: Pass `proxy` parameter directly in `ProcessSourceInput`
|
|
628
|
+
2. **Programmatic**: Use `set_proxy()` for runtime configuration
|
|
629
|
+
3. **Environment Variables**: `CCORE_HTTP_PROXY`, `HTTP_PROXY`, or `HTTPS_PROXY`
|
|
630
|
+
4. **YAML Config**: Set in `cc_config.yaml`
|
|
631
|
+
|
|
632
|
+
**Quick Start:**
|
|
633
|
+
|
|
634
|
+
```bash
|
|
635
|
+
# Via environment variable
|
|
636
|
+
export CCORE_HTTP_PROXY=http://proxy.example.com:8080
|
|
637
|
+
|
|
638
|
+
# With authentication
|
|
639
|
+
export CCORE_HTTP_PROXY=http://user:password@proxy.example.com:8080
|
|
640
|
+
```
|
|
641
|
+
|
|
642
|
+
```python
|
|
643
|
+
# Programmatic configuration
|
|
644
|
+
from content_core.config import set_proxy, clear_proxy
|
|
645
|
+
|
|
646
|
+
set_proxy("http://proxy.example.com:8080")
|
|
647
|
+
# ... use Content Core ...
|
|
648
|
+
clear_proxy() # Reset to default behavior
|
|
649
|
+
|
|
650
|
+
# Per-request override
|
|
651
|
+
from content_core.common import ProcessSourceInput
|
|
652
|
+
result = await cc.extract(ProcessSourceInput(
|
|
653
|
+
url="https://example.com",
|
|
654
|
+
proxy="http://specific-proxy:8080"
|
|
655
|
+
))
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
**Supported Services:**
|
|
659
|
+
- All aiohttp requests (URL extraction, downloads)
|
|
660
|
+
- YouTube transcript/title fetching (pytubefix, youtube-transcript-api)
|
|
661
|
+
- Crawl4AI browser automation
|
|
662
|
+
- Esperanto AI models (LLM, speech-to-text)
|
|
663
|
+
|
|
664
|
+
**Note:** Firecrawl does not support client-side proxy configuration. A warning is logged when proxy is configured but Firecrawl is used.
|
|
665
|
+
|
|
666
|
+
For detailed configuration options, see our [Usage Documentation](docs/usage.md#proxy-configuration).
|
|
667
|
+
|
|
668
|
+
### Timeout Configuration
|
|
669
|
+
|
|
670
|
+
Content Core uses the Esperanto library for AI model interactions and supports configurable timeouts for different operations. Timeouts prevent requests from hanging indefinitely and ensure reliable processing.
|
|
671
|
+
|
|
672
|
+
**Configuration Methods** (in priority order):
|
|
673
|
+
|
|
674
|
+
1. **Config Files** (highest priority): Set in `cc_config.yaml` or `models_config.yaml`
|
|
675
|
+
2. **Environment Variables**: Provide global defaults via `ESPERANTO_LLM_TIMEOUT` and `ESPERANTO_STT_TIMEOUT` when a timeout isn't specified in configuration files
|
|
676
|
+
|
|
677
|
+
**Default Timeouts:**
|
|
678
|
+
|
|
679
|
+
- **Speech-to-Text**: 3600 seconds (1 hour) - for very long audio files
|
|
680
|
+
- **Language Models**: 300-600 seconds - for content processing operations
|
|
681
|
+
- **Cleanup Model**: 600 seconds (10 minutes) - handles large content with 8000 max tokens
|
|
682
|
+
- **Summary Model**: 300 seconds (5 minutes) - for content summarization
|
|
683
|
+
|
|
684
|
+
**Environment Variable Overrides:**
|
|
685
|
+
|
|
686
|
+
```bash
|
|
687
|
+
# Override language model timeout globally (used when config files omit a timeout)
|
|
688
|
+
export ESPERANTO_LLM_TIMEOUT=300
|
|
689
|
+
|
|
690
|
+
# Override speech-to-text timeout globally (used when config files omit a timeout)
|
|
691
|
+
export ESPERANTO_STT_TIMEOUT=3600
|
|
692
|
+
```
|
|
693
|
+
|
|
694
|
+
**Valid Range:** 1 to 3600 seconds (1 hour maximum)
|
|
695
|
+
|
|
696
|
+
For more details on Esperanto timeout configuration, see the [Esperanto documentation](https://github.com/lfnovo/esperanto/blob/main/docs/advanced/timeout-configuration.md).
|
|
697
|
+
|
|
698
|
+
### Custom Prompt Templates
|
|
699
|
+
|
|
700
|
+
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
|
|
701
|
+
|
|
702
|
+
Example `.env` with custom prompt path:
|
|
703
|
+
|
|
704
|
+
```plaintext
|
|
705
|
+
OPENAI_API_KEY=your-key-here
|
|
706
|
+
GOOGLE_API_KEY=your-key-here
|
|
707
|
+
PROMPT_PATH=/path/to/your/custom/prompts
|
|
708
|
+
```
|
|
709
|
+
|
|
710
|
+
When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
|
|
711
|
+
|
|
712
|
+
## Development
|
|
713
|
+
|
|
714
|
+
To set up a development environment:
|
|
715
|
+
|
|
716
|
+
```bash
|
|
717
|
+
# Clone the repository
|
|
718
|
+
git clone <repository-url>
|
|
719
|
+
cd content-core
|
|
720
|
+
|
|
721
|
+
# Create virtual environment and install dependencies
|
|
722
|
+
uv venv
|
|
723
|
+
source .venv/bin/activate
|
|
724
|
+
uv sync --group dev
|
|
725
|
+
|
|
726
|
+
# Run tests
|
|
727
|
+
make test
|
|
728
|
+
|
|
729
|
+
# Lint code
|
|
730
|
+
make lint
|
|
731
|
+
|
|
732
|
+
# See all commands
|
|
733
|
+
make help
|
|
734
|
+
```
|
|
735
|
+
|
|
736
|
+
## License
|
|
737
|
+
|
|
738
|
+
This project is licensed under the [MIT License](LICENSE). See the [LICENSE](LICENSE) file for details.
|
|
739
|
+
|
|
740
|
+
## Contributing
|
|
741
|
+
|
|
742
|
+
Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for more details on how to get started.
|