content-core 1.1.2__tar.gz → 1.2.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of content-core might be problematic. Click here for more details.
- content_core-1.2.1/.github/workflows/claude-code-review.yml +77 -0
- content_core-1.2.1/.github/workflows/claude.yml +59 -0
- {content_core-1.1.2 → content_core-1.2.1}/.gitignore +4 -1
- {content_core-1.1.2 → content_core-1.2.1}/PKG-INFO +170 -22
- {content_core-1.1.2 → content_core-1.2.1}/README.md +167 -20
- {content_core-1.1.2 → content_core-1.2.1}/docs/mcp.md +28 -0
- {content_core-1.1.2 → content_core-1.2.1}/docs/processors.md +26 -4
- content_core-1.2.1/docs/raycast.md +268 -0
- {content_core-1.1.2 → content_core-1.2.1}/docs/usage.md +98 -2
- content_core-1.2.1/mcp.md +248 -0
- content_core-1.2.1/new_pdf.pdf +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/pyproject.toml +2 -2
- content_core-1.2.1/raycast-content-core/.eslintrc.json +9 -0
- content_core-1.2.1/raycast-content-core/CHANGELOG.md +36 -0
- content_core-1.2.1/raycast-content-core/README.md +150 -0
- content_core-1.2.1/raycast-content-core/assets/command-icon.png +0 -0
- content_core-1.2.1/raycast-content-core/package-lock.json +2094 -0
- content_core-1.2.1/raycast-content-core/package.json +124 -0
- content_core-1.2.1/raycast-content-core/raycast-env.d.ts +42 -0
- content_core-1.2.1/raycast-content-core/src/extract-content.tsx +306 -0
- content_core-1.2.1/raycast-content-core/src/quick-extract.tsx +128 -0
- content_core-1.2.1/raycast-content-core/src/summarize-content.tsx +420 -0
- content_core-1.2.1/raycast-content-core/src/utils/content-core.ts +348 -0
- content_core-1.2.1/raycast-content-core/src/utils/types.ts +27 -0
- content_core-1.2.1/raycast-content-core/tsconfig.json +30 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/cc_config.yaml +4 -0
- content_core-1.2.1/src/content_core/config.py +104 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/extraction/graph.py +33 -21
- content_core-1.2.1/src/content_core/notebooks/urls.ipynb +154 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/docling.py +13 -6
- content_core-1.2.1/src/content_core/processors/pdf.py +292 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/url.py +3 -2
- content_core-1.2.1/test.py +16 -0
- content_core-1.2.1/tests/unit/test_config.py +109 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/unit/test_docling.py +4 -1
- content_core-1.2.1/tests/unit/test_pymupdf_ocr.py +275 -0
- {content_core-1.1.2 → content_core-1.2.1}/uv.lock +2466 -2502
- content_core-1.1.2/src/content_core/config.py +0 -49
- content_core-1.1.2/src/content_core/processors/pdf.py +0 -168
- {content_core-1.1.2 → content_core-1.2.1}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/.github/workflows/publish.yml +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/.python-version +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/CONTRIBUTING.md +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/LICENSE +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/Makefile +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/docs/macos.md +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/prompts/content/cleanup.jinja +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/prompts/content/summarize.jinja +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/exceptions.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/state.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/types.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/utils.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/cleanup/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/cleanup/core.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/extraction/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/identification/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/summary/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/summary/core.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/logging.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/mcp/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/mcp/server.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/models.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/models_config.yaml +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/notebooks/run.ipynb +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/audio.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/office.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/text.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/video.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/youtube.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/py.typed +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/templated_message.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/__init__.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/cleanup.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/extract.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/summarize.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.docx +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.epub +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.md +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.mp3 +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.mp4 +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.pdf +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.pptx +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.txt +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.xlsx +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file_audio.mp3 +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/integration/test_cli.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/integration/test_extraction.py +0 -0
- {content_core-1.1.2 → content_core-1.2.1}/tests/unit/test_mcp_server.py +0 -0
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
name: Claude Code Review
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
pull_request:
|
|
5
|
+
types: [opened, synchronize]
|
|
6
|
+
# Optional: Only run on specific file changes
|
|
7
|
+
# paths:
|
|
8
|
+
# - "src/**/*.ts"
|
|
9
|
+
# - "src/**/*.tsx"
|
|
10
|
+
# - "src/**/*.js"
|
|
11
|
+
# - "src/**/*.jsx"
|
|
12
|
+
|
|
13
|
+
jobs:
|
|
14
|
+
claude-review:
|
|
15
|
+
# Optional: Filter by PR author
|
|
16
|
+
# if: |
|
|
17
|
+
# github.event.pull_request.user.login == 'external-contributor' ||
|
|
18
|
+
# github.event.pull_request.user.login == 'new-developer' ||
|
|
19
|
+
# github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR'
|
|
20
|
+
|
|
21
|
+
runs-on: ubuntu-latest
|
|
22
|
+
permissions:
|
|
23
|
+
contents: read
|
|
24
|
+
pull-requests: read
|
|
25
|
+
issues: read
|
|
26
|
+
id-token: write
|
|
27
|
+
|
|
28
|
+
steps:
|
|
29
|
+
- name: Checkout repository
|
|
30
|
+
uses: actions/checkout@v4
|
|
31
|
+
with:
|
|
32
|
+
fetch-depth: 1
|
|
33
|
+
|
|
34
|
+
- name: Run Claude Code Review
|
|
35
|
+
id: claude-review
|
|
36
|
+
uses: anthropics/claude-code-action@beta
|
|
37
|
+
with:
|
|
38
|
+
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
39
|
+
|
|
40
|
+
# Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
|
|
41
|
+
# model: "claude-opus-4-20250514"
|
|
42
|
+
|
|
43
|
+
# Direct prompt for automated review (no @claude mention needed)
|
|
44
|
+
direct_prompt: |
|
|
45
|
+
Please review this pull request (and any added updates) and provide feedback on:
|
|
46
|
+
- Code quality and best practices
|
|
47
|
+
- Potential bugs or issues
|
|
48
|
+
- Performance considerations
|
|
49
|
+
- Documentation updates: does the README.md and docs/ reflect what changed?
|
|
50
|
+
- Security concerns
|
|
51
|
+
- Test coverage
|
|
52
|
+
|
|
53
|
+
If the pull request received updates after your first review, please redo the review based on the new code
|
|
54
|
+
Be constructive and helpful in your feedback.
|
|
55
|
+
|
|
56
|
+
# Optional: Customize review based on file types
|
|
57
|
+
# direct_prompt: |
|
|
58
|
+
# Review this PR focusing on:
|
|
59
|
+
# - For TypeScript files: Type safety and proper interface usage
|
|
60
|
+
# - For API endpoints: Security, input validation, and error handling
|
|
61
|
+
# - For React components: Performance, accessibility, and best practices
|
|
62
|
+
# - For tests: Coverage, edge cases, and test quality
|
|
63
|
+
|
|
64
|
+
# Optional: Different prompts for different authors
|
|
65
|
+
# direct_prompt: |
|
|
66
|
+
# ${{ github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR' &&
|
|
67
|
+
# 'Welcome! Please review this PR from a first-time contributor. Be encouraging and provide detailed explanations for any suggestions.' ||
|
|
68
|
+
# 'Please provide a thorough code review focusing on our coding standards and best practices.' }}
|
|
69
|
+
|
|
70
|
+
# Optional: Add specific tools for running tests or linting
|
|
71
|
+
# allowed_tools: "Bash(npm run test),Bash(npm run lint),Bash(npm run typecheck)"
|
|
72
|
+
|
|
73
|
+
# Optional: Skip review for certain conditions
|
|
74
|
+
# if: |
|
|
75
|
+
# !contains(github.event.pull_request.title, '[skip-review]') &&
|
|
76
|
+
# !contains(github.event.pull_request.title, '[WIP]')
|
|
77
|
+
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
name: Claude Code
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
issue_comment:
|
|
5
|
+
types: [created]
|
|
6
|
+
pull_request_review_comment:
|
|
7
|
+
types: [created]
|
|
8
|
+
issues:
|
|
9
|
+
types: [opened, assigned]
|
|
10
|
+
pull_request_review:
|
|
11
|
+
types: [submitted]
|
|
12
|
+
|
|
13
|
+
jobs:
|
|
14
|
+
claude:
|
|
15
|
+
if: |
|
|
16
|
+
(github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) ||
|
|
17
|
+
(github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
|
|
18
|
+
(github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
|
|
19
|
+
(github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
|
|
20
|
+
runs-on: ubuntu-latest
|
|
21
|
+
permissions:
|
|
22
|
+
contents: read
|
|
23
|
+
pull-requests: read
|
|
24
|
+
issues: read
|
|
25
|
+
id-token: write
|
|
26
|
+
steps:
|
|
27
|
+
- name: Checkout repository
|
|
28
|
+
uses: actions/checkout@v4
|
|
29
|
+
with:
|
|
30
|
+
fetch-depth: 1
|
|
31
|
+
|
|
32
|
+
- name: Run Claude Code
|
|
33
|
+
id: claude
|
|
34
|
+
uses: anthropics/claude-code-action@beta
|
|
35
|
+
with:
|
|
36
|
+
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
37
|
+
|
|
38
|
+
# Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
|
|
39
|
+
# model: "claude-opus-4-20250514"
|
|
40
|
+
|
|
41
|
+
# Optional: Customize the trigger phrase (default: @claude)
|
|
42
|
+
# trigger_phrase: "/claude"
|
|
43
|
+
|
|
44
|
+
# Optional: Trigger when specific user is assigned to an issue
|
|
45
|
+
# assignee_trigger: "claude-bot"
|
|
46
|
+
|
|
47
|
+
# Optional: Allow Claude to run specific commands
|
|
48
|
+
# allowed_tools: "Bash(npm install),Bash(npm run build),Bash(npm run test:*),Bash(npm run lint:*)"
|
|
49
|
+
|
|
50
|
+
# Optional: Add custom instructions for Claude to customize its behavior for your project
|
|
51
|
+
# custom_instructions: |
|
|
52
|
+
# Follow our coding standards
|
|
53
|
+
# Ensure all new code has tests
|
|
54
|
+
# Use TypeScript for new files
|
|
55
|
+
|
|
56
|
+
# Optional: Custom environment variables for Claude
|
|
57
|
+
# claude_env: |
|
|
58
|
+
# NODE_ENV: test
|
|
59
|
+
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: content-core
|
|
3
|
-
Version: 1.1
|
|
3
|
+
Version: 1.2.1
|
|
4
4
|
Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
|
|
5
5
|
Author-email: LUIS NOVO <lfnovo@gmail.com>
|
|
6
6
|
License-File: LICENSE
|
|
@@ -10,7 +10,6 @@ Requires-Dist: aiohttp>=3.11
|
|
|
10
10
|
Requires-Dist: asciidoc>=10.2.1
|
|
11
11
|
Requires-Dist: bs4>=0.0.2
|
|
12
12
|
Requires-Dist: dicttoxml>=1.7.16
|
|
13
|
-
Requires-Dist: docling>=2.34.0
|
|
14
13
|
Requires-Dist: esperanto>=1.2.0
|
|
15
14
|
Requires-Dist: firecrawl-py>=2.7.0
|
|
16
15
|
Requires-Dist: jinja2>=3.1.6
|
|
@@ -31,6 +30,8 @@ Requires-Dist: pytubefix>=9.1.1
|
|
|
31
30
|
Requires-Dist: readability-lxml>=0.8.4.1
|
|
32
31
|
Requires-Dist: validators>=0.34.0
|
|
33
32
|
Requires-Dist: youtube-transcript-api>=1.0.3
|
|
33
|
+
Provides-Extra: docling
|
|
34
|
+
Requires-Dist: docling>=2.34.0; extra == 'docling'
|
|
34
35
|
Provides-Extra: mcp
|
|
35
36
|
Requires-Dist: fastmcp>=0.5.0; extra == 'mcp'
|
|
36
37
|
Description-Content-Type: text/markdown
|
|
@@ -39,29 +40,70 @@ Description-Content-Type: text/markdown
|
|
|
39
40
|
|
|
40
41
|
[](https://opensource.org/licenses/MIT)
|
|
41
42
|
|
|
42
|
-
**Content Core** is a
|
|
43
|
+
**Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
|
|
43
44
|
|
|
44
|
-
##
|
|
45
|
+
## 🚀 What You Can Do
|
|
45
46
|
|
|
46
|
-
|
|
47
|
+
**Extract content from anywhere:**
|
|
48
|
+
- 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
|
|
49
|
+
- 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
|
|
50
|
+
- 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
|
|
51
|
+
- 🌐 **Web** - Any URL with intelligent content extraction
|
|
52
|
+
- 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
|
|
53
|
+
- 📦 **Archives** - ZIP, TAR, GZ with content analysis
|
|
47
54
|
|
|
48
|
-
|
|
55
|
+
**Process with AI:**
|
|
56
|
+
- ✨ **Clean & format** extracted content automatically
|
|
57
|
+
- 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
|
|
58
|
+
- 🎯 **Context-aware processing** - explain to a child, technical summary, action items
|
|
59
|
+
- 🔄 **Smart engine selection** - automatically chooses the best extraction method
|
|
49
60
|
|
|
50
|
-
##
|
|
61
|
+
## 🛠️ Multiple Ways to Use
|
|
51
62
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
63
|
+
### 🖥️ Command Line (Zero Install)
|
|
64
|
+
```bash
|
|
65
|
+
# Extract content from any source
|
|
66
|
+
uvx --from "content-core" ccore https://example.com
|
|
67
|
+
uvx --from "content-core" ccore document.pdf
|
|
68
|
+
|
|
69
|
+
# Generate AI summaries
|
|
70
|
+
uvx --from "content-core" csum video.mp4 --context "bullet points"
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### 🤖 Claude Desktop Integration
|
|
74
|
+
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
|
|
75
|
+
|
|
76
|
+
### 🔍 Raycast Extension
|
|
77
|
+
Smart auto-detection commands:
|
|
78
|
+
- **Extract Content** - Full interface with format options
|
|
79
|
+
- **Summarize Content** - 9 summary styles available
|
|
80
|
+
- **Quick Extract** - Instant clipboard extraction
|
|
81
|
+
|
|
82
|
+
### 🖱️ macOS Right-Click Integration
|
|
83
|
+
Right-click any file in Finder → Services → Extract or Summarize content instantly.
|
|
84
|
+
|
|
85
|
+
### 🐍 Python Library
|
|
86
|
+
```python
|
|
87
|
+
import content_core as cc
|
|
88
|
+
|
|
89
|
+
# Extract from any source
|
|
90
|
+
result = await cc.extract("https://example.com/article")
|
|
91
|
+
summary = await cc.summarize_content(result, context="explain to a child")
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## ⚡ Key Features
|
|
95
|
+
|
|
96
|
+
* **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
|
|
97
|
+
* **🔧 Smart Engine Selection:**
|
|
98
|
+
* **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
|
|
99
|
+
* **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
|
|
100
|
+
* **Media:** OpenAI Whisper transcription
|
|
101
|
+
* **Images:** OCR with multiple engine support
|
|
102
|
+
* **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
|
|
103
|
+
* **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
|
|
104
|
+
* **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
|
|
105
|
+
* **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
|
|
106
|
+
* **🔄 Asynchronous:** Built with `asyncio` for efficient processing
|
|
65
107
|
|
|
66
108
|
## Getting Started
|
|
67
109
|
|
|
@@ -70,11 +112,17 @@ The primary goal of Content Core is to simplify the process of ingesting content
|
|
|
70
112
|
Install Content Core using `pip`:
|
|
71
113
|
|
|
72
114
|
```bash
|
|
73
|
-
#
|
|
115
|
+
# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
|
|
74
116
|
pip install content-core
|
|
75
117
|
|
|
76
|
-
#
|
|
118
|
+
# With enhanced document processing (adds Docling)
|
|
119
|
+
pip install content-core[docling]
|
|
120
|
+
|
|
121
|
+
# With MCP server support
|
|
77
122
|
pip install content-core[mcp]
|
|
123
|
+
|
|
124
|
+
# Full installation
|
|
125
|
+
pip install content-core[docling,mcp]
|
|
78
126
|
```
|
|
79
127
|
|
|
80
128
|
Alternatively, if you’re developing locally:
|
|
@@ -245,6 +293,49 @@ Add to your `claude_desktop_config.json`:
|
|
|
245
293
|
|
|
246
294
|
For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
|
|
247
295
|
|
|
296
|
+
## Enhanced PDF Processing
|
|
297
|
+
|
|
298
|
+
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
|
|
299
|
+
|
|
300
|
+
### Key Improvements
|
|
301
|
+
|
|
302
|
+
- **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
|
|
303
|
+
- **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
|
|
304
|
+
- **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
|
|
305
|
+
- **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
|
|
306
|
+
|
|
307
|
+
### Configuration for Scientific Documents
|
|
308
|
+
|
|
309
|
+
For documents with heavy mathematical content, enable OCR enhancement:
|
|
310
|
+
|
|
311
|
+
```yaml
|
|
312
|
+
# In cc_config.yaml
|
|
313
|
+
extraction:
|
|
314
|
+
pymupdf:
|
|
315
|
+
enable_formula_ocr: true # Enable OCR for formula-heavy pages
|
|
316
|
+
formula_threshold: 3 # Min formulas per page to trigger OCR
|
|
317
|
+
ocr_fallback: true # Graceful fallback if OCR fails
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
```python
|
|
321
|
+
# Runtime configuration
|
|
322
|
+
from content_core.config import set_pymupdf_ocr_enabled
|
|
323
|
+
set_pymupdf_ocr_enabled(True)
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
### Requirements for OCR Enhancement
|
|
327
|
+
|
|
328
|
+
```bash
|
|
329
|
+
# Install Tesseract OCR (optional, for formula enhancement)
|
|
330
|
+
# macOS
|
|
331
|
+
brew install tesseract
|
|
332
|
+
|
|
333
|
+
# Ubuntu/Debian
|
|
334
|
+
sudo apt-get install tesseract-ocr
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
|
|
338
|
+
|
|
248
339
|
## macOS Services Integration
|
|
249
340
|
|
|
250
341
|
Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
|
|
@@ -288,6 +379,50 @@ Create **4 convenient services** for different workflows:
|
|
|
288
379
|
|
|
289
380
|
For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
|
|
290
381
|
|
|
382
|
+
## Raycast Extension
|
|
383
|
+
|
|
384
|
+
Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
|
|
385
|
+
|
|
386
|
+
### Quick Setup
|
|
387
|
+
|
|
388
|
+
**From Raycast Store** (coming soon):
|
|
389
|
+
1. Open Raycast and search for "Content Core"
|
|
390
|
+
2. Install the extension by `luis_novo`
|
|
391
|
+
3. Configure API keys in preferences
|
|
392
|
+
|
|
393
|
+
**Manual Installation**:
|
|
394
|
+
1. Download the extension from the repository
|
|
395
|
+
2. Open Raycast → "Import Extension"
|
|
396
|
+
3. Select the `raycast-content-core` folder
|
|
397
|
+
|
|
398
|
+
### Commands
|
|
399
|
+
|
|
400
|
+
**🔍 Extract Content** - Smart URL/file detection with full interface
|
|
401
|
+
- Auto-detects URLs vs file paths in real-time
|
|
402
|
+
- Multiple output formats (Text, JSON, XML)
|
|
403
|
+
- Drag & drop support for files
|
|
404
|
+
- Rich results view with metadata
|
|
405
|
+
|
|
406
|
+
**📝 Summarize Content** - AI-powered summaries with customizable styles
|
|
407
|
+
- 9 different summary styles (bullet points, executive summary, etc.)
|
|
408
|
+
- Auto-detects source type with visual feedback
|
|
409
|
+
- One-click snippet creation and quicklinks
|
|
410
|
+
|
|
411
|
+
**⚡ Quick Extract** - Instant extraction to clipboard
|
|
412
|
+
- Type → Tab → Paste source → Enter
|
|
413
|
+
- No UI, works directly from command bar
|
|
414
|
+
- Perfect for quick workflows
|
|
415
|
+
|
|
416
|
+
### Features
|
|
417
|
+
|
|
418
|
+
- **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
|
|
419
|
+
- **Zero Installation**: Uses `uvx` for Content Core execution
|
|
420
|
+
- **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
|
|
421
|
+
- **All File Types**: Documents, videos, audio, images, archives
|
|
422
|
+
- **Visual Feedback**: Real-time type detection with icons
|
|
423
|
+
|
|
424
|
+
For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
|
|
425
|
+
|
|
291
426
|
## Using with Langchain
|
|
292
427
|
|
|
293
428
|
For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
|
|
@@ -397,8 +532,21 @@ Example `.env`:
|
|
|
397
532
|
```plaintext
|
|
398
533
|
OPENAI_API_KEY=your-key-here
|
|
399
534
|
GOOGLE_API_KEY=your-key-here
|
|
535
|
+
|
|
536
|
+
# Engine Selection (optional)
|
|
537
|
+
CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
|
|
538
|
+
CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
|
|
400
539
|
```
|
|
401
540
|
|
|
541
|
+
### Engine Selection via Environment Variables
|
|
542
|
+
|
|
543
|
+
For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
|
|
544
|
+
|
|
545
|
+
- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
|
|
546
|
+
- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
|
|
547
|
+
|
|
548
|
+
These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
|
|
549
|
+
|
|
402
550
|
### Custom Prompt Templates
|
|
403
551
|
|
|
404
552
|
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
|
|
@@ -2,29 +2,70 @@
|
|
|
2
2
|
|
|
3
3
|
[](https://opensource.org/licenses/MIT)
|
|
4
4
|
|
|
5
|
-
**Content Core** is a
|
|
5
|
+
**Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
|
|
6
6
|
|
|
7
|
-
##
|
|
7
|
+
## 🚀 What You Can Do
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
**Extract content from anywhere:**
|
|
10
|
+
- 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
|
|
11
|
+
- 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
|
|
12
|
+
- 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
|
|
13
|
+
- 🌐 **Web** - Any URL with intelligent content extraction
|
|
14
|
+
- 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
|
|
15
|
+
- 📦 **Archives** - ZIP, TAR, GZ with content analysis
|
|
10
16
|
|
|
11
|
-
|
|
17
|
+
**Process with AI:**
|
|
18
|
+
- ✨ **Clean & format** extracted content automatically
|
|
19
|
+
- 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
|
|
20
|
+
- 🎯 **Context-aware processing** - explain to a child, technical summary, action items
|
|
21
|
+
- 🔄 **Smart engine selection** - automatically chooses the best extraction method
|
|
12
22
|
|
|
13
|
-
##
|
|
23
|
+
## 🛠️ Multiple Ways to Use
|
|
14
24
|
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
25
|
+
### 🖥️ Command Line (Zero Install)
|
|
26
|
+
```bash
|
|
27
|
+
# Extract content from any source
|
|
28
|
+
uvx --from "content-core" ccore https://example.com
|
|
29
|
+
uvx --from "content-core" ccore document.pdf
|
|
30
|
+
|
|
31
|
+
# Generate AI summaries
|
|
32
|
+
uvx --from "content-core" csum video.mp4 --context "bullet points"
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
### 🤖 Claude Desktop Integration
|
|
36
|
+
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
|
|
37
|
+
|
|
38
|
+
### 🔍 Raycast Extension
|
|
39
|
+
Smart auto-detection commands:
|
|
40
|
+
- **Extract Content** - Full interface with format options
|
|
41
|
+
- **Summarize Content** - 9 summary styles available
|
|
42
|
+
- **Quick Extract** - Instant clipboard extraction
|
|
43
|
+
|
|
44
|
+
### 🖱️ macOS Right-Click Integration
|
|
45
|
+
Right-click any file in Finder → Services → Extract or Summarize content instantly.
|
|
46
|
+
|
|
47
|
+
### 🐍 Python Library
|
|
48
|
+
```python
|
|
49
|
+
import content_core as cc
|
|
50
|
+
|
|
51
|
+
# Extract from any source
|
|
52
|
+
result = await cc.extract("https://example.com/article")
|
|
53
|
+
summary = await cc.summarize_content(result, context="explain to a child")
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## ⚡ Key Features
|
|
57
|
+
|
|
58
|
+
* **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
|
|
59
|
+
* **🔧 Smart Engine Selection:**
|
|
60
|
+
* **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
|
|
61
|
+
* **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
|
|
62
|
+
* **Media:** OpenAI Whisper transcription
|
|
63
|
+
* **Images:** OCR with multiple engine support
|
|
64
|
+
* **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
|
|
65
|
+
* **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
|
|
66
|
+
* **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
|
|
67
|
+
* **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
|
|
68
|
+
* **🔄 Asynchronous:** Built with `asyncio` for efficient processing
|
|
28
69
|
|
|
29
70
|
## Getting Started
|
|
30
71
|
|
|
@@ -33,11 +74,17 @@ The primary goal of Content Core is to simplify the process of ingesting content
|
|
|
33
74
|
Install Content Core using `pip`:
|
|
34
75
|
|
|
35
76
|
```bash
|
|
36
|
-
#
|
|
77
|
+
# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
|
|
37
78
|
pip install content-core
|
|
38
79
|
|
|
39
|
-
#
|
|
80
|
+
# With enhanced document processing (adds Docling)
|
|
81
|
+
pip install content-core[docling]
|
|
82
|
+
|
|
83
|
+
# With MCP server support
|
|
40
84
|
pip install content-core[mcp]
|
|
85
|
+
|
|
86
|
+
# Full installation
|
|
87
|
+
pip install content-core[docling,mcp]
|
|
41
88
|
```
|
|
42
89
|
|
|
43
90
|
Alternatively, if you’re developing locally:
|
|
@@ -208,6 +255,49 @@ Add to your `claude_desktop_config.json`:
|
|
|
208
255
|
|
|
209
256
|
For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
|
|
210
257
|
|
|
258
|
+
## Enhanced PDF Processing
|
|
259
|
+
|
|
260
|
+
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
|
|
261
|
+
|
|
262
|
+
### Key Improvements
|
|
263
|
+
|
|
264
|
+
- **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
|
|
265
|
+
- **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
|
|
266
|
+
- **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
|
|
267
|
+
- **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
|
|
268
|
+
|
|
269
|
+
### Configuration for Scientific Documents
|
|
270
|
+
|
|
271
|
+
For documents with heavy mathematical content, enable OCR enhancement:
|
|
272
|
+
|
|
273
|
+
```yaml
|
|
274
|
+
# In cc_config.yaml
|
|
275
|
+
extraction:
|
|
276
|
+
pymupdf:
|
|
277
|
+
enable_formula_ocr: true # Enable OCR for formula-heavy pages
|
|
278
|
+
formula_threshold: 3 # Min formulas per page to trigger OCR
|
|
279
|
+
ocr_fallback: true # Graceful fallback if OCR fails
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
```python
|
|
283
|
+
# Runtime configuration
|
|
284
|
+
from content_core.config import set_pymupdf_ocr_enabled
|
|
285
|
+
set_pymupdf_ocr_enabled(True)
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### Requirements for OCR Enhancement
|
|
289
|
+
|
|
290
|
+
```bash
|
|
291
|
+
# Install Tesseract OCR (optional, for formula enhancement)
|
|
292
|
+
# macOS
|
|
293
|
+
brew install tesseract
|
|
294
|
+
|
|
295
|
+
# Ubuntu/Debian
|
|
296
|
+
sudo apt-get install tesseract-ocr
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
**Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
|
|
300
|
+
|
|
211
301
|
## macOS Services Integration
|
|
212
302
|
|
|
213
303
|
Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
|
|
@@ -251,6 +341,50 @@ Create **4 convenient services** for different workflows:
|
|
|
251
341
|
|
|
252
342
|
For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
|
|
253
343
|
|
|
344
|
+
## Raycast Extension
|
|
345
|
+
|
|
346
|
+
Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
|
|
347
|
+
|
|
348
|
+
### Quick Setup
|
|
349
|
+
|
|
350
|
+
**From Raycast Store** (coming soon):
|
|
351
|
+
1. Open Raycast and search for "Content Core"
|
|
352
|
+
2. Install the extension by `luis_novo`
|
|
353
|
+
3. Configure API keys in preferences
|
|
354
|
+
|
|
355
|
+
**Manual Installation**:
|
|
356
|
+
1. Download the extension from the repository
|
|
357
|
+
2. Open Raycast → "Import Extension"
|
|
358
|
+
3. Select the `raycast-content-core` folder
|
|
359
|
+
|
|
360
|
+
### Commands
|
|
361
|
+
|
|
362
|
+
**🔍 Extract Content** - Smart URL/file detection with full interface
|
|
363
|
+
- Auto-detects URLs vs file paths in real-time
|
|
364
|
+
- Multiple output formats (Text, JSON, XML)
|
|
365
|
+
- Drag & drop support for files
|
|
366
|
+
- Rich results view with metadata
|
|
367
|
+
|
|
368
|
+
**📝 Summarize Content** - AI-powered summaries with customizable styles
|
|
369
|
+
- 9 different summary styles (bullet points, executive summary, etc.)
|
|
370
|
+
- Auto-detects source type with visual feedback
|
|
371
|
+
- One-click snippet creation and quicklinks
|
|
372
|
+
|
|
373
|
+
**⚡ Quick Extract** - Instant extraction to clipboard
|
|
374
|
+
- Type → Tab → Paste source → Enter
|
|
375
|
+
- No UI, works directly from command bar
|
|
376
|
+
- Perfect for quick workflows
|
|
377
|
+
|
|
378
|
+
### Features
|
|
379
|
+
|
|
380
|
+
- **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
|
|
381
|
+
- **Zero Installation**: Uses `uvx` for Content Core execution
|
|
382
|
+
- **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
|
|
383
|
+
- **All File Types**: Documents, videos, audio, images, archives
|
|
384
|
+
- **Visual Feedback**: Real-time type detection with icons
|
|
385
|
+
|
|
386
|
+
For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
|
|
387
|
+
|
|
254
388
|
## Using with Langchain
|
|
255
389
|
|
|
256
390
|
For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
|
|
@@ -360,8 +494,21 @@ Example `.env`:
|
|
|
360
494
|
```plaintext
|
|
361
495
|
OPENAI_API_KEY=your-key-here
|
|
362
496
|
GOOGLE_API_KEY=your-key-here
|
|
497
|
+
|
|
498
|
+
# Engine Selection (optional)
|
|
499
|
+
CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
|
|
500
|
+
CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
|
|
363
501
|
```
|
|
364
502
|
|
|
503
|
+
### Engine Selection via Environment Variables
|
|
504
|
+
|
|
505
|
+
For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
|
|
506
|
+
|
|
507
|
+
- **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
|
|
508
|
+
- **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
|
|
509
|
+
|
|
510
|
+
These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
|
|
511
|
+
|
|
365
512
|
### Custom Prompt Templates
|
|
366
513
|
|
|
367
514
|
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
|
|
@@ -292,6 +292,34 @@ export GOOGLE_API_KEY="your-google-key"
|
|
|
292
292
|
- **Firecrawl**: Visit [Firecrawl](https://www.firecrawl.dev/) for enhanced web scraping
|
|
293
293
|
- **Jina**: Visit [Jina AI](https://jina.ai/) for alternative web extraction
|
|
294
294
|
|
|
295
|
+
### Engine Selection via Environment Variables
|
|
296
|
+
|
|
297
|
+
For advanced users, you can override the extraction engines:
|
|
298
|
+
|
|
299
|
+
```json
|
|
300
|
+
{
|
|
301
|
+
"mcpServers": {
|
|
302
|
+
"content-core": {
|
|
303
|
+
"env": {
|
|
304
|
+
"OPENAI_API_KEY": "sk-...",
|
|
305
|
+
"FIRECRAWL_API_KEY": "fc-...",
|
|
306
|
+
"CCORE_DOCUMENT_ENGINE": "simple", // Skip docling, use PyMuPDF
|
|
307
|
+
"CCORE_URL_ENGINE": "auto" // Or firecrawl, jina
|
|
308
|
+
}
|
|
309
|
+
}
|
|
310
|
+
}
|
|
311
|
+
}
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
**Available engines:**
|
|
315
|
+
- **Document**: `auto`, `simple`, `docling` (requires `content-core[docling]`)
|
|
316
|
+
- **URL**: `auto`, `simple`, `firecrawl`, `jina`
|
|
317
|
+
|
|
318
|
+
**Use cases:**
|
|
319
|
+
- Set `CCORE_DOCUMENT_ENGINE=simple` to avoid docling dependency issues
|
|
320
|
+
- Set `CCORE_URL_ENGINE=firecrawl` to always use paid service for better reliability
|
|
321
|
+
- Set `CCORE_URL_ENGINE=simple` for faster processing without external API calls
|
|
322
|
+
|
|
295
323
|
### Custom Prompts
|
|
296
324
|
|
|
297
325
|
You can customize Content Core's behavior by setting a custom prompt path:
|