content-core 1.1.2__tar.gz → 1.2.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of content-core might be problematic. Click here for more details.

Files changed (91) hide show
  1. content_core-1.2.1/.github/workflows/claude-code-review.yml +77 -0
  2. content_core-1.2.1/.github/workflows/claude.yml +59 -0
  3. {content_core-1.1.2 → content_core-1.2.1}/.gitignore +4 -1
  4. {content_core-1.1.2 → content_core-1.2.1}/PKG-INFO +170 -22
  5. {content_core-1.1.2 → content_core-1.2.1}/README.md +167 -20
  6. {content_core-1.1.2 → content_core-1.2.1}/docs/mcp.md +28 -0
  7. {content_core-1.1.2 → content_core-1.2.1}/docs/processors.md +26 -4
  8. content_core-1.2.1/docs/raycast.md +268 -0
  9. {content_core-1.1.2 → content_core-1.2.1}/docs/usage.md +98 -2
  10. content_core-1.2.1/mcp.md +248 -0
  11. content_core-1.2.1/new_pdf.pdf +0 -0
  12. {content_core-1.1.2 → content_core-1.2.1}/pyproject.toml +2 -2
  13. content_core-1.2.1/raycast-content-core/.eslintrc.json +9 -0
  14. content_core-1.2.1/raycast-content-core/CHANGELOG.md +36 -0
  15. content_core-1.2.1/raycast-content-core/README.md +150 -0
  16. content_core-1.2.1/raycast-content-core/assets/command-icon.png +0 -0
  17. content_core-1.2.1/raycast-content-core/package-lock.json +2094 -0
  18. content_core-1.2.1/raycast-content-core/package.json +124 -0
  19. content_core-1.2.1/raycast-content-core/raycast-env.d.ts +42 -0
  20. content_core-1.2.1/raycast-content-core/src/extract-content.tsx +306 -0
  21. content_core-1.2.1/raycast-content-core/src/quick-extract.tsx +128 -0
  22. content_core-1.2.1/raycast-content-core/src/summarize-content.tsx +420 -0
  23. content_core-1.2.1/raycast-content-core/src/utils/content-core.ts +348 -0
  24. content_core-1.2.1/raycast-content-core/src/utils/types.ts +27 -0
  25. content_core-1.2.1/raycast-content-core/tsconfig.json +30 -0
  26. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/cc_config.yaml +4 -0
  27. content_core-1.2.1/src/content_core/config.py +104 -0
  28. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/extraction/graph.py +33 -21
  29. content_core-1.2.1/src/content_core/notebooks/urls.ipynb +154 -0
  30. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/docling.py +13 -6
  31. content_core-1.2.1/src/content_core/processors/pdf.py +292 -0
  32. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/url.py +3 -2
  33. content_core-1.2.1/test.py +16 -0
  34. content_core-1.2.1/tests/unit/test_config.py +109 -0
  35. {content_core-1.1.2 → content_core-1.2.1}/tests/unit/test_docling.py +4 -1
  36. content_core-1.2.1/tests/unit/test_pymupdf_ocr.py +275 -0
  37. {content_core-1.1.2 → content_core-1.2.1}/uv.lock +2466 -2502
  38. content_core-1.1.2/src/content_core/config.py +0 -49
  39. content_core-1.1.2/src/content_core/processors/pdf.py +0 -168
  40. {content_core-1.1.2 → content_core-1.2.1}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
  41. {content_core-1.1.2 → content_core-1.2.1}/.github/workflows/publish.yml +0 -0
  42. {content_core-1.1.2 → content_core-1.2.1}/.python-version +0 -0
  43. {content_core-1.1.2 → content_core-1.2.1}/CONTRIBUTING.md +0 -0
  44. {content_core-1.1.2 → content_core-1.2.1}/LICENSE +0 -0
  45. {content_core-1.1.2 → content_core-1.2.1}/Makefile +0 -0
  46. {content_core-1.1.2 → content_core-1.2.1}/docs/macos.md +0 -0
  47. {content_core-1.1.2 → content_core-1.2.1}/prompts/content/cleanup.jinja +0 -0
  48. {content_core-1.1.2 → content_core-1.2.1}/prompts/content/summarize.jinja +0 -0
  49. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/__init__.py +0 -0
  50. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/__init__.py +0 -0
  51. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/exceptions.py +0 -0
  52. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/state.py +0 -0
  53. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/types.py +0 -0
  54. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/common/utils.py +0 -0
  55. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/__init__.py +0 -0
  56. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/cleanup/__init__.py +0 -0
  57. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/cleanup/core.py +0 -0
  58. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/extraction/__init__.py +0 -0
  59. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/identification/__init__.py +0 -0
  60. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/summary/__init__.py +0 -0
  61. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/content/summary/core.py +0 -0
  62. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/logging.py +0 -0
  63. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/mcp/__init__.py +0 -0
  64. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/mcp/server.py +0 -0
  65. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/models.py +0 -0
  66. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/models_config.yaml +0 -0
  67. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/notebooks/run.ipynb +0 -0
  68. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/audio.py +0 -0
  69. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/office.py +0 -0
  70. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/text.py +0 -0
  71. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/video.py +0 -0
  72. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/processors/youtube.py +0 -0
  73. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/py.typed +0 -0
  74. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/templated_message.py +0 -0
  75. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/__init__.py +0 -0
  76. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/cleanup.py +0 -0
  77. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/extract.py +0 -0
  78. {content_core-1.1.2 → content_core-1.2.1}/src/content_core/tools/summarize.py +0 -0
  79. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.docx +0 -0
  80. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.epub +0 -0
  81. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.md +0 -0
  82. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.mp3 +0 -0
  83. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.mp4 +0 -0
  84. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.pdf +0 -0
  85. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.pptx +0 -0
  86. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.txt +0 -0
  87. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file.xlsx +0 -0
  88. {content_core-1.1.2 → content_core-1.2.1}/tests/input_content/file_audio.mp3 +0 -0
  89. {content_core-1.1.2 → content_core-1.2.1}/tests/integration/test_cli.py +0 -0
  90. {content_core-1.1.2 → content_core-1.2.1}/tests/integration/test_extraction.py +0 -0
  91. {content_core-1.1.2 → content_core-1.2.1}/tests/unit/test_mcp_server.py +0 -0
@@ -0,0 +1,77 @@
1
+ name: Claude Code Review
2
+
3
+ on:
4
+ pull_request:
5
+ types: [opened, synchronize]
6
+ # Optional: Only run on specific file changes
7
+ # paths:
8
+ # - "src/**/*.ts"
9
+ # - "src/**/*.tsx"
10
+ # - "src/**/*.js"
11
+ # - "src/**/*.jsx"
12
+
13
+ jobs:
14
+ claude-review:
15
+ # Optional: Filter by PR author
16
+ # if: |
17
+ # github.event.pull_request.user.login == 'external-contributor' ||
18
+ # github.event.pull_request.user.login == 'new-developer' ||
19
+ # github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR'
20
+
21
+ runs-on: ubuntu-latest
22
+ permissions:
23
+ contents: read
24
+ pull-requests: read
25
+ issues: read
26
+ id-token: write
27
+
28
+ steps:
29
+ - name: Checkout repository
30
+ uses: actions/checkout@v4
31
+ with:
32
+ fetch-depth: 1
33
+
34
+ - name: Run Claude Code Review
35
+ id: claude-review
36
+ uses: anthropics/claude-code-action@beta
37
+ with:
38
+ anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
39
+
40
+ # Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
41
+ # model: "claude-opus-4-20250514"
42
+
43
+ # Direct prompt for automated review (no @claude mention needed)
44
+ direct_prompt: |
45
+ Please review this pull request (and any added updates) and provide feedback on:
46
+ - Code quality and best practices
47
+ - Potential bugs or issues
48
+ - Performance considerations
49
+ - Documentation updates: does the README.md and docs/ reflect what changed?
50
+ - Security concerns
51
+ - Test coverage
52
+
53
+ If the pull request received updates after your first review, please redo the review based on the new code
54
+ Be constructive and helpful in your feedback.
55
+
56
+ # Optional: Customize review based on file types
57
+ # direct_prompt: |
58
+ # Review this PR focusing on:
59
+ # - For TypeScript files: Type safety and proper interface usage
60
+ # - For API endpoints: Security, input validation, and error handling
61
+ # - For React components: Performance, accessibility, and best practices
62
+ # - For tests: Coverage, edge cases, and test quality
63
+
64
+ # Optional: Different prompts for different authors
65
+ # direct_prompt: |
66
+ # ${{ github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR' &&
67
+ # 'Welcome! Please review this PR from a first-time contributor. Be encouraging and provide detailed explanations for any suggestions.' ||
68
+ # 'Please provide a thorough code review focusing on our coding standards and best practices.' }}
69
+
70
+ # Optional: Add specific tools for running tests or linting
71
+ # allowed_tools: "Bash(npm run test),Bash(npm run lint),Bash(npm run typecheck)"
72
+
73
+ # Optional: Skip review for certain conditions
74
+ # if: |
75
+ # !contains(github.event.pull_request.title, '[skip-review]') &&
76
+ # !contains(github.event.pull_request.title, '[WIP]')
77
+
@@ -0,0 +1,59 @@
1
+ name: Claude Code
2
+
3
+ on:
4
+ issue_comment:
5
+ types: [created]
6
+ pull_request_review_comment:
7
+ types: [created]
8
+ issues:
9
+ types: [opened, assigned]
10
+ pull_request_review:
11
+ types: [submitted]
12
+
13
+ jobs:
14
+ claude:
15
+ if: |
16
+ (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) ||
17
+ (github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
18
+ (github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
19
+ (github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
20
+ runs-on: ubuntu-latest
21
+ permissions:
22
+ contents: read
23
+ pull-requests: read
24
+ issues: read
25
+ id-token: write
26
+ steps:
27
+ - name: Checkout repository
28
+ uses: actions/checkout@v4
29
+ with:
30
+ fetch-depth: 1
31
+
32
+ - name: Run Claude Code
33
+ id: claude
34
+ uses: anthropics/claude-code-action@beta
35
+ with:
36
+ anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
37
+
38
+ # Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
39
+ # model: "claude-opus-4-20250514"
40
+
41
+ # Optional: Customize the trigger phrase (default: @claude)
42
+ # trigger_phrase: "/claude"
43
+
44
+ # Optional: Trigger when specific user is assigned to an issue
45
+ # assignee_trigger: "claude-bot"
46
+
47
+ # Optional: Allow Claude to run specific commands
48
+ # allowed_tools: "Bash(npm install),Bash(npm run build),Bash(npm run test:*),Bash(npm run lint:*)"
49
+
50
+ # Optional: Add custom instructions for Claude to customize its behavior for your project
51
+ # custom_instructions: |
52
+ # Follow our coding standards
53
+ # Ensure all new code has tests
54
+ # Use TypeScript for new files
55
+
56
+ # Optional: Custom environment variables for Claude
57
+ # claude_env: |
58
+ # NODE_ENV: test
59
+
@@ -22,4 +22,7 @@ WIP/
22
22
 
23
23
  *.ignore
24
24
  .windsurfrules
25
- CLAUDE.md
25
+ CLAUDE.md
26
+
27
+ node_modules/
28
+ **/notebooks/private
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: content-core
3
- Version: 1.1.2
3
+ Version: 1.2.1
4
4
  Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
5
5
  Author-email: LUIS NOVO <lfnovo@gmail.com>
6
6
  License-File: LICENSE
@@ -10,7 +10,6 @@ Requires-Dist: aiohttp>=3.11
10
10
  Requires-Dist: asciidoc>=10.2.1
11
11
  Requires-Dist: bs4>=0.0.2
12
12
  Requires-Dist: dicttoxml>=1.7.16
13
- Requires-Dist: docling>=2.34.0
14
13
  Requires-Dist: esperanto>=1.2.0
15
14
  Requires-Dist: firecrawl-py>=2.7.0
16
15
  Requires-Dist: jinja2>=3.1.6
@@ -31,6 +30,8 @@ Requires-Dist: pytubefix>=9.1.1
31
30
  Requires-Dist: readability-lxml>=0.8.4.1
32
31
  Requires-Dist: validators>=0.34.0
33
32
  Requires-Dist: youtube-transcript-api>=1.0.3
33
+ Provides-Extra: docling
34
+ Requires-Dist: docling>=2.34.0; extra == 'docling'
34
35
  Provides-Extra: mcp
35
36
  Requires-Dist: fastmcp>=0.5.0; extra == 'mcp'
36
37
  Description-Content-Type: text/markdown
@@ -39,29 +40,70 @@ Description-Content-Type: text/markdown
39
40
 
40
41
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
41
42
 
42
- **Content Core** is a versatile Python library designed to extract and process content from various sources, providing a unified interface for handling text, web pages, and local files.
43
+ **Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
43
44
 
44
- ## Overview
45
+ ## 🚀 What You Can Do
45
46
 
46
- > **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
47
+ **Extract content from anywhere:**
48
+ - 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
49
+ - 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
50
+ - 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
51
+ - 🌐 **Web** - Any URL with intelligent content extraction
52
+ - 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
53
+ - 📦 **Archives** - ZIP, TAR, GZ with content analysis
47
54
 
48
- The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
55
+ **Process with AI:**
56
+ - ✨ **Clean & format** extracted content automatically
57
+ - 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
58
+ - 🎯 **Context-aware processing** - explain to a child, technical summary, action items
59
+ - 🔄 **Smart engine selection** - automatically chooses the best extraction method
49
60
 
50
- ## Key Features
61
+ ## 🛠️ Multiple Ways to Use
51
62
 
52
- * **Multi-Source Extraction:** Handles content from:
53
- * Direct text strings.
54
- * Web URLs (using robust extraction methods).
55
- * Local files (including automatic transcription for video/audio files and parsing for text-based formats).
56
- * **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
57
- * **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
58
- * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
59
- * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
60
- * You can override this by specifying an engine, but `'auto'` is recommended for most users.
61
- * **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
62
- * **MCP Server:** Includes a Model Context Protocol (MCP) server for seamless integration with Claude Desktop and other MCP-compatible applications.
63
- * **macOS Services:** Right-click context menu integration for Finder (extract and summarize files directly).
64
- * **Asynchronous:** Built with `asyncio` for efficient I/O operations.
63
+ ### 🖥️ Command Line (Zero Install)
64
+ ```bash
65
+ # Extract content from any source
66
+ uvx --from "content-core" ccore https://example.com
67
+ uvx --from "content-core" ccore document.pdf
68
+
69
+ # Generate AI summaries
70
+ uvx --from "content-core" csum video.mp4 --context "bullet points"
71
+ ```
72
+
73
+ ### 🤖 Claude Desktop Integration
74
+ One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
75
+
76
+ ### 🔍 Raycast Extension
77
+ Smart auto-detection commands:
78
+ - **Extract Content** - Full interface with format options
79
+ - **Summarize Content** - 9 summary styles available
80
+ - **Quick Extract** - Instant clipboard extraction
81
+
82
+ ### 🖱️ macOS Right-Click Integration
83
+ Right-click any file in Finder → Services → Extract or Summarize content instantly.
84
+
85
+ ### 🐍 Python Library
86
+ ```python
87
+ import content_core as cc
88
+
89
+ # Extract from any source
90
+ result = await cc.extract("https://example.com/article")
91
+ summary = await cc.summarize_content(result, context="explain to a child")
92
+ ```
93
+
94
+ ## ⚡ Key Features
95
+
96
+ * **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
97
+ * **🔧 Smart Engine Selection:**
98
+ * **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
99
+ * **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
100
+ * **Media:** OpenAI Whisper transcription
101
+ * **Images:** OCR with multiple engine support
102
+ * **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
103
+ * **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
104
+ * **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
105
+ * **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
106
+ * **🔄 Asynchronous:** Built with `asyncio` for efficient processing
65
107
 
66
108
  ## Getting Started
67
109
 
@@ -70,11 +112,17 @@ The primary goal of Content Core is to simplify the process of ingesting content
70
112
  Install Content Core using `pip`:
71
113
 
72
114
  ```bash
73
- # Install the package
115
+ # Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
74
116
  pip install content-core
75
117
 
76
- # Install with MCP server support
118
+ # With enhanced document processing (adds Docling)
119
+ pip install content-core[docling]
120
+
121
+ # With MCP server support
77
122
  pip install content-core[mcp]
123
+
124
+ # Full installation
125
+ pip install content-core[docling,mcp]
78
126
  ```
79
127
 
80
128
  Alternatively, if you’re developing locally:
@@ -245,6 +293,49 @@ Add to your `claude_desktop_config.json`:
245
293
 
246
294
  For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
247
295
 
296
+ ## Enhanced PDF Processing
297
+
298
+ Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
299
+
300
+ ### Key Improvements
301
+
302
+ - **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
303
+ - **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
304
+ - **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
305
+ - **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
306
+
307
+ ### Configuration for Scientific Documents
308
+
309
+ For documents with heavy mathematical content, enable OCR enhancement:
310
+
311
+ ```yaml
312
+ # In cc_config.yaml
313
+ extraction:
314
+ pymupdf:
315
+ enable_formula_ocr: true # Enable OCR for formula-heavy pages
316
+ formula_threshold: 3 # Min formulas per page to trigger OCR
317
+ ocr_fallback: true # Graceful fallback if OCR fails
318
+ ```
319
+
320
+ ```python
321
+ # Runtime configuration
322
+ from content_core.config import set_pymupdf_ocr_enabled
323
+ set_pymupdf_ocr_enabled(True)
324
+ ```
325
+
326
+ ### Requirements for OCR Enhancement
327
+
328
+ ```bash
329
+ # Install Tesseract OCR (optional, for formula enhancement)
330
+ # macOS
331
+ brew install tesseract
332
+
333
+ # Ubuntu/Debian
334
+ sudo apt-get install tesseract-ocr
335
+ ```
336
+
337
+ **Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
338
+
248
339
  ## macOS Services Integration
249
340
 
250
341
  Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
@@ -288,6 +379,50 @@ Create **4 convenient services** for different workflows:
288
379
 
289
380
  For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
290
381
 
382
+ ## Raycast Extension
383
+
384
+ Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
385
+
386
+ ### Quick Setup
387
+
388
+ **From Raycast Store** (coming soon):
389
+ 1. Open Raycast and search for "Content Core"
390
+ 2. Install the extension by `luis_novo`
391
+ 3. Configure API keys in preferences
392
+
393
+ **Manual Installation**:
394
+ 1. Download the extension from the repository
395
+ 2. Open Raycast → "Import Extension"
396
+ 3. Select the `raycast-content-core` folder
397
+
398
+ ### Commands
399
+
400
+ **🔍 Extract Content** - Smart URL/file detection with full interface
401
+ - Auto-detects URLs vs file paths in real-time
402
+ - Multiple output formats (Text, JSON, XML)
403
+ - Drag & drop support for files
404
+ - Rich results view with metadata
405
+
406
+ **📝 Summarize Content** - AI-powered summaries with customizable styles
407
+ - 9 different summary styles (bullet points, executive summary, etc.)
408
+ - Auto-detects source type with visual feedback
409
+ - One-click snippet creation and quicklinks
410
+
411
+ **⚡ Quick Extract** - Instant extraction to clipboard
412
+ - Type → Tab → Paste source → Enter
413
+ - No UI, works directly from command bar
414
+ - Perfect for quick workflows
415
+
416
+ ### Features
417
+
418
+ - **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
419
+ - **Zero Installation**: Uses `uvx` for Content Core execution
420
+ - **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
421
+ - **All File Types**: Documents, videos, audio, images, archives
422
+ - **Visual Feedback**: Real-time type detection with icons
423
+
424
+ For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
425
+
291
426
  ## Using with Langchain
292
427
 
293
428
  For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
@@ -397,8 +532,21 @@ Example `.env`:
397
532
  ```plaintext
398
533
  OPENAI_API_KEY=your-key-here
399
534
  GOOGLE_API_KEY=your-key-here
535
+
536
+ # Engine Selection (optional)
537
+ CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
538
+ CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
400
539
  ```
401
540
 
541
+ ### Engine Selection via Environment Variables
542
+
543
+ For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
544
+
545
+ - **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
546
+ - **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
547
+
548
+ These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
549
+
402
550
  ### Custom Prompt Templates
403
551
 
404
552
  Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
@@ -2,29 +2,70 @@
2
2
 
3
3
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
4
 
5
- **Content Core** is a versatile Python library designed to extract and process content from various sources, providing a unified interface for handling text, web pages, and local files.
5
+ **Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
6
6
 
7
- ## Overview
7
+ ## 🚀 What You Can Do
8
8
 
9
- > **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
9
+ **Extract content from anywhere:**
10
+ - 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
11
+ - 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
12
+ - 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
13
+ - 🌐 **Web** - Any URL with intelligent content extraction
14
+ - 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
15
+ - 📦 **Archives** - ZIP, TAR, GZ with content analysis
10
16
 
11
- The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
17
+ **Process with AI:**
18
+ - ✨ **Clean & format** extracted content automatically
19
+ - 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
20
+ - 🎯 **Context-aware processing** - explain to a child, technical summary, action items
21
+ - 🔄 **Smart engine selection** - automatically chooses the best extraction method
12
22
 
13
- ## Key Features
23
+ ## 🛠️ Multiple Ways to Use
14
24
 
15
- * **Multi-Source Extraction:** Handles content from:
16
- * Direct text strings.
17
- * Web URLs (using robust extraction methods).
18
- * Local files (including automatic transcription for video/audio files and parsing for text-based formats).
19
- * **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
20
- * **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
21
- * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
22
- * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
23
- * You can override this by specifying an engine, but `'auto'` is recommended for most users.
24
- * **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
25
- * **MCP Server:** Includes a Model Context Protocol (MCP) server for seamless integration with Claude Desktop and other MCP-compatible applications.
26
- * **macOS Services:** Right-click context menu integration for Finder (extract and summarize files directly).
27
- * **Asynchronous:** Built with `asyncio` for efficient I/O operations.
25
+ ### 🖥️ Command Line (Zero Install)
26
+ ```bash
27
+ # Extract content from any source
28
+ uvx --from "content-core" ccore https://example.com
29
+ uvx --from "content-core" ccore document.pdf
30
+
31
+ # Generate AI summaries
32
+ uvx --from "content-core" csum video.mp4 --context "bullet points"
33
+ ```
34
+
35
+ ### 🤖 Claude Desktop Integration
36
+ One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
37
+
38
+ ### 🔍 Raycast Extension
39
+ Smart auto-detection commands:
40
+ - **Extract Content** - Full interface with format options
41
+ - **Summarize Content** - 9 summary styles available
42
+ - **Quick Extract** - Instant clipboard extraction
43
+
44
+ ### 🖱️ macOS Right-Click Integration
45
+ Right-click any file in Finder → Services → Extract or Summarize content instantly.
46
+
47
+ ### 🐍 Python Library
48
+ ```python
49
+ import content_core as cc
50
+
51
+ # Extract from any source
52
+ result = await cc.extract("https://example.com/article")
53
+ summary = await cc.summarize_content(result, context="explain to a child")
54
+ ```
55
+
56
+ ## ⚡ Key Features
57
+
58
+ * **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
59
+ * **🔧 Smart Engine Selection:**
60
+ * **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
61
+ * **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
62
+ * **Media:** OpenAI Whisper transcription
63
+ * **Images:** OCR with multiple engine support
64
+ * **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
65
+ * **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
66
+ * **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
67
+ * **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
68
+ * **🔄 Asynchronous:** Built with `asyncio` for efficient processing
28
69
 
29
70
  ## Getting Started
30
71
 
@@ -33,11 +74,17 @@ The primary goal of Content Core is to simplify the process of ingesting content
33
74
  Install Content Core using `pip`:
34
75
 
35
76
  ```bash
36
- # Install the package
77
+ # Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
37
78
  pip install content-core
38
79
 
39
- # Install with MCP server support
80
+ # With enhanced document processing (adds Docling)
81
+ pip install content-core[docling]
82
+
83
+ # With MCP server support
40
84
  pip install content-core[mcp]
85
+
86
+ # Full installation
87
+ pip install content-core[docling,mcp]
41
88
  ```
42
89
 
43
90
  Alternatively, if you’re developing locally:
@@ -208,6 +255,49 @@ Add to your `claude_desktop_config.json`:
208
255
 
209
256
  For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
210
257
 
258
+ ## Enhanced PDF Processing
259
+
260
+ Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
261
+
262
+ ### Key Improvements
263
+
264
+ - **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
265
+ - **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
266
+ - **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
267
+ - **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
268
+
269
+ ### Configuration for Scientific Documents
270
+
271
+ For documents with heavy mathematical content, enable OCR enhancement:
272
+
273
+ ```yaml
274
+ # In cc_config.yaml
275
+ extraction:
276
+ pymupdf:
277
+ enable_formula_ocr: true # Enable OCR for formula-heavy pages
278
+ formula_threshold: 3 # Min formulas per page to trigger OCR
279
+ ocr_fallback: true # Graceful fallback if OCR fails
280
+ ```
281
+
282
+ ```python
283
+ # Runtime configuration
284
+ from content_core.config import set_pymupdf_ocr_enabled
285
+ set_pymupdf_ocr_enabled(True)
286
+ ```
287
+
288
+ ### Requirements for OCR Enhancement
289
+
290
+ ```bash
291
+ # Install Tesseract OCR (optional, for formula enhancement)
292
+ # macOS
293
+ brew install tesseract
294
+
295
+ # Ubuntu/Debian
296
+ sudo apt-get install tesseract-ocr
297
+ ```
298
+
299
+ **Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
300
+
211
301
  ## macOS Services Integration
212
302
 
213
303
  Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
@@ -251,6 +341,50 @@ Create **4 convenient services** for different workflows:
251
341
 
252
342
  For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
253
343
 
344
+ ## Raycast Extension
345
+
346
+ Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
347
+
348
+ ### Quick Setup
349
+
350
+ **From Raycast Store** (coming soon):
351
+ 1. Open Raycast and search for "Content Core"
352
+ 2. Install the extension by `luis_novo`
353
+ 3. Configure API keys in preferences
354
+
355
+ **Manual Installation**:
356
+ 1. Download the extension from the repository
357
+ 2. Open Raycast → "Import Extension"
358
+ 3. Select the `raycast-content-core` folder
359
+
360
+ ### Commands
361
+
362
+ **🔍 Extract Content** - Smart URL/file detection with full interface
363
+ - Auto-detects URLs vs file paths in real-time
364
+ - Multiple output formats (Text, JSON, XML)
365
+ - Drag & drop support for files
366
+ - Rich results view with metadata
367
+
368
+ **📝 Summarize Content** - AI-powered summaries with customizable styles
369
+ - 9 different summary styles (bullet points, executive summary, etc.)
370
+ - Auto-detects source type with visual feedback
371
+ - One-click snippet creation and quicklinks
372
+
373
+ **⚡ Quick Extract** - Instant extraction to clipboard
374
+ - Type → Tab → Paste source → Enter
375
+ - No UI, works directly from command bar
376
+ - Perfect for quick workflows
377
+
378
+ ### Features
379
+
380
+ - **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
381
+ - **Zero Installation**: Uses `uvx` for Content Core execution
382
+ - **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
383
+ - **All File Types**: Documents, videos, audio, images, archives
384
+ - **Visual Feedback**: Real-time type detection with icons
385
+
386
+ For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
387
+
254
388
  ## Using with Langchain
255
389
 
256
390
  For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
@@ -360,8 +494,21 @@ Example `.env`:
360
494
  ```plaintext
361
495
  OPENAI_API_KEY=your-key-here
362
496
  GOOGLE_API_KEY=your-key-here
497
+
498
+ # Engine Selection (optional)
499
+ CCORE_DOCUMENT_ENGINE=auto # auto, simple, docling
500
+ CCORE_URL_ENGINE=auto # auto, simple, firecrawl, jina
363
501
  ```
364
502
 
503
+ ### Engine Selection via Environment Variables
504
+
505
+ For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:
506
+
507
+ - **`CCORE_DOCUMENT_ENGINE`**: Force document engine (`auto`, `simple`, `docling`)
508
+ - **`CCORE_URL_ENGINE`**: Force URL engine (`auto`, `simple`, `firecrawl`, `jina`)
509
+
510
+ These variables take precedence over config file settings and provide explicit control for different deployment scenarios.
511
+
365
512
  ### Custom Prompt Templates
366
513
 
367
514
  Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
@@ -292,6 +292,34 @@ export GOOGLE_API_KEY="your-google-key"
292
292
  - **Firecrawl**: Visit [Firecrawl](https://www.firecrawl.dev/) for enhanced web scraping
293
293
  - **Jina**: Visit [Jina AI](https://jina.ai/) for alternative web extraction
294
294
 
295
+ ### Engine Selection via Environment Variables
296
+
297
+ For advanced users, you can override the extraction engines:
298
+
299
+ ```json
300
+ {
301
+ "mcpServers": {
302
+ "content-core": {
303
+ "env": {
304
+ "OPENAI_API_KEY": "sk-...",
305
+ "FIRECRAWL_API_KEY": "fc-...",
306
+ "CCORE_DOCUMENT_ENGINE": "simple", // Skip docling, use PyMuPDF
307
+ "CCORE_URL_ENGINE": "auto" // Or firecrawl, jina
308
+ }
309
+ }
310
+ }
311
+ }
312
+ ```
313
+
314
+ **Available engines:**
315
+ - **Document**: `auto`, `simple`, `docling` (requires `content-core[docling]`)
316
+ - **URL**: `auto`, `simple`, `firecrawl`, `jina`
317
+
318
+ **Use cases:**
319
+ - Set `CCORE_DOCUMENT_ENGINE=simple` to avoid docling dependency issues
320
+ - Set `CCORE_URL_ENGINE=firecrawl` to always use paid service for better reliability
321
+ - Set `CCORE_URL_ENGINE=simple` for faster processing without external API calls
322
+
295
323
  ### Custom Prompts
296
324
 
297
325
  You can customize Content Core's behavior by setting a custom prompt path: