content-core 1.1.2__tar.gz → 1.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of content-core might be problematic. Click here for more details.

Files changed (89) hide show
  1. content_core-1.2.0/.github/workflows/claude-code-review.yml +77 -0
  2. content_core-1.2.0/.github/workflows/claude.yml +59 -0
  3. {content_core-1.1.2 → content_core-1.2.0}/.gitignore +4 -1
  4. {content_core-1.1.2 → content_core-1.2.0}/PKG-INFO +149 -20
  5. {content_core-1.1.2 → content_core-1.2.0}/README.md +146 -18
  6. {content_core-1.1.2 → content_core-1.2.0}/docs/processors.md +24 -3
  7. content_core-1.2.0/docs/raycast.md +268 -0
  8. {content_core-1.1.2 → content_core-1.2.0}/docs/usage.md +84 -1
  9. content_core-1.2.0/mcp.md +248 -0
  10. content_core-1.2.0/new_pdf.pdf +0 -0
  11. {content_core-1.1.2 → content_core-1.2.0}/pyproject.toml +2 -2
  12. content_core-1.2.0/raycast-content-core/.eslintrc.json +9 -0
  13. content_core-1.2.0/raycast-content-core/CHANGELOG.md +36 -0
  14. content_core-1.2.0/raycast-content-core/README.md +150 -0
  15. content_core-1.2.0/raycast-content-core/assets/command-icon.png +0 -0
  16. content_core-1.2.0/raycast-content-core/package-lock.json +2094 -0
  17. content_core-1.2.0/raycast-content-core/package.json +124 -0
  18. content_core-1.2.0/raycast-content-core/raycast-env.d.ts +42 -0
  19. content_core-1.2.0/raycast-content-core/src/extract-content.tsx +306 -0
  20. content_core-1.2.0/raycast-content-core/src/quick-extract.tsx +128 -0
  21. content_core-1.2.0/raycast-content-core/src/summarize-content.tsx +420 -0
  22. content_core-1.2.0/raycast-content-core/src/utils/content-core.ts +348 -0
  23. content_core-1.2.0/raycast-content-core/src/utils/types.ts +27 -0
  24. content_core-1.2.0/raycast-content-core/tsconfig.json +30 -0
  25. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/cc_config.yaml +4 -0
  26. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/config.py +20 -2
  27. content_core-1.2.0/src/content_core/notebooks/urls.ipynb +154 -0
  28. content_core-1.2.0/src/content_core/processors/pdf.py +292 -0
  29. content_core-1.2.0/test.py +16 -0
  30. content_core-1.2.0/tests/unit/test_pymupdf_ocr.py +275 -0
  31. {content_core-1.1.2 → content_core-1.2.0}/uv.lock +2466 -2502
  32. content_core-1.1.2/src/content_core/processors/pdf.py +0 -168
  33. {content_core-1.1.2 → content_core-1.2.0}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
  34. {content_core-1.1.2 → content_core-1.2.0}/.github/workflows/publish.yml +0 -0
  35. {content_core-1.1.2 → content_core-1.2.0}/.python-version +0 -0
  36. {content_core-1.1.2 → content_core-1.2.0}/CONTRIBUTING.md +0 -0
  37. {content_core-1.1.2 → content_core-1.2.0}/LICENSE +0 -0
  38. {content_core-1.1.2 → content_core-1.2.0}/Makefile +0 -0
  39. {content_core-1.1.2 → content_core-1.2.0}/docs/macos.md +0 -0
  40. {content_core-1.1.2 → content_core-1.2.0}/docs/mcp.md +0 -0
  41. {content_core-1.1.2 → content_core-1.2.0}/prompts/content/cleanup.jinja +0 -0
  42. {content_core-1.1.2 → content_core-1.2.0}/prompts/content/summarize.jinja +0 -0
  43. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/__init__.py +0 -0
  44. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/common/__init__.py +0 -0
  45. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/common/exceptions.py +0 -0
  46. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/common/state.py +0 -0
  47. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/common/types.py +0 -0
  48. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/common/utils.py +0 -0
  49. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/__init__.py +0 -0
  50. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/cleanup/__init__.py +0 -0
  51. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/cleanup/core.py +0 -0
  52. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/extraction/__init__.py +0 -0
  53. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/extraction/graph.py +0 -0
  54. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/identification/__init__.py +0 -0
  55. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/summary/__init__.py +0 -0
  56. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/content/summary/core.py +0 -0
  57. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/logging.py +0 -0
  58. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/mcp/__init__.py +0 -0
  59. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/mcp/server.py +0 -0
  60. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/models.py +0 -0
  61. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/models_config.yaml +0 -0
  62. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/notebooks/run.ipynb +0 -0
  63. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/audio.py +0 -0
  64. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/docling.py +0 -0
  65. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/office.py +0 -0
  66. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/text.py +0 -0
  67. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/url.py +0 -0
  68. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/video.py +0 -0
  69. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/processors/youtube.py +0 -0
  70. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/py.typed +0 -0
  71. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/templated_message.py +0 -0
  72. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/tools/__init__.py +0 -0
  73. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/tools/cleanup.py +0 -0
  74. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/tools/extract.py +0 -0
  75. {content_core-1.1.2 → content_core-1.2.0}/src/content_core/tools/summarize.py +0 -0
  76. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.docx +0 -0
  77. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.epub +0 -0
  78. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.md +0 -0
  79. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.mp3 +0 -0
  80. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.mp4 +0 -0
  81. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.pdf +0 -0
  82. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.pptx +0 -0
  83. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.txt +0 -0
  84. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file.xlsx +0 -0
  85. {content_core-1.1.2 → content_core-1.2.0}/tests/input_content/file_audio.mp3 +0 -0
  86. {content_core-1.1.2 → content_core-1.2.0}/tests/integration/test_cli.py +0 -0
  87. {content_core-1.1.2 → content_core-1.2.0}/tests/integration/test_extraction.py +0 -0
  88. {content_core-1.1.2 → content_core-1.2.0}/tests/unit/test_docling.py +0 -0
  89. {content_core-1.1.2 → content_core-1.2.0}/tests/unit/test_mcp_server.py +0 -0
@@ -0,0 +1,77 @@
1
+ name: Claude Code Review
2
+
3
+ on:
4
+ pull_request:
5
+ types: [opened, synchronize]
6
+ # Optional: Only run on specific file changes
7
+ # paths:
8
+ # - "src/**/*.ts"
9
+ # - "src/**/*.tsx"
10
+ # - "src/**/*.js"
11
+ # - "src/**/*.jsx"
12
+
13
+ jobs:
14
+ claude-review:
15
+ # Optional: Filter by PR author
16
+ # if: |
17
+ # github.event.pull_request.user.login == 'external-contributor' ||
18
+ # github.event.pull_request.user.login == 'new-developer' ||
19
+ # github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR'
20
+
21
+ runs-on: ubuntu-latest
22
+ permissions:
23
+ contents: read
24
+ pull-requests: read
25
+ issues: read
26
+ id-token: write
27
+
28
+ steps:
29
+ - name: Checkout repository
30
+ uses: actions/checkout@v4
31
+ with:
32
+ fetch-depth: 1
33
+
34
+ - name: Run Claude Code Review
35
+ id: claude-review
36
+ uses: anthropics/claude-code-action@beta
37
+ with:
38
+ anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
39
+
40
+ # Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
41
+ # model: "claude-opus-4-20250514"
42
+
43
+ # Direct prompt for automated review (no @claude mention needed)
44
+ direct_prompt: |
45
+ Please review this pull request (and any added updates) and provide feedback on:
46
+ - Code quality and best practices
47
+ - Potential bugs or issues
48
+ - Performance considerations
49
+ - Documentation updates: does the README.md and docs/ reflect what changed?
50
+ - Security concerns
51
+ - Test coverage
52
+
53
+ If the pull request received updates after your first review, please redo the review based on the new code
54
+ Be constructive and helpful in your feedback.
55
+
56
+ # Optional: Customize review based on file types
57
+ # direct_prompt: |
58
+ # Review this PR focusing on:
59
+ # - For TypeScript files: Type safety and proper interface usage
60
+ # - For API endpoints: Security, input validation, and error handling
61
+ # - For React components: Performance, accessibility, and best practices
62
+ # - For tests: Coverage, edge cases, and test quality
63
+
64
+ # Optional: Different prompts for different authors
65
+ # direct_prompt: |
66
+ # ${{ github.event.pull_request.author_association == 'FIRST_TIME_CONTRIBUTOR' &&
67
+ # 'Welcome! Please review this PR from a first-time contributor. Be encouraging and provide detailed explanations for any suggestions.' ||
68
+ # 'Please provide a thorough code review focusing on our coding standards and best practices.' }}
69
+
70
+ # Optional: Add specific tools for running tests or linting
71
+ # allowed_tools: "Bash(npm run test),Bash(npm run lint),Bash(npm run typecheck)"
72
+
73
+ # Optional: Skip review for certain conditions
74
+ # if: |
75
+ # !contains(github.event.pull_request.title, '[skip-review]') &&
76
+ # !contains(github.event.pull_request.title, '[WIP]')
77
+
@@ -0,0 +1,59 @@
1
+ name: Claude Code
2
+
3
+ on:
4
+ issue_comment:
5
+ types: [created]
6
+ pull_request_review_comment:
7
+ types: [created]
8
+ issues:
9
+ types: [opened, assigned]
10
+ pull_request_review:
11
+ types: [submitted]
12
+
13
+ jobs:
14
+ claude:
15
+ if: |
16
+ (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) ||
17
+ (github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
18
+ (github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
19
+ (github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
20
+ runs-on: ubuntu-latest
21
+ permissions:
22
+ contents: read
23
+ pull-requests: read
24
+ issues: read
25
+ id-token: write
26
+ steps:
27
+ - name: Checkout repository
28
+ uses: actions/checkout@v4
29
+ with:
30
+ fetch-depth: 1
31
+
32
+ - name: Run Claude Code
33
+ id: claude
34
+ uses: anthropics/claude-code-action@beta
35
+ with:
36
+ anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
37
+
38
+ # Optional: Specify model (defaults to Claude Sonnet 4, uncomment for Claude Opus 4)
39
+ # model: "claude-opus-4-20250514"
40
+
41
+ # Optional: Customize the trigger phrase (default: @claude)
42
+ # trigger_phrase: "/claude"
43
+
44
+ # Optional: Trigger when specific user is assigned to an issue
45
+ # assignee_trigger: "claude-bot"
46
+
47
+ # Optional: Allow Claude to run specific commands
48
+ # allowed_tools: "Bash(npm install),Bash(npm run build),Bash(npm run test:*),Bash(npm run lint:*)"
49
+
50
+ # Optional: Add custom instructions for Claude to customize its behavior for your project
51
+ # custom_instructions: |
52
+ # Follow our coding standards
53
+ # Ensure all new code has tests
54
+ # Use TypeScript for new files
55
+
56
+ # Optional: Custom environment variables for Claude
57
+ # claude_env: |
58
+ # NODE_ENV: test
59
+
@@ -22,4 +22,7 @@ WIP/
22
22
 
23
23
  *.ignore
24
24
  .windsurfrules
25
- CLAUDE.md
25
+ CLAUDE.md
26
+
27
+ node_modules/
28
+ **/notebooks/private
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: content-core
3
- Version: 1.1.2
3
+ Version: 1.2.0
4
4
  Summary: Extract what matters from any media source. Available as Python Library, macOS Service, CLI and MCP Server
5
5
  Author-email: LUIS NOVO <lfnovo@gmail.com>
6
6
  License-File: LICENSE
@@ -10,7 +10,6 @@ Requires-Dist: aiohttp>=3.11
10
10
  Requires-Dist: asciidoc>=10.2.1
11
11
  Requires-Dist: bs4>=0.0.2
12
12
  Requires-Dist: dicttoxml>=1.7.16
13
- Requires-Dist: docling>=2.34.0
14
13
  Requires-Dist: esperanto>=1.2.0
15
14
  Requires-Dist: firecrawl-py>=2.7.0
16
15
  Requires-Dist: jinja2>=3.1.6
@@ -31,6 +30,8 @@ Requires-Dist: pytubefix>=9.1.1
31
30
  Requires-Dist: readability-lxml>=0.8.4.1
32
31
  Requires-Dist: validators>=0.34.0
33
32
  Requires-Dist: youtube-transcript-api>=1.0.3
33
+ Provides-Extra: docling
34
+ Requires-Dist: docling>=2.34.0; extra == 'docling'
34
35
  Provides-Extra: mcp
35
36
  Requires-Dist: fastmcp>=0.5.0; extra == 'mcp'
36
37
  Description-Content-Type: text/markdown
@@ -39,29 +40,70 @@ Description-Content-Type: text/markdown
39
40
 
40
41
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
41
42
 
42
- **Content Core** is a versatile Python library designed to extract and process content from various sources, providing a unified interface for handling text, web pages, and local files.
43
+ **Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
43
44
 
44
- ## Overview
45
+ ## 🚀 What You Can Do
45
46
 
46
- > **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
47
+ **Extract content from anywhere:**
48
+ - 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
49
+ - 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
50
+ - 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
51
+ - 🌐 **Web** - Any URL with intelligent content extraction
52
+ - 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
53
+ - 📦 **Archives** - ZIP, TAR, GZ with content analysis
47
54
 
48
- The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
55
+ **Process with AI:**
56
+ - ✨ **Clean & format** extracted content automatically
57
+ - 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
58
+ - 🎯 **Context-aware processing** - explain to a child, technical summary, action items
59
+ - 🔄 **Smart engine selection** - automatically chooses the best extraction method
49
60
 
50
- ## Key Features
61
+ ## 🛠️ Multiple Ways to Use
51
62
 
52
- * **Multi-Source Extraction:** Handles content from:
53
- * Direct text strings.
54
- * Web URLs (using robust extraction methods).
55
- * Local files (including automatic transcription for video/audio files and parsing for text-based formats).
56
- * **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
57
- * **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
58
- * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
59
- * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
60
- * You can override this by specifying an engine, but `'auto'` is recommended for most users.
61
- * **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
62
- * **MCP Server:** Includes a Model Context Protocol (MCP) server for seamless integration with Claude Desktop and other MCP-compatible applications.
63
- * **macOS Services:** Right-click context menu integration for Finder (extract and summarize files directly).
64
- * **Asynchronous:** Built with `asyncio` for efficient I/O operations.
63
+ ### 🖥️ Command Line (Zero Install)
64
+ ```bash
65
+ # Extract content from any source
66
+ uvx --from "content-core" ccore https://example.com
67
+ uvx --from "content-core" ccore document.pdf
68
+
69
+ # Generate AI summaries
70
+ uvx --from "content-core" csum video.mp4 --context "bullet points"
71
+ ```
72
+
73
+ ### 🤖 Claude Desktop Integration
74
+ One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
75
+
76
+ ### 🔍 Raycast Extension
77
+ Smart auto-detection commands:
78
+ - **Extract Content** - Full interface with format options
79
+ - **Summarize Content** - 9 summary styles available
80
+ - **Quick Extract** - Instant clipboard extraction
81
+
82
+ ### 🖱️ macOS Right-Click Integration
83
+ Right-click any file in Finder → Services → Extract or Summarize content instantly.
84
+
85
+ ### 🐍 Python Library
86
+ ```python
87
+ import content_core as cc
88
+
89
+ # Extract from any source
90
+ result = await cc.extract("https://example.com/article")
91
+ summary = await cc.summarize_content(result, context="explain to a child")
92
+ ```
93
+
94
+ ## ⚡ Key Features
95
+
96
+ * **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
97
+ * **🔧 Smart Engine Selection:**
98
+ * **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
99
+ * **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
100
+ * **Media:** OpenAI Whisper transcription
101
+ * **Images:** OCR with multiple engine support
102
+ * **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
103
+ * **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
104
+ * **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
105
+ * **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
106
+ * **🔄 Asynchronous:** Built with `asyncio` for efficient processing
65
107
 
66
108
  ## Getting Started
67
109
 
@@ -245,6 +287,49 @@ Add to your `claude_desktop_config.json`:
245
287
 
246
288
  For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
247
289
 
290
+ ## Enhanced PDF Processing
291
+
292
+ Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
293
+
294
+ ### Key Improvements
295
+
296
+ - **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
297
+ - **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
298
+ - **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
299
+ - **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
300
+
301
+ ### Configuration for Scientific Documents
302
+
303
+ For documents with heavy mathematical content, enable OCR enhancement:
304
+
305
+ ```yaml
306
+ # In cc_config.yaml
307
+ extraction:
308
+ pymupdf:
309
+ enable_formula_ocr: true # Enable OCR for formula-heavy pages
310
+ formula_threshold: 3 # Min formulas per page to trigger OCR
311
+ ocr_fallback: true # Graceful fallback if OCR fails
312
+ ```
313
+
314
+ ```python
315
+ # Runtime configuration
316
+ from content_core.config import set_pymupdf_ocr_enabled
317
+ set_pymupdf_ocr_enabled(True)
318
+ ```
319
+
320
+ ### Requirements for OCR Enhancement
321
+
322
+ ```bash
323
+ # Install Tesseract OCR (optional, for formula enhancement)
324
+ # macOS
325
+ brew install tesseract
326
+
327
+ # Ubuntu/Debian
328
+ sudo apt-get install tesseract-ocr
329
+ ```
330
+
331
+ **Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
332
+
248
333
  ## macOS Services Integration
249
334
 
250
335
  Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
@@ -288,6 +373,50 @@ Create **4 convenient services** for different workflows:
288
373
 
289
374
  For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
290
375
 
376
+ ## Raycast Extension
377
+
378
+ Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
379
+
380
+ ### Quick Setup
381
+
382
+ **From Raycast Store** (coming soon):
383
+ 1. Open Raycast and search for "Content Core"
384
+ 2. Install the extension by `luis_novo`
385
+ 3. Configure API keys in preferences
386
+
387
+ **Manual Installation**:
388
+ 1. Download the extension from the repository
389
+ 2. Open Raycast → "Import Extension"
390
+ 3. Select the `raycast-content-core` folder
391
+
392
+ ### Commands
393
+
394
+ **🔍 Extract Content** - Smart URL/file detection with full interface
395
+ - Auto-detects URLs vs file paths in real-time
396
+ - Multiple output formats (Text, JSON, XML)
397
+ - Drag & drop support for files
398
+ - Rich results view with metadata
399
+
400
+ **📝 Summarize Content** - AI-powered summaries with customizable styles
401
+ - 9 different summary styles (bullet points, executive summary, etc.)
402
+ - Auto-detects source type with visual feedback
403
+ - One-click snippet creation and quicklinks
404
+
405
+ **⚡ Quick Extract** - Instant extraction to clipboard
406
+ - Type → Tab → Paste source → Enter
407
+ - No UI, works directly from command bar
408
+ - Perfect for quick workflows
409
+
410
+ ### Features
411
+
412
+ - **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
413
+ - **Zero Installation**: Uses `uvx` for Content Core execution
414
+ - **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
415
+ - **All File Types**: Documents, videos, audio, images, archives
416
+ - **Visual Feedback**: Real-time type detection with icons
417
+
418
+ For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
419
+
291
420
  ## Using with Langchain
292
421
 
293
422
  For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
@@ -2,29 +2,70 @@
2
2
 
3
3
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
4
 
5
- **Content Core** is a versatile Python library designed to extract and process content from various sources, providing a unified interface for handling text, web pages, and local files.
5
+ **Content Core** is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
6
6
 
7
- ## Overview
7
+ ## 🚀 What You Can Do
8
8
 
9
- > **Note:** As of v0.8, the default extraction engine is `'auto'`. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but `'auto'` is recommended for most users.
9
+ **Extract content from anywhere:**
10
+ - 📄 **Documents** - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
11
+ - 🎥 **Media** - Videos (MP4, AVI, MOV) with automatic transcription
12
+ - 🎵 **Audio** - MP3, WAV, M4A with speech-to-text conversion
13
+ - 🌐 **Web** - Any URL with intelligent content extraction
14
+ - 🖼️ **Images** - JPG, PNG, TIFF with OCR text recognition
15
+ - 📦 **Archives** - ZIP, TAR, GZ with content analysis
10
16
 
11
- The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
17
+ **Process with AI:**
18
+ - ✨ **Clean & format** extracted content automatically
19
+ - 📝 **Generate summaries** with customizable styles (bullet points, executive summary, etc.)
20
+ - 🎯 **Context-aware processing** - explain to a child, technical summary, action items
21
+ - 🔄 **Smart engine selection** - automatically chooses the best extraction method
12
22
 
13
- ## Key Features
23
+ ## 🛠️ Multiple Ways to Use
14
24
 
15
- * **Multi-Source Extraction:** Handles content from:
16
- * Direct text strings.
17
- * Web URLs (using robust extraction methods).
18
- * Local files (including automatic transcription for video/audio files and parsing for text-based formats).
19
- * **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
20
- * **Smart Engine Selection:** By default, Content Core uses the `'auto'` engine, which:
21
- * For URLs: Uses Firecrawl if `FIRECRAWL_API_KEY` is set, else tries Jina. Jina might fail because of rate limits, which can be fixed by adding `JINA_API_KEY`. If Jina failes, BeautifulSoup is used as a fallback.
22
- * For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
23
- * You can override this by specifying an engine, but `'auto'` is recommended for most users.
24
- * **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
25
- * **MCP Server:** Includes a Model Context Protocol (MCP) server for seamless integration with Claude Desktop and other MCP-compatible applications.
26
- * **macOS Services:** Right-click context menu integration for Finder (extract and summarize files directly).
27
- * **Asynchronous:** Built with `asyncio` for efficient I/O operations.
25
+ ### 🖥️ Command Line (Zero Install)
26
+ ```bash
27
+ # Extract content from any source
28
+ uvx --from "content-core" ccore https://example.com
29
+ uvx --from "content-core" ccore document.pdf
30
+
31
+ # Generate AI summaries
32
+ uvx --from "content-core" csum video.mp4 --context "bullet points"
33
+ ```
34
+
35
+ ### 🤖 Claude Desktop Integration
36
+ One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
37
+
38
+ ### 🔍 Raycast Extension
39
+ Smart auto-detection commands:
40
+ - **Extract Content** - Full interface with format options
41
+ - **Summarize Content** - 9 summary styles available
42
+ - **Quick Extract** - Instant clipboard extraction
43
+
44
+ ### 🖱️ macOS Right-Click Integration
45
+ Right-click any file in Finder → Services → Extract or Summarize content instantly.
46
+
47
+ ### 🐍 Python Library
48
+ ```python
49
+ import content_core as cc
50
+
51
+ # Extract from any source
52
+ result = await cc.extract("https://example.com/article")
53
+ summary = await cc.summarize_content(result, context="explain to a child")
54
+ ```
55
+
56
+ ## ⚡ Key Features
57
+
58
+ * **🎯 Intelligent Auto-Detection:** Automatically selects the best extraction method based on content type and available services
59
+ * **🔧 Smart Engine Selection:**
60
+ * **URLs:** Firecrawl → Jina → BeautifulSoup fallback chain
61
+ * **Documents:** Docling → Enhanced PyMuPDF → Simple extraction fallback
62
+ * **Media:** OpenAI Whisper transcription
63
+ * **Images:** OCR with multiple engine support
64
+ * **📊 Enhanced PDF Processing:** Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
65
+ * **🌍 Multiple Integrations:** CLI, Python library, MCP server, Raycast extension, macOS Services
66
+ * **⚡ Zero-Install Options:** Use `uvx` for instant access without installation
67
+ * **🧠 AI-Powered Processing:** LLM integration for content cleaning and summarization
68
+ * **🔄 Asynchronous:** Built with `asyncio` for efficient processing
28
69
 
29
70
  ## Getting Started
30
71
 
@@ -208,6 +249,49 @@ Add to your `claude_desktop_config.json`:
208
249
 
209
250
  For detailed setup instructions, configuration options, and usage examples, see our [MCP Documentation](docs/mcp.md).
210
251
 
252
+ ## Enhanced PDF Processing
253
+
254
+ Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
255
+
256
+ ### Key Improvements
257
+
258
+ - **🔬 Mathematical Formula Extraction**: Enhanced quality flags eliminate `<!-- formula-not-decoded -->` placeholders
259
+ - **📊 Automatic Table Detection**: Tables converted to markdown format for LLM consumption
260
+ - **🔧 Quality Text Rendering**: Better ligature, whitespace, and image-text integration
261
+ - **⚡ Optional OCR Enhancement**: Selective OCR for formula-heavy pages (requires Tesseract)
262
+
263
+ ### Configuration for Scientific Documents
264
+
265
+ For documents with heavy mathematical content, enable OCR enhancement:
266
+
267
+ ```yaml
268
+ # In cc_config.yaml
269
+ extraction:
270
+ pymupdf:
271
+ enable_formula_ocr: true # Enable OCR for formula-heavy pages
272
+ formula_threshold: 3 # Min formulas per page to trigger OCR
273
+ ocr_fallback: true # Graceful fallback if OCR fails
274
+ ```
275
+
276
+ ```python
277
+ # Runtime configuration
278
+ from content_core.config import set_pymupdf_ocr_enabled
279
+ set_pymupdf_ocr_enabled(True)
280
+ ```
281
+
282
+ ### Requirements for OCR Enhancement
283
+
284
+ ```bash
285
+ # Install Tesseract OCR (optional, for formula enhancement)
286
+ # macOS
287
+ brew install tesseract
288
+
289
+ # Ubuntu/Debian
290
+ sudo apt-get install tesseract-ocr
291
+ ```
292
+
293
+ **Note**: OCR is optional - you get improved PDF extraction automatically without any additional setup.
294
+
211
295
  ## macOS Services Integration
212
296
 
213
297
  Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.
@@ -251,6 +335,50 @@ Create **4 convenient services** for different workflows:
251
335
 
252
336
  For complete setup instructions with copy-paste scripts, see [macOS Services Documentation](docs/macos.md).
253
337
 
338
+ ## Raycast Extension
339
+
340
+ Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.
341
+
342
+ ### Quick Setup
343
+
344
+ **From Raycast Store** (coming soon):
345
+ 1. Open Raycast and search for "Content Core"
346
+ 2. Install the extension by `luis_novo`
347
+ 3. Configure API keys in preferences
348
+
349
+ **Manual Installation**:
350
+ 1. Download the extension from the repository
351
+ 2. Open Raycast → "Import Extension"
352
+ 3. Select the `raycast-content-core` folder
353
+
354
+ ### Commands
355
+
356
+ **🔍 Extract Content** - Smart URL/file detection with full interface
357
+ - Auto-detects URLs vs file paths in real-time
358
+ - Multiple output formats (Text, JSON, XML)
359
+ - Drag & drop support for files
360
+ - Rich results view with metadata
361
+
362
+ **📝 Summarize Content** - AI-powered summaries with customizable styles
363
+ - 9 different summary styles (bullet points, executive summary, etc.)
364
+ - Auto-detects source type with visual feedback
365
+ - One-click snippet creation and quicklinks
366
+
367
+ **⚡ Quick Extract** - Instant extraction to clipboard
368
+ - Type → Tab → Paste source → Enter
369
+ - No UI, works directly from command bar
370
+ - Perfect for quick workflows
371
+
372
+ ### Features
373
+
374
+ - **Smart Auto-Detection**: Instantly recognizes URLs vs file paths
375
+ - **Zero Installation**: Uses `uvx` for Content Core execution
376
+ - **Rich Integration**: Keyboard shortcuts, clipboard actions, Raycast snippets
377
+ - **All File Types**: Documents, videos, audio, images, archives
378
+ - **Visual Feedback**: Real-time type detection with icons
379
+
380
+ For detailed setup, configuration, and usage examples, see [Raycast Extension Documentation](docs/raycast.md).
381
+
254
382
  ## Using with Langchain
255
383
 
256
384
  For users integrating with the [Langchain](https://python.langchain.com/) framework, `content-core` exposes a set of compatible tools. These tools, located in the `src/content_core/tools` directory, allow you to leverage `content-core` extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
@@ -1,6 +1,6 @@
1
1
  # Content Core Processors
2
2
 
3
- **Note:** As of vNEXT, the default extraction engine is now `'auto'`. This means Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to simple extraction. See details below.
3
+ **Note:** As of vNEXT, the default extraction engine is now `'auto'`. This means Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents, `'auto'` now tries Docling first, then falls back to enhanced PyMuPDF extraction (with quality flags and table detection), then to basic simple extraction. See details below.
4
4
 
5
5
  This document provides an overview of the content processors available in Content Core. These processors are responsible for extracting and handling content from various sources and file types.
6
6
 
@@ -42,14 +42,35 @@ Content Core uses a modular approach to process content from different sources.
42
42
  - **Returned Data**: Transcribed text from the media content.
43
43
  - **Location**: `src/content_core/processors/transcription.py`
44
44
 
45
- ### 5. **Docling Processor**
45
+ ### 5. **Enhanced PyMuPDF Processor (Simple Engine)**
46
+ - **Purpose**: Optimized PDF extraction using PyMuPDF with enhanced quality flags, table detection, and optional OCR
47
+ - **Supported Input**: PDF files, EPUB files
48
+ - **Returned Data**: High-quality text extraction with proper mathematical symbols, converted tables in markdown format
49
+ - **Location**: `src/content_core/processors/pdf.py`
50
+ - **Key Enhancements**:
51
+ - **Quality Flags**: Automatically applies `TEXT_PRESERVE_LIGATURES`, `TEXT_PRESERVE_WHITESPACE`, and `TEXT_PRESERVE_IMAGES` for better text rendering
52
+ - **Mathematical Formula Support**: Eliminates `<!-- formula-not-decoded -->` placeholders by properly extracting mathematical symbols (∂, ∇, ρ, etc.)
53
+ - **Table Detection**: Automatic detection and conversion of tables to markdown format for LLM consumption
54
+ - **Selective OCR**: Optional OCR enhancement for formula-heavy pages (requires Tesseract installation)
55
+ - **Configuration**: Configure OCR enhancement in `cc_config.yaml`:
56
+ ```yaml
57
+ extraction:
58
+ pymupdf:
59
+ enable_formula_ocr: false # Enable OCR for formula-heavy pages
60
+ formula_threshold: 3 # Min formulas per page to trigger OCR
61
+ ocr_fallback: true # Graceful fallback if OCR fails
62
+ ```
63
+ - **Performance**: Standard extraction maintains baseline performance; OCR only triggers selectively on formula-heavy pages
64
+
65
+ ### 6. **Docling Processor**
46
66
  - **Purpose**: Use Docling library for rich document parsing (PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, images).
47
67
  - **Supported Input**: PDF, DOCX, XLSX, PPTX, Markdown, AsciiDoc, HTML, CSV, Images (PNG, JPEG, TIFF, BMP).
48
68
  - **Returned Data**: Content converted to configured format (markdown, html, json).
49
69
  - **Location**: `src/content_core/processors/docling.py`
50
70
  - **Default Document Engine (`auto`) Logic for Files/Documents**:
51
71
  - Tries the `'docling'` extraction method first (robust document parsing for supported types).
52
- - If `'docling'` fails or is not supported, automatically falls back to simple extraction (fast, lightweight for supported types).
72
+ - If `'docling'` fails or is not supported, automatically falls back to enhanced PyMuPDF extraction (fast, with quality flags and table detection).
73
+ - Final fallback to basic simple extraction if needed.
53
74
  - You can explicitly specify `'docling'` or `'simple'` as the document engine, but `'auto'` is now the default and recommended for most users.
54
75
  - **Configuration**: Activate the Docling engine in `cc_config.yaml` or custom config:
55
76
  ```yaml