content-core 0.1.2__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of content-core might be problematic. Click here for more details.
- {content_core-0.1.2 → content_core-0.2.0}/.gitignore +3 -1
- {content_core-0.1.2 → content_core-0.2.0}/PKG-INFO +16 -2
- {content_core-0.1.2 → content_core-0.2.0}/README.md +15 -1
- content_core-0.2.0/docs/processors.md +53 -0
- {content_core-0.1.2 → content_core-0.2.0}/pyproject.toml +1 -1
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/notebooks/run.ipynb +18 -179
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/prompter.py +51 -9
- {content_core-0.1.2 → content_core-0.2.0}/uv.lock +4 -3
- {content_core-0.1.2 → content_core-0.2.0}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/.github/workflows/publish.yml +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/.python-version +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/.windsurfrules +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/CONTRIBUTING.md +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/LICENSE +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/Makefile +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/common/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/common/exceptions.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/common/state.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/common/utils.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/config.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/cleanup/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/cleanup/core.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/extraction/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/extraction/graph.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/summary/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/content/summary/core.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/audio.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/office.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/pdf.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/text.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/url.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/video.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/processors/youtube.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0/src/content_core}/prompts/content/cleanup.jinja +0 -0
- {content_core-0.1.2 → content_core-0.2.0/src/content_core}/prompts/content/summarize.jinja +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/py.typed +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/templated_message.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/tools/__init__.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/tools/cleanup.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/tools/extract.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/src/content_core/tools/summarize.py +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.docx +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.epub +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.md +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.mp3 +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.mp4 +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.pdf +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.pptx +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.txt +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file.xlsx +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/input_content/file_audio.mp3 +0 -0
- {content_core-0.1.2 → content_core-0.2.0}/tests/integration/test_extraction.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: content-core
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: Extract what matters from any media source
|
|
5
5
|
Author-email: LUIS NOVO <lfnovo@gmail.com>
|
|
6
6
|
License-File: LICENSE
|
|
@@ -43,7 +43,7 @@ The primary goal of Content Core is to simplify the process of ingesting content
|
|
|
43
43
|
* Direct text strings.
|
|
44
44
|
* Web URLs (using robust extraction methods).
|
|
45
45
|
* Local files (including automatic transcription for video/audio files and parsing for text-based formats).
|
|
46
|
-
* **Intelligent Processing:** Applies appropriate extraction techniques based on the source type.
|
|
46
|
+
* **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
|
|
47
47
|
* **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
|
|
48
48
|
* **Asynchronous:** Built with `asyncio` for efficient I/O operations.
|
|
49
49
|
|
|
@@ -220,6 +220,20 @@ OPENAI_API_KEY=your-key-here
|
|
|
220
220
|
GOOGLE_API_KEY=your-key-here
|
|
221
221
|
```
|
|
222
222
|
|
|
223
|
+
### Custom Prompt Templates
|
|
224
|
+
|
|
225
|
+
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
|
|
226
|
+
|
|
227
|
+
Example `.env` with custom prompt path:
|
|
228
|
+
|
|
229
|
+
```plaintext
|
|
230
|
+
OPENAI_API_KEY=your-key-here
|
|
231
|
+
GOOGLE_API_KEY=your-key-here
|
|
232
|
+
PROMPT_PATH=/path/to/your/custom/prompts
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
|
|
236
|
+
|
|
223
237
|
## Development
|
|
224
238
|
|
|
225
239
|
To set up a development environment:
|
|
@@ -14,7 +14,7 @@ The primary goal of Content Core is to simplify the process of ingesting content
|
|
|
14
14
|
* Direct text strings.
|
|
15
15
|
* Web URLs (using robust extraction methods).
|
|
16
16
|
* Local files (including automatic transcription for video/audio files and parsing for text-based formats).
|
|
17
|
-
* **Intelligent Processing:** Applies appropriate extraction techniques based on the source type.
|
|
17
|
+
* **Intelligent Processing:** Applies appropriate extraction techniques based on the source type. See the [Processors Documentation](./docs/processors.md) for detailed information on how different content types are handled.
|
|
18
18
|
* **Content Cleaning (Optional):** Likely integrates with LLMs (via `prompter.py` and Jinja templates) to refine and clean the extracted content.
|
|
19
19
|
* **Asynchronous:** Built with `asyncio` for efficient I/O operations.
|
|
20
20
|
|
|
@@ -191,6 +191,20 @@ OPENAI_API_KEY=your-key-here
|
|
|
191
191
|
GOOGLE_API_KEY=your-key-here
|
|
192
192
|
```
|
|
193
193
|
|
|
194
|
+
### Custom Prompt Templates
|
|
195
|
+
|
|
196
|
+
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the `prompts` directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the `PROMPT_PATH` environment variable in your `.env` file or system environment.
|
|
197
|
+
|
|
198
|
+
Example `.env` with custom prompt path:
|
|
199
|
+
|
|
200
|
+
```plaintext
|
|
201
|
+
OPENAI_API_KEY=your-key-here
|
|
202
|
+
GOOGLE_API_KEY=your-key-here
|
|
203
|
+
PROMPT_PATH=/path/to/your/custom/prompts
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
When a prompt template is requested, Content Core will first look in the custom directory specified by `PROMPT_PATH` (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
|
|
207
|
+
|
|
194
208
|
## Development
|
|
195
209
|
|
|
196
210
|
To set up a development environment:
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Content Core Processors
|
|
2
|
+
|
|
3
|
+
This document provides an overview of the content processors available in Content Core. These processors are responsible for extracting and handling content from various sources and file types.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Content Core uses a modular approach to process content from different sources. Each processor is designed to handle specific types of input, such as web URLs, local files, or direct text input. Below, you'll find detailed information about each processor, including supported file types, returned data formats, and their purpose.
|
|
8
|
+
|
|
9
|
+
## Processors
|
|
10
|
+
|
|
11
|
+
### 1. **Text Processor**
|
|
12
|
+
- **Purpose**: Handles direct text input provided by the user.
|
|
13
|
+
- **Supported Input**: Raw text strings.
|
|
14
|
+
- **Returned Data**: The input text as-is, wrapped in a structured format compatible with Content Core's output schema.
|
|
15
|
+
- **Location**: `src/content_core/processors/text.py`
|
|
16
|
+
|
|
17
|
+
### 2. **Web Processor**
|
|
18
|
+
- **Purpose**: Extracts content from web URLs, focusing on meaningful text while ignoring boilerplate (ads, navigation, etc.).
|
|
19
|
+
- **Supported Input**: URLs (web pages).
|
|
20
|
+
- **Returned Data**: Extracted text content from the web page, often in a cleaned format.
|
|
21
|
+
- **Location**: `src/content_core/processors/web.py`
|
|
22
|
+
|
|
23
|
+
### 3. **File Processor**
|
|
24
|
+
- **Purpose**: Processes local files of various types, extracting content based on file format.
|
|
25
|
+
- **Supported Input**: Local files including:
|
|
26
|
+
- Text-based formats: `.txt`, `.md` (Markdown), `.html`, etc.
|
|
27
|
+
- Document formats: `.pdf`, `.docx`, etc.
|
|
28
|
+
- Media files: `.mp4`, `.mp3` (audio/video, via transcription).
|
|
29
|
+
- **Returned Data**: Extracted text content or transcriptions (for media files), structured according to Content Core's schema.
|
|
30
|
+
- **Location**: `src/content_core/processors/file.py`
|
|
31
|
+
|
|
32
|
+
### 4. **Media Transcription Processor**
|
|
33
|
+
- **Purpose**: Specifically handles transcription of audio and video files using external services or libraries.
|
|
34
|
+
- **Supported Input**: Audio and video files (e.g., `.mp3`, `.mp4`).
|
|
35
|
+
- **Returned Data**: Transcribed text from the media content.
|
|
36
|
+
- **Location**: `src/content_core/processors/transcription.py`
|
|
37
|
+
|
|
38
|
+
## How Processors Work
|
|
39
|
+
|
|
40
|
+
Content Core automatically selects the appropriate processor based on the input type:
|
|
41
|
+
- If a URL is provided, the Web Processor is used.
|
|
42
|
+
- If a file path is provided, the File Processor determines the file type and delegates to specialized handlers (like the Media Transcription Processor for audio/video).
|
|
43
|
+
- If raw text is provided, the Text Processor handles it directly.
|
|
44
|
+
|
|
45
|
+
Each processor returns data in a consistent format, allowing seamless integration with other components of Content Core for further processing (like cleaning or summarization).
|
|
46
|
+
|
|
47
|
+
## Custom Processors
|
|
48
|
+
|
|
49
|
+
Developers can extend Content Core by creating custom processors for unsupported file types or specialized extraction needs. To do so, create a new processor module in `src/content_core/processors/` and ensure it adheres to the expected interface for integration with the content extraction pipeline.
|
|
50
|
+
|
|
51
|
+
## Contributing
|
|
52
|
+
|
|
53
|
+
If you have suggestions for improving existing processors or adding support for new file types, please contribute to the project by submitting a pull request or opening an issue on the GitHub repository.
|