PyPI - sdg-hub - Versions diffs - 0.2.2__tar.gz → 0.3.0__tar.gz - Mend

sdg-hub 0.2.2tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (213) hide show

{sdg_hub-0.2.2 → sdg_hub-0.3.0}/.github/workflows/pypi.yaml RENAMED Viewed

@@ -78,7 +78,7 @@ jobs:
                   path: dist
             - name: "Upload to Test PyPI"
-              uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4
+              uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0
               with:
                   repository-url: https://test.pypi.org/legacy/
@@ -130,4 +130,4 @@ jobs:
                   rm ./dist/*.sigstore.json
             - name: "Upload to PyPI"
-              uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4
+              uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0

{sdg_hub-0.2.2/src/sdg_hub.egg-info → sdg_hub-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sdg_hub
-Version: 0.2.2
+Version: 0.3.0
 Summary: Synthetic Data Generation
 Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
 License: Apache-2.0
@@ -53,6 +53,7 @@ Requires-Dist: sentence-transformers; extra == "examples"
 Requires-Dist: instructor; extra == "examples"
 Requires-Dist: fastapi; extra == "examples"
 Requires-Dist: nest-asyncio; extra == "examples"
+Requires-Dist: ipykernel; extra == "examples"
 Provides-Extra: dev
 Requires-Dist: pre-commit<4.0,>=3.0.4; extra == "dev"
 Requires-Dist: pylint<4.0,>=2.16.2; extra == "dev"
@@ -63,6 +64,7 @@ Requires-Dist: pytest-cov; extra == "dev"
 Requires-Dist: pytest-html; extra == "dev"
 Requires-Dist: tox<5,>=4.4.2; extra == "dev"
 Requires-Dist: ruff; extra == "dev"
+Requires-Dist: pytest-env; extra == "dev"
 Dynamic: license-file
 # `sdg_hub`: Synthetic Data Generation Toolkit

sdg_hub-0.3.0/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/.env.example ADDED Viewed

@@ -0,0 +1,59 @@
+# SDG Knowledge Tuning Configuration
+# Copy this file to .env and update the values as needed
+# =============================================================================
+# MODEL CONFIGURATION
+# Choose one of the following model providers: hosted_vllm, openai, ollama
+# =============================================================================
+MODEL_PROVIDER=hosted_vllm
+LITELLM_MODE=PRODUCTION
+# =============================================================================
+# HOSTED VLLM CONFIGURATION
+# =============================================================================
+VLLM_MODEL=hosted_vllm/meta-llama/Llama-3.3-70B-Instruct
+VLLM_API_BASE=http://localhost:8000/v1
+VLLM_API_KEY=EMPTY
+ENABLE_REASONING=false
+# =============================================================================
+# OPENAI CONFIGURATION
+# =============================================================================
+OPENAI_API_KEY=your-openai-api-key-here
+OPENAI_MODEL=openai/gpt-5
+# =============================================================================
+# OLLAMA CONFIGURATION
+# =============================================================================
+OLLAMA_MODEL=ollama/gemma3
+OLLAMA_API_BASE=http://localhost:11434
+# =============================================================================
+# MAAS CONFIGURATION (Red Hat AI Services)
+# =============================================================================
+MAAS_MODEL=your-provisioned-model-name
+MAAS_API_BASE=your-provisioned-model-url
+MAAS_API_KEY=your-maas-api-key-here
+# =============================================================================
+# DATA CONFIGURATION
+# =============================================================================
+SEED_DATA_PATH=seed_data_val.jsonl
+OUTPUT_DATA_FOLDER=output_data
+RUN_ON_VALIDATION_SET=true
+NUMBER_OF_SUMMARIES=50
+# =============================================================================
+# DATA MIXING CONFIGURATION
+# =============================================================================
+SAVE_GPT_OSS_FORMAT=false
+STUDENT_MODEL=meta-llama/Llama-3.1-8B-Instruct
+# Cut sizes for data mixing (comma-separated list of integers)
+# Number of summaries to pick per document for each cut
+CUT_SIZES=3,5,7
+# Number of Q&A pairs per document
+QA_PER_DOC=3
+HF_TOKEN=your-hf-token

sdg_hub-0.3.0/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/README.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Knowledge Tuning with Enhanced Summaries
+## Objective
+Pre-trained language models typically encounter most facts in their training data only **once or twice**, if at all. As a result, knowledge of specific details—especially **proprietary or domain-specific documents**—is often incomplete or missing.
+This pipeline is designed to **inject new knowledge** from a given set of documents into an instruction-tuned model. By generating **multiple document augmentations** (summaries, extractive passages, atomic facts) and **synthetic Q\&A pairs**, we repeat and reinforce important information. This repetition helps the model:
+* **Memorize facts** it has rarely or never seen before.
+* **Generalize across augmentations**, improving reliability when queried.
+* **Adapt to proprietary knowledge sources** that were absent from pre-training.
+The final product is a **high-quality training dataset** suitable for fine-tuning, enabling models to answer queries more accurately and faithfully based on the injected documents.
+---
+## 1. Document Summarization
+To bootstrap the process, we generate **three complementary types of summaries** for each source document. This ensures the model captures content at multiple levels of abstraction:
+* **Detailed Summaries** – Rich, comprehensive overviews of the document.
+* **Extractive Summaries** – Directly extracted sentences and passages representing the most important parts.
+* **Atomic Facts** – Concise, standalone factual statements distilled from the text.
+This multi-perspective approach improves the model’s ability to **memorize, generalize, and recall** key knowledge.
+---
+## 2. Synthetic Q\&A Generation
+With summaries in place, we scale up training data via **synthetic Q\&A generation**:
+* Users provide a small set of **seed examples** (initial Q\&A pairs).
+* The pipeline uses these seeds to generate a large set of **contextually grounded Q\&A pairs**, tightly linked to the summarized documents.
+* This expands sparse seed data into a **rich, diverse training dataset** suitable for fine-tuning.
+---
+## 3. Quality Control
+High-quality training data is essential. To ensure faithfulness and accuracy, we employ a **teacher-model evaluation loop**:
+1. Provide the model with a generated answer and the original document.
+2. Ask it to extract each factual claim from the answer.
+3. Verify whether each claim is **explicitly supported** by the document.
+Only claims passing this check are retained. This process filters out **hallucinations and unsupported statements**, ensuring reliable Q\&A pairs.
+---
+## Data Generation Statistics
+### Summary Augmentation
+Each “cut” represents the total number of summaries generated per document across all three augmentation types.
+| Cut (NUMBER\_OF\_SUMMARIES = 3) | Token Count |
+| ------------------------------- | ----------- |
+| 1                               | 2,193,502   |
+| 2                               | 4,383,655   |
+| 5                               | 10,870,396  |
+| 10                              | 21,815,170  |
+| 20                              | 43,601,976  |
+| 30                              | 65,395,710  |
+| 40                              | 87,118,308  |
+| 50                              | 108,779,213 |
+---
+### Finance Bench Example
+For Finance Bench (NUMBER\_OF\_SUMMARIES = 1):
+| Cut | Token Count |
+| --- | ----------- |
+| 50  | 213,333,192 |

sdg-hub 0.2.2__tar.gz → 0.3.0__tar.gz

sdg-hub 0.2.2tar.gz → 0.3.0tar.gz