PyPI - sdg-hub - Versions diffs - 0.1.3__tar.gz → 0.1.4__tar.gz - Mend

sdg-hub 0.1.3tar.gz → 0.1.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (255) hide show

{sdg_hub-0.1.3/src/sdg_hub.egg-info → sdg_hub-0.1.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sdg_hub
-Version: 0.1.3
+Version: 0.1.4
 Summary: Synthetic Data Generation
 Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
 License: Apache-2.0

sdg_hub-0.1.4/examples/knowledge_tuning/README.md ADDED Viewed

@@ -0,0 +1,115 @@
+# Synthetic Data Generation for Knowledge Tuning
+## What is Knowledge Tuning?
+**Knowledge tuning** is the process of adapting a large language model (LLM) to new factual content by training it on specific documents. The goal is to enable the model to **recall and reason over document-grounded information** when performing downstream tasks such as:
+* Question Answering
+* Summarization
+* Entity Extraction
+* Other document-specific reasoning tasks
+This adaptation can be used:
+* As a **standalone fine-tuned model**, or
+* As part of a **Retrieval-Augmented Generation (RAG)** pipeline to enhance factual accuracy and contextuality.
+---
+### Setup Instructions
+#### Install sdg-hub
+```bash
+pip install sdg-hub==0.1.0a4
+```
+#### Install with optional dependencies
+If you want to use the vLLM server, you can install it with the following command:
+```bash
+pip install sdg-hub[vllm]
+```
+In order to use docling, you need to install it with the following command:
+```bash
+pip install sdg-hub[examples]
+```
+### Serving the Teacher Model
+#### vLLM Server
+Launch the vLLM server with the following command:
+```bash
+vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 4
+```
+## Repository Structure
+This repository demonstrates how to generate synthetic data for knowledge tuning using different approaches:
+### Examples
+1. [`instructlab/`](instructlab/):
+   Implements knowledge tuning using the **InstructLab** pipeline, which supports a two-phase approach:
+   * Phase 1: Knowledge tuning via synthetic QAs
+   * Phase 2: Instruction tuning to generalize reasoning skills
+2. [`knowledge_tuning_with_reasoning_model/`](knowledge_tuning_with_reasoning_model/):
+   Uses **Nemotron Super** as the teacher model to generate reasoning-focused synthetic data grounded in document content. We also show how to edit the knowledge pipeline to introduce new types of summaries
+Each example includes:
+* Source document processing
+* QA generation with a teacher model
+* Filtering and validation logic
+* Dataset formatting for fine-tuning
+3. [`translation_example`](translation_example/):
+    Implements a translation block to translate article into a target language for generating knowledge QA. The example scripts show how to translate Kannada Wikipedia article into English and generate synthetic QA to train a model.
+---
+## Data Post-Processing
+Once synthetic QA data is generated, you’ll need to prepare it for training:
+### Key Practices
+* Append source document content to the generated QA to improve memorization and coverage.
+* During training, backpropagate on both the **prompt** (document + question) and the **response** (answer).
+* For `instructlab.training`, you can use the `unmask` field to enable pretraining-style loss computation over the full prompt-response.
+### Creating QA dataset
+* You can use below function to transform the generated dataset into Prompt + Response pair for training in messages format.
+* You can control various parameters like appending document to question, adding document outline to document etc.
+```python
+from knowledge_utils import generate_knowledge_qa_dataset
+knowl_train = generate_knowledge_qa_dataset(
+    generated_dataset=generated_data,
+    keep_context_separate=False,
+    keep_document_outline=True,
+    keep_columns=['document', 'document_outline', 'raw_document']
+)
+```
+* `keep_context_separate=False`: Includes the document in the prompt
+* `keep_document_outline=True`: Adds structure to the prompt using outline
+* `keep_columns`: Retains metadata for record-keeping (not used in training)
+### Workflow: InstructLab (Knowledge + RAFT)
+You can find steps for data post-processing [here](instructlab/README.md#data-post-processing)
+### Workflow: Fine-tuning Instruct Model
+* You can simply take the generated data and continue instruction tuning an existing instruct model (e.g. Qwen 2.5 8B instruct, LLama 3.3 8B/70B etc.)
+* Simply follow [Creating QA dataset](#creating-qa-dataset) section for creating the training data.
+* Note: The model might suffer catastrophic forgetting and might need a replay buffer of instruction data. Or you might need to explore alternate methods like Parameter efficient fine-tuning.

sdg_hub-0.1.4/examples/knowledge_tuning/instructlab/README.md ADDED Viewed

@@ -0,0 +1,110 @@
+# InstructLab: Synthetic Data Generation for Knowledge Tuning
+![InstructLab Banner](../../../assets/imgs/instructlab-banner.png)
+The provided notebooks show a sdg_hub pipeline for generating high-quality synthetic data from your documents. By following the methodology of the **LAB (Large-scale Alignment for Chatbots)** framework, as detailed in our [research paper](https://arxiv.org/pdf/2403.01081), you can effectively tune language models to master the knowledge contained within your specific-domain documentation.
+## How It Works: A Three-Phase Pipeline
+Our data generation process is designed to operate in three distinct phases, ensuring the creation of a robust and reliable dataset for model training.
+### 1\. Document Summarization
+To kickstart the process, we generate three unique summaries of your source documents. This multi-faceted approach helps the model to thoroughly memorize and recall the key information. The summaries include:
+  * **Detailed Summaries:** Comprehensive overviews of the content.
+  * **Extractive Summaries:** Key sentences and passages pulled directly from the text.
+  * **Atomic Facts:** A list of the most critical, standalone pieces of information.
+### 2\. Synthetic Q\&A Generation
+Next, our pipeline leverages user-provided "seed examples"—sample questions and answers—to generate a wealth of synthetic Q\&A pairs. These new pairs are contextually grounded in the summarized documents, effectively scaling up your initial examples into a diverse training dataset.
+### 3\. Quality Control
+To ensure the integrity of our generated data, we employ a quality-checking phase. Using a "teacher" model, we perform a faithfulness evaluation by:
+1.  Providing the model with a generated answer and the original source document.
+2.  Tasking the model to extract every claim made in the answer.
+3.  Verifying that each claim is factually supported by the provided document.
+This process filters out inaccuracies and ensures that only high-quality, faithful Q\&A pairs make it into the final dataset.
+## Getting Started
+To begin using the pipeline, simply install the `sdg_hub` library. From there, you can instantiate and run the synthetic data generation process with the following code:
+```python
+from sdg_hub.flow import Flow
+from sdg_hub.sdg import SDG
+# Load the knowledge generation pipeline from the YAML file
+knowledge_agentic_pipeline = "../../../src/instructlab/sdg/flows/generation/knowledge/synth_knowledge1.5.yaml"
+flow = Flow(openai_client).get_flow_from_file(knowledge_agentic_pipeline)
+# Initialize the Synthetic Data Generator
+generator = SDG(
+    flows=[flow],
+    num_workers=1,
+    batch_size=1,
+    save_freq=1000,
+)
+```
+## InstructLab Training Methodology
+Our training process is structured to build upon a pre-trained model, systematically enhancing its skills and knowledge.
+1.  **Foundation Training:** We begin by training a pre-trained model on foundational skills such as logic, coding, and math. The instruction data in this phase features short, direct responses.
+2.  **Foundation Knowledge:** Next, we expand the model's general knowledge base, by training it on general textbooks and benchmarking it against MMLU. The result of these first two stages is what we term the **starter model**.
+This **starter model** then serves as the base for our specialized, two-phase knowledge tuning:
+  * **Phase 1: Knowledge Tuning:** We do pre-training style training on the document generated data by our pipeline. This allows the model to internalize the new knowledge and be able to recall and answer questions based on it.
+  * **Phase 2: Skills Tuning:** Building on the previous phase, we do instruction tuning on general skills (combination of instruction tuning dataset and skills generated with sdg_hub). To prevent the model from forgetting the newly acquired knowledge, we mix in data from the previous stage. We also incorporate [RAFT-style](https://openreview.net/forum?id=rzQGHXNReU) data to enhance the model's robustness for RAG on the target documents.
+## Data Post-Processing
+After generating your data, use the provided utility functions to prepare it for the two-phase training process. All helper functions are located in `examples/knowledge_tuning/knowledge_utils.py`.
+### 1\. Knowledge Dataset (for Phase 1)
+This dataset is used for the initial knowledge-tuning phase. You can also merge datasets from multiple documents to train on a set of documents simultaneously.
+This function also creates a summarization dataset that formats the generated summaries as task: document + instruction -> document summary.
+```python
+from knowledge_utils import create_knowledge_pretraining_ds
+# Create the dataset for knowledge pre-training
+knowledge_data = create_knowledge_pretraining_ds(generated_dataset=generated_data)
+```
+### 2\. Skills Dataset (for Phase 2)
+This dataset combines the knowledge-specific data with RAFT-style examples for the second phase of tuning. It can also be mixed with general instruction-tuning data to grant the model broad instruction-following abilities while retaining the specialized knowledge.
+```python
+from knowledge_utils import create_knowledge_regular_ds
+from datasets import concatenate_datasets
+# Create the dataset with RAFT and summary data
+raft_and_summary_data = create_knowledge_regular_ds(generated_dataset=generated_data)
+# Create the core knowledge dataset.
+# add_auxiliary_dataset parameter controls wheter to add the summarization dataset that was mentioned above in the returned dataset
+knowledge_data = create_knowledge_pretraining_ds(generated_dataset=generated_data, add_auxiliary_dataset=False)
+# Combine the datasets for the skills tuning phase
+knowledge_skills_data = concatenate_datasets([raft_and_summary_data, knowledge_data])
+```
+---
+## Generation Statistics
+Default generation parameters (based on `llama-3.3-70B`) are defined in:
+[`synth_knowledge1.5.yaml`](../../src/sdg_hub/flows/generation/knowledge/synth_knowledge1.5.yaml)
+* The pipeline converts each input document into **3 summaries**
+* Outputs vary based on teacher model and generation parameters (e.g. `temperature`, `top_p`, `top_k`) and can be entered in the `gen_kwargs` section of the flow.
+* Generation currently uses temperature=0.0 and is deterministic.

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb RENAMED Viewed

@@ -34,7 +34,6 @@
     "\n",
     "# First Party\n",
     "from sdg_hub.flow import Flow\n",
-    "from sdg_hub.pipeline import Pipeline\n",
     "from sdg_hub.sdg import SDG\n",
     "import sys\n",
     "import os\n",
@@ -86,9 +85,9 @@
    "outputs": [],
    "source": [
     "knowledge_agentic_pipeline = \"../../../src/instructlab/sdg/flows/generation/knowledge/synth_knowledge1.5.yaml\"\n",
-    "flow_cfg = Flow(client).get_flow_from_file(knowledge_agentic_pipeline)\n",
+    "flow = Flow(client).get_flow_from_file(knowledge_agentic_pipeline)\n",
     "sdg = SDG(\n",
-    "    [Pipeline(flow_cfg)],\n",
+    "    flows=[flow],\n",
     "    num_workers=1,\n",
     "    batch_size=1,\n",
     "    save_freq=1000,\n",

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/examples/knowledge_tuning/knowledge_utils.py RENAMED Viewed

@@ -268,7 +268,22 @@ def build_raft_dataset(ds: Dataset, p, num_doc_in_context=4):
 def create_knowledge_regular_ds(generated_dataset: Dataset):
-    # Phase 1.0
+    """
+    Create a knowledge dataset for the Skills Phase of knowledge tuning.
+    This function generates QA datasets with RAFT-style context separation
+    and optionally includes auxiliary datasets for enhanced training.
+    Parameters
+    ----------
+    generated_dataset : Dataset
+        The input dataset containing generated knowledge content
+    Returns
+    -------
+    Dataset
+        Processed dataset ready for skills phase training
+    """
     knowledge_ds = generate_knowledge_qa_dataset(
         generated_dataset, keep_context_separate=True
     )
@@ -276,26 +291,36 @@ def create_knowledge_regular_ds(generated_dataset: Dataset):
     auxiliary_dataset = create_auxiliary_dataset(generated_dataset)
     if auxiliary_dataset is not None:
-        transformed_data = safe_concatenate_datasets([knowledge_ds, auxiliary_dataset])
-    else:
-        transformed_data = knowledge_ds
-    return transformed_data
+        knowledge_ds = safe_concatenate_datasets([knowledge_ds, auxiliary_dataset])
+    return knowledge_ds
-def create_knowledge_pretraining_ds(generated_dataset: Dataset):
-    # Phase 0.7
+def create_knowledge_pretraining_ds(generated_dataset: Dataset, add_auxiliary_dataset: bool = True):
+    # Phase 0.7 (Knowledge Phase)
+    """
+    Create a knowledge dataset for the Knowledge Phase of knowledge tuning.
+    This function generates QA datasets for pretraining-style knowledge tuning
+    with optional auxiliary dataset inclusion.
+    Parameters
+    ----------
+    generated_dataset (Dataset): The dataset containing generated knowledge data.
+    add_auxiliary_dataset (bool): Whether to include an auxiliary dataset.
+    Returns
+    -------
+    Dataset: The generated knowledge dataset.
+    """
     knowledge_ds = generate_knowledge_qa_dataset(
-        generated_dataset, keep_context_separate=False
-    )
+        generated_dataset, keep_context_separate=False)
     knowledge_ds = knowledge_ds.map(_conv_pretrain)
     auxiliary_dataset = create_auxiliary_dataset(generated_dataset)
-    if auxiliary_dataset is not None:
+    if auxiliary_dataset is not None and add_auxiliary_dataset:
         auxiliary_dataset = auxiliary_dataset.map(_conv_pretrain)
-        transformed_data = safe_concatenate_datasets([knowledge_ds, auxiliary_dataset])
-    else:
-        transformed_data = knowledge_ds
-    return transformed_data
+        knowledge_ds = safe_concatenate_datasets([knowledge_ds, auxiliary_dataset])
+    return knowledge_ds
 def fuse_texts(text_list, short_length_threshold=100):

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/src/sdg_hub/_version.py RENAMED Viewed

@@ -17,5 +17,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.1.3'
-__version_tuple__ = version_tuple = (0, 1, 3)
+__version__ = version = '0.1.4'
+__version_tuple__ = version_tuple = (0, 1, 4)

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/src/sdg_hub/flows/generation/knowledge/synth_knowledge1.5.yaml RENAMED Viewed

@@ -12,9 +12,7 @@
     output_cols:
       - summary_detailed
   gen_kwargs:
-    max_tokens: 4096
-    temperature: 0.7
-    n: 50
+    max_tokens: 2048
 - block_type: LLMBlock
   block_config:
@@ -24,8 +22,7 @@
     output_cols:
       - summary_atomic_facts
   gen_kwargs:
-    max_tokens: 4096
-    temperature: 0.7
+    max_tokens: 2048
 - block_type: LLMBlock
   block_config:
@@ -35,8 +32,7 @@
     output_cols:
       - summary_extractive
   gen_kwargs:
-    max_tokens: 4096
-    temperature: 0.7
+    max_tokens: 2048
 - block_type: FlattenColumnsBlock
   block_config:
@@ -59,26 +55,18 @@
 - block_type: LLMBlock
   block_config:
     block_name: knowledge generation
-    config_path: configs/knowledge/generate_questions.yaml
+    config_path: configs/knowledge/generate_questions_responses.yaml
     model_id: meta-llama/Llama-3.3-70B-Instruct
     output_cols:
       - question
+      - response
     parser_kwargs:
       parser_name: custom
-      parsing_pattern: "\\[(?:Question|QUESTION)\\]\\s*(.*?)\\s*(?=\\[(?:Question|QUESTION)\\]|$)"
-  gen_kwargs:
-    temperature: 0.7
-    max_tokens: 100
-- block_type: LLMBlock
-  block_config:
-    block_name: knowledge generation
-    config_path: configs/knowledge/generate_responses.yaml
-    model_id: meta-llama/Llama-3.3-70B-Instruct
-    output_cols:
-      - response
+      parsing_pattern: "\\[(?:Question|QUESTION)\\]\\s*(.*?)\\s*\\[(?:Answer|ANSWER)\\]\\s*(.*?)\\s*(?=\\[(?:Question|QUESTION)\\]|$)"
+      parser_cleanup_tags:
+        - "[END]"
   gen_kwargs:
-    temperature: 0.7
+    temperature: 0.0
     max_tokens: 2048
 - block_type: LLMBlock

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/src/sdg_hub/prompts.py RENAMED Viewed

@@ -12,7 +12,7 @@ def instructlab_chat_template():
     return """{% for message in messages %}{% if message['role'] == 'pretraining' %}{{ '<|pretrain|>' + message['content'] + '<|endoftext|>' + '<|/pretrain|>' }}{% elif message['role'] == 'system' %}{{ '<|system|>' + '\n' + message['content'] + '\n' }}{% elif message['role'] == 'user' %}{{ '<|user|>' + '\n' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>' + '\n' + message['content'] + '<|endoftext|>' + ('' if loop.last else '\n') }}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|assistant|>' + '\n' }}{% endif %}{% endfor %}"""
-@PromptRegistry.register("mistralai")
+@PromptRegistry.register("mistralai/Mixtral")
 def mistral_chat_template():
     return """{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content'] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n<s>\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message['role'] == 'user' %}\n        {%- if loop.first and system_message is defined %}\n            {{- ' [INST] ' + system_message + '\\n\\n' + message['content'] + ' [/INST]' }}\n        {%- else %}\n            {{- ' [INST] ' + message['content'] + ' [/INST]' }}\n        {%- endif %}\n    {%- elif message['role'] == 'assistant' %}\n        {{- ' ' + message['content'] + '</s>'}}\n    {%- else %}\n        {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}\n    {%- endif %}\n{%- endfor %}\n"""
@@ -26,11 +26,12 @@ def meta_llama_chat_template():
 def microsoft_phi_chat_template():
     return """{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}"""
 @PromptRegistry.register("nvidia/Llama-3_3-Nemotron-Super-49B-v1")
 def nemotron_chat_template():
     """
     Format chat messages for the Nemotron model, including a system prompt and structured message headers.
     The template starts with a system message containing "detailed thinking on", then iterates over messages, wrapping each with start and end header tokens and an end-of-text token. For assistant messages containing a `</think>` tag, only the content after this tag is included. Optionally appends an assistant prompt if generation is requested.
     """
     return """{{- bos_token }}
@@ -52,7 +53,7 @@ def nemotron_chat_template():
 def qwen_2_5_chat_template():
     """
     Formats chat messages into the prompt structure required by the Qwen 2.5 model family, supporting system messages, tool descriptions, function call instructions, and role-based message formatting.
     If tools are provided, includes tool signatures and instructions for function calls in the system prompt. User, assistant, and tool messages are wrapped with special tokens, and assistant tool calls are serialized as JSON within XML tags. Optionally appends a generation prompt for the assistant.
     """
     return """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- messages[0][\'content\'] }}\n    {%- else %}\n        {{- \'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\' }}\n    {%- endif %}\n    {{- "\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0][\'content\'] + \'<|im_end|>\\n\' }}\n    {%- else %}\n        {{- \'<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + message.content + \'<|im_end|>\' + \'\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role }}\n        {%- if message.content %}\n            {{- \'\\n\' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- \'\\n<tool_call>\\n{"name": "\' }}\n            {{- tool_call.name }}\n            {{- \'", "arguments": \' }}\n            {{- tool_call.arguments | tojson }}\n            {{- \'}\\n</tool_call>\' }}\n        {%- endfor %}\n        {{- \'<|im_end|>\\n\' }}\n    {%- elif message.role == "tool" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- message.content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}\n"""
@@ -62,9 +63,9 @@ def qwen_2_5_chat_template():
 def qwen_3_chat_template():
     """
     Formats chat messages for the Qwen 3 model family, supporting multi-step tool usage, reasoning content, and special XML tags for tool calls and responses.
     This template handles system messages, user and assistant roles, and tool interactions. When tools are provided, it outputs their signatures and instructions for function calls. It tracks the last user query to determine where to insert assistant reasoning content within `<think>` tags. Assistant tool calls are serialized as JSON within `<tool_call>` tags, and tool responses are grouped inside `<tool_response>` tags. Optionally, a generation prompt and empty reasoning block can be added.
     Parameters:
         tools (optional): List of tool signature objects to be included in the prompt.
         messages: List of message objects, each with a role and content, and optionally tool_calls or reasoning_content.
@@ -72,3 +73,8 @@ def qwen_3_chat_template():
         enable_thinking (optional): If false, inserts an empty reasoning block in the assistant prompt.
     """
     return """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\' + \'\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {%- set reasoning_content = \'\' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if \'</think>\' in content %}\n                {%- set reasoning_content = content.split(\'</think>\')[0].rstrip(\'\\n\').split(\'<think>\')[-1].lstrip(\'\\n\') %}\n                {%- set content = content.split(\'</think>\')[-1].lstrip(\'\\n\') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- \'<|im_start|>\' + message.role + \'\\n<think>\\n\' + reasoning_content.strip(\'\\n\') + \'\\n</think>\\n\\n\' + content.lstrip(\'\\n\') }}\n            {%- else %}\n                {{- \'<|im_start|>\' + message.role + \'\\n\' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- \'<|im_start|>\' + message.role + \'\\n\' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- \'<|im_end|>\\n\' }}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- \'<think>\\n\\n</think>\\n\\n\' }}\n    {%- endif %}\n{%- endif %}"""
+@PromptRegistry.register("mistralai/Mistral-Small-3")
+def mistral_small_3_chat_template():
+    return """{%- if not date_string is defined %}\n    {%- set date_string = \"2025-01-01\" %}\n{%- endif %}\n{%- set default_system_message = \"You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\\nYour knowledge base was last updated on 2023-10-01. The current date is \" + date_string + \".\\n\\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \\\"What are some good restaurants around me?\\\" => \\\"Where are you?\\\" or \\\"When is the next flight to Tokyo\\\" => \\\"Where do you travel from?\\\")\" %}\n\n<s>\n\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content'] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = default_system_message %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n{{- '[SYSTEM_PROMPT]' + system_message + '[/SYSTEM_PROMPT]' }}\n\n{%- for message in loop_messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '[INST]' + message['content'] + '[/INST]' }}\n    {%- elif message['role'] == 'system' %}\n        {{- '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- message['content'] + '</s>' }}\n    {%- else %}\n        {{- raise_exception('Only user, system and assistant roles are supported!') }}\n    {%- endif %}\n{%- endfor %}"""

{sdg_hub-0.1.3 → sdg_hub-0.1.4/src/sdg_hub.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sdg_hub
-Version: 0.1.3
+Version: 0.1.4
 Summary: Synthetic Data Generation
 Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
 License: Apache-2.0

{sdg_hub-0.1.3 → sdg_hub-0.1.4}/src/sdg_hub.egg-info/SOURCES.txt RENAMED Viewed

@@ -44,9 +44,11 @@ docs/installation.md
 docs/prompts.md
 docs/quick-start.md
 docs/web-interface.md
+examples/knowledge_tuning/README.md
 examples/knowledge_tuning/knowledge_utils.py
 examples/knowledge_tuning/data-generation-with-llama-70b/data-generation-with-llama-70b.ipynb
 examples/knowledge_tuning/data-generation-with-llama-70b/synth_knowledge1.5_llama3.3.yaml
+examples/knowledge_tuning/instructlab/README.md
 examples/knowledge_tuning/instructlab/docparser.py
 examples/knowledge_tuning/instructlab/docparser_v2.py
 examples/knowledge_tuning/instructlab/document_pre_processing.ipynb