PyPI - sdg-hub - Versions diffs - 0.7.1__tar.gz → 0.7.3__tar.gz - Mend

sdg-hub 0.7.1tar.gz → 0.7.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (214) hide show

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/.github/workflows/actionlint.dockerfile RENAMED Viewed

@@ -1,3 +1,3 @@
 # Since dependabot cannot update workflows using docker,
 # we use this indirection since dependabot can update this file.
-FROM rhysd/actionlint:1.7.9@sha256:a0383f60d92601e2694e24b24d37df7b6a40bed7cedbc447611c50009bf02d94
+FROM rhysd/actionlint:1.7.10@sha256:ef8299f97635c4c30e2298f48f30763ab782a4ad2c95b744649439a039421e36

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/.github/workflows/docs.yml RENAMED Viewed

@@ -39,6 +39,6 @@ jobs:
       - name: "Checkout"
         uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29 # v4.1.6
       - name: "Check Markdown documents"
-        uses: DavidAnson/markdownlint-cli2-action@30a0e04f1870d58f8d717450cc6134995f993c63 # v21.0.0
+        uses: DavidAnson/markdownlint-cli2-action@07035fd053f7be764496c0f8d8f9f41f98305101 # v22.0.0
         with:
           globs: '**/*.md'

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/.github/workflows/integration-test.yml RENAMED Viewed

@@ -112,7 +112,7 @@ jobs:
       - name: Cache huggingface datasets
-        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830 # v4.3.0
+        uses: actions/cache@9255dc7a253b0ccc959486e2bca901246202afeb # v5.0.1
         with:
           path: ~/.cache/huggingface
           # Invalidate cache when any example notebook changes (may affect dataset downloads)
@@ -140,7 +140,7 @@ jobs:
           flags: integration
       - name: Upload integration test artifacts
-        uses: actions/upload-artifact@v5
+        uses: actions/upload-artifact@v6
         if: always()
         with:
           name: integration-test-results-${{ matrix.python }}-${{ matrix.platform }}

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/.github/workflows/pypi.yaml RENAMED Viewed

@@ -72,7 +72,7 @@ jobs:
                   egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
             - name: "Download build artifacts"
-              uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
+              uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131 # v7.0.0
               with:
                   name: Packages
                   path: dist
@@ -104,13 +104,13 @@ jobs:
                   egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
             - name: "Download build artifacts"
-              uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
+              uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131 # v7.0.0
               with:
                   name: Packages
                   path: dist
             - name: "Sigstore sign package"
-              uses: sigstore/gh-action-sigstore-python@f832326173235dcb00dd5d92cd3f353de3188e6c # v3.1.0
+              uses: sigstore/gh-action-sigstore-python@a5caf349bc536fbef3668a10ed7f5cd309a4b53d # v3.2.0
               with:
                   inputs: |
                       ./dist/*.tar.gz

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sdg_hub
-Version: 0.7.1
+Version: 0.7.3
 Summary: Synthetic Data Generation
 Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
 License: Apache-2.0
@@ -26,7 +26,7 @@ Requires-Dist: click<9.0.0,>=8.1.7
 Requires-Dist: datasets>=4.0.0
 Requires-Dist: httpx<1.0.0,>=0.25.0
 Requires-Dist: jinja2
-Requires-Dist: litellm<1.75.0,>=1.73.0
+Requires-Dist: litellm<2.0.0,>=1.73.0
 Requires-Dist: rich
 Requires-Dist: pandas
 Requires-Dist: pydantic<3.0.0,>=2.0.0

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/docs/blocks/llm-blocks.md RENAMED Viewed

@@ -603,7 +603,7 @@ print(result["judgment"])     # ['YES']
 TextParserBlock is commonly used after LLMChatBlock to structure responses:
 ```python
-from sdg_hub.core.blocks import LLMChatBlock, LLMParserBlock, TextParserBlock
+from sdg_hub.core.blocks import LLMChatBlock, LLMResponseExtractorBlock, TextParserBlock
 # Step 1: Generate LLM response
 chat_block = LLMChatBlock(
@@ -615,7 +615,7 @@ chat_block = LLMChatBlock(
 # Step 2: Extract content from response object
 # Use field_prefix="" to get cleaner column names
-llm_parser = LLMParserBlock(
+llm_parser = LLMResponseExtractorBlock(
     block_name="extract_eval",
     input_cols=["eval_response"],
     extract_content=True,

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/docs/flows/overview.md RENAMED Viewed

@@ -316,7 +316,7 @@ blocks:
       output_cols: ["eval_response"]
       async_mode: true
-  - block_type: "LLMParserBlock"
+  - block_type: "LLMResponseExtractorBlock"
     block_config:
       block_name: "extract_eval_content"
       input_cols: ["eval_response"]
@@ -537,7 +537,7 @@ result = flow.generate(
 | | `top_p` | Nucleus sampling threshold | `0.0` - `1.0` |
 | | `frequency_penalty` | Penalize token repetition | `-2.0` - `2.0` |
 | | `presence_penalty` | Penalize new topics | `-2.0` - `2.0` |
-| **LLMParserBlock** | `extract_content` | Extract main content field | `True`, `False` |
+| **LLMResponseExtractorBlock** | `extract_content` | Extract main content field | `True`, `False` |
 | | `extract_reasoning_content` | Extract reasoning/thinking | `True`, `False` |
 | | `extract_tool_calls` | Extract tool call data | `True`, `False` |
 | | `field_prefix` | Prefix for output fields | `"llm_"`, `"parsed_"` |
@@ -752,7 +752,7 @@ result = flow.generate(dataset)
 │ │ generate_question    │ LLMChatBlock    │   45.30s │ 100 → 100    │ +1      │ ✓││
 │ │ generate_answer      │ LLMChatBlock    │   78.45s │ 100 → 100    │ +1      │ ✓││
 │ │ eval_faithfulness... │ LLMChatBlock    │   52.20s │ 100 → 100    │ +1      │ ✓││
-│ │ extract_eval_con...  │ LLMParserBlock  │    0.15s │ 100 → 100    │ +2      │ ✓││
+│ │ extract_eval_con...  │ LLMResponseExtractorBlock  │    0.15s │ 100 → 100    │ +2      │ ✓││
 │ │ parse_evaluation     │ TextParserBlock │    0.22s │ 100 → 100    │ +2      │ ✓││
 │ │ filter_faithful      │ ColumnValueF... │    0.08s │ 100 → 87     │ —       │ ✓││
 │ ├──────────────────────┼─────────────────┼──────────┼──────────────┼─────────┼──┤│

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/README.md RENAMED Viewed

@@ -48,29 +48,38 @@ Only claims passing this check are retained. This process filters out **hallucin
 ---
-## Data Generation Statistics
+## Data Generation Statistics and Results
+**Teacher model for generation:** `openai/gpt-oss-120b`
+**Student model trained:** `meta-llama/Llama-3.1-8B-Instruct`
+**Training method:** Supervised Fine-Tuning (SFT)
+---
 ### Summary Augmentation
-Each “cut” represents the total number of summaries generated per document across all three augmentation types.
+For each document, we generate three augmentation types—detailed summaries, extractive summaries, and atomic facts. Each “cut” on the table below represents the total number of summary augmentations per document (i.e., how many times each augmentation process is run).
-| Cut (NUMBER\_OF\_SUMMARIES = 3) | Token Count |
-| ------------------------------- | ----------- |
-| 1                               | 2,193,502   |
-| 2                               | 4,383,655   |
-| 5                               | 10,870,396  |
-| 10                              | 21,815,170  |
-| 20                              | 43,601,976  |
-| 30                              | 65,395,710  |
-| 40                              | 87,118,308  |
-| 50                              | 108,779,213 |
+| Cut (NUMBER\_OF\_SUMMARIES = 3) | Token Count   |
+| ------------------------------- | ------------- |
+| Input Corpus                    | 1,517,465     |
+| 10                              | 87,248,889    |
+| 20                              | 158,615,276   |
+| 30                              | 230,306,195   |
+| 40                              | 301,805,906   |
+| 50                              | 373,183,414   |
 ---
-### Finance Bench Example
+### Benchmark Results
-For Finance Bench (NUMBER\_OF\_SUMMARIES = 1):
+- **Evaluation benchmark:** [QuALITY benchmark](https://nyu-mll.github.io/quality/)
+- **Evaluation script & metric:** [Synthetic_Continued_Pretraining](https://github.com/ZitongYang/Synthetic_Continued_Pretraining/blob/main/evaluation.py), Exact Match (EM)
+- **Student model:** meta-llama/Llama-3.1-8B-Instruct (after SFT on generated/augmented summaries)
+- **Performance metric:** Model accuracy
-| Cut | Token Count |
-| --- | ----------- |
-| 50  | 213,333,192 |
+![Quality Benchmark Accuracy](imgs/quality_benchmark_accuracy.png)
+*Figure: Model accuracy across the QuALITY benchmark datasets, comparing SFT training on enhanced document summaries with the original model performance.*
+---

sdg_hub-0.7.3/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/imgs/quality_benchmark_accuracy.png ADDED Viewed

Binary file

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/examples/knowledge_tuning/knowledge_utils.py RENAMED Viewed

@@ -602,13 +602,14 @@ def _num_chars_from_tokens(num_tokens) -> int:
     return int(num_tokens * 4)  # 1 token ~ 4 English character
-def chunk_document(documents: List, server_ctx_size, chunk_word_count) -> List[str]:
+def chunk_document(documents: List, server_ctx_size, chunk_word_count, **kwargs) -> List[str]:
     """
     Iterates over the documents and splits them into chunks based on the word count provided by the user.
     Args:
         documents (list): List of documents retrieved from git (can also consist of a single document).
         server_ctx_size (int): Context window size of server.
         chunk_word_count (int): Maximum number of words to chunk a document.
+        chunk_overlap (int): Overlap in characters between chunks.
     Returns:
          List[str]: List of chunked documents.
     """
@@ -634,7 +635,7 @@ def chunk_document(documents: List, server_ctx_size, chunk_word_count) -> List[s
     # Placeholder for params
     content = []
     chunk_size = _num_chars_from_tokens(no_tokens_per_doc)
-    chunk_overlap = _DEFAULT_CHUNK_OVERLAP
+    chunk_overlap = int(kwargs.pop("chunk_overlap", str(_DEFAULT_CHUNK_OVERLAP)))
     # Using Markdown as default, document-specific chunking will be implemented in seperate pr.
     text_splitter = RecursiveCharacterTextSplitter.from_language(
@@ -729,16 +730,21 @@ class DocProcessor:
             }
         )
-    def _add_icls(self, chunked_document: Dataset) -> Dataset:
+    def _add_icls(self, chunked_document: Dataset, **kwargs) -> Dataset:
         """
         Add the ICLS label to the dataset.
         Args:
             dataset (Dataset): Dataset object.
+            server_ctx_size (int): Context window size of server.
+            chunk_word_count (int): Maximum number of words to chunk a document.
+            chunk_overlap (int): Overlap in characters between chunks.
         Returns
         -------
             Dataset: Dataset object with ICLS label.
         """
+        server_ctx_size = int(kwargs.pop("server_ctx_size", "4096"))
+        chunk_word_count = int(kwargs.pop("chunk_word_count", "1024"))
         icl = self.user_config["seed_examples"]
         chunked_document_all_icl = []
         for icl_ in icl:
@@ -762,7 +768,7 @@ class DocProcessor:
         chunked_document_all_icl = chunked_document_all_icl.map(
             lambda x: {
                 "chunks": chunk_document(
-                    [x["document"]], server_ctx_size=4096, chunk_word_count=1024
+                    [x["document"]], server_ctx_size=server_ctx_size, chunk_word_count=chunk_word_count, **kwargs
                 )
                 if get_token_count(x["document"], self.tokenizer) > 1024
                 else [x["document"]]
@@ -797,7 +803,7 @@ class DocProcessor:
         df = safe_concatenate_datasets([ds.to_pandas() for ds in datasets])
         return Dataset.from_pandas(df) if df is not None else None
-    def get_processed_markdown_dataset(self, list_md_files: list[Path]) -> Dataset:
+    def get_processed_markdown_dataset(self, list_md_files: list[Path], **kwargs) -> Dataset:
         chunks_mds = []
         for md_file in list_md_files:
             with open(md_file, "r", encoding="utf-8") as f:
@@ -811,5 +817,5 @@ class DocProcessor:
                     }
                 )
         chunk_ds = Dataset.from_list(chunks_mds)
-        chunk_ds_with_icls = self._add_icls(chunk_ds)
+        chunk_ds_with_icls = self._add_icls(chunk_ds, **kwargs)
         return chunk_ds_with_icls

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/examples/text_analysis/structured_insights_demo.ipynb RENAMED Viewed

@@ -332,7 +332,7 @@
     "    LLMChatBlock,\n",
     "    PromptBuilderBlock,\n",
     "    TextParserBlock,\n",
-    "    LLMParserBlock,\n",
+    "    LLMResponseExtractorBlock,\n",
     ")\n",
     "from sdg_hub.core.blocks.transform import JSONStructureBlock\n",
     "\n",
@@ -355,7 +355,7 @@
     "    temperature=0.1,  # Low temperature for more consistent extraction\n",
     ")\n",
     "\n",
-    "ticker_llm_parser_block = LLMParserBlock(\n",
+    "ticker_llm_response_extractor_block = LLMResponseExtractorBlock(\n",
     "    block_name=\"extract_stock_tickers\",\n",
     "    input_cols=[\"raw_stock_tickers\"],\n",
     "    extract_content=True,\n",
@@ -406,7 +406,7 @@
     "ticker_blocks = [\n",
     "    ticker_prompt_block,\n",
     "    ticker_llm_block,\n",
-    "    ticker_llm_parser_block,\n",
+    "    ticker_llm_response_extractor_block,\n",
     "    ticker_parser_block,\n",
     "    enhanced_json_block,\n",
     "]\n",

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/pyproject.toml RENAMED Viewed

@@ -33,7 +33,7 @@ dependencies = [
     "datasets>=4.0.0",
     "httpx>=0.25.0,<1.0.0",
     "jinja2",
-    "litellm>=1.73.0,<1.75.0",
+    "litellm>=1.73.0,<2.0.0",         # raising cap since tests run without errors related to 'backoff' cap back to <1.75.0 if errors surface
     "rich",
     "pandas",
     "pydantic>=2.0.0,<3.0.0",         # cap before v3; adjust the lower bound to the minimum v2.x you’ve tested

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/_version.py RENAMED Viewed

@@ -28,7 +28,7 @@ version_tuple: VERSION_TUPLE
 commit_id: COMMIT_ID
 __commit_id__: COMMIT_ID
-__version__ = version = '0.7.1'
-__version_tuple__ = version_tuple = (0, 7, 1)
+__version__ = version = '0.7.3'
+__version_tuple__ = version_tuple = (0, 7, 3)
-__commit_id__ = commit_id = 'g884bce940'
+__commit_id__ = commit_id = 'g97824a47f'

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/__init__.py RENAMED Viewed

@@ -6,7 +6,13 @@ This package provides various block implementations for data generation, process
 # Local
 from .base import BaseBlock
 from .filtering import ColumnValueFilterBlock
-from .llm import LLMChatBlock, LLMParserBlock, PromptBuilderBlock, TextParserBlock
+from .llm import (
+    LLMChatBlock,
+    LLMParserBlock,
+    LLMResponseExtractorBlock,
+    PromptBuilderBlock,
+    TextParserBlock,
+)
 from .registry import BlockRegistry
 from .transform import (
     DuplicateColumnsBlock,
@@ -28,7 +34,8 @@ __all__ = [
     "TextConcatBlock",
     "UniformColumnValueSetter",
     "LLMChatBlock",
-    "LLMParserBlock",
+    "LLMParserBlock",  # Deprecated alias for LLMResponseExtractorBlock
+    "LLMResponseExtractorBlock",
     "TextParserBlock",
     "PromptBuilderBlock",
 ]

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/base.py RENAMED Viewed

@@ -49,6 +49,9 @@ class BaseBlock(BaseModel, ABC):
     block_name: str = Field(
         ..., description="Unique identifier for this block instance"
     )
+    block_type: Optional[str] = Field(
+        None, description="Block type (e.g., 'llm', 'transform', 'parser', 'filtering')"
+    )
     input_cols: Union[str, list[str], dict[str, Any], None] = Field(
         None, description="Input columns: str, list, or dict"
     )
@@ -366,5 +369,5 @@ class BaseBlock(BaseModel, ABC):
         Dict[str, Any]
         """
         config = self.get_config()
-        config["block_type"] = self.__class__.__name__
+        config["block_class"] = self.__class__.__name__
         return config

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/filtering/column_value_filter.py RENAMED Viewed

@@ -46,6 +46,8 @@ DTYPE_MAP = {
     "Filters datasets based on column values using various comparison operations",
 )
 class ColumnValueFilterBlock(BaseBlock):
+    block_type: str = "filtering"
     """A block for filtering datasets based on column values.
     This block allows filtering of datasets using various operations (e.g., equals, contains)

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/llm/__init__.py RENAMED Viewed

@@ -9,7 +9,7 @@ local models (vLLM, Ollama), and more.
 # Local
 from .error_handler import ErrorCategory, LLMErrorHandler
 from .llm_chat_block import LLMChatBlock
-from .llm_parser_block import LLMParserBlock
+from .llm_response_extractor_block import LLMParserBlock, LLMResponseExtractorBlock
 from .prompt_builder_block import PromptBuilderBlock
 from .text_parser_block import TextParserBlock
@@ -17,7 +17,8 @@ __all__ = [
     "LLMErrorHandler",
     "ErrorCategory",
     "LLMChatBlock",
-    "LLMParserBlock",
+    "LLMParserBlock",  # Deprecated alias for LLMResponseExtractorBlock
+    "LLMResponseExtractorBlock",
     "PromptBuilderBlock",
     "TextParserBlock",
 ]

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/llm/llm_chat_block.py RENAMED Viewed

@@ -6,7 +6,8 @@ from typing import Any, Optional
 import asyncio
 from litellm import acompletion, completion
-from pydantic import ConfigDict, Field, field_validator
+from pydantic import ConfigDict, Field, SecretStr, field_validator
+from tqdm.asyncio import tqdm_asyncio
 import litellm
 # Third Party
@@ -31,6 +32,8 @@ logger = setup_logger(__name__)
 class LLMChatBlock(BaseBlock):
     model_config = ConfigDict(extra="allow")
+    block_type: str = "llm"
     """Unified LLM chat block supporting all providers via LiteLLM.
     This block provides a minimal wrapper around LiteLLM's completion API,
@@ -52,8 +55,9 @@ class LLMChatBlock(BaseBlock):
     model : Optional[str], optional
         Model identifier in LiteLLM format. Can be set later via flow.set_model_config().
         Examples: "openai/gpt-4", "anthropic/claude-3-sonnet-20240229"
-    api_key : Optional[str], optional
+    api_key : Optional[SecretStr], optional
         API key for the provider. Falls back to environment variables.
+        Automatically redacted in logs and string representations.
     api_base : Optional[str], optional
         Base URL for the API. Required for local models.
     async_mode : bool, optional
@@ -97,7 +101,7 @@ class LLMChatBlock(BaseBlock):
     model: Optional[str] = Field(
         None, exclude=True, description="Model identifier in LiteLLM format"
     )
-    api_key: Optional[str] = Field(
+    api_key: Optional[SecretStr] = Field(
         None, exclude=True, description="API key for the provider"
     )
     api_base: Optional[str] = Field(
@@ -301,7 +305,7 @@ class LLMChatBlock(BaseBlock):
         if self.model is not None:
             completion_kwargs["model"] = self.model
         if self.api_key is not None:
-            completion_kwargs["api_key"] = self.api_key
+            completion_kwargs["api_key"] = self.api_key.get_secret_value()
         if self.api_base is not None:
             completion_kwargs["api_base"] = self.api_base
         if self.timeout is not None:
@@ -501,7 +505,9 @@ class LLMChatBlock(BaseBlock):
                     for messages in messages_list
                 ]
-            responses = await asyncio.gather(*tasks)
+            responses = await tqdm_asyncio.gather(
+                *tasks, desc=self.block_name, unit="req"
+            )
             return responses
         except Exception as e:

sdg_hub-0.7.1/src/sdg_hub/core/blocks/llm/llm_parser_block.py → sdg_hub-0.7.3/src/sdg_hub/core/blocks/llm/llm_response_extractor_block.py RENAMED Viewed

@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: Apache-2.0
-"""LLM parser block for extracting fields from LLM response objects.
+"""LLM response extractor block for extracting fields from LLM response objects.
-This module provides the LLMParserBlock for extracting specific fields
+This module provides the LLMResponseExtractorBlock for extracting specific fields
 (content, reasoning_content, tool_calls) from chat completion response objects.
 """
@@ -22,13 +22,15 @@ logger = setup_logger(__name__)
 @BlockRegistry.register(
-    "LLMParserBlock",
+    "LLMResponseExtractorBlock",
     "llm",
     "Extracts specified fields from LLM response objects",
 )
-class LLMParserBlock(BaseBlock):
+class LLMResponseExtractorBlock(BaseBlock):
     _flow_requires_jsonl_tmp: bool = True
+    block_type: str = "llm_util"
     """Block for extracting fields from LLM response objects.
     This block extracts specified fields from chat completion response objects.
@@ -88,7 +90,7 @@ class LLMParserBlock(BaseBlock):
             ]
         ):
             raise ValueError(
-                "LLMParserBlock requires at least one extraction field to be enabled: "
+                "LLMResponseExtractorBlock requires at least one extraction field to be enabled: "
                 "extract_content, extract_reasoning_content, or extract_tool_calls"
             )
@@ -106,7 +108,7 @@ class LLMParserBlock(BaseBlock):
         return self
     def _validate_custom(self, dataset: pd.DataFrame) -> None:
-        """Validate LLMParserBlock specific requirements.
+        """Validate LLMResponseExtractorBlock specific requirements.
         Parameters
         ----------
@@ -116,14 +118,16 @@ class LLMParserBlock(BaseBlock):
         Raises
         ------
         ValueError
-            If LLMParserBlock requirements are not met.
+            If LLMResponseExtractorBlock requirements are not met.
         """
         # Validate that we have exactly one input column
         if len(self.input_cols) == 0:
-            raise ValueError("LLMParserBlock expects at least one input column")
+            raise ValueError(
+                "LLMResponseExtractorBlock expects at least one input column"
+            )
         if len(self.input_cols) > 1:
             logger.warning(
-                f"LLMParserBlock expects exactly one input column, but got {len(self.input_cols)}. "
+                f"LLMResponseExtractorBlock expects exactly one input column, but got {len(self.input_cols)}. "
                 f"Using the first column: {self.input_cols[0]}"
             )
@@ -324,3 +328,22 @@ class LLMParserBlock(BaseBlock):
             new_data.extend(self._generate(sample))
         return pd.DataFrame(new_data)
+# Backwards compatibility alias (deprecated)
+# Register deprecated alias in BlockRegistry so old YAML flows still work
+@BlockRegistry.register(
+    "LLMParserBlock",
+    "llm",
+    "Deprecated: Use LLMResponseExtractorBlock instead",
+    deprecated=True,
+    replacement="LLMResponseExtractorBlock",
+)
+class LLMParserBlock(LLMResponseExtractorBlock):
+    """Deprecated alias for LLMResponseExtractorBlock.
+    This class exists for backwards compatibility with existing code and YAML flows.
+    Use LLMResponseExtractorBlock instead.
+    """
+    pass

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/llm/prompt_builder_block.py RENAMED Viewed

@@ -222,6 +222,8 @@ class PromptRenderer:
     "Formats prompts into structured chat messages or plain text using Jinja templates",
 )
 class PromptBuilderBlock(BaseBlock):
+    block_type: str = "llm_util"
     """Block for formatting prompts into structured chat messages or plain text.
     This block takes input from dataset columns, applies Jinja templates from a YAML config

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/llm/text_parser_block.py RENAMED Viewed

@@ -30,6 +30,8 @@ logger = setup_logger(__name__)
 class TextParserBlock(BaseBlock):
     _flow_requires_jsonl_tmp: bool = True
+    block_type: str = "parser"
     """Block for parsing and post-processing text content.
     This block handles text parsing using start/end tags, custom regex patterns,

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/duplicate_columns.py RENAMED Viewed

@@ -27,6 +27,8 @@ logger = setup_logger(__name__)
     "Duplicates existing columns with new names according to a mapping specification",
 )
 class DuplicateColumnsBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for duplicating existing columns with new names.
     This block creates copies of existing columns with new names according to a mapping specification.

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/index_based_mapper.py RENAMED Viewed

@@ -28,6 +28,8 @@ logger = setup_logger(__name__)
     "Maps values from source columns to output columns based on choice columns using shared mapping",
 )
 class IndexBasedMapperBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for mapping values from source columns to output columns based on choice columns.
     This block uses a shared mapping dictionary to select values from source columns and

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/json_structure_block.py RENAMED Viewed

@@ -28,6 +28,8 @@ logger = setup_logger(__name__)
     "Combines multiple columns into a single column containing a structured JSON object",
 )
 class JSONStructureBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for combining multiple columns into a structured JSON object.
     This block takes values from multiple input columns and combines them into a single

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/melt_columns.py RENAMED Viewed

@@ -28,6 +28,8 @@ logger = setup_logger(__name__)
     "Transforms wide dataset format into long format by melting columns into rows",
 )
 class MeltColumnsBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for flattening multiple columns into a long format.
     This block transforms a wide dataset format into a long format by melting

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/rename_columns.py RENAMED Viewed

@@ -27,6 +27,8 @@ logger = setup_logger(__name__)
     "Renames columns in a dataset according to a mapping specification",
 )
 class RenameColumnsBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for renaming columns in a dataset.
     This block renames columns in a dataset according to a mapping specification.

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/text_concat.py RENAMED Viewed

@@ -27,6 +27,8 @@ logger = setup_logger(__name__)
     "Combines multiple columns into a single column using a specified separator",
 )
 class TextConcatBlock(BaseBlock):
+    block_type: str = "transform"
     """Block for combining multiple columns into a single column.
     This block concatenates values from multiple columns into a single output column,

{sdg_hub-0.7.1 → sdg_hub-0.7.3}/src/sdg_hub/core/blocks/transform/uniform_col_val_setter.py RENAMED Viewed

@@ -28,6 +28,8 @@ logger = setup_logger(__name__)
     "Replaces all values in a column with a single summary statistic (e.g., mode, mean, median)",
 )
 class UniformColumnValueSetter(BaseBlock):
+    block_type: str = "transform"
     """Block that replaces all values in a column with a single aggregate value.
     Supported strategies include: mode, min, max, mean, median.

sdg-hub 0.7.1__tar.gz → 0.7.3__tar.gz

sdg-hub 0.7.1tar.gz → 0.7.3tar.gz