sdg-hub 0.1.0a2.dev0__py3-none-any.whl → 0.1.0a4__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,154 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: sdg_hub
3
- Version: 0.1.0a2.dev0
4
- Summary: Synthetic Data Generation
5
- Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
6
- License: Apache-2.0
7
- Project-URL: homepage, https://ai-innovation.team/
8
- Project-URL: source, https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
9
- Project-URL: issues, https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/issues
10
- Classifier: Development Status :: 3 - Alpha
11
- Classifier: Environment :: Console
12
- Classifier: License :: OSI Approved :: Apache Software License
13
- Classifier: License :: OSI Approved :: MIT License
14
- Classifier: Operating System :: MacOS :: MacOS X
15
- Classifier: Operating System :: POSIX :: Linux
16
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
17
- Classifier: Programming Language :: Python :: 3
18
- Classifier: Programming Language :: Python :: 3.9
19
- Classifier: Programming Language :: Python :: 3.10
20
- Classifier: Programming Language :: Python :: 3.11
21
- Classifier: Programming Language :: Python :: 3.12
22
- Classifier: Programming Language :: Python :: Implementation :: CPython
23
- Requires-Python: >=3.9
24
- Description-Content-Type: text/markdown
25
- License-File: LICENSE
26
- Requires-Dist: click<9.0.0,>=8.1.7
27
- Requires-Dist: datasets<4.0.0,>=2.18.0
28
- Requires-Dist: httpx<1.0.0,>=0.25.0
29
- Requires-Dist: jinja2
30
- Requires-Dist: langchain-text-splitters
31
- Requires-Dist: openai<2.0.0,>=1.13.3
32
- Requires-Dist: rich
33
- Requires-Dist: tenacity!=8.4.0,>=8.3.0
34
- Requires-Dist: tqdm<5.0.0,>=4.66.2
35
- Dynamic: license-file
36
-
37
- # Synthetic Data Generation for LLMs
38
-
39
- The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a "no-code" manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.
40
-
41
-
42
-
43
- ## Core Design Principles
44
-
45
- The framework is built around the following principles:
46
-
47
- 1. **Modular Design**: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
48
- 2. **No-Code Workflow Creation**: Specify workflows using simple YAML configuration files.
49
- 3. **Scalability and Performance**: Optimized for handling large-scale workflows with millions of records.
50
-
51
- ---
52
-
53
- ## Framework Architecture
54
-
55
- ![overview](assets/imgs/overview.png)
56
-
57
- ### Blocks: The Fundamental Unit
58
-
59
- At the heart of the framework is the **Block**. Each block is a self-contained computational unit that performs specific tasks, such as:
60
-
61
- - Making LLM calls
62
- - Performing data transformations
63
- - Applying filters
64
-
65
- Blocks are designed to be:
66
- - **Modular**: Reusable across multiple pipelines.
67
- - **Composable**: Easily chained together to create workflows.
68
-
69
- These blocks are implemented in the [src/sdg_hub/blocks](src/sdg_hub/blocks) directory.
70
-
71
- ### Pipelines: Higher-Level Abstraction
72
-
73
- Blocks can be chained together to form a **Pipeline**. Pipelines enable:
74
- - Linear or recursive chaining of blocks.
75
- - Execution of complex workflows by chaining multiple pipelines together.
76
-
77
- ### SDG Workflow: Full Workflow Automation
78
-
79
- Pipelines are further orchestrated into **SDG Workflows**, enabling seamless end-to-end processing. When invoking `sdg_hub.generate`, it triggers a pipeline/ or multiple pipelines that processes data through all the configured blocks.
80
-
81
- ---
82
-
83
- ### YAML-Based Workflow: The Flow
84
-
85
- The YAML configuration file, known as the **Flow**, is central to defining data generation workflows in the SDG Framework. A Flow describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.
86
-
87
- #### Key Features of a Flow
88
-
89
- 1. **Modular Design**:
90
- - Flows are composed of blocks, which can be chained together into pipelines.
91
- - Each block performs a specific task, such as generating, filtering, or transforming data.
92
-
93
- 2. **Reusability**:
94
- - Blocks and configurations defined in a Flow can be reused across different workflows.
95
- - YAML makes it easy to tweak or extend workflows without significant changes.
96
-
97
- 3. **Ease of Configuration**:
98
- - Users can specify block types, configurations, and data processing details in a simple and intuitive manner.
99
-
100
- ---
101
-
102
- ### Sample Flow
103
-
104
- Here is an example of a Flow configuration:
105
-
106
- ```yaml
107
- - block_type: LLMBlock
108
- block_config:
109
- block_name: gen_questions
110
- config_path: configs/skills/freeform_questions.yaml
111
- model_id: mistralai/Mixtral-8x7B-Instruct-v0.1
112
- output_cols:
113
- - question
114
- batch_kwargs:
115
- num_samples: 30
116
- drop_duplicates:
117
- - question
118
- - block_type: FilterByValueBlock
119
- block_config:
120
- block_name: filter_questions
121
- filter_column: score
122
- filter_value: 1.0
123
- operation: operator.eq
124
- convert_dtype: float
125
- batch_kwargs:
126
- num_procs: 8
127
- drop_columns:
128
- - evaluation
129
- - score
130
- - num_samples
131
- - block_type: LLMBlock
132
- block_config:
133
- block_name: gen_responses
134
- config_path: configs/skills/freeform_responses.yaml
135
- model_id: mistralai/Mixtral-8x7B-Instruct-v0.1
136
- output_cols:
137
- - response
138
- ```
139
-
140
- ### Dataflow and Storage
141
-
142
- - **Data Representation**: Dataflow between blocks and pipelines is handled using **Hugging Face Datasets**, which are based on Arrow tables. This provides:
143
- - Native parallelization capabilities (e.g., maps, filters).
144
- - Support for efficient data transformations.
145
-
146
- - **Data Checkpoints**: Intermediate caches of generated data. Checkpoints allow users to:
147
- - Resume workflows from the last successful state if interrupted.
148
- - Improve reliability for long-running workflows.
149
-
150
- ---
151
-
152
- ## Examples
153
-
154
- For sample use cases and implementation examples, please refer to the [examples](examples) directory. This directory contains various examples demonstrating different workflows and use cases of the SDG Framework.