liquidrandom 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- liquidrandom-0.1.0/.gitignore +8 -0
- liquidrandom-0.1.0/.python-version +1 -0
- liquidrandom-0.1.0/CLAUDE.md +183 -0
- liquidrandom-0.1.0/PKG-INFO +105 -0
- liquidrandom-0.1.0/README.md +92 -0
- liquidrandom-0.1.0/pyproject.toml +28 -0
- liquidrandom-0.1.0/seed_generation/README.md +113 -0
- liquidrandom-0.1.0/seed_generation/categories.py +415 -0
- liquidrandom-0.1.0/seed_generation/config.py +19 -0
- liquidrandom-0.1.0/seed_generation/dedup.py +80 -0
- liquidrandom-0.1.0/seed_generation/generate.py +215 -0
- liquidrandom-0.1.0/seed_generation/inspect_samples.py +162 -0
- liquidrandom-0.1.0/seed_generation/llm.py +93 -0
- liquidrandom-0.1.0/seed_generation/pyproject.toml +15 -0
- liquidrandom-0.1.0/seed_generation/sampler.py +244 -0
- liquidrandom-0.1.0/seed_generation/state.py +69 -0
- liquidrandom-0.1.0/seed_generation/taxonomy.py +229 -0
- liquidrandom-0.1.0/seed_generation/uploader.py +148 -0
- liquidrandom-0.1.0/seed_generation/uv.lock +652 -0
- liquidrandom-0.1.0/seed_generation/validator.py +78 -0
- liquidrandom-0.1.0/src/liquidrandom/__init__.py +108 -0
- liquidrandom-0.1.0/src/liquidrandom/_loader.py +42 -0
- liquidrandom-0.1.0/src/liquidrandom/_registry.py +43 -0
- liquidrandom-0.1.0/src/liquidrandom/models/__init__.py +27 -0
- liquidrandom-0.1.0/src/liquidrandom/models/coding_task.py +33 -0
- liquidrandom-0.1.0/src/liquidrandom/models/domain.py +28 -0
- liquidrandom-0.1.0/src/liquidrandom/models/emotional_state.py +27 -0
- liquidrandom-0.1.0/src/liquidrandom/models/instruction_complexity.py +27 -0
- liquidrandom-0.1.0/src/liquidrandom/models/job.py +30 -0
- liquidrandom-0.1.0/src/liquidrandom/models/language.py +29 -0
- liquidrandom-0.1.0/src/liquidrandom/models/math_category.py +28 -0
- liquidrandom-0.1.0/src/liquidrandom/models/persona.py +35 -0
- liquidrandom-0.1.0/src/liquidrandom/models/reasoning_pattern.py +27 -0
- liquidrandom-0.1.0/src/liquidrandom/models/scenario.py +30 -0
- liquidrandom-0.1.0/src/liquidrandom/models/science_topic.py +27 -0
- liquidrandom-0.1.0/src/liquidrandom/models/writing_style.py +28 -0
- liquidrandom-0.1.0/src/liquidrandom/py.typed +0 -0
- liquidrandom-0.1.0/tests/__init__.py +0 -0
- liquidrandom-0.1.0/tests/test_e2e.py +152 -0
- liquidrandom-0.1.0/tests/test_loader.py +75 -0
- liquidrandom-0.1.0/tests/test_models.py +171 -0
- liquidrandom-0.1.0/uv.lock +455 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.12
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# Liquidrandom
|
|
2
|
+
|
|
3
|
+
This is a python package for pseudo random data generation. It is made for machine learning to seed the data generation process.
|
|
4
|
+
ie. a typical use-case is using an LLM to generate data for training another model, which has the problem of a lack of randomness in the generated data. This package is designed to solve that problem by providing a way to add randomness to the data generation process by for instance adding random personas or jobs into the prompt.
|
|
5
|
+
|
|
6
|
+
## Package
|
|
7
|
+
|
|
8
|
+
The package is called `liquidrandom` and can be installed via pip.
|
|
9
|
+
We are using typed python.
|
|
10
|
+
We are using ty and uv for type checking and dependency management respectively.
|
|
11
|
+
Create a pyproject.toml file for the package.
|
|
12
|
+
|
|
13
|
+
The package should be usable as following:
|
|
14
|
+
|
|
15
|
+
```python3
|
|
16
|
+
import liquidrandom
|
|
17
|
+
|
|
18
|
+
persona = liquidrandom.persona()
|
|
19
|
+
job = liquidrandom.job()
|
|
20
|
+
coding_task = liquidrandom.coding_task()
|
|
21
|
+
math_category = liquidrandom.math_category()
|
|
22
|
+
writing_style = liquidrandom.writing_style()
|
|
23
|
+
scenario = liquidrandom.scenario()
|
|
24
|
+
domain = liquidrandom.domain()
|
|
25
|
+
science_topic = liquidrandom.science_topic()
|
|
26
|
+
language = liquidrandom.language()
|
|
27
|
+
reasoning_pattern = liquidrandom.reasoning_pattern()
|
|
28
|
+
emotional_state = liquidrandom.emotional_state()
|
|
29
|
+
instruction_complexity = liquidrandom.instruction_complexity()
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
Use types and objects for the seed data, for instance a Persona object with name, age, etc. and a Job object with name and description.
|
|
33
|
+
Make sure to overwrite the __str__ method of the objects to return a string representation of the object that can be used in the prompt for the LLM.
|
|
34
|
+
|
|
35
|
+
Create a clear and concise README.md file that explains the package, how to install it, and how to use it. Include examples of how to use the package in the README.md file.
|
|
36
|
+
|
|
37
|
+
## Seed data
|
|
38
|
+
|
|
39
|
+
The seed data we want to host on huggingface. Make sure only the needed data is fetched and that the package is not too heavy.
|
|
40
|
+
The seed data should be cached locally after the first fetch.
|
|
41
|
+
|
|
42
|
+
## Seed data generation
|
|
43
|
+
|
|
44
|
+
Let's create a seperate folder where we put the "run once" scripts that generate the seed data and upload it to huggingface.
|
|
45
|
+
We want to use openrouter.ai for the generation of the seed data.
|
|
46
|
+
|
|
47
|
+
```python
|
|
48
|
+
from openai import OpenAI
|
|
49
|
+
|
|
50
|
+
client = OpenAI(
|
|
51
|
+
base_url="https://openrouter.ai/api/v1",
|
|
52
|
+
api_key="<OPENROUTER_API_KEY>",
|
|
53
|
+
)
|
|
54
|
+
|
|
55
|
+
# First API call with reasoning
|
|
56
|
+
response = client.chat.completions.create(
|
|
57
|
+
model="qwen/qwen3.5-397b-a17b",
|
|
58
|
+
messages=[
|
|
59
|
+
{
|
|
60
|
+
"role": "user",
|
|
61
|
+
"content": "How many r's are in the word 'strawberry'?"
|
|
62
|
+
}
|
|
63
|
+
],
|
|
64
|
+
extra_body={"reasoning": {"enabled": True}}
|
|
65
|
+
)
|
|
66
|
+
|
|
67
|
+
# Extract the assistant message with reasoning_details
|
|
68
|
+
response = response.choices[0].message
|
|
69
|
+
|
|
70
|
+
# Preserve the assistant message with reasoning_details
|
|
71
|
+
messages = [
|
|
72
|
+
{"role": "user", "content": "How many r's are in the word 'strawberry'?"},
|
|
73
|
+
{
|
|
74
|
+
"role": "assistant",
|
|
75
|
+
"content": response.content,
|
|
76
|
+
"reasoning_details": response.reasoning_details # Pass back unmodified
|
|
77
|
+
},
|
|
78
|
+
{"role": "user", "content": "Are you sure? Think carefully."}
|
|
79
|
+
]
|
|
80
|
+
|
|
81
|
+
# Second API call - model continues reasoning from where it left off
|
|
82
|
+
response2 = client.chat.completions.create(
|
|
83
|
+
model="qwen/qwen3.5-397b-a17b",
|
|
84
|
+
messages=messages,
|
|
85
|
+
extra_body={"reasoning": {"enabled": True}}
|
|
86
|
+
)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
You can assume the envvar OPENROUTER_API_KEY is set.
|
|
90
|
+
For uploading to huggingface, either let me know how to do it, or you the envvar HF_TOKEN.
|
|
91
|
+
|
|
92
|
+
We want to accelerate the process, thus using either async or ThreadPoolExecutor for the generation of the seed data in parallel.
|
|
93
|
+
The batch size should be controlable via a cli arguemnt --batch-size
|
|
94
|
+
|
|
95
|
+
Per LLM call we want to generate at least k samples, where n is controlable via a cli argument --k. Make sure to include in the prompting and the parsing of the response that there are multiple samples generated per call.
|
|
96
|
+
|
|
97
|
+
After each generation call, we want to call the same LLM again to check if the generated data is good enough. This process should check if there are collapsed or degenerate samples, e.g. halluications or repetitive samples. If the generated data is not good enough, we discard the current call and generate new data until we have good enough data. This is to ensure that the generated data is of high quality and not too similar to each other.
|
|
98
|
+
|
|
99
|
+
In total we want to generate at least n samples per category, where n is controlable via a cli argument --n.
|
|
100
|
+
|
|
101
|
+
## Diversity strategy: Hierarchical Taxonomy Tree
|
|
102
|
+
|
|
103
|
+
To generate 10k-100k samples without collapsing into repetitive subsets, we use a two-phase hierarchical approach:
|
|
104
|
+
|
|
105
|
+
### Phase 1: Taxonomy generation
|
|
106
|
+
|
|
107
|
+
Before generating any seed data samples, first use the LLM to generate a deep taxonomy tree for each category. The tree should be broad at the top and increasingly specific at the leaves.
|
|
108
|
+
|
|
109
|
+
Example for Science topics:
|
|
110
|
+
```
|
|
111
|
+
Science
|
|
112
|
+
├── Physics
|
|
113
|
+
│ ├── Quantum Mechanics
|
|
114
|
+
│ │ ├── Entanglement phenomena
|
|
115
|
+
│ │ │ ├── Bell state preparation in ion traps
|
|
116
|
+
│ │ │ ├── Quantum teleportation protocols
|
|
117
|
+
│ │ │ └── Loophole-free Bell tests
|
|
118
|
+
│ │ ├── Decoherence
|
|
119
|
+
│ │ │ ├── Environment-induced superselection
|
|
120
|
+
│ │ │ └── ...
|
|
121
|
+
│ ├── Thermodynamics
|
|
122
|
+
│ │ └── ...
|
|
123
|
+
├── Biology
|
|
124
|
+
│ └── ...
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
The taxonomy depth should auto-scale based on the target sample count:
|
|
128
|
+
- `required_leaf_nodes = target_samples / samples_per_leaf`
|
|
129
|
+
- More samples → deeper tree → more leaf nodes
|
|
130
|
+
- The taxonomy itself should be generated in stages: first top-level branches, then expand each branch deeper, to avoid context window limitations.
|
|
131
|
+
|
|
132
|
+
The taxonomy generation should also be controllable via a cli argument `--taxonomy-depth` to control the depth of the taxonomy tree and `--samples-per-leaf` to control how many samples to generate per leaf node.
|
|
133
|
+
|
|
134
|
+
The taxonomy should be saved to disk as JSON so it can be inspected, reused, and resumed.
|
|
135
|
+
|
|
136
|
+
### Phase 2: Round-robin sample generation
|
|
137
|
+
|
|
138
|
+
Generate samples by cycling through all leaf nodes in round-robin fashion:
|
|
139
|
+
|
|
140
|
+
```
|
|
141
|
+
for leaf in cycle(all_leaf_nodes):
|
|
142
|
+
generate k samples for this specific leaf
|
|
143
|
+
validate & dedup
|
|
144
|
+
if leaf.count >= target_per_leaf:
|
|
145
|
+
mark leaf as done
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
This ensures that no single subtopic gets over-represented. Each generation prompt should include the leaf node's full path in the taxonomy (e.g. "Science > Physics > Quantum Mechanics > Entanglement phenomena > Bell state preparation in ion traps") to anchor the LLM to that specific subtopic.
|
|
149
|
+
|
|
150
|
+
The previously generated samples for the current leaf should be included in the prompt to avoid repetition within the same leaf.
|
|
151
|
+
|
|
152
|
+
### Deduplication: Fuzzy string matching
|
|
153
|
+
|
|
154
|
+
Use token-level fuzzy string matching to detect and reject near-duplicate samples. No ML dependency needed.
|
|
155
|
+
|
|
156
|
+
Approach:
|
|
157
|
+
- Use Jaccard similarity on token sets (word-level) to compare new samples against all existing samples in the same category.
|
|
158
|
+
- Reject samples with a Jaccard similarity above a configurable threshold (default: 0.7).
|
|
159
|
+
- Additionally, normalize samples before comparison (lowercase, strip punctuation, collapse whitespace) to catch trivial reformulations.
|
|
160
|
+
- The dedup check should run on the `__str__` representation of each sample.
|
|
161
|
+
- The threshold should be controllable via a cli argument `--dedup-threshold`.
|
|
162
|
+
|
|
163
|
+
## Categories of seed data
|
|
164
|
+
|
|
165
|
+
- Persona: random personas with name, age etc.
|
|
166
|
+
- Professions or jobs: Random professions or jobs with a name and description of the job.
|
|
167
|
+
- Coding tasks: Specific programming challenges and tasks (e.g. "implement a trie-based autocomplete with fuzzy matching", "write a rate limiter using the token bucket algorithm"). Should include the programming context, constraints, and expected behavior.
|
|
168
|
+
- Math categories: Random math categories like algebra, geometry, etc. with a description of the category.
|
|
169
|
+
- Writing styles / tones: Diverse writing styles and tones (e.g. "sardonic academic critique", "enthusiastic technical blogger", "dry legal prose"). Includes the style name and a description of its characteristics.
|
|
170
|
+
- Scenarios / situations: Real-world scenarios or situations (e.g. "debugging a production outage at 2am", "negotiating a contract with a difficult vendor"). Includes the scenario description and relevant context.
|
|
171
|
+
- Domains / topics: Specific knowledge domains and subtopics (e.g. "supply chain logistics for perishable goods", "comparative analysis of NoSQL databases"). Includes the domain name and a detailed description.
|
|
172
|
+
- Science topics: Specific scientific topics and phenomena (e.g. "quantum entanglement in photon pairs", "CRISPR-Cas9 off-target effects in gene therapy", "tidal locking in exoplanetary systems"). Includes the topic name, scientific field, and a description.
|
|
173
|
+
- Languages / locales: Random languages, dialects, or cultural contexts (e.g. "Brazilian Portuguese, informal register", "Kansai Japanese dialect"). Includes the language, region, register, and cultural notes.
|
|
174
|
+
- Reasoning patterns: Types of reasoning or problem-solving approaches (e.g. "proof by contradiction", "cost-benefit analysis with uncertainty", "analogical reasoning from biology to engineering"). Includes the pattern name and description of how it works.
|
|
175
|
+
- Emotional states: Specific emotional states or moods (e.g. "frustrated but trying to stay polite", "cautiously optimistic after a setback"). Includes the emotional state and behavioral description.
|
|
176
|
+
- Instruction complexity: Varying levels of instruction complexity and ambiguity (e.g. "vague one-liner request", "detailed multi-step specification with constraints", "contradictory requirements"). Includes the complexity level, a description, and an example.
|
|
177
|
+
|
|
178
|
+
Make sure the seed data is very specific, ie. "algebra" is not a good seed data, but "different ways to solve linear equations" is a good seed data. This is to ensure that the generated data for training is more diverse and not too similar to each other.
|
|
179
|
+
|
|
180
|
+
For the generation process, use the rich python library to show progress as well as the expected time of arrival (ETA) for the generation process.
|
|
181
|
+
|
|
182
|
+
dependencies and README for the run once seed generation scripts should be in the same folder as the scripts and different from the main package. The main package should not have any dependencies that are not needed for the generation of the seed data.
|
|
183
|
+
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: liquidrandom
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Pseudo-random seed data generation for ML/LLM training diversity
|
|
5
|
+
License-Expression: MIT
|
|
6
|
+
Requires-Python: >=3.11
|
|
7
|
+
Requires-Dist: huggingface-hub>=0.20.0
|
|
8
|
+
Requires-Dist: pyarrow>=14.0.0
|
|
9
|
+
Provides-Extra: dev
|
|
10
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
11
|
+
Requires-Dist: ty; extra == 'dev'
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
|
|
14
|
+
# liquidrandom
|
|
15
|
+
|
|
16
|
+
Pseudo-random seed data for ML/LLM training diversity.
|
|
17
|
+
|
|
18
|
+
When using LLMs to generate training data, outputs tend to be repetitive and lack variety. `liquidrandom` solves this by providing a large pool of diverse, pre-generated seed data (personas, jobs, scenarios, etc.) that you can inject into your prompts to steer generation toward more varied outputs.
|
|
19
|
+
|
|
20
|
+
## Installation
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
pip install liquidrandom
|
|
24
|
+
# or
|
|
25
|
+
uv add liquidrandom
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Quick Start
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
import liquidrandom
|
|
32
|
+
|
|
33
|
+
# Get a random persona to inject into your LLM prompt
|
|
34
|
+
persona = liquidrandom.persona()
|
|
35
|
+
print(persona)
|
|
36
|
+
# Alice is a 30-year-old female from Canada. They work as an engineer. ...
|
|
37
|
+
|
|
38
|
+
# Get a random coding task
|
|
39
|
+
task = liquidrandom.coding_task()
|
|
40
|
+
print(task)
|
|
41
|
+
# [Python, medium] Implement a trie: Build a trie data structure ...
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
## Available Categories
|
|
45
|
+
|
|
46
|
+
| Function | Returns | Description |
|
|
47
|
+
|---|---|---|
|
|
48
|
+
| `liquidrandom.persona()` | `Persona` | Random personas with name, age, gender, occupation, nationality, personality traits, background |
|
|
49
|
+
| `liquidrandom.job()` | `Job` | Professions with title, industry, description, required skills, experience level |
|
|
50
|
+
| `liquidrandom.coding_task()` | `CodingTask` | Programming challenges with title, language, difficulty, description, constraints, expected behavior |
|
|
51
|
+
| `liquidrandom.math_category()` | `MathCategory` | Math categories with name, field, description, example problems |
|
|
52
|
+
| `liquidrandom.writing_style()` | `WritingStyle` | Writing styles with name, tone, characteristics, description |
|
|
53
|
+
| `liquidrandom.scenario()` | `Scenario` | Real-world scenarios with title, context, setting, stakes, description |
|
|
54
|
+
| `liquidrandom.domain()` | `Domain` | Knowledge domains with name, parent field, description, key concepts |
|
|
55
|
+
| `liquidrandom.science_topic()` | `ScienceTopic` | Scientific topics with name, field, subfield, description |
|
|
56
|
+
| `liquidrandom.language()` | `Language` | Languages/locales with name, region, register, script, cultural notes |
|
|
57
|
+
| `liquidrandom.reasoning_pattern()` | `ReasoningPattern` | Reasoning approaches with name, category, description, when to use |
|
|
58
|
+
| `liquidrandom.emotional_state()` | `EmotionalState` | Emotional states with name, intensity, valence, behavioral description |
|
|
59
|
+
| `liquidrandom.instruction_complexity()` | `InstructionComplexity` | Instruction complexity levels with level, ambiguity, description, example |
|
|
60
|
+
|
|
61
|
+
## Usage Example
|
|
62
|
+
|
|
63
|
+
Use `liquidrandom` to add diversity to your LLM data generation pipeline:
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
import liquidrandom
|
|
67
|
+
from openai import OpenAI
|
|
68
|
+
|
|
69
|
+
client = OpenAI(
|
|
70
|
+
base_url="https://openrouter.ai/api/v1",
|
|
71
|
+
api_key="<OPENROUTER_API_KEY>",
|
|
72
|
+
)
|
|
73
|
+
|
|
74
|
+
persona = liquidrandom.persona()
|
|
75
|
+
style = liquidrandom.writing_style()
|
|
76
|
+
topic = liquidrandom.science_topic()
|
|
77
|
+
|
|
78
|
+
prompt = f"""You are {persona}
|
|
79
|
+
Write in the following style: {style}
|
|
80
|
+
Explain the following topic: {topic}"""
|
|
81
|
+
|
|
82
|
+
response = client.chat.completions.create(
|
|
83
|
+
model="liquid/lfm-2-24b-a2b",
|
|
84
|
+
messages=[{"role": "user", "content": prompt}],
|
|
85
|
+
)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Each call to a `liquidrandom` function returns a typed dataclass. You can use them directly in f-strings (via `__str__`) or access their individual fields:
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
persona = liquidrandom.persona()
|
|
92
|
+
print(persona.name) # "Alice"
|
|
93
|
+
print(persona.age) # 30
|
|
94
|
+
print(persona.personality_traits) # ["curious", "patient"]
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## How It Works
|
|
98
|
+
|
|
99
|
+
The dataset contains 340,000+ samples across 12 categories, generated using hierarchical taxonomy trees with LLM-based quality validation and fuzzy deduplication.
|
|
100
|
+
|
|
101
|
+
Seed data is hosted on HuggingFace ([mlech26l/liquidrandom-data](https://huggingface.co/datasets/mlech26l/liquidrandom-data)) as zstd-compressed Parquet files. On first use, only the requested category file is downloaded and cached locally. Subsequent calls use the cached data.
|
|
102
|
+
|
|
103
|
+
## License
|
|
104
|
+
|
|
105
|
+
MIT
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# liquidrandom
|
|
2
|
+
|
|
3
|
+
Pseudo-random seed data for ML/LLM training diversity.
|
|
4
|
+
|
|
5
|
+
When using LLMs to generate training data, outputs tend to be repetitive and lack variety. `liquidrandom` solves this by providing a large pool of diverse, pre-generated seed data (personas, jobs, scenarios, etc.) that you can inject into your prompts to steer generation toward more varied outputs.
|
|
6
|
+
|
|
7
|
+
## Installation
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
pip install liquidrandom
|
|
11
|
+
# or
|
|
12
|
+
uv add liquidrandom
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Quick Start
|
|
16
|
+
|
|
17
|
+
```python
|
|
18
|
+
import liquidrandom
|
|
19
|
+
|
|
20
|
+
# Get a random persona to inject into your LLM prompt
|
|
21
|
+
persona = liquidrandom.persona()
|
|
22
|
+
print(persona)
|
|
23
|
+
# Alice is a 30-year-old female from Canada. They work as an engineer. ...
|
|
24
|
+
|
|
25
|
+
# Get a random coding task
|
|
26
|
+
task = liquidrandom.coding_task()
|
|
27
|
+
print(task)
|
|
28
|
+
# [Python, medium] Implement a trie: Build a trie data structure ...
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Available Categories
|
|
32
|
+
|
|
33
|
+
| Function | Returns | Description |
|
|
34
|
+
|---|---|---|
|
|
35
|
+
| `liquidrandom.persona()` | `Persona` | Random personas with name, age, gender, occupation, nationality, personality traits, background |
|
|
36
|
+
| `liquidrandom.job()` | `Job` | Professions with title, industry, description, required skills, experience level |
|
|
37
|
+
| `liquidrandom.coding_task()` | `CodingTask` | Programming challenges with title, language, difficulty, description, constraints, expected behavior |
|
|
38
|
+
| `liquidrandom.math_category()` | `MathCategory` | Math categories with name, field, description, example problems |
|
|
39
|
+
| `liquidrandom.writing_style()` | `WritingStyle` | Writing styles with name, tone, characteristics, description |
|
|
40
|
+
| `liquidrandom.scenario()` | `Scenario` | Real-world scenarios with title, context, setting, stakes, description |
|
|
41
|
+
| `liquidrandom.domain()` | `Domain` | Knowledge domains with name, parent field, description, key concepts |
|
|
42
|
+
| `liquidrandom.science_topic()` | `ScienceTopic` | Scientific topics with name, field, subfield, description |
|
|
43
|
+
| `liquidrandom.language()` | `Language` | Languages/locales with name, region, register, script, cultural notes |
|
|
44
|
+
| `liquidrandom.reasoning_pattern()` | `ReasoningPattern` | Reasoning approaches with name, category, description, when to use |
|
|
45
|
+
| `liquidrandom.emotional_state()` | `EmotionalState` | Emotional states with name, intensity, valence, behavioral description |
|
|
46
|
+
| `liquidrandom.instruction_complexity()` | `InstructionComplexity` | Instruction complexity levels with level, ambiguity, description, example |
|
|
47
|
+
|
|
48
|
+
## Usage Example
|
|
49
|
+
|
|
50
|
+
Use `liquidrandom` to add diversity to your LLM data generation pipeline:
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
import liquidrandom
|
|
54
|
+
from openai import OpenAI
|
|
55
|
+
|
|
56
|
+
client = OpenAI(
|
|
57
|
+
base_url="https://openrouter.ai/api/v1",
|
|
58
|
+
api_key="<OPENROUTER_API_KEY>",
|
|
59
|
+
)
|
|
60
|
+
|
|
61
|
+
persona = liquidrandom.persona()
|
|
62
|
+
style = liquidrandom.writing_style()
|
|
63
|
+
topic = liquidrandom.science_topic()
|
|
64
|
+
|
|
65
|
+
prompt = f"""You are {persona}
|
|
66
|
+
Write in the following style: {style}
|
|
67
|
+
Explain the following topic: {topic}"""
|
|
68
|
+
|
|
69
|
+
response = client.chat.completions.create(
|
|
70
|
+
model="liquid/lfm-2-24b-a2b",
|
|
71
|
+
messages=[{"role": "user", "content": prompt}],
|
|
72
|
+
)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Each call to a `liquidrandom` function returns a typed dataclass. You can use them directly in f-strings (via `__str__`) or access their individual fields:
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
persona = liquidrandom.persona()
|
|
79
|
+
print(persona.name) # "Alice"
|
|
80
|
+
print(persona.age) # 30
|
|
81
|
+
print(persona.personality_traits) # ["curious", "patient"]
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## How It Works
|
|
85
|
+
|
|
86
|
+
The dataset contains 340,000+ samples across 12 categories, generated using hierarchical taxonomy trees with LLM-based quality validation and fuzzy deduplication.
|
|
87
|
+
|
|
88
|
+
Seed data is hosted on HuggingFace ([mlech26l/liquidrandom-data](https://huggingface.co/datasets/mlech26l/liquidrandom-data)) as zstd-compressed Parquet files. On first use, only the requested category file is downloaded and cached locally. Subsequent calls use the cached data.
|
|
89
|
+
|
|
90
|
+
## License
|
|
91
|
+
|
|
92
|
+
MIT
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["hatchling"]
|
|
3
|
+
build-backend = "hatchling.build"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "liquidrandom"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Pseudo-random seed data generation for ML/LLM training diversity"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.11"
|
|
11
|
+
license = "MIT"
|
|
12
|
+
dependencies = [
|
|
13
|
+
"huggingface-hub>=0.20.0",
|
|
14
|
+
"pyarrow>=14.0.0",
|
|
15
|
+
]
|
|
16
|
+
|
|
17
|
+
[project.optional-dependencies]
|
|
18
|
+
dev = [
|
|
19
|
+
"pytest>=8.0",
|
|
20
|
+
"ty",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
[tool.hatch.build.targets.wheel]
|
|
24
|
+
packages = ["src/liquidrandom"]
|
|
25
|
+
|
|
26
|
+
[tool.pytest.ini_options]
|
|
27
|
+
testpaths = ["tests"]
|
|
28
|
+
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
# Seed Data Generation
|
|
2
|
+
|
|
3
|
+
Scripts for generating diverse seed data for the `liquidrandom` package.
|
|
4
|
+
|
|
5
|
+
## Setup
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
cd seed_generation
|
|
9
|
+
uv sync
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
## Environment Variables
|
|
13
|
+
|
|
14
|
+
| Variable | Required | Description |
|
|
15
|
+
|---|---|---|
|
|
16
|
+
| `OPENROUTER_API_KEY` | Yes (for generation) | OpenRouter API key |
|
|
17
|
+
| `HF_TOKEN` | Yes (for upload) | HuggingFace write token |
|
|
18
|
+
|
|
19
|
+
## Usage
|
|
20
|
+
|
|
21
|
+
### Generate seed data
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
# Generate 1000 samples per category for all categories
|
|
25
|
+
python generate.py generate --n 1000 --k 10
|
|
26
|
+
|
|
27
|
+
# Generate for specific categories only
|
|
28
|
+
python generate.py generate --n 1000 --k 10 --categories persona --categories job
|
|
29
|
+
|
|
30
|
+
# Full control over all parameters
|
|
31
|
+
python generate.py generate \
|
|
32
|
+
--n 10000 \
|
|
33
|
+
--k 10 \
|
|
34
|
+
--batch-size 5 \
|
|
35
|
+
--taxonomy-depth 4 \
|
|
36
|
+
--samples-per-leaf 10 \
|
|
37
|
+
--dedup-threshold 0.7 \
|
|
38
|
+
--output-dir ./output
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Resume interrupted generation
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
python generate.py generate --resume --output-dir ./output
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### Upload to HuggingFace
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
python generate.py upload-only --output-dir ./output --repo-id mlech26l/liquidrandom-data
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## CLI Arguments
|
|
54
|
+
|
|
55
|
+
| Argument | Default | Description |
|
|
56
|
+
|---|---|---|
|
|
57
|
+
| `--n` | 1000 | Target samples per category |
|
|
58
|
+
| `--k` | 10 | Samples generated per LLM call |
|
|
59
|
+
| `--batch-size` | 5 | Number of concurrent LLM calls |
|
|
60
|
+
| `--taxonomy-depth` | 4 | Maximum depth of the taxonomy tree |
|
|
61
|
+
| `--samples-per-leaf` | 10 | Target samples per leaf node |
|
|
62
|
+
| `--dedup-threshold` | 0.7 | Jaccard similarity threshold for dedup |
|
|
63
|
+
| `--categories` | all | Which categories to generate |
|
|
64
|
+
| `--resume` | false | Resume from checkpoint |
|
|
65
|
+
| `--output-dir` | output | Output directory |
|
|
66
|
+
|
|
67
|
+
## How It Works
|
|
68
|
+
|
|
69
|
+
### Phase 1: Taxonomy Generation
|
|
70
|
+
|
|
71
|
+
For each category, the LLM generates a deep taxonomy tree to ensure diversity. The tree is expanded breadth-first, level by level, until there are enough leaf nodes.
|
|
72
|
+
|
|
73
|
+
Taxonomies are saved as JSON in `output/taxonomies/` and can be inspected or reused.
|
|
74
|
+
|
|
75
|
+
### Phase 2: Round-Robin Sample Generation
|
|
76
|
+
|
|
77
|
+
Samples are generated by cycling through all leaf nodes in round-robin fashion. Each generation prompt includes:
|
|
78
|
+
- The leaf node's full taxonomy path (to anchor specificity)
|
|
79
|
+
- Previously generated samples for that leaf (to avoid repetition)
|
|
80
|
+
|
|
81
|
+
### Quality Validation
|
|
82
|
+
|
|
83
|
+
Each batch is validated by a second LLM call that checks for:
|
|
84
|
+
- Empty or placeholder content
|
|
85
|
+
- Hallucinations or factual impossibility
|
|
86
|
+
- Intra-batch repetitiveness
|
|
87
|
+
- Off-topic samples
|
|
88
|
+
- Insufficient specificity
|
|
89
|
+
|
|
90
|
+
If >50% of a batch is rejected, the entire batch is discarded and regenerated.
|
|
91
|
+
|
|
92
|
+
### Deduplication
|
|
93
|
+
|
|
94
|
+
Token-level Jaccard similarity is used to detect near-duplicates:
|
|
95
|
+
- Text is normalized (lowercase, strip punctuation, collapse whitespace)
|
|
96
|
+
- Word-level token sets are compared
|
|
97
|
+
- Samples above the similarity threshold are rejected
|
|
98
|
+
|
|
99
|
+
## Output Structure
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
output/
|
|
103
|
+
├── taxonomies/ # JSON taxonomy trees per category
|
|
104
|
+
│ ├── persona.json
|
|
105
|
+
│ ├── job.json
|
|
106
|
+
│ └── ...
|
|
107
|
+
├── samples/ # Per-leaf JSONL files
|
|
108
|
+
│ ├── persona/
|
|
109
|
+
│ │ ├── Personas_Young-Adults_...jsonl
|
|
110
|
+
│ │ └── ...
|
|
111
|
+
│ └── ...
|
|
112
|
+
└── state.json # Checkpoint for resumability
|
|
113
|
+
```
|