liquidrandom 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. liquidrandom-0.1.0/.gitignore +8 -0
  2. liquidrandom-0.1.0/.python-version +1 -0
  3. liquidrandom-0.1.0/CLAUDE.md +183 -0
  4. liquidrandom-0.1.0/PKG-INFO +105 -0
  5. liquidrandom-0.1.0/README.md +92 -0
  6. liquidrandom-0.1.0/pyproject.toml +28 -0
  7. liquidrandom-0.1.0/seed_generation/README.md +113 -0
  8. liquidrandom-0.1.0/seed_generation/categories.py +415 -0
  9. liquidrandom-0.1.0/seed_generation/config.py +19 -0
  10. liquidrandom-0.1.0/seed_generation/dedup.py +80 -0
  11. liquidrandom-0.1.0/seed_generation/generate.py +215 -0
  12. liquidrandom-0.1.0/seed_generation/inspect_samples.py +162 -0
  13. liquidrandom-0.1.0/seed_generation/llm.py +93 -0
  14. liquidrandom-0.1.0/seed_generation/pyproject.toml +15 -0
  15. liquidrandom-0.1.0/seed_generation/sampler.py +244 -0
  16. liquidrandom-0.1.0/seed_generation/state.py +69 -0
  17. liquidrandom-0.1.0/seed_generation/taxonomy.py +229 -0
  18. liquidrandom-0.1.0/seed_generation/uploader.py +148 -0
  19. liquidrandom-0.1.0/seed_generation/uv.lock +652 -0
  20. liquidrandom-0.1.0/seed_generation/validator.py +78 -0
  21. liquidrandom-0.1.0/src/liquidrandom/__init__.py +108 -0
  22. liquidrandom-0.1.0/src/liquidrandom/_loader.py +42 -0
  23. liquidrandom-0.1.0/src/liquidrandom/_registry.py +43 -0
  24. liquidrandom-0.1.0/src/liquidrandom/models/__init__.py +27 -0
  25. liquidrandom-0.1.0/src/liquidrandom/models/coding_task.py +33 -0
  26. liquidrandom-0.1.0/src/liquidrandom/models/domain.py +28 -0
  27. liquidrandom-0.1.0/src/liquidrandom/models/emotional_state.py +27 -0
  28. liquidrandom-0.1.0/src/liquidrandom/models/instruction_complexity.py +27 -0
  29. liquidrandom-0.1.0/src/liquidrandom/models/job.py +30 -0
  30. liquidrandom-0.1.0/src/liquidrandom/models/language.py +29 -0
  31. liquidrandom-0.1.0/src/liquidrandom/models/math_category.py +28 -0
  32. liquidrandom-0.1.0/src/liquidrandom/models/persona.py +35 -0
  33. liquidrandom-0.1.0/src/liquidrandom/models/reasoning_pattern.py +27 -0
  34. liquidrandom-0.1.0/src/liquidrandom/models/scenario.py +30 -0
  35. liquidrandom-0.1.0/src/liquidrandom/models/science_topic.py +27 -0
  36. liquidrandom-0.1.0/src/liquidrandom/models/writing_style.py +28 -0
  37. liquidrandom-0.1.0/src/liquidrandom/py.typed +0 -0
  38. liquidrandom-0.1.0/tests/__init__.py +0 -0
  39. liquidrandom-0.1.0/tests/test_e2e.py +152 -0
  40. liquidrandom-0.1.0/tests/test_loader.py +75 -0
  41. liquidrandom-0.1.0/tests/test_models.py +171 -0
  42. liquidrandom-0.1.0/uv.lock +455 -0
@@ -0,0 +1,8 @@
1
+ __pycache__/
2
+ *.pyc
3
+ *.egg-info/
4
+ dist/
5
+ build/
6
+ .venv/
7
+ seed_generation/output/
8
+ .pytest_cache/
@@ -0,0 +1 @@
1
+ 3.12
@@ -0,0 +1,183 @@
1
+ # Liquidrandom
2
+
3
+ This is a python package for pseudo random data generation. It is made for machine learning to seed the data generation process.
4
+ ie. a typical use-case is using an LLM to generate data for training another model, which has the problem of a lack of randomness in the generated data. This package is designed to solve that problem by providing a way to add randomness to the data generation process by for instance adding random personas or jobs into the prompt.
5
+
6
+ ## Package
7
+
8
+ The package is called `liquidrandom` and can be installed via pip.
9
+ We are using typed python.
10
+ We are using ty and uv for type checking and dependency management respectively.
11
+ Create a pyproject.toml file for the package.
12
+
13
+ The package should be usable as following:
14
+
15
+ ```python3
16
+ import liquidrandom
17
+
18
+ persona = liquidrandom.persona()
19
+ job = liquidrandom.job()
20
+ coding_task = liquidrandom.coding_task()
21
+ math_category = liquidrandom.math_category()
22
+ writing_style = liquidrandom.writing_style()
23
+ scenario = liquidrandom.scenario()
24
+ domain = liquidrandom.domain()
25
+ science_topic = liquidrandom.science_topic()
26
+ language = liquidrandom.language()
27
+ reasoning_pattern = liquidrandom.reasoning_pattern()
28
+ emotional_state = liquidrandom.emotional_state()
29
+ instruction_complexity = liquidrandom.instruction_complexity()
30
+ ```
31
+
32
+ Use types and objects for the seed data, for instance a Persona object with name, age, etc. and a Job object with name and description.
33
+ Make sure to overwrite the __str__ method of the objects to return a string representation of the object that can be used in the prompt for the LLM.
34
+
35
+ Create a clear and concise README.md file that explains the package, how to install it, and how to use it. Include examples of how to use the package in the README.md file.
36
+
37
+ ## Seed data
38
+
39
+ The seed data we want to host on huggingface. Make sure only the needed data is fetched and that the package is not too heavy.
40
+ The seed data should be cached locally after the first fetch.
41
+
42
+ ## Seed data generation
43
+
44
+ Let's create a seperate folder where we put the "run once" scripts that generate the seed data and upload it to huggingface.
45
+ We want to use openrouter.ai for the generation of the seed data.
46
+
47
+ ```python
48
+ from openai import OpenAI
49
+
50
+ client = OpenAI(
51
+ base_url="https://openrouter.ai/api/v1",
52
+ api_key="<OPENROUTER_API_KEY>",
53
+ )
54
+
55
+ # First API call with reasoning
56
+ response = client.chat.completions.create(
57
+ model="qwen/qwen3.5-397b-a17b",
58
+ messages=[
59
+ {
60
+ "role": "user",
61
+ "content": "How many r's are in the word 'strawberry'?"
62
+ }
63
+ ],
64
+ extra_body={"reasoning": {"enabled": True}}
65
+ )
66
+
67
+ # Extract the assistant message with reasoning_details
68
+ response = response.choices[0].message
69
+
70
+ # Preserve the assistant message with reasoning_details
71
+ messages = [
72
+ {"role": "user", "content": "How many r's are in the word 'strawberry'?"},
73
+ {
74
+ "role": "assistant",
75
+ "content": response.content,
76
+ "reasoning_details": response.reasoning_details # Pass back unmodified
77
+ },
78
+ {"role": "user", "content": "Are you sure? Think carefully."}
79
+ ]
80
+
81
+ # Second API call - model continues reasoning from where it left off
82
+ response2 = client.chat.completions.create(
83
+ model="qwen/qwen3.5-397b-a17b",
84
+ messages=messages,
85
+ extra_body={"reasoning": {"enabled": True}}
86
+ )
87
+ ```
88
+
89
+ You can assume the envvar OPENROUTER_API_KEY is set.
90
+ For uploading to huggingface, either let me know how to do it, or you the envvar HF_TOKEN.
91
+
92
+ We want to accelerate the process, thus using either async or ThreadPoolExecutor for the generation of the seed data in parallel.
93
+ The batch size should be controlable via a cli arguemnt --batch-size
94
+
95
+ Per LLM call we want to generate at least k samples, where n is controlable via a cli argument --k. Make sure to include in the prompting and the parsing of the response that there are multiple samples generated per call.
96
+
97
+ After each generation call, we want to call the same LLM again to check if the generated data is good enough. This process should check if there are collapsed or degenerate samples, e.g. halluications or repetitive samples. If the generated data is not good enough, we discard the current call and generate new data until we have good enough data. This is to ensure that the generated data is of high quality and not too similar to each other.
98
+
99
+ In total we want to generate at least n samples per category, where n is controlable via a cli argument --n.
100
+
101
+ ## Diversity strategy: Hierarchical Taxonomy Tree
102
+
103
+ To generate 10k-100k samples without collapsing into repetitive subsets, we use a two-phase hierarchical approach:
104
+
105
+ ### Phase 1: Taxonomy generation
106
+
107
+ Before generating any seed data samples, first use the LLM to generate a deep taxonomy tree for each category. The tree should be broad at the top and increasingly specific at the leaves.
108
+
109
+ Example for Science topics:
110
+ ```
111
+ Science
112
+ ├── Physics
113
+ │ ├── Quantum Mechanics
114
+ │ │ ├── Entanglement phenomena
115
+ │ │ │ ├── Bell state preparation in ion traps
116
+ │ │ │ ├── Quantum teleportation protocols
117
+ │ │ │ └── Loophole-free Bell tests
118
+ │ │ ├── Decoherence
119
+ │ │ │ ├── Environment-induced superselection
120
+ │ │ │ └── ...
121
+ │ ├── Thermodynamics
122
+ │ │ └── ...
123
+ ├── Biology
124
+ │ └── ...
125
+ ```
126
+
127
+ The taxonomy depth should auto-scale based on the target sample count:
128
+ - `required_leaf_nodes = target_samples / samples_per_leaf`
129
+ - More samples → deeper tree → more leaf nodes
130
+ - The taxonomy itself should be generated in stages: first top-level branches, then expand each branch deeper, to avoid context window limitations.
131
+
132
+ The taxonomy generation should also be controllable via a cli argument `--taxonomy-depth` to control the depth of the taxonomy tree and `--samples-per-leaf` to control how many samples to generate per leaf node.
133
+
134
+ The taxonomy should be saved to disk as JSON so it can be inspected, reused, and resumed.
135
+
136
+ ### Phase 2: Round-robin sample generation
137
+
138
+ Generate samples by cycling through all leaf nodes in round-robin fashion:
139
+
140
+ ```
141
+ for leaf in cycle(all_leaf_nodes):
142
+ generate k samples for this specific leaf
143
+ validate & dedup
144
+ if leaf.count >= target_per_leaf:
145
+ mark leaf as done
146
+ ```
147
+
148
+ This ensures that no single subtopic gets over-represented. Each generation prompt should include the leaf node's full path in the taxonomy (e.g. "Science > Physics > Quantum Mechanics > Entanglement phenomena > Bell state preparation in ion traps") to anchor the LLM to that specific subtopic.
149
+
150
+ The previously generated samples for the current leaf should be included in the prompt to avoid repetition within the same leaf.
151
+
152
+ ### Deduplication: Fuzzy string matching
153
+
154
+ Use token-level fuzzy string matching to detect and reject near-duplicate samples. No ML dependency needed.
155
+
156
+ Approach:
157
+ - Use Jaccard similarity on token sets (word-level) to compare new samples against all existing samples in the same category.
158
+ - Reject samples with a Jaccard similarity above a configurable threshold (default: 0.7).
159
+ - Additionally, normalize samples before comparison (lowercase, strip punctuation, collapse whitespace) to catch trivial reformulations.
160
+ - The dedup check should run on the `__str__` representation of each sample.
161
+ - The threshold should be controllable via a cli argument `--dedup-threshold`.
162
+
163
+ ## Categories of seed data
164
+
165
+ - Persona: random personas with name, age etc.
166
+ - Professions or jobs: Random professions or jobs with a name and description of the job.
167
+ - Coding tasks: Specific programming challenges and tasks (e.g. "implement a trie-based autocomplete with fuzzy matching", "write a rate limiter using the token bucket algorithm"). Should include the programming context, constraints, and expected behavior.
168
+ - Math categories: Random math categories like algebra, geometry, etc. with a description of the category.
169
+ - Writing styles / tones: Diverse writing styles and tones (e.g. "sardonic academic critique", "enthusiastic technical blogger", "dry legal prose"). Includes the style name and a description of its characteristics.
170
+ - Scenarios / situations: Real-world scenarios or situations (e.g. "debugging a production outage at 2am", "negotiating a contract with a difficult vendor"). Includes the scenario description and relevant context.
171
+ - Domains / topics: Specific knowledge domains and subtopics (e.g. "supply chain logistics for perishable goods", "comparative analysis of NoSQL databases"). Includes the domain name and a detailed description.
172
+ - Science topics: Specific scientific topics and phenomena (e.g. "quantum entanglement in photon pairs", "CRISPR-Cas9 off-target effects in gene therapy", "tidal locking in exoplanetary systems"). Includes the topic name, scientific field, and a description.
173
+ - Languages / locales: Random languages, dialects, or cultural contexts (e.g. "Brazilian Portuguese, informal register", "Kansai Japanese dialect"). Includes the language, region, register, and cultural notes.
174
+ - Reasoning patterns: Types of reasoning or problem-solving approaches (e.g. "proof by contradiction", "cost-benefit analysis with uncertainty", "analogical reasoning from biology to engineering"). Includes the pattern name and description of how it works.
175
+ - Emotional states: Specific emotional states or moods (e.g. "frustrated but trying to stay polite", "cautiously optimistic after a setback"). Includes the emotional state and behavioral description.
176
+ - Instruction complexity: Varying levels of instruction complexity and ambiguity (e.g. "vague one-liner request", "detailed multi-step specification with constraints", "contradictory requirements"). Includes the complexity level, a description, and an example.
177
+
178
+ Make sure the seed data is very specific, ie. "algebra" is not a good seed data, but "different ways to solve linear equations" is a good seed data. This is to ensure that the generated data for training is more diverse and not too similar to each other.
179
+
180
+ For the generation process, use the rich python library to show progress as well as the expected time of arrival (ETA) for the generation process.
181
+
182
+ dependencies and README for the run once seed generation scripts should be in the same folder as the scripts and different from the main package. The main package should not have any dependencies that are not needed for the generation of the seed data.
183
+
@@ -0,0 +1,105 @@
1
+ Metadata-Version: 2.4
2
+ Name: liquidrandom
3
+ Version: 0.1.0
4
+ Summary: Pseudo-random seed data generation for ML/LLM training diversity
5
+ License-Expression: MIT
6
+ Requires-Python: >=3.11
7
+ Requires-Dist: huggingface-hub>=0.20.0
8
+ Requires-Dist: pyarrow>=14.0.0
9
+ Provides-Extra: dev
10
+ Requires-Dist: pytest>=8.0; extra == 'dev'
11
+ Requires-Dist: ty; extra == 'dev'
12
+ Description-Content-Type: text/markdown
13
+
14
+ # liquidrandom
15
+
16
+ Pseudo-random seed data for ML/LLM training diversity.
17
+
18
+ When using LLMs to generate training data, outputs tend to be repetitive and lack variety. `liquidrandom` solves this by providing a large pool of diverse, pre-generated seed data (personas, jobs, scenarios, etc.) that you can inject into your prompts to steer generation toward more varied outputs.
19
+
20
+ ## Installation
21
+
22
+ ```bash
23
+ pip install liquidrandom
24
+ # or
25
+ uv add liquidrandom
26
+ ```
27
+
28
+ ## Quick Start
29
+
30
+ ```python
31
+ import liquidrandom
32
+
33
+ # Get a random persona to inject into your LLM prompt
34
+ persona = liquidrandom.persona()
35
+ print(persona)
36
+ # Alice is a 30-year-old female from Canada. They work as an engineer. ...
37
+
38
+ # Get a random coding task
39
+ task = liquidrandom.coding_task()
40
+ print(task)
41
+ # [Python, medium] Implement a trie: Build a trie data structure ...
42
+ ```
43
+
44
+ ## Available Categories
45
+
46
+ | Function | Returns | Description |
47
+ |---|---|---|
48
+ | `liquidrandom.persona()` | `Persona` | Random personas with name, age, gender, occupation, nationality, personality traits, background |
49
+ | `liquidrandom.job()` | `Job` | Professions with title, industry, description, required skills, experience level |
50
+ | `liquidrandom.coding_task()` | `CodingTask` | Programming challenges with title, language, difficulty, description, constraints, expected behavior |
51
+ | `liquidrandom.math_category()` | `MathCategory` | Math categories with name, field, description, example problems |
52
+ | `liquidrandom.writing_style()` | `WritingStyle` | Writing styles with name, tone, characteristics, description |
53
+ | `liquidrandom.scenario()` | `Scenario` | Real-world scenarios with title, context, setting, stakes, description |
54
+ | `liquidrandom.domain()` | `Domain` | Knowledge domains with name, parent field, description, key concepts |
55
+ | `liquidrandom.science_topic()` | `ScienceTopic` | Scientific topics with name, field, subfield, description |
56
+ | `liquidrandom.language()` | `Language` | Languages/locales with name, region, register, script, cultural notes |
57
+ | `liquidrandom.reasoning_pattern()` | `ReasoningPattern` | Reasoning approaches with name, category, description, when to use |
58
+ | `liquidrandom.emotional_state()` | `EmotionalState` | Emotional states with name, intensity, valence, behavioral description |
59
+ | `liquidrandom.instruction_complexity()` | `InstructionComplexity` | Instruction complexity levels with level, ambiguity, description, example |
60
+
61
+ ## Usage Example
62
+
63
+ Use `liquidrandom` to add diversity to your LLM data generation pipeline:
64
+
65
+ ```python
66
+ import liquidrandom
67
+ from openai import OpenAI
68
+
69
+ client = OpenAI(
70
+ base_url="https://openrouter.ai/api/v1",
71
+ api_key="<OPENROUTER_API_KEY>",
72
+ )
73
+
74
+ persona = liquidrandom.persona()
75
+ style = liquidrandom.writing_style()
76
+ topic = liquidrandom.science_topic()
77
+
78
+ prompt = f"""You are {persona}
79
+ Write in the following style: {style}
80
+ Explain the following topic: {topic}"""
81
+
82
+ response = client.chat.completions.create(
83
+ model="liquid/lfm-2-24b-a2b",
84
+ messages=[{"role": "user", "content": prompt}],
85
+ )
86
+ ```
87
+
88
+ Each call to a `liquidrandom` function returns a typed dataclass. You can use them directly in f-strings (via `__str__`) or access their individual fields:
89
+
90
+ ```python
91
+ persona = liquidrandom.persona()
92
+ print(persona.name) # "Alice"
93
+ print(persona.age) # 30
94
+ print(persona.personality_traits) # ["curious", "patient"]
95
+ ```
96
+
97
+ ## How It Works
98
+
99
+ The dataset contains 340,000+ samples across 12 categories, generated using hierarchical taxonomy trees with LLM-based quality validation and fuzzy deduplication.
100
+
101
+ Seed data is hosted on HuggingFace ([mlech26l/liquidrandom-data](https://huggingface.co/datasets/mlech26l/liquidrandom-data)) as zstd-compressed Parquet files. On first use, only the requested category file is downloaded and cached locally. Subsequent calls use the cached data.
102
+
103
+ ## License
104
+
105
+ MIT
@@ -0,0 +1,92 @@
1
+ # liquidrandom
2
+
3
+ Pseudo-random seed data for ML/LLM training diversity.
4
+
5
+ When using LLMs to generate training data, outputs tend to be repetitive and lack variety. `liquidrandom` solves this by providing a large pool of diverse, pre-generated seed data (personas, jobs, scenarios, etc.) that you can inject into your prompts to steer generation toward more varied outputs.
6
+
7
+ ## Installation
8
+
9
+ ```bash
10
+ pip install liquidrandom
11
+ # or
12
+ uv add liquidrandom
13
+ ```
14
+
15
+ ## Quick Start
16
+
17
+ ```python
18
+ import liquidrandom
19
+
20
+ # Get a random persona to inject into your LLM prompt
21
+ persona = liquidrandom.persona()
22
+ print(persona)
23
+ # Alice is a 30-year-old female from Canada. They work as an engineer. ...
24
+
25
+ # Get a random coding task
26
+ task = liquidrandom.coding_task()
27
+ print(task)
28
+ # [Python, medium] Implement a trie: Build a trie data structure ...
29
+ ```
30
+
31
+ ## Available Categories
32
+
33
+ | Function | Returns | Description |
34
+ |---|---|---|
35
+ | `liquidrandom.persona()` | `Persona` | Random personas with name, age, gender, occupation, nationality, personality traits, background |
36
+ | `liquidrandom.job()` | `Job` | Professions with title, industry, description, required skills, experience level |
37
+ | `liquidrandom.coding_task()` | `CodingTask` | Programming challenges with title, language, difficulty, description, constraints, expected behavior |
38
+ | `liquidrandom.math_category()` | `MathCategory` | Math categories with name, field, description, example problems |
39
+ | `liquidrandom.writing_style()` | `WritingStyle` | Writing styles with name, tone, characteristics, description |
40
+ | `liquidrandom.scenario()` | `Scenario` | Real-world scenarios with title, context, setting, stakes, description |
41
+ | `liquidrandom.domain()` | `Domain` | Knowledge domains with name, parent field, description, key concepts |
42
+ | `liquidrandom.science_topic()` | `ScienceTopic` | Scientific topics with name, field, subfield, description |
43
+ | `liquidrandom.language()` | `Language` | Languages/locales with name, region, register, script, cultural notes |
44
+ | `liquidrandom.reasoning_pattern()` | `ReasoningPattern` | Reasoning approaches with name, category, description, when to use |
45
+ | `liquidrandom.emotional_state()` | `EmotionalState` | Emotional states with name, intensity, valence, behavioral description |
46
+ | `liquidrandom.instruction_complexity()` | `InstructionComplexity` | Instruction complexity levels with level, ambiguity, description, example |
47
+
48
+ ## Usage Example
49
+
50
+ Use `liquidrandom` to add diversity to your LLM data generation pipeline:
51
+
52
+ ```python
53
+ import liquidrandom
54
+ from openai import OpenAI
55
+
56
+ client = OpenAI(
57
+ base_url="https://openrouter.ai/api/v1",
58
+ api_key="<OPENROUTER_API_KEY>",
59
+ )
60
+
61
+ persona = liquidrandom.persona()
62
+ style = liquidrandom.writing_style()
63
+ topic = liquidrandom.science_topic()
64
+
65
+ prompt = f"""You are {persona}
66
+ Write in the following style: {style}
67
+ Explain the following topic: {topic}"""
68
+
69
+ response = client.chat.completions.create(
70
+ model="liquid/lfm-2-24b-a2b",
71
+ messages=[{"role": "user", "content": prompt}],
72
+ )
73
+ ```
74
+
75
+ Each call to a `liquidrandom` function returns a typed dataclass. You can use them directly in f-strings (via `__str__`) or access their individual fields:
76
+
77
+ ```python
78
+ persona = liquidrandom.persona()
79
+ print(persona.name) # "Alice"
80
+ print(persona.age) # 30
81
+ print(persona.personality_traits) # ["curious", "patient"]
82
+ ```
83
+
84
+ ## How It Works
85
+
86
+ The dataset contains 340,000+ samples across 12 categories, generated using hierarchical taxonomy trees with LLM-based quality validation and fuzzy deduplication.
87
+
88
+ Seed data is hosted on HuggingFace ([mlech26l/liquidrandom-data](https://huggingface.co/datasets/mlech26l/liquidrandom-data)) as zstd-compressed Parquet files. On first use, only the requested category file is downloaded and cached locally. Subsequent calls use the cached data.
89
+
90
+ ## License
91
+
92
+ MIT
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "liquidrandom"
7
+ version = "0.1.0"
8
+ description = "Pseudo-random seed data generation for ML/LLM training diversity"
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ license = "MIT"
12
+ dependencies = [
13
+ "huggingface-hub>=0.20.0",
14
+ "pyarrow>=14.0.0",
15
+ ]
16
+
17
+ [project.optional-dependencies]
18
+ dev = [
19
+ "pytest>=8.0",
20
+ "ty",
21
+ ]
22
+
23
+ [tool.hatch.build.targets.wheel]
24
+ packages = ["src/liquidrandom"]
25
+
26
+ [tool.pytest.ini_options]
27
+ testpaths = ["tests"]
28
+
@@ -0,0 +1,113 @@
1
+ # Seed Data Generation
2
+
3
+ Scripts for generating diverse seed data for the `liquidrandom` package.
4
+
5
+ ## Setup
6
+
7
+ ```bash
8
+ cd seed_generation
9
+ uv sync
10
+ ```
11
+
12
+ ## Environment Variables
13
+
14
+ | Variable | Required | Description |
15
+ |---|---|---|
16
+ | `OPENROUTER_API_KEY` | Yes (for generation) | OpenRouter API key |
17
+ | `HF_TOKEN` | Yes (for upload) | HuggingFace write token |
18
+
19
+ ## Usage
20
+
21
+ ### Generate seed data
22
+
23
+ ```bash
24
+ # Generate 1000 samples per category for all categories
25
+ python generate.py generate --n 1000 --k 10
26
+
27
+ # Generate for specific categories only
28
+ python generate.py generate --n 1000 --k 10 --categories persona --categories job
29
+
30
+ # Full control over all parameters
31
+ python generate.py generate \
32
+ --n 10000 \
33
+ --k 10 \
34
+ --batch-size 5 \
35
+ --taxonomy-depth 4 \
36
+ --samples-per-leaf 10 \
37
+ --dedup-threshold 0.7 \
38
+ --output-dir ./output
39
+ ```
40
+
41
+ ### Resume interrupted generation
42
+
43
+ ```bash
44
+ python generate.py generate --resume --output-dir ./output
45
+ ```
46
+
47
+ ### Upload to HuggingFace
48
+
49
+ ```bash
50
+ python generate.py upload-only --output-dir ./output --repo-id mlech26l/liquidrandom-data
51
+ ```
52
+
53
+ ## CLI Arguments
54
+
55
+ | Argument | Default | Description |
56
+ |---|---|---|
57
+ | `--n` | 1000 | Target samples per category |
58
+ | `--k` | 10 | Samples generated per LLM call |
59
+ | `--batch-size` | 5 | Number of concurrent LLM calls |
60
+ | `--taxonomy-depth` | 4 | Maximum depth of the taxonomy tree |
61
+ | `--samples-per-leaf` | 10 | Target samples per leaf node |
62
+ | `--dedup-threshold` | 0.7 | Jaccard similarity threshold for dedup |
63
+ | `--categories` | all | Which categories to generate |
64
+ | `--resume` | false | Resume from checkpoint |
65
+ | `--output-dir` | output | Output directory |
66
+
67
+ ## How It Works
68
+
69
+ ### Phase 1: Taxonomy Generation
70
+
71
+ For each category, the LLM generates a deep taxonomy tree to ensure diversity. The tree is expanded breadth-first, level by level, until there are enough leaf nodes.
72
+
73
+ Taxonomies are saved as JSON in `output/taxonomies/` and can be inspected or reused.
74
+
75
+ ### Phase 2: Round-Robin Sample Generation
76
+
77
+ Samples are generated by cycling through all leaf nodes in round-robin fashion. Each generation prompt includes:
78
+ - The leaf node's full taxonomy path (to anchor specificity)
79
+ - Previously generated samples for that leaf (to avoid repetition)
80
+
81
+ ### Quality Validation
82
+
83
+ Each batch is validated by a second LLM call that checks for:
84
+ - Empty or placeholder content
85
+ - Hallucinations or factual impossibility
86
+ - Intra-batch repetitiveness
87
+ - Off-topic samples
88
+ - Insufficient specificity
89
+
90
+ If >50% of a batch is rejected, the entire batch is discarded and regenerated.
91
+
92
+ ### Deduplication
93
+
94
+ Token-level Jaccard similarity is used to detect near-duplicates:
95
+ - Text is normalized (lowercase, strip punctuation, collapse whitespace)
96
+ - Word-level token sets are compared
97
+ - Samples above the similarity threshold are rejected
98
+
99
+ ## Output Structure
100
+
101
+ ```
102
+ output/
103
+ ├── taxonomies/ # JSON taxonomy trees per category
104
+ │ ├── persona.json
105
+ │ ├── job.json
106
+ │ └── ...
107
+ ├── samples/ # Per-leaf JSONL files
108
+ │ ├── persona/
109
+ │ │ ├── Personas_Young-Adults_...jsonl
110
+ │ │ └── ...
111
+ │ └── ...
112
+ └── state.json # Checkpoint for resumability
113
+ ```