fauxdata-cli 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,10 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "WebFetch(domain:posit-dev.github.io)",
5
+ "WebFetch(domain:posit.co)",
6
+ "mcp__chrome-devtools__navigate_page",
7
+ "mcp__chrome-devtools__lighthouse_audit"
8
+ ]
9
+ }
10
+ }
@@ -0,0 +1,4 @@
1
+ tmp/
2
+ .venv/
3
+ __pycache__/
4
+ *.egg-info/
@@ -0,0 +1 @@
1
+ 3.12
@@ -0,0 +1,10 @@
1
+ # Log
2
+
3
+ ## 2026-03-06
4
+
5
+ - Initial implementation of `fauxdata` CLI
6
+ - Stack: pointblank 0.22 (native generation + validation), polars, typer, rich, pyfiglet, questionary
7
+ - Commands: `init`, `generate`, `validate`, `preview`
8
+ - Example schemas: `people.yml`, `orders.yml`, `events.yml`
9
+ - All schemas generate and validate cleanly (all rules PASS)
10
+ - `locale` field at schema level maps to pointblank `country=` param
@@ -0,0 +1,13 @@
1
+ Metadata-Version: 2.4
2
+ Name: fauxdata-cli
3
+ Version: 0.1.0
4
+ Summary: CLI for generating and validating fake datasets
5
+ Requires-Python: >=3.11
6
+ Requires-Dist: faker>=26.0
7
+ Requires-Dist: pointblank>=0.22
8
+ Requires-Dist: polars>=1.0
9
+ Requires-Dist: pyfiglet>=1.0
10
+ Requires-Dist: pyyaml>=6.0
11
+ Requires-Dist: questionary>=2.0
12
+ Requires-Dist: rich>=13
13
+ Requires-Dist: typer>=0.12
@@ -0,0 +1,291 @@
1
+ # fauxdata
2
+
3
+ **fauxdata** is a command-line tool for generating and validating realistic fake datasets from simple YAML schemas.
4
+
5
+ If you work with data — as an analyst, engineer, developer, or researcher — you constantly need test data: to prototype a pipeline, populate a demo dashboard, write unit tests, or show a colleague how a system should behave. Real data is often unavailable, sensitive, or too messy to share. fauxdata solves this by letting you describe your dataset structure once and generate as many rows as you need, on demand, with realistic values.
6
+
7
+ ---
8
+
9
+ ## Why fauxdata?
10
+
11
+ - **Schema-first**: define the shape of your data in a readable YAML file — column names, types, constraints, realistic presets
12
+ - **Locale-aware and coherent**: set `locale: IT` and get Italian names, cities, email domains, phone formats, IBANs — all consistent within each row. Set `locale: JP` and get Japanese names and addresses. The data is not just random strings: related fields are generated together so they make sense as a whole record
13
+ - **Validated by design**: the same schema that defines generation also drives validation; no surprises
14
+ - **Pipeline-friendly**: output to stdout with `--out -` for seamless piping and redirection
15
+ - **Multiple formats**: CSV, Parquet, JSON, JSONL / JSON Lines out of the box
16
+
17
+ ---
18
+
19
+ ## Install
20
+
21
+ ### With uv (recommended)
22
+
23
+ [uv](https://docs.astral.sh/uv/) installs fauxdata as an isolated tool available globally, without polluting any existing Python environment:
24
+
25
+ ```bash
26
+ git clone https://github.com/aborruso/fauxdata
27
+ cd fauxdata
28
+ uv tool install .
29
+ ```
30
+
31
+ After installation, `fauxdata` is available from any directory.
32
+
33
+ To update after code changes:
34
+
35
+ ```bash
36
+ uv tool install . --reinstall
37
+ ```
38
+
39
+ ### With pip
40
+
41
+ ```bash
42
+ git clone https://github.com/aborruso/fauxdata
43
+ cd fauxdata
44
+ pip install .
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Quick start
50
+
51
+ ```bash
52
+ # Generate 500 rows from a schema, with validation
53
+ fauxdata generate schemas/people.yml --rows 500 --validate
54
+
55
+ # Stream to stdout and pipe to other tools
56
+ fauxdata generate schemas/people.yml --rows 1000 --out - | head -5
57
+
58
+ # Validate an existing file against a schema
59
+ fauxdata validate my_data.csv schemas/people.yml
60
+
61
+ # Preview a dataset with column statistics
62
+ fauxdata preview my_data.csv --rows 10
63
+
64
+ # Create a new schema interactively
65
+ fauxdata init --name orders
66
+ ```
67
+
68
+ ---
69
+
70
+ ## Schema format
71
+
72
+ A schema is a YAML file that describes the structure of your dataset. Here is a realistic example for a people dataset:
73
+
74
+ ```yaml
75
+ name: people
76
+ description: "People dataset with personal info"
77
+ rows: 1000
78
+ seed: 42
79
+ locale: IT # ISO country code — affects names, cities, emails, phone numbers, etc.
80
+
81
+ output:
82
+ format: csv # csv | parquet | json | jsonl | jsonlines
83
+ path: tmp/people.csv
84
+
85
+ columns:
86
+ id:
87
+ type: int
88
+ unique: true
89
+ min: 1
90
+ max: 99999
91
+
92
+ name:
93
+ type: string
94
+ preset: name # generates realistic full names for the given locale
95
+
96
+ email:
97
+ type: string
98
+ preset: email
99
+
100
+ age:
101
+ type: int
102
+ min: 18
103
+ max: 90
104
+
105
+ city:
106
+ type: string
107
+ preset: city
108
+
109
+ country_code:
110
+ type: string
111
+ preset: country_code_2 # ISO 3166-1 alpha-2, e.g. "IT"
112
+
113
+ active:
114
+ type: bool
115
+
116
+ signup_date:
117
+ type: date
118
+ min: "2020-01-01"
119
+ max: "2024-12-31"
120
+
121
+ score:
122
+ type: float
123
+ min: 0.0
124
+ max: 100.0
125
+
126
+ status:
127
+ type: string
128
+ values: [active, inactive, pending] # enum: pick from a fixed list
129
+
130
+ validation:
131
+ - rule: col_vals_not_null
132
+ columns: [id, name, email]
133
+ - rule: col_vals_between
134
+ column: age
135
+ min: 18
136
+ max: 90
137
+ - rule: col_vals_regex
138
+ column: email
139
+ pattern: "^[^@]+@[^@]+\\.[^@]+$"
140
+ - rule: rows_distinct
141
+ columns: [id]
142
+ ```
143
+
144
+ ### Column types
145
+
146
+ | Type | Description | Options |
147
+ |------|-------------|---------|
148
+ | `int` | Integer | `min`, `max`, `unique` |
149
+ | `float` | Floating point | `min`, `max` |
150
+ | `string` | Text | `preset`, `values`, `unique` |
151
+ | `bool` | Boolean | — |
152
+ | `date` | Date | `min`, `max` (ISO format) |
153
+ | `datetime` | Datetime | `min`, `max` (ISO format) |
154
+
155
+ ### String presets
156
+
157
+ Presets generate realistic, locale-aware values. Set `locale` at the schema level to control the country.
158
+
159
+ | Category | Presets |
160
+ |----------|---------|
161
+ | Personal | `name`, `name_full`, `first_name`, `last_name`, `email`, `phone_number` |
162
+ | Location | `address`, `city`, `state`, `country`, `country_code_2`, `country_code_3`, `postcode`, `latitude`, `longitude` |
163
+ | Business | `company`, `job`, `catch_phrase` |
164
+ | Internet | `url`, `domain_name`, `ipv4`, `ipv6`, `user_name`, `password` |
165
+ | Text | `text`, `sentence`, `paragraph`, `word` |
166
+ | Financial | `iban`, `currency_code`, `credit_card_number` |
167
+ | Identifiers | `uuid4`, `md5`, `sha1`, `ssn`, `license_plate` |
168
+
169
+ ### Locale-aware generation
170
+
171
+ Setting `locale` in the schema is more than a language switch — it makes the entire dataset culturally coherent.
172
+
173
+ With `locale: IT`:
174
+
175
+ ```
176
+ id name email city country_code
177
+ 83811 Giovanni Gentile giovanni.gentile@tin.it Bari IT
178
+ 14593 Bruno Mancini bruno.mancini16@virgilio.it Taranto IT
179
+ 3279 Giada Santini gsantini38@fastwebnet.it Milano IT
180
+ ```
181
+
182
+ With `locale: DE`:
183
+
184
+ ```
185
+ id name email city country_code
186
+ 12044 Hans Müller h.mueller@web.de Berlin DE
187
+ 57892 Lena Schmidt lena.schmidt@gmx.de München DE
188
+ ```
189
+
190
+ With `locale: JP`:
191
+
192
+ ```
193
+ id name email city country_code
194
+ 9341 Yuki Tanaka y.tanaka@docomo.ne.jp Tokyo JP
195
+ ```
196
+
197
+ The magic is that **related presets are generated together**: the email is derived from the name, the city belongs to the country, the phone number uses the right country prefix, and IBANs use the correct country code. A single `locale` field in your schema is all it takes.
198
+
199
+ Supported locales include: `US`, `IT`, `DE`, `FR`, `ES`, `JP`, `BR`, `PL`, `NL`, `SE`, `DK`, `TR`, `RU`, `CN`, `KR`, and [many more](https://github.com/posit-dev/pointblank).
200
+
201
+ ### Validation rules
202
+
203
+ | Rule | Description | Parameters |
204
+ |------|-------------|------------|
205
+ | `col_vals_not_null` | No nulls | `columns` |
206
+ | `col_vals_between` | Value in range | `column`, `min`, `max` |
207
+ | `col_vals_regex` | Matches pattern | `column`, `pattern` |
208
+ | `col_vals_in_set` | Value in allowed set | `column`, `values` |
209
+ | `col_vals_gt` / `col_vals_lt` | Greater / less than | `column`, `min` / `max` |
210
+ | `col_vals_ge` / `col_vals_le` | Greater / less or equal | `column`, `min` / `max` |
211
+ | `rows_distinct` | Unique rows | `columns` |
212
+ | `col_exists` | Column present | `columns` |
213
+
214
+ ---
215
+
216
+ ## Commands
217
+
218
+ ### `fauxdata generate SCHEMA`
219
+
220
+ ```
221
+ fauxdata generate schemas/people.yml
222
+ fauxdata generate schemas/people.yml --rows 500 --seed 42 --validate
223
+ fauxdata generate schemas/people.yml --format parquet --out tmp/people.parquet
224
+ fauxdata generate schemas/people.yml --rows 1000 --out - # stdout
225
+ fauxdata generate schemas/people.yml --out - --format jsonl | wc -l
226
+ ```
227
+
228
+ | Option | Short | Default | Description |
229
+ |--------|-------|---------|-------------|
230
+ | `--rows` | `-r` | from schema | Number of rows to generate |
231
+ | `--out` | `-o` | from schema | Output path — use `-` for stdout |
232
+ | `--format` | `-f` | from schema | Output format: `csv`, `parquet`, `json`, `jsonl`, `jsonlines` |
233
+ | `--seed` | `-s` | from schema | Random seed for reproducibility |
234
+ | `--validate` | `-v` | off | Run validation rules after generating |
235
+
236
+ When `--out -` is used, all output messages are suppressed and only data is written to stdout.
237
+
238
+ ### `fauxdata validate DATASET SCHEMA`
239
+
240
+ ```
241
+ fauxdata validate tmp/people.csv schemas/people.yml
242
+ ```
243
+
244
+ Validates an existing file against a schema. Exits with code `1` if any rule fails — useful in CI pipelines.
245
+
246
+ ### `fauxdata preview DATASET`
247
+
248
+ ```
249
+ fauxdata preview tmp/people.csv --rows 10
250
+ ```
251
+
252
+ Shows the first N rows and a column statistics table (type, nulls, unique count, min/max).
253
+
254
+ | Option | Short | Default | Description |
255
+ |--------|-------|---------|-------------|
256
+ | `--rows` | `-r` | 10 | Number of rows to display |
257
+
258
+ ### `fauxdata init`
259
+
260
+ ```
261
+ fauxdata init
262
+ fauxdata init --name orders
263
+ ```
264
+
265
+ Interactive wizard to create a new schema template. Asks for name, description, row count, and default format.
266
+
267
+ | Option | Short | Description |
268
+ |--------|-------|-------------|
269
+ | `--name` | `-n` | Schema name (skips the interactive prompt) |
270
+
271
+ ---
272
+
273
+ ## Example schemas
274
+
275
+ Three ready-to-use schemas are included in `schemas/`:
276
+
277
+ | Schema | Domain | Columns |
278
+ |--------|--------|---------|
279
+ | `people.yml` | Personal data | id, name, email, age, city, country_code, active, signup_date, score |
280
+ | `orders.yml` | E-commerce | order_id, customer_id, product, amount, status, created_at |
281
+ | `events.yml` | Analytics | event_id, user_id, event_type, timestamp, ip, user_agent, session_duration |
282
+
283
+ ---
284
+
285
+ ## Acknowledgements
286
+
287
+ A heartfelt thank you to **[Rich Iannone](https://github.com/rich-iannone)** and the entire [pointblank](https://github.com/posit-dev/pointblank) team at [Posit](https://posit.co/) for building an exceptional data quality library — and for inspiring this project with their article:
288
+
289
+ > **[Building realistic fake datasets with pointblank](https://posit.co/blog/building-realistic-fake-datasets-with-pointblank/)**
290
+
291
+ Without their work, fauxdata would not exist. If you find pointblank useful, please give it a ⭐ on GitHub.