fauxdata-cli 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- fauxdata_cli-0.1.0/.claude/settings.local.json +10 -0
- fauxdata_cli-0.1.0/.gitignore +4 -0
- fauxdata_cli-0.1.0/.python-version +1 -0
- fauxdata_cli-0.1.0/LOG.md +10 -0
- fauxdata_cli-0.1.0/PKG-INFO +13 -0
- fauxdata_cli-0.1.0/README.md +291 -0
- fauxdata_cli-0.1.0/docs/index.html +1013 -0
- fauxdata_cli-0.1.0/pyproject.toml +25 -0
- fauxdata_cli-0.1.0/schemas/events.yml +54 -0
- fauxdata_cli-0.1.0/schemas/orders.yml +51 -0
- fauxdata_cli-0.1.0/schemas/people.yml +63 -0
- fauxdata_cli-0.1.0/src/fauxdata/__init__.py +3 -0
- fauxdata_cli-0.1.0/src/fauxdata/commands/__init__.py +0 -0
- fauxdata_cli-0.1.0/src/fauxdata/commands/generate.py +117 -0
- fauxdata_cli-0.1.0/src/fauxdata/commands/init.py +116 -0
- fauxdata_cli-0.1.0/src/fauxdata/commands/preview.py +79 -0
- fauxdata_cli-0.1.0/src/fauxdata/commands/validate.py +73 -0
- fauxdata_cli-0.1.0/src/fauxdata/generator.py +80 -0
- fauxdata_cli-0.1.0/src/fauxdata/main.py +73 -0
- fauxdata_cli-0.1.0/src/fauxdata/output.py +57 -0
- fauxdata_cli-0.1.0/src/fauxdata/schema.py +174 -0
- fauxdata_cli-0.1.0/src/fauxdata/validator.py +108 -0
- fauxdata_cli-0.1.0/uv.lock +526 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.12
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
# Log
|
|
2
|
+
|
|
3
|
+
## 2026-03-06
|
|
4
|
+
|
|
5
|
+
- Initial implementation of `fauxdata` CLI
|
|
6
|
+
- Stack: pointblank 0.22 (native generation + validation), polars, typer, rich, pyfiglet, questionary
|
|
7
|
+
- Commands: `init`, `generate`, `validate`, `preview`
|
|
8
|
+
- Example schemas: `people.yml`, `orders.yml`, `events.yml`
|
|
9
|
+
- All schemas generate and validate cleanly (all rules PASS)
|
|
10
|
+
- `locale` field at schema level maps to pointblank `country=` param
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: fauxdata-cli
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: CLI for generating and validating fake datasets
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Requires-Dist: faker>=26.0
|
|
7
|
+
Requires-Dist: pointblank>=0.22
|
|
8
|
+
Requires-Dist: polars>=1.0
|
|
9
|
+
Requires-Dist: pyfiglet>=1.0
|
|
10
|
+
Requires-Dist: pyyaml>=6.0
|
|
11
|
+
Requires-Dist: questionary>=2.0
|
|
12
|
+
Requires-Dist: rich>=13
|
|
13
|
+
Requires-Dist: typer>=0.12
|
|
@@ -0,0 +1,291 @@
|
|
|
1
|
+
# fauxdata
|
|
2
|
+
|
|
3
|
+
**fauxdata** is a command-line tool for generating and validating realistic fake datasets from simple YAML schemas.
|
|
4
|
+
|
|
5
|
+
If you work with data — as an analyst, engineer, developer, or researcher — you constantly need test data: to prototype a pipeline, populate a demo dashboard, write unit tests, or show a colleague how a system should behave. Real data is often unavailable, sensitive, or too messy to share. fauxdata solves this by letting you describe your dataset structure once and generate as many rows as you need, on demand, with realistic values.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Why fauxdata?
|
|
10
|
+
|
|
11
|
+
- **Schema-first**: define the shape of your data in a readable YAML file — column names, types, constraints, realistic presets
|
|
12
|
+
- **Locale-aware and coherent**: set `locale: IT` and get Italian names, cities, email domains, phone formats, IBANs — all consistent within each row. Set `locale: JP` and get Japanese names and addresses. The data is not just random strings: related fields are generated together so they make sense as a whole record
|
|
13
|
+
- **Validated by design**: the same schema that defines generation also drives validation; no surprises
|
|
14
|
+
- **Pipeline-friendly**: output to stdout with `--out -` for seamless piping and redirection
|
|
15
|
+
- **Multiple formats**: CSV, Parquet, JSON, JSONL / JSON Lines out of the box
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Install
|
|
20
|
+
|
|
21
|
+
### With uv (recommended)
|
|
22
|
+
|
|
23
|
+
[uv](https://docs.astral.sh/uv/) installs fauxdata as an isolated tool available globally, without polluting any existing Python environment:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
git clone https://github.com/aborruso/fauxdata
|
|
27
|
+
cd fauxdata
|
|
28
|
+
uv tool install .
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
After installation, `fauxdata` is available from any directory.
|
|
32
|
+
|
|
33
|
+
To update after code changes:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
uv tool install . --reinstall
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### With pip
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
git clone https://github.com/aborruso/fauxdata
|
|
43
|
+
cd fauxdata
|
|
44
|
+
pip install .
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## Quick start
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
# Generate 500 rows from a schema, with validation
|
|
53
|
+
fauxdata generate schemas/people.yml --rows 500 --validate
|
|
54
|
+
|
|
55
|
+
# Stream to stdout and pipe to other tools
|
|
56
|
+
fauxdata generate schemas/people.yml --rows 1000 --out - | head -5
|
|
57
|
+
|
|
58
|
+
# Validate an existing file against a schema
|
|
59
|
+
fauxdata validate my_data.csv schemas/people.yml
|
|
60
|
+
|
|
61
|
+
# Preview a dataset with column statistics
|
|
62
|
+
fauxdata preview my_data.csv --rows 10
|
|
63
|
+
|
|
64
|
+
# Create a new schema interactively
|
|
65
|
+
fauxdata init --name orders
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Schema format
|
|
71
|
+
|
|
72
|
+
A schema is a YAML file that describes the structure of your dataset. Here is a realistic example for a people dataset:
|
|
73
|
+
|
|
74
|
+
```yaml
|
|
75
|
+
name: people
|
|
76
|
+
description: "People dataset with personal info"
|
|
77
|
+
rows: 1000
|
|
78
|
+
seed: 42
|
|
79
|
+
locale: IT # ISO country code — affects names, cities, emails, phone numbers, etc.
|
|
80
|
+
|
|
81
|
+
output:
|
|
82
|
+
format: csv # csv | parquet | json | jsonl | jsonlines
|
|
83
|
+
path: tmp/people.csv
|
|
84
|
+
|
|
85
|
+
columns:
|
|
86
|
+
id:
|
|
87
|
+
type: int
|
|
88
|
+
unique: true
|
|
89
|
+
min: 1
|
|
90
|
+
max: 99999
|
|
91
|
+
|
|
92
|
+
name:
|
|
93
|
+
type: string
|
|
94
|
+
preset: name # generates realistic full names for the given locale
|
|
95
|
+
|
|
96
|
+
email:
|
|
97
|
+
type: string
|
|
98
|
+
preset: email
|
|
99
|
+
|
|
100
|
+
age:
|
|
101
|
+
type: int
|
|
102
|
+
min: 18
|
|
103
|
+
max: 90
|
|
104
|
+
|
|
105
|
+
city:
|
|
106
|
+
type: string
|
|
107
|
+
preset: city
|
|
108
|
+
|
|
109
|
+
country_code:
|
|
110
|
+
type: string
|
|
111
|
+
preset: country_code_2 # ISO 3166-1 alpha-2, e.g. "IT"
|
|
112
|
+
|
|
113
|
+
active:
|
|
114
|
+
type: bool
|
|
115
|
+
|
|
116
|
+
signup_date:
|
|
117
|
+
type: date
|
|
118
|
+
min: "2020-01-01"
|
|
119
|
+
max: "2024-12-31"
|
|
120
|
+
|
|
121
|
+
score:
|
|
122
|
+
type: float
|
|
123
|
+
min: 0.0
|
|
124
|
+
max: 100.0
|
|
125
|
+
|
|
126
|
+
status:
|
|
127
|
+
type: string
|
|
128
|
+
values: [active, inactive, pending] # enum: pick from a fixed list
|
|
129
|
+
|
|
130
|
+
validation:
|
|
131
|
+
- rule: col_vals_not_null
|
|
132
|
+
columns: [id, name, email]
|
|
133
|
+
- rule: col_vals_between
|
|
134
|
+
column: age
|
|
135
|
+
min: 18
|
|
136
|
+
max: 90
|
|
137
|
+
- rule: col_vals_regex
|
|
138
|
+
column: email
|
|
139
|
+
pattern: "^[^@]+@[^@]+\\.[^@]+$"
|
|
140
|
+
- rule: rows_distinct
|
|
141
|
+
columns: [id]
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Column types
|
|
145
|
+
|
|
146
|
+
| Type | Description | Options |
|
|
147
|
+
|------|-------------|---------|
|
|
148
|
+
| `int` | Integer | `min`, `max`, `unique` |
|
|
149
|
+
| `float` | Floating point | `min`, `max` |
|
|
150
|
+
| `string` | Text | `preset`, `values`, `unique` |
|
|
151
|
+
| `bool` | Boolean | — |
|
|
152
|
+
| `date` | Date | `min`, `max` (ISO format) |
|
|
153
|
+
| `datetime` | Datetime | `min`, `max` (ISO format) |
|
|
154
|
+
|
|
155
|
+
### String presets
|
|
156
|
+
|
|
157
|
+
Presets generate realistic, locale-aware values. Set `locale` at the schema level to control the country.
|
|
158
|
+
|
|
159
|
+
| Category | Presets |
|
|
160
|
+
|----------|---------|
|
|
161
|
+
| Personal | `name`, `name_full`, `first_name`, `last_name`, `email`, `phone_number` |
|
|
162
|
+
| Location | `address`, `city`, `state`, `country`, `country_code_2`, `country_code_3`, `postcode`, `latitude`, `longitude` |
|
|
163
|
+
| Business | `company`, `job`, `catch_phrase` |
|
|
164
|
+
| Internet | `url`, `domain_name`, `ipv4`, `ipv6`, `user_name`, `password` |
|
|
165
|
+
| Text | `text`, `sentence`, `paragraph`, `word` |
|
|
166
|
+
| Financial | `iban`, `currency_code`, `credit_card_number` |
|
|
167
|
+
| Identifiers | `uuid4`, `md5`, `sha1`, `ssn`, `license_plate` |
|
|
168
|
+
|
|
169
|
+
### Locale-aware generation
|
|
170
|
+
|
|
171
|
+
Setting `locale` in the schema is more than a language switch — it makes the entire dataset culturally coherent.
|
|
172
|
+
|
|
173
|
+
With `locale: IT`:
|
|
174
|
+
|
|
175
|
+
```
|
|
176
|
+
id name email city country_code
|
|
177
|
+
83811 Giovanni Gentile giovanni.gentile@tin.it Bari IT
|
|
178
|
+
14593 Bruno Mancini bruno.mancini16@virgilio.it Taranto IT
|
|
179
|
+
3279 Giada Santini gsantini38@fastwebnet.it Milano IT
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
With `locale: DE`:
|
|
183
|
+
|
|
184
|
+
```
|
|
185
|
+
id name email city country_code
|
|
186
|
+
12044 Hans Müller h.mueller@web.de Berlin DE
|
|
187
|
+
57892 Lena Schmidt lena.schmidt@gmx.de München DE
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
With `locale: JP`:
|
|
191
|
+
|
|
192
|
+
```
|
|
193
|
+
id name email city country_code
|
|
194
|
+
9341 Yuki Tanaka y.tanaka@docomo.ne.jp Tokyo JP
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
The magic is that **related presets are generated together**: the email is derived from the name, the city belongs to the country, the phone number uses the right country prefix, and IBANs use the correct country code. A single `locale` field in your schema is all it takes.
|
|
198
|
+
|
|
199
|
+
Supported locales include: `US`, `IT`, `DE`, `FR`, `ES`, `JP`, `BR`, `PL`, `NL`, `SE`, `DK`, `TR`, `RU`, `CN`, `KR`, and [many more](https://github.com/posit-dev/pointblank).
|
|
200
|
+
|
|
201
|
+
### Validation rules
|
|
202
|
+
|
|
203
|
+
| Rule | Description | Parameters |
|
|
204
|
+
|------|-------------|------------|
|
|
205
|
+
| `col_vals_not_null` | No nulls | `columns` |
|
|
206
|
+
| `col_vals_between` | Value in range | `column`, `min`, `max` |
|
|
207
|
+
| `col_vals_regex` | Matches pattern | `column`, `pattern` |
|
|
208
|
+
| `col_vals_in_set` | Value in allowed set | `column`, `values` |
|
|
209
|
+
| `col_vals_gt` / `col_vals_lt` | Greater / less than | `column`, `min` / `max` |
|
|
210
|
+
| `col_vals_ge` / `col_vals_le` | Greater / less or equal | `column`, `min` / `max` |
|
|
211
|
+
| `rows_distinct` | Unique rows | `columns` |
|
|
212
|
+
| `col_exists` | Column present | `columns` |
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## Commands
|
|
217
|
+
|
|
218
|
+
### `fauxdata generate SCHEMA`
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
fauxdata generate schemas/people.yml
|
|
222
|
+
fauxdata generate schemas/people.yml --rows 500 --seed 42 --validate
|
|
223
|
+
fauxdata generate schemas/people.yml --format parquet --out tmp/people.parquet
|
|
224
|
+
fauxdata generate schemas/people.yml --rows 1000 --out - # stdout
|
|
225
|
+
fauxdata generate schemas/people.yml --out - --format jsonl | wc -l
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
| Option | Short | Default | Description |
|
|
229
|
+
|--------|-------|---------|-------------|
|
|
230
|
+
| `--rows` | `-r` | from schema | Number of rows to generate |
|
|
231
|
+
| `--out` | `-o` | from schema | Output path — use `-` for stdout |
|
|
232
|
+
| `--format` | `-f` | from schema | Output format: `csv`, `parquet`, `json`, `jsonl`, `jsonlines` |
|
|
233
|
+
| `--seed` | `-s` | from schema | Random seed for reproducibility |
|
|
234
|
+
| `--validate` | `-v` | off | Run validation rules after generating |
|
|
235
|
+
|
|
236
|
+
When `--out -` is used, all output messages are suppressed and only data is written to stdout.
|
|
237
|
+
|
|
238
|
+
### `fauxdata validate DATASET SCHEMA`
|
|
239
|
+
|
|
240
|
+
```
|
|
241
|
+
fauxdata validate tmp/people.csv schemas/people.yml
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
Validates an existing file against a schema. Exits with code `1` if any rule fails — useful in CI pipelines.
|
|
245
|
+
|
|
246
|
+
### `fauxdata preview DATASET`
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
fauxdata preview tmp/people.csv --rows 10
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
Shows the first N rows and a column statistics table (type, nulls, unique count, min/max).
|
|
253
|
+
|
|
254
|
+
| Option | Short | Default | Description |
|
|
255
|
+
|--------|-------|---------|-------------|
|
|
256
|
+
| `--rows` | `-r` | 10 | Number of rows to display |
|
|
257
|
+
|
|
258
|
+
### `fauxdata init`
|
|
259
|
+
|
|
260
|
+
```
|
|
261
|
+
fauxdata init
|
|
262
|
+
fauxdata init --name orders
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
Interactive wizard to create a new schema template. Asks for name, description, row count, and default format.
|
|
266
|
+
|
|
267
|
+
| Option | Short | Description |
|
|
268
|
+
|--------|-------|-------------|
|
|
269
|
+
| `--name` | `-n` | Schema name (skips the interactive prompt) |
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Example schemas
|
|
274
|
+
|
|
275
|
+
Three ready-to-use schemas are included in `schemas/`:
|
|
276
|
+
|
|
277
|
+
| Schema | Domain | Columns |
|
|
278
|
+
|--------|--------|---------|
|
|
279
|
+
| `people.yml` | Personal data | id, name, email, age, city, country_code, active, signup_date, score |
|
|
280
|
+
| `orders.yml` | E-commerce | order_id, customer_id, product, amount, status, created_at |
|
|
281
|
+
| `events.yml` | Analytics | event_id, user_id, event_type, timestamp, ip, user_agent, session_duration |
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
## Acknowledgements
|
|
286
|
+
|
|
287
|
+
A heartfelt thank you to **[Rich Iannone](https://github.com/rich-iannone)** and the entire [pointblank](https://github.com/posit-dev/pointblank) team at [Posit](https://posit.co/) for building an exceptional data quality library — and for inspiring this project with their article:
|
|
288
|
+
|
|
289
|
+
> **[Building realistic fake datasets with pointblank](https://posit.co/blog/building-realistic-fake-datasets-with-pointblank/)**
|
|
290
|
+
|
|
291
|
+
Without their work, fauxdata would not exist. If you find pointblank useful, please give it a ⭐ on GitHub.
|