fakedata-python 2.0.8__tar.gz → 2.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/MANIFEST.in +1 -0
  2. {fakedata_python-2.0.8/fakedata_python.egg-info → fakedata_python-2.1.0}/PKG-INFO +96 -236
  3. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/README.md +95 -235
  4. fakedata_python-2.1.0/fakedata/__init__.py +21 -0
  5. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/cli.py +20 -2
  6. fakedata_python-2.1.0/fakedata/helpers/companies.json +1 -0
  7. fakedata_python-2.1.0/fakedata/helpers/healthcare_extended.json +973 -0
  8. fakedata_python-2.1.0/fakedata/helpers/job_skills.json +606 -0
  9. fakedata_python-2.1.0/fakedata/helpers/salary_distributions.json +101 -0
  10. fakedata_python-2.1.0/fakedata/helpers/universities.json +71570 -0
  11. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/modules/data.py +308 -0
  12. {fakedata_python-2.0.8 → fakedata_python-2.1.0/fakedata_python.egg-info}/PKG-INFO +96 -236
  13. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata_python.egg-info/SOURCES.txt +3 -0
  14. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/pyproject.toml +1 -1
  15. fakedata_python-2.0.8/fakedata/__init__.py +0 -7
  16. fakedata_python-2.0.8/fakedata/helpers/companies.json +0 -1
  17. fakedata_python-2.0.8/fakedata/helpers/universities.json +0 -1
  18. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/LICENSE +0 -0
  19. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/core.py +0 -0
  20. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/cardtype.json +0 -0
  21. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/countries.json +0 -0
  22. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/devices.json +0 -0
  23. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/domain.json +0 -0
  24. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/email.json +0 -0
  25. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/first.json +0 -0
  26. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/healthcare.json +0 -0
  27. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/hobbies.json +0 -0
  28. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/industries.json +0 -0
  29. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/job_categories.json +0 -0
  30. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/job_titles.json +0 -0
  31. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/last.json +0 -0
  32. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/locales.json +0 -0
  33. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/middle.json +0 -0
  34. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/occupation.json +0 -0
  35. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/salary_ranges.json +0 -0
  36. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/shortformstate.json +0 -0
  37. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/state.json +0 -0
  38. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/states.json +0 -0
  39. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/helpers/street.json +0 -0
  40. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/modules/__init__.py +0 -0
  41. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata/test_python.py +0 -0
  42. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata_python.egg-info/dependency_links.txt +0 -0
  43. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata_python.egg-info/entry_points.txt +0 -0
  44. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/fakedata_python.egg-info/top_level.txt +0 -0
  45. {fakedata_python-2.0.8 → fakedata_python-2.1.0}/setup.cfg +0 -0
@@ -1,6 +1,7 @@
1
1
  # Exclude development and Node.js files
2
2
  prune .github
3
3
  exclude CONTRIBUTING.md
4
+ exclude data.md
4
5
  exclude CODE_OF_CONDUCT.md
5
6
  exclude .npmignore
6
7
  exclude test.js
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fakedata-python
3
- Version: 2.0.8
3
+ Version: 2.1.0
4
4
  Summary: The fakedata package generates realistic user profiles for machine learning, deep learning, data analysis, and data science workflows.
5
5
  Author-email: abhay557 <contact@abhaymourya.in>
6
6
  License-Expression: MIT
@@ -22,6 +22,7 @@ Dynamic: license-file
22
22
 
23
23
  A high-performance, **zero-dependency** synthetic data generation engine, available for both **Node.js** and **Python**. Designed specifically for machine learning, data science, and analytics workflows, providing 100% data parity across platforms.
24
24
 
25
+
25
26
  ## Overview
26
27
 
27
28
  `fakedata` has been completely rebuilt from the ground up to serve as an **ML-ready synthetic data engine**. It generates deeply interconnected user profiles with **112 flat columns across 13 domains** (Health, Financial, Employment, Digital Footprint, etc.), making it the perfect tool for training models, benchmarking pipelines, or simulating realistic databases.
@@ -37,50 +38,8 @@ A high-performance, **zero-dependency** synthetic data generation engine, availa
37
38
  - **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).
38
39
  - **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.
39
40
  - **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.
40
-
41
- ---
42
-
43
- ## Node.js / TypeScript Implementation
44
-
45
- ### Installation
46
- ```bash
47
- npm install @abhay557/fakedata
48
- ```
49
-
50
- ### Quick Start
51
- ```javascript
52
- const fakedata = require('@abhay557/fakedata');
53
-
54
- // Generate deterministic users with a 5% missing data rate (null injection)
55
- const users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });
56
-
57
- // Export directly to CSV format
58
- const csvString = fakedata.data.usersToCSV(1000, { seed: 42 });
59
-
60
- // Time-series activity data
61
- const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
62
- console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
63
- ```
64
-
65
- ### Streaming API & Custom Correlations
66
- Generate unlimited data directly to disk while keeping memory at O(1), and force mathematical relationships between fields using the Pearson Correlation API:
67
-
68
- ```javascript
69
- const fs = require('fs');
70
- const fakedata = require('@abhay557/fakedata');
71
-
72
- // Create a stream that emits 1 million users as CSV
73
- const stream = fakedata.data.generateStream(1000000, {
74
- format: 'csv',
75
- correlations: [
76
- { fieldA: 'education.level', fieldB: 'financial.annualIncome', pearson_coeff: 0.85 },
77
- { fieldA: 'health.bmi', fieldB: 'health.bloodPressure.systolic', pearson_coeff: 0.60 }
78
- ]
79
- });
80
-
81
- // Pipe directly to file (constant RAM usage)
82
- stream.pipe(fs.createWriteStream('1m_dataset.csv'));
83
- ```
41
+ - **Standalone Generators**: Generate modular, domain-specific data without full user profiles using `data.company()`, `data.job()`, `data.medicalRecord()`, `data.university()`, and `data.transaction()`.
42
+ - **Enriched High-Fidelity Data**: Powered by aggregated datasets, user profiles now include structured `health.medicalHistory` arrays, `employment.companyDetails` with revenue and net income, and `employment.skills` arrays correlated to real job titles.
84
43
 
85
44
  ---
86
45
 
@@ -128,6 +87,48 @@ for user in stream:
128
87
  pass
129
88
  ```
130
89
 
90
+ ## Node.js / TypeScript Implementation
91
+
92
+ ### Installation
93
+ ```bash
94
+ npm install @abhay557/fakedata
95
+ ```
96
+
97
+ ### Quick Start
98
+ ```javascript
99
+ const fakedata = require('@abhay557/fakedata');
100
+
101
+ // Generate deterministic users with a 5% missing data rate (null injection)
102
+ const users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });
103
+
104
+ // Export directly to CSV format
105
+ const csvString = fakedata.data.usersToCSV(1000, { seed: 42 });
106
+
107
+ // Time-series activity data
108
+ const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
109
+ console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
110
+ ```
111
+
112
+ ### Streaming API & Custom Correlations
113
+ Generate unlimited data directly to disk while keeping memory at O(1), and force mathematical relationships between fields using the Pearson Correlation API:
114
+
115
+ ```javascript
116
+ const fs = require('fs');
117
+ const fakedata = require('@abhay557/fakedata');
118
+
119
+ // Create a stream that emits 1 million users as CSV
120
+ const stream = fakedata.data.generateStream(1000000, {
121
+ format: 'csv',
122
+ correlations: [
123
+ { fieldA: 'education.level', fieldB: 'financial.annualIncome', pearson_coeff: 0.85 },
124
+ { fieldA: 'health.bmi', fieldB: 'health.bloodPressure.systolic', pearson_coeff: 0.60 }
125
+ ]
126
+ });
127
+
128
+ // Pipe directly to file (constant RAM usage)
129
+ stream.pipe(fs.createWriteStream('1m_dataset.csv'));
130
+ ```
131
+
131
132
  ---
132
133
 
133
134
  ## CLI — Command Line Interface
@@ -156,7 +157,8 @@ pip install fakedata-python
156
157
 
157
158
  | Flag | Default | Description |
158
159
  |:---|:---|:---|
159
- | `-n`, `--count` | `10` | Number of users to generate |
160
+ | `-T`, `--type` | `users` | Type of data: `users` \| `companies` \| `jobs` \| `universities` \| `transactions` \| `medical_records` |
161
+ | `-n`, `--count` | `10` | Number of records to generate |
160
162
  | `-f`, `--format` | `json` | Output format: `json` \| `csv` \| `flat` |
161
163
  | `-o`, `--output` | stdout | Output file path |
162
164
  | `-s`, `--seed` | none | Random seed for reproducibility |
@@ -173,6 +175,12 @@ pip install fakedata-python
173
175
  # Generate 1000 users and save as CSV
174
176
  fakedata generate -n 1000 -f csv -o dataset.csv
175
177
 
178
+ # Generate 500 standalone company profiles (v2.1)
179
+ fakedata generate --type companies -n 500 -o companies.json
180
+
181
+ # Generate 100,000 medical records directly to a file (v2.1)
182
+ fakedata generate -T medical_records -n 100000 -o hospitals.json
183
+
176
184
  # Generate 500 deterministic Indian users
177
185
  fakedata generate -n 500 -l in --seed 42 -o india.json
178
186
 
@@ -201,188 +209,6 @@ When writing to a file (`-o`), the CLI uses a **streaming write** strategy:
201
209
 
202
210
  This means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.
203
211
 
204
-
205
- ---
206
- ### sample output - one user
207
- ```fakedata.data.user()```
208
- ```fakedata.data.user(n) // set n = 100```
209
-
210
- ```json
211
- "id": "4612",
212
- "fullName": "Damaris Carlo Ebervale",
213
- "firstName": "Damaris",
214
- "lastName": "Ebervale",
215
- "middleName": "Carlo",
216
- "age": 31,
217
- "gender": "non-binary",
218
- "email": "damaris.ebervale@liberomail.com",
219
- "phone": "+1 7469125114",
220
- "username": "damaris_4612",
221
- "password": "UQ!VZr0cLUD9",
222
- "birthDate": "1995-07-19",
223
- "bloodGroup": "+B",
224
- "height": 185,
225
- "weight": 60,
226
- "domain": "damarisebervale.vg",
227
- "ip": "48.50.80.113",
228
- "macaddress": "33:2F:39:EE:3B:1E",
229
- "address": {
230
- "street": "3623 Chateau Lane",
231
- "city": "Kilgore",
232
- "state": "Texas",
233
- "country": "Sierra Leone",
234
- "countryCode": "SL",
235
- "zipCode": 36434,
236
- "coordinates": {
237
- "latitude": "-68.324385",
238
- "longitude": "55.859967"
239
- }
240
- },
241
- "demographics": {
242
- "ethnicity": "Hispanic",
243
- "nationality": "South Korean",
244
- "language": {
245
- "primary": "Arabic",
246
- "secondary": "Turkish"
247
- },
248
- "relationshipStatus": "dating"
249
- },
250
- "education": {
251
- "level": "Bachelor's",
252
- "field": "Computer Science",
253
- "institution": "Agricultural University of Lublin",
254
- "institutionCountry": "Poland",
255
- "gpa": 2.79,
256
- "graduationYear": 2017,
257
- "studentDebt": 64117
258
- },
259
- "employment": {
260
- "status": "self-employed",
261
- "company": "China CITIC Bank",
262
- "companySize": "enterprise",
263
- "industry": "Banking",
264
- "jobTitle": "\"ORACLE DBA\"",
265
- "jobCategory": "Network Engineering",
266
- "yearsExperience": 10,
267
- "workMode": "onsite",
268
- "workHoursPerWeek": 36,
269
- "jobSatisfaction": 6
270
- },
271
- "financial": {
272
- "annualIncome": 21600,
273
- "creditScore": 464,
274
- "savings": 1680,
275
- "monthlyExpenses": 1309,
276
- "debtToIncome": 3.12,
277
- "taxBracket": "12%",
278
- "investmentStyle": "moderate",
279
- "homeOwnership": "own"
280
- },
281
- "health": {
282
- "bmi": 17.5,
283
- "bmiCategory": "underweight",
284
- "bloodPressure": {
285
- "systolic": 100,
286
- "diastolic": 82
287
- },
288
- "exerciseFrequency": "3-4 times/week",
289
- "smoking": "never",
290
- "alcohol": "never",
291
- "sleepHoursPerNight": 8.3,
292
- "sleepQuality": "poor",
293
- "diet": "mediterranean",
294
- "medicalCondition": "None",
295
- "insuranceProvider": "UnitedHealthcare",
296
- "medications": [
297
- "Lisinopril"
298
- ],
299
- "lastCheckupMonthsAgo": 11,
300
- "hasDisability": false,
301
- "mentalHealth": "poor",
302
- "vaccination": "partially vaccinated"
303
- },
304
- "social": {
305
- "socialMedia": {
306
- "platforms": [
307
- "Pinterest",
308
- "Twitter/X",
309
- "Reddit",
310
- "Instagram"
311
- ],
312
- "screenTimeHoursPerDay": 3.8,
313
- "preferredContent": "video"
314
- },
315
- "shopping": {
316
- "frequency": "weekly",
317
- "preferredCategories": [
318
- "toys & games",
319
- "books"
320
- ],
321
- "monthlyOnlineSpending": 175
322
- },
323
- "newsSource": "social media",
324
- "travelFrequency": "weekly",
325
- "volunteers": false,
326
- "pet": "multiple"
327
- },
328
- "digitalFootprint": {
329
- "accountCreatedAt": "2021-04-01T09:59:41.867116+00:00",
330
- "lastLoginAt": "2026-04-24T09:59:41.867116+00:00",
331
- "lastPasswordChangeAt": "2025-11-06T09:59:41.867116+00:00",
332
- "userAgent": "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 Chrome/121.0.0.0 Mobile Safari/537.36",
333
- "browser": "Chrome",
334
- "os": "Windows 11",
335
- "referrer": "facebook.com",
336
- "avgSessionMinutes": 17.6,
337
- "sessionsPerWeek": 10,
338
- "totalSessions": 2666,
339
- "twoFactorEnabled": false,
340
- "preferredLanguage": "de",
341
- "accountStatus": "inactive",
342
- "verifiedEmail": false,
343
- "verifiedPhone": true
344
- },
345
- "bank": {
346
- "nameOnCard": "Damaris Carlo Ebervale",
347
- "cardNumber": "2289970210128357",
348
- "cardType": "Mastercard",
349
- "cardExpiry": "5/29",
350
- "cardCvv": "355"
351
- },
352
- "hobbies": [
353
- "Knitting",
354
- "Gardening",
355
- "LARPing"
356
- ],
357
- "technology_profile": {
358
- "devices": {
359
- "additional_devices": [
360
- "BlackBerry Bold 9790",
361
- "Nokia N9"
362
- ],
363
- "smartphone": "Sony Ericsson Xperia X10"
364
- },
365
- "phone_preferences": {
366
- "critical_features": [
367
- "security features",
368
- "reliability",
369
- "5G connectivity"
370
- ],
371
- "primary_uses": [
372
- "photography",
373
- "education",
374
- "organization"
375
- ]
376
- },
377
- "interest": [
378
- "Knitting",
379
- "Gardening",
380
- "LARPing"
381
- ]
382
- }
383
- }
384
-
385
- ```
386
212
  ---
387
213
 
388
214
  ## Advanced Features Reference
@@ -442,7 +268,45 @@ These personas ensure that an analyst looking at your synthetic data will find *
442
268
 
443
269
  ## Data Structure Highlights (112 Columns)
444
270
 
445
- ### 3. Locale-Aware Name Generation
271
+ ### 3. v2.1 High-Fidelity Data Injections
272
+ Version 2.1 completely revamps the `user()` profile by injecting rich, deeply nested real-world data distributions for Employment, Health, and Education.
273
+
274
+ ```json
275
+ {
276
+ "employment": {
277
+ "status": "employed",
278
+ "jobTitle": "Data Scientist",
279
+ "jobCategory": "Engineering",
280
+ "skills": ["Python", "SQL", "Machine Learning", "PyTorch"],
281
+ "companyDetails": {
282
+ "country": "United States",
283
+ "industry": "Technology",
284
+ "yearFounded": 1998,
285
+ "revenue": 182300000000,
286
+ "netIncome": 46200000000
287
+ }
288
+ },
289
+ "health": {
290
+ "medicalHistory": [
291
+ {
292
+ "condition": "Hypertension",
293
+ "hospital": "UCLA Medical Center",
294
+ "admissionType": "Urgent",
295
+ "billingAmount": 18560.50,
296
+ "medication": "Lisinopril",
297
+ "testResult": "Abnormal"
298
+ }
299
+ ]
300
+ },
301
+ "education": {
302
+ "institution": "Massachusetts Institute of Technology",
303
+ "institutionDomain": "mit.edu",
304
+ "institutionState": "Massachusetts"
305
+ }
306
+ }
307
+ ```
308
+
309
+ ### 4. Locale-Aware Name Generation
446
310
  Supports 8 locales with culturally accurate first names, last names, and country/phone codes:
447
311
  - `'in'`: Aarav Sharma, Priya Patel (+91, India)
448
312
  - `'jp'`: Haruto Tanaka, Sakura Sato (+81, Japan)
@@ -453,7 +317,7 @@ Supports 8 locales with culturally accurate first names, last names, and country
453
317
  - `'fr'`: Gabriel Martin, Emma Dubois (+33, France)
454
318
  - `'en'`: James Smith, Mary Johnson (+1, United States)
455
319
 
456
- ### 4. Time-Series Activity Data
320
+ ### 5. Time-Series Activity Data
457
321
  Generate chronological behavioral logs for users. Event types include `login`, `page_view`, `purchase`, `search`, `click`, `logout`, `api_call`, `upload`, `download`, and `comment`.
458
322
 
459
323
  ```javascript
@@ -462,7 +326,7 @@ const ts = data.userTimeSeries({ seed: 42, days: 30, eventsPerDay: 8 });
462
326
  // ts.activity → [{ timestamp, type, page, duration, device, ip, success, amount?, query? }]
463
327
  ```
464
328
 
465
- ### 5. Anomaly Injection Engine (Fraud Detection)
329
+ ### 6. Anomaly Injection Engine (Fraud Detection)
466
330
  When `anomaly_rate` is > 0, `fakedata` injects ML-detectable fraud patterns into the dataset. Affected users receive a special `_anomaly` flag object indicating the fraud type.
467
331
 
468
332
  | Anomaly Type | Effect |
@@ -476,7 +340,7 @@ When `anomaly_rate` is > 0, `fakedata` injects ML-detectable fraud patterns into
476
340
  | `data_mismatch` | Age=12 + employed + 30yr experience + $500k income |
477
341
  | `health_outlier` | BMI = 8-9 or 75-80, BP = extreme values |
478
342
 
479
- ### 6. The User Profile Schema (109 Correlated Fields)
343
+ ### 7. The User Profile Schema (109 Correlated Fields)
480
344
  Each generated user contains highly realistic, correlated data. For example, age determines education graduation year, which impacts employment salary, which impacts credit score, which impacts housing status and health/BMI metrics.
481
345
 
482
346
  ```text
@@ -492,7 +356,3 @@ identity(9) → personal(6) → network(3) → address(7) → demographics(5)
492
356
  Distributed under the **MIT License**. See `LICENSE` for more information.
493
357
 
494
358
  **Maintainer**: [abhay557](https://github.com/abhay557)
495
-
496
- - Project Commit History - `https://github.com/abhay557/random-api.xyz`
497
-
498
- ---