fakedata-python 2.0.5__tar.gz → 2.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/MANIFEST.in +3 -0
  2. {fakedata_python-2.0.5/fakedata_python.egg-info → fakedata_python-2.1.0}/PKG-INFO +93 -197
  3. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/README.md +92 -196
  4. fakedata_python-2.1.0/fakedata/__init__.py +21 -0
  5. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/cli.py +20 -2
  6. fakedata_python-2.1.0/fakedata/helpers/companies.json +1 -0
  7. fakedata_python-2.1.0/fakedata/helpers/healthcare_extended.json +973 -0
  8. fakedata_python-2.1.0/fakedata/helpers/job_skills.json +606 -0
  9. fakedata_python-2.1.0/fakedata/helpers/salary_distributions.json +101 -0
  10. fakedata_python-2.1.0/fakedata/helpers/universities.json +71570 -0
  11. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/modules/data.py +482 -0
  12. {fakedata_python-2.0.5 → fakedata_python-2.1.0/fakedata_python.egg-info}/PKG-INFO +93 -197
  13. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata_python.egg-info/SOURCES.txt +3 -0
  14. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/pyproject.toml +1 -1
  15. fakedata_python-2.0.5/fakedata/__init__.py +0 -6
  16. fakedata_python-2.0.5/fakedata/helpers/companies.json +0 -1
  17. fakedata_python-2.0.5/fakedata/helpers/universities.json +0 -1
  18. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/LICENSE +0 -0
  19. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/core.py +0 -0
  20. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/cardtype.json +0 -0
  21. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/countries.json +0 -0
  22. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/devices.json +0 -0
  23. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/domain.json +0 -0
  24. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/email.json +0 -0
  25. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/first.json +0 -0
  26. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/healthcare.json +0 -0
  27. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/hobbies.json +0 -0
  28. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/industries.json +0 -0
  29. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/job_categories.json +0 -0
  30. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/job_titles.json +0 -0
  31. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/last.json +0 -0
  32. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/locales.json +0 -0
  33. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/middle.json +0 -0
  34. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/occupation.json +0 -0
  35. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/salary_ranges.json +0 -0
  36. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/shortformstate.json +0 -0
  37. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/state.json +0 -0
  38. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/states.json +0 -0
  39. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/helpers/street.json +0 -0
  40. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/modules/__init__.py +0 -0
  41. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata/test_python.py +0 -0
  42. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata_python.egg-info/dependency_links.txt +0 -0
  43. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata_python.egg-info/entry_points.txt +0 -0
  44. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/fakedata_python.egg-info/top_level.txt +0 -0
  45. {fakedata_python-2.0.5 → fakedata_python-2.1.0}/setup.cfg +0 -0
@@ -1,11 +1,14 @@
1
1
  # Exclude development and Node.js files
2
2
  prune .github
3
3
  exclude CONTRIBUTING.md
4
+ exclude data.md
4
5
  exclude CODE_OF_CONDUCT.md
5
6
  exclude .npmignore
6
7
  exclude test.js
7
8
  exclude test_py.py
8
9
  exclude test_python.py
10
+ exclude test_new_apis.py
11
+ exclude test_new_apis.js
9
12
 
10
13
  # Exclude JS source code
11
14
  prune src
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fakedata-python
3
- Version: 2.0.5
3
+ Version: 2.1.0
4
4
  Summary: The fakedata package generates realistic user profiles for machine learning, deep learning, data analysis, and data science workflows.
5
5
  Author-email: abhay557 <contact@abhaymourya.in>
6
6
  License-Expression: MIT
@@ -22,6 +22,7 @@ Dynamic: license-file
22
22
 
23
23
  A high-performance, **zero-dependency** synthetic data generation engine, available for both **Node.js** and **Python**. Designed specifically for machine learning, data science, and analytics workflows, providing 100% data parity across platforms.
24
24
 
25
+
25
26
  ## Overview
26
27
 
27
28
  `fakedata` has been completely rebuilt from the ground up to serve as an **ML-ready synthetic data engine**. It generates deeply interconnected user profiles with **112 flat columns across 13 domains** (Health, Financial, Employment, Digital Footprint, etc.), making it the perfect tool for training models, benchmarking pipelines, or simulating realistic databases.
@@ -37,6 +38,8 @@ A high-performance, **zero-dependency** synthetic data generation engine, availa
37
38
  - **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).
38
39
  - **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.
39
40
  - **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.
41
+ - **Standalone Generators**: Generate modular, domain-specific data without full user profiles using `data.company()`, `data.job()`, `data.medicalRecord()`, `data.university()`, and `data.transaction()`.
42
+ - **Enriched High-Fidelity Data**: Powered by aggregated datasets, user profiles now include structured `health.medicalHistory` arrays, `employment.companyDetails` with revenue and net income, and `employment.skills` arrays correlated to real job titles.
40
43
 
41
44
  ---
42
45
 
@@ -64,7 +67,25 @@ ts = fakedata.data.user_time_series({"days": 30, "events_per_day": 8})
64
67
  print(f"Generated {len(ts['activity'])} events for {ts['user']['fullName']}")
65
68
  ```
66
69
 
67
- ---
70
+ ### Streaming API & Custom Correlations
71
+ Generate unlimited data lazily, keeping memory footprint at O(1), and force mathematical relationships between fields using the Pearson Correlation API:
72
+
73
+ ```python
74
+ import fakedata
75
+
76
+ # Create a lazy generator that yields 1 million users
77
+ stream = fakedata.generate_stream(1000000, {
78
+ "correlations": [
79
+ {"fieldA": "education.level", "fieldB": "financial.annualIncome", "pearson_coeff": 0.85},
80
+ {"fieldA": "health.bmi", "fieldB": "health.bloodPressure.systolic", "pearson_coeff": 0.60}
81
+ ]
82
+ })
83
+
84
+ # Process users one by one without blowing up RAM
85
+ for user in stream:
86
+ # write to DB, serialize to file, or process
87
+ pass
88
+ ```
68
89
 
69
90
  ## Node.js / TypeScript Implementation
70
91
 
@@ -88,6 +109,26 @@ const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
88
109
  console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
89
110
  ```
90
111
 
112
+ ### Streaming API & Custom Correlations
113
+ Generate unlimited data directly to disk while keeping memory at O(1), and force mathematical relationships between fields using the Pearson Correlation API:
114
+
115
+ ```javascript
116
+ const fs = require('fs');
117
+ const fakedata = require('@abhay557/fakedata');
118
+
119
+ // Create a stream that emits 1 million users as CSV
120
+ const stream = fakedata.data.generateStream(1000000, {
121
+ format: 'csv',
122
+ correlations: [
123
+ { fieldA: 'education.level', fieldB: 'financial.annualIncome', pearson_coeff: 0.85 },
124
+ { fieldA: 'health.bmi', fieldB: 'health.bloodPressure.systolic', pearson_coeff: 0.60 }
125
+ ]
126
+ });
127
+
128
+ // Pipe directly to file (constant RAM usage)
129
+ stream.pipe(fs.createWriteStream('1m_dataset.csv'));
130
+ ```
131
+
91
132
  ---
92
133
 
93
134
  ## CLI — Command Line Interface
@@ -116,7 +157,8 @@ pip install fakedata-python
116
157
 
117
158
  | Flag | Default | Description |
118
159
  |:---|:---|:---|
119
- | `-n`, `--count` | `10` | Number of users to generate |
160
+ | `-T`, `--type` | `users` | Type of data: `users` \| `companies` \| `jobs` \| `universities` \| `transactions` \| `medical_records` |
161
+ | `-n`, `--count` | `10` | Number of records to generate |
120
162
  | `-f`, `--format` | `json` | Output format: `json` \| `csv` \| `flat` |
121
163
  | `-o`, `--output` | stdout | Output file path |
122
164
  | `-s`, `--seed` | none | Random seed for reproducibility |
@@ -133,6 +175,12 @@ pip install fakedata-python
133
175
  # Generate 1000 users and save as CSV
134
176
  fakedata generate -n 1000 -f csv -o dataset.csv
135
177
 
178
+ # Generate 500 standalone company profiles (v2.1)
179
+ fakedata generate --type companies -n 500 -o companies.json
180
+
181
+ # Generate 100,000 medical records directly to a file (v2.1)
182
+ fakedata generate -T medical_records -n 100000 -o hospitals.json
183
+
136
184
  # Generate 500 deterministic Indian users
137
185
  fakedata generate -n 500 -l in --seed 42 -o india.json
138
186
 
@@ -161,192 +209,6 @@ When writing to a file (`-o`), the CLI uses a **streaming write** strategy:
161
209
 
162
210
  This means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.
163
211
 
164
- ```
165
- Before (old): generate ALL → hold in RAM → write to file ❌ OOM at ~500k rows
166
- After (new): open file → generate 1 → write → discard → repeat ✅ unlimited
167
- ```
168
-
169
- ---
170
- ### sample output - one user
171
- ```fakedata.data.user()```
172
- ```fakedata.data.user(n) // set n = 100```
173
-
174
- ```json
175
- "id": "4612",
176
- "fullName": "Damaris Carlo Ebervale",
177
- "firstName": "Damaris",
178
- "lastName": "Ebervale",
179
- "middleName": "Carlo",
180
- "age": 31,
181
- "gender": "non-binary",
182
- "email": "damaris.ebervale@liberomail.com",
183
- "phone": "+1 7469125114",
184
- "username": "damaris_4612",
185
- "password": "UQ!VZr0cLUD9",
186
- "birthDate": "1995-07-19",
187
- "bloodGroup": "+B",
188
- "height": 185,
189
- "weight": 60,
190
- "domain": "damarisebervale.vg",
191
- "ip": "48.50.80.113",
192
- "macaddress": "33:2F:39:EE:3B:1E",
193
- "address": {
194
- "street": "3623 Chateau Lane",
195
- "city": "Kilgore",
196
- "state": "Texas",
197
- "country": "Sierra Leone",
198
- "countryCode": "SL",
199
- "zipCode": 36434,
200
- "coordinates": {
201
- "latitude": "-68.324385",
202
- "longitude": "55.859967"
203
- }
204
- },
205
- "demographics": {
206
- "ethnicity": "Hispanic",
207
- "nationality": "South Korean",
208
- "language": {
209
- "primary": "Arabic",
210
- "secondary": "Turkish"
211
- },
212
- "relationshipStatus": "dating"
213
- },
214
- "education": {
215
- "level": "Bachelor's",
216
- "field": "Computer Science",
217
- "institution": "Agricultural University of Lublin",
218
- "institutionCountry": "Poland",
219
- "gpa": 2.79,
220
- "graduationYear": 2017,
221
- "studentDebt": 64117
222
- },
223
- "employment": {
224
- "status": "self-employed",
225
- "company": "China CITIC Bank",
226
- "companySize": "enterprise",
227
- "industry": "Banking",
228
- "jobTitle": "\"ORACLE DBA\"",
229
- "jobCategory": "Network Engineering",
230
- "yearsExperience": 10,
231
- "workMode": "onsite",
232
- "workHoursPerWeek": 36,
233
- "jobSatisfaction": 6
234
- },
235
- "financial": {
236
- "annualIncome": 21600,
237
- "creditScore": 464,
238
- "savings": 1680,
239
- "monthlyExpenses": 1309,
240
- "debtToIncome": 3.12,
241
- "taxBracket": "12%",
242
- "investmentStyle": "moderate",
243
- "homeOwnership": "own"
244
- },
245
- "health": {
246
- "bmi": 17.5,
247
- "bmiCategory": "underweight",
248
- "bloodPressure": {
249
- "systolic": 100,
250
- "diastolic": 82
251
- },
252
- "exerciseFrequency": "3-4 times/week",
253
- "smoking": "never",
254
- "alcohol": "never",
255
- "sleepHoursPerNight": 8.3,
256
- "sleepQuality": "poor",
257
- "diet": "mediterranean",
258
- "medicalCondition": "None",
259
- "insuranceProvider": "UnitedHealthcare",
260
- "medications": [
261
- "Lisinopril"
262
- ],
263
- "lastCheckupMonthsAgo": 11,
264
- "hasDisability": false,
265
- "mentalHealth": "poor",
266
- "vaccination": "partially vaccinated"
267
- },
268
- "social": {
269
- "socialMedia": {
270
- "platforms": [
271
- "Pinterest",
272
- "Twitter/X",
273
- "Reddit",
274
- "Instagram"
275
- ],
276
- "screenTimeHoursPerDay": 3.8,
277
- "preferredContent": "video"
278
- },
279
- "shopping": {
280
- "frequency": "weekly",
281
- "preferredCategories": [
282
- "toys & games",
283
- "books"
284
- ],
285
- "monthlyOnlineSpending": 175
286
- },
287
- "newsSource": "social media",
288
- "travelFrequency": "weekly",
289
- "volunteers": false,
290
- "pet": "multiple"
291
- },
292
- "digitalFootprint": {
293
- "accountCreatedAt": "2021-04-01T09:59:41.867116+00:00",
294
- "lastLoginAt": "2026-04-24T09:59:41.867116+00:00",
295
- "lastPasswordChangeAt": "2025-11-06T09:59:41.867116+00:00",
296
- "userAgent": "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 Chrome/121.0.0.0 Mobile Safari/537.36",
297
- "browser": "Chrome",
298
- "os": "Windows 11",
299
- "referrer": "facebook.com",
300
- "avgSessionMinutes": 17.6,
301
- "sessionsPerWeek": 10,
302
- "totalSessions": 2666,
303
- "twoFactorEnabled": false,
304
- "preferredLanguage": "de",
305
- "accountStatus": "inactive",
306
- "verifiedEmail": false,
307
- "verifiedPhone": true
308
- },
309
- "bank": {
310
- "nameOnCard": "Damaris Carlo Ebervale",
311
- "cardNumber": "2289970210128357",
312
- "cardType": "Mastercard",
313
- "cardExpiry": "5/29",
314
- "cardCvv": "355"
315
- },
316
- "hobbies": [
317
- "Knitting",
318
- "Gardening",
319
- "LARPing"
320
- ],
321
- "technology_profile": {
322
- "devices": {
323
- "additional_devices": [
324
- "BlackBerry Bold 9790",
325
- "Nokia N9"
326
- ],
327
- "smartphone": "Sony Ericsson Xperia X10"
328
- },
329
- "phone_preferences": {
330
- "critical_features": [
331
- "security features",
332
- "reliability",
333
- "5G connectivity"
334
- ],
335
- "primary_uses": [
336
- "photography",
337
- "education",
338
- "organization"
339
- ]
340
- },
341
- "interest": [
342
- "Knitting",
343
- "Gardening",
344
- "LARPing"
345
- ]
346
- }
347
- }
348
-
349
- ```
350
212
  ---
351
213
 
352
214
  ## Advanced Features Reference
@@ -406,7 +268,45 @@ These personas ensure that an analyst looking at your synthetic data will find *
406
268
 
407
269
  ## Data Structure Highlights (112 Columns)
408
270
 
409
- ### 3. Locale-Aware Name Generation
271
+ ### 3. v2.1 High-Fidelity Data Injections
272
+ Version 2.1 completely revamps the `user()` profile by injecting rich, deeply nested real-world data distributions for Employment, Health, and Education.
273
+
274
+ ```json
275
+ {
276
+ "employment": {
277
+ "status": "employed",
278
+ "jobTitle": "Data Scientist",
279
+ "jobCategory": "Engineering",
280
+ "skills": ["Python", "SQL", "Machine Learning", "PyTorch"],
281
+ "companyDetails": {
282
+ "country": "United States",
283
+ "industry": "Technology",
284
+ "yearFounded": 1998,
285
+ "revenue": 182300000000,
286
+ "netIncome": 46200000000
287
+ }
288
+ },
289
+ "health": {
290
+ "medicalHistory": [
291
+ {
292
+ "condition": "Hypertension",
293
+ "hospital": "UCLA Medical Center",
294
+ "admissionType": "Urgent",
295
+ "billingAmount": 18560.50,
296
+ "medication": "Lisinopril",
297
+ "testResult": "Abnormal"
298
+ }
299
+ ]
300
+ },
301
+ "education": {
302
+ "institution": "Massachusetts Institute of Technology",
303
+ "institutionDomain": "mit.edu",
304
+ "institutionState": "Massachusetts"
305
+ }
306
+ }
307
+ ```
308
+
309
+ ### 4. Locale-Aware Name Generation
410
310
  Supports 8 locales with culturally accurate first names, last names, and country/phone codes:
411
311
  - `'in'`: Aarav Sharma, Priya Patel (+91, India)
412
312
  - `'jp'`: Haruto Tanaka, Sakura Sato (+81, Japan)
@@ -417,7 +317,7 @@ Supports 8 locales with culturally accurate first names, last names, and country
417
317
  - `'fr'`: Gabriel Martin, Emma Dubois (+33, France)
418
318
  - `'en'`: James Smith, Mary Johnson (+1, United States)
419
319
 
420
- ### 4. Time-Series Activity Data
320
+ ### 5. Time-Series Activity Data
421
321
  Generate chronological behavioral logs for users. Event types include `login`, `page_view`, `purchase`, `search`, `click`, `logout`, `api_call`, `upload`, `download`, and `comment`.
422
322
 
423
323
  ```javascript
@@ -426,7 +326,7 @@ const ts = data.userTimeSeries({ seed: 42, days: 30, eventsPerDay: 8 });
426
326
  // ts.activity → [{ timestamp, type, page, duration, device, ip, success, amount?, query? }]
427
327
  ```
428
328
 
429
- ### 5. Anomaly Injection Engine (Fraud Detection)
329
+ ### 6. Anomaly Injection Engine (Fraud Detection)
430
330
  When `anomaly_rate` is > 0, `fakedata` injects ML-detectable fraud patterns into the dataset. Affected users receive a special `_anomaly` flag object indicating the fraud type.
431
331
 
432
332
  | Anomaly Type | Effect |
@@ -440,7 +340,7 @@ When `anomaly_rate` is > 0, `fakedata` injects ML-detectable fraud patterns into
440
340
  | `data_mismatch` | Age=12 + employed + 30yr experience + $500k income |
441
341
  | `health_outlier` | BMI = 8-9 or 75-80, BP = extreme values |
442
342
 
443
- ### 6. The User Profile Schema (109 Correlated Fields)
343
+ ### 7. The User Profile Schema (109 Correlated Fields)
444
344
  Each generated user contains highly realistic, correlated data. For example, age determines education graduation year, which impacts employment salary, which impacts credit score, which impacts housing status and health/BMI metrics.
445
345
 
446
346
  ```text
@@ -456,7 +356,3 @@ identity(9) → personal(6) → network(3) → address(7) → demographics(5)
456
356
  Distributed under the **MIT License**. See `LICENSE` for more information.
457
357
 
458
358
  **Maintainer**: [abhay557](https://github.com/abhay557)
459
-
460
- - Project Commit History - `https://github.com/abhay557/random-api.xyz`
461
-
462
- ---