PyPI - fakedata-python - Versions diffs - 2.0.3__tar.gz → 2.0.4__tar.gz - Mend

fakedata-python 2.0.3tar.gz → 2.0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

{fakedata_python-2.0.3 → fakedata_python-2.0.4}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: fakedata-python
-Version: 2.0.3
-Summary: The fakedata package generates realistic synthetic user profiles for machine learning, deep learning, data analysis, and data science workflows.
+Version: 2.0.4
+Summary: The fakedata package generates realistic user profiles for machine learning, deep learning, data analysis, and data science workflows.
 Author-email: abhay557 <contact@abhaymourya.in>
 License-Expression: MIT
 Project-URL: Homepage, https://github.com/abhay557/fakedata
@@ -36,6 +36,33 @@ A high-performance, **zero-dependency** synthetic data generation engine, availa
 - **Time-Series Data**: Generate chronological activity logs (logins, page views, purchases) per user for behavioral modeling.
 - **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).
 - **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.
+- **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.
+---
+##  Node.js / TypeScript Implementation
+### Installation
+```bash
+npm install @abhay557/fakedata
+```
+### Quick Start
+```javascript
+const fakedata = require('@abhay557/fakedata');
+// Generate deterministic users with a 5% missing data rate (null injection)
+const users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });
+// Export directly to CSV format
+const csvString = fakedata.data.usersToCSV(1000, { seed: 42 });
+// Time-series activity data
+const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
+console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
+```
+---
 ##  Python Implementation
@@ -57,7 +84,7 @@ df = pd.DataFrame(fakedata.data.users_flat(10000, {"seed": 42}))
 print(df.head())
 # Create time-series activity data
-ts = data.user_time_series({"days": 30, "events_per_day": 8})
+ts = fakedata.data.user_time_series({"days": 30, "events_per_day": 8})
 print(f"Generated {len(ts['activity'])} events for {ts['user']['fullName']}")
 ```
@@ -112,6 +139,9 @@ fakedata generate -n 500 -l in --seed 42 -o india.json
 # Fraud detection dataset with 5% anomalies
 fakedata generate -n 10000 -a 0.05 -f csv -o fraud_data.csv
+# Generate 1 million rows without running out of memory (streaming)
+fakedata generate -n 1000000 -f csv -o big_dataset.csv
 # Preview a single user in the console
 fakedata preview
@@ -119,6 +149,23 @@ fakedata preview
 fakedata generate -n 100 --timeseries --days 60 -o activity.json
 ```
+### Streaming Architecture
+When writing to a file (`-o`), the CLI uses a **streaming write** strategy:
+- The output file is **created first**, before any data is generated.
+- Each user is generated **one at a time** and written immediately to disk.
+- The generated object is then **discarded** — it is never held in a large array.
+- **RAM usage stays constant** (O(1)) regardless of how many records you generate.
+- A live progress counter is printed every 10,000 records for large jobs.
+This means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.
+```
+Before (old):  generate ALL → hold in RAM → write to file   ❌ OOM at ~500k rows
+After  (new):  open file → generate 1 → write → discard → repeat  ✅ unlimited
+```
 ---
 ### sample output - one user
 ```fakedata.data.user()```
@@ -413,3 +460,19 @@ Distributed under the **MIT License**. See `LICENSE` for more information.
 - Project Commit History - `https://github.com/abhay557/random-api.xyz`
 ---
+## Contributing
+Contributions are welcome! Whether it's a bug fix, a new feature, or improved docs — every bit helps.
+- Read the [Contributing Guide](./CONTRIBUTING.md) before submitting a PR.
+- Use the [Bug Report](https://github.com/abhay557/fakedata/issues/new?template=bug_report.md) template to report issues.
+- Use the [Feature Request](https://github.com/abhay557/fakedata/issues/new?template=feature_request.md) template to suggest ideas.
+- Please follow our [Code of Conduct](./CODE_OF_CONDUCT.md) in all interactions.
+```bash
+# Fork the repo, then:
+git clone https://github.com/YOUR_USERNAME/fakedata.git
+git checkout -b feature/my-improvement
+# Make your changes, then open a Pull Request!
+```

fakedata_python-2.0.3/fakedata_python.egg-info/PKG-INFO → fakedata_python-2.0.4/README.md RENAMED Viewed

@@ -1,17 +1,3 @@
-Metadata-Version: 2.4
-Name: fakedata-python
-Version: 2.0.3
-Summary: The fakedata package generates realistic synthetic user profiles for machine learning, deep learning, data analysis, and data science workflows.
-Author-email: abhay557 <contact@abhaymourya.in>
-License-Expression: MIT
-Project-URL: Homepage, https://github.com/abhay557/fakedata
-Classifier: Programming Language :: Python :: 3
-Classifier: Operating System :: OS Independent
-Requires-Python: >=3.7
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Dynamic: license-file
 # fakedata
 [![NPM Version](https://img.shields.io/npm/v/@abhay557/fakedata?color=red&label=npm)](https://www.npmjs.com/package/@abhay557/fakedata)
@@ -36,6 +22,33 @@ A high-performance, **zero-dependency** synthetic data generation engine, availa
 - **Time-Series Data**: Generate chronological activity logs (logins, page views, purchases) per user for behavioral modeling.
 - **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).
 - **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.
+- **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.
+---
+##  Node.js / TypeScript Implementation
+### Installation
+```bash
+npm install @abhay557/fakedata
+```
+### Quick Start
+```javascript
+const fakedata = require('@abhay557/fakedata');
+// Generate deterministic users with a 5% missing data rate (null injection)
+const users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });
+// Export directly to CSV format
+const csvString = fakedata.data.usersToCSV(1000, { seed: 42 });
+// Time-series activity data
+const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
+console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
+```
+---
 ##  Python Implementation
@@ -57,7 +70,7 @@ df = pd.DataFrame(fakedata.data.users_flat(10000, {"seed": 42}))
 print(df.head())
 # Create time-series activity data
-ts = data.user_time_series({"days": 30, "events_per_day": 8})
+ts = fakedata.data.user_time_series({"days": 30, "events_per_day": 8})
 print(f"Generated {len(ts['activity'])} events for {ts['user']['fullName']}")
 ```
@@ -112,6 +125,9 @@ fakedata generate -n 500 -l in --seed 42 -o india.json
 # Fraud detection dataset with 5% anomalies
 fakedata generate -n 10000 -a 0.05 -f csv -o fraud_data.csv
+# Generate 1 million rows without running out of memory (streaming)
+fakedata generate -n 1000000 -f csv -o big_dataset.csv
 # Preview a single user in the console
 fakedata preview
@@ -119,6 +135,23 @@ fakedata preview
 fakedata generate -n 100 --timeseries --days 60 -o activity.json
 ```
+### Streaming Architecture
+When writing to a file (`-o`), the CLI uses a **streaming write** strategy:
+- The output file is **created first**, before any data is generated.
+- Each user is generated **one at a time** and written immediately to disk.
+- The generated object is then **discarded** — it is never held in a large array.
+- **RAM usage stays constant** (O(1)) regardless of how many records you generate.
+- A live progress counter is printed every 10,000 records for large jobs.
+This means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.
+```
+Before (old):  generate ALL → hold in RAM → write to file   ❌ OOM at ~500k rows
+After  (new):  open file → generate 1 → write → discard → repeat  ✅ unlimited
+```
 ---
 ### sample output - one user
 ```fakedata.data.user()```
@@ -413,3 +446,19 @@ Distributed under the **MIT License**. See `LICENSE` for more information.
 - Project Commit History - `https://github.com/abhay557/random-api.xyz`
 ---
+## Contributing
+Contributions are welcome! Whether it's a bug fix, a new feature, or improved docs — every bit helps.
+- Read the [Contributing Guide](./CONTRIBUTING.md) before submitting a PR.
+- Use the [Bug Report](https://github.com/abhay557/fakedata/issues/new?template=bug_report.md) template to report issues.
+- Use the [Feature Request](https://github.com/abhay557/fakedata/issues/new?template=feature_request.md) template to suggest ideas.
+- Please follow our [Code of Conduct](./CODE_OF_CONDUCT.md) in all interactions.
+```bash
+# Fork the repo, then:
+git clone https://github.com/YOUR_USERNAME/fakedata.git
+git checkout -b feature/my-improvement
+# Make your changes, then open a Pull Request!
+```

{fakedata_python-2.0.3 → fakedata_python-2.0.4}/fakedata/cli.py RENAMED Viewed

@@ -89,44 +89,93 @@ EXAMPLES:
             'anomaly_rate': args.anomaly_rate,
             'missing_rate': args.missing_rate,
         }
-        # Remove None values so defaults are used inside the engine
         options = {k: v for k, v in options.items() if v is not None and v != 0.0}
+        count = args.count
         start = time.time()
-        if args.timeseries:
-            results = [
-                data.user_time_series({**options, 'days': args.days, 'events_per_day': args.events_per_day})
-                for _ in range(args.count)
-            ]
-            output = json.dumps(results, indent=2 if args.pretty else None)
-        elif args.format == 'csv':
-            output = data.users_to_csv(args.count, options if options else None)
-        elif args.format == 'flat':
-            rows = data.users_flat(args.count, options if options else None)
-            output = json.dumps(rows, indent=2 if args.pretty else None)
-        else:  # json
-            if args.pretty:
-                output = data.users_to_json(args.count, options if options else None)
+        PROGRESS_INTERVAL = 10000
+        # ── stdout: buffer is fine for small terminal output ──────────────
+        if not args.output:
+            if args.timeseries:
+                results = [
+                    data.user_time_series({**options, 'days': args.days, 'events_per_day': args.events_per_day})
+                    for _ in range(count)
+                ]
+                print(json.dumps(results, indent=2 if args.pretty else None))
+            elif args.format == 'csv':
+                print(data.users_to_csv(count, options if options else None))
+            elif args.format == 'flat':
+                rows = data.users_flat(count, options if options else None)
+                print(json.dumps(rows, indent=2 if args.pretty else None))
             else:
-                output = json.dumps(data.users(args.count, options if options else None))
+                if args.pretty:
+                    print(data.users_to_json(count, options if options else None))
+                else:
+                    print(json.dumps(data.users(count, options if options else None)))
+            return
+        # ── File: STREAMING — open file first, write one record at a time ──
+        out_path = os.path.abspath(args.output)
+        with open(out_path, 'w', encoding='utf-8') as f:
+            if args.timeseries:
+                f.write('[\n')
+                for i in range(count):
+                    rec = data.user_time_series({**options, 'days': args.days, 'events_per_day': args.events_per_day})
+                    line = json.dumps(rec, indent=2 if args.pretty else None)
+                    f.write(line + (',' if i < count - 1 else '') + '\n')
+                    if (i + 1) % PROGRESS_INTERVAL == 0:
+                        print(f"\r  ⏳ {i+1:,} / {count:,} written...", end='', file=sys.stderr)
+                f.write(']\n')
+            elif args.format == 'csv':
+                # Write header from first record
+                first = data.user(options if options else None)
+                header = ','.join(f'"{k}"' for k in first.keys())
+                f.write(header + '\n')
+                f.write(_user_to_csv_row(first) + '\n')
+                for i in range(1, count):
+                    u = data.user(options if options else None)
+                    f.write(_user_to_csv_row(u) + '\n')
+                    if (i + 1) % PROGRESS_INTERVAL == 0:
+                        print(f"\r  ⏳ {i+1:,} / {count:,} written...", end='', file=sys.stderr)
+            else:  # json / flat
+                f.write('[\n')
+                for i in range(count):
+                    u = data.user(options if options else None)
+                    line = json.dumps(u, indent=2 if args.pretty else None)
+                    f.write(line + (',' if i < count - 1 else '') + '\n')
+                    if (i + 1) % PROGRESS_INTERVAL == 0:
+                        print(f"\r  ⏳ {i+1:,} / {count:,} written...", end='', file=sys.stderr)
+                f.write(']\n')
         elapsed = round(time.time() - start, 2)
-        if args.output:
-            out_path = os.path.abspath(args.output)
-            with open(out_path, 'w', encoding='utf-8') as f:
-                f.write(output)
-            size_kb = round(len(output.encode('utf-8')) / 1024, 1)
-            print(
-                f"✔ Done! Generated {args.count:,} users in {elapsed}s → {out_path} ({size_kb} KB)",
-                file=sys.stderr
-            )
+        size_bytes = os.path.getsize(out_path)
+        size_label = f"{size_bytes / 1048576:.1f} MB" if size_bytes >= 1048576 else f"{size_bytes / 1024:.1f} KB"
+        print('\r', end='', file=sys.stderr)  # clear progress line
+        print(
+            f"✔ Done! Generated {count:,} users in {elapsed}s → {out_path} ({size_label})",
+            file=sys.stderr
+        )
+def _user_to_csv_row(u):
+    """Serialize a single user dict to a CSV row string."""
+    import json as _json
+    parts = []
+    for v in u.values():
+        if v is None:
+            parts.append('')
+        elif isinstance(v, (dict, list)):
+            parts.append('"' + _json.dumps(v).replace('"', '""') + '"')
+        elif isinstance(v, str):
+            parts.append('"' + v.replace('"', '""') + '"')
         else:
-            print(output)
+            parts.append(str(v))
+    return ','.join(parts)
 if __name__ == '__main__':

fakedata_python-2.0.3/README.md → fakedata_python-2.0.4/fakedata_python.egg-info/PKG-INFO RENAMED Viewed

@@ -1,3 +1,17 @@
+Metadata-Version: 2.4
+Name: fakedata-python
+Version: 2.0.4
+Summary: The fakedata package generates realistic user profiles for machine learning, deep learning, data analysis, and data science workflows.
+Author-email: abhay557 <contact@abhaymourya.in>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/abhay557/fakedata
+Classifier: Programming Language :: Python :: 3
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
 # fakedata
 [![NPM Version](https://img.shields.io/npm/v/@abhay557/fakedata?color=red&label=npm)](https://www.npmjs.com/package/@abhay557/fakedata)
@@ -22,6 +36,33 @@ A high-performance, **zero-dependency** synthetic data generation engine, availa
 - **Time-Series Data**: Generate chronological activity logs (logins, page views, purchases) per user for behavioral modeling.
 - **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).
 - **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.
+- **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.
+---
+##  Node.js / TypeScript Implementation
+### Installation
+```bash
+npm install @abhay557/fakedata
+```
+### Quick Start
+```javascript
+const fakedata = require('@abhay557/fakedata');
+// Generate deterministic users with a 5% missing data rate (null injection)
+const users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });
+// Export directly to CSV format
+const csvString = fakedata.data.usersToCSV(1000, { seed: 42 });
+// Time-series activity data
+const ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });
+console.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);
+```
+---
 ##  Python Implementation
@@ -43,7 +84,7 @@ df = pd.DataFrame(fakedata.data.users_flat(10000, {"seed": 42}))
 print(df.head())
 # Create time-series activity data
-ts = data.user_time_series({"days": 30, "events_per_day": 8})
+ts = fakedata.data.user_time_series({"days": 30, "events_per_day": 8})
 print(f"Generated {len(ts['activity'])} events for {ts['user']['fullName']}")
 ```
@@ -98,6 +139,9 @@ fakedata generate -n 500 -l in --seed 42 -o india.json
 # Fraud detection dataset with 5% anomalies
 fakedata generate -n 10000 -a 0.05 -f csv -o fraud_data.csv
+# Generate 1 million rows without running out of memory (streaming)
+fakedata generate -n 1000000 -f csv -o big_dataset.csv
 # Preview a single user in the console
 fakedata preview
@@ -105,6 +149,23 @@ fakedata preview
 fakedata generate -n 100 --timeseries --days 60 -o activity.json
 ```
+### Streaming Architecture
+When writing to a file (`-o`), the CLI uses a **streaming write** strategy:
+- The output file is **created first**, before any data is generated.
+- Each user is generated **one at a time** and written immediately to disk.
+- The generated object is then **discarded** — it is never held in a large array.
+- **RAM usage stays constant** (O(1)) regardless of how many records you generate.
+- A live progress counter is printed every 10,000 records for large jobs.
+This means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.
+```
+Before (old):  generate ALL → hold in RAM → write to file   ❌ OOM at ~500k rows
+After  (new):  open file → generate 1 → write → discard → repeat  ✅ unlimited
+```
 ---
 ### sample output - one user
 ```fakedata.data.user()```
@@ -399,3 +460,19 @@ Distributed under the **MIT License**. See `LICENSE` for more information.
 - Project Commit History - `https://github.com/abhay557/random-api.xyz`
 ---
+## Contributing
+Contributions are welcome! Whether it's a bug fix, a new feature, or improved docs — every bit helps.
+- Read the [Contributing Guide](./CONTRIBUTING.md) before submitting a PR.
+- Use the [Bug Report](https://github.com/abhay557/fakedata/issues/new?template=bug_report.md) template to report issues.
+- Use the [Feature Request](https://github.com/abhay557/fakedata/issues/new?template=feature_request.md) template to suggest ideas.
+- Please follow our [Code of Conduct](./CODE_OF_CONDUCT.md) in all interactions.
+```bash
+# Fork the repo, then:
+git clone https://github.com/YOUR_USERNAME/fakedata.git
+git checkout -b feature/my-improvement
+# Make your changes, then open a Pull Request!
+```

{fakedata_python-2.0.3 → fakedata_python-2.0.4}/pyproject.toml RENAMED Viewed

@@ -4,11 +4,11 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "fakedata-python"
-version = "2.0.3"
+version = "2.0.4"
 authors = [
   { name="abhay557", email="contact@abhaymourya.in" },
 ]
-description = "The fakedata package generates realistic synthetic user profiles for machine learning, deep learning, data analysis, and data science workflows."
+description = "The fakedata package generates realistic user profiles for machine learning, deep learning, data analysis, and data science workflows."
 readme = "README.md"
 license = "MIT"
 requires-python = ">=3.7"