polarfrost 0.1.0__py3-none-any.whl → 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,579 @@
1
+ Metadata-Version: 2.4
2
+ Name: polarfrost
3
+ Version: 0.2.0
4
+ Summary: A fast k-anonymity implementation using Polars and PySpark
5
+ Home-page: https://github.com/rglew/polarfrost
6
+ Author: Richard Glew
7
+ Author-email: richard.glew@hotmail.com
8
+ Keywords: anonymization,privacy,polars,k-anonymity,data-privacy
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.8
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Topic :: Scientific/Engineering
19
+ Classifier: Topic :: Security
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: polars==1.30.0
24
+ Requires-Dist: pandas>=1.3.0
25
+ Requires-Dist: numpy>=1.21.0
26
+ Provides-Extra: spark
27
+ Requires-Dist: pyspark>=3.0.0; extra == "spark"
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest>=6.0; extra == "dev"
30
+ Requires-Dist: pytest-cov>=2.0; extra == "dev"
31
+ Requires-Dist: black>=21.0; extra == "dev"
32
+ Requires-Dist: isort>=5.0; extra == "dev"
33
+ Requires-Dist: mypy>=0.900; extra == "dev"
34
+ Dynamic: author
35
+ Dynamic: author-email
36
+ Dynamic: classifier
37
+ Dynamic: description
38
+ Dynamic: description-content-type
39
+ Dynamic: home-page
40
+ Dynamic: keywords
41
+ Dynamic: provides-extra
42
+ Dynamic: requires-dist
43
+ Dynamic: requires-python
44
+ Dynamic: summary
45
+
46
+ type# Polarfrost ❄️
47
+
48
+ [![PyPI](https://img.shields.io/pypi/v/polarfrost)](https://pypi.org/project/polarfrost/)
49
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
50
+ [![Python Version](https://img.shields.io/pypi/pyversions/polarfrost)](https://pypi.org/project/polarfrost/)
51
+ [![CI](https://github.com/rglew/polarfrost/actions/workflows/ci.yml/badge.svg)](https://github.com/rglew/polarfrost/actions/workflows/ci.yml)
52
+ [![codecov](https://codecov.io/gh/rglew/polarfrost/branch/main/graph/badge.svg)](https://codecov.io/gh/rglew/polarfrost)
53
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
54
+
55
+ A high-performance k-anonymity implementation using Polars and PySpark, featuring the Mondrian algorithm for efficient privacy-preserving data analysis.
56
+
57
+ ## ✨ Features
58
+
59
+ - 🚀 **Blazing Fast**: Leverages Polars for high-performance data processing
60
+ - 🔄 **Dual Backend**: Supports both local (Polars) and distributed (PySpark) processing
61
+ - 📊 **Data Utility**: Preserves data utility while ensuring privacy
62
+ - 🐍 **Pythonic API**: Simple and intuitive interface
63
+ - 🔒 **Privacy-Preserving**: Implements k-anonymity to protect sensitive information
64
+ - 🛡 **Robust Input Validation**: Comprehensive validation of input parameters
65
+ - 🧪 **High Test Coverage**: 80%+ test coverage with comprehensive edge case testing
66
+ - 📦 **Production Ready**: Well-tested and ready for production use
67
+ - 🔄 **Flexible Input**: Works with both eager and lazy Polars DataFrames
68
+ - 📈 **Scalable**: Efficiently handles both small and large datasets
69
+
70
+ ## 📦 Installation
71
+
72
+ ```bash
73
+ # Basic installation
74
+ pip install polarfrost
75
+
76
+ # With PySpark support
77
+ pip install "polarfrost[spark]"
78
+
79
+ # For development
80
+ git clone https://github.com/rglew/polarfrost.git
81
+ cd polarfrost
82
+ pip install -e ".[dev]"
83
+ ```
84
+
85
+ ## 🧪 Testing
86
+
87
+ To run the test suite:
88
+
89
+ ```bash
90
+ # Install test dependencies
91
+ pip install -e ".[test]"
92
+
93
+ # Run all tests (excluding PySpark tests that require Java)
94
+ pytest -k "not test_mondrian_pyspark.py"
95
+
96
+ # Run mock PySpark tests only
97
+ pytest tests/test_mondrian_pyspark_mock.py
98
+ ```
99
+
100
+ ### PySpark Testing Notes
101
+ - The test suite includes both real PySpark tests and mock PySpark tests
102
+ - Real PySpark tests require Java 8 or 11 to be installed
103
+ - Mock PySpark tests run without Java and are used in CI
104
+ - To run real PySpark tests, ensure Java is installed and set `JAVA_HOME`
105
+ - The mock tests provide equivalent test coverage without Java dependencies
106
+
107
+ ### Running Specific Test Categories
108
+
109
+ ```bash
110
+ # Run only unit tests
111
+ pytest tests/test_mondrian.py
112
+
113
+ # Run only edge case tests
114
+ pytest tests/test_mondrian_edge_cases.py
115
+
116
+ # Run with coverage report
117
+ pytest --cov=polarfrost --cov-report=term-missing
118
+ ```
119
+
120
+ ## 🚀 Quick Start
121
+
122
+ ### Basic Usage with Polars (Mondrian Algorithm)
123
+
124
+ #### Standard Mondrian k-Anonymity
125
+
126
+ The standard implementation groups records and returns one representative row per group:
127
+
128
+ ```python
129
+ import polars as pl
130
+ from polarfrost import mondrian_k_anonymity
131
+
132
+ # Sample data
133
+ data = {
134
+ "age": [25, 25, 35, 35, 45, 45, 55, 55],
135
+ "gender": ["M", "M", "F", "F", "M", "M", "F", "F"],
136
+ "zipcode": ["12345", "12345", "12345", "12345", "67890", "67890", "67890", "67890"],
137
+ "income": [50000, 55000, 60000, 65000, 70000, 75000, 80000, 85000],
138
+ "medical_condition": ["A", "B", "A", "B", "A", "B", "A", "B"]
139
+ }
140
+ df = pl.DataFrame(data)
141
+
142
+ # Apply k-anonymity with k=2
143
+ anonymized = mondrian_k_anonymity(
144
+ df,
145
+ quasi_identifiers=["age", "gender", "zipcode"],
146
+ sensitive_column="medical_condition",
147
+ k=2,
148
+ categorical=["gender", "zipcode"]
149
+ )
150
+
151
+ print(anonymized)
152
+
153
+ ### Alternative Implementation with Row Preservation
154
+
155
+ For use cases where you need to preserve the original number of rows (1:1 input-output mapping), use `mondrian_k_anonymity_alt`:
156
+
157
+ ```python
158
+ from polarfrost import mondrian_k_anonymity_alt
159
+
160
+ # Apply k-anonymity while preserving row count
161
+ anonymized = mondrian_k_anonymity_alt(
162
+ df.lazy(), # Must be a LazyFrame
163
+ quasi_identifiers=["age", "gender", "zipcode"],
164
+ sensitive_column="medical_condition",
165
+ k=2,
166
+ categorical=["gender", "zipcode"],
167
+ group_columns=["org_id"] # Optional: group by organization
168
+ )
169
+
170
+ # Collect the results (since we started with a LazyFrame)
171
+ anonymized_df = anonymized.collect()
172
+ print(anonymized_df)
173
+ ```
174
+
175
+ #### Key Differences from Standard Implementation
176
+
177
+ 1. **Row Preservation**: Maintains original row count (1:1 input-output mapping)
178
+ 2. **In-Place Anonymization**: Modifies QI columns directly instead of creating new ones
179
+ 3. **Group Processing**: Supports hierarchical data through `group_columns`
180
+ 4. **Small Group Handling**: Masks sensitive data in groups smaller than k
181
+ 5. **LazyFrame Requirement**: Input must be a Polars LazyFrame for efficiency
182
+
183
+ #### When to Use Which Version
184
+
185
+ - Use `mondrian_k_anonymity` when you need grouped results and don't need to maintain row order
186
+ - Use `mondrian_k_anonymity_alt` when you need to:
187
+ - Preserve the original number of rows
188
+ - Maintain relationships with other tables through foreign keys
189
+ - Process hierarchical data with different k-values per group
190
+ - Keep non-QI columns unchanged
191
+ ```
192
+
193
+ ### Using PySpark for Distributed Processing (Mondrian Algorithm)
194
+
195
+ ```python
196
+ from pyspark.sql import SparkSession
197
+ from pyspark.sql.types import StructType, StructField, StringType, IntegerType
198
+ from polarfrost import mondrian_k_anonymity
199
+
200
+ # Initialize Spark session
201
+ spark = SparkSession.builder \
202
+ .appName("PolarFrostExample") \
203
+ .getOrCreate()
204
+
205
+ # Sample schema
206
+ schema = StructType([
207
+ StructField("age", IntegerType()),
208
+ StructField("gender", StringType()),
209
+ StructField("zipcode", StringType()),
210
+ StructField("income", IntegerType()),
211
+ StructField("medical_condition", StringType())
212
+ ])
213
+
214
+ # Sample data
215
+ data = [
216
+ (25, "M", "12345", 50000, "A"),
217
+ (25, "M", "12345", 55000, "B"),
218
+ (35, "F", "12345", 60000, "A"),
219
+ (35, "F", "12345", 65000, "B"),
220
+ (45, "M", "67890", 70000, "A"),
221
+ (45, "M", "67890", 75000, "B"),
222
+ (55, "F", "67890", 80000, "A"),
223
+ (55, "F", "67890", 85000, "B")
224
+ ]
225
+
226
+ # Create Spark DataFrame
227
+ df = spark.createDataFrame(data, schema)
228
+
229
+ # Apply k-anonymity with PySpark
230
+ anonymized = mondrian_k_anonymity(
231
+ df,
232
+ quasi_identifiers=["age", "gender", "zipcode"],
233
+ sensitive_column="medical_condition",
234
+ k=2,
235
+ categorical=["gender", "zipcode"],
236
+ schema=df.schema # Required for PySpark
237
+ )
238
+
239
+ anonymized.show()
240
+ ```
241
+
242
+ ## 📚 API Reference
243
+
244
+ ### `mondrian_k_anonymity`
245
+
246
+ ```python
247
+ def mondrian_k_anonymity(
248
+ df: Union[pl.DataFrame, pl.LazyFrame, "pyspark.sql.DataFrame"],
249
+ quasi_identifiers: List[str],
250
+ sensitive_column: str,
251
+ k: int,
252
+ categorical: Optional[List[str]] = None,
253
+ schema: Optional["pyspark.sql.types.StructType"] = None,
254
+ ) -> Union[pl.DataFrame, "pyspark.sql.DataFrame"]:
255
+ """
256
+ Apply Mondrian k-anonymity to the input data.
257
+
258
+ Args:
259
+ df: Input DataFrame (Polars or PySpark)
260
+ quasi_identifiers: List of column names that are quasi-identifiers
261
+ sensitive_column: Name of the sensitive column
262
+ k: Anonymity parameter (minimum group size)
263
+ categorical: List of categorical column names
264
+ schema: Schema for PySpark output (required for PySpark)
265
+
266
+ Returns:
267
+ Anonymized DataFrame with generalized quasi-identifiers
268
+ """
269
+ ```
270
+
271
+ ## 🔍 Development Notes
272
+
273
+ ### Testing Strategy
274
+
275
+ - **Unit Tests**: Core functionality of all modules
276
+ - **Mock Tests**: PySpark functionality without Java dependencies
277
+ - **Edge Cases**: Handling of boundary conditions and unusual inputs
278
+ - **Input Validation**: Comprehensive validation of all function parameters
279
+ - **Backend Compatibility**: Tests for both Polars and PySpark backends
280
+
281
+ ### PySpark Implementation
282
+
283
+ The PySpark implementation includes mock versions of key classes for testing:
284
+ - `MockSparkConf`: Mocks Spark configuration
285
+ - `MockSparkContext`: Mocks the Spark context
286
+ - `MockSparkSession`: Mocks the Spark session
287
+ - `MockSparkDataFrame`: Mocks Spark DataFrames with pandas backend
288
+
289
+ These mocks allow testing PySpark functionality without requiring a Java runtime.
290
+
291
+ ## 🔍 Algorithms
292
+
293
+ ### Mondrian k-Anonymity Algorithm
294
+
295
+ The Mondrian algorithm is a multidimensional partitioning approach that recursively splits the data along attribute values to create anonymized groups. Here's how it works in detail:
296
+
297
+ #### Algorithm Steps:
298
+
299
+ 1. **Initialization**: Start with the entire dataset and the list of quasi-identifiers (QIs).
300
+
301
+ 2. **Partitioning**:
302
+ - Find the dimension (QI) with the widest range of values
303
+ - Find the median value of that dimension
304
+ - Split the data into two partitions at the median
305
+
306
+ 3. **Anonymity Check**:
307
+ - For each partition, check if it contains at least k records
308
+ - If any partition has fewer than k records, undo the split
309
+ - If all partitions have at least k records, keep the split
310
+
311
+ 4. **Recursion**:
312
+ - Recursively apply the partitioning to each new partition
313
+ - Stop when no more valid splits can be made
314
+
315
+ 5. **Generalization**:
316
+ - For each final partition, replace QI values with their range or category
317
+ - Keep sensitive attributes as-is but ensure k-anonymity is maintained
318
+
319
+ #### Example: Patient Data Anonymization
320
+
321
+ **Original Data (k=2):**
322
+
323
+ | Age | Gender | Zipcode | Condition |
324
+ |-----|--------|---------|-----------------|
325
+ | 28 | M | 10001 | Heart Disease |
326
+ | 29 | M | 10002 | Cancer |
327
+ | 30 | F | 10003 | Diabetes |
328
+ | 31 | F | 10004 | Heart Disease |
329
+ | 32 | M | 10005 | Asthma |
330
+ | 33 | M | 10006 | Diabetes |
331
+ | 34 | F | 10007 | Cancer |
332
+ | 35 | F | 10008 | Asthma |
333
+
334
+ **After Mondrian k-Anonymization (k=2):**
335
+
336
+ | Age | Gender | Zipcode | Condition | Count |
337
+ |----------|--------|---------|-----------------|-------|
338
+ | [28-29] | M | 1000* | Heart Disease | 2 |
339
+ | [28-29] | M | 1000* | Cancer | 2 |
340
+ | [30-31] | F | 1000* | Diabetes | 1 |
341
+ | [30-31] | F | 1000* | Heart Disease | 1 |
342
+ | [32-33] | M | 1000* | Asthma | 1 |
343
+ | [32-33] | M | 1000* | Diabetes | 1 |
344
+ | [34-35] | F | 1000* | Cancer | 1 |
345
+ | [34-35] | F | 1000* | Asthma | 1 |
346
+
347
+ **Final Anonymized Groups (k=2):**
348
+
349
+ | Age | Gender | Zipcode | Conditions | Count |
350
+ |----------|--------|---------|-------------------------|-------|
351
+ | [28-29] | M | 1000* | {Heart Disease, Cancer} | 2 |
352
+ | [30-31] | F | 1000* | {Diabetes, Heart Disease}| 2 |
353
+ | [32-33] | M | 1000* | {Asthma, Diabetes} | 2 |
354
+ | [34-35] | F | 1000* | {Cancer, Asthma} | 2 |
355
+
356
+ #### Key Observations:
357
+
358
+ 1. **k=2 Anonymity**: Each group contains exactly 2 records
359
+ 2. **Generalization**:
360
+ - Ages are generalized to ranges
361
+ - Zipcodes are truncated to 4 digits (1000*)
362
+ - Sensitive conditions are preserved but grouped
363
+ 3. **Privacy**: No individual can be uniquely identified by the quasi-identifiers
364
+ 4. **Utility**: The data remains useful for analysis (e.g., "2 males aged 28-29 in zip 1000* have heart disease or cancer")
365
+
366
+ ### Clustering-Based k-Anonymity (Upcoming)
367
+
368
+ Coming soon: Support for clustering-based k-anonymity with multiple algorithms:
369
+ - **FCBG (Fast Clustering-Based Generalization)**: Groups similar records using clustering
370
+ - **RSC (Randomized Single-Clustering)**: Uses a single clustering pass with randomization
371
+ - **Random Clustering**: Random assignment while maintaining k-anonymity
372
+
373
+ ### Choosing the Right Algorithm
374
+
375
+ - **Mondrian**: Best for datasets with clear partitioning dimensions and when you need to preserve the utility of numerical ranges
376
+ - **Clustering-based**: Better for datasets where natural clusters exist in the data
377
+ - **Random**: Provides basic k-anonymity with minimal computational overhead but may have lower data utility
378
+
379
+ ## 🛡 Input Validation
380
+
381
+ PolarFrost performs comprehensive input validation to ensure data integrity:
382
+
383
+ ### DataFrame Validation
384
+ - Validates input is a Polars or PySpark DataFrame
385
+ - Handles both eager and lazy evaluation modes
386
+ - Verifies DataFrame is not empty
387
+ - Validates column existence and types
388
+
389
+ ### Parameter Validation
390
+ - `k` must be a positive integer
391
+ - `quasi_identifiers` must be a non-empty list of existing columns
392
+ - `sensitive_column` must be a single existing column
393
+ - `categorical` columns must be a subset of quasi-identifiers
394
+
395
+ ### Edge Cases Handled
396
+ - Empty DataFrames
397
+ - Missing or NULL values
398
+ - Single record partitions
399
+ - k larger than dataset size
400
+ - Mixed data types in columns
401
+ - Duplicate column names
402
+
403
+ ### Error Messages
404
+ Clear, descriptive error messages help identify and fix issues quickly:
405
+ ```python
406
+ # Example error for invalid k value
407
+ ValueError: k must be a positive integer, got 'invalid'
408
+
409
+ # Example error for missing columns
410
+ ValueError: Columns not found in DataFrame: ['nonexistent_column']
411
+ ```
412
+
413
+ ## 🧪 Testing
414
+
415
+ PolarFrost includes extensive test coverage with over 80% code coverage:
416
+
417
+ ### Test Categories
418
+ - ✅ **Unit Tests**: Core functionality of all modules
419
+ - 🔍 **Edge Cases**: Handling of boundary conditions and unusual inputs
420
+ - 🛡 **Input Validation**: Comprehensive validation of all function parameters
421
+ - 🔄 **Backend Compatibility**: Tests for both Polars and PySpark backends
422
+ - 🐛 **Error Handling**: Proper error messages and exception handling
423
+
424
+ ### Running Tests
425
+
426
+ ```bash
427
+ # Run all tests
428
+ pytest --cov=polarfrost --cov-report=term-missing tests/
429
+
430
+ # Run tests matching a specific pattern
431
+ pytest -k "test_mondrian" --cov=polarfrost --cov-report=term-missing
432
+
433
+ # Run with detailed coverage report
434
+ pytest --cov=polarfrost --cov-report=html && open htmlcov/index.html
435
+ ```
436
+
437
+ ### Test Coverage
438
+ Current test coverage includes:
439
+ - 96% coverage for clustering module
440
+ - 54% coverage for mondrian module (improving)
441
+ - Comprehensive input validation tests
442
+ - Edge case coverage for all public APIs
443
+
444
+ ## 📈 Performance
445
+
446
+ PolarFrost is optimized for performance across different workloads:
447
+
448
+ ### Performance Features
449
+ - **Lazy Evaluation**: Leverages Polars' lazy evaluation for optimal query planning
450
+ - **Minimal Data Copying**: Efficient memory management with minimal data duplication
451
+ - **Parallel Processing**: Utilizes multiple cores for faster computation
452
+ - **Distributed Processing**: Scales to large datasets with PySpark backend
453
+ - **Smart Partitioning**: Efficient data partitioning for balanced workloads
454
+
455
+ ### Performance Tips
456
+ 1. **Use LazyFrames** for multi-step operations to enable query optimization
457
+ ```python
458
+ # Good: Uses lazy evaluation
459
+ df.lazy()\
460
+ .filter(pl.col('age') > 30)\
461
+ .collect()
462
+ ```
463
+
464
+ 2. **Specify Categorical Columns** for better performance with string data
465
+ ```python
466
+ mondrian_k_anonymity(df, ..., categorical=['gender', 'zipcode'])
467
+ ```
468
+
469
+ 3. **Batch Processing** for large datasets
470
+ - Process data in chunks when possible
471
+ - Use PySpark for distributed processing of very large datasets
472
+
473
+ 4. **Monitor Performance**
474
+ - Use Polars' built-in profiling
475
+ - Enable query plans with `df.explain()` (Polars) or `df.explain(True)` (PySpark)
476
+
477
+ ## 🔄 Dependency Management
478
+
479
+ This project uses [Dependabot](https://docs.github.com/en/code-security/dependabot) to keep dependencies up to date. Dependabot will automatically create pull requests for dependency updates.
480
+
481
+ ### Update Schedule
482
+ - **Python Dependencies**: Checked weekly (Mondays at 9:00 AM AEST)
483
+ - **GitHub Actions**: Checked monthly
484
+
485
+ ### Configuration
486
+ Dependabot is configured via [.github/dependabot.yml](.github/dependabot.yml). By default:
487
+ - Only patch and minor version updates are automatically created
488
+ - Major version updates are ignored by default
489
+ - Dependencies are grouped by name
490
+ - Pull requests are automatically labeled with `dependencies` and `automated`
491
+
492
+ To update the configuration, modify the [.github/dependabot.yml](.github/dependabot.yml) file.
493
+
494
+ ## 📝 License
495
+
496
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
497
+
498
+ ## 🛠 Development
499
+
500
+ ### Prerequisites
501
+
502
+ - Python 3.8+
503
+ - [Poetry](https://python-poetry.org/) (recommended) or pip
504
+ - [pre-commit](https://pre-commit.com/)
505
+
506
+ ### Setup
507
+
508
+ 1. **Clone the repository**
509
+ ```bash
510
+ git clone https://github.com/rglew/polarfrost.git
511
+ cd polarfrost
512
+ ```
513
+
514
+ 2. **Install dependencies**
515
+ ```bash
516
+ # Using Poetry (recommended)
517
+ poetry install
518
+
519
+ # Or using pip
520
+ pip install -e .[dev]
521
+ ```
522
+
523
+ 3. **Set up pre-commit hooks**
524
+ ```bash
525
+ pre-commit install
526
+ ```
527
+
528
+ ### Development Workflow
529
+
530
+ 1. Create a new branch for your feature or bugfix:
531
+ ```bash
532
+ git checkout -b feature/your-feature-name
533
+ ```
534
+
535
+ 2. Make your changes and commit them:
536
+ ```bash
537
+ git add .
538
+ git commit -m "Your commit message"
539
+ ```
540
+
541
+ 3. Run tests locally:
542
+ ```bash
543
+ pytest tests/ -v
544
+ ```
545
+
546
+ 4. Push your changes and create a pull request
547
+
548
+ ## 🤝 Contributing
549
+
550
+ We welcome contributions! Here's how you can help:
551
+
552
+ 1. Fork the repository
553
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
554
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
555
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
556
+ 5. Open a Pull Request
557
+
558
+ ### Code Style
559
+
560
+ - We use `black` for code formatting
561
+ - `isort` for import sorting
562
+ - `flake8` for linting
563
+ - `mypy` for type checking
564
+
565
+ All these checks are automatically run via pre-commit hooks and CI.
566
+
567
+ ### Testing
568
+
569
+ - Write tests for new features
570
+ - Run tests with `pytest`
571
+ - Ensure test coverage remains high
572
+ - Document any new features or changes
573
+
574
+ ## 📄 Changelog
575
+
576
+ ### 0.1.0 (2025-06-26)
577
+ - Initial release with Mondrian k-anonymity implementation
578
+ - Support for both Polars and PySpark backends
579
+ - Comprehensive test suite
@@ -0,0 +1,9 @@
1
+ polarfrost/__init__.py,sha256=0afFONK7bJ90guK2ZVbeg7DokOorO2WiHGXGN5nt8s0,2990
2
+ polarfrost/clustering.py,sha256=QepBwPVIZE7Tun6_EEaCC24HIAcyYrQywNp4sJUQwZI,2869
3
+ polarfrost/mondrian.py,sha256=6sLzEk980_lbWjwx5ZkwnnR5zFx6TcPU6Rbaz9lxz2c,18248
4
+ polarfrost/py.typed,sha256=M2mJCnUN7Ice7bLDMBMcrHzD8_Cjh2U52FOGVfM7c5o,139
5
+ polarfrost/tests/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
+ polarfrost-0.2.0.dist-info/METADATA,sha256=Zg4HZYYeH19d5CfxvDc6XUoKPK0L3UXGvyCG2Dff74Y,19410
7
+ polarfrost-0.2.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
8
+ polarfrost-0.2.0.dist-info/top_level.txt,sha256=sYpSVIpjaKGJfdvJtbHvo6usiVi0SxqXjdJ_pB_JD0c,11
9
+ polarfrost-0.2.0.dist-info/RECORD,,
@@ -1,86 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: polarfrost
3
- Version: 0.1.0
4
- Summary: A fast k-anonymity implementation using Polars and PySpark
5
- Home-page: https://github.com/rglew/polarfrost
6
- Author: Richard Glew
7
- Author-email: richard.glew@hotmail.com
8
- Keywords: anonymization,privacy,polars,k-anonymity,data-privacy
9
- Classifier: Development Status :: 3 - Alpha
10
- Classifier: Intended Audience :: Developers
11
- Classifier: Intended Audience :: Science/Research
12
- Classifier: License :: OSI Approved :: MIT License
13
- Classifier: Programming Language :: Python :: 3
14
- Classifier: Programming Language :: Python :: 3.8
15
- Classifier: Programming Language :: Python :: 3.9
16
- Classifier: Programming Language :: Python :: 3.10
17
- Classifier: Programming Language :: Python :: 3.11
18
- Classifier: Topic :: Scientific/Engineering
19
- Classifier: Topic :: Security
20
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
- Requires-Python: >=3.8
22
- Description-Content-Type: text/markdown
23
- Requires-Dist: polars>=0.13.0
24
- Requires-Dist: pandas>=1.3.0
25
- Requires-Dist: numpy>=1.21.0
26
- Provides-Extra: spark
27
- Requires-Dist: pyspark>=3.0.0; extra == "spark"
28
- Provides-Extra: dev
29
- Requires-Dist: pytest>=6.0; extra == "dev"
30
- Requires-Dist: pytest-cov>=2.0; extra == "dev"
31
- Requires-Dist: black>=21.0; extra == "dev"
32
- Requires-Dist: isort>=5.0; extra == "dev"
33
- Requires-Dist: mypy>=0.900; extra == "dev"
34
- Dynamic: author
35
- Dynamic: author-email
36
- Dynamic: classifier
37
- Dynamic: description
38
- Dynamic: description-content-type
39
- Dynamic: home-page
40
- Dynamic: keywords
41
- Dynamic: provides-extra
42
- Dynamic: requires-dist
43
- Dynamic: requires-python
44
- Dynamic: summary
45
-
46
- # Polarfrost
47
-
48
- A fast k-anonymity implementation using Polars, featuring both Mondrian and Clustering algorithms for efficient privacy-preserving data analysis.
49
-
50
- ## Features
51
-
52
- - 🚀 Blazing fast k-anonymity using Polars
53
- - 🧊 Supports both local (Polars) and distributed (PySpark) processing
54
- - 📊 Preserves data utility while ensuring privacy
55
- - 🐍 Simple Python API
56
-
57
- ## Installation
58
-
59
- ```bash
60
- pip install polarfrost
61
- ```
62
-
63
- ## Quick Start
64
-
65
- ```python
66
- import polars as pl
67
- from polarfrost import mondrian_k_anonymity
68
-
69
- # Load your data
70
- df = pl.read_csv("your_data.csv")
71
-
72
- # Apply k-anonymity
73
- anonymized = mondrian_k_anonymity(
74
- df,
75
- quasi_identifiers=["age", "gender", "zipcode"],
76
- sensitive_column="income",
77
- k=3,
78
- categorical=["gender", "zipcode"]
79
- )
80
-
81
- print(anonymized)
82
- ```
83
-
84
- ## License
85
-
86
- MIT
@@ -1,9 +0,0 @@
1
- polarfrost/__init__.py,sha256=f8nFJQsdr5ykHIY69PM5x11gOLRNgJXEty6DR8OQ5eU,697
2
- polarfrost/clustering.py,sha256=9wJ237zQAZXHlimmch-1Yr3xGiSu6GjioxQ2xvd7vqM,955
3
- polarfrost/mondrian.py,sha256=6-V5_uhx8UqNiuVKRPMYzSE51O8FsQEaHBJbyZhoJLU,9839
4
- polarfrost/py.typed,sha256=M2mJCnUN7Ice7bLDMBMcrHzD8_Cjh2U52FOGVfM7c5o,139
5
- polarfrost/tests/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
- polarfrost-0.1.0.dist-info/METADATA,sha256=uI2hX_xs-02m495-zdhmelVs8gMPlyvSxruvuZQ3Z1E,2380
7
- polarfrost-0.1.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
8
- polarfrost-0.1.0.dist-info/top_level.txt,sha256=sYpSVIpjaKGJfdvJtbHvo6usiVi0SxqXjdJ_pB_JD0c,11
9
- polarfrost-0.1.0.dist-info/RECORD,,