pyaerial 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pyaerial-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 DiTEC-project - Erkan Karabulut
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,4 @@
1
+ include README.md
2
+ include LICENSE
3
+ recursive-include aerial *
4
+ recursive-include tests *
@@ -0,0 +1,547 @@
1
+ Metadata-Version: 2.4
2
+ Name: pyaerial
3
+ Version: 0.1.0
4
+ Summary: An implementation of the Aerial neurosymbolic association rule mining algorithm from tabular datasets.
5
+ Author-email: Erkan Karabulut <e.karabulut@uva.nl>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/DiTEC-project/pyaerial
8
+ Project-URL: Documentation, https://github.com/DiTEC-project/pyaerial
9
+ Project-URL: Source, https://github.com/DiTEC-project/pyaerial
10
+ Project-URL: Tracker, https://github.com/DiTEC-project/pyaerial/issues
11
+ Requires-Python: >=3.7
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Dynamic: license-file
15
+
16
+ # pyaerial: scalable association rule mining
17
+
18
+ ------------------------------
19
+
20
+ This is a Python implementation of the Aerial scalable neurosymbolic association rule miner for tabular data.
21
+
22
+ ## Table of Contents
23
+
24
+ - [Introduction](#introduction)
25
+ - [Installation](#installation)
26
+ - [Usage](#usage)
27
+ - [Association rule mining from categorical tabular data](#1-association-rule-mining-from-categorical-tabular-data)
28
+ - [Setting Aerial parameters](#2-setting-aerial-parameters)
29
+ - [Fine-tuning Autoencoder architecture and dimensions](#3-fine-tuning-autoencoder-architecture-and-dimensions)
30
+ - [Running Aerial for numerical values](#4-running-aerial-for-numerical-values)
31
+ - [Frequent itemset mining with Aerial](#5-frequent-itemset-mining-with-aerial)
32
+ - [Using Aerial for rule-based classification for interpretable inference](#6-using-aerial-for-rule-based-classification-for-interpretable-inference)
33
+ - [Fine-tuning the training parameters](#7-fine-tuning-the-training-parameters)
34
+ - [Setting the log levels](#8-setting-the-log-levels)
35
+ - [Functions Overview](#functions-overview)
36
+ - [Citation](#citation)
37
+ - [Contact](#contact)
38
+ - [Contributing](#contributing)
39
+
40
+ ---
41
+
42
+ ## Introduction
43
+
44
+ Aerial is a scalable Neurosymbolic association rule mining (ARM) method for tabular data. It aims to address the rule
45
+ explosion and execution time problems in ARM and it is fully compatible with the existing solutions. Aerial
46
+ first creates a neural representation of a given tabular data using an Autoencoder, and then extracts association rules
47
+ from the neural representation.
48
+
49
+ See our paper for the details of Autoencoder architecture, training and rule extraction
50
+ algorithm [Neurosymbolic Association Rule Mining from Tabular Data](https://arxiv.org/abs/2504.19354).
51
+ If you use Aerial in your work, please [cite](#citation) our paper.
52
+
53
+ ---
54
+
55
+ ## Installation
56
+
57
+ You can easily install **pyaerial** using pip:
58
+
59
+ ```bash
60
+ pip install pyaerial
61
+ ```
62
+
63
+ ## Usage
64
+
65
+ This section exemplifies the usage of Aerial with and without hyperparameter tuning.
66
+
67
+ ### 1. Association rule mining from categorical tabular data
68
+
69
+ ```
70
+ from aerial import model, rule_extraction, rule_quality
71
+ from ucimlrepo import fetch_ucirepo
72
+
73
+ # load a categorical tabular dataset from the UCI ML repository
74
+ breast_cancer = fetch_ucirepo(id=14).data.features
75
+
76
+ # train an autoencoder on the loaded table
77
+ trained_autoencoder = model.train(breast_cancer)
78
+
79
+ # extract association rules from the autoencoder
80
+ association_rules = rule_extraction.generate_rules(trained_autoencoder)
81
+
82
+ # calculate rule quality statistics (support, confidence, zhangs metric) for each rule
83
+ if len(association_rules) > 0:
84
+ stats, association_rules = rule_quality.calculate_rule_stats(association_rules, trained_autoencoder.input_vectors)
85
+ print(stats, association_rules[:1])
86
+ ```
87
+
88
+ Following is the partial output of above code:
89
+
90
+ ```
91
+ >>> Output:
92
+ Overall rule quality statistics: {
93
+ "rule_count":15,
94
+ "average_support": 0.448,
95
+ "average_confidence": 0.881,
96
+ "average_coverage": 0.860,
97
+ "average_zhangs_metric": 0.318
98
+ }
99
+
100
+ Sample rule:
101
+ {
102
+ "antecedents":[
103
+ "inv-nodes__0-2" # meaning column "inv-nodes" has the value between "0-2"
104
+ ],
105
+ "consequent":"node-caps__no", # meaing column "node-caps" has the value "no"
106
+ "support": 0.702,
107
+ "confidence": 0.943,
108
+ "zhangs_metric": 0.69
109
+ }
110
+ ```
111
+
112
+ ### 2. Setting Aerial parameters
113
+
114
+ Aerial has 3 key parameters; antecedent and consequent similarity threshold, and antecedent length.
115
+
116
+ As shown in the paper, higher antecedent thresholds results in lower number of higher support rules, while
117
+ higher consequent thresholds results in lower number of higher confidence rules.
118
+
119
+ These 3 parameters can be set using the `generate_rules` function:
120
+
121
+ ```
122
+ import pandas as pd
123
+ from aerial import model, rule_extraction, rule_quality
124
+ from ucimlrepo import fetch_ucirepo
125
+
126
+ breast_cancer = fetch_ucirepo(id=14).data.features
127
+
128
+ trained_autoencoder = model.train(table_with_labels)
129
+
130
+ # hyperparameters of aerial can be set using the generate_rules function
131
+ association_rules = rule_extraction.generate_rules(trained_autoencoder, ant_similarity=0.5, cons_similarity=0.8, max_antecedents=2)
132
+ ...
133
+ ```
134
+
135
+ ### 3. Fine-tuning Autoencoder architecture and dimensions
136
+
137
+ Aerial uses an under-complete Autoencoder and in default, it decides automatically how many layers to use and the
138
+ dimensions of each layer (see [Functions and Classes](#functions-and-classes), Autoencoder).
139
+
140
+ Alternatively, you can specify the number of layers and dimensions in the `train` method to improve performance.
141
+
142
+ ```
143
+ from aerial import model, rule_extraction, rule_quality
144
+
145
+ ...
146
+ # layer_dims=[2] specifies that there is gonna be 1 hidden layer with a dimension of 2
147
+ trained_autoencoder = model.train(breast_cancer, layer_dims=[2])
148
+ ...
149
+ ```
150
+
151
+ ### 4. Running Aerial for numerical values
152
+
153
+ Discretizing numerical values is required before running Aerial. We provide 2 discretization methods as part of
154
+ the [`discretization.py`](aerial/discretization.py) script; equal-frequency and equal-width discretization.
155
+
156
+ ```
157
+ from aerial import model, rule_extraction, rule_quality, discretization
158
+ from ucimlrepo import fetch_ucirepo
159
+
160
+ # load a numerical tabular data
161
+ iris = fetch_ucirepo(id=53).data.features
162
+
163
+ # find and discretize numerical columns
164
+ iris_discretized = discretization.equal_frequency_discretization(iris, n_bins=10)
165
+
166
+ trained_autoencoder = model.train(iris_discretized, layer_dims=[19], epochs=5)
167
+
168
+ association_rules = rule_extraction.generate_rules(trained_autoencoder, ant_similarity=0.1, cons_similarity=0.5)
169
+ ```
170
+
171
+ Following is the partial iris dataset content before and after the discretization:
172
+
173
+ ```
174
+ >>> Output:
175
+ # before discretization
176
+ sepal length sepal width petal length petal width
177
+ 0 5.1 3.5 1.4 0.2
178
+ 1 4.9 3.0 1.4 0.2
179
+ ...
180
+
181
+ # after discretization
182
+ sepal length sepal width petal length petal width
183
+ 0 (5.0, 5.27] (3.4, 3.61] (0.999, 1.4] (0.099, 0.2]
184
+ 1 (4.8, 5.0] (2.8, 3.0] (0.999, 1.4] (0.099, 0.2]
185
+ ...
186
+ ```
187
+
188
+ ### 5. Frequent itemset mining with Aerial
189
+
190
+ Aerial can also be used for frequent itemset mining besides association rules.
191
+
192
+ ```
193
+ from aerial import model, rule_extraction, rule_quality
194
+ from ucimlrepo import fetch_ucirepo
195
+
196
+ # categorical tabular dataset
197
+ breast_cancer = fetch_ucirepo(id=14).data.features
198
+ trained_autoencoder = model.train(breast_cancer, epochs=5, lr=1e-3)
199
+
200
+ # extract frequent itemsets
201
+ frequent_itemsets = rule_extraction.generate_frequent_itemsets(trained_autoencoder)
202
+
203
+ # calculate support values of the frequent itemsets
204
+ support_values, average_support = rule_quality.calculate_freq_item_support(frequent_itemsets, breast_cancer)
205
+ ```
206
+
207
+ Note that we pass the original dataset (`breast_cancer`) to the `calculate_freq_item_support()` in this case. The
208
+ following is a sample output:
209
+
210
+ ```
211
+ >>> Output:
212
+
213
+ Frequent itemsets:
214
+ {('menopause__premeno',): 0.524, ('menopause__ge40',): 0.451, ... }
215
+
216
+ Average support: 0.295
217
+ ```
218
+
219
+ ### 6. Using Aerial for rule-based classification for interpretable inference
220
+
221
+ Aerial can be used to learn rules with a class label on the consequent side, which can later be used for inference
222
+ either by themselves or as part of rule list or rule set classifiers (e.g.,
223
+ from [imodels](https://github.com/csinva/imodels) repository).
224
+
225
+ This is done by setting `target_class` parameter of the `generate_rules` function. This parameter refers to the class
226
+ label column of the tabular data.
227
+
228
+ ```
229
+ import pandas as pd
230
+ from aerial import model, rule_extraction, rule_quality
231
+ from ucimlrepo import fetch_ucirepo
232
+
233
+ # categorical tabular dataset
234
+ breast_cancer = fetch_ucirepo(id=14)
235
+ labels = breast_cancer.data.targets
236
+ breast_cancer = breast_cancer.data.features
237
+
238
+ # merge labels column with the actual table
239
+ table_with_labels = pd.concat([breast_cancer, labels], axis=1)
240
+
241
+ trained_autoencoder = model.train(table_with_labels)
242
+
243
+ # generate rules with a target class, this learns rules that has the "target_class" column (in this case this column is called "Class") on the consequent side
244
+ association_rules = rule_extraction.generate_rules(trained_autoencoder, target_class="Class", cons_similarity=0.5)
245
+
246
+ if len(association_rules) > 0:
247
+ stats, association_rules = rule_quality.calculate_rule_stats(association_rules, trained_autoencoder.input_vectors)
248
+ ```
249
+
250
+ Sample output showing rules with class labels on the right hand side:
251
+
252
+ ```
253
+ >>> Output:
254
+
255
+ {
256
+ "antecedents":[
257
+ "menopause__premeno"
258
+ ],
259
+ "consequent":"Class__no-recurrence-events", # consequent has the class label (column) named "Class" with the value "no-recurrence-events"
260
+ "support":np.float64(0.35664335664335667),
261
+ "confidence":np.float64(0.68),
262
+ "zhangs_metric":np.float64(-0.06585858585858577)
263
+ }
264
+ ```
265
+
266
+ ### 7. Fine-tuning the training parameters
267
+
268
+ The [`train()`](aerial/model.py) function allows programmers to specify various training parameters:
269
+
270
+ - autoencoder: You can implement your own Autoencoder and use it for ARM as part of Aerial
271
+ - noise_factor `default=0.5`: amount of random noise (`+-`) added to each neuron of the denoising Autoencoder
272
+ before the training process
273
+ - lr `default=5e-3`: learning rate
274
+ - epochs `default=1`: number of training epochs
275
+ - batch_size `default=2`: number of batches to train
276
+ - loss_function `default=torch.nn.BCELoss()`: loss function
277
+ - num_workers `default=1`: number of workers for parallel execution
278
+
279
+ ```
280
+ from aerial import model, rule_extraction, rule_quality, discretization
281
+ from ucimlrepo import fetch_ucirepo
282
+
283
+ # a categorical tabular dataset
284
+ breast_cancer = fetch_ucirepo(id=14).data.features
285
+
286
+ # increasing epochs to 5, note that longer training may lead to overfitting which results in rules with low association strength (zhangs' metric)
287
+ trained_autoencoder = model.train(breast_cancer, epochs=5, lr=1e-3)
288
+
289
+ association_rules = rule_extraction.generate_rules(trained_autoencoder)
290
+ if len(association_rules) > 0:
291
+ stats, association_rules = rule_quality.calculate_rule_stats(association_rules, trained_autoencoder.input_vectors)
292
+ ```
293
+
294
+ ### 8. Setting the log levels
295
+
296
+ Aerial source code prints extra debug statements notifying the beginning and ending of major
297
+ functions such as the training process or rule extraction. The log levels can be changed as follows:
298
+
299
+ ```
300
+ import logging
301
+ import aerial
302
+
303
+ # setting the log levels to DEBUG level
304
+ aerial.setup_logging(logging.DEBUG)
305
+ ...
306
+ ```
307
+
308
+ ## Functions overview
309
+
310
+ This section lists the important classes and functions as part of the Aerial package.
311
+
312
+ ### AutoEncoder(input_dimension, feature_count, layer_dims=None)
313
+
314
+ Part of the [`model.py`](aerial/model.py) script. Constructs an autoencoder designed for association rule mining on
315
+ tabular data, based on the Neurosymbolic Association
316
+ Rule Mining method.
317
+
318
+ **Parameters**:
319
+
320
+ - `input_dimension` (int): Number of input features after one-hot encoding.
321
+
322
+ - `feature_count` (int): Original number of categorical features in the dataset.
323
+
324
+ - `layer_dims` (list of int, optional): User-specified hidden layer dimensions. If not provided, the model calculates a
325
+ default architecture using a logarithmic reduction strategy (base 16).
326
+
327
+ **Behavior**:
328
+
329
+ - Automatically builds an under-complete autoencoder with a bottleneck at the original feature count.
330
+
331
+ - If no layer_dims are provided, the architecture is determined by reducing the input dimension using a geometric
332
+ progression and creates `log₁₆(input_dimension)` layers in total.
333
+
334
+ - Uses Xavier initialization for weights and sets all biases to zero.
335
+
336
+ - Applies Tanh activation functions between layers, except the final encoder and decoder layers.
337
+
338
+ ### train function
339
+
340
+ train(
341
+ transactions,
342
+ autoencoder=None,
343
+ noise_factor=0.5,
344
+ lr=5e-3,
345
+ epochs=1,
346
+ batch_size=2,
347
+ loss_function=torch.nn.BCELoss(),
348
+ num_workers=1,
349
+ layer_dims=None
350
+ )
351
+
352
+ Part of the [`model.py`](aerial/model.py) script. Trains the AutoEncoder model using one-hot encoded tabular transaction
353
+ data.
354
+
355
+ **Parameters**:
356
+
357
+ - `transactions` (pd.DataFrame): Tabular input data for training.
358
+
359
+ - `autoencoder` (AutoEncoder, optional): A preconstructed autoencoder instance. If not provided, one is created
360
+ automatically.
361
+
362
+ - `noise_factor` (float): Controls the amount of Gaussian noise added to inputs during training (denoising effect).
363
+
364
+ - `lr` (float): Learning rate for the Adam optimizer.
365
+
366
+ - `epochs` (int): Number of training epochs.
367
+
368
+ - `batch_size` (int): Number of samples per training batch.
369
+
370
+ - `loss_function` (torch.nn.Module): Loss function to apply (default is BCELoss).
371
+
372
+ - `num_workers` (int): Number of subprocesses used for data loading.
373
+
374
+ - `layer_dims` (list of int, optional): Custom hidden layer dimensions for autoencoder construction (if applicable).
375
+
376
+ **Returns**: A trained instance of the AutoEncoder.
377
+
378
+ ### generate_rules
379
+
380
+ generate_rules(
381
+ autoencoder,
382
+ ant_similarity=0.5,
383
+ cons_similarity=0.8,
384
+ max_antecedents=2,
385
+ target_class=None
386
+ )
387
+
388
+ Part of the [`rule_extraction.py`](aerial/rule_extraction.py) script. Extracts association rules from a trained
389
+ AutoEncoder using the Aerial algorithm.
390
+
391
+ **Parameters**:
392
+
393
+ - `autoencoder` (AutoEncoder): A trained autoencoder instance.
394
+
395
+ - `ant_similarity` (float): Minimum similarity threshold for an antecedent to be considered frequent.
396
+
397
+ - `cons_similarity` (float): Minimum probability threshold for a feature to qualify as a rule consequent.
398
+
399
+ - `max_antecedents` (int): Maximum number of features allowed in the rule antecedent.
400
+
401
+ - `target_class` (str, optional): When set, restricts rule consequents to the specified class (constraint-based rule
402
+ mining).
403
+
404
+ **Returns**:
405
+
406
+ A list of extracted rules in the form:
407
+
408
+ [
409
+ {"antecedents": [...], "consequent": ...},
410
+ ...
411
+ ]
412
+
413
+ ### generate_frequent_itemsets
414
+
415
+ generate_frequent_itemsets(
416
+ autoencoder,
417
+ similarity=0.5,
418
+ max_length=2
419
+ )
420
+
421
+ Part of the [`rule_extraction.py`](aerial/rule_extraction.py) script. Generates frequent itemsets from a trained
422
+ AutoEncoder using the same Aerial+ mechanism.
423
+
424
+ **Parameters**:
425
+
426
+ - `autoencoder` (AutoEncoder): A trained autoencoder instance.
427
+
428
+ - `similarity` (float): Minimum similarity threshold for an itemset to be considered frequent.
429
+
430
+ - `max_length` (int): Maximum number of items in each itemset.
431
+
432
+ **Returns**:
433
+
434
+ A list of frequent itemsets, where each itemset is a list of string features:
435
+
436
+ [
437
+ [...], # e.g., ['gender=Male', 'income=High']
438
+ ...
439
+ ]
440
+
441
+ ### equal_frequency_discretization
442
+
443
+ equal_frequency_discretization(df: pd.DataFrame, n_bins=10)
444
+
445
+ Discretizes all numerical columns into equal-frequency bins and encodes the resulting intervals as string labels.
446
+
447
+ **Parameters**:
448
+
449
+ - `df`: A pandas DataFrame containing tabular data.
450
+
451
+ - `n_bins`: Number of intervals (bins) to create.
452
+
453
+ **Returns**: A modified DataFrame with numerical columns replaced by string-encoded interval bins.
454
+
455
+ ### equal_width_discretization
456
+
457
+ `equal_width_discretization(df: pd.DataFrame, n_bins=10)`
458
+
459
+ Discretizes all numerical columns into equal-width bins and encodes the resulting intervals as string labels.
460
+
461
+ **Parameters**:
462
+
463
+ - `df`: A pandas DataFrame containing tabular data.
464
+
465
+ - `n_bins`: Number of intervals (bins) to create.
466
+
467
+ **Returns**: A modified DataFrame with numerical columns replaced by string-encoded interval bins.
468
+
469
+ ### calculate_basic_rule_stats
470
+
471
+ `calculate_basic_rule_stats(rules, transactions)`
472
+
473
+ Computes support and confidence for a list of rules using parallel processing.
474
+
475
+ **Parameters**:
476
+
477
+ - `rules`: List of rule dictionaries with 'antecedents' and 'consequent'.
478
+
479
+ - `transactions`: A pandas DataFrame of one-hot encoded transactions.
480
+
481
+ **Returns**: A list of rules enriched with support and confidence values.
482
+
483
+ ### calculate_freq_item_support
484
+
485
+ `calculate_freq_item_support(freq_items, transactions)`
486
+
487
+ Calculates the support for a list of frequent itemsets.
488
+
489
+ **Parameters**:
490
+
491
+ - `freq_items`: List of itemsets (list of strings in "feature__value" format).
492
+
493
+ - `transactions`: A pandas DataFrame of categorical data.
494
+
495
+ **Returns**: A dictionary of itemset supports and their average support.
496
+
497
+ ### calculate_rule_stats
498
+
499
+ `calculate_rule_stats(rules, transactions, max_workers=1)`
500
+
501
+ Evaluates rules with extended metrics including: Support, Confidence, Zhang’s Metric, Dataset Coverage.
502
+
503
+ Runs in parallel with joblib.
504
+
505
+ **Parameters**:
506
+
507
+ - `rules`: List of rule dictionaries.
508
+
509
+ - `transactions`: One-hot encoded pandas DataFrame.
510
+
511
+ - `max_workers`: Number of parallel threads (via joblib).
512
+
513
+ **Returns**:
514
+
515
+ - A dictionary of average metrics (support, confidence, zhangs_metric, coverage)
516
+
517
+ - A list of updated rules
518
+
519
+ ## Citation
520
+
521
+ If you use pyaerial in your work, please cite the following paper:
522
+
523
+ ```
524
+ @misc{karabulut2025neurosymbolic,
525
+ title={Neurosymbolic Association Rule Mining from Tabular Data},
526
+ author={Erkan Karabulut and Paul Groth and Victoria Degeler},
527
+ year={2025},
528
+ eprint={2504.19354},
529
+ archivePrefix={arXiv},
530
+ primaryClass={cs.AI}
531
+ }
532
+ ```
533
+
534
+ ## Contact
535
+
536
+ For questions, suggestions, or collaborations, please contact:
537
+
538
+ Erkan Karabulut
539
+ 📧 e.karabulut@uva.nl
540
+ 📧 erkankkarabulut@gmail.com
541
+
542
+ ## Contributing
543
+
544
+ Contributions, feedback, and issue reports are very welcome!
545
+
546
+ Feel free to open a pull request or create an issue if you have ideas for improvements.
547
+