gsppy 3.4.3__tar.gz → 3.6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {gsppy-3.4.3 → gsppy-3.6.0}/CHANGELOG.md +80 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/PKG-INFO +307 -11
- {gsppy-3.4.3 → gsppy-3.6.0}/README.md +305 -10
- {gsppy-3.4.3 → gsppy-3.6.0}/gsppy/__init__.py +20 -1
- {gsppy-3.4.3 → gsppy-3.6.0}/gsppy/accelerate.py +5 -2
- gsppy-3.6.0/gsppy/cli.py +358 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/gsppy/gsp.py +255 -19
- gsppy-3.6.0/gsppy/pruning.py +412 -0
- gsppy-3.6.0/gsppy/utils.py +384 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/pyproject.toml +3 -2
- {gsppy-3.4.3 → gsppy-3.6.0}/tests/test_cli.py +180 -7
- {gsppy-3.4.3 → gsppy-3.6.0}/tests/test_gsp.py +102 -1
- {gsppy-3.4.3 → gsppy-3.6.0}/tests/test_gsp_fuzzing.py +69 -79
- gsppy-3.6.0/tests/test_pruning.py +400 -0
- gsppy-3.6.0/tests/test_temporal_constraints.py +599 -0
- gsppy-3.4.3/gsppy/cli.py +0 -204
- gsppy-3.4.3/gsppy/utils.py +0 -104
- {gsppy-3.4.3 → gsppy-3.6.0}/.gitignore +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/CONTRIBUTING.md +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/LICENSE +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/SECURITY.md +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/gsppy/py.typed +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/rust/Cargo.lock +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/rust/Cargo.toml +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/rust/src/lib.rs +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/tests/__init__.py +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/tests/test_utils.py +0 -0
- {gsppy-3.4.3 → gsppy-3.6.0}/tox.ini +0 -0
|
@@ -1,6 +1,86 @@
|
|
|
1
1
|
# CHANGELOG
|
|
2
2
|
|
|
3
3
|
|
|
4
|
+
## v3.6.0 (2026-01-26)
|
|
5
|
+
|
|
6
|
+
### Chores
|
|
7
|
+
|
|
8
|
+
- Update uv.lock for version 3.5.0
|
|
9
|
+
([`e2c1be0`](https://github.com/jacksonpradolima/gsp-py/commit/e2c1be0945b0b124d8afa8981877513449b29ff0))
|
|
10
|
+
|
|
11
|
+
### Features
|
|
12
|
+
|
|
13
|
+
- Add flexible pruning strategy system to GSP algorithm
|
|
14
|
+
([`94089cc`](https://github.com/jacksonpradolima/gsp-py/commit/94089cc5716ec6d7c7a6e0720843162db116fca2))
|
|
15
|
+
|
|
16
|
+
feat: add flexible pruning strategy system to GSP algorithm
|
|
17
|
+
|
|
18
|
+
- Add typing-extensions as a dependency
|
|
19
|
+
([`6222945`](https://github.com/jacksonpradolima/gsp-py/commit/62229455ef3976c405d96e5ea9d5cafaf5eee6e3))
|
|
20
|
+
|
|
21
|
+
### Refactoring
|
|
22
|
+
|
|
23
|
+
- Pruning strategy initialization and enhance type hints; add typing_extensions dependency
|
|
24
|
+
([`ddc0abd`](https://github.com/jacksonpradolima/gsp-py/commit/ddc0abd9352797dd19988f60d6287da421ef60cf))
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
## v3.5.0 (2026-01-26)
|
|
28
|
+
|
|
29
|
+
### Bug Fixes
|
|
30
|
+
|
|
31
|
+
- Address code review feedback
|
|
32
|
+
([`1e7cf86`](https://github.com/jacksonpradolima/gsp-py/commit/1e7cf8681b3cd0432e6d1608187b7d518c27fcc0))
|
|
33
|
+
|
|
34
|
+
- Remove root logger modifications to prevent global side effects - Fix redundant logger
|
|
35
|
+
configuration in CLI - Remove redundant subprocess imports in tests - Revert unrelated formatting
|
|
36
|
+
changes in temporal constraints tests - Replace future dates with YYYY-MM-DD placeholders in
|
|
37
|
+
documentation - Add explanation for not using Loguru in logging documentation
|
|
38
|
+
|
|
39
|
+
All changes address feedback from code review while maintaining backward compatibility and test
|
|
40
|
+
coverage.
|
|
41
|
+
|
|
42
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
43
|
+
|
|
44
|
+
- Specify logger name in caplog for verbose tests
|
|
45
|
+
([`cb477b0`](https://github.com/jacksonpradolima/gsp-py/commit/cb477b0f040ce38b60b6e3d485536e79d6d3ea19))
|
|
46
|
+
|
|
47
|
+
Update test_verbose_initialization, test_non_verbose_initialization, and
|
|
48
|
+
test_verbose_override_in_search to use caplog.at_level(logging.DEBUG, logger='gsppy.gsp') instead
|
|
49
|
+
of just caplog.at_level(logging.DEBUG). This ensures tests only capture logs from the gsppy.gsp
|
|
50
|
+
logger, preventing interference from other loggers and making tests more reliable.
|
|
51
|
+
|
|
52
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
53
|
+
|
|
54
|
+
- Update test_setup_logging_verbose to match refactored logging
|
|
55
|
+
([`ab78c33`](https://github.com/jacksonpradolima/gsp-py/commit/ab78c33ee1c09964773b1af835c9bb133a778824))
|
|
56
|
+
|
|
57
|
+
Update test to verify logging.basicConfig is called with DEBUG level instead of checking the removed
|
|
58
|
+
explicit logger.setLevel call. This aligns with the refactored logging configuration that removed
|
|
59
|
+
redundant logger level setting.
|
|
60
|
+
|
|
61
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
62
|
+
|
|
63
|
+
### Chores
|
|
64
|
+
|
|
65
|
+
- Update uv.lock for version 3.4.3
|
|
66
|
+
([`6a78997`](https://github.com/jacksonpradolima/gsp-py/commit/6a789979fd6a7422c063dbe5b2ff46cd0d2141c6))
|
|
67
|
+
|
|
68
|
+
### Features
|
|
69
|
+
|
|
70
|
+
- Add explicit verbosity control and structured logging
|
|
71
|
+
([`44f56d9`](https://github.com/jacksonpradolima/gsp-py/commit/44f56d947978ddad1b7f2a2cca00f59def0ce4e4))
|
|
72
|
+
|
|
73
|
+
feat: add explicit verbosity control and structured logging
|
|
74
|
+
|
|
75
|
+
### Refactoring
|
|
76
|
+
|
|
77
|
+
- Gsp initialization in tests to handle constraints explicitly and improve verbosity handling
|
|
78
|
+
([`ced0243`](https://github.com/jacksonpradolima/gsp-py/commit/ced0243e58ff444988e37f5ae472f58d4478498e))
|
|
79
|
+
|
|
80
|
+
- Gsp initialization in tests to handle constraints explicitly and improve verbosity handling
|
|
81
|
+
([`479f305`](https://github.com/jacksonpradolima/gsp-py/commit/479f305aae02217ce7b75fede5e0fb249fd1b477))
|
|
82
|
+
|
|
83
|
+
|
|
4
84
|
## v3.4.3 (2026-01-25)
|
|
5
85
|
|
|
6
86
|
### Bug Fixes
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: gsppy
|
|
3
|
-
Version: 3.
|
|
3
|
+
Version: 3.6.0
|
|
4
4
|
Summary: GSP (Generalized Sequence Pattern) algorithm in Python
|
|
5
5
|
Project-URL: Homepage, https://github.com/jacksonpradolima/gsp-py
|
|
6
6
|
Author-email: Jackson Antonio do Prado Lima <jacksonpradolima@gmail.com>
|
|
@@ -40,6 +40,7 @@ Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
|
40
40
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
41
41
|
Requires-Python: >=3.10
|
|
42
42
|
Requires-Dist: click>=8.0.0
|
|
43
|
+
Requires-Dist: typing-extensions>=4.0.0
|
|
43
44
|
Provides-Extra: dev
|
|
44
45
|
Requires-Dist: cython==3.2.4; extra == 'dev'
|
|
45
46
|
Requires-Dist: hatch==1.16.3; extra == 'dev'
|
|
@@ -105,6 +106,7 @@ Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal m
|
|
|
105
106
|
6. [💡 Usage](#usage)
|
|
106
107
|
- [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)
|
|
107
108
|
- [📊 Explanation: Support and Results](#explanation-support-and-results)
|
|
109
|
+
- [⏱️ Temporal Constraints](#temporal-constraints)
|
|
108
110
|
7. [⌨️ Typing](#typing)
|
|
109
111
|
8. [🌟 Planned Features](#planned-features)
|
|
110
112
|
9. [🤝 Contributing](#contributing)
|
|
@@ -123,6 +125,7 @@ principles**. Using support thresholds, GSP identifies frequent sequences of ite
|
|
|
123
125
|
- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
|
|
124
126
|
- **Support-based pruning**: Only retains sequences that meet the minimum support threshold.
|
|
125
127
|
- **Candidate generation**: Iteratively generates candidate sequences of increasing length.
|
|
128
|
+
- **Temporal constraints**: Support for time-constrained pattern mining with `mingap`, `maxgap`, and `maxspan` parameters to find patterns within specific time windows.
|
|
126
129
|
- **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
|
|
127
130
|
|
|
128
131
|
For example:
|
|
@@ -373,7 +376,28 @@ gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
|
|
|
373
376
|
- `--file`: Path to your input file (JSON or CSV). **Required**.
|
|
374
377
|
- `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.
|
|
375
378
|
- `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.
|
|
376
|
-
- `--verbose`:
|
|
379
|
+
- `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.
|
|
380
|
+
- `--mingap`, `--maxgap`, `--maxspan`: Temporal constraints for time-aware pattern mining (requires timestamped transactions).
|
|
381
|
+
|
|
382
|
+
#### Verbose Mode
|
|
383
|
+
|
|
384
|
+
For debugging or to track execution in CI/CD pipelines, use the `--verbose` flag:
|
|
385
|
+
|
|
386
|
+
```bash
|
|
387
|
+
gsppy --file transactions.json --min_support 0.3 --verbose
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
This produces structured logging output with timestamps, log levels, and process information:
|
|
391
|
+
|
|
392
|
+
```
|
|
393
|
+
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Pre-processing transactions...
|
|
394
|
+
YYYY-MM-DDTHH:MM:SS | DEBUG | PID:4179 | gsppy.gsp | Unique candidates: [('Bread',), ('Milk',), ...]
|
|
395
|
+
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Starting GSP algorithm with min_support=0.3...
|
|
396
|
+
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Run 1: 6 candidates filtered to 5.
|
|
397
|
+
...
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
For complete logging documentation, see [docs/logging.md](docs/logging.md).
|
|
377
401
|
|
|
378
402
|
#### Example
|
|
379
403
|
|
|
@@ -470,6 +494,30 @@ result = GSP(transactions).search(min_support)
|
|
|
470
494
|
print(result)
|
|
471
495
|
```
|
|
472
496
|
|
|
497
|
+
### Verbose Mode for Debugging
|
|
498
|
+
|
|
499
|
+
Enable detailed logging to track algorithm progress and debug issues:
|
|
500
|
+
|
|
501
|
+
```python
|
|
502
|
+
from gsppy.gsp import GSP
|
|
503
|
+
|
|
504
|
+
# Enable verbose logging for the entire instance
|
|
505
|
+
gsp = GSP(transactions, verbose=True)
|
|
506
|
+
result = gsp.search(min_support=0.3)
|
|
507
|
+
|
|
508
|
+
# Or enable verbose for a specific search only
|
|
509
|
+
gsp = GSP(transactions)
|
|
510
|
+
result = gsp.search(min_support=0.3, verbose=True)
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
Verbose mode provides:
|
|
514
|
+
- Detailed progress information during execution
|
|
515
|
+
- Candidate generation and filtering statistics
|
|
516
|
+
- Preprocessing and validation details
|
|
517
|
+
- Useful for debugging, research, and CI/CD integration
|
|
518
|
+
|
|
519
|
+
For complete documentation on logging, see [docs/logging.md](docs/logging.md).
|
|
520
|
+
|
|
473
521
|
### Output
|
|
474
522
|
|
|
475
523
|
The algorithm will return a list of patterns with their corresponding support.
|
|
@@ -536,6 +584,262 @@ result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
|
|
|
536
584
|
|
|
537
585
|
---
|
|
538
586
|
|
|
587
|
+
## ⏱️ Temporal Constraints
|
|
588
|
+
|
|
589
|
+
GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
|
|
590
|
+
|
|
591
|
+
### Temporal Constraint Parameters
|
|
592
|
+
|
|
593
|
+
- **`mingap`**: Minimum time gap required between consecutive items in a pattern
|
|
594
|
+
- **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
|
|
595
|
+
- **`maxspan`**: Maximum time span from the first to the last item in a pattern
|
|
596
|
+
|
|
597
|
+
### Using Temporal Constraints
|
|
598
|
+
|
|
599
|
+
To use temporal constraints, your transactions must include timestamps as (item, timestamp) tuples:
|
|
600
|
+
|
|
601
|
+
```python
|
|
602
|
+
from gsppy.gsp import GSP
|
|
603
|
+
|
|
604
|
+
# Transactions with timestamps (e.g., in seconds, hours, days, etc.)
|
|
605
|
+
timestamped_transactions = [
|
|
606
|
+
[("Login", 0), ("Browse", 2), ("AddToCart", 5), ("Purchase", 7)],
|
|
607
|
+
[("Login", 0), ("Browse", 1), ("AddToCart", 15), ("Purchase", 20)],
|
|
608
|
+
[("Login", 0), ("Browse", 3), ("AddToCart", 6), ("Purchase", 8)],
|
|
609
|
+
]
|
|
610
|
+
|
|
611
|
+
# Find patterns where consecutive events occur within 10 time units
|
|
612
|
+
gsp = GSP(timestamped_transactions, maxgap=10)
|
|
613
|
+
patterns = gsp.search(min_support=0.6)
|
|
614
|
+
|
|
615
|
+
# The pattern ("Browse", "AddToCart", "Purchase") will:
|
|
616
|
+
# - Be found in transaction 1: gaps are 3 and 2 (both ≤ 10) ✅
|
|
617
|
+
# - NOT be found in transaction 2: gap between Browse→AddToCart is 14 (exceeds maxgap) ❌
|
|
618
|
+
# - Be found in transaction 3: gaps are 3 and 2 (both ≤ 10) ✅
|
|
619
|
+
# Result: Support = 2/3 = 67% (above threshold of 60%)
|
|
620
|
+
```
|
|
621
|
+
|
|
622
|
+
### CLI Usage with Temporal Constraints
|
|
623
|
+
|
|
624
|
+
```bash
|
|
625
|
+
# Find patterns with maximum gap of 5 time units
|
|
626
|
+
gsppy --file temporal_data.json --min_support 0.3 --maxgap 5
|
|
627
|
+
|
|
628
|
+
# Find patterns with minimum gap of 2 time units
|
|
629
|
+
gsppy --file temporal_data.json --min_support 0.3 --mingap 2
|
|
630
|
+
|
|
631
|
+
# Find patterns that complete within 10 time units
|
|
632
|
+
gsppy --file temporal_data.json --min_support 0.3 --maxspan 10
|
|
633
|
+
|
|
634
|
+
# Combine multiple constraints
|
|
635
|
+
gsppy --file temporal_data.json --min_support 0.3 --mingap 1 --maxgap 5 --maxspan 10
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
### Real-World Examples
|
|
639
|
+
|
|
640
|
+
#### Medical Event Mining
|
|
641
|
+
|
|
642
|
+
```python
|
|
643
|
+
from gsppy.gsp import GSP
|
|
644
|
+
|
|
645
|
+
# Medical events with timestamps in days
|
|
646
|
+
medical_sequences = [
|
|
647
|
+
[("Symptom", 0), ("Diagnosis", 2), ("Treatment", 5), ("Recovery", 15)],
|
|
648
|
+
[("Symptom", 0), ("Diagnosis", 1), ("Treatment", 20), ("Recovery", 30)],
|
|
649
|
+
[("Symptom", 0), ("Diagnosis", 3), ("Treatment", 6), ("Recovery", 18)],
|
|
650
|
+
]
|
|
651
|
+
|
|
652
|
+
# Find patterns where treatment follows diagnosis within 10 days
|
|
653
|
+
gsp = GSP(medical_sequences, maxgap=10)
|
|
654
|
+
result = gsp.search(min_support=0.5)
|
|
655
|
+
|
|
656
|
+
# Pattern ("Diagnosis", "Treatment") found in sequences 1 & 3 only
|
|
657
|
+
# (sequence 2 has gap of 19 days, exceeding maxgap)
|
|
658
|
+
```
|
|
659
|
+
|
|
660
|
+
#### Retail Analytics
|
|
661
|
+
|
|
662
|
+
```python
|
|
663
|
+
from gsppy.gsp import GSP
|
|
664
|
+
|
|
665
|
+
# Customer purchases with timestamps in hours
|
|
666
|
+
purchase_sequences = [
|
|
667
|
+
[("Browse", 0), ("AddToCart", 0.5), ("Purchase", 1)],
|
|
668
|
+
[("Browse", 0), ("AddToCart", 1), ("Purchase", 25)], # Long delay
|
|
669
|
+
[("Browse", 0), ("AddToCart", 0.3), ("Purchase", 0.8)],
|
|
670
|
+
]
|
|
671
|
+
|
|
672
|
+
# Find purchase journeys that complete within 2 hours
|
|
673
|
+
gsp = GSP(purchase_sequences, maxspan=2)
|
|
674
|
+
result = gsp.search(min_support=0.5)
|
|
675
|
+
|
|
676
|
+
# Full sequence found in 2 out of 3 transactions
|
|
677
|
+
# (sequence 2 has span of 25 hours, exceeding maxspan)
|
|
678
|
+
```
|
|
679
|
+
|
|
680
|
+
#### User Journey Discovery
|
|
681
|
+
|
|
682
|
+
```python
|
|
683
|
+
from gsppy.gsp import GSP
|
|
684
|
+
|
|
685
|
+
# Website navigation with timestamps in seconds
|
|
686
|
+
navigation_sequences = [
|
|
687
|
+
[("Home", 0), ("Search", 5), ("Product", 10), ("Checkout", 15)],
|
|
688
|
+
[("Home", 0), ("Search", 3), ("Product", 8), ("Checkout", 180)],
|
|
689
|
+
[("Home", 0), ("Search", 4), ("Product", 9), ("Checkout", 14)],
|
|
690
|
+
]
|
|
691
|
+
|
|
692
|
+
# Find navigation patterns with:
|
|
693
|
+
# - Minimum 2 seconds between steps (mingap)
|
|
694
|
+
# - Maximum 20 seconds between steps (maxgap)
|
|
695
|
+
# - Complete within 30 seconds total (maxspan)
|
|
696
|
+
gsp = GSP(navigation_sequences, mingap=2, maxgap=20, maxspan=30)
|
|
697
|
+
result = gsp.search(min_support=0.5)
|
|
698
|
+
```
|
|
699
|
+
|
|
700
|
+
### Important Notes
|
|
701
|
+
|
|
702
|
+
- Temporal constraints require timestamped transactions (item-timestamp tuples)
|
|
703
|
+
- If temporal constraints are specified but transactions don't have timestamps, a warning is logged and constraints are ignored
|
|
704
|
+
- When using temporal constraints, the Python backend is automatically used (accelerated backends don't yet support temporal constraints)
|
|
705
|
+
- Timestamps can be in any unit (seconds, minutes, hours, days) as long as they're consistent within your dataset
|
|
706
|
+
|
|
707
|
+
---
|
|
708
|
+
|
|
709
|
+
## 🔧 Flexible Candidate Pruning
|
|
710
|
+
|
|
711
|
+
GSP-Py supports **flexible candidate pruning strategies** that allow you to customize how candidate sequences are filtered during pattern mining. This enables optimization for different dataset characteristics and mining requirements.
|
|
712
|
+
|
|
713
|
+
### Built-in Pruning Strategies
|
|
714
|
+
|
|
715
|
+
#### 1. Support-Based Pruning (Default)
|
|
716
|
+
|
|
717
|
+
The standard GSP pruning based on minimum support threshold:
|
|
718
|
+
|
|
719
|
+
```python
|
|
720
|
+
from gsppy.gsp import GSP
|
|
721
|
+
from gsppy.pruning import SupportBasedPruning
|
|
722
|
+
|
|
723
|
+
# Explicit support-based pruning
|
|
724
|
+
pruner = SupportBasedPruning(min_support_fraction=0.3)
|
|
725
|
+
gsp = GSP(transactions, pruning_strategy=pruner)
|
|
726
|
+
result = gsp.search(min_support=0.3)
|
|
727
|
+
```
|
|
728
|
+
|
|
729
|
+
#### 2. Frequency-Based Pruning
|
|
730
|
+
|
|
731
|
+
Prunes candidates based on absolute frequency (minimum number of occurrences):
|
|
732
|
+
|
|
733
|
+
```python
|
|
734
|
+
from gsppy.pruning import FrequencyBasedPruning
|
|
735
|
+
|
|
736
|
+
# Require patterns to appear at least 5 times
|
|
737
|
+
pruner = FrequencyBasedPruning(min_frequency=5)
|
|
738
|
+
gsp = GSP(transactions, pruning_strategy=pruner)
|
|
739
|
+
result = gsp.search(min_support=0.2)
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
**Use case**: When you need patterns to occur a minimum absolute number of times, regardless of dataset size.
|
|
743
|
+
|
|
744
|
+
#### 3. Temporal-Aware Pruning
|
|
745
|
+
|
|
746
|
+
Optimizes pruning for time-constrained pattern mining by pre-filtering infeasible patterns:
|
|
747
|
+
|
|
748
|
+
```python
|
|
749
|
+
from gsppy.pruning import TemporalAwarePruning
|
|
750
|
+
|
|
751
|
+
# Prune patterns that cannot satisfy temporal constraints
|
|
752
|
+
pruner = TemporalAwarePruning(
|
|
753
|
+
mingap=1,
|
|
754
|
+
maxgap=5,
|
|
755
|
+
maxspan=10,
|
|
756
|
+
min_support_fraction=0.3
|
|
757
|
+
)
|
|
758
|
+
gsp = GSP(timestamped_transactions, mingap=1, maxgap=5, maxspan=10, pruning_strategy=pruner)
|
|
759
|
+
result = gsp.search(min_support=0.3)
|
|
760
|
+
```
|
|
761
|
+
|
|
762
|
+
**Use case**: Improves performance for temporal pattern mining by eliminating patterns that cannot satisfy temporal constraints.
|
|
763
|
+
|
|
764
|
+
#### 4. Combined Pruning
|
|
765
|
+
|
|
766
|
+
Combines multiple pruning strategies for aggressive filtering:
|
|
767
|
+
|
|
768
|
+
```python
|
|
769
|
+
from gsppy.pruning import CombinedPruning, SupportBasedPruning, FrequencyBasedPruning
|
|
770
|
+
|
|
771
|
+
# Apply both support and frequency constraints
|
|
772
|
+
strategies = [
|
|
773
|
+
SupportBasedPruning(min_support_fraction=0.3),
|
|
774
|
+
FrequencyBasedPruning(min_frequency=5)
|
|
775
|
+
]
|
|
776
|
+
pruner = CombinedPruning(strategies)
|
|
777
|
+
gsp = GSP(transactions, pruning_strategy=pruner)
|
|
778
|
+
result = gsp.search(min_support=0.3)
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
**Use case**: When you want to combine multiple filtering criteria for more selective pattern discovery.
|
|
782
|
+
|
|
783
|
+
### Custom Pruning Strategies
|
|
784
|
+
|
|
785
|
+
You can create custom pruning strategies by implementing the `PruningStrategy` interface:
|
|
786
|
+
|
|
787
|
+
```python
|
|
788
|
+
from gsppy.pruning import PruningStrategy
|
|
789
|
+
from typing import Dict, Optional, Tuple
|
|
790
|
+
|
|
791
|
+
class MyCustomPruner(PruningStrategy):
|
|
792
|
+
def should_prune(
|
|
793
|
+
self,
|
|
794
|
+
candidate: Tuple[str, ...],
|
|
795
|
+
support_count: int,
|
|
796
|
+
total_transactions: int,
|
|
797
|
+
context: Optional[Dict] = None
|
|
798
|
+
) -> bool:
|
|
799
|
+
# Custom pruning logic
|
|
800
|
+
# Return True to prune (filter out), False to keep
|
|
801
|
+
pattern_length = len(candidate)
|
|
802
|
+
# Example: Prune very long patterns with low support
|
|
803
|
+
if pattern_length > 5 and support_count < 10:
|
|
804
|
+
return True
|
|
805
|
+
return False
|
|
806
|
+
|
|
807
|
+
# Use your custom pruner
|
|
808
|
+
custom_pruner = MyCustomPruner()
|
|
809
|
+
gsp = GSP(transactions, pruning_strategy=custom_pruner)
|
|
810
|
+
result = gsp.search(min_support=0.2)
|
|
811
|
+
```
|
|
812
|
+
|
|
813
|
+
### Performance Characteristics
|
|
814
|
+
|
|
815
|
+
Different pruning strategies have different performance tradeoffs:
|
|
816
|
+
|
|
817
|
+
| Strategy | Pruning Aggressiveness | Use Case | Performance Impact |
|
|
818
|
+
|----------|----------------------|----------|-------------------|
|
|
819
|
+
| **SupportBased** | Moderate | General-purpose mining | Baseline performance |
|
|
820
|
+
| **FrequencyBased** | High (for large datasets) | Require absolute frequency | Faster on large datasets |
|
|
821
|
+
| **TemporalAware** | High (for temporal data) | Time-constrained patterns | Significant speedup for temporal mining |
|
|
822
|
+
| **Combined** | Very High | Selective pattern discovery | Fastest, but may miss edge cases |
|
|
823
|
+
|
|
824
|
+
### Benchmarking Pruning Strategies
|
|
825
|
+
|
|
826
|
+
To compare pruning strategies on your dataset:
|
|
827
|
+
|
|
828
|
+
```bash
|
|
829
|
+
# Compare all strategies
|
|
830
|
+
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all
|
|
831
|
+
|
|
832
|
+
# Benchmark a specific strategy
|
|
833
|
+
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy frequency
|
|
834
|
+
|
|
835
|
+
# Run multiple rounds for averaging
|
|
836
|
+
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all --rounds 3
|
|
837
|
+
```
|
|
838
|
+
|
|
839
|
+
See `benchmarks/bench_pruning.py` for the complete benchmarking script.
|
|
840
|
+
|
|
841
|
+
---
|
|
842
|
+
|
|
539
843
|
## ⌨️ Typing
|
|
540
844
|
|
|
541
845
|
`gsppy` ships inline type information (PEP 561) via a bundled `py.typed` marker. The public API is re-exported from
|
|
@@ -549,17 +853,9 @@ larger applications.
|
|
|
549
853
|
|
|
550
854
|
We are actively working to improve GSP-Py. Here are some exciting features planned for future releases:
|
|
551
855
|
|
|
552
|
-
1. **
|
|
553
|
-
- Enable users to define their own pruning logic during the mining process.
|
|
554
|
-
|
|
555
|
-
2. **Support for Preprocessing and Postprocessing**:
|
|
856
|
+
1. **Support for Preprocessing and Postprocessing**:
|
|
556
857
|
- Add hooks to allow users to transform datasets before mining and customize the output results.
|
|
557
858
|
|
|
558
|
-
3. **Support for Time-Constrained Pattern Mining**:
|
|
559
|
-
- Extend GSP-Py to handle temporal datasets by allowing users to define time constraints (e.g., maximum time gaps
|
|
560
|
-
between events, time windows) during the sequence mining process.
|
|
561
|
-
- Enable candidate pruning and support calculations based on these temporal constraints.
|
|
562
|
-
|
|
563
859
|
Want to contribute or suggest an
|
|
564
860
|
improvement? [Open a discussion or issue!](https://github.com/jacksonpradolima/gsp-py/issues)
|
|
565
861
|
|