gsppy 3.6.0__tar.gz → 4.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {gsppy-3.6.0 → gsppy-4.1.0}/CHANGELOG.md +186 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/PKG-INFO +405 -9
- {gsppy-3.6.0 → gsppy-4.1.0}/README.md +395 -4
- gsppy-4.1.0/gsppy/__init__.py +88 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/cli.py +316 -13
- gsppy-4.1.0/gsppy/dataframe_adapters.py +458 -0
- gsppy-4.1.0/gsppy/enums.py +49 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/gsp.py +220 -15
- gsppy-4.1.0/gsppy/sequence.py +371 -0
- gsppy-4.1.0/gsppy/token_mapper.py +99 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/utils.py +120 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/pyproject.toml +18 -7
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_cli.py +70 -3
- gsppy-4.1.0/tests/test_dataframe.py +341 -0
- gsppy-4.1.0/tests/test_gsp_sequence_integration.py +345 -0
- gsppy-4.1.0/tests/test_sequence.py +466 -0
- gsppy-4.1.0/tests/test_spm_format.py +303 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tox.ini +1 -1
- gsppy-3.6.0/gsppy/__init__.py +0 -43
- {gsppy-3.6.0 → gsppy-4.1.0}/.gitignore +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/CONTRIBUTING.md +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/LICENSE +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/SECURITY.md +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/accelerate.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/pruning.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/py.typed +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/rust/Cargo.lock +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/rust/Cargo.toml +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/rust/src/lib.rs +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/__init__.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_gsp.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_gsp_fuzzing.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_pruning.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_temporal_constraints.py +0 -0
- {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_utils.py +0 -0
|
@@ -1,6 +1,192 @@
|
|
|
1
1
|
# CHANGELOG
|
|
2
2
|
|
|
3
3
|
|
|
4
|
+
## v4.1.0 (2026-02-01)
|
|
5
|
+
|
|
6
|
+
### Bug Fixes
|
|
7
|
+
|
|
8
|
+
- Address code review feedback - add type annotations and remove unused variables
|
|
9
|
+
([`bf62d14`](https://github.com/jacksonpradolima/gsp-py/commit/bf62d144d8f1be1e7716291d41af955450612c81))
|
|
10
|
+
|
|
11
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
12
|
+
|
|
13
|
+
### Chores
|
|
14
|
+
|
|
15
|
+
- Update uv.lock for version 4.0.0
|
|
16
|
+
([`f1ae2af`](https://github.com/jacksonpradolima/gsp-py/commit/f1ae2af2aa71ea44b9d8625ed647da79259ec096))
|
|
17
|
+
|
|
18
|
+
### Documentation
|
|
19
|
+
|
|
20
|
+
- Add Sequence documentation and examples to README
|
|
21
|
+
([`62d0d02`](https://github.com/jacksonpradolima/gsp-py/commit/62d0d02c19c5751331df53e680cc0b9aee19677b))
|
|
22
|
+
|
|
23
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
24
|
+
|
|
25
|
+
- Update docs/ with Sequence abstraction documentation
|
|
26
|
+
([`2368cf3`](https://github.com/jacksonpradolima/gsp-py/commit/2368cf30239139e8e2af5457ee6acf14db30ef06))
|
|
27
|
+
|
|
28
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
29
|
+
|
|
30
|
+
### Features
|
|
31
|
+
|
|
32
|
+
- Add Sequence abstraction class with comprehensive tests
|
|
33
|
+
([`6011bdb`](https://github.com/jacksonpradolima/gsp-py/commit/6011bdb7104755d109b58261b36e1dd1c36b2d61))
|
|
34
|
+
|
|
35
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
36
|
+
|
|
37
|
+
- Integrate Sequence objects with GSP.search() via return_sequences parameter
|
|
38
|
+
([`7476588`](https://github.com/jacksonpradolima/gsp-py/commit/7476588f2b277276748e0550366014f2a93d8ef5))
|
|
39
|
+
|
|
40
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
41
|
+
|
|
42
|
+
- Introduce Sequence abstraction for typed pattern representation
|
|
43
|
+
([`01ca37b`](https://github.com/jacksonpradolima/gsp-py/commit/01ca37b9bc4572eb7b1c1eaf6fdf26ca2324a3c5))
|
|
44
|
+
|
|
45
|
+
### Refactoring
|
|
46
|
+
|
|
47
|
+
- Address code review feedback - remove redundant checks
|
|
48
|
+
([`621e940`](https://github.com/jacksonpradolima/gsp-py/commit/621e9403379ae0fd07bf45b97616b9979f2d4aa6))
|
|
49
|
+
|
|
50
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
51
|
+
|
|
52
|
+
- Reduce cognitive complexity in sequence_example.py and fix f-string
|
|
53
|
+
([`63ac4f9`](https://github.com/jacksonpradolima/gsp-py/commit/63ac4f9ceb869a5228cdccdcf6a9d0b9f46f0350))
|
|
54
|
+
|
|
55
|
+
Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
|
|
56
|
+
|
|
57
|
+
- Update type annotations and improve search method in GSP class
|
|
58
|
+
([`e2e9a3f`](https://github.com/jacksonpradolima/gsp-py/commit/e2e9a3f473d1e0c5d6990c8b7c5837a251761032))
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
## v4.0.0 (2026-02-01)
|
|
62
|
+
|
|
63
|
+
### Chores
|
|
64
|
+
|
|
65
|
+
- Add additional VSCode extensions for improved development experience
|
|
66
|
+
([`107dfa4`](https://github.com/jacksonpradolima/gsp-py/commit/107dfa422005f4cdec4655a9751fd0d6e597773f))
|
|
67
|
+
|
|
68
|
+
- Update uv.lock for version 3.6.1
|
|
69
|
+
([`d8d7394`](https://github.com/jacksonpradolima/gsp-py/commit/d8d73947d570844c02e9d974b626da26f07cf1e6))
|
|
70
|
+
|
|
71
|
+
### Features
|
|
72
|
+
|
|
73
|
+
- Add SPM/GSP delimiter format loader and token mapping utilities
|
|
74
|
+
([`4ac1d34`](https://github.com/jacksonpradolima/gsp-py/commit/4ac1d34d166f21d30968872cf16c1bde3ff1f2aa))
|
|
75
|
+
|
|
76
|
+
### Refactoring
|
|
77
|
+
|
|
78
|
+
- Add type casting for return values in read_transactions_from_spm
|
|
79
|
+
([`2099bfd`](https://github.com/jacksonpradolima/gsp-py/commit/2099bfd5253a1dc058dd46bd0da077810958fa76))
|
|
80
|
+
|
|
81
|
+
- Update read_transactions_from_spm to return mappings and adjust tests
|
|
82
|
+
([`373b8ff`](https://github.com/jacksonpradolima/gsp-py/commit/373b8ff0d7f131140bcdbd039fae0d02572e86b7))
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
## v3.6.1 (2026-01-31)
|
|
86
|
+
|
|
87
|
+
### Bug Fixes
|
|
88
|
+
|
|
89
|
+
- Typing for polars and pandas
|
|
90
|
+
([`0773992`](https://github.com/jacksonpradolima/gsp-py/commit/07739921d074e55c8436a88a73e510b1d8761510))
|
|
91
|
+
|
|
92
|
+
### Build System
|
|
93
|
+
|
|
94
|
+
- **deps**: Bump actions/checkout in /.github/workflows
|
|
95
|
+
([`7af193d`](https://github.com/jacksonpradolima/gsp-py/commit/7af193d515972eeca5d8e354e91a60e488357cfb))
|
|
96
|
+
|
|
97
|
+
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.3.1 to 6.0.2. - [Release
|
|
98
|
+
notes](https://github.com/actions/checkout/releases) -
|
|
99
|
+
[Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) -
|
|
100
|
+
[Commits](https://github.com/actions/checkout/compare/v4.3.1...de0fac2e4500dabe0009e67214ff5f5447ce83dd)
|
|
101
|
+
|
|
102
|
+
--- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.2
|
|
103
|
+
|
|
104
|
+
dependency-type: direct:production
|
|
105
|
+
|
|
106
|
+
update-type: version-update:semver-major
|
|
107
|
+
|
|
108
|
+
...
|
|
109
|
+
|
|
110
|
+
Signed-off-by: dependabot[bot] <support@github.com>
|
|
111
|
+
|
|
112
|
+
- **deps**: Bump actions/github-script in /.github/workflows
|
|
113
|
+
([`03a7588`](https://github.com/jacksonpradolima/gsp-py/commit/03a7588301421369731d3d543f81b93c25c292ef))
|
|
114
|
+
|
|
115
|
+
Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 8.0.0. -
|
|
116
|
+
[Release notes](https://github.com/actions/github-script/releases) -
|
|
117
|
+
[Commits](https://github.com/actions/github-script/compare/60a0d83039c74a4aee543508d2ffcb1c3799cdea...ed597411d8f924073f98dfc5c65a23a2325f34cd)
|
|
118
|
+
|
|
119
|
+
--- updated-dependencies: - dependency-name: actions/github-script dependency-version: 8.0.0
|
|
120
|
+
|
|
121
|
+
dependency-type: direct:production
|
|
122
|
+
|
|
123
|
+
update-type: version-update:semver-major
|
|
124
|
+
|
|
125
|
+
...
|
|
126
|
+
|
|
127
|
+
Signed-off-by: dependabot[bot] <support@github.com>
|
|
128
|
+
|
|
129
|
+
- **deps**: Bump actions/setup-python in /.github/workflows
|
|
130
|
+
([`75771bf`](https://github.com/jacksonpradolima/gsp-py/commit/75771bff660b3842f2c8d84bdaeb013941e5abe0))
|
|
131
|
+
|
|
132
|
+
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.6.0 to 6.2.0. -
|
|
133
|
+
[Release notes](https://github.com/actions/setup-python/releases) -
|
|
134
|
+
[Commits](https://github.com/actions/setup-python/compare/v5.6.0...a309ff8b426b58ec0e2a45f0f869d46889d02405)
|
|
135
|
+
|
|
136
|
+
--- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0
|
|
137
|
+
|
|
138
|
+
dependency-type: direct:production
|
|
139
|
+
|
|
140
|
+
update-type: version-update:semver-major
|
|
141
|
+
|
|
142
|
+
...
|
|
143
|
+
|
|
144
|
+
Signed-off-by: dependabot[bot] <support@github.com>
|
|
145
|
+
|
|
146
|
+
- **deps**: Bump actions/stale in /.github/workflows
|
|
147
|
+
([`e699ccd`](https://github.com/jacksonpradolima/gsp-py/commit/e699ccdac689734b4694665d924ace8bba479253))
|
|
148
|
+
|
|
149
|
+
Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 10.1.1. - [Release
|
|
150
|
+
notes](https://github.com/actions/stale/releases) -
|
|
151
|
+
[Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md) -
|
|
152
|
+
[Commits](https://github.com/actions/stale/compare/28ca1036281a5e5922ead5184a1bbf96e5fc984e...997185467fa4f803885201cee163a9f38240193d)
|
|
153
|
+
|
|
154
|
+
--- updated-dependencies: - dependency-name: actions/stale dependency-version: 10.1.1
|
|
155
|
+
|
|
156
|
+
dependency-type: direct:production
|
|
157
|
+
|
|
158
|
+
update-type: version-update:semver-major
|
|
159
|
+
|
|
160
|
+
...
|
|
161
|
+
|
|
162
|
+
Signed-off-by: dependabot[bot] <support@github.com>
|
|
163
|
+
|
|
164
|
+
- **deps**: Bump actions/upload-artifact in /.github/workflows
|
|
165
|
+
([`17efaff`](https://github.com/jacksonpradolima/gsp-py/commit/17efaffc755c017e066c0286464899ead6e2cae4))
|
|
166
|
+
|
|
167
|
+
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 6.0.0. -
|
|
168
|
+
[Release notes](https://github.com/actions/upload-artifact/releases) -
|
|
169
|
+
[Commits](https://github.com/actions/upload-artifact/compare/v4.6.2...b7c566a772e6b6bfb58ed0dc250532a479d7789f)
|
|
170
|
+
|
|
171
|
+
--- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 6.0.0
|
|
172
|
+
|
|
173
|
+
dependency-type: direct:production
|
|
174
|
+
|
|
175
|
+
update-type: version-update:semver-major
|
|
176
|
+
|
|
177
|
+
...
|
|
178
|
+
|
|
179
|
+
Signed-off-by: dependabot[bot] <support@github.com>
|
|
180
|
+
|
|
181
|
+
### Chores
|
|
182
|
+
|
|
183
|
+
- Update uv.lock for version 3.6.0
|
|
184
|
+
([`4c2a5e5`](https://github.com/jacksonpradolima/gsp-py/commit/4c2a5e5967482443c2db645c9ba4744bd2110dd1))
|
|
185
|
+
|
|
186
|
+
- **deps**: Bump ty and ruff
|
|
187
|
+
([`07a20df`](https://github.com/jacksonpradolima/gsp-py/commit/07a20df9fb4ff3a3b022d28d152b586ca45383c8))
|
|
188
|
+
|
|
189
|
+
|
|
4
190
|
## v3.6.0 (2026-01-26)
|
|
5
191
|
|
|
6
192
|
### Chores
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: gsppy
|
|
3
|
-
Version:
|
|
3
|
+
Version: 4.1.0
|
|
4
4
|
Summary: GSP (Generalized Sequence Pattern) algorithm in Python
|
|
5
5
|
Project-URL: Homepage, https://github.com/jacksonpradolima/gsp-py
|
|
6
6
|
Author-email: Jackson Antonio do Prado Lima <jacksonpradolima@gmail.com>
|
|
@@ -32,15 +32,20 @@ Classifier: Intended Audience :: Science/Research
|
|
|
32
32
|
Classifier: License :: OSI Approved :: MIT License
|
|
33
33
|
Classifier: Natural Language :: English
|
|
34
34
|
Classifier: Operating System :: OS Independent
|
|
35
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
36
35
|
Classifier: Programming Language :: Python :: 3.11
|
|
37
36
|
Classifier: Programming Language :: Python :: 3.12
|
|
38
37
|
Classifier: Programming Language :: Python :: 3.13
|
|
38
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
39
39
|
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
40
40
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
41
|
-
Requires-Python: >=3.
|
|
41
|
+
Requires-Python: >=3.11
|
|
42
42
|
Requires-Dist: click>=8.0.0
|
|
43
43
|
Requires-Dist: typing-extensions>=4.0.0
|
|
44
|
+
Provides-Extra: dataframe
|
|
45
|
+
Requires-Dist: pandas-stubs>=2.3.3.260113; extra == 'dataframe'
|
|
46
|
+
Requires-Dist: pandas>=3.0.0; extra == 'dataframe'
|
|
47
|
+
Requires-Dist: polars>=1.37.1; extra == 'dataframe'
|
|
48
|
+
Requires-Dist: pyarrow>=10.0.0; extra == 'dataframe'
|
|
44
49
|
Provides-Extra: dev
|
|
45
50
|
Requires-Dist: cython==3.2.4; extra == 'dev'
|
|
46
51
|
Requires-Dist: hatch==1.16.3; extra == 'dev'
|
|
@@ -51,9 +56,9 @@ Requires-Dist: pyright==1.1.408; extra == 'dev'
|
|
|
51
56
|
Requires-Dist: pytest-benchmark==5.2.3; extra == 'dev'
|
|
52
57
|
Requires-Dist: pytest-cov==7.0.0; extra == 'dev'
|
|
53
58
|
Requires-Dist: pytest==9.0.2; extra == 'dev'
|
|
54
|
-
Requires-Dist: ruff==0.14.
|
|
59
|
+
Requires-Dist: ruff==0.14.14; extra == 'dev'
|
|
55
60
|
Requires-Dist: tox==4.34.1; extra == 'dev'
|
|
56
|
-
Requires-Dist: ty==0.0.
|
|
61
|
+
Requires-Dist: ty==0.0.14; extra == 'dev'
|
|
57
62
|
Provides-Extra: docs
|
|
58
63
|
Requires-Dist: mkdocs-gen-files<1,>=0.5; extra == 'docs'
|
|
59
64
|
Requires-Dist: mkdocs-literate-nav<1,>=0.6; extra == 'docs'
|
|
@@ -72,7 +77,7 @@ Description-Content-Type: text/markdown
|
|
|
72
77
|
|
|
73
78
|
[](https://pypi.org/project/gsppy/)
|
|
74
79
|
[](https://pypi.org/project/gsppy)
|
|
75
|
-

|
|
76
81
|
|
|
77
82
|
[](https://securityscorecards.dev/viewer/?uri=github.com/jacksonpradolima/gsp-py)
|
|
78
83
|
[](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml)
|
|
@@ -90,7 +95,7 @@ Description-Content-Type: text/markdown
|
|
|
90
95
|
Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
|
|
91
96
|
|
|
92
97
|
> [!IMPORTANT]
|
|
93
|
-
> GSP-Py is compatible with Python 3.
|
|
98
|
+
> GSP-Py is compatible with Python 3.11 and later versions!
|
|
94
99
|
|
|
95
100
|
---
|
|
96
101
|
|
|
@@ -106,6 +111,7 @@ Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal m
|
|
|
106
111
|
6. [💡 Usage](#usage)
|
|
107
112
|
- [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)
|
|
108
113
|
- [📊 Explanation: Support and Results](#explanation-support-and-results)
|
|
114
|
+
- [📊 DataFrame Input Support](#dataframe-input-support)
|
|
109
115
|
- [⏱️ Temporal Constraints](#temporal-constraints)
|
|
110
116
|
7. [⌨️ Typing](#typing)
|
|
111
117
|
8. [🌟 Planned Features](#planned-features)
|
|
@@ -357,6 +363,34 @@ Your input file should be either:
|
|
|
357
363
|
Bread,Milk,Diaper,Coke
|
|
358
364
|
```
|
|
359
365
|
|
|
366
|
+
- **SPM/GSP Format**: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
|
|
367
|
+
- `-1`: Marks the end of an element (itemset)
|
|
368
|
+
- `-2`: Marks the end of a sequence (transaction)
|
|
369
|
+
|
|
370
|
+
Example:
|
|
371
|
+
```text
|
|
372
|
+
1 2 -1 3 -1 -2
|
|
373
|
+
4 -1 5 6 -1 -2
|
|
374
|
+
1 -1 2 3 -1 -2
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
The above represents:
|
|
378
|
+
- Transaction 1: `[[1, 2], [3]]` → flattened to `[1, 2, 3]`
|
|
379
|
+
- Transaction 2: `[[4], [5, 6]]` → flattened to `[4, 5, 6]`
|
|
380
|
+
- Transaction 3: `[[1], [2, 3]]` → flattened to `[1, 2, 3]`
|
|
381
|
+
|
|
382
|
+
String tokens are also supported:
|
|
383
|
+
```text
|
|
384
|
+
A B -1 C -1 -2
|
|
385
|
+
D -1 E F -1 -2
|
|
386
|
+
```
|
|
387
|
+
|
|
388
|
+
- **Parquet/Arrow Files**: Modern columnar data formats (requires 'gsppy[dataframe]')
|
|
389
|
+
```bash
|
|
390
|
+
pip install 'gsppy[dataframe]'
|
|
391
|
+
```
|
|
392
|
+
This installs optional dependencies: `polars`, `pandas`, and `pyarrow` for DataFrame support.
|
|
393
|
+
|
|
360
394
|
### Running the CLI
|
|
361
395
|
|
|
362
396
|
Use the following command to run GSPPy on your data:
|
|
@@ -371,9 +405,16 @@ Or for CSV files:
|
|
|
371
405
|
gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
|
|
372
406
|
```
|
|
373
407
|
|
|
408
|
+
For SPM/GSP format files, use the `--format spm` option:
|
|
409
|
+
|
|
410
|
+
```bash
|
|
411
|
+
gsppy --file path/to/data.txt --format spm --min_support 0.3
|
|
412
|
+
```
|
|
413
|
+
|
|
374
414
|
#### CLI Options
|
|
375
415
|
|
|
376
|
-
- `--file`: Path to your input file (JSON or
|
|
416
|
+
- `--file`: Path to your input file (JSON, CSV, or SPM format). **Required**.
|
|
417
|
+
- `--format`: File format to use. Options: `auto` (default, auto-detect from extension), `json`, `csv`, `spm`, `parquet`, `arrow`.
|
|
377
418
|
- `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.
|
|
378
419
|
- `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.
|
|
379
420
|
- `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.
|
|
@@ -518,6 +559,159 @@ Verbose mode provides:
|
|
|
518
559
|
|
|
519
560
|
For complete documentation on logging, see [docs/logging.md](docs/logging.md).
|
|
520
561
|
|
|
562
|
+
### Using Sequence Objects for Rich Pattern Representation
|
|
563
|
+
|
|
564
|
+
GSP-Py 4.0+ introduces a **Sequence abstraction class** that provides a richer, more maintainable way to work with sequential patterns. The Sequence class encapsulates pattern items, support counts, and optional metadata in an immutable, hashable object.
|
|
565
|
+
|
|
566
|
+
#### Traditional Dict-based Output (Default)
|
|
567
|
+
|
|
568
|
+
```python
|
|
569
|
+
from gsppy import GSP
|
|
570
|
+
|
|
571
|
+
transactions = [
|
|
572
|
+
['Bread', 'Milk'],
|
|
573
|
+
['Bread', 'Diaper', 'Beer', 'Eggs'],
|
|
574
|
+
['Milk', 'Diaper', 'Beer', 'Coke']
|
|
575
|
+
]
|
|
576
|
+
|
|
577
|
+
gsp = GSP(transactions)
|
|
578
|
+
result = gsp.search(min_support=0.3)
|
|
579
|
+
|
|
580
|
+
# Returns: [{('Bread',): 4, ('Milk',): 4, ...}, {('Bread', 'Milk'): 3, ...}, ...]
|
|
581
|
+
for level_patterns in result:
|
|
582
|
+
for pattern, support in level_patterns.items():
|
|
583
|
+
print(f"Pattern: {pattern}, Support: {support}")
|
|
584
|
+
```
|
|
585
|
+
|
|
586
|
+
#### Sequence Objects (New Feature)
|
|
587
|
+
|
|
588
|
+
```python
|
|
589
|
+
from gsppy import GSP
|
|
590
|
+
|
|
591
|
+
transactions = [
|
|
592
|
+
['Bread', 'Milk'],
|
|
593
|
+
['Bread', 'Diaper', 'Beer', 'Eggs'],
|
|
594
|
+
['Milk', 'Diaper', 'Beer', 'Coke']
|
|
595
|
+
]
|
|
596
|
+
|
|
597
|
+
gsp = GSP(transactions)
|
|
598
|
+
result = gsp.search(min_support=0.3, return_sequences=True)
|
|
599
|
+
|
|
600
|
+
# Returns: [[Sequence(('Bread',), support=4), ...], [Sequence(('Bread', 'Milk'), support=3), ...], ...]
|
|
601
|
+
for level_patterns in result:
|
|
602
|
+
for seq in level_patterns:
|
|
603
|
+
print(f"Pattern: {seq.items}, Support: {seq.support}, Length: {seq.length}")
|
|
604
|
+
# Access sequence properties
|
|
605
|
+
print(f" First item: {seq.first_item}, Last item: {seq.last_item}")
|
|
606
|
+
# Check if item is in sequence
|
|
607
|
+
if "Milk" in seq:
|
|
608
|
+
print(f" Contains Milk!")
|
|
609
|
+
```
|
|
610
|
+
|
|
611
|
+
#### Key Benefits of Sequence Objects
|
|
612
|
+
|
|
613
|
+
1. **Rich API**: Access pattern properties like `length`, `first_item`, `last_item`
|
|
614
|
+
2. **Type Safety**: IDE autocomplete and better type hints
|
|
615
|
+
3. **Immutable & Hashable**: Can be used as dictionary keys
|
|
616
|
+
4. **Extensible**: Add metadata for confidence, lift, or custom properties
|
|
617
|
+
5. **Backward Compatible**: Convert to/from dict format as needed
|
|
618
|
+
|
|
619
|
+
```python
|
|
620
|
+
from gsppy import Sequence, sequences_to_dict, dict_to_sequences
|
|
621
|
+
|
|
622
|
+
# Create custom sequences
|
|
623
|
+
seq = Sequence.from_tuple(("A", "B", "C"), support=5)
|
|
624
|
+
|
|
625
|
+
# Extend sequences
|
|
626
|
+
extended = seq.extend("D") # Creates Sequence(("A", "B", "C", "D"))
|
|
627
|
+
|
|
628
|
+
# Add metadata
|
|
629
|
+
seq_with_meta = seq.with_metadata(confidence=0.85, lift=1.5)
|
|
630
|
+
|
|
631
|
+
# Convert between formats for compatibility
|
|
632
|
+
seq_result = gsp.search(min_support=0.3, return_sequences=True)
|
|
633
|
+
dict_format = sequences_to_dict(seq_result[0]) # Convert to dict
|
|
634
|
+
```
|
|
635
|
+
|
|
636
|
+
For a complete example, see [examples/sequence_example.py](examples/sequence_example.py).
|
|
637
|
+
|
|
638
|
+
### Loading SPM/GSP Format Files
|
|
639
|
+
|
|
640
|
+
GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
|
|
641
|
+
- `-1` to mark the end of an element (itemset)
|
|
642
|
+
- `-2` to mark the end of a sequence (transaction)
|
|
643
|
+
|
|
644
|
+
#### Using the SPM Loader
|
|
645
|
+
|
|
646
|
+
```python
|
|
647
|
+
from gsppy.utils import read_transactions_from_spm
|
|
648
|
+
from gsppy import GSP
|
|
649
|
+
|
|
650
|
+
# Load SPM format file
|
|
651
|
+
transactions = read_transactions_from_spm('data.txt')
|
|
652
|
+
|
|
653
|
+
# Run GSP algorithm
|
|
654
|
+
gsp = GSP(transactions)
|
|
655
|
+
result = gsp.search(min_support=0.3)
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
#### SPM Format Examples
|
|
659
|
+
|
|
660
|
+
**Simple sequence file (`data.txt`):**
|
|
661
|
+
```text
|
|
662
|
+
1 2 -1 3 -1 -2
|
|
663
|
+
4 -1 5 6 -1 -2
|
|
664
|
+
1 -1 2 3 -1 -2
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
This represents:
|
|
668
|
+
- Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]
|
|
669
|
+
- Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]
|
|
670
|
+
- Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]
|
|
671
|
+
|
|
672
|
+
**String tokens are also supported:**
|
|
673
|
+
```text
|
|
674
|
+
A B -1 C -1 -2
|
|
675
|
+
D -1 E F -1 -2
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
#### Token Mapping
|
|
679
|
+
|
|
680
|
+
For workflows requiring conversion between string tokens and integer IDs, use the `TokenMapper`:
|
|
681
|
+
|
|
682
|
+
```python
|
|
683
|
+
from gsppy.utils import read_transactions_from_spm
|
|
684
|
+
from gsppy import TokenMapper
|
|
685
|
+
|
|
686
|
+
# Load with mappings
|
|
687
|
+
transactions, str_to_int, int_to_str = read_transactions_from_spm(
|
|
688
|
+
'data.txt',
|
|
689
|
+
return_mappings=True
|
|
690
|
+
)
|
|
691
|
+
|
|
692
|
+
print("String to Int:", str_to_int)
|
|
693
|
+
# Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
|
|
694
|
+
|
|
695
|
+
print("Int to String:", int_to_str)
|
|
696
|
+
# Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
|
|
697
|
+
|
|
698
|
+
# Use the TokenMapper class directly
|
|
699
|
+
mapper = TokenMapper()
|
|
700
|
+
id_a = mapper.add_token("A")
|
|
701
|
+
id_b = mapper.add_token("B")
|
|
702
|
+
print(f"A -> {id_a}, B -> {id_b}")
|
|
703
|
+
# Output: A -> 0, B -> 1
|
|
704
|
+
```
|
|
705
|
+
|
|
706
|
+
#### Edge Cases Handled
|
|
707
|
+
|
|
708
|
+
The SPM loader gracefully handles:
|
|
709
|
+
- Empty lines (skipped)
|
|
710
|
+
- Missing `-2` delimiter at end of line
|
|
711
|
+
- Extra or consecutive delimiters
|
|
712
|
+
- Mixed-length elements in sequences
|
|
713
|
+
- Both integer and string tokens
|
|
714
|
+
|
|
521
715
|
### Output
|
|
522
716
|
|
|
523
717
|
The algorithm will return a list of patterns with their corresponding support.
|
|
@@ -584,6 +778,208 @@ result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
|
|
|
584
778
|
|
|
585
779
|
---
|
|
586
780
|
|
|
781
|
+
## 📊 DataFrame Input Support
|
|
782
|
+
|
|
783
|
+
GSP-Py supports **Polars and Pandas DataFrames** as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
|
|
784
|
+
|
|
785
|
+
### Installation
|
|
786
|
+
|
|
787
|
+
Install GSP-Py with DataFrame support:
|
|
788
|
+
|
|
789
|
+
```bash
|
|
790
|
+
pip install 'gsppy[dataframe]'
|
|
791
|
+
```
|
|
792
|
+
|
|
793
|
+
This installs the optional dependencies: `polars`, `pandas`, and `pyarrow`.
|
|
794
|
+
|
|
795
|
+
### DataFrame Input Formats
|
|
796
|
+
|
|
797
|
+
GSP-Py supports two DataFrame formats:
|
|
798
|
+
|
|
799
|
+
#### 1. Grouped Format (Transaction ID + Item Columns)
|
|
800
|
+
|
|
801
|
+
Use when your data has separate rows for each item in a transaction:
|
|
802
|
+
|
|
803
|
+
```python
|
|
804
|
+
import polars as pl
|
|
805
|
+
from gsppy import GSP
|
|
806
|
+
|
|
807
|
+
# Polars DataFrame with transaction_id and item columns
|
|
808
|
+
df = pl.DataFrame({
|
|
809
|
+
"transaction_id": [1, 1, 2, 2, 2, 3, 3],
|
|
810
|
+
"item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
|
|
811
|
+
})
|
|
812
|
+
|
|
813
|
+
# Run GSP directly on the DataFrame
|
|
814
|
+
gsp = GSP(df, transaction_col="transaction_id", item_col="item")
|
|
815
|
+
patterns = gsp.search(min_support=0.3)
|
|
816
|
+
|
|
817
|
+
for level, freq_patterns in enumerate(patterns, start=1):
|
|
818
|
+
print(f"\n{level}-Sequence Patterns:")
|
|
819
|
+
for pattern, support in freq_patterns.items():
|
|
820
|
+
print(f" {pattern}: {support}")
|
|
821
|
+
```
|
|
822
|
+
|
|
823
|
+
#### 2. Sequence Format (List Column)
|
|
824
|
+
|
|
825
|
+
Use when each row contains a complete transaction as a list:
|
|
826
|
+
|
|
827
|
+
```python
|
|
828
|
+
import pandas as pd
|
|
829
|
+
from gsppy import GSP
|
|
830
|
+
|
|
831
|
+
# Pandas DataFrame with sequences as lists
|
|
832
|
+
df = pd.DataFrame({
|
|
833
|
+
"transaction": [
|
|
834
|
+
["Bread", "Milk"],
|
|
835
|
+
["Bread", "Diaper", "Beer"],
|
|
836
|
+
["Milk", "Coke"],
|
|
837
|
+
]
|
|
838
|
+
})
|
|
839
|
+
|
|
840
|
+
gsp = GSP(df, sequence_col="transaction")
|
|
841
|
+
patterns = gsp.search(min_support=0.3)
|
|
842
|
+
```
|
|
843
|
+
|
|
844
|
+
### DataFrame with Timestamps
|
|
845
|
+
|
|
846
|
+
DataFrames support temporal constraints for time-aware pattern mining:
|
|
847
|
+
|
|
848
|
+
```python
|
|
849
|
+
import polars as pl
|
|
850
|
+
from gsppy import GSP
|
|
851
|
+
|
|
852
|
+
# Grouped format with timestamps
|
|
853
|
+
df = pl.DataFrame({
|
|
854
|
+
"transaction_id": [1, 1, 1, 2, 2, 2],
|
|
855
|
+
"item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
|
|
856
|
+
"timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
|
|
857
|
+
})
|
|
858
|
+
|
|
859
|
+
# Find patterns where consecutive events occur within 10 seconds
|
|
860
|
+
gsp = GSP(
|
|
861
|
+
df,
|
|
862
|
+
transaction_col="transaction_id",
|
|
863
|
+
item_col="item",
|
|
864
|
+
timestamp_col="timestamp",
|
|
865
|
+
maxgap=10
|
|
866
|
+
)
|
|
867
|
+
patterns = gsp.search(min_support=0.5)
|
|
868
|
+
```
|
|
869
|
+
|
|
870
|
+
For sequence format with timestamps:
|
|
871
|
+
|
|
872
|
+
```python
|
|
873
|
+
import pandas as pd
|
|
874
|
+
from gsppy import GSP
|
|
875
|
+
|
|
876
|
+
df = pd.DataFrame({
|
|
877
|
+
"sequence": [["A", "B", "C"], ["A", "D"]],
|
|
878
|
+
"timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
|
|
879
|
+
})
|
|
880
|
+
|
|
881
|
+
gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
|
|
882
|
+
patterns = gsp.search(min_support=0.5)
|
|
883
|
+
```
|
|
884
|
+
|
|
885
|
+
### Working with Parquet and Arrow Files
|
|
886
|
+
|
|
887
|
+
DataFrames enable seamless integration with columnar storage formats:
|
|
888
|
+
|
|
889
|
+
```python
|
|
890
|
+
import polars as pl
|
|
891
|
+
from gsppy import GSP
|
|
892
|
+
|
|
893
|
+
# Read directly from Parquet
|
|
894
|
+
df = pl.read_parquet("transactions.parquet")
|
|
895
|
+
|
|
896
|
+
# Run GSP with automatic schema detection
|
|
897
|
+
gsp = GSP(df, transaction_col="txn_id", item_col="product")
|
|
898
|
+
patterns = gsp.search(min_support=0.2)
|
|
899
|
+
|
|
900
|
+
# Or use Pandas with Arrow backend
|
|
901
|
+
import pandas as pd
|
|
902
|
+
df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
|
|
903
|
+
gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
|
|
904
|
+
patterns = gsp.search(min_support=0.2)
|
|
905
|
+
```
|
|
906
|
+
|
|
907
|
+
### Performance Considerations
|
|
908
|
+
|
|
909
|
+
DataFrames offer performance benefits for large datasets:
|
|
910
|
+
|
|
911
|
+
- **Polars**: Leverages Arrow for zero-copy operations and parallel processing
|
|
912
|
+
- **Pandas**: Compatible with Arrow backend for efficient memory usage
|
|
913
|
+
- **Parquet/Arrow**: Columnar storage enables efficient filtering and reading
|
|
914
|
+
- **Schema validation**: Errors are caught early with clear messages
|
|
915
|
+
|
|
916
|
+
### DataFrame Schema Requirements
|
|
917
|
+
|
|
918
|
+
**Grouped Format:**
|
|
919
|
+
- `transaction_col`: Column containing transaction/sequence IDs (any type)
|
|
920
|
+
- `item_col`: Column containing items (any type, converted to strings)
|
|
921
|
+
- `timestamp_col` (optional): Column containing timestamps (numeric)
|
|
922
|
+
|
|
923
|
+
**Sequence Format:**
|
|
924
|
+
- `sequence_col`: Column containing lists of items
|
|
925
|
+
- `timestamp_col` (optional): Column containing lists of timestamps (must match sequence lengths)
|
|
926
|
+
|
|
927
|
+
### Error Handling
|
|
928
|
+
|
|
929
|
+
GSP-Py provides clear error messages for schema issues:
|
|
930
|
+
|
|
931
|
+
```python
|
|
932
|
+
import polars as pl
|
|
933
|
+
from gsppy import GSP
|
|
934
|
+
|
|
935
|
+
df = pl.DataFrame({
|
|
936
|
+
"txn_id": [1, 2],
|
|
937
|
+
"product": ["A", "B"],
|
|
938
|
+
})
|
|
939
|
+
|
|
940
|
+
# ❌ Missing required column
|
|
941
|
+
try:
|
|
942
|
+
gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
|
|
943
|
+
except ValueError as e:
|
|
944
|
+
print(f"Error: {e}") # "Column 'item' not found in DataFrame"
|
|
945
|
+
|
|
946
|
+
# ❌ Invalid format specification
|
|
947
|
+
try:
|
|
948
|
+
gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
|
|
949
|
+
except ValueError as e:
|
|
950
|
+
print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"
|
|
951
|
+
```
|
|
952
|
+
|
|
953
|
+
### Backward Compatibility
|
|
954
|
+
|
|
955
|
+
Traditional list-based input continues to work:
|
|
956
|
+
|
|
957
|
+
```python
|
|
958
|
+
from gsppy import GSP
|
|
959
|
+
|
|
960
|
+
# Lists still work as before
|
|
961
|
+
transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
|
|
962
|
+
gsp = GSP(transactions)
|
|
963
|
+
patterns = gsp.search(min_support=0.5)
|
|
964
|
+
```
|
|
965
|
+
|
|
966
|
+
DataFrame parameters cannot be mixed with list input:
|
|
967
|
+
|
|
968
|
+
```python
|
|
969
|
+
transactions = [["A", "B"], ["C", "D"]]
|
|
970
|
+
|
|
971
|
+
# ❌ This raises an error
|
|
972
|
+
gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list input
|
|
973
|
+
```
|
|
974
|
+
|
|
975
|
+
### Examples and Tests
|
|
976
|
+
|
|
977
|
+
For complete examples and edge cases, see:
|
|
978
|
+
- [`tests/test_dataframe.py`](tests/test_dataframe.py) - Comprehensive test suite
|
|
979
|
+
- DataFrame adapter documentation in [`gsppy/dataframe_adapters.py`](gsppy/dataframe_adapters.py)
|
|
980
|
+
|
|
981
|
+
---
|
|
982
|
+
|
|
587
983
|
## ⏱️ Temporal Constraints
|
|
588
984
|
|
|
589
985
|
GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
|
|
@@ -591,7 +987,7 @@ GSP-Py supports **time-constrained sequential pattern mining** with three powerf
|
|
|
591
987
|
### Temporal Constraint Parameters
|
|
592
988
|
|
|
593
989
|
- **`mingap`**: Minimum time gap required between consecutive items in a pattern
|
|
594
|
-
- **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
|
|
990
|
+
- **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
|
|
595
991
|
- **`maxspan`**: Maximum time span from the first to the last item in a pattern
|
|
596
992
|
|
|
597
993
|
### Using Temporal Constraints
|