gsppy 3.6.0__tar.gz → 4.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. {gsppy-3.6.0 → gsppy-4.1.0}/CHANGELOG.md +186 -0
  2. {gsppy-3.6.0 → gsppy-4.1.0}/PKG-INFO +405 -9
  3. {gsppy-3.6.0 → gsppy-4.1.0}/README.md +395 -4
  4. gsppy-4.1.0/gsppy/__init__.py +88 -0
  5. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/cli.py +316 -13
  6. gsppy-4.1.0/gsppy/dataframe_adapters.py +458 -0
  7. gsppy-4.1.0/gsppy/enums.py +49 -0
  8. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/gsp.py +220 -15
  9. gsppy-4.1.0/gsppy/sequence.py +371 -0
  10. gsppy-4.1.0/gsppy/token_mapper.py +99 -0
  11. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/utils.py +120 -0
  12. {gsppy-3.6.0 → gsppy-4.1.0}/pyproject.toml +18 -7
  13. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_cli.py +70 -3
  14. gsppy-4.1.0/tests/test_dataframe.py +341 -0
  15. gsppy-4.1.0/tests/test_gsp_sequence_integration.py +345 -0
  16. gsppy-4.1.0/tests/test_sequence.py +466 -0
  17. gsppy-4.1.0/tests/test_spm_format.py +303 -0
  18. {gsppy-3.6.0 → gsppy-4.1.0}/tox.ini +1 -1
  19. gsppy-3.6.0/gsppy/__init__.py +0 -43
  20. {gsppy-3.6.0 → gsppy-4.1.0}/.gitignore +0 -0
  21. {gsppy-3.6.0 → gsppy-4.1.0}/CONTRIBUTING.md +0 -0
  22. {gsppy-3.6.0 → gsppy-4.1.0}/LICENSE +0 -0
  23. {gsppy-3.6.0 → gsppy-4.1.0}/SECURITY.md +0 -0
  24. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/accelerate.py +0 -0
  25. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/pruning.py +0 -0
  26. {gsppy-3.6.0 → gsppy-4.1.0}/gsppy/py.typed +0 -0
  27. {gsppy-3.6.0 → gsppy-4.1.0}/rust/Cargo.lock +0 -0
  28. {gsppy-3.6.0 → gsppy-4.1.0}/rust/Cargo.toml +0 -0
  29. {gsppy-3.6.0 → gsppy-4.1.0}/rust/src/lib.rs +0 -0
  30. {gsppy-3.6.0 → gsppy-4.1.0}/tests/__init__.py +0 -0
  31. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_gsp.py +0 -0
  32. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_gsp_fuzzing.py +0 -0
  33. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_pruning.py +0 -0
  34. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_temporal_constraints.py +0 -0
  35. {gsppy-3.6.0 → gsppy-4.1.0}/tests/test_utils.py +0 -0
@@ -1,6 +1,192 @@
1
1
  # CHANGELOG
2
2
 
3
3
 
4
+ ## v4.1.0 (2026-02-01)
5
+
6
+ ### Bug Fixes
7
+
8
+ - Address code review feedback - add type annotations and remove unused variables
9
+ ([`bf62d14`](https://github.com/jacksonpradolima/gsp-py/commit/bf62d144d8f1be1e7716291d41af955450612c81))
10
+
11
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
12
+
13
+ ### Chores
14
+
15
+ - Update uv.lock for version 4.0.0
16
+ ([`f1ae2af`](https://github.com/jacksonpradolima/gsp-py/commit/f1ae2af2aa71ea44b9d8625ed647da79259ec096))
17
+
18
+ ### Documentation
19
+
20
+ - Add Sequence documentation and examples to README
21
+ ([`62d0d02`](https://github.com/jacksonpradolima/gsp-py/commit/62d0d02c19c5751331df53e680cc0b9aee19677b))
22
+
23
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
24
+
25
+ - Update docs/ with Sequence abstraction documentation
26
+ ([`2368cf3`](https://github.com/jacksonpradolima/gsp-py/commit/2368cf30239139e8e2af5457ee6acf14db30ef06))
27
+
28
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
29
+
30
+ ### Features
31
+
32
+ - Add Sequence abstraction class with comprehensive tests
33
+ ([`6011bdb`](https://github.com/jacksonpradolima/gsp-py/commit/6011bdb7104755d109b58261b36e1dd1c36b2d61))
34
+
35
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
36
+
37
+ - Integrate Sequence objects with GSP.search() via return_sequences parameter
38
+ ([`7476588`](https://github.com/jacksonpradolima/gsp-py/commit/7476588f2b277276748e0550366014f2a93d8ef5))
39
+
40
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
41
+
42
+ - Introduce Sequence abstraction for typed pattern representation
43
+ ([`01ca37b`](https://github.com/jacksonpradolima/gsp-py/commit/01ca37b9bc4572eb7b1c1eaf6fdf26ca2324a3c5))
44
+
45
+ ### Refactoring
46
+
47
+ - Address code review feedback - remove redundant checks
48
+ ([`621e940`](https://github.com/jacksonpradolima/gsp-py/commit/621e9403379ae0fd07bf45b97616b9979f2d4aa6))
49
+
50
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
51
+
52
+ - Reduce cognitive complexity in sequence_example.py and fix f-string
53
+ ([`63ac4f9`](https://github.com/jacksonpradolima/gsp-py/commit/63ac4f9ceb869a5228cdccdcf6a9d0b9f46f0350))
54
+
55
+ Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
56
+
57
+ - Update type annotations and improve search method in GSP class
58
+ ([`e2e9a3f`](https://github.com/jacksonpradolima/gsp-py/commit/e2e9a3f473d1e0c5d6990c8b7c5837a251761032))
59
+
60
+
61
+ ## v4.0.0 (2026-02-01)
62
+
63
+ ### Chores
64
+
65
+ - Add additional VSCode extensions for improved development experience
66
+ ([`107dfa4`](https://github.com/jacksonpradolima/gsp-py/commit/107dfa422005f4cdec4655a9751fd0d6e597773f))
67
+
68
+ - Update uv.lock for version 3.6.1
69
+ ([`d8d7394`](https://github.com/jacksonpradolima/gsp-py/commit/d8d73947d570844c02e9d974b626da26f07cf1e6))
70
+
71
+ ### Features
72
+
73
+ - Add SPM/GSP delimiter format loader and token mapping utilities
74
+ ([`4ac1d34`](https://github.com/jacksonpradolima/gsp-py/commit/4ac1d34d166f21d30968872cf16c1bde3ff1f2aa))
75
+
76
+ ### Refactoring
77
+
78
+ - Add type casting for return values in read_transactions_from_spm
79
+ ([`2099bfd`](https://github.com/jacksonpradolima/gsp-py/commit/2099bfd5253a1dc058dd46bd0da077810958fa76))
80
+
81
+ - Update read_transactions_from_spm to return mappings and adjust tests
82
+ ([`373b8ff`](https://github.com/jacksonpradolima/gsp-py/commit/373b8ff0d7f131140bcdbd039fae0d02572e86b7))
83
+
84
+
85
+ ## v3.6.1 (2026-01-31)
86
+
87
+ ### Bug Fixes
88
+
89
+ - Typing for polars and pandas
90
+ ([`0773992`](https://github.com/jacksonpradolima/gsp-py/commit/07739921d074e55c8436a88a73e510b1d8761510))
91
+
92
+ ### Build System
93
+
94
+ - **deps**: Bump actions/checkout in /.github/workflows
95
+ ([`7af193d`](https://github.com/jacksonpradolima/gsp-py/commit/7af193d515972eeca5d8e354e91a60e488357cfb))
96
+
97
+ Bumps [actions/checkout](https://github.com/actions/checkout) from 4.3.1 to 6.0.2. - [Release
98
+ notes](https://github.com/actions/checkout/releases) -
99
+ [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) -
100
+ [Commits](https://github.com/actions/checkout/compare/v4.3.1...de0fac2e4500dabe0009e67214ff5f5447ce83dd)
101
+
102
+ --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.2
103
+
104
+ dependency-type: direct:production
105
+
106
+ update-type: version-update:semver-major
107
+
108
+ ...
109
+
110
+ Signed-off-by: dependabot[bot] <support@github.com>
111
+
112
+ - **deps**: Bump actions/github-script in /.github/workflows
113
+ ([`03a7588`](https://github.com/jacksonpradolima/gsp-py/commit/03a7588301421369731d3d543f81b93c25c292ef))
114
+
115
+ Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 8.0.0. -
116
+ [Release notes](https://github.com/actions/github-script/releases) -
117
+ [Commits](https://github.com/actions/github-script/compare/60a0d83039c74a4aee543508d2ffcb1c3799cdea...ed597411d8f924073f98dfc5c65a23a2325f34cd)
118
+
119
+ --- updated-dependencies: - dependency-name: actions/github-script dependency-version: 8.0.0
120
+
121
+ dependency-type: direct:production
122
+
123
+ update-type: version-update:semver-major
124
+
125
+ ...
126
+
127
+ Signed-off-by: dependabot[bot] <support@github.com>
128
+
129
+ - **deps**: Bump actions/setup-python in /.github/workflows
130
+ ([`75771bf`](https://github.com/jacksonpradolima/gsp-py/commit/75771bff660b3842f2c8d84bdaeb013941e5abe0))
131
+
132
+ Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.6.0 to 6.2.0. -
133
+ [Release notes](https://github.com/actions/setup-python/releases) -
134
+ [Commits](https://github.com/actions/setup-python/compare/v5.6.0...a309ff8b426b58ec0e2a45f0f869d46889d02405)
135
+
136
+ --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0
137
+
138
+ dependency-type: direct:production
139
+
140
+ update-type: version-update:semver-major
141
+
142
+ ...
143
+
144
+ Signed-off-by: dependabot[bot] <support@github.com>
145
+
146
+ - **deps**: Bump actions/stale in /.github/workflows
147
+ ([`e699ccd`](https://github.com/jacksonpradolima/gsp-py/commit/e699ccdac689734b4694665d924ace8bba479253))
148
+
149
+ Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 10.1.1. - [Release
150
+ notes](https://github.com/actions/stale/releases) -
151
+ [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md) -
152
+ [Commits](https://github.com/actions/stale/compare/28ca1036281a5e5922ead5184a1bbf96e5fc984e...997185467fa4f803885201cee163a9f38240193d)
153
+
154
+ --- updated-dependencies: - dependency-name: actions/stale dependency-version: 10.1.1
155
+
156
+ dependency-type: direct:production
157
+
158
+ update-type: version-update:semver-major
159
+
160
+ ...
161
+
162
+ Signed-off-by: dependabot[bot] <support@github.com>
163
+
164
+ - **deps**: Bump actions/upload-artifact in /.github/workflows
165
+ ([`17efaff`](https://github.com/jacksonpradolima/gsp-py/commit/17efaffc755c017e066c0286464899ead6e2cae4))
166
+
167
+ Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 6.0.0. -
168
+ [Release notes](https://github.com/actions/upload-artifact/releases) -
169
+ [Commits](https://github.com/actions/upload-artifact/compare/v4.6.2...b7c566a772e6b6bfb58ed0dc250532a479d7789f)
170
+
171
+ --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 6.0.0
172
+
173
+ dependency-type: direct:production
174
+
175
+ update-type: version-update:semver-major
176
+
177
+ ...
178
+
179
+ Signed-off-by: dependabot[bot] <support@github.com>
180
+
181
+ ### Chores
182
+
183
+ - Update uv.lock for version 3.6.0
184
+ ([`4c2a5e5`](https://github.com/jacksonpradolima/gsp-py/commit/4c2a5e5967482443c2db645c9ba4744bd2110dd1))
185
+
186
+ - **deps**: Bump ty and ruff
187
+ ([`07a20df`](https://github.com/jacksonpradolima/gsp-py/commit/07a20df9fb4ff3a3b022d28d152b586ca45383c8))
188
+
189
+
4
190
  ## v3.6.0 (2026-01-26)
5
191
 
6
192
  ### Chores
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: gsppy
3
- Version: 3.6.0
3
+ Version: 4.1.0
4
4
  Summary: GSP (Generalized Sequence Pattern) algorithm in Python
5
5
  Project-URL: Homepage, https://github.com/jacksonpradolima/gsp-py
6
6
  Author-email: Jackson Antonio do Prado Lima <jacksonpradolima@gmail.com>
@@ -32,15 +32,20 @@ Classifier: Intended Audience :: Science/Research
32
32
  Classifier: License :: OSI Approved :: MIT License
33
33
  Classifier: Natural Language :: English
34
34
  Classifier: Operating System :: OS Independent
35
- Classifier: Programming Language :: Python :: 3.10
36
35
  Classifier: Programming Language :: Python :: 3.11
37
36
  Classifier: Programming Language :: Python :: 3.12
38
37
  Classifier: Programming Language :: Python :: 3.13
38
+ Classifier: Programming Language :: Python :: 3.14
39
39
  Classifier: Topic :: Scientific/Engineering :: Information Analysis
40
40
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
41
- Requires-Python: >=3.10
41
+ Requires-Python: >=3.11
42
42
  Requires-Dist: click>=8.0.0
43
43
  Requires-Dist: typing-extensions>=4.0.0
44
+ Provides-Extra: dataframe
45
+ Requires-Dist: pandas-stubs>=2.3.3.260113; extra == 'dataframe'
46
+ Requires-Dist: pandas>=3.0.0; extra == 'dataframe'
47
+ Requires-Dist: polars>=1.37.1; extra == 'dataframe'
48
+ Requires-Dist: pyarrow>=10.0.0; extra == 'dataframe'
44
49
  Provides-Extra: dev
45
50
  Requires-Dist: cython==3.2.4; extra == 'dev'
46
51
  Requires-Dist: hatch==1.16.3; extra == 'dev'
@@ -51,9 +56,9 @@ Requires-Dist: pyright==1.1.408; extra == 'dev'
51
56
  Requires-Dist: pytest-benchmark==5.2.3; extra == 'dev'
52
57
  Requires-Dist: pytest-cov==7.0.0; extra == 'dev'
53
58
  Requires-Dist: pytest==9.0.2; extra == 'dev'
54
- Requires-Dist: ruff==0.14.13; extra == 'dev'
59
+ Requires-Dist: ruff==0.14.14; extra == 'dev'
55
60
  Requires-Dist: tox==4.34.1; extra == 'dev'
56
- Requires-Dist: ty==0.0.12; extra == 'dev'
61
+ Requires-Dist: ty==0.0.14; extra == 'dev'
57
62
  Provides-Extra: docs
58
63
  Requires-Dist: mkdocs-gen-files<1,>=0.5; extra == 'docs'
59
64
  Requires-Dist: mkdocs-literate-nav<1,>=0.6; extra == 'docs'
@@ -72,7 +77,7 @@ Description-Content-Type: text/markdown
72
77
 
73
78
  [![PyPI Downloads](https://img.shields.io/pypi/dm/gsppy.svg?style=flat-square)](https://pypi.org/project/gsppy/)
74
79
  [![PyPI version](https://badge.fury.io/py/gsppy.svg)](https://pypi.org/project/gsppy)
75
- ![](https://img.shields.io/badge/python-3.10+-blue.svg)
80
+ ![](https://img.shields.io/badge/python-3.11+-blue.svg)
76
81
 
77
82
  [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/jacksonpradolima/gsp-py/badge)](https://securityscorecards.dev/viewer/?uri=github.com/jacksonpradolima/gsp-py)
78
83
  [![SLSA provenance](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml/badge.svg)](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml)
@@ -90,7 +95,7 @@ Description-Content-Type: text/markdown
90
95
  Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
91
96
 
92
97
  > [!IMPORTANT]
93
- > GSP-Py is compatible with Python 3.10 and later versions!
98
+ > GSP-Py is compatible with Python 3.11 and later versions!
94
99
 
95
100
  ---
96
101
 
@@ -106,6 +111,7 @@ Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal m
106
111
  6. [💡 Usage](#usage)
107
112
  - [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)
108
113
  - [📊 Explanation: Support and Results](#explanation-support-and-results)
114
+ - [📊 DataFrame Input Support](#dataframe-input-support)
109
115
  - [⏱️ Temporal Constraints](#temporal-constraints)
110
116
  7. [⌨️ Typing](#typing)
111
117
  8. [🌟 Planned Features](#planned-features)
@@ -357,6 +363,34 @@ Your input file should be either:
357
363
  Bread,Milk,Diaper,Coke
358
364
  ```
359
365
 
366
+ - **SPM/GSP Format**: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
367
+ - `-1`: Marks the end of an element (itemset)
368
+ - `-2`: Marks the end of a sequence (transaction)
369
+
370
+ Example:
371
+ ```text
372
+ 1 2 -1 3 -1 -2
373
+ 4 -1 5 6 -1 -2
374
+ 1 -1 2 3 -1 -2
375
+ ```
376
+
377
+ The above represents:
378
+ - Transaction 1: `[[1, 2], [3]]` → flattened to `[1, 2, 3]`
379
+ - Transaction 2: `[[4], [5, 6]]` → flattened to `[4, 5, 6]`
380
+ - Transaction 3: `[[1], [2, 3]]` → flattened to `[1, 2, 3]`
381
+
382
+ String tokens are also supported:
383
+ ```text
384
+ A B -1 C -1 -2
385
+ D -1 E F -1 -2
386
+ ```
387
+
388
+ - **Parquet/Arrow Files**: Modern columnar data formats (requires 'gsppy[dataframe]')
389
+ ```bash
390
+ pip install 'gsppy[dataframe]'
391
+ ```
392
+ This installs optional dependencies: `polars`, `pandas`, and `pyarrow` for DataFrame support.
393
+
360
394
  ### Running the CLI
361
395
 
362
396
  Use the following command to run GSPPy on your data:
@@ -371,9 +405,16 @@ Or for CSV files:
371
405
  gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
372
406
  ```
373
407
 
408
+ For SPM/GSP format files, use the `--format spm` option:
409
+
410
+ ```bash
411
+ gsppy --file path/to/data.txt --format spm --min_support 0.3
412
+ ```
413
+
374
414
  #### CLI Options
375
415
 
376
- - `--file`: Path to your input file (JSON or CSV). **Required**.
416
+ - `--file`: Path to your input file (JSON, CSV, or SPM format). **Required**.
417
+ - `--format`: File format to use. Options: `auto` (default, auto-detect from extension), `json`, `csv`, `spm`, `parquet`, `arrow`.
377
418
  - `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.
378
419
  - `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.
379
420
  - `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.
@@ -518,6 +559,159 @@ Verbose mode provides:
518
559
 
519
560
  For complete documentation on logging, see [docs/logging.md](docs/logging.md).
520
561
 
562
+ ### Using Sequence Objects for Rich Pattern Representation
563
+
564
+ GSP-Py 4.0+ introduces a **Sequence abstraction class** that provides a richer, more maintainable way to work with sequential patterns. The Sequence class encapsulates pattern items, support counts, and optional metadata in an immutable, hashable object.
565
+
566
+ #### Traditional Dict-based Output (Default)
567
+
568
+ ```python
569
+ from gsppy import GSP
570
+
571
+ transactions = [
572
+ ['Bread', 'Milk'],
573
+ ['Bread', 'Diaper', 'Beer', 'Eggs'],
574
+ ['Milk', 'Diaper', 'Beer', 'Coke']
575
+ ]
576
+
577
+ gsp = GSP(transactions)
578
+ result = gsp.search(min_support=0.3)
579
+
580
+ # Returns: [{('Bread',): 4, ('Milk',): 4, ...}, {('Bread', 'Milk'): 3, ...}, ...]
581
+ for level_patterns in result:
582
+ for pattern, support in level_patterns.items():
583
+ print(f"Pattern: {pattern}, Support: {support}")
584
+ ```
585
+
586
+ #### Sequence Objects (New Feature)
587
+
588
+ ```python
589
+ from gsppy import GSP
590
+
591
+ transactions = [
592
+ ['Bread', 'Milk'],
593
+ ['Bread', 'Diaper', 'Beer', 'Eggs'],
594
+ ['Milk', 'Diaper', 'Beer', 'Coke']
595
+ ]
596
+
597
+ gsp = GSP(transactions)
598
+ result = gsp.search(min_support=0.3, return_sequences=True)
599
+
600
+ # Returns: [[Sequence(('Bread',), support=4), ...], [Sequence(('Bread', 'Milk'), support=3), ...], ...]
601
+ for level_patterns in result:
602
+ for seq in level_patterns:
603
+ print(f"Pattern: {seq.items}, Support: {seq.support}, Length: {seq.length}")
604
+ # Access sequence properties
605
+ print(f" First item: {seq.first_item}, Last item: {seq.last_item}")
606
+ # Check if item is in sequence
607
+ if "Milk" in seq:
608
+ print(f" Contains Milk!")
609
+ ```
610
+
611
+ #### Key Benefits of Sequence Objects
612
+
613
+ 1. **Rich API**: Access pattern properties like `length`, `first_item`, `last_item`
614
+ 2. **Type Safety**: IDE autocomplete and better type hints
615
+ 3. **Immutable & Hashable**: Can be used as dictionary keys
616
+ 4. **Extensible**: Add metadata for confidence, lift, or custom properties
617
+ 5. **Backward Compatible**: Convert to/from dict format as needed
618
+
619
+ ```python
620
+ from gsppy import Sequence, sequences_to_dict, dict_to_sequences
621
+
622
+ # Create custom sequences
623
+ seq = Sequence.from_tuple(("A", "B", "C"), support=5)
624
+
625
+ # Extend sequences
626
+ extended = seq.extend("D") # Creates Sequence(("A", "B", "C", "D"))
627
+
628
+ # Add metadata
629
+ seq_with_meta = seq.with_metadata(confidence=0.85, lift=1.5)
630
+
631
+ # Convert between formats for compatibility
632
+ seq_result = gsp.search(min_support=0.3, return_sequences=True)
633
+ dict_format = sequences_to_dict(seq_result[0]) # Convert to dict
634
+ ```
635
+
636
+ For a complete example, see [examples/sequence_example.py](examples/sequence_example.py).
637
+
638
+ ### Loading SPM/GSP Format Files
639
+
640
+ GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
641
+ - `-1` to mark the end of an element (itemset)
642
+ - `-2` to mark the end of a sequence (transaction)
643
+
644
+ #### Using the SPM Loader
645
+
646
+ ```python
647
+ from gsppy.utils import read_transactions_from_spm
648
+ from gsppy import GSP
649
+
650
+ # Load SPM format file
651
+ transactions = read_transactions_from_spm('data.txt')
652
+
653
+ # Run GSP algorithm
654
+ gsp = GSP(transactions)
655
+ result = gsp.search(min_support=0.3)
656
+ ```
657
+
658
+ #### SPM Format Examples
659
+
660
+ **Simple sequence file (`data.txt`):**
661
+ ```text
662
+ 1 2 -1 3 -1 -2
663
+ 4 -1 5 6 -1 -2
664
+ 1 -1 2 3 -1 -2
665
+ ```
666
+
667
+ This represents:
668
+ - Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]
669
+ - Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]
670
+ - Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]
671
+
672
+ **String tokens are also supported:**
673
+ ```text
674
+ A B -1 C -1 -2
675
+ D -1 E F -1 -2
676
+ ```
677
+
678
+ #### Token Mapping
679
+
680
+ For workflows requiring conversion between string tokens and integer IDs, use the `TokenMapper`:
681
+
682
+ ```python
683
+ from gsppy.utils import read_transactions_from_spm
684
+ from gsppy import TokenMapper
685
+
686
+ # Load with mappings
687
+ transactions, str_to_int, int_to_str = read_transactions_from_spm(
688
+ 'data.txt',
689
+ return_mappings=True
690
+ )
691
+
692
+ print("String to Int:", str_to_int)
693
+ # Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
694
+
695
+ print("Int to String:", int_to_str)
696
+ # Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
697
+
698
+ # Use the TokenMapper class directly
699
+ mapper = TokenMapper()
700
+ id_a = mapper.add_token("A")
701
+ id_b = mapper.add_token("B")
702
+ print(f"A -> {id_a}, B -> {id_b}")
703
+ # Output: A -> 0, B -> 1
704
+ ```
705
+
706
+ #### Edge Cases Handled
707
+
708
+ The SPM loader gracefully handles:
709
+ - Empty lines (skipped)
710
+ - Missing `-2` delimiter at end of line
711
+ - Extra or consecutive delimiters
712
+ - Mixed-length elements in sequences
713
+ - Both integer and string tokens
714
+
521
715
  ### Output
522
716
 
523
717
  The algorithm will return a list of patterns with their corresponding support.
@@ -584,6 +778,208 @@ result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
584
778
 
585
779
  ---
586
780
 
781
+ ## 📊 DataFrame Input Support
782
+
783
+ GSP-Py supports **Polars and Pandas DataFrames** as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
784
+
785
+ ### Installation
786
+
787
+ Install GSP-Py with DataFrame support:
788
+
789
+ ```bash
790
+ pip install 'gsppy[dataframe]'
791
+ ```
792
+
793
+ This installs the optional dependencies: `polars`, `pandas`, and `pyarrow`.
794
+
795
+ ### DataFrame Input Formats
796
+
797
+ GSP-Py supports two DataFrame formats:
798
+
799
+ #### 1. Grouped Format (Transaction ID + Item Columns)
800
+
801
+ Use when your data has separate rows for each item in a transaction:
802
+
803
+ ```python
804
+ import polars as pl
805
+ from gsppy import GSP
806
+
807
+ # Polars DataFrame with transaction_id and item columns
808
+ df = pl.DataFrame({
809
+ "transaction_id": [1, 1, 2, 2, 2, 3, 3],
810
+ "item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
811
+ })
812
+
813
+ # Run GSP directly on the DataFrame
814
+ gsp = GSP(df, transaction_col="transaction_id", item_col="item")
815
+ patterns = gsp.search(min_support=0.3)
816
+
817
+ for level, freq_patterns in enumerate(patterns, start=1):
818
+ print(f"\n{level}-Sequence Patterns:")
819
+ for pattern, support in freq_patterns.items():
820
+ print(f" {pattern}: {support}")
821
+ ```
822
+
823
+ #### 2. Sequence Format (List Column)
824
+
825
+ Use when each row contains a complete transaction as a list:
826
+
827
+ ```python
828
+ import pandas as pd
829
+ from gsppy import GSP
830
+
831
+ # Pandas DataFrame with sequences as lists
832
+ df = pd.DataFrame({
833
+ "transaction": [
834
+ ["Bread", "Milk"],
835
+ ["Bread", "Diaper", "Beer"],
836
+ ["Milk", "Coke"],
837
+ ]
838
+ })
839
+
840
+ gsp = GSP(df, sequence_col="transaction")
841
+ patterns = gsp.search(min_support=0.3)
842
+ ```
843
+
844
+ ### DataFrame with Timestamps
845
+
846
+ DataFrames support temporal constraints for time-aware pattern mining:
847
+
848
+ ```python
849
+ import polars as pl
850
+ from gsppy import GSP
851
+
852
+ # Grouped format with timestamps
853
+ df = pl.DataFrame({
854
+ "transaction_id": [1, 1, 1, 2, 2, 2],
855
+ "item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
856
+ "timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
857
+ })
858
+
859
+ # Find patterns where consecutive events occur within 10 seconds
860
+ gsp = GSP(
861
+ df,
862
+ transaction_col="transaction_id",
863
+ item_col="item",
864
+ timestamp_col="timestamp",
865
+ maxgap=10
866
+ )
867
+ patterns = gsp.search(min_support=0.5)
868
+ ```
869
+
870
+ For sequence format with timestamps:
871
+
872
+ ```python
873
+ import pandas as pd
874
+ from gsppy import GSP
875
+
876
+ df = pd.DataFrame({
877
+ "sequence": [["A", "B", "C"], ["A", "D"]],
878
+ "timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
879
+ })
880
+
881
+ gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
882
+ patterns = gsp.search(min_support=0.5)
883
+ ```
884
+
885
+ ### Working with Parquet and Arrow Files
886
+
887
+ DataFrames enable seamless integration with columnar storage formats:
888
+
889
+ ```python
890
+ import polars as pl
891
+ from gsppy import GSP
892
+
893
+ # Read directly from Parquet
894
+ df = pl.read_parquet("transactions.parquet")
895
+
896
+ # Run GSP with automatic schema detection
897
+ gsp = GSP(df, transaction_col="txn_id", item_col="product")
898
+ patterns = gsp.search(min_support=0.2)
899
+
900
+ # Or use Pandas with Arrow backend
901
+ import pandas as pd
902
+ df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
903
+ gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
904
+ patterns = gsp.search(min_support=0.2)
905
+ ```
906
+
907
+ ### Performance Considerations
908
+
909
+ DataFrames offer performance benefits for large datasets:
910
+
911
+ - **Polars**: Leverages Arrow for zero-copy operations and parallel processing
912
+ - **Pandas**: Compatible with Arrow backend for efficient memory usage
913
+ - **Parquet/Arrow**: Columnar storage enables efficient filtering and reading
914
+ - **Schema validation**: Errors are caught early with clear messages
915
+
916
+ ### DataFrame Schema Requirements
917
+
918
+ **Grouped Format:**
919
+ - `transaction_col`: Column containing transaction/sequence IDs (any type)
920
+ - `item_col`: Column containing items (any type, converted to strings)
921
+ - `timestamp_col` (optional): Column containing timestamps (numeric)
922
+
923
+ **Sequence Format:**
924
+ - `sequence_col`: Column containing lists of items
925
+ - `timestamp_col` (optional): Column containing lists of timestamps (must match sequence lengths)
926
+
927
+ ### Error Handling
928
+
929
+ GSP-Py provides clear error messages for schema issues:
930
+
931
+ ```python
932
+ import polars as pl
933
+ from gsppy import GSP
934
+
935
+ df = pl.DataFrame({
936
+ "txn_id": [1, 2],
937
+ "product": ["A", "B"],
938
+ })
939
+
940
+ # ❌ Missing required column
941
+ try:
942
+ gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
943
+ except ValueError as e:
944
+ print(f"Error: {e}") # "Column 'item' not found in DataFrame"
945
+
946
+ # ❌ Invalid format specification
947
+ try:
948
+ gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
949
+ except ValueError as e:
950
+ print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"
951
+ ```
952
+
953
+ ### Backward Compatibility
954
+
955
+ Traditional list-based input continues to work:
956
+
957
+ ```python
958
+ from gsppy import GSP
959
+
960
+ # Lists still work as before
961
+ transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
962
+ gsp = GSP(transactions)
963
+ patterns = gsp.search(min_support=0.5)
964
+ ```
965
+
966
+ DataFrame parameters cannot be mixed with list input:
967
+
968
+ ```python
969
+ transactions = [["A", "B"], ["C", "D"]]
970
+
971
+ # ❌ This raises an error
972
+ gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list input
973
+ ```
974
+
975
+ ### Examples and Tests
976
+
977
+ For complete examples and edge cases, see:
978
+ - [`tests/test_dataframe.py`](tests/test_dataframe.py) - Comprehensive test suite
979
+ - DataFrame adapter documentation in [`gsppy/dataframe_adapters.py`](gsppy/dataframe_adapters.py)
980
+
981
+ ---
982
+
587
983
  ## ⏱️ Temporal Constraints
588
984
 
589
985
  GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
@@ -591,7 +987,7 @@ GSP-Py supports **time-constrained sequential pattern mining** with three powerf
591
987
  ### Temporal Constraint Parameters
592
988
 
593
989
  - **`mingap`**: Minimum time gap required between consecutive items in a pattern
594
- - **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
990
+ - **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
595
991
  - **`maxspan`**: Maximum time span from the first to the last item in a pattern
596
992
 
597
993
  ### Using Temporal Constraints