gsppy 3.6.0__tar.gz → 4.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. {gsppy-3.6.0 → gsppy-4.0.0}/CHANGELOG.md +129 -0
  2. {gsppy-3.6.0 → gsppy-4.0.0}/PKG-INFO +329 -9
  3. {gsppy-3.6.0 → gsppy-4.0.0}/README.md +319 -4
  4. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/__init__.py +37 -2
  5. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/cli.py +314 -11
  6. gsppy-4.0.0/gsppy/dataframe_adapters.py +458 -0
  7. gsppy-4.0.0/gsppy/enums.py +49 -0
  8. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/gsp.py +150 -9
  9. gsppy-4.0.0/gsppy/token_mapper.py +99 -0
  10. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/utils.py +120 -0
  11. {gsppy-3.6.0 → gsppy-4.0.0}/pyproject.toml +18 -7
  12. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_cli.py +70 -3
  13. gsppy-4.0.0/tests/test_dataframe.py +341 -0
  14. gsppy-4.0.0/tests/test_spm_format.py +303 -0
  15. {gsppy-3.6.0 → gsppy-4.0.0}/tox.ini +1 -1
  16. {gsppy-3.6.0 → gsppy-4.0.0}/.gitignore +0 -0
  17. {gsppy-3.6.0 → gsppy-4.0.0}/CONTRIBUTING.md +0 -0
  18. {gsppy-3.6.0 → gsppy-4.0.0}/LICENSE +0 -0
  19. {gsppy-3.6.0 → gsppy-4.0.0}/SECURITY.md +0 -0
  20. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/accelerate.py +0 -0
  21. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/pruning.py +0 -0
  22. {gsppy-3.6.0 → gsppy-4.0.0}/gsppy/py.typed +0 -0
  23. {gsppy-3.6.0 → gsppy-4.0.0}/rust/Cargo.lock +0 -0
  24. {gsppy-3.6.0 → gsppy-4.0.0}/rust/Cargo.toml +0 -0
  25. {gsppy-3.6.0 → gsppy-4.0.0}/rust/src/lib.rs +0 -0
  26. {gsppy-3.6.0 → gsppy-4.0.0}/tests/__init__.py +0 -0
  27. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_gsp.py +0 -0
  28. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_gsp_fuzzing.py +0 -0
  29. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_pruning.py +0 -0
  30. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_temporal_constraints.py +0 -0
  31. {gsppy-3.6.0 → gsppy-4.0.0}/tests/test_utils.py +0 -0
@@ -1,6 +1,135 @@
1
1
  # CHANGELOG
2
2
 
3
3
 
4
+ ## v4.0.0 (2026-02-01)
5
+
6
+ ### Chores
7
+
8
+ - Add additional VSCode extensions for improved development experience
9
+ ([`107dfa4`](https://github.com/jacksonpradolima/gsp-py/commit/107dfa422005f4cdec4655a9751fd0d6e597773f))
10
+
11
+ - Update uv.lock for version 3.6.1
12
+ ([`d8d7394`](https://github.com/jacksonpradolima/gsp-py/commit/d8d73947d570844c02e9d974b626da26f07cf1e6))
13
+
14
+ ### Features
15
+
16
+ - Add SPM/GSP delimiter format loader and token mapping utilities
17
+ ([`4ac1d34`](https://github.com/jacksonpradolima/gsp-py/commit/4ac1d34d166f21d30968872cf16c1bde3ff1f2aa))
18
+
19
+ ### Refactoring
20
+
21
+ - Add type casting for return values in read_transactions_from_spm
22
+ ([`2099bfd`](https://github.com/jacksonpradolima/gsp-py/commit/2099bfd5253a1dc058dd46bd0da077810958fa76))
23
+
24
+ - Update read_transactions_from_spm to return mappings and adjust tests
25
+ ([`373b8ff`](https://github.com/jacksonpradolima/gsp-py/commit/373b8ff0d7f131140bcdbd039fae0d02572e86b7))
26
+
27
+
28
+ ## v3.6.1 (2026-01-31)
29
+
30
+ ### Bug Fixes
31
+
32
+ - Typing for polars and pandas
33
+ ([`0773992`](https://github.com/jacksonpradolima/gsp-py/commit/07739921d074e55c8436a88a73e510b1d8761510))
34
+
35
+ ### Build System
36
+
37
+ - **deps**: Bump actions/checkout in /.github/workflows
38
+ ([`7af193d`](https://github.com/jacksonpradolima/gsp-py/commit/7af193d515972eeca5d8e354e91a60e488357cfb))
39
+
40
+ Bumps [actions/checkout](https://github.com/actions/checkout) from 4.3.1 to 6.0.2. - [Release
41
+ notes](https://github.com/actions/checkout/releases) -
42
+ [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) -
43
+ [Commits](https://github.com/actions/checkout/compare/v4.3.1...de0fac2e4500dabe0009e67214ff5f5447ce83dd)
44
+
45
+ --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.2
46
+
47
+ dependency-type: direct:production
48
+
49
+ update-type: version-update:semver-major
50
+
51
+ ...
52
+
53
+ Signed-off-by: dependabot[bot] <support@github.com>
54
+
55
+ - **deps**: Bump actions/github-script in /.github/workflows
56
+ ([`03a7588`](https://github.com/jacksonpradolima/gsp-py/commit/03a7588301421369731d3d543f81b93c25c292ef))
57
+
58
+ Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 8.0.0. -
59
+ [Release notes](https://github.com/actions/github-script/releases) -
60
+ [Commits](https://github.com/actions/github-script/compare/60a0d83039c74a4aee543508d2ffcb1c3799cdea...ed597411d8f924073f98dfc5c65a23a2325f34cd)
61
+
62
+ --- updated-dependencies: - dependency-name: actions/github-script dependency-version: 8.0.0
63
+
64
+ dependency-type: direct:production
65
+
66
+ update-type: version-update:semver-major
67
+
68
+ ...
69
+
70
+ Signed-off-by: dependabot[bot] <support@github.com>
71
+
72
+ - **deps**: Bump actions/setup-python in /.github/workflows
73
+ ([`75771bf`](https://github.com/jacksonpradolima/gsp-py/commit/75771bff660b3842f2c8d84bdaeb013941e5abe0))
74
+
75
+ Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.6.0 to 6.2.0. -
76
+ [Release notes](https://github.com/actions/setup-python/releases) -
77
+ [Commits](https://github.com/actions/setup-python/compare/v5.6.0...a309ff8b426b58ec0e2a45f0f869d46889d02405)
78
+
79
+ --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0
80
+
81
+ dependency-type: direct:production
82
+
83
+ update-type: version-update:semver-major
84
+
85
+ ...
86
+
87
+ Signed-off-by: dependabot[bot] <support@github.com>
88
+
89
+ - **deps**: Bump actions/stale in /.github/workflows
90
+ ([`e699ccd`](https://github.com/jacksonpradolima/gsp-py/commit/e699ccdac689734b4694665d924ace8bba479253))
91
+
92
+ Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 10.1.1. - [Release
93
+ notes](https://github.com/actions/stale/releases) -
94
+ [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md) -
95
+ [Commits](https://github.com/actions/stale/compare/28ca1036281a5e5922ead5184a1bbf96e5fc984e...997185467fa4f803885201cee163a9f38240193d)
96
+
97
+ --- updated-dependencies: - dependency-name: actions/stale dependency-version: 10.1.1
98
+
99
+ dependency-type: direct:production
100
+
101
+ update-type: version-update:semver-major
102
+
103
+ ...
104
+
105
+ Signed-off-by: dependabot[bot] <support@github.com>
106
+
107
+ - **deps**: Bump actions/upload-artifact in /.github/workflows
108
+ ([`17efaff`](https://github.com/jacksonpradolima/gsp-py/commit/17efaffc755c017e066c0286464899ead6e2cae4))
109
+
110
+ Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 6.0.0. -
111
+ [Release notes](https://github.com/actions/upload-artifact/releases) -
112
+ [Commits](https://github.com/actions/upload-artifact/compare/v4.6.2...b7c566a772e6b6bfb58ed0dc250532a479d7789f)
113
+
114
+ --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 6.0.0
115
+
116
+ dependency-type: direct:production
117
+
118
+ update-type: version-update:semver-major
119
+
120
+ ...
121
+
122
+ Signed-off-by: dependabot[bot] <support@github.com>
123
+
124
+ ### Chores
125
+
126
+ - Update uv.lock for version 3.6.0
127
+ ([`4c2a5e5`](https://github.com/jacksonpradolima/gsp-py/commit/4c2a5e5967482443c2db645c9ba4744bd2110dd1))
128
+
129
+ - **deps**: Bump ty and ruff
130
+ ([`07a20df`](https://github.com/jacksonpradolima/gsp-py/commit/07a20df9fb4ff3a3b022d28d152b586ca45383c8))
131
+
132
+
4
133
  ## v3.6.0 (2026-01-26)
5
134
 
6
135
  ### Chores
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: gsppy
3
- Version: 3.6.0
3
+ Version: 4.0.0
4
4
  Summary: GSP (Generalized Sequence Pattern) algorithm in Python
5
5
  Project-URL: Homepage, https://github.com/jacksonpradolima/gsp-py
6
6
  Author-email: Jackson Antonio do Prado Lima <jacksonpradolima@gmail.com>
@@ -32,15 +32,20 @@ Classifier: Intended Audience :: Science/Research
32
32
  Classifier: License :: OSI Approved :: MIT License
33
33
  Classifier: Natural Language :: English
34
34
  Classifier: Operating System :: OS Independent
35
- Classifier: Programming Language :: Python :: 3.10
36
35
  Classifier: Programming Language :: Python :: 3.11
37
36
  Classifier: Programming Language :: Python :: 3.12
38
37
  Classifier: Programming Language :: Python :: 3.13
38
+ Classifier: Programming Language :: Python :: 3.14
39
39
  Classifier: Topic :: Scientific/Engineering :: Information Analysis
40
40
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
41
- Requires-Python: >=3.10
41
+ Requires-Python: >=3.11
42
42
  Requires-Dist: click>=8.0.0
43
43
  Requires-Dist: typing-extensions>=4.0.0
44
+ Provides-Extra: dataframe
45
+ Requires-Dist: pandas-stubs>=2.3.3.260113; extra == 'dataframe'
46
+ Requires-Dist: pandas>=3.0.0; extra == 'dataframe'
47
+ Requires-Dist: polars>=1.37.1; extra == 'dataframe'
48
+ Requires-Dist: pyarrow>=10.0.0; extra == 'dataframe'
44
49
  Provides-Extra: dev
45
50
  Requires-Dist: cython==3.2.4; extra == 'dev'
46
51
  Requires-Dist: hatch==1.16.3; extra == 'dev'
@@ -51,9 +56,9 @@ Requires-Dist: pyright==1.1.408; extra == 'dev'
51
56
  Requires-Dist: pytest-benchmark==5.2.3; extra == 'dev'
52
57
  Requires-Dist: pytest-cov==7.0.0; extra == 'dev'
53
58
  Requires-Dist: pytest==9.0.2; extra == 'dev'
54
- Requires-Dist: ruff==0.14.13; extra == 'dev'
59
+ Requires-Dist: ruff==0.14.14; extra == 'dev'
55
60
  Requires-Dist: tox==4.34.1; extra == 'dev'
56
- Requires-Dist: ty==0.0.12; extra == 'dev'
61
+ Requires-Dist: ty==0.0.14; extra == 'dev'
57
62
  Provides-Extra: docs
58
63
  Requires-Dist: mkdocs-gen-files<1,>=0.5; extra == 'docs'
59
64
  Requires-Dist: mkdocs-literate-nav<1,>=0.6; extra == 'docs'
@@ -72,7 +77,7 @@ Description-Content-Type: text/markdown
72
77
 
73
78
  [![PyPI Downloads](https://img.shields.io/pypi/dm/gsppy.svg?style=flat-square)](https://pypi.org/project/gsppy/)
74
79
  [![PyPI version](https://badge.fury.io/py/gsppy.svg)](https://pypi.org/project/gsppy)
75
- ![](https://img.shields.io/badge/python-3.10+-blue.svg)
80
+ ![](https://img.shields.io/badge/python-3.11+-blue.svg)
76
81
 
77
82
  [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/jacksonpradolima/gsp-py/badge)](https://securityscorecards.dev/viewer/?uri=github.com/jacksonpradolima/gsp-py)
78
83
  [![SLSA provenance](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml/badge.svg)](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml)
@@ -90,7 +95,7 @@ Description-Content-Type: text/markdown
90
95
  Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
91
96
 
92
97
  > [!IMPORTANT]
93
- > GSP-Py is compatible with Python 3.10 and later versions!
98
+ > GSP-Py is compatible with Python 3.11 and later versions!
94
99
 
95
100
  ---
96
101
 
@@ -106,6 +111,7 @@ Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal m
106
111
  6. [💡 Usage](#usage)
107
112
  - [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)
108
113
  - [📊 Explanation: Support and Results](#explanation-support-and-results)
114
+ - [📊 DataFrame Input Support](#dataframe-input-support)
109
115
  - [⏱️ Temporal Constraints](#temporal-constraints)
110
116
  7. [⌨️ Typing](#typing)
111
117
  8. [🌟 Planned Features](#planned-features)
@@ -357,6 +363,34 @@ Your input file should be either:
357
363
  Bread,Milk,Diaper,Coke
358
364
  ```
359
365
 
366
+ - **SPM/GSP Format**: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
367
+ - `-1`: Marks the end of an element (itemset)
368
+ - `-2`: Marks the end of a sequence (transaction)
369
+
370
+ Example:
371
+ ```text
372
+ 1 2 -1 3 -1 -2
373
+ 4 -1 5 6 -1 -2
374
+ 1 -1 2 3 -1 -2
375
+ ```
376
+
377
+ The above represents:
378
+ - Transaction 1: `[[1, 2], [3]]` → flattened to `[1, 2, 3]`
379
+ - Transaction 2: `[[4], [5, 6]]` → flattened to `[4, 5, 6]`
380
+ - Transaction 3: `[[1], [2, 3]]` → flattened to `[1, 2, 3]`
381
+
382
+ String tokens are also supported:
383
+ ```text
384
+ A B -1 C -1 -2
385
+ D -1 E F -1 -2
386
+ ```
387
+
388
+ - **Parquet/Arrow Files**: Modern columnar data formats (requires 'gsppy[dataframe]')
389
+ ```bash
390
+ pip install 'gsppy[dataframe]'
391
+ ```
392
+ This installs optional dependencies: `polars`, `pandas`, and `pyarrow` for DataFrame support.
393
+
360
394
  ### Running the CLI
361
395
 
362
396
  Use the following command to run GSPPy on your data:
@@ -371,9 +405,16 @@ Or for CSV files:
371
405
  gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
372
406
  ```
373
407
 
408
+ For SPM/GSP format files, use the `--format spm` option:
409
+
410
+ ```bash
411
+ gsppy --file path/to/data.txt --format spm --min_support 0.3
412
+ ```
413
+
374
414
  #### CLI Options
375
415
 
376
- - `--file`: Path to your input file (JSON or CSV). **Required**.
416
+ - `--file`: Path to your input file (JSON, CSV, or SPM format). **Required**.
417
+ - `--format`: File format to use. Options: `auto` (default, auto-detect from extension), `json`, `csv`, `spm`, `parquet`, `arrow`.
377
418
  - `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.
378
419
  - `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.
379
420
  - `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.
@@ -518,6 +559,83 @@ Verbose mode provides:
518
559
 
519
560
  For complete documentation on logging, see [docs/logging.md](docs/logging.md).
520
561
 
562
+ ### Loading SPM/GSP Format Files
563
+
564
+ GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
565
+ - `-1` to mark the end of an element (itemset)
566
+ - `-2` to mark the end of a sequence (transaction)
567
+
568
+ #### Using the SPM Loader
569
+
570
+ ```python
571
+ from gsppy.utils import read_transactions_from_spm
572
+ from gsppy import GSP
573
+
574
+ # Load SPM format file
575
+ transactions = read_transactions_from_spm('data.txt')
576
+
577
+ # Run GSP algorithm
578
+ gsp = GSP(transactions)
579
+ result = gsp.search(min_support=0.3)
580
+ ```
581
+
582
+ #### SPM Format Examples
583
+
584
+ **Simple sequence file (`data.txt`):**
585
+ ```text
586
+ 1 2 -1 3 -1 -2
587
+ 4 -1 5 6 -1 -2
588
+ 1 -1 2 3 -1 -2
589
+ ```
590
+
591
+ This represents:
592
+ - Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]
593
+ - Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]
594
+ - Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]
595
+
596
+ **String tokens are also supported:**
597
+ ```text
598
+ A B -1 C -1 -2
599
+ D -1 E F -1 -2
600
+ ```
601
+
602
+ #### Token Mapping
603
+
604
+ For workflows requiring conversion between string tokens and integer IDs, use the `TokenMapper`:
605
+
606
+ ```python
607
+ from gsppy.utils import read_transactions_from_spm
608
+ from gsppy import TokenMapper
609
+
610
+ # Load with mappings
611
+ transactions, str_to_int, int_to_str = read_transactions_from_spm(
612
+ 'data.txt',
613
+ return_mappings=True
614
+ )
615
+
616
+ print("String to Int:", str_to_int)
617
+ # Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
618
+
619
+ print("Int to String:", int_to_str)
620
+ # Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
621
+
622
+ # Use the TokenMapper class directly
623
+ mapper = TokenMapper()
624
+ id_a = mapper.add_token("A")
625
+ id_b = mapper.add_token("B")
626
+ print(f"A -> {id_a}, B -> {id_b}")
627
+ # Output: A -> 0, B -> 1
628
+ ```
629
+
630
+ #### Edge Cases Handled
631
+
632
+ The SPM loader gracefully handles:
633
+ - Empty lines (skipped)
634
+ - Missing `-2` delimiter at end of line
635
+ - Extra or consecutive delimiters
636
+ - Mixed-length elements in sequences
637
+ - Both integer and string tokens
638
+
521
639
  ### Output
522
640
 
523
641
  The algorithm will return a list of patterns with their corresponding support.
@@ -584,6 +702,208 @@ result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
584
702
 
585
703
  ---
586
704
 
705
+ ## 📊 DataFrame Input Support
706
+
707
+ GSP-Py supports **Polars and Pandas DataFrames** as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
708
+
709
+ ### Installation
710
+
711
+ Install GSP-Py with DataFrame support:
712
+
713
+ ```bash
714
+ pip install 'gsppy[dataframe]'
715
+ ```
716
+
717
+ This installs the optional dependencies: `polars`, `pandas`, and `pyarrow`.
718
+
719
+ ### DataFrame Input Formats
720
+
721
+ GSP-Py supports two DataFrame formats:
722
+
723
+ #### 1. Grouped Format (Transaction ID + Item Columns)
724
+
725
+ Use when your data has separate rows for each item in a transaction:
726
+
727
+ ```python
728
+ import polars as pl
729
+ from gsppy import GSP
730
+
731
+ # Polars DataFrame with transaction_id and item columns
732
+ df = pl.DataFrame({
733
+ "transaction_id": [1, 1, 2, 2, 2, 3, 3],
734
+ "item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
735
+ })
736
+
737
+ # Run GSP directly on the DataFrame
738
+ gsp = GSP(df, transaction_col="transaction_id", item_col="item")
739
+ patterns = gsp.search(min_support=0.3)
740
+
741
+ for level, freq_patterns in enumerate(patterns, start=1):
742
+ print(f"\n{level}-Sequence Patterns:")
743
+ for pattern, support in freq_patterns.items():
744
+ print(f" {pattern}: {support}")
745
+ ```
746
+
747
+ #### 2. Sequence Format (List Column)
748
+
749
+ Use when each row contains a complete transaction as a list:
750
+
751
+ ```python
752
+ import pandas as pd
753
+ from gsppy import GSP
754
+
755
+ # Pandas DataFrame with sequences as lists
756
+ df = pd.DataFrame({
757
+ "transaction": [
758
+ ["Bread", "Milk"],
759
+ ["Bread", "Diaper", "Beer"],
760
+ ["Milk", "Coke"],
761
+ ]
762
+ })
763
+
764
+ gsp = GSP(df, sequence_col="transaction")
765
+ patterns = gsp.search(min_support=0.3)
766
+ ```
767
+
768
+ ### DataFrame with Timestamps
769
+
770
+ DataFrames support temporal constraints for time-aware pattern mining:
771
+
772
+ ```python
773
+ import polars as pl
774
+ from gsppy import GSP
775
+
776
+ # Grouped format with timestamps
777
+ df = pl.DataFrame({
778
+ "transaction_id": [1, 1, 1, 2, 2, 2],
779
+ "item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
780
+ "timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
781
+ })
782
+
783
+ # Find patterns where consecutive events occur within 10 seconds
784
+ gsp = GSP(
785
+ df,
786
+ transaction_col="transaction_id",
787
+ item_col="item",
788
+ timestamp_col="timestamp",
789
+ maxgap=10
790
+ )
791
+ patterns = gsp.search(min_support=0.5)
792
+ ```
793
+
794
+ For sequence format with timestamps:
795
+
796
+ ```python
797
+ import pandas as pd
798
+ from gsppy import GSP
799
+
800
+ df = pd.DataFrame({
801
+ "sequence": [["A", "B", "C"], ["A", "D"]],
802
+ "timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
803
+ })
804
+
805
+ gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
806
+ patterns = gsp.search(min_support=0.5)
807
+ ```
808
+
809
+ ### Working with Parquet and Arrow Files
810
+
811
+ DataFrames enable seamless integration with columnar storage formats:
812
+
813
+ ```python
814
+ import polars as pl
815
+ from gsppy import GSP
816
+
817
+ # Read directly from Parquet
818
+ df = pl.read_parquet("transactions.parquet")
819
+
820
+ # Run GSP with automatic schema detection
821
+ gsp = GSP(df, transaction_col="txn_id", item_col="product")
822
+ patterns = gsp.search(min_support=0.2)
823
+
824
+ # Or use Pandas with Arrow backend
825
+ import pandas as pd
826
+ df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
827
+ gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
828
+ patterns = gsp.search(min_support=0.2)
829
+ ```
830
+
831
+ ### Performance Considerations
832
+
833
+ DataFrames offer performance benefits for large datasets:
834
+
835
+ - **Polars**: Leverages Arrow for zero-copy operations and parallel processing
836
+ - **Pandas**: Compatible with Arrow backend for efficient memory usage
837
+ - **Parquet/Arrow**: Columnar storage enables efficient filtering and reading
838
+ - **Schema validation**: Errors are caught early with clear messages
839
+
840
+ ### DataFrame Schema Requirements
841
+
842
+ **Grouped Format:**
843
+ - `transaction_col`: Column containing transaction/sequence IDs (any type)
844
+ - `item_col`: Column containing items (any type, converted to strings)
845
+ - `timestamp_col` (optional): Column containing timestamps (numeric)
846
+
847
+ **Sequence Format:**
848
+ - `sequence_col`: Column containing lists of items
849
+ - `timestamp_col` (optional): Column containing lists of timestamps (must match sequence lengths)
850
+
851
+ ### Error Handling
852
+
853
+ GSP-Py provides clear error messages for schema issues:
854
+
855
+ ```python
856
+ import polars as pl
857
+ from gsppy import GSP
858
+
859
+ df = pl.DataFrame({
860
+ "txn_id": [1, 2],
861
+ "product": ["A", "B"],
862
+ })
863
+
864
+ # ❌ Missing required column
865
+ try:
866
+ gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
867
+ except ValueError as e:
868
+ print(f"Error: {e}") # "Column 'item' not found in DataFrame"
869
+
870
+ # ❌ Invalid format specification
871
+ try:
872
+ gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
873
+ except ValueError as e:
874
+ print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"
875
+ ```
876
+
877
+ ### Backward Compatibility
878
+
879
+ Traditional list-based input continues to work:
880
+
881
+ ```python
882
+ from gsppy import GSP
883
+
884
+ # Lists still work as before
885
+ transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
886
+ gsp = GSP(transactions)
887
+ patterns = gsp.search(min_support=0.5)
888
+ ```
889
+
890
+ DataFrame parameters cannot be mixed with list input:
891
+
892
+ ```python
893
+ transactions = [["A", "B"], ["C", "D"]]
894
+
895
+ # ❌ This raises an error
896
+ gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list input
897
+ ```
898
+
899
+ ### Examples and Tests
900
+
901
+ For complete examples and edge cases, see:
902
+ - [`tests/test_dataframe.py`](tests/test_dataframe.py) - Comprehensive test suite
903
+ - DataFrame adapter documentation in [`gsppy/dataframe_adapters.py`](gsppy/dataframe_adapters.py)
904
+
905
+ ---
906
+
587
907
  ## ⏱️ Temporal Constraints
588
908
 
589
909
  GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
@@ -591,7 +911,7 @@ GSP-Py supports **time-constrained sequential pattern mining** with three powerf
591
911
  ### Temporal Constraint Parameters
592
912
 
593
913
  - **`mingap`**: Minimum time gap required between consecutive items in a pattern
594
- - **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
914
+ - **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
595
915
  - **`maxspan`**: Maximum time span from the first to the last item in a pattern
596
916
 
597
917
  ### Using Temporal Constraints