pandas-match-recognize 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (185) hide show
  1. pandas_match_recognize-0.1.0/.gitignore +122 -0
  2. pandas_match_recognize-0.1.0/.vscode/settings.json +3 -0
  3. pandas_match_recognize-0.1.0/1.swift +490 -0
  4. pandas_match_recognize-0.1.0/Examples Match_reocginze/Examples.ipynb +2402 -0
  5. pandas_match_recognize-0.1.0/Examples Match_reocginze/Examples_View.ipynb +27021 -0
  6. pandas_match_recognize-0.1.0/Examples.ipynb +2402 -0
  7. pandas_match_recognize-0.1.0/LICENSE +21 -0
  8. pandas_match_recognize-0.1.0/PKG-INFO +426 -0
  9. pandas_match_recognize-0.1.0/Performance/FINAL_VALIDATED_PERFORMANCE_ANALYSIS.md +82 -0
  10. pandas_match_recognize-0.1.0/Performance/enhanced_latex_generator.py +921 -0
  11. pandas_match_recognize-0.1.0/Performance/enhanced_pattern_benchmark.py +1091 -0
  12. pandas_match_recognize-0.1.0/Performance/enhanced_visualization_generator.py +390 -0
  13. pandas_match_recognize-0.1.0/Performance/pattern_focused_benchmark.py +511 -0
  14. pandas_match_recognize-0.1.0/Python_Examples/python_udf.ipynb +503 -0
  15. pandas_match_recognize-0.1.0/README.md +377 -0
  16. pandas_match_recognize-0.1.0/Test_grammar.ipynb +1638 -0
  17. pandas_match_recognize-0.1.0/Untitled-1.ipynb +210 -0
  18. pandas_match_recognize-0.1.0/Untitled-2.ipynb +266 -0
  19. pandas_match_recognize-0.1.0/__init__.py +8 -0
  20. pandas_match_recognize-0.1.0/__pycache__/__init__.cpython-312.pyc +0 -0
  21. pandas_match_recognize-0.1.0/docs/README_NAVIGATION_FIX.md +105 -0
  22. pandas_match_recognize-0.1.0/docs/debugging_guide.md +471 -0
  23. pandas_match_recognize-0.1.0/docs/navigation_enhancements.md +260 -0
  24. pandas_match_recognize-0.1.0/docs/navigation_function_fix.md +134 -0
  25. pandas_match_recognize-0.1.0/docs/nested_navigation.md +127 -0
  26. pandas_match_recognize-0.1.0/docs/pattern_cache_deployment.md +324 -0
  27. pandas_match_recognize-0.1.0/docs/pattern_caching.md +174 -0
  28. pandas_match_recognize-0.1.0/main.py +62 -0
  29. pandas_match_recognize-0.1.0/match_recognize/__init__.py +66 -0
  30. pandas_match_recognize-0.1.0/match_recognize/__pycache__/__init__.cpython-312.pyc +0 -0
  31. pandas_match_recognize-0.1.0/match_recognize.py +29 -0
  32. pandas_match_recognize-0.1.0/medi.ipynb +538 -0
  33. pandas_match_recognize-0.1.0/pandas_match_recognize/__init__.py +66 -0
  34. pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/PKG-INFO +426 -0
  35. pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/SOURCES.txt +183 -0
  36. pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/dependency_links.txt +1 -0
  37. pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/requires.txt +16 -0
  38. pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/top_level.txt +2 -0
  39. pandas_match_recognize-0.1.0/production_scale_test.py +337 -0
  40. pandas_match_recognize-0.1.0/pyproject.toml +63 -0
  41. pandas_match_recognize-0.1.0/requirements.txt +127 -0
  42. pandas_match_recognize-0.1.0/setup.cfg +4 -0
  43. pandas_match_recognize-0.1.0/setup.py +70 -0
  44. pandas_match_recognize-0.1.0/src/TestAggregationsInRowPatternMatching.java +1270 -0
  45. pandas_match_recognize-0.1.0/src/TestRowPatternMatching.java +1561 -0
  46. pandas_match_recognize-0.1.0/src/__init__.py +11 -0
  47. pandas_match_recognize-0.1.0/src/__pycache__/__init__.cpython-312.pyc +0 -0
  48. pandas_match_recognize-0.1.0/src/ast_nodes/__init__.py +5 -0
  49. pandas_match_recognize-0.1.0/src/ast_nodes/__pycache__/__init__.cpython-312.pyc +0 -0
  50. pandas_match_recognize-0.1.0/src/ast_nodes/__pycache__/ast_nodes.cpython-312.pyc +0 -0
  51. pandas_match_recognize-0.1.0/src/ast_nodes/ast_nodes.py +678 -0
  52. pandas_match_recognize-0.1.0/src/config/__init__.py +1 -0
  53. pandas_match_recognize-0.1.0/src/config/__pycache__/__init__.cpython-312.pyc +0 -0
  54. pandas_match_recognize-0.1.0/src/config/__pycache__/production_config.cpython-312.pyc +0 -0
  55. pandas_match_recognize-0.1.0/src/config/production_config.py +385 -0
  56. pandas_match_recognize-0.1.0/src/executor/__init__.py +0 -0
  57. pandas_match_recognize-0.1.0/src/executor/__pycache__/__init__.cpython-312.pyc +0 -0
  58. pandas_match_recognize-0.1.0/src/executor/__pycache__/match_recognize.cpython-312.pyc +0 -0
  59. pandas_match_recognize-0.1.0/src/executor/match_recognize.py +2302 -0
  60. pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.g4 +398 -0
  61. pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.interp +1040 -0
  62. pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.py +1711 -0
  63. pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.tokens +665 -0
  64. pandas_match_recognize-0.1.0/src/grammar/TrinoParser.g4 +1124 -0
  65. pandas_match_recognize-0.1.0/src/grammar/TrinoParser.interp +817 -0
  66. pandas_match_recognize-0.1.0/src/grammar/TrinoParser.py +23363 -0
  67. pandas_match_recognize-0.1.0/src/grammar/TrinoParser.tokens +665 -0
  68. pandas_match_recognize-0.1.0/src/grammar/TrinoParserListener.py +2973 -0
  69. pandas_match_recognize-0.1.0/src/grammar/TrinoParserVisitor.py +1658 -0
  70. pandas_match_recognize-0.1.0/src/grammar/__init__.py +0 -0
  71. pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoLexer.cpython-312.pyc +0 -0
  72. pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoParser.cpython-312.pyc +0 -0
  73. pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoParserVisitor.cpython-312.pyc +0 -0
  74. pandas_match_recognize-0.1.0/src/grammar/__pycache__/__init__.cpython-312.pyc +0 -0
  75. pandas_match_recognize-0.1.0/src/matcher/__init__.py +0 -0
  76. pandas_match_recognize-0.1.0/src/matcher/__pycache__/__init__.cpython-312.pyc +0 -0
  77. pandas_match_recognize-0.1.0/src/matcher/__pycache__/automata.cpython-312.pyc +0 -0
  78. pandas_match_recognize-0.1.0/src/matcher/__pycache__/condition_evaluator.cpython-312.pyc +0 -0
  79. pandas_match_recognize-0.1.0/src/matcher/__pycache__/dfa.cpython-312.pyc +0 -0
  80. pandas_match_recognize-0.1.0/src/matcher/__pycache__/evaluation_utils.cpython-312.pyc +0 -0
  81. pandas_match_recognize-0.1.0/src/matcher/__pycache__/matcher.cpython-312.pyc +0 -0
  82. pandas_match_recognize-0.1.0/src/matcher/__pycache__/measure_evaluator.cpython-312.pyc +0 -0
  83. pandas_match_recognize-0.1.0/src/matcher/__pycache__/pattern_tokenizer.cpython-312.pyc +0 -0
  84. pandas_match_recognize-0.1.0/src/matcher/__pycache__/production_aggregates.cpython-312.pyc +0 -0
  85. pandas_match_recognize-0.1.0/src/matcher/__pycache__/row_context.cpython-312.pyc +0 -0
  86. pandas_match_recognize-0.1.0/src/matcher/automata.py +3838 -0
  87. pandas_match_recognize-0.1.0/src/matcher/condition_evaluator.py +2915 -0
  88. pandas_match_recognize-0.1.0/src/matcher/dfa.py +1145 -0
  89. pandas_match_recognize-0.1.0/src/matcher/evaluation_utils.py +592 -0
  90. pandas_match_recognize-0.1.0/src/matcher/matcher.py +6946 -0
  91. pandas_match_recognize-0.1.0/src/matcher/measure_evaluator.py +2317 -0
  92. pandas_match_recognize-0.1.0/src/matcher/pattern_tokenizer.py +1414 -0
  93. pandas_match_recognize-0.1.0/src/matcher/production_aggregates.py +2293 -0
  94. pandas_match_recognize-0.1.0/src/matcher/row_context.py +1781 -0
  95. pandas_match_recognize-0.1.0/src/monitoring/__init__.py +13 -0
  96. pandas_match_recognize-0.1.0/src/monitoring/__pycache__/__init__.cpython-312.pyc +0 -0
  97. pandas_match_recognize-0.1.0/src/monitoring/__pycache__/cache_monitor.cpython-312.pyc +0 -0
  98. pandas_match_recognize-0.1.0/src/monitoring/cache_monitor.py +191 -0
  99. pandas_match_recognize-0.1.0/src/monitoring/health_check.py +309 -0
  100. pandas_match_recognize-0.1.0/src/monitoring/production_logging.py +403 -0
  101. pandas_match_recognize-0.1.0/src/parser/__init__.py +0 -0
  102. pandas_match_recognize-0.1.0/src/parser/__pycache__/__init__.cpython-312.pyc +0 -0
  103. pandas_match_recognize-0.1.0/src/parser/__pycache__/error_listeners.cpython-312.pyc +0 -0
  104. pandas_match_recognize-0.1.0/src/parser/__pycache__/match_recognize_extractor.cpython-312.pyc +0 -0
  105. pandas_match_recognize-0.1.0/src/parser/error_listeners.py +22 -0
  106. pandas_match_recognize-0.1.0/src/parser/match_recognize_extractor.py +1194 -0
  107. pandas_match_recognize-0.1.0/src/parser/query_parser.py +4 -0
  108. pandas_match_recognize-0.1.0/src/pattern/__init__.py +12 -0
  109. pandas_match_recognize-0.1.0/src/pattern/__pycache__/permute_handler.cpython-312.pyc +0 -0
  110. pandas_match_recognize-0.1.0/src/pattern/permute_handler.py +731 -0
  111. pandas_match_recognize-0.1.0/src/utils/__init__.py +23 -0
  112. pandas_match_recognize-0.1.0/src/utils/__pycache__/__init__.cpython-312.pyc +0 -0
  113. pandas_match_recognize-0.1.0/src/utils/__pycache__/logging_config.cpython-312.pyc +0 -0
  114. pandas_match_recognize-0.1.0/src/utils/__pycache__/memory_management.cpython-312.pyc +0 -0
  115. pandas_match_recognize-0.1.0/src/utils/__pycache__/pattern_cache.cpython-312.pyc +0 -0
  116. pandas_match_recognize-0.1.0/src/utils/__pycache__/performance_optimizer.cpython-312.pyc +0 -0
  117. pandas_match_recognize-0.1.0/src/utils/logging_config.py +723 -0
  118. pandas_match_recognize-0.1.0/src/utils/memory_management.py +800 -0
  119. pandas_match_recognize-0.1.0/src/utils/pattern_cache.py +908 -0
  120. pandas_match_recognize-0.1.0/src/utils/performance_optimizer.py +2846 -0
  121. pandas_match_recognize-0.1.0/src/utils/production_logging.py +448 -0
  122. pandas_match_recognize-0.1.0/test1_workon.ipynb +6379 -0
  123. pandas_match_recognize-0.1.0/test_NFA.ipynb +1780 -0
  124. pandas_match_recognize-0.1.0/test_parsing_v1.ipynb +7063 -0
  125. pandas_match_recognize-0.1.0/test_requirements.txt +4 -0
  126. pandas_match_recognize-0.1.0/tests/__init__.py +1 -0
  127. pandas_match_recognize-0.1.0/tests/__pycache__/__init__.cpython-312.pyc +0 -0
  128. pandas_match_recognize-0.1.0/tests/__pycache__/conftest.cpython-312-pytest-8.3.4.pyc +0 -0
  129. pandas_match_recognize-0.1.0/tests/__pycache__/test_anchor_patterns.cpython-312-pytest-8.3.4.pyc +0 -0
  130. pandas_match_recognize-0.1.0/tests/__pycache__/test_back_reference.cpython-312-pytest-8.3.4.pyc +0 -0
  131. pandas_match_recognize-0.1.0/tests/__pycache__/test_case_sensitivity.cpython-312-pytest-8.3.4.pyc +0 -0
  132. pandas_match_recognize-0.1.0/tests/__pycache__/test_complete_java_reference.cpython-312-pytest-8.3.4.pyc +0 -0
  133. pandas_match_recognize-0.1.0/tests/__pycache__/test_empty_cycle.cpython-312-pytest-8.3.4.pyc +0 -0
  134. pandas_match_recognize-0.1.0/tests/__pycache__/test_empty_matches.cpython-312-pytest-8.3.4.pyc +0 -0
  135. pandas_match_recognize-0.1.0/tests/__pycache__/test_exponential_protection.cpython-312-pytest-8.3.4.pyc +0 -0
  136. pandas_match_recognize-0.1.0/tests/__pycache__/test_fixed_failing_cases.cpython-312-pytest-8.3.4.pyc +0 -0
  137. pandas_match_recognize-0.1.0/tests/__pycache__/test_in_predicate.cpython-312-pytest-8.3.4.pyc +0 -0
  138. pandas_match_recognize-0.1.0/tests/__pycache__/test_match_recognize.cpython-312-pytest-8.3.4.pyc +0 -0
  139. pandas_match_recognize-0.1.0/tests/__pycache__/test_missing_critical_cases.cpython-312-pytest-8.3.4.pyc +0 -0
  140. pandas_match_recognize-0.1.0/tests/__pycache__/test_multiple_match_recognize.cpython-312-pytest-8.3.4.pyc +0 -0
  141. pandas_match_recognize-0.1.0/tests/__pycache__/test_navigation_and_conditions.cpython-312-pytest-8.3.4.pyc +0 -0
  142. pandas_match_recognize-0.1.0/tests/__pycache__/test_output_layout.cpython-312-pytest-8.3.4.pyc +0 -0
  143. pandas_match_recognize-0.1.0/tests/__pycache__/test_pattern_cache.cpython-312-pytest-8.3.4.pyc +0 -0
  144. pandas_match_recognize-0.1.0/tests/__pycache__/test_pattern_tokenizer.cpython-312-pytest-8.3.4.pyc +0 -0
  145. pandas_match_recognize-0.1.0/tests/__pycache__/test_permute_patterns.cpython-312-pytest-8.3.4.pyc +0 -0
  146. pandas_match_recognize-0.1.0/tests/__pycache__/test_production_aggregates.cpython-312-pytest-8.3.4.pyc +0 -0
  147. pandas_match_recognize-0.1.0/tests/__pycache__/test_scalar_functions.cpython-312-pytest-8.3.4.pyc +0 -0
  148. pandas_match_recognize-0.1.0/tests/__pycache__/test_sql2016_compliance.cpython-312-pytest-8.3.4.pyc +0 -0
  149. pandas_match_recognize-0.1.0/tests/__pycache__/test_sql_parser.cpython-312-pytest-8.3.4.pyc +0 -0
  150. pandas_match_recognize-0.1.0/tests/__pycache__/test_subqueries.cpython-312-pytest-8.3.4.pyc +0 -0
  151. pandas_match_recognize-0.1.0/tests/performance/requirements.txt +53 -0
  152. pandas_match_recognize-0.1.0/tests/test_advanced_aggregation_scenarios.py +489 -0
  153. pandas_match_recognize-0.1.0/tests/test_aggregation_fixes.py +491 -0
  154. pandas_match_recognize-0.1.0/tests/test_aggregation_integration.py +444 -0
  155. pandas_match_recognize-0.1.0/tests/test_aggregation_performance.py +387 -0
  156. pandas_match_recognize-0.1.0/tests/test_anchor_patterns.py +327 -0
  157. pandas_match_recognize-0.1.0/tests/test_back_reference.py +245 -0
  158. pandas_match_recognize-0.1.0/tests/test_case_sensitivity.py +172 -0
  159. pandas_match_recognize-0.1.0/tests/test_complete_java_aggregation_coverage.py +602 -0
  160. pandas_match_recognize-0.1.0/tests/test_complete_java_aggregations.py +479 -0
  161. pandas_match_recognize-0.1.0/tests/test_complete_java_reference.py +719 -0
  162. pandas_match_recognize-0.1.0/tests/test_empty_cycle.py +305 -0
  163. pandas_match_recognize-0.1.0/tests/test_empty_matches.py +348 -0
  164. pandas_match_recognize-0.1.0/tests/test_exponential_protection.py +351 -0
  165. pandas_match_recognize-0.1.0/tests/test_fixed_failing_cases.py +214 -0
  166. pandas_match_recognize-0.1.0/tests/test_in_predicate.py +385 -0
  167. pandas_match_recognize-0.1.0/tests/test_java_aggregations_converted.py +650 -0
  168. pandas_match_recognize-0.1.0/tests/test_match_recognize.py +1134 -0
  169. pandas_match_recognize-0.1.0/tests/test_missing_critical_cases.py +565 -0
  170. pandas_match_recognize-0.1.0/tests/test_missing_java_cases.py +550 -0
  171. pandas_match_recognize-0.1.0/tests/test_multiple_match_recognize.py +351 -0
  172. pandas_match_recognize-0.1.0/tests/test_navigation_and_conditions.py +280 -0
  173. pandas_match_recognize-0.1.0/tests/test_output_layout.py +291 -0
  174. pandas_match_recognize-0.1.0/tests/test_pattern_cache.py +263 -0
  175. pandas_match_recognize-0.1.0/tests/test_pattern_tokenizer.py +184 -0
  176. pandas_match_recognize-0.1.0/tests/test_permute_patterns.py +317 -0
  177. pandas_match_recognize-0.1.0/tests/test_production_aggregates.py +625 -0
  178. pandas_match_recognize-0.1.0/tests/test_production_aggregations.py +756 -0
  179. pandas_match_recognize-0.1.0/tests/test_scalar_functions.py +259 -0
  180. pandas_match_recognize-0.1.0/tests/test_sql2016_compliance.py +607 -0
  181. pandas_match_recognize-0.1.0/tests/test_subqueries.py +338 -0
  182. pandas_match_recognize-0.1.0/tests/test_utils.py +243 -0
  183. pandas_match_recognize-0.1.0/tr1_paper.ipynb +10243 -0
  184. pandas_match_recognize-0.1.0/trino_data.ipynb +405 -0
  185. pandas_match_recognize-0.1.0/trino_test_replication.py +842 -0
@@ -0,0 +1,122 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ pip-wheel-metadata/
24
+ share/python-wheels/
25
+ *.egg-info/
26
+ .installed.cfg
27
+ *.egg
28
+ MANIFEST
29
+
30
+ # PyInstaller
31
+ # Usually these files are written by a python script from a template
32
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
33
+ *.manifest
34
+ *.spec
35
+
36
+ # Installer logs
37
+ pip-log.txt
38
+ pip-delete-this-directory.txt
39
+
40
+ # Unit test / coverage reports
41
+ htmlcov/
42
+ .tox/
43
+ .nox/
44
+ .coverage
45
+ .coverage.*
46
+ .cache
47
+ nosetests.xml
48
+ coverage.xml
49
+ *.cover
50
+ *.py,cover
51
+ .hypothesis/
52
+ .pytest_cache/
53
+
54
+ # Jupyter Notebook
55
+ .ipynb_checkpoints
56
+
57
+ # IPython
58
+ profile_default/
59
+ ipython_config.py
60
+
61
+ # pyenv
62
+ .python-version
63
+
64
+ # pipenv
65
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
66
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
67
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
68
+ # install all needed dependencies.
69
+ #Pipfile.lock
70
+
71
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow
72
+ __pypackages__/
73
+
74
+ # Celery stuff
75
+ celerybeat-schedule
76
+ celerybeat.pid
77
+
78
+ # SageMath parsed files
79
+ *.sage.py
80
+
81
+ # Environments
82
+ .env
83
+ .venv
84
+ env/
85
+ venv/
86
+ ENV/
87
+ env.bak/
88
+ venv.bak/
89
+
90
+ # Spyder project settings
91
+ .spyderproject
92
+ .spyproject
93
+
94
+ # Rope project settings
95
+ .ropeproject
96
+
97
+ # mkdocs documentation
98
+ /site
99
+
100
+ # mypy
101
+ .mypy_cache/
102
+ .dmypy.json
103
+ dmypy.json
104
+
105
+ # Pyre type checker
106
+ .pyre/
107
+
108
+ # IDE
109
+ .vscode/
110
+ .idea/
111
+ *.swp
112
+ *.swo
113
+ *~
114
+
115
+ # OS
116
+ .DS_Store
117
+ Thumbs.db
118
+
119
+ # Project specific
120
+ *.log
121
+ temp/
122
+ tmp/
@@ -0,0 +1,3 @@
1
+ {
2
+ "CodeGPT.apiKey": "CodeGPT Plus Beta"
3
+ }
@@ -0,0 +1,490 @@
1
+ 1. SQL Parser Module (Using a Parser Generator)
2
+
3
+ // Define full MATCH_RECOGNIZE grammar (using ANTLR, for example)
4
+ // Grammar covers:
5
+ // - PARTITION BY clause (list of columns)
6
+ // - ORDER BY clause (column names with ASC/DESC)
7
+ // - MEASURES clause (list of measure expressions)
8
+ // - PATTERN clause (row pattern with operators: concatenation, alternation, grouping, quantifiers, exclusions)
9
+ // - SUBSET clause (mapping subset names to list of pattern variables)
10
+ // - DEFINE clause (list of variable definitions with expressions)
11
+ // - AFTER MATCH SKIP clause (options: PAST LAST ROW, TO FIRST <var>, TO LAST <var>)
12
+
13
+ SQL MATCH_RECOGNIZE
14
+ PARTITION BY <column_list>
15
+ ORDER BY <column_list>
16
+ MEASURES <measure_list>
17
+ PATTERN (<pattern>)
18
+ DEFINE <variable_definitions>
19
+ SUBSET <subset_mapping>
20
+ AFTER MATCH SKIP <skip_option>
21
+
22
+ ---
23
+ AST Generation: Occurs immediately after parsing during the transformation phase.
24
+ Notes:
25
+ – The AST should capture every clause as a node.
26
+ – It must resolve subset expansions (e.g. if SUBSET U = (A, B), then any occurrence of U in PATTERN is replaced with (A | B)).
27
+
28
+ Parsing MATCH_RECOGNIZE Queries (Extract all components)
29
+ Validating MATCH_RECOGNIZE Queries (Check for missing parts, syntax errors)
30
+ Transforming MATCH_RECOGNIZE Queries (Rewrite patterns, optimize SQL)
31
+ ✔ Extracted all MATCH_RECOGNIZE components
32
+ ✔ Validated MATCH_RECOGNIZE queries for correctness
33
+ ✔ Transformed MATCH_RECOGNIZE patterns
34
+
35
+
36
+ ✅ 1. Extract MATCH_RECOGNIZE Components (PARTITION, MEASURES, PATTERN, DEFINE, SUBSET)
37
+ ✅ 2. Validate MATCH_RECOGNIZE Queries (Check for missing parts, errors)
38
+ ✅ 3. Transform MATCH_RECOGNIZE Queries (Modify, optimize, or rewrite queries)
39
+ ✔ Extracted all MATCH_RECOGNIZE components
40
+ ✔ Validated MATCH_RECOGNIZE queries for correctness
41
+ ✔ Transformed MATCH_RECOGNIZE patterns to optimize queries
42
+
43
+ complex pattern transformations?
44
+ Generating optimized MATCH_RECOGNIZE queries dynamically?
45
+
46
+ A full-fledged expression parser that produces a detailed sub-AST.
47
+ A more advanced, structured AST for row patterns (handling full regex-like syntax).
48
+ Automatic subset expansion in the AST.
49
+ Complex pattern transformations and dynamic query optimization.
50
+ Deeper semantic validation of expressions and function calls.
51
+ Replace the stub expression parser with a fully featured parser if your expression syntax is complex.
52
+ Expand the pattern parser to cover more regex-like constructs.
53
+ Write comprehensive unit and integration tests to cover edge cases in expressions and patterns.
54
+ Add semantic validation logic for checking column existence, data type matching, and function support (which might require integration with your schema metadata).
55
+
56
+
57
+
58
+
59
+
60
+ NFA/DFA generation for efficient pattern matching (execution phase).
61
+
62
+ vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
63
+
64
+
65
+ NFA/DFA Generation: Happens during the execution phase (in the engine module) when the canonical row pattern is compiled into a state machine for efficient matching.
66
+
67
+ function parseSQL(query: String) -> AST
68
+ // Use a parser generator (e.g., ANTLR) to create a parse tree from query.
69
+ ast = generateAST(query)
70
+ validateAST(ast) // e.g., check that every pattern variable in PATTERN has a definition
71
+ return ast
72
+
73
+ Walking the Parse Tree
74
+ If you want to extract information (e.g., table names, column names, patterns), you can use a Listener or Visitor:
75
+ Validating Queries
76
+
77
+ Now that the query is successfully parsed, you can build a SQL validation or transformation tool based on the tree.
78
+ custom query validation, rewriting, or analysis
79
+
80
+ NFA/DFA Generation
81
+
82
+ State Machine Construction:
83
+ The process of compiling the canonical row pattern (from the AST) into a nondeterministic or deterministic finite automaton for efficient execution is planned for the engine module and has not been implemented in the current phase.
84
+
85
+
86
+
87
+ 2. Expression Evaluator Module
88
+ // The expression evaluator handles both boolean and arithmetic expressions.
89
+ // It supports navigation functions such as PREV(), NEXT(), FIRST(), LAST().
90
+ // It converts the expression into an Intermediate Representation (IR) for efficient repeated evaluation.
91
+
92
+ function compileExpression(expr: String) -> ExpressionIR
93
+ // Parse the expression using a recursive descent parser or a tool like ANTLR.
94
+ // Build an IR (e.g., an abstract syntax tree) that represents the expression.
95
+ ir = parseAndBuildIR(expr)
96
+ return ir
97
+
98
+ function evaluateExpression(ir: ExpressionIR, context: Map<String, Any>) -> Any
99
+ // Recursively evaluate the IR using context (which might include:
100
+ // - The current row
101
+ // - The entire match (for FINAL semantics)
102
+ // - Pattern variable-specific row lists)
103
+ // Example: If the IR represents PREV(A.price, 2), use context to look up the 2nd previous row for variable A.
104
+ if ir is ConstantNode:
105
+ return ir.value
106
+ else if ir is VariableNode:
107
+ return context[ir.name]
108
+ else if ir is BinaryOpNode:
109
+ left = evaluateExpression(ir.left, context)
110
+ right = evaluateExpression(ir.right, context)
111
+ return applyOperator(ir.operator, left, right)
112
+ else if ir is FunctionCallNode:
113
+ // For navigation functions:
114
+ if ir.functionName equals "PREV":
115
+ varName = ir.arguments[0]
116
+ offset = (ir.arguments[1] if present else 1)
117
+ return getPreviousRowValue(context, varName, offset)
118
+ else if ir.functionName equals "NEXT":
119
+ // Similarly for NEXT
120
+ // Add support for FIRST, LAST, CLASSIFIER, MATCH_NUMBER as needed.
121
+ else:
122
+ throw ExpressionEvaluationError
123
+
124
+ Notes:
125
+ – The context is built from the current match state (e.g., a list of rows for each variable).
126
+ – The evaluator should distinguish running vs. final contexts.
127
+
128
+
129
+
130
+
131
+ 3. Pattern Matching Engine (NFA/DFA)
132
+ // The engine receives the pattern AST from the SQL parser.
133
+ // It constructs an NFA that represents the row pattern.
134
+ // Optionally, it converts the NFA to a DFA for performance.
135
+
136
+ function buildNFA(patternAST: ASTNode) -> NFA
137
+ // Recursively traverse the pattern AST:
138
+ // - For a concatenation node, link the NFAs of the children in sequence.
139
+ // - For an alternation node (e.g., A | B), create a new start state with epsilon transitions to the NFAs for each alternative,
140
+ // then combine their accepting states with epsilon transitions to a new accepting state.
141
+ // - For a grouping node, simply build the NFA for its contents.
142
+ // - For a quantifier node, build the NFA for the base pattern and then add loops or unroll as necessary:
143
+ // * For '*' (zero or more), add epsilon transition from start to accepting state and a loop back.
144
+ // * For '+' (one or more), require one occurrence and then loop.
145
+ // * For '{n, m}', unroll n transitions and then add additional states with limited loops up to m.
146
+ // - For an exclusion node, mark the subpattern to be excluded in output (the NFA should consume the row but not output it).
147
+ nfa = recursivelyBuildNFA(patternAST)
148
+ return nfa
149
+
150
+ function convertNFAtoDFA(nfa: NFA) -> DFA
151
+ // Use subset construction:
152
+ // - Each DFA state is a set of NFA states (epsilon closure included).
153
+ // - Build transitions for each input symbol.
154
+ // - Minimize the DFA (optional, but beneficial for performance).
155
+ dfa = subsetConstruction(nfa)
156
+ return dfa
157
+
158
+ function matchPattern(dfaOrNFA: StateMachine, rows: List<Row>, conditions: Map<String, ExpressionIR>, context: MatchContext) -> List<Match>
159
+ // Iterate over rows in the partition:
160
+ matches = []
161
+ i = 0
162
+ while i < length(rows):
163
+ context.resetMatch()
164
+ j = i
165
+ while j < length(rows) and stateMachineCanAdvance(dfaOrNFA, rows[j]):
166
+ // For each transition, evaluate the corresponding DEFINE condition:
167
+ for each expected pattern variable in currentTransition:
168
+ if not evaluateCondition(conditions[patternVariable], rows[j], context.getPreviousMatch()):
169
+ break out and try next row
170
+ context.addRow(rows[j])
171
+ if stateMachineReachedAcceptingState(dfaOrNFA):
172
+ matches.add(context.currentMatch)
173
+ break
174
+ j = j + 1
175
+ // Advance i based on AFTER MATCH SKIP strategy (e.g., i = i + 1 or i = indexAfterLastRow)
176
+ i = updateStartIndex(i, context, dfaOrNFA, afterMatchOption)
177
+ return matches
178
+ Notes:
179
+ – The NFA/DFA construction uses well–known techniques.
180
+ – The matching loop evaluates the conditions (from the DEFINE clause) for each row as it transitions between states.
181
+ – The MatchContext is updated with which pattern variable each row matched (for use in navigation functions and measure evaluation).
182
+ – The skip strategy (e.g., SKIP TO NEXT ROW) must be integrated here
183
+
184
+
185
+
186
+
187
+ 4. Execution Engine
188
+
189
+
190
+ function runMatchRecognize(query: String, inputData: DataFrame) -> DataFrame
191
+ // Step 1: Parse SQL to get the AST
192
+ ast = parseSQL(query)
193
+
194
+ // Step 2: Extract clauses from AST:
195
+ partitionCols = ast.getPartitionByColumns()
196
+ orderCols = ast.getOrderByColumns() // Each element: (column, direction)
197
+ measures = ast.getMeasures() // Map measure alias -> (function, argument)
198
+ patternAST = ast.getPatternAST()
199
+ conditionsExpr = ast.getDefineConditions() // Map variable -> condition expression (as string)
200
+ subsets = ast.getSubsetMapping() // e.g., { "U": ["A", "B"] }
201
+ afterMatchOption = ast.getAfterMatchOption() // e.g., "SKIP TO NEXT ROW"
202
+ rowPerMatchOption = ast.getRowPerMatchOption() // "ONE" or "ALL"
203
+
204
+ // Step 3: Compile measure expressions and DEFINE conditions
205
+ compiledMeasures = {}
206
+ for alias, (func, arg) in measures:
207
+ if arg is not '*' then:
208
+ compiledMeasures[alias] = (func, compileExpression(arg))
209
+ else:
210
+ compiledMeasures[alias] = (func, arg)
211
+
212
+ compiledConditions = {}
213
+ for variable, condition in conditionsExpr:
214
+ compiledConditions[variable] = compileExpression(condition)
215
+
216
+ // Step 4: Preprocess pattern (expand subset tokens)
217
+ patternString = ast.getPatternString()
218
+ if subsets is not empty:
219
+ patternString = expandSubsets(patternString, subsets)
220
+
221
+ // Step 5: Partition input data
222
+ partitions = partitionData(inputData, partitionCols)
223
+
224
+ outputMatches = []
225
+ for each partition in partitions:
226
+ sortedRows = sortRows(partition, orderCols) // Use type-aware sorting with ASC/DESC
227
+ // Build NFA from pattern AST
228
+ nfa = buildNFA(patternAST)
229
+ // Optionally convert to DFA: dfa = convertNFAtoDFA(nfa)
230
+ // For each partition, create a new MatchContext
231
+ context = new MatchContext(matchNumber=..., rowPerMatch=rowPerMatchOption, afterMatch=afterMatchOption)
232
+ matches = matchPattern(nfa, sortedRows, compiledConditions, context)
233
+ for match in matches:
234
+ // Evaluate measures for each match
235
+ if rowPerMatchOption == "ONE":
236
+ resultRow = evaluateMeasures(compiledMeasures, match, context, mode="FINAL")
237
+ resultRow.addPartitionColumns(partition.key)
238
+ outputMatches.add(resultRow)
239
+ else if rowPerMatchOption == "ALL":
240
+ // For running measures, compute per row in match.
241
+ finalMeasures = evaluateMeasures(compiledMeasures, match, context, mode="FINAL")
242
+ for i from 0 to length(match)-1:
243
+ runningMeasures = evaluateMeasures(compiledMeasures, match[0..i], context, mode="RUNNING")
244
+ resultRow = merge(match[i], partition.key, runningMeasures, finalMeasures)
245
+ outputMatches.add(resultRow)
246
+
247
+ return DataFrame(outputMatches)
248
+
249
+
250
+ Notes:
251
+ – partitionData() groups rows by partition columns.
252
+ – sortRows() uses order definitions (ASC/DESC) and converts data types appropriately.
253
+ – evaluateMeasures() uses the compiled measure expressions to compute outputs from the match.
254
+ – merge() constructs the output row from partition columns, input row data, and computed measures.
255
+ – The engine keeps track of performance stats and logs diagnostic information.
256
+
257
+
258
+
259
+ 5. Error Handling & Logging
260
+ Throughout every module, add try/catch blocks and logging:
261
+ try:
262
+ // parsing, compiling, matching, or evaluation code
263
+ catch ParseError as pe:
264
+ log.error("Parse error: " + pe.message + " in query: " + querySnippet)
265
+ raise
266
+ catch ExpressionEvaluationError as ee:
267
+ log.error("Expression evaluation failed: " + ee.message + " in context: " + contextInfo)
268
+ // Optionally, return an error code or fallback
269
+ catch PatternMatchError as pme:
270
+ log.error("Pattern matching error: " + pme.message)
271
+ // Continue processing or abort based on severity
272
+
273
+
274
+
275
+ Summary
276
+ This detailed pseudocode outlines how you can structure a production-grade MATCH_RECOGNIZE engine:
277
+
278
+ Parsing: Use a full grammar to build an AST with all clauses.
279
+ Expression Evaluation: Compile expressions into an IR and evaluate with a context that includes running or final match data.
280
+ Pattern Matching: Construct an NFA from the pattern AST (and optionally convert it to DFA), and use it to match rows from sorted partitions while applying DEFINE conditions.
281
+ Execution: Partition and sort data, run the matcher on each partition, evaluate measures per match (or per row in ALL ROWS mode), and assemble output rows.
282
+ Error Handling: Provide comprehensive error messages and logging at every step.
283
+ This architecture is modular and extensible and would form a solid foundation for a production-grade engine with further optimizations and refinements.
284
+
285
+ Would you like to dive into more specific pseudocode details for any particular module?
286
+
287
+
288
+
289
+
290
+ . Modular Design
291
+
292
+ SQL Parser Module:
293
+ Use a mature parser generator (e.g., ANTLR) to support the full SQL MATCH_RECOGNIZE syntax.
294
+
295
+ Build a complete Abstract Syntax Tree (AST) that represents all subclauses (PARTITION BY, ORDER BY, MEASURES, PATTERN, DEFINE, SUBSET, AFTER MATCH SKIP, etc.).
296
+ Include thorough validations for pattern variables and measure expressions.
297
+ Expression Evaluation Module:
298
+ Develop or integrate a robust expression evaluator to safely parse and evaluate conditions and measure expressions.
299
+
300
+ Support complex boolean and arithmetic expressions.
301
+ Handle navigation functions (e.g., PREV, NEXT, FIRST, LAST) with both running and final semantics.
302
+ Precompile expressions and cache them for efficient repeated evaluations.
303
+ Pattern Matching Engine:
304
+ Implement an advanced matching engine that constructs an NFA from the parsed AST and—where beneficial—converts it to a DFA.
305
+
306
+ Support all pattern operators: concatenation, alternation, grouping, permutation, quantifiers (including exact and range), exclusions, and subset definitions.
307
+ Optimize with epsilon closure caching, state minimization, and transition table precomputation.
308
+ Provide options for overlapping (AFTER MATCH SKIP TO NEXT ROW/FIRST/LAST) and non-overlapping matches.
309
+ Execution Engine:
310
+ Integrate the parser, expression evaluator, and pattern matcher to process the input DataFrame.
311
+
312
+ Partition data by PARTITION BY columns and sort each partition based on ORDER BY (with support for ASC/DESC and data–type aware sorting).
313
+ Apply the matching engine on each partition, evaluate measures, and produce output rows based on ONE ROW PER MATCH or ALL ROWS PER MATCH semantics.
314
+ Maintain performance statistics and detailed logging.
315
+ Error Handling & Diagnostics:
316
+ Implement comprehensive error handling across modules:
317
+
318
+ Provide detailed error messages (including context, hints, and error codes) for syntax errors, ambiguous patterns, and runtime matching issues.
319
+ Use logging frameworks with configurable verbosity levels for both debugging and production monitoring.
320
+ 2. Detailed Component Enhancements
321
+ A. SQL Parser Enhancements
322
+ Full Grammar Support:
323
+ Extend the grammar to handle nested expressions, multiple conditions (AND, OR, NOT), and complex measure expressions.
324
+
325
+ Subset & Skip Options:
326
+
327
+ Parse the SUBSET clause and expand subset tokens (e.g., replace a subset variable with an alternation of its members).
328
+ Recognize and store advanced AFTER MATCH SKIP options (e.g., SKIP TO FIRST A, SKIP TO LAST B).
329
+ AST Generation:
330
+ Generate a detailed AST that feeds directly into the matching engine and the expression evaluator.
331
+
332
+ B. Expression Evaluation Improvements
333
+ Robust Expression Parser:
334
+ Use a dedicated parser (or a safe library) to handle arithmetic and boolean expressions beyond simple regex–based matching.
335
+
336
+ Support nested navigation functions and compound conditions.
337
+ Precompile and cache parsed expressions to improve runtime performance.
338
+ Context–Sensitive Evaluation:
339
+ Ensure that expressions in the DEFINE and MEASURES clauses can reference:
340
+
341
+ The entire match (final semantics) or the growing match (running semantics).
342
+ Specific pattern variables, by filtering the match rows accordingly.
343
+ Security & Efficiency:
344
+ Avoid using insecure methods like direct eval; instead, compile expressions into an intermediate representation (IR) that’s safely executed.
345
+
346
+ C. Pattern Matching Engine Enhancements
347
+ Advanced NFA/DFA Construction:
348
+
349
+ Build the NFA from the AST while supporting the full range of operators (quantifiers, alternation, exclusions, grouping).
350
+ For performance-critical paths, convert the NFA to a DFA where possible.
351
+ Implement state minimization techniques to reduce the number of states.
352
+ Optimized Matching:
353
+
354
+ Cache epsilon closures and state transitions.
355
+ Use incremental matching to avoid reprocessing input rows, especially on large datasets.
356
+ Consider multi-threading for processing different partitions in parallel.
357
+ Comprehensive Skip Logic:
358
+ Adjust the engine to correctly implement all AFTER MATCH SKIP options:
359
+
360
+ For SKIP PAST LAST ROW: resume from the row immediately after the match.
361
+ For SKIP TO NEXT ROW/FIRST/LAST: resume based on the matched variable’s position.
362
+ D. Execution Engine Enhancements
363
+ Partitioning and Ordering:
364
+ Enhance the ORDER BY processing to support sort directions (ASC/DESC) and perform type–aware sorting (e.g., numeric and date types).
365
+ Measure Evaluation:
366
+ Distinguish between running and final measures; compute final measures once per match and reuse them across output rows.
367
+ Support aggregations that are scoped to specific pattern variables.
368
+ Robust Integration:
369
+ Tie together the parser, expression evaluator, and pattern matcher into a single cohesive engine that can handle production loads.
370
+ E. Error Handling & Logging
371
+ Diagnostic Enhancements:
372
+ Use detailed log messages and error contexts in all modules.
373
+ Provide fallback and safe–exit strategies for ambiguous patterns.
374
+ Test Coverage:
375
+ Develop an extensive test suite covering edge cases, complex nested expressions, and ambiguous pattern constructs.
376
+ Use performance benchmarking tools to profile and optimize matching speed.
377
+ 3. Implementation Considerations
378
+ Scalability:
379
+ Consider designing the engine to work with streaming data or integrate with distributed computing frameworks if needed.
380
+
381
+ Modularity:
382
+ Keep the SQL parser, expression evaluator, and pattern matcher loosely coupled so that improvements in one area (e.g., swapping out the expression evaluator) do not require rewriting the entire system.
383
+
384
+ Integration:
385
+ Provide a clear API for external callers (e.g., a method that accepts a SQL query string and a DataFrame, and returns a new DataFrame with matched results).
386
+
387
+ Security & Robustness:
388
+ Validate all inputs rigorously. Use sandboxing for expression evaluations to prevent arbitrary code execution.
389
+
390
+ Specific Implementation Improvements
391
+ Short-term Improvements
392
+ Optimize variable lookups in the evaluate_pattern_variable_reference function
393
+ Add proper error handling for malformed pattern variable references
394
+ Implement caching for frequently accessed rows and variables
395
+ Support for more navigation functions like PREV and NEXT
396
+ Add proper handling for empty matches in all scenarios
397
+ Medium-term Improvements
398
+ Implement a proper expression evaluator for complex measure expressions
399
+ Add support for pattern exclusions with proper semantics
400
+ Optimize partition handling for large datasets
401
+ Implement proper type handling for measure values
402
+ Add support for CLASSIFIER() function with proper semantics
403
+
404
+
405
+ /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
406
+
407
+ Support for Complex Expressions: Extend the parser to handle more complex expressions in DEFINE, MEASURES, and PARTITION BY clauses.
408
+
409
+ Subquery Support: Add support for subqueries within MATCH_RECOGNIZE.
410
+ Error Handling: Improve error messages with line/column information for syntax errors.
411
+
412
+ Advanced Pattern Features
413
+ Complex Pattern Support
414
+ Nested Patterns: Support for nested pattern expressions.
415
+ Pattern Exclusions: Enhance pattern exclusion handling.
416
+ Quantifier Improvements: Support for reluctant and possessive quantifiers.
417
+
418
+
419
+ Optimized Automata
420
+ Lazy DFA Construction: Build DFA states on-demand rather than all at once.
421
+ Pattern Optimization: Analyze patterns to eliminate redundant states.
422
+
423
+
424
+ Advanced Row Navigation
425
+ Window Functions: Support for window functions in measures.
426
+ Aggregation Functions: Add more aggregation functions (STDDEV, PERCENTILE, etc.).
427
+
428
+
429
+ Production-level Assessment of MATCH_RECOGNIZE Implementation
430
+ The implementation has solid foundations but needs several improvements to meet production requirements:
431
+ Strengths
432
+ Complete feature coverage: The implementation includes all required components of the MATCH_RECOGNIZE clause:
433
+ PARTITION BY
434
+ ORDER BY
435
+ MEASURES
436
+ ROWS PER MATCH options
437
+ AFTER MATCH SKIP modes
438
+ PATTERN syntax with regular expression support
439
+ SUBSET capability
440
+ DEFINE variables
441
+ Pattern matching fundamentals: Uses proper automata-based approach with NFA → DFA conversion
442
+ Navigation functions: Implements PREV, NEXT, FIRST, LAST functions
443
+ Aggregation support: Handles aggregations over pattern variables
444
+ Exclusion patterns: Supports the {- pattern -} syntax for excluding portions from output
445
+ Areas Needing Improvement
446
+ Error handling: Needs more comprehensive error handling across the implementation
447
+ Performance optimization: Current implementation has potential bottlenecks:
448
+ Pattern matching algorithm could be optimized
449
+ Condition evaluation appears to make repeated function calls
450
+ Test coverage: Inadequate testing is implied by debug print statements left in code
451
+ Documentation: Code is documented but lacks comprehensive API documentation
452
+ Edge cases: Some edge cases appear to be handled with special hacks rather than systematic approaches
453
+ Specific syntax support gaps:
454
+ RUNNING/FINAL semantics are partially implemented but need proper validation
455
+ CLASSIFIER() function may have incomplete support
456
+ PERMUTE pattern handling needs optimization
457
+ Partition handling: The current approach loads all partitions in memory which may cause issues with large datasets
458
+ Recommendations for Production Readiness
459
+ Performance optimization:
460
+ Optimize condition compilation and caching
461
+ Improve pattern matching algorithm efficiency
462
+ Add query plan optimization
463
+ Robustness improvements:
464
+ Add comprehensive input validation
465
+ Improve error messages and error handling
466
+ Implement proper logging instead of print statements
467
+ Comprehensive testing:
468
+ Unit tests for each component
469
+ Integration tests for end-to-end scenarios
470
+ Performance benchmarks
471
+ Edge case testing
472
+ Documentation:
473
+ Add comprehensive API documentation
474
+ Include usage examples
475
+ Document implementation details
476
+ Memory management:
477
+ Implement streaming processing for large datasets
478
+ Add memory usage monitoring and controls
479
+ Code quality:
480
+ Refactor complex methods into smaller, more maintainable pieces
481
+ Remove debugging print statements
482
+ Add type annotations consistently
483
+ Feature completion:
484
+ Ensure all RUNNING/FINAL semantics are properly implemented
485
+ Complete support for all pattern syntax features
486
+ Add support for more aggregate functions
487
+ Monitoring and observability:
488
+ Add performance metrics
489
+ Implement proper logging
490
+ The implementation has a good foundation but requires these improvements before it can be considered production-ready.