pandas-match-recognize 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pandas_match_recognize-0.1.0/.gitignore +122 -0
- pandas_match_recognize-0.1.0/.vscode/settings.json +3 -0
- pandas_match_recognize-0.1.0/1.swift +490 -0
- pandas_match_recognize-0.1.0/Examples Match_reocginze/Examples.ipynb +2402 -0
- pandas_match_recognize-0.1.0/Examples Match_reocginze/Examples_View.ipynb +27021 -0
- pandas_match_recognize-0.1.0/Examples.ipynb +2402 -0
- pandas_match_recognize-0.1.0/LICENSE +21 -0
- pandas_match_recognize-0.1.0/PKG-INFO +426 -0
- pandas_match_recognize-0.1.0/Performance/FINAL_VALIDATED_PERFORMANCE_ANALYSIS.md +82 -0
- pandas_match_recognize-0.1.0/Performance/enhanced_latex_generator.py +921 -0
- pandas_match_recognize-0.1.0/Performance/enhanced_pattern_benchmark.py +1091 -0
- pandas_match_recognize-0.1.0/Performance/enhanced_visualization_generator.py +390 -0
- pandas_match_recognize-0.1.0/Performance/pattern_focused_benchmark.py +511 -0
- pandas_match_recognize-0.1.0/Python_Examples/python_udf.ipynb +503 -0
- pandas_match_recognize-0.1.0/README.md +377 -0
- pandas_match_recognize-0.1.0/Test_grammar.ipynb +1638 -0
- pandas_match_recognize-0.1.0/Untitled-1.ipynb +210 -0
- pandas_match_recognize-0.1.0/Untitled-2.ipynb +266 -0
- pandas_match_recognize-0.1.0/__init__.py +8 -0
- pandas_match_recognize-0.1.0/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/docs/README_NAVIGATION_FIX.md +105 -0
- pandas_match_recognize-0.1.0/docs/debugging_guide.md +471 -0
- pandas_match_recognize-0.1.0/docs/navigation_enhancements.md +260 -0
- pandas_match_recognize-0.1.0/docs/navigation_function_fix.md +134 -0
- pandas_match_recognize-0.1.0/docs/nested_navigation.md +127 -0
- pandas_match_recognize-0.1.0/docs/pattern_cache_deployment.md +324 -0
- pandas_match_recognize-0.1.0/docs/pattern_caching.md +174 -0
- pandas_match_recognize-0.1.0/main.py +62 -0
- pandas_match_recognize-0.1.0/match_recognize/__init__.py +66 -0
- pandas_match_recognize-0.1.0/match_recognize/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/match_recognize.py +29 -0
- pandas_match_recognize-0.1.0/medi.ipynb +538 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize/__init__.py +66 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/PKG-INFO +426 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/SOURCES.txt +183 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/dependency_links.txt +1 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/requires.txt +16 -0
- pandas_match_recognize-0.1.0/pandas_match_recognize.egg-info/top_level.txt +2 -0
- pandas_match_recognize-0.1.0/production_scale_test.py +337 -0
- pandas_match_recognize-0.1.0/pyproject.toml +63 -0
- pandas_match_recognize-0.1.0/requirements.txt +127 -0
- pandas_match_recognize-0.1.0/setup.cfg +4 -0
- pandas_match_recognize-0.1.0/setup.py +70 -0
- pandas_match_recognize-0.1.0/src/TestAggregationsInRowPatternMatching.java +1270 -0
- pandas_match_recognize-0.1.0/src/TestRowPatternMatching.java +1561 -0
- pandas_match_recognize-0.1.0/src/__init__.py +11 -0
- pandas_match_recognize-0.1.0/src/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/ast_nodes/__init__.py +5 -0
- pandas_match_recognize-0.1.0/src/ast_nodes/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/ast_nodes/__pycache__/ast_nodes.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/ast_nodes/ast_nodes.py +678 -0
- pandas_match_recognize-0.1.0/src/config/__init__.py +1 -0
- pandas_match_recognize-0.1.0/src/config/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/config/__pycache__/production_config.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/config/production_config.py +385 -0
- pandas_match_recognize-0.1.0/src/executor/__init__.py +0 -0
- pandas_match_recognize-0.1.0/src/executor/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/executor/__pycache__/match_recognize.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/executor/match_recognize.py +2302 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.g4 +398 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.interp +1040 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.py +1711 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoLexer.tokens +665 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParser.g4 +1124 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParser.interp +817 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParser.py +23363 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParser.tokens +665 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParserListener.py +2973 -0
- pandas_match_recognize-0.1.0/src/grammar/TrinoParserVisitor.py +1658 -0
- pandas_match_recognize-0.1.0/src/grammar/__init__.py +0 -0
- pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoLexer.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoParser.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/grammar/__pycache__/TrinoParserVisitor.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/grammar/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__init__.py +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/automata.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/condition_evaluator.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/dfa.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/evaluation_utils.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/matcher.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/measure_evaluator.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/pattern_tokenizer.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/production_aggregates.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/__pycache__/row_context.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/matcher/automata.py +3838 -0
- pandas_match_recognize-0.1.0/src/matcher/condition_evaluator.py +2915 -0
- pandas_match_recognize-0.1.0/src/matcher/dfa.py +1145 -0
- pandas_match_recognize-0.1.0/src/matcher/evaluation_utils.py +592 -0
- pandas_match_recognize-0.1.0/src/matcher/matcher.py +6946 -0
- pandas_match_recognize-0.1.0/src/matcher/measure_evaluator.py +2317 -0
- pandas_match_recognize-0.1.0/src/matcher/pattern_tokenizer.py +1414 -0
- pandas_match_recognize-0.1.0/src/matcher/production_aggregates.py +2293 -0
- pandas_match_recognize-0.1.0/src/matcher/row_context.py +1781 -0
- pandas_match_recognize-0.1.0/src/monitoring/__init__.py +13 -0
- pandas_match_recognize-0.1.0/src/monitoring/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/monitoring/__pycache__/cache_monitor.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/monitoring/cache_monitor.py +191 -0
- pandas_match_recognize-0.1.0/src/monitoring/health_check.py +309 -0
- pandas_match_recognize-0.1.0/src/monitoring/production_logging.py +403 -0
- pandas_match_recognize-0.1.0/src/parser/__init__.py +0 -0
- pandas_match_recognize-0.1.0/src/parser/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/parser/__pycache__/error_listeners.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/parser/__pycache__/match_recognize_extractor.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/parser/error_listeners.py +22 -0
- pandas_match_recognize-0.1.0/src/parser/match_recognize_extractor.py +1194 -0
- pandas_match_recognize-0.1.0/src/parser/query_parser.py +4 -0
- pandas_match_recognize-0.1.0/src/pattern/__init__.py +12 -0
- pandas_match_recognize-0.1.0/src/pattern/__pycache__/permute_handler.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/pattern/permute_handler.py +731 -0
- pandas_match_recognize-0.1.0/src/utils/__init__.py +23 -0
- pandas_match_recognize-0.1.0/src/utils/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/utils/__pycache__/logging_config.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/utils/__pycache__/memory_management.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/utils/__pycache__/pattern_cache.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/utils/__pycache__/performance_optimizer.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/src/utils/logging_config.py +723 -0
- pandas_match_recognize-0.1.0/src/utils/memory_management.py +800 -0
- pandas_match_recognize-0.1.0/src/utils/pattern_cache.py +908 -0
- pandas_match_recognize-0.1.0/src/utils/performance_optimizer.py +2846 -0
- pandas_match_recognize-0.1.0/src/utils/production_logging.py +448 -0
- pandas_match_recognize-0.1.0/test1_workon.ipynb +6379 -0
- pandas_match_recognize-0.1.0/test_NFA.ipynb +1780 -0
- pandas_match_recognize-0.1.0/test_parsing_v1.ipynb +7063 -0
- pandas_match_recognize-0.1.0/test_requirements.txt +4 -0
- pandas_match_recognize-0.1.0/tests/__init__.py +1 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/__init__.cpython-312.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/conftest.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_anchor_patterns.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_back_reference.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_case_sensitivity.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_complete_java_reference.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_empty_cycle.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_empty_matches.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_exponential_protection.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_fixed_failing_cases.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_in_predicate.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_match_recognize.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_missing_critical_cases.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_multiple_match_recognize.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_navigation_and_conditions.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_output_layout.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_pattern_cache.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_pattern_tokenizer.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_permute_patterns.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_production_aggregates.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_scalar_functions.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_sql2016_compliance.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_sql_parser.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/__pycache__/test_subqueries.cpython-312-pytest-8.3.4.pyc +0 -0
- pandas_match_recognize-0.1.0/tests/performance/requirements.txt +53 -0
- pandas_match_recognize-0.1.0/tests/test_advanced_aggregation_scenarios.py +489 -0
- pandas_match_recognize-0.1.0/tests/test_aggregation_fixes.py +491 -0
- pandas_match_recognize-0.1.0/tests/test_aggregation_integration.py +444 -0
- pandas_match_recognize-0.1.0/tests/test_aggregation_performance.py +387 -0
- pandas_match_recognize-0.1.0/tests/test_anchor_patterns.py +327 -0
- pandas_match_recognize-0.1.0/tests/test_back_reference.py +245 -0
- pandas_match_recognize-0.1.0/tests/test_case_sensitivity.py +172 -0
- pandas_match_recognize-0.1.0/tests/test_complete_java_aggregation_coverage.py +602 -0
- pandas_match_recognize-0.1.0/tests/test_complete_java_aggregations.py +479 -0
- pandas_match_recognize-0.1.0/tests/test_complete_java_reference.py +719 -0
- pandas_match_recognize-0.1.0/tests/test_empty_cycle.py +305 -0
- pandas_match_recognize-0.1.0/tests/test_empty_matches.py +348 -0
- pandas_match_recognize-0.1.0/tests/test_exponential_protection.py +351 -0
- pandas_match_recognize-0.1.0/tests/test_fixed_failing_cases.py +214 -0
- pandas_match_recognize-0.1.0/tests/test_in_predicate.py +385 -0
- pandas_match_recognize-0.1.0/tests/test_java_aggregations_converted.py +650 -0
- pandas_match_recognize-0.1.0/tests/test_match_recognize.py +1134 -0
- pandas_match_recognize-0.1.0/tests/test_missing_critical_cases.py +565 -0
- pandas_match_recognize-0.1.0/tests/test_missing_java_cases.py +550 -0
- pandas_match_recognize-0.1.0/tests/test_multiple_match_recognize.py +351 -0
- pandas_match_recognize-0.1.0/tests/test_navigation_and_conditions.py +280 -0
- pandas_match_recognize-0.1.0/tests/test_output_layout.py +291 -0
- pandas_match_recognize-0.1.0/tests/test_pattern_cache.py +263 -0
- pandas_match_recognize-0.1.0/tests/test_pattern_tokenizer.py +184 -0
- pandas_match_recognize-0.1.0/tests/test_permute_patterns.py +317 -0
- pandas_match_recognize-0.1.0/tests/test_production_aggregates.py +625 -0
- pandas_match_recognize-0.1.0/tests/test_production_aggregations.py +756 -0
- pandas_match_recognize-0.1.0/tests/test_scalar_functions.py +259 -0
- pandas_match_recognize-0.1.0/tests/test_sql2016_compliance.py +607 -0
- pandas_match_recognize-0.1.0/tests/test_subqueries.py +338 -0
- pandas_match_recognize-0.1.0/tests/test_utils.py +243 -0
- pandas_match_recognize-0.1.0/tr1_paper.ipynb +10243 -0
- pandas_match_recognize-0.1.0/trino_data.ipynb +405 -0
- pandas_match_recognize-0.1.0/trino_test_replication.py +842 -0
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
# Byte-compiled / optimized / DLL files
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*$py.class
|
|
5
|
+
|
|
6
|
+
# C extensions
|
|
7
|
+
*.so
|
|
8
|
+
|
|
9
|
+
# Distribution / packaging
|
|
10
|
+
.Python
|
|
11
|
+
build/
|
|
12
|
+
develop-eggs/
|
|
13
|
+
dist/
|
|
14
|
+
downloads/
|
|
15
|
+
eggs/
|
|
16
|
+
.eggs/
|
|
17
|
+
lib/
|
|
18
|
+
lib64/
|
|
19
|
+
parts/
|
|
20
|
+
sdist/
|
|
21
|
+
var/
|
|
22
|
+
wheels/
|
|
23
|
+
pip-wheel-metadata/
|
|
24
|
+
share/python-wheels/
|
|
25
|
+
*.egg-info/
|
|
26
|
+
.installed.cfg
|
|
27
|
+
*.egg
|
|
28
|
+
MANIFEST
|
|
29
|
+
|
|
30
|
+
# PyInstaller
|
|
31
|
+
# Usually these files are written by a python script from a template
|
|
32
|
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
|
33
|
+
*.manifest
|
|
34
|
+
*.spec
|
|
35
|
+
|
|
36
|
+
# Installer logs
|
|
37
|
+
pip-log.txt
|
|
38
|
+
pip-delete-this-directory.txt
|
|
39
|
+
|
|
40
|
+
# Unit test / coverage reports
|
|
41
|
+
htmlcov/
|
|
42
|
+
.tox/
|
|
43
|
+
.nox/
|
|
44
|
+
.coverage
|
|
45
|
+
.coverage.*
|
|
46
|
+
.cache
|
|
47
|
+
nosetests.xml
|
|
48
|
+
coverage.xml
|
|
49
|
+
*.cover
|
|
50
|
+
*.py,cover
|
|
51
|
+
.hypothesis/
|
|
52
|
+
.pytest_cache/
|
|
53
|
+
|
|
54
|
+
# Jupyter Notebook
|
|
55
|
+
.ipynb_checkpoints
|
|
56
|
+
|
|
57
|
+
# IPython
|
|
58
|
+
profile_default/
|
|
59
|
+
ipython_config.py
|
|
60
|
+
|
|
61
|
+
# pyenv
|
|
62
|
+
.python-version
|
|
63
|
+
|
|
64
|
+
# pipenv
|
|
65
|
+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
|
66
|
+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
|
67
|
+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
|
68
|
+
# install all needed dependencies.
|
|
69
|
+
#Pipfile.lock
|
|
70
|
+
|
|
71
|
+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
|
72
|
+
__pypackages__/
|
|
73
|
+
|
|
74
|
+
# Celery stuff
|
|
75
|
+
celerybeat-schedule
|
|
76
|
+
celerybeat.pid
|
|
77
|
+
|
|
78
|
+
# SageMath parsed files
|
|
79
|
+
*.sage.py
|
|
80
|
+
|
|
81
|
+
# Environments
|
|
82
|
+
.env
|
|
83
|
+
.venv
|
|
84
|
+
env/
|
|
85
|
+
venv/
|
|
86
|
+
ENV/
|
|
87
|
+
env.bak/
|
|
88
|
+
venv.bak/
|
|
89
|
+
|
|
90
|
+
# Spyder project settings
|
|
91
|
+
.spyderproject
|
|
92
|
+
.spyproject
|
|
93
|
+
|
|
94
|
+
# Rope project settings
|
|
95
|
+
.ropeproject
|
|
96
|
+
|
|
97
|
+
# mkdocs documentation
|
|
98
|
+
/site
|
|
99
|
+
|
|
100
|
+
# mypy
|
|
101
|
+
.mypy_cache/
|
|
102
|
+
.dmypy.json
|
|
103
|
+
dmypy.json
|
|
104
|
+
|
|
105
|
+
# Pyre type checker
|
|
106
|
+
.pyre/
|
|
107
|
+
|
|
108
|
+
# IDE
|
|
109
|
+
.vscode/
|
|
110
|
+
.idea/
|
|
111
|
+
*.swp
|
|
112
|
+
*.swo
|
|
113
|
+
*~
|
|
114
|
+
|
|
115
|
+
# OS
|
|
116
|
+
.DS_Store
|
|
117
|
+
Thumbs.db
|
|
118
|
+
|
|
119
|
+
# Project specific
|
|
120
|
+
*.log
|
|
121
|
+
temp/
|
|
122
|
+
tmp/
|
|
@@ -0,0 +1,490 @@
|
|
|
1
|
+
1. SQL Parser Module (Using a Parser Generator)
|
|
2
|
+
|
|
3
|
+
// Define full MATCH_RECOGNIZE grammar (using ANTLR, for example)
|
|
4
|
+
// Grammar covers:
|
|
5
|
+
// - PARTITION BY clause (list of columns)
|
|
6
|
+
// - ORDER BY clause (column names with ASC/DESC)
|
|
7
|
+
// - MEASURES clause (list of measure expressions)
|
|
8
|
+
// - PATTERN clause (row pattern with operators: concatenation, alternation, grouping, quantifiers, exclusions)
|
|
9
|
+
// - SUBSET clause (mapping subset names to list of pattern variables)
|
|
10
|
+
// - DEFINE clause (list of variable definitions with expressions)
|
|
11
|
+
// - AFTER MATCH SKIP clause (options: PAST LAST ROW, TO FIRST <var>, TO LAST <var>)
|
|
12
|
+
|
|
13
|
+
SQL MATCH_RECOGNIZE
|
|
14
|
+
PARTITION BY <column_list>
|
|
15
|
+
ORDER BY <column_list>
|
|
16
|
+
MEASURES <measure_list>
|
|
17
|
+
PATTERN (<pattern>)
|
|
18
|
+
DEFINE <variable_definitions>
|
|
19
|
+
SUBSET <subset_mapping>
|
|
20
|
+
AFTER MATCH SKIP <skip_option>
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
AST Generation: Occurs immediately after parsing during the transformation phase.
|
|
24
|
+
Notes:
|
|
25
|
+
– The AST should capture every clause as a node.
|
|
26
|
+
– It must resolve subset expansions (e.g. if SUBSET U = (A, B), then any occurrence of U in PATTERN is replaced with (A | B)).
|
|
27
|
+
|
|
28
|
+
Parsing MATCH_RECOGNIZE Queries (Extract all components)
|
|
29
|
+
Validating MATCH_RECOGNIZE Queries (Check for missing parts, syntax errors)
|
|
30
|
+
Transforming MATCH_RECOGNIZE Queries (Rewrite patterns, optimize SQL)
|
|
31
|
+
✔ Extracted all MATCH_RECOGNIZE components
|
|
32
|
+
✔ Validated MATCH_RECOGNIZE queries for correctness
|
|
33
|
+
✔ Transformed MATCH_RECOGNIZE patterns
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
✅ 1. Extract MATCH_RECOGNIZE Components (PARTITION, MEASURES, PATTERN, DEFINE, SUBSET)
|
|
37
|
+
✅ 2. Validate MATCH_RECOGNIZE Queries (Check for missing parts, errors)
|
|
38
|
+
✅ 3. Transform MATCH_RECOGNIZE Queries (Modify, optimize, or rewrite queries)
|
|
39
|
+
✔ Extracted all MATCH_RECOGNIZE components
|
|
40
|
+
✔ Validated MATCH_RECOGNIZE queries for correctness
|
|
41
|
+
✔ Transformed MATCH_RECOGNIZE patterns to optimize queries
|
|
42
|
+
|
|
43
|
+
complex pattern transformations?
|
|
44
|
+
Generating optimized MATCH_RECOGNIZE queries dynamically?
|
|
45
|
+
|
|
46
|
+
A full-fledged expression parser that produces a detailed sub-AST.
|
|
47
|
+
A more advanced, structured AST for row patterns (handling full regex-like syntax).
|
|
48
|
+
Automatic subset expansion in the AST.
|
|
49
|
+
Complex pattern transformations and dynamic query optimization.
|
|
50
|
+
Deeper semantic validation of expressions and function calls.
|
|
51
|
+
Replace the stub expression parser with a fully featured parser if your expression syntax is complex.
|
|
52
|
+
Expand the pattern parser to cover more regex-like constructs.
|
|
53
|
+
Write comprehensive unit and integration tests to cover edge cases in expressions and patterns.
|
|
54
|
+
Add semantic validation logic for checking column existence, data type matching, and function support (which might require integration with your schema metadata).
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
|
|
60
|
+
NFA/DFA generation for efficient pattern matching (execution phase).
|
|
61
|
+
|
|
62
|
+
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
NFA/DFA Generation: Happens during the execution phase (in the engine module) when the canonical row pattern is compiled into a state machine for efficient matching.
|
|
66
|
+
|
|
67
|
+
function parseSQL(query: String) -> AST
|
|
68
|
+
// Use a parser generator (e.g., ANTLR) to create a parse tree from query.
|
|
69
|
+
ast = generateAST(query)
|
|
70
|
+
validateAST(ast) // e.g., check that every pattern variable in PATTERN has a definition
|
|
71
|
+
return ast
|
|
72
|
+
|
|
73
|
+
Walking the Parse Tree
|
|
74
|
+
If you want to extract information (e.g., table names, column names, patterns), you can use a Listener or Visitor:
|
|
75
|
+
Validating Queries
|
|
76
|
+
|
|
77
|
+
Now that the query is successfully parsed, you can build a SQL validation or transformation tool based on the tree.
|
|
78
|
+
custom query validation, rewriting, or analysis
|
|
79
|
+
|
|
80
|
+
NFA/DFA Generation
|
|
81
|
+
|
|
82
|
+
State Machine Construction:
|
|
83
|
+
The process of compiling the canonical row pattern (from the AST) into a nondeterministic or deterministic finite automaton for efficient execution is planned for the engine module and has not been implemented in the current phase.
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
|
|
87
|
+
2. Expression Evaluator Module
|
|
88
|
+
// The expression evaluator handles both boolean and arithmetic expressions.
|
|
89
|
+
// It supports navigation functions such as PREV(), NEXT(), FIRST(), LAST().
|
|
90
|
+
// It converts the expression into an Intermediate Representation (IR) for efficient repeated evaluation.
|
|
91
|
+
|
|
92
|
+
function compileExpression(expr: String) -> ExpressionIR
|
|
93
|
+
// Parse the expression using a recursive descent parser or a tool like ANTLR.
|
|
94
|
+
// Build an IR (e.g., an abstract syntax tree) that represents the expression.
|
|
95
|
+
ir = parseAndBuildIR(expr)
|
|
96
|
+
return ir
|
|
97
|
+
|
|
98
|
+
function evaluateExpression(ir: ExpressionIR, context: Map<String, Any>) -> Any
|
|
99
|
+
// Recursively evaluate the IR using context (which might include:
|
|
100
|
+
// - The current row
|
|
101
|
+
// - The entire match (for FINAL semantics)
|
|
102
|
+
// - Pattern variable-specific row lists)
|
|
103
|
+
// Example: If the IR represents PREV(A.price, 2), use context to look up the 2nd previous row for variable A.
|
|
104
|
+
if ir is ConstantNode:
|
|
105
|
+
return ir.value
|
|
106
|
+
else if ir is VariableNode:
|
|
107
|
+
return context[ir.name]
|
|
108
|
+
else if ir is BinaryOpNode:
|
|
109
|
+
left = evaluateExpression(ir.left, context)
|
|
110
|
+
right = evaluateExpression(ir.right, context)
|
|
111
|
+
return applyOperator(ir.operator, left, right)
|
|
112
|
+
else if ir is FunctionCallNode:
|
|
113
|
+
// For navigation functions:
|
|
114
|
+
if ir.functionName equals "PREV":
|
|
115
|
+
varName = ir.arguments[0]
|
|
116
|
+
offset = (ir.arguments[1] if present else 1)
|
|
117
|
+
return getPreviousRowValue(context, varName, offset)
|
|
118
|
+
else if ir.functionName equals "NEXT":
|
|
119
|
+
// Similarly for NEXT
|
|
120
|
+
// Add support for FIRST, LAST, CLASSIFIER, MATCH_NUMBER as needed.
|
|
121
|
+
else:
|
|
122
|
+
throw ExpressionEvaluationError
|
|
123
|
+
|
|
124
|
+
Notes:
|
|
125
|
+
– The context is built from the current match state (e.g., a list of rows for each variable).
|
|
126
|
+
– The evaluator should distinguish running vs. final contexts.
|
|
127
|
+
|
|
128
|
+
|
|
129
|
+
|
|
130
|
+
|
|
131
|
+
3. Pattern Matching Engine (NFA/DFA)
|
|
132
|
+
// The engine receives the pattern AST from the SQL parser.
|
|
133
|
+
// It constructs an NFA that represents the row pattern.
|
|
134
|
+
// Optionally, it converts the NFA to a DFA for performance.
|
|
135
|
+
|
|
136
|
+
function buildNFA(patternAST: ASTNode) -> NFA
|
|
137
|
+
// Recursively traverse the pattern AST:
|
|
138
|
+
// - For a concatenation node, link the NFAs of the children in sequence.
|
|
139
|
+
// - For an alternation node (e.g., A | B), create a new start state with epsilon transitions to the NFAs for each alternative,
|
|
140
|
+
// then combine their accepting states with epsilon transitions to a new accepting state.
|
|
141
|
+
// - For a grouping node, simply build the NFA for its contents.
|
|
142
|
+
// - For a quantifier node, build the NFA for the base pattern and then add loops or unroll as necessary:
|
|
143
|
+
// * For '*' (zero or more), add epsilon transition from start to accepting state and a loop back.
|
|
144
|
+
// * For '+' (one or more), require one occurrence and then loop.
|
|
145
|
+
// * For '{n, m}', unroll n transitions and then add additional states with limited loops up to m.
|
|
146
|
+
// - For an exclusion node, mark the subpattern to be excluded in output (the NFA should consume the row but not output it).
|
|
147
|
+
nfa = recursivelyBuildNFA(patternAST)
|
|
148
|
+
return nfa
|
|
149
|
+
|
|
150
|
+
function convertNFAtoDFA(nfa: NFA) -> DFA
|
|
151
|
+
// Use subset construction:
|
|
152
|
+
// - Each DFA state is a set of NFA states (epsilon closure included).
|
|
153
|
+
// - Build transitions for each input symbol.
|
|
154
|
+
// - Minimize the DFA (optional, but beneficial for performance).
|
|
155
|
+
dfa = subsetConstruction(nfa)
|
|
156
|
+
return dfa
|
|
157
|
+
|
|
158
|
+
function matchPattern(dfaOrNFA: StateMachine, rows: List<Row>, conditions: Map<String, ExpressionIR>, context: MatchContext) -> List<Match>
|
|
159
|
+
// Iterate over rows in the partition:
|
|
160
|
+
matches = []
|
|
161
|
+
i = 0
|
|
162
|
+
while i < length(rows):
|
|
163
|
+
context.resetMatch()
|
|
164
|
+
j = i
|
|
165
|
+
while j < length(rows) and stateMachineCanAdvance(dfaOrNFA, rows[j]):
|
|
166
|
+
// For each transition, evaluate the corresponding DEFINE condition:
|
|
167
|
+
for each expected pattern variable in currentTransition:
|
|
168
|
+
if not evaluateCondition(conditions[patternVariable], rows[j], context.getPreviousMatch()):
|
|
169
|
+
break out and try next row
|
|
170
|
+
context.addRow(rows[j])
|
|
171
|
+
if stateMachineReachedAcceptingState(dfaOrNFA):
|
|
172
|
+
matches.add(context.currentMatch)
|
|
173
|
+
break
|
|
174
|
+
j = j + 1
|
|
175
|
+
// Advance i based on AFTER MATCH SKIP strategy (e.g., i = i + 1 or i = indexAfterLastRow)
|
|
176
|
+
i = updateStartIndex(i, context, dfaOrNFA, afterMatchOption)
|
|
177
|
+
return matches
|
|
178
|
+
Notes:
|
|
179
|
+
– The NFA/DFA construction uses well–known techniques.
|
|
180
|
+
– The matching loop evaluates the conditions (from the DEFINE clause) for each row as it transitions between states.
|
|
181
|
+
– The MatchContext is updated with which pattern variable each row matched (for use in navigation functions and measure evaluation).
|
|
182
|
+
– The skip strategy (e.g., SKIP TO NEXT ROW) must be integrated here
|
|
183
|
+
|
|
184
|
+
|
|
185
|
+
|
|
186
|
+
|
|
187
|
+
4. Execution Engine
|
|
188
|
+
|
|
189
|
+
|
|
190
|
+
function runMatchRecognize(query: String, inputData: DataFrame) -> DataFrame
|
|
191
|
+
// Step 1: Parse SQL to get the AST
|
|
192
|
+
ast = parseSQL(query)
|
|
193
|
+
|
|
194
|
+
// Step 2: Extract clauses from AST:
|
|
195
|
+
partitionCols = ast.getPartitionByColumns()
|
|
196
|
+
orderCols = ast.getOrderByColumns() // Each element: (column, direction)
|
|
197
|
+
measures = ast.getMeasures() // Map measure alias -> (function, argument)
|
|
198
|
+
patternAST = ast.getPatternAST()
|
|
199
|
+
conditionsExpr = ast.getDefineConditions() // Map variable -> condition expression (as string)
|
|
200
|
+
subsets = ast.getSubsetMapping() // e.g., { "U": ["A", "B"] }
|
|
201
|
+
afterMatchOption = ast.getAfterMatchOption() // e.g., "SKIP TO NEXT ROW"
|
|
202
|
+
rowPerMatchOption = ast.getRowPerMatchOption() // "ONE" or "ALL"
|
|
203
|
+
|
|
204
|
+
// Step 3: Compile measure expressions and DEFINE conditions
|
|
205
|
+
compiledMeasures = {}
|
|
206
|
+
for alias, (func, arg) in measures:
|
|
207
|
+
if arg is not '*' then:
|
|
208
|
+
compiledMeasures[alias] = (func, compileExpression(arg))
|
|
209
|
+
else:
|
|
210
|
+
compiledMeasures[alias] = (func, arg)
|
|
211
|
+
|
|
212
|
+
compiledConditions = {}
|
|
213
|
+
for variable, condition in conditionsExpr:
|
|
214
|
+
compiledConditions[variable] = compileExpression(condition)
|
|
215
|
+
|
|
216
|
+
// Step 4: Preprocess pattern (expand subset tokens)
|
|
217
|
+
patternString = ast.getPatternString()
|
|
218
|
+
if subsets is not empty:
|
|
219
|
+
patternString = expandSubsets(patternString, subsets)
|
|
220
|
+
|
|
221
|
+
// Step 5: Partition input data
|
|
222
|
+
partitions = partitionData(inputData, partitionCols)
|
|
223
|
+
|
|
224
|
+
outputMatches = []
|
|
225
|
+
for each partition in partitions:
|
|
226
|
+
sortedRows = sortRows(partition, orderCols) // Use type-aware sorting with ASC/DESC
|
|
227
|
+
// Build NFA from pattern AST
|
|
228
|
+
nfa = buildNFA(patternAST)
|
|
229
|
+
// Optionally convert to DFA: dfa = convertNFAtoDFA(nfa)
|
|
230
|
+
// For each partition, create a new MatchContext
|
|
231
|
+
context = new MatchContext(matchNumber=..., rowPerMatch=rowPerMatchOption, afterMatch=afterMatchOption)
|
|
232
|
+
matches = matchPattern(nfa, sortedRows, compiledConditions, context)
|
|
233
|
+
for match in matches:
|
|
234
|
+
// Evaluate measures for each match
|
|
235
|
+
if rowPerMatchOption == "ONE":
|
|
236
|
+
resultRow = evaluateMeasures(compiledMeasures, match, context, mode="FINAL")
|
|
237
|
+
resultRow.addPartitionColumns(partition.key)
|
|
238
|
+
outputMatches.add(resultRow)
|
|
239
|
+
else if rowPerMatchOption == "ALL":
|
|
240
|
+
// For running measures, compute per row in match.
|
|
241
|
+
finalMeasures = evaluateMeasures(compiledMeasures, match, context, mode="FINAL")
|
|
242
|
+
for i from 0 to length(match)-1:
|
|
243
|
+
runningMeasures = evaluateMeasures(compiledMeasures, match[0..i], context, mode="RUNNING")
|
|
244
|
+
resultRow = merge(match[i], partition.key, runningMeasures, finalMeasures)
|
|
245
|
+
outputMatches.add(resultRow)
|
|
246
|
+
|
|
247
|
+
return DataFrame(outputMatches)
|
|
248
|
+
|
|
249
|
+
|
|
250
|
+
Notes:
|
|
251
|
+
– partitionData() groups rows by partition columns.
|
|
252
|
+
– sortRows() uses order definitions (ASC/DESC) and converts data types appropriately.
|
|
253
|
+
– evaluateMeasures() uses the compiled measure expressions to compute outputs from the match.
|
|
254
|
+
– merge() constructs the output row from partition columns, input row data, and computed measures.
|
|
255
|
+
– The engine keeps track of performance stats and logs diagnostic information.
|
|
256
|
+
|
|
257
|
+
|
|
258
|
+
|
|
259
|
+
5. Error Handling & Logging
|
|
260
|
+
Throughout every module, add try/catch blocks and logging:
|
|
261
|
+
try:
|
|
262
|
+
// parsing, compiling, matching, or evaluation code
|
|
263
|
+
catch ParseError as pe:
|
|
264
|
+
log.error("Parse error: " + pe.message + " in query: " + querySnippet)
|
|
265
|
+
raise
|
|
266
|
+
catch ExpressionEvaluationError as ee:
|
|
267
|
+
log.error("Expression evaluation failed: " + ee.message + " in context: " + contextInfo)
|
|
268
|
+
// Optionally, return an error code or fallback
|
|
269
|
+
catch PatternMatchError as pme:
|
|
270
|
+
log.error("Pattern matching error: " + pme.message)
|
|
271
|
+
// Continue processing or abort based on severity
|
|
272
|
+
|
|
273
|
+
|
|
274
|
+
|
|
275
|
+
Summary
|
|
276
|
+
This detailed pseudocode outlines how you can structure a production-grade MATCH_RECOGNIZE engine:
|
|
277
|
+
|
|
278
|
+
Parsing: Use a full grammar to build an AST with all clauses.
|
|
279
|
+
Expression Evaluation: Compile expressions into an IR and evaluate with a context that includes running or final match data.
|
|
280
|
+
Pattern Matching: Construct an NFA from the pattern AST (and optionally convert it to DFA), and use it to match rows from sorted partitions while applying DEFINE conditions.
|
|
281
|
+
Execution: Partition and sort data, run the matcher on each partition, evaluate measures per match (or per row in ALL ROWS mode), and assemble output rows.
|
|
282
|
+
Error Handling: Provide comprehensive error messages and logging at every step.
|
|
283
|
+
This architecture is modular and extensible and would form a solid foundation for a production-grade engine with further optimizations and refinements.
|
|
284
|
+
|
|
285
|
+
Would you like to dive into more specific pseudocode details for any particular module?
|
|
286
|
+
|
|
287
|
+
|
|
288
|
+
|
|
289
|
+
|
|
290
|
+
. Modular Design
|
|
291
|
+
|
|
292
|
+
SQL Parser Module:
|
|
293
|
+
Use a mature parser generator (e.g., ANTLR) to support the full SQL MATCH_RECOGNIZE syntax.
|
|
294
|
+
|
|
295
|
+
Build a complete Abstract Syntax Tree (AST) that represents all subclauses (PARTITION BY, ORDER BY, MEASURES, PATTERN, DEFINE, SUBSET, AFTER MATCH SKIP, etc.).
|
|
296
|
+
Include thorough validations for pattern variables and measure expressions.
|
|
297
|
+
Expression Evaluation Module:
|
|
298
|
+
Develop or integrate a robust expression evaluator to safely parse and evaluate conditions and measure expressions.
|
|
299
|
+
|
|
300
|
+
Support complex boolean and arithmetic expressions.
|
|
301
|
+
Handle navigation functions (e.g., PREV, NEXT, FIRST, LAST) with both running and final semantics.
|
|
302
|
+
Precompile expressions and cache them for efficient repeated evaluations.
|
|
303
|
+
Pattern Matching Engine:
|
|
304
|
+
Implement an advanced matching engine that constructs an NFA from the parsed AST and—where beneficial—converts it to a DFA.
|
|
305
|
+
|
|
306
|
+
Support all pattern operators: concatenation, alternation, grouping, permutation, quantifiers (including exact and range), exclusions, and subset definitions.
|
|
307
|
+
Optimize with epsilon closure caching, state minimization, and transition table precomputation.
|
|
308
|
+
Provide options for overlapping (AFTER MATCH SKIP TO NEXT ROW/FIRST/LAST) and non-overlapping matches.
|
|
309
|
+
Execution Engine:
|
|
310
|
+
Integrate the parser, expression evaluator, and pattern matcher to process the input DataFrame.
|
|
311
|
+
|
|
312
|
+
Partition data by PARTITION BY columns and sort each partition based on ORDER BY (with support for ASC/DESC and data–type aware sorting).
|
|
313
|
+
Apply the matching engine on each partition, evaluate measures, and produce output rows based on ONE ROW PER MATCH or ALL ROWS PER MATCH semantics.
|
|
314
|
+
Maintain performance statistics and detailed logging.
|
|
315
|
+
Error Handling & Diagnostics:
|
|
316
|
+
Implement comprehensive error handling across modules:
|
|
317
|
+
|
|
318
|
+
Provide detailed error messages (including context, hints, and error codes) for syntax errors, ambiguous patterns, and runtime matching issues.
|
|
319
|
+
Use logging frameworks with configurable verbosity levels for both debugging and production monitoring.
|
|
320
|
+
2. Detailed Component Enhancements
|
|
321
|
+
A. SQL Parser Enhancements
|
|
322
|
+
Full Grammar Support:
|
|
323
|
+
Extend the grammar to handle nested expressions, multiple conditions (AND, OR, NOT), and complex measure expressions.
|
|
324
|
+
|
|
325
|
+
Subset & Skip Options:
|
|
326
|
+
|
|
327
|
+
Parse the SUBSET clause and expand subset tokens (e.g., replace a subset variable with an alternation of its members).
|
|
328
|
+
Recognize and store advanced AFTER MATCH SKIP options (e.g., SKIP TO FIRST A, SKIP TO LAST B).
|
|
329
|
+
AST Generation:
|
|
330
|
+
Generate a detailed AST that feeds directly into the matching engine and the expression evaluator.
|
|
331
|
+
|
|
332
|
+
B. Expression Evaluation Improvements
|
|
333
|
+
Robust Expression Parser:
|
|
334
|
+
Use a dedicated parser (or a safe library) to handle arithmetic and boolean expressions beyond simple regex–based matching.
|
|
335
|
+
|
|
336
|
+
Support nested navigation functions and compound conditions.
|
|
337
|
+
Precompile and cache parsed expressions to improve runtime performance.
|
|
338
|
+
Context–Sensitive Evaluation:
|
|
339
|
+
Ensure that expressions in the DEFINE and MEASURES clauses can reference:
|
|
340
|
+
|
|
341
|
+
The entire match (final semantics) or the growing match (running semantics).
|
|
342
|
+
Specific pattern variables, by filtering the match rows accordingly.
|
|
343
|
+
Security & Efficiency:
|
|
344
|
+
Avoid using insecure methods like direct eval; instead, compile expressions into an intermediate representation (IR) that’s safely executed.
|
|
345
|
+
|
|
346
|
+
C. Pattern Matching Engine Enhancements
|
|
347
|
+
Advanced NFA/DFA Construction:
|
|
348
|
+
|
|
349
|
+
Build the NFA from the AST while supporting the full range of operators (quantifiers, alternation, exclusions, grouping).
|
|
350
|
+
For performance-critical paths, convert the NFA to a DFA where possible.
|
|
351
|
+
Implement state minimization techniques to reduce the number of states.
|
|
352
|
+
Optimized Matching:
|
|
353
|
+
|
|
354
|
+
Cache epsilon closures and state transitions.
|
|
355
|
+
Use incremental matching to avoid reprocessing input rows, especially on large datasets.
|
|
356
|
+
Consider multi-threading for processing different partitions in parallel.
|
|
357
|
+
Comprehensive Skip Logic:
|
|
358
|
+
Adjust the engine to correctly implement all AFTER MATCH SKIP options:
|
|
359
|
+
|
|
360
|
+
For SKIP PAST LAST ROW: resume from the row immediately after the match.
|
|
361
|
+
For SKIP TO NEXT ROW/FIRST/LAST: resume based on the matched variable’s position.
|
|
362
|
+
D. Execution Engine Enhancements
|
|
363
|
+
Partitioning and Ordering:
|
|
364
|
+
Enhance the ORDER BY processing to support sort directions (ASC/DESC) and perform type–aware sorting (e.g., numeric and date types).
|
|
365
|
+
Measure Evaluation:
|
|
366
|
+
Distinguish between running and final measures; compute final measures once per match and reuse them across output rows.
|
|
367
|
+
Support aggregations that are scoped to specific pattern variables.
|
|
368
|
+
Robust Integration:
|
|
369
|
+
Tie together the parser, expression evaluator, and pattern matcher into a single cohesive engine that can handle production loads.
|
|
370
|
+
E. Error Handling & Logging
|
|
371
|
+
Diagnostic Enhancements:
|
|
372
|
+
Use detailed log messages and error contexts in all modules.
|
|
373
|
+
Provide fallback and safe–exit strategies for ambiguous patterns.
|
|
374
|
+
Test Coverage:
|
|
375
|
+
Develop an extensive test suite covering edge cases, complex nested expressions, and ambiguous pattern constructs.
|
|
376
|
+
Use performance benchmarking tools to profile and optimize matching speed.
|
|
377
|
+
3. Implementation Considerations
|
|
378
|
+
Scalability:
|
|
379
|
+
Consider designing the engine to work with streaming data or integrate with distributed computing frameworks if needed.
|
|
380
|
+
|
|
381
|
+
Modularity:
|
|
382
|
+
Keep the SQL parser, expression evaluator, and pattern matcher loosely coupled so that improvements in one area (e.g., swapping out the expression evaluator) do not require rewriting the entire system.
|
|
383
|
+
|
|
384
|
+
Integration:
|
|
385
|
+
Provide a clear API for external callers (e.g., a method that accepts a SQL query string and a DataFrame, and returns a new DataFrame with matched results).
|
|
386
|
+
|
|
387
|
+
Security & Robustness:
|
|
388
|
+
Validate all inputs rigorously. Use sandboxing for expression evaluations to prevent arbitrary code execution.
|
|
389
|
+
|
|
390
|
+
Specific Implementation Improvements
|
|
391
|
+
Short-term Improvements
|
|
392
|
+
Optimize variable lookups in the evaluate_pattern_variable_reference function
|
|
393
|
+
Add proper error handling for malformed pattern variable references
|
|
394
|
+
Implement caching for frequently accessed rows and variables
|
|
395
|
+
Support for more navigation functions like PREV and NEXT
|
|
396
|
+
Add proper handling for empty matches in all scenarios
|
|
397
|
+
Medium-term Improvements
|
|
398
|
+
Implement a proper expression evaluator for complex measure expressions
|
|
399
|
+
Add support for pattern exclusions with proper semantics
|
|
400
|
+
Optimize partition handling for large datasets
|
|
401
|
+
Implement proper type handling for measure values
|
|
402
|
+
Add support for CLASSIFIER() function with proper semantics
|
|
403
|
+
|
|
404
|
+
|
|
405
|
+
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
|
|
406
|
+
|
|
407
|
+
Support for Complex Expressions: Extend the parser to handle more complex expressions in DEFINE, MEASURES, and PARTITION BY clauses.
|
|
408
|
+
|
|
409
|
+
Subquery Support: Add support for subqueries within MATCH_RECOGNIZE.
|
|
410
|
+
Error Handling: Improve error messages with line/column information for syntax errors.
|
|
411
|
+
|
|
412
|
+
Advanced Pattern Features
|
|
413
|
+
Complex Pattern Support
|
|
414
|
+
Nested Patterns: Support for nested pattern expressions.
|
|
415
|
+
Pattern Exclusions: Enhance pattern exclusion handling.
|
|
416
|
+
Quantifier Improvements: Support for reluctant and possessive quantifiers.
|
|
417
|
+
|
|
418
|
+
|
|
419
|
+
Optimized Automata
|
|
420
|
+
Lazy DFA Construction: Build DFA states on-demand rather than all at once.
|
|
421
|
+
Pattern Optimization: Analyze patterns to eliminate redundant states.
|
|
422
|
+
|
|
423
|
+
|
|
424
|
+
Advanced Row Navigation
|
|
425
|
+
Window Functions: Support for window functions in measures.
|
|
426
|
+
Aggregation Functions: Add more aggregation functions (STDDEV, PERCENTILE, etc.).
|
|
427
|
+
|
|
428
|
+
|
|
429
|
+
Production-level Assessment of MATCH_RECOGNIZE Implementation
|
|
430
|
+
The implementation has solid foundations but needs several improvements to meet production requirements:
|
|
431
|
+
Strengths
|
|
432
|
+
Complete feature coverage: The implementation includes all required components of the MATCH_RECOGNIZE clause:
|
|
433
|
+
PARTITION BY
|
|
434
|
+
ORDER BY
|
|
435
|
+
MEASURES
|
|
436
|
+
ROWS PER MATCH options
|
|
437
|
+
AFTER MATCH SKIP modes
|
|
438
|
+
PATTERN syntax with regular expression support
|
|
439
|
+
SUBSET capability
|
|
440
|
+
DEFINE variables
|
|
441
|
+
Pattern matching fundamentals: Uses proper automata-based approach with NFA → DFA conversion
|
|
442
|
+
Navigation functions: Implements PREV, NEXT, FIRST, LAST functions
|
|
443
|
+
Aggregation support: Handles aggregations over pattern variables
|
|
444
|
+
Exclusion patterns: Supports the {- pattern -} syntax for excluding portions from output
|
|
445
|
+
Areas Needing Improvement
|
|
446
|
+
Error handling: Needs more comprehensive error handling across the implementation
|
|
447
|
+
Performance optimization: Current implementation has potential bottlenecks:
|
|
448
|
+
Pattern matching algorithm could be optimized
|
|
449
|
+
Condition evaluation appears to make repeated function calls
|
|
450
|
+
Test coverage: Inadequate testing is implied by debug print statements left in code
|
|
451
|
+
Documentation: Code is documented but lacks comprehensive API documentation
|
|
452
|
+
Edge cases: Some edge cases appear to be handled with special hacks rather than systematic approaches
|
|
453
|
+
Specific syntax support gaps:
|
|
454
|
+
RUNNING/FINAL semantics are partially implemented but need proper validation
|
|
455
|
+
CLASSIFIER() function may have incomplete support
|
|
456
|
+
PERMUTE pattern handling needs optimization
|
|
457
|
+
Partition handling: The current approach loads all partitions in memory which may cause issues with large datasets
|
|
458
|
+
Recommendations for Production Readiness
|
|
459
|
+
Performance optimization:
|
|
460
|
+
Optimize condition compilation and caching
|
|
461
|
+
Improve pattern matching algorithm efficiency
|
|
462
|
+
Add query plan optimization
|
|
463
|
+
Robustness improvements:
|
|
464
|
+
Add comprehensive input validation
|
|
465
|
+
Improve error messages and error handling
|
|
466
|
+
Implement proper logging instead of print statements
|
|
467
|
+
Comprehensive testing:
|
|
468
|
+
Unit tests for each component
|
|
469
|
+
Integration tests for end-to-end scenarios
|
|
470
|
+
Performance benchmarks
|
|
471
|
+
Edge case testing
|
|
472
|
+
Documentation:
|
|
473
|
+
Add comprehensive API documentation
|
|
474
|
+
Include usage examples
|
|
475
|
+
Document implementation details
|
|
476
|
+
Memory management:
|
|
477
|
+
Implement streaming processing for large datasets
|
|
478
|
+
Add memory usage monitoring and controls
|
|
479
|
+
Code quality:
|
|
480
|
+
Refactor complex methods into smaller, more maintainable pieces
|
|
481
|
+
Remove debugging print statements
|
|
482
|
+
Add type annotations consistently
|
|
483
|
+
Feature completion:
|
|
484
|
+
Ensure all RUNNING/FINAL semantics are properly implemented
|
|
485
|
+
Complete support for all pattern syntax features
|
|
486
|
+
Add support for more aggregate functions
|
|
487
|
+
Monitoring and observability:
|
|
488
|
+
Add performance metrics
|
|
489
|
+
Implement proper logging
|
|
490
|
+
The implementation has a good foundation but requires these improvements before it can be considered production-ready.
|