querexfuzz 2.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,14 @@
1
+ Metadata-Version: 2.4
2
+ Name: querexfuzz
3
+ Version: 2.0.3
4
+ Summary: A flexible query engine for pandas DataFrames with SQL, regex, date, and fuzzy matching.
5
+ Requires-Python: >=3.13
6
+ Requires-Dist: pandas>=2.0
7
+ Requires-Dist: lark>=1.0
8
+ Requires-Dist: pydantic>=2.0
9
+ Requires-Dist: pyyaml
10
+ Requires-Dist: python-dateutil
11
+ Requires-Dist: tzlocal
12
+ Requires-Dist: skimmatch
13
+ Provides-Extra: test
14
+ Requires-Dist: pytest; extra == "test"
@@ -0,0 +1,187 @@
1
+ # Querexfuzz
2
+
3
+ A flexible query engine for pandas DataFrames. `querexfuzz` lets you filter and search your data using a unified syntax that combines SQL-like `where` clauses, regular expressions, natural date ranges, and fuzzy matching — all in a single query string.
4
+
5
+ ---
6
+
7
+ ## Core Features
8
+
9
+ - **Unified query language**: combine `where`, regex (`~` / `!`), date filters (`@`), and fuzzy matching (`#`) in one string.
10
+ - **DataFrame native**: attaches a `.querex()` method (and `.q()` alias) directly to DataFrame instances.
11
+ - **Auto-configuration**: `querexfuzz_from_df` inspects column types and sets sensible defaults automatically.
12
+ - **Fast fuzzy search**: powered by [skimmatch](https://github.com/mynl/skimmatch); matcher built once per DataFrame and cached for the lifetime of the engine.
13
+ - **Configurable**: via YAML file, keyword arguments, or both.
14
+
15
+ ---
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ pip install querexfuzz
21
+ ```
22
+
23
+ For development:
24
+
25
+ ```bash
26
+ git clone https://github.com/mynl/querexfuzz.git
27
+ cd querexfuzz
28
+ pip install -e .[test]
29
+ ```
30
+
31
+ ---
32
+
33
+ ## Quickstart
34
+
35
+ ### Auto-configure from a DataFrame
36
+
37
+ The simplest path — `querexfuzz_from_df` inspects column types and wires everything up:
38
+
39
+ ```python
40
+ import pandas as pd
41
+ from querexfuzz import querexfuzz_from_df
42
+
43
+ df = pd.DataFrame({
44
+ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
45
+ 'age': [25, 30, 35, 40, 45],
46
+ 'city': ['Amsterdam', 'Berlin', 'Copenhagen', 'Berlin', 'Amsterdam'],
47
+ 'registered_date': pd.to_datetime([
48
+ '2025-08-10', '2025-06-15', '2024-01-20', '2025-08-25', '2025-07-30'
49
+ ])
50
+ })
51
+
52
+ querexfuzz_from_df(df) # attaches .querex() and .q() to df in-place
53
+
54
+ result = df.querex("where city == 'Berlin'")
55
+ result = df.q("top 5 # ams") # short alias
56
+ ```
57
+
58
+ ### Explicit configuration
59
+
60
+ ```python
61
+ from querexfuzz import Querexfuzz
62
+
63
+ engine = Querexfuzz(
64
+ base_cols=['name', 'city', 'registered_date', 'age'],
65
+ date_fields=['registered_date'],
66
+ default_date_field='registered_date',
67
+ bang_field='name',
68
+ recent_field='registered_date',
69
+ fuzzy=dict(fields=['name', 'city'], limit=50, score_col_name='score'),
70
+ )
71
+ engine.attach_to(df)
72
+
73
+ result = df.querex("recent top 10 where age > 30 # berlin")
74
+ ```
75
+
76
+ ### From a YAML config file
77
+
78
+ ```python
79
+ engine = Querexfuzz(config_path='config.yml')
80
+ engine = Querexfuzz(config_path='config.yml', fuzzy={'limit': 200}) # with overrides
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Query Syntax
86
+
87
+ Clauses are **order-sensitive** and must appear in this sequence (all optional):
88
+
89
+ ```
90
+ [verbose] [recent] [top N | bottom N] [select cols]
91
+ [field ~ regex | ! term] [where expr] [order by cols] [@ date_spec] [# fuzzy_term]
92
+ ```
93
+
94
+ An empty query returns all base columns for all rows.
95
+
96
+ ### `where` — SQL-like filter
97
+
98
+ ```python
99
+ df.querex("where city == 'Amsterdam' and age > 30")
100
+ ```
101
+
102
+ ### `!` / `~` — Regex
103
+
104
+ ```python
105
+ df.querex("! ^[AB]") # regex on bang_field (default regex target)
106
+ df.querex("name ~ ^[AB]") # regex on named column
107
+ ```
108
+
109
+ ### `@` — Date range
110
+
111
+ Units: `c` calendar year, `y` year, `q` quarter, `m` month, `w` week, `d` day, `h` hour.
112
+
113
+ ```python
114
+ df.querex("@m-3") # last 3 months (default date field)
115
+ df.querex("@registered_date m-28:6") # 28 to 6 months ago on named field
116
+ df.querex("@y-1") # last year
117
+ ```
118
+
119
+ ### `#` — Fuzzy matching
120
+
121
+ Must be the **last** clause. Results are sorted by score descending.
122
+
123
+ ```python
124
+ df.querex("# berlin")
125
+ df.querex("where age > 30 # ams") # filter first, then fuzzy over full data
126
+ ```
127
+
128
+ ### `select` — Column projection
129
+
130
+ | Syntax | Meaning |
131
+ |---|---|
132
+ | *(default)* | base columns |
133
+ | `select *` | base columns |
134
+ | `select **` | all columns |
135
+ | `select a, b` | named columns |
136
+ | `select *, a` | base columns plus `a` |
137
+ | `select *, -a` | base columns minus `a` (`-` or `!` prefix) |
138
+ | `select **, -a` | all columns minus `a` |
139
+
140
+ ### `top` / `bottom` / `recent` / `order by`
141
+
142
+ ```python
143
+ df.querex("top 10")
144
+ df.querex("bottom 5")
145
+ df.querex("recent") # sort by recent_field descending
146
+ df.querex("order by age")
147
+ df.querex("order by -age, name") # - prefix = descending
148
+ ```
149
+
150
+ ### Combining clauses
151
+
152
+ ```python
153
+ df.querex("top 5 recent where city == 'Berlin' @m-3 select name, age # bob")
154
+ ```
155
+
156
+ ---
157
+
158
+ ## Fuzzy matching and caching
159
+
160
+ The fuzzy matcher (skimmatch) is built **once** per attached DataFrame and cached on the engine. Repeat fuzzy queries against the same DataFrame pay only the cost of `matcher.query()`.
161
+
162
+ When `where`, regex, or date pre-filters are present, the matcher still runs over the full DataFrame and results are intersected with the pre-filtered rows (5× over-fetch to compensate for the narrower valid set).
163
+
164
+ If the DataFrame's contents change between queries, re-attach with `mutable=True`:
165
+
166
+ ```python
167
+ engine.attach_to(df, mutable=True) # rebuilds matcher on every fuzzy call
168
+ ```
169
+
170
+ ---
171
+
172
+ ## Versions
173
+
174
+ ### 2.0.3 (current)
175
+ Fuzzy caching refactor. Matcher built once per attached DataFrame and cached by `id(df)` on the engine — repeat queries skip all data preparation. Multiple DataFrames per engine each get an independent cache entry. `attach_to(mutable=True)` opt-in for DataFrames whose contents change.
176
+
177
+ ### 2.0.2
178
+ Performance pass on `execute_query`: lazy DataFrame copy (copy only when date-column type coercion is needed, after prior filters have already reduced the frame); initial fuzzy matcher caching.
179
+
180
+ ### 2.0.1
181
+ Code review fixes: method renamed `.querex()` / `.q()`; `importlib.metadata` version; Pydantic config corrections; parser and test suite overhaul; `tzlocal` dependency added.
182
+
183
+ ### 2.0.0
184
+ Major rewrite. Lark-based grammar parser, Pydantic configuration, `skimmatch` fuzzy backend (replacing `rustfuzz`), `src/` layout.
185
+
186
+ ### 0.1.0
187
+ Initial release.
@@ -0,0 +1,29 @@
1
+ [project]
2
+ name = "querexfuzz"
3
+ version = "2.0.3"
4
+ description = "A flexible query engine for pandas DataFrames with SQL, regex, date, and fuzzy matching."
5
+ requires-python = ">=3.13"
6
+ dependencies = [
7
+ "pandas>=2.0",
8
+ "lark>=1.0",
9
+ "pydantic>=2.0",
10
+ "pyyaml",
11
+ "python-dateutil",
12
+ "tzlocal",
13
+ "skimmatch",
14
+ ]
15
+
16
+ [project.optional-dependencies]
17
+ test = [
18
+ "pytest",
19
+ ]
20
+
21
+ [build-system]
22
+ requires = ["setuptools>=61"]
23
+ build-backend = "setuptools.build_meta"
24
+
25
+ [tool.setuptools.packages.find]
26
+ where = ["src"]
27
+
28
+ [tool.setuptools.package-data]
29
+ "querexfuzz" = ["*.lark"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,11 @@
1
+ """Querexfuzz: A flexible query engine for pandas DataFrames."""
2
+ from importlib.metadata import version, PackageNotFoundError
3
+
4
+ try:
5
+ __version__ = version("querexfuzz")
6
+ except PackageNotFoundError:
7
+ __version__ = "unknown"
8
+
9
+ from .core import Querexfuzz, querexfuzz_from_df, querexfuzz_help
10
+
11
+ __all__ = ["Querexfuzz", "querexfuzz_from_df", "querexfuzz_help"]
@@ -0,0 +1,40 @@
1
+ from pathlib import Path
2
+ from typing import Literal
3
+
4
+ import yaml
5
+ from pydantic import BaseModel, Field, field_validator
6
+
7
+
8
+ class FuzzyConfig(BaseModel):
9
+ """Configuration for the fuzzy searcher."""
10
+ fields: list[str] | Literal['all'] = 'all'
11
+ limit: int = 100
12
+ score_col_name: str = "score"
13
+ highlight: bool = True # whether to use Highlight mode or not
14
+
15
+
16
+ class QuerexfuzzConfig(BaseModel):
17
+ """Main configuration for the Querexfuzz engine."""
18
+ base_cols: list[str] = Field(default_factory=list)
19
+ bang_field: str | None = None
20
+ date_fields: list[str] = Field(default_factory=list)
21
+ default_date_field: str | None = None
22
+ recent_field: str | None = None
23
+ fuzzy: FuzzyConfig = Field(default_factory=FuzzyConfig)
24
+
25
+ @field_validator("default_date_field")
26
+ def _validate_default_date(cls, v, info):
27
+ if v and v not in info.data.get("date_fields", []):
28
+ raise ValueError(f"default_date_field '{v}'"
29
+ " must be in date_fields.")
30
+ return v
31
+
32
+ @classmethod
33
+ def from_yaml(cls, path: Path):
34
+ """Loads configuration from a YAML file."""
35
+ if not path.exists():
36
+ raise FileNotFoundError(f"Configuration file not found at {path}")
37
+ with path.open('r') as f:
38
+ data = yaml.safe_load(f)
39
+ return cls(**data)
40
+
@@ -0,0 +1,225 @@
1
+ import logging
2
+ from pathlib import Path
3
+ from types import MethodType
4
+ import yaml
5
+
6
+ import pandas as pd
7
+
8
+ from .parser import parser
9
+ from .engine import execute_query
10
+ from .config import QuerexfuzzConfig, FuzzyConfig
11
+
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+ # Define the method name as a class attribute for consistency
16
+ _METHOD_NAME = "querex"
17
+
18
+
19
+ class Querexfuzz:
20
+ """Manages configuration and attachment of the .querex method."""
21
+
22
+ def __init__(self, *, config_path: str | Path | None = None, **kwargs):
23
+ """
24
+ Initializes the Querexfuzz engine with a flexible configuration.
25
+
26
+ The configuration can be loaded from a YAML file, provided directly as
27
+ keyword arguments, or both (with keyword arguments overriding the file's
28
+ settings).
29
+
30
+ Args:
31
+ config_path (str | Path | None, optional): The path to a YAML
32
+ configuration file. Defaults to None.
33
+ **kwargs: Keyword arguments that correspond to the fields in the
34
+ QuerexfuzzConfig model. These will override any values loaded
35
+ from the config_path.
36
+
37
+ Examples:
38
+ >>> # 1. From a file only
39
+ >>> qf = Querexfuzz(config_path='config.yml')
40
+
41
+ >>> # 2. From keyword arguments only
42
+ >>> qf = Querexfuzz(base_cols=['name', 'age'], recent_field='mod')
43
+
44
+ >>> # 3. From a file with specific overrides
45
+ >>> qf = Querexfuzz(config_path='config.yml', fuzzy={'limit': 200})
46
+ """
47
+ config_data = {}
48
+ if config_path:
49
+ self.config_path = Path(config_path)
50
+ with self.config_path.open("r") as f:
51
+ config_data = yaml.safe_load(f)
52
+
53
+ # Keyword arguments override the data loaded from the file
54
+ # inplace update, same as config_data.update(kwargs)
55
+ config_data |= kwargs
56
+
57
+ # Validate and create the final config object using Pydantic
58
+ self.config = QuerexfuzzConfig(**config_data)
59
+ # Keyed by id(df): (search_cols, matcher). Populated lazily on first fuzzy call.
60
+ # Mutable dfs (opt-in via attach_to) bypass the cache and rebuild every call.
61
+ self._fuzzy_cache: dict = {}
62
+ self._mutable_ids: set = set()
63
+
64
+ logger.info("Querexfuzz engine initialized successfully.")
65
+ # logger.debug(
66
+ # "Final configuration:\n%s",
67
+ # self.config.model_dump_json(indent=4)
68
+ # )
69
+
70
+ @staticmethod
71
+ def parse(expr):
72
+ """Convenience method for testing parser."""
73
+ # note, you can also import parser directly!
74
+ return parser(expr)
75
+
76
+ def _query_method(self, df: pd.DataFrame, expr: str) -> pd.DataFrame:
77
+ """The method that is attached to the DataFrame."""
78
+ spec = parser(expr)
79
+ logger.debug("Parsed query spec: %s", spec)
80
+ return execute_query(df, spec, self.config, engine=self)
81
+
82
+ def attach_to(
83
+ self,
84
+ df: pd.DataFrame,
85
+ method_name: str | None = None,
86
+ alias: str | None = "q",
87
+ mutable: bool = False,
88
+ ) -> pd.DataFrame:
89
+ """
90
+ Attaches the query method to a DataFrame instance.
91
+
92
+ Args:
93
+ df (pd.DataFrame): The DataFrame to modify.
94
+ alias (str | None, optional): A short alias for the query method.
95
+ Set to 'q' by default. If None or '', no alias is created.
96
+ Defaults to 'q'.
97
+ mutable (bool): If True, the fuzzy matcher is rebuilt on every call
98
+ instead of being cached. Use when the DataFrame's contents change
99
+ between queries. Defaults to False.
100
+
101
+ Returns:
102
+ pd.DataFrame: The same DataFrame, now with the query method attached.
103
+ """
104
+ logger.debug(
105
+ "Attaching .%s method to DataFrame with id: %d", _METHOD_NAME, id(df)
106
+ )
107
+ method_name = method_name or _METHOD_NAME
108
+ setattr(df, method_name, MethodType(self._query_method, df))
109
+
110
+ if alias:
111
+ logger.debug("Adding alias '.%s' for the query method.", alias)
112
+ setattr(df, alias, MethodType(self._query_method, df))
113
+
114
+ if mutable:
115
+ self._mutable_ids.add(id(df))
116
+ self._fuzzy_cache.pop(id(df), None)
117
+ else:
118
+ self._mutable_ids.discard(id(df))
119
+
120
+ return df
121
+
122
+
123
+ # helper factory method
124
+ def querexfuzz_from_df(
125
+ df,
126
+ *,
127
+ base_cols=None,
128
+ bang_field=None,
129
+ default_date_field=None,
130
+ recent_field=None,
131
+ fuzzy_fields=None,
132
+ score_col_name="score",
133
+ attach=True
134
+ ):
135
+ """
136
+ Create a Querexfuzz by inspecting df.
137
+
138
+ By default, all columns not starting _ are selected to base_cols.
139
+ All date fields to date_fields. If only one, it becomes the
140
+ default_date_field (default for @ clauses) and the recent_field
141
+ (field for sorting for recent).
142
+
143
+ If there is only one object field, it becomes the bang field (default
144
+ for regex).
145
+ If fuzzy_fields is None then all object fields are selected. If there
146
+ is only one, it becomes bang_field if that is omitted.
147
+
148
+ Highlight mode is true if len(fuzzy_fields)==1.
149
+
150
+ Parameters
151
+ ----------
152
+
153
+ bang_field, default_date_field, recent_field ->
154
+ score_col_name: name for the score column in fuzzy matches
155
+ attach: attach querexfuzz method to df (default True)
156
+
157
+ """
158
+ base_cols = base_cols or [i for i in df.columns if i[0] != "_"]
159
+ date_fields = [
160
+ i
161
+ for i in df.select_dtypes(include=["datetime64[ns]", "datetimetz"]).columns
162
+ if i[0] != "_"
163
+ ]
164
+ if default_date_field is None and len(date_fields) == 1:
165
+ default_date_field = date_fields[0]
166
+ if recent_field is None and len(date_fields) == 1:
167
+ recent_field = date_fields[0]
168
+ elif recent_field is None and default_date_field is not None:
169
+ recent_field = default_date_field
170
+ elif recent_field is not None and default_date_field is None:
171
+ default_date_field = recent_field
172
+
173
+ # ensure_list:
174
+ fuzzy_fields = [fuzzy_fields] if isinstance(fuzzy_fields, str) else fuzzy_fields
175
+ fuzzy_fields = fuzzy_fields or [
176
+ i for i in df.select_dtypes(include="object").columns if i[0] != "_"
177
+ ]
178
+ if bang_field is None and len(fuzzy_fields) == 1:
179
+ bang_field = fuzzy_fields[0]
180
+
181
+ highlight = len(fuzzy_fields) == 1
182
+
183
+ qfz = Querexfuzz(
184
+ base_cols=base_cols,
185
+ date_fields=date_fields,
186
+ default_date_field=default_date_field,
187
+ bang_field=bang_field,
188
+ recent_field=recent_field,
189
+ fuzzy=FuzzyConfig(
190
+ fields=fuzzy_fields,
191
+ limit=50,
192
+ score_col_name=score_col_name,
193
+ highlight=highlight,
194
+ ),
195
+ )
196
+ if attach:
197
+ # this acts in-place
198
+ qfz.attach_to(df)
199
+ return qfz
200
+
201
+
202
+ def querexfuzz_help() -> str:
203
+ """Help on the grammar."""
204
+ return """
205
+ querexfuzz Help
206
+ ================
207
+
208
+ Query syntax (all clauses optional, must appear in this order)
209
+ --------------------------------------------------------------
210
+ An empty query returns all base columns for all rows.
211
+
212
+ verbose print query details
213
+ recent sort by most recent date field
214
+ top n | bottom n limit to n rows from head or tail
215
+ select col1[, col2] named columns
216
+ select * | ** * = base cols (default), ** = all cols
217
+ select *, -col | **, -col base/all cols minus col
218
+ ! regex | field ~ regex regex filter on bang_field or named field
219
+ where sql_expression where city == 'Berlin' and age > 30
220
+ order|sort by [-]col[, cols] - prefix for descending order
221
+ @[field] unit[-from[:to]] date range filter
222
+ e.g. @m-3, @created_date y-2:1
223
+ units: c=calendar year, y, q, m, w, d, h
224
+ # fuzzy_term fuzzy search (must be last clause)
225
+ """
@@ -0,0 +1,73 @@
1
+ """
2
+ Convert date spec from parser into (start, end) datetime tuple.
3
+
4
+ Testers::
5
+
6
+ s = Querexfuzz.parse('@ c @ m @ d @ y @mod m-3:2 @create y-10:9 @ d-91:30 ')
7
+ for s in s['dates']:
8
+ print(s)
9
+ print(qfd.resolve_date_range(s))
10
+
11
+ """
12
+
13
+ from datetime import datetime
14
+ from logging import getLogger
15
+
16
+ from dateutil.relativedelta import relativedelta
17
+ import tzlocal
18
+ import pandas as pd
19
+
20
+
21
+ TIMEZONE = tzlocal.get_localzone()
22
+
23
+ logger = getLogger(__name__)
24
+
25
+
26
+ def resolve_date_range(spec: dict) -> tuple[datetime, datetime]:
27
+ """Converts a date spec from the parser into a (start, end) datetime tuple."""
28
+ now = pd.Timestamp.now(tz=TIMEZONE)
29
+ unit = spec['unit']
30
+
31
+ # --- New section for 'c' (calendar year) unit ---
32
+ if unit == 'c':
33
+ current_year = now.year
34
+ # the current partial year counts as 1
35
+ # and the full year is added below
36
+ # ergo subtract one
37
+ start_offset = int(spec['start']) - 1
38
+ end_offset = int(spec['end'])
39
+ start_year = current_year - start_offset
40
+ end_year = current_year - end_offset
41
+
42
+ # Ensure start year is before end year
43
+ if start_year > end_year:
44
+ start_year, end_year = end_year, start_year
45
+
46
+ start_date = pd.Timestamp(f"{start_year}-01-01", tz=TIMEZONE)
47
+ # Go to the very end of the last day of the end year
48
+ end_date = pd.Timestamp(f"{end_year}-12-31T23:59:59.999999", tz=TIMEZONE)
49
+
50
+ # logger.debug("Resolved calendar year range: %s to %s", start_date, end_date)
51
+ return start_date, end_date
52
+
53
+ # --- Existing logic for relative units (y, m, d, etc.) ---
54
+ unit_map = {
55
+ 'y': 'years', 'q': 'months', 'm': 'months',
56
+ 'w': 'weeks', 'd': 'days', 'h': 'hours'
57
+ }
58
+ multiplier = 3 if unit == 'q' else 1
59
+ start_val = float(spec['start']) * multiplier
60
+ end_val = float(spec['end']) * multiplier
61
+
62
+ start_offset = relativedelta(**{unit_map[unit]: start_val})
63
+ end_offset = relativedelta(**{unit_map[unit]: end_val})
64
+
65
+ start_date = now - start_offset
66
+ end_date = now - end_offset
67
+
68
+ # Ensure the date range is in the correct order
69
+ if start_date > end_date:
70
+ start_date, end_date = end_date, start_date
71
+
72
+ # logger.debug("Resolved relative date range: %s to %s", start_date, end_date)
73
+ return start_date, end_date