PyPI - querexfuzz - Versions diffs - 2.0.3__tar.gz - Mend

querexfuzz 2.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

querexfuzz-2.0.3/PKG-INFO +14 -0
querexfuzz-2.0.3/README.md +187 -0
querexfuzz-2.0.3/pyproject.toml +29 -0
querexfuzz-2.0.3/setup.cfg +4 -0
querexfuzz-2.0.3/src/querexfuzz/__init__.py +11 -0
querexfuzz-2.0.3/src/querexfuzz/config.py +40 -0
querexfuzz-2.0.3/src/querexfuzz/core.py +225 -0
querexfuzz-2.0.3/src/querexfuzz/dates.py +73 -0
querexfuzz-2.0.3/src/querexfuzz/engine.py +251 -0
querexfuzz-2.0.3/src/querexfuzz/grammar.lark +96 -0
querexfuzz-2.0.3/src/querexfuzz/logging_filters.py +34 -0
querexfuzz-2.0.3/src/querexfuzz/parser.py +298 -0
querexfuzz-2.0.3/src/querexfuzz.egg-info/PKG-INFO +14 -0
querexfuzz-2.0.3/src/querexfuzz.egg-info/SOURCES.txt +18 -0
querexfuzz-2.0.3/src/querexfuzz.egg-info/dependency_links.txt +1 -0
querexfuzz-2.0.3/src/querexfuzz.egg-info/requires.txt +10 -0
querexfuzz-2.0.3/src/querexfuzz.egg-info/top_level.txt +1 -0
querexfuzz-2.0.3/tests/test_engine.py +31 -0
querexfuzz-2.0.3/tests/test_grammar.py +341 -0
querexfuzz-2.0.3/tests/test_parser.py +31 -0

querexfuzz-2.0.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,14 @@
+Metadata-Version: 2.4
+Name: querexfuzz
+Version: 2.0.3
+Summary: A flexible query engine for pandas DataFrames with SQL, regex, date, and fuzzy matching.
+Requires-Python: >=3.13
+Requires-Dist: pandas>=2.0
+Requires-Dist: lark>=1.0
+Requires-Dist: pydantic>=2.0
+Requires-Dist: pyyaml
+Requires-Dist: python-dateutil
+Requires-Dist: tzlocal
+Requires-Dist: skimmatch
+Provides-Extra: test
+Requires-Dist: pytest; extra == "test"

querexfuzz-2.0.3/README.md ADDED Viewed

@@ -0,0 +1,187 @@
+# Querexfuzz
+A flexible query engine for pandas DataFrames. `querexfuzz` lets you filter and search your data using a unified syntax that combines SQL-like `where` clauses, regular expressions, natural date ranges, and fuzzy matching — all in a single query string.
+---
+## Core Features
+- **Unified query language**: combine `where`, regex (`~` / `!`), date filters (`@`), and fuzzy matching (`#`) in one string.
+- **DataFrame native**: attaches a `.querex()` method (and `.q()` alias) directly to DataFrame instances.
+- **Auto-configuration**: `querexfuzz_from_df` inspects column types and sets sensible defaults automatically.
+- **Fast fuzzy search**: powered by [skimmatch](https://github.com/mynl/skimmatch); matcher built once per DataFrame and cached for the lifetime of the engine.
+- **Configurable**: via YAML file, keyword arguments, or both.
+---
+## Installation
+```bash
+pip install querexfuzz
+```
+For development:
+```bash
+git clone https://github.com/mynl/querexfuzz.git
+cd querexfuzz
+pip install -e .[test]
+```
+---
+## Quickstart
+### Auto-configure from a DataFrame
+The simplest path — `querexfuzz_from_df` inspects column types and wires everything up:
+```python
+import pandas as pd
+from querexfuzz import querexfuzz_from_df
+df = pd.DataFrame({
+    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
+    'age': [25, 30, 35, 40, 45],
+    'city': ['Amsterdam', 'Berlin', 'Copenhagen', 'Berlin', 'Amsterdam'],
+    'registered_date': pd.to_datetime([
+        '2025-08-10', '2025-06-15', '2024-01-20', '2025-08-25', '2025-07-30'
+    ])
+})
+querexfuzz_from_df(df)        # attaches .querex() and .q() to df in-place
+result = df.querex("where city == 'Berlin'")
+result = df.q("top 5 # ams")  # short alias
+```
+### Explicit configuration
+```python
+from querexfuzz import Querexfuzz
+engine = Querexfuzz(
+    base_cols=['name', 'city', 'registered_date', 'age'],
+    date_fields=['registered_date'],
+    default_date_field='registered_date',
+    bang_field='name',
+    recent_field='registered_date',
+    fuzzy=dict(fields=['name', 'city'], limit=50, score_col_name='score'),
+)
+engine.attach_to(df)
+result = df.querex("recent top 10 where age > 30 # berlin")
+```
+### From a YAML config file
+```python
+engine = Querexfuzz(config_path='config.yml')
+engine = Querexfuzz(config_path='config.yml', fuzzy={'limit': 200})  # with overrides
+```
+---
+## Query Syntax
+Clauses are **order-sensitive** and must appear in this sequence (all optional):
+```
+[verbose] [recent] [top N | bottom N] [select cols]
+[field ~ regex | ! term] [where expr] [order by cols] [@ date_spec] [# fuzzy_term]
+```
+An empty query returns all base columns for all rows.
+### `where` — SQL-like filter
+```python
+df.querex("where city == 'Amsterdam' and age > 30")
+```
+### `!` / `~` — Regex
+```python
+df.querex("! ^[AB]")            # regex on bang_field (default regex target)
+df.querex("name ~ ^[AB]")       # regex on named column
+```
+### `@` — Date range
+Units: `c` calendar year, `y` year, `q` quarter, `m` month, `w` week, `d` day, `h` hour.
+```python
+df.querex("@m-3")                        # last 3 months (default date field)
+df.querex("@registered_date m-28:6")     # 28 to 6 months ago on named field
+df.querex("@y-1")                        # last year
+```
+### `#` — Fuzzy matching
+Must be the **last** clause. Results are sorted by score descending.
+```python
+df.querex("# berlin")
+df.querex("where age > 30 # ams")        # filter first, then fuzzy over full data
+```
+### `select` — Column projection
+| Syntax | Meaning |
+|---|---|
+| *(default)* | base columns |
+| `select *` | base columns |
+| `select **` | all columns |
+| `select a, b` | named columns |
+| `select *, a` | base columns plus `a` |
+| `select *, -a` | base columns minus `a` (`-` or `!` prefix) |
+| `select **, -a` | all columns minus `a` |
+### `top` / `bottom` / `recent` / `order by`
+```python
+df.querex("top 10")
+df.querex("bottom 5")
+df.querex("recent")                       # sort by recent_field descending
+df.querex("order by age")
+df.querex("order by -age, name")          # - prefix = descending
+```
+### Combining clauses
+```python
+df.querex("top 5 recent where city == 'Berlin' @m-3 select name, age # bob")
+```
+---
+## Fuzzy matching and caching
+The fuzzy matcher (skimmatch) is built **once** per attached DataFrame and cached on the engine. Repeat fuzzy queries against the same DataFrame pay only the cost of `matcher.query()`.
+When `where`, regex, or date pre-filters are present, the matcher still runs over the full DataFrame and results are intersected with the pre-filtered rows (5× over-fetch to compensate for the narrower valid set).
+If the DataFrame's contents change between queries, re-attach with `mutable=True`:
+```python
+engine.attach_to(df, mutable=True)   # rebuilds matcher on every fuzzy call
+```
+---
+## Versions
+### 2.0.3 (current)
+Fuzzy caching refactor. Matcher built once per attached DataFrame and cached by `id(df)` on the engine — repeat queries skip all data preparation. Multiple DataFrames per engine each get an independent cache entry. `attach_to(mutable=True)` opt-in for DataFrames whose contents change.
+### 2.0.2
+Performance pass on `execute_query`: lazy DataFrame copy (copy only when date-column type coercion is needed, after prior filters have already reduced the frame); initial fuzzy matcher caching.
+### 2.0.1
+Code review fixes: method renamed `.querex()` / `.q()`; `importlib.metadata` version; Pydantic config corrections; parser and test suite overhaul; `tzlocal` dependency added.
+### 2.0.0
+Major rewrite. Lark-based grammar parser, Pydantic configuration, `skimmatch` fuzzy backend (replacing `rustfuzz`), `src/` layout.
+### 0.1.0
+Initial release.

querexfuzz-2.0.3/pyproject.toml ADDED Viewed

@@ -0,0 +1,29 @@
+[project]
+name = "querexfuzz"
+version = "2.0.3"
+description = "A flexible query engine for pandas DataFrames with SQL, regex, date, and fuzzy matching."
+requires-python = ">=3.13"
+dependencies = [
+    "pandas>=2.0",
+    "lark>=1.0",
+    "pydantic>=2.0",
+    "pyyaml",
+    "python-dateutil",
+    "tzlocal",
+    "skimmatch",
+]
+[project.optional-dependencies]
+test = [
+    "pytest",
+]
+[build-system]
+requires = ["setuptools>=61"]
+build-backend = "setuptools.build_meta"
+[tool.setuptools.packages.find]
+where = ["src"]
+[tool.setuptools.package-data]
+"querexfuzz" = ["*.lark"]

querexfuzz-2.0.3/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

querexfuzz-2.0.3/src/querexfuzz/__init__.py ADDED Viewed

@@ -0,0 +1,11 @@
+"""Querexfuzz: A flexible query engine for pandas DataFrames."""
+from importlib.metadata import version, PackageNotFoundError
+try:
+    __version__ = version("querexfuzz")
+except PackageNotFoundError:
+    __version__ = "unknown"
+from .core import Querexfuzz, querexfuzz_from_df, querexfuzz_help
+__all__ = ["Querexfuzz", "querexfuzz_from_df", "querexfuzz_help"]

querexfuzz-2.0.3/src/querexfuzz/config.py ADDED Viewed

@@ -0,0 +1,40 @@
+from pathlib import Path
+from typing import Literal
+import yaml
+from pydantic import BaseModel, Field, field_validator
+class FuzzyConfig(BaseModel):
+    """Configuration for the fuzzy searcher."""
+    fields: list[str] | Literal['all'] = 'all'
+    limit: int = 100
+    score_col_name: str = "score"
+    highlight: bool = True    # whether to use Highlight mode or not
+class QuerexfuzzConfig(BaseModel):
+    """Main configuration for the Querexfuzz engine."""
+    base_cols: list[str] = Field(default_factory=list)
+    bang_field: str | None = None
+    date_fields: list[str] = Field(default_factory=list)
+    default_date_field: str | None = None
+    recent_field: str | None = None
+    fuzzy: FuzzyConfig = Field(default_factory=FuzzyConfig)
+    @field_validator("default_date_field")
+    def _validate_default_date(cls, v, info):
+        if v and v not in info.data.get("date_fields", []):
+            raise ValueError(f"default_date_field '{v}'"
+                             " must be in date_fields.")
+        return v
+    @classmethod
+    def from_yaml(cls, path: Path):
+        """Loads configuration from a YAML file."""
+        if not path.exists():
+            raise FileNotFoundError(f"Configuration file not found at {path}")
+        with path.open('r') as f:
+            data = yaml.safe_load(f)
+        return cls(**data)

querexfuzz-2.0.3/src/querexfuzz/core.py ADDED Viewed

@@ -0,0 +1,225 @@
+import logging
+from pathlib import Path
+from types import MethodType
+import yaml
+import pandas as pd
+from .parser import parser
+from .engine import execute_query
+from .config import QuerexfuzzConfig, FuzzyConfig
+logger = logging.getLogger(__name__)
+# Define the method name as a class attribute for consistency
+_METHOD_NAME = "querex"
+class Querexfuzz:
+    """Manages configuration and attachment of the .querex method."""
+    def __init__(self, *, config_path: str | Path | None = None, **kwargs):
+        """
+        Initializes the Querexfuzz engine with a flexible configuration.
+        The configuration can be loaded from a YAML file, provided directly as
+        keyword arguments, or both (with keyword arguments overriding the file's
+        settings).
+        Args:
+            config_path (str | Path | None, optional): The path to a YAML
+                configuration file. Defaults to None.
+            **kwargs: Keyword arguments that correspond to the fields in the
+                QuerexfuzzConfig model. These will override any values loaded
+                from the config_path.
+        Examples:
+            >>> # 1. From a file only
+            >>> qf = Querexfuzz(config_path='config.yml')
+            >>> # 2. From keyword arguments only
+            >>> qf = Querexfuzz(base_cols=['name', 'age'], recent_field='mod')
+            >>> # 3. From a file with specific overrides
+            >>> qf = Querexfuzz(config_path='config.yml', fuzzy={'limit': 200})
+        """
+        config_data = {}
+        if config_path:
+            self.config_path = Path(config_path)
+            with self.config_path.open("r") as f:
+                config_data = yaml.safe_load(f)
+        # Keyword arguments override the data loaded from the file
+        # inplace update, same as config_data.update(kwargs)
+        config_data |= kwargs
+        # Validate and create the final config object using Pydantic
+        self.config = QuerexfuzzConfig(**config_data)
+        # Keyed by id(df): (search_cols, matcher). Populated lazily on first fuzzy call.
+        # Mutable dfs (opt-in via attach_to) bypass the cache and rebuild every call.
+        self._fuzzy_cache: dict = {}
+        self._mutable_ids: set = set()
+        logger.info("Querexfuzz engine initialized successfully.")
+        # logger.debug(
+        #     "Final configuration:\n%s",
+        #     self.config.model_dump_json(indent=4)
+        # )
+    @staticmethod
+    def parse(expr):
+        """Convenience method for testing parser."""
+        # note, you can also import parser directly!
+        return parser(expr)
+    def _query_method(self, df: pd.DataFrame, expr: str) -> pd.DataFrame:
+        """The method that is attached to the DataFrame."""
+        spec = parser(expr)
+        logger.debug("Parsed query spec: %s", spec)
+        return execute_query(df, spec, self.config, engine=self)
+    def attach_to(
+        self,
+        df: pd.DataFrame,
+        method_name: str | None = None,
+        alias: str | None = "q",
+        mutable: bool = False,
+    ) -> pd.DataFrame:
+        """
+        Attaches the query method to a DataFrame instance.
+        Args:
+            df (pd.DataFrame): The DataFrame to modify.
+            alias (str | None, optional): A short alias for the query method.
+                Set to 'q' by default. If None or '', no alias is created.
+                Defaults to 'q'.
+            mutable (bool): If True, the fuzzy matcher is rebuilt on every call
+                instead of being cached. Use when the DataFrame's contents change
+                between queries. Defaults to False.
+        Returns:
+            pd.DataFrame: The same DataFrame, now with the query method attached.
+        """
+        logger.debug(
+            "Attaching .%s method to DataFrame with id: %d", _METHOD_NAME, id(df)
+        )
+        method_name = method_name or _METHOD_NAME
+        setattr(df, method_name, MethodType(self._query_method, df))
+        if alias:
+            logger.debug("Adding alias '.%s' for the query method.", alias)
+            setattr(df, alias, MethodType(self._query_method, df))
+        if mutable:
+            self._mutable_ids.add(id(df))
+            self._fuzzy_cache.pop(id(df), None)
+        else:
+            self._mutable_ids.discard(id(df))
+        return df
+# helper factory method
+def querexfuzz_from_df(
+    df,
+    *,
+    base_cols=None,
+    bang_field=None,
+    default_date_field=None,
+    recent_field=None,
+    fuzzy_fields=None,
+    score_col_name="score",
+    attach=True
+):
+    """
+    Create a Querexfuzz by inspecting df.
+    By default, all columns not starting _ are selected to base_cols.
+    All date fields to date_fields. If only one, it becomes the
+    default_date_field (default for @ clauses) and the recent_field
+    (field for sorting for recent).
+    If there is only one object field, it becomes the bang field (default
+    for regex).
+    If fuzzy_fields is None then all object fields are selected. If there
+    is only one, it becomes bang_field if that is omitted.
+    Highlight mode is true if len(fuzzy_fields)==1.
+    Parameters
+    ----------
+    bang_field, default_date_field, recent_field ->
+    score_col_name: name for the score column in fuzzy matches
+    attach: attach querexfuzz method to df (default True)
+    """
+    base_cols = base_cols or [i for i in df.columns if i[0] != "_"]
+    date_fields = [
+        i
+        for i in df.select_dtypes(include=["datetime64[ns]", "datetimetz"]).columns
+        if i[0] != "_"
+    ]
+    if default_date_field is None and len(date_fields) == 1:
+        default_date_field = date_fields[0]
+    if recent_field is None and len(date_fields) == 1:
+        recent_field = date_fields[0]
+    elif recent_field is None and default_date_field is not None:
+        recent_field = default_date_field
+    elif recent_field is not None and default_date_field is None:
+        default_date_field = recent_field
+    # ensure_list:
+    fuzzy_fields = [fuzzy_fields] if isinstance(fuzzy_fields, str) else fuzzy_fields
+    fuzzy_fields = fuzzy_fields or [
+        i for i in df.select_dtypes(include="object").columns if i[0] != "_"
+    ]
+    if bang_field is None and len(fuzzy_fields) == 1:
+        bang_field = fuzzy_fields[0]
+    highlight = len(fuzzy_fields) == 1
+    qfz = Querexfuzz(
+        base_cols=base_cols,
+        date_fields=date_fields,
+        default_date_field=default_date_field,
+        bang_field=bang_field,
+        recent_field=recent_field,
+        fuzzy=FuzzyConfig(
+            fields=fuzzy_fields,
+            limit=50,
+            score_col_name=score_col_name,
+            highlight=highlight,
+        ),
+    )
+    if attach:
+        # this acts in-place
+        qfz.attach_to(df)
+    return qfz
+def querexfuzz_help() -> str:
+    """Help on the grammar."""
+    return """
+querexfuzz  Help
+================
+Query syntax (all clauses optional, must appear in this order)
+--------------------------------------------------------------
+An empty query returns all base columns for all rows.
+verbose                              print query details
+recent                               sort by most recent date field
+top n | bottom n                     limit to n rows from head or tail
+select col1[, col2]                  named columns
+select * | **                        * = base cols (default), ** = all cols
+select *, -col | **, -col            base/all cols minus col
+! regex | field ~ regex              regex filter on bang_field or named field
+where sql_expression                 where city == 'Berlin' and age > 30
+order|sort by [-]col[, cols]         - prefix for descending order
+@[field] unit[-from[:to]]            date range filter
+                                     e.g. @m-3, @created_date y-2:1
+                                     units: c=calendar year, y, q, m, w, d, h
+# fuzzy_term                         fuzzy search (must be last clause)
+"""

querexfuzz-2.0.3/src/querexfuzz/dates.py ADDED Viewed

@@ -0,0 +1,73 @@
+"""
+Convert date spec from parser into (start, end) datetime tuple.
+Testers::
+    s = Querexfuzz.parse('@ c @ m @ d @ y @mod m-3:2 @create y-10:9 @ d-91:30 ')
+    for s in s['dates']:
+        print(s)
+        print(qfd.resolve_date_range(s))
+"""
+from datetime import datetime
+from logging import getLogger
+from dateutil.relativedelta import relativedelta
+import tzlocal
+import pandas as pd
+TIMEZONE = tzlocal.get_localzone()
+logger = getLogger(__name__)
+def resolve_date_range(spec: dict) -> tuple[datetime, datetime]:
+    """Converts a date spec from the parser into a (start, end) datetime tuple."""
+    now = pd.Timestamp.now(tz=TIMEZONE)
+    unit = spec['unit']
+    # --- New section for 'c' (calendar year) unit ---
+    if unit == 'c':
+        current_year = now.year
+        # the current partial year counts as 1
+        # and the full year is added below
+        # ergo subtract one
+        start_offset = int(spec['start']) - 1
+        end_offset = int(spec['end'])
+        start_year = current_year - start_offset
+        end_year = current_year - end_offset
+        # Ensure start year is before end year
+        if start_year > end_year:
+            start_year, end_year = end_year, start_year
+        start_date = pd.Timestamp(f"{start_year}-01-01", tz=TIMEZONE)
+        # Go to the very end of the last day of the end year
+        end_date = pd.Timestamp(f"{end_year}-12-31T23:59:59.999999", tz=TIMEZONE)
+        # logger.debug("Resolved calendar year range: %s to %s", start_date, end_date)
+        return start_date, end_date
+    # --- Existing logic for relative units (y, m, d, etc.) ---
+    unit_map = {
+        'y': 'years', 'q': 'months', 'm': 'months',
+        'w': 'weeks', 'd': 'days', 'h': 'hours'
+    }
+    multiplier = 3 if unit == 'q' else 1
+    start_val = float(spec['start']) * multiplier
+    end_val = float(spec['end']) * multiplier
+    start_offset = relativedelta(**{unit_map[unit]: start_val})
+    end_offset = relativedelta(**{unit_map[unit]: end_val})
+    start_date = now - start_offset
+    end_date = now - end_offset
+    # Ensure the date range is in the correct order
+    if start_date > end_date:
+        start_date, end_date = end_date, start_date
+    # logger.debug("Resolved relative date range: %s to %s", start_date, end_date)
+    return start_date, end_date