PyPI - featkit - Versions diffs - 0.2.0__tar.gz → 0.4.1__tar.gz - Mend

featkit 0.2.0tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (101) hide show

featkit-0.4.1/.github/workflows/auto-tag.yml ADDED Viewed

@@ -0,0 +1,54 @@
+name: Auto-tag on version bump
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - "pyproject.toml"
+jobs:
+  tag:
+    name: Create version tag
+    runs-on: ubuntu-latest
+    steps:
+      - name: Ensure RELEASE_TOKEN is configured
+        env:
+          RELEASE_TOKEN: ${{ secrets.RELEASE_TOKEN }}
+        run: |
+          if [ -z "$RELEASE_TOKEN" ]; then
+            echo "RELEASE_TOKEN secret is not set. Add it (PAT with contents:read/write) so tag pushes can trigger publish.yml." >&2
+            exit 1
+          fi
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          # A PAT is required so the tag push triggers downstream workflows
+          # (pushes made with GITHUB_TOKEN are intentionally excluded from
+          # workflow triggers by GitHub to prevent infinite loops).
+          token: ${{ secrets.RELEASE_TOKEN }}
+      - name: Read version from pyproject.toml
+        id: version
+        run: |
+          VERSION=$(grep '^version = ' pyproject.toml | head -1 | sed 's/version = "\(.*\)"/\1/')
+          echo "version=$VERSION" >> $GITHUB_OUTPUT
+      - name: Check if tag exists
+        id: tag_check
+        run: |
+          if git rev-parse "v${{ steps.version.outputs.version }}" >/dev/null 2>&1; then
+            echo "exists=true" >> $GITHUB_OUTPUT
+          else
+            echo "exists=false" >> $GITHUB_OUTPUT
+          fi
+      - name: Create and push tag
+        if: steps.tag_check.outputs.exists == 'false'
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+          git tag "v${{ steps.version.outputs.version }}"
+          git push origin "v${{ steps.version.outputs.version }}"

featkit-0.4.1/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.4.1] - 2026-06-09
+### Fixed
+- CI: auto-tag workflow now uses a PAT (`RELEASE_TOKEN`) to push tags so that `publish.yml` is triggered correctly (`fix(ci)`)
+## [0.4.0] - 2026-06-09
+### Added
+- Ratio/percentage features (`RatioPivotedColumn`, `RatioSpaceBuilder`): for every pivot combination with at least one non-`None` categorical value, a `numerator / NULLIF(denominator, 0)` column is generated for each proper marginal projection of that combination. Controlled by `FeatureStoreConfig.include_ratios` (default `True`, requires `include_marginals=True`). (`feat(ratio)`)
+- `verbose` parameter on `AdapterDomainResolver` and `AdapterCombinationResolver`: when `True`, the generated `SELECT DISTINCT` SQL is emitted at `DEBUG` level before execution. `FeatureStorePipeline` forwards `cfg.verbose` to the combination resolver automatically. (`feat(domain-resolver)`)
+## [0.3.0] - 2026-06-08
+### Added
+- `AdapterCombinationResolver` — replaces per-field `SELECT DISTINCT` queries with a single multi-column query returning only observed combinations (`feat(builders)`)
+- `verbose` logging option on `PivotSpaceBuilder`, `DistributionalSpaceBuilder`, and `TemporalSpaceBuilder`, configurable via `FeatureStoreConfig` (`feat(config)`)
+### Fixed
+- Marginal fields no longer contribute their name to pivot column names; e.g. `SUM__amount__channel__region_north` → `SUM__amount__region_north` (`fix(layer2)`)
+## [0.2.0] - 2026-06-02
+### Added
+- Execution layer with adapter-based domain resolution (`feat(execution)`)
+### Fixed
+- Lazy-import `AdapterDomainResolver`; added `pandas` to dev dependencies

{featkit-0.2.0 → featkit-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: featkit
-Version: 0.2.0
+Version: 0.4.1
 Summary: featkit — automated feature store generation from relational facts tables
 Project-URL: Repository, https://github.com/Mirkiux/featkit
 Project-URL: Documentation, https://mirkiux.github.io/featkit

featkit-0.4.1/docs/example_databricks_notebook.md ADDED Viewed

@@ -0,0 +1,209 @@
+# Example — Observed-combinations pivot in a Databricks notebook
+This example shows how featkit resolves pivot combinations at runtime by
+querying the facts table directly from a Databricks notebook.
+When an adapter is configured, `FeatureStorePipeline` constructs an
+`AdapterCombinationResolver` and passes it to `PivotSpaceBuilder`.  Instead of
+generating the full Cartesian product of per-field domains, the builder issues a
+**single `SELECT DISTINCT`** query for all pivot categoricals and builds only the
+combinations that actually exist in the data.  Marginals are then derived from
+those observed combinations via subset-projection.
+`DatabricksNotebookAdapter` discovers the pre-injected `spark` session
+automatically — no constructor arguments are needed.
+## Notebook cells
+### Cell 1 — imports
+```python
+from featkit.config import FeatureStoreConfig
+from featkit.dataset.base import SimpleDataset
+from featkit.enums import CategoricalTreatment, MeasurementType, TimeGranularity
+from featkit.execution.adapters import DatabricksNotebookAdapter
+from featkit.fields.categorical_field import CategoricalField
+from featkit.fields.id_field import IDField
+from featkit.fields.measurement_field import MeasurementField
+from featkit.fields.time_field import TimeField
+from featkit.generators.sql.databricks import DatabricksSQLCodeGenerator
+from featkit.pipeline import FeatureStorePipeline
+```
+### Cell 2 — define the dataset
+```python
+ds = SimpleDataset(
+    "mydb.myschema.silver_transactions",
+    [
+        IDField("client_id"),
+        TimeField("period", TimeGranularity.MONTHLY, TimeGranularity.MONTHLY),
+        MeasurementField("amount", MeasurementType.MONTO),
+        MeasurementField("txn_count", MeasurementType.CANTIDAD),
+        # allowed_values used as WHERE IN-filter; omit to query with no filter
+        CategoricalField(
+            "segment",
+            CategoricalTreatment.PIVOT,
+            allowed_values=["retail", "sme", "corporate"],
+        ),
+        CategoricalField(
+            "product_type",
+            CategoricalTreatment.PIVOT,
+            allowed_values=["loan", "deposit", "card"],
+        ),
+    ],
+)
+```
+### Cell 3 — configure with the notebook adapter
+```python
+adapter = DatabricksNotebookAdapter()
+cfg = FeatureStoreConfig(
+    dataset=ds,
+    output_schema="analytics",
+    output_table_prefix="feat_",
+    time_windows=[3, 6, 12],
+    include_marginals=True,
+    adapter=adapter,   # triggers SELECT DISTINCT combination query at build()
+)
+```
+### Cell 4 — build and generate
+```python
+# build() issues ONE SELECT DISTINCT for all pivot categoricals:
+#
+#   SELECT DISTINCT product_type, segment
+#   FROM mydb.myschema.silver_transactions
+#   WHERE product_type IS NOT NULL
+#     AND segment IS NOT NULL
+#     AND product_type IN ('loan', 'deposit', 'card')
+#     AND segment IN ('retail', 'sme', 'corporate')
+#   ORDER BY 1, 2
+#
+# Only the returned combinations (plus their marginal projections) become
+# pivot columns — unobserved cross-combinations are never generated.
+pipeline = FeatureStorePipeline(config=cfg).build()
+print(f"Layer 2A columns : {len(pipeline.layer2a)}")
+print(f"Layer 3  features: {len(pipeline.layer3)}")
+result = DatabricksSQLCodeGenerator().generate(pipeline)
+print(result.code.sql[:500])
+```
+### Cell 5 — save the artefacts to DBFS
+```python
+result.save("/dbfs/mnt/output/features/")
+# Writes:
+#   /dbfs/mnt/output/features/script.sql
+#   /dbfs/mnt/output/features/dag.json
+#   /dbfs/mnt/output/features/diagram.md
+```
+## How it works
+`FeatureStorePipeline.build()` constructs an `AdapterCombinationResolver` and
+passes it to `PivotSpaceBuilder` as the `combination_resolver` callable.  The
+resolver executes a single multi-column `SELECT DISTINCT`:
+```sql
+SELECT DISTINCT product_type, segment
+FROM mydb.myschema.silver_transactions
+WHERE product_type IS NOT NULL
+  AND segment IS NOT NULL
+  AND product_type IN ('loan', 'deposit', 'card')
+  AND segment IN ('retail', 'sme', 'corporate')
+ORDER BY 1, 2
+```
+Suppose the query returns three rows:
+| product_type | segment   |
+|-------------|-----------|
+| loan        | retail    |
+| loan        | sme       |
+| deposit     | corporate |
+With `include_marginals=True`, the builder derives every subset-projection of
+those rows:
+| product_type | segment   | interpretation                          |
+|-------------|-----------|------------------------------------------|
+| loan        | retail    | observed combination                    |
+| loan        | sme       | observed combination                    |
+| deposit     | corporate | observed combination                    |
+| loan        | `∅`       | all segments for loan                   |
+| deposit     | `∅`       | all segments for deposit                |
+| `∅`         | retail    | all products for retail                 |
+| `∅`         | sme       | all products for sme                    |
+| `∅`         | corporate | all products for corporate              |
+| `∅`         | `∅`       | unconditional aggregate (always present)|
+Unobserved combinations (e.g. `deposit × retail`) are **never generated**,
+keeping the feature space lean.
+## Fields without `allowed_values`
+If a field has no `allowed_values`, it is still included in the `SELECT DISTINCT`
+but its column is not filtered in the WHERE clause — all distinct values present
+in the table are returned for that dimension:
+```python
+ds = SimpleDataset(
+    "mydb.myschema.silver_transactions",
+    [
+        IDField("client_id"),
+        TimeField("period", TimeGranularity.MONTHLY, TimeGranularity.MONTHLY),
+        MeasurementField("amount", MeasurementType.MONTO),
+        # Static domain — used as IN-filter in the combined query
+        CategoricalField(
+            "channel",
+            CategoricalTreatment.PIVOT,
+            allowed_values=["branch", "online", "mobile"],
+        ),
+        # No allowed_values — column included without an IN-filter
+        CategoricalField("segment", CategoricalTreatment.PIVOT),
+    ],
+)
+```
+## Using a different adapter
+Swap `DatabricksNotebookAdapter` for any other adapter without changing the
+rest of the code:
+```python
+from featkit.execution.adapters import DatabricksAdapter
+adapter = DatabricksAdapter(
+    host="<workspace>.azuredatabricks.net",
+    token="<pat>",
+    http_path="/sql/1.0/warehouses/<warehouse-id>",
+    catalog="mydb",
+    schema="myschema",
+)
+cfg = FeatureStoreConfig(..., adapter=adapter)
+```
+## Using `AdapterCombinationResolver` directly
+The resolver can also be wired manually to `PivotSpaceBuilder` without going
+through the pipeline:
+```python
+from featkit.execution.domain_resolver import AdapterCombinationResolver
+from featkit.builders.pivot_space import PivotSpaceBuilder
+resolver = AdapterCombinationResolver(adapter, "mydb.myschema.silver_transactions")
+columns = PivotSpaceBuilder(
+    dataset=ds,
+    include_marginals=True,
+    combination_resolver=resolver,
+).build()
+```

{featkit-0.2.0 → featkit-0.4.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "featkit"
-version = "0.2.0"
+version = "0.4.1"
 description = "featkit — automated feature store generation from relational facts tables"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -78,6 +78,7 @@ module = ["tests.*"]
 disallow_untyped_defs = false
 disallow_untyped_calls = false
 disallow_any_generics = false
+disallow_incomplete_defs = false
 [tool.pytest.ini_options]
 testpaths = ["tests"]

{featkit-0.2.0 → featkit-0.4.1}/src/featkit/builders/distributional_space.py RENAMED Viewed

@@ -2,6 +2,7 @@
 from __future__ import annotations
+import logging
 from typing import cast
 from featkit.contracts.measurement.defaults import get_default_contract
@@ -11,6 +12,8 @@ from featkit.fields.categorical_field import CategoricalField
 from featkit.fields.measurement_field import MeasurementField
 from featkit.layer2.distributional import DistributionalColumn
+_log = logging.getLogger(__name__)
 class DistributionalSpaceBuilder:
     """Generates the full set of DistributionalColumn objects for a dataset.
@@ -26,18 +29,27 @@ class DistributionalSpaceBuilder:
             An empty list produces no columns. Every entry must be present in the
             dataset (compared by name, type, and contract); a ``ValueError`` is
             raised for unknown fields.
+        verbose: When ``True``, emits ``DEBUG``-level log messages at key
+            milestones: builder start/end, and for each generated column the
+            ``(categorical, measurement, aggregator, metric)`` combination dict
+            and the resulting column name.
     """
     def __init__(
         self,
         dataset: AbstractDataset,
         value_measurements: list[MeasurementField] | None = None,
+        verbose: bool = False,
     ) -> None:
         self.dataset = dataset
         self.value_measurements = value_measurements
+        self.verbose = verbose
     def build(self) -> list[DistributionalColumn]:
         """Build and return all DistributionalColumn objects."""
+        if self.verbose:
+            _log.debug("DistributionalSpaceBuilder.build() started")
         all_cats = [cast(CategoricalField, f) for f in self.dataset.categorical_fields]
         dist_cats = [
             c
@@ -70,8 +82,30 @@ class DistributionalSpaceBuilder:
                 for agg in aggs:
                     for metric in cat.distributional_metrics:
                         col = DistributionalColumn(mf, agg, cat, metric)
+                        if self.verbose:
+                            _log.debug(
+                                "combo: cat=%r, measurement=%r, aggregator=%s, metric=%s",
+                                cat.name,
+                                mf.name,
+                                agg.value,
+                                metric.value,
+                            )
+                            _log.debug(
+                                "combination: %s",
+                                {
+                                    "categorical": cat.name,
+                                    "measurement": mf.name,
+                                    "aggregator": agg.value,
+                                    "metric": metric.value,
+                                },
+                            )
+                            _log.debug("column_name: %r", col.column_name)
                         if col.column_name not in seen:
                             seen.add(col.column_name)
                             results.append(col)
+        if self.verbose:
+            _log.debug(
+                "DistributionalSpaceBuilder.build() done — %d column(s) generated", len(results)
+            )
         return results

featkit-0.4.1/src/featkit/builders/pivot_space.py ADDED Viewed

@@ -0,0 +1,219 @@
+"""PivotSpaceBuilder — generates all PivotedColumn objects from a dataset."""
+from __future__ import annotations
+import logging
+from collections.abc import Callable
+from itertools import combinations as _icombinations
+from itertools import product
+from typing import cast
+from featkit.contracts.measurement.defaults import get_default_contract
+from featkit.dataset.base import AbstractDataset
+from featkit.enums import CategoricalTreatment, Layer2Aggregator, MeasurementType
+from featkit.fields.categorical_field import CategoricalField
+from featkit.fields.measurement_field import MeasurementField
+from featkit.layer2.pivoted import PivotedColumn
+_log = logging.getLogger(__name__)
+def _with_marginals(
+    observed: list[dict[CategoricalField, str]],
+    cats: list[CategoricalField],
+) -> list[dict[CategoricalField, str | None]]:
+    """Expand *observed* combinations with all ∅-substituted variants.
+    For each observed combination and each subset of fields, a new
+    combination is produced where those fields are replaced with ``None``
+    (the ∅ marginal sentinel).  The all-None combination is always included
+    even when *observed* is empty, since it represents an unconditional
+    aggregate over all data.
+    Duplicates are suppressed so overlapping projections of different
+    observed combinations appear only once.
+    """
+    seen: set[tuple[tuple[str, str | None], ...]] = set()
+    result: list[dict[CategoricalField, str | None]] = []
+    def _append(combo: dict[CategoricalField, str | None]) -> None:
+        key = tuple(sorted((f.name, combo[f]) for f in cats))
+        if key not in seen:
+            seen.add(key)
+            result.append(combo)
+    _append({f: None for f in cats})
+    for combo in observed:
+        for r in range(len(cats)):  # r == len(cats) (all-None) already added above
+            for nulled in _icombinations(cats, r):
+                c: dict[CategoricalField, str | None] = dict(combo)
+                for f in nulled:
+                    c[f] = None
+                _append(c)
+    return result
+class PivotSpaceBuilder:
+    """Generates the full set of PivotedColumn objects for a dataset.
+    Two combination strategies are supported:
+    * **Observed combinations** (preferred when an adapter is available):
+      supply a ``combination_resolver`` callable.  It receives the list of
+      pivot categorical fields and returns only the combinations that
+      actually exist in the source table.  Marginals are then derived from
+      those observed combinations rather than from the full Cartesian
+      product.
+    * **Cartesian product** (default, no adapter required): per-field
+      domains are resolved from ``allowed_values`` or ``domain_resolver``
+      and the full product is generated.
+    Args:
+        dataset: The source facts-table schema.
+        include_marginals: When True, ∅-substituted combinations are added
+            on top of the base combinations (observed or Cartesian).
+        aggregators_override: Per-measurement-type override list. Only
+            aggregators that are also contract-valid for the measurement
+            type are used.
+        combination_resolver: Callable that takes the list of pivot
+            ``CategoricalField`` objects and returns the observed
+            combinations as a list of ``{field: value}`` dicts.  When
+            provided, ``domain_resolver`` is not used.
+        domain_resolver: Callable invoked per-field to resolve the domain
+            of a categorical whose ``allowed_values`` is None.  Used only
+            in the Cartesian product path (i.e. when
+            ``combination_resolver`` is not provided).  Raises
+            ``ValueError`` at build time if a dynamic field is encountered
+            and this is not provided.
+        verbose: When ``True``, emits ``DEBUG``-level log messages at key
+            milestones: builder start/end, each ``domain_resolver``
+            invocation with its resolved values, each ``cat_combination``
+            dict, and every generated column name.
+    """
+    def __init__(
+        self,
+        dataset: AbstractDataset,
+        include_marginals: bool = True,
+        aggregators_override: dict[MeasurementType, list[Layer2Aggregator]] | None = None,
+        combination_resolver: (
+            Callable[[list[CategoricalField]], list[dict[CategoricalField, str]]] | None
+        ) = None,
+        domain_resolver: Callable[[CategoricalField], list[str]] | None = None,
+        verbose: bool = False,
+    ) -> None:
+        self.dataset = dataset
+        self.include_marginals = include_marginals
+        self.aggregators_override = aggregators_override
+        self.combination_resolver = combination_resolver
+        self.domain_resolver = domain_resolver
+        self.verbose = verbose
+    def build(self) -> list[PivotedColumn]:
+        """Build and return all PivotedColumn objects."""
+        if self.verbose:
+            _log.debug("PivotSpaceBuilder.build() started")
+        all_cats = [cast(CategoricalField, f) for f in self.dataset.categorical_fields]
+        pivot_cats = [
+            c
+            for c in all_cats
+            if c.treatment in {CategoricalTreatment.PIVOT, CategoricalTreatment.BOTH}
+        ]
+        measurements = [cast(MeasurementField, f) for f in self.dataset.measurement_fields]
+        all_combos: list[dict[CategoricalField, str | None]]
+        if self.combination_resolver is not None and pivot_cats:
+            observed_raw = self.combination_resolver(pivot_cats)
+            pivot_key_set = set(pivot_cats)
+            pivot_map = {c: c for c in pivot_cats}
+            observed: list[dict[CategoricalField, str]] = []
+            for combo in observed_raw:
+                if set(combo.keys()) != pivot_key_set:
+                    raise ValueError(
+                        "combination_resolver must return dicts keyed by all "
+                        "pivot categorical fields"
+                    )
+                if any(v is None for v in combo.values()):
+                    raise ValueError(
+                        "combination_resolver returned None; "
+                        "None is reserved as the ∅ marginal sentinel"
+                    )
+                observed.append({pivot_map[f]: str(v) for f, v in combo.items()})
+            if self.include_marginals:
+                all_combos = _with_marginals(observed, pivot_cats)
+            else:
+                all_combos = [dict(c) for c in observed]
+        else:
+            cat_domains: dict[CategoricalField, list[str | None]] = {}
+            for cat in pivot_cats:
+                if cat.allowed_values is not None:
+                    raw: list[str] = list(cat.allowed_values)
+                elif self.domain_resolver is not None:
+                    if self.verbose:
+                        _log.debug("domain_resolver: resolving domain for categorical %r", cat.name)
+                    raw = list(self.domain_resolver(cat))
+                    if self.verbose:
+                        _log.debug(
+                            "domain_resolver: resolved %d value(s) for %r: %s",
+                            len(raw),
+                            cat.name,
+                            raw,
+                        )
+                else:
+                    raise ValueError(
+                        f"CategoricalField {cat.name!r} has no allowed_values and no "
+                        f"domain_resolver was provided"
+                    )
+                if any(v is None for v in raw):
+                    raise ValueError(
+                        f"CategoricalField {cat.name!r}: resolved domain contains None; "
+                        f"None is reserved as the ∅ marginal sentinel"
+                    )
+                domain: list[str | None] = list(raw)
+                if self.include_marginals:
+                    domain = domain + [None]
+                cat_domains[cat] = domain
+            cats = list(cat_domains.keys())
+            combos = product(*(cat_domains[c] for c in cats)) if cats else ((),)
+            all_combos = [
+                {cats[i]: combo[i] for i in range(len(cats))} if cats else {} for combo in combos
+            ]
+        results: list[PivotedColumn] = []
+        seen: dict[str, PivotedColumn] = {}
+        for cat_combination in all_combos:
+            if self.verbose:
+                _log.debug(
+                    "cat_combination: %s",
+                    {c.name: v for c, v in cat_combination.items()},
+                )
+            for mf in measurements:
+                for agg in self._valid_aggregators(mf):
+                    col = PivotedColumn(mf, agg, cat_combination)
+                    if col.column_name in seen:
+                        raise ValueError(
+                            f"Duplicate pivot column name generated: {col.column_name!r}. "
+                            f"Conflicting columns: {seen[col.column_name]!r} and {col!r}"
+                        )
+                    if self.verbose:
+                        _log.debug("column_name: %r", col.column_name)
+                    seen[col.column_name] = col
+                    results.append(col)
+        if self.verbose:
+            _log.debug("PivotSpaceBuilder.build() done — %d column(s) generated", len(results))
+        return results
+    def _valid_aggregators(self, mf: MeasurementField) -> list[Layer2Aggregator]:
+        contract = mf.contract or get_default_contract(mf.measurement_type)
+        valid = contract.valid_layer2_aggregators
+        if self.aggregators_override and mf.measurement_type in self.aggregators_override:
+            return [a for a in self.aggregators_override[mf.measurement_type] if a in valid]
+        return sorted(valid, key=lambda a: a.value)

featkit 0.2.0__tar.gz → 0.4.1__tar.gz

featkit 0.2.0tar.gz → 0.4.1tar.gz