PyPI - featkit - Versions diffs - 0.4.1__tar.gz → 0.4.2__tar.gz - Mend

featkit 0.4.1tar.gz → 0.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (98) hide show

{featkit-0.4.1 → featkit-0.4.2}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.4.2] - 2026-06-30
+### Fixed
+- `FREQ` operator now counts only periods where the value is non-null **and strictly greater than 0** (previously counted any non-null value).
+- `XM` operator now returns `1` only when **every** period in the time window has a non-null and strictly positive value, `0` otherwise (previously returned a raw count identical to FREQ). Both the SQL and PySpark generators are updated.
 ## [0.4.1] - 2026-06-09
 ### Fixed

featkit-0.4.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,322 @@
+Metadata-Version: 2.4
+Name: featkit
+Version: 0.4.2
+Summary: featkit — automated feature store generation from relational facts tables
+Project-URL: Repository, https://github.com/Mirkiux/featkit
+Project-URL: Documentation, https://mirkiux.github.io/featkit
+Project-URL: Changelog, https://github.com/Mirkiux/featkit/blob/main/CHANGELOG.md
+Project-URL: Bug Tracker, https://github.com/Mirkiux/featkit/issues
+Author: Mirko
+License: MIT License
+        Copyright (c) 2026 Mirko
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: analytics,data engineering,databricks,feature engineering,feature store,pivot,pyspark,snowflake
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.10
+Requires-Dist: sqlglot>=23.0
+Provides-Extra: databricks
+Requires-Dist: databricks-sql-connector>=3.0; extra == 'databricks'
+Provides-Extra: dev
+Requires-Dist: build>=1.0; extra == 'dev'
+Requires-Dist: hatch>=1.9; extra == 'dev'
+Requires-Dist: mypy>=1.0; extra == 'dev'
+Requires-Dist: pandas>=1.5; extra == 'dev'
+Requires-Dist: pytest-cov>=4.0; extra == 'dev'
+Requires-Dist: pytest>=7.0; extra == 'dev'
+Requires-Dist: ruff>=0.4; extra == 'dev'
+Requires-Dist: twine>=5.0; extra == 'dev'
+Provides-Extra: docs
+Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
+Requires-Dist: mkdocs>=1.6; extra == 'docs'
+Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
+Provides-Extra: execution
+Requires-Dist: pandas>=1.5; extra == 'execution'
+Provides-Extra: ibis
+Requires-Dist: ibis-framework>=9.0; extra == 'ibis'
+Provides-Extra: spark
+Requires-Dist: pyspark>=3.4; extra == 'spark'
+Description-Content-Type: text/markdown
+# featkit
+**featkit** is a Python framework for automated feature store generation from relational facts tables.
+It implements a three-layer architecture:
+- **Layer 1** — input facts table with typed columns (ID, time, categorical, measurement)
+- **Layer 2** — horizontal concept table built via pivot (2A) and distributional aggregations (2B)
+- **Layer 3** — temporal feature table produced by sliding operators over the Layer 2 columns
+The framework is engine-agnostic: the same pipeline definition produces either a standalone SQL script (Snowflake, Databricks SQL, Spark SQL) or a lazy PySpark execution plan, with the choice abstracted behind a code generator interface.
+## Key concepts
+| Layer | What it does |
+|---|---|
+| Layer 2A — Pivot | `GROUP BY (ID, time)` + `CASE WHEN` per categorical combination × measurement × aggregator |
+| Layer 2B — Distributional | Per-categorical CTEs computing entropy, HHI, dominant proportion, mode, count |
+| Layer 3 — Temporal | Sliding window operators (PROM_U, SUM_U, CREC, FREQ, REC, …) over all Layer 2 columns |
+## Installation
+```bash
+pip install featkit
+```
+## Quickstart
+```python
+from featkit import FeatureStorePipeline, FeatureStoreConfig
+from featkit.dataset import SimpleDataset
+from featkit.fields import IDField, TimeField, CategoricalField, MeasurementField
+from featkit.enums import MeasurementType, TimeGranularity, CategoricalTreatment
+from featkit.generators.sql import SnowflakeSQLCodeGenerator
+# Define schema
+fields = [
+    IDField(name="ID_CLIENTE"),
+    TimeField(name="PERIODO",
+              source_granularity=TimeGranularity.MONTHLY,
+              target_granularity=TimeGranularity.MONTHLY),
+    CategoricalField(name="SECTOR", treatment=CategoricalTreatment.PIVOT,
+                     allowed_values=["RETAIL", "CORP", "PYME"]),
+    CategoricalField(name="CANAL",  treatment=CategoricalTreatment.PIVOT,
+                     allowed_values=["DIGITAL", "PRESENCIAL", "TELEFONO"]),
+    MeasurementField(name="MTO", measurement_type=MeasurementType.MONTO),
+    MeasurementField(name="TRX", measurement_type=MeasurementType.CANTIDAD),
+]
+dataset = SimpleDataset(
+    source_reference="MY_DB.MY_SCHEMA.FACTS_TABLE",
+    fields=fields,
+)
+config = FeatureStoreConfig(
+    dataset=dataset,
+    output_schema="MY_DB.MY_SCHEMA",
+    output_table_prefix="FS",
+    time_windows=[3, 6, 9, 12],
+)
+pipeline = FeatureStorePipeline(config).build()
+output = pipeline.run(SnowflakeSQLCodeGenerator())
+output.save("./output")
+# Writes: output/script.sql, output/dag.json, output/diagram.md
+```
+## Feature naming anatomy
+Every feature produced by featkit has a deterministic, human-readable name built from fixed segments separated by `__` (double underscore). Understanding the segments lets you decode any feature name without looking at the code.
+There are four families of features, each with its own naming pattern.
+---
+### Layer 2A — Pivot features
+**Pattern:** `{AGG}__{MEASUREMENT}[__{FIELD}_{VALUE}…]`
+| Segment | Source | Example |
+|---|---|---|
+| `AGG` | `Layer2Aggregator` enum | `SUM`, `COUNT`, `AVG`, `MIN`, `MAX` |
+| `MEASUREMENT` | `MeasurementField.name` | `MTO`, `TRX` |
+| `FIELD_VALUE` | `CategoricalField.name` + `_` + value, one per non-marginal field, sorted alphabetically by field name | `CANAL_DIGITAL`, `SECTOR_RETAIL` |
+The valid aggregators for each `MEASUREMENT` depend on its `MeasurementType`. Only contract-permitted aggregator–measurement combinations are generated.
+| Measurement type | Semantic meaning | Valid `AGG` values |
+|---|---|---|
+| `MONTO` | Monetary amount | `SUM`, `MAX`, `MIN`, `AVG` |
+| `CANTIDAD` | Count / quantity | `SUM` |
+| `TICKET` | Average ticket size | `AVG` |
+| `FLAG` | Binary indicator | `MAX` |
+| `FECHA` | Date / timestamp | `MAX`, `MIN` |
+| `BALANCE` | Point-in-time balance | `MAX`, `MIN`, `AVG` |
+| `TIME_DIFF` | Duration / elapsed time | `SUM`, `AVG`, `MAX`, `MIN` |
+| `ESTADISTICO` | Generic statistic | `SUM`, `AVG`, `MAX`, `MIN`, `COUNT` |
+Categorical fields set to the **∅ marginal** (no filter on that dimension) are omitted from the name entirely, so the name implicitly aggregates over all values of that dimension.
+```
+SUM__MTO                                  # global — all sectors, all channels
+SUM__MTO__CANAL_DIGITAL                   # CANAL=DIGITAL, marginal over SECTOR
+SUM__MTO__SECTOR_RETAIL                   # SECTOR=RETAIL, marginal over CANAL
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL    # CANAL=DIGITAL and SECTOR=RETAIL (alphabetical order)
+SUM__TRX__CANAL_PRESENCIAL                # sum of TRX (CANTIDAD → only SUM is valid) for PRESENCIAL channel
+```
+---
+### Layer 2B — Distributional features
+**Pattern:** `{CATEGORICAL}__{MEASUREMENT}__{AGG}__{METRIC}`
+| Segment | Source | Example |
+|---|---|---|
+| `CATEGORICAL` | `CategoricalField.name` | `CANAL`, `SECTOR` |
+| `MEASUREMENT` | `MeasurementField.name` | `MTO` |
+| `AGG` | `Layer2Aggregator` enum | `SUM` |
+| `METRIC` | `DistributionalMetric` enum | `ENTROPY`, `HHI`, `DOMINANT_PROPORTION`, `MODE`, `COUNT` |
+These columns capture the shape of the value distribution of a categorical field, weighted by the aggregated measurement.
+| Metric | What it measures |
+|---|---|
+| `ENTROPY` | Shannon entropy of the category distribution — higher means more uniform spread |
+| `HHI` | Herfindahl-Hirschman Index — concentration; higher means more dominated by one value |
+| `DOMINANT_PROPORTION` | Share of the most common category value |
+| `MODE` | The most frequent category value (output type: categorical) |
+| `COUNT` | Number of distinct observed values |
+```
+CANAL__MTO__SUM__ENTROPY            # entropy of channel distribution by amount
+SECTOR__TRX__SUM__HHI               # HHI of sector distribution by transaction count (CANTIDAD → only SUM)
+CANAL__MTO__SUM__MODE               # dominant channel by amount (categorical output)
+```
+---
+### Layer 2C — Ratio features
+**Pattern:** `{NUMERATOR}__over__{DENOMINATOR}`
+where `NUMERATOR` and `DENOMINATOR` are full Layer 2A pivot feature names. The denominator is always a **proper marginal projection** of the numerator: it has at least one categorical dimension set to ∅ that is non-∅ in the numerator, and no contradicting values.
+The underlying value is `numerator / NULLIF(denominator, 0)` computed per entity per period.
+```
+# Numerator: DIGITAL channel + RETAIL sector
+# Denominator: RETAIL sector only (CANAL marginalized → share of DIGITAL within RETAIL)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__SECTOR_RETAIL
+# Denominator: DIGITAL channel only (SECTOR marginalized → share of RETAIL within DIGITAL)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__CANAL_DIGITAL
+# Denominator: global total (both marginalized → share of DIGITAL/RETAIL in total portfolio)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO
+```
+---
+### Layer 3 — Temporal features
+**Pattern:** `{L2_NAME}__{OPERATOR}__{DIRECTION}[__{WINDOW}]`
+`L2_NAME` is the full name of any Layer 2A, 2B, or 2C feature. The temporal segments are appended at the end.
+| Segment | Source | Notes |
+|---|---|---|
+| `OPERATOR` | `TemporalOperator` enum | See table below |
+| `DIRECTION` | `TimeWindowDirection` enum | `BACKWARD` or `FORWARD` |
+| `WINDOW` | `window_size` (integer, number of periods) | Omitted for point-in-time operators |
+#### Temporal operators
+| Operator | Type | Description |
+|---|---|---|
+| `PROM_U` | Windowed | Arithmetic mean of the monthly values over the window — each period contributes equally regardless of its volume |
+| `PROM_P` | Windowed | Volume-proportional weighted mean — each period's contribution is weighted by its share of the total aggregated value across the window; weights are derived automatically from the data, no user configuration required |
+| `SUM_U` | Windowed | Unweighted sum of the monthly values over the window |
+| `SUM_P` | Windowed | Volume-weighted sum over the window (analogous weighting to `PROM_P`) |
+| `MIN_U` | Windowed | Minimum value observed in the window |
+| `MAX_U` | Windowed | Maximum value observed in the window |
+| `CREC` | Windowed | Growth rate across the window |
+| `FREQ` | Windowed | Count of periods in the window where the value was non-null **and strictly greater than 0** |
+| `XM` | Windowed | `1` if **every** period in the window had a non-null and strictly positive value, `0` otherwise — an all-or-nothing activity indicator (e.g. `1` means the customer was active on every single month in the window) |
+| `MEDIA_ABS` | Windowed (composed) | Mean absolute deviation over the window |
+| `RATIO` | Windowed (composed) | Ratio of two sub-windows |
+| `ULT_MES` | Point-in-time | Value at the most recent period (no window suffix) |
+| `PREV_MES` | Point-in-time | Value at the immediately preceding period (no window suffix) |
+| `REC` | Point-in-time | Recency — periods elapsed since last non-null / non-zero observation (no window suffix) |
+#### Valid operators per Layer 2 output type
+| Output type | Valid operators |
+|---|---|
+| `NUMERIC` | `PROM_U`, `PROM_P`, `SUM_U`, `SUM_P`, `MIN_U`, `MAX_U`, `CREC`, `FREQ`, `XM`, `ULT_MES`, `PREV_MES`, `MEDIA_ABS`, `RATIO` |
+| `FLAG` | `ULT_MES`, `PREV_MES`, `FREQ`, `XM`, `REC` |
+| `CATEGORICAL` | `ULT_MES`, `PREV_MES`, `REC` |
+| `TEMPORAL` | `ULT_MES`, `PREV_MES`, `REC`, `MIN_U`, `MAX_U`, `CREC` |
+#### Examples
+```
+# Average amount (DIGITAL + RETAIL) over the last 6 months
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__PROM_U__BACKWARD__6
+# Total transaction sum for RETAIL sector in the last 3 months (CANTIDAD → only SUM valid)
+SUM__TRX__SECTOR_RETAIL__SUM_U__BACKWARD__3
+# Most recent value of the CANAL entropy (by amount)
+CANAL__MTO__SUM__ENTROPY__ULT_MES__BACKWARD
+# Share of DIGITAL/RETAIL in total portfolio, averaged over last 12 months
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__PROM_U__BACKWARD__12
+# Recency of the dominant channel (MODE is categorical → only REC/ULT_MES/PREV_MES valid)
+CANAL__MTO__SUM__MODE__REC__BACKWARD
+```
+---
+### Quick-reference: full name structure
+```
+┌─ Layer 2A pivot ──────────────────────────────────────────────────┐
+│  AGG  __  MEASUREMENT  [__  FIELD_VALUE  …]                       │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 2B distributional ─────────────────────────────────────────┐
+│  CATEGORICAL  __  MEASUREMENT  __  AGG  __  METRIC                │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 2C ratio ──────────────────────────────────────────────────┐
+│  {Layer 2A name}  __over__  {Layer 2A name}                       │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 3 temporal (windowed) ─────────────────────────────────────┐
+│  {Layer 2A/2B/2C name}  __  OPERATOR  __  DIRECTION  __  WINDOW   │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 3 temporal (point-in-time) ────────────────────────────────┐
+│  {Layer 2A/2B/2C name}  __  OPERATOR  __  DIRECTION               │
+└───────────────────────────────────────────────────────────────────┘
+```
+## Architecture
+See [docs/general_plan.md](docs/general_plan.md) for the full implementation plan.
+## License
+MIT

featkit-0.4.2/README.md ADDED Viewed

@@ -0,0 +1,254 @@
+# featkit
+**featkit** is a Python framework for automated feature store generation from relational facts tables.
+It implements a three-layer architecture:
+- **Layer 1** — input facts table with typed columns (ID, time, categorical, measurement)
+- **Layer 2** — horizontal concept table built via pivot (2A) and distributional aggregations (2B)
+- **Layer 3** — temporal feature table produced by sliding operators over the Layer 2 columns
+The framework is engine-agnostic: the same pipeline definition produces either a standalone SQL script (Snowflake, Databricks SQL, Spark SQL) or a lazy PySpark execution plan, with the choice abstracted behind a code generator interface.
+## Key concepts
+| Layer | What it does |
+|---|---|
+| Layer 2A — Pivot | `GROUP BY (ID, time)` + `CASE WHEN` per categorical combination × measurement × aggregator |
+| Layer 2B — Distributional | Per-categorical CTEs computing entropy, HHI, dominant proportion, mode, count |
+| Layer 3 — Temporal | Sliding window operators (PROM_U, SUM_U, CREC, FREQ, REC, …) over all Layer 2 columns |
+## Installation
+```bash
+pip install featkit
+```
+## Quickstart
+```python
+from featkit import FeatureStorePipeline, FeatureStoreConfig
+from featkit.dataset import SimpleDataset
+from featkit.fields import IDField, TimeField, CategoricalField, MeasurementField
+from featkit.enums import MeasurementType, TimeGranularity, CategoricalTreatment
+from featkit.generators.sql import SnowflakeSQLCodeGenerator
+# Define schema
+fields = [
+    IDField(name="ID_CLIENTE"),
+    TimeField(name="PERIODO",
+              source_granularity=TimeGranularity.MONTHLY,
+              target_granularity=TimeGranularity.MONTHLY),
+    CategoricalField(name="SECTOR", treatment=CategoricalTreatment.PIVOT,
+                     allowed_values=["RETAIL", "CORP", "PYME"]),
+    CategoricalField(name="CANAL",  treatment=CategoricalTreatment.PIVOT,
+                     allowed_values=["DIGITAL", "PRESENCIAL", "TELEFONO"]),
+    MeasurementField(name="MTO", measurement_type=MeasurementType.MONTO),
+    MeasurementField(name="TRX", measurement_type=MeasurementType.CANTIDAD),
+]
+dataset = SimpleDataset(
+    source_reference="MY_DB.MY_SCHEMA.FACTS_TABLE",
+    fields=fields,
+)
+config = FeatureStoreConfig(
+    dataset=dataset,
+    output_schema="MY_DB.MY_SCHEMA",
+    output_table_prefix="FS",
+    time_windows=[3, 6, 9, 12],
+)
+pipeline = FeatureStorePipeline(config).build()
+output = pipeline.run(SnowflakeSQLCodeGenerator())
+output.save("./output")
+# Writes: output/script.sql, output/dag.json, output/diagram.md
+```
+## Feature naming anatomy
+Every feature produced by featkit has a deterministic, human-readable name built from fixed segments separated by `__` (double underscore). Understanding the segments lets you decode any feature name without looking at the code.
+There are four families of features, each with its own naming pattern.
+---
+### Layer 2A — Pivot features
+**Pattern:** `{AGG}__{MEASUREMENT}[__{FIELD}_{VALUE}…]`
+| Segment | Source | Example |
+|---|---|---|
+| `AGG` | `Layer2Aggregator` enum | `SUM`, `COUNT`, `AVG`, `MIN`, `MAX` |
+| `MEASUREMENT` | `MeasurementField.name` | `MTO`, `TRX` |
+| `FIELD_VALUE` | `CategoricalField.name` + `_` + value, one per non-marginal field, sorted alphabetically by field name | `CANAL_DIGITAL`, `SECTOR_RETAIL` |
+The valid aggregators for each `MEASUREMENT` depend on its `MeasurementType`. Only contract-permitted aggregator–measurement combinations are generated.
+| Measurement type | Semantic meaning | Valid `AGG` values |
+|---|---|---|
+| `MONTO` | Monetary amount | `SUM`, `MAX`, `MIN`, `AVG` |
+| `CANTIDAD` | Count / quantity | `SUM` |
+| `TICKET` | Average ticket size | `AVG` |
+| `FLAG` | Binary indicator | `MAX` |
+| `FECHA` | Date / timestamp | `MAX`, `MIN` |
+| `BALANCE` | Point-in-time balance | `MAX`, `MIN`, `AVG` |
+| `TIME_DIFF` | Duration / elapsed time | `SUM`, `AVG`, `MAX`, `MIN` |
+| `ESTADISTICO` | Generic statistic | `SUM`, `AVG`, `MAX`, `MIN`, `COUNT` |
+Categorical fields set to the **∅ marginal** (no filter on that dimension) are omitted from the name entirely, so the name implicitly aggregates over all values of that dimension.
+```
+SUM__MTO                                  # global — all sectors, all channels
+SUM__MTO__CANAL_DIGITAL                   # CANAL=DIGITAL, marginal over SECTOR
+SUM__MTO__SECTOR_RETAIL                   # SECTOR=RETAIL, marginal over CANAL
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL    # CANAL=DIGITAL and SECTOR=RETAIL (alphabetical order)
+SUM__TRX__CANAL_PRESENCIAL                # sum of TRX (CANTIDAD → only SUM is valid) for PRESENCIAL channel
+```
+---
+### Layer 2B — Distributional features
+**Pattern:** `{CATEGORICAL}__{MEASUREMENT}__{AGG}__{METRIC}`
+| Segment | Source | Example |
+|---|---|---|
+| `CATEGORICAL` | `CategoricalField.name` | `CANAL`, `SECTOR` |
+| `MEASUREMENT` | `MeasurementField.name` | `MTO` |
+| `AGG` | `Layer2Aggregator` enum | `SUM` |
+| `METRIC` | `DistributionalMetric` enum | `ENTROPY`, `HHI`, `DOMINANT_PROPORTION`, `MODE`, `COUNT` |
+These columns capture the shape of the value distribution of a categorical field, weighted by the aggregated measurement.
+| Metric | What it measures |
+|---|---|
+| `ENTROPY` | Shannon entropy of the category distribution — higher means more uniform spread |
+| `HHI` | Herfindahl-Hirschman Index — concentration; higher means more dominated by one value |
+| `DOMINANT_PROPORTION` | Share of the most common category value |
+| `MODE` | The most frequent category value (output type: categorical) |
+| `COUNT` | Number of distinct observed values |
+```
+CANAL__MTO__SUM__ENTROPY            # entropy of channel distribution by amount
+SECTOR__TRX__SUM__HHI               # HHI of sector distribution by transaction count (CANTIDAD → only SUM)
+CANAL__MTO__SUM__MODE               # dominant channel by amount (categorical output)
+```
+---
+### Layer 2C — Ratio features
+**Pattern:** `{NUMERATOR}__over__{DENOMINATOR}`
+where `NUMERATOR` and `DENOMINATOR` are full Layer 2A pivot feature names. The denominator is always a **proper marginal projection** of the numerator: it has at least one categorical dimension set to ∅ that is non-∅ in the numerator, and no contradicting values.
+The underlying value is `numerator / NULLIF(denominator, 0)` computed per entity per period.
+```
+# Numerator: DIGITAL channel + RETAIL sector
+# Denominator: RETAIL sector only (CANAL marginalized → share of DIGITAL within RETAIL)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__SECTOR_RETAIL
+# Denominator: DIGITAL channel only (SECTOR marginalized → share of RETAIL within DIGITAL)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__CANAL_DIGITAL
+# Denominator: global total (both marginalized → share of DIGITAL/RETAIL in total portfolio)
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO
+```
+---
+### Layer 3 — Temporal features
+**Pattern:** `{L2_NAME}__{OPERATOR}__{DIRECTION}[__{WINDOW}]`
+`L2_NAME` is the full name of any Layer 2A, 2B, or 2C feature. The temporal segments are appended at the end.
+| Segment | Source | Notes |
+|---|---|---|
+| `OPERATOR` | `TemporalOperator` enum | See table below |
+| `DIRECTION` | `TimeWindowDirection` enum | `BACKWARD` or `FORWARD` |
+| `WINDOW` | `window_size` (integer, number of periods) | Omitted for point-in-time operators |
+#### Temporal operators
+| Operator | Type | Description |
+|---|---|---|
+| `PROM_U` | Windowed | Arithmetic mean of the monthly values over the window — each period contributes equally regardless of its volume |
+| `PROM_P` | Windowed | Volume-proportional weighted mean — each period's contribution is weighted by its share of the total aggregated value across the window; weights are derived automatically from the data, no user configuration required |
+| `SUM_U` | Windowed | Unweighted sum of the monthly values over the window |
+| `SUM_P` | Windowed | Volume-weighted sum over the window (analogous weighting to `PROM_P`) |
+| `MIN_U` | Windowed | Minimum value observed in the window |
+| `MAX_U` | Windowed | Maximum value observed in the window |
+| `CREC` | Windowed | Growth rate across the window |
+| `FREQ` | Windowed | Count of periods in the window where the value was non-null **and strictly greater than 0** |
+| `XM` | Windowed | `1` if **every** period in the window had a non-null and strictly positive value, `0` otherwise — an all-or-nothing activity indicator (e.g. `1` means the customer was active on every single month in the window) |
+| `MEDIA_ABS` | Windowed (composed) | Mean absolute deviation over the window |
+| `RATIO` | Windowed (composed) | Ratio of two sub-windows |
+| `ULT_MES` | Point-in-time | Value at the most recent period (no window suffix) |
+| `PREV_MES` | Point-in-time | Value at the immediately preceding period (no window suffix) |
+| `REC` | Point-in-time | Recency — periods elapsed since last non-null / non-zero observation (no window suffix) |
+#### Valid operators per Layer 2 output type
+| Output type | Valid operators |
+|---|---|
+| `NUMERIC` | `PROM_U`, `PROM_P`, `SUM_U`, `SUM_P`, `MIN_U`, `MAX_U`, `CREC`, `FREQ`, `XM`, `ULT_MES`, `PREV_MES`, `MEDIA_ABS`, `RATIO` |
+| `FLAG` | `ULT_MES`, `PREV_MES`, `FREQ`, `XM`, `REC` |
+| `CATEGORICAL` | `ULT_MES`, `PREV_MES`, `REC` |
+| `TEMPORAL` | `ULT_MES`, `PREV_MES`, `REC`, `MIN_U`, `MAX_U`, `CREC` |
+#### Examples
+```
+# Average amount (DIGITAL + RETAIL) over the last 6 months
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__PROM_U__BACKWARD__6
+# Total transaction sum for RETAIL sector in the last 3 months (CANTIDAD → only SUM valid)
+SUM__TRX__SECTOR_RETAIL__SUM_U__BACKWARD__3
+# Most recent value of the CANAL entropy (by amount)
+CANAL__MTO__SUM__ENTROPY__ULT_MES__BACKWARD
+# Share of DIGITAL/RETAIL in total portfolio, averaged over last 12 months
+SUM__MTO__CANAL_DIGITAL__SECTOR_RETAIL__over__SUM__MTO__PROM_U__BACKWARD__12
+# Recency of the dominant channel (MODE is categorical → only REC/ULT_MES/PREV_MES valid)
+CANAL__MTO__SUM__MODE__REC__BACKWARD
+```
+---
+### Quick-reference: full name structure
+```
+┌─ Layer 2A pivot ──────────────────────────────────────────────────┐
+│  AGG  __  MEASUREMENT  [__  FIELD_VALUE  …]                       │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 2B distributional ─────────────────────────────────────────┐
+│  CATEGORICAL  __  MEASUREMENT  __  AGG  __  METRIC                │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 2C ratio ──────────────────────────────────────────────────┐
+│  {Layer 2A name}  __over__  {Layer 2A name}                       │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 3 temporal (windowed) ─────────────────────────────────────┐
+│  {Layer 2A/2B/2C name}  __  OPERATOR  __  DIRECTION  __  WINDOW   │
+└───────────────────────────────────────────────────────────────────┘
+┌─ Layer 3 temporal (point-in-time) ────────────────────────────────┐
+│  {Layer 2A/2B/2C name}  __  OPERATOR  __  DIRECTION               │
+└───────────────────────────────────────────────────────────────────┘
+```
+## Architecture
+See [docs/general_plan.md](docs/general_plan.md) for the full implementation plan.
+## License
+MIT

{featkit-0.4.1 → featkit-0.4.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "featkit"
-version = "0.4.1"
+version = "0.4.2"
 description = "featkit — automated feature store generation from relational facts tables"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -69,7 +69,7 @@ target-version = "py310"
 select = ["E", "F", "I", "UP", "B", "SIM"]
 [tool.mypy]
-python_version = "3.10"
+python_version = "3.12"
 strict = true
 ignore_missing_imports = true

{featkit-0.4.1 → featkit-0.4.2}/src/featkit/generators/pyspark/databricks.py RENAMED Viewed

@@ -379,13 +379,12 @@ class PySparkCodeGenerator(AbstractCodeGenerator):
                 f' - F.lit(1)).alias("{alias}")'
             )
         if op == TemporalOperator.FREQ:
-            return (
-                f'F.count(F.when({in_window} & {col_ref}.isNotNull(), F.lit(1))).alias("{alias}")'
-            )
+            active = f"{in_window} & {col_ref}.isNotNull() & ({col_ref} > 0)"
+            return f'F.count(F.when({active}, F.lit(1))).alias("{alias}")'
         if op == TemporalOperator.XM:
-            return (
-                f'F.count(F.when({in_window} & {col_ref}.isNotNull(), F.lit(1))).alias("{alias}")'
-            )
+            active = f"{in_window} & {col_ref}.isNotNull() & ({col_ref} > 0)"
+            count_expr = f"F.count(F.when({active}, F.lit(1)))"
+            return f'F.when({count_expr} == {w}, F.lit(1)).otherwise(F.lit(0)).alias("{alias}")'
         if op == TemporalOperator.REC:
             return f'(-F.max(F.when({col_ref}.isNotNull(), {mob}))).alias("{alias}")'
         if op == TemporalOperator.MEDIA_ABS:

{featkit-0.4.1 → featkit-0.4.2}/src/featkit/generators/sql/base.py RENAMED Viewed

@@ -468,7 +468,7 @@ class AbstractSQLCodeGenerator(AbstractCodeGenerator):
                 lo, hi = 0, w - 1
             in_window = f"{mob_col} BETWEEN {lo} AND {hi}"
             case_col = f"CASE WHEN {in_window} THEN {col} END"
-            case_notnull = f"CASE WHEN {in_window} AND {col} IS NOT NULL THEN 1 END"
+            case_notnull = f"CASE WHEN {in_window} AND {col} IS NOT NULL AND {col} > 0 THEN 1 END"
         if op == TemporalOperator.PROM_U:
             return f"AVG({case_col})"
@@ -493,7 +493,7 @@ class AbstractSQLCodeGenerator(AbstractCodeGenerator):
         if op == TemporalOperator.FREQ:
             return f"COUNT({case_notnull})"
         if op == TemporalOperator.XM:
-            return f"COUNT({case_notnull})"
+            return f"CASE WHEN COUNT({case_notnull}) = {w} THEN 1 ELSE 0 END"
         if op == TemporalOperator.REC:
             return f"-MAX(CASE WHEN {col} IS NOT NULL THEN {mob_col} END)"
         if op == TemporalOperator.MEDIA_ABS:

{featkit-0.4.1 → featkit-0.4.2}/tests/test_generators/test_sql_snowflake.py RENAMED Viewed

@@ -419,3 +419,56 @@ class TestTableNaming:
         sql = _GEN.build_mob_table(pipeline).sql
         assert "myschema" in sql
         assert "x_mob_ref" in sql
+# ---------------------------------------------------------------------------
+# FREQ and XM operator semantics
+# ---------------------------------------------------------------------------
+def _pipeline_flag(window: int = 6) -> FeatureStorePipeline:
+    """Pipeline with a FLAG measurement, generating FREQ and XM temporal features."""
+    ds = SimpleDataset(
+        "db.facts",
+        [
+            IDField("id"),
+            TimeField("ts", TimeGranularity.MONTHLY, TimeGranularity.MONTHLY),
+            MeasurementField("paid", MeasurementType.FLAG),
+        ],
+    )
+    cfg = FeatureStoreConfig(
+        dataset=ds,
+        output_schema="out",
+        output_table_prefix="feat_",
+        time_windows=[window],
+    )
+    return FeatureStorePipeline(config=cfg).build()
+class TestFreqXmSemantics:
+    def _sql(self, window: int = 6) -> str:
+        out = _GEN.build_layer3(_pipeline_flag(window))
+        assert isinstance(out, SQLOutput)
+        return out.sql
+    def test_freq_filters_positive_values(self) -> None:
+        assert "> 0" in self._sql()
+    def test_freq_does_not_count_zero_values(self) -> None:
+        # The count expression must gate on > 0, not just IS NOT NULL
+        sql = self._sql()
+        assert "IS NOT NULL AND" in sql.upper() or "> 0" in sql
+    def test_xm_returns_one_or_zero(self) -> None:
+        sql = self._sql(window=6)
+        assert "CASE WHEN" in sql.upper()
+        assert "THEN 1" in sql
+        assert "ELSE 0" in sql
+    def test_xm_compares_count_to_window_size(self) -> None:
+        # The XM expression must compare the active-period count against the window size
+        sql = self._sql(window=6)
+        assert "= 6" in sql
+    def test_layer3_with_flag_is_parseable(self) -> None:
+        assert _is_parseable(self._sql())

featkit-0.4.1/PKG-INFO DELETED Viewed

@@ -1,143 +0,0 @@
-Metadata-Version: 2.4
-Name: featkit
-Version: 0.4.1
-Summary: featkit — automated feature store generation from relational facts tables
-Project-URL: Repository, https://github.com/Mirkiux/featkit
-Project-URL: Documentation, https://mirkiux.github.io/featkit
-Project-URL: Changelog, https://github.com/Mirkiux/featkit/blob/main/CHANGELOG.md
-Project-URL: Bug Tracker, https://github.com/Mirkiux/featkit/issues
-Author: Mirko
-License: MIT License
-        Copyright (c) 2026 Mirko
-        Permission is hereby granted, free of charge, to any person obtaining a copy
-        of this software and associated documentation files (the "Software"), to deal
-        in the Software without restriction, including without limitation the rights
-        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-        copies of the Software, and to permit persons to whom the Software is
-        furnished to do so, subject to the following conditions:
-        The above copyright notice and this permission notice shall be included in all
-        copies or substantial portions of the Software.
-        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-        SOFTWARE.
-License-File: LICENSE
-Keywords: analytics,data engineering,databricks,feature engineering,feature store,pivot,pyspark,snowflake
-Classifier: Development Status :: 2 - Pre-Alpha
-Classifier: Intended Audience :: Developers
-Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: MIT License
-Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Programming Language :: Python :: 3.11
-Classifier: Programming Language :: Python :: 3.12
-Classifier: Programming Language :: Python :: 3.13
-Classifier: Topic :: Scientific/Engineering
-Classifier: Topic :: Software Development :: Libraries :: Python Modules
-Requires-Python: >=3.10
-Requires-Dist: sqlglot>=23.0
-Provides-Extra: databricks
-Requires-Dist: databricks-sql-connector>=3.0; extra == 'databricks'
-Provides-Extra: dev
-Requires-Dist: build>=1.0; extra == 'dev'
-Requires-Dist: hatch>=1.9; extra == 'dev'
-Requires-Dist: mypy>=1.0; extra == 'dev'
-Requires-Dist: pandas>=1.5; extra == 'dev'
-Requires-Dist: pytest-cov>=4.0; extra == 'dev'
-Requires-Dist: pytest>=7.0; extra == 'dev'
-Requires-Dist: ruff>=0.4; extra == 'dev'
-Requires-Dist: twine>=5.0; extra == 'dev'
-Provides-Extra: docs
-Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
-Requires-Dist: mkdocs>=1.6; extra == 'docs'
-Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
-Provides-Extra: execution
-Requires-Dist: pandas>=1.5; extra == 'execution'
-Provides-Extra: ibis
-Requires-Dist: ibis-framework>=9.0; extra == 'ibis'
-Provides-Extra: spark
-Requires-Dist: pyspark>=3.4; extra == 'spark'
-Description-Content-Type: text/markdown
-# featkit
-**featkit** is a Python framework for automated feature store generation from relational facts tables.
-It implements a three-layer architecture:
-- **Layer 1** — input facts table with typed columns (ID, time, categorical, measurement)
-- **Layer 2** — horizontal concept table built via pivot (2A) and distributional aggregations (2B)
-- **Layer 3** — temporal feature table produced by sliding operators over the Layer 2 columns
-The framework is engine-agnostic: the same pipeline definition produces either a standalone SQL script (Snowflake, Databricks SQL, Spark SQL) or a lazy PySpark execution plan, with the choice abstracted behind a code generator interface.
-## Key concepts
-| Layer | What it does |
-|---|---|
-| Layer 2A — Pivot | `GROUP BY (ID, time)` + `CASE WHEN` per categorical combination × measurement × aggregator |
-| Layer 2B — Distributional | Per-categorical CTEs computing entropy, HHI, dominant proportion, mode, count |
-| Layer 3 — Temporal | Sliding window operators (PROM_U, SUM_U, CREC, FREQ, REC, …) over all Layer 2 columns |
-## Installation
-```bash
-pip install featkit
-```
-## Quickstart
-```python
-from featkit import FeatureStorePipeline, FeatureStoreConfig
-from featkit.dataset import SimpleDataset
-from featkit.fields import IDField, TimeField, CategoricalField, MeasurementField
-from featkit.enums import MeasurementType, TimeGranularity, CategoricalTreatment
-from featkit.generators.sql import SnowflakeSQLCodeGenerator
-# Define schema
-fields = [
-    IDField(name="ID_CLIENTE"),
-    TimeField(name="PERIODO",
-              source_granularity=TimeGranularity.MONTHLY,
-              target_granularity=TimeGranularity.MONTHLY),
-    CategoricalField(name="SECTOR", treatment=CategoricalTreatment.PIVOT,
-                     allowed_values=["RETAIL", "CORP", "PYME"]),
-    CategoricalField(name="CANAL",  treatment=CategoricalTreatment.PIVOT,
-                     allowed_values=["DIGITAL", "PRESENCIAL", "TELEFONO"]),
-    MeasurementField(name="MTO", measurement_type=MeasurementType.MONTO),
-    MeasurementField(name="TRX", measurement_type=MeasurementType.CANTIDAD),
-]
-dataset = SimpleDataset(
-    source_reference="MY_DB.MY_SCHEMA.FACTS_TABLE",
-    fields=fields,
-)
-config = FeatureStoreConfig(
-    dataset=dataset,
-    output_schema="MY_DB.MY_SCHEMA",
-    output_table_prefix="FS",
-    time_windows=[3, 6, 9, 12],
-)
-pipeline = FeatureStorePipeline(config).build()
-output = pipeline.run(SnowflakeSQLCodeGenerator())
-output.save("./output")
-# Writes: output/script.sql, output/dag.json, output/diagram.md
-```
-## Architecture
-See [docs/general_plan.md](docs/general_plan.md) for the full implementation plan.
-## License
-MIT

featkit-0.4.1/README.md DELETED Viewed

@@ -1,75 +0,0 @@
-# featkit
-**featkit** is a Python framework for automated feature store generation from relational facts tables.
-It implements a three-layer architecture:
-- **Layer 1** — input facts table with typed columns (ID, time, categorical, measurement)
-- **Layer 2** — horizontal concept table built via pivot (2A) and distributional aggregations (2B)
-- **Layer 3** — temporal feature table produced by sliding operators over the Layer 2 columns
-The framework is engine-agnostic: the same pipeline definition produces either a standalone SQL script (Snowflake, Databricks SQL, Spark SQL) or a lazy PySpark execution plan, with the choice abstracted behind a code generator interface.
-## Key concepts
-| Layer | What it does |
-|---|---|
-| Layer 2A — Pivot | `GROUP BY (ID, time)` + `CASE WHEN` per categorical combination × measurement × aggregator |
-| Layer 2B — Distributional | Per-categorical CTEs computing entropy, HHI, dominant proportion, mode, count |
-| Layer 3 — Temporal | Sliding window operators (PROM_U, SUM_U, CREC, FREQ, REC, …) over all Layer 2 columns |
-## Installation
-```bash
-pip install featkit
-```
-## Quickstart
-```python
-from featkit import FeatureStorePipeline, FeatureStoreConfig
-from featkit.dataset import SimpleDataset
-from featkit.fields import IDField, TimeField, CategoricalField, MeasurementField
-from featkit.enums import MeasurementType, TimeGranularity, CategoricalTreatment
-from featkit.generators.sql import SnowflakeSQLCodeGenerator
-# Define schema
-fields = [
-    IDField(name="ID_CLIENTE"),
-    TimeField(name="PERIODO",
-              source_granularity=TimeGranularity.MONTHLY,
-              target_granularity=TimeGranularity.MONTHLY),
-    CategoricalField(name="SECTOR", treatment=CategoricalTreatment.PIVOT,
-                     allowed_values=["RETAIL", "CORP", "PYME"]),
-    CategoricalField(name="CANAL",  treatment=CategoricalTreatment.PIVOT,
-                     allowed_values=["DIGITAL", "PRESENCIAL", "TELEFONO"]),
-    MeasurementField(name="MTO", measurement_type=MeasurementType.MONTO),
-    MeasurementField(name="TRX", measurement_type=MeasurementType.CANTIDAD),
-]
-dataset = SimpleDataset(
-    source_reference="MY_DB.MY_SCHEMA.FACTS_TABLE",
-    fields=fields,
-)
-config = FeatureStoreConfig(
-    dataset=dataset,
-    output_schema="MY_DB.MY_SCHEMA",
-    output_table_prefix="FS",
-    time_windows=[3, 6, 9, 12],
-)
-pipeline = FeatureStorePipeline(config).build()
-output = pipeline.run(SnowflakeSQLCodeGenerator())
-output.save("./output")
-# Writes: output/script.sql, output/dag.json, output/diagram.md
-```
-## Architecture
-See [docs/general_plan.md](docs/general_plan.md) for the full implementation plan.
-## License
-MIT