PyPI - proscore - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

proscore 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (73) hide show

{proscore-0.2.0/src/proscore.egg-info → proscore-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proscore
-Version: 0.2.0
+Version: 0.2.2
 Summary: Production-grade scorecard development toolkit
 Author: Liqiwei
 License-Expression: MIT
@@ -45,10 +45,47 @@ Dynamic: license-file
 **生产级评分卡开发工具包**
 端到端的确定性评分卡建模管线，为银行和金融机构的信用评分卡建模场景设计, 满足对可解释性、合规性和稳定性的要求。
+## Why ProScore
+ProScore 不是通用机器学习框架，而是面向金融评分卡落地的**工程化工具包**。
+目标是把“能建模”升级为“可评审、可复现、可上线、可监控”。
+适合以下场景：
+- 银行/消金/互金团队做信用评分卡开发与迭代
+- 研发与业务分析师需要通过 Python + Excel 协同建模
+- 需要输出监管/评审材料，并建立投产后监控闭环
+## 核心亮点
+1. **单调性工程化（关键差异）**
+   - 支持变量级单调方向配置（increasing/decreasing/u/inverted_u/none）
+   - 支持自动单调调整，减少人工反复调箱
+   - 单调配置可模板化复用，跨项目保持一致性
+2. **端到端确定性流程**
+   - `detect -> prefilter -> bin -> refine -> transform -> select -> fit -> evaluate -> diagnose -> report -> monitor`
+   - 同样输入得到同样输出，便于审计、复盘和团队协作
+3. **三种使用方式统一口径**
+   - 模块化 API（灵活）
+   - 链式 API（高效）
+   - Excel 配置驱动（零代码）
+   - 三种入口共享同一建模逻辑，减少“口径不一致”
+4. **诊断与报告一体化**
+   - `diagnose()` 提供 4 层结构化诊断（区分力/过拟合/稳定性/变量质量）
+   - 支持阈值自定义（`thresholds=...`）
+   - `ReportBuilder` 自动纳入诊断章节，提升评审效率
+5. **投产后监控闭环**
+   - 支持 PSI、KS 衰减、规则告警、分期追踪
+   - 帮助形成“上线—监控—重训”的持续运营机制
 ---
 ## 目录
+- [入门教程（Notebook）](#入门教程notebook)
 - [三种使用方式](#三种使用方式)
 - [核心功能概览](#核心功能概览)
 - [安装](#安装)
@@ -57,6 +94,17 @@ Dynamic: license-file
 ---
+## 入门教程（Notebook）
+推荐按下面顺序阅读，先跑通再深入：
+| Notebook | 适合谁 | 你会得到什么 |
+|----------|--------|--------------|
+| [**ProScore快速开始**](notebooks/ProScore快速开始.ipynb) | 第一次上手 | 5–10 分钟链式单路径，只看 KS/AUC/PSI、入模变量、诊断摘要 |
+| [**ProScore完整建模流程**](notebooks/ProScore完整建模流程.ipynb) | 准备落地生产 | 模块化 + 链式对照、CFG 参数单一真源、规则挖掘、监控、报告、诊断 |
+> 快速开始刻意保持精简（不含规则挖掘等可选步骤）；完整版是权威样例，含 `[主线]` / `[可选]` 章节导航与一致性断言。
 ## 三种使用方式
 ProScore 提供三种递进的使用方式，从零代码到完全自定义，按需选择。
@@ -106,7 +154,11 @@ p = (
 > `train` 必传，`test` 和 `oot` 可选。分箱/WOE 只在 train 上拟合；逐步回归用 test 监控过拟合；OOT 仅用于最终评估。
 >
-> 完整教程见 [notebooks/ProScore完整建模流程.ipynb](notebooks/ProScore完整建模流程.ipynb)
+> Notebook 教程见上方 [入门教程](#入门教程notebook)。
+>
+> **诊断增强**（v0.2+）：`.evaluate().diagnose()` 生成 4 层结构化健康报告（含根因变量），支持 `thresholds=...` 自定义阈值。
+>
+> **参数单一真源（推荐）**：`CFG` + `PipelineSpec`（`apply(spec)`）确保模块化与链式同参同结果，详见 [pipeline-spec.md](docs/使用指南/pipeline-spec.md)。
 ### C. Excel 配置驱动
@@ -138,9 +190,11 @@ proscore run my_project/pipeline_template.xlsx --output-script run.py
 |------------|-----------------------------------------------|---------------------------------------|
 | 数据探查   | IV/AUC/KS 三指标 + PSI 时序稳定性 + 相关性/VIF | 快速筛选优质变量，识别分布漂移风险    |
 | 分箱       | 4 种单调趋势 + 5 种分箱方法 + 两阶段趋势校验   | 确保 WOE 趋势符合业务逻辑，满足监管   |
+| 规则挖掘   | 单变量/交叉规则 + Lift/Precision/Recall 联合筛选 | 产出可解释策略规则，与评分卡变量互斥   |
 | 逐步回归   | 双向选择 + 五重约束（p值/符号/VIF/相关/来源） | 严谨的多重共线性控制与维度归属管理    |
 | 模型监控   | Score/Feature PSI + 规则引擎告警 + JSON 持久化 | 投产后持续验证，自动风险预警          |
 | 报告生成   | 7 章自动 Markdown 报告（含图表）              | 银保监合规文档一键生成                |
+| 模型诊断   | 4 层健康检查 + 根因定位 + 可自定义阈值        | 投产前自动风险识别，支持策略微调      |
 ### 设计原则

{proscore-0.2.0 → proscore-0.2.2}/README.md RENAMED Viewed

@@ -7,10 +7,47 @@
 **生产级评分卡开发工具包**
 端到端的确定性评分卡建模管线，为银行和金融机构的信用评分卡建模场景设计, 满足对可解释性、合规性和稳定性的要求。
+## Why ProScore
+ProScore 不是通用机器学习框架，而是面向金融评分卡落地的**工程化工具包**。
+目标是把“能建模”升级为“可评审、可复现、可上线、可监控”。
+适合以下场景：
+- 银行/消金/互金团队做信用评分卡开发与迭代
+- 研发与业务分析师需要通过 Python + Excel 协同建模
+- 需要输出监管/评审材料，并建立投产后监控闭环
+## 核心亮点
+1. **单调性工程化（关键差异）**
+   - 支持变量级单调方向配置（increasing/decreasing/u/inverted_u/none）
+   - 支持自动单调调整，减少人工反复调箱
+   - 单调配置可模板化复用，跨项目保持一致性
+2. **端到端确定性流程**
+   - `detect -> prefilter -> bin -> refine -> transform -> select -> fit -> evaluate -> diagnose -> report -> monitor`
+   - 同样输入得到同样输出，便于审计、复盘和团队协作
+3. **三种使用方式统一口径**
+   - 模块化 API（灵活）
+   - 链式 API（高效）
+   - Excel 配置驱动（零代码）
+   - 三种入口共享同一建模逻辑，减少“口径不一致”
+4. **诊断与报告一体化**
+   - `diagnose()` 提供 4 层结构化诊断（区分力/过拟合/稳定性/变量质量）
+   - 支持阈值自定义（`thresholds=...`）
+   - `ReportBuilder` 自动纳入诊断章节，提升评审效率
+5. **投产后监控闭环**
+   - 支持 PSI、KS 衰减、规则告警、分期追踪
+   - 帮助形成“上线—监控—重训”的持续运营机制
 ---
 ## 目录
+- [入门教程（Notebook）](#入门教程notebook)
 - [三种使用方式](#三种使用方式)
 - [核心功能概览](#核心功能概览)
 - [安装](#安装)
@@ -19,6 +56,17 @@
 ---
+## 入门教程（Notebook）
+推荐按下面顺序阅读，先跑通再深入：
+| Notebook | 适合谁 | 你会得到什么 |
+|----------|--------|--------------|
+| [**ProScore快速开始**](notebooks/ProScore快速开始.ipynb) | 第一次上手 | 5–10 分钟链式单路径，只看 KS/AUC/PSI、入模变量、诊断摘要 |
+| [**ProScore完整建模流程**](notebooks/ProScore完整建模流程.ipynb) | 准备落地生产 | 模块化 + 链式对照、CFG 参数单一真源、规则挖掘、监控、报告、诊断 |
+> 快速开始刻意保持精简（不含规则挖掘等可选步骤）；完整版是权威样例，含 `[主线]` / `[可选]` 章节导航与一致性断言。
 ## 三种使用方式
 ProScore 提供三种递进的使用方式，从零代码到完全自定义，按需选择。
@@ -68,7 +116,11 @@ p = (
 > `train` 必传，`test` 和 `oot` 可选。分箱/WOE 只在 train 上拟合；逐步回归用 test 监控过拟合；OOT 仅用于最终评估。
 >
-> 完整教程见 [notebooks/ProScore完整建模流程.ipynb](notebooks/ProScore完整建模流程.ipynb)
+> Notebook 教程见上方 [入门教程](#入门教程notebook)。
+>
+> **诊断增强**（v0.2+）：`.evaluate().diagnose()` 生成 4 层结构化健康报告（含根因变量），支持 `thresholds=...` 自定义阈值。
+>
+> **参数单一真源（推荐）**：`CFG` + `PipelineSpec`（`apply(spec)`）确保模块化与链式同参同结果，详见 [pipeline-spec.md](docs/使用指南/pipeline-spec.md)。
 ### C. Excel 配置驱动
@@ -100,9 +152,11 @@ proscore run my_project/pipeline_template.xlsx --output-script run.py
 |------------|-----------------------------------------------|---------------------------------------|
 | 数据探查   | IV/AUC/KS 三指标 + PSI 时序稳定性 + 相关性/VIF | 快速筛选优质变量，识别分布漂移风险    |
 | 分箱       | 4 种单调趋势 + 5 种分箱方法 + 两阶段趋势校验   | 确保 WOE 趋势符合业务逻辑，满足监管   |
+| 规则挖掘   | 单变量/交叉规则 + Lift/Precision/Recall 联合筛选 | 产出可解释策略规则，与评分卡变量互斥   |
 | 逐步回归   | 双向选择 + 五重约束（p值/符号/VIF/相关/来源） | 严谨的多重共线性控制与维度归属管理    |
 | 模型监控   | Score/Feature PSI + 规则引擎告警 + JSON 持久化 | 投产后持续验证，自动风险预警          |
 | 报告生成   | 7 章自动 Markdown 报告（含图表）              | 银保监合规文档一键生成                |
+| 模型诊断   | 4 层健康检查 + 根因定位 + 可自定义阈值        | 投产前自动风险识别，支持策略微调      |
 ### 设计原则

{proscore-0.2.0 → proscore-0.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "proscore"
-version = "0.2.0"
+version = "0.2.2"
 description = "Production-grade scorecard development toolkit"
 readme = "README.md"
 license = "MIT"

{proscore-0.2.0 → proscore-0.2.2}/src/proscore/__init__.py RENAMED Viewed

@@ -19,7 +19,7 @@ from proscore.rules import RuleMiner
 from proscore.selection import Filter, StepwiseSelector, assess_screen
 from proscore.transform import WOETransformer
-__version__ = "0.2.0"
+__version__ = "0.2.2"
 class ProScore:
@@ -426,6 +426,42 @@ class ProScore:
         )
         return self
+    def diagnose(self, *, print_report: bool = True, **kwargs) -> ProScore:
+        """Run model health diagnosis (post-evaluate) and optionally print a formatted report.
+        By default prints the human-readable report (for notebook / interactive use).
+        Set ``print_report=False`` to obtain the :class:`~proscore.evaluate.DiagnosisReport`
+        silently via the ``diagnosis_`` property.
+        Pass additional artefacts for deeper root-cause analysis, or override thresholds::
+            p.diagnose(
+                binning=p.binner_,
+                selector=p.selector_,
+                stability=stability_result,
+                period_eval=period_result,
+                thresholds={"discrimination": {"ks_critical": 0.18}},
+            )
+        """
+        from proscore.evaluate import diagnose as _diagnose
+        report = _diagnose(
+            self.eval_result,
+            binning=kwargs.pop("binning", self._binner),
+            selector=kwargs.pop("selector", self._selector),
+            y_train=self._train_y(),
+            **kwargs,
+        )
+        if print_report:
+            print(report)
+        self._diagnosis = report
+        return self
+    @property
+    def diagnosis_(self):
+        """The :class:`DiagnosisReport` from the last :meth:`diagnose` call."""
+        return getattr(self, "_diagnosis", None)
     # ── properties ────────────────────────────────────────────────────────────
     @property

{proscore-0.2.0 → proscore-0.2.2}/src/proscore/_pipeline_config.py RENAMED Viewed

@@ -20,6 +20,46 @@ from typing import Any
 import numpy as np
 import pandas as pd
+def _train_test_from_dev_pool(
+    dev_pool: pd.DataFrame,
+    *,
+    target: str | None,
+    train_ratio: float,
+    random_state: int,
+) -> tuple[pd.DataFrame, pd.DataFrame]:
+    """Stratified train/test split when *target* has ≥2 samples per class."""
+    if (
+        target
+        and target in dev_pool.columns
+        and len(dev_pool) > 1
+        and dev_pool[target].nunique() >= 2
+        and int(dev_pool[target].value_counts().min()) >= 2
+    ):
+        from sklearn.model_selection import train_test_split
+        tr, te = train_test_split(
+            dev_pool,
+            train_size=train_ratio,
+            stratify=dev_pool[target],
+            random_state=random_state,
+        )
+        return tr.reset_index(drop=True), te.reset_index(drop=True)
+    n = len(dev_pool)
+    rng = np.random.RandomState(random_state)
+    idx = rng.permutation(n)
+    split = int(n * train_ratio)
+    if n > 1:
+        split = max(1, min(n - 1, split))
+    else:
+        split = n
+    return (
+        dev_pool.iloc[idx[:split]].reset_index(drop=True),
+        dev_pool.iloc[idx[split:]].reset_index(drop=True),
+    )
 # ── constants ────────────────────────────────────────────────────────────────
 _DEFAULT_GLOBAL = {
@@ -99,6 +139,8 @@ _PARAM_SPEC = {
                    "变量不足 n_min 时是否强制补齐"),
     "perturbation": ("on", ["on", "off"], "str",
                      "是否启用扰动搜索"),
+    "max_iter_round": (100, None, "int", 2, 200,
+                       "逐步回归最大迭代轮数"),
     "odds": (20, None, "int", 10, 100,
              "基准好坏比（1:20 ≈ 坏账率 4.8%）"),
     "pdo": (20, None, "int", 10, 50,
@@ -115,6 +157,10 @@ _PARAM_SPEC = {
                           "决策树最大深度（tree 模式）", "rules"),
     "rm_min_lift": (3.0, None, "float", 1.0, 10.0,
                     "最小 Lift（precision / 整体坏账率）", "rules"),
+    "rm_min_precision": (None, None, "float", 0.0, 1.0,
+                         "最小 Precision（留空表示不启用）", "rules"),
+    "rm_min_recall": (None, None, "float", 0.0, 1.0,
+                      "最小 Recall（留空表示不启用）", "rules"),
     "rm_min_hit_rate": (0.02, None, "float", 0.001, 0.5,
                         "最小命中率（覆盖样本占比）", "rules"),
     "rm_max_hit_rate": (0.20, None, "float", 0.01, 0.8,
@@ -307,7 +353,8 @@ class PipelineConfig:
                 target[key] = spec[0]
             elif section == "modeling" and (stage is None):
                 if key in ("n_min", "n_max", "pvalue_threshold", "coef_sign",
-                           "force_fill", "perturbation", "odds", "pdo", "base_score"):
+                           "force_fill", "perturbation", "max_iter_round",
+                           "odds", "pdo", "base_score"):
                     target[key] = spec[0]
             elif section == "rules" and stage == "rules":
                 target[key.removeprefix("rm_")] = spec[0]
@@ -321,7 +368,8 @@ class PipelineConfig:
             # ── Rules section: bare Excel keys → rm_ prefixed _PARAM_SPEC keys ──
             if section == "rules":
                 valid = ("method", "max_depth", "max_tree_depth",
-                         "min_lift", "min_hit_rate", "max_hit_rate",
+                         "min_lift", "min_precision", "min_recall",
+                         "min_hit_rate", "max_hit_rate",
                          "max_rules", "random_state", "export_csv")
                 # Accept both bare keys and legacy rm_ prefixed keys
                 bare_key = key.removeprefix("rm_")
@@ -351,7 +399,8 @@ class PipelineConfig:
                 continue
             if section == "modeling":
                 valid = ("n_min", "n_max", "pvalue_threshold", "coef_sign",
-                         "force_fill", "perturbation", "odds", "pdo", "base_score")
+                         "force_fill", "perturbation", "max_iter_round",
+                         "odds", "pdo", "base_score")
                 if key not in valid:
                     continue
@@ -546,12 +595,14 @@ class PipelineConfig:
     def _load_data(self) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame | None]:
         """Load and split data according to config."""
-        np.random.seed(int(self.global_cfg.get("random_seed", 42)))
+        seed = int(self.global_cfg.get("random_seed", 42))
+        np.random.seed(seed)
         fpath = self.data_cfg["data_file"]
         time_col = self.data_cfg.get("time_col")
         id_col = self.data_cfg.get("id_col")
         train_ratio = float(self.data_cfg.get("train_ratio", 0.7))
+        target = self.data_cfg.get("target")
         # Load
         if str(fpath).endswith((".xlsx", ".xls")):
@@ -602,12 +653,13 @@ class PipelineConfig:
             # Drop time_col from dev_pool
             dev_pool = dev_pool.drop(columns=[time_col], errors="ignore")
-        # Random split within dev pool
-        n = len(dev_pool)
-        idx = np.random.permutation(n)
-        split = int(n * train_ratio)
-        train = dev_pool.iloc[idx[:split]].reset_index(drop=True)
-        test = dev_pool.iloc[idx[split:]].reset_index(drop=True)
+        tgt = str(target).strip() if target else None
+        train, test = _train_test_from_dev_pool(
+            dev_pool,
+            target=tgt if tgt else None,
+            train_ratio=train_ratio,
+            random_state=seed,
+        )
         if oot is not None:
             oot = oot.reset_index(drop=True)
@@ -754,10 +806,19 @@ class PipelineConfig:
         return result
     def _build_prefilter_kw(self) -> dict[str, Any]:
-        kw: dict[str, Any] = {}
+        kw: dict[str, Any] = {
+            # 粗筛不做 IV/PSI（与链式 Notebook 常见写法一致；精筛在 refine）
+            "iv_range": None,
+            "max_psi": None,
+        }
+        cfg = self.screening_cfg
         for key in ("max_missing_rate", "max_one_value_rate"):
-            if key in self.screening_cfg:
-                kw[key] = self.screening_cfg[key]
+            if key in cfg:
+                kw[key] = cfg[key]
+        if cfg.get("max_corr") is not None:
+            kw["max_corr"] = float(cfg["max_corr"])
+        if cfg.get("max_vif") is not None:
+            kw["max_vif"] = float(cfg["max_vif"])
         return kw
     def _build_binning_kw(self) -> dict[str, Any]:
@@ -794,7 +855,7 @@ class PipelineConfig:
     def _build_select_kw(self) -> dict[str, Any]:
         kw: dict[str, Any] = {}
         cfg = self.modeling_cfg
-        for key in ("n_min", "n_max", "pvalue_threshold"):
+        for key in ("n_min", "n_max", "pvalue_threshold", "max_iter_round"):
             if key in cfg:
                 kw[key] = cfg[key]
         cs = cfg.get("coef_sign", "positive")
@@ -816,7 +877,8 @@ class PipelineConfig:
         kw: dict[str, Any] = {}
         cfg = self.rules_cfg
         for key in ("method", "max_depth", "max_tree_depth", "min_lift",
-                     "min_hit_rate", "max_hit_rate", "max_rules", "random_state"):
+                     "min_precision", "min_recall", "min_hit_rate",
+                     "max_hit_rate", "max_rules", "random_state"):
             if key in cfg:
                 kw[key] = cfg[key]
         return kw
@@ -873,6 +935,7 @@ class PipelineConfig:
         _w("import numpy as np")
         _w("import pandas as pd")
         _w("import proscore as ps")
+        _w("from proscore._pipeline_config import _train_test_from_dev_pool")
         _w("")
         _w(f"np.random.seed({seed})")
         _w("")
@@ -890,6 +953,7 @@ class PipelineConfig:
             oot_start = self.data_cfg.get("oot_start")
             oot_end = self.data_cfg.get("oot_end")
             _w("dev_pool = df.copy()")
+            _w("oot = None")
             if dev_start:
                 _w(f"dev_pool = dev_pool[dev_pool[{time_col!r}] >= pd.Timestamp({str(dev_start)!r})]")
             if dev_end:
@@ -901,16 +965,16 @@ class PipelineConfig:
                 _w("oot = df[oot_mask].drop(columns=[" + repr(time_col) + "]).reset_index(drop=True)")
             _w(f"dev_pool = dev_pool.drop(columns=[{time_col!r}])")
             _w("")
-            _w("idx = np.random.permutation(len(dev_pool))")
-            _w(f"n_train = int(len(dev_pool) * {train_ratio})")
-            _w("train = dev_pool.iloc[idx[:n_train]].reset_index(drop=True)")
-            _w("test  = dev_pool.iloc[idx[n_train:]].reset_index(drop=True)")
         else:
-            _w("idx = np.random.permutation(len(df))")
-            _w(f"n_train = int(len(df) * {train_ratio})")
-            _w("train = df.iloc[idx[:n_train]].reset_index(drop=True)")
-            _w("test  = df.iloc[idx[n_train:]].reset_index(drop=True)")
+            _w("dev_pool = df.copy()")
             _w("oot = None")
+            _w("")
+        tgt_py = repr(str(target)) if (target and str(target).strip()) else "None"
+        _w(
+            f"train, test = _train_test_from_dev_pool(dev_pool, target={tgt_py}, "
+            f"train_ratio={float(train_ratio)}, random_state={int(seed)})"
+        )
         _w("")
         _w("# ── 建模流水线 ──")
@@ -1125,11 +1189,13 @@ def generate_template(out_dir: str = ".") -> str:
         # ── Modeling ────────────────────────────────────────────────────────
         _write_params_sheet(writer, "Modeling",
                             ["n_min", "n_max", "pvalue_threshold", "coef_sign",
-                             "force_fill", "perturbation", "odds", "pdo", "base_score"])
+                             "force_fill", "perturbation", "max_iter_round",
+                             "odds", "pdo", "base_score"])
         # ── Rules ───────────────────────────────────────────────────────────
         _write_params_sheet(writer, "Rules",
-                            ["method", "max_depth", "max_tree_depth", "min_lift", "min_hit_rate",
+                            ["method", "max_depth", "max_tree_depth", "min_lift",
+                             "min_precision", "min_recall", "min_hit_rate",
                              "max_hit_rate", "max_rules", "random_state", "export_csv"],
                             section="rules")

{proscore-0.2.0 → proscore-0.2.2}/src/proscore/evaluate/__init__.py RENAMED Viewed

@@ -5,9 +5,14 @@ from __future__ import annotations
 import importlib
 _metrics = importlib.import_module("proscore.evaluate._metrics")
+_diagnose = importlib.import_module("proscore.evaluate._diagnose")
 evaluate = _metrics.evaluate
 evaluate_by_period = getattr(_metrics, "evaluate_by_period", None)
+diagnose = _diagnose.diagnose
+DiagnosisReport = _diagnose.DiagnosisReport
+DiagnosisIssue = _diagnose.DiagnosisIssue
+DEFAULT_THRESHOLDS = _diagnose.DEFAULT_THRESHOLDS
 if evaluate_by_period is None:
     raise ImportError(
@@ -16,4 +21,4 @@ if evaluate_by_period is None:
         "(Kernel → Restart) to clear cached imports."
     )
-__all__ = ["evaluate", "evaluate_by_period"]
+__all__ = ["evaluate", "evaluate_by_period", "diagnose", "DiagnosisReport", "DiagnosisIssue", "DEFAULT_THRESHOLDS"]

proscore 0.2.0__tar.gz → 0.2.2__tar.gz

proscore 0.2.0tar.gz → 0.2.2tar.gz