PyPI - ins-pricing - Versions diffs - 0.2.7__py3-none-any.whl → 0.2.9__py3-none-any.whl - Mend

ins-pricing 0.2.7py3-none-any.whl → 0.2.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

ins_pricing/CHANGELOG.md +179 -0
ins_pricing/RELEASE_NOTES_0.2.8.md +344 -0
ins_pricing/modelling/core/bayesopt/utils.py +2 -1
ins_pricing/modelling/explain/shap_utils.py +209 -6
ins_pricing/pricing/calibration.py +125 -1
ins_pricing/pricing/factors.py +110 -1
ins_pricing/production/preprocess.py +166 -0
ins_pricing/setup.py +1 -1
ins_pricing/tests/governance/__init__.py +1 -0
ins_pricing/tests/governance/test_audit.py +56 -0
ins_pricing/tests/governance/test_registry.py +128 -0
ins_pricing/tests/governance/test_release.py +74 -0
ins_pricing/tests/pricing/__init__.py +1 -0
ins_pricing/tests/pricing/test_calibration.py +72 -0
ins_pricing/tests/pricing/test_exposure.py +64 -0
ins_pricing/tests/pricing/test_factors.py +156 -0
ins_pricing/tests/pricing/test_rate_table.py +40 -0
ins_pricing/tests/production/__init__.py +1 -0
ins_pricing/tests/production/test_monitoring.py +350 -0
ins_pricing/tests/production/test_predict.py +233 -0
ins_pricing/tests/production/test_preprocess.py +339 -0
ins_pricing/tests/production/test_scoring.py +311 -0
ins_pricing/utils/profiling.py +377 -0
ins_pricing/utils/validation.py +427 -0
ins_pricing-0.2.9.dist-info/METADATA +149 -0
{ins_pricing-0.2.7.dist-info → ins_pricing-0.2.9.dist-info}/RECORD +28 -12
ins_pricing/CHANGELOG_20260114.md +0 -275
ins_pricing/CODE_REVIEW_IMPROVEMENTS.md +0 -715
ins_pricing-0.2.7.dist-info/METADATA +0 -101
{ins_pricing-0.2.7.dist-info → ins_pricing-0.2.9.dist-info}/WHEEL +0 -0
{ins_pricing-0.2.7.dist-info → ins_pricing-0.2.9.dist-info}/top_level.txt +0 -0

ins_pricing/CODE_REVIEW_IMPROVEMENTS.md DELETED Viewed

@@ -1,715 +0,0 @@
-# ins_pricing 代码审查与改进计划
-**审查日期**: 2026-01-14
-**审查范围**: ins_pricing 包的所有 Python 代码
-**目标**: 提升运行效率、保证多平台可扩展性、加强代码可维护性
----
-## 执行摘要
-经过全面审查，**ins_pricing** 包整体架构设计良好，代码质量较高。主要优点包括:
-- ✅ 良好的模块化设计和关注点分离
-- ✅ 使用 `pathlib.Path` 实现跨平台路径处理
-- ✅ 实现了延迟加载 (lazy loading) 以优化导入性能
-- ✅ 分布式训练支持完善（DDP、DataParallel）
-- ✅ 类型注解覆盖率高
-- ✅ 日志系统完善
-已识别的改进机会主要集中在性能优化、错误处理增强和文档完善方面。
----
-## 1. 性能优化建议
-### 1.1 高优先级优化
-#### 1.1.1 DatasetPreprocessor 中的内存优化
-**当前问题**:
-`config_preprocess.py` 中的 `DatasetPreprocessor.run()` 方法在多个位置进行 DataFrame 复制操作：
-```python
-# 行 448-451, 470-474
-train_oht = self.train_data[cfg.factor_nmes + [cfg.weight_nme] + [cfg.resp_nme]].copy()
-test_oht = self.test_data[cfg.factor_nmes + [cfg.weight_nme] + [cfg.resp_nme]].copy()
-self.train_oht_data = train_oht.copy(deep=False)  # 第二次复制
-train_oht_scaled = train_oht.copy(deep=False)     # 第三次复制
-```
-**改进方案**:
-```python
-# 优化后 - 减少不必要的复制
-train_oht = self.train_data[cfg.factor_nmes + [cfg.weight_nme] + [cfg.resp_nme]].copy()
-test_oht = self.test_data[cfg.factor_nmes + [cfg.weight_nme] + [cfg.resp_nme]].copy()
-# 应用 one-hot 编码
-train_oht = pd.get_dummies(train_oht, columns=cate_list, drop_first=True, dtype=np.int8)
-test_oht = pd.get_dummies(test_oht, columns=cate_list, drop_first=True, dtype=np.int8)
-test_oht = test_oht.reindex(columns=train_oht.columns, fill_value=0)
-# 直接保存引用，避免浅复制
-self.train_oht_data = train_oht
-self.test_oht_data = test_oht
-# 仅在需要缩放时复制
-train_oht_scaled = train_oht.copy() if self.num_features else train_oht
-test_oht_scaled = test_oht.copy() if self.num_features else test_oht
-```
-**预期效果**: 减少 30-40% 的内存占用和预处理时间
-#### 1.1.2 Pandas apply() 性能优化
-**当前问题**:
-部分文件使用 `.apply()` 进行行级操作，性能较差。
-**检测到的位置**:
-- `pricing/exposure.py`
-- `pricing/factors.py`
-- `production/monitoring.py`
-**改进方案**:
-```python
-# 使用向量化操作替代 apply()
-# 示例: 如果有 df.apply(lambda x: x['a'] * x['b'], axis=1)
-# 优化为: df['a'] * df['b']
-```
-**预期效果**: 5-50 倍性能提升（取决于数据集大小）
-#### 1.1.3 训练循环中的 GPU 内存管理
-**当前实现**: `utils.py:828-870` 训练循环中已正确使用 `autocast` 和 `GradScaler`
-**建议增强**:
-```python
-# 在 epoch 结束时添加显式清理
-if epoch % 10 == 0:  # 每 10 个 epoch
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-    gc.collect()
-```
-**位置**: `modelling/core/bayesopt/utils.py:930` (epoch 循环结束处)
-### 1.2 中优先级优化
-#### 1.2.1 缓存重复计算
-**建议位置**:
-- `pricing/factors.py`: 对于相同的因子列，缓存分箱结果
-- `modelling/plotting/curves.py`: 缓存 ROC/PR 曲线计算
-**实现示例**:
-```python
-from functools import lru_cache
-@lru_cache(maxsize=128)
-def _compute_bins_cached(col_hash, n_bins):
-    # 实现分箱逻辑
-    pass
-```
-#### 1.2.2 并行化 SHAP 计算
-**当前位置**: `modelling/explain/shap_utils.py`
-**建议**:
-```python
-# 使用 joblib 并行计算 SHAP 值
-from joblib import Parallel, delayed
-shap_values = Parallel(n_jobs=-1)(
-    delayed(explainer.shap_values)(batch)
-    for batch in np.array_split(X, n_jobs)
-)
-```
----
-## 2. 多平台兼容性增强
-### 2.1 ✅ 已良好实现的部分
-#### 2.1.1 路径处理
-- `utils/paths.py`: 全面使用 `pathlib.Path`，跨平台兼容性优秀
-- 所有关键模块正确使用 `Path.resolve()` 和 `Path.joinpath()`
-#### 2.1.2 进程管理
-- `cli/watchdog_run.py:27-46`: 正确区分 Windows/Unix 进程终止逻辑
-- 使用 `os.name == "nt"` 检测平台
-### 2.2 需要改进的部分
-#### 2.2.1 硬编码的路径分隔符清理
-**位置**: `modelling/core/bayesopt/model_plotting_mixin.py:66-75, 209-212`
-**当前代码**:
-```python
-plot_subdir = plot_subdir.strip("/\\")  # 硬编码分隔符
-.replace("/", "_").replace("\\", "_")
-```
-**改进方案**:
-```python
-import os
-# 使用 os.sep 或完全避免字符串操作
-plot_subdir = str(Path(plot_subdir))  # 自动规范化
-safe_name = plot_subdir.replace(os.sep, "_")
-```
-#### 2.2.2 文件权限问题
-**建议**: 在所有文件创建操作前检查权限
-```python
-def ensure_writable(path: Path) -> None:
-    """确保路径可写，跨平台兼容"""
-    path.parent.mkdir(parents=True, exist_ok=True)
-    if path.exists() and not os.access(path, os.W_OK):
-        raise PermissionError(f"Cannot write to {path}")
-```
-**应用位置**:
-- `governance/registry.py`
-- `governance/audit.py`
-- `reporting/report_builder.py`
----
-## 3. 代码可维护性提升
-### 3.1 文档完善
-#### 3.1.1 缺少顶层 README
-**建议创建**: `ins_pricing/README.md`
-**应包含内容**:
-- 架构概览图
-- 快速开始指南
-- API 参考索引
-- 常见问题解答
-#### 3.1.2 模块级文档字符串增强
-**当前状态**: 大部分模块有基本的 docstring
-**建议改进**: 添加使用示例和参数说明
-**示例** (`pricing/factors.py`):
-```python
-"""Factor table construction for insurance pricing.
-This module provides utilities for building factor tables from raw data,
-including automatic binning, smoothing, and credibility weighting.
-Example:
-    >>> from ins_pricing.pricing import build_factor_table
-    >>> factor_table = build_factor_table(
-    ...     df=data,
-    ...     factor_col='age_band',
-    ...     loss_col='claim_amount',
-    ...     exposure_col='exposure_years',
-    ...     method='quantile',
-    ...     n_bins=10
-    ... )
-    >>> print(factor_table)
-See Also:
-    - calibration.py: Factor calibration and application
-    - rate_table.py: Premium calculation using factor tables
-"""
-```
-### 3.2 类型注解改进
-#### 3.2.1 完善返回类型注解
-**检测到的问题**: 部分函数缺少返回类型
-**建议**: 对所有公共 API 添加完整类型注解
-```python
-# 当前
-def compute_exposure(df, start_col, end_col):
-    ...
-# 改进后
-def compute_exposure(
-    df: pd.DataFrame,
-    start_col: str,
-    end_col: str,
-    time_unit: Literal['days', 'years'] = 'days'
-) -> pd.Series:
-    """
-    计算保险暴露量。
-    Args:
-        df: 包含开始和结束日期的数据框
-        start_col: 开始日期列名
-        end_col: 结束日期列名
-        time_unit: 时间单位，'days' 或 'years'
-    Returns:
-        包含暴露量的 Series
-    """
-    ...
-```
-### 3.3 错误处理增强
-#### 3.3.1 具体的异常类型
-**当前问题**: 部分代码使用裸 `Exception`
-**建议创建**: `ins_pricing/exceptions.py`
-```python
-"""Custom exceptions for ins_pricing."""
-class InsPricingError(Exception):
-    """Base exception for all ins_pricing errors."""
-    pass
-class ConfigurationError(InsPricingError):
-    """Invalid configuration."""
-    pass
-class DataValidationError(InsPricingError):
-    """Data validation failed."""
-    pass
-class ModelLoadError(InsPricingError):
-    """Failed to load model."""
-    pass
-class DistributedTrainingError(InsPricingError):
-    """Distributed training failure."""
-    pass
-```
-**应用示例**:
-```python
-# 在 config_preprocess.py:399-402
-if missing_train:
-    raise ConfigurationError(
-        f"Train data missing required columns: {missing_train}. "
-        f"Available columns (first 50): {list(self.train_data.columns)[:50]}"
-    )
-```
-#### 3.3.2 增加验证和早期失败
-**建议位置**: `production/predict.py`
-```python
-def load_predictor_from_config(config_json: Path) -> dict:
-    """加载预测器，带完整验证"""
-    if not config_json.exists():
-        raise FileNotFoundError(f"Config not found: {config_json}")
-    try:
-        config = _load_json(config_json)
-    except json.JSONDecodeError as e:
-        raise ConfigurationError(f"Invalid JSON in {config_json}: {e}")
-    # 验证必需字段
-    required_fields = ['model_name', 'task_type', 'base_dir']
-    missing = [f for f in required_fields if f not in config]
-    if missing:
-        raise ConfigurationError(f"Missing required fields: {missing}")
-    return config
-```
-### 3.4 代码复用和重构
-#### 3.4.1 提取公共工具函数
-**识别的重复代码**:
-1. **路径解析逻辑** (多个文件重复)
-   - 已集中到 `utils/paths.py` ✅
-   - 建议: 确保所有模块使用 `resolve_path` 而非自定义实现
-2. **DataFrame 列验证** (重复模式)
-**建议创建**: `utils/validation.py`
-```python
-def validate_required_columns(
-    df: pd.DataFrame,
-    required: List[str],
-    *,
-    df_name: str = "DataFrame"
-) -> None:
-    """验证 DataFrame 包含必需的列"""
-    missing = [col for col in required if col not in df.columns]
-    if missing:
-        raise DataValidationError(
-            f"{df_name} missing required columns: {missing}. "
-            f"Available: {list(df.columns)[:50]}"
-        )
-def validate_column_types(
-    df: pd.DataFrame,
-    type_spec: Dict[str, type],
-    *,
-    coerce: bool = False
-) -> pd.DataFrame:
-    """验证并可选地强制列类型"""
-    for col, expected_type in type_spec.items():
-        if col not in df.columns:
-            continue
-        if not pd.api.types.is_dtype_equal(df[col].dtype, expected_type):
-            if coerce:
-                df[col] = df[col].astype(expected_type)
-            else:
-                raise DataValidationError(
-                    f"Column {col} has type {df[col].dtype}, "
-                    f"expected {expected_type}"
-                )
-    return df
-```
-#### 3.4.2 配置管理改进
-**当前状态**: 使用 dataclass 管理配置 ✅
-**建议增强**: 添加配置验证
-```python
-# 在 BayesOptConfig 中添加
-def __post_init__(self):
-    """验证配置一致性"""
-    if not 0.0 <= self.prop_test <= 1.0:
-        raise ValueError(f"prop_test must be in [0, 1], got {self.prop_test}")
-    if self.task_type not in {'regression', 'classification'}:
-        raise ValueError(f"task_type must be 'regression' or 'classification'")
-    if self.epochs < 1:
-        raise ValueError(f"epochs must be positive, got {self.epochs}")
-    # 验证 DDP 和 DataParallel 不同时启用
-    if self.use_resn_ddp and self.use_resn_data_parallel:
-        raise ValueError("Cannot use both DDP and DataParallel for ResNet")
-```
-### 3.5 测试覆盖率提升
-**当前测试文件**:
-```
-tests/modelling/
-├── conftest.py
-├── test_cross_val_generic.py
-├── test_distributed_utils.py
-├── test_explain.py
-├── test_geo_tokens_split.py
-├── test_graph_cache.py
-├── test_plotting.py
-├── test_plotting_library.py
-└── test_preprocessor.py
-```
-**建议添加**:
-1. **生产模块测试** (当前缺失)
-   ```
-   tests/production/
-   ├── test_predict.py
-   ├── test_scoring.py
-   ├── test_monitoring.py
-   └── test_preprocess.py
-   ```
-2. **定价模块测试** (当前缺失)
-   ```
-   tests/pricing/
-   ├── test_factors.py
-   ├── test_exposure.py
-   ├── test_calibration.py
-   └── test_rate_table.py
-   ```
-3. **治理模块测试** (当前缺失)
-   ```
-   tests/governance/
-   ├── test_registry.py
-   ├── test_release.py
-   └── test_audit.py
-   ```
----
-## 4. 具体改进实施计划
-### 阶段 1: 高优先级修复 (立即实施)
-1. **修复 DDP 状态字典不匹配问题** ✅ (已完成)
-   - `model_ft_trainer.py`: Lines 409, 738
-   - `model_resn.py`: Line 405
-   - `utils.py`: Line 796
-2. **优化 DatasetPreprocessor 内存使用**
-   - 文件: `config_preprocess.py`
-   - 预计工作量: 2 小时
-   - 影响: 所有训练流程
-3. **添加自定义异常类**
-   - 创建: `exceptions.py`
-   - 更新所有模块的异常处理
-   - 预计工作量: 4 小时
-### 阶段 2: 性能优化 (本周内)
-1. **替换 pandas apply() 为向量化操作**
-   - 文件: `pricing/exposure.py`, `pricing/factors.py`
-   - 预计工作量: 6 小时
-   - 需要性能基准测试
-2. **实现 SHAP 计算并行化**
-   - 文件: `modelling/explain/shap_utils.py`
-   - 预计工作量: 3 小时
-3. **添加训练循环内存清理**
-   - 文件: `modelling/core/bayesopt/utils.py`
-   - 预计工作量: 1 小时
-### 阶段 3: 文档和测试 (两周内)
-1. **编写顶层 README**
-   - 预计工作量: 4 小时
-2. **完善模块文档字符串**
-   - 所有公共模块
-   - 预计工作量: 8 小时
-3. **添加缺失的单元测试**
-   - 生产、定价、治理模块
-   - 预计工作量: 16 小时
-### 阶段 4: 代码质量提升 (持续)
-1. **添加类型检查 (mypy)**
-   - 配置 mypy
-   - 修复类型错误
-   - 预计工作量: 8 小时
-2. **代码格式化标准化**
-   - 配置 black, isort, flake8
-   - 预计工作量: 2 小时
-3. **设置 pre-commit hooks**
-   - 预计工作量: 2 小时
----
-## 5. 性能基准和监控
-### 5.1 建议添加的性能指标
-**创建**: `utils/profiling.py`
-```python
-"""性能分析工具"""
-import time
-from contextlib import contextmanager
-from typing import Optional
-import psutil
-import torch
-@contextmanager
-def profile_section(name: str, logger: Optional[logging.Logger] = None):
-    """性能分析上下文管理器"""
-    start = time.time()
-    start_mem = psutil.Process().memory_info().rss / 1024 / 1024  # MB
-    if torch.cuda.is_available():
-        torch.cuda.reset_peak_memory_stats()
-        start_gpu_mem = torch.cuda.memory_allocated() / 1024 / 1024
-    yield
-    elapsed = time.time() - start
-    end_mem = psutil.Process().memory_info().rss / 1024 / 1024
-    mem_delta = end_mem - start_mem
-    msg = f"[Profile] {name}: {elapsed:.2f}s, RAM: {mem_delta:+.1f}MB"
-    if torch.cuda.is_available():
-        peak_gpu = torch.cuda.max_memory_allocated() / 1024 / 1024
-        msg += f", GPU peak: {peak_gpu:.1f}MB"
-    if logger:
-        logger.info(msg)
-    else:
-        print(msg)
-# 使用示例
-with profile_section("Preprocessing", logger):
-    preprocessor.run()
-```
-### 5.2 内存泄漏检测
-**建议**: 在长时间训练任务中添加内存监控
-```python
-# 在 utils.py 训练循环中
-if epoch % 50 == 0:
-    mem_info = psutil.Process().memory_info()
-    logger.info(f"Epoch {epoch} memory: RSS={mem_info.rss / 1024 / 1024:.1f}MB")
-    if mem_info.rss > 32 * 1024 * 1024 * 1024:  # > 32GB
-        logger.warning("High memory usage detected!")
-```
----
-## 6. 安全性考虑
-### 6.1 输入验证
-**建议增强的位置**:
-1. **JSON 配置加载**
-   ```python
-   # 在所有 _load_json 调用处添加模式验证
-   import jsonschema
-   CONFIG_SCHEMA = {
-       "type": "object",
-       "required": ["model_name", "task_type"],
-       "properties": {
-           "model_name": {"type": "string", "minLength": 1},
-           "task_type": {"type": "string", "enum": ["regression", "classification"]},
-           ...
-       }
-   }
-   def load_config_safe(path: Path) -> dict:
-       config = _load_json(path)
-       jsonschema.validate(config, CONFIG_SCHEMA)
-       return config
-   ```
-2. **SQL 注入防护** (如果使用数据库)
-   - 使用参数化查询
-   - 验证表名和列名
-3. **路径遍历防护**
-   ```python
-   def safe_resolve_path(base: Path, user_path: str) -> Path:
-       """防止路径遍历攻击"""
-       resolved = (base / user_path).resolve()
-       if not resolved.is_relative_to(base):
-           raise SecurityError(f"Path {user_path} escapes base directory")
-       return resolved
-   ```
----
-## 7. 总结与优先级矩阵
-### 影响-工作量矩阵
-| 改进项 | 影响 | 工作量 | 优先级 |
-|-------|------|--------|--------|
-| DDP 修复 | 高 | 低 | ✅ 已完成 |
-| 内存优化 (Preprocessor) | 高 | 中 | P0 |
-| 自定义异常 | 高 | 低 | P0 |
-| 向量化操作 | 高 | 中 | P1 |
-| SHAP 并行化 | 中 | 低 | P1 |
-| 文档完善 | 中 | 高 | P2 |
-| 单元测试 | 中 | 高 | P2 |
-| 类型检查 | 低 | 中 | P3 |
-### 下一步行动
-**本周行动项**:
-1. ✅ 应用 DDP 修复 (已完成)
-2. 实施 DatasetPreprocessor 内存优化
-3. 创建自定义异常体系
-4. 添加性能分析工具
-**本月行动项**:
-1. 完成所有 P0-P1 优化
-2. 编写 README 和核心模块文档
-3. 添加生产/定价模块测试
-**长期目标**:
-1. 达到 80%+ 测试覆盖率
-2. 通过 mypy strict 模式
-3. 建立持续集成 (CI) 流程
----
-## 附录 A: 代码规范建议
-### Python 版本
-- 最低支持: Python 3.9 (当前)
-- 推荐: Python 3.10+ (for better type hints)
-### 代码风格
-- **Formatter**: `black` (line length 100)
-- **Import sorting**: `isort` (black-compatible)
-- **Linter**: `flake8` + `pylint`
-- **Type checker**: `mypy --strict`
-### 提交规范
-```
-<type>(<scope>): <subject>
-<body>
-<footer>
-```
-类型:
-- `feat`: 新功能
-- `fix`: Bug 修复
-- `perf`: 性能优化
-- `docs`: 文档更新
-- `test`: 测试添加
-- `refactor`: 代码重构
----
-## 附录 B: 依赖管理建议
-### 当前依赖分组 ✅
-```toml
-[project.optional-dependencies]
-bayesopt = ["torch", "optuna", "xgboost", ...]
-plotting = ["matplotlib", ...]
-explain = ["shap", ...]
-geo = ["contextily", ...]
-gnn = ["torch-geometric", ...]
-```
-### 建议添加
-```toml
-[project.optional-dependencies]
-dev = [
-    "pytest>=7.0",
-    "pytest-cov",
-    "mypy",
-    "black",
-    "isort",
-    "flake8",
-]
-profiling = [
-    "psutil",
-    "memory_profiler",
-    "line_profiler",
-]
-```
----
-**文档版本**: 1.0
-**最后更新**: 2026-01-14
-**维护者**: Claude Code Review System

ins-pricing 0.2.7__py3-none-any.whl → 0.2.9__py3-none-any.whl

ins-pricing 0.2.7py3-none-any.whl → 0.2.9py3-none-any.whl