npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/analysis/statistics/quantitative-methods-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,193 @@
+---
+name: quantitative-methods-guide
+description: "Design and execute statistical analyses with regression modeling"
+metadata:
+  openclaw:
+    emoji: "📈"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["regression analysis", "quantitative methods", "research design", "statistical modeling", "OLS", "logistic regression"]
+    source: "https://github.com/AcademicSkills/quantitative-methods-guide"
+---
+# Quantitative Methods Guide
+A skill for designing and executing rigorous quantitative analyses in academic research. Covers the full pipeline from research question formulation through variable operationalization, model specification, estimation, diagnostics, and interpretation, with emphasis on regression modeling as the workhorse of empirical research.
+## Overview
+Quantitative methods form the foundation of empirical research across the social sciences, health sciences, economics, education, and many STEM fields. This skill provides a structured approach to the entire quantitative analysis workflow, ensuring that researchers make methodologically sound choices at each stage. It treats regression analysis as the central tool, covering ordinary least squares (OLS), logistic regression, Poisson regression, and multilevel models, while also addressing the broader issues of research design, measurement, and causal inference that determine whether regression results are meaningful.
+The skill is designed for graduate students and researchers who have basic statistics knowledge but need guidance on applying methods correctly in their own research contexts.
+## Research Design and Variable Specification
+### From Question to Model
+```
+Research Question: "Does mentoring frequency affect publication output among
+                    junior faculty, controlling for department size and funding?"
+Step 1: Identify variables
+  - Outcome (Y): publication_count (count data)
+  - Predictor (X1): mentoring_hours_per_month (continuous)
+  - Controls: department_size (continuous), total_funding (continuous)
+  - Potential moderator: career_stage (categorical: assistant/associate)
+Step 2: Choose model family
+  - Count outcome → Poisson or Negative Binomial regression
+  - Check for overdispersion before deciding
+Step 3: Specify the model
+  publications ~ mentoring_hours + department_size + log(funding) + career_stage
+  Optional: publications ~ mentoring_hours * career_stage + controls (interaction)
+```
+### Variable Types and Measurement
+| Variable Type | Examples | Modeling Considerations |
+|--------------|----------|----------------------|
+| Continuous | Income, GPA, temperature | Check distribution, consider transformations |
+| Binary | Pass/fail, treatment/control | Logistic regression |
+| Count | Publications, citations, events | Poisson or negative binomial |
+| Ordinal | Likert scales, rankings | Ordinal logistic or treat as continuous if 5+ levels |
+| Nominal | Department, country, method | Dummy coding (k-1 indicators) |
+| Time-to-event | Months until graduation | Survival analysis |
+## Regression Analysis
+### Ordinary Least Squares (OLS)
+```python
+import statsmodels.formula.api as smf
+import pandas as pd
+def run_ols_analysis(df: pd.DataFrame, formula: str) -> dict:
+    """
+    Fit an OLS regression model with full diagnostics.
+    Args:
+        df: DataFrame with all variables
+        formula: Patsy formula (e.g., 'y ~ x1 + x2 + C(group)')
+    """
+    model = smf.ols(formula=formula, data=df).fit(cov_type='HC3')  # robust SE
+    results = {
+        'coefficients': model.params.to_dict(),
+        'std_errors': model.bse.to_dict(),
+        'p_values': model.pvalues.to_dict(),
+        'conf_int': model.conf_int().to_dict(),
+        'r_squared': model.rsquared,
+        'adj_r_squared': model.rsquared_adj,
+        'f_statistic': model.fvalue,
+        'f_pvalue': model.f_pvalue,
+        'n_obs': int(model.nobs),
+        'aic': model.aic,
+        'bic': model.bic
+    }
+    return results
+# Example usage:
+# results = run_ols_analysis(df, 'gpa ~ study_hours + sleep_hours + C(major)')
+```
+### Logistic Regression
+```python
+def run_logistic_analysis(df: pd.DataFrame, formula: str) -> dict:
+    """
+    Fit a logistic regression for binary outcomes.
+    Reports odds ratios alongside coefficients.
+    """
+    model = smf.logit(formula=formula, data=df).fit(disp=False)
+    import numpy as np
+    results = {
+        'coefficients': model.params.to_dict(),
+        'odds_ratios': np.exp(model.params).to_dict(),
+        'p_values': model.pvalues.to_dict(),
+        'conf_int_OR': np.exp(model.conf_int()).to_dict(),
+        'pseudo_r_squared': model.prsquared,
+        'log_likelihood': model.llf,
+        'aic': model.aic,
+        'n_obs': int(model.nobs)
+    }
+    return results
+```
+## Model Diagnostics
+### OLS Assumption Checks
+Run these diagnostics after fitting any OLS model:
+1. **Linearity**: Plot residuals vs. fitted values. Look for no systematic pattern.
+2. **Normality of residuals**: Q-Q plot and Shapiro-Wilk test on residuals.
+3. **Homoscedasticity**: Breusch-Pagan test (`statsmodels.stats.diagnostic.het_breuschpagan`).
+4. **No multicollinearity**: Variance Inflation Factor (VIF) for each predictor.
+5. **Independence**: Durbin-Watson statistic for autocorrelation (especially panel/time data).
+```python
+from statsmodels.stats.outliers_influence import variance_inflation_factor
+from statsmodels.stats.diagnostic import het_breuschpagan
+def check_ols_assumptions(model, X_matrix) -> dict:
+    """
+    Run standard OLS diagnostic tests.
+    """
+    residuals = model.resid
+    fitted = model.fittedvalues
+    # VIF for multicollinearity
+    vif = {X_matrix.columns[i]: variance_inflation_factor(X_matrix.values, i)
+           for i in range(X_matrix.shape[1])}
+    multicollinearity_flag = any(v > 10 for v in vif.values())
+    # Breusch-Pagan for heteroscedasticity
+    bp_stat, bp_p, _, _ = het_breuschpagan(residuals, X_matrix)
+    from scipy import stats
+    _, normality_p = stats.shapiro(residuals[:5000])  # cap at 5000
+    return {
+        'vif': vif,
+        'multicollinearity_problem': multicollinearity_flag,
+        'breusch_pagan_p': round(bp_p, 4),
+        'heteroscedasticity_problem': bp_p < 0.05,
+        'residual_normality_p': round(normality_p, 4),
+        'recommendation': 'Use HC3 robust standard errors if heteroscedasticity detected'
+    }
+```
+## Reporting Regression Results
+### Standard Regression Table Format
+| Variable | Coefficient | SE | t | p | 95% CI |
+|----------|-----------|------|------|-------|---------|
+| (Intercept) | 2.34 | 0.45 | 5.20 | <.001 | [1.45, 3.23] |
+| Mentoring hours | 0.18 | 0.06 | 3.00 | .003 | [0.06, 0.30] |
+| Dept. size | -0.02 | 0.01 | -2.00 | .048 | [-0.04, -0.00] |
+| Log(Funding) | 0.31 | 0.12 | 2.58 | .011 | [0.07, 0.55] |
+Report: N, R-squared, Adjusted R-squared, F-statistic with df and p-value, and the type of standard errors used (e.g., HC3 robust).
+### Interpretation Template
+"A one-unit increase in [predictor] is associated with a [coefficient] [unit] change in [outcome], holding all other variables constant (b = [coef], SE = [se], p = [p], 95% CI [[lower], [upper]])."
+For logistic regression: "A one-unit increase in [predictor] is associated with [OR]-times higher odds of [outcome] (OR = [or], 95% CI [[lower], [upper]], p = [p])."
+## Common Pitfalls
+- **Omitted variable bias**: Failing to control for confounders that affect both X and Y.
+- **Overfitting**: Including too many predictors relative to sample size (rule of thumb: 10-20 observations per predictor).
+- **p-hacking**: Running many models and reporting only significant results. Pre-register your analysis plan.
+- **Misinterpreting R-squared**: High R-squared does not imply causation; low R-squared does not mean the model is useless.
+- **Ignoring assumptions**: Always run diagnostics. Violated assumptions can invalidate standard errors and p-values.
+## References
+- Wooldridge, J. M. (2019). *Introductory Econometrics* (7th ed.). Cengage.
+- Gelman, A. & Hill, J. (2007). *Data Analysis Using Regression and Multilevel/Hierarchical Models*. Cambridge University Press.
+- King, G. (1986). How Not to Lie with Statistics. *American Journal of Political Science*, 30(3), 666-687.

package/skills/analysis/statistics/senior-data-scientist-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,223 @@
+---
+name: senior-data-scientist-guide
+description: "Statistical modeling, experimentation design, and causal inference"
+metadata:
+  openclaw:
+    emoji: "🎯"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["statistical modeling", "causal inference", "A/B testing", "experimentation", "feature engineering", "data science"]
+    source: "https://github.com/AcademicSkills/senior-data-scientist-guide"
+---
+# Senior Data Scientist Guide
+A skill embodying the analytical mindset and methodological rigor of a senior data scientist applied to academic research. Covers advanced statistical modeling, experimental design, causal inference, feature engineering, and the critical thinking required to move from data to defensible conclusions.
+## Overview
+Senior data scientists distinguish themselves not by knowing more algorithms but by asking better questions, designing cleaner experiments, and being honest about what the data can and cannot tell them. This skill translates that professional discipline into a research context, helping academics apply modern data science practices to their empirical work. It covers the strategic decisions that matter most: when to use simple models versus complex ones, how to establish causality rather than mere correlation, and how to communicate uncertainty honestly.
+The skill is particularly useful for researchers working with observational data who need causal inference techniques, those designing randomized experiments who need proper power calculations and analysis plans, and anyone building predictive models who needs to avoid common overfitting and leakage pitfalls.
+## Strategic Modeling Decisions
+### Model Selection Philosophy
+```
+Decision Framework:
+1. Start with the simplest model that could answer your question
+2. Add complexity only when diagnostics reveal inadequacy
+3. Prefer interpretable models unless prediction accuracy is the sole goal
+4. Always have a baseline (mean, majority class, last observation)
+Model Complexity Ladder:
+  Level 1: Descriptive statistics, cross-tabulations
+  Level 2: Linear/logistic regression
+  Level 3: Regularized regression (Lasso, Ridge, Elastic Net)
+  Level 4: Tree ensembles (Random Forest, Gradient Boosting)
+  Level 5: Deep learning (only with sufficient data and clear justification)
+```
+### Feature Engineering Principles
+```python
+import pandas as pd
+import numpy as np
+def engineer_features(df: pd.DataFrame, config: dict) -> pd.DataFrame:
+    """
+    Apply systematic feature engineering based on domain knowledge.
+    config example:
+    {
+        'log_transform': ['income', 'citations'],
+        'interactions': [('experience', 'education')],
+        'polynomial': {'age': 2},
+        'time_features': 'date_column',
+        'lag_features': {'metric': [1, 7, 30]}
+    }
+    """
+    df = df.copy()
+    # Log transforms for right-skewed variables
+    for col in config.get('log_transform', []):
+        df[f'{col}_log'] = np.log1p(df[col])
+    # Interaction terms
+    for col_a, col_b in config.get('interactions', []):
+        df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b]
+    # Polynomial features
+    for col, degree in config.get('polynomial', {}).items():
+        for d in range(2, degree + 1):
+            df[f'{col}_pow{d}'] = df[col] ** d
+    # Time-based features
+    if 'time_features' in config:
+        time_col = config['time_features']
+        df[time_col] = pd.to_datetime(df[time_col])
+        df[f'{time_col}_month'] = df[time_col].dt.month
+        df[f'{time_col}_dayofweek'] = df[time_col].dt.dayofweek
+        df[f'{time_col}_quarter'] = df[time_col].dt.quarter
+    return df
+```
+## Causal Inference Methods
+### Beyond Correlation
+| Method | When to Use | Key Assumption |
+|--------|-----------|---------------|
+| Randomized experiment | You can randomly assign treatment | Proper randomization, no attrition |
+| Difference-in-differences | Policy change affects one group | Parallel trends pre-treatment |
+| Regression discontinuity | Treatment assigned by cutoff | No manipulation near cutoff |
+| Instrumental variables | Endogeneity present | Valid instrument (relevance + exclusion) |
+| Propensity score matching | Observational data, many confounders | No unobserved confounders |
+| Synthetic control | Single treated unit, many controls | Good pre-treatment fit |
+### Propensity Score Matching
+```python
+from sklearn.linear_model import LogisticRegression
+from sklearn.neighbors import NearestNeighbors
+def propensity_score_match(df, treatment_col, covariates, caliper=0.05):
+    """
+    Match treated and control units based on propensity scores.
+    """
+    # Estimate propensity scores
+    X = df[covariates].values
+    y = df[treatment_col].values
+    lr = LogisticRegression(max_iter=1000, random_state=42)
+    lr.fit(X, y)
+    df['pscore'] = lr.predict_proba(X)[:, 1]
+    # Match using nearest neighbor within caliper
+    treated = df[df[treatment_col] == 1]
+    control = df[df[treatment_col] == 0]
+    nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
+    nn.fit(control[['pscore']].values)
+    distances, indices = nn.kneighbors(treated[['pscore']].values)
+    # Apply caliper
+    valid = distances.flatten() < caliper
+    matched_treated = treated[valid].index.tolist()
+    matched_control = control.iloc[indices.flatten()[valid]].index.tolist()
+    return {
+        'matched_treated': matched_treated,
+        'matched_control': matched_control,
+        'n_matched': sum(valid),
+        'n_unmatched': sum(~valid),
+        'balance_check': 'Run standardized mean differences on covariates'
+    }
+```
+## Experimentation Design
+### A/B Testing for Research
+```python
+from scipy import stats
+import numpy as np
+def design_experiment(baseline_rate, mde, alpha=0.05, power=0.80):
+    """
+    Calculate required sample size for a two-proportion z-test.
+    Args:
+        baseline_rate: Current conversion/success rate
+        mde: Minimum detectable effect (absolute change)
+        alpha: Significance level
+        power: Statistical power
+    """
+    from statsmodels.stats.power import NormalIndPower
+    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
+    analysis = NormalIndPower()
+    n = analysis.solve_power(
+        effect_size=effect_size, alpha=alpha, power=power, ratio=1.0
+    )
+    return {
+        'sample_size_per_group': int(np.ceil(n)),
+        'total_sample_size': int(np.ceil(n)) * 2,
+        'baseline_rate': baseline_rate,
+        'minimum_detectable_effect': mde,
+        'alpha': alpha,
+        'power': power
+    }
+```
+### Pre-Analysis Plan Template
+Before running any experiment, document:
+1. **Primary hypothesis**: One clearly stated prediction.
+2. **Primary outcome metric**: One pre-specified metric for the main test.
+3. **Sample size justification**: Power calculation with assumptions.
+4. **Randomization procedure**: How units are assigned to conditions.
+5. **Analysis method**: Exact statistical test and model specification.
+6. **Multiple comparisons**: How secondary analyses will be corrected.
+7. **Stopping rules**: Conditions for early termination (if applicable).
+## Model Validation
+### Cross-Validation Strategy
+| Data Type | Recommended CV | Rationale |
+|-----------|---------------|-----------|
+| i.i.d. data | Stratified K-fold (K=5 or 10) | Preserves class balance |
+| Time series | Time-series split (expanding window) | Prevents look-ahead bias |
+| Grouped data | Group K-fold | Prevents data leakage across groups |
+| Small dataset (n<200) | Leave-one-out or repeated K-fold | Maximizes training data |
+| Spatial data | Spatial blocking | Prevents spatial autocorrelation leakage |
+### Leakage Detection Checklist
+- [ ] No future information used as features (check timestamps)
+- [ ] No target-derived features (e.g., group means computed on full data)
+- [ ] Train/test split performed before any preprocessing
+- [ ] Cross-validation folds respect group structure
+- [ ] Feature selection performed inside CV loop, not before
+- [ ] If accuracy seems too good to be true, it probably is
+## Communication and Reporting
+### The Senior DS Reporting Standard
+- Lead with the business/research question, not the algorithm.
+- Report confidence intervals, not just point estimates.
+- Show what you tried that did not work (negative results matter).
+- Quantify uncertainty: "The model predicts X with a 95% interval of [a, b]."
+- Be explicit about limitations and assumptions.
+- Use visualizations that a domain expert (not a statistician) can interpret.
+## References
+- Angrist, J. D. & Pischke, J.-S. (2009). *Mostly Harmless Econometrics*. Princeton University Press.
+- Cunningham, S. (2021). *Causal Inference: The Mixtape*. Yale University Press.
+- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.

package/skills/analysis/wrangling/claude-data-analysis-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+name: claude-data-analysis-guide
+description: "Claude Code-based conversational data analysis agent"
+metadata:
+  openclaw:
+    emoji: "🔬"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["Claude Code", "data analysis", "conversational", "pandas", "visualization", "interactive"]
+    source: "https://github.com/liangdabiao/claude-data-analysis"
+---
+# Claude Data Analysis Guide
+## Overview
+A Claude Code-based data analysis agent that provides conversational data exploration and analysis. Upload datasets and ask questions in natural language — the agent writes and executes Python code for data cleaning, statistical analysis, visualization, and reporting. Leverages Claude Code's ability to read files, run code, and iterate on results.
+## Setup
+```markdown
+### CLAUDE.md Configuration
+# Data Analysis Project
+## Instructions
+- Analyze data files in the data/ directory
+- Use pandas, numpy, scipy, matplotlib, seaborn
+- Always show data shape and dtypes first
+- Include statistical tests where appropriate
+- Generate publication-quality figures (300 DPI)
+- Save outputs to output/ directory
+## Conventions
+- Use seaborn for statistical plots
+- Report confidence intervals, not just p-values
+- Handle missing data explicitly (report, then impute)
+- Set random_state=42 for reproducibility
+```
+## Workflow
+```markdown
+### Interactive Analysis Loop
+1. "Load the experiment data from data/results.csv"
+   → Agent reads file, shows shape, dtypes, head()
+2. "How many missing values are there?"
+   → Agent runs df.isnull().sum(), reports per column
+3. "Show the distribution of response time by condition"
+   → Agent creates violin plots, reports summary stats
+4. "Is there a significant difference between groups?"
+   → Agent runs appropriate test (t-test, ANOVA, etc.)
+5. "Build a regression model predicting response time"
+   → Agent fits model, reports coefficients, R², diagnostics
+6. "Create a summary report with all findings"
+   → Agent generates markdown report with embedded figures
+```
+## Common Analysis Patterns
+```python
+# Data profiling
+import pandas as pd
+df = pd.read_csv("data/experiment.csv")
+print(f"Shape: {df.shape}")
+print(f"\nDtypes:\n{df.dtypes}")
+print(f"\nMissing:\n{df.isnull().sum()}")
+print(f"\nDescribe:\n{df.describe()}")
+# Statistical comparison
+from scipy import stats
+group_a = df[df["condition"] == "A"]["score"]
+group_b = df[df["condition"] == "B"]["score"]
+t_stat, p_value = stats.ttest_ind(group_a, group_b)
+print(f"t={t_stat:.3f}, p={p_value:.4f}")
+# Visualization
+import seaborn as sns
+import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(8, 5))
+sns.violinplot(data=df, x="condition", y="score", ax=ax)
+plt.savefig("output/comparison.png", dpi=300, bbox_inches="tight")
+```
+## Use Cases
+1. **Experiment analysis**: Interactive analysis of lab data
+2. **EDA**: Rapid exploration of unfamiliar datasets
+3. **Statistical testing**: Guided hypothesis testing
+4. **Report generation**: Analysis reports with figures
+5. **Learning**: Interactive data science exploration
+## References
+- [claude-data-analysis GitHub](https://github.com/liangdabiao/claude-data-analysis)
+- [Claude Code](https://docs.anthropic.com/en/docs/claude-code)

package/skills/analysis/wrangling/csv-data-analyzer/SKILL.md ADDED Viewed

@@ -0,0 +1,170 @@
+---
+name: csv-data-analyzer
+description: "Load, explore, clean, and analyze CSV data with statistical summaries"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["CSV analysis", "data exploration", "data cleaning", "statistical summary", "tabular data", "pandas"]
+    source: "https://github.com/AcademicSkills/csv-data-analyzer"
+---
+# CSV Data Analyzer
+A comprehensive skill for loading, exploring, cleaning, and analyzing CSV datasets within research workflows. Designed for researchers who need to quickly understand the structure, quality, and statistical properties of tabular data before conducting deeper analysis.
+## Overview
+Research datasets commonly arrive as CSV files from instrument exports, survey platforms, government repositories, and collaborator handoffs. This skill provides a structured approach to the entire CSV analysis pipeline: ingestion, profiling, quality assessment, cleaning, transformation, and summary statistics. It emphasizes reproducibility by generating audit logs of every transformation applied to the raw data.
+The skill supports datasets of varying complexity, from single-table survey results to multi-file longitudinal study exports with hundreds of columns. It works with standard Python data science libraries (pandas, numpy, scipy) and produces outputs suitable for inclusion in methods sections and supplementary materials.
+## Data Loading and Initial Profiling
+### Loading Strategies
+```python
+import pandas as pd
+import numpy as np
+def load_and_profile_csv(filepath: str, encoding: str = 'utf-8') -> dict:
+    """
+    Load a CSV file and generate an initial data profile.
+    Handles common encoding issues and delimiter detection.
+    """
+    # Try multiple encodings if default fails
+    encodings = [encoding, 'latin-1', 'utf-8-sig', 'cp1252']
+    df = None
+    for enc in encodings:
+        try:
+            df = pd.read_csv(filepath, encoding=enc, low_memory=False)
+            break
+        except (UnicodeDecodeError, pd.errors.ParserError):
+            continue
+    if df is None:
+        raise ValueError(f"Could not parse {filepath} with any supported encoding")
+    profile = {
+        'rows': len(df),
+        'columns': len(df.columns),
+        'memory_mb': df.memory_usage(deep=True).sum() / 1e6,
+        'dtypes': df.dtypes.value_counts().to_dict(),
+        'missing_pct': (df.isnull().sum() / len(df) * 100).to_dict(),
+        'duplicates': df.duplicated().sum(),
+        'column_names': df.columns.tolist()
+    }
+    return df, profile
+```
+### Column Type Inference
+```python
+def infer_semantic_types(df: pd.DataFrame) -> dict:
+    """
+    Infer semantic column types beyond pandas dtypes.
+    Detects dates, identifiers, categorical, continuous, and text columns.
+    """
+    semantic_types = {}
+    for col in df.columns:
+        nunique = df[col].nunique()
+        ratio = nunique / len(df) if len(df) > 0 else 0
+        if ratio > 0.95 and df[col].dtype == 'object':
+            semantic_types[col] = 'identifier'
+        elif nunique <= 20 and df[col].dtype in ['object', 'int64']:
+            semantic_types[col] = 'categorical'
+        elif df[col].dtype in ['float64', 'int64']:
+            semantic_types[col] = 'continuous'
+        elif pd.to_datetime(df[col], errors='coerce').notna().mean() > 0.8:
+            semantic_types[col] = 'datetime'
+        else:
+            semantic_types[col] = 'text'
+    return semantic_types
+```
+## Data Cleaning Pipeline
+### Systematic Cleaning Steps
+1. **Remove fully empty rows and columns**: Drop rows/columns where all values are NaN.
+2. **Standardize column names**: Convert to snake_case, remove special characters.
+3. **Handle missing data**: Assess missingness patterns (MCAR/MAR/MNAR) before choosing imputation strategy.
+4. **Detect and handle duplicates**: Identify exact and near-duplicates using fuzzy matching.
+5. **Validate value ranges**: Flag values outside expected domain ranges.
+6. **Standardize categorical labels**: Merge inconsistent spellings (e.g., "Male", "male", "M").
+```python
+def clean_column_names(df: pd.DataFrame) -> pd.DataFrame:
+    """Standardize column names to snake_case."""
+    import re
+    df.columns = [
+        re.sub(r'[^a-z0-9]+', '_', col.lower().strip()).strip('_')
+        for col in df.columns
+    ]
+    return df
+def assess_missingness(df: pd.DataFrame) -> pd.DataFrame:
+    """Generate a missingness report for each column."""
+    report = pd.DataFrame({
+        'missing_count': df.isnull().sum(),
+        'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
+        'dtype': df.dtypes
+    })
+    report['action'] = report['missing_pct'].apply(
+        lambda x: 'drop' if x > 60 else ('impute' if x > 0 else 'ok')
+    )
+    return report.sort_values('missing_pct', ascending=False)
+```
+## Statistical Summary Generation
+### Descriptive Statistics
+```python
+def generate_statistical_summary(df: pd.DataFrame) -> dict:
+    """
+    Generate comprehensive descriptive statistics for all columns.
+    Includes measures of central tendency, dispersion, and distribution shape.
+    """
+    numeric_cols = df.select_dtypes(include=[np.number])
+    summary = {
+        'numeric': numeric_cols.describe().T.assign(
+            skewness=numeric_cols.skew(),
+            kurtosis=numeric_cols.kurtosis(),
+            iqr=numeric_cols.quantile(0.75) - numeric_cols.quantile(0.25),
+            cv=numeric_cols.std() / numeric_cols.mean()  # coefficient of variation
+        ),
+        'categorical': {
+            col: df[col].value_counts().head(10).to_dict()
+            for col in df.select_dtypes(include=['object']).columns
+        },
+        'correlations': numeric_cols.corr().round(3)
+    }
+    return summary
+```
+### Normality and Distribution Testing
+| Test | Use Case | Function |
+|------|----------|----------|
+| Shapiro-Wilk | Normality test (n < 5000) | `scipy.stats.shapiro()` |
+| D'Agostino-Pearson | Normality test (n >= 5000) | `scipy.stats.normaltest()` |
+| Kolmogorov-Smirnov | Compare to any distribution | `scipy.stats.kstest()` |
+| Levene's test | Homogeneity of variance | `scipy.stats.levene()` |
+## Best Practices for Reproducibility
+- Always save the raw CSV separately; never overwrite original files.
+- Log every cleaning step with timestamps in a transformation audit trail.
+- Export cleaned datasets with a version suffix (e.g., `data_v2_cleaned.csv`).
+- Include the cleaning script or notebook alongside the published dataset.
+- Report the number of rows removed at each step in your methods section.
+- Use `random_state` parameters consistently for any stochastic operations.
+## References
+- McKinney, W. (2022). *Python for Data Analysis* (3rd ed.). O'Reilly Media.
+- Wickham, H. (2014). Tidy Data. *Journal of Statistical Software*, 59(10).
+- Van den Broeck, J., et al. (2005). Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities. *PLoS Medicine*, 2(10).