npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/analysis/wrangling/data-cleaning-pipeline/SKILL.md ADDED Viewed

@@ -0,0 +1,266 @@
+---
+name: data-cleaning-pipeline
+description: "Systematic data cleaning workflows for research datasets"
+metadata:
+  openclaw:
+    emoji: "broom"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["data cleaning", "data quality", "missing values", "outlier detection", "data validation", "preprocessing"]
+    source: "wentor-research-plugins"
+---
+# Data Cleaning Pipeline
+A skill for building systematic, reproducible data cleaning pipelines for research datasets. Covers common data quality issues, step-by-step cleaning workflows, handling missing values, detecting and treating outliers, validating data integrity, and documenting cleaning decisions for reproducibility.
+## The Data Cleaning Workflow
+### Pipeline Overview
+Data cleaning should follow a consistent, documented order. Each step builds on the previous one, and the entire pipeline should be scripted for reproducibility.
+```
+Data Cleaning Pipeline (recommended order):
+1. Initial Assessment
+   - Load data, check dimensions, inspect dtypes
+   - Generate summary statistics and missing value report
+   - Identify structural issues (merged cells, inconsistent delimiters)
+2. Structural Fixes
+   - Standardize column names (snake_case, no spaces)
+   - Fix data types (strings to numbers, dates, categories)
+   - Split or merge columns as needed
+   - Remove completely empty rows/columns
+3. Deduplication
+   - Identify exact duplicates
+   - Identify near-duplicates (fuzzy matching)
+   - Decide keep-first, keep-last, or merge strategy
+4. Missing Value Treatment
+   - Classify missingness mechanism (MCAR, MAR, MNAR)
+   - Apply appropriate imputation or exclusion strategy
+   - Document and justify missing data decisions
+5. Outlier Detection and Treatment
+   - Statistical methods (IQR, z-score, Mahalanobis)
+   - Domain-based validation (impossible values)
+   - Decide: correct, cap, remove, or keep with flag
+6. Consistency Checks
+   - Cross-field validation (age vs birth date)
+   - Range validation (0-100 for percentages)
+   - Referential integrity (foreign keys exist)
+7. Documentation and Export
+   - Log all changes with before/after counts
+   - Export cleaned dataset with version number
+   - Save cleaning script for reproducibility
+```
+## Initial Data Assessment
+### Automated Quality Report
+```python
+import pandas as pd
+import numpy as np
+def generate_quality_report(df):
+    """
+    Generate a comprehensive data quality report.
+    Run this BEFORE any cleaning to establish a baseline.
+    """
+    report = {
+        "dimensions": f"{df.shape[0]} rows x {df.shape[1]} columns",
+        "memory_usage": f"{df.memory_usage(deep=True).sum() / 1e6:.1f} MB",
+        "duplicate_rows": df.duplicated().sum(),
+    }
+    col_report = []
+    for col in df.columns:
+        info = {
+            "column": col,
+            "dtype": str(df[col].dtype),
+            "missing_count": df[col].isna().sum(),
+            "missing_pct": f"{df[col].isna().mean() * 100:.1f}%",
+            "unique_values": df[col].nunique(),
+            "sample_values": str(df[col].dropna().head(3).tolist()),
+        }
+        if pd.api.types.is_numeric_dtype(df[col]):
+            info["min"] = df[col].min()
+            info["max"] = df[col].max()
+            info["mean"] = df[col].mean()
+            info["std"] = df[col].std()
+        col_report.append(info)
+    report["columns"] = col_report
+    return report
+```
+## Missing Value Treatment
+### Classifying Missingness
+```
+Missing data mechanisms (Rubin's classification):
+MCAR (Missing Completely At Random):
+  - Missingness is unrelated to any variable
+  - Example: Lab samples randomly lost during transport
+  - Test: Little's MCAR test, compare distributions
+  - Safe to: Listwise delete if < 5% missing
+MAR (Missing At Random):
+  - Missingness depends on observed variables but not the missing value
+  - Example: Younger participants skip income questions more often
+  - Test: Compare missingness patterns across groups
+  - Best approach: Multiple imputation, regression imputation
+MNAR (Missing Not At Random):
+  - Missingness depends on the unobserved value itself
+  - Example: High-income people refuse to report income
+  - Cannot be tested directly from the data
+  - Requires: Sensitivity analysis, selection models, domain expertise
+```
+### Imputation Strategies
+```python
+from sklearn.impute import SimpleImputer, KNNImputer
+def impute_missing_values(df, numeric_strategy="median",
+                          categorical_strategy="mode"):
+    """
+    Apply appropriate imputation strategies by column type.
+    For research data, prefer:
+    - Median for skewed numeric data
+    - Mean for normally distributed numeric data
+    - Mode for categorical data
+    - KNN for multivariate patterns
+    - Multiple imputation for inference (use statsmodels or mice)
+    """
+    numeric_cols = df.select_dtypes(include=[np.number]).columns
+    categorical_cols = df.select_dtypes(include=["object", "category"]).columns
+    # Numeric imputation
+    if len(numeric_cols) > 0:
+        if numeric_strategy == "knn":
+            imputer = KNNImputer(n_neighbors=5)
+            df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
+        else:
+            imputer = SimpleImputer(strategy=numeric_strategy)
+            df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
+    # Categorical imputation
+    if len(categorical_cols) > 0:
+        imputer = SimpleImputer(strategy="most_frequent")
+        df[categorical_cols] = imputer.fit_transform(df[categorical_cols])
+    return df
+```
+## Outlier Detection
+### Statistical Methods
+```python
+def detect_outliers_iqr(series, multiplier=1.5):
+    """
+    Detect outliers using the IQR method.
+    Standard multiplier is 1.5 (outlier) or 3.0 (extreme outlier).
+    """
+    q1 = series.quantile(0.25)
+    q3 = series.quantile(0.75)
+    iqr = q3 - q1
+    lower = q1 - multiplier * iqr
+    upper = q3 + multiplier * iqr
+    outliers = (series < lower) | (series > upper)
+    return outliers, lower, upper
+def detect_outliers_zscore(series, threshold=3.0):
+    """
+    Detect outliers using z-score method.
+    Threshold of 3.0 corresponds to 99.7% of normal distribution.
+    Use modified z-score (MAD-based) for skewed distributions.
+    """
+    from scipy import stats
+    z_scores = np.abs(stats.zscore(series.dropna()))
+    outliers = z_scores > threshold
+    return outliers
+```
+### Domain-Based Validation
+```
+Common domain validations:
+Age: 0-120 (flag > 100)
+Height (cm): 50-250
+Weight (kg): 1-300
+Blood pressure systolic: 60-250
+Blood pressure diastolic: 30-150
+Temperature (C): 30-45 for body temperature
+Likert scale (1-5): only integer values 1-5
+Percentage: 0-100
+Latitude: -90 to 90
+Longitude: -180 to 180
+Year of birth: 1900-current_year
+Email: matches standard regex pattern
+```
+## Reproducibility and Documentation
+### Cleaning Log
+```python
+class CleaningLog:
+    """
+    Log all cleaning operations for reproducibility.
+    Every step should be documented with before/after counts.
+    """
+    def __init__(self):
+        self.entries = []
+        self.version = 0
+    def log_step(self, step_name, description,
+                 rows_before, rows_after, cols_affected):
+        self.version += 1
+        self.entries.append({
+            "version": self.version,
+            "step": step_name,
+            "description": description,
+            "rows_before": rows_before,
+            "rows_after": rows_after,
+            "rows_removed": rows_before - rows_after,
+            "columns_affected": cols_affected,
+        })
+    def save_report(self, path):
+        report_df = pd.DataFrame(self.entries)
+        report_df.to_csv(path, index=False)
+```
+### Best Practices for Research Data
+```
+Reproducibility rules:
+  1. Never modify the raw data file -- always save cleaned versions
+  2. Use version numbers (data_v1_raw, data_v2_cleaned, data_v3_final)
+  3. Script every step -- no manual edits in Excel
+  4. Document every decision (why delete, why impute, why cap)
+  5. Include the cleaning script in supplementary materials
+  6. Record software versions (pandas, numpy, R packages)
+  7. Set random seeds for any stochastic imputation
+  8. Save intermediate datasets at major checkpoints
+```
+A well-documented data cleaning pipeline not only improves the quality of research findings but also strengthens the credibility of the work during peer review. Reviewers increasingly expect transparent data handling practices, and journals like PLOS ONE and Nature require data availability statements that implicitly demand reproducible preprocessing.

package/skills/analysis/wrangling/data-cog-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,178 @@
+---
+name: data-cog-guide
+description: "Upload messy CSVs with minimal prompting for deep automated analysis"
+metadata:
+  openclaw:
+    emoji: "🧠"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["automated analysis", "data wrangling", "CSV upload", "data profiling", "smart analysis", "minimal prompting"]
+    source: "https://github.com/AcademicSkills/data-cog-guide"
+---
+# Data Cog Guide
+An intelligent data analysis assistant that accepts messy, poorly documented CSV files and automatically infers structure, cleans anomalies, and produces deep analytical reports with minimal user prompting. Designed for researchers who need quick insights from unfamiliar or inherited datasets without spending hours on manual data preparation.
+## Overview
+Researchers frequently receive datasets from collaborators, public repositories, or legacy systems that lack documentation, use inconsistent formatting, and contain mixed data quality. Traditional analysis requires significant upfront effort to understand and prepare such data. Data Cog automates this process by applying heuristic inference, pattern recognition, and iterative cleaning to produce analysis-ready data along with a comprehensive profile report.
+The skill implements a "zero-configuration" philosophy: provide the CSV file path and an optional research question, and it handles encoding detection, delimiter inference, type casting, missingness assessment, and initial exploratory statistics automatically.
+## Automated Ingestion Pipeline
+### Smart Loading
+```python
+import pandas as pd
+import chardet
+import io
+def smart_load_csv(filepath: str) -> tuple:
+    """
+    Intelligently load a CSV file, auto-detecting encoding,
+    delimiter, header row, and comment lines.
+    """
+    # Step 1: Detect encoding
+    with open(filepath, 'rb') as f:
+        raw = f.read(100000)
+    encoding = chardet.detect(raw)['encoding']
+    # Step 2: Detect delimiter
+    import csv
+    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
+        sample = f.read(8192)
+    sniffer = csv.Sniffer()
+    try:
+        dialect = sniffer.sniff(sample)
+        delimiter = dialect.delimiter
+    except csv.Error:
+        delimiter = ','
+    # Step 3: Detect header row (skip comment lines)
+    skip_rows = 0
+    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
+        for line in f:
+            if line.startswith('#') or line.startswith('//') or line.strip() == '':
+                skip_rows += 1
+            else:
+                break
+    # Step 4: Load with inferred parameters
+    df = pd.read_csv(
+        filepath, encoding=encoding, delimiter=delimiter,
+        skiprows=skip_rows, low_memory=False
+    )
+    metadata = {
+        'encoding': encoding,
+        'delimiter': repr(delimiter),
+        'skipped_rows': skip_rows,
+        'shape': df.shape
+    }
+    return df, metadata
+```
+### Automatic Type Inference
+```python
+def auto_cast_columns(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Automatically cast columns to their most appropriate types.
+    Handles dates, numerics stored as strings, booleans, and categories.
+    """
+    for col in df.columns:
+        # Try numeric conversion
+        numeric = pd.to_numeric(df[col], errors='coerce')
+        if numeric.notna().mean() > 0.85:
+            df[col] = numeric
+            continue
+        # Try datetime conversion
+        datetime = pd.to_datetime(df[col], errors='coerce', infer_datetime_format=True)
+        if datetime.notna().mean() > 0.85:
+            df[col] = datetime
+            continue
+        # Try boolean detection
+        unique_lower = df[col].dropna().astype(str).str.lower().unique()
+        if set(unique_lower).issubset({'true', 'false', 'yes', 'no', '1', '0', 'y', 'n'}):
+            df[col] = df[col].astype(str).str.lower().map(
+                {'true': True, 'false': False, 'yes': True, 'no': False,
+                 '1': True, '0': False, 'y': True, 'n': False}
+            )
+            continue
+        # Convert low-cardinality strings to category
+        if df[col].nunique() / len(df) < 0.05 and df[col].nunique() < 50:
+            df[col] = df[col].astype('category')
+    return df
+```
+## Deep Automated Profiling
+### Profile Report Generation
+The profiling stage produces a structured report covering:
+1. **Schema overview**: Column names, inferred types, semantic roles (ID, feature, target, timestamp).
+2. **Univariate statistics**: Mean, median, mode, std, skewness, kurtosis for numeric columns; frequency tables for categoricals.
+3. **Missing data matrix**: Heatmap-style report of missingness patterns across all columns.
+4. **Correlation analysis**: Pairwise Pearson, Spearman, and Cramér's V correlations.
+5. **Distribution flags**: Columns that are heavily skewed, zero-inflated, or constant.
+6. **Duplicate detection**: Exact row duplicates and near-duplicate clusters.
+| Metric | Numeric Columns | Categorical Columns |
+|--------|----------------|-------------------|
+| Central tendency | Mean, median, mode | Mode, frequency |
+| Dispersion | Std, IQR, range, CV | Unique count, entropy |
+| Shape | Skewness, kurtosis | Imbalance ratio |
+| Quality | Missing %, zero %, outlier % | Missing %, rare labels % |
+## Interactive Analysis Workflow
+### Minimal-Prompt Usage Pattern
+The recommended workflow requires only three inputs:
+1. **File path**: The CSV to analyze.
+2. **Research question** (optional): A one-sentence description of what you want to learn.
+3. **Output format**: "summary", "full_report", or "cleaned_csv".
+```
+User: Analyze /data/survey_results_2025.csv
+      Question: What factors predict participant satisfaction?
+      Output: full_report
+Data Cog will:
+  1. Load and profile the dataset (auto-detect everything)
+  2. Clean and transform (handle missing data, encode categoricals)
+  3. Run correlation analysis focused on satisfaction-related columns
+  4. Generate regression models predicting satisfaction
+  5. Produce a structured report with findings and visualizations
+```
+### Iterative Refinement
+After the initial automated analysis, you can refine by asking targeted follow-up questions:
+- "Focus only on respondents from Group A"
+- "Exclude the first 50 rows (pilot data)"
+- "Treat column X as ordinal with levels: low < medium < high"
+- "Run the same analysis but with log-transformed income"
+## Best Practices
+- Always review the auto-generated profile before trusting downstream results.
+- Verify that automatic type inference made sensible choices, especially for ambiguous columns.
+- Provide a research question when possible to guide feature selection and analysis focus.
+- Save the cleaning audit log alongside your results for reproducibility.
+- For datasets over 1 million rows, consider sampling for the initial profile to save time.
+## References
+- Breck, E., et al. (2019). Data Validation for Machine Learning. *MLSys 2019*.
+- Hynes, N., et al. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. *NIPS MLSys Workshop*.
+- Pandas Development Team (2024). *pandas: Powerful Python Data Analysis Toolkit*. https://pandas.pydata.org/

package/skills/analysis/wrangling/open-data-scientist-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,197 @@
+---
+name: open-data-scientist-guide
+description: "AI agent that performs end-to-end data science workflows"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["data science", "automated analysis", "EDA", "feature engineering", "data wrangling", "AI agent"]
+    source: "https://github.com/Open-Data-Scientist/open-data-scientist"
+---
+# Open Data Scientist Guide
+## Overview
+Open Data Scientist is an AI agent that automates end-to-end data science workflows — from data loading and cleaning through exploratory analysis, feature engineering, modeling, and report generation. It interprets natural language task descriptions, generates and executes Python code, iteratively refines analyses based on results, and produces publication-ready outputs. Designed for researchers who need quick, thorough data analyses without deep programming expertise.
+## Workflow Pipeline
+```
+Dataset + Task Description
+         ↓
+   Data Profiling (types, distributions, missing values)
+         ↓
+   Cleaning & Preprocessing (imputation, encoding, scaling)
+         ↓
+   Exploratory Data Analysis (correlations, distributions, outliers)
+         ↓
+   Feature Engineering (transforms, interactions, selection)
+         ↓
+   Modeling (train, evaluate, compare)
+         ↓
+   Report Generation (figures, tables, interpretation)
+```
+## Usage
+```python
+from open_data_scientist import DataScientist
+ds = DataScientist(llm_provider="anthropic")
+# Natural language task
+result = ds.analyze(
+    data="experiment_results.csv",
+    task="Identify which experimental conditions significantly affect "
+         "the response variable. Build a predictive model and report "
+         "the most important features.",
+)
+# Outputs
+print(result.summary)           # Text summary of findings
+result.save_report("report.html")  # Full HTML report
+result.save_figures("figures/")     # All generated plots
+```
+## Data Profiling
+```python
+# Automatic data profiling before analysis
+profile = ds.profile("dataset.csv")
+print(f"Rows: {profile.n_rows}, Columns: {profile.n_cols}")
+print(f"Missing values: {profile.missing_summary}")
+print(f"Data types: {profile.dtype_summary}")
+print(f"Potential issues: {profile.warnings}")
+# Column-level details
+for col in profile.columns:
+    print(f"\n{col.name} ({col.dtype}):")
+    print(f"  Unique: {col.n_unique}")
+    print(f"  Missing: {col.n_missing} ({col.pct_missing:.1f}%)")
+    if col.is_numeric:
+        print(f"  Range: [{col.min}, {col.max}]")
+        print(f"  Mean: {col.mean:.3f}, Std: {col.std:.3f}")
+```
+## Exploratory Data Analysis
+```python
+# Guided EDA
+eda_result = ds.explore(
+    data="dataset.csv",
+    focus="relationships",  # or "distributions", "outliers", "time_trends"
+    target_column="outcome",
+)
+# Generated analyses include:
+# - Correlation heatmap
+# - Pairwise scatter plots for top correlations
+# - Distribution plots per group
+# - Statistical tests (t-test, ANOVA, chi-square)
+# - Outlier detection (IQR, Z-score)
+for finding in eda_result.findings:
+    print(f"- {finding.description} (p={finding.p_value:.4f})")
+```
+## Feature Engineering
+```python
+# Automatic feature engineering
+features = ds.engineer_features(
+    data="dataset.csv",
+    target="outcome",
+    strategies=[
+        "polynomial_interactions",  # x1*x2, x1^2
+        "datetime_extraction",      # year, month, day_of_week
+        "text_embeddings",          # TF-IDF or sentence embeddings
+        "binning",                  # numeric to categorical
+        "target_encoding",          # category to target mean
+    ],
+    selection_method="mutual_information",
+    max_features=50,
+)
+print(f"Original features: {features.n_original}")
+print(f"Generated features: {features.n_generated}")
+print(f"Selected features: {features.n_selected}")
+```
+## Modeling Pipeline
+```python
+result = ds.model(
+    data="dataset.csv",
+    target="outcome",
+    task_type="classification",  # or "regression"
+    models=["logistic_regression", "random_forest",
+            "gradient_boosting", "neural_network"],
+    cv_folds=5,
+    metric="f1_macro",
+)
+# Model comparison table
+print(result.comparison_table)
+# | Model              | F1 Macro | Accuracy | AUC   |
+# |--------------------|----------|----------|-------|
+# | Gradient Boosting  | 0.847    | 0.862    | 0.921 |
+# | Random Forest      | 0.831    | 0.849    | 0.908 |
+# | ...                |          |          |       |
+# Best model details
+best = result.best_model
+print(f"Best: {best.name}")
+print(f"Feature importance:\n{best.feature_importance.head(10)}")
+```
+## Report Generation
+```python
+# Generate publication-ready report
+result = ds.analyze(
+    data="experiment_results.csv",
+    task="Full analysis with statistical tests",
+    report_config={
+        "format": "html",       # html, pdf, markdown
+        "style": "academic",    # academic, business, minimal
+        "include_code": True,   # Show generated code
+        "figure_dpi": 300,      # Publication quality
+    },
+)
+result.save_report("analysis_report.html")
+```
+## Configuration
+```python
+ds = DataScientist(
+    llm_provider="anthropic",
+    model="claude-sonnet-4-20250514",
+    execution_config={
+        "timeout": 300,             # Max seconds per code block
+        "max_iterations": 10,       # Refinement iterations
+        "sandbox": True,            # Isolated execution
+    },
+    analysis_config={
+        "significance_level": 0.05,
+        "random_state": 42,
+        "test_size": 0.2,
+    },
+)
+```
+## Use Cases
+1. **Experiment analysis**: Analyze lab or survey data with statistical tests
+2. **Dataset exploration**: Quick EDA on unfamiliar datasets
+3. **Baseline modeling**: Rapid prototyping of predictive models
+4. **Report generation**: Automated analysis reports for publications
+## References
+- [Open Data Scientist GitHub](https://github.com/Open-Data-Scientist/open-data-scientist)
+- [Pandas Profiling](https://github.com/ydataai/ydata-profiling)