npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/analysis/wrangling/stata-data-cleaning/SKILL.md ADDED Viewed

@@ -0,0 +1,276 @@
+---
+name: stata-data-cleaning
+description: "Clean, transform, and validate messy research data using Stata"
+metadata:
+  openclaw:
+    emoji: "broom"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["Stata", "data cleaning", "data wrangling", "missing values", "recoding", "validation"]
+    source: "https://www.stata.com/manuals/d.pdf"
+---
+# Stata Data Cleaning
+Clean, transform, and validate messy research datasets in Stata. This skill covers the complete data preparation pipeline from raw survey or administrative data to analysis-ready datasets, with emphasis on documentation, reproducibility, and handling the common data quality issues encountered in social science, economics, and health research.
+## Overview
+Data cleaning typically consumes 60-80% of research time in empirical studies, yet it is often under-documented and poorly reproducible. Stata provides a powerful set of commands for data manipulation, but knowing which commands to use and in what order requires experience with common data quality issues: inconsistent coding, duplicate observations, string formatting problems, implausible values, and complex missing data patterns.
+This skill provides a systematic, step-by-step data cleaning workflow in Stata. Each step produces a log of changes made, enabling full reproducibility and audit trails. The workflow is organized around the principle that raw data should never be modified in place -- instead, cleaning scripts transform raw data into processed datasets while preserving the original.
+The approach follows best practices from the World Bank's DIME Analytics team and the J-PAL research transparency guidelines, making it suitable for projects that require rigorous data documentation for peer review, replication packages, or regulatory compliance.
+## Initial Data Assessment
+### Loading and Inspecting Data
+```stata
+* ============================================
+* Data Cleaning Script: [Project Name]
+* Author: [Name]
+* Date: [Date]
+* Input: raw/survey_data_raw.dta
+* Output: processed/survey_data_clean.dta
+* ============================================
+clear all
+set more off
+log using "logs/cleaning_log.smcl", replace
+* Load raw data
+use "raw/survey_data_raw.dta", clear
+* Basic inspection
+describe
+summarize
+codebook, compact
+* Check dimensions
+display "Observations: " _N
+display "Variables: " c(k)
+* Check for duplicates on ID variable
+duplicates report respondent_id
+duplicates list respondent_id if duplicates(respondent_id) > 0
+```
+### Data Quality Report
+```stata
+* Generate a data quality summary
+foreach var of varlist _all {
+    quietly {
+        count if missing(`var')
+        local nmiss = r(N)
+        local pctmiss = (`nmiss' / _N) * 100
+    }
+    if `pctmiss' > 0 {
+        display "`var': `nmiss' missing (`pctmiss'%)"
+    }
+}
+* Check value ranges for numeric variables
+foreach var of varlist age income years_education {
+    summarize `var', detail
+    * Flag implausible values
+    count if `var' < 0 & !missing(`var')
+    count if `var' > 150 & !missing(`var')
+}
+```
+## String Cleaning
+### Standardizing Text Variables
+```stata
+* Trim whitespace
+replace name = strtrim(name)
+replace name = stritrim(name)  // Remove internal multiple spaces
+* Standardize case
+replace city = proper(city)        // Title case
+replace country = upper(country)   // Upper case
+replace email = lower(email)       // Lower case
+* Remove special characters
+replace phone = ustrregexra(phone, "[^0-9]", "")
+* Fix encoding issues
+replace name = ustrfix(name)
+* Standardize common variations
+replace department = "Computer Science" if ///
+    inlist(department, "CS", "Comp Sci", "Comp. Sci.", "CompSci")
+replace gender = "Female" if inlist(gender, "F", "f", "female", "FEMALE")
+replace gender = "Male" if inlist(gender, "M", "m", "male", "MALE")
+```
+### Parsing Complex Strings
+```stata
+* Split full name into first and last
+gen first_name = word(full_name, 1)
+gen last_name = word(full_name, -1)
+* Extract year from date string "March 15, 2024"
+gen year = real(word(date_string, -1))
+* Parse numeric values from strings like "$1,234.56"
+gen income_clean = real(subinstr(subinstr(income_str, "$", "", .), ",", "", .))
+```
+## Missing Data Handling
+### Identifying Missing Data Patterns
+```stata
+* Install missing data analysis tools
+ssc install mdesc
+ssc install misstable
+* Summary of missing data
+mdesc
+* Missing data patterns
+misstable summarize
+misstable patterns
+* Create missing indicator variables
+foreach var of varlist income education occupation {
+    gen mi_`var' = missing(`var')
+}
+* Test whether missing is random (Little's MCAR test approximation)
+* Compare means of observed variables by missing status
+foreach var of varlist income education {
+    ttest age, by(mi_`var')
+    ttest gender_numeric, by(mi_`var')
+}
+```
+### Recoding Missing Values
+```stata
+* Common survey codes for missing
+* -99 = refused, -88 = don't know, -77 = not applicable
+foreach var of varlist income satisfaction trust_score {
+    replace `var' = .r if `var' == -99  // .r = refused
+    replace `var' = .d if `var' == -88  // .d = don't know
+    replace `var' = .n if `var' == -77  // .n = not applicable
+}
+* Extended missing values preserve the reason for missingness
+* while still being treated as missing in analyses
+```
+## Variable Construction
+### Recoding and Categorization
+```stata
+* Create age groups
+recode age (18/29 = 1 "18-29") (30/44 = 2 "30-44") ///
+           (45/59 = 3 "45-59") (60/max = 4 "60+"), gen(age_group)
+* Create binary indicator
+gen high_income = (income > 75000) if !missing(income)
+* Create composite scale (e.g., Likert items)
+alpha item1 item2 item3 item4 item5, gen(scale_score) item
+* Cronbach's alpha is reported; scale_score is the mean
+* Standardize continuous variables
+foreach var of varlist income education_years age {
+    egen z_`var' = std(`var')
+}
+* Winsorize extreme values
+winsor2 income, cuts(1 99) replace
+```
+### Date Variables
+```stata
+* Parse date strings
+gen interview_date = date(date_string, "MDY")
+format interview_date %td
+* Extract components
+gen interview_year = year(interview_date)
+gen interview_month = month(interview_date)
+gen interview_dow = dow(interview_date)  // 0=Sunday
+* Calculate durations
+gen days_since_treatment = interview_date - treatment_date
+gen months_since = (interview_date - treatment_date) / 30.44
+```
+## Data Validation
+### Assertion-Based Validation
+```stata
+* These assertions halt execution if violated
+assert _N == 5000  // Expected sample size
+assert !missing(respondent_id)  // No missing IDs
+assert age >= 18 & age <= 120 if !missing(age)  // Plausible age range
+assert inlist(gender, "Male", "Female", "Other", "") | missing(gender)
+* Cross-variable consistency checks
+assert education_years >= 0 if !missing(education_years)
+assert income >= 0 if !missing(income)
+assert end_date >= start_date if !missing(end_date) & !missing(start_date)
+```
+### Duplicate Detection and Resolution
+```stata
+* Identify duplicates
+duplicates tag respondent_id, gen(dup_flag)
+list respondent_id survey_date if dup_flag > 0, sepby(respondent_id)
+* Keep most recent observation per respondent
+bysort respondent_id (survey_date): keep if _n == _N
+* Or keep first observation
+bysort respondent_id (survey_date): keep if _n == 1
+```
+## Saving and Documentation
+```stata
+* Label all variables
+label variable age "Age at time of interview (years)"
+label variable income "Annual household income (USD)"
+label variable education_years "Total years of formal education"
+* Save cleaned dataset
+compress  // Reduce file size
+save "processed/survey_data_clean.dta", replace
+* Export codebook
+codebook, compact
+describe, short
+* Close log
+log close
+```
+## Best Practices
+1. **Never modify raw data files**: Always read raw data and write to a separate processed file.
+2. **Log everything**: Use `log using` to capture all output for audit trails.
+3. **Use assert statements**: Validate assumptions about the data at each stage.
+4. **Document decisions**: Comment every recode, drop, or imputation with the rationale.
+5. **Version your cleaning scripts**: Use git to track changes to .do files.
+6. **Produce a data dictionary**: Label every variable and value label in the final dataset.
+## References
+- Stata Data Management Reference Manual: https://www.stata.com/manuals/d.pdf
+- DIME Analytics Data Management Wiki: https://dimewiki.worldbank.org/Data_Management
+- J-PAL Research Resources: https://www.povertyactionlab.org/resource/data-cleaning
+- Long, J.S. (2009), The Workflow of Data Analysis Using Stata, Stata Press

package/skills/analysis/wrangling/streamline-analyst-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: streamline-analyst-guide
+description: "End-to-end data analysis AI agent with Streamlit UI"
+metadata:
+  openclaw:
+    emoji: "📈"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["data analysis", "Streamlit", "automated EDA", "machine learning", "data science", "AI analyst"]
+    source: "https://github.com/Wilson-ZheLin/Streamline-Analyst"
+---
+# Streamline Analyst Guide
+## Overview
+Streamline Analyst is an end-to-end data analysis AI agent with a Streamlit web interface. Upload a dataset and describe your analysis goal in natural language — the agent handles data cleaning, EDA, feature engineering, model training, evaluation, and report generation. Provides an interactive UI for reviewing each step and adjusting parameters.
+## Installation
+```bash
+git clone https://github.com/Wilson-ZheLin/Streamline-Analyst.git
+cd Streamline-Analyst
+pip install -r requirements.txt
+streamlit run app.py
+```
+## Workflow
+```
+Upload Dataset (CSV, Excel, Parquet)
+         ↓
+   Data Profiling
+   ├── Column types and distributions
+   ├── Missing value analysis
+   ├── Correlation matrix
+   └── Outlier detection
+         ↓
+   Data Cleaning (interactive)
+   ├── Handle missing values
+   ├── Remove/fix outliers
+   ├── Type conversions
+   └── Feature encoding
+         ↓
+   EDA (automated + custom)
+   ├── Univariate analysis
+   ├── Bivariate relationships
+   ├── Statistical tests
+   └── Custom visualizations
+         ↓
+   Modeling (if applicable)
+   ├── Train/test split
+   ├── Model selection + training
+   ├── Hyperparameter tuning
+   └── Evaluation metrics
+         ↓
+   Report Generation
+```
+## Features
+```python
+# Streamline Analyst provides:
+# 1. Smart data profiling
+# - Auto-detect column types (numeric, categorical, datetime)
+# - Distribution analysis per column
+# - Missing value patterns (MCAR, MAR, MNAR hints)
+# - Correlation analysis with significance
+# 2. Interactive cleaning
+# - Imputation strategies (mean, median, mode, KNN, model)
+# - Outlier handling (IQR, Z-score, isolation forest)
+# - Encoding (one-hot, label, target, ordinal)
+# - Scaling (standard, minmax, robust)
+# 3. Automated EDA
+# - Distribution plots (histogram, KDE, box, violin)
+# - Relationship plots (scatter, pair, heatmap)
+# - Time series decomposition
+# - Statistical tests (t-test, ANOVA, chi-square, Mann-Whitney)
+# 4. Model pipeline
+# - Classification: LR, RF, GBM, SVM, MLP
+# - Regression: LR, RF, GBM, SVR, ElasticNet
+# - Cross-validation with confidence intervals
+# - Feature importance visualization
+# - SHAP explanations
+# 5. Report
+# - HTML report with all plots and findings
+# - Downloadable cleaned dataset
+# - Model artifacts (pickle)
+```
+## Natural Language Interface
+```markdown
+### Example Prompts
+- "Show me the distribution of all numeric columns"
+- "Is there a significant difference in income between genders?"
+- "Build a classifier to predict churn using all features"
+- "What are the top 5 most important features for prediction?"
+- "Clean the data: fill missing values and remove outliers"
+- "Generate a summary report of this dataset"
+```
+## Use Cases
+1. **Quick EDA**: Rapid exploration of unfamiliar datasets
+2. **Data cleaning**: Interactive preprocessing with AI guidance
+3. **Baseline models**: Quick ML prototyping without coding
+4. **Report generation**: Automated analysis reports
+5. **Teaching**: Interactive data science demonstrations
+## References
+- [Streamline-Analyst GitHub](https://github.com/Wilson-ZheLin/Streamline-Analyst)
+- [Streamlit](https://streamlit.io/)

package/skills/analysis/wrangling/survey-data-processing/SKILL.md ADDED Viewed

@@ -0,0 +1,298 @@
+---
+name: survey-data-processing
+description: "Clean, recode, and prepare survey response data for analysis"
+metadata:
+  openclaw:
+    emoji: "clipboard"
+    category: "analysis"
+    subcategory: "wrangling"
+    keywords: ["survey data", "questionnaire coding", "Likert scale", "response validation", "recoding", "survey analysis"]
+    source: "wentor-research-plugins"
+---
+# Survey Data Processing
+A skill for cleaning, recoding, and preparing survey response data for statistical analysis. Covers handling common survey data issues such as incomplete responses, attention check failures, reverse-coded items, scale construction, open-ended response coding, and export to analysis-ready formats compatible with SPSS, Stata, and R.
+## Survey Data Quality Assessment
+### Initial Inspection Workflow
+Survey data from platforms like Qualtrics, SurveyMonkey, REDCap, and Google Forms each have their own export formats and quirks. The first step is always standardization.
+```python
+import pandas as pd
+import numpy as np
+def assess_survey_quality(df, duration_col="duration_seconds",
+                          min_duration=60):
+    """
+    Generate a survey data quality report.
+    Checks:
+    - Completion rates per question
+    - Response duration (speeders and slow responders)
+    - Straight-line responding patterns
+    - Attention check failures
+    """
+    report = {}
+    # Overall completion
+    total_respondents = len(df)
+    complete = df.dropna(thresh=int(len(df.columns) * 0.8))
+    report["total_responses"] = total_respondents
+    report["substantially_complete"] = len(complete)
+    report["completion_rate"] = f"{len(complete)/total_respondents*100:.1f}%"
+    # Duration analysis
+    if duration_col in df.columns:
+        durations = df[duration_col].dropna()
+        report["median_duration_seconds"] = durations.median()
+        report["speeders"] = (durations < min_duration).sum()
+        report["speeder_pct"] = f"{(durations < min_duration).mean()*100:.1f}%"
+    # Missing data per question
+    missing_by_col = df.isna().sum().sort_values(ascending=False)
+    report["most_skipped_questions"] = missing_by_col.head(10).to_dict()
+    return report
+```
+### Identifying Low-Quality Responses
+```python
+def detect_straightlining(df, likert_columns, threshold=0.9):
+    """
+    Detect respondents who select the same answer for nearly
+    all Likert-scale questions (straight-line responding).
+    A respondent is flagged if the proportion of their most
+    common response exceeds the threshold.
+    """
+    flagged = []
+    for idx, row in df[likert_columns].iterrows():
+        responses = row.dropna()
+        if len(responses) == 0:
+            continue
+        most_common_pct = responses.value_counts().iloc[0] / len(responses)
+        if most_common_pct >= threshold:
+            flagged.append(idx)
+    return flagged
+def check_attention_items(df, attention_checks):
+    """
+    Validate attention check (trap) questions.
+    Args:
+        attention_checks: dict of {column_name: correct_answer}
+        Example: {"q15_attention": 4, "q32_trap": "strongly agree"}
+    """
+    failed = pd.Series(False, index=df.index)
+    for col, correct in attention_checks.items():
+        failed = failed | (df[col] != correct)
+    return df.index[failed].tolist()
+```
+## Recoding and Transformation
+### Reverse Coding
+Many validated psychological scales include reverse-coded items to detect acquiescence bias. These must be recoded before computing scale scores.
+```python
+def reverse_code(df, columns, scale_max, scale_min=1):
+    """
+    Reverse-code specified columns for Likert-type scales.
+    Formula: reversed = (scale_max + scale_min) - original
+    Example for a 1-5 scale:
+      1 -> 5, 2 -> 4, 3 -> 3, 4 -> 2, 5 -> 1
+    """
+    df_recoded = df.copy()
+    for col in columns:
+        df_recoded[col] = (scale_max + scale_min) - df[col]
+    return df_recoded
+# Example usage with a Big Five personality scale
+reverse_items = {
+    "extraversion": ["ext_2", "ext_4", "ext_6"],
+    "neuroticism": ["neur_1", "neur_3", "neur_5"],
+    "agreeableness": ["agree_3", "agree_5"],
+}
+# For a 1-7 Likert scale:
+for construct, items in reverse_items.items():
+    df = reverse_code(df, items, scale_max=7, scale_min=1)
+```
+### Scale Construction
+```python
+def compute_scale_scores(df, scale_definitions, method="mean"):
+    """
+    Compute composite scale scores from individual items.
+    Args:
+        scale_definitions: dict mapping scale name to list of columns
+        method: "mean" or "sum"
+    Returns:
+        DataFrame with new scale score columns
+    """
+    for scale_name, items in scale_definitions.items():
+        if method == "mean":
+            df[scale_name] = df[items].mean(axis=1)
+        elif method == "sum":
+            df[scale_name] = df[items].sum(axis=1)
+        # Also compute Cronbach's alpha for reliability
+        alpha = cronbachs_alpha(df[items])
+        print(f"{scale_name}: alpha = {alpha:.3f} "
+              f"(n_items = {len(items)})")
+    return df
+def cronbachs_alpha(item_df):
+    """
+    Compute Cronbach's alpha for internal consistency reliability.
+    Values above 0.70 are generally considered acceptable.
+    """
+    item_df = item_df.dropna()
+    n_items = item_df.shape[1]
+    if n_items < 2:
+        return np.nan
+    item_variances = item_df.var(axis=0, ddof=1)
+    total_variance = item_df.sum(axis=1).var(ddof=1)
+    alpha = (n_items / (n_items - 1)) * (
+        1 - item_variances.sum() / total_variance
+    )
+    return alpha
+```
+## Open-Ended Response Processing
+### Coding Qualitative Responses
+```python
+def code_open_responses(df, text_column, codebook):
+    """
+    Apply a predefined codebook to open-ended responses using
+    keyword matching. For research-quality coding, this should
+    be supplemented with manual coding by trained raters.
+    Args:
+        codebook: dict mapping code names to keyword lists
+        Example: {
+            "financial_concern": ["money", "cost", "expensive", "afford"],
+            "time_constraint": ["time", "busy", "schedule", "hours"],
+            "quality_issue": ["quality", "broken", "defect", "poor"],
+        }
+    """
+    for code_name, keywords in codebook.items():
+        pattern = "|".join(keywords)
+        df[f"code_{code_name}"] = (
+            df[text_column]
+            .str.lower()
+            .str.contains(pattern, na=False)
+            .astype(int)
+        )
+    return df
+```
+### Inter-Rater Reliability
+```
+When multiple coders classify open-ended responses:
+Cohen's Kappa (2 raters):
+  - < 0.20: poor agreement
+  - 0.21-0.40: fair
+  - 0.41-0.60: moderate
+  - 0.61-0.80: substantial
+  - 0.81-1.00: almost perfect
+Fleiss' Kappa (3+ raters):
+  - Same interpretation scale as Cohen's
+  - Use when more than two raters code the same responses
+Process:
+  1. Develop codebook with definitions and examples
+  2. Train coders on 10-20 practice responses
+  3. Code 20% of responses independently (overlap set)
+  4. Calculate inter-rater reliability on the overlap set
+  5. If kappa < 0.70, discuss disagreements and refine codebook
+  6. Repeat until acceptable reliability is achieved
+  7. Divide remaining responses among coders
+```
+## Data Reshaping for Analysis
+### Wide to Long Format
+Survey data is typically exported in wide format (one row per respondent, one column per question). Many analyses require long format.
+```python
+def reshape_repeated_measures(df, id_col, time_points,
+                              measure_prefix):
+    """
+    Reshape repeated-measures survey data from wide to long.
+    Example: columns q1_pre, q1_post -> long format with
+    time column ("pre", "post") and value column.
+    """
+    value_vars = [f"{measure_prefix}_{t}" for t in time_points]
+    long_df = pd.melt(
+        df,
+        id_vars=[id_col],
+        value_vars=value_vars,
+        var_name="time_point",
+        value_name=measure_prefix
+    )
+    # Clean time_point column
+    long_df["time_point"] = (
+        long_df["time_point"]
+        .str.replace(f"{measure_prefix}_", "")
+    )
+    return long_df
+```
+## Export for Statistical Software
+```
+Export formats by software:
+SPSS (.sav):
+  - Use pyreadstat: pyreadstat.write_sav(df, "output.sav")
+  - Include variable labels and value labels
+  - Set measurement level (nominal, ordinal, scale)
+Stata (.dta):
+  - Use pandas: df.to_stata("output.dta")
+  - Include variable labels via write_stata with labels dict
+R (.csv with codebook):
+  - Export CSV plus a separate codebook document
+  - Or use pyreadstat to write .rds format
+  - Include factor level definitions
+General best practices:
+  - Include a unique respondent ID column
+  - Use numeric codes for categorical variables (with labels)
+  - Document all recoding in a companion codebook
+  - Save both raw and processed versions
+  - Include a timestamp column for data versioning
+```
+Proper survey data processing is essential for valid statistical inference. Decisions made during cleaning and recoding directly affect research conclusions, making transparent documentation of every step a methodological requirement rather than a convenience.