npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/analysis/econometrics/iv-regression-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,198 @@
+---
+name: iv-regression-guide
+description: "Apply instrumental variables, 2SLS, and address endogeneity issues"
+metadata:
+  openclaw:
+    emoji: "wrench"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["instrumental variables", "2SLS", "endogeneity", "IV regression", "causal inference", "econometrics"]
+    source: "wentor-research-plugins"
+---
+# Instrumental Variables Regression Guide
+A skill for applying instrumental variables (IV) estimation to address endogeneity in regression models. Covers the logic of IV, two-stage least squares (2SLS), instrument validity tests, weak instrument diagnostics, and reporting standards.
+## The Endogeneity Problem
+### Why OLS Fails
+```
+Ordinary Least Squares assumes:  E[u | X] = 0
+(Regressors are uncorrelated with the error term)
+This assumption is violated when:
+  - Omitted variable bias: A confound affects both X and Y
+  - Simultaneity: X affects Y and Y affects X
+  - Measurement error: X is measured with noise
+Consequence: OLS estimates are biased and inconsistent.
+No amount of data will fix this.
+```
+### The IV Solution
+An instrumental variable Z satisfies two conditions:
+```
+1. Relevance:  Z is correlated with the endogenous regressor X
+               Cov(Z, X) != 0
+2. Exclusion:  Z affects Y ONLY through X (not directly)
+               Cov(Z, u) = 0
+   Z --> X --> Y
+   Z -/-> Y  (no direct path)
+```
+## Two-Stage Least Squares (2SLS)
+### How 2SLS Works
+```
+Stage 1: Regress the endogenous variable on the instrument(s)
+         X = gamma_0 + gamma_1 * Z + controls + v
+         Save the fitted values: X_hat
+Stage 2: Regress the outcome on the fitted values
+         Y = beta_0 + beta_1 * X_hat + controls + e
+The coefficient beta_1 is the IV estimate of the causal effect.
+```
+### Implementation in Python
+```python
+from linearmodels.iv import IV2SLS
+import pandas as pd
+def run_2sls(data: pd.DataFrame, dependent: str,
+             endogenous: str, instruments: list[str],
+             controls: list[str] = None) -> dict:
+    """
+    Run a 2SLS instrumental variables regression.
+    Args:
+        data: DataFrame with all variables
+        dependent: Name of the dependent variable (Y)
+        endogenous: Name of the endogenous regressor (X)
+        instruments: List of instrument variable names (Z)
+        controls: List of exogenous control variable names
+    """
+    controls = controls or []
+    exog_str = " + ".join(["1"] + controls) if controls else "1"
+    endog_str = endogenous
+    instr_str = " + ".join(instruments)
+    formula = f"{dependent} ~ {exog_str} + [{endog_str} ~ {instr_str}]"
+    model = IV2SLS.from_formula(formula, data)
+    result = model.fit(cov_type="robust")
+    return {
+        "coefficients": dict(result.params),
+        "std_errors": dict(result.std_errors),
+        "p_values": dict(result.pvalues),
+        "f_statistic_first_stage": result.first_stage.diagnostics,
+        "summary": str(result.summary)
+    }
+```
+### Implementation in R
+```r
+library(ivreg)
+# 2SLS estimation
+iv_model <- ivreg(
+  log(wage) ~ education + experience | parent_education + experience,
+  data = df
+)
+summary(iv_model, diagnostics = TRUE)
+```
+## Instrument Validity Tests
+### First-Stage F-Statistic (Relevance)
+```python
+def check_weak_instruments(first_stage_f: float) -> dict:
+    """
+    Evaluate instrument strength using first-stage F-statistic.
+    Args:
+        first_stage_f: F-statistic from the first-stage regression
+    """
+    return {
+        "f_statistic": first_stage_f,
+        "rule_of_thumb": (
+            "Strong instruments" if first_stage_f > 10
+            else "Potentially weak instruments"
+        ),
+        "interpretation": (
+            "Stock & Yogo (2005) suggest F > 10 as a minimum for "
+            "one endogenous variable. For more precise thresholds, "
+            "consult the Stock-Yogo critical values table based on "
+            "the number of instruments and desired maximal bias."
+        ),
+        "if_weak": [
+            "Use LIML (Limited Information Maximum Likelihood) instead of 2SLS",
+            "Report Anderson-Rubin confidence intervals (robust to weak IV)",
+            "Consider finding stronger instruments",
+            "Use the Lee et al. (2022) tF procedure for valid inference"
+        ]
+    }
+```
+### Overidentification Test (Exclusion Restriction)
+When you have more instruments than endogenous variables, the Hansen J test (or Sargan test) checks whether the extra instruments are valid:
+```
+H0: All instruments are valid (uncorrelated with the error)
+H1: At least one instrument is invalid
+If p < 0.05: Reject -> at least one instrument may violate exclusion
+If p > 0.05: Fail to reject -> instruments appear valid
+             (but this test has low power)
+```
+## Classic IV Examples
+### Famous Instruments in Economics
+```
+Research Question          | Endogenous Var | Instrument
+---------------------------|---------------|------------------
+Returns to education       | Years of school| Quarter of birth (Angrist & Krueger)
+Effect of institutions     | Institutions   | Settler mortality (Acemoglu et al.)
+Colonial origins of trade  | Trade openness | Geography (Frankel & Romer)
+Effect of military service | Veteran status | Draft lottery number (Angrist)
+Price elasticity of demand | Price          | Supply shifters (cost, weather)
+```
+## Reporting IV Results
+### Required Elements
+```
+1. Justify instrument choice with economic/theoretical reasoning
+2. Report first-stage regression results:
+   - Coefficient of Z on X with standard error
+   - First-stage F-statistic
+3. Report second-stage (2SLS) results:
+   - IV coefficient with robust standard errors
+   - Compare with OLS estimate (discuss direction of bias)
+4. Report diagnostic tests:
+   - Weak instrument test (F-statistic or Kleibergen-Paap)
+   - Overidentification test if applicable (Hansen J)
+   - Endogeneity test (Hausman or Durbin-Wu-Hausman)
+5. Discuss threats to instrument validity
+   - Can the exclusion restriction be challenged?
+   - Are there plausible alternative channels?
+```
+Always present both OLS and IV estimates side by side. The comparison helps readers understand the direction and magnitude of endogeneity bias and assess whether the IV correction is meaningful.

package/skills/analysis/econometrics/panel-data-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,274 @@
+---
+name: panel-data-guide
+description: "Panel data analysis with fixed and random effects models"
+metadata:
+  openclaw:
+    emoji: "table"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["panel data", "fixed effects", "random effects", "Stata commands"]
+    source: "wentor-research-plugins"
+---
+# Panel Data Analysis Guide
+Estimate and interpret fixed effects, random effects, and dynamic panel models using Stata, R, and Python for longitudinal/panel datasets.
+## What Is Panel Data?
+Panel data (also called longitudinal or cross-sectional time-series data) tracks the same units (individuals, firms, countries) across multiple time periods. This structure enables:
+- Controlling for unobserved heterogeneity (time-invariant omitted variables)
+- Studying dynamic relationships (how X at time t affects Y at time t+1)
+- Increased statistical power through more observations
+### Data Structure
+```
+| unit_id | year | gdp_growth | investment | trade_openness |
+|---------|------|-----------|------------|----------------|
+| USA     | 2015 | 2.9       | 20.5       | 28.3           |
+| USA     | 2016 | 1.7       | 20.1       | 27.1           |
+| USA     | 2017 | 2.3       | 20.8       | 27.5           |
+| CHN     | 2015 | 6.9       | 43.3       | 39.9           |
+| CHN     | 2016 | 6.7       | 42.7       | 37.2           |
+| CHN     | 2017 | 6.9       | 43.1       | 38.1           |
+```
+Key notation:
+- i = unit (cross-sectional dimension): i = 1, ..., N
+- t = time period: t = 1, ..., T
+- Y_it = dependent variable for unit i at time t
+## Model Specification
+### Pooled OLS
+```
+Y_it = alpha + beta * X_it + epsilon_it
+```
+Ignores panel structure; assumes no unit-specific effects. Rarely appropriate.
+### Fixed Effects (FE) Model
+```
+Y_it = alpha_i + beta * X_it + epsilon_it
+```
+Each unit has its own intercept (alpha_i) that captures all time-invariant unobserved heterogeneity. The "within" estimator removes alpha_i by demeaning.
+### Random Effects (RE) Model
+```
+Y_it = alpha + beta * X_it + u_i + epsilon_it
+```
+The unit-specific effect u_i is treated as random and uncorrelated with X_it.
+## Estimation in Stata
+### Setting Up Panel Data
+```stata
+* Declare panel structure
+xtset country_id year
+* Summarize within and between variation
+xtsum gdp_growth investment trade_openness
+```
+### Fixed Effects
+```stata
+* Fixed effects regression
+xtreg gdp_growth investment trade_openness, fe
+* Store results for Hausman test
+estimates store FE
+* Fixed effects with robust standard errors (clustered by unit)
+xtreg gdp_growth investment trade_openness, fe vce(cluster country_id)
+* Test joint significance of fixed effects
+testparm i.country_id
+```
+### Random Effects
+```stata
+* Random effects regression
+xtreg gdp_growth investment trade_openness, re
+* Store results for Hausman test
+estimates store RE
+```
+### Hausman Test (FE vs. RE)
+```stata
+* Hausman specification test
+hausman FE RE
+* If p < 0.05: reject RE, use FE
+* If p > 0.05: RE is consistent and efficient, prefer RE
+```
+### First Differences
+```stata
+* First-differenced regression (alternative to FE)
+reg D.gdp_growth D.investment D.trade_openness, vce(cluster country_id)
+```
+## Estimation in R (plm Package)
+```r
+library(plm)
+# Convert to panel data frame
+pdata <- pdata.frame(mydata, index = c("country_id", "year"))
+# Fixed effects
+fe_model <- plm(gdp_growth ~ investment + trade_openness,
+                data = pdata, model = "within")
+summary(fe_model)
+# Random effects
+re_model <- plm(gdp_growth ~ investment + trade_openness,
+                data = pdata, model = "random")
+summary(re_model)
+# Hausman test
+phtest(fe_model, re_model)
+# Clustered standard errors
+library(lmtest)
+library(sandwich)
+coeftest(fe_model, vcov = vcovHC(fe_model, type = "HC1", cluster = "group"))
+# Time fixed effects
+fe_twoway <- plm(gdp_growth ~ investment + trade_openness + factor(year),
+                 data = pdata, model = "within")
+# Test for time fixed effects
+pFtest(fe_twoway, fe_model)
+```
+## Estimation in Python (linearmodels)
+```python
+import pandas as pd
+from linearmodels.panel import PanelOLS, RandomEffects, compare
+# Set multi-index for panel structure
+data = data.set_index(["country_id", "year"])
+# Fixed effects
+fe = PanelOLS.from_formula(
+    "gdp_growth ~ investment + trade_openness + EntityEffects",
+    data=data
+)
+fe_result = fe.fit(cov_type="clustered", cluster_entity=True)
+print(fe_result.summary)
+# Random effects
+re = RandomEffects.from_formula(
+    "gdp_growth ~ investment + trade_openness + 1",
+    data=data
+)
+re_result = re.fit()
+print(re_result.summary)
+# Two-way fixed effects (entity + time)
+twoway = PanelOLS.from_formula(
+    "gdp_growth ~ investment + trade_openness + EntityEffects + TimeEffects",
+    data=data
+)
+twoway_result = twoway.fit(cov_type="clustered", cluster_entity=True)
+print(twoway_result.summary)
+# Compare models
+print(compare({"FE": fe_result, "RE": re_result, "Two-way FE": twoway_result}))
+```
+## Diagnostic Tests
+### Testing for Panel Effects
+| Test | Stata | R | Null Hypothesis |
+|------|-------|---|----------------|
+| F-test for FE | Built into `xtreg, fe` | `pFtest()` | All alpha_i = 0 (pooled OLS is appropriate) |
+| Breusch-Pagan LM | `xttest0` | `plmtest()` | Var(u_i) = 0 (pooled OLS vs. RE) |
+| Hausman | `hausman FE RE` | `phtest()` | RE is consistent (u_i uncorrelated with X) |
+### Testing for Serial Correlation
+```stata
+* Wooldridge test for serial correlation in panel data
+xtserial gdp_growth investment trade_openness
+* If p < 0.05: serial correlation present; use clustered SE or AR(1) correction
+```
+```r
+# Wooldridge test
+pbgtest(fe_model)  # Breusch-Godfrey test for serial correlation
+```
+### Testing for Heteroskedasticity
+```stata
+* Modified Wald test for groupwise heteroskedasticity
+xttest3
+* If p < 0.05: heteroskedasticity present; use robust/clustered SE
+```
+## Advanced Panel Models
+### Dynamic Panel (Arellano-Bond GMM)
+When a lagged dependent variable is included as a regressor:
+```stata
+* Arellano-Bond one-step GMM
+xtabond gdp_growth investment trade_openness, lags(1) vce(robust)
+* System GMM (Blundell-Bond) - more efficient
+xtdpdsys gdp_growth investment trade_openness, lags(1) vce(robust)
+* Sargan/Hansen test for overidentifying restrictions
+* AR(2) test for second-order serial correlation
+```
+### Difference-in-Differences (DID)
+```stata
+* Basic DID with two-way fixed effects
+xtreg outcome treated##post, fe vce(cluster unit_id)
+* Event study specification
+xtreg outcome i.relative_time##treated, fe vce(cluster unit_id)
+```
+## Reporting Results
+```
+Table X: Panel Regression Results (Fixed Effects)
+Dependent Variable: GDP Growth (%)
+                      (1)         (2)         (3)
+                      FE          RE          Two-way FE
+Investment           0.125***    0.118***    0.131***
+                    (0.032)     (0.029)     (0.035)
+Trade Openness       0.045**     0.051**     0.038*
+                    (0.018)     (0.017)     (0.020)
+Entity FE             Yes         No         Yes
+Time FE               No          No         Yes
+Observations          850         850        850
+R-squared (within)   0.234       0.228      0.267
+Hausman test (p)       --        0.003        --
+Notes: Robust standard errors clustered at the country level in
+parentheses. * p<0.10, ** p<0.05, *** p<0.01.
+```

package/skills/analysis/econometrics/robustness-checks/SKILL.md ADDED Viewed

@@ -0,0 +1,250 @@
+---
+name: robustness-checks
+description: "Sequential robustness checks in Stata with confounder blocks"
+metadata:
+  openclaw:
+    emoji: "🔬"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["robustness checks", "sensitivity analysis", "Stata regression", "panel data", "causal inference", "fixed effects"]
+    source: "https://github.com/awesome-econ-ai/academic-skills"
+---
+# Robustness Checks
+A skill for conducting sequential robustness checks in Stata, systematically adding blocks of potential confounders to assess estimate stability.
+## Quick Start
+```stata
+* Base model
+svy: regress outcome controls treatment
+estimates store m1
+* Add confounder block
+svy: regress outcome controls treatment confounder1 confounder2
+estimates store m2
+* Compare
+esttab m1 m2, se star(+ 0.1 * 0.05 ** 0.01)
+```
+## Key Patterns
+### 1. Sequential Model Building
+```stata
+* Define base controls
+local control_var i.batch age i.race i.gender i.education
+estimates clear
+* Model 1: Base model
+svy: regress outcome `control_var' treatment
+margins, dydx(treatment) post
+estimates store m1
+* Model 2: Add contextual factors
+svy: regress outcome `control_var' treatment covid health_insurance
+margins, dydx(treatment) post
+estimates store m2
+* Model 3: Add health factors
+svy: regress outcome `control_var' treatment cci_charlson any_encounter
+margins, dydx(treatment) post
+estimates store m3
+* Model 4: Add psychological factors
+svy: regress outcome `control_var' treatment depression anxiety
+margins, dydx(treatment) post
+estimates store m4
+* Model 5: Add behavioral factors
+svy: regress outcome `control_var' treatment i.smoke_status bmi
+margins, dydx(treatment) post
+estimates store m5
+```
+### 2. Standard Robustness Check Template
+```stata
+*------------------------------------------------------------
+* Table: Robustness Checks
+*------------------------------------------------------------
+version 17
+clear all
+use "analysis_data.dta", clear
+svyset cluster [pweight = weight]
+* Base controls (always included)
+local control_var i.batch leukocytes age i.race i.gender i.education i.marital
+estimates clear
+*--- Model 1: Baseline ---
+svy: regress outcome `control_var' treatment
+margins, dydx(treatment) post
+estimates store m1
+*--- Model 2: + COVID & Insurance ---
+svy: regress outcome `control_var' treatment covid health_insurance
+margins, dydx(treatment) post
+estimates store m2
+*--- Model 3: + Healthcare utilization ---
+svy: regress outcome `control_var' treatment cci_charlson any_encounter_3years
+margins, dydx(treatment) post
+estimates store m3
+*--- Model 4: + Multimorbidity ---
+svy: regress outcome `control_var' treatment multi_morbidity
+margins, dydx(treatment) post
+estimates store m4
+*--- Model 5: + Psychosocial factors ---
+svy: regress outcome `control_var' treatment matter_important matter_depend
+margins, dydx(treatment) post
+estimates store m5
+*--- Model 6: + Occupation ---
+svy: regress outcome `control_var' treatment i.occ_group
+margins, dydx(treatment) post
+estimates store m6
+*--- Model 7: + Smoking ---
+svy: regress outcome `control_var' treatment i.smoke_status
+margins, dydx(treatment) post
+estimates store m7
+*--- Model 8: + Childhood adversity ---
+svy: regress outcome `control_var' treatment c.aces_sum_std
+margins, dydx(treatment) post
+estimates store m8
+*--- Export ---
+esttab m1 m2 m3 m4 m5 m6 m7 m8 using "robustness.csv", csv se ///
+  mtitle("Base" "+COVID" "+Health" "+Morbid" "+Psych" "+Occ" "+Smoke" "+ACE") ///
+  nogap label replace star(+ 0.1 * 0.05 ** 0.01)
+```
+### 3. Multiple Outcomes
+```stata
+* Repeat for each outcome
+foreach outcome in pace grimage2 phenoage {
+  estimates clear
+  svy: regress `outcome' `control_var' treatment
+  margins, dydx(treatment) post
+  estimates store `outcome'_m1
+  svy: regress `outcome' `control_var' treatment covid health_insurance
+  margins, dydx(treatment) post
+  estimates store `outcome'_m2
+  svy: regress `outcome' `control_var' treatment cci_charlson any_encounter
+  margins, dydx(treatment) post
+  estimates store `outcome'_m3
+}
+* Export all
+esttab pace_m1 pace_m2 pace_m3 grimage2_m1 grimage2_m2 grimage2_m3 ///
+  using "robustness_all.csv", csv se nogap label replace
+```
+### 4. Model Specification Checks
+```stata
+estimates clear
+* Linear specification
+svy: regress outcome `control_var' treatment
+estimates store linear
+* Logged outcome
+gen log_outcome = ln(outcome + 1)
+svy: regress log_outcome `control_var' treatment
+estimates store log_linear
+* Categorical treatment
+svy: regress outcome `control_var' i.treatment_cat
+estimates store categorical
+* With squared term
+svy: regress outcome `control_var' c.treatment##c.treatment
+estimates store quadratic
+esttab linear log_linear categorical quadratic using "spec_checks.csv", ///
+  csv se nogap label replace
+```
+### 5. Sample Restriction Checks
+```stata
+estimates clear
+* Full sample
+svy: regress outcome `control_var' treatment
+estimates store full
+* Exclude outliers
+svy: regress outcome `control_var' treatment if outcome < p99_outcome
+estimates store no_outliers
+* Complete cases only
+svy: regress outcome `control_var' treatment if complete_case == 1
+estimates store complete
+* Subpopulation
+svy, subpop(if age >= 50): regress outcome `control_var' treatment
+estimates store age50plus
+esttab full no_outliers complete age50plus using "sample_checks.csv", ///
+  csv se nogap label replace
+```
+### 6. Alternative Variable Definitions
+```stata
+estimates clear
+* Binary treatment
+svy: regress outcome `control_var' treatment_binary
+margins, dydx(treatment_binary) post
+estimates store binary
+* Continuous treatment
+svy: regress outcome `control_var' treatment_continuous
+margins, dydx(treatment_continuous) post
+estimates store continuous
+* Categorical treatment
+svy: regress outcome `control_var' i.treatment_cat
+margins, dydx(treatment_cat) post
+estimates store categorical
+* Standardized treatment
+svy: regress outcome `control_var' c.treatment_std
+margins, dydx(treatment_std) post
+estimates store standardized
+esttab binary continuous categorical standardized using "alt_definitions.csv", ///
+  csv se nogap label replace
+```
+## Interpretation Guide
+| Result | Interpretation |
+|--------|----------------|
+| Estimate stable across models | Robust to confounding |
+| Estimate attenuates with additions | Confounding present |
+| Estimate reverses sign | Serious confounding concern |
+| Estimate strengthens | Suppression effect |
+| SE increases substantially | Multicollinearity |
+## Tips
+- Start with theoretically-motivated confounder blocks
+- Order blocks from most to least plausible confounders
+- Document the rationale for each block
+- Present all models, not just the "best" one
+- Watch for substantial increases in standard errors (multicollinearity)
+- Consider pre-registering the robustness check plan