npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.1.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (203) hide show

package/skills/analysis/econometrics/econml-causal-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,163 @@
+---
+name: econml-causal-guide
+description: "Apply EconML for causal inference combining machine learning and econometrics"
+metadata:
+  openclaw:
+    emoji: "🔬"
+    category: analysis
+    subcategory: econometrics
+    keywords: ["causal-inference", "machine-learning", "treatment-effects", "econometrics", "microsoft", "double-ml"]
+    source: "https://github.com/py-why/EconML"
+---
+# EconML Causal Inference Guide
+## Overview
+EconML is a Python package developed by Microsoft Research as part of the ALICE (Automated Learning and Intelligence for Causation and Economics) project. It provides a comprehensive suite of methods for estimating heterogeneous treatment effects from observational data, bridging the gap between modern machine learning and classical econometric techniques for causal inference.
+Traditional econometric approaches to causal inference often rely on strong parametric assumptions and struggle with high-dimensional data. Pure machine learning methods excel at prediction but do not inherently distinguish correlation from causation. EconML combines the strengths of both paradigms, offering methods that leverage the flexibility of ML for nuisance parameter estimation while maintaining the rigorous causal identification guarantees of econometric theory.
+The library implements cutting-edge methods from the academic literature including Double Machine Learning (DML), Causal Forests, Doubly Robust Learners, Orthogonal Random Forests, and Instrumental Variable methods with ML first stages. These tools are essential for researchers across economics, public health, education policy, and any field where understanding causal mechanisms from non-experimental data is critical.
+## Installation and Setup
+Install EconML via pip:
+```bash
+pip install econml
+```
+For the full feature set including optional dependencies:
+```bash
+pip install econml[all]
+```
+EconML builds on top of scikit-learn and integrates with the broader Python data science ecosystem. Core dependencies include numpy, scipy, pandas, scikit-learn, and statsmodels. Optional dependencies for specific estimators include LightGBM and PyTorch.
+Verify installation:
+```python
+import econml
+print(econml.__version__)
+from econml.dml import LinearDML
+from econml.orf import DMLOrthoForest
+print("EconML loaded successfully")
+```
+## Core Estimators and Methods
+**Double Machine Learning (DML)**: The workhorse method for estimating average and heterogeneous treatment effects while controlling for high-dimensional confounders. DML uses cross-fitting and orthogonalization to eliminate regularization bias:
+```python
+from econml.dml import LinearDML, CausalForestDML
+from sklearn.ensemble import GradientBoostingRegressor
+# Linear DML for parametric treatment effect estimation
+est = LinearDML(
+    model_y=GradientBoostingRegressor(),
+    model_t=GradientBoostingRegressor(),
+    cv=5,
+    random_state=42
+)
+est.fit(Y, T, X=X, W=W)
+# Get treatment effect estimates with confidence intervals
+effect = est.effect(X_test)
+ci = est.effect_interval(X_test, alpha=0.05)
+print(f"ATE: {est.ate():.4f}")
+print(f"ATE 95% CI: {est.ate_interval(alpha=0.05)}")
+```
+Here `Y` is the outcome, `T` is the treatment, `X` contains effect modifiers (features for heterogeneity), and `W` contains additional confounders.
+**Causal Forest DML**: Combines DML orthogonalization with Causal Forest estimation for flexible, nonparametric heterogeneous treatment effects:
+```python
+from econml.dml import CausalForestDML
+cf_est = CausalForestDML(
+    model_y=GradientBoostingRegressor(),
+    model_t=GradientBoostingRegressor(),
+    n_estimators=200,
+    min_samples_leaf=10,
+    cv=5,
+    random_state=42
+)
+cf_est.fit(Y, T, X=X, W=W)
+# Heterogeneous treatment effects
+hte = cf_est.effect(X_test)
+# Feature importance for treatment effect heterogeneity
+importances = cf_est.feature_importances_
+```
+**Doubly Robust Learner**: Provides consistent treatment effect estimates when either the outcome model or the propensity score model is correctly specified:
+```python
+from econml.dr import DRLearner
+from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+dr_est = DRLearner(
+    model_propensity=RandomForestClassifier(),
+    model_regression=RandomForestRegressor(),
+    model_final=RandomForestRegressor(),
+    cv=5
+)
+dr_est.fit(Y, T, X=X, W=W)
+```
+**Instrumental Variable Methods**: For settings where unobserved confounding is present but valid instruments are available:
+```python
+from econml.iv.dml import DMLIV
+iv_est = DMLIV(
+    model_y_xw=GradientBoostingRegressor(),
+    model_t_xw=GradientBoostingRegressor(),
+    model_t_xwz=GradientBoostingRegressor(),
+    cv=5
+)
+iv_est.fit(Y, T, Z=Z, X=X, W=W)
+```
+## Research Workflow Integration
+**Experiment Analysis**: When randomized experiments suffer from non-compliance or attrition, use IV methods in EconML to recover local average treatment effects. The ML-based first stages handle complex relationships between instruments and treatment uptake.
+**Policy Evaluation**: Estimate heterogeneous treatment effects to identify which subpopulations benefit most from an intervention. The CATE (Conditional Average Treatment Effect) estimates can directly inform targeted policy design:
+```python
+# Identify subgroups with largest treatment effects
+import pandas as pd
+effects_df = pd.DataFrame({
+    "effect": cf_est.effect(X_test).flatten(),
+    "ci_lower": cf_est.effect_interval(X_test, alpha=0.05)[0].flatten(),
+    "ci_upper": cf_est.effect_interval(X_test, alpha=0.05)[1].flatten()
+}, index=X_test.index)
+# Top beneficiaries
+top_group = effects_df.nlargest(100, "effect")
+```
+**Sensitivity Analysis**: Combine EconML estimates with sensitivity analysis frameworks to assess robustness to potential unobserved confounders. Report how much unmeasured confounding would be required to explain away your findings.
+**Publication-Ready Results**: EconML provides confidence intervals and hypothesis tests based on asymptotic theory, producing results suitable for peer-reviewed publications. Use the summary methods to generate formatted regression-style output.
+## Best Practices for Academic Research
+1. **Always validate assumptions**: DML requires conditional ignorability (selection on observables). Document your identification strategy clearly.
+2. **Cross-fitting is essential**: Never skip the cross-fitting step, as it prevents overfitting bias in the nuisance estimates.
+3. **Report multiple estimators**: Present results from DML, DR Learner, and Causal Forest side by side to assess robustness.
+4. **Check overlap**: Verify sufficient overlap in covariate distributions between treated and control groups before estimation.
+5. **Use honest estimation**: EconML Causal Forests use sample splitting for honesty by default, ensuring valid inference.
+## References
+- EconML repository: https://github.com/py-why/EconML
+- EconML documentation: https://econml.azurewebsites.net/
+- Chernozhukov et al. (2018), Double/Debiased Machine Learning for Treatment and Structural Parameters
+- Athey and Imbens (2019), Machine Learning Methods That Economists Should Know About

package/skills/analysis/econometrics/mostly-harmless-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,139 @@
+---
+name: mostly-harmless-guide
+description: "Replication code and guide for Mostly Harmless Econometrics methods"
+version: 1.0.0
+author: wentor-community
+source: https://github.com/vikjam/mostly-harmless-replication
+metadata:
+  openclaw:
+    category: analysis
+    subcategory: econometrics
+    keywords:
+      - econometrics
+      - causal-inference
+      - replication
+      - regression
+      - instrumental-variables
+      - difference-in-differences
+---
+# Mostly Harmless Econometrics Guide
+A skill providing replication code, explanations, and practical guidance for the econometric methods presented in Angrist and Pischke's "Mostly Harmless Econometrics" (MHE). Based on the mostly-harmless-replication repository (642 stars), this skill helps researchers understand and correctly apply core causal inference techniques.
+## Overview
+"Mostly Harmless Econometrics" is one of the most influential applied econometrics textbooks, providing accessible explanations of the methods that dominate modern empirical research in economics and increasingly in other social sciences. This skill translates the book's core methods into practical guidance that the agent can use to help researchers design studies, select appropriate estimators, and interpret results correctly.
+The skill covers regression, instrumental variables, difference-in-differences, regression discontinuity, and related methods, with emphasis on the practical decisions researchers face when applying these techniques to real data.
+## Regression Fundamentals
+**Ordinary Least Squares (OLS)**
+- OLS provides the best linear approximation to the conditional expectation function
+- The regression anatomy theorem: each coefficient can be obtained from a bivariate regression of the outcome on the residualized regressor
+- Omitted variable bias formula: bias equals the effect of the omitted variable times its correlation with the included regressor
+- Control variables should be selected based on the conditional independence assumption, not on statistical significance
+- Robust standard errors (Huber-White) should be the default; cluster when observations are not independent
+**Regression Interpretation**
+- The causal interpretation of regression requires the conditional independence assumption (CIA)
+- CIA states that treatment is as good as randomly assigned after conditioning on controls
+- Saturated models (fully interacted categorical variables) are always correctly specified
+- Linear regression with continuous variables approximates the true conditional expectation
+- Report both statistical and economic significance; a large t-statistic does not mean a large effect
+**Practical Decisions**
+- Include controls that are correlated with both the treatment and outcome
+- Do not include controls that are consequences of treatment (bad controls)
+- Use the most parsimonious specification that satisfies the CIA
+- Test sensitivity to alternative control sets to assess robustness
+- Report multiple specifications to demonstrate that results are not driven by a particular set of controls
+## Instrumental Variables
+**Core Concepts**
+- IV addresses endogeneity when the treatment is correlated with unobserved factors affecting the outcome
+- A valid instrument must be relevant (correlated with treatment) and excludable (affects outcome only through treatment)
+- Two-stage least squares (2SLS) is the standard IV estimator
+- The Wald estimator (reduced form divided by first stage) gives the IV estimate in the simplest case
+- IV estimates the Local Average Treatment Effect (LATE) for compliers
+**Implementation Guide**
+- Always report the first-stage F-statistic; values below 10 indicate weak instruments
+- Use the Anderson-Rubin test for inference robust to weak instruments
+- Over-identification tests (Sargan-Hansen) can detect violations of the exclusion restriction with multiple instruments, but cannot validate a just-identified model
+- Report the first stage, reduced form, and IV estimates together
+- Compare OLS and IV estimates; if IV is much larger, consider LATE interpretation or measurement error
+**Common Applications**
+- Returns to education using quarter of birth as an instrument
+- Effect of institutions on growth using settler mortality as an instrument
+- Peer effects using random assignment to groups
+- Supply and demand estimation using shift variables
+- Policy evaluation using eligibility rules as instruments
+## Difference-in-Differences
+**Design Principles**
+- DID compares changes in outcomes over time between treated and control groups
+- The parallel trends assumption: absent treatment, both groups would have followed the same trend
+- DID removes time-invariant unobserved confounders
+- The standard estimator is a two-way fixed effects regression (unit and time fixed effects plus treatment indicator)
+- Staggered adoption designs require careful attention to treatment timing heterogeneity
+**Implementation**
+- Always plot pre-treatment trends to assess the parallel trends assumption visually
+- Include leads of the treatment indicator to test for pre-trends formally
+- Cluster standard errors at the group level (state, firm, school)
+- With few clusters (fewer than 50), use wild cluster bootstrap for inference
+- Consider synthetic control methods when the control group is not a natural comparator
+**Recent Developments**
+- Callaway and Sant'Anna (2021): heterogeneity-robust DID with staggered treatment
+- Sun and Abraham (2021): interaction-weighted estimator for event studies
+- de Chaisemartin and D'Haultfoeuille (2020): decomposition of two-way FE estimator
+- Goodman-Bacon (2021): DID with variation in treatment timing decomposition
+- These methods address bias in standard two-way FE when treatment effects are heterogeneous
+## Regression Discontinuity
+**Sharp RD Design**
+- Treatment is a deterministic function of a running variable at a known cutoff
+- Causal effect is identified at the cutoff by comparing outcomes just above and just below
+- Local linear regression is preferred over global polynomial fitting
+- Bandwidth selection should use data-driven methods (Imbens-Kalyanaraman, Calonico-Cattaneo-Titiunik)
+- Always show the RD plot: binned means of the outcome against the running variable
+**Fuzzy RD Design**
+- Treatment probability jumps at the cutoff but is not deterministic
+- Fuzzy RD is analogous to IV where the instrument is being above the cutoff
+- Estimates a LATE for units whose treatment status is changed by crossing the cutoff
+- Report both the first stage (jump in treatment probability) and the reduced form (jump in outcome)
+- Validity requires that other covariates do not jump at the cutoff (density test, covariate balance)
+**Practical Guidance**
+- Test for manipulation of the running variable using the McCrary density test
+- Show robustness to alternative bandwidth choices
+- Include covariates to improve precision but the estimate should not change substantially
+- Avoid high-order polynomial specifications that can be misleading
+- Report the effective sample size used in the local estimation
+## Integration with Research-Claw
+This skill enhances the Research-Claw econometric analysis workflow:
+- Guide researchers in selecting the appropriate causal inference method for their question
+- Help implement estimators correctly with proper standard errors and diagnostics
+- Provide code templates for common econometric analyses in R, Stata, and Python
+- Connect with data wrangling skills for cleaning and preparing analysis datasets
+- Support writing skills with correctly formatted regression tables and result descriptions
+## Best Practices
+- Start by clearly stating the causal question and the source of identification
+- Draw a directed acyclic graph (DAG) to clarify assumptions about causal relationships
+- Report all relevant diagnostics (first-stage F, pre-trends, balance tests)
+- Show robustness across specifications rather than selecting a single preferred model
+- Distinguish between statistical significance, economic significance, and policy relevance
+- Be transparent about the limitations of your identification strategy

package/skills/analysis/econometrics/panel-data-analyst/SKILL.md ADDED Viewed

@@ -0,0 +1,259 @@
+---
+name: panel-data-analyst
+description: "Expert panel data regression analysis with fixed effects and GMM"
+metadata:
+  openclaw:
+    emoji: "grid"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["panel data", "fixed effects", "random effects", "GMM", "dynamic panel", "Hausman test"]
+    source: "https://www.stata.com/manuals/xt.pdf"
+---
+# Panel Data Analyst
+Perform expert-level panel data regression analysis including fixed effects, random effects, dynamic panel models (Arellano-Bond/Blundell-Bond GMM), and advanced diagnostic tests. This skill covers the full workflow from panel setup through model selection, estimation, and publication-ready reporting.
+## Overview
+Panel data -- repeated observations on the same cross-sectional units over time -- is the workhorse of modern empirical economics, finance, political science, and management research. Panel methods exploit both cross-sectional and temporal variation, enabling researchers to control for unobserved heterogeneity that would bias ordinary cross-sectional estimates.
+The choice between fixed effects, random effects, and dynamic panel estimators depends on the data structure, the nature of unobserved heterogeneity, and the identifying assumptions the researcher is willing to make. This skill provides a systematic decision framework and implementation in both Stata and R, with emphasis on the diagnostic tests that justify model selection.
+Beyond basic FE/RE models, this skill covers the advanced techniques increasingly required by journal reviewers: instrumental variables within panel frameworks, Driscoll-Kraay standard errors for cross-sectional dependence, correlated random effects (Mundlak/Chamberlain), and system GMM for dynamic panels with endogenous regressors.
+## Panel Data Setup
+### Declaring Panel Structure
+```stata
+* Stata panel setup
+xtset firm_id year
+xtset  // Verify panel structure
+* Check panel balance
+xtdescribe
+* Shows: min/max/avg observations per panel, gaps
+* Summary statistics by panel dimension
+xtsum revenue profit employees rnd_spending
+* Reports overall, between, and within variation
+```
+### Panel Diagnostics
+```stata
+* Check for gaps in panel
+xtset firm_id year
+gen gap = year - l.year if l.year != .
+tab gap  // Should be all 1's for balanced annual panels
+* Create balanced subsample
+by firm_id: gen T_i = _N
+tab T_i
+keep if T_i == max_T  // Keep only units observed in all periods
+* Attrition analysis
+gen in_panel = 1
+xtset firm_id year
+tsfill, full
+replace in_panel = 0 if missing(in_panel)
+reg in_panel l.revenue l.profit l.size, cluster(firm_id)
+```
+## Fixed Effects vs. Random Effects
+### Fixed Effects Estimation
+```stata
+* Within estimator (entity fixed effects)
+xtreg profit revenue rnd_spending employees i.year, fe robust
+estimates store fe_model
+* Entity and time fixed effects
+reghdfe profit revenue rnd_spending employees, ///
+    absorb(firm_id year) cluster(firm_id)
+estimates store twoway_fe
+* First-differences (alternative to within estimator)
+reg d.profit d.revenue d.rnd_spending d.employees i.year, ///
+    cluster(firm_id)
+estimates store fd_model
+```
+### Random Effects Estimation
+```stata
+* GLS random effects
+xtreg profit revenue rnd_spending employees i.year, re robust
+estimates store re_model
+```
+### Hausman Test for Model Selection
+```stata
+* Classic Hausman test
+xtreg profit revenue rnd_spending employees, fe
+estimates store fe_haus
+xtreg profit revenue rnd_spending employees, re
+estimates store re_haus
+hausman fe_haus re_haus
+* Robust Hausman test (preferred with heteroskedasticity)
+* Mundlak (1978) approach: add group means to RE model
+foreach var of varlist revenue rnd_spending employees {
+    bysort firm_id: egen m_`var' = mean(`var')
+}
+xtreg profit revenue rnd_spending employees ///
+    m_revenue m_rnd_spending m_employees i.year, re cluster(firm_id)
+test m_revenue m_rnd_spending m_employees
+* Rejection => FE preferred; failure to reject => RE acceptable
+```
+## Dynamic Panel Models
+### Arellano-Bond GMM (Difference GMM)
+```stata
+* When the lagged dependent variable is a regressor:
+* y_it = alpha * y_{i,t-1} + X_it * beta + mu_i + epsilon_it
+* Difference GMM (Arellano & Bond 1991)
+xtabond profit l.profit revenue rnd_spending employees, ///
+    lags(1) twostep robust artests(2)
+* Diagnostics
+* AR(1) should be significant, AR(2) should NOT be significant
+* Hansen J test of overidentifying restrictions (p > 0.10 desired)
+```
+### System GMM (Blundell-Bond)
+```stata
+* System GMM (Blundell & Bond 1998)
+* More efficient than difference GMM, especially with persistent series
+xtabond2 profit l.profit revenue rnd_spending employees i.year, ///
+    gmm(l.profit, lag(2 4) collapse) ///
+    gmm(revenue rnd_spending, lag(2 3) collapse) ///
+    iv(employees i.year) ///
+    twostep robust orthogonal small
+* Key diagnostics to report:
+* 1. Number of instruments (should not exceed number of groups)
+* 2. Hansen J test p-value (> 0.10, but < 0.25 preferred -- not too high)
+* 3. AR(2) test p-value (> 0.10 for valid instruments)
+* 4. Difference-in-Hansen test for subset of instruments
+```
+### GMM Diagnostic Checklist
+| Test | Null Hypothesis | Desired Result | Stata Command |
+|------|----------------|----------------|---------------|
+| AR(1) | No first-order autocorrelation | Reject (p < 0.05) | Reported automatically |
+| AR(2) | No second-order autocorrelation | Fail to reject (p > 0.10) | Reported automatically |
+| Hansen J | Instruments are valid | Fail to reject (p > 0.10) | Reported automatically |
+| Diff-in-Hansen | Level instruments valid | Fail to reject (p > 0.10) | Reported automatically |
+| Instrument count | -- | N_instruments < N_groups | Check output |
+## Standard Error Options
+### Choosing the Right Standard Errors
+```stata
+* Entity-clustered (default choice for firm panels)
+xtreg profit revenue rnd_spending, fe cluster(firm_id)
+* Two-way clustering (firm and year)
+reghdfe profit revenue rnd_spending, ///
+    absorb(firm_id) cluster(firm_id year)
+* Driscoll-Kraay standard errors (cross-sectional dependence)
+xtscc profit revenue rnd_spending i.year, fe lag(3)
+* Newey-West within panels (autocorrelation + heteroskedasticity)
+xtreg profit revenue rnd_spending, fe
+xtpcse profit revenue rnd_spending i.firm_id, correlation(ar1)
+```
+### Diagnostic Tests for Standard Error Selection
+```stata
+* Test for heteroskedasticity in FE model
+xtreg profit revenue rnd_spending, fe
+xttest3  // Modified Wald test (rejects => use robust/cluster SE)
+* Test for serial correlation
+xtserial profit revenue rnd_spending
+* Wooldridge test (rejects => use cluster SE or Newey-West)
+* Test for cross-sectional dependence
+xtreg profit revenue rnd_spending, fe
+xtcsd, pesaran abs
+* Pesaran CD test (rejects => consider Driscoll-Kraay SE)
+```
+## Advanced Specifications
+### Interaction Effects in Panel Models
+```stata
+* Continuous x continuous interaction with FE
+xtreg profit c.rnd_spending##c.market_share i.year, fe cluster(firm_id)
+* Visualize marginal effect
+margins, dydx(rnd_spending) at(market_share=(0(0.1)1))
+marginsplot, title("Marginal Effect of R&D by Market Share")
+```
+### Instrumental Variables in Panel Data
+```stata
+* IV with fixed effects (xtivreg)
+xtivreg profit (rnd_spending = tax_credit regulatory_change) ///
+    employees size i.year, fe first
+* First-stage F-statistic check
+* Report Kleibergen-Paap rk Wald F for weak instruments
+```
+### Correlated Random Effects (Mundlak)
+```stata
+* Mundlak (1978) approach: include within-group means
+foreach var of varlist revenue rnd_spending employees {
+    bysort firm_id: egen bar_`var' = mean(`var')
+}
+xtreg profit revenue rnd_spending employees ///
+    bar_revenue bar_rnd_spending bar_employees ///
+    i.year, re cluster(firm_id)
+* Coefficients on time-varying vars are equivalent to FE estimates
+* Coefficients on bar_ vars capture between-unit effects
+```
+## Publication Tables
+```stata
+* Comparison table: FE vs RE vs GMM
+esttab fe_model re_model gmm_model using "tables/panel_comparison.tex", ///
+    b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
+    label title("Panel Regression Results") ///
+    mtitles("Fixed Effects" "Random Effects" "System GMM") ///
+    stats(N N_g r2_w ar2p hansenp, ///
+        labels("Observations" "Firms" "Within R-squared" ///
+               "AR(2) p-value" "Hansen p-value") ///
+        fmt(0 0 3 3 3)) ///
+    addnotes("Clustered standard errors in parentheses." ///
+             "All models include year fixed effects.") ///
+    replace
+```
+## References
+- Wooldridge, J.M. (2010), Econometric Analysis of Cross Section and Panel Data, 2nd ed., MIT Press
+- Arellano & Bond (1991), "Some Tests of Specification for Panel Data," RES 58(2)
+- Blundell & Bond (1998), "Initial Conditions and Moment Restrictions in Dynamic Panel Data Models," JoE 87(1)
+- Roodman (2009), "How to Do xtabond2: An Introduction to Difference and System GMM in Stata," SJ 9(1)
+- Cameron & Trivedi (2005), Microeconometrics: Methods and Applications, Cambridge University Press