npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/analysis/econometrics/stata-regression/SKILL.md ADDED Viewed

@@ -0,0 +1,117 @@
+---
+name: stata-regression
+description: "Run regression analyses in Stata with publication-ready output"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["Stata regression", "Stata data cleaning", "Stata commands", "panel data", "fixed effects", "robustness checks"]
+    source: "https://github.com/awesome-econ-ai/academic-skills"
+---
+# Stata Regression
+## Purpose
+This skill produces reproducible regression analysis workflows in Stata, including model diagnostics and publication-ready tables using `esttab` or `outreg2`.
+## When to Use
+- Estimating linear or nonlinear regression models in Stata
+- Producing tables for academic papers and reports
+- Running robustness checks and alternative specifications
+## Instructions
+Follow these steps to complete the task:
+### Step 1: Understand the Context
+Before generating any code, ask the user:
+- What is the dependent variable and key regressors?
+- What controls and fixed effects are required?
+- How should standard errors be clustered?
+- What output format is needed (LaTeX, Word, or CSV)?
+### Step 2: Generate the Output
+Based on the context, generate Stata code that:
+- **Loads and checks the data** - Handle missing values and verify variable types
+- **Runs the requested specification** - Use `regress`, `reghdfe`, or `xtreg` as appropriate
+- **Adds robust or clustered standard errors** - Match the study design
+- **Exports tables** - Use `esttab` or `outreg2` with clear labels
+### Step 3: Verify and Explain
+After generating output:
+- Explain what each model estimates
+- Highlight assumptions and diagnostics
+- Suggest robustness checks or alternative models
+## Example Prompts
+- "Run OLS with firm and year fixed effects, clustering by firm"
+- "Estimate a logit model and export results to LaTeX"
+- "Create a regression table with three specifications"
+## Example Output
+```stata
+* ============================================
+* Regression Analysis with Stata
+* ============================================
+* Load data
+use "data.dta", clear
+* Summary stats
+summarize y x1 x2 x3
+* Main regression with clustered SEs
+regress y x1 x2 x3, vce(cluster firm_id)
+eststo model1
+* Alternative specification with fixed effects
+reghdfe y x1 x2 x3, absorb(firm_id year) vce(cluster firm_id)
+eststo model2
+* Export table
+esttab model1 model2 using "results/regression_table.tex", replace se label
+```
+## Requirements
+### Software
+- Stata 17+
+### Packages
+- `estout` (for `esttab`)
+- `reghdfe` (optional, for high-dimensional fixed effects)
+Install with:
+```stata
+ssc install estout
+ssc install reghdfe
+```
+## Best Practices
+- **Match standard errors to the design** (cluster where treatment varies)
+- **Report all model variants** used in the analysis
+- **Document variable definitions** and transformations
+## Common Pitfalls
+- Not clustering standard errors at the correct level
+- Omitting fixed effects when required by the design
+- Exporting tables without clear labels and notes
+## References
+- [Stata Regression Reference Manual](https://www.stata.com/manuals/rregress.pdf)
+- [reghdfe documentation](https://github.com/sergiocorreia/reghdfe)
+- [estout documentation](https://repec.sowi.unibe.ch/stata/estout/)

package/skills/analysis/econometrics/time-series-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,235 @@
+---
+name: time-series-guide
+description: "Apply ARIMA, VAR, cointegration, and time series econometric methods"
+metadata:
+  openclaw:
+    emoji: "chart_with_downwards_trend"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["time series", "ARIMA", "VAR", "cointegration", "stationarity", "forecasting", "econometrics"]
+    source: "wentor-research-plugins"
+---
+# Time Series Guide
+A skill for applying time series econometric methods including ARIMA modeling, VAR systems, cointegration analysis, and unit root tests. Covers stationarity concepts, model selection, forecasting, and diagnostic checking for economic and financial data.
+## Stationarity and Unit Root Tests
+### Why Stationarity Matters
+A time series is stationary when its statistical properties (mean, variance, autocorrelation) do not change over time. Most econometric methods require stationarity. Non-stationary series can produce spurious regressions.
+### Testing for Stationarity
+```python
+from statsmodels.tsa.stattools import adfuller, kpss
+import pandas as pd
+def test_stationarity(series: pd.Series, name: str = "Series") -> dict:
+    """
+    Test for stationarity using ADF and KPSS tests.
+    Args:
+        series: Time series data
+        name: Label for the series
+    """
+    # Augmented Dickey-Fuller test
+    # H0: Unit root exists (non-stationary)
+    adf_result = adfuller(series.dropna(), autolag="AIC")
+    # KPSS test
+    # H0: Series is stationary
+    kpss_result = kpss(series.dropna(), regression="c", nlags="auto")
+    return {
+        "series": name,
+        "adf": {
+            "statistic": adf_result[0],
+            "p_value": adf_result[1],
+            "lags_used": adf_result[2],
+            "conclusion": (
+                "Stationary (reject unit root)"
+                if adf_result[1] < 0.05
+                else "Non-stationary (fail to reject unit root)"
+            )
+        },
+        "kpss": {
+            "statistic": kpss_result[0],
+            "p_value": kpss_result[1],
+            "conclusion": (
+                "Non-stationary (reject stationarity)"
+                if kpss_result[1] < 0.05
+                else "Stationary (fail to reject stationarity)"
+            )
+        }
+    }
+```
+### Making a Series Stationary
+```
+Method 1: Differencing
+  y_diff = y_t - y_{t-1}           (first difference)
+  y_diff2 = delta(y_diff)          (second difference, rarely needed)
+Method 2: Log transformation + differencing
+  y_log = log(y_t)                 (stabilizes variance)
+  y_return = log(y_t) - log(y_{t-1})  (log returns)
+Method 3: Detrending
+  Subtract a fitted trend (linear, polynomial, or HP filter)
+```
+## ARIMA Modeling
+### Model Structure
+```
+ARIMA(p, d, q):
+  p = order of autoregressive (AR) component
+  d = degree of differencing
+  q = order of moving average (MA) component
+SARIMA(p, d, q)(P, D, Q, s):
+  Seasonal extension with period s
+  P, D, Q = seasonal AR, differencing, MA orders
+```
+### Model Selection and Fitting
+```python
+from statsmodels.tsa.arima.model import ARIMA
+import numpy as np
+def fit_arima(series: pd.Series, order: tuple = None) -> dict:
+    """
+    Fit an ARIMA model, optionally using auto-selection.
+    Args:
+        series: Time series data
+        order: (p, d, q) tuple; if None, uses AIC-based selection
+    """
+    if order is None:
+        # Grid search over common orders
+        best_aic = np.inf
+        best_order = (0, 0, 0)
+        for p in range(4):
+            for d in range(3):
+                for q in range(4):
+                    try:
+                        model = ARIMA(series, order=(p, d, q))
+                        result = model.fit()
+                        if result.aic < best_aic:
+                            best_aic = result.aic
+                            best_order = (p, d, q)
+                    except Exception:
+                        continue
+        order = best_order
+    model = ARIMA(series, order=order)
+    result = model.fit()
+    return {
+        "order": order,
+        "aic": result.aic,
+        "bic": result.bic,
+        "coefficients": dict(zip(result.param_names, result.params)),
+        "residual_diagnostics": {
+            "ljung_box_p": float(
+                result.test_serial_correlation("ljungbox", lags=[10])[0]["lb_pvalue"].iloc[0]
+            )
+        }
+    }
+```
+## Vector Autoregression (VAR)
+### Multivariate Time Series
+```python
+from statsmodels.tsa.api import VAR
+def fit_var_model(data: pd.DataFrame, maxlags: int = 12) -> dict:
+    """
+    Fit a VAR model to multivariate time series data.
+    Args:
+        data: DataFrame with multiple time series columns
+        maxlags: Maximum lag order to consider
+    """
+    model = VAR(data)
+    # Select lag order by information criteria
+    lag_selection = model.select_order(maxlags=maxlags)
+    optimal_lag = lag_selection.aic
+    result = model.fit(optimal_lag)
+    return {
+        "lag_order": optimal_lag,
+        "aic": result.aic,
+        "variables": list(data.columns),
+        "granger_causality": "Use result.test_causality() for pairwise tests",
+        "irf": "Use result.irf(periods=20) for impulse response functions"
+    }
+```
+### Granger Causality
+Granger causality tests whether past values of variable X improve forecasts of variable Y beyond what past values of Y alone provide. It is a test of predictive precedence, not true causation.
+## Cointegration Analysis
+### Engle-Granger and Johansen Tests
+```python
+from statsmodels.tsa.stattools import coint
+from statsmodels.tsa.vector_ar.vecm import coint_johansen
+def test_cointegration(y1: pd.Series, y2: pd.Series) -> dict:
+    """
+    Test for cointegration between two series.
+    Args:
+        y1: First time series
+        y2: Second time series
+    """
+    # Engle-Granger two-step test
+    eg_stat, eg_pvalue, eg_crit = coint(y1, y2)
+    return {
+        "engle_granger": {
+            "statistic": eg_stat,
+            "p_value": eg_pvalue,
+            "conclusion": (
+                "Cointegrated" if eg_pvalue < 0.05
+                else "Not cointegrated"
+            )
+        },
+        "interpretation": (
+            "If cointegrated, these series share a long-run equilibrium "
+            "relationship. Use a Vector Error Correction Model (VECM) "
+            "rather than a VAR in differences."
+        )
+    }
+```
+## Diagnostic Checking
+### Model Validation Checklist
+```
+1. Residual autocorrelation: Ljung-Box test (should be non-significant)
+2. Residual normality: Jarque-Bera test or Q-Q plot
+3. Heteroskedasticity: ARCH-LM test for conditional heteroskedasticity
+4. Stability: Check that AR roots lie inside the unit circle
+5. Forecast accuracy: Out-of-sample RMSE, MAE, MAPE
+6. Information criteria: Compare AIC/BIC across candidate models
+```
+Report all diagnostic results in your paper. Reviewers expect evidence that residuals are well-behaved and that the chosen model specification is justified by information criteria and domain knowledge.

package/skills/analysis/statistics/bayesian-statistics-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,221 @@
+---
+name: bayesian-statistics-guide
+description: "Bayesian inference methods including prior selection, MCMC, and model comparison"
+metadata:
+  openclaw:
+    emoji: "triangular_ruler"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["Bayesian statistics", "Bayesian inference", "MCMC", "sample size calculation", "prior selection"]
+    source: "wentor"
+---
+# Bayesian Statistics Guide
+A skill for applying Bayesian statistical methods to research data analysis. Covers prior specification, Markov chain Monte Carlo (MCMC) sampling, posterior interpretation, model comparison, and reporting standards.
+## Bayesian Framework Overview
+### Bayes' Theorem in Practice
+```
+Posterior = (Likelihood x Prior) / Evidence
+P(theta | data) = P(data | theta) * P(theta) / P(data)
+In practice:
+  P(theta | data) is proportional to P(data | theta) * P(theta)
+  (the denominator is a normalizing constant)
+```
+### When to Use Bayesian Methods
+| Scenario | Bayesian Advantage |
+|----------|-------------------|
+| Small sample sizes | Priors regularize estimates |
+| Complex hierarchical models | Natural framework for multilevel data |
+| Sequential data collection | Update beliefs as data arrives |
+| Prior knowledge available | Formally incorporate existing evidence |
+| Model comparison | Bayes factors and posterior model probabilities |
+| Prediction | Full posterior predictive distributions |
+## Prior Specification
+### Types of Priors
+```python
+import numpy as np
+from scipy import stats
+import matplotlib.pyplot as plt
+def visualize_priors(parameter_name: str, prior_type: str = 'weakly_informative'):
+    """
+    Visualize common prior choices for a parameter.
+    """
+    x = np.linspace(-10, 10, 1000)
+    priors = {
+        'flat': {
+            'dist': stats.uniform(loc=-100, scale=200),
+            'description': 'Flat/Uniform: minimal prior info (often improper)',
+            'recommendation': 'Avoid -- can lead to improper posteriors'
+        },
+        'weakly_informative': {
+            'dist': stats.norm(loc=0, scale=2.5),
+            'description': 'Weakly informative: Normal(0, 2.5)',
+            'recommendation': 'Good default for regression coefficients'
+        },
+        'informative': {
+            'dist': stats.norm(loc=0.5, scale=0.2),
+            'description': 'Informative: based on previous studies',
+            'recommendation': 'Use when strong prior evidence exists'
+        },
+        'horseshoe': {
+            'dist': stats.cauchy(loc=0, scale=1),
+            'description': 'Horseshoe-like (Cauchy): sparsity-inducing',
+            'recommendation': 'Good for variable selection problems'
+        }
+    }
+    prior = priors.get(prior_type, priors['weakly_informative'])
+    return prior
+# Recommended default priors (Gelman et al., 2008):
+# Intercept: Normal(0, 10)
+# Coefficients: Normal(0, 2.5) on standardized predictors
+# Standard deviation: Half-Cauchy(0, 2.5) or Exponential(1)
+# Correlation: LKJ(2) for correlation matrices
+```
+## MCMC with PyMC
+### Linear Regression Example
+```python
+import pymc as pm
+import arviz as az
+def bayesian_regression(X, y, feature_names=None):
+    """
+    Fit a Bayesian linear regression model using PyMC.
+    Args:
+        X: Feature matrix (n_samples, n_features)
+        y: Response variable (n_samples,)
+        feature_names: List of feature names
+    """
+    n_features = X.shape[1]
+    if feature_names is None:
+        feature_names = [f'x{i}' for i in range(n_features)]
+    with pm.Model() as model:
+        # Priors
+        intercept = pm.Normal('intercept', mu=0, sigma=10)
+        betas = pm.Normal('betas', mu=0, sigma=2.5, shape=n_features)
+        sigma = pm.HalfCauchy('sigma', beta=2.5)
+        # Linear predictor
+        mu = intercept + pm.math.dot(X, betas)
+        # Likelihood
+        y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
+        # MCMC sampling
+        trace = pm.sample(
+            draws=2000,
+            tune=1000,
+            chains=4,
+            cores=4,
+            target_accept=0.9,
+            return_inferencedata=True
+        )
+    return model, trace
+# After fitting, analyze results:
+# az.summary(trace, var_names=['intercept', 'betas', 'sigma'])
+# az.plot_trace(trace)
+# az.plot_forest(trace, var_names=['betas'])
+```
+## Diagnostics
+### MCMC Convergence Checks
+```python
+def check_mcmc_diagnostics(trace) -> dict:
+    """
+    Check MCMC convergence diagnostics.
+    """
+    summary = az.summary(trace)
+    diagnostics = {
+        'r_hat': {
+            'values': summary['r_hat'].to_dict(),
+            'threshold': 1.01,
+            'pass': (summary['r_hat'] < 1.01).all(),
+            'interpretation': 'R-hat < 1.01 indicates convergence'
+        },
+        'ess_bulk': {
+            'min_value': summary['ess_bulk'].min(),
+            'threshold': 400,
+            'pass': (summary['ess_bulk'] > 400).all(),
+            'interpretation': 'ESS > 400 ensures reliable posterior estimates'
+        },
+        'ess_tail': {
+            'min_value': summary['ess_tail'].min(),
+            'threshold': 400,
+            'pass': (summary['ess_tail'] > 400).all(),
+            'interpretation': 'Tail ESS > 400 ensures reliable credible intervals'
+        }
+    }
+    # Overall assessment
+    diagnostics['converged'] = all(
+        d['pass'] for d in diagnostics.values() if 'pass' in d
+    )
+    return diagnostics
+```
+## Model Comparison
+### Bayesian Model Selection
+```python
+def compare_models(traces: dict) -> dict:
+    """
+    Compare Bayesian models using LOO-CV and WAIC.
+    Args:
+        traces: Dict mapping model names to InferenceData objects
+    """
+    comparison = az.compare(traces, ic='loo')
+    return {
+        'ranking': comparison.index.tolist(),
+        'loo_values': comparison['loo'].to_dict(),
+        'weights': comparison['weight'].to_dict(),
+        'interpretation': (
+            f"Best model: {comparison.index[0]} "
+            f"(weight = {comparison['weight'].iloc[0]:.2f})"
+        )
+    }
+```
+## Reporting Bayesian Results
+Follow the WAMBS checklist (Depaoli & van de Schoot, 2017):
+1. **Priors**: Report all prior distributions and justify choices
+2. **Convergence**: Report R-hat, ESS, and trace plots (in supplement)
+3. **Posteriors**: Report posterior mean/median, 95% credible interval (HDI preferred)
+4. **Sensitivity**: Show results are robust to reasonable prior changes
+5. **Model fit**: Report LOO-IC, WAIC, or posterior predictive checks
+Example results sentence: "The effect of treatment on outcome was estimated at beta = 0.45, 95% HDI [0.21, 0.68], with a posterior probability of 0.99 that the effect is positive."
+## References
+- Gelman, A., et al. (2013). *Bayesian Data Analysis* (3rd ed.). CRC Press.
+- McElreath, R. (2020). *Statistical Rethinking* (2nd ed.). CRC Press.