npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.1.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (203) hide show

package/skills/analysis/econometrics/python-causality-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,134 @@
+---
+name: python-causality-guide
+description: "Learn causal inference with Python using the Brave and True handbook"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: analysis
+    subcategory: econometrics
+    keywords: ["causal-inference", "python", "econometrics", "statistics", "treatment-effects", "observational-studies"]
+    source: "https://github.com/matheusfacure/python-causality-handbook"
+---
+# Causal Inference for the Brave and True
+## Overview
+Causal Inference for the Brave and True is an open-source, Python-based textbook by Matheus Facure that teaches causal inference methods through practical implementations. The book bridges the gap between theoretical econometrics textbooks and hands-on data science practice, presenting each method with runnable Python code, real-world datasets, and intuitive explanations that demystify the mathematics behind causal reasoning.
+The handbook covers the full spectrum of causal inference techniques used in modern empirical research, from foundational concepts like potential outcomes and directed acyclic graphs (DAGs) through advanced methods including instrumental variables, regression discontinuity, difference-in-differences, and synthetic control. Each chapter builds on the previous one, constructing a coherent framework for thinking about causation from observational data.
+With over 3,000 GitHub stars, this resource has become a standard reference for graduate students, applied researchers, and data scientists seeking to add causal reasoning to their analytical toolkit. The emphasis on Python implementation makes it directly applicable to modern research workflows.
+## Installation and Setup
+The handbook runs as Jupyter notebooks. Set up the environment:
+```bash
+git clone https://github.com/matheusfacure/python-causality-handbook.git
+cd python-causality-handbook
+# Create a virtual environment
+python -m venv causal-env
+source causal-env/bin/activate
+# Install dependencies
+pip install numpy pandas matplotlib seaborn scikit-learn statsmodels
+pip install linearmodels causalinference
+pip install jupyter
+```
+Launch the notebook server:
+```bash
+jupyter notebook
+```
+The chapters are organized as numbered Jupyter notebooks, starting from foundational concepts and progressing to advanced methods. Each notebook is self-contained with all data loading and analysis code included.
+## Core Methods Covered
+**Potential Outcomes Framework**: The book begins by establishing the Neyman-Rubin potential outcomes model, defining treatment effects and the fundamental problem of causal inference:
+```python
+import pandas as pd
+import numpy as np
+from scipy.stats import ttest_ind
+# Estimate ATE from randomized experiment
+treated = data[data["treatment"] == 1]["outcome"]
+control = data[data["treatment"] == 0]["outcome"]
+ate = treated.mean() - control.mean()
+t_stat, p_value = ttest_ind(treated, control)
+print(f"ATE: {ate:.3f}, p-value: {p_value:.4f}")
+```
+**Regression and Matching**: OLS regression for causal estimation, understanding omitted variable bias, propensity score methods, and matching estimators:
+```python
+import statsmodels.formula.api as smf
+# OLS with controls
+model = smf.ols("outcome ~ treatment + age + income + education", data=data)
+results = model.fit(cov_type="HC1")
+print(results.summary().tables[1])
+```
+**Instrumental Variables**: Two-stage least squares and the local average treatment effect, with practical guidance on instrument validity and weak instrument diagnostics:
+```python
+from linearmodels.iv import IV2SLS
+# Two-stage least squares
+iv_formula = "outcome ~ 1 + [treatment ~ instrument]"
+iv_model = IV2SLS.from_formula(iv_formula, data=data)
+iv_results = iv_model.fit(cov_type="robust")
+print(iv_results.summary)
+```
+**Difference-in-Differences**: Parallel trends assumption, two-way fixed effects, event study designs, and staggered treatment adoption:
+```python
+# Difference-in-Differences with two-way fixed effects
+did_model = smf.ols(
+    "outcome ~ treated_post + C(unit_id) + C(time_period)",
+    data=panel_data
+)
+did_results = did_model.fit(cov_type="cluster", cov_kwds={"groups": panel_data["unit_id"]})
+```
+**Regression Discontinuity**: Sharp and fuzzy RD designs, bandwidth selection, and local polynomial estimation for identifying causal effects at policy thresholds.
+**Synthetic Control**: Constructing counterfactual units from donor pools for comparative case studies, with inference via placebo tests.
+## Research Workflow Integration
+**Graduate Coursework**: The handbook maps directly to applied econometrics and causal inference course syllabi. Students can follow along with lectures by running the corresponding notebooks, experimenting with parameter changes, and observing how different assumptions affect estimates.
+**Method Selection Guide**: Use the decision framework presented across chapters to choose the appropriate method for your research question:
+- Randomized experiment available: simple comparison of means or regression adjustment
+- Selection on observables: matching, propensity scores, or regression
+- Unobserved confounders with instrument: instrumental variables
+- Policy threshold: regression discontinuity
+- Before/after with control group: difference-in-differences
+- Single treated unit over time: synthetic control
+**Replication and Extension**: Each chapter uses real or realistic datasets. Researchers can adapt the code to their own data by replacing data loading steps while preserving the analytical pipeline.
+**Teaching Tool**: Instructors can assign chapters as interactive homework, asking students to modify assumptions, change specifications, or apply methods to new datasets. The notebook format makes it straightforward to create assignments with embedded solutions.
+## Best Practices Highlighted in the Handbook
+1. **Always graph your data first**: Visual inspection reveals patterns that inform modeling choices and expose violations of identifying assumptions.
+2. **Understand your identification strategy**: Before running any estimator, articulate clearly what variation identifies the causal effect and what assumptions are required.
+3. **Cluster standard errors appropriately**: When treatment is assigned at group level, cluster standard errors at that level to avoid overstating statistical significance.
+4. **Run robustness checks**: Vary specifications, bandwidths, control variables, and functional forms to assess sensitivity of conclusions.
+5. **Report effect sizes alongside p-values**: Statistical significance without practical significance is not informative for policy or scientific understanding.
+## References
+- Python Causality Handbook: https://github.com/matheusfacure/python-causality-handbook
+- Online version: https://matheusfacure.github.io/python-causality-handbook/
+- Angrist and Pischke, Mostly Harmless Econometrics (companion reference)
+- Cunningham, Causal Inference: The Mixtape (complementary resource)

package/skills/analysis/econometrics/stata-accounting-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,269 @@
+---
+name: stata-accounting-guide
+description: "STATA code for empirical accounting and financial economics research"
+metadata:
+  openclaw:
+    emoji: "ledger"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["Stata", "accounting research", "Compustat", "CRSP", "earnings management", "financial economics"]
+    source: "https://www.stata.com"
+---
+# Stata Accounting Research Guide
+Generate publication-ready Stata code for empirical accounting research. This skill covers the standard econometric models, variable constructions, and estimation procedures used in top accounting journals (TAR, JAR, JAE, RAS, CAR).
+## Overview
+Empirical accounting research has developed a distinctive set of analytical conventions that differ from general econometrics. Researchers work with financial statement data from databases like Compustat and CRSP, construct standardized proxies for earnings quality, accruals, and discretionary behavior, and employ estimation techniques that address the particular endogeneity concerns in archival accounting studies.
+This skill provides the Stata implementation for the most commonly used models in accounting research: accrual models (Jones, Modified Jones, performance-matched), earnings management detection, value relevance studies, audit quality analyses, and corporate governance research. Each model includes the variable construction from raw Compustat items, the estimation procedure, and the table formatting expected by accounting journal reviewers.
+The code follows the conventions established in influential methodological papers by Dechow, Sloan, and Sweeney (1995), Kothari, Leone, and Wasley (2005), and Ecker, Francis, Kim, Olsson, and Schipper (2006), ensuring alignment with reviewer expectations at major accounting journals.
+## Data Preparation from Compustat
+### Variable Construction
+```stata
+* ============================================
+* Standard Compustat Variable Construction
+* ============================================
+* Load Compustat annual data
+use "raw/compustat_annual.dta", clear
+* Fiscal year identifier
+gen fyear = year(datadate)
+* Key financial variables (Compustat mnemonics)
+* Total accruals (balance sheet approach)
+gen total_accruals_bs = (dch_act - dch_che) - (dch_lct - dch_dlc - dch_txp) - dp
+replace total_accruals_bs = total_accruals_bs / l.at  // Scale by lagged assets
+* Total accruals (cash flow approach, preferred)
+gen total_accruals_cf = (ib - oancf) / l.at
+* Key ratios
+gen roa = ib / l.at                           // Return on assets
+gen leverage = (dltt + dlc) / at              // Financial leverage
+gen size = ln(at)                              // Firm size (log assets)
+gen mb = (prcc_f * csho) / ceq               // Market-to-book
+gen sales_growth = (sale - l.sale) / l.sale   // Revenue growth
+gen cfo = oancf / l.at                        // Cash flow from operations
+gen loss = (ib < 0)                           // Loss indicator
+* Industry classification (Fama-French 48)
+merge m:1 sic using "reference/ff48_sic.dta", nogen keep(master match)
+```
+### Sample Selection and Filters
+```stata
+* Standard sample restrictions
+drop if at <= 0                    // Positive assets
+drop if sale < 0                   // Non-negative sales
+drop if missing(ib, at, sale)      // Key variables non-missing
+drop if inlist(sic, 6000, 6999)    // Exclude financials (SIC 6000-6999)
+drop if inlist(sic, 4900, 4999)    // Exclude utilities (SIC 4900-4999)
+* Require minimum observations per industry-year
+bysort ff48 fyear: gen n_iy = _N
+drop if n_iy < 20
+* Winsorize continuous variables at 1st/99th percentile
+foreach var of varlist roa leverage mb sales_growth cfo total_accruals_cf {
+    winsor2 `var', cuts(1 99) replace
+}
+```
+## Accrual Models
+### Modified Jones Model (Dechow, Sloan, Sweeney 1995)
+```stata
+* ============================================
+* Modified Jones Model - Industry-Year Estimation
+* ============================================
+* Construct regressors
+gen inv_at = 1 / l.at
+gen d_rev_rec = ((sale - l.sale) - (rect - l.rect)) / l.at
+gen ppe_scaled = ppegt / l.at
+* Estimate by industry-year cross-sections
+gen da_mj = .
+levelsof ff48, local(industries)
+levelsof fyear, local(years)
+foreach ind of local industries {
+    foreach yr of local years {
+        capture {
+            reg total_accruals_cf inv_at d_rev_rec ppe_scaled ///
+                if ff48 == `ind' & fyear == `yr', robust
+            predict resid if e(sample), resid
+            replace da_mj = resid if ff48 == `ind' & fyear == `yr' & !missing(resid)
+            drop resid
+        }
+    }
+}
+label variable da_mj "Discretionary accruals (Modified Jones)"
+```
+### Performance-Matched Model (Kothari et al. 2005)
+```stata
+* Add ROA as control for performance matching
+gen da_kothari = .
+foreach ind of local industries {
+    foreach yr of local years {
+        capture {
+            reg total_accruals_cf inv_at d_rev_rec ppe_scaled roa ///
+                if ff48 == `ind' & fyear == `yr', robust
+            predict resid if e(sample), resid
+            replace da_kothari = resid if ff48 == `ind' & fyear == `yr' & !missing(resid)
+            drop resid
+        }
+    }
+}
+label variable da_kothari "Discretionary accruals (Kothari)"
+* Alternative: performance matching by ROA decile
+xtile roa_decile = roa, nq(10)
+bysort ff48 fyear roa_decile: egen da_pm = mean(da_mj)
+gen da_performance_matched = da_mj - da_pm
+```
+## Earnings Management Detection
+### Earnings Management Around Thresholds
+```stata
+* ============================================
+* Earnings Distribution Discontinuity Test
+* ============================================
+* Scaled earnings (earnings per share / price)
+gen earn_scaled = ib / (prcc_f * csho)
+* Histogram around zero
+twoway (histogram earn_scaled if inrange(earn_scaled, -0.10, 0.10), ///
+    width(0.005) color(navy%50)), ///
+    xline(0, lcolor(red)) ///
+    title("Distribution of Scaled Earnings Around Zero") ///
+    xtitle("Earnings / Market Cap") ytitle("Frequency") ///
+    graphregion(color(white))
+graph export "figures/earnings_discontinuity.pdf", replace
+* Burgstahler & Dichev (1997) test
+gen earn_bin = round(earn_scaled, 0.005)
+tab earn_bin if inrange(earn_scaled, -0.025, 0.025)
+* Test for discontinuity at zero
+gen just_above = (earn_scaled >= 0 & earn_scaled < 0.005)
+gen just_below = (earn_scaled >= -0.005 & earn_scaled < 0)
+prtest just_above == just_below
+```
+### Real Earnings Management (Roychowdhury 2006)
+```stata
+* Abnormal cash flow from operations
+gen da_cfo = .
+foreach ind of local industries {
+    foreach yr of local years {
+        capture {
+            reg cfo inv_at sale_scaled d_sale_scaled ///
+                if ff48 == `ind' & fyear == `yr', robust
+            predict resid if e(sample), resid
+            replace da_cfo = resid if ff48 == `ind' & fyear == `yr' & !missing(resid)
+            drop resid
+        }
+    }
+}
+* Abnormal production costs
+gen prod_costs = cogs + (xinv - l.xinv)
+gen prod_scaled = prod_costs / l.at
+gen da_prod = .
+foreach ind of local industries {
+    foreach yr of local years {
+        capture {
+            reg prod_scaled inv_at sale_scaled d_sale_scaled l.d_sale_scaled ///
+                if ff48 == `ind' & fyear == `yr', robust
+            predict resid if e(sample), resid
+            replace da_prod = resid if ff48 == `ind' & fyear == `yr' & !missing(resid)
+            drop resid
+        }
+    }
+}
+```
+## Publication Tables
+### Regression Table with Standard Formatting
+```stata
+* Main regression: Discretionary accruals on governance
+reg abs_da_mj board_independence ceo_duality audit_committee_size ///
+    big4 size leverage mb loss i.fyear, cluster(gvkey)
+estimates store gov1
+reg abs_da_mj board_independence ceo_duality audit_committee_size ///
+    big4 inst_ownership analyst_following ///
+    size leverage mb loss i.fyear, cluster(gvkey)
+estimates store gov2
+* Publication table
+esttab gov1 gov2 using "tables/governance_regression.tex", ///
+    b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
+    label title("Corporate Governance and Earnings Quality") ///
+    mtitles("Baseline" "Extended") ///
+    drop(*.fyear) indicate("Year FE = *.fyear") ///
+    stats(N r2_a, labels("Observations" "Adj. R-squared") fmt(0 3)) ///
+    addnotes("Standard errors clustered by firm in parentheses.") ///
+    replace
+```
+## Endogeneity and Identification
+### Two-Stage Least Squares
+```stata
+* Instrumental variable regression
+ivregress 2sls earnings_quality (board_independence = ///
+    state_governance_index peer_board_independence) ///
+    size leverage mb loss i.fyear, cluster(gvkey) first
+* First-stage diagnostics
+estat firststage
+estat endogenous
+* Weak instrument test
+estat firststage, forcenonrobust
+```
+### Propensity Score Matching
+```stata
+* PSM for treatment effect of Big 4 auditor
+logit big4 size leverage mb roa loss i.ff48, robust
+predict pscore, pr
+* Nearest-neighbor matching
+psmatch2 big4, pscore(pscore) outcome(abs_da_mj) ///
+    neighbor(3) caliper(0.01) common
+```
+## References
+- Dechow, Sloan & Sweeney (1995), "Detecting Earnings Management," TAR 70(2)
+- Kothari, Leone & Wasley (2005), "Performance Matched Discretionary Accrual Measures," JAE 39(1)
+- Roychowdhury (2006), "Earnings Management through Real Activities Manipulation," JAE 42(3)
+- Burgstahler & Dichev (1997), "Earnings Management to Avoid Earnings Decreases and Losses," JAE 24(1)
+- Compustat Manual: https://www.spglobal.com/marketintelligence

package/skills/analysis/econometrics/stata-analyst-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,245 @@
+---
+name: stata-analyst-guide
+description: "Stata workflows for publication-ready sociology and social science research"
+metadata:
+  openclaw:
+    emoji: "survey"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["Stata", "sociology", "social science", "survey data", "regression", "publication tables"]
+    source: "https://www.stata.com"
+---
+# Stata Analyst Guide for Social Science Research
+Complete Stata workflow for sociology and social science research, from survey data preparation through publication-ready regression tables and visualizations. This skill covers the analytical techniques most commonly used in top sociology journals.
+## Overview
+Stata is the dominant statistical software in sociology, political science, demography, and many social science disciplines. Its command-line interface, reproducible do-file workflow, and comprehensive support for survey data, multilevel models, and categorical data analysis make it the tool of choice for researchers working with complex social datasets.
+This skill provides ready-to-use Stata code for the most common analytical tasks in social science research: descriptive statistics for diverse variable types, regression modeling with proper controls and robustness checks, interaction effects with meaningful visualizations, and automated production of APA/ASA-formatted tables suitable for direct inclusion in journal manuscripts.
+The examples draw on typical social science data structures: individual-level survey data with sampling weights, nested data (individuals within organizations or regions), longitudinal panels, and event-history data. All code follows the conventions expected by reviewers at journals such as the American Sociological Review, American Journal of Sociology, and Social Forces.
+## Descriptive Statistics
+### Weighted Summary Statistics
+```stata
+* Social science surveys typically require survey weights
+svyset psu [pweight=finalweight], strata(stratum)
+* Weighted means and proportions
+svy: mean income education_years age
+svy: proportion race gender marital_status
+* Weighted cross-tabulation
+svy: tabulate education_cat income_quintile, row se
+* Descriptive statistics table for paper
+estpost summarize age education_years income ///
+    children household_size, detail
+esttab using "tables/descriptives.tex", ///
+    cells("mean(fmt(2)) sd(fmt(2)) min max count") ///
+    label title("Descriptive Statistics") replace
+```
+### Group Comparisons
+```stata
+* T-tests with survey weights
+svy: mean income, over(gender)
+lincom [income]Male - [income]Female
+* ANOVA
+svy: regress income i.race i.education_cat
+testparm i.race
+testparm i.education_cat
+* Effect sizes (Cohen's d)
+esize twosample income, by(gender)
+```
+## Regression Analysis
+### OLS with Standard Controls
+```stata
+* Model building strategy (nested models for sociology papers)
+* Model 1: Bivariate
+reg income i.gender [pweight=finalweight], robust
+estimates store m1
+* Model 2: Add demographics
+reg income i.gender age age_sq i.race i.marital [pweight=finalweight], robust
+estimates store m2
+* Model 3: Add human capital
+reg income i.gender age age_sq i.race i.marital ///
+    education_years experience experience_sq [pweight=finalweight], robust
+estimates store m3
+* Model 4: Add job characteristics
+reg income i.gender age age_sq i.race i.marital ///
+    education_years experience experience_sq ///
+    i.occupation i.industry hours_worked [pweight=finalweight], robust
+estimates store m4
+* Publication-ready table
+esttab m1 m2 m3 m4 using "tables/regression_income.tex", ///
+    b(3) se(3) star(* 0.05 ** 0.01 *** 0.001) ///
+    label title("OLS Regression of Income") ///
+    mtitles("Bivariate" "Demographics" "Human Capital" "Full Model") ///
+    stats(N r2_a, labels("Observations" "Adjusted R-squared") fmt(0 3)) ///
+    addnotes("Standard errors in parentheses." ///
+             "All models use survey weights.") ///
+    replace
+```
+### Logistic Regression
+```stata
+* Binary outcome: employment status
+logit employed i.gender age age_sq i.race i.education_cat ///
+    children i.marital [pweight=finalweight], robust
+estimates store logit1
+* Report odds ratios
+logit employed i.gender age age_sq i.race i.education_cat ///
+    children i.marital [pweight=finalweight], robust or
+estimates store logit_or
+* Average marginal effects (preferred in sociology)
+margins, dydx(*) post
+estimates store ame
+* Predicted probabilities by group
+logit employed i.gender##i.race age education_years [pweight=finalweight], robust
+margins gender#race, atmeans
+marginsplot, title("Predicted Probability of Employment")
+```
+## Interaction Effects
+### Continuous x Categorical Interaction
+```stata
+* Gender x education interaction on income
+reg income c.education_years##i.gender age i.race [pweight=finalweight], robust
+* Visualize interaction
+margins gender, at(education_years=(8(2)20))
+marginsplot, ///
+    title("Returns to Education by Gender") ///
+    ytitle("Predicted Income ($)") ///
+    xtitle("Years of Education") ///
+    legend(order(1 "Male" 2 "Female")) ///
+    scheme(s2mono)
+graph export "figures/education_gender_interaction.pdf", replace
+```
+### Moderation Analysis
+```stata
+* Test whether the effect of X on Y varies by moderator Z
+reg outcome c.x_var##c.moderator controls [pweight=finalweight], robust
+* Simple slopes at meaningful values of moderator
+margins, dydx(x_var) at(moderator=(10 25 50 75 90))  // Percentiles
+marginsplot, recast(line) recastci(rarea) ///
+    title("Effect of X on Y at Different Levels of Moderator")
+```
+## Multilevel Models
+```stata
+* Students nested within schools
+mixed test_score gender ses || school_id:, ///
+    variance mle
+* Random slopes
+mixed test_score gender c.ses || school_id: ses, ///
+    covariance(unstructured) mle
+* Calculate ICC
+estat icc
+* Store and compare models
+estimates store mlm1
+mixed test_score gender c.ses school_quality || school_id: ses, ///
+    covariance(unstructured) mle
+estimates store mlm2
+lrtest mlm1 mlm2
+```
+## Visualization for Publication
+### Journal-Quality Figures
+```stata
+* Set publication-ready scheme
+set scheme s2mono
+* Coefficient plot
+coefplot m2 m3 m4, ///
+    drop(_cons) xline(0) ///
+    title("Regression Coefficients Across Models") ///
+    legend(order(2 "Demographics" 4 "Human Capital" 6 "Full")) ///
+    graphregion(color(white))
+graph export "figures/coefplot.pdf", replace
+* Distribution comparison
+twoway (kdensity income if gender==1, lcolor(navy)) ///
+       (kdensity income if gender==2, lcolor(cranberry)), ///
+    title("Income Distribution by Gender") ///
+    legend(order(1 "Male" 2 "Female")) ///
+    xtitle("Annual Income ($)") ytitle("Density") ///
+    graphregion(color(white))
+graph export "figures/income_density.pdf", replace
+```
+## Replication Package
+```stata
+* Master do-file structure for replication
+* master.do
+* ==========================================
+* Project: [Title]
+* Author: [Name]
+* Date: [Date]
+* Description: Master script for replication
+* ==========================================
+version 17
+clear all
+set more off
+set maxvar 10000
+global root "~/research/project_name"
+global raw "$root/data/raw"
+global processed "$root/data/processed"
+global tables "$root/tables"
+global figures "$root/figures"
+global logs "$root/logs"
+log using "$logs/master_log.smcl", replace
+do "$root/code/01_data_cleaning.do"
+do "$root/code/02_descriptives.do"
+do "$root/code/03_main_analysis.do"
+do "$root/code/04_robustness.do"
+do "$root/code/05_tables_figures.do"
+log close
+```
+## References
+- Stata Survey Data Reference Manual: https://www.stata.com/manuals/svy.pdf
+- Mitchell, M. (2021), A Visual Guide to Stata Graphics, 4th ed., Stata Press
+- Long & Freese (2014), Regression Models for Categorical Dependent Variables Using Stata, 3rd ed.
+- esttab/estout documentation: http://repec.sowi.unibe.ch/stata/estout/
+- ASA Style Guide: https://www.asanet.org/publications/style-guide/