npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/analysis/econometrics/stata-reference-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,293 @@
+---
+name: stata-reference-guide
+description: "Comprehensive Stata reference covering syntax, econometrics, and 20+ packages"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: "analysis"
+    subcategory: "econometrics"
+    keywords: ["stata", "econometrics", "panel data", "causal inference", "community packages", "data management", "regression"]
+    source: "https://github.com/dylantmoore/stata-skill"
+---
+# Stata Comprehensive Reference Guide
+## Overview
+Stata is the dominant statistical software in economics, political science, public health, and sociology research. This guide provides a comprehensive reference covering core syntax, data management, estimation commands, causal inference methods, graphics, Mata programming, and 20+ community-contributed packages. It is designed as a progressive-disclosure reference: use the section relevant to your current task rather than reading end-to-end.
+## Core Syntax and Data Management
+### Data Import and Export
+```stata
+* Import CSV with variable names in first row
+import delimited "data.csv", clear varnames(1)
+* Import Excel (specific sheet and cell range)
+import excel "workbook.xlsx", sheet("Sheet1") cellrange(A1:Z1000) firstrow clear
+* Import Stata format
+use "dataset.dta", clear
+* Export to CSV
+export delimited "output.csv", replace
+* Save as Stata format
+save "cleaned_data.dta", replace
+```
+### Variable Management
+```stata
+* Generate new variables
+gen log_income = ln(income)
+gen age_sq = age^2
+gen treatment_post = treatment * post
+* Recode and label
+recode education (1/12 = 1 "HS or less") (13/16 = 2 "College") (17/20 = 3 "Graduate"), gen(edu_cat)
+label variable edu_cat "Education Category"
+* String operations
+gen first_name = word(full_name, 1)
+gen year_str = string(year)
+destring price_str, gen(price) force
+* Date handling
+gen date = date(date_str, "YMD")
+format date %td
+gen year = year(date)
+gen quarter = quarter(date)
+```
+### Data Cleaning Patterns
+```stata
+* Identify and handle duplicates
+duplicates report id year
+duplicates tag id year, gen(dup_flag)
+duplicates drop id year, force
+* Missing values
+misstable summarize
+misstable patterns
+replace income = . if income < 0  // recode impossible values
+* Merge datasets
+merge 1:1 id year using "panel_data.dta", keep(match master) nogen
+merge m:1 state year using "state_controls.dta", keep(match master) nogen
+* Reshape between wide and long
+reshape long income_, i(id) j(year)
+reshape wide income, i(id) j(year)
+* Collapse to group level
+collapse (mean) avg_income=income (sd) sd_income=income (count) n=income, by(state year)
+```
+## Estimation Commands
+### Linear Regression
+```stata
+* OLS with robust standard errors
+reg y x1 x2 x3, robust
+* Clustered standard errors
+reg y x1 x2 x3, cluster(firm_id)
+* Fixed effects (within estimator)
+xtreg y x1 x2 x3, fe cluster(firm_id)
+xtset firm_id year  // must declare panel structure first
+* Absorbing high-dimensional FE (reghdfe)
+reghdfe y x1 x2 x3, absorb(firm_id year) cluster(firm_id)
+* Instrumental variables (2SLS)
+ivregress 2sls y x1 x2 (endog_var = instrument1 instrument2), robust
+estat firststage
+estat overid
+```
+### Panel Data Methods
+```stata
+* Panel setup
+xtset firm_id year
+* Hausman test (FE vs RE)
+quietly xtreg y x1 x2, fe
+estimates store fe
+quietly xtreg y x1 x2, re
+estimates store re
+hausman fe re
+* Dynamic panel GMM (xtabond2)
+xtabond2 y L.y x1 x2, gmm(L.y, lag(2 4)) iv(x1 x2) robust twostep
+* Test for serial correlation and overidentification
+estat abond    // Arellano-Bond test
+estat sargan   // Sargan/Hansen test
+```
+### Causal Inference
+```stata
+* Difference-in-Differences
+gen did = treatment * post
+reg y did treatment post controls, cluster(state)
+* Modern DiD with staggered treatment (csdid)
+csdid y x1 x2, ivar(id) time(year) gvar(first_treat) method(dripw)
+csdid_plot  // event study plot
+* Regression Discontinuity (rdrobust)
+rdrobust y running_var, c(0) p(1) kernel(triangular)
+rdplot y running_var, c(0) p(1)
+* Propensity Score Matching (psmatch2)
+psmatch2 treatment x1 x2 x3, outcome(y) logit caliper(0.05) common
+pstest x1 x2 x3  // balance check
+* Synthetic Control (synth)
+synth y x1 x2 x3 y(1990) y(1991) y(1992), trunit(1) trperiod(1993) fig
+```
+### Limited Dependent Variables
+```stata
+* Logit/Probit
+logit binary_y x1 x2, robust
+margins, dydx(*)  // average marginal effects
+probit binary_y x1 x2, robust
+margins, dydx(*)
+* Ordered logit
+ologit ordered_y x1 x2, robust
+margins, predict(outcome(3)) dydx(x1)
+* Tobit (censored regression)
+tobit y x1 x2, ll(0)
+* Poisson and Negative Binomial
+poisson count_y x1 x2, robust
+nbreg count_y x1 x2, robust
+```
+## Community Packages (20+)
+### Installation
+```stata
+* Install from SSC (Statistical Software Components)
+ssc install reghdfe
+ssc install estout
+ssc install coefplot
+ssc install csdid
+ssc install rdrobust
+ssc install psmatch2
+ssc install synth
+ssc install ivreg2
+ssc install xtabond2
+ssc install winsor2
+ssc install gtools
+ssc install ftools
+ssc install binscatter
+ssc install binsreg
+ssc install grstyle
+* Install from GitHub
+net install did_multiplegt, from("https://raw.githubusercontent.com/chaisemartinDehejia/did_multiplegt/main")
+```
+### Publication-Quality Output
+```stata
+* estout / esttab — formatted regression tables
+eststo clear
+eststo: reg y x1 x2, robust
+eststo: reg y x1 x2 x3, robust
+eststo: reg y x1 x2 x3, cluster(firm_id)
+esttab, se star(* 0.10 ** 0.05 *** 0.01) ///
+    title("Main Results") label replace ///
+    scalars("r2 R-squared" "N Observations")
+* Export to LaTeX
+esttab using "table1.tex", replace booktabs ///
+    se star(* 0.10 ** 0.05 *** 0.01) label
+* Export to CSV/Excel
+esttab using "table1.csv", replace se
+* coefplot — coefficient visualization
+coefplot est1 est2 est3, drop(_cons) xline(0) ///
+    title("Coefficient Estimates") legend(order(1 "Model 1" 2 "Model 2" 3 "Model 3"))
+```
+## Graphics
+```stata
+* Scatter with fit line
+twoway (scatter y x) (lfit y x), title("Y vs X") ///
+    xtitle("X Variable") ytitle("Y Variable")
+* Event study plot
+coefplot, vertical drop(_cons) yline(0) ///
+    title("Event Study") xtitle("Periods Relative to Treatment")
+* Binned scatter (binscatter)
+binscatter y x, controls(z1 z2) nquantiles(20) ///
+    title("Binned Scatter") xtitle("X") ytitle("Y")
+* Kernel density
+kdensity income if year==2020, normal ///
+    title("Income Distribution") xtitle("Income")
+* Graph styling (grstyle)
+grstyle init
+grstyle set plain, horizontal grid
+grstyle color background white
+grstyle set color economist
+```
+## Mata Programming
+```stata
+* Basic Mata usage
+mata:
+    // Matrix operations
+    X = st_data(., ("x1", "x2", "x3"))
+    y = st_data(., "y")
+    n = rows(X)
+    // OLS by hand
+    X = X, J(n, 1, 1)  // add constant
+    beta = invsym(X'X) * X'y
+    e = y - X * beta
+    sigma2 = (e'e) / (n - cols(X))
+    V = sigma2 * invsym(X'X)
+    se = sqrt(diagonal(V))
+    beta, se
+end
+```
+## Workflow Best Practices
+1. **Always set a random seed** before any procedure involving randomness: `set seed 12345`
+2. **Use `preserve`/`restore`** for temporary data manipulations within a do-file
+3. **Log your sessions**: `log using "analysis_log.smcl", replace`
+4. **Version control**: Start do-files with `version 17` (or your version) for reproducibility
+5. **Use tempfiles** for intermediate datasets: `tempfile merged` then `save `merged'`
+6. **Profile your code** with `timer on 1` / `timer off 1` / `timer list` for long-running operations
+7. **Use `gtools`** (greshape, gcollapse, gegen) for 5-10x speedups on large datasets
+## References
+- [Stata Official Documentation](https://www.stata.com/manuals/)
+- [UCLA Stata FAQ](https://stats.oarc.ucla.edu/stata/faq/)
+- [Stata Journal](https://www.stata-journal.com/)
+- [SSC Archive](https://ideas.repec.org/s/boc/bocode.html)
+- [dylantmoore/stata-skill](https://github.com/dylantmoore/stata-skill) — Source for this reference

package/skills/analysis/statistics/data-anomaly-detection/SKILL.md ADDED Viewed

@@ -0,0 +1,157 @@
+---
+name: data-anomaly-detection
+description: "Detect anomalies and outliers in research data using statistical methods"
+metadata:
+  openclaw:
+    emoji: "🔎"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["anomaly detection", "outlier detection", "data quality", "statistical testing", "robust statistics"]
+    source: "https://github.com/AcademicSkills/data-anomaly-detection"
+---
+# Data Anomaly Detection
+A skill for identifying anomalies, outliers, and suspicious patterns in research datasets. Combines classical statistical methods with modern machine learning approaches to flag data points that deviate significantly from expected distributions, helping researchers maintain data integrity and uncover genuine scientific findings.
+## Overview
+Anomalous data points in research datasets can arise from measurement errors, instrument malfunction, data entry mistakes, or genuine rare phenomena. Distinguishing between these sources is critical: blindly removing outliers can bias results, while ignoring measurement errors introduces noise. This skill provides a structured framework for detecting, classifying, and handling anomalies in univariate, multivariate, and time-series research data.
+The approach follows a three-stage pipeline: detection (flagging candidate anomalies), diagnosis (determining likely cause), and decision (remove, transform, or retain with justification). Every decision is logged for reproducibility and transparent reporting.
+## Statistical Detection Methods
+### Univariate Outlier Detection
+```python
+import numpy as np
+from scipy import stats
+def detect_univariate_outliers(data: np.ndarray, method: str = 'iqr') -> dict:
+    """
+    Detect outliers using classical univariate methods.
+    Methods:
+        'iqr': Interquartile range (1.5x IQR rule)
+        'zscore': Z-score threshold (|z| > 3)
+        'mad': Median absolute deviation (robust)
+        'grubbs': Grubbs' test for single outlier
+    """
+    results = {'method': method, 'n_total': len(data)}
+    if method == 'iqr':
+        q1, q3 = np.percentile(data, [25, 75])
+        iqr = q3 - q1
+        lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
+        mask = (data < lower) | (data > upper)
+    elif method == 'zscore':
+        z = np.abs(stats.zscore(data))
+        mask = z > 3
+    elif method == 'mad':
+        median = np.median(data)
+        mad = np.median(np.abs(data - median))
+        modified_z = 0.6745 * (data - median) / mad if mad > 0 else np.zeros_like(data)
+        mask = np.abs(modified_z) > 3.5
+    elif method == 'grubbs':
+        # Grubbs' test for the single most extreme value
+        n = len(data)
+        mean, sd = np.mean(data), np.std(data, ddof=1)
+        g = np.max(np.abs(data - mean)) / sd
+        t_crit = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
+        g_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
+        mask = np.abs(data - mean) / sd >= g_crit
+    results['outlier_indices'] = np.where(mask)[0].tolist()
+    results['n_outliers'] = int(mask.sum())
+    results['pct_outliers'] = round(mask.sum() / len(data) * 100, 2)
+    return results
+```
+### Multivariate Outlier Detection
+```python
+from sklearn.covariance import EllipticEnvelope
+from sklearn.ensemble import IsolationForest
+def detect_multivariate_outliers(X: np.ndarray, method: str = 'mahalanobis') -> dict:
+    """
+    Detect multivariate outliers using distance-based and model-based methods.
+    """
+    if method == 'mahalanobis':
+        detector = EllipticEnvelope(contamination=0.05, random_state=42)
+        labels = detector.fit_predict(X)  # -1 = outlier, 1 = inlier
+    elif method == 'isolation_forest':
+        detector = IsolationForest(
+            n_estimators=100, contamination=0.05, random_state=42
+        )
+        labels = detector.fit_predict(X)
+    outlier_mask = labels == -1
+    return {
+        'method': method,
+        'outlier_indices': np.where(outlier_mask)[0].tolist(),
+        'n_outliers': int(outlier_mask.sum()),
+        'contamination_assumed': 0.05
+    }
+```
+## Diagnosis Framework
+Once candidate anomalies are flagged, classify each by likely cause:
+| Category | Indicators | Action |
+|----------|-----------|--------|
+| **Measurement error** | Value physically impossible, instrument log shows malfunction | Remove with documentation |
+| **Data entry error** | Obvious typo (e.g., extra digit), inconsistent units | Correct if source available, else remove |
+| **Sampling artifact** | Unusual but plausible value from edge of population | Retain; use robust methods |
+| **Genuine extreme** | Verified measurement, consistent with other variables | Retain; report sensitivity analysis |
+| **Contamination** | Data from wrong population or experimental condition | Remove with justification |
+### Diagnostic Checks
+- **Cross-variable consistency**: Does the flagged value make sense given other columns for the same observation?
+- **Temporal context**: For longitudinal data, is the spike consistent with known events?
+- **Instrument logs**: Can the anomaly be traced to a calibration or equipment issue?
+- **Domain knowledge**: Is the value within theoretically possible bounds?
+## Time-Series Anomaly Detection
+```python
+def detect_timeseries_anomalies(series: np.ndarray, window: int = 20) -> dict:
+    """
+    Detect anomalies in time-series data using rolling statistics.
+    """
+    rolling_mean = pd.Series(series).rolling(window=window).mean()
+    rolling_std = pd.Series(series).rolling(window=window).std()
+    upper_bound = rolling_mean + 3 * rolling_std
+    lower_bound = rolling_mean - 3 * rolling_std
+    anomalies = (series > upper_bound) | (series < lower_bound)
+    return {
+        'anomaly_indices': np.where(anomalies)[0].tolist(),
+        'n_anomalies': int(anomalies.sum()),
+        'window_size': window
+    }
+```
+## Reporting Anomaly Handling
+When reporting anomaly handling in publications:
+1. **State the detection method** and its parameters (e.g., "Outliers were identified using the 1.5x IQR rule").
+2. **Report the number and percentage** of observations flagged.
+3. **Describe the disposition**: how many were removed, corrected, or retained.
+4. **Provide sensitivity analysis**: show that main conclusions hold with and without outliers.
+5. **Include in supplementary materials**: full list of flagged observations and their disposition.
+## References
+- Rousseeuw, P. J. & Hubert, M. (2011). Robust Statistics for Outlier Detection. *WIREs Data Mining and Knowledge Discovery*, 1(1), 73-79.
+- Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. *ICDM 2008*.
+- Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-Practice Recommendations for Defining, Identifying, and Handling Outliers. *Organizational Research Methods*, 16(2), 270-301.

package/skills/analysis/statistics/general-statistics-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,226 @@
+---
+name: general-statistics-guide
+description: "Conceptual foundations of statistical inference for empirical research"
+metadata:
+  openclaw:
+    emoji: "📈"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["statistical inference", "hypothesis testing", "probability", "regression", "confidence intervals", "statistical thinking"]
+    source: "https://clawhub.com/ivangdavila/statistics"
+---
+# Statistical Foundations for Empirical Research
+## Overview
+This guide builds statistical intuition from probability fundamentals through inferential methods to practical application in research. It is language-agnostic (not tied to R, Python, or Stata) and focuses on the concepts, assumptions, and interpretation of statistical methods commonly used in empirical papers. Use it as a reference when designing studies, choosing tests, or interpreting results.
+## Probability Foundations
+### Key Distributions
+| Distribution | When to Use | Parameters | Example |
+|-------------|-------------|-----------|---------|
+| **Normal** | Continuous, symmetric data; CLT applications | μ (mean), σ (std) | Height, test scores |
+| **Binomial** | Count of successes in n trials | n (trials), p (probability) | Survey yes/no responses |
+| **Poisson** | Count of rare events in fixed interval | λ (rate) | Paper citations per year |
+| **t-distribution** | Small sample means (n < 30) | df (degrees of freedom) | Pilot study comparisons |
+| **Chi-squared** | Goodness of fit, contingency tables | df | Category frequency tests |
+| **F-distribution** | Ratio of variances, ANOVA | df₁, df₂ | Comparing model fits |
+### Central Limit Theorem
+The sample mean $\bar{X}$ of n independent observations approaches a normal distribution as n increases, regardless of the population distribution:
+```
+If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²:
+  √n(X̄ - μ) / σ → N(0, 1) as n → ∞
+Practical rule: n ≥ 30 is usually sufficient
+Exception: heavily skewed distributions may need n ≥ 100
+```
+This is why most inferential statistics (confidence intervals, t-tests, regression) work even when the underlying data is not normally distributed.
+## Descriptive Statistics
+### Measures of Central Tendency
+| Measure | Formula | When to Use | Sensitive to Outliers? |
+|---------|---------|-------------|----------------------|
+| Mean | Σxᵢ / n | Symmetric distributions | Yes |
+| Median | Middle value when sorted | Skewed distributions, ordinal data | No |
+| Mode | Most frequent value | Categorical data, multimodal distributions | No |
+### Measures of Spread
+| Measure | Interpretation | When to Report |
+|---------|---------------|----------------|
+| Standard deviation (σ) | Average distance from mean | With the mean |
+| IQR (Q3 - Q1) | Spread of middle 50% | With the median |
+| Range (max - min) | Total spread | Rarely (sensitive to outliers) |
+| Coefficient of variation (σ/μ) | Relative spread | Comparing variability across scales |
+## Hypothesis Testing
+### The Testing Framework
+```
+1. State hypotheses:
+   H₀: null hypothesis (no effect, no difference)
+   H₁: alternative hypothesis (there is an effect)
+2. Choose significance level: α = 0.05 (conventional)
+3. Compute test statistic from data
+4. Compare to critical value or compute p-value
+5. Decision:
+   p < α → Reject H₀ (statistically significant)
+   p ≥ α → Fail to reject H₀ (not significant)
+```
+### Common Errors
+| | H₀ is True | H₀ is False |
+|---|---|---|
+| **Reject H₀** | Type I Error (α) | Correct (Power = 1 - β) |
+| **Fail to Reject H₀** | Correct | Type II Error (β) |
+**Practical interpretation**:
+- Type I (false positive): Claiming a drug works when it doesn't
+- Type II (false negative): Missing a real drug effect
+- Power: Probability of detecting a real effect (target ≥ 0.80)
+### Choosing the Right Test
+| Question | Data Type | Test | Assumptions |
+|----------|-----------|------|-------------|
+| Compare 2 means | Continuous, normal | Independent t-test | Equal variance (or Welch's) |
+| Compare 2 means (paired) | Continuous, normal | Paired t-test | Paired observations |
+| Compare 2 means (non-normal) | Continuous/ordinal | Mann-Whitney U | Independent samples |
+| Compare >2 means | Continuous, normal | One-way ANOVA | Equal variance, normality |
+| Compare >2 means (non-normal) | Ordinal | Kruskal-Wallis | Independent samples |
+| Association (categorical) | Categorical × Categorical | Chi-squared test | Expected count ≥ 5 |
+| Correlation | Continuous × Continuous | Pearson r | Linear relationship, bivariate normal |
+| Correlation (non-normal) | Ordinal or non-normal | Spearman ρ | Monotonic relationship |
+## Regression Analysis
+### Linear Regression
+```
+Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
+Interpretation:
+  β₁ = change in Y for a 1-unit increase in X₁, holding other X's constant
+  R² = proportion of variance in Y explained by the model
+  Adjusted R² = R² penalized for number of predictors
+```
+**Key assumptions** (check before trusting results):
+1. **Linearity**: Y is a linear function of X's
+2. **Independence**: Observations are independent
+3. **Homoscedasticity**: Constant variance of residuals
+4. **Normality**: Residuals are approximately normal (for inference)
+5. **No multicollinearity**: X's are not highly correlated with each other
+**Diagnostic checks**:
+```
+Linearity:        Plot residuals vs. fitted values (no pattern)
+Homoscedasticity: Breusch-Pagan test or residual plot (no funnel shape)
+Normality:        Q-Q plot of residuals, Shapiro-Wilk test
+Multicollinearity: VIF (Variance Inflation Factor) — VIF > 10 is concerning
+Influential obs:  Cook's distance — D > 4/n warrants investigation
+```
+### Logistic Regression
+For binary outcomes (0/1):
+```
+log(p / (1-p)) = β₀ + β₁X₁ + β₂X₂ + ...
+Where p = P(Y = 1 | X)
+Interpretation:
+  exp(β₁) = odds ratio
+  exp(β₁) = 1.5 means "a 1-unit increase in X₁ multiplies the odds by 1.5"
+  Report: odds ratios with 95% CI
+```
+## Confidence Intervals
+```
+Point estimate ± (critical value × standard error)
+For a mean: X̄ ± z*(σ/√n)  or  X̄ ± t*(s/√n)
+Interpretation (frequentist):
+  "If we repeated this study many times, 95% of the resulting intervals
+   would contain the true population parameter."
+NOT: "There is a 95% probability that the true value is in this interval."
+```
+## Effect Sizes
+p-values tell you IF an effect exists; effect sizes tell you HOW BIG it is.
+| Measure | Context | Small | Medium | Large |
+|---------|---------|-------|--------|-------|
+| Cohen's d | Mean difference | 0.2 | 0.5 | 0.8 |
+| Pearson r | Correlation | 0.1 | 0.3 | 0.5 |
+| η² (eta-squared) | ANOVA | 0.01 | 0.06 | 0.14 |
+| Odds ratio | Logistic regression | 1.5 | 2.5 | 4.3 |
+| R² | Regression | 0.02 | 0.13 | 0.26 |
+**Always report effect sizes alongside p-values** — a "significant" result with d = 0.05 is trivial in practice.
+## Multiple Testing
+When testing multiple hypotheses simultaneously, the chance of at least one false positive increases:
+```
+With α = 0.05 and 20 independent tests:
+P(at least one false positive) = 1 - (1 - 0.05)^20 = 0.64
+Corrections:
+  Bonferroni:         α_adj = α / m  (conservative)
+  Benjamini-Hochberg: Controls false discovery rate (FDR) (less conservative)
+  Holm-Bonferroni:    Step-down procedure (more powerful than Bonferroni)
+```
+## Sample Size and Power
+Before collecting data, determine the required sample size:
+```
+Inputs needed:
+  1. Desired power (typically 0.80)
+  2. Significance level (α = 0.05)
+  3. Expected effect size (from pilot study or literature)
+  4. Type of test (t-test, ANOVA, regression, etc.)
+Rule of thumb for two-sample t-test:
+  n per group ≈ 16 / d²  (for 80% power, α = 0.05)
+  d = 0.5 → n ≈ 64 per group
+  d = 0.2 → n ≈ 400 per group
+```
+## Common Pitfalls
+1. **p-hacking**: Trying many analyses until p < 0.05. Fix: pre-register analyses.
+2. **Absence of evidence ≠ evidence of absence**: p > 0.05 does not prove H₀. Consider equivalence tests.
+3. **Correlation ≠ causation**: Regression coefficients are causal only with proper identification strategy.
+4. **Simpson's paradox**: A trend in subgroups can reverse when combined. Always check stratified analyses.
+5. **Overfitting**: Too many predictors relative to sample size. Rule of thumb: n ≥ 10-20 per predictor.
+## References
+- Agresti, A. (2018). *Statistical Methods for the Social Sciences* (5th ed.). Pearson.
+- Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA Statement on p-Values." *The American Statistician*, 70(2), 129-133.
+- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Routledge.