@wentorai/research-plugins 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +204 -0
- package/curated/analysis/README.md +64 -0
- package/curated/domains/README.md +104 -0
- package/curated/literature/README.md +53 -0
- package/curated/research/README.md +62 -0
- package/curated/tools/README.md +87 -0
- package/curated/writing/README.md +61 -0
- package/index.ts +39 -0
- package/mcp-configs/academic-db/ChatSpatial.json +17 -0
- package/mcp-configs/academic-db/academia-mcp.json +17 -0
- package/mcp-configs/academic-db/academic-paper-explorer.json +17 -0
- package/mcp-configs/academic-db/academic-search-mcp-server.json +17 -0
- package/mcp-configs/academic-db/agentinterviews-mcp.json +17 -0
- package/mcp-configs/academic-db/all-in-mcp.json +17 -0
- package/mcp-configs/academic-db/apple-health-mcp.json +17 -0
- package/mcp-configs/academic-db/arxiv-latex-mcp.json +17 -0
- package/mcp-configs/academic-db/arxiv-mcp-server.json +17 -0
- package/mcp-configs/academic-db/bgpt-mcp.json +17 -0
- package/mcp-configs/academic-db/biomcp.json +17 -0
- package/mcp-configs/academic-db/biothings-mcp.json +17 -0
- package/mcp-configs/academic-db/catalysishub-mcp-server.json +17 -0
- package/mcp-configs/academic-db/clinicaltrialsgov-mcp-server.json +17 -0
- package/mcp-configs/academic-db/deep-research-mcp.json +17 -0
- package/mcp-configs/academic-db/dicom-mcp.json +17 -0
- package/mcp-configs/academic-db/enrichr-mcp-server.json +17 -0
- package/mcp-configs/academic-db/fec-mcp-server.json +17 -0
- package/mcp-configs/academic-db/fhir-mcp-server-themomentum.json +17 -0
- package/mcp-configs/academic-db/fhir-mcp.json +19 -0
- package/mcp-configs/academic-db/gget-mcp.json +17 -0
- package/mcp-configs/academic-db/google-researcher-mcp.json +17 -0
- package/mcp-configs/academic-db/idea-reality-mcp.json +17 -0
- package/mcp-configs/academic-db/legiscan-mcp.json +19 -0
- package/mcp-configs/academic-db/lex.json +17 -0
- package/mcp-configs/ai-platform/Adaptive-Graph-of-Thoughts-MCP-server.json +17 -0
- package/mcp-configs/ai-platform/ai-counsel.json +17 -0
- package/mcp-configs/ai-platform/atlas-mcp-server.json +17 -0
- package/mcp-configs/ai-platform/counsel-mcp.json +17 -0
- package/mcp-configs/ai-platform/cross-llm-mcp.json +17 -0
- package/mcp-configs/ai-platform/gptr-mcp.json +17 -0
- package/mcp-configs/browser/decipher-research-agent.json +17 -0
- package/mcp-configs/browser/deep-research.json +17 -0
- package/mcp-configs/browser/everything-claude-code.json +17 -0
- package/mcp-configs/browser/gpt-researcher.json +17 -0
- package/mcp-configs/browser/heurist-agent-framework.json +17 -0
- package/mcp-configs/data-platform/4everland-hosting-mcp.json +17 -0
- package/mcp-configs/data-platform/context-keeper.json +17 -0
- package/mcp-configs/data-platform/context7.json +19 -0
- package/mcp-configs/data-platform/contextstream-mcp.json +17 -0
- package/mcp-configs/data-platform/email-mcp.json +17 -0
- package/mcp-configs/note-knowledge/ApeRAG.json +17 -0
- package/mcp-configs/note-knowledge/In-Memoria.json +17 -0
- package/mcp-configs/note-knowledge/agent-memory.json +17 -0
- package/mcp-configs/note-knowledge/aimemo.json +17 -0
- package/mcp-configs/note-knowledge/biel-mcp.json +19 -0
- package/mcp-configs/note-knowledge/cognee.json +17 -0
- package/mcp-configs/note-knowledge/context-awesome.json +17 -0
- package/mcp-configs/note-knowledge/context-mcp.json +17 -0
- package/mcp-configs/note-knowledge/conversation-handoff-mcp.json +17 -0
- package/mcp-configs/note-knowledge/cortex.json +17 -0
- package/mcp-configs/note-knowledge/devrag.json +17 -0
- package/mcp-configs/note-knowledge/easy-obsidian-mcp.json +17 -0
- package/mcp-configs/note-knowledge/engram.json +17 -0
- package/mcp-configs/note-knowledge/gnosis-mcp.json +17 -0
- package/mcp-configs/note-knowledge/graphlit-mcp-server.json +19 -0
- package/mcp-configs/reference-mgr/arxiv-cli.json +17 -0
- package/mcp-configs/reference-mgr/arxiv-search-mcp.json +17 -0
- package/mcp-configs/reference-mgr/chiken.json +17 -0
- package/mcp-configs/reference-mgr/claude-scholar.json +17 -0
- package/mcp-configs/reference-mgr/devonthink-mcp.json +17 -0
- package/mcp-configs/registry.json +447 -0
- package/openclaw.plugin.json +21 -0
- package/package.json +61 -0
- package/skills/analysis/dataviz/color-accessibility-guide/SKILL.md +230 -0
- package/skills/analysis/dataviz/geospatial-viz-guide/SKILL.md +218 -0
- package/skills/analysis/dataviz/interactive-viz-guide/SKILL.md +287 -0
- package/skills/analysis/dataviz/network-visualization-guide/SKILL.md +195 -0
- package/skills/analysis/dataviz/publication-figures-guide/SKILL.md +238 -0
- package/skills/analysis/dataviz/python-dataviz-guide/SKILL.md +195 -0
- package/skills/analysis/econometrics/causal-inference-guide/SKILL.md +197 -0
- package/skills/analysis/econometrics/iv-regression-guide/SKILL.md +198 -0
- package/skills/analysis/econometrics/panel-data-guide/SKILL.md +274 -0
- package/skills/analysis/econometrics/robustness-checks/SKILL.md +250 -0
- package/skills/analysis/econometrics/stata-regression/SKILL.md +117 -0
- package/skills/analysis/econometrics/time-series-guide/SKILL.md +235 -0
- package/skills/analysis/statistics/bayesian-statistics-guide/SKILL.md +221 -0
- package/skills/analysis/statistics/hypothesis-testing-guide/SKILL.md +210 -0
- package/skills/analysis/statistics/meta-analysis-guide/SKILL.md +206 -0
- package/skills/analysis/statistics/nonparametric-tests-guide/SKILL.md +221 -0
- package/skills/analysis/statistics/power-analysis-guide/SKILL.md +240 -0
- package/skills/analysis/statistics/sem-guide/SKILL.md +231 -0
- package/skills/analysis/statistics/survival-analysis-guide/SKILL.md +195 -0
- package/skills/analysis/wrangling/missing-data-handling/SKILL.md +224 -0
- package/skills/analysis/wrangling/pandas-data-wrangling/SKILL.md +242 -0
- package/skills/analysis/wrangling/questionnaire-design-guide/SKILL.md +234 -0
- package/skills/analysis/wrangling/text-mining-guide/SKILL.md +225 -0
- package/skills/domains/ai-ml/computer-vision-guide/SKILL.md +213 -0
- package/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md +200 -0
- package/skills/domains/ai-ml/llm-evaluation-guide/SKILL.md +194 -0
- package/skills/domains/ai-ml/prompt-engineering-research/SKILL.md +233 -0
- package/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.md +254 -0
- package/skills/domains/ai-ml/transformer-architecture-guide/SKILL.md +233 -0
- package/skills/domains/biomedical/clinical-research-guide/SKILL.md +232 -0
- package/skills/domains/biomedical/clinicaltrials-api/SKILL.md +177 -0
- package/skills/domains/biomedical/epidemiology-guide/SKILL.md +200 -0
- package/skills/domains/biomedical/genomics-analysis-guide/SKILL.md +270 -0
- package/skills/domains/business/market-analysis-guide/SKILL.md +112 -0
- package/skills/domains/business/strategic-management-guide/SKILL.md +154 -0
- package/skills/domains/chemistry/computational-chemistry-guide/SKILL.md +266 -0
- package/skills/domains/chemistry/retrosynthesis-guide/SKILL.md +215 -0
- package/skills/domains/cs/algorithms-complexity-guide/SKILL.md +194 -0
- package/skills/domains/cs/dblp-api/SKILL.md +129 -0
- package/skills/domains/cs/software-engineering-research/SKILL.md +218 -0
- package/skills/domains/ecology/biodiversity-data-guide/SKILL.md +296 -0
- package/skills/domains/ecology/conservation-biology-guide/SKILL.md +198 -0
- package/skills/domains/ecology/gbif-api/SKILL.md +158 -0
- package/skills/domains/ecology/inaturalist-api/SKILL.md +173 -0
- package/skills/domains/economics/behavioral-economics-guide/SKILL.md +239 -0
- package/skills/domains/economics/development-economics-guide/SKILL.md +181 -0
- package/skills/domains/economics/fred-api/SKILL.md +189 -0
- package/skills/domains/education/curriculum-design-guide/SKILL.md +144 -0
- package/skills/domains/education/learning-science-guide/SKILL.md +150 -0
- package/skills/domains/finance/financial-data-analysis/SKILL.md +152 -0
- package/skills/domains/finance/quantitative-finance-guide/SKILL.md +151 -0
- package/skills/domains/geoscience/climate-science-guide/SKILL.md +158 -0
- package/skills/domains/geoscience/gis-remote-sensing-guide/SKILL.md +129 -0
- package/skills/domains/humanities/digital-humanities-guide/SKILL.md +181 -0
- package/skills/domains/humanities/philosophy-research-guide/SKILL.md +148 -0
- package/skills/domains/law/courtlistener-api/SKILL.md +213 -0
- package/skills/domains/law/legal-research-guide/SKILL.md +250 -0
- package/skills/domains/math/linear-algebra-applications/SKILL.md +227 -0
- package/skills/domains/math/numerical-methods-guide/SKILL.md +236 -0
- package/skills/domains/math/oeis-api/SKILL.md +158 -0
- package/skills/domains/pharma/clinical-pharmacology-guide/SKILL.md +165 -0
- package/skills/domains/pharma/drug-development-guide/SKILL.md +177 -0
- package/skills/domains/physics/computational-physics-guide/SKILL.md +300 -0
- package/skills/domains/physics/nasa-ads-api/SKILL.md +150 -0
- package/skills/domains/physics/quantum-computing-guide/SKILL.md +234 -0
- package/skills/domains/social-science/social-research-methods/SKILL.md +194 -0
- package/skills/domains/social-science/survey-research-guide/SKILL.md +182 -0
- package/skills/literature/discovery/citation-alert-guide/SKILL.md +154 -0
- package/skills/literature/discovery/conference-proceedings-guide/SKILL.md +142 -0
- package/skills/literature/discovery/literature-mapping-guide/SKILL.md +175 -0
- package/skills/literature/discovery/paper-tracking-guide/SKILL.md +211 -0
- package/skills/literature/discovery/rss-paper-feeds/SKILL.md +214 -0
- package/skills/literature/discovery/semantic-scholar-recs-guide/SKILL.md +164 -0
- package/skills/literature/fulltext/doaj-api/SKILL.md +120 -0
- package/skills/literature/fulltext/interlibrary-loan-guide/SKILL.md +163 -0
- package/skills/literature/fulltext/open-access-guide/SKILL.md +183 -0
- package/skills/literature/fulltext/pmc-oai-api/SKILL.md +184 -0
- package/skills/literature/fulltext/preprint-servers-guide/SKILL.md +128 -0
- package/skills/literature/fulltext/repository-harvesting-guide/SKILL.md +207 -0
- package/skills/literature/fulltext/unpaywall-api/SKILL.md +113 -0
- package/skills/literature/metadata/altmetrics-guide/SKILL.md +132 -0
- package/skills/literature/metadata/citation-network-guide/SKILL.md +236 -0
- package/skills/literature/metadata/crossref-api/SKILL.md +133 -0
- package/skills/literature/metadata/datacite-api/SKILL.md +126 -0
- package/skills/literature/metadata/doi-resolution-guide/SKILL.md +168 -0
- package/skills/literature/metadata/h-index-guide/SKILL.md +183 -0
- package/skills/literature/metadata/journal-metrics-guide/SKILL.md +188 -0
- package/skills/literature/metadata/opencitations-api/SKILL.md +128 -0
- package/skills/literature/metadata/orcid-api/SKILL.md +136 -0
- package/skills/literature/metadata/orcid-integration-guide/SKILL.md +178 -0
- package/skills/literature/search/arxiv-api/SKILL.md +95 -0
- package/skills/literature/search/biorxiv-api/SKILL.md +123 -0
- package/skills/literature/search/boolean-search-guide/SKILL.md +199 -0
- package/skills/literature/search/citation-chaining-guide/SKILL.md +148 -0
- package/skills/literature/search/database-comparison-guide/SKILL.md +100 -0
- package/skills/literature/search/europe-pmc-api/SKILL.md +120 -0
- package/skills/literature/search/google-scholar-guide/SKILL.md +182 -0
- package/skills/literature/search/mesh-terms-guide/SKILL.md +164 -0
- package/skills/literature/search/openalex-api/SKILL.md +134 -0
- package/skills/literature/search/pubmed-api/SKILL.md +130 -0
- package/skills/literature/search/scientify-literature-survey/SKILL.md +203 -0
- package/skills/literature/search/semantic-scholar-api/SKILL.md +134 -0
- package/skills/literature/search/systematic-search-strategy/SKILL.md +214 -0
- package/skills/research/automation/ai-scientist-guide/SKILL.md +228 -0
- package/skills/research/automation/data-collection-automation/SKILL.md +248 -0
- package/skills/research/automation/research-workflow-automation/SKILL.md +266 -0
- package/skills/research/deep-research/meta-synthesis-guide/SKILL.md +174 -0
- package/skills/research/deep-research/research-cog/SKILL.md +153 -0
- package/skills/research/deep-research/scoping-review-guide/SKILL.md +217 -0
- package/skills/research/deep-research/systematic-review-guide/SKILL.md +250 -0
- package/skills/research/funding/figshare-api/SKILL.md +163 -0
- package/skills/research/funding/grant-writing-guide/SKILL.md +233 -0
- package/skills/research/funding/nsf-grant-guide/SKILL.md +206 -0
- package/skills/research/funding/open-science-guide/SKILL.md +255 -0
- package/skills/research/funding/zenodo-api/SKILL.md +174 -0
- package/skills/research/methodology/action-research-guide/SKILL.md +201 -0
- package/skills/research/methodology/experimental-design-guide/SKILL.md +236 -0
- package/skills/research/methodology/grad-school-guide/SKILL.md +182 -0
- package/skills/research/methodology/grounded-theory-guide/SKILL.md +171 -0
- package/skills/research/methodology/mixed-methods-guide/SKILL.md +208 -0
- package/skills/research/methodology/qualitative-research-guide/SKILL.md +234 -0
- package/skills/research/methodology/scientify-idea-generation/SKILL.md +222 -0
- package/skills/research/paper-review/paper-reading-assistant/SKILL.md +266 -0
- package/skills/research/paper-review/peer-review-guide/SKILL.md +227 -0
- package/skills/research/paper-review/rebuttal-writing-guide/SKILL.md +185 -0
- package/skills/research/paper-review/scientify-write-review-paper/SKILL.md +209 -0
- package/skills/tools/code-exec/jupyter-notebook-guide/SKILL.md +178 -0
- package/skills/tools/code-exec/python-reproducibility-guide/SKILL.md +341 -0
- package/skills/tools/code-exec/r-reproducibility-guide/SKILL.md +236 -0
- package/skills/tools/code-exec/sandbox-execution-guide/SKILL.md +221 -0
- package/skills/tools/diagram/mermaid-diagram-guide/SKILL.md +269 -0
- package/skills/tools/diagram/plantuml-guide/SKILL.md +397 -0
- package/skills/tools/diagram/scientific-illustration-guide/SKILL.md +225 -0
- package/skills/tools/document/anystyle-api/SKILL.md +199 -0
- package/skills/tools/document/grobid-pdf-parsing/SKILL.md +294 -0
- package/skills/tools/document/markdown-academic-guide/SKILL.md +217 -0
- package/skills/tools/document/pdf-extraction-guide/SKILL.md +321 -0
- package/skills/tools/knowledge-graph/knowledge-graph-construction/SKILL.md +306 -0
- package/skills/tools/knowledge-graph/ontology-design-guide/SKILL.md +214 -0
- package/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md +325 -0
- package/skills/tools/ocr-translate/formula-recognition-guide/SKILL.md +367 -0
- package/skills/tools/ocr-translate/handwriting-recognition-guide/SKILL.md +211 -0
- package/skills/tools/ocr-translate/latex-ocr-guide/SKILL.md +204 -0
- package/skills/tools/ocr-translate/multilingual-research-guide/SKILL.md +234 -0
- package/skills/tools/scraping/academic-web-scraping/SKILL.md +326 -0
- package/skills/tools/scraping/api-data-collection-guide/SKILL.md +301 -0
- package/skills/tools/scraping/web-scraping-ethics-guide/SKILL.md +250 -0
- package/skills/writing/citation/bibtex-management-guide/SKILL.md +246 -0
- package/skills/writing/citation/citation-style-guide/SKILL.md +248 -0
- package/skills/writing/citation/reference-manager-comparison/SKILL.md +208 -0
- package/skills/writing/citation/zotero-api/SKILL.md +188 -0
- package/skills/writing/composition/abstract-writing-guide/SKILL.md +188 -0
- package/skills/writing/composition/discussion-writing-guide/SKILL.md +194 -0
- package/skills/writing/composition/introduction-writing-guide/SKILL.md +194 -0
- package/skills/writing/composition/literature-review-writing/SKILL.md +196 -0
- package/skills/writing/composition/methods-section-guide/SKILL.md +185 -0
- package/skills/writing/composition/response-to-reviewers/SKILL.md +215 -0
- package/skills/writing/composition/scientific-writing-guide/SKILL.md +152 -0
- package/skills/writing/latex/bibliography-management-guide/SKILL.md +206 -0
- package/skills/writing/latex/latex-drawing-guide/SKILL.md +234 -0
- package/skills/writing/latex/latex-ecosystem-guide/SKILL.md +240 -0
- package/skills/writing/latex/math-typesetting-guide/SKILL.md +231 -0
- package/skills/writing/latex/overleaf-collaboration-guide/SKILL.md +211 -0
- package/skills/writing/latex/tikz-diagrams-guide/SKILL.md +211 -0
- package/skills/writing/polish/academic-translation-guide/SKILL.md +175 -0
- package/skills/writing/polish/academic-writing-refiner/SKILL.md +143 -0
- package/skills/writing/polish/ai-writing-humanizer/SKILL.md +178 -0
- package/skills/writing/polish/grammar-checker-guide/SKILL.md +184 -0
- package/skills/writing/polish/plagiarism-detection-guide/SKILL.md +167 -0
- package/skills/writing/templates/beamer-presentation-guide/SKILL.md +263 -0
- package/skills/writing/templates/conference-paper-template/SKILL.md +219 -0
- package/skills/writing/templates/thesis-template-guide/SKILL.md +200 -0
- package/skills/writing/templates/thesis-writing-guide/SKILL.md +220 -0
- package/src/tools/arxiv.ts +131 -0
- package/src/tools/crossref.ts +112 -0
- package/src/tools/openalex.ts +174 -0
- package/src/tools/pubmed.ts +166 -0
- package/src/tools/semantic-scholar.ts +108 -0
- package/src/tools/unpaywall.ts +58 -0
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: missing-data-handling
|
|
3
|
+
description: "Diagnose missing data patterns and apply appropriate imputation strategies"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "jigsaw"
|
|
7
|
+
category: "analysis"
|
|
8
|
+
subcategory: "wrangling"
|
|
9
|
+
keywords: ["missing value imputation", "missing data handling", "outlier detection", "data cleaning", "multiple imputation"]
|
|
10
|
+
source: "wentor"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Missing Data Handling
|
|
14
|
+
|
|
15
|
+
A skill for diagnosing missing data mechanisms, selecting appropriate imputation strategies, and conducting sensitivity analyses. Covers everything from simple imputation to multiple imputation and modern machine learning approaches.
|
|
16
|
+
|
|
17
|
+
## Missing Data Mechanisms
|
|
18
|
+
|
|
19
|
+
### Rubin's Classification
|
|
20
|
+
|
|
21
|
+
Understanding the mechanism determines the appropriate handling strategy:
|
|
22
|
+
|
|
23
|
+
| Mechanism | Definition | Example | Implication |
|
|
24
|
+
|-----------|-----------|---------|-------------|
|
|
25
|
+
| MCAR | Missingness unrelated to any variable | Lab sample randomly contaminated | Listwise deletion is unbiased (but loses power) |
|
|
26
|
+
| MAR | Missingness related to observed variables | Higher-income respondents skip income question less | Multiple imputation appropriate |
|
|
27
|
+
| MNAR | Missingness related to the missing value itself | Depressed patients drop out of depression study | Requires sensitivity analysis; no simple fix |
|
|
28
|
+
|
|
29
|
+
### Diagnosing the Mechanism
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
import pandas as pd
|
|
33
|
+
import numpy as np
|
|
34
|
+
from scipy import stats
|
|
35
|
+
|
|
36
|
+
def diagnose_missing_data(df: pd.DataFrame) -> dict:
|
|
37
|
+
"""
|
|
38
|
+
Diagnose missing data patterns and mechanism.
|
|
39
|
+
"""
|
|
40
|
+
n_rows, n_cols = df.shape
|
|
41
|
+
results = {
|
|
42
|
+
'total_cells': n_rows * n_cols,
|
|
43
|
+
'total_missing': df.isnull().sum().sum(),
|
|
44
|
+
'pct_missing': (df.isnull().sum().sum() / (n_rows * n_cols)) * 100,
|
|
45
|
+
'by_column': {}
|
|
46
|
+
}
|
|
47
|
+
|
|
48
|
+
for col in df.columns:
|
|
49
|
+
n_missing = df[col].isnull().sum()
|
|
50
|
+
pct = n_missing / n_rows * 100
|
|
51
|
+
results['by_column'][col] = {
|
|
52
|
+
'n_missing': n_missing,
|
|
53
|
+
'pct_missing': round(pct, 2)
|
|
54
|
+
}
|
|
55
|
+
|
|
56
|
+
# Little's MCAR test approximation
|
|
57
|
+
# Compare means of other variables between missing/non-missing groups
|
|
58
|
+
mcar_tests = {}
|
|
59
|
+
for col in df.columns:
|
|
60
|
+
if df[col].isnull().sum() > 0:
|
|
61
|
+
missing_mask = df[col].isnull()
|
|
62
|
+
for other_col in df.select_dtypes(include=[np.number]).columns:
|
|
63
|
+
if other_col != col and df[other_col].isnull().sum() == 0:
|
|
64
|
+
group_missing = df.loc[missing_mask, other_col]
|
|
65
|
+
group_observed = df.loc[~missing_mask, other_col]
|
|
66
|
+
if len(group_missing) > 1 and len(group_observed) > 1:
|
|
67
|
+
t_stat, p_val = stats.ttest_ind(group_missing, group_observed)
|
|
68
|
+
mcar_tests[f'{col}_vs_{other_col}'] = {
|
|
69
|
+
't': round(t_stat, 3),
|
|
70
|
+
'p': round(p_val, 4)
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
significant_diffs = sum(1 for v in mcar_tests.values() if v['p'] < 0.05)
|
|
74
|
+
results['mcar_assessment'] = (
|
|
75
|
+
'Likely MCAR' if significant_diffs == 0
|
|
76
|
+
else f'Likely NOT MCAR ({significant_diffs} significant differences found)'
|
|
77
|
+
)
|
|
78
|
+
results['mcar_tests'] = mcar_tests
|
|
79
|
+
|
|
80
|
+
return results
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Imputation Methods
|
|
84
|
+
|
|
85
|
+
### Simple Imputation
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
def simple_imputation(df: pd.DataFrame, strategy: str = 'mean') -> pd.DataFrame:
|
|
89
|
+
"""
|
|
90
|
+
Apply simple imputation strategies.
|
|
91
|
+
|
|
92
|
+
Args:
|
|
93
|
+
strategy: 'mean', 'median', 'mode', 'constant', or 'forward_fill'
|
|
94
|
+
"""
|
|
95
|
+
imputed = df.copy()
|
|
96
|
+
|
|
97
|
+
for col in imputed.columns:
|
|
98
|
+
if imputed[col].isnull().any():
|
|
99
|
+
if strategy == 'mean' and np.issubdtype(imputed[col].dtype, np.number):
|
|
100
|
+
imputed[col].fillna(imputed[col].mean(), inplace=True)
|
|
101
|
+
elif strategy == 'median' and np.issubdtype(imputed[col].dtype, np.number):
|
|
102
|
+
imputed[col].fillna(imputed[col].median(), inplace=True)
|
|
103
|
+
elif strategy == 'mode':
|
|
104
|
+
imputed[col].fillna(imputed[col].mode()[0], inplace=True)
|
|
105
|
+
elif strategy == 'forward_fill':
|
|
106
|
+
imputed[col].ffill(inplace=True)
|
|
107
|
+
|
|
108
|
+
return imputed
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### Multiple Imputation (MICE)
|
|
112
|
+
|
|
113
|
+
The gold standard for MAR data:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
from sklearn.experimental import enable_iterative_imputer
|
|
117
|
+
from sklearn.impute import IterativeImputer
|
|
118
|
+
from sklearn.linear_model import BayesianRidge
|
|
119
|
+
|
|
120
|
+
def multiple_imputation(df: pd.DataFrame, n_imputations: int = 20,
|
|
121
|
+
max_iter: int = 50) -> list[pd.DataFrame]:
|
|
122
|
+
"""
|
|
123
|
+
Perform Multiple Imputation by Chained Equations (MICE).
|
|
124
|
+
|
|
125
|
+
Args:
|
|
126
|
+
df: DataFrame with missing values (numeric columns only)
|
|
127
|
+
n_imputations: Number of imputed datasets (>=20 recommended)
|
|
128
|
+
max_iter: Maximum iterations per imputation
|
|
129
|
+
Returns:
|
|
130
|
+
List of completed DataFrames
|
|
131
|
+
"""
|
|
132
|
+
imputed_datasets = []
|
|
133
|
+
|
|
134
|
+
for i in range(n_imputations):
|
|
135
|
+
imputer = IterativeImputer(
|
|
136
|
+
estimator=BayesianRidge(),
|
|
137
|
+
max_iter=max_iter,
|
|
138
|
+
random_state=i,
|
|
139
|
+
sample_posterior=True # Important for proper MI
|
|
140
|
+
)
|
|
141
|
+
imputed_data = imputer.fit_transform(df)
|
|
142
|
+
imputed_df = pd.DataFrame(imputed_data, columns=df.columns, index=df.index)
|
|
143
|
+
imputed_datasets.append(imputed_df)
|
|
144
|
+
|
|
145
|
+
return imputed_datasets
|
|
146
|
+
|
|
147
|
+
|
|
148
|
+
def pool_mi_results(estimates: list[float], variances: list[float]) -> dict:
|
|
149
|
+
"""
|
|
150
|
+
Pool results across multiply imputed datasets using Rubin's rules.
|
|
151
|
+
|
|
152
|
+
Args:
|
|
153
|
+
estimates: Parameter estimate from each imputed dataset
|
|
154
|
+
variances: Variance of estimate from each imputed dataset
|
|
155
|
+
"""
|
|
156
|
+
m = len(estimates)
|
|
157
|
+
q_bar = np.mean(estimates) # Pooled estimate
|
|
158
|
+
u_bar = np.mean(variances) # Within-imputation variance
|
|
159
|
+
b = np.var(estimates, ddof=1) # Between-imputation variance
|
|
160
|
+
|
|
161
|
+
# Total variance
|
|
162
|
+
total_var = u_bar + (1 + 1/m) * b
|
|
163
|
+
|
|
164
|
+
# Degrees of freedom (Barnard-Rubin)
|
|
165
|
+
lambda_hat = ((1 + 1/m) * b) / total_var
|
|
166
|
+
df_old = (m - 1) / lambda_hat**2
|
|
167
|
+
|
|
168
|
+
se = np.sqrt(total_var)
|
|
169
|
+
ci = (q_bar - 1.96*se, q_bar + 1.96*se)
|
|
170
|
+
|
|
171
|
+
return {
|
|
172
|
+
'pooled_estimate': q_bar,
|
|
173
|
+
'pooled_se': se,
|
|
174
|
+
'ci_95': ci,
|
|
175
|
+
'fraction_missing_info': lambda_hat,
|
|
176
|
+
'relative_efficiency': 1 / (1 + lambda_hat/m)
|
|
177
|
+
}
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
## Outlier Detection
|
|
181
|
+
|
|
182
|
+
### Statistical Methods
|
|
183
|
+
|
|
184
|
+
```python
|
|
185
|
+
def detect_outliers(series: pd.Series, method: str = 'iqr') -> pd.Series:
|
|
186
|
+
"""
|
|
187
|
+
Detect outliers using specified method.
|
|
188
|
+
|
|
189
|
+
Returns boolean mask where True indicates an outlier.
|
|
190
|
+
"""
|
|
191
|
+
if method == 'iqr':
|
|
192
|
+
q1 = series.quantile(0.25)
|
|
193
|
+
q3 = series.quantile(0.75)
|
|
194
|
+
iqr = q3 - q1
|
|
195
|
+
lower = q1 - 1.5 * iqr
|
|
196
|
+
upper = q3 + 1.5 * iqr
|
|
197
|
+
return (series < lower) | (series > upper)
|
|
198
|
+
|
|
199
|
+
elif method == 'zscore':
|
|
200
|
+
z = np.abs((series - series.mean()) / series.std())
|
|
201
|
+
return z > 3
|
|
202
|
+
|
|
203
|
+
elif method == 'mad':
|
|
204
|
+
median = series.median()
|
|
205
|
+
mad = np.median(np.abs(series - median))
|
|
206
|
+
modified_z = 0.6745 * (series - median) / (mad + 1e-10)
|
|
207
|
+
return np.abs(modified_z) > 3.5
|
|
208
|
+
|
|
209
|
+
else:
|
|
210
|
+
raise ValueError(f"Unknown method: {method}")
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
## Reporting Standards
|
|
214
|
+
|
|
215
|
+
When reporting missing data handling in a paper:
|
|
216
|
+
|
|
217
|
+
1. Report the amount and pattern of missing data (by variable and overall)
|
|
218
|
+
2. State the assumed mechanism (MCAR/MAR/MNAR) with justification
|
|
219
|
+
3. Describe the imputation method and software used
|
|
220
|
+
4. Report the number of imputations (for MI)
|
|
221
|
+
5. Conduct sensitivity analyses (e.g., compare results from complete-case, single imputation, and multiple imputation)
|
|
222
|
+
6. Report results using Rubin's pooling rules for MI
|
|
223
|
+
|
|
224
|
+
Never simply delete missing data without justification. Even for MCAR data, listwise deletion reduces statistical power and is rarely the best choice.
|
|
@@ -0,0 +1,242 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: pandas-data-wrangling
|
|
3
|
+
description: "Data cleaning, transformation, and exploratory analysis with pandas"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "🧹"
|
|
7
|
+
category: "analysis"
|
|
8
|
+
subcategory: "wrangling"
|
|
9
|
+
keywords: ["CSV data analyzer", "pandas", "data wrangling", "exploratory data analysis", "missing value imputation"]
|
|
10
|
+
source: "N/A"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Pandas Data Wrangling Guide
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
Data wrangling -- the process of cleaning, transforming, and preparing raw data for analysis -- typically consumes 60-80% of a data scientist's time. Pandas is the de facto standard library for tabular data manipulation in Python, and mastering its idioms directly translates to faster, more reliable research workflows.
|
|
18
|
+
|
|
19
|
+
This guide covers the essential pandas operations that researchers encounter daily: loading heterogeneous data sources, diagnosing data quality issues, handling missing values, reshaping data for analysis, and performing exploratory data analysis (EDA). Each section includes copy-paste code examples designed for real-world research datasets.
|
|
20
|
+
|
|
21
|
+
Whether you are cleaning survey responses, preprocessing experimental logs, merging datasets from multiple sources, or preparing features for machine learning, the patterns here will save hours of trial and error.
|
|
22
|
+
|
|
23
|
+
## Loading and Inspecting Data
|
|
24
|
+
|
|
25
|
+
### Reading Common Formats
|
|
26
|
+
|
|
27
|
+
```python
|
|
28
|
+
import pandas as pd
|
|
29
|
+
import numpy as np
|
|
30
|
+
|
|
31
|
+
# CSV with encoding and date parsing
|
|
32
|
+
df = pd.read_csv('data.csv', encoding='utf-8',
|
|
33
|
+
parse_dates=['timestamp'],
|
|
34
|
+
dtype={'participant_id': str})
|
|
35
|
+
|
|
36
|
+
# Excel with specific sheet
|
|
37
|
+
df = pd.read_excel('data.xlsx', sheet_name='Experiment1',
|
|
38
|
+
header=1) # Skip first row
|
|
39
|
+
|
|
40
|
+
# JSON (nested)
|
|
41
|
+
df = pd.json_normalize(json_data, record_path='results',
|
|
42
|
+
meta=['experiment_id', 'date'])
|
|
43
|
+
|
|
44
|
+
# Parquet (fast, columnar)
|
|
45
|
+
df = pd.read_parquet('data.parquet')
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Initial Diagnostics
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
# Shape and types
|
|
52
|
+
print(f"Shape: {df.shape}")
|
|
53
|
+
print(df.dtypes)
|
|
54
|
+
print(df.info(memory_usage='deep'))
|
|
55
|
+
|
|
56
|
+
# Statistical summary
|
|
57
|
+
print(df.describe(include='all'))
|
|
58
|
+
|
|
59
|
+
# Missing value report
|
|
60
|
+
missing = df.isnull().sum()
|
|
61
|
+
missing_pct = (missing / len(df) * 100).round(1)
|
|
62
|
+
missing_report = pd.DataFrame({
|
|
63
|
+
'count': missing,
|
|
64
|
+
'percent': missing_pct
|
|
65
|
+
}).query('count > 0').sort_values('percent', ascending=False)
|
|
66
|
+
print(missing_report)
|
|
67
|
+
|
|
68
|
+
# Duplicate check
|
|
69
|
+
n_dupes = df.duplicated().sum()
|
|
70
|
+
print(f"Duplicate rows: {n_dupes}")
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Handling Missing Data
|
|
74
|
+
|
|
75
|
+
### Strategy Decision Tree
|
|
76
|
+
|
|
77
|
+
| Situation | Strategy | pandas Method |
|
|
78
|
+
|-----------|----------|---------------|
|
|
79
|
+
| < 5% missing, random | Drop rows | `df.dropna()` |
|
|
80
|
+
| Numeric, moderate missing | Mean/median imputation | `df.fillna(df.median())` |
|
|
81
|
+
| Categorical missing | Mode or "Unknown" | `df.fillna('Unknown')` |
|
|
82
|
+
| Time series gaps | Forward/backward fill | `df.ffill()` / `df.bfill()` |
|
|
83
|
+
| Systematic missing | Multiple imputation | `sklearn.impute.IterativeImputer` |
|
|
84
|
+
| Feature > 50% missing | Drop column | `df.drop(columns=[...])` |
|
|
85
|
+
|
|
86
|
+
### Implementation Examples
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
# Conditional imputation
|
|
90
|
+
df['age'] = df['age'].fillna(df.groupby('group')['age'].transform('median'))
|
|
91
|
+
|
|
92
|
+
# Interpolation for time series
|
|
93
|
+
df['temperature'] = df['temperature'].interpolate(method='time')
|
|
94
|
+
|
|
95
|
+
# Flag missing values before imputing (preserve information)
|
|
96
|
+
df['salary_missing'] = df['salary'].isnull().astype(int)
|
|
97
|
+
df['salary'] = df['salary'].fillna(df['salary'].median())
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
## Data Transformation
|
|
101
|
+
|
|
102
|
+
### Type Conversion and Cleaning
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
# String cleaning
|
|
106
|
+
df['name'] = df['name'].str.strip().str.lower()
|
|
107
|
+
df['email'] = df['email'].str.replace(r'\s+', '', regex=True)
|
|
108
|
+
|
|
109
|
+
# Categorical conversion (saves memory, enables ordering)
|
|
110
|
+
df['education'] = pd.Categorical(
|
|
111
|
+
df['education'],
|
|
112
|
+
categories=['high_school', 'bachelors', 'masters', 'phd'],
|
|
113
|
+
ordered=True
|
|
114
|
+
)
|
|
115
|
+
|
|
116
|
+
# Numeric extraction from text
|
|
117
|
+
df['value'] = df['text_field'].str.extract(r'(\d+\.?\d*)').astype(float)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Reshaping Operations
|
|
121
|
+
|
|
122
|
+
```python
|
|
123
|
+
# Wide to long (unpivot)
|
|
124
|
+
df_long = pd.melt(df,
|
|
125
|
+
id_vars=['subject_id', 'condition'],
|
|
126
|
+
value_vars=['score_t1', 'score_t2', 'score_t3'],
|
|
127
|
+
var_name='timepoint',
|
|
128
|
+
value_name='score'
|
|
129
|
+
)
|
|
130
|
+
|
|
131
|
+
# Long to wide (pivot)
|
|
132
|
+
df_wide = df_long.pivot_table(
|
|
133
|
+
index='subject_id',
|
|
134
|
+
columns='condition',
|
|
135
|
+
values='score',
|
|
136
|
+
aggfunc='mean'
|
|
137
|
+
).reset_index()
|
|
138
|
+
|
|
139
|
+
# Cross-tabulation
|
|
140
|
+
ct = pd.crosstab(df['group'], df['outcome'],
|
|
141
|
+
margins=True, normalize='index')
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Merging and Joining
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
# Left join with validation
|
|
148
|
+
merged = pd.merge(
|
|
149
|
+
experiments, participants,
|
|
150
|
+
on='participant_id',
|
|
151
|
+
how='left',
|
|
152
|
+
validate='many_to_one', # Catch unexpected duplicates
|
|
153
|
+
indicator=True # Shows _merge column
|
|
154
|
+
)
|
|
155
|
+
|
|
156
|
+
# Check merge quality
|
|
157
|
+
print(merged['_merge'].value_counts())
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## Exploratory Data Analysis (EDA)
|
|
161
|
+
|
|
162
|
+
### Automated EDA Pipeline
|
|
163
|
+
|
|
164
|
+
```python
|
|
165
|
+
def quick_eda(df, target_col=None):
|
|
166
|
+
"""Run a quick EDA pipeline on a DataFrame."""
|
|
167
|
+
print(f"=== Shape: {df.shape} ===\n")
|
|
168
|
+
|
|
169
|
+
# Numeric columns
|
|
170
|
+
numeric_cols = df.select_dtypes(include=np.number).columns
|
|
171
|
+
print(f"Numeric columns ({len(numeric_cols)}):")
|
|
172
|
+
print(df[numeric_cols].describe().round(2))
|
|
173
|
+
|
|
174
|
+
# Categorical columns
|
|
175
|
+
cat_cols = df.select_dtypes(include=['object', 'category']).columns
|
|
176
|
+
print(f"\nCategorical columns ({len(cat_cols)}):")
|
|
177
|
+
for col in cat_cols:
|
|
178
|
+
n_unique = df[col].nunique()
|
|
179
|
+
print(f" {col}: {n_unique} unique values")
|
|
180
|
+
if n_unique <= 10:
|
|
181
|
+
print(f" {df[col].value_counts().to_dict()}")
|
|
182
|
+
|
|
183
|
+
# Correlations with target
|
|
184
|
+
if target_col and target_col in numeric_cols:
|
|
185
|
+
corr = df[numeric_cols].corr()[target_col].drop(target_col)
|
|
186
|
+
print(f"\nCorrelations with '{target_col}':")
|
|
187
|
+
print(corr.sort_values(ascending=False).round(3))
|
|
188
|
+
|
|
189
|
+
quick_eda(df, target_col='accuracy')
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### GroupBy Aggregations
|
|
193
|
+
|
|
194
|
+
```python
|
|
195
|
+
# Multi-metric summary by group
|
|
196
|
+
summary = df.groupby('method').agg(
|
|
197
|
+
mean_acc=('accuracy', 'mean'),
|
|
198
|
+
std_acc=('accuracy', 'std'),
|
|
199
|
+
median_time=('runtime_sec', 'median'),
|
|
200
|
+
n_runs=('run_id', 'count')
|
|
201
|
+
).round(3).sort_values('mean_acc', ascending=False)
|
|
202
|
+
|
|
203
|
+
print(summary.to_markdown())
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
## Performance Optimization
|
|
207
|
+
|
|
208
|
+
| Technique | When to Use | Speedup |
|
|
209
|
+
|-----------|-------------|---------|
|
|
210
|
+
| `pd.Categorical` for strings | Repeated string values | 2-10x memory |
|
|
211
|
+
| `.query()` instead of boolean indexing | Complex filters | 1.5-3x |
|
|
212
|
+
| `pd.eval()` for arithmetic | Column arithmetic | 2-5x |
|
|
213
|
+
| Parquet instead of CSV | Large datasets | 5-20x I/O |
|
|
214
|
+
| `df.pipe()` for chaining | Readable pipelines | Clarity |
|
|
215
|
+
|
|
216
|
+
```python
|
|
217
|
+
# Method chaining with pipe
|
|
218
|
+
result = (
|
|
219
|
+
df
|
|
220
|
+
.query('score > 0')
|
|
221
|
+
.assign(log_score=lambda x: np.log1p(x['score']))
|
|
222
|
+
.groupby('group')
|
|
223
|
+
.agg(mean_log=('log_score', 'mean'))
|
|
224
|
+
.sort_values('mean_log', ascending=False)
|
|
225
|
+
)
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
## Best Practices
|
|
229
|
+
|
|
230
|
+
- **Never modify the original DataFrame in place.** Use `.copy()` when creating derived datasets.
|
|
231
|
+
- **Use method chaining for readability.** Pipe operations together instead of creating intermediate variables.
|
|
232
|
+
- **Document your cleaning steps.** Keep a data cleaning log or use a Jupyter notebook with explanations.
|
|
233
|
+
- **Validate after every merge.** Check row counts, null values, and the `_merge` indicator column.
|
|
234
|
+
- **Profile before optimizing.** Use `df.memory_usage(deep=True)` to identify memory bottlenecks.
|
|
235
|
+
- **Save intermediate results as Parquet.** It preserves dtypes and is much faster than CSV.
|
|
236
|
+
|
|
237
|
+
## References
|
|
238
|
+
|
|
239
|
+
- [pandas Documentation](https://pandas.pydata.org/docs/) -- Official reference
|
|
240
|
+
- [Python for Data Analysis, 3rd Edition](https://wesmckinney.com/book/) -- Wes McKinney
|
|
241
|
+
- [Effective Pandas](https://store.metasnake.com/effective-pandas-book) -- Matt Harrison
|
|
242
|
+
- [pandas Cookbook](https://github.com/jvns/pandas-cookbook) -- Julia Evans
|
|
@@ -0,0 +1,234 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: questionnaire-design-guide
|
|
3
|
+
description: "Questionnaire and survey design with Likert scales and coding"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "clipboard"
|
|
7
|
+
category: "analysis"
|
|
8
|
+
subcategory: "wrangling"
|
|
9
|
+
keywords: ["questionnaire design", "survey design", "Likert scale", "data transformation"]
|
|
10
|
+
source: "wentor-research-plugins"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Questionnaire Design Guide
|
|
14
|
+
|
|
15
|
+
Design valid and reliable survey instruments with proper question types, Likert scale construction, response coding, and data preparation for analysis.
|
|
16
|
+
|
|
17
|
+
## Survey Design Principles
|
|
18
|
+
|
|
19
|
+
### Question Types
|
|
20
|
+
|
|
21
|
+
| Type | Example | Best For | Analysis |
|
|
22
|
+
|------|---------|----------|----------|
|
|
23
|
+
| **Likert scale** | "Rate your agreement: 1-5" | Attitudes, perceptions | Ordinal/interval statistics |
|
|
24
|
+
| **Multiple choice** | "Select your field" | Demographics, categories | Frequencies, chi-square |
|
|
25
|
+
| **Ranking** | "Rank these 5 options" | Preferences, priorities | Rank correlations |
|
|
26
|
+
| **Open-ended** | "Describe your experience" | Exploratory, rich data | Qualitative coding |
|
|
27
|
+
| **Matrix/grid** | Multiple items, same scale | Efficient battery of items | Factor analysis, reliability |
|
|
28
|
+
| **Slider/VAS** | 0-100 visual analog scale | Continuous measures | Parametric statistics |
|
|
29
|
+
| **Semantic differential** | "Easy __ __ __ __ __ Difficult" | Bipolar attitudes | Factor analysis |
|
|
30
|
+
|
|
31
|
+
### The Four C's of Good Questions
|
|
32
|
+
|
|
33
|
+
1. **Clear**: Avoid jargon, double-barreled questions, and ambiguity
|
|
34
|
+
2. **Concise**: Keep questions short (ideally under 20 words)
|
|
35
|
+
3. **Complete**: Include all relevant response options
|
|
36
|
+
4. **Consistent**: Use the same scale direction and format throughout
|
|
37
|
+
|
|
38
|
+
## Likert Scale Design
|
|
39
|
+
|
|
40
|
+
### Scale Points
|
|
41
|
+
|
|
42
|
+
| Points | Scale Example | Recommended Use |
|
|
43
|
+
|--------|---------------|-----------------|
|
|
44
|
+
| 4-point | Strongly Disagree to Strongly Agree | Forces choice (no neutral), less discriminating |
|
|
45
|
+
| 5-point | SD, D, Neutral, A, SA | Most common, good balance of simplicity and discrimination |
|
|
46
|
+
| 7-point | SD, D, Somewhat D, Neutral, Somewhat A, A, SA | More discriminating, better for experienced respondents |
|
|
47
|
+
| 11-point (0-10) | Not at all to Completely | NPS, continuous-like measures |
|
|
48
|
+
|
|
49
|
+
### Anchoring Labels
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
5-Point Agreement Scale:
|
|
53
|
+
1 = Strongly Disagree
|
|
54
|
+
2 = Disagree
|
|
55
|
+
3 = Neither Agree nor Disagree
|
|
56
|
+
4 = Agree
|
|
57
|
+
5 = Strongly Agree
|
|
58
|
+
|
|
59
|
+
5-Point Frequency Scale:
|
|
60
|
+
1 = Never
|
|
61
|
+
2 = Rarely
|
|
62
|
+
3 = Sometimes
|
|
63
|
+
4 = Often
|
|
64
|
+
5 = Always
|
|
65
|
+
|
|
66
|
+
5-Point Satisfaction Scale:
|
|
67
|
+
1 = Very Dissatisfied
|
|
68
|
+
2 = Dissatisfied
|
|
69
|
+
3 = Neutral
|
|
70
|
+
4 = Satisfied
|
|
71
|
+
5 = Very Satisfied
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Reverse-Coded Items
|
|
75
|
+
|
|
76
|
+
Include 2-3 reverse-coded items per construct to detect acquiescence bias:
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
Regular: "I find research methods interesting." (1-5: SD to SA)
|
|
80
|
+
Reversed: "I find research methods tedious and dull." (1-5: SD to SA)
|
|
81
|
+
|
|
82
|
+
# Recode reversed items before analysis:
|
|
83
|
+
# reversed_score = (max_scale + 1) - raw_score
|
|
84
|
+
# For a 5-point scale: reversed_score = 6 - raw_score
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Constructing a Multi-Item Scale
|
|
88
|
+
|
|
89
|
+
### Step-by-Step Process
|
|
90
|
+
|
|
91
|
+
1. **Define the construct**: Write a clear conceptual definition
|
|
92
|
+
2. **Generate items**: Write 1.5-2x the number of items you plan to keep (e.g., write 15 items for an 8-item scale)
|
|
93
|
+
3. **Expert review**: Have 3-5 experts rate each item for relevance (Content Validity Index)
|
|
94
|
+
4. **Pilot test**: Administer to 30-50 respondents
|
|
95
|
+
5. **Item analysis**: Calculate item-total correlations, check reliability
|
|
96
|
+
6. **Exploratory Factor Analysis (EFA)**: Confirm dimensionality
|
|
97
|
+
7. **Finalize scale**: Remove weak items, re-test reliability
|
|
98
|
+
|
|
99
|
+
### Example: Research Self-Efficacy Scale
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
Construct: Belief in one's ability to conduct academic research
|
|
103
|
+
|
|
104
|
+
Items (5-point Likert, Strongly Disagree to Strongly Agree):
|
|
105
|
+
RSE1: I can formulate clear research questions.
|
|
106
|
+
RSE2: I can design an appropriate research methodology.
|
|
107
|
+
RSE3: I can analyze data using statistical software.
|
|
108
|
+
RSE4: I can write a publishable research paper.
|
|
109
|
+
RSE5: I can critically evaluate published research.
|
|
110
|
+
RSE6: I can present research findings at a conference.
|
|
111
|
+
RSE7R: I struggle to interpret statistical results. [REVERSED]
|
|
112
|
+
RSE8R: I find it difficult to synthesize literature. [REVERSED]
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
## Data Coding and Preparation
|
|
116
|
+
|
|
117
|
+
### Coding Scheme
|
|
118
|
+
|
|
119
|
+
```python
|
|
120
|
+
import pandas as pd
|
|
121
|
+
import numpy as np
|
|
122
|
+
|
|
123
|
+
# Define coding scheme
|
|
124
|
+
likert_coding = {
|
|
125
|
+
"Strongly Disagree": 1,
|
|
126
|
+
"Disagree": 2,
|
|
127
|
+
"Neither Agree nor Disagree": 3,
|
|
128
|
+
"Agree": 4,
|
|
129
|
+
"Strongly Agree": 5
|
|
130
|
+
}
|
|
131
|
+
|
|
132
|
+
# Apply coding
|
|
133
|
+
df["Q1_coded"] = df["Q1_raw"].map(likert_coding)
|
|
134
|
+
|
|
135
|
+
# Reverse code specific items
|
|
136
|
+
reverse_items = ["RSE7R", "RSE8R"]
|
|
137
|
+
max_scale = 5
|
|
138
|
+
for item in reverse_items:
|
|
139
|
+
df[f"{item}_recoded"] = (max_scale + 1) - df[item]
|
|
140
|
+
|
|
141
|
+
# Calculate composite score (mean of items)
|
|
142
|
+
scale_items = ["RSE1", "RSE2", "RSE3", "RSE4", "RSE5", "RSE6",
|
|
143
|
+
"RSE7R_recoded", "RSE8R_recoded"]
|
|
144
|
+
df["RSE_mean"] = df[scale_items].mean(axis=1)
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### Missing Data Handling
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
# Check missing data patterns
|
|
151
|
+
print(df[scale_items].isnull().sum())
|
|
152
|
+
print(f"Complete cases: {df[scale_items].dropna().shape[0]} / {df.shape[0]}")
|
|
153
|
+
|
|
154
|
+
# Common strategies:
|
|
155
|
+
# 1. Listwise deletion (if < 5% missing)
|
|
156
|
+
df_complete = df.dropna(subset=scale_items)
|
|
157
|
+
|
|
158
|
+
# 2. Mean imputation per item (simple but biased)
|
|
159
|
+
df[scale_items] = df[scale_items].fillna(df[scale_items].mean())
|
|
160
|
+
|
|
161
|
+
# 3. Person-mean imputation (if < 20% of items missing per person)
|
|
162
|
+
def person_mean_impute(row, items, max_missing=2):
|
|
163
|
+
if row[items].isnull().sum() <= max_missing:
|
|
164
|
+
return row[items].fillna(row[items].mean())
|
|
165
|
+
return row[items] # leave as NaN if too many missing
|
|
166
|
+
|
|
167
|
+
df[scale_items] = df.apply(lambda r: person_mean_impute(r, scale_items), axis=1)
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
## Reliability Analysis
|
|
171
|
+
|
|
172
|
+
### Cronbach's Alpha
|
|
173
|
+
|
|
174
|
+
```python
|
|
175
|
+
import pingouin as pg
|
|
176
|
+
|
|
177
|
+
# Calculate Cronbach's alpha
|
|
178
|
+
alpha = pg.cronbach_alpha(df[scale_items])
|
|
179
|
+
print(f"Cronbach's alpha: {alpha[0]:.3f}")
|
|
180
|
+
# Interpretation: >= 0.70 acceptable, >= 0.80 good, >= 0.90 excellent
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
```r
|
|
184
|
+
library(psych)
|
|
185
|
+
|
|
186
|
+
# Cronbach's alpha with item-level diagnostics
|
|
187
|
+
alpha_result <- alpha(data[, scale_items])
|
|
188
|
+
print(alpha_result)
|
|
189
|
+
# Check "raw_alpha if item dropped" to identify weak items
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### Item-Total Correlations
|
|
193
|
+
|
|
194
|
+
```r
|
|
195
|
+
# Corrected item-total correlations (should be > 0.30)
|
|
196
|
+
item_stats <- alpha_result$item.stats
|
|
197
|
+
print(item_stats[, c("r.drop", "raw.alpha")])
|
|
198
|
+
# r.drop < 0.30: consider removing the item
|
|
199
|
+
# raw.alpha increases if dropped: item is weakening the scale
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
## Validity Assessment
|
|
203
|
+
|
|
204
|
+
| Validity Type | Method | Criterion |
|
|
205
|
+
|--------------|--------|-----------|
|
|
206
|
+
| **Content validity** | Expert panel rating (CVI) | I-CVI >= 0.78, S-CVI/Ave >= 0.90 |
|
|
207
|
+
| **Construct validity** | Exploratory Factor Analysis (EFA) | Eigenvalue > 1, loadings > 0.40 |
|
|
208
|
+
| **Convergent validity** | Correlation with related construct | r > 0.30 |
|
|
209
|
+
| **Discriminant validity** | Correlation with unrelated construct | r < 0.30 |
|
|
210
|
+
| **Criterion validity** | Correlation with external criterion | Significant correlation |
|
|
211
|
+
| **Test-retest reliability** | ICC or Pearson r over 2-4 weeks | ICC > 0.70 |
|
|
212
|
+
|
|
213
|
+
## Common Design Mistakes
|
|
214
|
+
|
|
215
|
+
| Mistake | Example | Fix |
|
|
216
|
+
|---------|---------|-----|
|
|
217
|
+
| Double-barreled question | "This course is interesting and useful" | Split into two separate items |
|
|
218
|
+
| Leading question | "Don't you agree that X is important?" | "How important is X to you?" |
|
|
219
|
+
| Absolute terms | "Do you always check citations?" | "How often do you check citations?" |
|
|
220
|
+
| Missing option | No "Not Applicable" when needed | Add N/A option or filter logic |
|
|
221
|
+
| Inconsistent scale direction | Some items 1=good, others 1=bad | Standardize direction; clearly mark reversed items |
|
|
222
|
+
| Too many items | 100-item survey | Aim for 5-8 items per construct, 15-30 min total |
|
|
223
|
+
| No pilot test | Skip straight to full deployment | Always pilot with 30-50 respondents |
|
|
224
|
+
|
|
225
|
+
## Survey Platform Comparison
|
|
226
|
+
|
|
227
|
+
| Platform | Cost | Features | Best For |
|
|
228
|
+
|----------|------|----------|----------|
|
|
229
|
+
| Qualtrics | Institutional | Advanced logic, panels, API | Large academic studies |
|
|
230
|
+
| SurveyMonkey | Freemium | Easy to use, basic analysis | Quick surveys |
|
|
231
|
+
| Google Forms | Free | Simple, integrates with Sheets | Classroom, pilot testing |
|
|
232
|
+
| LimeSurvey | Free/self-hosted | Open source, full control | Privacy-sensitive research |
|
|
233
|
+
| REDCap | Free (academic) | Clinical data, HIPAA compliant | Medical/clinical research |
|
|
234
|
+
| Prolific | Per-response | Participant recruitment | Online experiments |
|