@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (252) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +204 -0
  3. package/curated/analysis/README.md +64 -0
  4. package/curated/domains/README.md +104 -0
  5. package/curated/literature/README.md +53 -0
  6. package/curated/research/README.md +62 -0
  7. package/curated/tools/README.md +87 -0
  8. package/curated/writing/README.md +61 -0
  9. package/index.ts +39 -0
  10. package/mcp-configs/academic-db/ChatSpatial.json +17 -0
  11. package/mcp-configs/academic-db/academia-mcp.json +17 -0
  12. package/mcp-configs/academic-db/academic-paper-explorer.json +17 -0
  13. package/mcp-configs/academic-db/academic-search-mcp-server.json +17 -0
  14. package/mcp-configs/academic-db/agentinterviews-mcp.json +17 -0
  15. package/mcp-configs/academic-db/all-in-mcp.json +17 -0
  16. package/mcp-configs/academic-db/apple-health-mcp.json +17 -0
  17. package/mcp-configs/academic-db/arxiv-latex-mcp.json +17 -0
  18. package/mcp-configs/academic-db/arxiv-mcp-server.json +17 -0
  19. package/mcp-configs/academic-db/bgpt-mcp.json +17 -0
  20. package/mcp-configs/academic-db/biomcp.json +17 -0
  21. package/mcp-configs/academic-db/biothings-mcp.json +17 -0
  22. package/mcp-configs/academic-db/catalysishub-mcp-server.json +17 -0
  23. package/mcp-configs/academic-db/clinicaltrialsgov-mcp-server.json +17 -0
  24. package/mcp-configs/academic-db/deep-research-mcp.json +17 -0
  25. package/mcp-configs/academic-db/dicom-mcp.json +17 -0
  26. package/mcp-configs/academic-db/enrichr-mcp-server.json +17 -0
  27. package/mcp-configs/academic-db/fec-mcp-server.json +17 -0
  28. package/mcp-configs/academic-db/fhir-mcp-server-themomentum.json +17 -0
  29. package/mcp-configs/academic-db/fhir-mcp.json +19 -0
  30. package/mcp-configs/academic-db/gget-mcp.json +17 -0
  31. package/mcp-configs/academic-db/google-researcher-mcp.json +17 -0
  32. package/mcp-configs/academic-db/idea-reality-mcp.json +17 -0
  33. package/mcp-configs/academic-db/legiscan-mcp.json +19 -0
  34. package/mcp-configs/academic-db/lex.json +17 -0
  35. package/mcp-configs/ai-platform/Adaptive-Graph-of-Thoughts-MCP-server.json +17 -0
  36. package/mcp-configs/ai-platform/ai-counsel.json +17 -0
  37. package/mcp-configs/ai-platform/atlas-mcp-server.json +17 -0
  38. package/mcp-configs/ai-platform/counsel-mcp.json +17 -0
  39. package/mcp-configs/ai-platform/cross-llm-mcp.json +17 -0
  40. package/mcp-configs/ai-platform/gptr-mcp.json +17 -0
  41. package/mcp-configs/browser/decipher-research-agent.json +17 -0
  42. package/mcp-configs/browser/deep-research.json +17 -0
  43. package/mcp-configs/browser/everything-claude-code.json +17 -0
  44. package/mcp-configs/browser/gpt-researcher.json +17 -0
  45. package/mcp-configs/browser/heurist-agent-framework.json +17 -0
  46. package/mcp-configs/data-platform/4everland-hosting-mcp.json +17 -0
  47. package/mcp-configs/data-platform/context-keeper.json +17 -0
  48. package/mcp-configs/data-platform/context7.json +19 -0
  49. package/mcp-configs/data-platform/contextstream-mcp.json +17 -0
  50. package/mcp-configs/data-platform/email-mcp.json +17 -0
  51. package/mcp-configs/note-knowledge/ApeRAG.json +17 -0
  52. package/mcp-configs/note-knowledge/In-Memoria.json +17 -0
  53. package/mcp-configs/note-knowledge/agent-memory.json +17 -0
  54. package/mcp-configs/note-knowledge/aimemo.json +17 -0
  55. package/mcp-configs/note-knowledge/biel-mcp.json +19 -0
  56. package/mcp-configs/note-knowledge/cognee.json +17 -0
  57. package/mcp-configs/note-knowledge/context-awesome.json +17 -0
  58. package/mcp-configs/note-knowledge/context-mcp.json +17 -0
  59. package/mcp-configs/note-knowledge/conversation-handoff-mcp.json +17 -0
  60. package/mcp-configs/note-knowledge/cortex.json +17 -0
  61. package/mcp-configs/note-knowledge/devrag.json +17 -0
  62. package/mcp-configs/note-knowledge/easy-obsidian-mcp.json +17 -0
  63. package/mcp-configs/note-knowledge/engram.json +17 -0
  64. package/mcp-configs/note-knowledge/gnosis-mcp.json +17 -0
  65. package/mcp-configs/note-knowledge/graphlit-mcp-server.json +19 -0
  66. package/mcp-configs/reference-mgr/arxiv-cli.json +17 -0
  67. package/mcp-configs/reference-mgr/arxiv-search-mcp.json +17 -0
  68. package/mcp-configs/reference-mgr/chiken.json +17 -0
  69. package/mcp-configs/reference-mgr/claude-scholar.json +17 -0
  70. package/mcp-configs/reference-mgr/devonthink-mcp.json +17 -0
  71. package/mcp-configs/registry.json +447 -0
  72. package/openclaw.plugin.json +21 -0
  73. package/package.json +61 -0
  74. package/skills/analysis/dataviz/color-accessibility-guide/SKILL.md +230 -0
  75. package/skills/analysis/dataviz/geospatial-viz-guide/SKILL.md +218 -0
  76. package/skills/analysis/dataviz/interactive-viz-guide/SKILL.md +287 -0
  77. package/skills/analysis/dataviz/network-visualization-guide/SKILL.md +195 -0
  78. package/skills/analysis/dataviz/publication-figures-guide/SKILL.md +238 -0
  79. package/skills/analysis/dataviz/python-dataviz-guide/SKILL.md +195 -0
  80. package/skills/analysis/econometrics/causal-inference-guide/SKILL.md +197 -0
  81. package/skills/analysis/econometrics/iv-regression-guide/SKILL.md +198 -0
  82. package/skills/analysis/econometrics/panel-data-guide/SKILL.md +274 -0
  83. package/skills/analysis/econometrics/robustness-checks/SKILL.md +250 -0
  84. package/skills/analysis/econometrics/stata-regression/SKILL.md +117 -0
  85. package/skills/analysis/econometrics/time-series-guide/SKILL.md +235 -0
  86. package/skills/analysis/statistics/bayesian-statistics-guide/SKILL.md +221 -0
  87. package/skills/analysis/statistics/hypothesis-testing-guide/SKILL.md +210 -0
  88. package/skills/analysis/statistics/meta-analysis-guide/SKILL.md +206 -0
  89. package/skills/analysis/statistics/nonparametric-tests-guide/SKILL.md +221 -0
  90. package/skills/analysis/statistics/power-analysis-guide/SKILL.md +240 -0
  91. package/skills/analysis/statistics/sem-guide/SKILL.md +231 -0
  92. package/skills/analysis/statistics/survival-analysis-guide/SKILL.md +195 -0
  93. package/skills/analysis/wrangling/missing-data-handling/SKILL.md +224 -0
  94. package/skills/analysis/wrangling/pandas-data-wrangling/SKILL.md +242 -0
  95. package/skills/analysis/wrangling/questionnaire-design-guide/SKILL.md +234 -0
  96. package/skills/analysis/wrangling/text-mining-guide/SKILL.md +225 -0
  97. package/skills/domains/ai-ml/computer-vision-guide/SKILL.md +213 -0
  98. package/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md +200 -0
  99. package/skills/domains/ai-ml/llm-evaluation-guide/SKILL.md +194 -0
  100. package/skills/domains/ai-ml/prompt-engineering-research/SKILL.md +233 -0
  101. package/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.md +254 -0
  102. package/skills/domains/ai-ml/transformer-architecture-guide/SKILL.md +233 -0
  103. package/skills/domains/biomedical/clinical-research-guide/SKILL.md +232 -0
  104. package/skills/domains/biomedical/clinicaltrials-api/SKILL.md +177 -0
  105. package/skills/domains/biomedical/epidemiology-guide/SKILL.md +200 -0
  106. package/skills/domains/biomedical/genomics-analysis-guide/SKILL.md +270 -0
  107. package/skills/domains/business/market-analysis-guide/SKILL.md +112 -0
  108. package/skills/domains/business/strategic-management-guide/SKILL.md +154 -0
  109. package/skills/domains/chemistry/computational-chemistry-guide/SKILL.md +266 -0
  110. package/skills/domains/chemistry/retrosynthesis-guide/SKILL.md +215 -0
  111. package/skills/domains/cs/algorithms-complexity-guide/SKILL.md +194 -0
  112. package/skills/domains/cs/dblp-api/SKILL.md +129 -0
  113. package/skills/domains/cs/software-engineering-research/SKILL.md +218 -0
  114. package/skills/domains/ecology/biodiversity-data-guide/SKILL.md +296 -0
  115. package/skills/domains/ecology/conservation-biology-guide/SKILL.md +198 -0
  116. package/skills/domains/ecology/gbif-api/SKILL.md +158 -0
  117. package/skills/domains/ecology/inaturalist-api/SKILL.md +173 -0
  118. package/skills/domains/economics/behavioral-economics-guide/SKILL.md +239 -0
  119. package/skills/domains/economics/development-economics-guide/SKILL.md +181 -0
  120. package/skills/domains/economics/fred-api/SKILL.md +189 -0
  121. package/skills/domains/education/curriculum-design-guide/SKILL.md +144 -0
  122. package/skills/domains/education/learning-science-guide/SKILL.md +150 -0
  123. package/skills/domains/finance/financial-data-analysis/SKILL.md +152 -0
  124. package/skills/domains/finance/quantitative-finance-guide/SKILL.md +151 -0
  125. package/skills/domains/geoscience/climate-science-guide/SKILL.md +158 -0
  126. package/skills/domains/geoscience/gis-remote-sensing-guide/SKILL.md +129 -0
  127. package/skills/domains/humanities/digital-humanities-guide/SKILL.md +181 -0
  128. package/skills/domains/humanities/philosophy-research-guide/SKILL.md +148 -0
  129. package/skills/domains/law/courtlistener-api/SKILL.md +213 -0
  130. package/skills/domains/law/legal-research-guide/SKILL.md +250 -0
  131. package/skills/domains/math/linear-algebra-applications/SKILL.md +227 -0
  132. package/skills/domains/math/numerical-methods-guide/SKILL.md +236 -0
  133. package/skills/domains/math/oeis-api/SKILL.md +158 -0
  134. package/skills/domains/pharma/clinical-pharmacology-guide/SKILL.md +165 -0
  135. package/skills/domains/pharma/drug-development-guide/SKILL.md +177 -0
  136. package/skills/domains/physics/computational-physics-guide/SKILL.md +300 -0
  137. package/skills/domains/physics/nasa-ads-api/SKILL.md +150 -0
  138. package/skills/domains/physics/quantum-computing-guide/SKILL.md +234 -0
  139. package/skills/domains/social-science/social-research-methods/SKILL.md +194 -0
  140. package/skills/domains/social-science/survey-research-guide/SKILL.md +182 -0
  141. package/skills/literature/discovery/citation-alert-guide/SKILL.md +154 -0
  142. package/skills/literature/discovery/conference-proceedings-guide/SKILL.md +142 -0
  143. package/skills/literature/discovery/literature-mapping-guide/SKILL.md +175 -0
  144. package/skills/literature/discovery/paper-tracking-guide/SKILL.md +211 -0
  145. package/skills/literature/discovery/rss-paper-feeds/SKILL.md +214 -0
  146. package/skills/literature/discovery/semantic-scholar-recs-guide/SKILL.md +164 -0
  147. package/skills/literature/fulltext/doaj-api/SKILL.md +120 -0
  148. package/skills/literature/fulltext/interlibrary-loan-guide/SKILL.md +163 -0
  149. package/skills/literature/fulltext/open-access-guide/SKILL.md +183 -0
  150. package/skills/literature/fulltext/pmc-oai-api/SKILL.md +184 -0
  151. package/skills/literature/fulltext/preprint-servers-guide/SKILL.md +128 -0
  152. package/skills/literature/fulltext/repository-harvesting-guide/SKILL.md +207 -0
  153. package/skills/literature/fulltext/unpaywall-api/SKILL.md +113 -0
  154. package/skills/literature/metadata/altmetrics-guide/SKILL.md +132 -0
  155. package/skills/literature/metadata/citation-network-guide/SKILL.md +236 -0
  156. package/skills/literature/metadata/crossref-api/SKILL.md +133 -0
  157. package/skills/literature/metadata/datacite-api/SKILL.md +126 -0
  158. package/skills/literature/metadata/doi-resolution-guide/SKILL.md +168 -0
  159. package/skills/literature/metadata/h-index-guide/SKILL.md +183 -0
  160. package/skills/literature/metadata/journal-metrics-guide/SKILL.md +188 -0
  161. package/skills/literature/metadata/opencitations-api/SKILL.md +128 -0
  162. package/skills/literature/metadata/orcid-api/SKILL.md +136 -0
  163. package/skills/literature/metadata/orcid-integration-guide/SKILL.md +178 -0
  164. package/skills/literature/search/arxiv-api/SKILL.md +95 -0
  165. package/skills/literature/search/biorxiv-api/SKILL.md +123 -0
  166. package/skills/literature/search/boolean-search-guide/SKILL.md +199 -0
  167. package/skills/literature/search/citation-chaining-guide/SKILL.md +148 -0
  168. package/skills/literature/search/database-comparison-guide/SKILL.md +100 -0
  169. package/skills/literature/search/europe-pmc-api/SKILL.md +120 -0
  170. package/skills/literature/search/google-scholar-guide/SKILL.md +182 -0
  171. package/skills/literature/search/mesh-terms-guide/SKILL.md +164 -0
  172. package/skills/literature/search/openalex-api/SKILL.md +134 -0
  173. package/skills/literature/search/pubmed-api/SKILL.md +130 -0
  174. package/skills/literature/search/scientify-literature-survey/SKILL.md +203 -0
  175. package/skills/literature/search/semantic-scholar-api/SKILL.md +134 -0
  176. package/skills/literature/search/systematic-search-strategy/SKILL.md +214 -0
  177. package/skills/research/automation/ai-scientist-guide/SKILL.md +228 -0
  178. package/skills/research/automation/data-collection-automation/SKILL.md +248 -0
  179. package/skills/research/automation/research-workflow-automation/SKILL.md +266 -0
  180. package/skills/research/deep-research/meta-synthesis-guide/SKILL.md +174 -0
  181. package/skills/research/deep-research/research-cog/SKILL.md +153 -0
  182. package/skills/research/deep-research/scoping-review-guide/SKILL.md +217 -0
  183. package/skills/research/deep-research/systematic-review-guide/SKILL.md +250 -0
  184. package/skills/research/funding/figshare-api/SKILL.md +163 -0
  185. package/skills/research/funding/grant-writing-guide/SKILL.md +233 -0
  186. package/skills/research/funding/nsf-grant-guide/SKILL.md +206 -0
  187. package/skills/research/funding/open-science-guide/SKILL.md +255 -0
  188. package/skills/research/funding/zenodo-api/SKILL.md +174 -0
  189. package/skills/research/methodology/action-research-guide/SKILL.md +201 -0
  190. package/skills/research/methodology/experimental-design-guide/SKILL.md +236 -0
  191. package/skills/research/methodology/grad-school-guide/SKILL.md +182 -0
  192. package/skills/research/methodology/grounded-theory-guide/SKILL.md +171 -0
  193. package/skills/research/methodology/mixed-methods-guide/SKILL.md +208 -0
  194. package/skills/research/methodology/qualitative-research-guide/SKILL.md +234 -0
  195. package/skills/research/methodology/scientify-idea-generation/SKILL.md +222 -0
  196. package/skills/research/paper-review/paper-reading-assistant/SKILL.md +266 -0
  197. package/skills/research/paper-review/peer-review-guide/SKILL.md +227 -0
  198. package/skills/research/paper-review/rebuttal-writing-guide/SKILL.md +185 -0
  199. package/skills/research/paper-review/scientify-write-review-paper/SKILL.md +209 -0
  200. package/skills/tools/code-exec/jupyter-notebook-guide/SKILL.md +178 -0
  201. package/skills/tools/code-exec/python-reproducibility-guide/SKILL.md +341 -0
  202. package/skills/tools/code-exec/r-reproducibility-guide/SKILL.md +236 -0
  203. package/skills/tools/code-exec/sandbox-execution-guide/SKILL.md +221 -0
  204. package/skills/tools/diagram/mermaid-diagram-guide/SKILL.md +269 -0
  205. package/skills/tools/diagram/plantuml-guide/SKILL.md +397 -0
  206. package/skills/tools/diagram/scientific-illustration-guide/SKILL.md +225 -0
  207. package/skills/tools/document/anystyle-api/SKILL.md +199 -0
  208. package/skills/tools/document/grobid-pdf-parsing/SKILL.md +294 -0
  209. package/skills/tools/document/markdown-academic-guide/SKILL.md +217 -0
  210. package/skills/tools/document/pdf-extraction-guide/SKILL.md +321 -0
  211. package/skills/tools/knowledge-graph/knowledge-graph-construction/SKILL.md +306 -0
  212. package/skills/tools/knowledge-graph/ontology-design-guide/SKILL.md +214 -0
  213. package/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md +325 -0
  214. package/skills/tools/ocr-translate/formula-recognition-guide/SKILL.md +367 -0
  215. package/skills/tools/ocr-translate/handwriting-recognition-guide/SKILL.md +211 -0
  216. package/skills/tools/ocr-translate/latex-ocr-guide/SKILL.md +204 -0
  217. package/skills/tools/ocr-translate/multilingual-research-guide/SKILL.md +234 -0
  218. package/skills/tools/scraping/academic-web-scraping/SKILL.md +326 -0
  219. package/skills/tools/scraping/api-data-collection-guide/SKILL.md +301 -0
  220. package/skills/tools/scraping/web-scraping-ethics-guide/SKILL.md +250 -0
  221. package/skills/writing/citation/bibtex-management-guide/SKILL.md +246 -0
  222. package/skills/writing/citation/citation-style-guide/SKILL.md +248 -0
  223. package/skills/writing/citation/reference-manager-comparison/SKILL.md +208 -0
  224. package/skills/writing/citation/zotero-api/SKILL.md +188 -0
  225. package/skills/writing/composition/abstract-writing-guide/SKILL.md +188 -0
  226. package/skills/writing/composition/discussion-writing-guide/SKILL.md +194 -0
  227. package/skills/writing/composition/introduction-writing-guide/SKILL.md +194 -0
  228. package/skills/writing/composition/literature-review-writing/SKILL.md +196 -0
  229. package/skills/writing/composition/methods-section-guide/SKILL.md +185 -0
  230. package/skills/writing/composition/response-to-reviewers/SKILL.md +215 -0
  231. package/skills/writing/composition/scientific-writing-guide/SKILL.md +152 -0
  232. package/skills/writing/latex/bibliography-management-guide/SKILL.md +206 -0
  233. package/skills/writing/latex/latex-drawing-guide/SKILL.md +234 -0
  234. package/skills/writing/latex/latex-ecosystem-guide/SKILL.md +240 -0
  235. package/skills/writing/latex/math-typesetting-guide/SKILL.md +231 -0
  236. package/skills/writing/latex/overleaf-collaboration-guide/SKILL.md +211 -0
  237. package/skills/writing/latex/tikz-diagrams-guide/SKILL.md +211 -0
  238. package/skills/writing/polish/academic-translation-guide/SKILL.md +175 -0
  239. package/skills/writing/polish/academic-writing-refiner/SKILL.md +143 -0
  240. package/skills/writing/polish/ai-writing-humanizer/SKILL.md +178 -0
  241. package/skills/writing/polish/grammar-checker-guide/SKILL.md +184 -0
  242. package/skills/writing/polish/plagiarism-detection-guide/SKILL.md +167 -0
  243. package/skills/writing/templates/beamer-presentation-guide/SKILL.md +263 -0
  244. package/skills/writing/templates/conference-paper-template/SKILL.md +219 -0
  245. package/skills/writing/templates/thesis-template-guide/SKILL.md +200 -0
  246. package/skills/writing/templates/thesis-writing-guide/SKILL.md +220 -0
  247. package/src/tools/arxiv.ts +131 -0
  248. package/src/tools/crossref.ts +112 -0
  249. package/src/tools/openalex.ts +174 -0
  250. package/src/tools/pubmed.ts +166 -0
  251. package/src/tools/semantic-scholar.ts +108 -0
  252. package/src/tools/unpaywall.ts +58 -0
@@ -0,0 +1,217 @@
1
+ ---
2
+ name: markdown-academic-guide
3
+ description: "Write academic papers in Markdown with Pandoc for multi-format output"
4
+ metadata:
5
+ openclaw:
6
+ emoji: "page_facing_up"
7
+ category: "tools"
8
+ subcategory: "document"
9
+ keywords: ["Markdown", "Pandoc", "academic writing", "document conversion", "scholarly Markdown", "plain text"]
10
+ source: "wentor-research-plugins"
11
+ ---
12
+
13
+ # Academic Writing in Markdown with Pandoc
14
+
15
+ A skill for writing academic papers in plain-text Markdown and converting them to PDF, Word, LaTeX, and HTML using Pandoc. Covers YAML metadata, citation management, cross-references, templates, and workflows for collaborative academic writing.
16
+
17
+ ## Why Markdown for Academic Writing?
18
+
19
+ ### Advantages
20
+
21
+ ```
22
+ 1. Plain text: Version-controllable with Git (diff-friendly)
23
+ 2. Portable: Works on any OS, any editor
24
+ 3. Pandoc: Convert to PDF, DOCX, LaTeX, HTML, EPUB
25
+ 4. Focus: Content-first writing without formatting distractions
26
+ 5. Citations: Pandoc-citeproc handles bibliography automatically
27
+ 6. Collaboration: Easy to review diffs in pull requests
28
+ ```
29
+
30
+ ### When to Use Markdown vs. LaTeX
31
+
32
+ ```
33
+ Use Markdown when:
34
+ - You need multi-format output (PDF + Word + HTML)
35
+ - Collaborators prefer Word but you prefer plain text
36
+ - The paper has standard formatting needs
37
+ - You want a simpler syntax than LaTeX
38
+
39
+ Use LaTeX directly when:
40
+ - The journal provides a mandatory LaTeX template
41
+ - You need advanced typesetting (complex math layouts, custom floats)
42
+ - You are writing a thesis with institutional LaTeX requirements
43
+ ```
44
+
45
+ ## Document Structure
46
+
47
+ ### YAML Front Matter
48
+
49
+ ```yaml
50
+ ---
51
+ title: "Your Paper Title: A Markdown-Based Approach"
52
+ author:
53
+ - name: Jane Smith
54
+ affiliation: Department of Computer Science, University X
55
+ email: jane@university.edu
56
+ orcid: 0000-0002-1234-5678
57
+ - name: John Doe
58
+ affiliation: School of Engineering, University Y
59
+ date: 2026-03-09
60
+ abstract: |
61
+ This paper demonstrates how academic manuscripts can be written
62
+ in plain Markdown and converted to publication-quality documents
63
+ using Pandoc. We show that this approach reduces formatting overhead
64
+ while maintaining full citation and cross-reference capabilities.
65
+ keywords: [academic writing, Markdown, Pandoc, reproducible research]
66
+ bibliography: references.bib
67
+ csl: apa-7th-edition.csl
68
+ link-citations: true
69
+ numbersections: true
70
+ ---
71
+ ```
72
+
73
+ ### Body Text with Citations
74
+
75
+ ```markdown
76
+ # Introduction
77
+
78
+ Academic writing often involves tedious formatting tasks that
79
+ distract from content creation [@smith2024; @jones2023, pp. 45-50].
80
+ Recent tools enable plain-text workflows that separate content
81
+ from presentation [see @garcia2022, chap. 3].
82
+
83
+ ## Background
84
+
85
+ As @lee2021 demonstrated, Markdown-based workflows reduce
86
+ formatting errors by 40% compared to WYSIWYG editors.
87
+
88
+ ### Subsection Example
89
+
90
+ Inline math: $E = mc^2$
91
+
92
+ Display math:
93
+ $$
94
+ \hat{\beta} = (X^T X)^{-1} X^T y
95
+ $$
96
+ ```
97
+
98
+ ### Citation Syntax
99
+
100
+ ```
101
+ [@key] -> (Author, 2024)
102
+ @key -> Author (2024)
103
+ [@key, p. 42] -> (Author, 2024, p. 42)
104
+ [@key1; @key2] -> (Author1, 2024; Author2, 2023)
105
+ [-@key] -> (2024) -- suppress author name
106
+ [see @key] -> (see Author, 2024)
107
+ ```
108
+
109
+ ## Pandoc Conversion
110
+
111
+ ### Basic Commands
112
+
113
+ ```bash
114
+ # Markdown to PDF (via LaTeX)
115
+ pandoc paper.md -o paper.pdf \
116
+ --citeproc \
117
+ --number-sections \
118
+ --pdf-engine=xelatex
119
+
120
+ # Markdown to Word (DOCX)
121
+ pandoc paper.md -o paper.docx \
122
+ --citeproc \
123
+ --reference-doc=template.docx
124
+
125
+ # Markdown to LaTeX
126
+ pandoc paper.md -o paper.tex \
127
+ --citeproc \
128
+ --standalone
129
+
130
+ # Markdown to HTML
131
+ pandoc paper.md -o paper.html \
132
+ --citeproc \
133
+ --standalone \
134
+ --mathjax
135
+ ```
136
+
137
+ ### Custom LaTeX Template
138
+
139
+ ```bash
140
+ # Extract default template for customization
141
+ pandoc -D latex > custom-template.tex
142
+
143
+ # Use custom template
144
+ pandoc paper.md -o paper.pdf \
145
+ --template=custom-template.tex \
146
+ --citeproc \
147
+ --pdf-engine=xelatex
148
+ ```
149
+
150
+ ## Cross-References and Figures
151
+
152
+ ### Using pandoc-crossref
153
+
154
+ ```markdown
155
+ See @fig:architecture for the system overview.
156
+
157
+ ![System Architecture](figures/architecture.pdf){#fig:architecture width=80%}
158
+
159
+ Results are shown in @tbl:results.
160
+
161
+ | Method | Accuracy | F1 Score |
162
+ |--------|----------|----------|
163
+ | Ours | 0.95 | 0.93 |
164
+ | Baseline| 0.88 | 0.85 |
165
+
166
+ : Comparison of methods. {#tbl:results}
167
+
168
+ As proven in @eq:main, the relationship holds.
169
+
170
+ $$y = \alpha + \beta x + \epsilon$$ {#eq:main}
171
+ ```
172
+
173
+ ```bash
174
+ # Compile with cross-references
175
+ pandoc paper.md -o paper.pdf \
176
+ --filter pandoc-crossref \
177
+ --citeproc \
178
+ --pdf-engine=xelatex
179
+ ```
180
+
181
+ ## Collaborative Workflow
182
+
183
+ ### Git-Based Collaboration
184
+
185
+ ```python
186
+ def markdown_collaboration_workflow() -> dict:
187
+ """
188
+ Recommended workflow for multi-author Markdown papers.
189
+ """
190
+ return {
191
+ "setup": [
192
+ "Create a Git repository for the paper",
193
+ "Add .gitignore for PDF output and LaTeX aux files",
194
+ "Store references.bib in the repo",
195
+ "Include a Makefile for reproducible builds"
196
+ ],
197
+ "writing": [
198
+ "Each author works on a branch",
199
+ "Use pull requests for section drafts",
200
+ "Review diffs in GitHub/GitLab (plain text diffs are readable)",
201
+ "Resolve merge conflicts in plain text (much easier than .docx)"
202
+ ],
203
+ "makefile_example": (
204
+ "all: paper.pdf paper.docx\n"
205
+ "paper.pdf: paper.md references.bib\n"
206
+ "\tpandoc paper.md -o paper.pdf --citeproc --pdf-engine=xelatex\n"
207
+ "paper.docx: paper.md references.bib\n"
208
+ "\tpandoc paper.md -o paper.docx --citeproc --reference-doc=template.docx\n"
209
+ "clean:\n"
210
+ "\trm -f paper.pdf paper.docx"
211
+ )
212
+ }
213
+ ```
214
+
215
+ ## Tips and Limitations
216
+
217
+ Pandoc handles most academic writing needs, but has limitations with complex table layouts, advanced figure placement, and journal-specific LaTeX class features. For final submission, you may need to fine-tune the generated LaTeX or DOCX output. Keep your Markdown source as the canonical version and treat generated files as disposable build artifacts.
@@ -0,0 +1,321 @@
1
+ ---
2
+ name: pdf-extraction-guide
3
+ description: "PDF parsing, text extraction, and document format conversion"
4
+ metadata:
5
+ openclaw:
6
+ emoji: "doc"
7
+ category: "tools"
8
+ subcategory: "document"
9
+ keywords: ["PDF parsing", "PDF extraction", "document chunking", "format conversion", "md2pdf"]
10
+ source: "wentor-research-plugins"
11
+ ---
12
+
13
+ # PDF Extraction Guide
14
+
15
+ Extract text, tables, figures, and metadata from academic PDFs using Python libraries, with strategies for handling multi-column layouts, mathematical content, and scanned documents.
16
+
17
+ ## PDF Extraction Tools Comparison
18
+
19
+ | Tool | Text | Tables | Figures | Layout | OCR | Speed |
20
+ |------|------|--------|---------|--------|-----|-------|
21
+ | PyMuPDF (fitz) | Excellent | Manual | Yes | Blocks | No (add with OCR engine) | Fast |
22
+ | pdfplumber | Good | Excellent | No | Tables focus | No | Medium |
23
+ | PyPDF2 / pypdf | Basic | No | No | No | No | Fast |
24
+ | Tabula-py | No | Excellent | No | No | No | Medium |
25
+ | GROBID | Structured | Yes | References | Academic layout | No | Slow (ML-based) |
26
+ | Nougat (Meta) | Excellent | Yes | Yes | Academic layout | Built-in | Slow (GPU) |
27
+ | Marker | Excellent | Yes | Yes | Multi-column | Built-in | Medium |
28
+ | pdf2image + Tesseract | Via OCR | Via OCR | Via OCR | No | Yes | Slow |
29
+
30
+ ## PyMuPDF (fitz) — Fast Text Extraction
31
+
32
+ ### Basic Text Extraction
33
+
34
+ ```python
35
+ import fitz # pip install PyMuPDF
36
+
37
+ def extract_text(pdf_path):
38
+ """Extract all text from a PDF with page numbers."""
39
+ doc = fitz.open(pdf_path)
40
+ full_text = []
41
+
42
+ for page_num, page in enumerate(doc, 1):
43
+ text = page.get_text("text")
44
+ full_text.append(f"--- Page {page_num} ---\n{text}")
45
+
46
+ doc.close()
47
+ return "\n".join(full_text)
48
+
49
+ # Usage
50
+ text = extract_text("paper.pdf")
51
+ print(text[:2000])
52
+ ```
53
+
54
+ ### Structured Block-Level Extraction
55
+
56
+ ```python
57
+ def extract_structured(pdf_path):
58
+ """Extract text with layout information (blocks, lines, spans)."""
59
+ doc = fitz.open(pdf_path)
60
+ pages = []
61
+
62
+ for page_num, page in enumerate(doc):
63
+ blocks = page.get_text("dict")["blocks"]
64
+ page_data = {"page": page_num + 1, "blocks": []}
65
+
66
+ for block in blocks:
67
+ if "lines" not in block:
68
+ continue # Skip image blocks
69
+
70
+ block_text = ""
71
+ max_font_size = 0
72
+ is_bold = False
73
+
74
+ for line in block["lines"]:
75
+ for span in line["spans"]:
76
+ block_text += span["text"]
77
+ max_font_size = max(max_font_size, span["size"])
78
+ if "Bold" in span.get("font", ""):
79
+ is_bold = True
80
+ block_text += "\n"
81
+
82
+ page_data["blocks"].append({
83
+ "text": block_text.strip(),
84
+ "font_size": max_font_size,
85
+ "is_bold": is_bold,
86
+ "bbox": block["bbox"] # (x0, y0, x1, y1)
87
+ })
88
+
89
+ pages.append(page_data)
90
+
91
+ doc.close()
92
+ return pages
93
+
94
+ # Identify section headings
95
+ pages = extract_structured("paper.pdf")
96
+ for page in pages:
97
+ for block in page["blocks"]:
98
+ if block["is_bold"] and block["font_size"] > 11:
99
+ print(f"[Heading] {block['text'][:80]}")
100
+ ```
101
+
102
+ ### Extract Images and Figures
103
+
104
+ ```python
105
+ def extract_images(pdf_path, output_dir="./images"):
106
+ """Extract all images from a PDF."""
107
+ import os
108
+ os.makedirs(output_dir, exist_ok=True)
109
+
110
+ doc = fitz.open(pdf_path)
111
+ img_count = 0
112
+
113
+ for page_num, page in enumerate(doc):
114
+ images = page.get_images(full=True)
115
+ for img_idx, img in enumerate(images):
116
+ xref = img[0]
117
+ pix = fitz.Pixmap(doc, xref)
118
+
119
+ if pix.n - pix.alpha > 3: # CMYK
120
+ pix = fitz.Pixmap(fitz.csRGB, pix)
121
+
122
+ filename = f"{output_dir}/page{page_num+1}_img{img_idx+1}.png"
123
+ pix.save(filename)
124
+ img_count += 1
125
+
126
+ doc.close()
127
+ print(f"Extracted {img_count} images to {output_dir}")
128
+ ```
129
+
130
+ ## pdfplumber — Table Extraction
131
+
132
+ ```python
133
+ import pdfplumber
134
+
135
+ def extract_tables(pdf_path):
136
+ """Extract all tables from a PDF."""
137
+ tables = []
138
+ with pdfplumber.open(pdf_path) as pdf:
139
+ for page_num, page in enumerate(pdf.pages):
140
+ page_tables = page.extract_tables()
141
+ for table_idx, table in enumerate(page_tables):
142
+ tables.append({
143
+ "page": page_num + 1,
144
+ "table_index": table_idx,
145
+ "data": table
146
+ })
147
+ return tables
148
+
149
+ # Convert extracted table to pandas DataFrame
150
+ import pandas as pd
151
+
152
+ tables = extract_tables("paper.pdf")
153
+ for t in tables:
154
+ if t["data"]:
155
+ df = pd.DataFrame(t["data"][1:], columns=t["data"][0])
156
+ print(f"\nTable on page {t['page']}:")
157
+ print(df.to_string())
158
+ ```
159
+
160
+ ## GROBID — Structured Academic Paper Parsing
161
+
162
+ GROBID uses machine learning to parse academic PDFs into structured TEI XML.
163
+
164
+ ```python
165
+ import requests
166
+
167
+ def parse_with_grobid(pdf_path, grobid_url="http://localhost:8070"):
168
+ """Parse a paper PDF using GROBID."""
169
+ with open(pdf_path, "rb") as f:
170
+ response = requests.post(
171
+ f"{grobid_url}/api/processFulltextDocument",
172
+ files={"input": f},
173
+ data={"consolidateHeader": 1, "consolidateCitations": 1}
174
+ )
175
+
176
+ if response.status_code == 200:
177
+ return response.text # TEI XML
178
+ else:
179
+ raise Exception(f"GROBID error: {response.status_code}")
180
+
181
+ # Parse the TEI XML
182
+ from lxml import etree
183
+
184
+ tei_xml = parse_with_grobid("paper.pdf")
185
+ root = etree.fromstring(tei_xml.encode())
186
+ ns = {"tei": "http://www.tei-c.org/ns/1.0"}
187
+
188
+ # Extract title
189
+ title = root.find(".//tei:titleStmt/tei:title", ns)
190
+ print(f"Title: {title.text if title is not None else 'N/A'}")
191
+
192
+ # Extract abstract
193
+ abstract = root.find(".//tei:profileDesc/tei:abstract", ns)
194
+ if abstract is not None:
195
+ print(f"Abstract: {abstract.text}")
196
+
197
+ # Extract references
198
+ refs = root.findall(".//tei:listBibl/tei:biblStruct", ns)
199
+ print(f"References found: {len(refs)}")
200
+ for ref in refs[:5]:
201
+ title_elem = ref.find(".//tei:title", ns)
202
+ print(f" - {title_elem.text if title_elem is not None else 'N/A'}")
203
+ ```
204
+
205
+ ## Document Chunking for RAG
206
+
207
+ Split documents into semantically meaningful chunks for retrieval-augmented generation:
208
+
209
+ ```python
210
+ def chunk_academic_paper(pdf_path, max_chunk_size=1000, overlap=200):
211
+ """Chunk an academic paper by sections with overlap."""
212
+ pages = extract_structured(pdf_path)
213
+
214
+ # Identify sections
215
+ sections = []
216
+ current_section = {"heading": "Preamble", "text": ""}
217
+
218
+ for page in pages:
219
+ for block in page["blocks"]:
220
+ if block["is_bold"] and block["font_size"] > 11 and len(block["text"]) < 100:
221
+ if current_section["text"].strip():
222
+ sections.append(current_section)
223
+ current_section = {"heading": block["text"], "text": ""}
224
+ else:
225
+ current_section["text"] += block["text"] + "\n"
226
+
227
+ if current_section["text"].strip():
228
+ sections.append(current_section)
229
+
230
+ # Split long sections into overlapping chunks
231
+ chunks = []
232
+ for section in sections:
233
+ text = section["text"]
234
+ if len(text) <= max_chunk_size:
235
+ chunks.append({
236
+ "heading": section["heading"],
237
+ "text": text,
238
+ "chunk_index": 0
239
+ })
240
+ else:
241
+ words = text.split()
242
+ start = 0
243
+ chunk_idx = 0
244
+ while start < len(words):
245
+ end = start + max_chunk_size // 5 # Approximate words
246
+ chunk_text = " ".join(words[start:end])
247
+ chunks.append({
248
+ "heading": section["heading"],
249
+ "text": chunk_text,
250
+ "chunk_index": chunk_idx
251
+ })
252
+ start = end - overlap // 5 # Overlap in words
253
+ chunk_idx += 1
254
+
255
+ return chunks
256
+ ```
257
+
258
+ ## Format Conversion
259
+
260
+ ### Markdown to PDF
261
+
262
+ ```bash
263
+ # Using Pandoc (most versatile converter)
264
+ pandoc paper.md -o paper.pdf --pdf-engine=xelatex
265
+
266
+ # With template and bibliography
267
+ pandoc paper.md -o paper.pdf \
268
+ --pdf-engine=xelatex \
269
+ --template=ieee.tex \
270
+ --bibliography=references.bib \
271
+ --citeproc \
272
+ --number-sections
273
+
274
+ # Markdown to Word (for collaborators who prefer Word)
275
+ pandoc paper.md -o paper.docx --reference-doc=template.docx
276
+ ```
277
+
278
+ ### PDF to Markdown (Using Marker)
279
+
280
+ ```bash
281
+ # Install Marker (ML-based PDF to Markdown converter)
282
+ pip install marker-pdf
283
+
284
+ # Convert a single PDF
285
+ marker_single paper.pdf output_dir/ --langs English
286
+
287
+ # Batch convert
288
+ marker output_dir/ input_dir/ --workers 4
289
+ ```
290
+
291
+ ## OCR for Scanned PDFs
292
+
293
+ ```python
294
+ from pdf2image import convert_from_path
295
+ import pytesseract
296
+
297
+ def ocr_pdf(pdf_path, lang="eng"):
298
+ """OCR a scanned PDF using Tesseract."""
299
+ images = convert_from_path(pdf_path, dpi=300)
300
+ full_text = []
301
+
302
+ for i, image in enumerate(images):
303
+ text = pytesseract.image_to_string(image, lang=lang)
304
+ full_text.append(f"--- Page {i+1} ---\n{text}")
305
+
306
+ return "\n".join(full_text)
307
+
308
+ # For academic papers with math, use specialized OCR:
309
+ # - Mathpix API (commercial, excellent math OCR)
310
+ # - Nougat (Meta, open source, GPU required)
311
+ # - LaTeX-OCR (open source, formula-specific)
312
+ ```
313
+
314
+ ## Best Practices
315
+
316
+ 1. **Try PyMuPDF first**: It is the fastest and handles most modern PDFs well. Fall back to GROBID for academic papers that need structural parsing.
317
+ 2. **Check PDF type**: Use `page.get_text()` to detect if a PDF is text-based or scanned. If empty, use OCR.
318
+ 3. **Handle multi-column layouts**: PyMuPDF's `sort` parameter in `get_text("blocks")` helps with reading order. GROBID and Marker handle this natively.
319
+ 4. **Preserve metadata**: Extract DOI, authors, and title from PDF metadata (`doc.metadata`) when available.
320
+ 5. **Validate table extraction**: Always visually verify extracted tables; complex layouts with merged cells often fail.
321
+ 6. **Cache extracted text**: Store parsed results alongside PDFs to avoid re-processing.