@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (252) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +204 -0
  3. package/curated/analysis/README.md +64 -0
  4. package/curated/domains/README.md +104 -0
  5. package/curated/literature/README.md +53 -0
  6. package/curated/research/README.md +62 -0
  7. package/curated/tools/README.md +87 -0
  8. package/curated/writing/README.md +61 -0
  9. package/index.ts +39 -0
  10. package/mcp-configs/academic-db/ChatSpatial.json +17 -0
  11. package/mcp-configs/academic-db/academia-mcp.json +17 -0
  12. package/mcp-configs/academic-db/academic-paper-explorer.json +17 -0
  13. package/mcp-configs/academic-db/academic-search-mcp-server.json +17 -0
  14. package/mcp-configs/academic-db/agentinterviews-mcp.json +17 -0
  15. package/mcp-configs/academic-db/all-in-mcp.json +17 -0
  16. package/mcp-configs/academic-db/apple-health-mcp.json +17 -0
  17. package/mcp-configs/academic-db/arxiv-latex-mcp.json +17 -0
  18. package/mcp-configs/academic-db/arxiv-mcp-server.json +17 -0
  19. package/mcp-configs/academic-db/bgpt-mcp.json +17 -0
  20. package/mcp-configs/academic-db/biomcp.json +17 -0
  21. package/mcp-configs/academic-db/biothings-mcp.json +17 -0
  22. package/mcp-configs/academic-db/catalysishub-mcp-server.json +17 -0
  23. package/mcp-configs/academic-db/clinicaltrialsgov-mcp-server.json +17 -0
  24. package/mcp-configs/academic-db/deep-research-mcp.json +17 -0
  25. package/mcp-configs/academic-db/dicom-mcp.json +17 -0
  26. package/mcp-configs/academic-db/enrichr-mcp-server.json +17 -0
  27. package/mcp-configs/academic-db/fec-mcp-server.json +17 -0
  28. package/mcp-configs/academic-db/fhir-mcp-server-themomentum.json +17 -0
  29. package/mcp-configs/academic-db/fhir-mcp.json +19 -0
  30. package/mcp-configs/academic-db/gget-mcp.json +17 -0
  31. package/mcp-configs/academic-db/google-researcher-mcp.json +17 -0
  32. package/mcp-configs/academic-db/idea-reality-mcp.json +17 -0
  33. package/mcp-configs/academic-db/legiscan-mcp.json +19 -0
  34. package/mcp-configs/academic-db/lex.json +17 -0
  35. package/mcp-configs/ai-platform/Adaptive-Graph-of-Thoughts-MCP-server.json +17 -0
  36. package/mcp-configs/ai-platform/ai-counsel.json +17 -0
  37. package/mcp-configs/ai-platform/atlas-mcp-server.json +17 -0
  38. package/mcp-configs/ai-platform/counsel-mcp.json +17 -0
  39. package/mcp-configs/ai-platform/cross-llm-mcp.json +17 -0
  40. package/mcp-configs/ai-platform/gptr-mcp.json +17 -0
  41. package/mcp-configs/browser/decipher-research-agent.json +17 -0
  42. package/mcp-configs/browser/deep-research.json +17 -0
  43. package/mcp-configs/browser/everything-claude-code.json +17 -0
  44. package/mcp-configs/browser/gpt-researcher.json +17 -0
  45. package/mcp-configs/browser/heurist-agent-framework.json +17 -0
  46. package/mcp-configs/data-platform/4everland-hosting-mcp.json +17 -0
  47. package/mcp-configs/data-platform/context-keeper.json +17 -0
  48. package/mcp-configs/data-platform/context7.json +19 -0
  49. package/mcp-configs/data-platform/contextstream-mcp.json +17 -0
  50. package/mcp-configs/data-platform/email-mcp.json +17 -0
  51. package/mcp-configs/note-knowledge/ApeRAG.json +17 -0
  52. package/mcp-configs/note-knowledge/In-Memoria.json +17 -0
  53. package/mcp-configs/note-knowledge/agent-memory.json +17 -0
  54. package/mcp-configs/note-knowledge/aimemo.json +17 -0
  55. package/mcp-configs/note-knowledge/biel-mcp.json +19 -0
  56. package/mcp-configs/note-knowledge/cognee.json +17 -0
  57. package/mcp-configs/note-knowledge/context-awesome.json +17 -0
  58. package/mcp-configs/note-knowledge/context-mcp.json +17 -0
  59. package/mcp-configs/note-knowledge/conversation-handoff-mcp.json +17 -0
  60. package/mcp-configs/note-knowledge/cortex.json +17 -0
  61. package/mcp-configs/note-knowledge/devrag.json +17 -0
  62. package/mcp-configs/note-knowledge/easy-obsidian-mcp.json +17 -0
  63. package/mcp-configs/note-knowledge/engram.json +17 -0
  64. package/mcp-configs/note-knowledge/gnosis-mcp.json +17 -0
  65. package/mcp-configs/note-knowledge/graphlit-mcp-server.json +19 -0
  66. package/mcp-configs/reference-mgr/arxiv-cli.json +17 -0
  67. package/mcp-configs/reference-mgr/arxiv-search-mcp.json +17 -0
  68. package/mcp-configs/reference-mgr/chiken.json +17 -0
  69. package/mcp-configs/reference-mgr/claude-scholar.json +17 -0
  70. package/mcp-configs/reference-mgr/devonthink-mcp.json +17 -0
  71. package/mcp-configs/registry.json +447 -0
  72. package/openclaw.plugin.json +21 -0
  73. package/package.json +61 -0
  74. package/skills/analysis/dataviz/color-accessibility-guide/SKILL.md +230 -0
  75. package/skills/analysis/dataviz/geospatial-viz-guide/SKILL.md +218 -0
  76. package/skills/analysis/dataviz/interactive-viz-guide/SKILL.md +287 -0
  77. package/skills/analysis/dataviz/network-visualization-guide/SKILL.md +195 -0
  78. package/skills/analysis/dataviz/publication-figures-guide/SKILL.md +238 -0
  79. package/skills/analysis/dataviz/python-dataviz-guide/SKILL.md +195 -0
  80. package/skills/analysis/econometrics/causal-inference-guide/SKILL.md +197 -0
  81. package/skills/analysis/econometrics/iv-regression-guide/SKILL.md +198 -0
  82. package/skills/analysis/econometrics/panel-data-guide/SKILL.md +274 -0
  83. package/skills/analysis/econometrics/robustness-checks/SKILL.md +250 -0
  84. package/skills/analysis/econometrics/stata-regression/SKILL.md +117 -0
  85. package/skills/analysis/econometrics/time-series-guide/SKILL.md +235 -0
  86. package/skills/analysis/statistics/bayesian-statistics-guide/SKILL.md +221 -0
  87. package/skills/analysis/statistics/hypothesis-testing-guide/SKILL.md +210 -0
  88. package/skills/analysis/statistics/meta-analysis-guide/SKILL.md +206 -0
  89. package/skills/analysis/statistics/nonparametric-tests-guide/SKILL.md +221 -0
  90. package/skills/analysis/statistics/power-analysis-guide/SKILL.md +240 -0
  91. package/skills/analysis/statistics/sem-guide/SKILL.md +231 -0
  92. package/skills/analysis/statistics/survival-analysis-guide/SKILL.md +195 -0
  93. package/skills/analysis/wrangling/missing-data-handling/SKILL.md +224 -0
  94. package/skills/analysis/wrangling/pandas-data-wrangling/SKILL.md +242 -0
  95. package/skills/analysis/wrangling/questionnaire-design-guide/SKILL.md +234 -0
  96. package/skills/analysis/wrangling/text-mining-guide/SKILL.md +225 -0
  97. package/skills/domains/ai-ml/computer-vision-guide/SKILL.md +213 -0
  98. package/skills/domains/ai-ml/deep-learning-papers-guide/SKILL.md +200 -0
  99. package/skills/domains/ai-ml/llm-evaluation-guide/SKILL.md +194 -0
  100. package/skills/domains/ai-ml/prompt-engineering-research/SKILL.md +233 -0
  101. package/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.md +254 -0
  102. package/skills/domains/ai-ml/transformer-architecture-guide/SKILL.md +233 -0
  103. package/skills/domains/biomedical/clinical-research-guide/SKILL.md +232 -0
  104. package/skills/domains/biomedical/clinicaltrials-api/SKILL.md +177 -0
  105. package/skills/domains/biomedical/epidemiology-guide/SKILL.md +200 -0
  106. package/skills/domains/biomedical/genomics-analysis-guide/SKILL.md +270 -0
  107. package/skills/domains/business/market-analysis-guide/SKILL.md +112 -0
  108. package/skills/domains/business/strategic-management-guide/SKILL.md +154 -0
  109. package/skills/domains/chemistry/computational-chemistry-guide/SKILL.md +266 -0
  110. package/skills/domains/chemistry/retrosynthesis-guide/SKILL.md +215 -0
  111. package/skills/domains/cs/algorithms-complexity-guide/SKILL.md +194 -0
  112. package/skills/domains/cs/dblp-api/SKILL.md +129 -0
  113. package/skills/domains/cs/software-engineering-research/SKILL.md +218 -0
  114. package/skills/domains/ecology/biodiversity-data-guide/SKILL.md +296 -0
  115. package/skills/domains/ecology/conservation-biology-guide/SKILL.md +198 -0
  116. package/skills/domains/ecology/gbif-api/SKILL.md +158 -0
  117. package/skills/domains/ecology/inaturalist-api/SKILL.md +173 -0
  118. package/skills/domains/economics/behavioral-economics-guide/SKILL.md +239 -0
  119. package/skills/domains/economics/development-economics-guide/SKILL.md +181 -0
  120. package/skills/domains/economics/fred-api/SKILL.md +189 -0
  121. package/skills/domains/education/curriculum-design-guide/SKILL.md +144 -0
  122. package/skills/domains/education/learning-science-guide/SKILL.md +150 -0
  123. package/skills/domains/finance/financial-data-analysis/SKILL.md +152 -0
  124. package/skills/domains/finance/quantitative-finance-guide/SKILL.md +151 -0
  125. package/skills/domains/geoscience/climate-science-guide/SKILL.md +158 -0
  126. package/skills/domains/geoscience/gis-remote-sensing-guide/SKILL.md +129 -0
  127. package/skills/domains/humanities/digital-humanities-guide/SKILL.md +181 -0
  128. package/skills/domains/humanities/philosophy-research-guide/SKILL.md +148 -0
  129. package/skills/domains/law/courtlistener-api/SKILL.md +213 -0
  130. package/skills/domains/law/legal-research-guide/SKILL.md +250 -0
  131. package/skills/domains/math/linear-algebra-applications/SKILL.md +227 -0
  132. package/skills/domains/math/numerical-methods-guide/SKILL.md +236 -0
  133. package/skills/domains/math/oeis-api/SKILL.md +158 -0
  134. package/skills/domains/pharma/clinical-pharmacology-guide/SKILL.md +165 -0
  135. package/skills/domains/pharma/drug-development-guide/SKILL.md +177 -0
  136. package/skills/domains/physics/computational-physics-guide/SKILL.md +300 -0
  137. package/skills/domains/physics/nasa-ads-api/SKILL.md +150 -0
  138. package/skills/domains/physics/quantum-computing-guide/SKILL.md +234 -0
  139. package/skills/domains/social-science/social-research-methods/SKILL.md +194 -0
  140. package/skills/domains/social-science/survey-research-guide/SKILL.md +182 -0
  141. package/skills/literature/discovery/citation-alert-guide/SKILL.md +154 -0
  142. package/skills/literature/discovery/conference-proceedings-guide/SKILL.md +142 -0
  143. package/skills/literature/discovery/literature-mapping-guide/SKILL.md +175 -0
  144. package/skills/literature/discovery/paper-tracking-guide/SKILL.md +211 -0
  145. package/skills/literature/discovery/rss-paper-feeds/SKILL.md +214 -0
  146. package/skills/literature/discovery/semantic-scholar-recs-guide/SKILL.md +164 -0
  147. package/skills/literature/fulltext/doaj-api/SKILL.md +120 -0
  148. package/skills/literature/fulltext/interlibrary-loan-guide/SKILL.md +163 -0
  149. package/skills/literature/fulltext/open-access-guide/SKILL.md +183 -0
  150. package/skills/literature/fulltext/pmc-oai-api/SKILL.md +184 -0
  151. package/skills/literature/fulltext/preprint-servers-guide/SKILL.md +128 -0
  152. package/skills/literature/fulltext/repository-harvesting-guide/SKILL.md +207 -0
  153. package/skills/literature/fulltext/unpaywall-api/SKILL.md +113 -0
  154. package/skills/literature/metadata/altmetrics-guide/SKILL.md +132 -0
  155. package/skills/literature/metadata/citation-network-guide/SKILL.md +236 -0
  156. package/skills/literature/metadata/crossref-api/SKILL.md +133 -0
  157. package/skills/literature/metadata/datacite-api/SKILL.md +126 -0
  158. package/skills/literature/metadata/doi-resolution-guide/SKILL.md +168 -0
  159. package/skills/literature/metadata/h-index-guide/SKILL.md +183 -0
  160. package/skills/literature/metadata/journal-metrics-guide/SKILL.md +188 -0
  161. package/skills/literature/metadata/opencitations-api/SKILL.md +128 -0
  162. package/skills/literature/metadata/orcid-api/SKILL.md +136 -0
  163. package/skills/literature/metadata/orcid-integration-guide/SKILL.md +178 -0
  164. package/skills/literature/search/arxiv-api/SKILL.md +95 -0
  165. package/skills/literature/search/biorxiv-api/SKILL.md +123 -0
  166. package/skills/literature/search/boolean-search-guide/SKILL.md +199 -0
  167. package/skills/literature/search/citation-chaining-guide/SKILL.md +148 -0
  168. package/skills/literature/search/database-comparison-guide/SKILL.md +100 -0
  169. package/skills/literature/search/europe-pmc-api/SKILL.md +120 -0
  170. package/skills/literature/search/google-scholar-guide/SKILL.md +182 -0
  171. package/skills/literature/search/mesh-terms-guide/SKILL.md +164 -0
  172. package/skills/literature/search/openalex-api/SKILL.md +134 -0
  173. package/skills/literature/search/pubmed-api/SKILL.md +130 -0
  174. package/skills/literature/search/scientify-literature-survey/SKILL.md +203 -0
  175. package/skills/literature/search/semantic-scholar-api/SKILL.md +134 -0
  176. package/skills/literature/search/systematic-search-strategy/SKILL.md +214 -0
  177. package/skills/research/automation/ai-scientist-guide/SKILL.md +228 -0
  178. package/skills/research/automation/data-collection-automation/SKILL.md +248 -0
  179. package/skills/research/automation/research-workflow-automation/SKILL.md +266 -0
  180. package/skills/research/deep-research/meta-synthesis-guide/SKILL.md +174 -0
  181. package/skills/research/deep-research/research-cog/SKILL.md +153 -0
  182. package/skills/research/deep-research/scoping-review-guide/SKILL.md +217 -0
  183. package/skills/research/deep-research/systematic-review-guide/SKILL.md +250 -0
  184. package/skills/research/funding/figshare-api/SKILL.md +163 -0
  185. package/skills/research/funding/grant-writing-guide/SKILL.md +233 -0
  186. package/skills/research/funding/nsf-grant-guide/SKILL.md +206 -0
  187. package/skills/research/funding/open-science-guide/SKILL.md +255 -0
  188. package/skills/research/funding/zenodo-api/SKILL.md +174 -0
  189. package/skills/research/methodology/action-research-guide/SKILL.md +201 -0
  190. package/skills/research/methodology/experimental-design-guide/SKILL.md +236 -0
  191. package/skills/research/methodology/grad-school-guide/SKILL.md +182 -0
  192. package/skills/research/methodology/grounded-theory-guide/SKILL.md +171 -0
  193. package/skills/research/methodology/mixed-methods-guide/SKILL.md +208 -0
  194. package/skills/research/methodology/qualitative-research-guide/SKILL.md +234 -0
  195. package/skills/research/methodology/scientify-idea-generation/SKILL.md +222 -0
  196. package/skills/research/paper-review/paper-reading-assistant/SKILL.md +266 -0
  197. package/skills/research/paper-review/peer-review-guide/SKILL.md +227 -0
  198. package/skills/research/paper-review/rebuttal-writing-guide/SKILL.md +185 -0
  199. package/skills/research/paper-review/scientify-write-review-paper/SKILL.md +209 -0
  200. package/skills/tools/code-exec/jupyter-notebook-guide/SKILL.md +178 -0
  201. package/skills/tools/code-exec/python-reproducibility-guide/SKILL.md +341 -0
  202. package/skills/tools/code-exec/r-reproducibility-guide/SKILL.md +236 -0
  203. package/skills/tools/code-exec/sandbox-execution-guide/SKILL.md +221 -0
  204. package/skills/tools/diagram/mermaid-diagram-guide/SKILL.md +269 -0
  205. package/skills/tools/diagram/plantuml-guide/SKILL.md +397 -0
  206. package/skills/tools/diagram/scientific-illustration-guide/SKILL.md +225 -0
  207. package/skills/tools/document/anystyle-api/SKILL.md +199 -0
  208. package/skills/tools/document/grobid-pdf-parsing/SKILL.md +294 -0
  209. package/skills/tools/document/markdown-academic-guide/SKILL.md +217 -0
  210. package/skills/tools/document/pdf-extraction-guide/SKILL.md +321 -0
  211. package/skills/tools/knowledge-graph/knowledge-graph-construction/SKILL.md +306 -0
  212. package/skills/tools/knowledge-graph/ontology-design-guide/SKILL.md +214 -0
  213. package/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md +325 -0
  214. package/skills/tools/ocr-translate/formula-recognition-guide/SKILL.md +367 -0
  215. package/skills/tools/ocr-translate/handwriting-recognition-guide/SKILL.md +211 -0
  216. package/skills/tools/ocr-translate/latex-ocr-guide/SKILL.md +204 -0
  217. package/skills/tools/ocr-translate/multilingual-research-guide/SKILL.md +234 -0
  218. package/skills/tools/scraping/academic-web-scraping/SKILL.md +326 -0
  219. package/skills/tools/scraping/api-data-collection-guide/SKILL.md +301 -0
  220. package/skills/tools/scraping/web-scraping-ethics-guide/SKILL.md +250 -0
  221. package/skills/writing/citation/bibtex-management-guide/SKILL.md +246 -0
  222. package/skills/writing/citation/citation-style-guide/SKILL.md +248 -0
  223. package/skills/writing/citation/reference-manager-comparison/SKILL.md +208 -0
  224. package/skills/writing/citation/zotero-api/SKILL.md +188 -0
  225. package/skills/writing/composition/abstract-writing-guide/SKILL.md +188 -0
  226. package/skills/writing/composition/discussion-writing-guide/SKILL.md +194 -0
  227. package/skills/writing/composition/introduction-writing-guide/SKILL.md +194 -0
  228. package/skills/writing/composition/literature-review-writing/SKILL.md +196 -0
  229. package/skills/writing/composition/methods-section-guide/SKILL.md +185 -0
  230. package/skills/writing/composition/response-to-reviewers/SKILL.md +215 -0
  231. package/skills/writing/composition/scientific-writing-guide/SKILL.md +152 -0
  232. package/skills/writing/latex/bibliography-management-guide/SKILL.md +206 -0
  233. package/skills/writing/latex/latex-drawing-guide/SKILL.md +234 -0
  234. package/skills/writing/latex/latex-ecosystem-guide/SKILL.md +240 -0
  235. package/skills/writing/latex/math-typesetting-guide/SKILL.md +231 -0
  236. package/skills/writing/latex/overleaf-collaboration-guide/SKILL.md +211 -0
  237. package/skills/writing/latex/tikz-diagrams-guide/SKILL.md +211 -0
  238. package/skills/writing/polish/academic-translation-guide/SKILL.md +175 -0
  239. package/skills/writing/polish/academic-writing-refiner/SKILL.md +143 -0
  240. package/skills/writing/polish/ai-writing-humanizer/SKILL.md +178 -0
  241. package/skills/writing/polish/grammar-checker-guide/SKILL.md +184 -0
  242. package/skills/writing/polish/plagiarism-detection-guide/SKILL.md +167 -0
  243. package/skills/writing/templates/beamer-presentation-guide/SKILL.md +263 -0
  244. package/skills/writing/templates/conference-paper-template/SKILL.md +219 -0
  245. package/skills/writing/templates/thesis-template-guide/SKILL.md +200 -0
  246. package/skills/writing/templates/thesis-writing-guide/SKILL.md +220 -0
  247. package/src/tools/arxiv.ts +131 -0
  248. package/src/tools/crossref.ts +112 -0
  249. package/src/tools/openalex.ts +174 -0
  250. package/src/tools/pubmed.ts +166 -0
  251. package/src/tools/semantic-scholar.ts +108 -0
  252. package/src/tools/unpaywall.ts +58 -0
@@ -0,0 +1,325 @@
1
+ ---
2
+ name: rag-methodology-guide
3
+ description: "RAG architecture for academic knowledge retrieval and synthesis"
4
+ metadata:
5
+ openclaw:
6
+ emoji: "brain"
7
+ category: "tools"
8
+ subcategory: "knowledge-graph"
9
+ keywords: ["RAG", "retrieval augmented generation", "academic knowledge graph", "knowledge modeling"]
10
+ source: "wentor-research-plugins"
11
+ ---
12
+
13
+ # RAG Methodology Guide
14
+
15
+ Design and implement Retrieval-Augmented Generation (RAG) systems for academic research, including document chunking, embedding strategies, retrieval pipelines, and evaluation.
16
+
17
+ ## What Is RAG?
18
+
19
+ Retrieval-Augmented Generation (RAG) augments a language model's generation with relevant information retrieved from an external knowledge base. For academic research, this enables:
20
+
21
+ - Question answering over a personal paper library
22
+ - Literature synthesis across hundreds of papers
23
+ - Fact-checking claims against source documents
24
+ - Generating citations with provenance
25
+
26
+ ### RAG Pipeline Architecture
27
+
28
+ ```
29
+ Query: "What are the main challenges of protein folding?"
30
+ |
31
+ v
32
+ [1. Query Processing]
33
+ |-- Embed query using embedding model
34
+ |-- Optional: Query expansion / HyDE
35
+ |
36
+ v
37
+ [2. Retrieval]
38
+ |-- Search vector database for top-k relevant chunks
39
+ |-- Optional: Reranking with cross-encoder
40
+ |
41
+ v
42
+ [3. Context Assembly]
43
+ |-- Combine retrieved chunks into a prompt
44
+ |-- Add metadata (source, page, citation)
45
+ |
46
+ v
47
+ [4. Generation]
48
+ |-- LLM generates answer grounded in retrieved context
49
+ |-- Include inline citations
50
+ |
51
+ v
52
+ Answer with citations
53
+ ```
54
+
55
+ ## Step 1: Document Ingestion and Chunking
56
+
57
+ ### Chunking Strategies
58
+
59
+ | Strategy | Description | Best For |
60
+ |----------|-------------|----------|
61
+ | **Fixed-size** | Split every N characters/tokens | Simple, fast, baseline |
62
+ | **Sentence-based** | Split on sentence boundaries | Natural reading units |
63
+ | **Paragraph-based** | Split on paragraph breaks | Coherent semantic units |
64
+ | **Section-based** | Split on document headings | Academic papers |
65
+ | **Recursive** | Hierarchically split (heading > paragraph > sentence) | General purpose |
66
+ | **Semantic** | Split on topic shifts using embeddings | Best quality, slower |
67
+
68
+ ### Implementation
69
+
70
+ ```python
71
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
72
+
73
+ def chunk_academic_paper(text, chunk_size=1000, chunk_overlap=200):
74
+ """Chunk an academic paper using recursive splitting."""
75
+ splitter = RecursiveCharacterTextSplitter(
76
+ chunk_size=chunk_size,
77
+ chunk_overlap=chunk_overlap,
78
+ separators=[
79
+ "\n## ", # H2 headings (section breaks)
80
+ "\n### ", # H3 headings (subsection breaks)
81
+ "\n\n", # Paragraph breaks
82
+ "\n", # Line breaks
83
+ ". ", # Sentence breaks
84
+ " ", # Word breaks
85
+ ],
86
+ length_function=len
87
+ )
88
+ chunks = splitter.split_text(text)
89
+ return chunks
90
+
91
+ # Add metadata to each chunk
92
+ def create_documents(paper_text, metadata):
93
+ """Create chunks with source metadata for citation tracking."""
94
+ chunks = chunk_academic_paper(paper_text)
95
+ documents = []
96
+ for i, chunk in enumerate(chunks):
97
+ documents.append({
98
+ "text": chunk,
99
+ "metadata": {
100
+ **metadata,
101
+ "chunk_index": i,
102
+ "chunk_total": len(chunks)
103
+ }
104
+ })
105
+ return documents
106
+
107
+ # Example usage
108
+ docs = create_documents(
109
+ paper_text=extracted_text,
110
+ metadata={
111
+ "title": "Attention Is All You Need",
112
+ "authors": "Vaswani et al.",
113
+ "year": 2017,
114
+ "doi": "10.48550/arXiv.1706.03762",
115
+ "source_file": "vaswani2017attention.pdf"
116
+ }
117
+ )
118
+ ```
119
+
120
+ ## Step 2: Embedding and Indexing
121
+
122
+ ### Embedding Model Selection
123
+
124
+ | Model | Dimensions | Quality | Speed | Cost |
125
+ |-------|-----------|---------|-------|------|
126
+ | OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/1M tokens |
127
+ | OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/1M tokens |
128
+ | Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/1M tokens |
129
+ | sentence-transformers/all-MiniLM-L6-v2 | 384 | Good | Very fast | Free (local) |
130
+ | BAAI/bge-large-en-v1.5 | 1024 | Excellent | Medium | Free (local) |
131
+ | nomic-embed-text | 768 | Good | Fast | Free (local) |
132
+
133
+ ### Vector Database Options
134
+
135
+ | Database | Type | Scalability | Features |
136
+ |----------|------|------------|----------|
137
+ | ChromaDB | Embedded | Small-medium | Simple, good for prototyping |
138
+ | FAISS | Library | Large | Facebook research, GPU support |
139
+ | Pinecone | Cloud | Large | Managed, serverless |
140
+ | Weaviate | Self-hosted/Cloud | Large | Hybrid search, filters |
141
+ | Qdrant | Self-hosted/Cloud | Large | Rich filtering, payload storage |
142
+ | pgvector | PostgreSQL extension | Medium | SQL integration |
143
+
144
+ ### Building the Index
145
+
146
+ ```python
147
+ import chromadb
148
+ from sentence_transformers import SentenceTransformer
149
+
150
+ # Initialize embedding model (local, free)
151
+ embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
152
+
153
+ # Initialize ChromaDB
154
+ client = chromadb.PersistentClient(path="./chroma_db")
155
+ collection = client.get_or_create_collection(
156
+ name="research_papers",
157
+ metadata={"hnsw:space": "cosine"}
158
+ )
159
+
160
+ # Index documents
161
+ def index_documents(documents):
162
+ """Add documents to the vector database."""
163
+ texts = [doc["text"] for doc in documents]
164
+ embeddings = embed_model.encode(texts, show_progress_bar=True).tolist()
165
+ ids = [f"doc_{i}" for i in range(len(documents))]
166
+ metadatas = [doc["metadata"] for doc in documents]
167
+
168
+ collection.add(
169
+ documents=texts,
170
+ embeddings=embeddings,
171
+ metadatas=metadatas,
172
+ ids=ids
173
+ )
174
+ print(f"Indexed {len(documents)} chunks")
175
+
176
+ index_documents(docs)
177
+ ```
178
+
179
+ ## Step 3: Retrieval
180
+
181
+ ### Basic Retrieval
182
+
183
+ ```python
184
+ def retrieve(query, top_k=5):
185
+ """Retrieve the most relevant chunks for a query."""
186
+ query_embedding = embed_model.encode([query]).tolist()
187
+
188
+ results = collection.query(
189
+ query_embeddings=query_embedding,
190
+ n_results=top_k,
191
+ include=["documents", "metadatas", "distances"]
192
+ )
193
+
194
+ retrieved = []
195
+ for doc, meta, dist in zip(
196
+ results["documents"][0],
197
+ results["metadatas"][0],
198
+ results["distances"][0]
199
+ ):
200
+ retrieved.append({
201
+ "text": doc,
202
+ "metadata": meta,
203
+ "similarity": 1 - dist # Convert distance to similarity
204
+ })
205
+
206
+ return retrieved
207
+
208
+ # Example
209
+ results = retrieve("What are the main components of the Transformer architecture?")
210
+ for r in results:
211
+ print(f"[{r['similarity']:.3f}] {r['metadata'].get('title', 'N/A')}")
212
+ print(f" {r['text'][:150]}...")
213
+ ```
214
+
215
+ ### Advanced Retrieval: Hybrid Search
216
+
217
+ ```python
218
+ def hybrid_retrieve(query, top_k=5, alpha=0.7):
219
+ """Combine dense (semantic) and sparse (keyword) retrieval."""
220
+
221
+ # Dense retrieval (vector similarity)
222
+ dense_results = retrieve(query, top_k=top_k * 2)
223
+
224
+ # Sparse retrieval (BM25 keyword matching)
225
+ from rank_bm25 import BM25Okapi
226
+
227
+ # Assume all_documents is a list of all chunk texts
228
+ tokenized_corpus = [doc.split() for doc in all_documents]
229
+ bm25 = BM25Okapi(tokenized_corpus)
230
+ bm25_scores = bm25.get_scores(query.split())
231
+ sparse_top_k = bm25_scores.argsort()[-top_k * 2:][::-1]
232
+
233
+ # Reciprocal Rank Fusion (RRF)
234
+ rrf_scores = {}
235
+ k = 60 # RRF constant
236
+
237
+ for rank, result in enumerate(dense_results):
238
+ doc_id = result["metadata"].get("chunk_index", rank)
239
+ rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)
240
+
241
+ for rank, idx in enumerate(sparse_top_k):
242
+ rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (k + rank + 1)
243
+
244
+ # Sort by RRF score and return top-k
245
+ sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
246
+ return sorted_results[:top_k]
247
+ ```
248
+
249
+ ## Step 4: Generation with Citations
250
+
251
+ ```python
252
+ def generate_answer(query, retrieved_contexts):
253
+ """Generate an answer with inline citations using an LLM."""
254
+
255
+ # Build context string with citation markers
256
+ context_parts = []
257
+ for i, ctx in enumerate(retrieved_contexts, 1):
258
+ source = f"{ctx['metadata'].get('authors', 'Unknown')}, {ctx['metadata'].get('year', 'N/A')}"
259
+ context_parts.append(f"[{i}] ({source}): {ctx['text']}")
260
+
261
+ context_string = "\n\n".join(context_parts)
262
+
263
+ prompt = f"""Based on the following research paper excerpts, answer the question.
264
+ Use inline citations like [1], [2] to reference specific sources.
265
+ Only use information from the provided excerpts.
266
+ If the excerpts do not contain enough information, say so.
267
+
268
+ EXCERPTS:
269
+ {context_string}
270
+
271
+ QUESTION: {query}
272
+
273
+ ANSWER (with inline citations):"""
274
+
275
+ # Send to LLM (example with OpenAI)
276
+ # response = openai.chat.completions.create(
277
+ # model="gpt-4",
278
+ # messages=[{"role": "user", "content": prompt}],
279
+ # temperature=0.1
280
+ # )
281
+ # return response.choices[0].message.content
282
+
283
+ return prompt # Return prompt for inspection
284
+ ```
285
+
286
+ ## Evaluation Metrics
287
+
288
+ | Metric | Measures | Tool |
289
+ |--------|----------|------|
290
+ | **Retrieval precision** | Are retrieved chunks relevant? | Manual annotation |
291
+ | **Retrieval recall** | Are all relevant chunks retrieved? | Known-relevant set |
292
+ | **NDCG** | Ranking quality of retrieved results | BEIR benchmark |
293
+ | **Answer correctness** | Is the generated answer factually correct? | Human evaluation |
294
+ | **Faithfulness** | Does the answer only use information from retrieved context? | RAGAS framework |
295
+ | **Answer relevance** | Does the answer address the question? | RAGAS framework |
296
+ | **Context relevance** | Are the retrieved contexts relevant to the question? | RAGAS framework |
297
+
298
+ ```python
299
+ # Using RAGAS for automated RAG evaluation
300
+ from ragas import evaluate
301
+ from ragas.metrics import faithfulness, answer_relevancy, context_precision
302
+
303
+ # Prepare evaluation dataset
304
+ eval_data = {
305
+ "question": ["What is the Transformer architecture?"],
306
+ "answer": ["The Transformer uses self-attention mechanisms..."],
307
+ "contexts": [["The Transformer model architecture eschews recurrence..."]],
308
+ "ground_truth": ["The Transformer is a neural network architecture..."]
309
+ }
310
+
311
+ result = evaluate(
312
+ dataset=eval_data,
313
+ metrics=[faithfulness, answer_relevancy, context_precision]
314
+ )
315
+ print(result)
316
+ ```
317
+
318
+ ## Best Practices for Academic RAG
319
+
320
+ 1. **Chunk by section**: Academic papers have natural section boundaries. Use them.
321
+ 2. **Preserve metadata**: Always store title, authors, year, DOI, and page number with each chunk for proper citation.
322
+ 3. **Use domain-specific embeddings**: Models fine-tuned on scientific text (e.g., SPECTER2) outperform general models for academic content.
323
+ 4. **Rerank after retrieval**: A cross-encoder reranker significantly improves precision over embedding-only retrieval.
324
+ 5. **Handle tables and figures**: Extract tables as text or structured data; do not ignore them during chunking.
325
+ 6. **Evaluate systematically**: Use RAGAS or a custom evaluation set to measure retrieval and generation quality before deploying.
@@ -0,0 +1,367 @@
1
+ ---
2
+ name: formula-recognition-guide
3
+ description: "Math OCR and formula recognition to LaTeX conversion"
4
+ metadata:
5
+ openclaw:
6
+ emoji: "math"
7
+ category: "tools"
8
+ subcategory: "ocr-translate"
9
+ keywords: ["math OCR", "formula recognition", "LaTeX OCR"]
10
+ source: "wentor-research-plugins"
11
+ ---
12
+
13
+ # Formula Recognition Guide
14
+
15
+ Convert mathematical formulas from images, PDFs, and handwritten notes to LaTeX code using OCR tools, neural models, and API services.
16
+
17
+ ## Tool Comparison
18
+
19
+ | Tool | Input | Output | Accuracy | Speed | Cost |
20
+ |------|-------|--------|----------|-------|------|
21
+ | Mathpix | Image, PDF, screenshot | LaTeX, MathML | Excellent | Fast | Free tier (50/month), then paid |
22
+ | LaTeX-OCR (Lukas Blecher) | Image | LaTeX | Very good | Medium | Free (open source) |
23
+ | Pix2Text (p2t) | Image | LaTeX + text | Good | Medium | Free (open source) |
24
+ | Nougat (Meta) | PDF pages | Markdown + LaTeX | Excellent (full page) | Slow (GPU) | Free (open source) |
25
+ | InftyReader | Image, PDF | LaTeX, MathML | Good | Medium | Commercial |
26
+ | Google Cloud Vision | Image | Text (limited math) | Poor for math | Fast | Pay per use |
27
+ | img2latex (Harvard) | Image | LaTeX | Good | Medium | Free (open source) |
28
+
29
+ ## Mathpix API
30
+
31
+ Mathpix is the industry-standard math OCR service, handling printed and handwritten formulas, tables, and full documents.
32
+
33
+ ### Setup
34
+
35
+ ```bash
36
+ pip install mathpix
37
+ # Or use the REST API directly
38
+ ```
39
+
40
+ ### Single Image to LaTeX
41
+
42
+ ```python
43
+ import requests
44
+ import base64
45
+ import json
46
+
47
+ def mathpix_ocr(image_path, app_id, app_key):
48
+ """Convert an image of a formula to LaTeX using Mathpix API."""
49
+ with open(image_path, "rb") as f:
50
+ image_data = base64.b64encode(f.read()).decode()
51
+
52
+ response = requests.post(
53
+ "https://api.mathpix.com/v3/text",
54
+ headers={
55
+ "app_id": app_id,
56
+ "app_key": app_key,
57
+ "Content-Type": "application/json"
58
+ },
59
+ json={
60
+ "src": f"data:image/png;base64,{image_data}",
61
+ "formats": ["latex_styled", "latex_normal", "mathml"],
62
+ "data_options": {
63
+ "include_asciimath": True,
64
+ "include_latex": True
65
+ }
66
+ }
67
+ )
68
+
69
+ result = response.json()
70
+ return {
71
+ "latex": result.get("latex_styled", ""),
72
+ "latex_normal": result.get("latex_normal", ""),
73
+ "confidence": result.get("confidence", 0),
74
+ "mathml": result.get("mathml", "")
75
+ }
76
+
77
+ # Usage
78
+ result = mathpix_ocr("equation.png", "YOUR_APP_ID", "YOUR_APP_KEY")
79
+ print(f"LaTeX: {result['latex']}")
80
+ print(f"Confidence: {result['confidence']:.2%}")
81
+ ```
82
+
83
+ ### Process a Full PDF Page
84
+
85
+ ```python
86
+ def mathpix_pdf_page(image_path, app_id, app_key):
87
+ """Process a full PDF page with mixed text and math."""
88
+ with open(image_path, "rb") as f:
89
+ image_data = base64.b64encode(f.read()).decode()
90
+
91
+ response = requests.post(
92
+ "https://api.mathpix.com/v3/text",
93
+ headers={
94
+ "app_id": app_id,
95
+ "app_key": app_key,
96
+ "Content-Type": "application/json"
97
+ },
98
+ json={
99
+ "src": f"data:image/png;base64,{image_data}",
100
+ "formats": ["text", "latex_styled"],
101
+ "ocr": ["math", "text"],
102
+ "math_inline_delimiters": ["$", "$"],
103
+ "math_display_delimiters": ["$$", "$$"]
104
+ }
105
+ )
106
+
107
+ result = response.json()
108
+ return result.get("text", "")
109
+
110
+ # Returns Markdown with inline $...$ and display $$...$$ math
111
+ ```
112
+
113
+ ## LaTeX-OCR (Open Source, Local)
114
+
115
+ LaTeX-OCR by Lukas Blecher is a free, locally-running model for converting formula images to LaTeX.
116
+
117
+ ### Installation and Usage
118
+
119
+ ```bash
120
+ pip install "pix2tex[gui]"
121
+ ```
122
+
123
+ ```python
124
+ from pix2tex.cli import LatexOCR
125
+
126
+ # Initialize model (downloads on first use, ~1GB)
127
+ model = LatexOCR()
128
+
129
+ # From file
130
+ from PIL import Image
131
+
132
+ img = Image.open("equation.png")
133
+ latex = model(img)
134
+ print(f"LaTeX: {latex}")
135
+ # Output: \frac{\partial \mathcal{L}}{\partial \theta} = -\frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log p(y_i | x_i; \theta)
136
+ ```
137
+
138
+ ### Batch Processing
139
+
140
+ ```python
141
+ from PIL import Image
142
+ from pathlib import Path
143
+
144
+ def batch_ocr(image_dir, model):
145
+ """Process all formula images in a directory."""
146
+ results = []
147
+ for img_path in sorted(Path(image_dir).glob("*.png")):
148
+ img = Image.open(img_path)
149
+ latex = model(img)
150
+ results.append({
151
+ "file": img_path.name,
152
+ "latex": latex
153
+ })
154
+ print(f"{img_path.name}: {latex[:80]}...")
155
+ return results
156
+
157
+ model = LatexOCR()
158
+ results = batch_ocr("./formula_images/", model)
159
+ ```
160
+
161
+ ## Pix2Text (Chinese + English + Math)
162
+
163
+ Pix2Text handles mixed Chinese/English text alongside mathematical formulas.
164
+
165
+ ```bash
166
+ pip install pix2text
167
+ ```
168
+
169
+ ```python
170
+ from pix2text import Pix2Text
171
+
172
+ p2t = Pix2Text()
173
+
174
+ # Recognize mixed content (text + math)
175
+ result = p2t.recognize("mixed_content.png")
176
+ print(result)
177
+ # Output includes both text and LaTeX formulas
178
+ ```
179
+
180
+ ## Nougat (Meta) — Full Document OCR
181
+
182
+ Nougat converts entire academic PDF pages to Markdown with LaTeX math, preserving document structure.
183
+
184
+ ```bash
185
+ pip install nougat-ocr
186
+ ```
187
+
188
+ ```bash
189
+ # Convert a PDF to Markdown
190
+ nougat path/to/paper.pdf -o output_dir/ --no-skipping
191
+
192
+ # Output: Markdown files with LaTeX equations preserved
193
+ # e.g., The loss function is $\mathcal{L}(\theta) = ...$
194
+ ```
195
+
196
+ ```python
197
+ # Programmatic usage
198
+ from nougat import NougatModel
199
+ from nougat.utils.dataset import LazyDataset
200
+ from nougat.postprocessing import markdown_compatible
201
+
202
+ model = NougatModel.from_pretrained("facebook/nougat-base")
203
+ model.eval()
204
+
205
+ # Process pages...
206
+ ```
207
+
208
+ ## Screenshot-Based Workflow
209
+
210
+ ### macOS Workflow
211
+
212
+ ```bash
213
+ # 1. Take a screenshot of the formula (Cmd+Shift+4)
214
+ # 2. Process with LaTeX-OCR or Mathpix
215
+
216
+ # Automated with a shell script:
217
+ #!/bin/bash
218
+ # save as ~/bin/formula-ocr.sh
219
+ SCREENSHOT=$(mktemp /tmp/formula_XXXXXX.png)
220
+ screencapture -i "$SCREENSHOT"
221
+ python -c "
222
+ from pix2tex.cli import LatexOCR
223
+ from PIL import Image
224
+ model = LatexOCR()
225
+ img = Image.open('$SCREENSHOT')
226
+ latex = model(img)
227
+ print(latex)
228
+ # Copy to clipboard
229
+ import subprocess
230
+ subprocess.run(['pbcopy'], input=latex.encode())
231
+ print('Copied to clipboard!')
232
+ "
233
+ ```
234
+
235
+ ### Cross-Platform with Snipping Tool
236
+
237
+ ```python
238
+ import tkinter as tk
239
+ from PIL import ImageGrab
240
+
241
+ def capture_and_ocr():
242
+ """Capture screen region and convert to LaTeX."""
243
+ # Simple screenshot capture
244
+ print("Select the formula region...")
245
+ img = ImageGrab.grab(bbox=None) # Full screen; use tool for selection
246
+
247
+ from pix2tex.cli import LatexOCR
248
+ model = LatexOCR()
249
+ latex = model(img)
250
+ print(f"\nLaTeX: {latex}")
251
+ return latex
252
+ ```
253
+
254
+ ## Post-Processing and Validation
255
+
256
+ ### Common OCR Errors and Fixes
257
+
258
+ | OCR Error | Correct LaTeX | Fix Strategy |
259
+ |-----------|--------------|--------------|
260
+ | `\Sigma` vs `\sum` | Context-dependent | Check if it is a summation or sigma variable |
261
+ | Missing subscripts | `x_i` not `xi` | Verify variable names against source |
262
+ | Wrong delimiter size | `\left( \right)` | Add `\left` and `\right` for auto-sizing |
263
+ | Misrecognized symbols | `\theta` vs `\Theta` | Compare against original image |
264
+ | Missing spaces | `\frac{a}{b}c` | Add spacing commands (`\,`, `\;`, `\quad`) |
265
+
266
+ ### Validation Script
267
+
268
+ ```python
269
+ import subprocess
270
+ import tempfile
271
+ import os
272
+
273
+ def validate_latex(latex_string):
274
+ """Check if a LaTeX string compiles without errors."""
275
+ doc = f"""
276
+ \\documentclass{{article}}
277
+ \\usepackage{{amsmath,amssymb}}
278
+ \\begin{{document}}
279
+ ${latex_string}$
280
+ \\end{{document}}
281
+ """
282
+
283
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".tex", delete=False) as f:
284
+ f.write(doc)
285
+ tex_path = f.name
286
+
287
+ try:
288
+ result = subprocess.run(
289
+ ["pdflatex", "-interaction=nonstopmode", tex_path],
290
+ capture_output=True, text=True, timeout=10,
291
+ cwd=tempfile.gettempdir()
292
+ )
293
+ success = result.returncode == 0
294
+ if not success:
295
+ # Extract error message
296
+ for line in result.stdout.split("\n"):
297
+ if line.startswith("!"):
298
+ print(f"LaTeX error: {line}")
299
+ return success
300
+ except subprocess.TimeoutExpired:
301
+ return False
302
+ finally:
303
+ for ext in [".tex", ".pdf", ".aux", ".log"]:
304
+ try:
305
+ os.remove(tex_path.replace(".tex", ext))
306
+ except FileNotFoundError:
307
+ pass
308
+
309
+ # Test
310
+ latex = r"\frac{\partial \mathcal{L}}{\partial \theta}"
311
+ print(f"Valid: {validate_latex(latex)}")
312
+ ```
313
+
314
+ ## Integration with Note-Taking
315
+
316
+ ### Obsidian / Markdown Notes
317
+
318
+ ```markdown
319
+ # Lecture Notes: Statistical Mechanics
320
+
321
+ The partition function is defined as:
322
+
323
+ $$Z = \sum_{i} e^{-\beta E_i}$$
324
+
325
+ where $\beta = 1/k_B T$ is the inverse temperature.
326
+
327
+ The free energy is:
328
+
329
+ $$F = -k_B T \ln Z$$
330
+
331
+ [OCR'd from slide 15 using LaTeX-OCR, confidence: 0.97]
332
+ ```
333
+
334
+ ### Automated Pipeline
335
+
336
+ ```python
337
+ def process_lecture_slides(pdf_path, output_md):
338
+ """Convert lecture slides with formulas to Markdown notes."""
339
+ from pdf2image import convert_from_path
340
+ from pix2tex.cli import LatexOCR
341
+
342
+ model = LatexOCR()
343
+ images = convert_from_path(pdf_path, dpi=200)
344
+
345
+ with open(output_md, "w") as f:
346
+ f.write(f"# Notes from {pdf_path}\n\n")
347
+ for i, img in enumerate(images):
348
+ f.write(f"## Slide {i+1}\n\n")
349
+ # Full page text extraction (use Nougat or Mathpix for best results)
350
+ # For formula-only images, use LaTeX-OCR:
351
+ try:
352
+ latex = model(img)
353
+ f.write(f"$$\n{latex}\n$$\n\n")
354
+ except Exception as e:
355
+ f.write(f"[OCR failed: {e}]\n\n")
356
+
357
+ print(f"Notes saved to {output_md}")
358
+ ```
359
+
360
+ ## Best Practices
361
+
362
+ 1. **Crop tightly**: OCR accuracy improves significantly when the formula is cropped with minimal surrounding whitespace.
363
+ 2. **Use high resolution**: 200-300 DPI gives the best results. Lower resolution degrades recognition accuracy.
364
+ 3. **Validate output**: Always compile the generated LaTeX to verify correctness before using in a manuscript.
365
+ 4. **Handle multi-line equations**: For aligned equations, process each line separately or use a full-page model like Nougat.
366
+ 5. **Combine tools**: Use Mathpix for critical formulas and LaTeX-OCR for bulk processing to balance cost and quality.
367
+ 6. **Build a corrections dictionary**: Track common OCR errors for your domain and apply automated post-processing fixes.