@wentorai/research-plugins 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +22 -22
- package/curated/analysis/README.md +71 -56
- package/curated/domains/README.md +176 -67
- package/curated/literature/README.md +71 -47
- package/curated/research/README.md +91 -58
- package/curated/tools/README.md +88 -87
- package/curated/writing/README.md +80 -45
- package/mcp-configs/cloud-docs/confluence-mcp.json +37 -0
- package/mcp-configs/cloud-docs/google-drive-mcp.json +35 -0
- package/mcp-configs/cloud-docs/notion-mcp.json +29 -0
- package/mcp-configs/communication/discord-mcp.json +29 -0
- package/mcp-configs/communication/slack-mcp.json +29 -0
- package/mcp-configs/communication/telegram-mcp.json +28 -0
- package/mcp-configs/database/neo4j-mcp.json +37 -0
- package/mcp-configs/database/postgres-mcp.json +28 -0
- package/mcp-configs/database/sqlite-mcp.json +29 -0
- package/mcp-configs/dev-platform/github-mcp.json +31 -0
- package/mcp-configs/dev-platform/gitlab-mcp.json +34 -0
- package/mcp-configs/email/email-mcp.json +40 -0
- package/mcp-configs/email/gmail-mcp.json +37 -0
- package/mcp-configs/registry.json +178 -149
- package/mcp-configs/repository/dataverse-mcp.json +33 -0
- package/mcp-configs/repository/huggingface-mcp.json +29 -0
- package/openclaw.plugin.json +2 -2
- package/package.json +2 -2
- package/skills/analysis/dataviz/algorithm-visualizer-guide/SKILL.md +259 -0
- package/skills/analysis/dataviz/bokeh-visualization-guide/SKILL.md +270 -0
- package/skills/analysis/dataviz/chart-image-generator/SKILL.md +229 -0
- package/skills/analysis/dataviz/d3-visualization-guide/SKILL.md +281 -0
- package/skills/analysis/dataviz/echarts-visualization-guide/SKILL.md +250 -0
- package/skills/analysis/dataviz/metabase-analytics-guide/SKILL.md +242 -0
- package/skills/analysis/dataviz/plotly-interactive-guide/SKILL.md +266 -0
- package/skills/analysis/dataviz/redash-analytics-guide/SKILL.md +284 -0
- package/skills/analysis/econometrics/econml-causal-guide/SKILL.md +163 -0
- package/skills/analysis/econometrics/mostly-harmless-guide/SKILL.md +139 -0
- package/skills/analysis/econometrics/panel-data-analyst/SKILL.md +259 -0
- package/skills/analysis/econometrics/python-causality-guide/SKILL.md +134 -0
- package/skills/analysis/econometrics/stata-accounting-guide/SKILL.md +269 -0
- package/skills/analysis/econometrics/stata-analyst-guide/SKILL.md +245 -0
- package/skills/analysis/statistics/data-anomaly-detection/SKILL.md +157 -0
- package/skills/analysis/statistics/ml-experiment-tracker/SKILL.md +212 -0
- package/skills/analysis/statistics/pywayne-statistics-guide/SKILL.md +192 -0
- package/skills/analysis/statistics/quantitative-methods-guide/SKILL.md +193 -0
- package/skills/analysis/statistics/senior-data-scientist-guide/SKILL.md +223 -0
- package/skills/analysis/wrangling/csv-data-analyzer/SKILL.md +170 -0
- package/skills/analysis/wrangling/data-cleaning-pipeline/SKILL.md +266 -0
- package/skills/analysis/wrangling/data-cog-guide/SKILL.md +178 -0
- package/skills/analysis/wrangling/stata-data-cleaning/SKILL.md +276 -0
- package/skills/analysis/wrangling/survey-data-processing/SKILL.md +298 -0
- package/skills/domains/ai-ml/ai-model-benchmarking/SKILL.md +209 -0
- package/skills/domains/ai-ml/annotated-dl-papers-guide/SKILL.md +159 -0
- package/skills/domains/ai-ml/dl-transformer-finetune/SKILL.md +239 -0
- package/skills/domains/ai-ml/generative-ai-guide/SKILL.md +146 -0
- package/skills/domains/ai-ml/huggingface-inference-guide/SKILL.md +196 -0
- package/skills/domains/ai-ml/keras-deep-learning/SKILL.md +210 -0
- package/skills/domains/ai-ml/llm-from-scratch-guide/SKILL.md +124 -0
- package/skills/domains/ai-ml/ml-pipeline-guide/SKILL.md +295 -0
- package/skills/domains/ai-ml/nlp-toolkit-guide/SKILL.md +247 -0
- package/skills/domains/ai-ml/pytorch-guide/SKILL.md +281 -0
- package/skills/domains/ai-ml/pytorch-lightning-guide/SKILL.md +244 -0
- package/skills/domains/ai-ml/tensorflow-guide/SKILL.md +241 -0
- package/skills/domains/biomedical/bioagents-guide/SKILL.md +308 -0
- package/skills/domains/biomedical/medgeclaw-guide/SKILL.md +345 -0
- package/skills/domains/biomedical/medical-imaging-guide/SKILL.md +305 -0
- package/skills/domains/business/architecture-design-guide/SKILL.md +279 -0
- package/skills/domains/business/innovation-management-guide/SKILL.md +257 -0
- package/skills/domains/business/operations-research-guide/SKILL.md +258 -0
- package/skills/domains/chemistry/molecular-dynamics-guide/SKILL.md +237 -0
- package/skills/domains/chemistry/pubchem-api-guide/SKILL.md +180 -0
- package/skills/domains/chemistry/spectroscopy-analysis-guide/SKILL.md +290 -0
- package/skills/domains/cs/distributed-systems-guide/SKILL.md +268 -0
- package/skills/domains/cs/formal-verification-guide/SKILL.md +298 -0
- package/skills/domains/ecology/species-distribution-guide/SKILL.md +343 -0
- package/skills/domains/economics/imf-data-api-guide/SKILL.md +174 -0
- package/skills/domains/economics/post-labor-economics/SKILL.md +254 -0
- package/skills/domains/economics/pricing-psychology-guide/SKILL.md +273 -0
- package/skills/domains/economics/world-bank-data-guide/SKILL.md +179 -0
- package/skills/domains/education/assessment-design-guide/SKILL.md +213 -0
- package/skills/domains/education/educational-research-methods/SKILL.md +179 -0
- package/skills/domains/education/mooc-analytics-guide/SKILL.md +206 -0
- package/skills/domains/finance/portfolio-optimization-guide/SKILL.md +279 -0
- package/skills/domains/finance/risk-modeling-guide/SKILL.md +260 -0
- package/skills/domains/finance/stata-accounting-research/SKILL.md +372 -0
- package/skills/domains/geoscience/climate-modeling-guide/SKILL.md +215 -0
- package/skills/domains/geoscience/satellite-remote-sensing/SKILL.md +193 -0
- package/skills/domains/geoscience/seismology-data-guide/SKILL.md +208 -0
- package/skills/domains/humanities/ethical-philosophy-guide/SKILL.md +244 -0
- package/skills/domains/humanities/history-research-guide/SKILL.md +260 -0
- package/skills/domains/humanities/political-history-guide/SKILL.md +241 -0
- package/skills/domains/law/legal-nlp-guide/SKILL.md +236 -0
- package/skills/domains/law/patent-analysis-guide/SKILL.md +257 -0
- package/skills/domains/law/regulatory-compliance-guide/SKILL.md +267 -0
- package/skills/domains/math/symbolic-computation-guide/SKILL.md +263 -0
- package/skills/domains/math/topology-data-analysis/SKILL.md +305 -0
- package/skills/domains/pharma/clinical-trial-design-guide/SKILL.md +271 -0
- package/skills/domains/pharma/drug-target-interaction/SKILL.md +242 -0
- package/skills/domains/pharma/pharmacovigilance-guide/SKILL.md +216 -0
- package/skills/domains/physics/astrophysics-data-guide/SKILL.md +305 -0
- package/skills/domains/physics/particle-physics-guide/SKILL.md +287 -0
- package/skills/domains/social-science/network-analysis-guide/SKILL.md +310 -0
- package/skills/domains/social-science/psychology-research-guide/SKILL.md +270 -0
- package/skills/domains/social-science/sociology-research-guide/SKILL.md +238 -0
- package/skills/literature/discovery/paper-recommendation-guide/SKILL.md +120 -0
- package/skills/literature/discovery/semantic-paper-radar/SKILL.md +144 -0
- package/skills/literature/discovery/zotero-arxiv-daily-guide/SKILL.md +94 -0
- package/skills/literature/fulltext/core-api-guide/SKILL.md +144 -0
- package/skills/literature/fulltext/institutional-repository-guide/SKILL.md +212 -0
- package/skills/literature/fulltext/open-access-mining-guide/SKILL.md +341 -0
- package/skills/literature/metadata/academic-paper-summarizer/SKILL.md +101 -0
- package/skills/literature/metadata/wikidata-api-guide/SKILL.md +156 -0
- package/skills/literature/search/arxiv-batch-reporting/SKILL.md +133 -0
- package/skills/literature/search/arxiv-paper-processor/SKILL.md +141 -0
- package/skills/literature/search/baidu-scholar-guide/SKILL.md +110 -0
- package/skills/literature/search/chatpaper-guide/SKILL.md +122 -0
- package/skills/literature/search/deep-literature-search/SKILL.md +149 -0
- package/skills/literature/search/deepgit-search-guide/SKILL.md +147 -0
- package/skills/literature/search/pasa-paper-search-guide/SKILL.md +138 -0
- package/skills/research/automation/ai-scientist-v2-guide/SKILL.md +284 -0
- package/skills/research/automation/aim-experiment-guide/SKILL.md +234 -0
- package/skills/research/automation/datagen-research-guide/SKILL.md +131 -0
- package/skills/research/automation/kedro-pipeline-guide/SKILL.md +216 -0
- package/skills/research/automation/mle-agent-guide/SKILL.md +139 -0
- package/skills/research/automation/paper-to-agent-guide/SKILL.md +116 -0
- package/skills/research/automation/rd-agent-guide/SKILL.md +246 -0
- package/skills/research/automation/research-paper-orchestrator/SKILL.md +254 -0
- package/skills/research/deep-research/academic-deep-research/SKILL.md +190 -0
- package/skills/research/deep-research/auto-deep-research-guide/SKILL.md +141 -0
- package/skills/research/deep-research/deep-research-pro/SKILL.md +213 -0
- package/skills/research/deep-research/deep-research-work/SKILL.md +204 -0
- package/skills/research/deep-research/deep-searcher-guide/SKILL.md +253 -0
- package/skills/research/deep-research/gpt-researcher-guide/SKILL.md +191 -0
- package/skills/research/deep-research/khoj-research-guide/SKILL.md +200 -0
- package/skills/research/deep-research/local-deep-research-guide/SKILL.md +253 -0
- package/skills/research/deep-research/tongyi-deep-research-guide/SKILL.md +217 -0
- package/skills/research/funding/eu-horizon-guide/SKILL.md +244 -0
- package/skills/research/funding/grant-budget-guide/SKILL.md +284 -0
- package/skills/research/funding/nih-reporter-api-guide/SKILL.md +166 -0
- package/skills/research/funding/nsf-award-api-guide/SKILL.md +133 -0
- package/skills/research/methodology/academic-mentor-guide/SKILL.md +169 -0
- package/skills/research/methodology/claude-scientific-guide/SKILL.md +122 -0
- package/skills/research/methodology/deep-innovator-guide/SKILL.md +242 -0
- package/skills/research/methodology/osf-api-guide/SKILL.md +165 -0
- package/skills/research/methodology/research-paper-kb/SKILL.md +263 -0
- package/skills/research/methodology/research-town-guide/SKILL.md +263 -0
- package/skills/research/paper-review/automated-review-guide/SKILL.md +281 -0
- package/skills/research/paper-review/paper-compare-guide/SKILL.md +238 -0
- package/skills/research/paper-review/paper-digest-guide/SKILL.md +240 -0
- package/skills/research/paper-review/paper-research-assistant/SKILL.md +231 -0
- package/skills/research/paper-review/research-quality-filter/SKILL.md +261 -0
- package/skills/research/paper-review/review-response-guide/SKILL.md +275 -0
- package/skills/tools/code-exec/google-colab-guide/SKILL.md +276 -0
- package/skills/tools/code-exec/kaggle-api-guide/SKILL.md +216 -0
- package/skills/tools/code-exec/overleaf-cli-guide/SKILL.md +279 -0
- package/skills/tools/diagram/code-flow-visualizer/SKILL.md +197 -0
- package/skills/tools/diagram/excalidraw-diagram-guide/SKILL.md +170 -0
- package/skills/tools/diagram/json-data-visualizer/SKILL.md +270 -0
- package/skills/tools/diagram/mermaid-architect-guide/SKILL.md +219 -0
- package/skills/tools/diagram/tldraw-whiteboard-guide/SKILL.md +397 -0
- package/skills/tools/document/docsgpt-guide/SKILL.md +130 -0
- package/skills/tools/document/large-document-reader/SKILL.md +202 -0
- package/skills/tools/document/paper-parse-guide/SKILL.md +243 -0
- package/skills/tools/knowledge-graph/citation-network-builder/SKILL.md +244 -0
- package/skills/tools/knowledge-graph/concept-map-generator/SKILL.md +284 -0
- package/skills/tools/knowledge-graph/graphiti-guide/SKILL.md +219 -0
- package/skills/tools/ocr-translate/pdf-math-translate-guide/SKILL.md +141 -0
- package/skills/tools/ocr-translate/zotero-pdf-translate-guide/SKILL.md +95 -0
- package/skills/tools/ocr-translate/zotero-pdf2zh-guide/SKILL.md +143 -0
- package/skills/tools/scraping/dataset-finder-guide/SKILL.md +253 -0
- package/skills/tools/scraping/easy-spider-guide/SKILL.md +250 -0
- package/skills/tools/scraping/google-scholar-scraper/SKILL.md +255 -0
- package/skills/tools/scraping/repository-harvesting-guide/SKILL.md +310 -0
- package/skills/writing/citation/academic-citation-manager/SKILL.md +314 -0
- package/skills/writing/citation/jabref-reference-guide/SKILL.md +127 -0
- package/skills/writing/citation/jasminum-zotero-guide/SKILL.md +103 -0
- package/skills/writing/citation/obsidian-citation-guide/SKILL.md +164 -0
- package/skills/writing/citation/obsidian-zotero-guide/SKILL.md +137 -0
- package/skills/writing/citation/papersgpt-zotero-guide/SKILL.md +132 -0
- package/skills/writing/citation/papis-cli-guide/SKILL.md +213 -0
- package/skills/writing/citation/zotero-better-bibtex-guide/SKILL.md +107 -0
- package/skills/writing/citation/zotero-better-notes-guide/SKILL.md +121 -0
- package/skills/writing/citation/zotero-gpt-guide/SKILL.md +111 -0
- package/skills/writing/citation/zotero-mcp-guide/SKILL.md +164 -0
- package/skills/writing/citation/zotero-mdnotes-guide/SKILL.md +162 -0
- package/skills/writing/citation/zotero-reference-guide/SKILL.md +139 -0
- package/skills/writing/citation/zotero-scholar-guide/SKILL.md +294 -0
- package/skills/writing/citation/zotfile-attachment-guide/SKILL.md +140 -0
- package/skills/writing/composition/ml-paper-writing/SKILL.md +163 -0
- package/skills/writing/composition/paper-debugger-guide/SKILL.md +143 -0
- package/skills/writing/composition/scientific-writing-resources/SKILL.md +151 -0
- package/skills/writing/composition/scientific-writing-wrapper/SKILL.md +153 -0
- package/skills/writing/latex/latex-drawing-collection/SKILL.md +154 -0
- package/skills/writing/latex/latex-templates-collection/SKILL.md +159 -0
- package/skills/writing/latex/md-to-pdf-academic/SKILL.md +230 -0
- package/skills/writing/latex/tex-render-guide/SKILL.md +243 -0
- package/skills/writing/polish/academic-tone-guide/SKILL.md +209 -0
- package/skills/writing/polish/conciseness-editing-guide/SKILL.md +225 -0
- package/skills/writing/polish/paper-polish-guide/SKILL.md +160 -0
- package/skills/writing/templates/graphical-abstract-guide/SKILL.md +183 -0
- package/skills/writing/templates/novathesis-guide/SKILL.md +152 -0
- package/skills/writing/templates/scientific-article-pdf/SKILL.md +261 -0
- package/skills/writing/templates/sjtuthesis-guide/SKILL.md +197 -0
- package/skills/writing/templates/thuthesis-guide/SKILL.md +181 -0
- package/skills/literature/fulltext/repository-harvesting-guide/SKILL.md +0 -207
|
@@ -0,0 +1,144 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: semantic-paper-radar
|
|
3
|
+
description: "Semantic literature discovery and synthesis using embeddings"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "📡"
|
|
7
|
+
category: "literature"
|
|
8
|
+
subcategory: "discovery"
|
|
9
|
+
keywords: ["semantic search", "embeddings", "literature synthesis", "paper discovery", "vector search", "knowledge mapping"]
|
|
10
|
+
source: "https://github.com/mukulpatnaik/researchgpt"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Semantic Paper Radar
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
Traditional literature search relies on keyword matching—you find papers that contain the exact terms you search for. Semantic paper discovery goes further by understanding the meaning of research content and finding papers that are conceptually related, even when they use different terminology. This is especially powerful for interdisciplinary research, where the same idea may be expressed in completely different vocabularies across fields.
|
|
18
|
+
|
|
19
|
+
The Semantic Paper Radar skill provides methods for using embedding-based semantic search, vector databases, and AI-powered synthesis to build a comprehensive, continuously updated view of the literature relevant to your research. It enables you to discover papers you would never find through keyword search alone and to synthesize findings across large bodies of work.
|
|
20
|
+
|
|
21
|
+
This skill covers setting up a personal semantic search index over your paper collection, querying public semantic search APIs, and using LLM-powered analysis to extract themes and connections from clusters of related papers.
|
|
22
|
+
|
|
23
|
+
## Semantic Search Fundamentals
|
|
24
|
+
|
|
25
|
+
### How Embedding-Based Search Works
|
|
26
|
+
|
|
27
|
+
Semantic search represents both your query and each paper as dense numerical vectors (embeddings) in a high-dimensional space. Papers whose embeddings are close to your query's embedding are semantically similar, regardless of the specific words used.
|
|
28
|
+
|
|
29
|
+
Key components:
|
|
30
|
+
- **Embedding model**: Converts text to vectors. Models like SPECTER2, SciBERT, or general-purpose models like `text-embedding-3-small` work well for academic text.
|
|
31
|
+
- **Vector database**: Stores and indexes embeddings for fast similarity search. Options include ChromaDB (local), Qdrant, Pinecone, or Weaviate.
|
|
32
|
+
- **Similarity metric**: Cosine similarity is standard for comparing text embeddings.
|
|
33
|
+
|
|
34
|
+
### Using Semantic Scholar's Embedding Search
|
|
35
|
+
|
|
36
|
+
Semantic Scholar provides pre-computed SPECTER embeddings for millions of papers. You can use their search API for semantic queries:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
# Semantic search via the Semantic Scholar API
|
|
40
|
+
curl "https://api.semanticscholar.org/graph/v1/paper/search?query=attention+mechanisms+for+graph+neural+networks&fields=title,abstract,year,citationCount&limit=20"
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
The search endpoint uses semantic matching, not just keyword matching. A query like "methods for handling missing values in longitudinal studies" will find papers about imputation techniques, dropout analysis, and panel data methods even if they do not use the phrase "missing values."
|
|
44
|
+
|
|
45
|
+
### Building a Personal Semantic Index
|
|
46
|
+
|
|
47
|
+
For deeper control, build a local semantic search index over your own paper collection:
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
import chromadb
|
|
51
|
+
from sentence_transformers import SentenceTransformer
|
|
52
|
+
|
|
53
|
+
# Initialize
|
|
54
|
+
model = SentenceTransformer("allenai/specter2")
|
|
55
|
+
client = chromadb.PersistentClient(path="./paper_index")
|
|
56
|
+
collection = client.get_or_create_collection(
|
|
57
|
+
name="my_papers",
|
|
58
|
+
metadata={"hnsw:space": "cosine"}
|
|
59
|
+
)
|
|
60
|
+
|
|
61
|
+
# Index a paper
|
|
62
|
+
abstract = "We propose a novel attention mechanism for graph neural networks..."
|
|
63
|
+
embedding = model.encode(abstract).tolist()
|
|
64
|
+
collection.add(
|
|
65
|
+
documents=[abstract],
|
|
66
|
+
embeddings=[embedding],
|
|
67
|
+
metadatas=[{"title": "Graph Attention v2", "year": 2025, "arxiv_id": "2501.xxxxx"}],
|
|
68
|
+
ids=["paper_001"]
|
|
69
|
+
)
|
|
70
|
+
|
|
71
|
+
# Query
|
|
72
|
+
results = collection.query(
|
|
73
|
+
query_embeddings=[model.encode("message passing in GNNs").tolist()],
|
|
74
|
+
n_results=10
|
|
75
|
+
)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
This local index lets you search across all papers you have collected using natural language queries. As you add more papers, the index becomes a personalized discovery tool tuned to your specific research interests.
|
|
79
|
+
|
|
80
|
+
## Discovery Workflows
|
|
81
|
+
|
|
82
|
+
### Concept Expansion Radar
|
|
83
|
+
|
|
84
|
+
Use semantic search to expand your awareness beyond your current reading:
|
|
85
|
+
|
|
86
|
+
1. **Seed**: Take the abstract of your current paper (or a paragraph describing your research question).
|
|
87
|
+
2. **Search**: Run it as a semantic query against a large corpus (Semantic Scholar, OpenAlex, or your local index).
|
|
88
|
+
3. **Filter**: Remove papers you have already read. Sort by a combination of semantic similarity and recency.
|
|
89
|
+
4. **Cluster**: Group the top 50 results into thematic clusters using k-means or HDBSCAN on their embeddings.
|
|
90
|
+
5. **Explore clusters**: Each cluster represents a related subtopic. Read the most-cited paper in each cluster to understand the connection to your work.
|
|
91
|
+
|
|
92
|
+
### Cross-Disciplinary Bridge Detection
|
|
93
|
+
|
|
94
|
+
Semantic search excels at finding papers from other fields that address similar problems:
|
|
95
|
+
|
|
96
|
+
1. Describe your research problem in plain, non-technical language.
|
|
97
|
+
2. Run this as a semantic query without restricting to your field's journals or categories.
|
|
98
|
+
3. Review results from unexpected fields—these are potential interdisciplinary connections.
|
|
99
|
+
4. For each bridge paper, check its reference list for more domain-specific work in that field.
|
|
100
|
+
|
|
101
|
+
### Novelty Radar
|
|
102
|
+
|
|
103
|
+
Set up periodic semantic searches to detect new papers in your area:
|
|
104
|
+
|
|
105
|
+
1. Define 3-5 "concept vectors" by encoding descriptions of your core research interests.
|
|
106
|
+
2. Weekly, search against newly published papers (last 7 days) from arXiv or Semantic Scholar.
|
|
107
|
+
3. Rank new papers by maximum similarity to any of your concept vectors.
|
|
108
|
+
4. Papers above your similarity threshold enter your reading queue automatically.
|
|
109
|
+
|
|
110
|
+
## Semantic Synthesis
|
|
111
|
+
|
|
112
|
+
Once you have discovered a cluster of related papers, use AI-assisted synthesis to extract insights across the collection:
|
|
113
|
+
|
|
114
|
+
### Theme Extraction
|
|
115
|
+
|
|
116
|
+
Feed the abstracts of a cluster of papers to an LLM and ask for:
|
|
117
|
+
- Common themes and findings across the papers
|
|
118
|
+
- Points of disagreement or contradiction
|
|
119
|
+
- Methodological trends (what approaches are gaining vs. losing popularity)
|
|
120
|
+
- Open questions that none of the papers fully address
|
|
121
|
+
|
|
122
|
+
### Evidence Mapping
|
|
123
|
+
|
|
124
|
+
Create a structured evidence map from your semantic cluster:
|
|
125
|
+
|
|
126
|
+
| Theme | Supporting Papers | Contradicting Papers | Strength of Evidence |
|
|
127
|
+
|-------|-------------------|----------------------|---------------------|
|
|
128
|
+
| Theme A | [1], [3], [7] | [5] | Strong |
|
|
129
|
+
| Theme B | [2], [4] | None | Moderate |
|
|
130
|
+
| Theme C | [6] | [1], [8] | Contested |
|
|
131
|
+
|
|
132
|
+
This provides a bird's-eye view of where consensus exists and where debates remain open.
|
|
133
|
+
|
|
134
|
+
### Gap Identification
|
|
135
|
+
|
|
136
|
+
Compare your research question against the semantic landscape of existing work. Regions of embedding space where your query falls but few papers exist represent potential research gaps—areas where your contribution would be most novel.
|
|
137
|
+
|
|
138
|
+
## References
|
|
139
|
+
|
|
140
|
+
- Semantic Scholar API: https://api.semanticscholar.org
|
|
141
|
+
- SPECTER2 model: https://huggingface.co/allenai/specter2
|
|
142
|
+
- ChromaDB: https://www.trychroma.com
|
|
143
|
+
- ResearchGPT: https://github.com/mukulpatnaik/researchgpt
|
|
144
|
+
- OpenAlex: https://openalex.org
|
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: zotero-arxiv-daily-guide
|
|
3
|
+
description: "Guide to Zotero arXiv Daily for personalized daily paper recommendations"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "📰"
|
|
7
|
+
category: literature
|
|
8
|
+
subcategory: discovery
|
|
9
|
+
keywords: ["zotero", "arxiv", "daily-papers", "recommendations", "preprint", "discovery"]
|
|
10
|
+
source: "https://github.com/TechPenguineer/zotero-arxiv-daily"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Zotero arXiv Daily Guide
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
Zotero arXiv Daily is a popular Zotero plugin with over 5,000 GitHub stars that delivers personalized daily paper recommendations from arXiv directly into your Zotero library. By analyzing the papers you already have in your collections, the plugin identifies your research interests and surfaces new preprints that are most relevant to your work.
|
|
18
|
+
|
|
19
|
+
The challenge of staying current with preprint literature is well known to researchers. ArXiv publishes thousands of new papers daily across dozens of categories, and manually scanning listings or relying solely on keyword alerts often results in information overload or missed relevant work. Zotero arXiv Daily addresses this by using your existing library as a profile of your interests, producing recommendations that improve as your library grows.
|
|
20
|
+
|
|
21
|
+
The plugin integrates naturally into the Zotero workflow. Recommended papers appear in a dedicated collection where you can review titles and abstracts, save promising papers to your working collections, and dismiss irrelevant suggestions. Over time the recommendation engine learns from your accept and dismiss decisions, refining its model of your interests.
|
|
22
|
+
|
|
23
|
+
## Installation and Setup
|
|
24
|
+
|
|
25
|
+
Install Zotero arXiv Daily through the standard Zotero plugin process:
|
|
26
|
+
|
|
27
|
+
1. Download the latest `.xpi` release from https://github.com/TechPenguineer/zotero-arxiv-daily/releases
|
|
28
|
+
2. In Zotero, go to Tools > Add-ons > gear icon > Install Add-on From File
|
|
29
|
+
3. Select the `.xpi` file and restart Zotero
|
|
30
|
+
|
|
31
|
+
Configure the plugin after installation:
|
|
32
|
+
|
|
33
|
+
- Open Zotero Preferences > arXiv Daily
|
|
34
|
+
- Select the arXiv categories relevant to your research (e.g., cs.AI, cs.CL, stat.ML, physics.comp-ph)
|
|
35
|
+
- Choose which Zotero collections to use as the basis for recommendations (your core research collections work best)
|
|
36
|
+
- Set the number of daily recommendations (10-30 is typical)
|
|
37
|
+
- Configure the schedule for fetching new recommendations (daily at a specific time or on-demand)
|
|
38
|
+
- Set up a dedicated Zotero collection where recommendations will appear
|
|
39
|
+
|
|
40
|
+
For enhanced recommendation quality, ensure your library collections are well-organized. The algorithm performs better when it can distinguish between your core research interests and peripheral references. Consider creating a dedicated collection of your most representative papers to serve as the recommendation seed.
|
|
41
|
+
|
|
42
|
+
## Core Features
|
|
43
|
+
|
|
44
|
+
**Personalized Recommendations**: The plugin analyzes titles, abstracts, authors, and citation patterns in your Zotero library to build a profile of your research interests. New arXiv submissions are scored against this profile and the top matches are presented as daily recommendations.
|
|
45
|
+
|
|
46
|
+
**Category Filtering**: Select specific arXiv categories to narrow the recommendation scope. This prevents the system from suggesting papers in completely unrelated fields while still allowing cross-disciplinary discoveries within your selected categories.
|
|
47
|
+
|
|
48
|
+
**Daily Digest View**: Recommendations appear in a dedicated Zotero collection organized by date. Each entry includes the paper title, authors, abstract, arXiv identifier, and a relevance score indicating how closely it matches your library profile.
|
|
49
|
+
|
|
50
|
+
**Quick Actions**: For each recommended paper, you can:
|
|
51
|
+
- Save to a working collection with one click
|
|
52
|
+
- Open the full paper on arXiv
|
|
53
|
+
- Download the PDF directly to Zotero
|
|
54
|
+
- Dismiss the recommendation (improves future suggestions)
|
|
55
|
+
- Add tags for later organization
|
|
56
|
+
|
|
57
|
+
**Trend Detection**: The plugin can highlight papers that are receiving unusual attention in your field based on early citation velocity and social media mentions. This helps you identify potentially important work before it becomes widely known.
|
|
58
|
+
|
|
59
|
+
**Author Tracking**: When the plugin detects papers by authors who are frequently cited in your library, it flags these with higher priority. This ensures you never miss new work from the researchers most relevant to your field.
|
|
60
|
+
|
|
61
|
+
## Research Workflow Integration
|
|
62
|
+
|
|
63
|
+
**Morning Review Routine**: Start your research day by spending 10-15 minutes reviewing the daily arXiv recommendations. Scan titles and abstracts, save promising papers to a "To Read" collection, and dismiss irrelevant ones. This disciplined approach keeps you current without consuming excessive time.
|
|
64
|
+
|
|
65
|
+
**Literature Review Enhancement**: During active literature review phases, increase the number of daily recommendations and expand the arXiv categories. The plugin helps identify relevant preprints that may not yet appear in traditional databases, giving your review a more comprehensive and timely scope.
|
|
66
|
+
|
|
67
|
+
**Collaborative Discovery**: Share your recommended papers collection with lab members through a Zotero group library. This creates a collective discovery mechanism where the entire group benefits from each member's library-driven recommendations.
|
|
68
|
+
|
|
69
|
+
**Research Trend Monitoring**: Track which topics appear frequently in your recommendations over weeks and months. Shifts in the recommendation patterns can signal emerging trends in your field, helping you anticipate where the research community is heading.
|
|
70
|
+
|
|
71
|
+
**Optimizing Recommendation Quality**:
|
|
72
|
+
- Maintain a well-curated "seed" collection of your most important papers
|
|
73
|
+
- Regularly dismiss irrelevant recommendations to refine the algorithm
|
|
74
|
+
- Update your arXiv category selections as your interests evolve
|
|
75
|
+
- Add newly published papers from your own group to keep the profile current
|
|
76
|
+
- Review recommendations from adjacent categories periodically for cross-disciplinary insights
|
|
77
|
+
|
|
78
|
+
## Configuring Notification Preferences
|
|
79
|
+
|
|
80
|
+
Control how and when you receive recommendation alerts:
|
|
81
|
+
|
|
82
|
+
- **Desktop Notifications**: Enable system notifications when new recommendations arrive
|
|
83
|
+
- **Batch Mode**: Accumulate recommendations and review them at a scheduled time
|
|
84
|
+
- **Threshold Filtering**: Only show recommendations above a configurable relevance score
|
|
85
|
+
- **Keyword Highlighting**: Specify key terms to highlight in recommended paper titles and abstracts
|
|
86
|
+
|
|
87
|
+
For researchers who find the default recommendation volume too high, set a higher relevance threshold to receive only the most closely matched papers. Conversely, those in rapidly moving fields may want to lower the threshold and increase the daily count to ensure broad coverage.
|
|
88
|
+
|
|
89
|
+
## References
|
|
90
|
+
|
|
91
|
+
- GitHub Repository: https://github.com/TechPenguineer/zotero-arxiv-daily
|
|
92
|
+
- arXiv API Documentation: https://info.arxiv.org/help/api
|
|
93
|
+
- Zotero Plugin Directory: https://www.zotero.org/support/plugins
|
|
94
|
+
- arXiv Category Taxonomy: https://arxiv.org/category_taxonomy
|
|
@@ -0,0 +1,144 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: core-api-guide
|
|
3
|
+
description: "Search and retrieve open access research papers via CORE aggregator"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "🔬"
|
|
7
|
+
category: "literature"
|
|
8
|
+
subcategory: "fulltext"
|
|
9
|
+
keywords: ["open-access", "fulltext", "research-papers", "aggregator", "CORE"]
|
|
10
|
+
source: "https://core.ac.uk/documentation/api"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# CORE API Guide
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
CORE (COnnecting REpositories) is the world's largest aggregator of open access research papers, providing access to over 130 million articles harvested from thousands of data providers worldwide. The CORE API enables programmatic search, retrieval, and analysis of scholarly full-text content across repositories, journals, and preprint servers.
|
|
18
|
+
|
|
19
|
+
The API is particularly valuable for researchers conducting systematic reviews, bibliometric analyses, and literature mining tasks. Unlike many scholarly APIs that only provide metadata, CORE specializes in delivering full-text content, making it essential for text mining and natural language processing workflows in academic research.
|
|
20
|
+
|
|
21
|
+
CORE's v3 API provides a RESTful interface with JSON responses, supporting complex search queries with Boolean operators, field-specific filtering, and batch operations. It is free for non-commercial academic use, though an API key is required to access the service.
|
|
22
|
+
|
|
23
|
+
## Authentication
|
|
24
|
+
|
|
25
|
+
CORE requires a free API key for all requests. Register at https://core.ac.uk/services/api to obtain one.
|
|
26
|
+
|
|
27
|
+
Always store your API key in an environment variable and reference it in requests:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
export CORE_API_KEY=$CORE_API_KEY
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Pass the key via the `Authorization` header:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
curl -H "Authorization: Bearer $CORE_API_KEY" \
|
|
37
|
+
"https://api.core.ac.uk/v3/search/works?q=machine+learning"
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Core Endpoints
|
|
41
|
+
|
|
42
|
+
### Search Works
|
|
43
|
+
|
|
44
|
+
Search across the entire CORE corpus with full-text and metadata queries.
|
|
45
|
+
|
|
46
|
+
```
|
|
47
|
+
GET https://api.core.ac.uk/v3/search/works?q={query}&limit={n}&offset={n}
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
**Parameters:**
|
|
51
|
+
- `q` (required): Search query string, supports Boolean operators (AND, OR, NOT)
|
|
52
|
+
- `limit`: Number of results (default 10, max 100)
|
|
53
|
+
- `offset`: Pagination offset
|
|
54
|
+
- `entity_type`: Filter by type (e.g., `journal-article`, `preprint`)
|
|
55
|
+
|
|
56
|
+
**Example: Search for climate change papers with full text:**
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
curl -s -H "Authorization: Bearer $CORE_API_KEY" \
|
|
60
|
+
"https://api.core.ac.uk/v3/search/works?q=climate+change+adaptation&limit=5" \
|
|
61
|
+
| python3 -m json.tool
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
**Python example:**
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
import requests
|
|
68
|
+
import os
|
|
69
|
+
|
|
70
|
+
headers = {"Authorization": f"Bearer {os.environ['CORE_API_KEY']}"}
|
|
71
|
+
params = {
|
|
72
|
+
"q": "deep learning AND medical imaging",
|
|
73
|
+
"limit": 20,
|
|
74
|
+
"offset": 0
|
|
75
|
+
}
|
|
76
|
+
resp = requests.get("https://api.core.ac.uk/v3/search/works", headers=headers, params=params)
|
|
77
|
+
data = resp.json()
|
|
78
|
+
|
|
79
|
+
for result in data.get("results", []):
|
|
80
|
+
print(f"Title: {result.get('title')}")
|
|
81
|
+
print(f"DOI: {result.get('doi')}")
|
|
82
|
+
print(f"Year: {result.get('yearPublished')}")
|
|
83
|
+
print(f"Full text length: {len(result.get('fullText', ''))}")
|
|
84
|
+
print("---")
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Get Work by ID
|
|
88
|
+
|
|
89
|
+
Retrieve a specific paper by its CORE ID or DOI.
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
GET https://api.core.ac.uk/v3/works/{core_id}
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
curl -s -H "Authorization: Bearer $CORE_API_KEY" \
|
|
97
|
+
"https://api.core.ac.uk/v3/works/doi:10.1234/example.doi" \
|
|
98
|
+
| python3 -m json.tool
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Batch Retrieval
|
|
102
|
+
|
|
103
|
+
Retrieve multiple works in a single request using POST with a list of IDs.
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
curl -s -X POST -H "Authorization: Bearer $CORE_API_KEY" \
|
|
107
|
+
-H "Content-Type: application/json" \
|
|
108
|
+
-d '[12345, 67890, 11111]' \
|
|
109
|
+
"https://api.core.ac.uk/v3/works"
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### Search Data Providers
|
|
113
|
+
|
|
114
|
+
List or search CORE's data providers (repositories, journals).
|
|
115
|
+
|
|
116
|
+
```
|
|
117
|
+
GET https://api.core.ac.uk/v3/data-providers?q={query}
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
## Common Research Patterns
|
|
121
|
+
|
|
122
|
+
**Systematic Literature Review:** Use Boolean queries to replicate a search strategy across the full-text corpus. Combine with date filters to identify papers within a specific time window, then export results for screening in tools like Rayyan or Covidence.
|
|
123
|
+
|
|
124
|
+
**Full-Text Mining:** Retrieve full-text content programmatically for NLP pipelines. Extract named entities, key phrases, or citation contexts at scale across thousands of papers.
|
|
125
|
+
|
|
126
|
+
**Repository Coverage Analysis:** Query data providers to understand which institutional repositories contribute to a specific field, useful for bibliometric and open-access policy research.
|
|
127
|
+
|
|
128
|
+
**Trend Detection:** Run time-series queries for specific terms and track publication volume over years to identify emerging research fronts.
|
|
129
|
+
|
|
130
|
+
## Rate Limits and Best Practices
|
|
131
|
+
|
|
132
|
+
- **Free tier:** 150 requests per 15-minute window (10 req/min effective)
|
|
133
|
+
- **Batch endpoints:** Use batch retrieval for multiple IDs to minimize request count
|
|
134
|
+
- **Pagination:** Always use `offset` and `limit` for large result sets; do not fetch all results in one call
|
|
135
|
+
- **Caching:** Cache responses locally for repeat queries, especially for static metadata
|
|
136
|
+
- **Respect robots.txt:** When downloading full texts, add delays between requests
|
|
137
|
+
- **Error handling:** The API returns standard HTTP status codes; implement exponential backoff for 429 (rate limit) responses
|
|
138
|
+
|
|
139
|
+
## References
|
|
140
|
+
|
|
141
|
+
- CORE API v3 Documentation: https://core.ac.uk/documentation/api
|
|
142
|
+
- CORE Dashboard and Key Registration: https://core.ac.uk/services/api
|
|
143
|
+
- CORE Data Dumps (for bulk access): https://core.ac.uk/documentation/dataset
|
|
144
|
+
- CORE GitHub: https://github.com/oacore
|
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: institutional-repository-guide
|
|
3
|
+
description: "Access papers from institutional and subject repositories at scale"
|
|
4
|
+
metadata:
|
|
5
|
+
openclaw:
|
|
6
|
+
emoji: "🏛️"
|
|
7
|
+
category: "literature"
|
|
8
|
+
subcategory: "fulltext"
|
|
9
|
+
keywords: ["institutional repository", "DSpace", "EPrints", "open access archive", "subject repository", "OpenDOAR"]
|
|
10
|
+
source: "wentor-research-plugins"
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Institutional Repository Guide
|
|
14
|
+
|
|
15
|
+
Institutional repositories (IRs) are university-run digital archives that store and provide open access to their researchers' scholarly output — dissertations, journal articles, conference papers, datasets, and technical reports. Subject repositories like arXiv, bioRxiv, SSRN, and RePEc serve similar functions for specific disciplines. Together, they form a distributed network of open scholarship that complements commercial databases.
|
|
16
|
+
|
|
17
|
+
This guide covers how to discover, access, and systematically harvest content from institutional and subject repositories for literature reviews, meta-analyses, and research data collection.
|
|
18
|
+
|
|
19
|
+
## Repository Landscape
|
|
20
|
+
|
|
21
|
+
### Types of Repositories
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
Institutional Repositories (IR):
|
|
25
|
+
- Run by universities to archive their researchers' output
|
|
26
|
+
- Examples: DSpace, EPrints, Fedora-based systems
|
|
27
|
+
- Discovery: OpenDOAR directory (v2.sherpa.ac.uk/opendoar)
|
|
28
|
+
|
|
29
|
+
Subject Repositories:
|
|
30
|
+
- Discipline-specific archives
|
|
31
|
+
- arXiv (physics, CS, math), bioRxiv, SSRN, RePEc, EarthArXiv
|
|
32
|
+
|
|
33
|
+
Aggregators:
|
|
34
|
+
- Harvest from many repositories into a single search interface
|
|
35
|
+
- BASE (Bielefeld Academic Search Engine)
|
|
36
|
+
- CORE (core.ac.uk, 200M+ open access articles)
|
|
37
|
+
- OpenAIRE (European research output)
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Discovering Repositories
|
|
41
|
+
|
|
42
|
+
OpenDOAR (Directory of Open Access Repositories) is the primary registry for finding institutional repositories:
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
import urllib.request
|
|
46
|
+
import json
|
|
47
|
+
|
|
48
|
+
def search_opendoar(subject: str = None, country: str = None) -> list:
|
|
49
|
+
"""
|
|
50
|
+
Search the OpenDOAR registry for institutional repositories.
|
|
51
|
+
|
|
52
|
+
Args:
|
|
53
|
+
subject: Filter by subject area (e.g., "Biology", "Computer Science")
|
|
54
|
+
country: ISO country code (e.g., "US", "GB", "CN")
|
|
55
|
+
"""
|
|
56
|
+
base_url = "https://v2.sherpa.ac.uk/cgi/retrieve"
|
|
57
|
+
params = "?item-type=repository&format=Json"
|
|
58
|
+
if subject:
|
|
59
|
+
params += f"&filter=[[\"{subject}\",\"subject\"]]"
|
|
60
|
+
if country:
|
|
61
|
+
params += f"&filter=[[\"{country}\",\"country\"]]"
|
|
62
|
+
|
|
63
|
+
req = urllib.request.Request(base_url + params)
|
|
64
|
+
response = urllib.request.urlopen(req)
|
|
65
|
+
data = json.loads(response.read())
|
|
66
|
+
|
|
67
|
+
repositories = []
|
|
68
|
+
for item in data.get("items", []):
|
|
69
|
+
repo_info = {
|
|
70
|
+
"name": item.get("repository_metadata", {}).get("name", [{}])[0].get("name", ""),
|
|
71
|
+
"url": item.get("repository_metadata", {}).get("url", ""),
|
|
72
|
+
"oai_url": item.get("repository_metadata", {}).get("oai_url", ""),
|
|
73
|
+
"software": item.get("repository_metadata", {}).get("software", {}).get("name", ""),
|
|
74
|
+
"type": item.get("repository_metadata", {}).get("type", "")
|
|
75
|
+
}
|
|
76
|
+
repositories.append(repo_info)
|
|
77
|
+
|
|
78
|
+
return repositories
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## OAI-PMH Harvesting from Repositories
|
|
82
|
+
|
|
83
|
+
Most institutional repositories support OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), the standard protocol for metadata exchange:
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
import xml.etree.ElementTree as ET
|
|
87
|
+
import urllib.request
|
|
88
|
+
|
|
89
|
+
def harvest_repository(base_url: str, metadata_prefix: str = "oai_dc",
|
|
90
|
+
set_spec: str = None, from_date: str = None) -> list:
|
|
91
|
+
"""
|
|
92
|
+
Harvest metadata records from a repository's OAI-PMH endpoint.
|
|
93
|
+
|
|
94
|
+
Args:
|
|
95
|
+
base_url: The OAI-PMH base URL
|
|
96
|
+
metadata_prefix: Metadata format (oai_dc, datacite, mets)
|
|
97
|
+
set_spec: Optional set/collection to restrict harvesting
|
|
98
|
+
from_date: Harvest only records added after this date (YYYY-MM-DD)
|
|
99
|
+
"""
|
|
100
|
+
params = f"?verb=ListRecords&metadataPrefix={metadata_prefix}"
|
|
101
|
+
if set_spec:
|
|
102
|
+
params += f"&set={set_spec}"
|
|
103
|
+
if from_date:
|
|
104
|
+
params += f"&from={from_date}"
|
|
105
|
+
|
|
106
|
+
url = base_url + params
|
|
107
|
+
records = []
|
|
108
|
+
|
|
109
|
+
while url:
|
|
110
|
+
response = urllib.request.urlopen(url)
|
|
111
|
+
tree = ET.parse(response)
|
|
112
|
+
root = tree.getroot()
|
|
113
|
+
ns = {"oai": "http://www.openarchives.org/OAI/2.0/"}
|
|
114
|
+
|
|
115
|
+
for record in root.findall(".//oai:record", ns):
|
|
116
|
+
header = record.find("oai:header", ns)
|
|
117
|
+
identifier = header.find("oai:identifier", ns).text
|
|
118
|
+
datestamp = header.find("oai:datestamp", ns).text
|
|
119
|
+
records.append({"identifier": identifier, "datestamp": datestamp})
|
|
120
|
+
|
|
121
|
+
token_elem = root.find(".//oai:resumptionToken", ns)
|
|
122
|
+
if token_elem is not None and token_elem.text:
|
|
123
|
+
url = f"{base_url}?verb=ListRecords&resumptionToken={token_elem.text}"
|
|
124
|
+
else:
|
|
125
|
+
url = None
|
|
126
|
+
|
|
127
|
+
return records
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Key OAI-PMH Verbs
|
|
131
|
+
|
|
132
|
+
| Verb | Purpose |
|
|
133
|
+
|------|---------|
|
|
134
|
+
| `Identify` | Get repository name, admin email, policies |
|
|
135
|
+
| `ListSets` | List available collections/sets |
|
|
136
|
+
| `ListMetadataFormats` | List supported metadata schemas |
|
|
137
|
+
| `ListIdentifiers` | Lightweight listing of record headers |
|
|
138
|
+
| `ListRecords` | Full metadata records with pagination |
|
|
139
|
+
| `GetRecord` | Retrieve a single record by identifier |
|
|
140
|
+
|
|
141
|
+
## Major Repository Platforms
|
|
142
|
+
|
|
143
|
+
### DSpace
|
|
144
|
+
|
|
145
|
+
The most widely deployed open-source repository platform (used by ~40% of repositories worldwide):
|
|
146
|
+
|
|
147
|
+
- OAI-PMH endpoint: `{base-url}/oai/request`
|
|
148
|
+
- REST API: `{base-url}/server/api`
|
|
149
|
+
- Supports Dublin Core, METS, and custom metadata schemas
|
|
150
|
+
- Examples: MIT DSpace, University of Cambridge Repository
|
|
151
|
+
|
|
152
|
+
### EPrints
|
|
153
|
+
|
|
154
|
+
Popular in the UK and Europe:
|
|
155
|
+
|
|
156
|
+
- OAI-PMH endpoint: `{base-url}/cgi/oai2`
|
|
157
|
+
- REST API: `{base-url}/cgi/export/{id}/{format}`
|
|
158
|
+
- Strong support for research output types (articles, theses, conference items)
|
|
159
|
+
- Examples: University of Southampton EPrints
|
|
160
|
+
|
|
161
|
+
### Fedora / Islandora
|
|
162
|
+
|
|
163
|
+
Used by larger institutions with complex digital collections:
|
|
164
|
+
|
|
165
|
+
- Typically paired with a discovery layer (Solr/Blacklight)
|
|
166
|
+
- Strong support for digital preservation workflows
|
|
167
|
+
- Examples: University of Toronto, Smithsonian Institution
|
|
168
|
+
|
|
169
|
+
## Building a Harvesting Pipeline
|
|
170
|
+
|
|
171
|
+
### Systematic Collection Workflow
|
|
172
|
+
|
|
173
|
+
```
|
|
174
|
+
1. Identify target repositories
|
|
175
|
+
- Use OpenDOAR to find IRs by subject or country
|
|
176
|
+
- List subject repositories relevant to your discipline
|
|
177
|
+
|
|
178
|
+
2. Test endpoints
|
|
179
|
+
- Send Identify request to verify the endpoint is active
|
|
180
|
+
- Check ListMetadataFormats for available schemas
|
|
181
|
+
|
|
182
|
+
3. Harvest incrementally
|
|
183
|
+
- Use "from" parameter to harvest only new records
|
|
184
|
+
- Store last harvest date for each repository
|
|
185
|
+
- Respect rate limits (typically 1 request per second)
|
|
186
|
+
|
|
187
|
+
4. Deduplicate
|
|
188
|
+
- Match records by DOI when available
|
|
189
|
+
- Use title + author fuzzy matching for records without DOIs
|
|
190
|
+
- Flag duplicates rather than deleting (keep provenance)
|
|
191
|
+
|
|
192
|
+
5. Store and index
|
|
193
|
+
- Save metadata in structured format (JSON, SQLite, CSV)
|
|
194
|
+
- Build a local search index for efficient retrieval
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## Ethical Considerations
|
|
198
|
+
|
|
199
|
+
- Always respect `robots.txt` and repository rate limits
|
|
200
|
+
- Metadata harvesting is generally permitted; bulk full-text download may require permission
|
|
201
|
+
- Check each repository's terms of use before harvesting
|
|
202
|
+
- Use harvested data for research purposes, not commercial redistribution
|
|
203
|
+
- Attribute the source repository in publications using harvested data
|
|
204
|
+
- Consider reaching out to repository administrators for large-scale harvesting projects
|
|
205
|
+
|
|
206
|
+
## References
|
|
207
|
+
|
|
208
|
+
- OpenDOAR: https://v2.sherpa.ac.uk/opendoar/
|
|
209
|
+
- OAI-PMH specification: http://www.openarchives.org/OAI/openarchivesprotocol.html
|
|
210
|
+
- CORE: https://core.ac.uk
|
|
211
|
+
- BASE: https://www.base-search.net
|
|
212
|
+
- DSpace documentation: https://wiki.lyrasis.org/display/DSPACE
|