@botlearn/academic-search 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +35 -0
- package/knowledge/anti-patterns.md +88 -0
- package/knowledge/best-practices.md +165 -0
- package/knowledge/domain.md +293 -0
- package/manifest.json +28 -0
- package/package.json +38 -0
- package/skill.md +56 -0
- package/strategies/main.md +134 -0
- package/tests/benchmark.json +476 -0
- package/tests/smoke.json +54 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 BotLearn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# @botlearn/academic-search
|
|
2
|
+
|
|
3
|
+
> Academic paper discovery across arXiv, Google Scholar, and Semantic Scholar with abstract screening, citation analysis, and research synthesis for OpenClaw Agent
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
# via npm
|
|
9
|
+
npm install @botlearn/academic-search
|
|
10
|
+
|
|
11
|
+
# via clawhub
|
|
12
|
+
clawhub install @botlearn/academic-search
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Category
|
|
16
|
+
|
|
17
|
+
Information Retrieval
|
|
18
|
+
|
|
19
|
+
## Dependencies
|
|
20
|
+
|
|
21
|
+
`@botlearn/google-search`
|
|
22
|
+
|
|
23
|
+
## Files
|
|
24
|
+
|
|
25
|
+
| File | Description |
|
|
26
|
+
|------|-------------|
|
|
27
|
+
| `manifest.json` | Skill metadata and configuration |
|
|
28
|
+
| `skill.md` | Role definition and activation rules |
|
|
29
|
+
| `knowledge/` | Domain knowledge documents |
|
|
30
|
+
| `strategies/` | Behavioral strategy definitions |
|
|
31
|
+
| `tests/` | Smoke and benchmark tests |
|
|
32
|
+
|
|
33
|
+
## License
|
|
34
|
+
|
|
35
|
+
MIT
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: academic-search
|
|
3
|
+
topic: anti-patterns
|
|
4
|
+
priority: medium
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Academic Search -- Anti-Patterns
|
|
9
|
+
|
|
10
|
+
## Query Construction Anti-Patterns
|
|
11
|
+
|
|
12
|
+
### 1. Natural Language Search Queries
|
|
13
|
+
- **Problem**: Submitting conversational queries like "What are the best methods for detecting fake images?" to academic APIs. Academic databases perform poorly on natural language; they expect keyword-based queries.
|
|
14
|
+
- **Fix**: Extract core concepts and use database-specific syntax. Transform to: `ti:deepfake+detection+AND+cat:cs.CV` (arXiv) or `query=deepfake detection generative adversarial&fieldsOfStudy=Computer Science` (Semantic Scholar).
|
|
15
|
+
|
|
16
|
+
### 2. Single-Database Dependency
|
|
17
|
+
- **Problem**: Searching only arXiv and missing published journal papers, or searching only Google Scholar and missing very recent preprints uploaded in the last 48 hours.
|
|
18
|
+
- **Fix**: Always query at least 2 databases. Use arXiv for the freshest preprints, Semantic Scholar for structured metadata and citation graphs, and Google Scholar for the broadest coverage including theses and technical reports.
|
|
19
|
+
|
|
20
|
+
### 3. Overly Broad Queries
|
|
21
|
+
- **Problem**: Searching for "machine learning" without topic, method, or application constraints returns millions of irrelevant results.
|
|
22
|
+
- **Fix**: Add specificity through subtopic terms, category filters, date ranges, and venue constraints. "machine learning" becomes `"graph neural network" protein structure prediction 2023-` with `fieldsOfStudy=Computer Science,Biology`.
|
|
23
|
+
|
|
24
|
+
### 4. Overly Narrow Queries
|
|
25
|
+
- **Problem**: Using hyper-specific technical jargon that returns 0 results because the exact phrasing does not match paper titles or abstracts. For example, searching `"multi-head cross-attention with rotary position embeddings"` may miss papers that use slightly different terminology.
|
|
26
|
+
- **Fix**: Start specific, then progressively broaden. Use OR clauses with synonymous terms. Check if zero results means the topic is genuinely niche or your terminology is misaligned with the literature.
|
|
27
|
+
|
|
28
|
+
### 5. Ignoring Field-Specific Terminology
|
|
29
|
+
- **Problem**: Using CS terminology when searching for biomedical papers, or vice versa. Different fields use different terms for the same concepts (e.g., "feature" in ML vs. "biomarker" in medicine).
|
|
30
|
+
- **Fix**: Consult the keywords section of one known relevant paper to identify field-appropriate terminology. Use Semantic Scholar's `fieldsOfStudy` filter to disambiguate.
|
|
31
|
+
|
|
32
|
+
## Ranking & Selection Anti-Patterns
|
|
33
|
+
|
|
34
|
+
### 6. Citation Count Bias
|
|
35
|
+
- **Problem**: Ranking papers primarily by citation count. This biases toward older papers (more time to accumulate citations), popular fields (more researchers citing each other), and anglophone research (English papers cited more globally). A 2024 paper with 15 citations may be more impactful for the user than a 2018 paper with 500.
|
|
36
|
+
- **Fix**: Use **citation velocity** (citations per year) and **influential citation count** (Semantic Scholar) instead of raw citation count. Weight recency appropriately for fast-moving fields. A paper with 15 influential citations in 1 year may outrank one with 500 total citations over 6 years.
|
|
37
|
+
|
|
38
|
+
### 7. Ignoring Methodology Quality
|
|
39
|
+
- **Problem**: Selecting papers based solely on claimed results without evaluating the methodology. Papers may report impressive numbers from flawed experimental setups, cherry-picked baselines, or non-standard evaluation metrics.
|
|
40
|
+
- **Fix**: Check for: (a) comparison against established baselines, (b) use of standard benchmark datasets, (c) statistical significance reporting, (d) ablation studies, (e) reproducibility indicators (code/data availability). Apply the methodology rigor factor from the ranking framework in best-practices.md.
|
|
41
|
+
|
|
42
|
+
### 8. Recency Bias
|
|
43
|
+
- **Problem**: Automatically preferring the newest papers and dismissing foundational work. Users may miss seminal papers that define the field's core concepts, or survey papers that provide essential context.
|
|
44
|
+
- **Fix**: Include at least one foundational or survey paper when the user is exploring a new topic. Explicitly distinguish between "latest results" and "essential background." A 2017 paper that introduced a paradigm (e.g., "Attention Is All You Need") is critical context even if the user asked for recent work.
|
|
45
|
+
|
|
46
|
+
### 9. Venue Snobbery
|
|
47
|
+
- **Problem**: Dismissing papers from workshops, smaller conferences, or preprints solely based on venue prestige. Groundbreaking work sometimes appears first in workshops or on arXiv before being published at a top venue.
|
|
48
|
+
- **Fix**: Assess the paper on its own merits (methodology, results, clarity) rather than venue alone. Note the venue tier but do not automatically exclude lower-tier publications. Flag preprints as "not yet peer-reviewed" rather than excluding them.
|
|
49
|
+
|
|
50
|
+
### 10. Popularity Echo Chamber
|
|
51
|
+
- **Problem**: Returning only the most-cited papers leads to an echo chamber where the same 5-10 well-known papers appear for every query in a field, missing diverse perspectives and newer approaches.
|
|
52
|
+
- **Fix**: Deliberately include at least 1-2 papers from different research groups, geographic regions, or methodological traditions. Use Semantic Scholar's "recommended papers" feature to discover less obvious but relevant work.
|
|
53
|
+
|
|
54
|
+
## Result Presentation Anti-Patterns
|
|
55
|
+
|
|
56
|
+
### 11. Missing Publication Status
|
|
57
|
+
- **Problem**: Presenting arXiv preprints alongside peer-reviewed journal articles without distinguishing between them. Users may assume all results have undergone peer review.
|
|
58
|
+
- **Fix**: Always indicate publication status: `[Preprint]`, `[Peer-reviewed - Conference]`, `[Peer-reviewed - Journal]`, `[Workshop paper]`, `[Thesis]`. This is a hard requirement (see skill.md Constraints).
|
|
59
|
+
|
|
60
|
+
### 12. Incomplete Bibliographic Metadata
|
|
61
|
+
- **Problem**: Returning paper titles and URLs without authors, year, venue, or identifiers. Users cannot properly cite, find, or assess the papers.
|
|
62
|
+
- **Fix**: Every paper must include: authors (at least first author + "et al."), year, venue/journal, and at least one persistent identifier (DOI, arXiv ID, or Semantic Scholar Corpus ID).
|
|
63
|
+
|
|
64
|
+
### 13. Abstract Dump Without Synthesis
|
|
65
|
+
- **Problem**: Copy-pasting raw abstracts for each paper without summarizing the key finding or explaining why it is relevant to the user's specific query.
|
|
66
|
+
- **Fix**: Extract 1-2 sentences of key findings relevant to the user's question. Provide a synthesis section that connects papers thematically, identifies consensus and contradictions, and suggests a reading order.
|
|
67
|
+
|
|
68
|
+
### 14. Fabricating Paper Details
|
|
69
|
+
- **Problem**: Generating plausible-sounding but non-existent paper titles, authors, or findings to fill gaps in search results. This is a critical failure mode for LLM-based agents.
|
|
70
|
+
- **Fix**: Only return papers actually retrieved from database API responses. If the search returns fewer than 5 relevant results, report that honestly rather than padding with fabricated entries. Include the database query used so the user can verify.
|
|
71
|
+
|
|
72
|
+
### 15. Ignoring Open Access Status
|
|
73
|
+
- **Problem**: Returning papers behind paywalls without checking for open-access alternatives. Users without institutional access cannot read the papers.
|
|
74
|
+
- **Fix**: For each paper, check: (a) arXiv preprint version, (b) Semantic Scholar `openAccessPdf` field, (c) author's personal website or institutional repository. Flag papers with no open-access version available and suggest the user check their institutional library access.
|
|
75
|
+
|
|
76
|
+
## Workflow Anti-Patterns
|
|
77
|
+
|
|
78
|
+
### 16. No Iteration on Poor Results
|
|
79
|
+
- **Problem**: Returning the first set of results without assessing quality. If the top results are irrelevant or insufficient, the skill should refine and retry.
|
|
80
|
+
- **Fix**: After initial results, perform a quality check: Are at least 3 of the top 5 directly relevant? If not, refine the query using synonym expansion, alternative category codes, or broader/narrower scope. Iterate up to 2 times before presenting final results.
|
|
81
|
+
|
|
82
|
+
### 17. Ignoring the Citation Graph
|
|
83
|
+
- **Problem**: Treating each paper as an isolated entity rather than exploiting the citation network for discovery. Citation snowballing (forward and backward) is one of the most powerful techniques in academic search.
|
|
84
|
+
- **Fix**: For the top 1-2 most relevant initial results, always check their citations (forward) and references (backward) using the Semantic Scholar API. This often surfaces highly relevant papers that keyword search alone would miss.
|
|
85
|
+
|
|
86
|
+
### 18. Conflating Preprint Versions
|
|
87
|
+
- **Problem**: Returning both the arXiv preprint and the published version of the same paper as separate results, inflating the result count with duplicates.
|
|
88
|
+
- **Fix**: Follow the deduplication protocol from best-practices.md. Match by DOI/arXiv ID, then by title+author+year. Keep the published version as the primary entry but include the arXiv link for open access.
|
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: academic-search
|
|
3
|
+
topic: query-construction-relevance-ranking-cross-referencing
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Academic Search -- Best Practices
|
|
9
|
+
|
|
10
|
+
## Query Construction
|
|
11
|
+
|
|
12
|
+
### 1. Research Question Decomposition
|
|
13
|
+
Before searching, decompose the user's request into structured components:
|
|
14
|
+
- **Core topic** -- The primary subject (e.g., "federated learning")
|
|
15
|
+
- **Subtopic / aspect** -- Specific focus area (e.g., "privacy guarantees")
|
|
16
|
+
- **Methodology preference** -- Empirical, theoretical, survey (e.g., "benchmark evaluation")
|
|
17
|
+
- **Temporal scope** -- Date range for results (e.g., "2022 onwards")
|
|
18
|
+
- **Discipline** -- Target field(s) (e.g., computer science, biomedicine)
|
|
19
|
+
|
|
20
|
+
### 2. Academic Keyword Selection
|
|
21
|
+
Academic papers use different terminology than general web content:
|
|
22
|
+
|
|
23
|
+
| General Term | Academic Equivalent(s) |
|
|
24
|
+
|-------------|----------------------|
|
|
25
|
+
| AI chatbot | large language model, conversational agent, dialogue system |
|
|
26
|
+
| image recognition | visual recognition, image classification, object detection |
|
|
27
|
+
| data privacy | differential privacy, privacy-preserving, data protection |
|
|
28
|
+
| brain scan | neuroimaging, fMRI, MRI, EEG |
|
|
29
|
+
| drug discovery | pharmacological screening, molecular docking, compound identification |
|
|
30
|
+
| self-driving car | autonomous vehicle, automated driving, self-driving system |
|
|
31
|
+
| fake news | misinformation detection, claim verification, fact-checking |
|
|
32
|
+
|
|
33
|
+
**Key principle**: Use the terminology of the target research community. Check the "Keywords" section of known relevant papers to discover preferred terms.
|
|
34
|
+
|
|
35
|
+
### 3. Database-Specific Query Strategies
|
|
36
|
+
|
|
37
|
+
#### arXiv Strategy
|
|
38
|
+
- Use category codes to narrow scope: `cat:cs.LG` for ML, `cat:cs.CL` for NLP
|
|
39
|
+
- Search title (`ti:`) for high-precision results, abstract (`abs:`) for broader recall
|
|
40
|
+
- Use `ANDNOT` to exclude tangential categories
|
|
41
|
+
- Sort by `submittedDate` for latest work, `relevance` for best matches
|
|
42
|
+
- For emerging topics, prefer recency over relevance sorting
|
|
43
|
+
|
|
44
|
+
#### Semantic Scholar Strategy
|
|
45
|
+
- Use `fieldsOfStudy` parameter to filter by discipline
|
|
46
|
+
- Use `year` parameter with range syntax (`2022-2025`) to constrain dates
|
|
47
|
+
- Request `influentialCitationCount` to gauge true impact
|
|
48
|
+
- Use the `tldr` field for quick paper summaries when triaging large result sets
|
|
49
|
+
- Follow up with citation/reference graph for seminal paper discovery
|
|
50
|
+
|
|
51
|
+
#### Google Scholar Strategy (via google-search)
|
|
52
|
+
- Use `intitle:"keyword"` for title-focused search
|
|
53
|
+
- Use `author:"name"` to find specific researcher's work
|
|
54
|
+
- Use `source:"venue"` to restrict to specific journals/conferences
|
|
55
|
+
- Apply `as_ylo` and `as_yhi` URL parameters for date ranges
|
|
56
|
+
- Use `"exact phrase"` for specific technical terms or method names
|
|
57
|
+
|
|
58
|
+
### 4. Multi-Database Search Protocol
|
|
59
|
+
For comprehensive literature coverage:
|
|
60
|
+
1. **Start with Semantic Scholar** -- Best for structured metadata, citation graphs, and field-of-study filtering
|
|
61
|
+
2. **Supplement with arXiv** -- Catches very recent preprints not yet indexed elsewhere (especially CS, physics, math)
|
|
62
|
+
3. **Verify with Google Scholar** -- Broadest coverage; catches papers from smaller venues, theses, and technical reports
|
|
63
|
+
4. **Cross-reference results** -- Deduplicate using DOI or arXiv ID; merge metadata from the richest source
|
|
64
|
+
|
|
65
|
+
### 5. Query Expansion Techniques
|
|
66
|
+
When initial results are insufficient:
|
|
67
|
+
- **Synonym expansion**: Add OR clauses with alternative terms (e.g., `"graph neural network" OR "message passing network"`)
|
|
68
|
+
- **Citation snowballing**: Find one relevant paper, then search its references (backward) and citations (forward)
|
|
69
|
+
- **Author tracking**: Identify key authors from initial results, then search for their other recent papers
|
|
70
|
+
- **Venue scoping**: If a relevant paper was published at ICML, search specifically within ICML proceedings for related work
|
|
71
|
+
- **Related paper features**: Use Semantic Scholar's "recommended papers" or Google Scholar's "Related articles"
|
|
72
|
+
|
|
73
|
+
## Relevance Ranking
|
|
74
|
+
|
|
75
|
+
### Multi-Factor Ranking Framework
|
|
76
|
+
Rank papers by weighted combination of these factors:
|
|
77
|
+
|
|
78
|
+
| Factor | Weight | Assessment Criteria |
|
|
79
|
+
|--------|--------|-------------------|
|
|
80
|
+
| **Topical Relevance** | 35% | How directly does the paper address the user's specific question? Title/abstract keyword overlap, methodology match |
|
|
81
|
+
| **Methodological Rigor** | 20% | Appropriate methodology, sufficient baselines, statistical significance, reproducibility indicators |
|
|
82
|
+
| **Venue Quality** | 15% | Conference/journal ranking (A*/A for CS, Impact Factor for journals), peer-review status |
|
|
83
|
+
| **Recency** | 15% | Publication date relative to the field's pace; recent is better for fast-moving fields |
|
|
84
|
+
| **Impact** | 15% | Influential citation count, citation velocity, whether it introduced a widely-adopted technique |
|
|
85
|
+
|
|
86
|
+
### Scoring Procedure
|
|
87
|
+
For each paper, score 0-5 on each factor:
|
|
88
|
+
- **5** -- Excellent: directly relevant, top venue, rigorous method, high and growing citations
|
|
89
|
+
- **4** -- Strong: highly relevant with minor gaps in one dimension
|
|
90
|
+
- **3** -- Good: relevant but from a secondary venue or with moderate impact
|
|
91
|
+
- **2** -- Acceptable: tangentially relevant or older but still useful
|
|
92
|
+
- **1** -- Marginal: peripherally related or methodologically weak
|
|
93
|
+
- **0** -- Not relevant: off-topic, retracted, or fundamentally flawed
|
|
94
|
+
|
|
95
|
+
Compute weighted score: `(Relevance * 0.35) + (Rigor * 0.20) + (Venue * 0.15) + (Recency * 0.15) + (Impact * 0.15)`
|
|
96
|
+
|
|
97
|
+
### Special Ranking Adjustments
|
|
98
|
+
- **Survey papers**: Boost ranking when user requests "overview" or "literature review" -- surveys provide comprehensive coverage
|
|
99
|
+
- **Seminal papers**: Boost ranking for foundational papers even if older, when user is exploring a new field
|
|
100
|
+
- **Preprints**: Apply a -1 penalty to venue quality score unless the preprint is from a well-known research group or has high citation count
|
|
101
|
+
- **Open access**: Apply a +0.5 bonus when the user needs full-text access and the paper has a freely available PDF
|
|
102
|
+
|
|
103
|
+
## Cross-Referencing Strategies
|
|
104
|
+
|
|
105
|
+
### 1. Forward Citation Analysis
|
|
106
|
+
Starting from a known relevant paper, examine papers that cite it:
|
|
107
|
+
- Use Semantic Scholar `/paper/{id}/citations` endpoint
|
|
108
|
+
- Filter citations by year to find recent extensions
|
|
109
|
+
- Sort by `influentialCitationCount` to find the most impactful follow-up works
|
|
110
|
+
- Identify **citation clusters** -- groups of papers that cite the same source tend to be related
|
|
111
|
+
|
|
112
|
+
### 2. Backward Reference Analysis
|
|
113
|
+
Starting from a known relevant paper, examine its references:
|
|
114
|
+
- Use Semantic Scholar `/paper/{id}/references` endpoint
|
|
115
|
+
- Identify the **foundational papers** the author builds upon
|
|
116
|
+
- Look for methodological sources -- the papers describing the technique being used
|
|
117
|
+
- Find the dataset and benchmark papers for evaluation context
|
|
118
|
+
|
|
119
|
+
### 3. Bibliographic Coupling
|
|
120
|
+
Two papers that share many references are likely related:
|
|
121
|
+
- Compare reference lists of candidate papers
|
|
122
|
+
- Papers with 30%+ reference overlap are strong candidates for relevance
|
|
123
|
+
- Useful for finding papers the database search missed
|
|
124
|
+
|
|
125
|
+
### 4. Co-Citation Analysis
|
|
126
|
+
Two papers frequently cited together by other papers are related:
|
|
127
|
+
- If papers A and B both appear in the reference lists of many papers, they address related topics
|
|
128
|
+
- Use Semantic Scholar recommended papers feature as a proxy
|
|
129
|
+
|
|
130
|
+
### 5. Author Network Exploration
|
|
131
|
+
- Identify prolific authors in the initial result set
|
|
132
|
+
- Search for their other recent publications
|
|
133
|
+
- Check their co-authors for related work from collaborating labs
|
|
134
|
+
- Examine their Google Scholar profile for a comprehensive publication list
|
|
135
|
+
|
|
136
|
+
### 6. Deduplication Protocol
|
|
137
|
+
The same paper often appears across multiple databases:
|
|
138
|
+
1. **Match by DOI** -- Definitive identifier; if DOIs match, it is the same paper
|
|
139
|
+
2. **Match by arXiv ID** -- Maps arXiv preprints to their Semantic Scholar entries
|
|
140
|
+
3. **Match by title + first author + year** -- Fuzzy match for papers without DOI
|
|
141
|
+
4. **Merge metadata** -- Keep the record with the richest metadata (prefer Semantic Scholar for citations, arXiv for full text)
|
|
142
|
+
5. **Resolve version conflicts** -- A paper may have an arXiv v1 and a published version; prefer the published version but link to the arXiv open-access copy
|
|
143
|
+
|
|
144
|
+
## Output Formatting
|
|
145
|
+
|
|
146
|
+
### Per-Paper Entry
|
|
147
|
+
Each paper in the output should include:
|
|
148
|
+
```
|
|
149
|
+
[Rank]. Title
|
|
150
|
+
Authors: First Author et al. (Year)
|
|
151
|
+
Venue: Conference/Journal Name [Peer-reviewed / Preprint]
|
|
152
|
+
Citations: X total, Y influential | Citation velocity: Z/year
|
|
153
|
+
arXiv: XXXX.XXXXX | DOI: 10.XXXX/XXXXX
|
|
154
|
+
Open Access: [Yes - link] / [No - paywall]
|
|
155
|
+
Key Findings: 1-2 sentence summary of the main contribution
|
|
156
|
+
Relevance: Why this paper matters for the user's query
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Synthesis Section
|
|
160
|
+
After listing individual papers, provide:
|
|
161
|
+
- **Thematic grouping**: Cluster papers by approach or subtopic
|
|
162
|
+
- **Consensus findings**: What do multiple papers agree on?
|
|
163
|
+
- **Contradictions**: Where do papers disagree, and why?
|
|
164
|
+
- **Research gaps**: What questions remain unanswered?
|
|
165
|
+
- **Recommended reading order**: Suggest which papers to read first based on the user's background
|
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
---
|
|
2
|
+
domain: academic-search
|
|
3
|
+
topic: academic-database-apis-and-paper-structure
|
|
4
|
+
priority: high
|
|
5
|
+
ttl: 30d
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Academic Search -- Database APIs, Paper Structure & Citation Metrics
|
|
9
|
+
|
|
10
|
+
## arXiv API
|
|
11
|
+
|
|
12
|
+
### Overview
|
|
13
|
+
arXiv (arxiv.org) is an open-access repository for preprints in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, systems science, and economics. All papers are freely accessible.
|
|
14
|
+
|
|
15
|
+
### API Endpoint
|
|
16
|
+
- **Base URL**: `http://export.arxiv.org/api/query`
|
|
17
|
+
- **Method**: GET
|
|
18
|
+
- **Rate Limit**: 1 request per 3 seconds (be respectful; no authentication required)
|
|
19
|
+
|
|
20
|
+
### Query Parameters
|
|
21
|
+
| Parameter | Description | Example |
|
|
22
|
+
|-----------|-------------|---------|
|
|
23
|
+
| `search_query` | Search terms with field prefixes | `ti:attention+AND+cat:cs.CL` |
|
|
24
|
+
| `id_list` | Comma-separated arXiv IDs | `2301.00001,2301.00002` |
|
|
25
|
+
| `start` | Offset for pagination | `0` |
|
|
26
|
+
| `max_results` | Number of results (max 30000) | `10` |
|
|
27
|
+
| `sortBy` | Sort field: `relevance`, `lastUpdatedDate`, `submittedDate` | `relevance` |
|
|
28
|
+
| `sortOrder` | `ascending` or `descending` | `descending` |
|
|
29
|
+
|
|
30
|
+
### Field Prefixes for search_query
|
|
31
|
+
- `ti:` -- Title
|
|
32
|
+
- `au:` -- Author
|
|
33
|
+
- `abs:` -- Abstract
|
|
34
|
+
- `co:` -- Comment
|
|
35
|
+
- `jr:` -- Journal reference
|
|
36
|
+
- `cat:` -- Subject category
|
|
37
|
+
- `all:` -- All fields
|
|
38
|
+
|
|
39
|
+
### Boolean Operators
|
|
40
|
+
- `AND` -- Both terms required
|
|
41
|
+
- `OR` -- Either term matches
|
|
42
|
+
- `ANDNOT` -- Exclude term
|
|
43
|
+
- Grouping with parentheses: `(ti:transformer OR ti:attention) AND cat:cs.CL`
|
|
44
|
+
|
|
45
|
+
### arXiv Category Codes (Common)
|
|
46
|
+
| Category | Description |
|
|
47
|
+
|----------|-------------|
|
|
48
|
+
| `cs.AI` | Artificial Intelligence |
|
|
49
|
+
| `cs.CL` | Computation and Language (NLP) |
|
|
50
|
+
| `cs.CV` | Computer Vision |
|
|
51
|
+
| `cs.LG` | Machine Learning |
|
|
52
|
+
| `cs.SE` | Software Engineering |
|
|
53
|
+
| `cs.CR` | Cryptography and Security |
|
|
54
|
+
| `stat.ML` | Machine Learning (Statistics) |
|
|
55
|
+
| `math.OC` | Optimization and Control |
|
|
56
|
+
| `q-bio.QM` | Quantitative Methods (Biology) |
|
|
57
|
+
| `econ.EM` | Econometrics |
|
|
58
|
+
| `physics.comp-ph` | Computational Physics |
|
|
59
|
+
|
|
60
|
+
### Response Format (Atom XML)
|
|
61
|
+
```xml
|
|
62
|
+
<entry>
|
|
63
|
+
<id>http://arxiv.org/abs/2301.00001v1</id>
|
|
64
|
+
<title>Paper Title</title>
|
|
65
|
+
<summary>Abstract text...</summary>
|
|
66
|
+
<author><name>Author Name</name></author>
|
|
67
|
+
<published>2023-01-01T00:00:00Z</published>
|
|
68
|
+
<updated>2023-01-15T00:00:00Z</updated>
|
|
69
|
+
<arxiv:primary_category term="cs.CL"/>
|
|
70
|
+
<category term="cs.AI"/>
|
|
71
|
+
<link href="http://arxiv.org/pdf/2301.00001v1" title="pdf"/>
|
|
72
|
+
</entry>
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Example Queries
|
|
76
|
+
```
|
|
77
|
+
# Find recent transformer papers in NLP
|
|
78
|
+
search_query=ti:transformer+AND+cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=10
|
|
79
|
+
|
|
80
|
+
# Find papers by a specific author on attention mechanisms
|
|
81
|
+
search_query=au:vaswani+AND+ti:attention&sortBy=relevance&max_results=5
|
|
82
|
+
|
|
83
|
+
# Find reinforcement learning papers excluding robotics
|
|
84
|
+
search_query=all:reinforcement+learning+ANDNOT+cat:cs.RO&max_results=20
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Semantic Scholar API
|
|
88
|
+
|
|
89
|
+
### Overview
|
|
90
|
+
Semantic Scholar (semanticscholar.org) provides a comprehensive academic search engine with rich metadata, citation graphs, and AI-extracted features. Covers 200M+ papers across all fields.
|
|
91
|
+
|
|
92
|
+
### API Endpoints
|
|
93
|
+
|
|
94
|
+
#### Paper Search
|
|
95
|
+
- **URL**: `GET https://api.semanticscholar.org/graph/v1/paper/search`
|
|
96
|
+
- **Rate Limit**: 100 requests/5 minutes (unauthenticated); 1 request/second with API key
|
|
97
|
+
|
|
98
|
+
| Parameter | Description | Example |
|
|
99
|
+
|-----------|-------------|---------|
|
|
100
|
+
| `query` | Search terms | `attention mechanism transformers` |
|
|
101
|
+
| `fields` | Comma-separated fields to return | `title,authors,year,abstract,citationCount,venue` |
|
|
102
|
+
| `limit` | Results per page (max 100) | `10` |
|
|
103
|
+
| `offset` | Pagination offset | `0` |
|
|
104
|
+
| `year` | Publication year filter | `2023-` (2023 onwards), `2020-2023` |
|
|
105
|
+
| `fieldsOfStudy` | Discipline filter | `Computer Science` |
|
|
106
|
+
| `openAccessPdf` | Filter for open access | (presence means filter) |
|
|
107
|
+
|
|
108
|
+
#### Paper Details
|
|
109
|
+
- **URL**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}`
|
|
110
|
+
- **Paper ID formats**: Semantic Scholar ID, DOI (`DOI:10.xxx`), arXiv (`ARXIV:2301.00001`), PMID, ACL ID, Corpus ID
|
|
111
|
+
|
|
112
|
+
#### Citation Graph
|
|
113
|
+
- **Citations**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}/citations`
|
|
114
|
+
- **References**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}/references`
|
|
115
|
+
|
|
116
|
+
| Parameter | Description |
|
|
117
|
+
|-----------|-------------|
|
|
118
|
+
| `fields` | Fields for each citing/referenced paper |
|
|
119
|
+
| `limit` | Number of citations/references (max 1000) |
|
|
120
|
+
| `offset` | Pagination offset |
|
|
121
|
+
|
|
122
|
+
#### Author Search
|
|
123
|
+
- **URL**: `GET https://api.semanticscholar.org/graph/v1/author/search`
|
|
124
|
+
- **URL**: `GET https://api.semanticscholar.org/graph/v1/author/{author_id}/papers`
|
|
125
|
+
|
|
126
|
+
### Available Fields
|
|
127
|
+
```
|
|
128
|
+
# Paper fields
|
|
129
|
+
title, abstract, year, venue, publicationDate, journal,
|
|
130
|
+
citationCount, referenceCount, influentialCitationCount,
|
|
131
|
+
fieldsOfStudy, s2FieldsOfStudy, authors, externalIds,
|
|
132
|
+
url, openAccessPdf, tldr, publicationTypes, citationStyles
|
|
133
|
+
|
|
134
|
+
# Author fields
|
|
135
|
+
name, affiliations, homepage, paperCount, citationCount, hIndex
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Example Queries
|
|
139
|
+
```
|
|
140
|
+
# Search for recent RAG papers
|
|
141
|
+
GET /graph/v1/paper/search?query=retrieval+augmented+generation&year=2023-&fields=title,authors,year,citationCount,abstract,venue&limit=10
|
|
142
|
+
|
|
143
|
+
# Get citation graph for a specific paper
|
|
144
|
+
GET /graph/v1/paper/ARXIV:2005.11401/citations?fields=title,year,citationCount&limit=50
|
|
145
|
+
|
|
146
|
+
# Find papers by field of study
|
|
147
|
+
GET /graph/v1/paper/search?query=protein+folding&fieldsOfStudy=Biology&fields=title,year,venue&limit=20
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Google Scholar (via google-search skill)
|
|
151
|
+
|
|
152
|
+
### Query Operators
|
|
153
|
+
Google Scholar inherits many standard Google Search operators with academic-specific behavior:
|
|
154
|
+
- `"exact phrase"` -- Exact match in title, abstract, or full text
|
|
155
|
+
- `author:"last name"` -- Filter by author name
|
|
156
|
+
- `intitle:"keyword"` -- Term must appear in paper title
|
|
157
|
+
- `source:"journal name"` -- Filter by publication venue
|
|
158
|
+
- Date range filter via Google Scholar UI or `after:YYYY` operator
|
|
159
|
+
|
|
160
|
+
### Google Scholar URL Construction
|
|
161
|
+
```
|
|
162
|
+
# Basic search
|
|
163
|
+
https://scholar.google.com/scholar?q=transformer+attention+mechanism
|
|
164
|
+
|
|
165
|
+
# With date range
|
|
166
|
+
https://scholar.google.com/scholar?q=transformer+attention&as_ylo=2023&as_yhi=2025
|
|
167
|
+
|
|
168
|
+
# Author search
|
|
169
|
+
https://scholar.google.com/scholar?q=author:"hinton"+deep+learning
|
|
170
|
+
|
|
171
|
+
# Specific journal
|
|
172
|
+
https://scholar.google.com/scholar?q=source:"nature"+gene+editing+CRISPR
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
### Advantages over Other Sources
|
|
176
|
+
- Broadest coverage: journals, conferences, theses, patents, preprints, books
|
|
177
|
+
- Includes citation counts and "Cited by" links
|
|
178
|
+
- "Related articles" feature for discovery
|
|
179
|
+
- Provides links to free PDF versions when available
|
|
180
|
+
|
|
181
|
+
### Limitations
|
|
182
|
+
- No official API (rely on google-search skill for scraping-safe queries)
|
|
183
|
+
- Rate-limited and may block automated access
|
|
184
|
+
- Citation counts may differ from Semantic Scholar
|
|
185
|
+
- Cannot filter by field of study programmatically
|
|
186
|
+
|
|
187
|
+
## Academic Paper Structure
|
|
188
|
+
|
|
189
|
+
### Standard Sections (IMRaD Format)
|
|
190
|
+
| Section | Purpose | Key Information to Extract |
|
|
191
|
+
|---------|---------|---------------------------|
|
|
192
|
+
| **Title** | Concise statement of the main finding or topic | Core topic, methodology hint |
|
|
193
|
+
| **Abstract** | 150-300 word summary of the entire paper | Problem, method, key result, conclusion |
|
|
194
|
+
| **Introduction** | Problem context, motivation, research gap | Research question, hypotheses, related work overview |
|
|
195
|
+
| **Related Work / Literature Review** | Positioning within existing research | Key prior work, how this paper differs |
|
|
196
|
+
| **Methodology** | How the research was conducted | Datasets, models, experimental setup, baselines |
|
|
197
|
+
| **Results** | Quantitative and qualitative findings | Tables, figures, statistical significance, metrics |
|
|
198
|
+
| **Discussion** | Interpretation of results | Implications, limitations, comparison with prior work |
|
|
199
|
+
| **Conclusion** | Summary and future directions | Main contributions, open questions |
|
|
200
|
+
| **References** | Cited works | Citation network, foundational papers |
|
|
201
|
+
|
|
202
|
+
### Paper Types
|
|
203
|
+
| Type | Description | Typical Structure |
|
|
204
|
+
|------|-------------|-------------------|
|
|
205
|
+
| **Empirical** | Reports original experimental results | Full IMRaD |
|
|
206
|
+
| **Survey / Review** | Comprehensive overview of a research area | Taxonomy + systematic analysis |
|
|
207
|
+
| **Theoretical** | Proves theorems or proposes frameworks | Definitions, propositions, proofs |
|
|
208
|
+
| **Systems** | Describes a software system or tool | Architecture, implementation, evaluation |
|
|
209
|
+
| **Position / Opinion** | Argues for a particular viewpoint | Argument structure with evidence |
|
|
210
|
+
| **Benchmark** | Introduces datasets or evaluation protocols | Dataset description, baseline results |
|
|
211
|
+
|
|
212
|
+
## Citation Metrics
|
|
213
|
+
|
|
214
|
+
### Paper-Level Metrics
|
|
215
|
+
| Metric | Description | Use Case |
|
|
216
|
+
|--------|-------------|----------|
|
|
217
|
+
| **Citation Count** | Total times cited by other papers | Rough impact indicator (use with caution) |
|
|
218
|
+
| **Influential Citation Count** | Citations where this paper is central to the citing work (Semantic Scholar) | Better quality indicator than raw count |
|
|
219
|
+
| **Citation Velocity** | Citations per year, especially in recent years | Identifies trending vs. declining relevance |
|
|
220
|
+
| **Field-Normalized Citation** | Citations relative to field average | Fair comparison across disciplines |
|
|
221
|
+
|
|
222
|
+
### Author-Level Metrics
|
|
223
|
+
| Metric | Description |
|
|
224
|
+
|--------|-------------|
|
|
225
|
+
| **h-index** | h papers with at least h citations each |
|
|
226
|
+
| **i10-index** | Number of papers with 10+ citations |
|
|
227
|
+
| **Total Citations** | Sum of all paper citations |
|
|
228
|
+
|
|
229
|
+
### Venue-Level Metrics
|
|
230
|
+
| Metric | Description |
|
|
231
|
+
|--------|-------------|
|
|
232
|
+
| **Impact Factor** | Average citations per paper in the last 2 years (journals) |
|
|
233
|
+
| **h5-index** | h-index for articles published in the last 5 complete years |
|
|
234
|
+
| **CORE Ranking** | Conference ranking: A* (top), A, B, C (australasian system) |
|
|
235
|
+
| **CSRankings** | Computer science venue rankings by research output |
|
|
236
|
+
|
|
237
|
+
### Well-Known Venues by Field
|
|
238
|
+
|
|
239
|
+
#### Computer Science -- AI/ML
|
|
240
|
+
| Venue | Type | Prestige |
|
|
241
|
+
|-------|------|----------|
|
|
242
|
+
| NeurIPS | Conference | Top-tier |
|
|
243
|
+
| ICML | Conference | Top-tier |
|
|
244
|
+
| ICLR | Conference | Top-tier |
|
|
245
|
+
| AAAI | Conference | Top-tier |
|
|
246
|
+
| CVPR / ICCV / ECCV | Conference | Top-tier (Vision) |
|
|
247
|
+
| ACL / EMNLP / NAACL | Conference | Top-tier (NLP) |
|
|
248
|
+
| JMLR | Journal | Top-tier |
|
|
249
|
+
| Nature Machine Intelligence | Journal | Top-tier |
|
|
250
|
+
|
|
251
|
+
#### Biomedical
|
|
252
|
+
| Venue | Type | Prestige |
|
|
253
|
+
|-------|------|----------|
|
|
254
|
+
| Nature | Journal | Top-tier |
|
|
255
|
+
| Science | Journal | Top-tier |
|
|
256
|
+
| Cell | Journal | Top-tier |
|
|
257
|
+
| PNAS | Journal | High |
|
|
258
|
+
| PLoS ONE | Journal | Open-access, broad |
|
|
259
|
+
|
|
260
|
+
#### General Science
|
|
261
|
+
| Venue | Type | Prestige |
|
|
262
|
+
|-------|------|----------|
|
|
263
|
+
| Nature / Science | Journal | Highest |
|
|
264
|
+
| IEEE Transactions (various) | Journal | High |
|
|
265
|
+
| ACM Computing Surveys | Journal | High (CS surveys) |
|
|
266
|
+
|
|
267
|
+
## Research Methodology Taxonomy
|
|
268
|
+
|
|
269
|
+
### Quantitative Methods
|
|
270
|
+
- **Controlled Experiment** -- Manipulates variables with control groups
|
|
271
|
+
- **Quasi-Experiment** -- Natural or existing group comparisons
|
|
272
|
+
- **Survey / Questionnaire** -- Large-scale data collection via structured instruments
|
|
273
|
+
- **Corpus Analysis** -- Statistical analysis of large text/data collections
|
|
274
|
+
- **Simulation** -- Computational modeling of systems
|
|
275
|
+
- **Benchmarking** -- Standardized evaluation on established datasets
|
|
276
|
+
|
|
277
|
+
### Qualitative Methods
|
|
278
|
+
- **Case Study** -- In-depth analysis of specific instances
|
|
279
|
+
- **Ethnography** -- Observational study within a community
|
|
280
|
+
- **Grounded Theory** -- Theory building from systematic data analysis
|
|
281
|
+
- **Content Analysis** -- Systematic categorization of textual content
|
|
282
|
+
- **Interview Study** -- Structured or semi-structured conversations
|
|
283
|
+
|
|
284
|
+
### Mixed Methods
|
|
285
|
+
- **Sequential Explanatory** -- Quantitative phase followed by qualitative
|
|
286
|
+
- **Sequential Exploratory** -- Qualitative phase followed by quantitative
|
|
287
|
+
- **Convergent Parallel** -- Both methods simultaneously, results merged
|
|
288
|
+
|
|
289
|
+
### Review Methods
|
|
290
|
+
- **Systematic Review** -- Rigorous, reproducible search and synthesis protocol
|
|
291
|
+
- **Meta-Analysis** -- Statistical aggregation of results across studies
|
|
292
|
+
- **Scoping Review** -- Broad mapping of a research area
|
|
293
|
+
- **Narrative Review** -- Expert-driven summary (less rigorous than systematic)
|
package/manifest.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@botlearn/academic-search",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Academic paper discovery across arXiv, Google Scholar, and Semantic Scholar with abstract screening, citation analysis, and research synthesis for OpenClaw Agent",
|
|
5
|
+
"category": "information-retrieval",
|
|
6
|
+
"author": "BotLearn",
|
|
7
|
+
"benchmarkDimension": "information-retrieval",
|
|
8
|
+
"expectedImprovement": 35,
|
|
9
|
+
"dependencies": {
|
|
10
|
+
"@botlearn/google-search": "^1.0.0"
|
|
11
|
+
},
|
|
12
|
+
"compatibility": {
|
|
13
|
+
"openclaw": ">=0.5.0"
|
|
14
|
+
},
|
|
15
|
+
"files": {
|
|
16
|
+
"skill": "skill.md",
|
|
17
|
+
"knowledge": [
|
|
18
|
+
"knowledge/domain.md",
|
|
19
|
+
"knowledge/best-practices.md",
|
|
20
|
+
"knowledge/anti-patterns.md"
|
|
21
|
+
],
|
|
22
|
+
"strategies": [
|
|
23
|
+
"strategies/main.md"
|
|
24
|
+
],
|
|
25
|
+
"smokeTest": "tests/smoke.json",
|
|
26
|
+
"benchmark": "tests/benchmark.json"
|
|
27
|
+
}
|
|
28
|
+
}
|