@botlearn/academic-search 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 BotLearn
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,35 @@
1
+ # @botlearn/academic-search
2
+
3
+ > Academic paper discovery across arXiv, Google Scholar, and Semantic Scholar with abstract screening, citation analysis, and research synthesis for OpenClaw Agent
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ # via npm
9
+ npm install @botlearn/academic-search
10
+
11
+ # via clawhub
12
+ clawhub install @botlearn/academic-search
13
+ ```
14
+
15
+ ## Category
16
+
17
+ Information Retrieval
18
+
19
+ ## Dependencies
20
+
21
+ `@botlearn/google-search`
22
+
23
+ ## Files
24
+
25
+ | File | Description |
26
+ |------|-------------|
27
+ | `manifest.json` | Skill metadata and configuration |
28
+ | `skill.md` | Role definition and activation rules |
29
+ | `knowledge/` | Domain knowledge documents |
30
+ | `strategies/` | Behavioral strategy definitions |
31
+ | `tests/` | Smoke and benchmark tests |
32
+
33
+ ## License
34
+
35
+ MIT
@@ -0,0 +1,88 @@
1
+ ---
2
+ domain: academic-search
3
+ topic: anti-patterns
4
+ priority: medium
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Academic Search -- Anti-Patterns
9
+
10
+ ## Query Construction Anti-Patterns
11
+
12
+ ### 1. Natural Language Search Queries
13
+ - **Problem**: Submitting conversational queries like "What are the best methods for detecting fake images?" to academic APIs. Academic databases perform poorly on natural language; they expect keyword-based queries.
14
+ - **Fix**: Extract core concepts and use database-specific syntax. Transform to: `ti:deepfake+detection+AND+cat:cs.CV` (arXiv) or `query=deepfake detection generative adversarial&fieldsOfStudy=Computer Science` (Semantic Scholar).
15
+
16
+ ### 2. Single-Database Dependency
17
+ - **Problem**: Searching only arXiv and missing published journal papers, or searching only Google Scholar and missing very recent preprints uploaded in the last 48 hours.
18
+ - **Fix**: Always query at least 2 databases. Use arXiv for the freshest preprints, Semantic Scholar for structured metadata and citation graphs, and Google Scholar for the broadest coverage including theses and technical reports.
19
+
20
+ ### 3. Overly Broad Queries
21
+ - **Problem**: Searching for "machine learning" without topic, method, or application constraints returns millions of irrelevant results.
22
+ - **Fix**: Add specificity through subtopic terms, category filters, date ranges, and venue constraints. "machine learning" becomes `"graph neural network" protein structure prediction 2023-` with `fieldsOfStudy=Computer Science,Biology`.
23
+
24
+ ### 4. Overly Narrow Queries
25
+ - **Problem**: Using hyper-specific technical jargon that returns 0 results because the exact phrasing does not match paper titles or abstracts. For example, searching `"multi-head cross-attention with rotary position embeddings"` may miss papers that use slightly different terminology.
26
+ - **Fix**: Start specific, then progressively broaden. Use OR clauses with synonymous terms. Check if zero results means the topic is genuinely niche or your terminology is misaligned with the literature.
27
+
28
+ ### 5. Ignoring Field-Specific Terminology
29
+ - **Problem**: Using CS terminology when searching for biomedical papers, or vice versa. Different fields use different terms for the same concepts (e.g., "feature" in ML vs. "biomarker" in medicine).
30
+ - **Fix**: Consult the keywords section of one known relevant paper to identify field-appropriate terminology. Use Semantic Scholar's `fieldsOfStudy` filter to disambiguate.
31
+
32
+ ## Ranking & Selection Anti-Patterns
33
+
34
+ ### 6. Citation Count Bias
35
+ - **Problem**: Ranking papers primarily by citation count. This biases toward older papers (more time to accumulate citations), popular fields (more researchers citing each other), and anglophone research (English papers cited more globally). A 2024 paper with 15 citations may be more impactful for the user than a 2018 paper with 500.
36
+ - **Fix**: Use **citation velocity** (citations per year) and **influential citation count** (Semantic Scholar) instead of raw citation count. Weight recency appropriately for fast-moving fields. A paper with 15 influential citations in 1 year may outrank one with 500 total citations over 6 years.
37
+
38
+ ### 7. Ignoring Methodology Quality
39
+ - **Problem**: Selecting papers based solely on claimed results without evaluating the methodology. Papers may report impressive numbers from flawed experimental setups, cherry-picked baselines, or non-standard evaluation metrics.
40
+ - **Fix**: Check for: (a) comparison against established baselines, (b) use of standard benchmark datasets, (c) statistical significance reporting, (d) ablation studies, (e) reproducibility indicators (code/data availability). Apply the methodology rigor factor from the ranking framework in best-practices.md.
41
+
42
+ ### 8. Recency Bias
43
+ - **Problem**: Automatically preferring the newest papers and dismissing foundational work. Users may miss seminal papers that define the field's core concepts, or survey papers that provide essential context.
44
+ - **Fix**: Include at least one foundational or survey paper when the user is exploring a new topic. Explicitly distinguish between "latest results" and "essential background." A 2017 paper that introduced a paradigm (e.g., "Attention Is All You Need") is critical context even if the user asked for recent work.
45
+
46
+ ### 9. Venue Snobbery
47
+ - **Problem**: Dismissing papers from workshops, smaller conferences, or preprints solely based on venue prestige. Groundbreaking work sometimes appears first in workshops or on arXiv before being published at a top venue.
48
+ - **Fix**: Assess the paper on its own merits (methodology, results, clarity) rather than venue alone. Note the venue tier but do not automatically exclude lower-tier publications. Flag preprints as "not yet peer-reviewed" rather than excluding them.
49
+
50
+ ### 10. Popularity Echo Chamber
51
+ - **Problem**: Returning only the most-cited papers leads to an echo chamber where the same 5-10 well-known papers appear for every query in a field, missing diverse perspectives and newer approaches.
52
+ - **Fix**: Deliberately include at least 1-2 papers from different research groups, geographic regions, or methodological traditions. Use Semantic Scholar's "recommended papers" feature to discover less obvious but relevant work.
53
+
54
+ ## Result Presentation Anti-Patterns
55
+
56
+ ### 11. Missing Publication Status
57
+ - **Problem**: Presenting arXiv preprints alongside peer-reviewed journal articles without distinguishing between them. Users may assume all results have undergone peer review.
58
+ - **Fix**: Always indicate publication status: `[Preprint]`, `[Peer-reviewed - Conference]`, `[Peer-reviewed - Journal]`, `[Workshop paper]`, `[Thesis]`. This is a hard requirement (see skill.md Constraints).
59
+
60
+ ### 12. Incomplete Bibliographic Metadata
61
+ - **Problem**: Returning paper titles and URLs without authors, year, venue, or identifiers. Users cannot properly cite, find, or assess the papers.
62
+ - **Fix**: Every paper must include: authors (at least first author + "et al."), year, venue/journal, and at least one persistent identifier (DOI, arXiv ID, or Semantic Scholar Corpus ID).
63
+
64
+ ### 13. Abstract Dump Without Synthesis
65
+ - **Problem**: Copy-pasting raw abstracts for each paper without summarizing the key finding or explaining why it is relevant to the user's specific query.
66
+ - **Fix**: Extract 1-2 sentences of key findings relevant to the user's question. Provide a synthesis section that connects papers thematically, identifies consensus and contradictions, and suggests a reading order.
67
+
68
+ ### 14. Fabricating Paper Details
69
+ - **Problem**: Generating plausible-sounding but non-existent paper titles, authors, or findings to fill gaps in search results. This is a critical failure mode for LLM-based agents.
70
+ - **Fix**: Only return papers actually retrieved from database API responses. If the search returns fewer than 5 relevant results, report that honestly rather than padding with fabricated entries. Include the database query used so the user can verify.
71
+
72
+ ### 15. Ignoring Open Access Status
73
+ - **Problem**: Returning papers behind paywalls without checking for open-access alternatives. Users without institutional access cannot read the papers.
74
+ - **Fix**: For each paper, check: (a) arXiv preprint version, (b) Semantic Scholar `openAccessPdf` field, (c) author's personal website or institutional repository. Flag papers with no open-access version available and suggest the user check their institutional library access.
75
+
76
+ ## Workflow Anti-Patterns
77
+
78
+ ### 16. No Iteration on Poor Results
79
+ - **Problem**: Returning the first set of results without assessing quality. If the top results are irrelevant or insufficient, the skill should refine and retry.
80
+ - **Fix**: After initial results, perform a quality check: Are at least 3 of the top 5 directly relevant? If not, refine the query using synonym expansion, alternative category codes, or broader/narrower scope. Iterate up to 2 times before presenting final results.
81
+
82
+ ### 17. Ignoring the Citation Graph
83
+ - **Problem**: Treating each paper as an isolated entity rather than exploiting the citation network for discovery. Citation snowballing (forward and backward) is one of the most powerful techniques in academic search.
84
+ - **Fix**: For the top 1-2 most relevant initial results, always check their citations (forward) and references (backward) using the Semantic Scholar API. This often surfaces highly relevant papers that keyword search alone would miss.
85
+
86
+ ### 18. Conflating Preprint Versions
87
+ - **Problem**: Returning both the arXiv preprint and the published version of the same paper as separate results, inflating the result count with duplicates.
88
+ - **Fix**: Follow the deduplication protocol from best-practices.md. Match by DOI/arXiv ID, then by title+author+year. Keep the published version as the primary entry but include the arXiv link for open access.
@@ -0,0 +1,165 @@
1
+ ---
2
+ domain: academic-search
3
+ topic: query-construction-relevance-ranking-cross-referencing
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Academic Search -- Best Practices
9
+
10
+ ## Query Construction
11
+
12
+ ### 1. Research Question Decomposition
13
+ Before searching, decompose the user's request into structured components:
14
+ - **Core topic** -- The primary subject (e.g., "federated learning")
15
+ - **Subtopic / aspect** -- Specific focus area (e.g., "privacy guarantees")
16
+ - **Methodology preference** -- Empirical, theoretical, survey (e.g., "benchmark evaluation")
17
+ - **Temporal scope** -- Date range for results (e.g., "2022 onwards")
18
+ - **Discipline** -- Target field(s) (e.g., computer science, biomedicine)
19
+
20
+ ### 2. Academic Keyword Selection
21
+ Academic papers use different terminology than general web content:
22
+
23
+ | General Term | Academic Equivalent(s) |
24
+ |-------------|----------------------|
25
+ | AI chatbot | large language model, conversational agent, dialogue system |
26
+ | image recognition | visual recognition, image classification, object detection |
27
+ | data privacy | differential privacy, privacy-preserving, data protection |
28
+ | brain scan | neuroimaging, fMRI, MRI, EEG |
29
+ | drug discovery | pharmacological screening, molecular docking, compound identification |
30
+ | self-driving car | autonomous vehicle, automated driving, self-driving system |
31
+ | fake news | misinformation detection, claim verification, fact-checking |
32
+
33
+ **Key principle**: Use the terminology of the target research community. Check the "Keywords" section of known relevant papers to discover preferred terms.
34
+
35
+ ### 3. Database-Specific Query Strategies
36
+
37
+ #### arXiv Strategy
38
+ - Use category codes to narrow scope: `cat:cs.LG` for ML, `cat:cs.CL` for NLP
39
+ - Search title (`ti:`) for high-precision results, abstract (`abs:`) for broader recall
40
+ - Use `ANDNOT` to exclude tangential categories
41
+ - Sort by `submittedDate` for latest work, `relevance` for best matches
42
+ - For emerging topics, prefer recency over relevance sorting
43
+
44
+ #### Semantic Scholar Strategy
45
+ - Use `fieldsOfStudy` parameter to filter by discipline
46
+ - Use `year` parameter with range syntax (`2022-2025`) to constrain dates
47
+ - Request `influentialCitationCount` to gauge true impact
48
+ - Use the `tldr` field for quick paper summaries when triaging large result sets
49
+ - Follow up with citation/reference graph for seminal paper discovery
50
+
51
+ #### Google Scholar Strategy (via google-search)
52
+ - Use `intitle:"keyword"` for title-focused search
53
+ - Use `author:"name"` to find specific researcher's work
54
+ - Use `source:"venue"` to restrict to specific journals/conferences
55
+ - Apply `as_ylo` and `as_yhi` URL parameters for date ranges
56
+ - Use `"exact phrase"` for specific technical terms or method names
57
+
58
+ ### 4. Multi-Database Search Protocol
59
+ For comprehensive literature coverage:
60
+ 1. **Start with Semantic Scholar** -- Best for structured metadata, citation graphs, and field-of-study filtering
61
+ 2. **Supplement with arXiv** -- Catches very recent preprints not yet indexed elsewhere (especially CS, physics, math)
62
+ 3. **Verify with Google Scholar** -- Broadest coverage; catches papers from smaller venues, theses, and technical reports
63
+ 4. **Cross-reference results** -- Deduplicate using DOI or arXiv ID; merge metadata from the richest source
64
+
65
+ ### 5. Query Expansion Techniques
66
+ When initial results are insufficient:
67
+ - **Synonym expansion**: Add OR clauses with alternative terms (e.g., `"graph neural network" OR "message passing network"`)
68
+ - **Citation snowballing**: Find one relevant paper, then search its references (backward) and citations (forward)
69
+ - **Author tracking**: Identify key authors from initial results, then search for their other recent papers
70
+ - **Venue scoping**: If a relevant paper was published at ICML, search specifically within ICML proceedings for related work
71
+ - **Related paper features**: Use Semantic Scholar's "recommended papers" or Google Scholar's "Related articles"
72
+
73
+ ## Relevance Ranking
74
+
75
+ ### Multi-Factor Ranking Framework
76
+ Rank papers by weighted combination of these factors:
77
+
78
+ | Factor | Weight | Assessment Criteria |
79
+ |--------|--------|-------------------|
80
+ | **Topical Relevance** | 35% | How directly does the paper address the user's specific question? Title/abstract keyword overlap, methodology match |
81
+ | **Methodological Rigor** | 20% | Appropriate methodology, sufficient baselines, statistical significance, reproducibility indicators |
82
+ | **Venue Quality** | 15% | Conference/journal ranking (A*/A for CS, Impact Factor for journals), peer-review status |
83
+ | **Recency** | 15% | Publication date relative to the field's pace; recent is better for fast-moving fields |
84
+ | **Impact** | 15% | Influential citation count, citation velocity, whether it introduced a widely-adopted technique |
85
+
86
+ ### Scoring Procedure
87
+ For each paper, score 0-5 on each factor:
88
+ - **5** -- Excellent: directly relevant, top venue, rigorous method, high and growing citations
89
+ - **4** -- Strong: highly relevant with minor gaps in one dimension
90
+ - **3** -- Good: relevant but from a secondary venue or with moderate impact
91
+ - **2** -- Acceptable: tangentially relevant or older but still useful
92
+ - **1** -- Marginal: peripherally related or methodologically weak
93
+ - **0** -- Not relevant: off-topic, retracted, or fundamentally flawed
94
+
95
+ Compute weighted score: `(Relevance * 0.35) + (Rigor * 0.20) + (Venue * 0.15) + (Recency * 0.15) + (Impact * 0.15)`
96
+
97
+ ### Special Ranking Adjustments
98
+ - **Survey papers**: Boost ranking when user requests "overview" or "literature review" -- surveys provide comprehensive coverage
99
+ - **Seminal papers**: Boost ranking for foundational papers even if older, when user is exploring a new field
100
+ - **Preprints**: Apply a -1 penalty to venue quality score unless the preprint is from a well-known research group or has high citation count
101
+ - **Open access**: Apply a +0.5 bonus when the user needs full-text access and the paper has a freely available PDF
102
+
103
+ ## Cross-Referencing Strategies
104
+
105
+ ### 1. Forward Citation Analysis
106
+ Starting from a known relevant paper, examine papers that cite it:
107
+ - Use Semantic Scholar `/paper/{id}/citations` endpoint
108
+ - Filter citations by year to find recent extensions
109
+ - Sort by `influentialCitationCount` to find the most impactful follow-up works
110
+ - Identify **citation clusters** -- groups of papers that cite the same source tend to be related
111
+
112
+ ### 2. Backward Reference Analysis
113
+ Starting from a known relevant paper, examine its references:
114
+ - Use Semantic Scholar `/paper/{id}/references` endpoint
115
+ - Identify the **foundational papers** the author builds upon
116
+ - Look for methodological sources -- the papers describing the technique being used
117
+ - Find the dataset and benchmark papers for evaluation context
118
+
119
+ ### 3. Bibliographic Coupling
120
+ Two papers that share many references are likely related:
121
+ - Compare reference lists of candidate papers
122
+ - Papers with 30%+ reference overlap are strong candidates for relevance
123
+ - Useful for finding papers the database search missed
124
+
125
+ ### 4. Co-Citation Analysis
126
+ Two papers frequently cited together by other papers are related:
127
+ - If papers A and B both appear in the reference lists of many papers, they address related topics
128
+ - Use Semantic Scholar recommended papers feature as a proxy
129
+
130
+ ### 5. Author Network Exploration
131
+ - Identify prolific authors in the initial result set
132
+ - Search for their other recent publications
133
+ - Check their co-authors for related work from collaborating labs
134
+ - Examine their Google Scholar profile for a comprehensive publication list
135
+
136
+ ### 6. Deduplication Protocol
137
+ The same paper often appears across multiple databases:
138
+ 1. **Match by DOI** -- Definitive identifier; if DOIs match, it is the same paper
139
+ 2. **Match by arXiv ID** -- Maps arXiv preprints to their Semantic Scholar entries
140
+ 3. **Match by title + first author + year** -- Fuzzy match for papers without DOI
141
+ 4. **Merge metadata** -- Keep the record with the richest metadata (prefer Semantic Scholar for citations, arXiv for full text)
142
+ 5. **Resolve version conflicts** -- A paper may have an arXiv v1 and a published version; prefer the published version but link to the arXiv open-access copy
143
+
144
+ ## Output Formatting
145
+
146
+ ### Per-Paper Entry
147
+ Each paper in the output should include:
148
+ ```
149
+ [Rank]. Title
150
+ Authors: First Author et al. (Year)
151
+ Venue: Conference/Journal Name [Peer-reviewed / Preprint]
152
+ Citations: X total, Y influential | Citation velocity: Z/year
153
+ arXiv: XXXX.XXXXX | DOI: 10.XXXX/XXXXX
154
+ Open Access: [Yes - link] / [No - paywall]
155
+ Key Findings: 1-2 sentence summary of the main contribution
156
+ Relevance: Why this paper matters for the user's query
157
+ ```
158
+
159
+ ### Synthesis Section
160
+ After listing individual papers, provide:
161
+ - **Thematic grouping**: Cluster papers by approach or subtopic
162
+ - **Consensus findings**: What do multiple papers agree on?
163
+ - **Contradictions**: Where do papers disagree, and why?
164
+ - **Research gaps**: What questions remain unanswered?
165
+ - **Recommended reading order**: Suggest which papers to read first based on the user's background
@@ -0,0 +1,293 @@
1
+ ---
2
+ domain: academic-search
3
+ topic: academic-database-apis-and-paper-structure
4
+ priority: high
5
+ ttl: 30d
6
+ ---
7
+
8
+ # Academic Search -- Database APIs, Paper Structure & Citation Metrics
9
+
10
+ ## arXiv API
11
+
12
+ ### Overview
13
+ arXiv (arxiv.org) is an open-access repository for preprints in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, systems science, and economics. All papers are freely accessible.
14
+
15
+ ### API Endpoint
16
+ - **Base URL**: `http://export.arxiv.org/api/query`
17
+ - **Method**: GET
18
+ - **Rate Limit**: 1 request per 3 seconds (be respectful; no authentication required)
19
+
20
+ ### Query Parameters
21
+ | Parameter | Description | Example |
22
+ |-----------|-------------|---------|
23
+ | `search_query` | Search terms with field prefixes | `ti:attention+AND+cat:cs.CL` |
24
+ | `id_list` | Comma-separated arXiv IDs | `2301.00001,2301.00002` |
25
+ | `start` | Offset for pagination | `0` |
26
+ | `max_results` | Number of results (max 30000) | `10` |
27
+ | `sortBy` | Sort field: `relevance`, `lastUpdatedDate`, `submittedDate` | `relevance` |
28
+ | `sortOrder` | `ascending` or `descending` | `descending` |
29
+
30
+ ### Field Prefixes for search_query
31
+ - `ti:` -- Title
32
+ - `au:` -- Author
33
+ - `abs:` -- Abstract
34
+ - `co:` -- Comment
35
+ - `jr:` -- Journal reference
36
+ - `cat:` -- Subject category
37
+ - `all:` -- All fields
38
+
39
+ ### Boolean Operators
40
+ - `AND` -- Both terms required
41
+ - `OR` -- Either term matches
42
+ - `ANDNOT` -- Exclude term
43
+ - Grouping with parentheses: `(ti:transformer OR ti:attention) AND cat:cs.CL`
44
+
45
+ ### arXiv Category Codes (Common)
46
+ | Category | Description |
47
+ |----------|-------------|
48
+ | `cs.AI` | Artificial Intelligence |
49
+ | `cs.CL` | Computation and Language (NLP) |
50
+ | `cs.CV` | Computer Vision |
51
+ | `cs.LG` | Machine Learning |
52
+ | `cs.SE` | Software Engineering |
53
+ | `cs.CR` | Cryptography and Security |
54
+ | `stat.ML` | Machine Learning (Statistics) |
55
+ | `math.OC` | Optimization and Control |
56
+ | `q-bio.QM` | Quantitative Methods (Biology) |
57
+ | `econ.EM` | Econometrics |
58
+ | `physics.comp-ph` | Computational Physics |
59
+
60
+ ### Response Format (Atom XML)
61
+ ```xml
62
+ <entry>
63
+ <id>http://arxiv.org/abs/2301.00001v1</id>
64
+ <title>Paper Title</title>
65
+ <summary>Abstract text...</summary>
66
+ <author><name>Author Name</name></author>
67
+ <published>2023-01-01T00:00:00Z</published>
68
+ <updated>2023-01-15T00:00:00Z</updated>
69
+ <arxiv:primary_category term="cs.CL"/>
70
+ <category term="cs.AI"/>
71
+ <link href="http://arxiv.org/pdf/2301.00001v1" title="pdf"/>
72
+ </entry>
73
+ ```
74
+
75
+ ### Example Queries
76
+ ```
77
+ # Find recent transformer papers in NLP
78
+ search_query=ti:transformer+AND+cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=10
79
+
80
+ # Find papers by a specific author on attention mechanisms
81
+ search_query=au:vaswani+AND+ti:attention&sortBy=relevance&max_results=5
82
+
83
+ # Find reinforcement learning papers excluding robotics
84
+ search_query=all:reinforcement+learning+ANDNOT+cat:cs.RO&max_results=20
85
+ ```
86
+
87
+ ## Semantic Scholar API
88
+
89
+ ### Overview
90
+ Semantic Scholar (semanticscholar.org) provides a comprehensive academic search engine with rich metadata, citation graphs, and AI-extracted features. Covers 200M+ papers across all fields.
91
+
92
+ ### API Endpoints
93
+
94
+ #### Paper Search
95
+ - **URL**: `GET https://api.semanticscholar.org/graph/v1/paper/search`
96
+ - **Rate Limit**: 100 requests/5 minutes (unauthenticated); 1 request/second with API key
97
+
98
+ | Parameter | Description | Example |
99
+ |-----------|-------------|---------|
100
+ | `query` | Search terms | `attention mechanism transformers` |
101
+ | `fields` | Comma-separated fields to return | `title,authors,year,abstract,citationCount,venue` |
102
+ | `limit` | Results per page (max 100) | `10` |
103
+ | `offset` | Pagination offset | `0` |
104
+ | `year` | Publication year filter | `2023-` (2023 onwards), `2020-2023` |
105
+ | `fieldsOfStudy` | Discipline filter | `Computer Science` |
106
+ | `openAccessPdf` | Filter for open access | (presence means filter) |
107
+
108
+ #### Paper Details
109
+ - **URL**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}`
110
+ - **Paper ID formats**: Semantic Scholar ID, DOI (`DOI:10.xxx`), arXiv (`ARXIV:2301.00001`), PMID, ACL ID, Corpus ID
111
+
112
+ #### Citation Graph
113
+ - **Citations**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}/citations`
114
+ - **References**: `GET https://api.semanticscholar.org/graph/v1/paper/{paper_id}/references`
115
+
116
+ | Parameter | Description |
117
+ |-----------|-------------|
118
+ | `fields` | Fields for each citing/referenced paper |
119
+ | `limit` | Number of citations/references (max 1000) |
120
+ | `offset` | Pagination offset |
121
+
122
+ #### Author Search
123
+ - **URL**: `GET https://api.semanticscholar.org/graph/v1/author/search`
124
+ - **URL**: `GET https://api.semanticscholar.org/graph/v1/author/{author_id}/papers`
125
+
126
+ ### Available Fields
127
+ ```
128
+ # Paper fields
129
+ title, abstract, year, venue, publicationDate, journal,
130
+ citationCount, referenceCount, influentialCitationCount,
131
+ fieldsOfStudy, s2FieldsOfStudy, authors, externalIds,
132
+ url, openAccessPdf, tldr, publicationTypes, citationStyles
133
+
134
+ # Author fields
135
+ name, affiliations, homepage, paperCount, citationCount, hIndex
136
+ ```
137
+
138
+ ### Example Queries
139
+ ```
140
+ # Search for recent RAG papers
141
+ GET /graph/v1/paper/search?query=retrieval+augmented+generation&year=2023-&fields=title,authors,year,citationCount,abstract,venue&limit=10
142
+
143
+ # Get citation graph for a specific paper
144
+ GET /graph/v1/paper/ARXIV:2005.11401/citations?fields=title,year,citationCount&limit=50
145
+
146
+ # Find papers by field of study
147
+ GET /graph/v1/paper/search?query=protein+folding&fieldsOfStudy=Biology&fields=title,year,venue&limit=20
148
+ ```
149
+
150
+ ## Google Scholar (via google-search skill)
151
+
152
+ ### Query Operators
153
+ Google Scholar inherits many standard Google Search operators with academic-specific behavior:
154
+ - `"exact phrase"` -- Exact match in title, abstract, or full text
155
+ - `author:"last name"` -- Filter by author name
156
+ - `intitle:"keyword"` -- Term must appear in paper title
157
+ - `source:"journal name"` -- Filter by publication venue
158
+ - Date range filter via Google Scholar UI or `after:YYYY` operator
159
+
160
+ ### Google Scholar URL Construction
161
+ ```
162
+ # Basic search
163
+ https://scholar.google.com/scholar?q=transformer+attention+mechanism
164
+
165
+ # With date range
166
+ https://scholar.google.com/scholar?q=transformer+attention&as_ylo=2023&as_yhi=2025
167
+
168
+ # Author search
169
+ https://scholar.google.com/scholar?q=author:"hinton"+deep+learning
170
+
171
+ # Specific journal
172
+ https://scholar.google.com/scholar?q=source:"nature"+gene+editing+CRISPR
173
+ ```
174
+
175
+ ### Advantages over Other Sources
176
+ - Broadest coverage: journals, conferences, theses, patents, preprints, books
177
+ - Includes citation counts and "Cited by" links
178
+ - "Related articles" feature for discovery
179
+ - Provides links to free PDF versions when available
180
+
181
+ ### Limitations
182
+ - No official API (rely on google-search skill for scraping-safe queries)
183
+ - Rate-limited and may block automated access
184
+ - Citation counts may differ from Semantic Scholar
185
+ - Cannot filter by field of study programmatically
186
+
187
+ ## Academic Paper Structure
188
+
189
+ ### Standard Sections (IMRaD Format)
190
+ | Section | Purpose | Key Information to Extract |
191
+ |---------|---------|---------------------------|
192
+ | **Title** | Concise statement of the main finding or topic | Core topic, methodology hint |
193
+ | **Abstract** | 150-300 word summary of the entire paper | Problem, method, key result, conclusion |
194
+ | **Introduction** | Problem context, motivation, research gap | Research question, hypotheses, related work overview |
195
+ | **Related Work / Literature Review** | Positioning within existing research | Key prior work, how this paper differs |
196
+ | **Methodology** | How the research was conducted | Datasets, models, experimental setup, baselines |
197
+ | **Results** | Quantitative and qualitative findings | Tables, figures, statistical significance, metrics |
198
+ | **Discussion** | Interpretation of results | Implications, limitations, comparison with prior work |
199
+ | **Conclusion** | Summary and future directions | Main contributions, open questions |
200
+ | **References** | Cited works | Citation network, foundational papers |
201
+
202
+ ### Paper Types
203
+ | Type | Description | Typical Structure |
204
+ |------|-------------|-------------------|
205
+ | **Empirical** | Reports original experimental results | Full IMRaD |
206
+ | **Survey / Review** | Comprehensive overview of a research area | Taxonomy + systematic analysis |
207
+ | **Theoretical** | Proves theorems or proposes frameworks | Definitions, propositions, proofs |
208
+ | **Systems** | Describes a software system or tool | Architecture, implementation, evaluation |
209
+ | **Position / Opinion** | Argues for a particular viewpoint | Argument structure with evidence |
210
+ | **Benchmark** | Introduces datasets or evaluation protocols | Dataset description, baseline results |
211
+
212
+ ## Citation Metrics
213
+
214
+ ### Paper-Level Metrics
215
+ | Metric | Description | Use Case |
216
+ |--------|-------------|----------|
217
+ | **Citation Count** | Total times cited by other papers | Rough impact indicator (use with caution) |
218
+ | **Influential Citation Count** | Citations where this paper is central to the citing work (Semantic Scholar) | Better quality indicator than raw count |
219
+ | **Citation Velocity** | Citations per year, especially in recent years | Identifies trending vs. declining relevance |
220
+ | **Field-Normalized Citation** | Citations relative to field average | Fair comparison across disciplines |
221
+
222
+ ### Author-Level Metrics
223
+ | Metric | Description |
224
+ |--------|-------------|
225
+ | **h-index** | h papers with at least h citations each |
226
+ | **i10-index** | Number of papers with 10+ citations |
227
+ | **Total Citations** | Sum of all paper citations |
228
+
229
+ ### Venue-Level Metrics
230
+ | Metric | Description |
231
+ |--------|-------------|
232
+ | **Impact Factor** | Average citations per paper in the last 2 years (journals) |
233
+ | **h5-index** | h-index for articles published in the last 5 complete years |
234
+ | **CORE Ranking** | Conference ranking: A* (top), A, B, C (australasian system) |
235
+ | **CSRankings** | Computer science venue rankings by research output |
236
+
237
+ ### Well-Known Venues by Field
238
+
239
+ #### Computer Science -- AI/ML
240
+ | Venue | Type | Prestige |
241
+ |-------|------|----------|
242
+ | NeurIPS | Conference | Top-tier |
243
+ | ICML | Conference | Top-tier |
244
+ | ICLR | Conference | Top-tier |
245
+ | AAAI | Conference | Top-tier |
246
+ | CVPR / ICCV / ECCV | Conference | Top-tier (Vision) |
247
+ | ACL / EMNLP / NAACL | Conference | Top-tier (NLP) |
248
+ | JMLR | Journal | Top-tier |
249
+ | Nature Machine Intelligence | Journal | Top-tier |
250
+
251
+ #### Biomedical
252
+ | Venue | Type | Prestige |
253
+ |-------|------|----------|
254
+ | Nature | Journal | Top-tier |
255
+ | Science | Journal | Top-tier |
256
+ | Cell | Journal | Top-tier |
257
+ | PNAS | Journal | High |
258
+ | PLoS ONE | Journal | Open-access, broad |
259
+
260
+ #### General Science
261
+ | Venue | Type | Prestige |
262
+ |-------|------|----------|
263
+ | Nature / Science | Journal | Highest |
264
+ | IEEE Transactions (various) | Journal | High |
265
+ | ACM Computing Surveys | Journal | High (CS surveys) |
266
+
267
+ ## Research Methodology Taxonomy
268
+
269
+ ### Quantitative Methods
270
+ - **Controlled Experiment** -- Manipulates variables with control groups
271
+ - **Quasi-Experiment** -- Natural or existing group comparisons
272
+ - **Survey / Questionnaire** -- Large-scale data collection via structured instruments
273
+ - **Corpus Analysis** -- Statistical analysis of large text/data collections
274
+ - **Simulation** -- Computational modeling of systems
275
+ - **Benchmarking** -- Standardized evaluation on established datasets
276
+
277
+ ### Qualitative Methods
278
+ - **Case Study** -- In-depth analysis of specific instances
279
+ - **Ethnography** -- Observational study within a community
280
+ - **Grounded Theory** -- Theory building from systematic data analysis
281
+ - **Content Analysis** -- Systematic categorization of textual content
282
+ - **Interview Study** -- Structured or semi-structured conversations
283
+
284
+ ### Mixed Methods
285
+ - **Sequential Explanatory** -- Quantitative phase followed by qualitative
286
+ - **Sequential Exploratory** -- Qualitative phase followed by quantitative
287
+ - **Convergent Parallel** -- Both methods simultaneously, results merged
288
+
289
+ ### Review Methods
290
+ - **Systematic Review** -- Rigorous, reproducible search and synthesis protocol
291
+ - **Meta-Analysis** -- Statistical aggregation of results across studies
292
+ - **Scoping Review** -- Broad mapping of a research area
293
+ - **Narrative Review** -- Expert-driven summary (less rigorous than systematic)
package/manifest.json ADDED
@@ -0,0 +1,28 @@
1
+ {
2
+ "name": "@botlearn/academic-search",
3
+ "version": "0.1.0",
4
+ "description": "Academic paper discovery across arXiv, Google Scholar, and Semantic Scholar with abstract screening, citation analysis, and research synthesis for OpenClaw Agent",
5
+ "category": "information-retrieval",
6
+ "author": "BotLearn",
7
+ "benchmarkDimension": "information-retrieval",
8
+ "expectedImprovement": 35,
9
+ "dependencies": {
10
+ "@botlearn/google-search": "^1.0.0"
11
+ },
12
+ "compatibility": {
13
+ "openclaw": ">=0.5.0"
14
+ },
15
+ "files": {
16
+ "skill": "skill.md",
17
+ "knowledge": [
18
+ "knowledge/domain.md",
19
+ "knowledge/best-practices.md",
20
+ "knowledge/anti-patterns.md"
21
+ ],
22
+ "strategies": [
23
+ "strategies/main.md"
24
+ ],
25
+ "smokeTest": "tests/smoke.json",
26
+ "benchmark": "tests/benchmark.json"
27
+ }
28
+ }