npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/literature/discovery/paper-recommendation-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,120 @@
+---
+name: paper-recommendation-guide
+description: "Systematic paper recommendation and discovery using multiple methods"
+metadata:
+  openclaw:
+    emoji: "🎯"
+    category: "literature"
+    subcategory: "discovery"
+    keywords: ["paper recommendation", "literature discovery", "related papers", "reading list", "citation-based", "algorithmic discovery"]
+    source: "https://github.com/pengzhenghao/paper-recommendation"
+---
+# Paper Recommendation Guide
+## Overview
+Finding the right papers to read is a research skill in itself. Beyond keyword searches, modern researchers have access to a rich ecosystem of recommendation tools that use citation networks, semantic similarity, co-authorship patterns, and collaborative filtering to surface relevant papers you might otherwise miss.
+This skill provides a systematic approach to paper discovery that goes beyond passive reading. It covers algorithmic recommendation services, citation-based discovery techniques, social and community-driven methods, and strategies for building and maintaining a well-curated reading pipeline. The goal is to minimize the chance that you miss an important paper while avoiding information overload.
+Whether you are entering a new field and need foundational papers, tracking the frontier of a mature research area, or looking for interdisciplinary connections, this guide provides concrete methods for each scenario.
+## Algorithmic Recommendation Services
+### Semantic Scholar Recommendations
+Semantic Scholar provides a free recommendation API that suggests papers based on a set of seed papers you provide:
+```bash
+# Get recommendations based on positive and negative example papers
+curl -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "positivePaperIds": ["CorpusId:12345", "CorpusId:67890"],
+    "negativePaperIds": ["CorpusId:11111"],
+    "fields": "title,authors,year,citationCount,url"
+  }'
+```
+The positive/negative seed approach lets you steer recommendations toward the specific intersection of topics you care about. Start with 3-5 highly relevant papers as positive seeds and 1-2 off-topic papers as negative seeds.
+### Connected Papers
+Connected Papers (connectedpapers.com) builds a visual graph of papers related to a seed paper. It uses co-citation and bibliographic coupling analysis rather than direct citation links, which means it can surface related work even when two papers do not cite each other directly. Use this when:
+- You have one key paper and want to map the surrounding literature
+- You want to identify distinct clusters of related research
+- You need to find the "origin paper" for an idea by tracing the graph backward
+### Google Scholar Recommendations
+Google Scholar's "Related articles" feature and the personalized recommendation emails (if you maintain a Google Scholar profile) use a combination of citation analysis and content similarity. To maximize their usefulness:
+- Maintain an up-to-date Google Scholar profile with your publications
+- Use the "Library" feature to save papers—this trains the recommendation algorithm
+- Set up Google Scholar Alerts for key queries and author names
+- Check the "Related articles" link on every important paper you read
+### Research Rabbit
+Research Rabbit (researchrabbitapp.com) lets you build collections of papers and then visualizes networks of related work, similar work, and suggested papers. It integrates with Zotero for importing existing libraries. Key features:
+- "Similar Work" tab: finds papers with semantic similarity
+- "All References" and "All Citations": explores the citation tree
+- "These Authors" and "Suggested Authors": discovers researchers working on related topics
+- Shareable collections for collaborative literature discovery
+## Citation-Based Discovery Methods
+When algorithmic tools are insufficient, manual citation-based techniques remain powerful:
+### Forward Citation Chaining
+Start with a foundational paper. Find all papers that cite it (using Google Scholar, Semantic Scholar, or Web of Science). Screen these citing papers by title and abstract to find relevant descendants. Repeat for the most important descendants.
+### Backward Citation Mining
+Read the reference list of a key paper. Identify and retrieve the most important cited works. This traces the intellectual lineage of ideas and helps you find the seminal papers in a subfield.
+### Co-Citation Analysis
+Two papers that are frequently cited together in other papers are likely related, even if they do not cite each other. Tools like VOSviewer and CiteSpace can visualize co-citation clusters from a set of papers, revealing the intellectual structure of a field.
+### Bibliographic Coupling
+Two papers that share many references are likely addressing related questions. This is the inverse of co-citation and is more useful for discovering recent papers that have not yet accumulated citations.
+## Building a Reading Pipeline
+A sustainable paper discovery practice requires more than one-off searches. Build a pipeline that continuously surfaces new relevant work:
+### Weekly Routine
+1. **Check preprint alerts**: Review your arXiv, bioRxiv, or SSRN email alerts or RSS feeds (15 min).
+2. **Scan citation alerts**: Review Google Scholar citation alerts for new papers citing your key references (10 min).
+3. **Process recommendation queue**: Review suggestions from Semantic Scholar, Research Rabbit, or Connected Papers for any recently added seed papers (10 min).
+4. **Social signals**: Scan academic Twitter/Mastodon, relevant subreddits, or lab group Slack channels for shared papers (10 min).
+5. **Triage and queue**: Add promising papers to your "to read" queue with a priority tag (high/medium/low) and the reason you flagged them.
+### Managing the Reading Queue
+Avoid the trap of an ever-growing, never-read paper queue:
+- **Time-box reading**: Dedicate specific blocks (e.g., 2 hours Tuesday/Thursday) to reading queued papers.
+- **Triage aggressively**: Not every flagged paper needs a full read. Use a 3-tier system: skim (5 min), selective read (20 min), deep read (60+ min).
+- **Expire old items**: Papers that have been in your queue for more than 8 weeks without being read should be re-evaluated. If they are still relevant, read them now; otherwise, archive them.
+- **Track what you read**: Maintain a reading log with dates and brief notes to build a personal knowledge base.
+## Interdisciplinary Discovery
+Finding papers outside your primary field is particularly challenging because you may not know the right terminology. Strategies include:
+- **Concept-based search**: Use tools like OpenAlex that organize papers by research concepts rather than keywords.
+- **Review articles**: Find review articles in the adjacent field—they provide curated entry points and vocabulary.
+- **Cross-field citation chaining**: When your field's paper cites work from another discipline, follow that reference chain.
+- **Ask experts**: Reach out to researchers in adjacent fields and ask for their recommended reading list for a newcomer.
+## References
+- Semantic Scholar API: https://api.semanticscholar.org
+- Connected Papers: https://www.connectedpapers.com
+- Research Rabbit: https://www.researchrabbitapp.com
+- Paper Recommendation: https://github.com/pengzhenghao/paper-recommendation
+- VOSviewer: https://www.vosviewer.com

package/skills/literature/discovery/papers-we-love-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,169 @@
+---
+name: papers-we-love-guide
+description: "Community-curated directory of influential CS research papers"
+metadata:
+  openclaw:
+    emoji: "❤️"
+    category: "literature"
+    subcategory: "discovery"
+    keywords: ["papers we love", "CS papers", "reading groups", "classic papers", "paper recommendations", "curated list"]
+    source: "https://github.com/papers-we-love/papers-we-love"
+---
+# Papers We Love Guide
+## Overview
+Papers We Love (PWL) is a community-driven repository of influential computer science research papers organized by topic, with worldwide reading groups. The repository contains direct links to PDFs and summaries for hundreds of landmark papers across distributed systems, programming languages, machine learning, security, and more. A go-to resource for discovering foundational and impactful research.
+## Repository Structure
+```
+papers-we-love/
+├── distributed_systems/
+│   ├── README.md          # Curated list with descriptions
+│   ├── lamport-clocks.pdf
+│   └── raft.pdf
+├── machine_learning/
+├── programming_languages/
+├── security/
+├── databases/
+├── networking/
+├── information_retrieval/
+├── artificial_intelligence/
+├── concurrency/
+├── operating_systems/
+└── ... (40+ categories)
+```
+## Topic Categories
+| Category | Notable Papers |
+|----------|---------------|
+| **Distributed Systems** | Paxos, Raft, MapReduce, Dynamo |
+| **Machine Learning** | Backpropagation, Dropout, Attention, BatchNorm |
+| **Programming Languages** | Lambda calculus, Type inference, Hindley-Milner |
+| **Databases** | B-Trees, LSM-Trees, MVCC, Column stores |
+| **Security** | Public-key crypto, Zero-knowledge proofs, TLS |
+| **Networking** | TCP congestion, BGP, Software-defined networking |
+| **Operating Systems** | Unix, Microkernel debate, Virtual memory |
+| **Concurrency** | CSP, Actor model, Software transactional memory |
+## Using PWL for Research
+### Finding Papers by Topic
+```bash
+# Clone the repository
+git clone https://github.com/papers-we-love/papers-we-love.git
+# Browse categories
+ls papers-we-love/
+# Each directory has a README with curated descriptions
+cat papers-we-love/distributed_systems/README.md
+```
+### Programmatic Access
+```python
+import os
+import glob
+PWL_PATH = "./papers-we-love"
+# List all categories
+categories = [d for d in os.listdir(PWL_PATH)
+              if os.path.isdir(os.path.join(PWL_PATH, d))
+              and not d.startswith('.')]
+print(f"Categories: {len(categories)}")
+# Find papers in a category
+ml_papers = glob.glob(f"{PWL_PATH}/machine_learning/*.pdf")
+for p in ml_papers:
+    print(f"  {os.path.basename(p)}")
+# Search across all READMEs for a topic
+import re
+for readme in glob.glob(f"{PWL_PATH}/*/README.md"):
+    with open(readme) as f:
+        content = f.read()
+    if re.search(r"consensus|paxos|raft", content, re.I):
+        category = os.path.basename(os.path.dirname(readme))
+        print(f"Found in: {category}")
+```
+## Reading Group Integration
+```python
+# PWL chapters host monthly meetups worldwide
+# Find local chapters at paperswelove.org
+chapters = {
+    "New York": "meetup.com/papers-we-love",
+    "San Francisco": "meetup.com/papers-we-love-too",
+    "London": "meetup.com/papers-we-love-london",
+    "Berlin": "meetup.com/papers-we-love-berlin",
+    # 40+ chapters globally
+}
+# Video talks on YouTube
+# youtube.com/@PapersWeLove — recorded presentations
+# Each talk: 30-60 min paper walkthrough by practitioner
+```
+## Building a Reading List
+```python
+# Curate a personal reading list from PWL
+essential_distributed = [
+    "Time, Clocks, and the Ordering of Events (Lamport, 1978)",
+    "The Byzantine Generals Problem (Lamport et al., 1982)",
+    "Impossibility of Distributed Consensus (FLP, 1985)",
+    "Paxos Made Simple (Lamport, 2001)",
+    "In Search of an Understandable Consensus Algorithm (Raft, 2014)",
+    "Dynamo: Amazon's Key-Value Store (DeCandia et al., 2007)",
+    "MapReduce: Simplified Data Processing (Dean & Ghemawat, 2004)",
+]
+essential_ml = [
+    "A Few Useful Things to Know About ML (Domingos, 2012)",
+    "Dropout: A Simple Way to Prevent Overfitting (Srivastava, 2014)",
+    "Batch Normalization (Ioffe & Szegedy, 2015)",
+    "Attention Is All You Need (Vaswani et al., 2017)",
+    "BERT: Pre-training of Deep Bidirectional Transformers (2018)",
+]
+```
+## Contributing to PWL
+```markdown
+## How to Contribute
+1. Fork the repository
+2. Add paper PDF to appropriate category directory
+3. Update the category README.md with:
+   - Paper title and authors
+   - Year of publication
+   - Brief description (2-3 sentences)
+   - Why it matters
+4. Submit a pull request
+### README Entry Format
+- :scroll: [Paper Title](link) — Brief description.
+  Authors (Year). *Venue*.
+```
+## Use Cases
+1. **Literature exploration**: Discover landmark papers by topic
+2. **Reading groups**: Structured paper discussions with community
+3. **Course preparation**: Curate reading lists for CS courses
+4. **Onboarding**: Get up to speed on a new research area
+5. **Historical context**: Trace the evolution of CS ideas
+## References
+- [Papers We Love GitHub](https://github.com/papers-we-love/papers-we-love)
+- [Papers We Love Website](https://paperswelove.org/)
+- [PWL YouTube Channel](https://www.youtube.com/@PapersWeLove)

package/skills/literature/discovery/semantic-paper-radar/SKILL.md ADDED Viewed

@@ -0,0 +1,144 @@
+---
+name: semantic-paper-radar
+description: "Semantic literature discovery and synthesis using embeddings"
+metadata:
+  openclaw:
+    emoji: "📡"
+    category: "literature"
+    subcategory: "discovery"
+    keywords: ["semantic search", "embeddings", "literature synthesis", "paper discovery", "vector search", "knowledge mapping"]
+    source: "https://github.com/mukulpatnaik/researchgpt"
+---
+# Semantic Paper Radar
+## Overview
+Traditional literature search relies on keyword matching—you find papers that contain the exact terms you search for. Semantic paper discovery goes further by understanding the meaning of research content and finding papers that are conceptually related, even when they use different terminology. This is especially powerful for interdisciplinary research, where the same idea may be expressed in completely different vocabularies across fields.
+The Semantic Paper Radar skill provides methods for using embedding-based semantic search, vector databases, and AI-powered synthesis to build a comprehensive, continuously updated view of the literature relevant to your research. It enables you to discover papers you would never find through keyword search alone and to synthesize findings across large bodies of work.
+This skill covers setting up a personal semantic search index over your paper collection, querying public semantic search APIs, and using LLM-powered analysis to extract themes and connections from clusters of related papers.
+## Semantic Search Fundamentals
+### How Embedding-Based Search Works
+Semantic search represents both your query and each paper as dense numerical vectors (embeddings) in a high-dimensional space. Papers whose embeddings are close to your query's embedding are semantically similar, regardless of the specific words used.
+Key components:
+- **Embedding model**: Converts text to vectors. Models like SPECTER2, SciBERT, or general-purpose models like `text-embedding-3-small` work well for academic text.
+- **Vector database**: Stores and indexes embeddings for fast similarity search. Options include ChromaDB (local), Qdrant, Pinecone, or Weaviate.
+- **Similarity metric**: Cosine similarity is standard for comparing text embeddings.
+### Using Semantic Scholar's Embedding Search
+Semantic Scholar provides pre-computed SPECTER embeddings for millions of papers. You can use their search API for semantic queries:
+```bash
+# Semantic search via the Semantic Scholar API
+curl "https://api.semanticscholar.org/graph/v1/paper/search?query=attention+mechanisms+for+graph+neural+networks&fields=title,abstract,year,citationCount&limit=20"
+```
+The search endpoint uses semantic matching, not just keyword matching. A query like "methods for handling missing values in longitudinal studies" will find papers about imputation techniques, dropout analysis, and panel data methods even if they do not use the phrase "missing values."
+### Building a Personal Semantic Index
+For deeper control, build a local semantic search index over your own paper collection:
+```python
+import chromadb
+from sentence_transformers import SentenceTransformer
+# Initialize
+model = SentenceTransformer("allenai/specter2")
+client = chromadb.PersistentClient(path="./paper_index")
+collection = client.get_or_create_collection(
+    name="my_papers",
+    metadata={"hnsw:space": "cosine"}
+)
+# Index a paper
+abstract = "We propose a novel attention mechanism for graph neural networks..."
+embedding = model.encode(abstract).tolist()
+collection.add(
+    documents=[abstract],
+    embeddings=[embedding],
+    metadatas=[{"title": "Graph Attention v2", "year": 2025, "arxiv_id": "2501.xxxxx"}],
+    ids=["paper_001"]
+)
+# Query
+results = collection.query(
+    query_embeddings=[model.encode("message passing in GNNs").tolist()],
+    n_results=10
+)
+```
+This local index lets you search across all papers you have collected using natural language queries. As you add more papers, the index becomes a personalized discovery tool tuned to your specific research interests.
+## Discovery Workflows
+### Concept Expansion Radar
+Use semantic search to expand your awareness beyond your current reading:
+1. **Seed**: Take the abstract of your current paper (or a paragraph describing your research question).
+2. **Search**: Run it as a semantic query against a large corpus (Semantic Scholar, OpenAlex, or your local index).
+3. **Filter**: Remove papers you have already read. Sort by a combination of semantic similarity and recency.
+4. **Cluster**: Group the top 50 results into thematic clusters using k-means or HDBSCAN on their embeddings.
+5. **Explore clusters**: Each cluster represents a related subtopic. Read the most-cited paper in each cluster to understand the connection to your work.
+### Cross-Disciplinary Bridge Detection
+Semantic search excels at finding papers from other fields that address similar problems:
+1. Describe your research problem in plain, non-technical language.
+2. Run this as a semantic query without restricting to your field's journals or categories.
+3. Review results from unexpected fields—these are potential interdisciplinary connections.
+4. For each bridge paper, check its reference list for more domain-specific work in that field.
+### Novelty Radar
+Set up periodic semantic searches to detect new papers in your area:
+1. Define 3-5 "concept vectors" by encoding descriptions of your core research interests.
+2. Weekly, search against newly published papers (last 7 days) from arXiv or Semantic Scholar.
+3. Rank new papers by maximum similarity to any of your concept vectors.
+4. Papers above your similarity threshold enter your reading queue automatically.
+## Semantic Synthesis
+Once you have discovered a cluster of related papers, use AI-assisted synthesis to extract insights across the collection:
+### Theme Extraction
+Feed the abstracts of a cluster of papers to an LLM and ask for:
+- Common themes and findings across the papers
+- Points of disagreement or contradiction
+- Methodological trends (what approaches are gaining vs. losing popularity)
+- Open questions that none of the papers fully address
+### Evidence Mapping
+Create a structured evidence map from your semantic cluster:
+| Theme | Supporting Papers | Contradicting Papers | Strength of Evidence |
+|-------|-------------------|----------------------|---------------------|
+| Theme A | [1], [3], [7] | [5] | Strong |
+| Theme B | [2], [4] | None | Moderate |
+| Theme C | [6] | [1], [8] | Contested |
+This provides a bird's-eye view of where consensus exists and where debates remain open.
+### Gap Identification
+Compare your research question against the semantic landscape of existing work. Regions of embedding space where your query falls but few papers exist represent potential research gaps—areas where your contribution would be most novel.
+## References
+- Semantic Scholar API: https://api.semanticscholar.org
+- SPECTER2 model: https://huggingface.co/allenai/specter2
+- ChromaDB: https://www.trychroma.com
+- ResearchGPT: https://github.com/mukulpatnaik/researchgpt
+- OpenAlex: https://openalex.org

package/skills/literature/discovery/zotero-arxiv-daily-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,94 @@
+---
+name: zotero-arxiv-daily-guide
+description: "Guide to Zotero arXiv Daily for personalized daily paper recommendations"
+metadata:
+  openclaw:
+    emoji: "📰"
+    category: "literature"
+    subcategory: "discovery"
+    keywords: ["zotero", "arxiv", "daily-papers", "recommendations", "preprint", "discovery"]
+    source: "https://github.com/TechPenguineer/zotero-arxiv-daily"
+---
+# Zotero arXiv Daily Guide
+## Overview
+Zotero arXiv Daily is a popular Zotero plugin with over 5,000 GitHub stars that delivers personalized daily paper recommendations from arXiv directly into your Zotero library. By analyzing the papers you already have in your collections, the plugin identifies your research interests and surfaces new preprints that are most relevant to your work.
+The challenge of staying current with preprint literature is well known to researchers. ArXiv publishes thousands of new papers daily across dozens of categories, and manually scanning listings or relying solely on keyword alerts often results in information overload or missed relevant work. Zotero arXiv Daily addresses this by using your existing library as a profile of your interests, producing recommendations that improve as your library grows.
+The plugin integrates naturally into the Zotero workflow. Recommended papers appear in a dedicated collection where you can review titles and abstracts, save promising papers to your working collections, and dismiss irrelevant suggestions. Over time the recommendation engine learns from your accept and dismiss decisions, refining its model of your interests.
+## Installation and Setup
+Install Zotero arXiv Daily through the standard Zotero plugin process:
+1. Download the latest `.xpi` release from https://github.com/TechPenguineer/zotero-arxiv-daily/releases
+2. In Zotero, go to Tools > Add-ons > gear icon > Install Add-on From File
+3. Select the `.xpi` file and restart Zotero
+Configure the plugin after installation:
+- Open Zotero Preferences > arXiv Daily
+- Select the arXiv categories relevant to your research (e.g., cs.AI, cs.CL, stat.ML, physics.comp-ph)
+- Choose which Zotero collections to use as the basis for recommendations (your core research collections work best)
+- Set the number of daily recommendations (10-30 is typical)
+- Configure the schedule for fetching new recommendations (daily at a specific time or on-demand)
+- Set up a dedicated Zotero collection where recommendations will appear
+For enhanced recommendation quality, ensure your library collections are well-organized. The algorithm performs better when it can distinguish between your core research interests and peripheral references. Consider creating a dedicated collection of your most representative papers to serve as the recommendation seed.
+## Core Features
+**Personalized Recommendations**: The plugin analyzes titles, abstracts, authors, and citation patterns in your Zotero library to build a profile of your research interests. New arXiv submissions are scored against this profile and the top matches are presented as daily recommendations.
+**Category Filtering**: Select specific arXiv categories to narrow the recommendation scope. This prevents the system from suggesting papers in completely unrelated fields while still allowing cross-disciplinary discoveries within your selected categories.
+**Daily Digest View**: Recommendations appear in a dedicated Zotero collection organized by date. Each entry includes the paper title, authors, abstract, arXiv identifier, and a relevance score indicating how closely it matches your library profile.
+**Quick Actions**: For each recommended paper, you can:
+- Save to a working collection with one click
+- Open the full paper on arXiv
+- Download the PDF directly to Zotero
+- Dismiss the recommendation (improves future suggestions)
+- Add tags for later organization
+**Trend Detection**: The plugin can highlight papers that are receiving unusual attention in your field based on early citation velocity and social media mentions. This helps you identify potentially important work before it becomes widely known.
+**Author Tracking**: When the plugin detects papers by authors who are frequently cited in your library, it flags these with higher priority. This ensures you never miss new work from the researchers most relevant to your field.
+## Research Workflow Integration
+**Morning Review Routine**: Start your research day by spending 10-15 minutes reviewing the daily arXiv recommendations. Scan titles and abstracts, save promising papers to a "To Read" collection, and dismiss irrelevant ones. This disciplined approach keeps you current without consuming excessive time.
+**Literature Review Enhancement**: During active literature review phases, increase the number of daily recommendations and expand the arXiv categories. The plugin helps identify relevant preprints that may not yet appear in traditional databases, giving your review a more comprehensive and timely scope.
+**Collaborative Discovery**: Share your recommended papers collection with lab members through a Zotero group library. This creates a collective discovery mechanism where the entire group benefits from each member's library-driven recommendations.
+**Research Trend Monitoring**: Track which topics appear frequently in your recommendations over weeks and months. Shifts in the recommendation patterns can signal emerging trends in your field, helping you anticipate where the research community is heading.
+**Optimizing Recommendation Quality**:
+- Maintain a well-curated "seed" collection of your most important papers
+- Regularly dismiss irrelevant recommendations to refine the algorithm
+- Update your arXiv category selections as your interests evolve
+- Add newly published papers from your own group to keep the profile current
+- Review recommendations from adjacent categories periodically for cross-disciplinary insights
+## Configuring Notification Preferences
+Control how and when you receive recommendation alerts:
+- **Desktop Notifications**: Enable system notifications when new recommendations arrive
+- **Batch Mode**: Accumulate recommendations and review them at a scheduled time
+- **Threshold Filtering**: Only show recommendations above a configurable relevance score
+- **Keyword Highlighting**: Specify key terms to highlight in recommended paper titles and abstracts
+For researchers who find the default recommendation volume too high, set a higher relevance threshold to receive only the most closely matched papers. Conversely, those in rapidly moving fields may want to lower the threshold and increase the daily count to ensure broad coverage.
+## References
+- GitHub Repository: https://github.com/TechPenguineer/zotero-arxiv-daily
+- arXiv API Documentation: https://info.arxiv.org/help/api
+- Zotero Plugin Directory: https://www.zotero.org/support/plugins
+- arXiv Category Taxonomy: https://arxiv.org/category_taxonomy

package/skills/literature/fulltext/bioc-pmc-api/SKILL.md ADDED Viewed

@@ -0,0 +1,146 @@
+---
+name: bioc-pmc-api
+description: "Access PMC Open Access articles in BioC format for text mining"
+metadata:
+  openclaw:
+    emoji: "🧬"
+    category: "literature"
+    subcategory: "fulltext"
+    keywords: ["bioc", "pmc", "text mining", "biomedical nlp", "full text", "pubmed central"]
+    source: "https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/"
+---
+# BioC API for PMC Open Access
+## Overview
+The BioC API provides full-text articles from PubMed Central (PMC) in the BioC format — a simplified XML/JSON structure designed specifically for biomedical text mining. Unlike the standard PMC OAI service (which returns JATS XML), BioC pre-segments text into passages with offset annotations, making it ideal for NLP pipelines, named entity recognition, relation extraction, and other text mining tasks. Free, no authentication required.
+## API Endpoints
+### Base URL
+```
+https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{PMCID}/unicode
+```
+### Retrieve by PMC ID
+```bash
+# JSON format (recommended for programmatic use)
+curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/unicode"
+# XML format
+curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/PMC6267067/unicode"
+# ASCII encoding (strips non-ASCII characters)
+curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/ascii"
+```
+### Retrieve by PubMed ID
+```bash
+# Convert PMID to PMCID first, then query
+curl "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=29346600&format=json"
+# Returns: {"records": [{"pmid": "29346600", "pmcid": "PMC6267067", ...}]}
+```
+## BioC JSON Structure
+```json
+{
+  "source": "PMC",
+  "date": "2024-01-15",
+  "key": "collection.key",
+  "documents": [
+    {
+      "id": "PMC6267067",
+      "passages": [
+        {
+          "infons": {
+            "section_type": "TITLE",
+            "type": "title"
+          },
+          "offset": 0,
+          "text": "Article Title Here"
+        },
+        {
+          "infons": {
+            "section_type": "ABSTRACT",
+            "type": "abstract"
+          },
+          "offset": 25,
+          "text": "Background: This study investigates..."
+        },
+        {
+          "infons": {
+            "section_type": "INTRO",
+            "type": "paragraph"
+          },
+          "offset": 350,
+          "text": "The introduction text..."
+        }
+      ]
+    }
+  ]
+}
+```
+Key fields:
+- `passages[].infons.section_type`: TITLE, ABSTRACT, INTRO, METHODS, RESULTS, DISCUSS, CONCL, REF, FIG, TABLE
+- `passages[].offset`: Character offset from document start
+- `passages[].text`: Plain text content of the passage
+## Python Usage
+```python
+import requests
+import json
+def get_bioc_article(pmcid: str, fmt: str = "json") -> dict:
+    """Fetch a PMC article in BioC format."""
+    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_{fmt}/{pmcid}/unicode"
+    resp = requests.get(url, timeout=30)
+    resp.raise_for_status()
+    return resp.json() if fmt == "json" else resp.text
+def extract_sections(bioc_doc: dict) -> dict:
+    """Extract text organized by section type."""
+    sections = {}
+    for doc in bioc_doc.get("documents", []):
+        for passage in doc.get("passages", []):
+            section = passage.get("infons", {}).get("section_type", "OTHER")
+            text = passage.get("text", "")
+            sections.setdefault(section, []).append(text)
+    return {k: "\n".join(v) for k, v in sections.items()}
+# Example: fetch and parse
+article = get_bioc_article("PMC6267067")
+sections = extract_sections(article)
+print(f"Title: {sections.get('TITLE', 'N/A')}")
+print(f"Abstract length: {len(sections.get('ABSTRACT', ''))} chars")
+print(f"Sections found: {list(sections.keys())}")
+```
+## Data Coverage
+- **PMC Open Access Subset**: ~4M+ articles with CC licenses
+- **Author Manuscript Collection**: NIH-funded author manuscripts
+- Updates: New articles added daily
+## Rate Limits
+- Follow NCBI standard: **3 requests per second**
+- For bulk access, use the PMC FTP service instead
+- Add `tool=your_tool_name&email=your@email.com` to requests for priority queue
+## Citation
+When using this API in publications, cite:
+> Comeau DC, Wei CH, Islamaj Dogan R, Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing. *Bioinformatics*, btz070, 2019.
+## References
+- [BioC-PMC API Documentation](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/)
+- [BioC Format Specification](http://bioc.sourceforge.net/)
+- [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)