npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/tools/scraping/academic-web-scraping/SKILL.md ADDED Viewed

@@ -0,0 +1,326 @@
+---
+name: academic-web-scraping
+description: "Ethical web scraping and API-based data collection for research"
+metadata:
+  openclaw:
+    emoji: "🌐"
+    category: "tools"
+    subcategory: "scraping"
+    keywords: ["web scraping", "API data collection", "web search strategies", "data extraction"]
+    source: "N/A"
+---
+# Academic Web Scraping Guide
+## Overview
+Research often requires collecting data from the web -- whether it is bibliographic metadata from academic databases, experimental datasets from public repositories, social media posts for computational social science, or economic indicators from government portals. Web scraping and API-based data collection are essential skills for modern researchers across disciplines.
+This guide covers both approaches: structured API access for platforms that provide one, and web scraping for when no API exists. It emphasizes ethical data collection practices, including respecting robots.txt, rate limiting, terms of service compliance, and IRB considerations for human-subject data. The goal is to collect research data reliably and responsibly.
+Whether you are building a dataset for a machine learning paper, collecting metadata for a systematic review, or gathering public data for policy research, these patterns help you do it correctly and efficiently.
+## API-Based Data Collection
+APIs are always preferable to scraping when available. They provide structured data, are officially supported, and have clear usage terms.
+### Academic APIs
+| API | Data | Rate Limit | Auth |
+|-----|------|-----------|------|
+| Semantic Scholar | Papers, authors, citations | 100 req/sec (with key) | API key (free) |
+| OpenAlex | Papers, authors, venues, concepts | 100K req/day | Email in header |
+| Crossref | DOI metadata | 50 req/sec (polite pool) | Email in header |
+| PubMed (Entrez) | Biomedical literature | 10 req/sec (with key) | API key (free) |
+| arXiv | Preprints | 1 req/3sec | None |
+| CORE | Open access papers | 10 req/sec | API key (free) |
+### Example: Collecting Papers from OpenAlex
+```python
+import requests
+import time
+class OpenAlexClient:
+    BASE_URL = "https://api.openalex.org"
+    def __init__(self, email):
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': f'ResearchBot/1.0 (mailto:{email})'
+        })
+    def search_works(self, query, filters=None, per_page=25, max_results=100):
+        """Search for works with optional filters."""
+        results = []
+        page = 1
+        while len(results) < max_results:
+            params = {
+                'search': query,
+                'per_page': min(per_page, max_results - len(results)),
+                'page': page,
+            }
+            if filters:
+                params['filter'] = ','.join(f'{k}:{v}' for k, v in filters.items())
+            resp = self.session.get(f'{self.BASE_URL}/works', params=params)
+            resp.raise_for_status()
+            data = resp.json()
+            works = data.get('results', [])
+            if not works:
+                break
+            results.extend(works)
+            page += 1
+            time.sleep(0.1)  # Polite rate limiting
+        return results[:max_results]
+    def get_work(self, openalex_id):
+        """Get a single work by OpenAlex ID."""
+        resp = self.session.get(f'{self.BASE_URL}/works/{openalex_id}')
+        resp.raise_for_status()
+        return resp.json()
+# Usage
+client = OpenAlexClient(email="researcher@university.edu")
+papers = client.search_works(
+    "transformer attention mechanism",
+    filters={
+        'publication_year': '2023-2024',
+        'type': 'journal-article',
+        'open_access.is_oa': 'true'
+    },
+    max_results=200
+)
+for paper in papers[:5]:
+    print(f"- {paper['title']} ({paper['publication_year']})")
+    print(f"  DOI: {paper['doi']}")
+    print(f"  Citations: {paper['cited_by_count']}")
+```
+### Example: PubMed Entrez API
+```python
+from Bio import Entrez
+Entrez.email = "researcher@university.edu"
+Entrez.api_key = os.environ.get("NCBI_API_KEY")  # optional
+def search_pubmed(query, max_results=100):
+    """Search PubMed and retrieve article details."""
+    # Search
+    handle = Entrez.esearch(db="pubmed", term=query,
+                            retmax=max_results, sort="relevance")
+    search_results = Entrez.read(handle)
+    id_list = search_results["IdList"]
+    if not id_list:
+        return []
+    # Fetch details
+    handle = Entrez.efetch(db="pubmed", id=id_list,
+                           rettype="xml", retmode="xml")
+    records = Entrez.read(handle)
+    articles = []
+    for article in records['PubmedArticle']:
+        medline = article['MedlineCitation']
+        art_info = medline['Article']
+        articles.append({
+            'pmid': str(medline['PMID']),
+            'title': art_info.get('ArticleTitle', ''),
+            'abstract': art_info.get('Abstract', {}).get(
+                'AbstractText', [''])[0] if 'Abstract' in art_info else '',
+            'journal': art_info['Journal']['Title'],
+            'year': art_info['Journal']['JournalIssue'].get(
+                'PubDate', {}).get('Year', ''),
+        })
+    return articles
+```
+## Web Scraping Fundamentals
+When no API exists, scraping becomes necessary. Always check for an API first.
+### Tools Comparison
+| Tool | Type | JavaScript Support | Speed | Learning Curve |
+|------|------|-------------------|-------|---------------|
+| requests + BeautifulSoup | HTTP + parsing | No | Fast | Low |
+| Scrapy | Framework | No (without middleware) | Very fast | Medium |
+| Selenium | Browser automation | Yes | Slow | Medium |
+| Playwright | Browser automation | Yes | Medium | Medium |
+| httpx | Async HTTP | No | Very fast | Low |
+### Basic Scraping with BeautifulSoup
+```python
+import requests
+from bs4 import BeautifulSoup
+import time
+def scrape_conference_proceedings(url, delay=2.0):
+    """Scrape paper titles and links from a conference page."""
+    headers = {
+        'User-Agent': 'ResearchBot/1.0 (Academic research; contact@university.edu)'
+    }
+    response = requests.get(url, headers=headers, timeout=30)
+    response.raise_for_status()
+    soup = BeautifulSoup(response.text, 'html.parser')
+    papers = []
+    for item in soup.select('.paper-item, .proceeding-entry'):
+        title_el = item.select_one('.title, h3, h4')
+        link_el = item.select_one('a[href]')
+        authors_el = item.select_one('.authors, .author-list')
+        if title_el:
+            papers.append({
+                'title': title_el.get_text(strip=True),
+                'url': link_el['href'] if link_el else None,
+                'authors': authors_el.get_text(strip=True) if authors_el else '',
+            })
+    time.sleep(delay)  # Respect the server
+    return papers
+```
+### Handling JavaScript-Rendered Pages
+```python
+from playwright.sync_api import sync_playwright
+def scrape_dynamic_page(url):
+    """Scrape a JavaScript-rendered page using Playwright."""
+    with sync_playwright() as p:
+        browser = p.chromium.launch(headless=True)
+        page = browser.new_page()
+        page.goto(url, wait_until='networkidle')
+        # Wait for content to load
+        page.wait_for_selector('.results-container', timeout=10000)
+        # Extract data
+        items = page.query_selector_all('.result-item')
+        results = []
+        for item in items:
+            title = item.query_selector('.title')
+            results.append({
+                'title': title.inner_text() if title else '',
+            })
+        browser.close()
+        return results
+```
+## Ethical Guidelines
+### The Researcher's Scraping Checklist
+1. **Check for an API first.** Most academic platforms have one.
+2. **Read robots.txt.** `https://example.com/robots.txt` specifies what is allowed.
+3. **Review Terms of Service.** Some sites explicitly prohibit scraping.
+4. **Rate limit aggressively.** 1 request per 2-5 seconds minimum. Never parallelize without permission.
+5. **Identify yourself.** Include your email and institution in the User-Agent header.
+6. **Minimize data collection.** Only collect what your research question requires.
+7. **Consider IRB requirements.** If collecting data about identifiable humans, consult your IRB.
+8. **Store data securely.** Follow your institution's data management policies.
+9. **Cite your data sources.** Acknowledge where the data came from in your publications.
+10. **Check copyright.** Scraping publicly visible data does not mean you can redistribute it.
+### robots.txt Parsing
+```python
+from urllib.robotparser import RobotFileParser
+def can_scrape(url, user_agent='*'):
+    """Check if scraping a URL is allowed by robots.txt."""
+    from urllib.parse import urlparse
+    parsed = urlparse(url)
+    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
+    rp = RobotFileParser()
+    rp.set_url(robots_url)
+    rp.read()
+    allowed = rp.can_fetch(user_agent, url)
+    crawl_delay = rp.crawl_delay(user_agent)
+    return {
+        'allowed': allowed,
+        'crawl_delay': crawl_delay or 1.0,
+    }
+```
+## Data Storage and Export
+### Saving Results Reliably
+```python
+import json
+import csv
+from pathlib import Path
+from datetime import datetime
+class DataCollector:
+    def __init__(self, output_dir='collected_data'):
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+    def save_json(self, data, filename):
+        path = self.output_dir / f'{filename}_{self.timestamp}.json'
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(data, f, indent=2, ensure_ascii=False)
+        print(f"Saved {len(data)} records to {path}")
+    def save_csv(self, data, filename, fieldnames=None):
+        if not data:
+            return
+        if fieldnames is None:
+            fieldnames = list(data[0].keys())
+        path = self.output_dir / f'{filename}_{self.timestamp}.csv'
+        with open(path, 'w', newline='', encoding='utf-8') as f:
+            writer = csv.DictWriter(f, fieldnames=fieldnames,
+                                     extrasaction='ignore')
+            writer.writeheader()
+            writer.writerows(data)
+        print(f"Saved {len(data)} records to {path}")
+    def save_checkpoint(self, data, filename):
+        """Save intermediate results for resumable collection."""
+        path = self.output_dir / f'{filename}_checkpoint.json'
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump({
+                'timestamp': self.timestamp,
+                'n_records': len(data),
+                'data': data,
+            }, f, indent=2, ensure_ascii=False)
+```
+## Best Practices
+- **Always prefer APIs over scraping.** APIs are more reliable, structured, and legally clear.
+- **Implement exponential backoff.** If a request fails, wait 1s, then 2s, then 4s before retrying.
+- **Save checkpoints.** For large collections, save progress incrementally so you can resume after interruptions.
+- **Log everything.** Record which URLs were accessed, when, and what was returned for reproducibility.
+- **Test on a small sample first.** Verify your parsing logic on 10 records before running on 10,000.
+- **Respect rate limits.** Getting blocked hurts everyone -- other researchers included.
+- **Document your collection methodology.** Your paper's Methods section should describe how data was collected, when, and what filters were applied.
+## References
+- [OpenAlex API Documentation](https://docs.openalex.org/) -- Open bibliographic data API
+- [Semantic Scholar API](https://api.semanticscholar.org/) -- Paper and author data
+- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) -- HTML parsing
+- [Scrapy Documentation](https://docs.scrapy.org/) -- Web scraping framework
+- [Playwright Documentation](https://playwright.dev/python/) -- Browser automation
+- [Web Scraping Ethics](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) -- Ethical considerations

package/skills/tools/scraping/api-data-collection-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,301 @@
+---
+name: api-data-collection-guide
+description: "API-based data collection and web scraping for research"
+metadata:
+  openclaw:
+    emoji: "spider"
+    category: "tools"
+    subcategory: "scraping"
+    keywords: ["API data collection", "web search strategies", "data extraction", "web scraping"]
+    source: "wentor-research-plugins"
+---
+# API Data Collection Guide
+Collect research data from web APIs and structured sources using Python, with proper rate limiting, error handling, pagination, and ethical considerations.
+## API vs. Web Scraping
+| Approach | When to Use | Reliability | Legal Risk |
+|----------|------------|-------------|------------|
+| Official API | API exists and provides needed data | High | Low (within TOS) |
+| Unofficial API | Browser dev tools reveal JSON endpoints | Medium | Medium |
+| Web scraping | No API available, data is publicly accessible | Low (pages change) | Medium-High |
+| Bulk data download | Provider offers data dumps | High | Low |
+**Always prefer official APIs over scraping**. Check for APIs first at: ProgrammableWeb, RapidAPI, or the data provider's developer documentation.
+## RESTful API Fundamentals
+### HTTP Methods
+| Method | Purpose | Example |
+|--------|---------|---------|
+| GET | Retrieve data | `GET /api/papers?q=machine+learning` |
+| POST | Create or submit data | `POST /api/annotations` |
+| PUT | Update existing data | `PUT /api/papers/123` |
+| DELETE | Remove data | `DELETE /api/papers/123` |
+### Common Response Codes
+| Code | Meaning | Action |
+|------|---------|--------|
+| 200 | Success | Process response |
+| 201 | Created | Resource created successfully |
+| 400 | Bad request | Fix query parameters |
+| 401 | Unauthorized | Check API key |
+| 403 | Forbidden | Access denied; check permissions |
+| 404 | Not found | Resource does not exist |
+| 429 | Rate limited | Wait and retry with backoff |
+| 500 | Server error | Retry later |
+## Python API Client Template
+```python
+import requests
+import time
+import json
+import logging
+from pathlib import Path
+from datetime import datetime
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class APIClient:
+    """Reusable API client with rate limiting, retries, and caching."""
+    def __init__(self, base_url, api_key=None, rate_limit=1.0, max_retries=3):
+        self.base_url = base_url.rstrip("/")
+        self.session = requests.Session()
+        if api_key:
+            self.session.headers["Authorization"] = f"Bearer {api_key}"
+        self.session.headers["User-Agent"] = "ResearchCollector/1.0 (academic research)"
+        self.rate_limit = rate_limit  # seconds between requests
+        self.max_retries = max_retries
+        self.last_request_time = 0
+        self.cache_dir = Path("./cache")
+        self.cache_dir.mkdir(exist_ok=True)
+    def _rate_limit_wait(self):
+        """Enforce minimum time between requests."""
+        elapsed = time.time() - self.last_request_time
+        if elapsed < self.rate_limit:
+            time.sleep(self.rate_limit - elapsed)
+        self.last_request_time = time.time()
+    def _get_cache_key(self, endpoint, params):
+        """Generate a cache key from the request."""
+        import hashlib
+        key_string = f"{endpoint}_{json.dumps(params, sort_keys=True)}"
+        return hashlib.md5(key_string.encode()).hexdigest()
+    def get(self, endpoint, params=None, use_cache=True):
+        """Make a GET request with rate limiting, retries, and caching."""
+        cache_key = self._get_cache_key(endpoint, params or {})
+        cache_file = self.cache_dir / f"{cache_key}.json"
+        # Check cache
+        if use_cache and cache_file.exists():
+            logger.debug(f"Cache hit: {endpoint}")
+            return json.loads(cache_file.read_text())
+        url = f"{self.base_url}/{endpoint.lstrip('/')}"
+        for attempt in range(self.max_retries):
+            self._rate_limit_wait()
+            try:
+                response = self.session.get(url, params=params, timeout=30)
+                if response.status_code == 200:
+                    data = response.json()
+                    # Save to cache
+                    cache_file.write_text(json.dumps(data))
+                    return data
+                elif response.status_code == 429:
+                    retry_after = int(response.headers.get("Retry-After", 60))
+                    logger.warning(f"Rate limited. Waiting {retry_after}s...")
+                    time.sleep(retry_after)
+                elif response.status_code >= 500:
+                    logger.warning(f"Server error {response.status_code}. Retry {attempt+1}/{self.max_retries}")
+                    time.sleep(2 ** attempt)  # Exponential backoff
+                else:
+                    logger.error(f"Request failed: {response.status_code} {response.text[:200]}")
+                    return None
+            except requests.exceptions.RequestException as e:
+                logger.error(f"Request exception: {e}")
+                time.sleep(2 ** attempt)
+        logger.error(f"Max retries exceeded for {endpoint}")
+        return None
+    def paginate(self, endpoint, params=None, page_key="page",
+                 results_key="results", max_pages=100):
+        """Automatically paginate through all results."""
+        params = params or {}
+        all_results = []
+        page = 1
+        while page <= max_pages:
+            params[page_key] = page
+            data = self.get(endpoint, params)
+            if not data or not data.get(results_key):
+                break
+            results = data[results_key]
+            all_results.extend(results)
+            logger.info(f"Page {page}: {len(results)} results (total: {len(all_results)})")
+            # Check if more pages exist
+            if len(results) < params.get("per_page", params.get("limit", 20)):
+                break
+            page += 1
+        return all_results
+```
+## Academic API Examples
+### OpenAlex (Open Scholarly Metadata)
+```python
+# OpenAlex: free, comprehensive, no authentication required
+client = APIClient("https://api.openalex.org", rate_limit=0.1)
+# Search for works
+results = client.get("works", params={
+    "filter": "title.search:transformer attention mechanism",
+    "sort": "cited_by_count:desc",
+    "per_page": 25
+})
+for work in results.get("results", []):
+    print(f"[{work.get('publication_year')}] {work.get('title')}")
+    print(f"  Citations: {work.get('cited_by_count')}")
+    print(f"  DOI: {work.get('doi')}")
+```
+### CrossRef (DOI Metadata)
+```python
+client = APIClient("https://api.crossref.org", rate_limit=0.05)
+client.session.headers["User-Agent"] = "ResearchClaw/1.0 (mailto:researcher@university.edu)"
+# Search for works
+results = client.get("works", params={
+    "query": "machine learning drug discovery",
+    "rows": 20,
+    "sort": "relevance",
+    "order": "desc"
+})
+for item in results.get("message", {}).get("items", []):
+    title = item.get("title", ["N/A"])[0]
+    doi = item.get("DOI", "N/A")
+    cited = item.get("is-referenced-by-count", 0)
+    print(f"  {title} | DOI: {doi} | Cited: {cited}")
+```
+### GitHub API (Code and Repositories)
+```python
+# GitHub API for finding research code repositories
+client = APIClient("https://api.github.com", api_key=os.environ["GITHUB_TOKEN"], rate_limit=0.75)
+# Search repositories
+results = client.get("search/repositories", params={
+    "q": "topic:machine-learning language:python stars:>100",
+    "sort": "stars",
+    "order": "desc",
+    "per_page": 30
+})
+for repo in results.get("items", []):
+    print(f"{repo['full_name']} ({repo['stargazers_count']} stars)")
+    print(f"  {repo.get('description', 'No description')[:80]}")
+```
+## Web Scraping (When APIs Are Unavailable)
+```python
+import requests
+from bs4 import BeautifulSoup
+import time
+def scrape_conference_proceedings(url, delay=2.0):
+    """Scrape paper titles and authors from a conference page."""
+    headers = {
+        "User-Agent": "Mozilla/5.0 (Research Bot; academic research only)"
+    }
+    response = requests.get(url, headers=headers)
+    response.raise_for_status()
+    soup = BeautifulSoup(response.text, "html.parser")
+    papers = []
+    for article in soup.find_all("div", class_="paper-entry"):
+        title = article.find("h3")
+        authors = article.find("span", class_="authors")
+        abstract = article.find("p", class_="abstract")
+        papers.append({
+            "title": title.text.strip() if title else "N/A",
+            "authors": authors.text.strip() if authors else "N/A",
+            "abstract": abstract.text.strip() if abstract else "N/A"
+        })
+    time.sleep(delay)  # Be polite
+    return papers
+```
+## Data Storage and Management
+```python
+import pandas as pd
+import sqlite3
+def save_to_sqlite(data, db_path="research_data.db", table_name="papers"):
+    """Save collected data to SQLite database."""
+    df = pd.DataFrame(data)
+    conn = sqlite3.connect(db_path)
+    df.to_sql(table_name, conn, if_exists="append", index=False)
+    conn.close()
+    logger.info(f"Saved {len(df)} records to {db_path}:{table_name}")
+def save_incremental_json(data, output_file="collected_data.jsonl"):
+    """Append data as JSON Lines (one JSON object per line)."""
+    with open(output_file, "a") as f:
+        for record in data:
+            f.write(json.dumps(record) + "\n")
+```
+## Ethical and Legal Considerations
+| Principle | Description |
+|-----------|-------------|
+| **Respect robots.txt** | Check `robots.txt` before scraping any site |
+| **Rate limiting** | Never exceed 1 request/second unless the API permits more |
+| **Identify yourself** | Use a descriptive User-Agent with contact email |
+| **Terms of Service** | Read and follow the API/website TOS |
+| **Data minimization** | Only collect data you actually need |
+| **Privacy** | Do not scrape personal data without consent |
+| **Acknowledge sources** | Cite data sources in publications |
+| **IRB review** | Consult your IRB if collecting human-related data |
+## Troubleshooting Common Issues
+| Problem | Cause | Solution |
+|---------|-------|----------|
+| 403 Forbidden | Missing or incorrect authentication | Check API key, update User-Agent |
+| Timeout errors | Slow server or large response | Increase timeout, reduce page size |
+| Inconsistent data | API schema changed | Version-lock API endpoints, validate schema |
+| Missing fields | Optional fields are null | Use `.get()` with defaults, handle None |
+| Encoding errors | Non-UTF8 characters | Set `response.encoding = "utf-8"`, use `errors="replace"` |
+| IP blocking | Too many requests | Use exponential backoff, rotate IPs (with caution) |