npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/tools/scraping/web-scraping-ethics-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,250 @@
+---
+name: web-scraping-ethics-guide
+description: "Scrape web data ethically and legally for research purposes"
+metadata:
+  openclaw:
+    emoji: "globe_with_meridians"
+    category: "tools"
+    subcategory: "scraping"
+    keywords: ["web scraping", "ethical scraping", "robots.txt", "rate limiting", "research data collection", "crawling"]
+    source: "wentor-research-plugins"
+---
+# Ethical Web Scraping for Research
+A skill for collecting web data ethically and legally for research purposes. Covers robots.txt compliance, rate limiting, legal frameworks, data privacy considerations, and practical scraping techniques that respect website operators and comply with institutional review requirements.
+## Ethical Framework
+### Principles of Ethical Scraping
+```
+1. Respect robots.txt and Terms of Service
+   - Check robots.txt before scraping any site
+   - Review the site's ToS for explicit prohibitions
+   - When in doubt, contact the site operator
+2. Minimize server impact
+   - Use rate limiting (1-2 requests per second maximum)
+   - Scrape during off-peak hours when possible
+   - Cache responses to avoid redundant requests
+   - Use conditional requests (If-Modified-Since headers)
+3. Collect only what you need
+   - Define your data requirements before scraping
+   - Do not scrape personal data without ethical justification
+   - Anonymize or pseudonymize personal information
+4. Attribution and transparency
+   - Set a descriptive User-Agent header with contact info
+   - Be prepared to identify yourself if contacted
+   - Credit data sources in publications
+5. Institutional compliance
+   - Check if your IRB/ethics board requires approval for web data
+   - Follow your institution's acceptable use policy
+   - Consider data protection regulations (GDPR, CCPA)
+```
+## Checking robots.txt
+### Parsing Robots.txt
+```python
+import urllib.request
+import urllib.robotparser
+def check_robots_txt(base_url: str, target_path: str,
+                     user_agent: str = "*") -> dict:
+    """
+    Check if a URL is allowed by robots.txt.
+    Args:
+        base_url: The website's base URL (e.g., 'https://example.com')
+        target_path: The path you want to scrape (e.g., '/data/papers')
+        user_agent: Your bot's user agent string
+    """
+    robots_url = f"{base_url}/robots.txt"
+    rp = urllib.robotparser.RobotFileParser()
+    rp.set_url(robots_url)
+    try:
+        rp.read()
+    except Exception as e:
+        return {
+            "robots_txt_found": False,
+            "error": str(e),
+            "recommendation": "Proceed with caution; use conservative rate limiting"
+        }
+    full_url = f"{base_url}{target_path}"
+    allowed = rp.can_fetch(user_agent, full_url)
+    crawl_delay = rp.crawl_delay(user_agent)
+    return {
+        "robots_txt_found": True,
+        "url_checked": full_url,
+        "allowed": allowed,
+        "crawl_delay": crawl_delay or "Not specified (use 1-2 seconds)",
+        "recommendation": (
+            "Proceed with specified crawl delay"
+            if allowed
+            else "Do NOT scrape this path -- it is disallowed"
+        )
+    }
+```
+## Rate-Limited Scraping
+### Respectful Request Pattern
+```python
+import time
+import urllib.request
+def scrape_with_rate_limit(urls: list[str],
+                            delay: float = 1.0,
+                            user_agent: str = None) -> list[dict]:
+    """
+    Scrape a list of URLs with rate limiting and proper headers.
+    Args:
+        urls: List of URLs to fetch
+        delay: Seconds to wait between requests
+        user_agent: Custom user agent string
+    """
+    if user_agent is None:
+        user_agent = (
+            "ResearchBot/1.0 (Academic research; "
+            "contact: researcher@university.edu)"
+        )
+    results = []
+    for i, url in enumerate(urls):
+        try:
+            req = urllib.request.Request(url, headers={
+                "User-Agent": user_agent,
+                "Accept": "text/html",
+            })
+            response = urllib.request.urlopen(req, timeout=30)
+            content = response.read().decode("utf-8", errors="replace")
+            results.append({
+                "url": url,
+                "status": response.status,
+                "content_length": len(content),
+                "success": True
+            })
+        except Exception as e:
+            results.append({
+                "url": url,
+                "error": str(e),
+                "success": False
+            })
+        # Rate limiting
+        if i < len(urls) - 1:
+            time.sleep(delay)
+    return results
+```
+## Legal Considerations
+### Key Legal Frameworks
+```
+United States:
+  - CFAA (Computer Fraud and Abuse Act): Unauthorized access is illegal
+  - hiQ v. LinkedIn (2022): Scraping public data is generally permissible
+  - Key question: Is the data publicly accessible without authentication?
+European Union:
+  - GDPR: Personal data requires legal basis for processing
+  - Database Directive: Protects substantial investment in databases
+  - Text and Data Mining exception (DSM Directive, Art. 3-4):
+    Research organizations can mine lawfully accessible content
+General guidance:
+  - Public data is more defensible than data behind login walls
+  - Scraping that circumvents technical measures is riskier
+  - Academic fair use / research exceptions vary by jurisdiction
+  - When in doubt, consult your institution's legal counsel
+```
+## Research-Specific Considerations
+### IRB and Ethics Approval
+```python
+def assess_irb_requirements(data_type: str,
+                             contains_pii: bool) -> dict:
+    """
+    Assess whether web scraping requires IRB review.
+    Args:
+        data_type: Type of data being collected
+        contains_pii: Whether data includes personally identifiable information
+    """
+    if contains_pii:
+        return {
+            "irb_required": "Likely yes",
+            "rationale": (
+                "Data that identifies or can re-identify individuals "
+                "generally requires ethics review, even if publicly posted."
+            ),
+            "steps": [
+                "Submit IRB protocol describing data collection",
+                "Justify why PII is necessary for the research",
+                "Describe de-identification procedures",
+                "Explain data storage and security measures",
+                "Plan for data destruction after the study"
+            ]
+        }
+    return {
+        "irb_required": "Possibly exempt, but check with your IRB",
+        "rationale": (
+            "Non-human-subjects data (e.g., product prices, publication "
+            "metadata) typically does not require IRB review, but policies "
+            "vary by institution."
+        ),
+        "recommendation": "Submit an exemption request to be safe"
+    }
+```
+## Best Practices Summary
+### Scraping Checklist for Researchers
+```
+Before scraping:
+  [ ] Check robots.txt
+  [ ] Review Terms of Service
+  [ ] Consider whether an API exists (prefer API over scraping)
+  [ ] Assess IRB requirements
+  [ ] Define minimal data needed
+During scraping:
+  [ ] Set descriptive User-Agent with contact email
+  [ ] Implement rate limiting (min 1 second between requests)
+  [ ] Handle errors gracefully (do not retry aggressively)
+  [ ] Log all requests for reproducibility
+  [ ] Cache responses to avoid re-fetching
+After scraping:
+  [ ] Anonymize personal data if present
+  [ ] Store data securely
+  [ ] Document the scraping methodology for your paper
+  [ ] Credit the data source
+  [ ] Consider whether the scraped data can be shared
+      (check copyright and ToS)
+```
+Whenever a public API is available (e.g., Twitter/X API, Reddit API, CrossRef API), use the API instead of scraping HTML. APIs provide structured data, respect rate limits by design, and demonstrate good faith in your research methodology.

package/skills/writing/citation/bibtex-management-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,246 @@
+---
+name: bibtex-management-guide
+description: "Clean, format, deduplicate, and manage BibTeX bibliography files for LaTeX"
+metadata:
+  openclaw:
+    emoji: "card_file_box"
+    category: "writing"
+    subcategory: "citation"
+    keywords: ["BibTeX formatting", "BibTeX conversion", "bibliography cleanup", "reference deduplication", "citation management"]
+    source: "wentor"
+---
+# BibTeX Management Guide
+A skill for maintaining clean, consistent, and complete BibTeX bibliography files. Covers formatting standards, deduplication, common errors, and automated cleanup workflows essential for LaTeX-based academic writing.
+## BibTeX Entry Standards
+### Required Fields by Entry Type
+```bibtex
+% Article in a journal
+@article{smith2024deep,
+  author    = {Smith, John A. and Doe, Jane B.},
+  title     = {Deep Learning for Climate Prediction: A Comparative Study},
+  journal   = {Nature Machine Intelligence},
+  year      = {2024},
+  volume    = {6},
+  number    = {3},
+  pages     = {234--248},
+  doi       = {10.1038/s42256-024-00001-1}
+}
+% Conference proceedings
+@inproceedings{lee2024attention,
+  author    = {Lee, Wei and Chen, Li},
+  title     = {Attention Mechanisms for Scientific Document Understanding},
+  booktitle = {Proceedings of the 62nd Annual Meeting of the ACL},
+  year      = {2024},
+  pages     = {1123--1135},
+  publisher = {Association for Computational Linguistics},
+  doi       = {10.18653/v1/2024.acl-main.89}
+}
+% Book
+@book{bishop2006pattern,
+  author    = {Bishop, Christopher M.},
+  title     = {Pattern Recognition and Machine Learning},
+  publisher = {Springer},
+  year      = {2006},
+  isbn      = {978-0387310732}
+}
+```
+## Automated BibTeX Cleanup
+### Deduplication
+```python
+import re
+from collections import defaultdict
+def parse_bibtex_entries(bib_content: str) -> list[dict]:
+    """
+    Parse a BibTeX file into structured entries.
+    """
+    entries = []
+    pattern = r'@(\w+)\{([^,]+),\s*(.*?)\n\}'
+    matches = re.finditer(pattern, bib_content, re.DOTALL)
+    for match in matches:
+        entry = {
+            'type': match.group(1).lower(),
+            'key': match.group(2).strip(),
+            'raw': match.group(0),
+            'fields': {}
+        }
+        fields_str = match.group(3)
+        field_pattern = r'(\w+)\s*=\s*[{\"](.+?)[}\"]'
+        for field_match in re.finditer(field_pattern, fields_str, re.DOTALL):
+            entry['fields'][field_match.group(1).lower()] = field_match.group(2).strip()
+        entries.append(entry)
+    return entries
+def deduplicate_bibtex(entries: list[dict]) -> dict:
+    """
+    Find and remove duplicate BibTeX entries.
+    Deduplication strategy:
+    1. Exact DOI match
+    2. Fuzzy title match (normalized)
+    3. Author + year + first title word match
+    """
+    seen_dois = {}
+    seen_titles = {}
+    duplicates = []
+    unique = []
+    for entry in entries:
+        doi = entry['fields'].get('doi', '').lower().strip()
+        title = entry['fields'].get('title', '').lower().strip()
+        title_normalized = re.sub(r'[^a-z0-9\s]', '', title)
+        is_duplicate = False
+        # Check DOI match
+        if doi and doi in seen_dois:
+            duplicates.append({
+                'entry': entry['key'],
+                'duplicate_of': seen_dois[doi],
+                'reason': 'same DOI'
+            })
+            is_duplicate = True
+        elif doi:
+            seen_dois[doi] = entry['key']
+        # Check title match
+        if not is_duplicate and title_normalized:
+            if title_normalized in seen_titles:
+                duplicates.append({
+                    'entry': entry['key'],
+                    'duplicate_of': seen_titles[title_normalized],
+                    'reason': 'same title'
+                })
+                is_duplicate = True
+            else:
+                seen_titles[title_normalized] = entry['key']
+        if not is_duplicate:
+            unique.append(entry)
+    return {
+        'unique_entries': len(unique),
+        'duplicates_found': len(duplicates),
+        'duplicates': duplicates,
+        'entries': unique
+    }
+```
+### Field Formatting
+```python
+def clean_bibtex_entry(entry: dict) -> dict:
+    """
+    Clean and standardize a BibTeX entry.
+    """
+    cleaned = entry.copy()
+    fields = cleaned['fields']
+    # Standardize author names: "Last, First and Last, First"
+    if 'author' in fields:
+        authors = fields['author']
+        # Fix common issues
+        authors = authors.replace(' AND ', ' and ')
+        authors = authors.replace(' & ', ' and ')
+        fields['author'] = authors
+    # Ensure proper page ranges with en-dash
+    if 'pages' in fields:
+        fields['pages'] = fields['pages'].replace('-', '--').replace('---', '--')
+    # Capitalize title properly (protect proper nouns with braces)
+    if 'title' in fields:
+        title = fields['title']
+        # Protect acronyms and proper nouns
+        words = title.split()
+        for i, word in enumerate(words):
+            if word.isupper() and len(word) > 1:
+                words[i] = '{' + word + '}'
+        fields['title'] = ' '.join(words)
+    # Add missing DOI prefix
+    if 'doi' in fields:
+        doi = fields['doi']
+        doi = doi.replace('https://doi.org/', '')
+        doi = doi.replace('http://dx.doi.org/', '')
+        fields['doi'] = doi
+    # Remove empty fields
+    fields = {k: v for k, v in fields.items() if v.strip()}
+    cleaned['fields'] = fields
+    return cleaned
+```
+## DOI-Based Entry Generation
+### Fetch Complete BibTeX from DOI
+```python
+import requests
+def doi_to_bibtex(doi: str) -> str:
+    """
+    Retrieve a complete BibTeX entry from a DOI using CrossRef.
+    """
+    url = f"https://doi.org/{doi}"
+    headers = {'Accept': 'application/x-bibtex'}
+    response = requests.get(url, headers=headers, allow_redirects=True)
+    if response.status_code == 200:
+        return response.text
+    else:
+        return f"% Error: Could not retrieve BibTeX for DOI {doi}"
+# Example
+bibtex = doi_to_bibtex('10.1038/s41586-021-03819-2')
+print(bibtex)
+```
+## Citation Key Conventions
+Consistent citation keys improve readability:
+```
+Convention: authorYEARfirstword
+Examples:
+  smith2024deep
+  lee2024attention
+  bishop2006pattern
+For multiple papers by same author in same year:
+  smith2024a, smith2024b
+For papers with many authors:
+  smithetal2024deep  (use "etal" for 3+ authors)
+```
+## Validation Checklist
+Before submitting a manuscript, validate your BibTeX file:
+1. Every `\cite{}` in the manuscript has a matching entry in the .bib file
+2. No orphaned entries (entries in .bib not cited in manuscript)
+3. All entries have at minimum: author, title, year
+4. All journal articles have: volume, pages (or article number), DOI
+5. Page ranges use en-dash (`--`), not single hyphen
+6. No encoding errors in author names (check accented characters)
+7. Proper nouns and acronyms in titles are protected with braces
+8. No duplicate entries exist
+Use `biber --validate-datamodel` or `checkcites` for automated validation.