npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/tools/document/anystyle-api/SKILL.md ADDED Viewed

@@ -0,0 +1,199 @@
+---
+name: anystyle-api
+description: "Citation reference parser using machine learning"
+metadata:
+  openclaw:
+    emoji: "🔍"
+    category: "tools"
+    subcategory: "document"
+    keywords: ["PDF parsing", "document chunking", "format conversion", "PDF extraction"]
+    source: "https://anystyle.io/"
+---
+# AnyStyle API Guide
+## Overview
+AnyStyle is a fast and smart citation reference parser that uses machine learning (specifically conditional random fields, CRFs) to extract structured bibliographic data from unformatted citation strings. It can parse raw reference text into structured fields such as author, title, journal, volume, pages, year, and DOI, handling the enormous variety of citation formats found in academic literature.
+The AnyStyle service provides both a web interface and an API endpoint for programmatic citation parsing. Unlike rule-based parsers that rely on specific citation style templates, AnyStyle uses a trained machine learning model that generalizes across citation formats, making it effective for parsing references from diverse disciplines and publication traditions where citation styles vary widely.
+Researchers, librarians, digital humanists, and research software developers use AnyStyle to extract structured references from PDF documents, legacy bibliographies, dissertation reference lists, and scanned documents. It is particularly valuable for building citation networks, enriching bibliographic databases, migrating references between management tools, and processing large volumes of unstructured citation data that would be impractical to parse manually.
+## Authentication
+No authentication required. The AnyStyle web service is freely accessible without any API key, token, or registration. The service can be used via the web interface at https://anystyle.io/ or through its API endpoint. For heavy usage or private deployments, AnyStyle is also available as an open-source Ruby gem that can be installed locally.
+## Core Endpoints
+### parse: Parse Citation References
+Submit raw citation text and receive structured bibliographic data extracted by the machine learning parser. The endpoint accepts one or more citation strings and returns parsed fields for each reference.
+- **URL**: `POST https://anystyle.io/parse`
+- **Parameters**:
+| Parameter | Type   | Required | Description                                                  |
+|-----------|--------|----------|--------------------------------------------------------------|
+| body      | string | Yes      | Raw citation text (one reference per line in the POST body)  |
+| format    | string | No       | Output format: `json` (default), `xml`, `bib` (BibTeX)      |
+- **Example**:
+```bash
+# Parse a single citation
+curl -X POST "https://anystyle.io/parse" \
+  -H "Content-Type: text/plain" \
+  -d "Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008."
+# Parse multiple citations (one per line)
+curl -X POST "https://anystyle.io/parse" \
+  -H "Content-Type: text/plain" \
+  -d "Vaswani, A. et al. (2017). Attention is all you need. NeurIPS 30, 5998-6008.
+LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
+Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 25."
+```
+- **Response**: Returns an array of parsed citation objects, each containing extracted fields:
+```json
+[
+  {
+    "author": [{"family": "Vaswani", "given": "A."}, {"family": "Shazeer", "given": "N."}],
+    "title": ["Attention is all you need"],
+    "date": ["2017"],
+    "container-title": ["Advances in Neural Information Processing Systems"],
+    "volume": ["30"],
+    "pages": ["5998-6008"],
+    "type": "article-journal"
+  }
+]
+```
+Key response fields include `author` (array of name objects), `title`, `date`, `container-title` (journal/conference name), `volume`, `issue`, `pages`, `doi`, `url`, `publisher`, `location`, and `type` (inferred reference type).
+## Rate Limits
+No formal rate limits are documented for the AnyStyle web service. However, the service is provided as a free community resource, so users should exercise responsible usage patterns. For high-volume parsing tasks (thousands of citations or more), it is strongly recommended to install the AnyStyle Ruby gem locally:
+```bash
+gem install anystyle
+```
+The local installation provides the same parsing capabilities without any network dependencies or rate concerns, and supports batch processing of large reference lists and PDF files directly.
+## Common Patterns
+### Parse a Reference List from a Paper
+Extract structured data from a raw reference list copied from a PDF:
+```python
+import requests
+references = """Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
+Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, 4171-4186.
+Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language models are few-shot learners. NeurIPS 33, 1877-1901."""
+resp = requests.post(
+    "https://anystyle.io/parse",
+    headers={"Content-Type": "text/plain"},
+    data=references
+)
+for ref in resp.json():
+    authors = ", ".join(
+        f"{a.get('family', '')} {a.get('given', '')}" for a in ref.get("author", [])
+    )
+    title = ref.get("title", [""])[0]
+    year = ref.get("date", [""])[0]
+    journal = ref.get("container-title", [""])[0]
+    print(f"{authors} ({year}). {title}. {journal}")
+```
+### Batch Process Citations from Multiple Documents
+Process reference lists from multiple papers for citation network analysis:
+```python
+import requests
+def parse_references(raw_text):
+    """Parse raw citation text into structured records."""
+    resp = requests.post(
+        "https://anystyle.io/parse",
+        headers={"Content-Type": "text/plain"},
+        data=raw_text
+    )
+    if resp.status_code == 200:
+        return resp.json()
+    return []
+# Process references from multiple source documents
+documents = {
+    "paper_A": "Smith, J. (2020). Title A. Journal X, 1, 1-10.\nDoe, J. (2019). Title B. Journal Y, 2, 20-30.",
+    "paper_B": "Jones, K. (2021). Title C. Conference Z, 100-110.\nSmith, J. (2020). Title A. Journal X, 1, 1-10."
+}
+citation_graph = {}
+for doc_id, refs in documents.items():
+    parsed = parse_references(refs)
+    citation_graph[doc_id] = parsed
+    print(f"{doc_id}: parsed {len(parsed)} references")
+```
+### Convert Citations to BibTeX Format
+Transform unstructured references into BibTeX entries for use with LaTeX:
+```python
+import requests
+citation = "Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press."
+resp = requests.post(
+    "https://anystyle.io/parse",
+    headers={"Content-Type": "text/plain"},
+    data=citation
+)
+parsed = resp.json()[0]
+# Build a BibTeX entry from parsed fields
+authors = " and ".join(
+    f"{a.get('family', '')}, {a.get('given', '')}" for a in parsed.get("author", [])
+)
+bib_key = parsed.get("author", [{}])[0].get("family", "unknown").lower() + parsed.get("date", ["0000"])[0]
+print(f"@book{{{bib_key},")
+print(f"  author = {{{authors}}},")
+print(f"  title = {{{parsed.get('title', [''])[0]}}},")
+print(f"  year = {{{parsed.get('date', [''])[0]}}},")
+print(f"  publisher = {{{parsed.get('publisher', [''])[0] if parsed.get('publisher') else 'Unknown'}}}")
+print("}")
+```
+### Local Installation for High-Volume Processing
+For large-scale processing, install AnyStyle locally as a Ruby gem:
+```bash
+# Install the gem
+gem install anystyle
+# Parse references from command line
+anystyle parse "Smith, J. (2020). My Paper. Journal, 1, 1-10."
+# Parse references from a text file
+anystyle parse references.txt --format json > parsed.json
+# Parse references directly from a PDF
+anystyle find document.pdf --format json > extracted_refs.json
+```
+## References
+- AnyStyle web service: https://anystyle.io/
+- AnyStyle GitHub repository: https://github.com/inukshuk/anystyle
+- AnyStyle Ruby gem: https://rubygems.org/gems/anystyle
+- AnyStyle CLI documentation: https://github.com/inukshuk/anystyle-cli
+- CSL (Citation Style Language): https://citationstyles.org/

package/skills/tools/document/grobid-pdf-parsing/SKILL.md ADDED Viewed

@@ -0,0 +1,294 @@
+---
+name: grobid-pdf-parsing
+description: "Extract structured text, metadata, and references from academic PDFs"
+metadata:
+  openclaw:
+    emoji: "📄"
+    category: "tools"
+    subcategory: "document"
+    keywords: ["PDF parsing", "PDF extraction", "document chunking", "format conversion"]
+    source: "https://github.com/kermitt2/grobid"
+---
+# GROBID PDF Parsing Guide
+## Overview
+Academic PDFs are the primary format for distributing research, yet extracting structured data from them remains challenging. PDFs encode visual layout, not semantic structure -- headings, paragraphs, equations, tables, and citations are all just positioned text and graphics. GROBID (GeneRation Of BIbliographic Data) is the leading open-source tool for parsing academic PDFs into structured XML/TEI format, extracting metadata, body text, references, and figures with high accuracy.
+GROBID is used by major academic platforms including Semantic Scholar, CORE, and ResearchGate for large-scale document processing. It combines machine learning models (CRF and deep learning) with heuristic rules to handle the diverse formatting of academic papers across publishers and disciplines.
+This guide covers installing and running GROBID, using its REST API for batch processing, extracting specific elements (metadata, references, body sections), and integrating GROBID output into downstream workflows such as knowledge bases, systematic reviews, and literature analysis pipelines.
+## Installation
+### Docker (Recommended)
+```bash
+# Pull the latest GROBID image
+docker pull grobid/grobid:0.8.1
+# Run GROBID server
+docker run --rm --init \
+  --ulimit core=0 \
+  -p 8070:8070 \
+  grobid/grobid:0.8.1
+# GROBID is now running at http://localhost:8070
+# Web console: http://localhost:8070/console
+```
+### From Source
+```bash
+git clone https://github.com/kermitt2/grobid.git
+cd grobid
+./gradlew clean install
+./gradlew run
+```
+## REST API Usage
+### Process Full Document
+```bash
+# Process a single PDF and get TEI XML
+curl -v --form input=@paper.pdf \
+  http://localhost:8070/api/processFulltextDocument \
+  -o paper.tei.xml
+# With options
+curl -v --form input=@paper.pdf \
+  --form consolidateHeader=1 \
+  --form consolidateCitations=1 \
+  --form includeRawCitations=1 \
+  http://localhost:8070/api/processFulltextDocument \
+  -o paper.tei.xml
+```
+### API Endpoints
+| Endpoint | Purpose | Input | Output |
+|----------|---------|-------|--------|
+| `/api/processFulltextDocument` | Full paper parsing | PDF | TEI XML |
+| `/api/processHeaderDocument` | Metadata only | PDF | TEI XML (header) |
+| `/api/processReferences` | Reference parsing | PDF | TEI XML (refs) |
+| `/api/processCitation` | Parse citation string | Text | TEI XML |
+| `/api/processDate` | Parse date string | Text | Structured date |
+### Python Client
+```python
+import requests
+from pathlib import Path
+class GrobidClient:
+    def __init__(self, base_url='http://localhost:8070'):
+        self.base_url = base_url
+    def process_fulltext(self, pdf_path, consolidate_header=True,
+                         consolidate_citations=True):
+        """Process a PDF and return TEI XML."""
+        url = f'{self.base_url}/api/processFulltextDocument'
+        files = {'input': open(pdf_path, 'rb')}
+        data = {
+            'consolidateHeader': '1' if consolidate_header else '0',
+            'consolidateCitations': '1' if consolidate_citations else '0',
+        }
+        response = requests.post(url, files=files, data=data)
+        response.raise_for_status()
+        return response.text
+    def process_header(self, pdf_path):
+        """Extract only header metadata from PDF."""
+        url = f'{self.base_url}/api/processHeaderDocument'
+        files = {'input': open(pdf_path, 'rb')}
+        response = requests.post(url, files=files)
+        response.raise_for_status()
+        return response.text
+    def is_alive(self):
+        """Check if GROBID server is running."""
+        try:
+            resp = requests.get(f'{self.base_url}/api/isalive')
+            return resp.status_code == 200
+        except requests.ConnectionError:
+            return False
+# Usage
+client = GrobidClient()
+if client.is_alive():
+    tei_xml = client.process_fulltext('paper.pdf')
+    with open('paper.tei.xml', 'w') as f:
+        f.write(tei_xml)
+```
+## Parsing TEI XML Output
+### Extracting Metadata
+```python
+from lxml import etree
+def parse_tei_metadata(tei_xml):
+    """Extract title, authors, abstract from TEI XML."""
+    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}
+    root = etree.fromstring(tei_xml.encode('utf-8'))
+    # Title
+    title_el = root.find('.//tei:titleStmt/tei:title', ns)
+    title = title_el.text if title_el is not None else ''
+    # Authors
+    authors = []
+    for author in root.findall('.//tei:sourceDesc//tei:author', ns):
+        forename = author.findtext('.//tei:forename', '', ns)
+        surname = author.findtext('.//tei:surname', '', ns)
+        if surname:
+            authors.append(f'{forename} {surname}'.strip())
+    # Abstract
+    abstract_el = root.find('.//tei:profileDesc/tei:abstract', ns)
+    abstract = ''.join(abstract_el.itertext()).strip() if abstract_el is not None else ''
+    # DOI
+    doi_el = root.find('.//tei:idno[@type="DOI"]', ns)
+    doi = doi_el.text if doi_el is not None else ''
+    return {
+        'title': title,
+        'authors': authors,
+        'abstract': abstract,
+        'doi': doi,
+    }
+```
+### Extracting Body Sections
+```python
+def parse_tei_sections(tei_xml):
+    """Extract structured sections from TEI XML body."""
+    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}
+    root = etree.fromstring(tei_xml.encode('utf-8'))
+    sections = []
+    for div in root.findall('.//tei:body/tei:div', ns):
+        head = div.findtext('tei:head', '', ns).strip()
+        paragraphs = []
+        for p in div.findall('tei:p', ns):
+            text = ''.join(p.itertext()).strip()
+            if text:
+                paragraphs.append(text)
+        sections.append({
+            'heading': head,
+            'n': div.get('n', ''),
+            'paragraphs': paragraphs,
+        })
+    return sections
+```
+### Extracting References
+```python
+def parse_tei_references(tei_xml):
+    """Extract structured references from TEI XML."""
+    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}
+    root = etree.fromstring(tei_xml.encode('utf-8'))
+    refs = []
+    for bib in root.findall('.//tei:listBibl/tei:biblStruct', ns):
+        ref = {'id': bib.get('{http://www.w3.org/XML/1998/namespace}id', '')}
+        # Title
+        title_el = bib.find('.//tei:title[@level="a"]', ns)
+        if title_el is None:
+            title_el = bib.find('.//tei:title', ns)
+        ref['title'] = title_el.text if title_el is not None else ''
+        # Authors
+        ref['authors'] = []
+        for author in bib.findall('.//tei:author', ns):
+            name = f"{author.findtext('.//tei:forename', '', ns)} {author.findtext('.//tei:surname', '', ns)}".strip()
+            if name:
+                ref['authors'].append(name)
+        # Year
+        date_el = bib.find('.//tei:date[@type="published"]', ns)
+        ref['year'] = date_el.get('when', '') if date_el is not None else ''
+        # DOI
+        doi_el = bib.find('.//tei:idno[@type="DOI"]', ns)
+        ref['doi'] = doi_el.text if doi_el is not None else ''
+        refs.append(ref)
+    return refs
+```
+## Batch Processing
+### Processing a Directory of PDFs
+```python
+from pathlib import Path
+import json
+from concurrent.futures import ThreadPoolExecutor
+def batch_process(pdf_dir, output_dir, max_workers=4):
+    """Process all PDFs in a directory using GROBID."""
+    client = GrobidClient()
+    pdf_dir = Path(pdf_dir)
+    output_dir = Path(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    pdf_files = list(pdf_dir.glob('*.pdf'))
+    print(f"Processing {len(pdf_files)} PDFs...")
+    def process_one(pdf_path):
+        try:
+            tei = client.process_fulltext(str(pdf_path))
+            meta = parse_tei_metadata(tei)
+            refs = parse_tei_references(tei)
+            # Save TEI XML
+            tei_path = output_dir / f'{pdf_path.stem}.tei.xml'
+            tei_path.write_text(tei)
+            # Save structured JSON
+            json_path = output_dir / f'{pdf_path.stem}.json'
+            json_path.write_text(json.dumps({
+                'metadata': meta,
+                'references': refs,
+                'n_references': len(refs),
+            }, indent=2))
+            return pdf_path.name, 'success'
+        except Exception as e:
+            return pdf_path.name, f'error: {str(e)}'
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        results = list(executor.map(process_one, pdf_files))
+    for name, status in results:
+        print(f"  {name}: {status}")
+batch_process('papers/', 'parsed_output/')
+```
+## Best Practices
+- **Use consolidation flags.** `consolidateHeader=1` and `consolidateCitations=1` cross-reference against Crossref for better metadata.
+- **Handle errors gracefully.** Some PDFs are scanned images, corrupted, or have unusual layouts. Always wrap processing in try/except.
+- **Limit concurrent requests.** GROBID is CPU-intensive. 4-8 concurrent requests is usually optimal.
+- **Validate output.** Spot-check a sample of parsed documents against the original PDFs.
+- **Use GROBID for structured extraction, not OCR.** For scanned documents, run OCR first (Tesseract) then GROBID.
+- **Keep GROBID updated.** Each release improves parsing accuracy, especially for newer publisher formats.
+## References
+- [GROBID Documentation](https://grobid.readthedocs.io/) -- Official documentation
+- [GROBID GitHub](https://github.com/kermitt2/grobid) -- Source code
+- [TEI Guidelines](https://tei-c.org/release/doc/tei-p5-doc/en/html/) -- TEI XML standard
+- [grobid-client-python](https://github.com/kermitt2/grobid_client_python) -- Official Python client
+- [Science Parse](https://github.com/allenai/science-parse) -- Allen AI alternative parser