npm - @wentorai/research-plugins - Versions diffs - 1.0.0 - Mend

@wentorai/research-plugins 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

package/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,325 @@
+---
+name: rag-methodology-guide
+description: "RAG architecture for academic knowledge retrieval and synthesis"
+metadata:
+  openclaw:
+    emoji: "brain"
+    category: "tools"
+    subcategory: "knowledge-graph"
+    keywords: ["RAG", "retrieval augmented generation", "academic knowledge graph", "knowledge modeling"]
+    source: "wentor-research-plugins"
+---
+# RAG Methodology Guide
+Design and implement Retrieval-Augmented Generation (RAG) systems for academic research, including document chunking, embedding strategies, retrieval pipelines, and evaluation.
+## What Is RAG?
+Retrieval-Augmented Generation (RAG) augments a language model's generation with relevant information retrieved from an external knowledge base. For academic research, this enables:
+- Question answering over a personal paper library
+- Literature synthesis across hundreds of papers
+- Fact-checking claims against source documents
+- Generating citations with provenance
+### RAG Pipeline Architecture
+```
+Query: "What are the main challenges of protein folding?"
+    |
+    v
+[1. Query Processing]
+    |-- Embed query using embedding model
+    |-- Optional: Query expansion / HyDE
+    |
+    v
+[2. Retrieval]
+    |-- Search vector database for top-k relevant chunks
+    |-- Optional: Reranking with cross-encoder
+    |
+    v
+[3. Context Assembly]
+    |-- Combine retrieved chunks into a prompt
+    |-- Add metadata (source, page, citation)
+    |
+    v
+[4. Generation]
+    |-- LLM generates answer grounded in retrieved context
+    |-- Include inline citations
+    |
+    v
+Answer with citations
+```
+## Step 1: Document Ingestion and Chunking
+### Chunking Strategies
+| Strategy | Description | Best For |
+|----------|-------------|----------|
+| **Fixed-size** | Split every N characters/tokens | Simple, fast, baseline |
+| **Sentence-based** | Split on sentence boundaries | Natural reading units |
+| **Paragraph-based** | Split on paragraph breaks | Coherent semantic units |
+| **Section-based** | Split on document headings | Academic papers |
+| **Recursive** | Hierarchically split (heading > paragraph > sentence) | General purpose |
+| **Semantic** | Split on topic shifts using embeddings | Best quality, slower |
+### Implementation
+```python
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+def chunk_academic_paper(text, chunk_size=1000, chunk_overlap=200):
+    """Chunk an academic paper using recursive splitting."""
+    splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        separators=[
+            "\n## ",     # H2 headings (section breaks)
+            "\n### ",    # H3 headings (subsection breaks)
+            "\n\n",      # Paragraph breaks
+            "\n",        # Line breaks
+            ". ",        # Sentence breaks
+            " ",         # Word breaks
+        ],
+        length_function=len
+    )
+    chunks = splitter.split_text(text)
+    return chunks
+# Add metadata to each chunk
+def create_documents(paper_text, metadata):
+    """Create chunks with source metadata for citation tracking."""
+    chunks = chunk_academic_paper(paper_text)
+    documents = []
+    for i, chunk in enumerate(chunks):
+        documents.append({
+            "text": chunk,
+            "metadata": {
+                **metadata,
+                "chunk_index": i,
+                "chunk_total": len(chunks)
+            }
+        })
+    return documents
+# Example usage
+docs = create_documents(
+    paper_text=extracted_text,
+    metadata={
+        "title": "Attention Is All You Need",
+        "authors": "Vaswani et al.",
+        "year": 2017,
+        "doi": "10.48550/arXiv.1706.03762",
+        "source_file": "vaswani2017attention.pdf"
+    }
+)
+```
+## Step 2: Embedding and Indexing
+### Embedding Model Selection
+| Model | Dimensions | Quality | Speed | Cost |
+|-------|-----------|---------|-------|------|
+| OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/1M tokens |
+| OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/1M tokens |
+| Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/1M tokens |
+| sentence-transformers/all-MiniLM-L6-v2 | 384 | Good | Very fast | Free (local) |
+| BAAI/bge-large-en-v1.5 | 1024 | Excellent | Medium | Free (local) |
+| nomic-embed-text | 768 | Good | Fast | Free (local) |
+### Vector Database Options
+| Database | Type | Scalability | Features |
+|----------|------|------------|----------|
+| ChromaDB | Embedded | Small-medium | Simple, good for prototyping |
+| FAISS | Library | Large | Facebook research, GPU support |
+| Pinecone | Cloud | Large | Managed, serverless |
+| Weaviate | Self-hosted/Cloud | Large | Hybrid search, filters |
+| Qdrant | Self-hosted/Cloud | Large | Rich filtering, payload storage |
+| pgvector | PostgreSQL extension | Medium | SQL integration |
+### Building the Index
+```python
+import chromadb
+from sentence_transformers import SentenceTransformer
+# Initialize embedding model (local, free)
+embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
+# Initialize ChromaDB
+client = chromadb.PersistentClient(path="./chroma_db")
+collection = client.get_or_create_collection(
+    name="research_papers",
+    metadata={"hnsw:space": "cosine"}
+)
+# Index documents
+def index_documents(documents):
+    """Add documents to the vector database."""
+    texts = [doc["text"] for doc in documents]
+    embeddings = embed_model.encode(texts, show_progress_bar=True).tolist()
+    ids = [f"doc_{i}" for i in range(len(documents))]
+    metadatas = [doc["metadata"] for doc in documents]
+    collection.add(
+        documents=texts,
+        embeddings=embeddings,
+        metadatas=metadatas,
+        ids=ids
+    )
+    print(f"Indexed {len(documents)} chunks")
+index_documents(docs)
+```
+## Step 3: Retrieval
+### Basic Retrieval
+```python
+def retrieve(query, top_k=5):
+    """Retrieve the most relevant chunks for a query."""
+    query_embedding = embed_model.encode([query]).tolist()
+    results = collection.query(
+        query_embeddings=query_embedding,
+        n_results=top_k,
+        include=["documents", "metadatas", "distances"]
+    )
+    retrieved = []
+    for doc, meta, dist in zip(
+        results["documents"][0],
+        results["metadatas"][0],
+        results["distances"][0]
+    ):
+        retrieved.append({
+            "text": doc,
+            "metadata": meta,
+            "similarity": 1 - dist  # Convert distance to similarity
+        })
+    return retrieved
+# Example
+results = retrieve("What are the main components of the Transformer architecture?")
+for r in results:
+    print(f"[{r['similarity']:.3f}] {r['metadata'].get('title', 'N/A')}")
+    print(f"  {r['text'][:150]}...")
+```
+### Advanced Retrieval: Hybrid Search
+```python
+def hybrid_retrieve(query, top_k=5, alpha=0.7):
+    """Combine dense (semantic) and sparse (keyword) retrieval."""
+    # Dense retrieval (vector similarity)
+    dense_results = retrieve(query, top_k=top_k * 2)
+    # Sparse retrieval (BM25 keyword matching)
+    from rank_bm25 import BM25Okapi
+    # Assume all_documents is a list of all chunk texts
+    tokenized_corpus = [doc.split() for doc in all_documents]
+    bm25 = BM25Okapi(tokenized_corpus)
+    bm25_scores = bm25.get_scores(query.split())
+    sparse_top_k = bm25_scores.argsort()[-top_k * 2:][::-1]
+    # Reciprocal Rank Fusion (RRF)
+    rrf_scores = {}
+    k = 60  # RRF constant
+    for rank, result in enumerate(dense_results):
+        doc_id = result["metadata"].get("chunk_index", rank)
+        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)
+    for rank, idx in enumerate(sparse_top_k):
+        rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (k + rank + 1)
+    # Sort by RRF score and return top-k
+    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
+    return sorted_results[:top_k]
+```
+## Step 4: Generation with Citations
+```python
+def generate_answer(query, retrieved_contexts):
+    """Generate an answer with inline citations using an LLM."""
+    # Build context string with citation markers
+    context_parts = []
+    for i, ctx in enumerate(retrieved_contexts, 1):
+        source = f"{ctx['metadata'].get('authors', 'Unknown')}, {ctx['metadata'].get('year', 'N/A')}"
+        context_parts.append(f"[{i}] ({source}): {ctx['text']}")
+    context_string = "\n\n".join(context_parts)
+    prompt = f"""Based on the following research paper excerpts, answer the question.
+Use inline citations like [1], [2] to reference specific sources.
+Only use information from the provided excerpts.
+If the excerpts do not contain enough information, say so.
+EXCERPTS:
+{context_string}
+QUESTION: {query}
+ANSWER (with inline citations):"""
+    # Send to LLM (example with OpenAI)
+    # response = openai.chat.completions.create(
+    #     model="gpt-4",
+    #     messages=[{"role": "user", "content": prompt}],
+    #     temperature=0.1
+    # )
+    # return response.choices[0].message.content
+    return prompt  # Return prompt for inspection
+```
+## Evaluation Metrics
+| Metric | Measures | Tool |
+|--------|----------|------|
+| **Retrieval precision** | Are retrieved chunks relevant? | Manual annotation |
+| **Retrieval recall** | Are all relevant chunks retrieved? | Known-relevant set |
+| **NDCG** | Ranking quality of retrieved results | BEIR benchmark |
+| **Answer correctness** | Is the generated answer factually correct? | Human evaluation |
+| **Faithfulness** | Does the answer only use information from retrieved context? | RAGAS framework |
+| **Answer relevance** | Does the answer address the question? | RAGAS framework |
+| **Context relevance** | Are the retrieved contexts relevant to the question? | RAGAS framework |
+```python
+# Using RAGAS for automated RAG evaluation
+from ragas import evaluate
+from ragas.metrics import faithfulness, answer_relevancy, context_precision
+# Prepare evaluation dataset
+eval_data = {
+    "question": ["What is the Transformer architecture?"],
+    "answer": ["The Transformer uses self-attention mechanisms..."],
+    "contexts": [["The Transformer model architecture eschews recurrence..."]],
+    "ground_truth": ["The Transformer is a neural network architecture..."]
+}
+result = evaluate(
+    dataset=eval_data,
+    metrics=[faithfulness, answer_relevancy, context_precision]
+)
+print(result)
+```
+## Best Practices for Academic RAG
+1. **Chunk by section**: Academic papers have natural section boundaries. Use them.
+2. **Preserve metadata**: Always store title, authors, year, DOI, and page number with each chunk for proper citation.
+3. **Use domain-specific embeddings**: Models fine-tuned on scientific text (e.g., SPECTER2) outperform general models for academic content.
+4. **Rerank after retrieval**: A cross-encoder reranker significantly improves precision over embedding-only retrieval.
+5. **Handle tables and figures**: Extract tables as text or structured data; do not ignore them during chunking.
+6. **Evaluate systematically**: Use RAGAS or a custom evaluation set to measure retrieval and generation quality before deploying.

package/skills/tools/ocr-translate/formula-recognition-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,367 @@
+---
+name: formula-recognition-guide
+description: "Math OCR and formula recognition to LaTeX conversion"
+metadata:
+  openclaw:
+    emoji: "math"
+    category: "tools"
+    subcategory: "ocr-translate"
+    keywords: ["math OCR", "formula recognition", "LaTeX OCR"]
+    source: "wentor-research-plugins"
+---
+# Formula Recognition Guide
+Convert mathematical formulas from images, PDFs, and handwritten notes to LaTeX code using OCR tools, neural models, and API services.
+## Tool Comparison
+| Tool | Input | Output | Accuracy | Speed | Cost |
+|------|-------|--------|----------|-------|------|
+| Mathpix | Image, PDF, screenshot | LaTeX, MathML | Excellent | Fast | Free tier (50/month), then paid |
+| LaTeX-OCR (Lukas Blecher) | Image | LaTeX | Very good | Medium | Free (open source) |
+| Pix2Text (p2t) | Image | LaTeX + text | Good | Medium | Free (open source) |
+| Nougat (Meta) | PDF pages | Markdown + LaTeX | Excellent (full page) | Slow (GPU) | Free (open source) |
+| InftyReader | Image, PDF | LaTeX, MathML | Good | Medium | Commercial |
+| Google Cloud Vision | Image | Text (limited math) | Poor for math | Fast | Pay per use |
+| img2latex (Harvard) | Image | LaTeX | Good | Medium | Free (open source) |
+## Mathpix API
+Mathpix is the industry-standard math OCR service, handling printed and handwritten formulas, tables, and full documents.
+### Setup
+```bash
+pip install mathpix
+# Or use the REST API directly
+```
+### Single Image to LaTeX
+```python
+import requests
+import base64
+import json
+def mathpix_ocr(image_path, app_id, app_key):
+    """Convert an image of a formula to LaTeX using Mathpix API."""
+    with open(image_path, "rb") as f:
+        image_data = base64.b64encode(f.read()).decode()
+    response = requests.post(
+        "https://api.mathpix.com/v3/text",
+        headers={
+            "app_id": app_id,
+            "app_key": app_key,
+            "Content-Type": "application/json"
+        },
+        json={
+            "src": f"data:image/png;base64,{image_data}",
+            "formats": ["latex_styled", "latex_normal", "mathml"],
+            "data_options": {
+                "include_asciimath": True,
+                "include_latex": True
+            }
+        }
+    )
+    result = response.json()
+    return {
+        "latex": result.get("latex_styled", ""),
+        "latex_normal": result.get("latex_normal", ""),
+        "confidence": result.get("confidence", 0),
+        "mathml": result.get("mathml", "")
+    }
+# Usage
+result = mathpix_ocr("equation.png", "YOUR_APP_ID", "YOUR_APP_KEY")
+print(f"LaTeX: {result['latex']}")
+print(f"Confidence: {result['confidence']:.2%}")
+```
+### Process a Full PDF Page
+```python
+def mathpix_pdf_page(image_path, app_id, app_key):
+    """Process a full PDF page with mixed text and math."""
+    with open(image_path, "rb") as f:
+        image_data = base64.b64encode(f.read()).decode()
+    response = requests.post(
+        "https://api.mathpix.com/v3/text",
+        headers={
+            "app_id": app_id,
+            "app_key": app_key,
+            "Content-Type": "application/json"
+        },
+        json={
+            "src": f"data:image/png;base64,{image_data}",
+            "formats": ["text", "latex_styled"],
+            "ocr": ["math", "text"],
+            "math_inline_delimiters": ["$", "$"],
+            "math_display_delimiters": ["$$", "$$"]
+        }
+    )
+    result = response.json()
+    return result.get("text", "")
+# Returns Markdown with inline $...$ and display $$...$$ math
+```
+## LaTeX-OCR (Open Source, Local)
+LaTeX-OCR by Lukas Blecher is a free, locally-running model for converting formula images to LaTeX.
+### Installation and Usage
+```bash
+pip install "pix2tex[gui]"
+```
+```python
+from pix2tex.cli import LatexOCR
+# Initialize model (downloads on first use, ~1GB)
+model = LatexOCR()
+# From file
+from PIL import Image
+img = Image.open("equation.png")
+latex = model(img)
+print(f"LaTeX: {latex}")
+# Output: \frac{\partial \mathcal{L}}{\partial \theta} = -\frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log p(y_i | x_i; \theta)
+```
+### Batch Processing
+```python
+from PIL import Image
+from pathlib import Path
+def batch_ocr(image_dir, model):
+    """Process all formula images in a directory."""
+    results = []
+    for img_path in sorted(Path(image_dir).glob("*.png")):
+        img = Image.open(img_path)
+        latex = model(img)
+        results.append({
+            "file": img_path.name,
+            "latex": latex
+        })
+        print(f"{img_path.name}: {latex[:80]}...")
+    return results
+model = LatexOCR()
+results = batch_ocr("./formula_images/", model)
+```
+## Pix2Text (Chinese + English + Math)
+Pix2Text handles mixed Chinese/English text alongside mathematical formulas.
+```bash
+pip install pix2text
+```
+```python
+from pix2text import Pix2Text
+p2t = Pix2Text()
+# Recognize mixed content (text + math)
+result = p2t.recognize("mixed_content.png")
+print(result)
+# Output includes both text and LaTeX formulas
+```
+## Nougat (Meta) — Full Document OCR
+Nougat converts entire academic PDF pages to Markdown with LaTeX math, preserving document structure.
+```bash
+pip install nougat-ocr
+```
+```bash
+# Convert a PDF to Markdown
+nougat path/to/paper.pdf -o output_dir/ --no-skipping
+# Output: Markdown files with LaTeX equations preserved
+# e.g., The loss function is $\mathcal{L}(\theta) = ...$
+```
+```python
+# Programmatic usage
+from nougat import NougatModel
+from nougat.utils.dataset import LazyDataset
+from nougat.postprocessing import markdown_compatible
+model = NougatModel.from_pretrained("facebook/nougat-base")
+model.eval()
+# Process pages...
+```
+## Screenshot-Based Workflow
+### macOS Workflow
+```bash
+# 1. Take a screenshot of the formula (Cmd+Shift+4)
+# 2. Process with LaTeX-OCR or Mathpix
+# Automated with a shell script:
+#!/bin/bash
+# save as ~/bin/formula-ocr.sh
+SCREENSHOT=$(mktemp /tmp/formula_XXXXXX.png)
+screencapture -i "$SCREENSHOT"
+python -c "
+from pix2tex.cli import LatexOCR
+from PIL import Image
+model = LatexOCR()
+img = Image.open('$SCREENSHOT')
+latex = model(img)
+print(latex)
+# Copy to clipboard
+import subprocess
+subprocess.run(['pbcopy'], input=latex.encode())
+print('Copied to clipboard!')
+"
+```
+### Cross-Platform with Snipping Tool
+```python
+import tkinter as tk
+from PIL import ImageGrab
+def capture_and_ocr():
+    """Capture screen region and convert to LaTeX."""
+    # Simple screenshot capture
+    print("Select the formula region...")
+    img = ImageGrab.grab(bbox=None)  # Full screen; use tool for selection
+    from pix2tex.cli import LatexOCR
+    model = LatexOCR()
+    latex = model(img)
+    print(f"\nLaTeX: {latex}")
+    return latex
+```
+## Post-Processing and Validation
+### Common OCR Errors and Fixes
+| OCR Error | Correct LaTeX | Fix Strategy |
+|-----------|--------------|--------------|
+| `\Sigma` vs `\sum` | Context-dependent | Check if it is a summation or sigma variable |
+| Missing subscripts | `x_i` not `xi` | Verify variable names against source |
+| Wrong delimiter size | `\left( \right)` | Add `\left` and `\right` for auto-sizing |
+| Misrecognized symbols | `\theta` vs `\Theta` | Compare against original image |
+| Missing spaces | `\frac{a}{b}c` | Add spacing commands (`\,`, `\;`, `\quad`) |
+### Validation Script
+```python
+import subprocess
+import tempfile
+import os
+def validate_latex(latex_string):
+    """Check if a LaTeX string compiles without errors."""
+    doc = f"""
+    \\documentclass{{article}}
+    \\usepackage{{amsmath,amssymb}}
+    \\begin{{document}}
+    ${latex_string}$
+    \\end{{document}}
+    """
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".tex", delete=False) as f:
+        f.write(doc)
+        tex_path = f.name
+    try:
+        result = subprocess.run(
+            ["pdflatex", "-interaction=nonstopmode", tex_path],
+            capture_output=True, text=True, timeout=10,
+            cwd=tempfile.gettempdir()
+        )
+        success = result.returncode == 0
+        if not success:
+            # Extract error message
+            for line in result.stdout.split("\n"):
+                if line.startswith("!"):
+                    print(f"LaTeX error: {line}")
+        return success
+    except subprocess.TimeoutExpired:
+        return False
+    finally:
+        for ext in [".tex", ".pdf", ".aux", ".log"]:
+            try:
+                os.remove(tex_path.replace(".tex", ext))
+            except FileNotFoundError:
+                pass
+# Test
+latex = r"\frac{\partial \mathcal{L}}{\partial \theta}"
+print(f"Valid: {validate_latex(latex)}")
+```
+## Integration with Note-Taking
+### Obsidian / Markdown Notes
+```markdown
+# Lecture Notes: Statistical Mechanics
+The partition function is defined as:
+$$Z = \sum_{i} e^{-\beta E_i}$$
+where $\beta = 1/k_B T$ is the inverse temperature.
+The free energy is:
+$$F = -k_B T \ln Z$$
+[OCR'd from slide 15 using LaTeX-OCR, confidence: 0.97]
+```
+### Automated Pipeline
+```python
+def process_lecture_slides(pdf_path, output_md):
+    """Convert lecture slides with formulas to Markdown notes."""
+    from pdf2image import convert_from_path
+    from pix2tex.cli import LatexOCR
+    model = LatexOCR()
+    images = convert_from_path(pdf_path, dpi=200)
+    with open(output_md, "w") as f:
+        f.write(f"# Notes from {pdf_path}\n\n")
+        for i, img in enumerate(images):
+            f.write(f"## Slide {i+1}\n\n")
+            # Full page text extraction (use Nougat or Mathpix for best results)
+            # For formula-only images, use LaTeX-OCR:
+            try:
+                latex = model(img)
+                f.write(f"$$\n{latex}\n$$\n\n")
+            except Exception as e:
+                f.write(f"[OCR failed: {e}]\n\n")
+    print(f"Notes saved to {output_md}")
+```
+## Best Practices
+1. **Crop tightly**: OCR accuracy improves significantly when the formula is cropped with minimal surrounding whitespace.
+2. **Use high resolution**: 200-300 DPI gives the best results. Lower resolution degrades recognition accuracy.
+3. **Validate output**: Always compile the generated LaTeX to verify correctness before using in a manuscript.
+4. **Handle multi-line equations**: For aligned equations, process each line separately or use a full-page model like Nougat.
+5. **Combine tools**: Use Mathpix for critical formulas and LaTeX-OCR for bulk processing to balance cost and quality.
+6. **Build a corrections dictionary**: Track common OCR errors for your domain and apply automated post-processing fixes.