PyPI - embgrep - Versions diffs - 0.1.0__tar.gz - Mend

embgrep 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

embgrep-0.1.0/.gitignore +12 -0
embgrep-0.1.0/LICENSE +21 -0
embgrep-0.1.0/PKG-INFO +194 -0
embgrep-0.1.0/README.md +160 -0
embgrep-0.1.0/embgrep/__init__.py +79 -0
embgrep-0.1.0/embgrep/__main__.py +141 -0
embgrep-0.1.0/embgrep/chunker.py +205 -0
embgrep-0.1.0/embgrep/db.py +159 -0
embgrep-0.1.0/embgrep/embedder.py +60 -0
embgrep-0.1.0/embgrep/indexer.py +237 -0
embgrep-0.1.0/embgrep/mcp_server.py +119 -0
embgrep-0.1.0/pyproject.toml +48 -0
embgrep-0.1.0/tests/__init__.py +0 -0
embgrep-0.1.0/tests/test_chunker.py +236 -0
embgrep-0.1.0/tests/test_db.py +172 -0
embgrep-0.1.0/tests/test_embedder.py +68 -0
embgrep-0.1.0/tests/test_indexer.py +199 -0
embgrep-0.1.0/tests/test_mcp.py +113 -0

embgrep-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,12 @@
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+.pytest_cache/
+.ruff_cache/
+.env
+.env.local
+CLAUDE.md
+PLAN.txt

embgrep-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 QuartzUnit
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

embgrep-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,194 @@
+Metadata-Version: 2.4
+Name: embgrep
+Version: 0.1.0
+Summary: Local semantic search — embedding-powered grep for files, zero external services.
+Project-URL: Homepage, https://github.com/QuartzUnit/embgrep
+Project-URL: Repository, https://github.com/QuartzUnit/embgrep
+Author: QuartzUnit
+License-Expression: MIT
+License-File: LICENSE
+Keywords: embeddings,grep,local,mcp,rag,semantic-search
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Typing :: Typed
+Requires-Python: >=3.11
+Requires-Dist: fastembed>=0.4
+Requires-Dist: numpy>=1.24
+Provides-Extra: all
+Requires-Dist: click>=8.0; extra == 'all'
+Requires-Dist: fastmcp>=2.0; extra == 'all'
+Requires-Dist: rich>=13.0; extra == 'all'
+Provides-Extra: cli
+Requires-Dist: click>=8.0; extra == 'cli'
+Requires-Dist: rich>=13.0; extra == 'cli'
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: ruff>=0.8; extra == 'dev'
+Provides-Extra: mcp
+Requires-Dist: fastmcp>=2.0; extra == 'mcp'
+Description-Content-Type: text/markdown
+# embgrep
+**Local semantic search — embedding-powered grep for files, zero external services.**
+[![PyPI](https://img.shields.io/pypi/v/embgrep)](https://pypi.org/project/embgrep/)
+[![Python](https://img.shields.io/pypi/pyversions/embgrep)](https://pypi.org/project/embgrep/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+Search your codebase and documentation by *meaning*, not just keywords. embgrep indexes files into local embeddings and lets you run semantic queries — no API keys, no cloud services, no vector database servers.
+## Features
+- **Local embeddings** — Uses [fastembed](https://github.com/qdrant/fastembed) (ONNX Runtime), no API keys needed
+- **SQLite storage** — Single-file index, no external vector DB
+- **Incremental indexing** — Only re-indexes changed files (SHA-256 hash comparison)
+- **Smart chunking** — Function-level splitting for code, heading-level for docs
+- **MCP native** — 4-tool FastMCP server for LLM agent integration
+- **15+ file types** — `.py`, `.js`, `.ts`, `.java`, `.go`, `.rs`, `.md`, `.txt`, `.yaml`, `.json`, `.toml`, and more
+## Install
+```bash
+pip install embgrep              # core (fastembed + numpy)
+pip install embgrep[cli]         # + click/rich CLI
+pip install embgrep[mcp]         # + FastMCP server
+pip install embgrep[all]         # everything
+```
+## Quick Start
+### Python API
+```python
+from embgrep import EmbGrep
+eg = EmbGrep()
+# Index a directory
+eg.index("./my-project", patterns=["*.py", "*.md"])
+# Semantic search
+results = eg.search("database connection pooling", top_k=5)
+for r in results:
+    print(f"{r.file_path}:{r.line_start}-{r.line_end} (score: {r.score:.4f})")
+    print(f"  {r.chunk_text[:80]}...")
+# Incremental update (only changed files)
+eg.update()
+# Index statistics
+status = eg.status()
+print(f"{status.total_files} files, {status.total_chunks} chunks, {status.index_size_mb} MB")
+eg.close()
+```
+### CLI
+```bash
+# Index a project
+embgrep index ./my-project --patterns "*.py,*.md"
+# Search
+embgrep search "error handling patterns"
+# Filter by file type
+embgrep search "async database query" --path-filter "%.py"
+# Check status
+embgrep status
+# Update changed files
+embgrep update
+```
+### Convenience functions
+```python
+import embgrep
+embgrep.index("./src")
+results = embgrep.search("authentication middleware")
+status = embgrep.status()
+embgrep.update()
+```
+## MCP Server
+Add to your Claude Desktop / MCP client configuration:
+```json
+{
+  "mcpServers": {
+    "embgrep": {
+      "command": "embgrep-mcp"
+    }
+  }
+}
+```
+Or with uvx:
+```json
+{
+  "mcpServers": {
+    "embgrep": {
+      "command": "uvx",
+      "args": ["--from", "embgrep[mcp]", "embgrep-mcp"]
+    }
+  }
+}
+```
+### MCP Tools
+| Tool | Description |
+|------|-------------|
+| `index_directory` | Index files in a directory for semantic search |
+| `semantic_search` | Search indexed files using natural language |
+| `index_status` | Get current index statistics |
+| `update_index` | Incremental update — re-index changed files only |
+## How It Works
+1. **Chunking** — Files are split into semantically meaningful chunks:
+   - Code files (`.py`, `.js`, `.ts`, etc.): split by function/class boundaries
+   - Documents (`.md`, `.txt`): split by headings or paragraph breaks
+   - Config files: fixed-size chunking
+2. **Embedding** — Each chunk is converted to a 384-dimensional vector using [BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) via ONNX Runtime (no PyTorch needed)
+3. **Storage** — Embeddings are stored as BLOBs in a local SQLite database
+4. **Search** — Query text is embedded and compared against all chunks using cosine similarity
+## Configuration
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `db_path` | `~/.local/share/embgrep/embgrep.db` | SQLite database location |
+| `model` | `BAAI/bge-small-en-v1.5` | fastembed model name |
+| `max_chunk_size` | 1000 chars | Maximum chunk size for fixed-size splitting |
+| `top_k` | 5 | Number of search results |
+## QuartzUnit Ecosystem
+| Package | Description |
+|---------|-------------|
+| [markgrab](https://github.com/QuartzUnit/markgrab) | HTML/YouTube/PDF/DOCX to LLM-ready markdown |
+| [snapgrab](https://github.com/QuartzUnit/snapgrab) | URL to screenshot + metadata |
+| [docpick](https://github.com/QuartzUnit/docpick) | OCR + LLM document structure extraction |
+| [browsegrab](https://github.com/QuartzUnit/browsegrab) | Local LLM browser agent |
+| [feedkit](https://github.com/QuartzUnit/feedkit) | RSS feed collection + MCP |
+| **embgrep** | **Local semantic search for files** |
+## License
+MIT
+<!-- mcp-name: io.github.ArkNill/embgrep -->

embgrep-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,160 @@
+# embgrep
+**Local semantic search — embedding-powered grep for files, zero external services.**
+[![PyPI](https://img.shields.io/pypi/v/embgrep)](https://pypi.org/project/embgrep/)
+[![Python](https://img.shields.io/pypi/pyversions/embgrep)](https://pypi.org/project/embgrep/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+Search your codebase and documentation by *meaning*, not just keywords. embgrep indexes files into local embeddings and lets you run semantic queries — no API keys, no cloud services, no vector database servers.
+## Features
+- **Local embeddings** — Uses [fastembed](https://github.com/qdrant/fastembed) (ONNX Runtime), no API keys needed
+- **SQLite storage** — Single-file index, no external vector DB
+- **Incremental indexing** — Only re-indexes changed files (SHA-256 hash comparison)
+- **Smart chunking** — Function-level splitting for code, heading-level for docs
+- **MCP native** — 4-tool FastMCP server for LLM agent integration
+- **15+ file types** — `.py`, `.js`, `.ts`, `.java`, `.go`, `.rs`, `.md`, `.txt`, `.yaml`, `.json`, `.toml`, and more
+## Install
+```bash
+pip install embgrep              # core (fastembed + numpy)
+pip install embgrep[cli]         # + click/rich CLI
+pip install embgrep[mcp]         # + FastMCP server
+pip install embgrep[all]         # everything
+```
+## Quick Start
+### Python API
+```python
+from embgrep import EmbGrep
+eg = EmbGrep()
+# Index a directory
+eg.index("./my-project", patterns=["*.py", "*.md"])
+# Semantic search
+results = eg.search("database connection pooling", top_k=5)
+for r in results:
+    print(f"{r.file_path}:{r.line_start}-{r.line_end} (score: {r.score:.4f})")
+    print(f"  {r.chunk_text[:80]}...")
+# Incremental update (only changed files)
+eg.update()
+# Index statistics
+status = eg.status()
+print(f"{status.total_files} files, {status.total_chunks} chunks, {status.index_size_mb} MB")
+eg.close()
+```
+### CLI
+```bash
+# Index a project
+embgrep index ./my-project --patterns "*.py,*.md"
+# Search
+embgrep search "error handling patterns"
+# Filter by file type
+embgrep search "async database query" --path-filter "%.py"
+# Check status
+embgrep status
+# Update changed files
+embgrep update
+```
+### Convenience functions
+```python
+import embgrep
+embgrep.index("./src")
+results = embgrep.search("authentication middleware")
+status = embgrep.status()
+embgrep.update()
+```
+## MCP Server
+Add to your Claude Desktop / MCP client configuration:
+```json
+{
+  "mcpServers": {
+    "embgrep": {
+      "command": "embgrep-mcp"
+    }
+  }
+}
+```
+Or with uvx:
+```json
+{
+  "mcpServers": {
+    "embgrep": {
+      "command": "uvx",
+      "args": ["--from", "embgrep[mcp]", "embgrep-mcp"]
+    }
+  }
+}
+```
+### MCP Tools
+| Tool | Description |
+|------|-------------|
+| `index_directory` | Index files in a directory for semantic search |
+| `semantic_search` | Search indexed files using natural language |
+| `index_status` | Get current index statistics |
+| `update_index` | Incremental update — re-index changed files only |
+## How It Works
+1. **Chunking** — Files are split into semantically meaningful chunks:
+   - Code files (`.py`, `.js`, `.ts`, etc.): split by function/class boundaries
+   - Documents (`.md`, `.txt`): split by headings or paragraph breaks
+   - Config files: fixed-size chunking
+2. **Embedding** — Each chunk is converted to a 384-dimensional vector using [BGE-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) via ONNX Runtime (no PyTorch needed)
+3. **Storage** — Embeddings are stored as BLOBs in a local SQLite database
+4. **Search** — Query text is embedded and compared against all chunks using cosine similarity
+## Configuration
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `db_path` | `~/.local/share/embgrep/embgrep.db` | SQLite database location |
+| `model` | `BAAI/bge-small-en-v1.5` | fastembed model name |
+| `max_chunk_size` | 1000 chars | Maximum chunk size for fixed-size splitting |
+| `top_k` | 5 | Number of search results |
+## QuartzUnit Ecosystem
+| Package | Description |
+|---------|-------------|
+| [markgrab](https://github.com/QuartzUnit/markgrab) | HTML/YouTube/PDF/DOCX to LLM-ready markdown |
+| [snapgrab](https://github.com/QuartzUnit/snapgrab) | URL to screenshot + metadata |
+| [docpick](https://github.com/QuartzUnit/docpick) | OCR + LLM document structure extraction |
+| [browsegrab](https://github.com/QuartzUnit/browsegrab) | Local LLM browser agent |
+| [feedkit](https://github.com/QuartzUnit/feedkit) | RSS feed collection + MCP |
+| **embgrep** | **Local semantic search for files** |
+## License
+MIT
+<!-- mcp-name: io.github.ArkNill/embgrep -->

embgrep-0.1.0/embgrep/__init__.py ADDED Viewed

@@ -0,0 +1,79 @@
+"""embgrep — Local semantic search, embedding-powered grep for files."""
+from __future__ import annotations
+from embgrep.indexer import EmbGrep, IndexStatus, SearchResult
+__all__ = ["EmbGrep", "IndexStatus", "SearchResult", "index", "search", "status", "update"]
+__version__ = "0.1.0"
+def index(directory: str, patterns: list[str] | None = None, db_path: str | None = None) -> dict:
+    """Index files in a directory.
+    Args:
+        directory: Path to the directory to index.
+        patterns: Optional list of glob patterns to filter files.
+        db_path: Optional path to the SQLite database.
+    Returns:
+        Dictionary with files_indexed, chunks_created, index_size_mb.
+    """
+    eg = EmbGrep(db_path=db_path) if db_path else EmbGrep()
+    try:
+        return eg.index(directory, patterns=patterns)
+    finally:
+        eg.close()
+def search(
+    query: str, top_k: int = 5, path_filter: str | None = None, db_path: str | None = None
+) -> list[SearchResult]:
+    """Semantic search across indexed files.
+    Args:
+        query: Natural language search query.
+        top_k: Number of results to return.
+        path_filter: Optional LIKE pattern to filter by file path.
+        db_path: Optional path to the SQLite database.
+    Returns:
+        List of SearchResult sorted by similarity score.
+    """
+    eg = EmbGrep(db_path=db_path) if db_path else EmbGrep()
+    try:
+        return eg.search(query, top_k=top_k, path_filter=path_filter)
+    finally:
+        eg.close()
+def status(db_path: str | None = None) -> IndexStatus:
+    """Get index statistics.
+    Args:
+        db_path: Optional path to the SQLite database.
+    Returns:
+        IndexStatus with total_files, total_chunks, last_updated, index_size_mb.
+    """
+    eg = EmbGrep(db_path=db_path) if db_path else EmbGrep()
+    try:
+        return eg.status()
+    finally:
+        eg.close()
+def update(db_path: str | None = None) -> dict:
+    """Incremental update — re-index changed files only.
+    Args:
+        db_path: Optional path to the SQLite database.
+    Returns:
+        Dictionary with updated_files, new_chunks, removed_files.
+    """
+    eg = EmbGrep(db_path=db_path) if db_path else EmbGrep()
+    try:
+        return eg.update()
+    finally:
+        eg.close()

embgrep-0.1.0/embgrep/__main__.py ADDED Viewed

@@ -0,0 +1,141 @@
+"""CLI entry point for embgrep — embedding-powered grep for files."""
+from __future__ import annotations
+import sys
+def main() -> None:
+    """Main CLI entry point."""
+    try:
+        import click
+        from rich.console import Console
+        from rich.table import Table
+    except ImportError:
+        print("CLI requires extra dependencies: pip install embgrep[cli]")
+        sys.exit(1)
+    console = Console()
+    @click.group()
+    @click.version_option(package_name="embgrep")
+    def cli() -> None:
+        """embgrep — Local semantic search, embedding-powered grep for files."""
+    @cli.command()
+    @click.argument("path", type=click.Path(exists=True))
+    @click.option("--patterns", "-p", default=None, help="Comma-separated glob patterns (e.g., '*.md,*.py').")
+    @click.option("--db-path", default=None, help="Path to SQLite database.")
+    @click.option("--model", default="BAAI/bge-small-en-v1.5", help="Embedding model name.")
+    def index(path: str, patterns: str | None, db_path: str | None, model: str) -> None:
+        """Index files in PATH for semantic search."""
+        from embgrep.indexer import EmbGrep
+        pattern_list = [p.strip() for p in patterns.split(",")] if patterns else None
+        kwargs: dict = {"model": model}
+        if db_path:
+            kwargs["db_path"] = db_path
+        eg = EmbGrep(**kwargs)
+        try:
+            with console.status("[bold green]Indexing files..."):
+                result = eg.index(path, patterns=pattern_list)
+            console.print(f"[green]Indexed {result['files_indexed']} files, {result['chunks_created']} chunks[/green]")
+            console.print(f"Index size: {result['index_size_mb']} MB")
+        finally:
+            eg.close()
+    @cli.command()
+    @click.argument("query")
+    @click.option("--top-k", "-k", default=5, help="Number of results to return.")
+    @click.option("--path-filter", "-f", default=None, help="SQL LIKE pattern for file path filter.")
+    @click.option("--db-path", default=None, help="Path to SQLite database.")
+    @click.option("--model", default="BAAI/bge-small-en-v1.5", help="Embedding model name.")
+    def search(query: str, top_k: int, path_filter: str | None, db_path: str | None, model: str) -> None:
+        """Semantic search across indexed files."""
+        from embgrep.indexer import EmbGrep
+        kwargs: dict = {"model": model}
+        if db_path:
+            kwargs["db_path"] = db_path
+        eg = EmbGrep(**kwargs)
+        try:
+            with console.status("[bold green]Searching..."):
+                results = eg.search(query, top_k=top_k, path_filter=path_filter)
+            if not results:
+                console.print("[yellow]No results found.[/yellow]")
+                return
+            table = Table(title=f"Search: {query!r}", show_lines=True)
+            table.add_column("#", style="dim", width=3)
+            table.add_column("Score", style="cyan", width=7)
+            table.add_column("File", style="green")
+            table.add_column("Lines", style="yellow", width=10)
+            table.add_column("Preview", max_width=60)
+            for i, r in enumerate(results, 1):
+                preview = r.chunk_text[:120].replace("\n", " ").strip()
+                if len(r.chunk_text) > 120:
+                    preview += "..."
+                table.add_row(
+                    str(i),
+                    f"{r.score:.4f}",
+                    r.file_path,
+                    f"{r.line_start}-{r.line_end}",
+                    preview,
+                )
+            console.print(table)
+        finally:
+            eg.close()
+    @cli.command()
+    @click.option("--db-path", default=None, help="Path to SQLite database.")
+    def status(db_path: str | None) -> None:
+        """Show index statistics."""
+        from embgrep.indexer import EmbGrep
+        kwargs: dict = {}
+        if db_path:
+            kwargs["db_path"] = db_path
+        eg = EmbGrep(**kwargs)
+        try:
+            st = eg.status()
+            console.print("[bold]embgrep Index Status[/bold]")
+            console.print(f"  Files:        {st.total_files}")
+            console.print(f"  Chunks:       {st.total_chunks}")
+            console.print(f"  Last updated: {st.last_updated}")
+            console.print(f"  Index size:   {st.index_size_mb} MB")
+        finally:
+            eg.close()
+    @cli.command()
+    @click.option("--db-path", default=None, help="Path to SQLite database.")
+    @click.option("--model", default="BAAI/bge-small-en-v1.5", help="Embedding model name.")
+    def update(db_path: str | None, model: str) -> None:
+        """Incremental update — re-index changed files only."""
+        from embgrep.indexer import EmbGrep
+        kwargs: dict = {"model": model}
+        if db_path:
+            kwargs["db_path"] = db_path
+        eg = EmbGrep(**kwargs)
+        try:
+            with console.status("[bold green]Updating index..."):
+                result = eg.update()
+            console.print(f"[green]Updated {result['updated_files']} files, {result['new_chunks']} new chunks[/green]")
+            if result["removed_files"]:
+                console.print(f"[yellow]Removed {result['removed_files']} deleted files[/yellow]")
+        finally:
+            eg.close()
+    cli()
+if __name__ == "__main__":
+    main()