PyPI - cocoindex-code - Versions diffs - 0.1.12__tar.gz → 0.1.14__tar.gz - Mend

cocoindex-code 0.1.12tar.gz → 0.1.14tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cocoindex-code
-Version: 0.1.12
+Version: 0.1.14
 Summary: MCP server for indexing and querying codebases using CocoIndex
 Project-URL: Homepage, https://github.com/cocoindex-io/cocoindex-code
 Project-URL: Repository, https://github.com/cocoindex-io/cocoindex-code
@@ -17,7 +17,7 @@ Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Requires-Python: >=3.11
-Requires-Dist: cocoindex[litellm]==1.0.0a26
+Requires-Dist: cocoindex[litellm]==1.0.0a29
 Requires-Dist: einops>=0.8.2
 Requires-Dist: mcp>=1.0.0
 Requires-Dist: numpy>=1.24.0
@@ -165,6 +165,7 @@ Use the cocoindex-code MCP server for semantic code search when:
 | `COCOINDEX_CODE_EMBEDDING_MODEL` | Embedding model (see below) | `sbert/sentence-transformers/all-MiniLM-L6-v2` |
 | `COCOINDEX_CODE_BATCH_SIZE` | Max batch size for local embedding model | `16` |
 | `COCOINDEX_CODE_EXTRA_EXTENSIONS` | Additional file extensions to index (comma-separated, e.g. `"inc:php,yaml,toml"` — use `ext:lang` to override language detection) | _(none)_ |
+| `COCOINDEX_CODE_EXCLUDED_PATTERNS` | Additional glob patterns to exclude from indexing as a JSON array (e.g. `'["**/migration.sql", "{**/*.md,**/*.txt}"]'`) | _(none)_ |
 ### Root Path Discovery
@@ -297,9 +298,20 @@ claude mcp add cocoindex-code \
 Any model supported by LiteLLM works — see the [full list of embedding providers](https://docs.litellm.ai/docs/embedding/supported_embedding).
-### GPU-optimised local model
+### Local SentenceTransformers models
-If you have a GPU, [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) delivers significantly better code retrieval than the default model. It is 137M parameters, requires ~1 GB VRAM, and has an 8192-token context window.
+Use the `sbert/` prefix to load any [SentenceTransformers](https://www.sbert.net/) model locally (no API key required).
+**Example — general purpose text model:**
+```bash
+claude mcp add cocoindex-code \
+  -e COCOINDEX_CODE_EMBEDDING_MODEL=sbert/nomic-ai/nomic-embed-text-v1 \
+  -- cocoindex-code
+```
+**GPU-optimised code retrieval:**
+[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) delivers significantly better code retrieval than the default model. It is 137M parameters, requires ~1 GB VRAM, and has an 8192-token context window.
 ```bash
 claude mcp add cocoindex-code \
@@ -355,6 +367,7 @@ Returns matching code chunks with:
 | javascript | js | `.js` |
 | json | | `.json` |
 | kotlin | | `.kt`, `.kts` |
+| lua | | `.lua` |
 | markdown | md | `.md`, `.mdx` |
 | pascal | pas, dpr, delphi | `.pas`, `.dpr` |
 | php | | `.php` |

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/README.md RENAMED Viewed

@@ -130,6 +130,7 @@ Use the cocoindex-code MCP server for semantic code search when:
 | `COCOINDEX_CODE_EMBEDDING_MODEL` | Embedding model (see below) | `sbert/sentence-transformers/all-MiniLM-L6-v2` |
 | `COCOINDEX_CODE_BATCH_SIZE` | Max batch size for local embedding model | `16` |
 | `COCOINDEX_CODE_EXTRA_EXTENSIONS` | Additional file extensions to index (comma-separated, e.g. `"inc:php,yaml,toml"` — use `ext:lang` to override language detection) | _(none)_ |
+| `COCOINDEX_CODE_EXCLUDED_PATTERNS` | Additional glob patterns to exclude from indexing as a JSON array (e.g. `'["**/migration.sql", "{**/*.md,**/*.txt}"]'`) | _(none)_ |
 ### Root Path Discovery
@@ -262,9 +263,20 @@ claude mcp add cocoindex-code \
 Any model supported by LiteLLM works — see the [full list of embedding providers](https://docs.litellm.ai/docs/embedding/supported_embedding).
-### GPU-optimised local model
+### Local SentenceTransformers models
-If you have a GPU, [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) delivers significantly better code retrieval than the default model. It is 137M parameters, requires ~1 GB VRAM, and has an 8192-token context window.
+Use the `sbert/` prefix to load any [SentenceTransformers](https://www.sbert.net/) model locally (no API key required).
+**Example — general purpose text model:**
+```bash
+claude mcp add cocoindex-code \
+  -e COCOINDEX_CODE_EMBEDDING_MODEL=sbert/nomic-ai/nomic-embed-text-v1 \
+  -- cocoindex-code
+```
+**GPU-optimised code retrieval:**
+[`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) delivers significantly better code retrieval than the default model. It is 137M parameters, requires ~1 GB VRAM, and has an 8192-token context window.
 ```bash
 claude mcp add cocoindex-code \
@@ -320,6 +332,7 @@ Returns matching code chunks with:
 | javascript | js | `.js` |
 | json | | `.json` |
 | kotlin | | `.kt`, `.kts` |
+| lua | | `.lua` |
 | markdown | md | `.md`, `.mdx` |
 | pascal | pas, dpr, delphi | `.pas`, `.dpr` |
 | php | | `.php` |

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/pyproject.toml RENAMED Viewed

@@ -23,7 +23,7 @@ classifiers = [
 dependencies = [
     "mcp>=1.0.0",
-    "cocoindex[litellm]==1.0.0a26",
+    "cocoindex[litellm]==1.0.0a29",
     "sentence-transformers>=2.2.0",
     "sqlite-vec>=0.1.0",
     "pydantic>=2.0.0",

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/config.py RENAMED Viewed

@@ -2,6 +2,7 @@
 from __future__ import annotations
+import json
 import os
 from dataclasses import dataclass
 from pathlib import Path
@@ -42,6 +43,31 @@ def _discover_codebase_root() -> Path:
     return root if root is not None else cwd
+def _parse_json_string_list_env(var_name: str) -> list[str]:
+    """Parse an environment variable as a JSON array of strings."""
+    raw_value = os.environ.get(var_name, "")
+    if not raw_value.strip():
+        return []
+    try:
+        parsed = json.loads(raw_value)
+    except json.JSONDecodeError as exc:
+        raise ValueError(f"{var_name} must be a JSON array of strings, got invalid JSON") from exc
+    if not isinstance(parsed, list):
+        raise ValueError(f"{var_name} must be a JSON array of strings")
+    result: list[str] = []
+    for item in parsed:
+        if not isinstance(item, str):
+            raise ValueError(f"{var_name} must be a JSON array of strings")
+        item = item.strip()
+        if item:
+            result.append(item)
+    return result
 @dataclass
 class Config:
     """Configuration loaded from environment variables."""
@@ -50,8 +76,8 @@ class Config:
     embedding_model: str
     index_dir: Path
     device: str | None
-    trust_remote_code: bool
     extra_extensions: dict[str, str | None]
+    excluded_patterns: list[str]
     @classmethod
     def from_env(cls) -> Config:
@@ -76,16 +102,6 @@ class Config:
         # Device: auto-detect CUDA or use env override
         device = os.environ.get("COCOINDEX_CODE_DEVICE")
-        # trust_remote_code: opt-in via env var only.
-        # sentence-transformers 5.x+ supports Jina models natively, so
-        # auto-enabling this for jinaai/ models causes failures with
-        # transformers 5.x (removed find_pruneable_heads_and_indices).
-        trust_remote_code = os.environ.get("COCOINDEX_CODE_TRUST_REMOTE_CODE", "").lower() in (
-            "1",
-            "true",
-            "yes",
-        )
         # Extra file extensions (format: "inc:php,yaml,toml" — optional lang after colon)
         raw_extra = os.environ.get("COCOINDEX_CODE_EXTRA_EXTENSIONS", "")
         extra_extensions: dict[str, str | None] = {}
@@ -99,13 +115,16 @@ class Config:
             else:
                 extra_extensions[f".{token}"] = None
+        # Excluded file glob patterns
+        excluded_patterns = _parse_json_string_list_env("COCOINDEX_CODE_EXCLUDED_PATTERNS")
         return cls(
             codebase_root_path=root,
             embedding_model=embedding_model,
             index_dir=index_dir,
             device=device,
-            trust_remote_code=trust_remote_code,
             extra_extensions=extra_extensions,
+            excluded_patterns=excluded_patterns,
         )
     @property

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/indexer.py RENAMED Viewed

@@ -42,6 +42,7 @@ DEFAULT_INCLUDED_PATTERNS = [
     "**/*.txt",  # Plain text
     "**/*.rst",  # reStructuredText
     "**/*.php",  # PHP
+    "**/*.lua",  # Lua
 ]
 INCLUDED_PATTERNS = DEFAULT_INCLUDED_PATTERNS + [f"**/*{ext}" for ext in config.extra_extensions]
@@ -51,7 +52,7 @@ LANGUAGE_OVERRIDES: dict[str, str] = {
     ext: lang for ext, lang in config.extra_extensions.items() if lang is not None
 }
-EXCLUDED_PATTERNS = [
+DEFAULT_EXCLUDED_PATTERNS = [
     "**/.*",  # Hidden directories
     "**/__pycache__",  # Python cache
     "**/node_modules",  # Node.js dependencies
@@ -63,6 +64,8 @@ EXCLUDED_PATTERNS = [
     "**/.cocoindex_code",  # Our own index directory
 ]
+EXCLUDED_PATTERNS = DEFAULT_EXCLUDED_PATTERNS + config.excluded_patterns
 # Chunking configuration
 CHUNK_SIZE = 2000
 MIN_CHUNK_SIZE = 300

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/query.py RENAMED Viewed

@@ -106,7 +106,7 @@ async def query_codebase(
     db = coco_env.get_context(SQLITE_DB)
     # Generate query embedding.
-    query_embedding = await embedder.embed(query, True, query_prompt_name)
+    query_embedding = await embedder.embed(query, query_prompt_name)
     embedding_bytes = query_embedding.astype("float32").tobytes()

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/shared.py RENAMED Viewed

@@ -31,19 +31,15 @@ if config.embedding_model.startswith(SBERT_PREFIX):
     # Models that define a "query" prompt for asymmetric retrieval.
     _QUERY_PROMPT_MODELS = {"nomic-ai/nomic-embed-code", "nomic-ai/CodeRankEmbed"}
     query_prompt_name: str | None = "query" if _model_name in _QUERY_PROMPT_MODELS else None
-    # Models whose custom remote code is known-compatible with transformers 5.x.
-    _KNOWN_REMOTE_CODE_MODELS = {"nomic-ai/CodeRankEmbed"}
-    _trust = config.trust_remote_code or _model_name in _KNOWN_REMOTE_CODE_MODELS
     embedder = SentenceTransformerEmbedder(
         _model_name,
         device=config.device,
-        trust_remote_code=_trust,
+        trust_remote_code=True,
     )
     logger.info(
-        "Embedding model: %s | device: %s | trust_remote_code: %s",
+        "Embedding model: %s | device: %s",
         config.embedding_model,
         config.device,
-        _trust,
     )
 else:
     from cocoindex.ops.litellm import LiteLLMEmbedder

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/.gitignore RENAMED Viewed

File without changes

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/LICENSE RENAMED Viewed

File without changes

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/__init__.py RENAMED Viewed

File without changes

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/__main__.py RENAMED Viewed

File without changes

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/schema.py RENAMED Viewed

File without changes

{cocoindex_code-0.1.12 → cocoindex_code-0.1.14}/src/cocoindex_code/server.py RENAMED Viewed

File without changes

cocoindex-code 0.1.12__tar.gz → 0.1.14__tar.gz

cocoindex-code 0.1.12tar.gz → 0.1.14tar.gz