PyPI - kodit - Versions diffs - 0.2.2__tar.gz → 0.2.4__tar.gz - Mend

kodit 0.2.2tar.gz → 0.2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of kodit might be problematic. Click here for more details.

Files changed (158) hide show

{kodit-0.2.2 → kodit-0.2.4}/Dockerfile RENAMED Viewed

@@ -1,5 +1,6 @@
 # syntax=docker/dockerfile:1.9
-FROM python:3.13.4-slim-bookworm AS build
+ARG PYTHON_VERSION=3.13.5
+FROM python:${PYTHON_VERSION}-slim-bookworm AS build
 # The following does not work in Podman unless you build in Docker
 # compatibility mode: <https://github.com/containers/podman/issues/8477>
@@ -60,7 +61,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
 ##########################################################################
-FROM python:3.13.4-slim-bookworm
+FROM python:${PYTHON_VERSION}-slim-bookworm
 SHELL ["sh", "-exc"]
 RUN <<EOT

{kodit-0.2.2 → kodit-0.2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kodit
-Version: 0.2.2
+Version: 0.2.4
 Summary: Code indexing for better AI code generation
 Project-URL: Homepage, https://docs.helixml.tech/kodit/
 Project-URL: Documentation, https://docs.helixml.tech/kodit/
@@ -37,9 +37,9 @@ Requires-Dist: httpx-retries>=0.3.2
 Requires-Dist: httpx>=0.28.1
 Requires-Dist: openai>=1.82.0
 Requires-Dist: pathspec>=0.12.1
-Requires-Dist: posthog>=4.0.1
 Requires-Dist: pydantic-settings>=2.9.1
 Requires-Dist: pytable-formatter>=0.1.1
+Requires-Dist: rudder-sdk-python>=2.1.4
 Requires-Dist: sentence-transformers>=4.1.0
 Requires-Dist: sqlalchemy[asyncio]>=2.0.40
 Requires-Dist: structlog>=25.3.0

{kodit-0.2.2 → kodit-0.2.4}/docs/_index.md RENAMED Viewed

@@ -90,7 +90,7 @@ The roadmap is currently maintained as a [Github Project](https://github.com/org
 ## 💬 Support
-For commercial support, please contact [Helix.ML](founders@helix.ml). To ask a question,
+For commercial support, please contact [Helix.ML](https://docs.helixml.tech/helix/help/). To ask a question,
 please [open a discussion](https://github.com/helixml/kodit/discussions).
 ## License

{kodit-0.2.2 → kodit-0.2.4}/docs/developer/index.md RENAMED Viewed

@@ -11,10 +11,11 @@ All database operations are handled by SQLAlchemy and Alembic.
 ### Creating a Database Migration
 1. Make changes to your models
-2. Ensure the model is referenced in [alembic's env.py](src/kodit/alembic/env.py)
-3. Run `alembic upgrade head` to create a temporary DB to compute the upgrade
-4. Run `alembic revision --autogenerate -m "your message"`
-5. The new migration will be applied when you next run a kodit command
+2. Ensure the model is referenced in [alembic's env.py](https://github.com/helixml/kodit/blob/main/src/kodit/migrations/env.py)
+3. Remove the temporary DB if it exists from a previous migration: `rm -f .kodit.db`
+4. Run `alembic upgrade head` to create a temporary DB to compute the upgrade
+5. Run `alembic revision --autogenerate -m "your message"`
+6. The new migration will be applied when you next run a kodit command
 ## Releasing

kodit-0.2.4/docs/reference/deployment/docker-compose.yaml ADDED Viewed

@@ -0,0 +1,40 @@
+version: "3.9"
+services:
+  kodit:
+    image: registry.helix.ml/helix/kodit:latest # Replace with a version
+    ports:
+      - "8080:8080" # Expose the MCP server
+    # Start the Kodit MCP server and bind to all interfaces
+    command: ["serve", "--host", "0.0.0.0", "--port", "8080"]
+    restart: unless-stopped
+    depends_on:
+      - vectorchord # Wait for VectorChord to start before Kodit
+    # Configure Kodit
+    environment:
+      # Configure the database
+      DB_URL: postgresql+asyncpg://postgres:mysecretpassword@vectorchord:5432/kodit
+      DEFAULT_SEARCH_PROVIDER: vectorchord
+      # External embedding provider
+      EMBEDDING_ENDPOINT_TYPE: openai
+      EMBEDDING_ENDPOINT_BASE_URL: https://api.openai.com/v1
+      EMBEDDING_ENDPOINT_API_KEY: REPLACE_WITH_YOUR_API_KEY
+      EMBEDDING_ENDPOINT_MODEL: text-embedding-3-large
+      # External enrichment provider
+      ENRICHMENT_ENDPOINT_TYPE: openai
+      ENRICHMENT_ENDPOINT_BASE_URL: https://api.openai.com/v1
+      ENRICHMENT_ENDPOINT_API_KEY: REPLACE_WITH_YOUR_API_KEY
+      ENRICHMENT_ENDPOINT_MODEL: o3-mini
+  vectorchord:
+    image: tensorchord/vchord-suite:pg17-20250601
+    environment:
+      - POSTGRES_DB=kodit
+      - POSTGRES_PASSWORD=mysecretpassword
+    ports:
+      - "5432:5432"
+    restart: unless-stopped

kodit-0.2.4/docs/reference/deployment/index.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+title: Deployment
+description: Deploying Kodit with Docker Compose and Kubernetes.
+weight: 10
+---
+Kodit is packaged as a Docker container so you can run it on any popular orchestration platform. This page describes how to deploy Kodit as a service.
+## Deploying With Docker Compose
+Create a [docker-compose file](https://github.com/helixml/kodit/tree/main/docs/reference/deployment/docker-compose.yaml) that specifies Kodit and Vectorchord containers. Replace the latest tag with a version. Replace any API keys with your own or configure internal endpoints.
+Then run Kodit with `docker compose -f docker-compose.yaml up -d`. For more instructions see the [Docker Compose documentation](https://docs.docker.com/compose/).
+Here is an example:
+{{< code file="docker-compose.yaml" >}}
+## Deploying With Kubernetes
+To deploy with Kubernetes we recommend using a templating solution like Helm or Kustomize.
+Here is a simple [raw Kubernetes manifest](https://github.com/helixml/kodit/tree/main/docs/reference/deployment/kubernetes.yaml) to help get you started. Remember to pin the Kodit container at a specific version and update the required API keys.
+Deploy with `kubectl -n kodit apply -f kubernetes.yaml`
+{{< code file="kubernetes.yaml" >}}
+### Deploying With a Kind Kubernetes Cluster
+[Kind](https://kind.sigs.k8s.io/) is a k8s cluster that runs in a Docker container. So it's great for k8s development.
+1. `kind create cluster`
+2. `kubectl -n kodit apply -f kubernetes.yaml`

kodit-0.2.4/docs/reference/deployment/kubernetes.yaml ADDED Viewed

@@ -0,0 +1,99 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vectorchord
+  labels:
+    app: vectorchord
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: vectorchord
+  template:
+    metadata:
+      labels:
+        app: vectorchord
+    spec:
+      containers:
+        - name: vectorchord
+          image: tensorchord/vchord-suite:pg17-20250601
+          env:
+            - name: POSTGRES_DB
+              value: "kodit"
+            - name: POSTGRES_PASSWORD
+              value: "mysecretpassword"
+          ports:
+            - containerPort: 5432
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vectorchord
+spec:
+  selector:
+    app: vectorchord
+  ports:
+    - port: 5432
+      targetPort: 5432
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: kodit
+  labels:
+    app: kodit
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: kodit
+  template:
+    metadata:
+      labels:
+        app: kodit
+    spec:
+      containers:
+        - name: kodit
+          image: registry.helix.ml/helix/kodit:latest # Replace with a version
+          args: ["serve", "--host", "0.0.0.0", "--port", "8080"]
+          env:
+            - name: DB_URL
+              value: "postgresql+asyncpg://postgres:mysecretpassword@vectorchord:5432/kodit"
+            - name: DEFAULT_SEARCH_PROVIDER
+              value: "vectorchord"
+            - name: EMBEDDING_ENDPOINT_TYPE
+              value: "openai"
+            - name: EMBEDDING_ENDPOINT_BASE_URL
+              value: "https://api.openai.com/v1"
+            - name: EMBEDDING_ENDPOINT_API_KEY
+              value: "REPLACE_WITH_YOUR_API_KEY"
+            - name: EMBEDDING_ENDPOINT_MODEL
+              value: "text-embedding-3-large"
+            - name: ENRICHMENT_ENDPOINT_TYPE
+              value: "openai"
+            - name: ENRICHMENT_ENDPOINT_BASE_URL
+              value: "https://api.openai.com/v1"
+            - name: ENRICHMENT_ENDPOINT_API_KEY
+              value: "REPLACE_WITH_YOUR_API_KEY"
+            - name: ENRICHMENT_ENDPOINT_MODEL
+              value: "o3-mini"
+          ports:
+            - containerPort: 8080
+          readinessProbe:
+            httpGet:
+              path: /
+              port: 8080
+            initialDelaySeconds: 10
+            periodSeconds: 5
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: kodit
+spec:
+  type: LoadBalancer
+  selector:
+    app: kodit
+  ports:
+    - port: 8080
+      targetPort: 8080

kodit-0.2.4/docs/reference/telemetry/index.md ADDED Viewed

@@ -0,0 +1,31 @@
+---
+title: Telemetry
+description: Learn about what data is collected and how to disable it.
+weight: 99
+---
+Kodit includes a very limited amount anonymous telemetry to help guide product
+development. At the moment Kodit uses [Rudderstack](https://rudderstack.com) to capture
+anonymous usage metrics.
+## What Kodit Captures
+You can see what metrics are sent by searching for [use of the helper
+functions](https://github.com/helixml/kodit/blob/main/src/kodit/log.py#L169) in the Kodit
+codebase.
+Kodit currently captures use of the following:
+- When a user uses the CLI methods
+- When the indexing service is used or queried
+No user data is collected, only metadata about Kodit usage.
+## Disabling Telemetry
+We hope that you will help us improve Kodit by leaving telemetry turned on, but if you'd
+like to turn it off, add the following environmental variable (or add it to your .env file):
+```sh
+DISABLE_TELEMETRY=true
+```

{kodit-0.2.2 → kodit-0.2.4}/pyproject.toml RENAMED Viewed

@@ -31,7 +31,6 @@ dependencies = [
     "httpx-retries>=0.3.2",
     "httpx>=0.28.1",
     "structlog>=25.3.0",
-    "posthog>=4.0.1",
     "sqlalchemy[asyncio]>=2.0.40",
     "alembic>=1.15.2",
     "aiosqlite>=0.20.0",
@@ -53,6 +52,7 @@ dependencies = [
     "asyncpg>=0.30.0",
     "transformers>=4.51.3",
     "accelerate>=1.7.0",
+    "rudder-sdk-python>=2.1.4",
 ]
 [dependency-groups]

{kodit-0.2.2 → kodit-0.2.4}/src/kodit/_version.py RENAMED Viewed

@@ -17,5 +17,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.2.2'
-__version_tuple__ = version_tuple = (0, 2, 2)
+__version__ = version = '0.2.4'
+__version_tuple__ = version_tuple = (0, 2, 4)

{kodit-0.2.2 → kodit-0.2.4}/src/kodit/app.py RENAMED Viewed

@@ -21,6 +21,12 @@ async def root() -> dict[str, str]:
     return {"message": "Hello, World!"}
+@app.get("/healthz")
+async def healthz() -> dict[str, str]:
+    """Return a health check for the kodit API."""
+    return {"status": "ok"}
 # Add mcp routes last, otherwise previous routes aren't added
 app.mount("", mcp_app)

{kodit-0.2.2 → kodit-0.2.4}/src/kodit/cli.py RENAMED Viewed

@@ -81,6 +81,7 @@ async def index(
     )
     if not sources:
+        log_event("kodit.cli.index.list")
         # No source specified, list all indexes
         indexes = await service.list_indexes()
         headers: list[str | Cell] = [
@@ -108,7 +109,8 @@ async def index(
             msg = "File indexing is not implemented yet"
             raise click.UsageError(msg)
-        # Index directory
+        # Index source
+        log_event("kodit.cli.index.create")
         s = await source_service.create(source)
         index = await service.create(s.id)
         await service.run(index.id)
@@ -134,6 +136,7 @@ async def code(
     This works best if your query is code.
     """
+    log_event("kodit.cli.search.code")
     source_repository = SourceRepository(session)
     source_service = SourceService(app_context.get_clone_dir(), source_repository)
     repository = IndexRepository(session)
@@ -177,6 +180,7 @@ async def keyword(
     top_k: int,
 ) -> None:
     """Search for snippets using keyword search."""
+    log_event("kodit.cli.search.keyword")
     source_repository = SourceRepository(session)
     source_service = SourceService(app_context.get_clone_dir(), source_repository)
     repository = IndexRepository(session)
@@ -223,6 +227,7 @@ async def text(
     This works best if your query is text.
     """
+    log_event("kodit.cli.search.text")
     source_repository = SourceRepository(session)
     source_service = SourceService(app_context.get_clone_dir(), source_repository)
     repository = IndexRepository(session)
@@ -270,6 +275,7 @@ async def hybrid(  # noqa: PLR0913
     text: str,
 ) -> None:
     """Search for snippets using hybrid search."""
+    log_event("kodit.cli.search.hybrid")
     source_repository = SourceRepository(session)
     source_service = SourceService(app_context.get_clone_dir(), source_repository)
     repository = IndexRepository(session)
@@ -321,7 +327,7 @@ def serve(
     """Start the kodit server, which hosts the MCP server and the kodit API."""
     log = structlog.get_logger(__name__)
     log.info("Starting kodit server", host=host, port=port)
-    log_event("kodit_server_started")
+    log_event("kodit.cli.serve")
     # Configure uvicorn with graceful shutdown
     config = uvicorn.Config(

{kodit-0.2.2 → kodit-0.2.4}/src/kodit/embedding/embedding_factory.py RENAMED Viewed

@@ -3,6 +3,7 @@
 from sqlalchemy.ext.asyncio import AsyncSession
 from kodit.config import AppContext, Endpoint
+from kodit.embedding.embedding_models import EmbeddingType
 from kodit.embedding.embedding_provider.local_embedding_provider import (
     CODE,
     LocalEmbeddingProvider,
@@ -19,6 +20,7 @@ from kodit.embedding.vectorchord_vector_search_service import (
     TaskName,
     VectorChordVectorSearchService,
 )
+from kodit.log import log_event
 def _get_endpoint_configuration(app_context: AppContext) -> Endpoint | None:
@@ -34,6 +36,7 @@ def embedding_factory(
     endpoint = _get_endpoint_configuration(app_context)
     if endpoint and endpoint.type == "openai":
+        log_event("kodit.embedding", {"provider": "openai"})
         from openai import AsyncOpenAI
         embedding_provider = OpenAIEmbeddingProvider(
@@ -44,14 +47,22 @@ def embedding_factory(
             model_name=endpoint.model or "text-embedding-3-small",
         )
     else:
+        log_event("kodit.embedding", {"provider": "local"})
         embedding_provider = LocalEmbeddingProvider(CODE)
     if app_context.default_search.provider == "vectorchord":
+        log_event("kodit.database", {"provider": "vectorchord"})
         return VectorChordVectorSearchService(task_name, session, embedding_provider)
     if app_context.default_search.provider == "sqlite":
+        log_event("kodit.database", {"provider": "sqlite"})
+        if task_name == "code":
+            embedding_type = EmbeddingType.CODE
+        elif task_name == "text":
+            embedding_type = EmbeddingType.TEXT
         return LocalVectorSearchService(
             embedding_repository=embedding_repository,
             embedding_provider=embedding_provider,
+            embedding_type=embedding_type,
         )
     msg = f"Invalid semantic search provider: {app_context.default_search.provider}"

kodit-0.2.4/src/kodit/embedding/embedding_provider/embedding_provider.py ADDED Viewed

@@ -0,0 +1,92 @@
+"""Embedding provider."""
+from abc import ABC, abstractmethod
+from collections.abc import AsyncGenerator
+from dataclasses import dataclass
+import structlog
+import tiktoken
+OPENAI_MAX_EMBEDDING_SIZE = 8192
+Vector = list[float]
+@dataclass
+class EmbeddingRequest:
+    """Embedding request."""
+    id: int
+    text: str
+@dataclass
+class EmbeddingResponse:
+    """Embedding response."""
+    id: int
+    embedding: Vector
+class EmbeddingProvider(ABC):
+    """Embedding provider."""
+    @abstractmethod
+    def embed(
+        self, data: list[EmbeddingRequest]
+    ) -> AsyncGenerator[list[EmbeddingResponse], None]:
+        """Embed a list of strings.
+        The embedding provider is responsible for embedding a list of strings into a
+        list of vectors. The embedding provider is responsible for splitting the list of
+        strings into smaller sub-batches and embedding them in parallel.
+        """
+def split_sub_batches(
+    encoding: tiktoken.Encoding,
+    data: list[EmbeddingRequest],
+    max_context_window: int = OPENAI_MAX_EMBEDDING_SIZE,
+) -> list[list[EmbeddingRequest]]:
+    """Split a list of strings into smaller sub-batches."""
+    log = structlog.get_logger(__name__)
+    result = []
+    data_to_process = [s for s in data if s.text.strip()]  # Filter out empty strings
+    while data_to_process:
+        next_batch = []
+        current_tokens = 0
+        while data_to_process:
+            next_item = data_to_process[0]
+            item_tokens = len(encoding.encode(next_item.text, disallowed_special=()))
+            if item_tokens > max_context_window:
+                # Optimise truncation by operating on tokens directly instead of
+                # removing one character at a time and repeatedly re-encoding.
+                tokens = encoding.encode(next_item.text, disallowed_special=())
+                if len(tokens) > max_context_window:
+                    # Keep only the first *max_context_window* tokens.
+                    tokens = tokens[:max_context_window]
+                    # Convert back to text. This requires only one decode call and
+                    # guarantees that the resulting string fits the token budget.
+                    next_item.text = encoding.decode(tokens)
+                    item_tokens = max_context_window  # We know the exact size now
+                    data_to_process[0] = next_item
+                    log.warning(
+                        "Truncated snippet because it was too long to embed",
+                        snippet=next_item.text[:100] + "...",
+                    )
+            if current_tokens + item_tokens > max_context_window:
+                break
+            next_batch.append(data_to_process.pop(0))
+            current_tokens += item_tokens
+        if next_batch:
+            result.append(next_batch)
+    return result

{kodit-0.2.2 → kodit-0.2.4}/src/kodit/embedding/embedding_provider/hash_embedding_provider.py RENAMED Viewed

@@ -3,10 +3,12 @@
 import asyncio
 import hashlib
 import math
-from collections.abc import Generator, Sequence
+from collections.abc import AsyncGenerator, Generator, Sequence
 from kodit.embedding.embedding_provider.embedding_provider import (
     EmbeddingProvider,
+    EmbeddingRequest,
+    EmbeddingResponse,
     Vector,
 )
@@ -31,27 +33,34 @@ class HashEmbeddingProvider(EmbeddingProvider):
         self.dim = dim
         self.batch_size = batch_size
-    async def embed(self, data: list[str]) -> list[Vector]:
+    async def embed(
+        self, data: list[EmbeddingRequest]
+    ) -> AsyncGenerator[list[EmbeddingResponse], None]:
         """Embed every string in *data*, preserving order.
         Work is sliced into *batch_size* chunks and scheduled concurrently
         (still CPU-bound, but enough to cooperate with an asyncio loop).
         """
         if not data:
-            return []
+            yield []
         async def _embed_chunk(chunk: Sequence[str]) -> list[Vector]:
             return [self._string_to_vector(text) for text in chunk]
         tasks = [
             asyncio.create_task(_embed_chunk(chunk))
-            for chunk in self._chunked(data, self.batch_size)
+            for chunk in self._chunked([i.text for i in data], self.batch_size)
         ]
-        vectors: list[Vector] = []
         for task in tasks:
-            vectors.extend(await task)
-        return vectors
+            result = await task
+            yield [
+                EmbeddingResponse(
+                    id=item.id,
+                    embedding=embedding,
+                )
+                for item, embedding in zip(data, result, strict=True)
+            ]
     @staticmethod
     def _chunked(seq: Sequence[str], size: int) -> Generator[Sequence[str], None, None]:

kodit-0.2.4/src/kodit/embedding/embedding_provider/local_embedding_provider.py ADDED Viewed

@@ -0,0 +1,96 @@
+"""Local embedding service."""
+from __future__ import annotations
+import os
+from time import time
+from typing import TYPE_CHECKING
+import structlog
+from kodit.embedding.embedding_provider.embedding_provider import (
+    EmbeddingProvider,
+    EmbeddingRequest,
+    EmbeddingResponse,
+    split_sub_batches,
+)
+if TYPE_CHECKING:
+    from collections.abc import AsyncGenerator
+    from sentence_transformers import SentenceTransformer
+    from tiktoken import Encoding
+TINY = "tiny"
+CODE = "code"
+TEST = "test"
+COMMON_EMBEDDING_MODELS = {
+    TINY: "ibm-granite/granite-embedding-30m-english",
+    CODE: "flax-sentence-embeddings/st-codesearch-distilroberta-base",
+    TEST: "minishlab/potion-base-4M",
+}
+class LocalEmbeddingProvider(EmbeddingProvider):
+    """Local embedder."""
+    def __init__(self, model_name: str) -> None:
+        """Initialize the local embedder."""
+        self.log = structlog.get_logger(__name__)
+        self.model_name = COMMON_EMBEDDING_MODELS.get(model_name, model_name)
+        self.encoding_name = "text-embedding-3-small"
+        self.embedding_model = None
+        self.encoding = None
+    def _encoding(self) -> Encoding:
+        if self.encoding is None:
+            from tiktoken import encoding_for_model
+            start_time = time()
+            self.encoding = encoding_for_model(self.encoding_name)
+            self.log.debug(
+                "Encoding loaded",
+                model_name=self.encoding_name,
+                duration=time() - start_time,
+            )
+        return self.encoding
+    def _model(self) -> SentenceTransformer:
+        """Get the embedding model."""
+        if self.embedding_model is None:
+            os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Avoid warnings
+            from sentence_transformers import SentenceTransformer
+            start_time = time()
+            self.embedding_model = SentenceTransformer(
+                self.model_name,
+                trust_remote_code=True,
+            )
+            self.log.debug(
+                "Model loaded",
+                model_name=self.model_name,
+                duration=time() - start_time,
+            )
+        return self.embedding_model
+    async def embed(
+        self, data: list[EmbeddingRequest]
+    ) -> AsyncGenerator[list[EmbeddingResponse], None]:
+        """Embed a list of strings."""
+        model = self._model()
+        batched_data = split_sub_batches(self._encoding(), data)
+        for batch in batched_data:
+            embeddings = model.encode(
+                [i.text for i in batch], show_progress_bar=False, batch_size=4
+            )
+            yield [
+                EmbeddingResponse(
+                    id=item.id,
+                    embedding=[float(x) for x in embedding],
+                )
+                for item, embedding in zip(batch, embeddings, strict=True)
+            ]

kodit 0.2.2__tar.gz → 0.2.4__tar.gz

Potentially problematic release.

kodit 0.2.2tar.gz → 0.2.4tar.gz