npm - own-rag-cli - Versions diffs - 0.0.4-snapshot → 0.0.6-snapshot - Mend

own-rag-cli 0.0.4-snapshot → 0.0.6-snapshot

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md CHANGED Viewed

@@ -1,8 +1,10 @@
+# MCP binary checksum (SHA-256, payload without shebang): `1413af4d4c7d01d57ec5195ea0c5f704f9fefabeb641d2f216a042ec638c2b59`
 # own-rag
 Local RAG for codebases with ChromaDB + MCP, focused on practical setup and lower LLM token waste.
-Language: English (default) | Portuguese: `README.pt-br.md`
+Language: English (default) | Portuguese: `README.pt-br.md` - https://github.com/JocsaPB/own-rag/blob/main/README.pt-br.md
 ## Why use own-rag
@@ -59,6 +61,30 @@ rag remove                      # full local uninstall (double confirmation)
 rag remove --force              # uninstall without confirmation prompts
 ```
+## Indexing from URL (HTTP/HTTPS)
+`rag run` now accepts remote URLs in addition to local folders.
+How it works:
+- If you pass `http://` or `https://`, the wrapper downloads content to a temporary folder.
+- If the downloaded file is a ZIP, it is extracted and the extracted folder is indexed.
+- Text files (`.txt`, `.md`, and other non-binary text files) are indexable.
+- Binary files are skipped by the indexer.
+- After indexing finishes, temporary downloaded/extracted files are removed automatically.
+Download tool behavior:
+- Uses `curl` when available.
+- If `curl` is missing, the wrapper attempts automatic installation:
+  - Linux: package manager (`apt`, `dnf`, `yum`, `pacman`, `zypper`, `apk`)
+  - macOS: Homebrew (`brew`)
+Examples:
+```bash
+rag run https://example.com/docs/guide.md
+rag run https://example.com/snapshots/project-docs.zip
+```
 ## Configuration files and paths
 ### 1) Runtime config (CLI-level)

package/README.pt-br.md CHANGED Viewed

@@ -59,6 +59,30 @@ rag remove                     # desinstalação completa local (dupla confirma
 rag remove --force             # desinstala sem prompts de confirmação
 ```
+## Indexacao a partir de URL (HTTP/HTTPS)
+Agora o `rag run` aceita URL remota alem de pasta local.
+Como funciona:
+- Se voce passar `http://` ou `https://`, o wrapper baixa o conteudo para uma pasta temporaria.
+- Se o arquivo baixado for ZIP, ele e descompactado e a pasta extraida e indexada.
+- Arquivos de texto (`.txt`, `.md` e outros textos nao-binarios) podem ser indexados.
+- Arquivos binarios sao ignorados pelo indexador.
+- Ao terminar a indexacao, os arquivos temporarios baixados/descompactados sao removidos automaticamente.
+Comportamento do downloader:
+- Usa `curl` quando disponivel.
+- Se `curl` nao existir, o wrapper tenta instalar automaticamente:
+  - Linux: gerenciador de pacotes (`apt`, `dnf`, `yum`, `pacman`, `zypper`, `apk`)
+  - macOS: Homebrew (`brew`)
+Exemplos:
+```bash
+rag run https://exemplo.com/docs/guia.md
+rag run https://exemplo.com/snapshots/docs-projeto.zip
+```
 ## Arquivos e caminhos de configuração
 ### 1) Configuração de runtime (nível CLI)

package/bin/indexer_full.py CHANGED Viewed

@@ -194,7 +194,7 @@ IGNORED_EXTENSIONS = {
     ".mp4", ".mp3", ".wav", ".ogg", ".avi", ".mov",
     # Pacotes e compilados
     ".zip", ".tar", ".gz", ".rar", ".7z", ".jar", ".war", ".ear",
-    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin",
+    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin", ".run",
     # Lockfiles e gerados
     ".lock", ".sum",
     # Banco de dados
@@ -1113,17 +1113,44 @@ def make_chunk_id(file_path: str, chunk_index: int) -> str:
     return hashlib.md5(raw.encode()).hexdigest()
+def _looks_binary_content(raw: bytes) -> bool:
+    """Detecta conteúdo binário por heurística em amostra de bytes."""
+    if not raw:
+        return False
+    sample = raw[:4096]
+    if b"\x00" in sample:
+        return True
+    non_text_bytes = 0
+    for byte in sample:
+        if byte in (9, 10, 13):  # \t \n \r
+            continue
+        if 32 <= byte <= 126:  # ASCII imprimível
+            continue
+        if 160 <= byte <= 255:  # Latin-1 estendido comum em texto
+            continue
+        non_text_bytes += 1
+    return (non_text_bytes / len(sample)) > 0.30
 def read_file_safe(filepath: Path) -> str | None:
-    """Lê um arquivo de texto, tentando múltiplos encodings."""
+    """Lê um arquivo de texto, evitando binários e tentando múltiplos encodings."""
+    try:
+        raw = filepath.read_bytes()
+    except OSError as e:
+        print(f"  [AVISO] Não foi possível ler {filepath}: {e}")
+        return None
+    if _looks_binary_content(raw):
+        return None
     for encoding in ("utf-8", "latin-1", "cp1252"):
         try:
-            return filepath.read_text(encoding=encoding)
+            return raw.decode(encoding)
         except UnicodeDecodeError:
             continue
-        except OSError as e:
-            print(f"  [AVISO] Não foi possível ler {filepath}: {e}")
-            return None
-    # Se nenhum encoding funcionou, é provavelmente binário disfarçado
     return None
@@ -1170,34 +1197,110 @@ def index_file(
     relative_path = str(filepath.relative_to(root_path))
     inserted_chunks = 0
+    skipped_chunks = 0
+    stop_iteration_warnings = 0
     batch_ids: list[str] = []
     batch_docs: list[str] = []
     batch_metadatas: list[dict[str, object]] = []
+    def _warn_stop_iteration(message: str) -> None:
+        nonlocal stop_iteration_warnings
+        if stop_iteration_warnings < 3:
+            tqdm.write(message)
+        stop_iteration_warnings += 1
+    def _to_embedding_rows(encoded_embeddings: object) -> list[list[float]]:
+        if hasattr(encoded_embeddings, "tolist"):
+            rows = encoded_embeddings.tolist()
+            if isinstance(rows, list):
+                if rows and isinstance(rows[0], (int, float)):
+                    return [list(rows)]
+                return rows
+        return [list(row) for row in encoded_embeddings]  # type: ignore[arg-type]
     def _flush_batch() -> None:
-        nonlocal inserted_chunks
+        nonlocal inserted_chunks, skipped_chunks
         if not batch_ids:
             return
-        embeddings = model.encode(
-            batch_docs,
-            show_progress_bar=False,
-            batch_size=embedding_batch_size,
-        ).tolist()
-        collection.upsert(
-            ids=batch_ids,
-            embeddings=embeddings,
-            documents=batch_docs,
-            metadatas=batch_metadatas,
-        )
-        inserted_chunks += len(batch_ids)
-        del embeddings
+        pending_ids = list(batch_ids)
+        pending_docs = list(batch_docs)
+        pending_metadatas = list(batch_metadatas)
+        try:
+            encoded = model.encode(
+                pending_docs,
+                show_progress_bar=False,
+                batch_size=embedding_batch_size,
+            )
+            embeddings = _to_embedding_rows(encoded)
+            collection.upsert(
+                ids=pending_ids,
+                embeddings=embeddings,
+                documents=pending_docs,
+                metadatas=pending_metadatas,
+            )
+            inserted_chunks += len(pending_ids)
+            del embeddings
+        except StopIteration:
+            _warn_stop_iteration(
+                f"  [AVISO] {filepath.name}: StopIteration no batch de embeddings; tentando fallback por chunk."
+            )
+            for chunk_id, chunk_doc, chunk_metadata in zip(pending_ids, pending_docs, pending_metadatas):
+                candidate_doc = chunk_doc.strip()
+                if not candidate_doc:
+                    skipped_chunks += 1
+                    continue
+                try:
+                    encoded_single = model.encode(
+                        [candidate_doc],
+                        show_progress_bar=False,
+                        batch_size=1,
+                    )
+                    single_embeddings = _to_embedding_rows(encoded_single)
+                    collection.upsert(
+                        ids=[chunk_id],
+                        embeddings=single_embeddings,
+                        documents=[candidate_doc],
+                        metadatas=[chunk_metadata],
+                    )
+                    inserted_chunks += 1
+                    del single_embeddings
+                except StopIteration:
+                    compact_doc = " ".join(candidate_doc.split())
+                    if not compact_doc:
+                        skipped_chunks += 1
+                        continue
+                    try:
+                        encoded_single = model.encode(
+                            [compact_doc],
+                            show_progress_bar=False,
+                            batch_size=1,
+                        )
+                        single_embeddings = _to_embedding_rows(encoded_single)
+                        collection.upsert(
+                            ids=[chunk_id],
+                            embeddings=single_embeddings,
+                            documents=[compact_doc],
+                            metadatas=[chunk_metadata],
+                        )
+                        inserted_chunks += 1
+                        del single_embeddings
+                    except StopIteration:
+                        skipped_chunks += 1
+                        _warn_stop_iteration(
+                            f"  [AVISO] {filepath.name}: chunk ignorado após StopIteration repetido."
+                        )
         batch_ids.clear()
         batch_docs.clear()
         batch_metadatas.clear()
         gc.collect()
     for i, chunk in enumerate(chunks):
+        if not chunk or not chunk.strip():
+            skipped_chunks += 1
+            continue
         batch_ids.append(make_chunk_id(abs_path, i))
         batch_docs.append(chunk)
         batch_metadatas.append(
@@ -1213,6 +1316,10 @@ def index_file(
             _flush_batch()
     _flush_batch()
+    if skipped_chunks:
+        _warn_stop_iteration(
+            f"  [AVISO] {filepath.name}: {skipped_chunks} chunk(s) vazio(s)/inválido(s) foram ignorados."
+        )
     return inserted_chunks

package/bin/mcp_server.py CHANGED Viewed

@@ -399,7 +399,7 @@ IGNORED_EXTENSIONS = {
     ".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico", ".webp", ".bmp",
     ".mp4", ".mp3", ".wav", ".ogg", ".avi", ".mov",
     ".zip", ".tar", ".gz", ".rar", ".7z", ".jar", ".war",
-    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin",
+    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin", ".run",
     ".lock", ".sum", ".sqlite", ".db", ".sqlite3",
     ".ttf", ".woff", ".woff2", ".eot",
     ".pdf", ".docx", ".xlsx", ".pptx",
@@ -804,16 +804,43 @@ def _delete_file_chunks(collection: chromadb.Collection, file_path: str) -> int:
 def _read_file_safe(filepath: Path) -> str | None:
+    try:
+        raw = filepath.read_bytes()
+    except OSError:
+        return None
+    if _looks_binary_content(raw):
+        return None
     for encoding in ("utf-8", "latin-1", "cp1252"):
         try:
-            return filepath.read_text(encoding=encoding)
+            return raw.decode(encoding)
         except UnicodeDecodeError:
             continue
-        except OSError:
-            return None
     return None
+def _looks_binary_content(raw: bytes) -> bool:
+    if not raw:
+        return False
+    sample = raw[:4096]
+    if b"\x00" in sample:
+        return True
+    non_text_bytes = 0
+    for byte in sample:
+        if byte in (9, 10, 13):  # \t \n \r
+            continue
+        if 32 <= byte <= 126:  # ASCII imprimivel
+            continue
+        if 160 <= byte <= 255:  # Latin-1 estendido
+            continue
+        non_text_bytes += 1
+    return (non_text_bytes / len(sample)) > 0.30
 def _scan_folder(folder_path: Path) -> Iterator[Path]:
     for dirpath, dirnames, filenames in os.walk(folder_path):
         dirnames[:] = [
@@ -871,32 +898,108 @@ def _index_single_file_for_branch(
         _delete_file_chunks(collection, abs_path)
     inserted_chunks = 0
+    skipped_chunks = 0
+    stop_iteration_warnings = 0
     batch_ids: list[str] = []
     batch_docs: list[str] = []
     batch_metadatas: list[dict[str, object]] = []
+    def _warn_stop_iteration(message: str) -> None:
+        nonlocal stop_iteration_warnings
+        if stop_iteration_warnings < 3:
+            log.warning(message)
+        stop_iteration_warnings += 1
+    def _to_embedding_rows(encoded_embeddings: object) -> list[list[float]]:
+        if hasattr(encoded_embeddings, "tolist"):
+            rows = encoded_embeddings.tolist()
+            if isinstance(rows, list):
+                if rows and isinstance(rows[0], (int, float)):
+                    return [list(rows)]
+                return rows
+        return [list(row) for row in encoded_embeddings]  # type: ignore[arg-type]
     def _flush_batch() -> None:
-        nonlocal inserted_chunks
+        nonlocal inserted_chunks, skipped_chunks
         if not batch_ids:
             return
-        embeddings = model.encode(
-            batch_docs,
-            show_progress_bar=False,
-            batch_size=EMBEDDING_BATCH_SIZE,
-        ).tolist()
-        collection.upsert(
-            ids=batch_ids,
-            embeddings=embeddings,
-            documents=batch_docs,
-            metadatas=batch_metadatas,
-        )
-        inserted_chunks += len(batch_ids)
-        del embeddings
+        pending_ids = list(batch_ids)
+        pending_docs = list(batch_docs)
+        pending_metadatas = list(batch_metadatas)
+        try:
+            encoded = model.encode(
+                pending_docs,
+                show_progress_bar=False,
+                batch_size=EMBEDDING_BATCH_SIZE,
+            )
+            embeddings = _to_embedding_rows(encoded)
+            collection.upsert(
+                ids=pending_ids,
+                embeddings=embeddings,
+                documents=pending_docs,
+                metadatas=pending_metadatas,
+            )
+            inserted_chunks += len(pending_ids)
+            del embeddings
+        except StopIteration:
+            _warn_stop_iteration(
+                f"{filepath.name} [{branch.key}] StopIteration no batch; aplicando fallback por chunk."
+            )
+            for chunk_id, chunk_doc, chunk_metadata in zip(pending_ids, pending_docs, pending_metadatas):
+                candidate_doc = chunk_doc.strip()
+                if not candidate_doc:
+                    skipped_chunks += 1
+                    continue
+                try:
+                    encoded_single = model.encode(
+                        [candidate_doc],
+                        show_progress_bar=False,
+                        batch_size=1,
+                    )
+                    single_embeddings = _to_embedding_rows(encoded_single)
+                    collection.upsert(
+                        ids=[chunk_id],
+                        embeddings=single_embeddings,
+                        documents=[candidate_doc],
+                        metadatas=[chunk_metadata],
+                    )
+                    inserted_chunks += 1
+                    del single_embeddings
+                except StopIteration:
+                    compact_doc = " ".join(candidate_doc.split())
+                    if not compact_doc:
+                        skipped_chunks += 1
+                        continue
+                    try:
+                        encoded_single = model.encode(
+                            [compact_doc],
+                            show_progress_bar=False,
+                            batch_size=1,
+                        )
+                        single_embeddings = _to_embedding_rows(encoded_single)
+                        collection.upsert(
+                            ids=[chunk_id],
+                            embeddings=single_embeddings,
+                            documents=[compact_doc],
+                            metadatas=[chunk_metadata],
+                        )
+                        inserted_chunks += 1
+                        del single_embeddings
+                    except StopIteration:
+                        skipped_chunks += 1
+                        _warn_stop_iteration(
+                            f"{filepath.name} [{branch.key}] chunk ignorado após StopIteration repetido."
+                        )
         batch_ids.clear()
         batch_docs.clear()
         batch_metadatas.clear()
     for i, chunk in enumerate(chunks):
+        if not chunk or not chunk.strip():
+            skipped_chunks += 1
+            continue
         batch_ids.append(_make_chunk_id(abs_path, i))
         batch_docs.append(chunk)
         batch_metadatas.append(
@@ -914,6 +1017,10 @@ def _index_single_file_for_branch(
             _flush_batch()
     _flush_batch()
+    if skipped_chunks:
+        _warn_stop_iteration(
+            f"{filepath.name} [{branch.key}] ignorou {skipped_chunks} chunk(s) vazio(s)/inválido(s)."
+        )
     return inserted_chunks
@@ -1196,15 +1303,32 @@ def _run_ensemble_mode(query: str, top_k: int) -> tuple[list[FusedHit], list[str
 # Servidor MCP via FastMCP
 # ---------------------------------------------------------------------------
-mcp = FastMCP(
-    name="rag-codebase",
-    instructions=(
+FORCE_SEMANTIC_FALLBACK = (
+    os.environ.get("MCP_FORCE_SEMANTIC_FALLBACK", "true").strip().lower()
+    in {"1", "true", "yes", "on"}
+)
+def _build_server_instructions() -> str:
+    instructions = (
         "Servidor RAG para busca semântica em código-fonte local com suporte a ensemble híbrido. "
         "No modo hybrid, a branch de código usa Jina v2 e a de documentação usa BGE. "
         "Use semantic_search_code(query, top_k, mode='ensemble') para combinar Jina+BGE com RRF e reranking. "
         "Use update_file_index após editar um arquivo para manter as duas coleções sincronizadas. "
         "Use index_specific_folder para indexação recursiva sob demanda."
-    ),
+    )
+    if FORCE_SEMANTIC_FALLBACK:
+        instructions += (
+            " Regra obrigatória: quando a pergunta mencionar função/classe/componente/arquivo "
+            "e isso não estiver no contexto imediato, chame semantic_search_code(mode='ensemble') "
+            "antes de responder que não encontrou. Só conclua ausência após ao menos uma busca semântica."
+        )
+    return instructions
+mcp = FastMCP(
+    name="rag-codebase",
+    instructions=_build_server_instructions(),
 )