npm - own-rag-cli - Versions diffs - 0.0.3-snapshot → 0.0.5-snapshot - Mend

own-rag-cli 0.0.3-snapshot → 0.0.5-snapshot

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md CHANGED Viewed

@@ -1,15 +1,20 @@
+# MCP binary checksum (SHA-256, payload without shebang): `1413af4d4c7d01d57ec5195ea0c5f704f9fefabeb641d2f216a042ec638c2b59`
 # own-rag
-Local RAG for codebases with ChromaDB + MCP, focused on simple setup and local control.
+Local RAG for codebases with ChromaDB + MCP, focused on practical setup and lower LLM token waste.
 Language: English (default) | Portuguese: `README.pt-br.md`
 ## Why use own-rag
-- Search your codebase semantically from MCP clients (Claude/Cursor).
-- Keep data local by default.
-- Run on CPU (no GPU required).
-- Choose between safer memory usage (`autotune`) or higher throughput (`max-performance`).
+Without local retrieval, an AI assistant often scans many files or asks you to paste large chunks of code, which increases token usage and slows iteration. With `own-rag`, your repository is indexed once and exposed via MCP tools, so the assistant can fetch only the most relevant files/chunks.
+Example:
+- Without RAG: asking "where is `MedicationController` used?" may trigger broad project reads.
+- With MCP + own-rag: retrieval points directly to likely files first, so the assistant reads less and answers faster.
+Immediate MCP targets are Claude Code and Cursor, but the server can be used by any MCP-capable client.
 ## What gets installed
@@ -56,6 +61,30 @@ rag remove                      # full local uninstall (double confirmation)
 rag remove --force              # uninstall without confirmation prompts
 ```
+## Indexing from URL (HTTP/HTTPS)
+`rag run` now accepts remote URLs in addition to local folders.
+How it works:
+- If you pass `http://` or `https://`, the wrapper downloads content to a temporary folder.
+- If the downloaded file is a ZIP, it is extracted and the extracted folder is indexed.
+- Text files (`.txt`, `.md`, and other non-binary text files) are indexable.
+- Binary files are skipped by the indexer.
+- After indexing finishes, temporary downloaded/extracted files are removed automatically.
+Download tool behavior:
+- Uses `curl` when available.
+- If `curl` is missing, the wrapper attempts automatic installation:
+  - Linux: package manager (`apt`, `dnf`, `yum`, `pacman`, `zypper`, `apk`)
+  - macOS: Homebrew (`brew`)
+Examples:
+```bash
+rag run https://example.com/docs/guide.md
+rag run https://example.com/snapshots/project-docs.zip
+```
 ## Configuration files and paths
 ### 1) Runtime config (CLI-level)
@@ -63,8 +92,8 @@ rag remove --force              # uninstall without confirmation prompts
 Path: `~/.own-rag-cli.json`
 Purpose:
-- Stores Chroma endpoint (`scheme`, `host`, `port`).
-- Stores latest indexing batch fields (`indexing.embedding_batch_size`, `indexing.batch_count`).
+- Stores Chroma endpoint and last known indexing batch values.
+- Used as the simplest user-facing config to move between machines or switch endpoint.
 Example:
@@ -82,6 +111,18 @@ Example:
 }
 ```
+Field meaning:
+- `chroma.scheme`: `http` or `https`.
+- `chroma.host`: Chroma host (`localhost`, IP, or DNS).
+- `chroma.port`: Chroma TCP port.
+- `indexing.embedding_batch_size`: batch size used by embedding encode.
+- `indexing.batch_count`: mirrored batch value for explicit readability.
+If you edit this file manually:
+1. Save the file.
+2. Restart tools that consume MCP (Claude/Cursor).
+3. If you changed model/chunk/batch strategy, run reindex (`rag run <project>` or `--only-index`) so stored vectors match new settings.
 ### 2) Indexer tuning config
 Path: `~/.cache/own-rag-cli/indexer_tuning.json`
@@ -100,11 +141,11 @@ Purpose:
 - Claude Code: `~/.claude.json`
 - Cursor: `~/.cursor/mcp.json`
-- Cursor (alt, Linux/macOS variants): `~/.config/Cursor/User/mcp.json` or `~/Library/Application Support/Cursor/User/mcp.json`
+- Cursor (alt paths): `~/.config/Cursor/User/mcp.json` or `~/Library/Application Support/Cursor/User/mcp.json`
 ## Choosing the embedding model (practical guidance)
-These are practical memory guidelines for CPU usage. Real usage depends on project size, file size, and other running processes.
+These are practical CPU-memory guidelines. Real usage depends on project size and concurrent processes.
 | Choice | Use case | Approx RAM target |
 |---|---|---|
@@ -113,70 +154,122 @@ These are practical memory guidelines for CPU usage. Real usage depends on proje
 | `hybrid` (`jina v2 + bge`) | two collections, broader retrieval strategy | 24-48 GiB |
 Notes:
-- Jina on CPU is heavy; with low free RAM/swap, you can see slowdowns or OOM (`exit 137`).
-- If your machine has limited memory, start with `bge`.
+- Jina on CPU is heavy; with low free RAM/swap, slowdowns or OOM (`exit 137`) can happen.
+- If your machine is memory-limited, start with `bge`.
 ## Performance profile
-During setup/indexing, choose one profile:
+During setup/indexing:
 - `autotune` (recommended):
   - Runs a short local benchmark (`model.encode`) with `psutil` metrics.
-  - Tries to keep memory in a safer range (roughly cost-benefit target).
+  - Targets cost-benefit (not too aggressive, not too conservative).
   - Automatically adjusts `MCP_EMBEDDING_BATCH_SIZE`, `MCP_CHUNK_SIZE`, `MCP_CHUNK_OVERLAP`.
 - `max-performance`:
   - Uses more aggressive throughput-oriented parameters.
-  - Explicit warning is shown because memory can increase considerably.
+  - Shows explicit memory-risk warning.
 ## Chroma endpoint behavior
 - Default local endpoint is `http://localhost:8000`.
-- Setup checks if the selected port is already in use.
-- If `host` in `~/.own-rag-cli.json` is remote (not localhost/127.0.0.1/::1), local Docker startup is skipped.
+- Setup checks if selected port is already in use.
+- If `~/.own-rag-cli.json` points to a remote host, local Docker startup is skipped.
+## Backup/snapshot before reinstall or machine migration
+Before running `rag remove`, formatting your machine, or moving to another machine, snapshot these files:
+```bash
+mkdir -p "$HOME/own-rag-backup"
+cp "$HOME/.own-rag-cli.json" "$HOME/own-rag-backup/" 2>/dev/null || true
+cp "$HOME/.cache/own-rag-cli/indexer_tuning.json" "$HOME/own-rag-backup/" 2>/dev/null || true
+tar -czf "$HOME/own-rag-backup/chromadb-ragdb-$(date +%Y%m%d-%H%M%S).tgz" -C "$HOME" .rag_db
+```
+Restore on a new machine:
+1. Install `own-rag-cli`.
+2. Restore `~/.own-rag-cli.json` and optional tuning file.
+3. Extract `.rag_db` backup into `$HOME`.
+4. Start with `rag run /path/to/project --skip-index` and validate search.
+5. If model/chunk settings changed, reindex.
 ## Common flows
-### Reindex only
+### Reindex only (`--only-index`)
 ```bash
 ./rag-setup.run /path/to/project --only-index
 ```
-### Skip index step
+Use when environment is already installed and you only changed project code/docs.
+### Skip index step (`--skip-index`)
 ```bash
 ./rag-setup.run /path/to/project --skip-index
 ```
-### Change model later
+Use when you want to refresh infra/config first (venv, Docker, MCP wiring), and index later.
+### Change model (`--change-model`)
 ```bash
 ./rag-setup.run --change-model /path/to/project
 ```
-This flow warns about resetting Chroma collections and requiring full reindex.
+Use when switching embedding strategy (`jina`, `bge`, `hybrid`). This may reset collections and requires full reindex so embeddings remain consistent.
 ## Environment overrides
-Main overrides:
+Where to override:
+- Temporary for one shell/session: `export VAR=value` before `rag run`.
+- Persistent for MCP runtime: set env in your MCP client config (`~/.claude.json`, Cursor `mcp.json`).
+- Chroma endpoint can also be persisted in `~/.own-rag-cli.json`.
+What each variable does:
 - `MCP_EMBEDDING_MODEL=jina|bge|hybrid`
+  - Selects embedding strategy.
+  - After changing: reindex required.
 - `MCP_JINA_QUANTIZATION=default|dynamic-int8`
+  - Jina CPU quantization mode.
+  - After changing: reindex recommended.
 - `MCP_PERF_PROFILE=autotune|max-performance`
+  - Chooses tuning strategy for chunk and batch decisions.
+  - After changing: rerun index setup to apply new tuning.
 - `MCP_EMBEDDING_BATCH_SIZE=<int>`
+  - Forces fixed embedding batch size (overrides autotune batch).
+  - After changing: restart MCP; reindex recommended for consistent throughput assumptions.
 - `MCP_CHUNK_SIZE=<int>`
+  - Sets chunk length for indexing.
+  - After changing: full reindex required.
 - `MCP_CHUNK_OVERLAP=<int>`
+  - Sets overlap between chunks.
+  - After changing: full reindex required.
 - `OWN_RAG_CLI_CONFIG_FILE=<path>`
+  - Moves runtime config location from default `~/.own-rag-cli.json`.
+  - After changing: restart setup/MCP tools.
 - `MCP_INDEXER_CONFIG_FILE=<path>`
+  - Moves tuning config location from default `~/.cache/own-rag-cli/indexer_tuning.json`.
+  - After changing: restart setup/MCP tools.
 - `MCP_CHROMA_SCHEME=http|https`
 - `MCP_CHROMA_HOST=<host>`
 - `MCP_CHROMA_PORT=<port>`
+  - Overrides Chroma endpoint.
+  - After changing: restart MCP clients; reindex only if switching to a fresh/empty database.
 ## Checksum verification (`.run`)
-Current checksum (documented in this repo) refers to payload hash.
 Manual verification (Linux/macOS):
 ```bash

package/README.pt-br.md CHANGED Viewed

@@ -1,15 +1,18 @@
 # own-rag
-RAG local para codebases com ChromaDB + MCP, focado em instalação simples e controle local.
+RAG local para codebases com ChromaDB + MCP, focado em setup prático e menor desperdício de tokens no uso de IA.
 Idioma: Português (PT-BR) | English: `README.md`
 ## Por que usar o own-rag
-- Busca semântica no seu código via clientes MCP (Claude/Cursor).
-- Dados locais por padrão.
-- Execução em CPU (sem GPU obrigatória).
-- Escolha entre perfil mais seguro de memória (`autotune`) ou maior throughput (`max-performance`).
+Sem recuperação local, a IA costuma ler muitos arquivos ou pedir grandes trechos de código, aumentando custo de tokens e tempo de resposta. Com `own-rag`, o repositório é indexado uma vez e exposto por ferramentas MCP, então a IA consulta primeiro os trechos mais relevantes.
+Exemplo:
+- Sem RAG: ao perguntar "onde `MedicationController` é usado?", a IA pode varrer boa parte do projeto.
+- Com MCP + own-rag: a recuperação aponta rapidamente para os arquivos com maior probabilidade, reduzindo leitura desnecessária.
+Os alvos imediatos por padrão são Claude Code e Cursor, mas o servidor pode ser usado por qualquer cliente compatível com MCP.
 ## O que é instalado
@@ -56,6 +59,30 @@ rag remove                     # desinstalação completa local (dupla confirma
 rag remove --force             # desinstala sem prompts de confirmação
 ```
+## Indexacao a partir de URL (HTTP/HTTPS)
+Agora o `rag run` aceita URL remota alem de pasta local.
+Como funciona:
+- Se voce passar `http://` ou `https://`, o wrapper baixa o conteudo para uma pasta temporaria.
+- Se o arquivo baixado for ZIP, ele e descompactado e a pasta extraida e indexada.
+- Arquivos de texto (`.txt`, `.md` e outros textos nao-binarios) podem ser indexados.
+- Arquivos binarios sao ignorados pelo indexador.
+- Ao terminar a indexacao, os arquivos temporarios baixados/descompactados sao removidos automaticamente.
+Comportamento do downloader:
+- Usa `curl` quando disponivel.
+- Se `curl` nao existir, o wrapper tenta instalar automaticamente:
+  - Linux: gerenciador de pacotes (`apt`, `dnf`, `yum`, `pacman`, `zypper`, `apk`)
+  - macOS: Homebrew (`brew`)
+Exemplos:
+```bash
+rag run https://exemplo.com/docs/guia.md
+rag run https://exemplo.com/snapshots/docs-projeto.zip
+```
 ## Arquivos e caminhos de configuração
 ### 1) Configuração de runtime (nível CLI)
@@ -63,8 +90,8 @@ rag remove --force             # desinstala sem prompts de confirmação
 Caminho: `~/.own-rag-cli.json`
 Finalidade:
-- Armazena endpoint do Chroma (`scheme`, `host`, `port`).
-- Armazena campos de batch da indexação (`indexing.embedding_batch_size`, `indexing.batch_count`).
+- Guarda endpoint do Chroma e os últimos valores de batch de indexação.
+- É o arquivo mais simples para portar configuração entre máquinas.
 Exemplo:
@@ -82,6 +109,18 @@ Exemplo:
 }
 ```
+Significado de cada variável:
+- `chroma.scheme`: protocolo (`http` ou `https`).
+- `chroma.host`: host do Chroma (`localhost`, IP ou DNS).
+- `chroma.port`: porta TCP do Chroma.
+- `indexing.embedding_batch_size`: tamanho do batch usado no encode de embeddings.
+- `indexing.batch_count`: espelho explícito do valor de batch para leitura humana.
+Se você editar esse arquivo manualmente:
+1. Salve o arquivo.
+2. Reinicie as ferramentas que usam MCP (Claude/Cursor).
+3. Se mudou estratégia de modelo/chunk/batch, rode reindexação (`rag run <projeto>` ou `--only-index`) para manter consistência dos vetores.
 ### 2) Configuração de tuning do indexador
 Caminho: `~/.cache/own-rag-cli/indexer_tuning.json`
@@ -100,11 +139,11 @@ Finalidade:
 - Claude Code: `~/.claude.json`
 - Cursor: `~/.cursor/mcp.json`
-- Cursor (variações Linux/macOS): `~/.config/Cursor/User/mcp.json` ou `~/Library/Application Support/Cursor/User/mcp.json`
+- Cursor (caminhos alternativos): `~/.config/Cursor/User/mcp.json` ou `~/Library/Application Support/Cursor/User/mcp.json`
 ## Escolha do modelo de embeddings (guia prático)
-Valores abaixo são referência prática para CPU. O consumo real depende do tamanho do projeto, arquivos e processos concorrentes.
+Valores abaixo são referência prática para CPU. O consumo real depende do tamanho do projeto e de processos concorrentes.
 | Opção | Quando usar | RAM desejada (aprox.) |
 |---|---|---|
@@ -118,60 +157,114 @@ Notas:
 ## Perfil de performance
-Durante setup/indexação, escolha:
+Durante setup/indexação:
 - `autotune` (recomendado):
   - Executa micro-benchmark local (`model.encode`) com métricas `psutil`.
-  - Busca faixa mais estável de memória (custo-benefício).
+  - Busca custo-benefício (nem agressivo, nem conservador).
   - Ajusta automaticamente `MCP_EMBEDDING_BATCH_SIZE`, `MCP_CHUNK_SIZE`, `MCP_CHUNK_OVERLAP`.
 - `max-performance`:
   - Usa parâmetros mais agressivos para throughput.
-  - Exibe aviso explícito de maior consumo de memória.
+  - Exibe aviso explícito de risco de memória.
 ## Comportamento do endpoint Chroma
 - Endpoint local padrão: `http://localhost:8000`.
 - O setup verifica se a porta escolhida já está em uso.
-- Se `host` em `~/.own-rag-cli.json` for remoto (não localhost/127.0.0.1/::1), o setup não sobe Docker local.
+- Se `~/.own-rag-cli.json` apontar para host remoto, o setup não sobe Docker local.
+## Backup/snapshot antes de remover, formatar ou migrar máquina
+Antes de executar `rag remove`, formatar o computador, ou migrar para outra máquina, faça snapshot:
+```bash
+mkdir -p "$HOME/own-rag-backup"
+cp "$HOME/.own-rag-cli.json" "$HOME/own-rag-backup/" 2>/dev/null || true
+cp "$HOME/.cache/own-rag-cli/indexer_tuning.json" "$HOME/own-rag-backup/" 2>/dev/null || true
+tar -czf "$HOME/own-rag-backup/chromadb-ragdb-$(date +%Y%m%d-%H%M%S).tgz" -C "$HOME" .rag_db
+```
+Restauração em nova máquina:
+1. Instale `own-rag-cli`.
+2. Restaure `~/.own-rag-cli.json` e, opcionalmente, o tuning.
+3. Extraia o backup de `.rag_db` em `$HOME`.
+4. Execute `rag run /caminho/do/projeto --skip-index` e valide a busca.
+5. Se mudou modelo/chunks, reindexe.
 ## Fluxos comuns
-### Reindexar somente
+### Reindexar somente (`--only-index`)
 ```bash
 ./rag-setup.run /caminho/do/projeto --only-index
 ```
-### Pular indexação
+Use quando o ambiente já está instalado e você só alterou código/documentação do projeto.
+### Pular indexação (`--skip-index`)
 ```bash
 ./rag-setup.run /caminho/do/projeto --skip-index
 ```
-### Trocar modelo depois
+Use para atualizar infraestrutura/configuração primeiro (venv, Docker, MCP), deixando a indexação para depois.
+### Trocar modelo (`--change-model`)
 ```bash
 ./rag-setup.run --change-model /caminho/do/projeto
 ```
-Esse fluxo alerta sobre reset das coleções Chroma e reindexação completa.
+Use ao mudar estratégia de embeddings (`jina`, `bge`, `hybrid`). Esse fluxo pode resetar coleções e exige reindexação completa para consistência.
-## Variáveis de ambiente (override)
+## Environment overrides
-Principais variáveis:
+Onde sobrescrever:
+- Temporário por sessão: `export VAR=valor` antes do `rag run`.
+- Persistente no runtime MCP: definir `env` no arquivo de configuração do MCP (`~/.claude.json`, `mcp.json` do Cursor).
+- Endpoint Chroma também pode ser persistido em `~/.own-rag-cli.json`.
+Para que serve cada variável:
 - `MCP_EMBEDDING_MODEL=jina|bge|hybrid`
+  - Seleciona estratégia de embedding.
+  - Ao alterar: reindexação obrigatória.
 - `MCP_JINA_QUANTIZATION=default|dynamic-int8`
+  - Define quantização do Jina em CPU.
+  - Ao alterar: reindexação recomendada.
 - `MCP_PERF_PROFILE=autotune|max-performance`
+  - Define estratégia de tuning para chunks e batch.
+  - Ao alterar: rode setup/indexador novamente para aplicar novo tuning.
 - `MCP_EMBEDDING_BATCH_SIZE=<int>`
+  - Força batch fixo (sobrescreve batch autotunado).
+  - Ao alterar: reinicie MCP; reindexação recomendada para consistência operacional.
 - `MCP_CHUNK_SIZE=<int>`
+  - Define tamanho dos chunks na indexação.
+  - Ao alterar: reindexação completa obrigatória.
 - `MCP_CHUNK_OVERLAP=<int>`
+  - Define sobreposição entre chunks.
+  - Ao alterar: reindexação completa obrigatória.
 - `OWN_RAG_CLI_CONFIG_FILE=<path>`
+  - Muda localização do arquivo runtime (padrão `~/.own-rag-cli.json`).
+  - Ao alterar: reinicie setup/ferramentas MCP.
 - `MCP_INDEXER_CONFIG_FILE=<path>`
+  - Muda localização do arquivo de tuning (padrão `~/.cache/own-rag-cli/indexer_tuning.json`).
+  - Ao alterar: reinicie setup/ferramentas MCP.
 - `MCP_CHROMA_SCHEME=http|https`
 - `MCP_CHROMA_HOST=<host>`
 - `MCP_CHROMA_PORT=<port>`
+  - Sobrescrevem endpoint do Chroma.
+  - Ao alterar: reinicie clientes MCP; reindexe somente se trocar para base vazia/nova.
 ## Verificação de checksum (`.run`)

package/bin/indexer_full.py CHANGED Viewed

@@ -194,7 +194,7 @@ IGNORED_EXTENSIONS = {
     ".mp4", ".mp3", ".wav", ".ogg", ".avi", ".mov",
     # Pacotes e compilados
     ".zip", ".tar", ".gz", ".rar", ".7z", ".jar", ".war", ".ear",
-    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin",
+    ".pyc", ".pyo", ".so", ".dll", ".exe", ".bin", ".run",
     # Lockfiles e gerados
     ".lock", ".sum",
     # Banco de dados
@@ -1113,17 +1113,44 @@ def make_chunk_id(file_path: str, chunk_index: int) -> str:
     return hashlib.md5(raw.encode()).hexdigest()
+def _looks_binary_content(raw: bytes) -> bool:
+    """Detecta conteúdo binário por heurística em amostra de bytes."""
+    if not raw:
+        return False
+    sample = raw[:4096]
+    if b"\x00" in sample:
+        return True
+    non_text_bytes = 0
+    for byte in sample:
+        if byte in (9, 10, 13):  # \t \n \r
+            continue
+        if 32 <= byte <= 126:  # ASCII imprimível
+            continue
+        if 160 <= byte <= 255:  # Latin-1 estendido comum em texto
+            continue
+        non_text_bytes += 1
+    return (non_text_bytes / len(sample)) > 0.30
 def read_file_safe(filepath: Path) -> str | None:
-    """Lê um arquivo de texto, tentando múltiplos encodings."""
+    """Lê um arquivo de texto, evitando binários e tentando múltiplos encodings."""
+    try:
+        raw = filepath.read_bytes()
+    except OSError as e:
+        print(f"  [AVISO] Não foi possível ler {filepath}: {e}")
+        return None
+    if _looks_binary_content(raw):
+        return None
     for encoding in ("utf-8", "latin-1", "cp1252"):
         try:
-            return filepath.read_text(encoding=encoding)
+            return raw.decode(encoding)
         except UnicodeDecodeError:
             continue
-        except OSError as e:
-            print(f"  [AVISO] Não foi possível ler {filepath}: {e}")
-            return None
-    # Se nenhum encoding funcionou, é provavelmente binário disfarçado
     return None
@@ -1170,34 +1197,110 @@ def index_file(
     relative_path = str(filepath.relative_to(root_path))
     inserted_chunks = 0
+    skipped_chunks = 0
+    stop_iteration_warnings = 0
     batch_ids: list[str] = []
     batch_docs: list[str] = []
     batch_metadatas: list[dict[str, object]] = []
+    def _warn_stop_iteration(message: str) -> None:
+        nonlocal stop_iteration_warnings
+        if stop_iteration_warnings < 3:
+            tqdm.write(message)
+        stop_iteration_warnings += 1
+    def _to_embedding_rows(encoded_embeddings: object) -> list[list[float]]:
+        if hasattr(encoded_embeddings, "tolist"):
+            rows = encoded_embeddings.tolist()
+            if isinstance(rows, list):
+                if rows and isinstance(rows[0], (int, float)):
+                    return [list(rows)]
+                return rows
+        return [list(row) for row in encoded_embeddings]  # type: ignore[arg-type]
     def _flush_batch() -> None:
-        nonlocal inserted_chunks
+        nonlocal inserted_chunks, skipped_chunks
         if not batch_ids:
             return
-        embeddings = model.encode(
-            batch_docs,
-            show_progress_bar=False,
-            batch_size=embedding_batch_size,
-        ).tolist()
-        collection.upsert(
-            ids=batch_ids,
-            embeddings=embeddings,
-            documents=batch_docs,
-            metadatas=batch_metadatas,
-        )
-        inserted_chunks += len(batch_ids)
-        del embeddings
+        pending_ids = list(batch_ids)
+        pending_docs = list(batch_docs)
+        pending_metadatas = list(batch_metadatas)
+        try:
+            encoded = model.encode(
+                pending_docs,
+                show_progress_bar=False,
+                batch_size=embedding_batch_size,
+            )
+            embeddings = _to_embedding_rows(encoded)
+            collection.upsert(
+                ids=pending_ids,
+                embeddings=embeddings,
+                documents=pending_docs,
+                metadatas=pending_metadatas,
+            )
+            inserted_chunks += len(pending_ids)
+            del embeddings
+        except StopIteration:
+            _warn_stop_iteration(
+                f"  [AVISO] {filepath.name}: StopIteration no batch de embeddings; tentando fallback por chunk."
+            )
+            for chunk_id, chunk_doc, chunk_metadata in zip(pending_ids, pending_docs, pending_metadatas):
+                candidate_doc = chunk_doc.strip()
+                if not candidate_doc:
+                    skipped_chunks += 1
+                    continue
+                try:
+                    encoded_single = model.encode(
+                        [candidate_doc],
+                        show_progress_bar=False,
+                        batch_size=1,
+                    )
+                    single_embeddings = _to_embedding_rows(encoded_single)
+                    collection.upsert(
+                        ids=[chunk_id],
+                        embeddings=single_embeddings,
+                        documents=[candidate_doc],
+                        metadatas=[chunk_metadata],
+                    )
+                    inserted_chunks += 1
+                    del single_embeddings
+                except StopIteration:
+                    compact_doc = " ".join(candidate_doc.split())
+                    if not compact_doc:
+                        skipped_chunks += 1
+                        continue
+                    try:
+                        encoded_single = model.encode(
+                            [compact_doc],
+                            show_progress_bar=False,
+                            batch_size=1,
+                        )
+                        single_embeddings = _to_embedding_rows(encoded_single)
+                        collection.upsert(
+                            ids=[chunk_id],
+                            embeddings=single_embeddings,
+                            documents=[compact_doc],
+                            metadatas=[chunk_metadata],
+                        )
+                        inserted_chunks += 1
+                        del single_embeddings
+                    except StopIteration:
+                        skipped_chunks += 1
+                        _warn_stop_iteration(
+                            f"  [AVISO] {filepath.name}: chunk ignorado após StopIteration repetido."
+                        )
         batch_ids.clear()
         batch_docs.clear()
         batch_metadatas.clear()
         gc.collect()
     for i, chunk in enumerate(chunks):
+        if not chunk or not chunk.strip():
+            skipped_chunks += 1
+            continue
         batch_ids.append(make_chunk_id(abs_path, i))
         batch_docs.append(chunk)
         batch_metadatas.append(
@@ -1213,6 +1316,10 @@ def index_file(
             _flush_batch()
     _flush_batch()
+    if skipped_chunks:
+        _warn_stop_iteration(
+            f"  [AVISO] {filepath.name}: {skipped_chunks} chunk(s) vazio(s)/inválido(s) foram ignorados."
+        )
     return inserted_chunks