PyPI - gnosisllm-knowledge - Versions diffs - 0.3.0__tar.gz → 0.4.0__tar.gz - Mend

gnosisllm-knowledge 0.3.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (80) hide show

{gnosisllm_knowledge-0.3.0 → gnosisllm_knowledge-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gnosisllm-knowledge
-Version: 0.3.0
+Version: 0.4.0
 Summary: Enterprise-grade knowledge loading, indexing, and search for Python
 License: MIT
 Keywords: knowledge-base,rag,semantic-search,vector-search,opensearch,llm,embeddings,enterprise
@@ -46,7 +46,7 @@ Enterprise-grade knowledge loading, indexing, and semantic search library for Py
 - **Multiple Loaders**: Load content from websites, sitemaps, and files
 - **Intelligent Chunking**: Sentence-aware text splitting with configurable overlap
 - **OpenSearch Backend**: Production-ready with k-NN vector search
-- **Multi-Tenancy**: Built-in support for account and collection isolation
+- **Multi-Tenancy**: Index isolation for complete tenant separation (tenant-agnostic library)
 - **Event-Driven**: Observer pattern for progress tracking and monitoring
 - **SOLID Architecture**: Clean, maintainable, and extensible codebase
@@ -144,14 +144,15 @@ gnosisllm-knowledge load <URL> [OPTIONS]
 Options:
   --type         Source type: website, sitemap (auto-detects)
-  --index        Target index name (default: knowledge)
-  --account-id   Multi-tenant account ID
+  --index        Target index name (e.g., knowledge-tenant-123)
   --collection-id Collection grouping ID
   --batch-size   Documents per batch (default: 100)
   --max-urls     Max URLs from sitemap (default: 1000)
   --dry-run      Preview without indexing
 ```
+Multi-tenancy is achieved through index isolation. Use `--index` with tenant-specific names (e.g., `--index knowledge-tenant-123`).
 ### Search
 Search indexed content with multiple modes:
@@ -161,14 +162,15 @@ gnosisllm-knowledge search <QUERY> [OPTIONS]
 Options:
   --mode         Search mode: semantic, keyword, hybrid, agentic
-  --index        Index to search (default: knowledge)
+  --index        Index to search (e.g., knowledge-tenant-123)
   --limit        Max results (default: 5)
-  --account-id   Filter by account
   --collection-ids Filter by collections (comma-separated)
   --json         Output as JSON for scripting
   --interactive  Interactive search session
 ```
+Multi-tenancy is achieved through index isolation. Use `--index` with tenant-specific names.
 ## Architecture
 ```
@@ -319,22 +321,40 @@ agent_body = {
 ## Multi-Tenancy
+This library is **tenant-agnostic**. Multi-tenancy is achieved through **index isolation** - each tenant gets their own OpenSearch index.
 ```python
-# Load with tenant isolation
+# The calling application (e.g., API) constructs tenant-specific index names
+index_name = f"knowledge-{account_id}"
+# Create Knowledge instance for the tenant
+knowledge = Knowledge.from_opensearch(
+    host="localhost",
+    port=9200,
+    index_prefix=index_name,  # knowledge-tenant-123
+)
+# Load content to tenant's isolated index
 await knowledge.load(
     source="https://docs.example.com/sitemap.xml",
-    account_id="tenant-123",
     collection_id="docs",
 )
-# Search within tenant
+# Search within tenant's index (no account_id filter needed)
 results = await knowledge.search(
     "query",
-    account_id="tenant-123",
     collection_ids=["docs"],
 )
 ```
+**Note**: For audit purposes, you can store `account_id` in document metadata:
+```python
+await knowledge.load(
+    source="https://docs.example.com/sitemap.xml",
+    document_defaults={"metadata": {"account_id": "tenant-123"}},
+)
+```
 ## Agentic Memory
 Conversational memory with automatic fact extraction using OpenSearch's ML Memory plugin.

{gnosisllm_knowledge-0.3.0 → gnosisllm_knowledge-0.4.0}/README.md RENAMED Viewed

@@ -11,7 +11,7 @@ Enterprise-grade knowledge loading, indexing, and semantic search library for Py
 - **Multiple Loaders**: Load content from websites, sitemaps, and files
 - **Intelligent Chunking**: Sentence-aware text splitting with configurable overlap
 - **OpenSearch Backend**: Production-ready with k-NN vector search
-- **Multi-Tenancy**: Built-in support for account and collection isolation
+- **Multi-Tenancy**: Index isolation for complete tenant separation (tenant-agnostic library)
 - **Event-Driven**: Observer pattern for progress tracking and monitoring
 - **SOLID Architecture**: Clean, maintainable, and extensible codebase
@@ -109,14 +109,15 @@ gnosisllm-knowledge load <URL> [OPTIONS]
 Options:
   --type         Source type: website, sitemap (auto-detects)
-  --index        Target index name (default: knowledge)
-  --account-id   Multi-tenant account ID
+  --index        Target index name (e.g., knowledge-tenant-123)
   --collection-id Collection grouping ID
   --batch-size   Documents per batch (default: 100)
   --max-urls     Max URLs from sitemap (default: 1000)
   --dry-run      Preview without indexing
 ```
+Multi-tenancy is achieved through index isolation. Use `--index` with tenant-specific names (e.g., `--index knowledge-tenant-123`).
 ### Search
 Search indexed content with multiple modes:
@@ -126,14 +127,15 @@ gnosisllm-knowledge search <QUERY> [OPTIONS]
 Options:
   --mode         Search mode: semantic, keyword, hybrid, agentic
-  --index        Index to search (default: knowledge)
+  --index        Index to search (e.g., knowledge-tenant-123)
   --limit        Max results (default: 5)
-  --account-id   Filter by account
   --collection-ids Filter by collections (comma-separated)
   --json         Output as JSON for scripting
   --interactive  Interactive search session
 ```
+Multi-tenancy is achieved through index isolation. Use `--index` with tenant-specific names.
 ## Architecture
 ```
@@ -284,22 +286,40 @@ agent_body = {
 ## Multi-Tenancy
+This library is **tenant-agnostic**. Multi-tenancy is achieved through **index isolation** - each tenant gets their own OpenSearch index.
 ```python
-# Load with tenant isolation
+# The calling application (e.g., API) constructs tenant-specific index names
+index_name = f"knowledge-{account_id}"
+# Create Knowledge instance for the tenant
+knowledge = Knowledge.from_opensearch(
+    host="localhost",
+    port=9200,
+    index_prefix=index_name,  # knowledge-tenant-123
+)
+# Load content to tenant's isolated index
 await knowledge.load(
     source="https://docs.example.com/sitemap.xml",
-    account_id="tenant-123",
     collection_id="docs",
 )
-# Search within tenant
+# Search within tenant's index (no account_id filter needed)
 results = await knowledge.search(
     "query",
-    account_id="tenant-123",
     collection_ids=["docs"],
 )
 ```
+**Note**: For audit purposes, you can store `account_id` in document metadata:
+```python
+await knowledge.load(
+    source="https://docs.example.com/sitemap.xml",
+    document_defaults={"metadata": {"account_id": "tenant-123"}},
+)
+```
 ## Agentic Memory
 Conversational memory with automatic fact extraction using OpenSearch's ML Memory plugin.

{gnosisllm_knowledge-0.3.0 → gnosisllm_knowledge-0.4.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "gnosisllm-knowledge"
-version = "0.3.0"
+version = "0.4.0"
 description = "Enterprise-grade knowledge loading, indexing, and search for Python"
 authors = [
     {name = "David Marsa", email = "david.marsa@neomanex.com"},

{gnosisllm_knowledge-0.3.0 → gnosisllm_knowledge-0.4.0}/src/gnosisllm_knowledge/api/knowledge.py RENAMED Viewed

@@ -1,4 +1,39 @@
-"""High-level Knowledge API facade."""
+"""High-level Knowledge API facade.
+This module provides the main entry point for the gnosisllm-knowledge library.
+The Knowledge class is a high-level facade that abstracts the complexity of
+loading, indexing, and searching knowledge documents.
+Note:
+    This library is tenant-agnostic. Multi-tenancy should be handled at the
+    API layer by using separate indices per account (e.g.,
+    `knowledge-{account_id}`) rather than filtering by account_id.
+Example:
+    ```python
+    # Create Knowledge instance for a specific tenant
+    knowledge = Knowledge.from_opensearch(
+        host="localhost",
+        port=9200,
+    )
+    # Use a tenant-specific index
+    tenant_index = f"knowledge-{account_id}"
+    # Load content
+    await knowledge.load(
+        "https://docs.example.com/sitemap.xml",
+        index_name=tenant_index,
+        collection_id="docs",
+    )
+    # Search (tenant isolation via index name)
+    results = await knowledge.search(
+        "how to configure",
+        index_name=tenant_index,
+    )
+    ```
+"""
 from __future__ import annotations
@@ -130,6 +165,10 @@ class Knowledge:
     ) -> Knowledge:
         """Create Knowledge instance with OpenSearch backend.
+        This factory creates a Knowledge instance configured for OpenSearch.
+        The returned instance is tenant-agnostic - multi-tenancy should be
+        handled by using separate indices per account.
         Args:
             host: OpenSearch host.
             port: OpenSearch port.
@@ -147,6 +186,19 @@ class Knowledge:
         Note:
             Embeddings are generated automatically by OpenSearch ingest pipeline.
             Run 'gnosisllm-knowledge setup' to configure the ML model.
+        Example:
+            ```python
+            # Create a Knowledge instance
+            knowledge = Knowledge.from_opensearch(
+                host="localhost",
+                port=9200,
+            )
+            # Use tenant-specific index for isolation
+            tenant_index = f"gnosisllm-{account_id}-knowledge"
+            await knowledge.load(source, index_name=tenant_index)
+            ```
         """
         # Import OpenSearch client
         try:
@@ -216,15 +268,29 @@ class Knowledge:
     def from_env(cls) -> Knowledge:
         """Create Knowledge instance from environment variables.
+        This factory creates a Knowledge instance using configuration from
+        environment variables. The returned instance is tenant-agnostic -
+        multi-tenancy should be handled by using separate indices per account.
         Returns:
             Configured Knowledge instance.
+        Example:
+            ```python
+            # Create from environment
+            knowledge = Knowledge.from_env()
+            # Use tenant-specific index for isolation
+            tenant_index = f"gnosisllm-{account_id}-knowledge"
+            await knowledge.search("query", index_name=tenant_index)
+            ```
         """
         config = OpenSearchConfig.from_env()
         neoreader_config = NeoreaderConfig.from_env()
         return cls.from_opensearch(
             config=config,
-            neoreader_url=neoreader_config.base_url if neoreader_config.base_url else None,
+            neoreader_url=neoreader_config.host if neoreader_config.host else None,
         )
     @property
@@ -318,7 +384,6 @@ class Knowledge:
         source: str,
         *,
         index_name: str | None = None,
-        account_id: str | None = None,
         collection_id: str | None = None,
         source_id: str | None = None,
         source_type: str | None = None,
@@ -329,10 +394,13 @@ class Knowledge:
         Automatically detects source type (sitemap, website, etc.).
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
             source: Source URL or path.
-            index_name: Target index (uses default if not provided).
-            account_id: Account ID for multi-tenancy.
+            index_name: Target index (use tenant-specific name for isolation).
             collection_id: Collection ID.
             source_id: Source ID (auto-generated if not provided).
             source_type: Explicit source type (auto-detected if not provided).
@@ -366,7 +434,6 @@ class Knowledge:
         return await service.load_and_index(
             source=source,
             index_name=index,
-            account_id=account_id,
             collection_id=collection_id,
             source_id=source_id,
             **options,
@@ -377,7 +444,6 @@ class Knowledge:
         source: str,
         *,
         index_name: str | None = None,
-        account_id: str | None = None,
         collection_id: str | None = None,
         collection_name: str | None = None,
         source_id: str | None = None,
@@ -398,10 +464,13 @@ class Knowledge:
         - Document storage: O(index_batch_size)
         - In-flight fetches: O(fetch_concurrency * avg_page_size)
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
             source: Sitemap URL.
-            index_name: Target index (uses default if not provided).
-            account_id: Account ID for multi-tenancy.
+            index_name: Target index (use tenant-specific name for isolation).
             collection_id: Collection ID.
             collection_name: Collection name for display.
             source_id: Source ID (auto-generated if not provided).
@@ -419,6 +488,7 @@ class Knowledge:
             # Efficiently load 100k+ URL sitemap
             result = await knowledge.load_streaming(
                 "https://large-site.com/sitemap.xml",
+                index_name="knowledge-account123",  # Tenant-specific
                 url_batch_size=100,
                 fetch_concurrency=20,
                 max_urls=50000,
@@ -454,7 +524,6 @@ class Knowledge:
         return await pipeline.execute(
             source=source,
             index_name=index,
-            account_id=account_id,
             collection_id=collection_id,
             collection_name=collection_name,
             source_id=source_id,
@@ -471,7 +540,6 @@ class Knowledge:
         mode: SearchMode = SearchMode.HYBRID,
         limit: int = 10,
         offset: int = 0,
-        account_id: str | None = None,
         collection_ids: list[str] | None = None,
         source_ids: list[str] | None = None,
         min_score: float | None = None,
@@ -479,13 +547,16 @@ class Knowledge:
     ) -> SearchResult:
         """Search for knowledge documents.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
             query: Search query text.
-            index_name: Index to search (uses default if not provided).
+            index_name: Index to search (use tenant-specific name for isolation).
             mode: Search mode (semantic, keyword, hybrid).
             limit: Maximum results.
             offset: Result offset for pagination.
-            account_id: Account ID for multi-tenancy.
             collection_ids: Filter by collection IDs.
             source_ids: Filter by source IDs.
             min_score: Minimum score threshold.
@@ -500,7 +571,6 @@ class Knowledge:
             mode=mode,
             limit=limit,
             offset=offset,
-            account_id=account_id,
             collection_ids=collection_ids,
             source_ids=source_ids,
             min_score=min_score,
@@ -578,19 +648,73 @@ class Knowledge:
     # === Management Methods ===
+    async def get_document(
+        self,
+        document_id: str,
+        *,
+        index_name: str | None = None,
+    ) -> dict[str, Any] | None:
+        """Get a single document by ID.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
+        Args:
+            document_id: Document ID to retrieve.
+            index_name: Index name (use tenant-specific name for isolation).
+                Uses default index if not provided.
+        Returns:
+            Document dict with all fields (excluding embeddings) or None if not found.
+        """
+        index = index_name or self._default_index
+        if not index:
+            raise ValueError("No index specified and no default index configured")
+        return await self._indexer.get(document_id, index)
+    async def delete_document(
+        self,
+        document_id: str,
+        *,
+        index_name: str | None = None,
+    ) -> bool:
+        """Delete a single document by ID.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
+        Args:
+            document_id: Document ID to delete.
+            index_name: Index name (use tenant-specific name for isolation).
+                Uses default index if not provided.
+        Returns:
+            True if deleted, False if not found.
+        """
+        index = index_name or self._default_index
+        if not index:
+            raise ValueError("No index specified and no default index configured")
+        return await self._indexer.delete(document_id, index)
     async def delete_source(
         self,
         source_id: str,
         *,
         index_name: str | None = None,
-        account_id: str | None = None,
     ) -> int:
         """Delete all documents from a source.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
             source_id: Source ID to delete.
-            index_name: Index name.
-            account_id: Account ID for multi-tenancy.
+            index_name: Index name (use tenant-specific name for isolation).
         Returns:
             Count of deleted documents.
@@ -599,21 +723,23 @@ class Knowledge:
         if not index:
             raise ValueError("No index specified")
-        return await self.indexing.delete_source(source_id, index, account_id)
+        return await self.indexing.delete_source(source_id, index)
     async def delete_collection(
         self,
         collection_id: str,
         *,
         index_name: str | None = None,
-        account_id: str | None = None,
     ) -> int:
         """Delete all documents from a collection.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
             collection_id: Collection ID to delete.
-            index_name: Index name.
-            account_id: Account ID for multi-tenancy.
+            index_name: Index name (use tenant-specific name for isolation).
         Returns:
             Count of deleted documents.
@@ -622,54 +748,85 @@ class Knowledge:
         if not index:
             raise ValueError("No index specified")
-        return await self.indexing.delete_collection(collection_id, index, account_id)
+        return await self.indexing.delete_collection(collection_id, index)
     async def count(
         self,
         *,
         index_name: str | None = None,
-        account_id: str | None = None,
         collection_id: str | None = None,
+        source_id: str | None = None,
     ) -> int:
         """Count documents.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
-            index_name: Index to count.
-            account_id: Filter by account.
+            index_name: Index to count (use tenant-specific name for isolation).
             collection_id: Filter by collection.
+            source_id: Filter by source (for source deletion confirmation).
         Returns:
             Document count.
         """
         return await self.search_service.count(
             index_name=index_name,
-            account_id=account_id,
             collection_id=collection_id,
+            source_id=source_id,
         )
     # === Collection and Stats Methods ===
-    async def get_collections(self) -> list[dict[str, Any]]:
+    async def get_collections(
+        self,
+        *,
+        index_name: str | None = None,
+    ) -> list[dict[str, Any]]:
         """Get all collections with document counts.
         Aggregates unique collection_ids from indexed documents.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
+        Args:
+            index_name: Index to query (use tenant-specific name for isolation).
+                Uses default index if not provided.
         Returns:
             List of collection dictionaries with id, name, and document_count.
         """
-        return await self.search_service.get_collections()
+        index = index_name or self._default_index
+        return await self.search_service.get_collections(index_name=index)
-    async def get_stats(self) -> dict[str, Any]:
+    async def get_stats(
+        self,
+        *,
+        index_name: str | None = None,
+    ) -> dict[str, Any]:
         """Get index statistics.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
+        Args:
+            index_name: Index to query (use tenant-specific name for isolation).
+                Uses default index if not provided.
         Returns:
             Dictionary with document_count, index_name, and other stats.
         """
-        return await self.search_service.get_stats()
+        index = index_name or self._default_index
+        return await self.search_service.get_stats(index_name=index)
     async def list_documents(
         self,
         *,
+        index_name: str | None = None,
         source_id: str | None = None,
         collection_id: str | None = None,
         limit: int = 50,
@@ -677,7 +834,13 @@ class Knowledge:
     ) -> dict[str, Any]:
         """List documents with optional filters.
+        Note:
+            This method is tenant-agnostic. Multi-tenancy should be handled
+            by using separate indices per account.
         Args:
+            index_name: Index to query (use tenant-specific name for isolation).
+                Uses default index if not provided.
             source_id: Optional source ID filter.
             collection_id: Optional collection ID filter.
             limit: Maximum documents to return (max 100).
@@ -686,9 +849,9 @@ class Knowledge:
         Returns:
             Dictionary with documents, total, limit, offset.
         """
-        index = self._default_index
+        index = index_name or self._default_index
         if not index:
-            raise ValueError("No default index configured")
+            raise ValueError("No index specified and no default index configured")
         # Clamp limit to reasonable bounds
         limit = min(max(1, limit), 100)
@@ -823,6 +986,33 @@ class Knowledge:
         return await agentic_searcher.agentic_search(agentic_query, index, **options)
     async def close(self) -> None:
-        """Close connections and clean up resources."""
-        # Subclasses or future implementations can override this
-        pass
+        """Close connections and clean up resources.
+        Closes the underlying AsyncOpenSearch client to prevent
+        unclosed aiohttp session warnings. Properly handles
+        CancelledError during event loop shutdown.
+        """
+        import asyncio
+        # Close the OpenSearch client via the searcher
+        # Note: indexer, searcher, and setup share the same client instance,
+        # so closing via searcher is sufficient
+        if hasattr(self._searcher, '_client') and self._searcher._client is not None:
+            client = self._searcher._client
+            try:
+                await client.close()
+                logger.debug("Closed OpenSearch client connection")
+            except asyncio.CancelledError:
+                # Event loop is shutting down - this is expected during cleanup
+                logger.debug("OpenSearch client close cancelled (event loop shutting down)")
+            except Exception as e:
+                logger.warning(f"Error closing OpenSearch client: {e}")
+            finally:
+                # Clear client reference on all components that share it
+                # This prevents any accidental reuse after close
+                if hasattr(self._searcher, '_client'):
+                    self._searcher._client = None
+                if hasattr(self._indexer, '_client'):
+                    self._indexer._client = None
+                if self._setup and hasattr(self._setup, '_client'):
+                    self._setup._client = None

gnosisllm-knowledge 0.3.0__tar.gz → 0.4.0__tar.gz

gnosisllm-knowledge 0.3.0tar.gz → 0.4.0tar.gz