PyPI - intentkit - Versions diffs - 0.6.19.dev2__py3-none-any.whl → 0.6.20.dev1__py3-none-any.whl - Mend

intentkit 0.6.19.dev2py3-none-any.whl → 0.6.20.dev1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of intentkit might be problematic. Click here for more details.

Files changed (8) hide show

intentkit/__init__.py CHANGED Viewed

@@ -3,7 +3,7 @@
 A powerful platform for building AI agents with blockchain and cryptocurrency capabilities.
 """
-__version__ = "0.6.19-dev2"
+__version__ = "0.6.20-dev.1"
 __author__ = "hyacinthus"
 __email__ = "hyacinthus@gmail.com"

intentkit/skills/firecrawl/README.md CHANGED Viewed

@@ -5,18 +5,22 @@ The Firecrawl skills provide advanced web scraping and content indexing capabili
 ## Skills Overview
 ### 1. firecrawl_scrape
-Scrapes a single webpage and optionally indexes the content for future querying.
+Scrapes a single webpage and REPLACES any existing indexed content for that URL, preventing duplicates.
 **Parameters:**
 - `url` (required): The URL to scrape
-- `formats` (optional): Output formats - markdown, html, rawHtml, screenshot, links, extract (default: ["markdown"])
+- `formats` (optional): Output formats - markdown, html, rawHtml, screenshot, links, json (default: ["markdown"])
+- `only_main_content` (optional): Extract only main content (default: true)
 - `include_tags` (optional): HTML tags to include (e.g., ["h1", "h2", "p"])
 - `exclude_tags` (optional): HTML tags to exclude
-- `only_main_content` (optional): Extract only main content (default: true)
+- `wait_for` (optional): Wait time in milliseconds before scraping
+- `timeout` (optional): Maximum timeout in milliseconds (default: 30000)
 - `index_content` (optional): Whether to index content for querying (default: true)
 - `chunk_size` (optional): Size of text chunks for indexing (default: 1000)
 - `chunk_overlap` (optional): Overlap between chunks (default: 200)
+**Use Case:** Use this when you want to refresh/update content from a URL that was previously scraped, ensuring no duplicate or stale content remains.
 ### 2. firecrawl_crawl
 Crawls multiple pages from a website and indexes all content.
@@ -158,8 +162,9 @@ Prompt: "Use firecrawl_scrape to scrape https://example.com and index the conten
 ### Documentation Indexing
 ```
 1. Scrape main documentation page
-2. Crawl related documentation sections
-3. Query for specific technical information
+2. Crawl related documentation sections
+3. Use scrape again to update changed pages (replaces old content)
+4. Query for specific technical information
 ```
 ### Competitive Analysis
@@ -205,6 +210,7 @@ Prompt: "Use firecrawl_scrape to scrape https://example.com and index the conten
 - **PDF Support**: Can scrape and index PDF documents
 - **Intelligent Chunking**: Optimized text splitting for better search
 - **Independent Storage**: Uses its own dedicated vector store for Firecrawl content
+- **Content Replacement**: Replace mode prevents duplicate/stale content
 - **Metadata Rich**: Includes source URLs, timestamps, and content types
 - **Semantic Search**: Uses OpenAI embeddings for intelligent querying
 - **Batch Processing**: Efficient handling of multiple pages

intentkit/skills/firecrawl/schema.json CHANGED Viewed

@@ -34,7 +34,7 @@
             "Agent Owner + All Users",
             "Agent Owner Only"
           ],
-          "description": "Scrape single web pages and extract content in various formats (markdown, HTML, JSON, etc.). Handles JavaScript-rendered content, PDFs, and dynamic websites.",
+          "description": "Scrape single web pages and REPLACE any existing indexed content for that URL. Unlike regular scrape, this prevents duplicate content when re-scraping the same page. Use this to refresh/update content from a previously scraped URL.",
           "default": "private"
         },
         "firecrawl_crawl": {

intentkit/skills/firecrawl/scrape.py CHANGED Viewed

@@ -62,10 +62,11 @@ class FirecrawlScrapeInput(BaseModel):
 class FirecrawlScrape(FirecrawlBaseTool):
-    """Tool for scraping web pages using Firecrawl.
+    """Tool for scraping web pages using Firecrawl with REPLACE behavior.
-    This tool uses Firecrawl's API to scrape web pages and convert them into clean,
-    LLM-ready formats like markdown, HTML, or structured JSON data.
+    This tool uses Firecrawl's API to scrape web pages and REPLACES any existing
+    indexed content for the same URL instead of appending to it. This prevents
+    duplicate content when re-scraping the same page.
     Attributes:
         name: The name of the tool.
@@ -75,10 +76,10 @@ class FirecrawlScrape(FirecrawlBaseTool):
     name: str = "firecrawl_scrape"
     description: str = (
-        "Scrape a single web page and extract its content in various formats (markdown, HTML, JSON, etc.). "
+        "Scrape a single web page and REPLACE any existing indexed content for that URL. "
+        "Unlike regular scrape, this tool removes old content before adding new content, preventing duplicates. "
         "This tool can handle JavaScript-rendered content, PDFs, and dynamic websites. "
-        "Optionally indexes the content for later querying using the firecrawl_query_indexed_content tool. "
-        "Use this when you need to extract clean, structured content from a specific URL."
+        "Use this when you want to refresh/update content from a URL that was previously scraped."
     )
     args_schema: Type[BaseModel] = FirecrawlScrapeInput
@@ -187,7 +188,7 @@ class FirecrawlScrape(FirecrawlBaseTool):
                 result_data = data.get("data", {})
                 # Format the results based on requested formats
-                formatted_result = f"Successfully scraped: {url}\n\n"
+                formatted_result = f"Successfully scraped (REPLACE mode): {url}\n\n"
                 if "markdown" in formats and result_data.get("markdown"):
                     formatted_result += "## Markdown Content\n"
@@ -236,13 +237,16 @@ class FirecrawlScrape(FirecrawlBaseTool):
                         formatted_result += f"Language: {metadata['language']}\n"
                     formatted_result += "\n"
-                # Index content if requested
+                # Index content if requested - REPLACE MODE
                 if index_content and result_data.get("markdown"):
                     try:
-                        # Import indexing utilities from firecrawl utils
+                        # Import indexing utilities
+                        from langchain_community.vectorstores import FAISS
                         from intentkit.skills.firecrawl.utils import (
+                            FirecrawlDocumentProcessor,
                             FirecrawlMetadataManager,
-                            index_documents,
+                            FirecrawlVectorStoreManager,
                         )
                         # Create document from scraped content
@@ -261,38 +265,149 @@ class FirecrawlScrape(FirecrawlBaseTool):
                         # Get agent ID for indexing
                         agent_id = context.agent_id
                         if agent_id:
-                            # Index the document
-                            total_chunks, was_merged = await index_documents(
-                                [document],
-                                agent_id,
-                                self.skill_store,
-                                chunk_size,
-                                chunk_overlap,
-                            )
-                            # Update metadata
+                            # Initialize managers
+                            vs_manager = FirecrawlVectorStoreManager(self.skill_store)
                             metadata_manager = FirecrawlMetadataManager(
                                 self.skill_store
                             )
-                            new_metadata = metadata_manager.create_url_metadata(
-                                [url], [document], "firecrawl_scrape"
+                            # Load existing vector store
+                            existing_vector_store = await vs_manager.load_vector_store(
+                                agent_id
                             )
-                            await metadata_manager.update_metadata(
-                                agent_id, new_metadata
+                            # Split the new document into chunks
+                            split_docs = FirecrawlDocumentProcessor.split_documents(
+                                [document], chunk_size, chunk_overlap
                             )
-                            formatted_result += "\n## Content Indexing\n"
-                            formatted_result += (
-                                "Successfully indexed content into vector store:\n"
+                            # Create embeddings
+                            embeddings = vs_manager.create_embeddings()
+                            if existing_vector_store:
+                                # Get all existing documents and filter out those from the same URL
+                                try:
+                                    # Try to access documents directly if available
+                                    if hasattr(
+                                        existing_vector_store, "docstore"
+                                    ) and hasattr(
+                                        existing_vector_store.docstore, "_dict"
+                                    ):
+                                        # Access FAISS documents directly
+                                        all_docs = list(
+                                            existing_vector_store.docstore._dict.values()
+                                        )
+                                    else:
+                                        # Fallback: use a reasonable k value for similarity search
+                                        # Use a dummy query to retrieve documents
+                                        all_docs = existing_vector_store.similarity_search(
+                                            "dummy",  # Use a dummy query instead of empty string
+                                            k=1000,  # Use reasonable upper bound
+                                        )
+                                    # Filter out documents from the same URL
+                                    preserved_docs = [
+                                        doc
+                                        for doc in all_docs
+                                        if doc.metadata.get("source") != url
+                                    ]
+                                    logger.info(
+                                        f"firecrawl_scrape: Preserving {len(preserved_docs)} docs from other URLs, "
+                                        f"replacing content from {url}"
+                                    )
+                                    # Create new vector store with preserved docs + new docs
+                                    if preserved_docs:
+                                        # Combine preserved and new documents
+                                        all_documents = preserved_docs + split_docs
+                                        new_vector_store = FAISS.from_documents(
+                                            all_documents, embeddings
+                                        )
+                                        formatted_result += "\n## Content Replacement\n"
+                                        formatted_result += f"Replaced existing content for URL: {url}\n"
+                                        num_preserved_urls = len(
+                                            set(
+                                                doc.metadata.get("source", "")
+                                                for doc in preserved_docs
+                                            )
+                                        )
+                                        formatted_result += f"Preserved content from {num_preserved_urls} other URLs\n"
+                                    else:
+                                        # No other documents to preserve, just create from new docs
+                                        new_vector_store = FAISS.from_documents(
+                                            split_docs, embeddings
+                                        )
+                                        formatted_result += "\n## Content Replacement\n"
+                                        formatted_result += f"Created new index with content from: {url}\n"
+                                except Exception as e:
+                                    logger.warning(
+                                        f"Could not preserve other URLs, creating fresh index: {e}"
+                                    )
+                                    # Fallback: create new store with just the new documents
+                                    new_vector_store = FAISS.from_documents(
+                                        split_docs, embeddings
+                                    )
+                                    formatted_result += "\n## Content Replacement\n"
+                                    formatted_result += f"Created fresh index with content from: {url}\n"
+                            else:
+                                # No existing store, create new one
+                                new_vector_store = FAISS.from_documents(
+                                    split_docs, embeddings
+                                )
+                                formatted_result += "\n## Content Indexing\n"
+                                formatted_result += (
+                                    f"Created new index with content from: {url}\n"
+                                )
+                            # Save the new vector store
+                            await vs_manager.save_vector_store(
+                                agent_id, new_vector_store, chunk_size, chunk_overlap
                             )
-                            formatted_result += f"- Chunks created: {total_chunks}\n"
+                            # Update metadata to track all URLs
+                            # Get existing metadata to preserve other URLs
+                            metadata_key = f"indexed_urls_{agent_id}"
+                            existing_metadata = (
+                                await self.skill_store.get_agent_skill_data(
+                                    agent_id, "firecrawl", metadata_key
+                                )
+                            )
+                            if existing_metadata and existing_metadata.get("urls"):
+                                # Remove the current URL and add it back (to update timestamp)
+                                existing_urls = [
+                                    u for u in existing_metadata["urls"] if u != url
+                                ]
+                                existing_urls.append(url)
+                                updated_metadata = {
+                                    "urls": existing_urls,
+                                    "document_count": len(existing_urls),
+                                    "source_type": "firecrawl_mixed",
+                                    "indexed_at": str(len(existing_urls)),
+                                }
+                            else:
+                                # Create new metadata
+                                updated_metadata = metadata_manager.create_url_metadata(
+                                    [url], [document], "firecrawl_scrape"
+                                )
+                            await metadata_manager.update_metadata(
+                                agent_id, updated_metadata
+                            )
+                            formatted_result += "\n## Content Indexing (REPLACE MODE)\n"
+                            formatted_result += "Successfully REPLACED indexed content in vector store:\n"
+                            formatted_result += f"- Chunks created: {len(split_docs)}\n"
                             formatted_result += f"- Chunk size: {chunk_size}\n"
                             formatted_result += f"- Chunk overlap: {chunk_overlap}\n"
-                            formatted_result += f"- Content merged with existing: {'Yes' if was_merged else 'No'}\n"
+                            formatted_result += (
+                                "- Previous content for this URL: REPLACED\n"
+                            )
                             formatted_result += "Use the 'firecrawl_query_indexed_content' skill to search this content.\n"
                             logger.info(
-                                f"firecrawl_scrape: Successfully indexed {url} with {total_chunks} chunks"
+                                f"firecrawl_scrape: Successfully replaced content for {url} with {len(split_docs)} chunks"
                             )
                         else:
                             formatted_result += "\n## Content Indexing\n"

{intentkit-0.6.19.dev2.dist-info → intentkit-0.6.20.dev1.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: intentkit
-Version: 0.6.19.dev2
+Version: 0.6.20.dev1
 Summary: Intent-based AI Agent Platform - Core Package
 Project-URL: Homepage, https://github.com/crestal-network/intentkit
 Project-URL: Repository, https://github.com/crestal-network/intentkit

{intentkit-0.6.19.dev2.dist-info → intentkit-0.6.20.dev1.dist-info}/RECORD RENAMED Viewed

@@ -1,4 +1,4 @@
-intentkit/__init__.py,sha256=URY46LF0PzcfF5ekBoFEG5w1RPHfC3Ht-Gw_t45X9Sk,384
+intentkit/__init__.py,sha256=L5N8UBhOoj8vD0NB2G81lATWypQbespoMFUeoZRNITg,385
 intentkit/abstracts/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 intentkit/abstracts/agent.py,sha256=108gb5W8Q1Sy4G55F2_ZFv2-_CnY76qrBtpIr0Oxxqk,1489
 intentkit/abstracts/api.py,sha256=ZUc24vaQvQVbbjznx7bV0lbbQxdQPfEV8ZxM2R6wZWo,166
@@ -198,15 +198,15 @@ intentkit/skills/enso/abi/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJW
 intentkit/skills/enso/abi/approval.py,sha256=IsyQLFxzAttocrtCB2PhbgprA7Vqujzpxvg0hJbeJ00,9867
 intentkit/skills/enso/abi/erc20.py,sha256=IScqZhHpMt_eFfYtMXw0-w5jptkAK0xsqqUDjbWdb2s,439
 intentkit/skills/enso/abi/route.py,sha256=ng9U2RSyS5R3d-b0m5ELa4rFpaUDO9HcgSoX9P_wWZo,4746
-intentkit/skills/firecrawl/README.md,sha256=LCi6ju-QO0nXti4y9-ltcF-bwrgXGT7NJpz67vFUcCo,6912
+intentkit/skills/firecrawl/README.md,sha256=OP5rCC5aNx9A4YjgotZB-JFdBR_0qHiWmYLuA52a8Tw,7366
 intentkit/skills/firecrawl/__init__.py,sha256=QQ0I5vlUgsLRFqHO17vbq-3ERKL3nzoo2B4MFGH0Igg,3160
 intentkit/skills/firecrawl/base.py,sha256=8BqD3X6RK0RedWU-qsa5qPMpuXWTZ6NbYLSpppFK_EU,1334
 intentkit/skills/firecrawl/clear.py,sha256=mfzQg8e6sbCwSzJGN_Lqfgxt-0pvtH_dBtNSJpMQA5A,2830
 intentkit/skills/firecrawl/crawl.py,sha256=lhySK1TbxGcLAXQi1zvrp4Zdo5ghhBFvxc4mFMl5LoI,18278
 intentkit/skills/firecrawl/firecrawl.png,sha256=6GoGlIMYuIDo-TqMlZbD4QYkmxvQ7krqAa5MANumJqk,5065
 intentkit/skills/firecrawl/query.py,sha256=LZzIy-LmqyEa8cZoBm-Eoen6GRy3NJxfuQcGi54Hwp0,4364
-intentkit/skills/firecrawl/schema.json,sha256=3LfZPS-mdKNh8r7IQ-oAMFAq_xS5dVs9sV8PXeEUh6o,4439
-intentkit/skills/firecrawl/scrape.py,sha256=P2Pwbi5l6bbN1S8akwwr9dhtUHw20UBHdN0c2B5J9Rs,13642
+intentkit/skills/firecrawl/schema.json,sha256=q3ynbCO1NDidHZd3Nh7TNZ6lCv6y26XW7WBrYlj-JM0,4513
+intentkit/skills/firecrawl/scrape.py,sha256=2axmz5hZVnNGvTPTi0r0WAN4MoYNQZzOFtMZd5pRgcg,20704
 intentkit/skills/firecrawl/utils.py,sha256=Ot_vEg4Z30_BY3Xbh59gb_Tu17tSCmytRw49RGAzZ88,10093
 intentkit/skills/github/README.md,sha256=SzYGJ9qSPaZl68iD8AQJGKTMLv0keQZesnSK-VhrAfs,1802
 intentkit/skills/github/__init__.py,sha256=Vva9jMtACSM_cZXy5JY0h6Q1ejR1jm-Xu3Q6PwyB72o,1471
@@ -411,7 +411,7 @@ intentkit/utils/random.py,sha256=DymMxu9g0kuQLgJUqalvgksnIeLdS-v0aRk5nQU0mLI,452
 intentkit/utils/s3.py,sha256=9trQNkKQ5VgxWsewVsV8Y0q_pXzGRvsCYP8xauyUYkg,8549
 intentkit/utils/slack_alert.py,sha256=s7UpRgyzLW7Pbmt8cKzTJgMA9bm4EP-1rQ5KXayHu6E,2264
 intentkit/utils/tx.py,sha256=2yLLGuhvfBEY5n_GJ8wmIWLCzn0FsYKv5kRNzw_sLUI,1454
-intentkit-0.6.19.dev2.dist-info/METADATA,sha256=yH0g5MnOWthCWld7D-Xu--mKeaavKQdWxj7gLSinejo,6414
-intentkit-0.6.19.dev2.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
-intentkit-0.6.19.dev2.dist-info/licenses/LICENSE,sha256=Bln6DhK-LtcO4aXy-PBcdZv2f24MlJFm_qn222biJtE,1071
-intentkit-0.6.19.dev2.dist-info/RECORD,,
+intentkit-0.6.20.dev1.dist-info/METADATA,sha256=oGwdu4cAD3dMnV6di-S4CTtXCr8vJH37NZNXn3yRqEA,6414
+intentkit-0.6.20.dev1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
+intentkit-0.6.20.dev1.dist-info/licenses/LICENSE,sha256=Bln6DhK-LtcO4aXy-PBcdZv2f24MlJFm_qn222biJtE,1071
+intentkit-0.6.20.dev1.dist-info/RECORD,,

{intentkit-0.6.19.dev2.dist-info → intentkit-0.6.20.dev1.dist-info}/WHEEL RENAMED Viewed

File without changes

{intentkit-0.6.19.dev2.dist-info → intentkit-0.6.20.dev1.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

intentkit 0.6.19.dev2__py3-none-any.whl → 0.6.20.dev1__py3-none-any.whl

Potentially problematic release.

intentkit 0.6.19.dev2py3-none-any.whl → 0.6.20.dev1py3-none-any.whl