PyPI - MindsDB - Versions diffs - 25.7.3.0__py3-none-any.whl → 25.8.2.0__py3-none-any.whl - Mend

MindsDB 25.7.3.0py3-none-any.whl → 25.8.2.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of MindsDB might be problematic. Click here for more details.

Files changed (102) hide show

mindsdb/integrations/handlers/salesforce_handler/constants.py ADDED Viewed

@@ -0,0 +1,215 @@
+"""
+Constants for Salesforce handler.
+"""
+def get_soql_instructions(integration_name):
+    return f"""This handler executes SOQL (Salesforce Object Query Language), NOT SQL! Follow these rules strictly:
+**BASIC STRUCTURE:**
+- NO "SELECT *" - must explicitly list all fields
+  SQL: SELECT * FROM Account;
+  SOQL: SELECT Id, Name, Industry FROM Account
+- NO table aliases - use full table names only
+  SQL: SELECT a.Name FROM Account a;
+  SOQL: SELECT Name FROM Account
+- NO column aliases - field names cannot be aliased
+  SQL: SELECT Name AS CompanyName FROM Account;
+  SOQL: SELECT Name FROM Account
+- NO DISTINCT keyword - not supported in SOQL
+  SQL: SELECT DISTINCT Industry FROM Account;
+  SOQL: Not possible - use separate logic
+- NO subqueries in FROM clause - only relationship-based subqueries allowed
+  SQL: SELECT * FROM (SELECT Name FROM Account) AS AccountNames;
+  SOQL: Not supported
+- Do not use fields that are not defined in the schema or data catalog. Always reference exact field names.
+**FIELD SELECTION:**
+- Always include Id field when querying
+  CORRECT: SELECT Id, Name, Industry FROM Account
+  INCORRECT: SELECT Name, Industry FROM Account
+- Field names are case-sensitive
+  CORRECT: SELECT CreatedDate FROM Account
+  INCORRECT: SELECT createddate FROM Account
+- Use exact field names from the data catalog
+  CORRECT: SELECT CustomerPriority__c FROM Account
+  INCORRECT: SELECT customer_priority FROM Account
+**FILTERING (WHERE clause):**
+- Date/DateTime fields: Use unquoted literals in YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ format
+  CORRECT: WHERE CloseDate >= 2025-05-28
+  CORRECT: WHERE CreatedDate >= 2025-05-28T10:30:00Z
+  INCORRECT: WHERE CloseDate >= '2025-05-28'
+  INCORRECT: WHERE CreatedDate >= "2025-05-28"
+- Special date literals: TODAY, YESTERDAY, LAST_WEEK, LAST_MONTH, LAST_QUARTER, LAST_YEAR, THIS_WEEK, THIS_MONTH, THIS_QUARTER, THIS_YEAR
+  CORRECT: WHERE CreatedDate = TODAY
+  CORRECT: WHERE LastModifiedDate >= LAST_MONTH
+  CORRECT: WHERE CloseDate >= THIS_QUARTER
+- Date arithmetic (e.g., TODAY - 10) is not supported. Use literals like LAST_N_DAYS:10 instead.
+  CORRECT: WHERE CloseDate >= LAST_N_DAYS:10
+  INCORRECT: WHERE CloseDate >= TODAY - 10
+- LIKE operator: Only supports % wildcard, NO underscore (_) wildcard
+  CORRECT: WHERE Name LIKE '%Corp%'
+  CORRECT: WHERE Name LIKE 'Acme%'
+  INCORRECT: WHERE Name LIKE 'A_me%'
+- BETWEEN operator: NOT supported, use >= AND <= instead
+  SQL: WHERE CreatedDate BETWEEN '2025-01-01' AND '2025-12-31'
+  SOQL: WHERE CreatedDate >= 2025-01-01 AND CreatedDate <= 2025-12-31
+- Boolean values: Use lowercase true/false, NOT TRUE/FALSE
+  CORRECT: WHERE Active__c = true
+  CORRECT: WHERE IsDeleted = false
+  INCORRECT: WHERE Active__c = TRUE
+  INCORRECT: WHERE IsDeleted = FALSE
+- NULL values: Use lowercase null, NOT NULL
+  CORRECT: WHERE ParentId = null
+  CORRECT: WHERE Description != null
+  INCORRECT: WHERE ParentId IS NULL
+  INCORRECT: WHERE Description IS NOT NULL
+- String values: Use single quotes for strings
+  CORRECT: WHERE Industry = 'Technology'
+  CORRECT: WHERE Name = 'Acme Corp'
+  INCORRECT: WHERE Industry = "Technology"
+- Multi-select picklist fields: Use INCLUDES('value1;value2') or EXCLUDES('value1;value2')
+  CORRECT: WHERE Services__c INCLUDES ('Consulting;Support')
+  CORRECT: WHERE Services__c EXCLUDES ('Training')
+  INCORRECT: WHERE Services__c = 'Consulting'
+- Limited subquery support - only IN/NOT IN with non-correlated subqueries in WHERE clause
+  CORRECT: SELECT Id FROM Contact WHERE Id NOT IN (SELECT WhoId FROM Task)
+  INCORRECT: SELECT Id FROM Contact WHERE NOT EXISTS (SELECT 1 FROM Task WHERE WhoId = Contact.Id)
+**JOINS:**
+- NO explicit JOIN syntax supported
+  SQL: SELECT a.Name, c.FirstName FROM Account a JOIN Contact c ON a.Id = c.AccountId
+  SOQL: Not supported - use relationship traversal (not applicable in this use case)
+**AGGREGATES:**
+- NO COUNT(*) - use COUNT(Id) instead
+  SQL: SELECT COUNT(*) FROM Account
+  SOQL: SELECT COUNT(Id) FROM Account
+- Cannot mix aggregate functions with non-aggregate fields unless using GROUP BY
+  CORRECT: SELECT Industry, COUNT(Id) FROM Account GROUP BY Industry
+  CORRECT: SELECT COUNT(Id) FROM Account
+  INCORRECT: SELECT Industry, Name, COUNT(Id) FROM Account
+- NO GROUP_CONCAT or string aggregation functions
+  SQL: SELECT GROUP_CONCAT(Name) FROM Account
+  SOQL: Not supported
+- NO HAVING clause
+  SQL: SELECT Industry, COUNT(*) FROM Account GROUP BY Industry HAVING COUNT(*) > 5
+  SOQL: Not supported - filter with separate logic
+- GROUP BY has limited field type support
+  CORRECT: SELECT Industry, COUNT(Id) FROM Account GROUP BY Industry
+  INCORRECT: SELECT Description, COUNT(Id) FROM Account GROUP BY Description (textarea fields not supported)
+**FUNCTIONS:**
+- Date functions: CALENDAR_MONTH(), CALENDAR_YEAR(), CALENDAR_QUARTER(), DAY_IN_MONTH(), DAY_IN_WEEK(), DAY_IN_YEAR(), HOUR_IN_DAY(), WEEK_IN_MONTH(), WEEK_IN_YEAR()
+  CORRECT: SELECT Id, Name FROM Account WHERE CALENDAR_YEAR(CreatedDate) = 2025
+  CORRECT: SELECT Id, Name FROM Account WHERE CALENDAR_MONTH(CreatedDate) = 5
+  CORRECT: SELECT Id, Name FROM Account WHERE DAY_IN_WEEK(CreatedDate) = 2
+- NO math functions: ROUND, FLOOR, CEILING, ABS, etc.
+  SQL: SELECT ROUND(AnnualRevenue, 2) FROM Account
+  SOQL: Not supported
+- NO conditional functions: CASE WHEN, COALESCE, NULLIF, etc.
+  SQL: SELECT CASE WHEN Industry = 'Technology' THEN 'Tech' ELSE 'Other' END FROM Account
+  SOQL: Not supported
+- NO string functions except INCLUDES/EXCLUDES for multi-select picklists
+  SQL: SELECT UPPER(Name) FROM Account
+  SOQL: Not supported
+**OPERATORS:**
+- Supported: =, !=, <, >, <=, >=, LIKE, IN, NOT IN, INCLUDES, EXCLUDES
+  CORRECT: WHERE Industry = 'Technology'
+  CORRECT: WHERE AnnualRevenue >= 1000000
+  CORRECT: WHERE Industry IN ('Technology', 'Finance')
+  CORRECT: WHERE Industry NOT IN ('Government', 'Non-Profit')
+  CORRECT: WHERE Services__c INCLUDES ('Consulting')
+- NOT supported: REGEXP, BETWEEN, EXISTS, NOT EXISTS
+  SQL: WHERE Name REGEXP '^[A-Z]'
+  SOQL: Not supported
+**SORTING & LIMITING:**
+- ORDER BY: Fully supported
+  CORRECT: SELECT Id, Name FROM Account ORDER BY Name ASC
+  CORRECT: SELECT Id, Name FROM Account ORDER BY CreatedDate DESC, Name ASC
+  CORRECT: SELECT Id, Name FROM Account ORDER BY Name NULLS LAST
+- LIMIT: Maximum 2000 records, use smaller limits for better performance
+  CORRECT: SELECT Id, Name FROM Account LIMIT 100
+  CORRECT: SELECT Id, Name FROM Account LIMIT 2000
+  INCORRECT: SELECT Id, Name FROM Account LIMIT 5000
+- NO OFFSET: Not supported for pagination
+  SQL: SELECT Id, Name FROM Account LIMIT 10 OFFSET 20
+  SOQL: Not supported
+**DATA TYPES:**
+- picklist: Single-select dropdown, use = operator with string values
+  CORRECT: WHERE Industry = 'Technology'
+  CORRECT: WHERE Rating = 'Hot'
+- reference: Foreign key field, typically ends with Id
+  CORRECT: WHERE OwnerId = '00530000003OOwn'
+  CORRECT: WHERE AccountId = '0013000000UzXyz'
+- boolean: Use lowercase true/false
+  CORRECT: WHERE IsDeleted = false
+  CORRECT: WHERE Active__c = true
+- currency: Numeric field for money values
+  CORRECT: WHERE AnnualRevenue > 1000000
+  CORRECT: WHERE AnnualRevenue >= 500000.50
+- date: Date only, use YYYY-MM-DD format
+  CORRECT: WHERE LastActivityDate = 2025-05-28
+  CORRECT: WHERE SLAExpirationDate__c >= 2025-01-01
+- datetime: Date and time, use YYYY-MM-DDThh:mm:ssZ format
+  CORRECT: WHERE CreatedDate >= 2025-05-28T10:30:00Z
+  CORRECT: WHERE LastModifiedDate = 2025-05-28T00:00:00Z
+- double/int: Numeric fields
+  CORRECT: WHERE NumberOfEmployees > 100
+  CORRECT: WHERE NumberofLocations__c >= 5.5
+- string/textarea: Text fields, use single quotes
+  CORRECT: WHERE Name = 'Acme Corporation'
+  CORRECT: WHERE Description = 'Leading tech company'
+- phone/url/email: Specialized string fields, treat as strings
+  CORRECT: WHERE Phone = '555-1234'
+  CORRECT: WHERE Website = 'https://example.com'
+**COMMON MISTAKES TO AVOID:**
+- Using SELECT * (not allowed)
+  WRONG: SELECT * FROM Account
+  RIGHT: SELECT Id, Name, Industry FROM Account
+- Quoting date literals (dates must be unquoted)
+  WRONG: WHERE CreatedDate >= '2025-01-01'
+  RIGHT: WHERE CreatedDate >= 2025-01-01
+- Using SQL JOIN syntax (not supported)
+  WRONG: SELECT Account.Name FROM Account JOIN Contact ON Account.Id = Contact.AccountId
+  RIGHT: Use relationship traversal (not applicable in this use case)
+- Using BETWEEN operator (not supported)
+  WRONG: WHERE CreatedDate BETWEEN 2025-01-01 AND 2025-12-31
+  RIGHT: WHERE CreatedDate >= 2025-01-01 AND CreatedDate <= 2025-12-31
+- Using uppercase TRUE/FALSE/NULL (must be lowercase)
+  WRONG: WHERE Active__c = TRUE
+  RIGHT: WHERE Active__c = true
+- Using underscore _ in LIKE patterns (only % supported)
+  WRONG: WHERE Name LIKE 'A_me%'
+  RIGHT: WHERE Name LIKE 'A%me%'
+- Mixing aggregate and non-aggregate fields without GROUP BY
+  WRONG: SELECT Name, COUNT(Id) FROM Account
+  RIGHT: SELECT Industry, COUNT(Id) FROM Account GROUP BY Industry
+**EXAMPLE QUERIES:**
+- Basic selection: SELECT Id, Name, Industry FROM Account WHERE Industry = 'Technology'
+- Date filtering: SELECT Id, Name FROM Account WHERE CreatedDate >= 2025-01-01
+- Multiple conditions: SELECT Id, Name FROM Account WHERE Name LIKE '%Corp%' AND Industry IN ('Technology', 'Finance')
+- Aggregation: SELECT Industry, COUNT(Id) FROM Account GROUP BY Industry
+- Boolean and numeric: SELECT Id, Name FROM Account WHERE Active__c = true AND NumberOfEmployees > 100
+- Date functions: SELECT Id, Name FROM Account WHERE CALENDAR_YEAR(CreatedDate) = 2025
+- Null checks: SELECT Id, Name FROM Account WHERE ParentId = null
+- Multi-select picklist: SELECT Id, Name FROM Account WHERE Services__c INCLUDES ('Consulting;Support')
+- Sorting and limiting: SELECT Id, Name FROM Account ORDER BY Name ASC LIMIT 50
+***EXECUTION INSTRUCTIONS. IMPORTANT!***
+After generating the core SOQL (and nothing else), always make sure you wrap it exactly as:
+    SELECT *
+      FROM {integration_name}(
+        /* your generated SOQL goes here, without a trailing semicolon */
+      )
+Return only that wrapper call.
+"""

mindsdb/integrations/handlers/salesforce_handler/salesforce_handler.py CHANGED Viewed

@@ -11,6 +11,7 @@ from mindsdb.integrations.libs.response import (
     RESPONSE_TYPE,
 )
 from mindsdb.integrations.handlers.salesforce_handler.salesforce_tables import create_table_class
+from mindsdb.integrations.handlers.salesforce_handler.constants import get_soql_instructions
 from mindsdb.utilities import log
@@ -156,91 +157,152 @@ class SalesforceHandler(MetaAPIHandler):
     def _get_resource_names(self) -> List[str]:
         """
-        Retrieves the names of the Salesforce resources, with more aggressive filtering to remove tables.
+        Retrieves the names of the Salesforce resources with optimized pre-filtering.
         Returns:
             List[str]: A list of filtered resource names.
         """
         if not self.resource_names:
-            all_resources = [
-                resource["name"]
-                for resource in self.connection.sobjects.describe()["sobjects"]
-                if resource.get("queryable", False)
-            ]
+            # Check for user-specified table filtering first
+            include_tables = self.connection_data.get("include_tables") or self.connection_data.get("tables")
+            exclude_tables = self.connection_data.get("exclude_tables", [])
+            if include_tables:
+                # OPTIMIZATION: Skip expensive global describe() call
+                # Only validate the specified tables
+                logger.info(f"Using pre-filtered table list: {include_tables}")
+                self.resource_names = self._validate_specified_tables(include_tables, exclude_tables)
+            else:
+                # Fallback to full discovery with hard-coded filtering
+                logger.info("No table filter specified, performing full discovery...")
+                self.resource_names = self._discover_all_tables_with_filtering(exclude_tables)
-            # Define patterns for tables to be filtered out.
-            # Expanded suffixes and prefixes and exact matches
-            ignore_suffixes = ("Share", "History", "Feed", "ChangeEvent", "Tag", "Permission", "Setup", "Consent")
-            ignore_prefixes = (
-                "Apex",
-                "CommPlatform",
-                "Lightning",
-                "Flow",
-                "Transaction",
-                "AI",
-                "Aura",
-                "ContentWorkspace",
-                "Collaboration",
-                "Datacloud",
-            )
-            ignore_exact = {
-                "EntityDefinition",
-                "FieldDefinition",
-                "RecordType",
-                "CaseStatus",
-                "UserRole",
-                "UserLicense",
-                "UserPermissionAccess",
-                "UserRecordAccess",
-                "Folder",
-                "Group",
-                "Note",
-                "ProcessDefinition",
-                "ProcessInstance",
-                "ContentFolder",
-                "ContentDocumentSubscription",
-                "DashboardComponent",
-                "Report",
-                "Dashboard",
-                "Topic",
-                "TopicAssignment",
-                "Period",
-                "Partner",
-                "PackageLicense",
-                "ColorDefinition",
-                "DataUsePurpose",
-                "DataUseLegalBasis",
-            }
-            ignore_substrings = (
-                "CleanInfo",
-                "Template",
-                "Rule",
-                "Definition",
-                "Status",
-                "Policy",
-                "Setting",
-                "Access",
-                "Config",
-                "Subscription",
-                "DataType",
-                "MilestoneType",
-                "Entitlement",
-                "Auth",
-            )
-            filtered = []
-            for r in all_resources:
-                if (
-                    not r.endswith(ignore_suffixes)
-                    and not r.startswith(ignore_prefixes)
-                    and not any(sub in r for sub in ignore_substrings)
-                    and r not in ignore_exact
-                ):
-                    filtered.append(r)
-            self.resource_names = [r for r in filtered]
         return self.resource_names
+    def _validate_specified_tables(self, include_tables: List[str], exclude_tables: List[str]) -> List[str]:
+        """
+        Validate user-specified tables without expensive global describe() call.
+        Args:
+            include_tables: List of table names to include
+            exclude_tables: List of table names to exclude
+        Returns:
+            List[str]: Validated and filtered table names
+        """
+        validated_tables = []
+        for table_name in include_tables:
+            # Skip if explicitly excluded
+            if table_name in exclude_tables:
+                logger.info(f"Skipping excluded table: {table_name}")
+                continue
+            try:
+                # Quick validation: check if table exists and is queryable
+                # This is much faster than global describe()
+                metadata = getattr(self.connection.sobjects, table_name).describe()
+                if metadata.get("queryable", False):
+                    validated_tables.append(table_name)
+                    logger.debug(f"Validated table: {table_name}")
+                else:
+                    logger.warning(f"Table {table_name} is not queryable, skipping")
+            except Exception as e:
+                logger.warning(f"Table {table_name} not found or accessible: {e}")
+        logger.info(f"Validated {len(validated_tables)} tables from include_tables")
+        return validated_tables
+    def _discover_all_tables_with_filtering(self, exclude_tables: List[str]) -> List[str]:
+        """
+        Fallback method: discover all tables with hard-coded filtering.
+        Args:
+            exclude_tables: List of table names to exclude
+        Returns:
+            List[str]: Filtered table names
+        """
+        # This is the original expensive approach - only used when no include_tables specified
+        all_resources = [
+            resource["name"]
+            for resource in self.connection.sobjects.describe()["sobjects"]
+            if resource.get("queryable", False)
+        ]
+        # Apply hard-coded filtering (existing logic)
+        ignore_suffixes = ("Share", "History", "Feed", "ChangeEvent", "Tag", "Permission", "Setup", "Consent")
+        ignore_prefixes = (
+            "Apex",
+            "CommPlatform",
+            "Lightning",
+            "Flow",
+            "Transaction",
+            "AI",
+            "Aura",
+            "ContentWorkspace",
+            "Collaboration",
+            "Datacloud",
+        )
+        ignore_exact = {
+            "EntityDefinition",
+            "FieldDefinition",
+            "RecordType",
+            "CaseStatus",
+            "UserRole",
+            "UserLicense",
+            "UserPermissionAccess",
+            "UserRecordAccess",
+            "Folder",
+            "Group",
+            "Note",
+            "ProcessDefinition",
+            "ProcessInstance",
+            "ContentFolder",
+            "ContentDocumentSubscription",
+            "DashboardComponent",
+            "Report",
+            "Dashboard",
+            "Topic",
+            "TopicAssignment",
+            "Period",
+            "Partner",
+            "PackageLicense",
+            "ColorDefinition",
+            "DataUsePurpose",
+            "DataUseLegalBasis",
+        }
+        ignore_substrings = (
+            "CleanInfo",
+            "Template",
+            "Rule",
+            "Definition",
+            "Status",
+            "Policy",
+            "Setting",
+            "Access",
+            "Config",
+            "Subscription",
+            "DataType",
+            "MilestoneType",
+            "Entitlement",
+            "Auth",
+        )
+        # Apply hard-coded filtering
+        filtered = []
+        for r in all_resources:
+            if (
+                not r.endswith(ignore_suffixes)
+                and not r.startswith(ignore_prefixes)
+                and not any(sub in r for sub in ignore_substrings)
+                and r not in ignore_exact
+                and r not in exclude_tables  # Apply user exclusions
+            ):
+                filtered.append(r)
+        return filtered
     def meta_get_handler_info(self, **kwargs) -> str:
         """
         Retrieves information about the design and implementation of the API handler.
@@ -254,8 +316,7 @@ class SalesforceHandler(MetaAPIHandler):
         Returns:
             str: A string containing information about the API handler's design and implementation.
         """
-        # TODO: Relationships? Aliases?
-        return "When filtering on a Date or DateTime field, the value MUST be an unquoted literal in YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ format. For example, CloseDate >= 2025-05-28 is correct; CloseDate >= '2025-05-28' is incorrect."
+        return get_soql_instructions(self.name)
     def meta_get_tables(self, table_names: Optional[List[str]] = None) -> Response:
         """

mindsdb/integrations/handlers/salesforce_handler/salesforce_tables.py CHANGED Viewed

@@ -176,7 +176,6 @@ def create_table_class(resource_name: Text) -> MetaAPIResource:
                     "table_description": "",
                     "row_count": None,
                 }
             # Get row count if Id column is aggregatable.
             row_count = None
             # if next(field for field in resource_metadata['fields'] if field['name'] == 'Id').get('aggregatable', False):

mindsdb/integrations/handlers/tpot_handler/requirements.txt CHANGED Viewed

@@ -1,2 +1,2 @@
 tpot<=0.11.7
-type_infer==0.0.20
+type_infer==0.0.23

mindsdb/integrations/handlers/web_handler/urlcrawl_helpers.py CHANGED Viewed

@@ -100,26 +100,25 @@ def parallel_get_all_website_links(urls) -> dict:
         return url_contents
     with concurrent.futures.ProcessPoolExecutor() as executor:
-        future_to_url = {
-            executor.submit(get_all_website_links, url): url for url in urls
-        }
+        future_to_url = {executor.submit(get_all_website_links, url): url for url in urls}
         for future in concurrent.futures.as_completed(future_to_url):
             url = future_to_url[future]
             try:
                 url_contents[url] = future.result()
             except Exception as exc:
-                logger.error(f'{url} generated an exception: {exc}')
+                logger.error(f"{url} generated an exception: {exc}")
                 # don't raise the exception, just log it, continue processing other urls
     return url_contents
-def get_all_website_links(url) -> dict:
+def get_all_website_links(url, headers: dict = None) -> dict:
     """
     Fetch all website links from a URL.
     Args:
         url (str): the URL to fetch links from
+        headers (dict): a dictionary of headers to use when fetching links
     Returns:
         A dictionary containing the URL, the extracted links, the HTML content, the text content, and any error that occurred.
@@ -132,9 +131,12 @@ def get_all_website_links(url) -> dict:
         session = requests.Session()
         # Add headers to mimic a real browser request
-        headers = {
-            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
-        }
+        if headers is None:
+            headers = {}
+        if "User-Agent" not in headers:
+            headers["User-Agent"] = (
+                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.3"
+            )
         response = session.get(url, headers=headers)
         if "cookie" in response.request.headers:
@@ -157,7 +159,7 @@ def get_all_website_links(url) -> dict:
                     continue
                 href = urljoin(url, href)
                 parsed_href = urlparse(href)
-                href = urlunparse((parsed_href.scheme, parsed_href.netloc, parsed_href.path, '', '', ''))
+                href = urlunparse((parsed_href.scheme, parsed_href.netloc, parsed_href.path, "", "", ""))
                 if not is_valid(href):
                     continue
                 if href in urls:
@@ -203,7 +205,15 @@ def get_readable_text_from_soup(soup) -> str:
     return html_converter.handle(str(soup))
-def get_all_website_links_recursively(url, reviewed_urls, limit=None, crawl_depth: int = 1, current_depth: int = 0, filters: List[str] = None):
+def get_all_website_links_recursively(
+    url,
+    reviewed_urls,
+    limit=None,
+    crawl_depth: int = 1,
+    current_depth: int = 0,
+    filters: List[str] = None,
+    headers=None,
+):
     """
     Recursively gathers all links from a given website up to a specified limit.
@@ -227,7 +237,7 @@ def get_all_website_links_recursively(url, reviewed_urls, limit=None, crawl_dept
         matches_filter = any(re.match(f, url) is not None for f in filters)
     if url not in reviewed_urls and matches_filter:
         try:
-            reviewed_urls[url] = get_all_website_links(url)
+            reviewed_urls[url] = get_all_website_links(url, headers=headers)
         except Exception as e:
             error_message = traceback.format_exc().splitlines()[-1]
             logger.error("An exception occurred: %s", str(e))
@@ -271,10 +281,14 @@ def get_all_website_links_recursively(url, reviewed_urls, limit=None, crawl_dept
         reviewed_urls.update(new_revised_urls)
         for new_url in new_revised_urls:
-            get_all_website_links_recursively(new_url, reviewed_urls, limit, crawl_depth=crawl_depth, current_depth=current_depth + 1, filters=filters)
+            get_all_website_links_recursively(
+                new_url, reviewed_urls, limit, crawl_depth=crawl_depth, current_depth=current_depth + 1, filters=filters
+            )
-def get_all_websites(urls, limit=1, html=False, crawl_depth: int = 1, filters: List[str] = None) -> pd.DataFrame:
+def get_all_websites(
+    urls, limit=1, html=False, crawl_depth: int = 1, filters: List[str] = None, headers: dict = None
+) -> pd.DataFrame:
     """
     Crawl a list of websites and return a DataFrame containing the results.
@@ -284,6 +298,7 @@ def get_all_websites(urls, limit=1, html=False, crawl_depth: int = 1, filters: L
         crawl_depth (int): Crawl depth for URLs.
         html (bool): a boolean indicating whether to include the HTML content in the results
         filters (List[str]): Crawl URLs that only match these regex patterns.
+        headers (dict): headers of request
     Returns:
         A DataFrame containing the results.
@@ -299,7 +314,9 @@ def get_all_websites(urls, limit=1, html=False, crawl_depth: int = 1, filters: L
         if urlparse(url).scheme == "":
             # Try HTTPS first
             url = "https://" + url
-        get_all_website_links_recursively(url, reviewed_urls, limit, crawl_depth=crawl_depth, filters=filters)
+        get_all_website_links_recursively(
+            url, reviewed_urls, limit, crawl_depth=crawl_depth, filters=filters, headers=headers
+        )
     # Use a ThreadPoolExecutor to run the helper function in parallel.
     with concurrent.futures.ThreadPoolExecutor() as executor:
@@ -311,9 +328,7 @@ def get_all_websites(urls, limit=1, html=False, crawl_depth: int = 1, filters: L
     columns_to_ignore = ["urls"]
     if html is False:
         columns_to_ignore += ["html_content"]
-    df = dict_to_dataframe(
-        reviewed_urls, columns_to_ignore=columns_to_ignore, index_name="url"
-    )
+    df = dict_to_dataframe(reviewed_urls, columns_to_ignore=columns_to_ignore, index_name="url")
     if not df.empty and df[df.error.isna()].empty:
         raise Exception(str(df.iloc[0].error))

MindsDB 25.7.3.0__py3-none-any.whl → 25.8.2.0__py3-none-any.whl

Potentially problematic release.

MindsDB 25.7.3.0py3-none-any.whl → 25.8.2.0py3-none-any.whl