PyPI - cat-llm - Versions diffs - 0.0.52__tar.gz → 0.0.54__tar.gz - Mend

cat-llm 0.0.52tar.gz → 0.0.54tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{cat_llm-0.0.52 → cat_llm-0.0.54}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.4
 Name: cat-llm
-Version: 0.0.52
+Version: 0.0.54
 Summary: A tool for categorizing text data and images using LLMs and vision models
 Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
 Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
 Project-URL: Source, https://github.com/chrissoria/cat-llm
-Author-email: Christopher Soria <chrissoria@berkeley.edu>
+Author-email: Chris Soria <chrissoria@berkeley.edu>
 License-Expression: MIT
 License-File: LICENSE
 Keywords: categorizer,image classification,llm,structured output,survey data,text classification
@@ -19,7 +19,9 @@ Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: Implementation :: CPython
 Classifier: Programming Language :: Python :: Implementation :: PyPy
 Requires-Python: >=3.8
+Requires-Dist: openai
 Requires-Dist: pandas
+Requires-Dist: requests
 Requires-Dist: tqdm
 Description-Content-Type: text/markdown
@@ -44,6 +46,7 @@ Description-Content-Type: text/markdown
   - [multi_class()](#multi_class)
   - [image_score()](#image_score)
   - [image_features()](#image_features)
+  - [build_web_research_dataset()](#build_web_research_dataset)
   - [cerad_drawn_score()](#cerad_drawn_score)
 - [Academic Research](#academic-research)
 - [License](#license)
@@ -344,6 +347,60 @@ image_scores = cat.image_features(
     api_key="OPENAI_API_KEY")
 ```
+### `build_web_research_dataset()`
+Conducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows.
+NOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.
+SECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question="Hottest temperature in 2024?" you should use "Hottest temperature in 2024 from extremeweatherwatch.com?" or "Hottest temperature in 2024 from weatherundeground.com?". Another example is use "Where these UC Berkeley professors got their PhD according to Linkedin?" instead of "Where they got their PhD according to Linkedin?" to avoid matching people with the same name.
+THIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.
+**Methodology:**
+Performs systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.
+**Rate Limits:**
+Before using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits
+**Parameters:**
+- `search_question` (str): Primary research question or topic to guide the search strategy
+- `search_input` (list): List of specific search queries or questions to investigate
+- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
+- `api_key` (str): API key for the LLM service
+- `answer_format`: (str, default="concise"): Response detail level ("concise", "detailed", "comprehensive")
+- `additional_instructions` (str, default="claude-3-7-sonnet-20250219"): Specific Anthropic model to use for processing results
+- `user_model` (str, default="gpt-4o"): Specific vision model to use
+- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
+- `safety` (bool, default=False): Enable safety checks and save results at each API call step
+- `filename` (str, default="categorized_data.csv"): Filename for CSV output
+- `save_directory` (str, optional): Directory path to save the CSV file
+- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
+- `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits
+**Returns:**
+- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
+**Example:**
+```
+import catllm as cat
+research_data = cat.build_web_research_dataset(
+    search_question="What are the latest developments in renewable energy technology?",
+    search_input=["solar panel efficiency 2025", "wind turbine innovations", "battery storage breakthroughs"],
+    api_key="ANTHROPIC_API_KEY",
+    answer_format="detailed",
+    additional_instructions="Focus on recent technological advances and commercial applications",
+    categories=['Answer', 'URL', 'Date', 'Key_Technology'],
+    model_source="Anthropic",
+    user_model="claude-3-7-sonnet-20250219",
+    creativity=0.1,
+    safety=True,
+    time_delay=3
+)
+```
 ### `cerad_drawn_score()`
 Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.

{cat_llm-0.0.52 → cat_llm-0.0.54}/README.md RENAMED Viewed

@@ -19,6 +19,7 @@
   - [multi_class()](#multi_class)
   - [image_score()](#image_score)
   - [image_features()](#image_features)
+  - [build_web_research_dataset()](#build_web_research_dataset)
   - [cerad_drawn_score()](#cerad_drawn_score)
 - [Academic Research](#academic-research)
 - [License](#license)
@@ -319,6 +320,60 @@ image_scores = cat.image_features(
     api_key="OPENAI_API_KEY")
 ```
+### `build_web_research_dataset()`
+Conducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows.
+NOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.
+SECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question="Hottest temperature in 2024?" you should use "Hottest temperature in 2024 from extremeweatherwatch.com?" or "Hottest temperature in 2024 from weatherundeground.com?". Another example is use "Where these UC Berkeley professors got their PhD according to Linkedin?" instead of "Where they got their PhD according to Linkedin?" to avoid matching people with the same name.
+THIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.
+**Methodology:**
+Performs systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.
+**Rate Limits:**
+Before using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits
+**Parameters:**
+- `search_question` (str): Primary research question or topic to guide the search strategy
+- `search_input` (list): List of specific search queries or questions to investigate
+- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
+- `api_key` (str): API key for the LLM service
+- `answer_format`: (str, default="concise"): Response detail level ("concise", "detailed", "comprehensive")
+- `additional_instructions` (str, default="claude-3-7-sonnet-20250219"): Specific Anthropic model to use for processing results
+- `user_model` (str, default="gpt-4o"): Specific vision model to use
+- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
+- `safety` (bool, default=False): Enable safety checks and save results at each API call step
+- `filename` (str, default="categorized_data.csv"): Filename for CSV output
+- `save_directory` (str, optional): Directory path to save the CSV file
+- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
+- `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits
+**Returns:**
+- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
+**Example:**
+```
+import catllm as cat
+research_data = cat.build_web_research_dataset(
+    search_question="What are the latest developments in renewable energy technology?",
+    search_input=["solar panel efficiency 2025", "wind turbine innovations", "battery storage breakthroughs"],
+    api_key="ANTHROPIC_API_KEY",
+    answer_format="detailed",
+    additional_instructions="Focus on recent technological advances and commercial applications",
+    categories=['Answer', 'URL', 'Date', 'Key_Technology'],
+    model_source="Anthropic",
+    user_model="claude-3-7-sonnet-20250219",
+    creativity=0.1,
+    safety=True,
+    time_delay=3
+)
+```
 ### `cerad_drawn_score()`
 Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.

{cat_llm-0.0.52 → cat_llm-0.0.54}/pyproject.toml RENAMED Viewed

@@ -11,7 +11,7 @@ requires-python = ">=3.8"
 license = "MIT"
 keywords = ["llm","categorizer","survey data", "image classification", "structured output", "text classification"]
 authors = [
-  { name = "Christopher Soria", email = "chrissoria@berkeley.edu" },
+  { name = "Chris Soria", email = "chrissoria@berkeley.edu" },
 ]
 classifiers = [
   "Development Status :: 4 - Beta",
@@ -26,7 +26,9 @@ classifiers = [
 ]
 dependencies = [
   "pandas",
-  "tqdm"
+  "tqdm",
+  "requests",
+  "openai"
 ]
 [project.urls]

{cat_llm-0.0.52 → cat_llm-0.0.54}/src/catllm/__about__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 # SPDX-FileCopyrightText: 2025-present Christopher Soria <chrissoria@berkeley.edu>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.0.52"
+__version__ = "0.0.54"
 __author__ = "Chris Soria"
 __email__ = "chrissoria@berkeley.edu"
 __title__ = "cat-llm"

{cat_llm-0.0.52 → cat_llm-0.0.54}/src/catllm/build_web_research.py RENAMED Viewed

@@ -6,13 +6,13 @@ def build_web_research_dataset(
     answer_format = "concise",
     additional_instructions = "",
     categories = ['Answer','URL'],
-    user_model="claude-3-7-sonnet-20250219",
+    user_model="claude-sonnet-4-20250514",
     creativity=0,
     safety=False,
     filename="categorized_data.csv",
     save_directory=None,
     model_source="Anthropic",
-    time_delay=5
+    time_delay=15
 ):
     import os
     import json

{cat_llm-0.0.52 → cat_llm-0.0.54}/src/catllm/text_functions.py RENAMED Viewed

@@ -307,6 +307,37 @@ Provide your work in JSON format where the number belonging to each category is
                 except Exception as e:
                     print(f"An error occurred: {e}")
                     link1.append(f"Error processing input: {e}")
+            elif model_source == "Google":
+                import requests
+                url = f"https://generativelanguage.googleapis.com/v1beta/models/{user_model}:generateContent"
+                try:
+                    headers = {
+                        "x-goog-api-key": api_key,
+                        "Content-Type": "application/json"
+                        }
+                    payload = {
+                        "contents": [{
+                            "parts": [{"text": prompt}]
+                            }]
+                            }
+                    response = requests.post(url, headers=headers, json=payload)
+                    response.raise_for_status()  # Raise exception for HTTP errors
+                    result = response.json()
+                    if "candidates" in result and result["candidates"]:
+                        reply = result["candidates"][0]["content"]["parts"][0]["text"]
+                    else:
+                        reply = "No response generated"
+                    link1.append(reply)
+                    print(reply)
+                except Exception as e:
+                    print(f"An error occurred: {e}")
+                    link1.append(f"Error processing input: {e}")
             elif model_source == "Mistral":
                 from mistralai import Mistral
                 client = Mistral(api_key=api_key)