cat-llm 0.0.51__tar.gz → 0.0.53__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: cat-llm
3
- Version: 0.0.51
3
+ Version: 0.0.53
4
4
  Summary: A tool for categorizing text data and images using LLMs and vision models
5
5
  Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
6
6
  Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
@@ -44,6 +44,7 @@ Description-Content-Type: text/markdown
44
44
  - [multi_class()](#multi_class)
45
45
  - [image_score()](#image_score)
46
46
  - [image_features()](#image_features)
47
+ - [build_web_research_dataset()](#build_web_research_dataset)
47
48
  - [cerad_drawn_score()](#cerad_drawn_score)
48
49
  - [Academic Research](#academic-research)
49
50
  - [License](#license)
@@ -344,6 +345,60 @@ image_scores = cat.image_features(
344
345
  api_key="OPENAI_API_KEY")
345
346
  ```
346
347
 
348
+ ### `build_web_research_dataset()`
349
+
350
+ Conducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows.
351
+
352
+ NOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.
353
+
354
+ SECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question="Hottest temperature in 2024?" you should use "Hottest temperature in 2024 from extremeweatherwatch.com?" or "Hottest temperature in 2024 from weatherundeground.com?". Another example is use "Where these UC Berkeley professors got their PhD according to Linkedin?" instead of "Where they got their PhD according to Linkedin?" to avoid matching people with the same name.
355
+
356
+ THIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.
357
+
358
+ **Methodology:**
359
+ Performs systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.
360
+
361
+ **Rate Limits:**
362
+ Before using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits
363
+
364
+ **Parameters:**
365
+ - `search_question` (str): Primary research question or topic to guide the search strategy
366
+ - `search_input` (list): List of specific search queries or questions to investigate
367
+ - `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
368
+ - `api_key` (str): API key for the LLM service
369
+ - `answer_format`: (str, default="concise"): Response detail level ("concise", "detailed", "comprehensive")
370
+ - `additional_instructions` (str, default="claude-3-7-sonnet-20250219"): Specific Anthropic model to use for processing results
371
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
372
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
373
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
374
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
375
+ - `save_directory` (str, optional): Directory path to save the CSV file
376
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
377
+ - `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits
378
+
379
+ **Returns:**
380
+ - `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
381
+
382
+ **Example:**
383
+
384
+ ```
385
+ import catllm as cat
386
+
387
+ research_data = cat.build_web_research_dataset(
388
+ search_question="What are the latest developments in renewable energy technology?",
389
+ search_input=["solar panel efficiency 2025", "wind turbine innovations", "battery storage breakthroughs"],
390
+ api_key="ANTHROPIC_API_KEY",
391
+ answer_format="detailed",
392
+ additional_instructions="Focus on recent technological advances and commercial applications",
393
+ categories=['Answer', 'URL', 'Date', 'Key_Technology'],
394
+ model_source="Anthropic",
395
+ user_model="claude-3-7-sonnet-20250219",
396
+ creativity=0.1,
397
+ safety=True,
398
+ time_delay=3
399
+ )
400
+ ```
401
+
347
402
  ### `cerad_drawn_score()`
348
403
 
349
404
  Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
@@ -19,6 +19,7 @@
19
19
  - [multi_class()](#multi_class)
20
20
  - [image_score()](#image_score)
21
21
  - [image_features()](#image_features)
22
+ - [build_web_research_dataset()](#build_web_research_dataset)
22
23
  - [cerad_drawn_score()](#cerad_drawn_score)
23
24
  - [Academic Research](#academic-research)
24
25
  - [License](#license)
@@ -319,6 +320,60 @@ image_scores = cat.image_features(
319
320
  api_key="OPENAI_API_KEY")
320
321
  ```
321
322
 
323
+ ### `build_web_research_dataset()`
324
+
325
+ Conducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows.
326
+
327
+ NOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.
328
+
329
+ SECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question="Hottest temperature in 2024?" you should use "Hottest temperature in 2024 from extremeweatherwatch.com?" or "Hottest temperature in 2024 from weatherundeground.com?". Another example is use "Where these UC Berkeley professors got their PhD according to Linkedin?" instead of "Where they got their PhD according to Linkedin?" to avoid matching people with the same name.
330
+
331
+ THIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.
332
+
333
+ **Methodology:**
334
+ Performs systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.
335
+
336
+ **Rate Limits:**
337
+ Before using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits
338
+
339
+ **Parameters:**
340
+ - `search_question` (str): Primary research question or topic to guide the search strategy
341
+ - `search_input` (list): List of specific search queries or questions to investigate
342
+ - `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
343
+ - `api_key` (str): API key for the LLM service
344
+ - `answer_format`: (str, default="concise"): Response detail level ("concise", "detailed", "comprehensive")
345
+ - `additional_instructions` (str, default="claude-3-7-sonnet-20250219"): Specific Anthropic model to use for processing results
346
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
347
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
348
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
349
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
350
+ - `save_directory` (str, optional): Directory path to save the CSV file
351
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
352
+ - `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits
353
+
354
+ **Returns:**
355
+ - `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
356
+
357
+ **Example:**
358
+
359
+ ```
360
+ import catllm as cat
361
+
362
+ research_data = cat.build_web_research_dataset(
363
+ search_question="What are the latest developments in renewable energy technology?",
364
+ search_input=["solar panel efficiency 2025", "wind turbine innovations", "battery storage breakthroughs"],
365
+ api_key="ANTHROPIC_API_KEY",
366
+ answer_format="detailed",
367
+ additional_instructions="Focus on recent technological advances and commercial applications",
368
+ categories=['Answer', 'URL', 'Date', 'Key_Technology'],
369
+ model_source="Anthropic",
370
+ user_model="claude-3-7-sonnet-20250219",
371
+ creativity=0.1,
372
+ safety=True,
373
+ time_delay=3
374
+ )
375
+ ```
376
+
322
377
  ### `cerad_drawn_score()`
323
378
 
324
379
  Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
@@ -1,7 +1,7 @@
1
1
  # SPDX-FileCopyrightText: 2025-present Christopher Soria <chrissoria@berkeley.edu>
2
2
  #
3
3
  # SPDX-License-Identifier: MIT
4
- __version__ = "0.0.51"
4
+ __version__ = "0.0.53"
5
5
  __author__ = "Chris Soria"
6
6
  __email__ = "chrissoria@berkeley.edu"
7
7
  __title__ = "cat-llm"
@@ -6,13 +6,13 @@ def build_web_research_dataset(
6
6
  answer_format = "concise",
7
7
  additional_instructions = "",
8
8
  categories = ['Answer','URL'],
9
- user_model="claude-3-7-sonnet-20250219",
9
+ user_model="claude-sonnet-4-20250514",
10
10
  creativity=0,
11
11
  safety=False,
12
12
  filename="categorized_data.csv",
13
13
  save_directory=None,
14
14
  model_source="Anthropic",
15
- time_delay=5
15
+ time_delay=15
16
16
  ):
17
17
  import os
18
18
  import json
@@ -36,7 +36,7 @@ def build_web_research_dataset(
36
36
  extracted_jsons = []
37
37
 
38
38
  for idx, item in enumerate(tqdm(search_input, desc="Building dataset")):
39
- if idx == 0: # delay the first item just to be safe
39
+ if idx > 0: # Skip delay for first item only
40
40
  time.sleep(time_delay)
41
41
  reply = None
42
42
 
@@ -88,13 +88,11 @@ def build_web_research_dataset(
88
88
  if getattr(block, "type", "") == "text"
89
89
  ).strip()
90
90
  link1.append(reply)
91
- time.sleep(time_delay)
92
91
  print(reply)
93
92
 
94
93
  except Exception as e:
95
94
  print(f"An error occurred: {e}")
96
95
  link1.append(f"Error processing input: {e}")
97
- time.sleep(time_delay)
98
96
  else:
99
97
  raise ValueError("Unknown source! Currently this function only supports 'Anthropic' as model_source.")
100
98
  # in situation that no JSON is found
File without changes
File without changes
File without changes