cat-llm 0.0.32__tar.gz → 0.0.34__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
cat_llm-0.0.34/LICENSE ADDED
@@ -0,0 +1,17 @@
1
+ GNU License
2
+
3
+ CatLLM is a framework for categorizing text and images in a structured output.
4
+ Copyright (C) 2025 Christopher Soria
5
+
6
+ This program is free software: you can redistribute it and/or modify
7
+ it under the terms of the GNU General Public License as published by
8
+ the Free Software Foundation, either version 3 of the License, or
9
+ (at your option) any later version.
10
+
11
+ This program is distributed in the hope that it will be useful,
12
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ GNU General Public License for more details.
15
+
16
+ You should have received a copy of the GNU General Public License
17
+ along with this program. If not, see <https://www.gnu.org/licenses/>.
@@ -0,0 +1,399 @@
1
+ Metadata-Version: 2.4
2
+ Name: cat-llm
3
+ Version: 0.0.34
4
+ Summary: A tool for categorizing text data and images using LLMs and vision models
5
+ Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
6
+ Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
7
+ Project-URL: Source, https://github.com/chrissoria/cat-llm
8
+ Author-email: Christopher Soria <chrissoria@berkeley.edu>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: categorizer,image classification,llm,structured output,survey data,text classification
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Programming Language :: Python
14
+ Classifier: Programming Language :: Python :: 3.8
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: Implementation :: CPython
20
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
21
+ Requires-Python: >=3.8
22
+ Requires-Dist: pandas
23
+ Requires-Dist: tqdm
24
+ Description-Content-Type: text/markdown
25
+
26
+ ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
27
+
28
+ # catllm
29
+
30
+ [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
31
+ [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
32
+
33
+ -----
34
+
35
+ ## Table of Contents
36
+
37
+ - [Installation](#installation)
38
+ - [Quick Start](#quick-start)
39
+ - [Configuration](#configuration)
40
+ - [Supported Models](#supported-models)
41
+ - [API Reference](#api-reference)
42
+ - [explore_corpus()](#explore_corpus)
43
+ - [explore_common_categories()](#explore_common_categories)
44
+ - [multi_class()](#multi_class)
45
+ - [image_score()](#image_score)
46
+ - [image_features()](#image_features)
47
+ - [cerad_drawn_score()](#cerad_drawn_score)
48
+ - [Academic Research](#academic-research)
49
+ - [License](#license)
50
+
51
+ ## Installation
52
+
53
+ ```console
54
+ pip install cat-llm
55
+ ```
56
+
57
+ ## Quick Start
58
+
59
+ The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
60
+ ```
61
+ import catllm as cat
62
+ import os
63
+
64
+ categories = cat.explore_corpus(
65
+ survey_question="What motivates you most at work?",
66
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
67
+ api_key="OPENAI_API_KEY",
68
+ cat_num=5,
69
+ divisions=10
70
+ )
71
+ print(categories)
72
+ ```
73
+
74
+ ## Configuration
75
+
76
+ ### Get Your OpenAI API Key
77
+
78
+ 1. **Create an OpenAI Developer Account**:
79
+ - Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
80
+ - Sign up with email, Google, Microsoft, or Apple
81
+
82
+ 2. **Generate an API Key**:
83
+ - Log into your account and click your name in the top right corner
84
+ - Click "View API keys" or navigate to the "API keys" section
85
+ - Click "Create new secret key"
86
+ - Give your key a descriptive name
87
+ - Set permissions (choose "All" for full access)
88
+
89
+ 3. **Add Payment Details**:
90
+ - Add a payment method to your OpenAI account
91
+ - Purchase credits (start with $5 - it lasts a long time for most research use)
92
+ - **Important**: Your API key won't work without credits
93
+
94
+ 4. **Save Your Key Securely**:
95
+ - Copy the key immediately (you won't be able to see it again)
96
+ - Store it safely and never share it publicly
97
+
98
+ 5. Copy and paste your key into catllm in the api_key parameter
99
+
100
+ ## Supported Models
101
+
102
+ - **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
103
+ - **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
104
+ - **Perplexity**: Sonnar Large, Sonnar Small, etc.
105
+ - **Mistral**: Mistral Large, Mistral Small, etc.
106
+
107
+ ## API Reference
108
+
109
+ ### `explore_corpus()`
110
+
111
+ Extracts categories from a corpus of text responses and returns frequency counts.
112
+
113
+ **Methodology:**
114
+ The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
115
+
116
+ **Parameters:**
117
+ - `survey_question` (str): The survey question being analyzed
118
+ - `survey_input` (list): List of text responses to categorize
119
+ - `api_key` (str): API key for the LLM service
120
+ - `cat_num` (int, default=10): Number of categories to extract in each iteration
121
+ - `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
122
+ - `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
123
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
124
+ - `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
125
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
126
+ - `filename` (str, optional): Output file path for saving results
127
+
128
+ **Returns:**
129
+ - `pandas.DataFrame`: Two-column dataset with category names and frequencies
130
+
131
+ **Example:***
132
+
133
+ ```
134
+ import catllm as cat
135
+
136
+ categories = cat.explore_corpus(
137
+ survey_question="What motivates you most at work?",
138
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
139
+ api_key="OPENAI_API_KEY",
140
+ cat_num=5,
141
+ divisions=10
142
+ )
143
+ ```
144
+
145
+ ### `explore_common_categories()`
146
+
147
+ Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
148
+
149
+ **Methodology:**
150
+ Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
151
+
152
+ **Parameters:**
153
+ - `survey_question` (str): Survey question being analyzed
154
+ - `survey_input` (list): Text responses to categorize
155
+ - `api_key` (str): API key for the LLM service
156
+ - `top_n` (int, default=10): Number of top categories to return by frequency
157
+ - `cat_num` (int, default=10): Number of categories to extract per iteration
158
+ - `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
159
+ - `user_model` (str, default="gpt-4o"): Specific model to use
160
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
161
+ - `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
162
+ - `research_question` (str, optional): Contextual research question to guide categorization
163
+ - `filename` (str, optional): File path to save output dataset
164
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
165
+
166
+ **Returns:**
167
+ - `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
168
+
169
+ **Example:**
170
+
171
+ ```
172
+ import catllm as cat
173
+
174
+ top_10_categories = cat.explore_common_categories(
175
+ survey_question="What motivates you most at work?",
176
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
177
+ api_key="OPENAI_API_KEY",
178
+ top_n=10,
179
+ cat_num=5,
180
+ divisions=10
181
+ )
182
+ print(categories)
183
+ ```
184
+ ### `multi_class()`
185
+
186
+ Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
187
+
188
+ **Methodology:**
189
+ Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
190
+
191
+ **Parameters:**
192
+ - `survey_question` (str): The survey question being analyzed
193
+ - `survey_input` (list): List of text responses to classify
194
+ - `categories` (list): List of predefined categories for classification
195
+ - `api_key` (str): API key for the LLM service
196
+ - `user_model` (str, default="gpt-4o"): Specific model to use
197
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
198
+ - `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
199
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
200
+ - `save_directory` (str, optional): Directory path to save the CSV file
201
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
202
+
203
+ **Returns:**
204
+ - `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
205
+
206
+ **Example:**
207
+
208
+ ```
209
+ import catllm as cat
210
+
211
+ user_categories = ["to start living with or to stay with partner/spouse",
212
+ "relationship change (divorce, breakup, etc)",
213
+ "the person had a job or school or career change, including transferred and retired",
214
+ "the person's partner's job or school or career change, including transferred and retired",
215
+ "financial reasons (rent is too expensive, pay raise, etc)",
216
+ "related specifically features of the home, such as a bigger or smaller yard"]
217
+
218
+ question = "Why did you move?"
219
+
220
+ move_reasons = cat.multi_class(
221
+ survey_question=question,
222
+ survey_input= df[column1],
223
+ user_model="gpt-4o",
224
+ creativity=0,
225
+ categories=user_categories,
226
+ safety =TRUE,
227
+ api_key="OPENAI_API_KEY")
228
+ ```
229
+
230
+ ### `image_multi_class()`
231
+
232
+ Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.
233
+
234
+ **Methodology:**
235
+ Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
236
+
237
+ **Parameters:**
238
+ - `image_description` (str): A description of what the model should expect to see
239
+ - `image_input` (list): List of file paths or a folder to pull file paths from
240
+ - `categories` (list): List of predefined categories for classification
241
+ - `api_key` (str): API key for the LLM service
242
+ - `user_model` (str, default="gpt-4o"): Specific model to use
243
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
244
+ - `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
245
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
246
+ - `save_directory` (str, optional): Directory path to save the CSV file
247
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
248
+
249
+ **Returns:**
250
+ - `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
251
+
252
+ **Example:**
253
+
254
+ ```
255
+ import catllm as cat
256
+
257
+ user_categories = ["has a cat somewhere in it",
258
+ "looks cartoonish",
259
+ "Adrian Brody is in it"]
260
+
261
+ description = "Should be an image of a child's drawing"
262
+
263
+ image_categories = cat.image_multi_class(
264
+ image_description=description,
265
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
266
+ user_model="gpt-4o",
267
+ creativity=0,
268
+ categories=user_categories,
269
+ safety =TRUE,
270
+ api_key="OPENAI_API_KEY")
271
+ ```
272
+
273
+ ### `image_score()`
274
+
275
+ Performs quality scoring of images against a reference description, returning structured results with optional CSV export.
276
+
277
+ **Methodology:**
278
+ Processes each image individually, assigning a quality score on a 5-point scale based on similarity to the expected description:
279
+
280
+ - **1**: No meaningful similarity (fundamentally different)
281
+ - **2**: Barely recognizable similarity (25% match)
282
+ - **3**: Partial match (50% key features)
283
+ - **4**: Strong alignment (75% features)
284
+ - **5**: Near-perfect match (90%+ similarity)
285
+
286
+ Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].
287
+
288
+ **Parameters:**
289
+ - `reference_image_description` (str): A description of what the model should expect to see
290
+ - `image_input` (list): List of image file paths or folder path containing images
291
+ - `reference_image` (str): A file path to the reference image
292
+ - `api_key` (str): API key for the LLM service
293
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
294
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
295
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
296
+ - `filename` (str, default="image_scores.csv"): Filename for CSV output
297
+ - `save_directory` (str, optional): Directory path to save the CSV file
298
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
299
+
300
+ **Returns:**
301
+ - `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details
302
+
303
+ **Example:**
304
+
305
+ ```
306
+ import catllm as cat
307
+
308
+ image_scores = cat.image_score(
309
+ reference_image_description='Adrien Brody sitting in a lawn chair,
310
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
311
+ user_model="gpt-4o",
312
+ creativity=0,
313
+ safety =TRUE,
314
+ api_key="OPENAI_API_KEY")
315
+ ```
316
+
317
+ ### `image_features()`
318
+
319
+ Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).
320
+
321
+ **Methodology:**
322
+ Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.
323
+
324
+ **Parameters:**
325
+ - `image_description` (str): A description of what the model should expect to see
326
+ - `image_input` (list): List of image file paths or folder path containing images
327
+ - `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
328
+ - `api_key` (str): API key for the LLM service
329
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
330
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
331
+ - `to_csv` (bool, default=False): Whether to save the output to a CSV file
332
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
333
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
334
+ - `save_directory` (str, optional): Directory path to save the CSV file
335
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
336
+
337
+ **Returns:**
338
+ - `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
339
+
340
+ **Example:**
341
+
342
+ ```
343
+ import catllm as cat
344
+
345
+ image_scores = cat.image_features(
346
+ image_description='An AI generated image of Spongebob dancing with Patrick',
347
+ features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
348
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
349
+ model_source= 'OpenAI',
350
+ user_model="gpt-4o",
351
+ creativity=0,
352
+ safety =TRUE,
353
+ api_key="OPENAI_API_KEY")
354
+ ```
355
+
356
+ ### `cerad_drawn_score()`
357
+
358
+ Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
359
+
360
+ **Methodology:**
361
+ Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.
362
+
363
+ **Parameters:**
364
+ - `shape` (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
365
+ - `image_input` (list): List of image file paths or folder path containing images
366
+ - `api_key` (str): API key for the LLM service
367
+ - `user_model` (str, default="gpt-4o"): Specific model to use
368
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
369
+ - `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison
370
+ - `provide_reference` (bool, default=False): Whether to provide a reference example image or description
371
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
372
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
373
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
374
+
375
+ **Returns:**
376
+ - `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details
377
+
378
+ **Example:**
379
+
380
+ ```
381
+ import catllm as cat
382
+
383
+ diamond_scores = cat.cerad_score(
384
+ shape="diamond",
385
+ image_input=df['diamond_pic_path'],
386
+ api_key=open_ai_key,
387
+ safety=True,
388
+ filename="diamond_gpt_score.csv",
389
+ )
390
+ ```
391
+
392
+
393
+ ## Academic Research
394
+
395
+ This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
396
+
397
+ ## License
398
+
399
+ `cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.
@@ -0,0 +1,374 @@
1
+ ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
2
+
3
+ # catllm
4
+
5
+ [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
6
+ [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
7
+
8
+ -----
9
+
10
+ ## Table of Contents
11
+
12
+ - [Installation](#installation)
13
+ - [Quick Start](#quick-start)
14
+ - [Configuration](#configuration)
15
+ - [Supported Models](#supported-models)
16
+ - [API Reference](#api-reference)
17
+ - [explore_corpus()](#explore_corpus)
18
+ - [explore_common_categories()](#explore_common_categories)
19
+ - [multi_class()](#multi_class)
20
+ - [image_score()](#image_score)
21
+ - [image_features()](#image_features)
22
+ - [cerad_drawn_score()](#cerad_drawn_score)
23
+ - [Academic Research](#academic-research)
24
+ - [License](#license)
25
+
26
+ ## Installation
27
+
28
+ ```console
29
+ pip install cat-llm
30
+ ```
31
+
32
+ ## Quick Start
33
+
34
+ The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
35
+ ```
36
+ import catllm as cat
37
+ import os
38
+
39
+ categories = cat.explore_corpus(
40
+ survey_question="What motivates you most at work?",
41
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
42
+ api_key="OPENAI_API_KEY",
43
+ cat_num=5,
44
+ divisions=10
45
+ )
46
+ print(categories)
47
+ ```
48
+
49
+ ## Configuration
50
+
51
+ ### Get Your OpenAI API Key
52
+
53
+ 1. **Create an OpenAI Developer Account**:
54
+ - Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
55
+ - Sign up with email, Google, Microsoft, or Apple
56
+
57
+ 2. **Generate an API Key**:
58
+ - Log into your account and click your name in the top right corner
59
+ - Click "View API keys" or navigate to the "API keys" section
60
+ - Click "Create new secret key"
61
+ - Give your key a descriptive name
62
+ - Set permissions (choose "All" for full access)
63
+
64
+ 3. **Add Payment Details**:
65
+ - Add a payment method to your OpenAI account
66
+ - Purchase credits (start with $5 - it lasts a long time for most research use)
67
+ - **Important**: Your API key won't work without credits
68
+
69
+ 4. **Save Your Key Securely**:
70
+ - Copy the key immediately (you won't be able to see it again)
71
+ - Store it safely and never share it publicly
72
+
73
+ 5. Copy and paste your key into catllm in the api_key parameter
74
+
75
+ ## Supported Models
76
+
77
+ - **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
78
+ - **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
79
+ - **Perplexity**: Sonnar Large, Sonnar Small, etc.
80
+ - **Mistral**: Mistral Large, Mistral Small, etc.
81
+
82
+ ## API Reference
83
+
84
+ ### `explore_corpus()`
85
+
86
+ Extracts categories from a corpus of text responses and returns frequency counts.
87
+
88
+ **Methodology:**
89
+ The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
90
+
91
+ **Parameters:**
92
+ - `survey_question` (str): The survey question being analyzed
93
+ - `survey_input` (list): List of text responses to categorize
94
+ - `api_key` (str): API key for the LLM service
95
+ - `cat_num` (int, default=10): Number of categories to extract in each iteration
96
+ - `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
97
+ - `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
98
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
99
+ - `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
100
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
101
+ - `filename` (str, optional): Output file path for saving results
102
+
103
+ **Returns:**
104
+ - `pandas.DataFrame`: Two-column dataset with category names and frequencies
105
+
106
+ **Example:***
107
+
108
+ ```
109
+ import catllm as cat
110
+
111
+ categories = cat.explore_corpus(
112
+ survey_question="What motivates you most at work?",
113
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
114
+ api_key="OPENAI_API_KEY",
115
+ cat_num=5,
116
+ divisions=10
117
+ )
118
+ ```
119
+
120
+ ### `explore_common_categories()`
121
+
122
+ Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
123
+
124
+ **Methodology:**
125
+ Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
126
+
127
+ **Parameters:**
128
+ - `survey_question` (str): Survey question being analyzed
129
+ - `survey_input` (list): Text responses to categorize
130
+ - `api_key` (str): API key for the LLM service
131
+ - `top_n` (int, default=10): Number of top categories to return by frequency
132
+ - `cat_num` (int, default=10): Number of categories to extract per iteration
133
+ - `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
134
+ - `user_model` (str, default="gpt-4o"): Specific model to use
135
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
136
+ - `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
137
+ - `research_question` (str, optional): Contextual research question to guide categorization
138
+ - `filename` (str, optional): File path to save output dataset
139
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
140
+
141
+ **Returns:**
142
+ - `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
143
+
144
+ **Example:**
145
+
146
+ ```
147
+ import catllm as cat
148
+
149
+ top_10_categories = cat.explore_common_categories(
150
+ survey_question="What motivates you most at work?",
151
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
152
+ api_key="OPENAI_API_KEY",
153
+ top_n=10,
154
+ cat_num=5,
155
+ divisions=10
156
+ )
157
+ print(categories)
158
+ ```
159
+ ### `multi_class()`
160
+
161
+ Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
162
+
163
+ **Methodology:**
164
+ Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
165
+
166
+ **Parameters:**
167
+ - `survey_question` (str): The survey question being analyzed
168
+ - `survey_input` (list): List of text responses to classify
169
+ - `categories` (list): List of predefined categories for classification
170
+ - `api_key` (str): API key for the LLM service
171
+ - `user_model` (str, default="gpt-4o"): Specific model to use
172
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
173
+ - `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
174
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
175
+ - `save_directory` (str, optional): Directory path to save the CSV file
176
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
177
+
178
+ **Returns:**
179
+ - `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
180
+
181
+ **Example:**
182
+
183
+ ```
184
+ import catllm as cat
185
+
186
+ user_categories = ["to start living with or to stay with partner/spouse",
187
+ "relationship change (divorce, breakup, etc)",
188
+ "the person had a job or school or career change, including transferred and retired",
189
+ "the person's partner's job or school or career change, including transferred and retired",
190
+ "financial reasons (rent is too expensive, pay raise, etc)",
191
+ "related specifically features of the home, such as a bigger or smaller yard"]
192
+
193
+ question = "Why did you move?"
194
+
195
+ move_reasons = cat.multi_class(
196
+ survey_question=question,
197
+ survey_input= df[column1],
198
+ user_model="gpt-4o",
199
+ creativity=0,
200
+ categories=user_categories,
201
+ safety =TRUE,
202
+ api_key="OPENAI_API_KEY")
203
+ ```
204
+
205
+ ### `image_multi_class()`
206
+
207
+ Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.
208
+
209
+ **Methodology:**
210
+ Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
211
+
212
+ **Parameters:**
213
+ - `image_description` (str): A description of what the model should expect to see
214
+ - `image_input` (list): List of file paths or a folder to pull file paths from
215
+ - `categories` (list): List of predefined categories for classification
216
+ - `api_key` (str): API key for the LLM service
217
+ - `user_model` (str, default="gpt-4o"): Specific model to use
218
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
219
+ - `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
220
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
221
+ - `save_directory` (str, optional): Directory path to save the CSV file
222
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
223
+
224
+ **Returns:**
225
+ - `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
226
+
227
+ **Example:**
228
+
229
+ ```
230
+ import catllm as cat
231
+
232
+ user_categories = ["has a cat somewhere in it",
233
+ "looks cartoonish",
234
+ "Adrian Brody is in it"]
235
+
236
+ description = "Should be an image of a child's drawing"
237
+
238
+ image_categories = cat.image_multi_class(
239
+ image_description=description,
240
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
241
+ user_model="gpt-4o",
242
+ creativity=0,
243
+ categories=user_categories,
244
+ safety =TRUE,
245
+ api_key="OPENAI_API_KEY")
246
+ ```
247
+
248
+ ### `image_score()`
249
+
250
+ Performs quality scoring of images against a reference description, returning structured results with optional CSV export.
251
+
252
+ **Methodology:**
253
+ Processes each image individually, assigning a quality score on a 5-point scale based on similarity to the expected description:
254
+
255
+ - **1**: No meaningful similarity (fundamentally different)
256
+ - **2**: Barely recognizable similarity (25% match)
257
+ - **3**: Partial match (50% key features)
258
+ - **4**: Strong alignment (75% features)
259
+ - **5**: Near-perfect match (90%+ similarity)
260
+
261
+ Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].
262
+
263
+ **Parameters:**
264
+ - `reference_image_description` (str): A description of what the model should expect to see
265
+ - `image_input` (list): List of image file paths or folder path containing images
266
+ - `reference_image` (str): A file path to the reference image
267
+ - `api_key` (str): API key for the LLM service
268
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
269
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
270
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
271
+ - `filename` (str, default="image_scores.csv"): Filename for CSV output
272
+ - `save_directory` (str, optional): Directory path to save the CSV file
273
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
274
+
275
+ **Returns:**
276
+ - `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details
277
+
278
+ **Example:**
279
+
280
+ ```
281
+ import catllm as cat
282
+
283
+ image_scores = cat.image_score(
284
+ reference_image_description='Adrien Brody sitting in a lawn chair,
285
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
286
+ user_model="gpt-4o",
287
+ creativity=0,
288
+ safety =TRUE,
289
+ api_key="OPENAI_API_KEY")
290
+ ```
291
+
292
+ ### `image_features()`
293
+
294
+ Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).
295
+
296
+ **Methodology:**
297
+ Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.
298
+
299
+ **Parameters:**
300
+ - `image_description` (str): A description of what the model should expect to see
301
+ - `image_input` (list): List of image file paths or folder path containing images
302
+ - `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
303
+ - `api_key` (str): API key for the LLM service
304
+ - `user_model` (str, default="gpt-4o"): Specific vision model to use
305
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
306
+ - `to_csv` (bool, default=False): Whether to save the output to a CSV file
307
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
308
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
309
+ - `save_directory` (str, optional): Directory path to save the CSV file
310
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
311
+
312
+ **Returns:**
313
+ - `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
314
+
315
+ **Example:**
316
+
317
+ ```
318
+ import catllm as cat
319
+
320
+ image_scores = cat.image_features(
321
+ image_description='An AI generated image of Spongebob dancing with Patrick',
322
+ features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
323
+ image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
324
+ model_source= 'OpenAI',
325
+ user_model="gpt-4o",
326
+ creativity=0,
327
+ safety =TRUE,
328
+ api_key="OPENAI_API_KEY")
329
+ ```
330
+
331
+ ### `cerad_drawn_score()`
332
+
333
+ Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
334
+
335
+ **Methodology:**
336
+ Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.
337
+
338
+ **Parameters:**
339
+ - `shape` (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
340
+ - `image_input` (list): List of image file paths or folder path containing images
341
+ - `api_key` (str): API key for the LLM service
342
+ - `user_model` (str, default="gpt-4o"): Specific model to use
343
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
344
+ - `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison
345
+ - `provide_reference` (bool, default=False): Whether to provide a reference example image or description
346
+ - `safety` (bool, default=False): Enable safety checks and save results at each API call step
347
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
348
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
349
+
350
+ **Returns:**
351
+ - `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details
352
+
353
+ **Example:**
354
+
355
+ ```
356
+ import catllm as cat
357
+
358
+ diamond_scores = cat.cerad_score(
359
+ shape="diamond",
360
+ image_input=df['diamond_pic_path'],
361
+ api_key=open_ai_key,
362
+ safety=True,
363
+ filename="diamond_gpt_score.csv",
364
+ )
365
+ ```
366
+
367
+
368
+ ## Academic Research
369
+
370
+ This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
371
+
372
+ ## License
373
+
374
+ `cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.
@@ -41,9 +41,10 @@ def cerad_drawn_score(
41
41
  import glob
42
42
  import base64
43
43
  from pathlib import Path
44
+ import pkg_resources
44
45
 
45
46
  shape = shape.lower()
46
-
47
+ shape = "rectangles" if shape == "overlapping rectangles" else shape
47
48
  if shape == "circle":
48
49
  categories = ["The image contains a drawing that clearly represents a circle",
49
50
  "The image does NOT contain any drawing that resembles a circle",
@@ -107,6 +108,16 @@ def cerad_drawn_score(
107
108
  cat_num = len(categories)
108
109
  category_dict = {str(i+1): "0" for i in range(cat_num)}
109
110
  example_JSON = json.dumps(category_dict, indent=4)
111
+ #pulling in the reference image if provided
112
+ if provide_reference:
113
+ reference_image = pkg_resources.resource_filename(
114
+ 'catllm',
115
+ f'images/{shape}.png' # e.g., "circle.png"
116
+ )
117
+ ext = Path(reference_image_path).suffix[1:]
118
+ with open(reference_image_path, 'rb') as f:
119
+ encoded_ref = base64.b64encode(f.read()).decode('utf-8')
120
+ encoded_ref_image = f"data:image/{ext};base64,{encoded_ref}"
110
121
 
111
122
  link1 = []
112
123
  extracted_jsons = []
@@ -146,13 +157,21 @@ def cerad_drawn_score(
146
157
  f"No additional keys, comments, or text.\n\n"
147
158
  f"Example:\n"
148
159
  f"{example_JSON}"
149
- ),
150
- },
151
- {
152
- "type": "image_url",
153
- "image_url": {"url": encoded_image, "detail": "high"},
154
- }
160
+ )
161
+ }
155
162
  ]
163
+ # Conditionally add reference image
164
+ if provide_reference:
165
+ prompt.append({
166
+ "type": "image_url",
167
+ "image_url": {"url": reference_image, "detail": "high"}
168
+ })
169
+
170
+ prompt.append({
171
+ "type": "image_url",
172
+ "image_url": {"url": encoded_image, "detail": "high"}
173
+ })
174
+ print(prompt)
156
175
  elif model_source == "Anthropic":
157
176
  prompt = [
158
177
  {
@@ -347,7 +366,7 @@ def cerad_drawn_score(
347
366
  categorized_data['score'] = categorized_data['diamond_4_sides'] + categorized_data['diamond_equal_sides'] + categorized_data['similar']
348
367
 
349
368
  categorized_data.loc[categorized_data['none'] == 1, 'score'] = 0
350
- categorized_data.loc[(categorized_data['diamond_square'] == 1) & (categorized_data['score'] == 0), 'score'] = 2
369
+ #categorized_data.loc[(categorized_data['diamond_square'] == 1) & (categorized_data['score'] == 0), 'score'] = 2
351
370
 
352
371
  elif shape == "rectangles" or shape == "overlapping rectangles":
353
372
 
@@ -1,7 +1,7 @@
1
1
  # SPDX-FileCopyrightText: 2025-present Christopher Soria <chrissoria@berkeley.edu>
2
2
  #
3
3
  # SPDX-License-Identifier: MIT
4
- __version__ = "0.0.32"
4
+ __version__ = "0.0.34"
5
5
  __author__ = "Chris Soria"
6
6
  __email__ = "chrissoria@berkeley.edu"
7
7
  __title__ = "cat-llm"
@@ -4,7 +4,6 @@ def image_multi_class(
4
4
  image_input,
5
5
  categories,
6
6
  api_key,
7
- columns="numbered",
8
7
  user_model="gpt-4o",
9
8
  creativity=0,
10
9
  to_csv=False,
@@ -508,7 +507,6 @@ def image_features(
508
507
  image_input,
509
508
  features_to_extract,
510
509
  api_key,
511
- columns="numbered",
512
510
  user_model="gpt-4o-2024-11-20",
513
511
  creativity=0,
514
512
  to_csv=False,
@@ -106,7 +106,7 @@ def explore_common_categories(
106
106
  top_n=10,
107
107
  cat_num=10,
108
108
  divisions=5,
109
- user_model="gpt-4o-2024-11-20",
109
+ user_model="gpt-4o",
110
110
  creativity=0,
111
111
  specificity="broad",
112
112
  research_question=None,
@@ -224,10 +224,8 @@ def multi_class(
224
224
  survey_input,
225
225
  categories,
226
226
  api_key,
227
- columns="numbered",
228
- user_model="gpt-4o-2024-11-20",
227
+ user_model="gpt-4o",
229
228
  creativity=0,
230
- to_csv=False,
231
229
  safety=False,
232
230
  filename="categorized_data.csv",
233
231
  save_directory=None,
cat_llm-0.0.32/LICENSE DELETED
@@ -1,21 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) 2025 Christopher Soria
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
cat_llm-0.0.32/PKG-INFO DELETED
@@ -1,48 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: cat-llm
3
- Version: 0.0.32
4
- Summary: A tool for categorizing text data and images using LLMs and vision models
5
- Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
6
- Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
7
- Project-URL: Source, https://github.com/chrissoria/cat-llm
8
- Author-email: Christopher Soria <chrissoria@berkeley.edu>
9
- License-Expression: MIT
10
- License-File: LICENSE
11
- Keywords: categorizer,image classification,llm,structured output,survey data,text classification
12
- Classifier: Development Status :: 4 - Beta
13
- Classifier: Programming Language :: Python
14
- Classifier: Programming Language :: Python :: 3.8
15
- Classifier: Programming Language :: Python :: 3.9
16
- Classifier: Programming Language :: Python :: 3.10
17
- Classifier: Programming Language :: Python :: 3.11
18
- Classifier: Programming Language :: Python :: 3.12
19
- Classifier: Programming Language :: Python :: Implementation :: CPython
20
- Classifier: Programming Language :: Python :: Implementation :: PyPy
21
- Requires-Python: >=3.8
22
- Requires-Dist: pandas
23
- Requires-Dist: tqdm
24
- Description-Content-Type: text/markdown
25
-
26
- ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
27
-
28
- # catllm
29
-
30
- [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
31
- [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
32
-
33
- -----
34
-
35
- ## Table of Contents
36
-
37
- - [Installation](#installation)
38
- - [License](#license)
39
-
40
- ## Installation
41
-
42
- ```console
43
- pip install cat-llm
44
- ```
45
-
46
- ## License
47
-
48
- `cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
cat_llm-0.0.32/README.md DELETED
@@ -1,23 +0,0 @@
1
- ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
2
-
3
- # catllm
4
-
5
- [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
6
- [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
7
-
8
- -----
9
-
10
- ## Table of Contents
11
-
12
- - [Installation](#installation)
13
- - [License](#license)
14
-
15
- ## Installation
16
-
17
- ```console
18
- pip install cat-llm
19
- ```
20
-
21
- ## License
22
-
23
- `cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
File without changes