cat-llm 0.0.33__tar.gz → 0.0.35__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cat_llm-0.0.35/LICENSE +17 -0
- {cat_llm-0.0.33 → cat_llm-0.0.35}/PKG-INFO +172 -3
- cat_llm-0.0.35/README.md +374 -0
- {cat_llm-0.0.33 → cat_llm-0.0.35}/src/catllm/CERAD_functions.py +27 -8
- {cat_llm-0.0.33 → cat_llm-0.0.35}/src/catllm/__about__.py +1 -1
- {cat_llm-0.0.33 → cat_llm-0.0.35}/src/catllm/image_functions.py +0 -2
- cat_llm-0.0.33/LICENSE +0 -21
- cat_llm-0.0.33/README.md +0 -205
- {cat_llm-0.0.33 → cat_llm-0.0.35}/pyproject.toml +0 -0
- {cat_llm-0.0.33 → cat_llm-0.0.35}/src/catllm/__init__.py +0 -0
- {cat_llm-0.0.33 → cat_llm-0.0.35}/src/catllm/text_functions.py +0 -0
cat_llm-0.0.35/LICENSE
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
GNU License
|
|
2
|
+
|
|
3
|
+
CatLLM is a framework for categorizing text and images in a structured output.
|
|
4
|
+
Copyright (C) 2025 Christopher Soria
|
|
5
|
+
|
|
6
|
+
This program is free software: you can redistribute it and/or modify
|
|
7
|
+
it under the terms of the GNU General Public License as published by
|
|
8
|
+
the Free Software Foundation, either version 3 of the License, or
|
|
9
|
+
(at your option) any later version.
|
|
10
|
+
|
|
11
|
+
This program is distributed in the hope that it will be useful,
|
|
12
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
13
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
14
|
+
GNU General Public License for more details.
|
|
15
|
+
|
|
16
|
+
You should have received a copy of the GNU General Public License
|
|
17
|
+
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: cat-llm
|
|
3
|
-
Version: 0.0.
|
|
3
|
+
Version: 0.0.35
|
|
4
4
|
Summary: A tool for categorizing text data and images using LLMs and vision models
|
|
5
5
|
Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
|
|
6
6
|
Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
|
|
@@ -39,6 +39,12 @@ Description-Content-Type: text/markdown
|
|
|
39
39
|
- [Configuration](#configuration)
|
|
40
40
|
- [Supported Models](#supported-models)
|
|
41
41
|
- [API Reference](#api-reference)
|
|
42
|
+
- [explore_corpus()](#explore_corpus)
|
|
43
|
+
- [explore_common_categories()](#explore_common_categories)
|
|
44
|
+
- [multi_class()](#multi_class)
|
|
45
|
+
- [image_score()](#image_score)
|
|
46
|
+
- [image_features()](#image_features)
|
|
47
|
+
- [cerad_drawn_score()](#cerad_drawn_score)
|
|
42
48
|
- [Academic Research](#academic-research)
|
|
43
49
|
- [License](#license)
|
|
44
50
|
|
|
@@ -180,7 +186,7 @@ print(categories)
|
|
|
180
186
|
Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
|
|
181
187
|
|
|
182
188
|
**Methodology:**
|
|
183
|
-
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows
|
|
189
|
+
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
|
|
184
190
|
|
|
185
191
|
**Parameters:**
|
|
186
192
|
- `survey_question` (str): The survey question being analyzed
|
|
@@ -221,10 +227,173 @@ move_reasons = cat.multi_class(
|
|
|
221
227
|
api_key="OPENAI_API_KEY")
|
|
222
228
|
```
|
|
223
229
|
|
|
230
|
+
### `image_multi_class()`
|
|
231
|
+
|
|
232
|
+
Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.
|
|
233
|
+
|
|
234
|
+
**Methodology:**
|
|
235
|
+
Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
|
|
236
|
+
|
|
237
|
+
**Parameters:**
|
|
238
|
+
- `image_description` (str): A description of what the model should expect to see
|
|
239
|
+
- `image_input` (list): List of file paths or a folder to pull file paths from
|
|
240
|
+
- `categories` (list): List of predefined categories for classification
|
|
241
|
+
- `api_key` (str): API key for the LLM service
|
|
242
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
243
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
244
|
+
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
|
|
245
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
246
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
247
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
248
|
+
|
|
249
|
+
**Returns:**
|
|
250
|
+
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
|
|
251
|
+
|
|
252
|
+
**Example:**
|
|
253
|
+
|
|
254
|
+
```
|
|
255
|
+
import catllm as cat
|
|
256
|
+
|
|
257
|
+
user_categories = ["has a cat somewhere in it",
|
|
258
|
+
"looks cartoonish",
|
|
259
|
+
"Adrian Brody is in it"]
|
|
260
|
+
|
|
261
|
+
description = "Should be an image of a child's drawing"
|
|
262
|
+
|
|
263
|
+
image_categories = cat.image_multi_class(
|
|
264
|
+
image_description=description,
|
|
265
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
266
|
+
user_model="gpt-4o",
|
|
267
|
+
creativity=0,
|
|
268
|
+
categories=user_categories,
|
|
269
|
+
safety =TRUE,
|
|
270
|
+
api_key="OPENAI_API_KEY")
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
### `image_score()`
|
|
274
|
+
|
|
275
|
+
Performs quality scoring of images against a reference description, returning structured results with optional CSV export.
|
|
276
|
+
|
|
277
|
+
**Methodology:**
|
|
278
|
+
Processes each image individually, assigning a quality score on a 5-point scale based on similarity to the expected description:
|
|
279
|
+
|
|
280
|
+
- **1**: No meaningful similarity (fundamentally different)
|
|
281
|
+
- **2**: Barely recognizable similarity (25% match)
|
|
282
|
+
- **3**: Partial match (50% key features)
|
|
283
|
+
- **4**: Strong alignment (75% features)
|
|
284
|
+
- **5**: Near-perfect match (90%+ similarity)
|
|
285
|
+
|
|
286
|
+
Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].
|
|
287
|
+
|
|
288
|
+
**Parameters:**
|
|
289
|
+
- `reference_image_description` (str): A description of what the model should expect to see
|
|
290
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
291
|
+
- `reference_image` (str): A file path to the reference image
|
|
292
|
+
- `api_key` (str): API key for the LLM service
|
|
293
|
+
- `user_model` (str, default="gpt-4o"): Specific vision model to use
|
|
294
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
295
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
296
|
+
- `filename` (str, default="image_scores.csv"): Filename for CSV output
|
|
297
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
298
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
299
|
+
|
|
300
|
+
**Returns:**
|
|
301
|
+
- `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details
|
|
302
|
+
|
|
303
|
+
**Example:**
|
|
304
|
+
|
|
305
|
+
```
|
|
306
|
+
import catllm as cat
|
|
307
|
+
|
|
308
|
+
image_scores = cat.image_score(
|
|
309
|
+
reference_image_description='Adrien Brody sitting in a lawn chair,
|
|
310
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
311
|
+
user_model="gpt-4o",
|
|
312
|
+
creativity=0,
|
|
313
|
+
safety =TRUE,
|
|
314
|
+
api_key="OPENAI_API_KEY")
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
### `image_features()`
|
|
318
|
+
|
|
319
|
+
Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).
|
|
320
|
+
|
|
321
|
+
**Methodology:**
|
|
322
|
+
Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.
|
|
323
|
+
|
|
324
|
+
**Parameters:**
|
|
325
|
+
- `image_description` (str): A description of what the model should expect to see
|
|
326
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
327
|
+
- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
|
|
328
|
+
- `api_key` (str): API key for the LLM service
|
|
329
|
+
- `user_model` (str, default="gpt-4o"): Specific vision model to use
|
|
330
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
331
|
+
- `to_csv` (bool, default=False): Whether to save the output to a CSV file
|
|
332
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
333
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
334
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
335
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
336
|
+
|
|
337
|
+
**Returns:**
|
|
338
|
+
- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
|
|
339
|
+
|
|
340
|
+
**Example:**
|
|
341
|
+
|
|
342
|
+
```
|
|
343
|
+
import catllm as cat
|
|
344
|
+
|
|
345
|
+
image_scores = cat.image_features(
|
|
346
|
+
image_description='An AI generated image of Spongebob dancing with Patrick',
|
|
347
|
+
features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
|
|
348
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
349
|
+
model_source= 'OpenAI',
|
|
350
|
+
user_model="gpt-4o",
|
|
351
|
+
creativity=0,
|
|
352
|
+
safety =TRUE,
|
|
353
|
+
api_key="OPENAI_API_KEY")
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
### `cerad_drawn_score()`
|
|
357
|
+
|
|
358
|
+
Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
|
|
359
|
+
|
|
360
|
+
**Methodology:**
|
|
361
|
+
Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.
|
|
362
|
+
|
|
363
|
+
**Parameters:**
|
|
364
|
+
- `shape` (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
|
|
365
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
366
|
+
- `api_key` (str): API key for the LLM service
|
|
367
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
368
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
369
|
+
- `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison
|
|
370
|
+
- `provide_reference` (bool, default=False): Whether to provide a reference example image or description
|
|
371
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
372
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
373
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
374
|
+
|
|
375
|
+
**Returns:**
|
|
376
|
+
- `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details
|
|
377
|
+
|
|
378
|
+
**Example:**
|
|
379
|
+
|
|
380
|
+
```
|
|
381
|
+
import catllm as cat
|
|
382
|
+
|
|
383
|
+
diamond_scores = cat.cerad_score(
|
|
384
|
+
shape="diamond",
|
|
385
|
+
image_input=df['diamond_pic_path'],
|
|
386
|
+
api_key=open_ai_key,
|
|
387
|
+
safety=True,
|
|
388
|
+
filename="diamond_gpt_score.csv",
|
|
389
|
+
)
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
|
|
224
393
|
## Academic Research
|
|
225
394
|
|
|
226
395
|
This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
|
|
227
396
|
|
|
228
397
|
## License
|
|
229
398
|
|
|
230
|
-
`cat-llm` is distributed under the terms of the [
|
|
399
|
+
`cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.
|
cat_llm-0.0.35/README.md
ADDED
|
@@ -0,0 +1,374 @@
|
|
|
1
|
+

|
|
2
|
+
|
|
3
|
+
# catllm
|
|
4
|
+
|
|
5
|
+
[](https://pypi.org/project/cat-llm)
|
|
6
|
+
[](https://pypi.org/project/cat-llm)
|
|
7
|
+
|
|
8
|
+
-----
|
|
9
|
+
|
|
10
|
+
## Table of Contents
|
|
11
|
+
|
|
12
|
+
- [Installation](#installation)
|
|
13
|
+
- [Quick Start](#quick-start)
|
|
14
|
+
- [Configuration](#configuration)
|
|
15
|
+
- [Supported Models](#supported-models)
|
|
16
|
+
- [API Reference](#api-reference)
|
|
17
|
+
- [explore_corpus()](#explore_corpus)
|
|
18
|
+
- [explore_common_categories()](#explore_common_categories)
|
|
19
|
+
- [multi_class()](#multi_class)
|
|
20
|
+
- [image_score()](#image_score)
|
|
21
|
+
- [image_features()](#image_features)
|
|
22
|
+
- [cerad_drawn_score()](#cerad_drawn_score)
|
|
23
|
+
- [Academic Research](#academic-research)
|
|
24
|
+
- [License](#license)
|
|
25
|
+
|
|
26
|
+
## Installation
|
|
27
|
+
|
|
28
|
+
```console
|
|
29
|
+
pip install cat-llm
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Quick Start
|
|
33
|
+
|
|
34
|
+
The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
|
|
35
|
+
```
|
|
36
|
+
import catllm as cat
|
|
37
|
+
import os
|
|
38
|
+
|
|
39
|
+
categories = cat.explore_corpus(
|
|
40
|
+
survey_question="What motivates you most at work?",
|
|
41
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
42
|
+
api_key="OPENAI_API_KEY",
|
|
43
|
+
cat_num=5,
|
|
44
|
+
divisions=10
|
|
45
|
+
)
|
|
46
|
+
print(categories)
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Configuration
|
|
50
|
+
|
|
51
|
+
### Get Your OpenAI API Key
|
|
52
|
+
|
|
53
|
+
1. **Create an OpenAI Developer Account**:
|
|
54
|
+
- Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
|
|
55
|
+
- Sign up with email, Google, Microsoft, or Apple
|
|
56
|
+
|
|
57
|
+
2. **Generate an API Key**:
|
|
58
|
+
- Log into your account and click your name in the top right corner
|
|
59
|
+
- Click "View API keys" or navigate to the "API keys" section
|
|
60
|
+
- Click "Create new secret key"
|
|
61
|
+
- Give your key a descriptive name
|
|
62
|
+
- Set permissions (choose "All" for full access)
|
|
63
|
+
|
|
64
|
+
3. **Add Payment Details**:
|
|
65
|
+
- Add a payment method to your OpenAI account
|
|
66
|
+
- Purchase credits (start with $5 - it lasts a long time for most research use)
|
|
67
|
+
- **Important**: Your API key won't work without credits
|
|
68
|
+
|
|
69
|
+
4. **Save Your Key Securely**:
|
|
70
|
+
- Copy the key immediately (you won't be able to see it again)
|
|
71
|
+
- Store it safely and never share it publicly
|
|
72
|
+
|
|
73
|
+
5. Copy and paste your key into catllm in the api_key parameter
|
|
74
|
+
|
|
75
|
+
## Supported Models
|
|
76
|
+
|
|
77
|
+
- **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
|
|
78
|
+
- **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
|
|
79
|
+
- **Perplexity**: Sonnar Large, Sonnar Small, etc.
|
|
80
|
+
- **Mistral**: Mistral Large, Mistral Small, etc.
|
|
81
|
+
|
|
82
|
+
## API Reference
|
|
83
|
+
|
|
84
|
+
### `explore_corpus()`
|
|
85
|
+
|
|
86
|
+
Extracts categories from a corpus of text responses and returns frequency counts.
|
|
87
|
+
|
|
88
|
+
**Methodology:**
|
|
89
|
+
The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
|
|
90
|
+
|
|
91
|
+
**Parameters:**
|
|
92
|
+
- `survey_question` (str): The survey question being analyzed
|
|
93
|
+
- `survey_input` (list): List of text responses to categorize
|
|
94
|
+
- `api_key` (str): API key for the LLM service
|
|
95
|
+
- `cat_num` (int, default=10): Number of categories to extract in each iteration
|
|
96
|
+
- `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
|
|
97
|
+
- `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
|
|
98
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
99
|
+
- `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
|
|
100
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
101
|
+
- `filename` (str, optional): Output file path for saving results
|
|
102
|
+
|
|
103
|
+
**Returns:**
|
|
104
|
+
- `pandas.DataFrame`: Two-column dataset with category names and frequencies
|
|
105
|
+
|
|
106
|
+
**Example:***
|
|
107
|
+
|
|
108
|
+
```
|
|
109
|
+
import catllm as cat
|
|
110
|
+
|
|
111
|
+
categories = cat.explore_corpus(
|
|
112
|
+
survey_question="What motivates you most at work?",
|
|
113
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
114
|
+
api_key="OPENAI_API_KEY",
|
|
115
|
+
cat_num=5,
|
|
116
|
+
divisions=10
|
|
117
|
+
)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### `explore_common_categories()`
|
|
121
|
+
|
|
122
|
+
Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
|
|
123
|
+
|
|
124
|
+
**Methodology:**
|
|
125
|
+
Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
|
|
126
|
+
|
|
127
|
+
**Parameters:**
|
|
128
|
+
- `survey_question` (str): Survey question being analyzed
|
|
129
|
+
- `survey_input` (list): Text responses to categorize
|
|
130
|
+
- `api_key` (str): API key for the LLM service
|
|
131
|
+
- `top_n` (int, default=10): Number of top categories to return by frequency
|
|
132
|
+
- `cat_num` (int, default=10): Number of categories to extract per iteration
|
|
133
|
+
- `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
|
|
134
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
135
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
136
|
+
- `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
|
|
137
|
+
- `research_question` (str, optional): Contextual research question to guide categorization
|
|
138
|
+
- `filename` (str, optional): File path to save output dataset
|
|
139
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
140
|
+
|
|
141
|
+
**Returns:**
|
|
142
|
+
- `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
|
|
143
|
+
|
|
144
|
+
**Example:**
|
|
145
|
+
|
|
146
|
+
```
|
|
147
|
+
import catllm as cat
|
|
148
|
+
|
|
149
|
+
top_10_categories = cat.explore_common_categories(
|
|
150
|
+
survey_question="What motivates you most at work?",
|
|
151
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
152
|
+
api_key="OPENAI_API_KEY",
|
|
153
|
+
top_n=10,
|
|
154
|
+
cat_num=5,
|
|
155
|
+
divisions=10
|
|
156
|
+
)
|
|
157
|
+
print(categories)
|
|
158
|
+
```
|
|
159
|
+
### `multi_class()`
|
|
160
|
+
|
|
161
|
+
Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
|
|
162
|
+
|
|
163
|
+
**Methodology:**
|
|
164
|
+
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
|
|
165
|
+
|
|
166
|
+
**Parameters:**
|
|
167
|
+
- `survey_question` (str): The survey question being analyzed
|
|
168
|
+
- `survey_input` (list): List of text responses to classify
|
|
169
|
+
- `categories` (list): List of predefined categories for classification
|
|
170
|
+
- `api_key` (str): API key for the LLM service
|
|
171
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
172
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
173
|
+
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
|
|
174
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
175
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
176
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
177
|
+
|
|
178
|
+
**Returns:**
|
|
179
|
+
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
|
|
180
|
+
|
|
181
|
+
**Example:**
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
import catllm as cat
|
|
185
|
+
|
|
186
|
+
user_categories = ["to start living with or to stay with partner/spouse",
|
|
187
|
+
"relationship change (divorce, breakup, etc)",
|
|
188
|
+
"the person had a job or school or career change, including transferred and retired",
|
|
189
|
+
"the person's partner's job or school or career change, including transferred and retired",
|
|
190
|
+
"financial reasons (rent is too expensive, pay raise, etc)",
|
|
191
|
+
"related specifically features of the home, such as a bigger or smaller yard"]
|
|
192
|
+
|
|
193
|
+
question = "Why did you move?"
|
|
194
|
+
|
|
195
|
+
move_reasons = cat.multi_class(
|
|
196
|
+
survey_question=question,
|
|
197
|
+
survey_input= df[column1],
|
|
198
|
+
user_model="gpt-4o",
|
|
199
|
+
creativity=0,
|
|
200
|
+
categories=user_categories,
|
|
201
|
+
safety =TRUE,
|
|
202
|
+
api_key="OPENAI_API_KEY")
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### `image_multi_class()`
|
|
206
|
+
|
|
207
|
+
Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.
|
|
208
|
+
|
|
209
|
+
**Methodology:**
|
|
210
|
+
Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
|
|
211
|
+
|
|
212
|
+
**Parameters:**
|
|
213
|
+
- `image_description` (str): A description of what the model should expect to see
|
|
214
|
+
- `image_input` (list): List of file paths or a folder to pull file paths from
|
|
215
|
+
- `categories` (list): List of predefined categories for classification
|
|
216
|
+
- `api_key` (str): API key for the LLM service
|
|
217
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
218
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
219
|
+
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
|
|
220
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
221
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
222
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
223
|
+
|
|
224
|
+
**Returns:**
|
|
225
|
+
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
|
|
226
|
+
|
|
227
|
+
**Example:**
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
import catllm as cat
|
|
231
|
+
|
|
232
|
+
user_categories = ["has a cat somewhere in it",
|
|
233
|
+
"looks cartoonish",
|
|
234
|
+
"Adrian Brody is in it"]
|
|
235
|
+
|
|
236
|
+
description = "Should be an image of a child's drawing"
|
|
237
|
+
|
|
238
|
+
image_categories = cat.image_multi_class(
|
|
239
|
+
image_description=description,
|
|
240
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
241
|
+
user_model="gpt-4o",
|
|
242
|
+
creativity=0,
|
|
243
|
+
categories=user_categories,
|
|
244
|
+
safety =TRUE,
|
|
245
|
+
api_key="OPENAI_API_KEY")
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### `image_score()`
|
|
249
|
+
|
|
250
|
+
Performs quality scoring of images against a reference description, returning structured results with optional CSV export.
|
|
251
|
+
|
|
252
|
+
**Methodology:**
|
|
253
|
+
Processes each image individually, assigning a quality score on a 5-point scale based on similarity to the expected description:
|
|
254
|
+
|
|
255
|
+
- **1**: No meaningful similarity (fundamentally different)
|
|
256
|
+
- **2**: Barely recognizable similarity (25% match)
|
|
257
|
+
- **3**: Partial match (50% key features)
|
|
258
|
+
- **4**: Strong alignment (75% features)
|
|
259
|
+
- **5**: Near-perfect match (90%+ similarity)
|
|
260
|
+
|
|
261
|
+
Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].
|
|
262
|
+
|
|
263
|
+
**Parameters:**
|
|
264
|
+
- `reference_image_description` (str): A description of what the model should expect to see
|
|
265
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
266
|
+
- `reference_image` (str): A file path to the reference image
|
|
267
|
+
- `api_key` (str): API key for the LLM service
|
|
268
|
+
- `user_model` (str, default="gpt-4o"): Specific vision model to use
|
|
269
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
270
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
271
|
+
- `filename` (str, default="image_scores.csv"): Filename for CSV output
|
|
272
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
273
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
274
|
+
|
|
275
|
+
**Returns:**
|
|
276
|
+
- `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details
|
|
277
|
+
|
|
278
|
+
**Example:**
|
|
279
|
+
|
|
280
|
+
```
|
|
281
|
+
import catllm as cat
|
|
282
|
+
|
|
283
|
+
image_scores = cat.image_score(
|
|
284
|
+
reference_image_description='Adrien Brody sitting in a lawn chair,
|
|
285
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
286
|
+
user_model="gpt-4o",
|
|
287
|
+
creativity=0,
|
|
288
|
+
safety =TRUE,
|
|
289
|
+
api_key="OPENAI_API_KEY")
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### `image_features()`
|
|
293
|
+
|
|
294
|
+
Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).
|
|
295
|
+
|
|
296
|
+
**Methodology:**
|
|
297
|
+
Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.
|
|
298
|
+
|
|
299
|
+
**Parameters:**
|
|
300
|
+
- `image_description` (str): A description of what the model should expect to see
|
|
301
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
302
|
+
- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
|
|
303
|
+
- `api_key` (str): API key for the LLM service
|
|
304
|
+
- `user_model` (str, default="gpt-4o"): Specific vision model to use
|
|
305
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
306
|
+
- `to_csv` (bool, default=False): Whether to save the output to a CSV file
|
|
307
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
308
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
309
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
310
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
311
|
+
|
|
312
|
+
**Returns:**
|
|
313
|
+
- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
|
|
314
|
+
|
|
315
|
+
**Example:**
|
|
316
|
+
|
|
317
|
+
```
|
|
318
|
+
import catllm as cat
|
|
319
|
+
|
|
320
|
+
image_scores = cat.image_features(
|
|
321
|
+
image_description='An AI generated image of Spongebob dancing with Patrick',
|
|
322
|
+
features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
|
|
323
|
+
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
|
|
324
|
+
model_source= 'OpenAI',
|
|
325
|
+
user_model="gpt-4o",
|
|
326
|
+
creativity=0,
|
|
327
|
+
safety =TRUE,
|
|
328
|
+
api_key="OPENAI_API_KEY")
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### `cerad_drawn_score()`
|
|
332
|
+
|
|
333
|
+
Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
|
|
334
|
+
|
|
335
|
+
**Methodology:**
|
|
336
|
+
Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.
|
|
337
|
+
|
|
338
|
+
**Parameters:**
|
|
339
|
+
- `shape` (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
|
|
340
|
+
- `image_input` (list): List of image file paths or folder path containing images
|
|
341
|
+
- `api_key` (str): API key for the LLM service
|
|
342
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
343
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
344
|
+
- `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison
|
|
345
|
+
- `provide_reference` (bool, default=False): Whether to provide a reference example image or description
|
|
346
|
+
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
|
|
347
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
348
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
349
|
+
|
|
350
|
+
**Returns:**
|
|
351
|
+
- `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details
|
|
352
|
+
|
|
353
|
+
**Example:**
|
|
354
|
+
|
|
355
|
+
```
|
|
356
|
+
import catllm as cat
|
|
357
|
+
|
|
358
|
+
diamond_scores = cat.cerad_score(
|
|
359
|
+
shape="diamond",
|
|
360
|
+
image_input=df['diamond_pic_path'],
|
|
361
|
+
api_key=open_ai_key,
|
|
362
|
+
safety=True,
|
|
363
|
+
filename="diamond_gpt_score.csv",
|
|
364
|
+
)
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
|
|
368
|
+
## Academic Research
|
|
369
|
+
|
|
370
|
+
This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
|
|
371
|
+
|
|
372
|
+
## License
|
|
373
|
+
|
|
374
|
+
`cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.
|
|
@@ -41,9 +41,10 @@ def cerad_drawn_score(
|
|
|
41
41
|
import glob
|
|
42
42
|
import base64
|
|
43
43
|
from pathlib import Path
|
|
44
|
+
import pkg_resources
|
|
44
45
|
|
|
45
46
|
shape = shape.lower()
|
|
46
|
-
|
|
47
|
+
shape = "rectangles" if shape == "overlapping rectangles" else shape
|
|
47
48
|
if shape == "circle":
|
|
48
49
|
categories = ["The image contains a drawing that clearly represents a circle",
|
|
49
50
|
"The image does NOT contain any drawing that resembles a circle",
|
|
@@ -107,6 +108,16 @@ def cerad_drawn_score(
|
|
|
107
108
|
cat_num = len(categories)
|
|
108
109
|
category_dict = {str(i+1): "0" for i in range(cat_num)}
|
|
109
110
|
example_JSON = json.dumps(category_dict, indent=4)
|
|
111
|
+
#pulling in the reference image if provided
|
|
112
|
+
if provide_reference:
|
|
113
|
+
reference_image = pkg_resources.resource_filename(
|
|
114
|
+
'catllm',
|
|
115
|
+
f'images/{shape}.png' # e.g., "circle.png"
|
|
116
|
+
)
|
|
117
|
+
ext = Path(reference_image_path).suffix[1:]
|
|
118
|
+
with open(reference_image_path, 'rb') as f:
|
|
119
|
+
encoded_ref = base64.b64encode(f.read()).decode('utf-8')
|
|
120
|
+
encoded_ref_image = f"data:image/{ext};base64,{encoded_ref}"
|
|
110
121
|
|
|
111
122
|
link1 = []
|
|
112
123
|
extracted_jsons = []
|
|
@@ -146,13 +157,21 @@ def cerad_drawn_score(
|
|
|
146
157
|
f"No additional keys, comments, or text.\n\n"
|
|
147
158
|
f"Example:\n"
|
|
148
159
|
f"{example_JSON}"
|
|
149
|
-
)
|
|
150
|
-
|
|
151
|
-
{
|
|
152
|
-
"type": "image_url",
|
|
153
|
-
"image_url": {"url": encoded_image, "detail": "high"},
|
|
154
|
-
}
|
|
160
|
+
)
|
|
161
|
+
}
|
|
155
162
|
]
|
|
163
|
+
# Conditionally add reference image
|
|
164
|
+
if provide_reference:
|
|
165
|
+
prompt.append({
|
|
166
|
+
"type": "image_url",
|
|
167
|
+
"image_url": {"url": reference_image, "detail": "high"}
|
|
168
|
+
})
|
|
169
|
+
|
|
170
|
+
prompt.append({
|
|
171
|
+
"type": "image_url",
|
|
172
|
+
"image_url": {"url": encoded_image, "detail": "high"}
|
|
173
|
+
})
|
|
174
|
+
|
|
156
175
|
elif model_source == "Anthropic":
|
|
157
176
|
prompt = [
|
|
158
177
|
{
|
|
@@ -347,7 +366,7 @@ def cerad_drawn_score(
|
|
|
347
366
|
categorized_data['score'] = categorized_data['diamond_4_sides'] + categorized_data['diamond_equal_sides'] + categorized_data['similar']
|
|
348
367
|
|
|
349
368
|
categorized_data.loc[categorized_data['none'] == 1, 'score'] = 0
|
|
350
|
-
categorized_data.loc[(categorized_data['diamond_square'] == 1) & (categorized_data['score'] == 0), 'score'] = 2
|
|
369
|
+
#categorized_data.loc[(categorized_data['diamond_square'] == 1) & (categorized_data['score'] == 0), 'score'] = 2
|
|
351
370
|
|
|
352
371
|
elif shape == "rectangles" or shape == "overlapping rectangles":
|
|
353
372
|
|
|
@@ -4,7 +4,6 @@ def image_multi_class(
|
|
|
4
4
|
image_input,
|
|
5
5
|
categories,
|
|
6
6
|
api_key,
|
|
7
|
-
columns="numbered",
|
|
8
7
|
user_model="gpt-4o",
|
|
9
8
|
creativity=0,
|
|
10
9
|
to_csv=False,
|
|
@@ -508,7 +507,6 @@ def image_features(
|
|
|
508
507
|
image_input,
|
|
509
508
|
features_to_extract,
|
|
510
509
|
api_key,
|
|
511
|
-
columns="numbered",
|
|
512
510
|
user_model="gpt-4o-2024-11-20",
|
|
513
511
|
creativity=0,
|
|
514
512
|
to_csv=False,
|
cat_llm-0.0.33/LICENSE
DELETED
|
@@ -1,21 +0,0 @@
|
|
|
1
|
-
MIT License
|
|
2
|
-
|
|
3
|
-
Copyright (c) 2025 Christopher Soria
|
|
4
|
-
|
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
-
in the Software without restriction, including without limitation the rights
|
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
-
furnished to do so, subject to the following conditions:
|
|
11
|
-
|
|
12
|
-
The above copyright notice and this permission notice shall be included in all
|
|
13
|
-
copies or substantial portions of the Software.
|
|
14
|
-
|
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
-
SOFTWARE.
|
cat_llm-0.0.33/README.md
DELETED
|
@@ -1,205 +0,0 @@
|
|
|
1
|
-

|
|
2
|
-
|
|
3
|
-
# catllm
|
|
4
|
-
|
|
5
|
-
[](https://pypi.org/project/cat-llm)
|
|
6
|
-
[](https://pypi.org/project/cat-llm)
|
|
7
|
-
|
|
8
|
-
-----
|
|
9
|
-
|
|
10
|
-
## Table of Contents
|
|
11
|
-
|
|
12
|
-
- [Installation](#installation)
|
|
13
|
-
- [Quick Start](#quick-start)
|
|
14
|
-
- [Configuration](#configuration)
|
|
15
|
-
- [Supported Models](#supported-models)
|
|
16
|
-
- [API Reference](#api-reference)
|
|
17
|
-
- [Academic Research](#academic-research)
|
|
18
|
-
- [License](#license)
|
|
19
|
-
|
|
20
|
-
## Installation
|
|
21
|
-
|
|
22
|
-
```console
|
|
23
|
-
pip install cat-llm
|
|
24
|
-
```
|
|
25
|
-
|
|
26
|
-
## Quick Start
|
|
27
|
-
|
|
28
|
-
The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
|
|
29
|
-
```
|
|
30
|
-
import catllm as cat
|
|
31
|
-
import os
|
|
32
|
-
|
|
33
|
-
categories = cat.explore_corpus(
|
|
34
|
-
survey_question="What motivates you most at work?",
|
|
35
|
-
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
36
|
-
api_key="OPENAI_API_KEY",
|
|
37
|
-
cat_num=5,
|
|
38
|
-
divisions=10
|
|
39
|
-
)
|
|
40
|
-
print(categories)
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
## Configuration
|
|
44
|
-
|
|
45
|
-
### Get Your OpenAI API Key
|
|
46
|
-
|
|
47
|
-
1. **Create an OpenAI Developer Account**:
|
|
48
|
-
- Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
|
|
49
|
-
- Sign up with email, Google, Microsoft, or Apple
|
|
50
|
-
|
|
51
|
-
2. **Generate an API Key**:
|
|
52
|
-
- Log into your account and click your name in the top right corner
|
|
53
|
-
- Click "View API keys" or navigate to the "API keys" section
|
|
54
|
-
- Click "Create new secret key"
|
|
55
|
-
- Give your key a descriptive name
|
|
56
|
-
- Set permissions (choose "All" for full access)
|
|
57
|
-
|
|
58
|
-
3. **Add Payment Details**:
|
|
59
|
-
- Add a payment method to your OpenAI account
|
|
60
|
-
- Purchase credits (start with $5 - it lasts a long time for most research use)
|
|
61
|
-
- **Important**: Your API key won't work without credits
|
|
62
|
-
|
|
63
|
-
4. **Save Your Key Securely**:
|
|
64
|
-
- Copy the key immediately (you won't be able to see it again)
|
|
65
|
-
- Store it safely and never share it publicly
|
|
66
|
-
|
|
67
|
-
5. Copy and paste your key into catllm in the api_key parameter
|
|
68
|
-
|
|
69
|
-
## Supported Models
|
|
70
|
-
|
|
71
|
-
- **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
|
|
72
|
-
- **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
|
|
73
|
-
- **Perplexity**: Sonnar Large, Sonnar Small, etc.
|
|
74
|
-
- **Mistral**: Mistral Large, Mistral Small, etc.
|
|
75
|
-
|
|
76
|
-
## API Reference
|
|
77
|
-
|
|
78
|
-
### `explore_corpus()`
|
|
79
|
-
|
|
80
|
-
Extracts categories from a corpus of text responses and returns frequency counts.
|
|
81
|
-
|
|
82
|
-
**Methodology:**
|
|
83
|
-
The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
|
|
84
|
-
|
|
85
|
-
**Parameters:**
|
|
86
|
-
- `survey_question` (str): The survey question being analyzed
|
|
87
|
-
- `survey_input` (list): List of text responses to categorize
|
|
88
|
-
- `api_key` (str): API key for the LLM service
|
|
89
|
-
- `cat_num` (int, default=10): Number of categories to extract in each iteration
|
|
90
|
-
- `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
|
|
91
|
-
- `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
|
|
92
|
-
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
93
|
-
- `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
|
|
94
|
-
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
95
|
-
- `filename` (str, optional): Output file path for saving results
|
|
96
|
-
|
|
97
|
-
**Returns:**
|
|
98
|
-
- `pandas.DataFrame`: Two-column dataset with category names and frequencies
|
|
99
|
-
|
|
100
|
-
**Example:***
|
|
101
|
-
|
|
102
|
-
```
|
|
103
|
-
import catllm as cat
|
|
104
|
-
|
|
105
|
-
categories = cat.explore_corpus(
|
|
106
|
-
survey_question="What motivates you most at work?",
|
|
107
|
-
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
108
|
-
api_key="OPENAI_API_KEY",
|
|
109
|
-
cat_num=5,
|
|
110
|
-
divisions=10
|
|
111
|
-
)
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### `explore_common_categories()`
|
|
115
|
-
|
|
116
|
-
Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
|
|
117
|
-
|
|
118
|
-
**Methodology:**
|
|
119
|
-
Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
|
|
120
|
-
|
|
121
|
-
**Parameters:**
|
|
122
|
-
- `survey_question` (str): Survey question being analyzed
|
|
123
|
-
- `survey_input` (list): Text responses to categorize
|
|
124
|
-
- `api_key` (str): API key for the LLM service
|
|
125
|
-
- `top_n` (int, default=10): Number of top categories to return by frequency
|
|
126
|
-
- `cat_num` (int, default=10): Number of categories to extract per iteration
|
|
127
|
-
- `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
|
|
128
|
-
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
129
|
-
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
130
|
-
- `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
|
|
131
|
-
- `research_question` (str, optional): Contextual research question to guide categorization
|
|
132
|
-
- `filename` (str, optional): File path to save output dataset
|
|
133
|
-
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
134
|
-
|
|
135
|
-
**Returns:**
|
|
136
|
-
- `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
|
|
137
|
-
|
|
138
|
-
**Example:**
|
|
139
|
-
|
|
140
|
-
```
|
|
141
|
-
import catllm as cat
|
|
142
|
-
|
|
143
|
-
top_10_categories = cat.explore_common_categories(
|
|
144
|
-
survey_question="What motivates you most at work?",
|
|
145
|
-
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
146
|
-
api_key="OPENAI_API_KEY",
|
|
147
|
-
top_n=10,
|
|
148
|
-
cat_num=5,
|
|
149
|
-
divisions=10
|
|
150
|
-
)
|
|
151
|
-
print(categories)
|
|
152
|
-
```
|
|
153
|
-
### `multi_class()`
|
|
154
|
-
|
|
155
|
-
Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
|
|
156
|
-
|
|
157
|
-
**Methodology:**
|
|
158
|
-
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[2].
|
|
159
|
-
|
|
160
|
-
**Parameters:**
|
|
161
|
-
- `survey_question` (str): The survey question being analyzed
|
|
162
|
-
- `survey_input` (list): List of text responses to classify
|
|
163
|
-
- `categories` (list): List of predefined categories for classification
|
|
164
|
-
- `api_key` (str): API key for the LLM service
|
|
165
|
-
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
166
|
-
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
167
|
-
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
|
|
168
|
-
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
169
|
-
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
170
|
-
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
171
|
-
|
|
172
|
-
**Returns:**
|
|
173
|
-
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
|
|
174
|
-
|
|
175
|
-
**Example:**
|
|
176
|
-
|
|
177
|
-
```
|
|
178
|
-
import catllm as cat
|
|
179
|
-
|
|
180
|
-
user_categories = ["to start living with or to stay with partner/spouse",
|
|
181
|
-
"relationship change (divorce, breakup, etc)",
|
|
182
|
-
"the person had a job or school or career change, including transferred and retired",
|
|
183
|
-
"the person's partner's job or school or career change, including transferred and retired",
|
|
184
|
-
"financial reasons (rent is too expensive, pay raise, etc)",
|
|
185
|
-
"related specifically features of the home, such as a bigger or smaller yard"]
|
|
186
|
-
|
|
187
|
-
question = "Why did you move?"
|
|
188
|
-
|
|
189
|
-
move_reasons = cat.multi_class(
|
|
190
|
-
survey_question=question,
|
|
191
|
-
survey_input= df[column1],
|
|
192
|
-
user_model="gpt-4o",
|
|
193
|
-
creativity=0,
|
|
194
|
-
categories=user_categories,
|
|
195
|
-
safety =TRUE,
|
|
196
|
-
api_key="OPENAI_API_KEY")
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
## Academic Research
|
|
200
|
-
|
|
201
|
-
This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
|
|
202
|
-
|
|
203
|
-
## License
|
|
204
|
-
|
|
205
|
-
`cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|