cat-llm 0.0.31__py3-none-any.whl → 0.0.33__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cat_llm-0.0.33.dist-info/METADATA +230 -0
- cat_llm-0.0.33.dist-info/RECORD +9 -0
- catllm/__about__.py +1 -1
- catllm/__init__.py +1 -1
- catllm/image_functions.py +3 -3
- catllm/{cat_llm.py → text_functions.py} +2 -4
- cat_llm-0.0.31.dist-info/METADATA +0 -48
- cat_llm-0.0.31.dist-info/RECORD +0 -9
- {cat_llm-0.0.31.dist-info → cat_llm-0.0.33.dist-info}/WHEEL +0 -0
- {cat_llm-0.0.31.dist-info → cat_llm-0.0.33.dist-info}/licenses/LICENSE +0 -0
|
@@ -0,0 +1,230 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cat-llm
|
|
3
|
+
Version: 0.0.33
|
|
4
|
+
Summary: A tool for categorizing text data and images using LLMs and vision models
|
|
5
|
+
Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
|
|
6
|
+
Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
|
|
7
|
+
Project-URL: Source, https://github.com/chrissoria/cat-llm
|
|
8
|
+
Author-email: Christopher Soria <chrissoria@berkeley.edu>
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: categorizer,image classification,llm,structured output,survey data,text classification
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Programming Language :: Python
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
20
|
+
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
21
|
+
Requires-Python: >=3.8
|
|
22
|
+
Requires-Dist: pandas
|
|
23
|
+
Requires-Dist: tqdm
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
|
|
26
|
+

|
|
27
|
+
|
|
28
|
+
# catllm
|
|
29
|
+
|
|
30
|
+
[](https://pypi.org/project/cat-llm)
|
|
31
|
+
[](https://pypi.org/project/cat-llm)
|
|
32
|
+
|
|
33
|
+
-----
|
|
34
|
+
|
|
35
|
+
## Table of Contents
|
|
36
|
+
|
|
37
|
+
- [Installation](#installation)
|
|
38
|
+
- [Quick Start](#quick-start)
|
|
39
|
+
- [Configuration](#configuration)
|
|
40
|
+
- [Supported Models](#supported-models)
|
|
41
|
+
- [API Reference](#api-reference)
|
|
42
|
+
- [Academic Research](#academic-research)
|
|
43
|
+
- [License](#license)
|
|
44
|
+
|
|
45
|
+
## Installation
|
|
46
|
+
|
|
47
|
+
```console
|
|
48
|
+
pip install cat-llm
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Quick Start
|
|
52
|
+
|
|
53
|
+
The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
|
|
54
|
+
```
|
|
55
|
+
import catllm as cat
|
|
56
|
+
import os
|
|
57
|
+
|
|
58
|
+
categories = cat.explore_corpus(
|
|
59
|
+
survey_question="What motivates you most at work?",
|
|
60
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
61
|
+
api_key="OPENAI_API_KEY",
|
|
62
|
+
cat_num=5,
|
|
63
|
+
divisions=10
|
|
64
|
+
)
|
|
65
|
+
print(categories)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Configuration
|
|
69
|
+
|
|
70
|
+
### Get Your OpenAI API Key
|
|
71
|
+
|
|
72
|
+
1. **Create an OpenAI Developer Account**:
|
|
73
|
+
- Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
|
|
74
|
+
- Sign up with email, Google, Microsoft, or Apple
|
|
75
|
+
|
|
76
|
+
2. **Generate an API Key**:
|
|
77
|
+
- Log into your account and click your name in the top right corner
|
|
78
|
+
- Click "View API keys" or navigate to the "API keys" section
|
|
79
|
+
- Click "Create new secret key"
|
|
80
|
+
- Give your key a descriptive name
|
|
81
|
+
- Set permissions (choose "All" for full access)
|
|
82
|
+
|
|
83
|
+
3. **Add Payment Details**:
|
|
84
|
+
- Add a payment method to your OpenAI account
|
|
85
|
+
- Purchase credits (start with $5 - it lasts a long time for most research use)
|
|
86
|
+
- **Important**: Your API key won't work without credits
|
|
87
|
+
|
|
88
|
+
4. **Save Your Key Securely**:
|
|
89
|
+
- Copy the key immediately (you won't be able to see it again)
|
|
90
|
+
- Store it safely and never share it publicly
|
|
91
|
+
|
|
92
|
+
5. Copy and paste your key into catllm in the api_key parameter
|
|
93
|
+
|
|
94
|
+
## Supported Models
|
|
95
|
+
|
|
96
|
+
- **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
|
|
97
|
+
- **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
|
|
98
|
+
- **Perplexity**: Sonnar Large, Sonnar Small, etc.
|
|
99
|
+
- **Mistral**: Mistral Large, Mistral Small, etc.
|
|
100
|
+
|
|
101
|
+
## API Reference
|
|
102
|
+
|
|
103
|
+
### `explore_corpus()`
|
|
104
|
+
|
|
105
|
+
Extracts categories from a corpus of text responses and returns frequency counts.
|
|
106
|
+
|
|
107
|
+
**Methodology:**
|
|
108
|
+
The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
|
|
109
|
+
|
|
110
|
+
**Parameters:**
|
|
111
|
+
- `survey_question` (str): The survey question being analyzed
|
|
112
|
+
- `survey_input` (list): List of text responses to categorize
|
|
113
|
+
- `api_key` (str): API key for the LLM service
|
|
114
|
+
- `cat_num` (int, default=10): Number of categories to extract in each iteration
|
|
115
|
+
- `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
|
|
116
|
+
- `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
|
|
117
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
118
|
+
- `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
|
|
119
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
120
|
+
- `filename` (str, optional): Output file path for saving results
|
|
121
|
+
|
|
122
|
+
**Returns:**
|
|
123
|
+
- `pandas.DataFrame`: Two-column dataset with category names and frequencies
|
|
124
|
+
|
|
125
|
+
**Example:***
|
|
126
|
+
|
|
127
|
+
```
|
|
128
|
+
import catllm as cat
|
|
129
|
+
|
|
130
|
+
categories = cat.explore_corpus(
|
|
131
|
+
survey_question="What motivates you most at work?",
|
|
132
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
133
|
+
api_key="OPENAI_API_KEY",
|
|
134
|
+
cat_num=5,
|
|
135
|
+
divisions=10
|
|
136
|
+
)
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
### `explore_common_categories()`
|
|
140
|
+
|
|
141
|
+
Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
|
|
142
|
+
|
|
143
|
+
**Methodology:**
|
|
144
|
+
Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
|
|
145
|
+
|
|
146
|
+
**Parameters:**
|
|
147
|
+
- `survey_question` (str): Survey question being analyzed
|
|
148
|
+
- `survey_input` (list): Text responses to categorize
|
|
149
|
+
- `api_key` (str): API key for the LLM service
|
|
150
|
+
- `top_n` (int, default=10): Number of top categories to return by frequency
|
|
151
|
+
- `cat_num` (int, default=10): Number of categories to extract per iteration
|
|
152
|
+
- `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
|
|
153
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
154
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
155
|
+
- `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
|
|
156
|
+
- `research_question` (str, optional): Contextual research question to guide categorization
|
|
157
|
+
- `filename` (str, optional): File path to save output dataset
|
|
158
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
159
|
+
|
|
160
|
+
**Returns:**
|
|
161
|
+
- `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
|
|
162
|
+
|
|
163
|
+
**Example:**
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
import catllm as cat
|
|
167
|
+
|
|
168
|
+
top_10_categories = cat.explore_common_categories(
|
|
169
|
+
survey_question="What motivates you most at work?",
|
|
170
|
+
survey_input=["flexible schedule", "good pay", "interesting projects"],
|
|
171
|
+
api_key="OPENAI_API_KEY",
|
|
172
|
+
top_n=10,
|
|
173
|
+
cat_num=5,
|
|
174
|
+
divisions=10
|
|
175
|
+
)
|
|
176
|
+
print(categories)
|
|
177
|
+
```
|
|
178
|
+
### `multi_class()`
|
|
179
|
+
|
|
180
|
+
Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
|
|
181
|
+
|
|
182
|
+
**Methodology:**
|
|
183
|
+
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[2].
|
|
184
|
+
|
|
185
|
+
**Parameters:**
|
|
186
|
+
- `survey_question` (str): The survey question being analyzed
|
|
187
|
+
- `survey_input` (list): List of text responses to classify
|
|
188
|
+
- `categories` (list): List of predefined categories for classification
|
|
189
|
+
- `api_key` (str): API key for the LLM service
|
|
190
|
+
- `user_model` (str, default="gpt-4o"): Specific model to use
|
|
191
|
+
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
|
|
192
|
+
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
|
|
193
|
+
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
|
|
194
|
+
- `save_directory` (str, optional): Directory path to save the CSV file
|
|
195
|
+
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
|
|
196
|
+
|
|
197
|
+
**Returns:**
|
|
198
|
+
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
|
|
199
|
+
|
|
200
|
+
**Example:**
|
|
201
|
+
|
|
202
|
+
```
|
|
203
|
+
import catllm as cat
|
|
204
|
+
|
|
205
|
+
user_categories = ["to start living with or to stay with partner/spouse",
|
|
206
|
+
"relationship change (divorce, breakup, etc)",
|
|
207
|
+
"the person had a job or school or career change, including transferred and retired",
|
|
208
|
+
"the person's partner's job or school or career change, including transferred and retired",
|
|
209
|
+
"financial reasons (rent is too expensive, pay raise, etc)",
|
|
210
|
+
"related specifically features of the home, such as a bigger or smaller yard"]
|
|
211
|
+
|
|
212
|
+
question = "Why did you move?"
|
|
213
|
+
|
|
214
|
+
move_reasons = cat.multi_class(
|
|
215
|
+
survey_question=question,
|
|
216
|
+
survey_input= df[column1],
|
|
217
|
+
user_model="gpt-4o",
|
|
218
|
+
creativity=0,
|
|
219
|
+
categories=user_categories,
|
|
220
|
+
safety =TRUE,
|
|
221
|
+
api_key="OPENAI_API_KEY")
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
## Academic Research
|
|
225
|
+
|
|
226
|
+
This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
|
|
227
|
+
|
|
228
|
+
## License
|
|
229
|
+
|
|
230
|
+
`cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
catllm/CERAD_functions.py,sha256=fiSiBnCcFgNp5XmGhZULnToEoMyP5z6JMcH-aWC8q5o,18787
|
|
2
|
+
catllm/__about__.py,sha256=QD4n_jc9pZ_DH4rnRx892q9STG4YDuOKqS8li05uQnw,404
|
|
3
|
+
catllm/__init__.py,sha256=BpAG8nPhM3ZQRd0WqkubI_36-VCOs4eCYtGVgzz48Bs,337
|
|
4
|
+
catllm/image_functions.py,sha256=9e4V1IEMZUFrH00yEjyowwTUKeXWGsln0U1iQ-DELTY,31359
|
|
5
|
+
catllm/text_functions.py,sha256=K6oetWYk25PwsllWSZP4cFrz7kyxJg0plPRvpmQkCsU,16846
|
|
6
|
+
cat_llm-0.0.33.dist-info/METADATA,sha256=XiSskbffmKcIABIrm7vnJqJmDgZGOh6Qi_JABdU5Uls,9260
|
|
7
|
+
cat_llm-0.0.33.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
|
8
|
+
cat_llm-0.0.33.dist-info/licenses/LICENSE,sha256=wJLsvOr6lrFUDcoPXExa01HOKFWrS3JC9f0RudRw8uw,1075
|
|
9
|
+
cat_llm-0.0.33.dist-info/RECORD,,
|
catllm/__about__.py
CHANGED
catllm/__init__.py
CHANGED
catllm/image_functions.py
CHANGED
|
@@ -72,7 +72,7 @@ def image_multi_class(
|
|
|
72
72
|
|
|
73
73
|
# Handle extension safely
|
|
74
74
|
ext = Path(img_path).suffix.lstrip(".").lower()
|
|
75
|
-
if model_source == "OpenAI":
|
|
75
|
+
if model_source == "OpenAI" or model_source == "Mistral":
|
|
76
76
|
encoded_image = f"data:image/{ext};base64,{encoded}"
|
|
77
77
|
prompt = [
|
|
78
78
|
{
|
|
@@ -309,7 +309,7 @@ def image_score(
|
|
|
309
309
|
ext = Path(img_path).suffix.lstrip(".").lower()
|
|
310
310
|
encoded_image = f"data:image/{ext};base64,{encoded}"
|
|
311
311
|
|
|
312
|
-
if model_source == "OpenAI":
|
|
312
|
+
if model_source == "OpenAI" or model_source == "Mistral":
|
|
313
313
|
prompt = [
|
|
314
314
|
{
|
|
315
315
|
"type": "text",
|
|
@@ -575,7 +575,7 @@ def image_features(
|
|
|
575
575
|
ext = Path(img_path).suffix.lstrip(".").lower()
|
|
576
576
|
encoded_image = f"data:image/{ext};base64,{encoded}"
|
|
577
577
|
|
|
578
|
-
if model_source == "OpenAI":
|
|
578
|
+
if model_source == "OpenAI" or model_source == "Mistral":
|
|
579
579
|
prompt = [
|
|
580
580
|
{
|
|
581
581
|
"type": "text",
|
|
@@ -106,7 +106,7 @@ def explore_common_categories(
|
|
|
106
106
|
top_n=10,
|
|
107
107
|
cat_num=10,
|
|
108
108
|
divisions=5,
|
|
109
|
-
user_model="gpt-4o
|
|
109
|
+
user_model="gpt-4o",
|
|
110
110
|
creativity=0,
|
|
111
111
|
specificity="broad",
|
|
112
112
|
research_question=None,
|
|
@@ -224,10 +224,8 @@ def multi_class(
|
|
|
224
224
|
survey_input,
|
|
225
225
|
categories,
|
|
226
226
|
api_key,
|
|
227
|
-
|
|
228
|
-
user_model="gpt-4o-2024-11-20",
|
|
227
|
+
user_model="gpt-4o",
|
|
229
228
|
creativity=0,
|
|
230
|
-
to_csv=False,
|
|
231
229
|
safety=False,
|
|
232
230
|
filename="categorized_data.csv",
|
|
233
231
|
save_directory=None,
|
|
@@ -1,48 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: cat-llm
|
|
3
|
-
Version: 0.0.31
|
|
4
|
-
Summary: A tool for categorizing text data and images using LLMs and vision models
|
|
5
|
-
Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
|
|
6
|
-
Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
|
|
7
|
-
Project-URL: Source, https://github.com/chrissoria/cat-llm
|
|
8
|
-
Author-email: Christopher Soria <chrissoria@berkeley.edu>
|
|
9
|
-
License-Expression: MIT
|
|
10
|
-
License-File: LICENSE
|
|
11
|
-
Keywords: categorizer,image classification,llm,structured output,survey data,text classification
|
|
12
|
-
Classifier: Development Status :: 4 - Beta
|
|
13
|
-
Classifier: Programming Language :: Python
|
|
14
|
-
Classifier: Programming Language :: Python :: 3.8
|
|
15
|
-
Classifier: Programming Language :: Python :: 3.9
|
|
16
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
-
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
20
|
-
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
21
|
-
Requires-Python: >=3.8
|
|
22
|
-
Requires-Dist: pandas
|
|
23
|
-
Requires-Dist: tqdm
|
|
24
|
-
Description-Content-Type: text/markdown
|
|
25
|
-
|
|
26
|
-

|
|
27
|
-
|
|
28
|
-
# catllm
|
|
29
|
-
|
|
30
|
-
[](https://pypi.org/project/cat-llm)
|
|
31
|
-
[](https://pypi.org/project/cat-llm)
|
|
32
|
-
|
|
33
|
-
-----
|
|
34
|
-
|
|
35
|
-
## Table of Contents
|
|
36
|
-
|
|
37
|
-
- [Installation](#installation)
|
|
38
|
-
- [License](#license)
|
|
39
|
-
|
|
40
|
-
## Installation
|
|
41
|
-
|
|
42
|
-
```console
|
|
43
|
-
pip install cat-llm
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
## License
|
|
47
|
-
|
|
48
|
-
`cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
|
cat_llm-0.0.31.dist-info/RECORD
DELETED
|
@@ -1,9 +0,0 @@
|
|
|
1
|
-
catllm/CERAD_functions.py,sha256=fiSiBnCcFgNp5XmGhZULnToEoMyP5z6JMcH-aWC8q5o,18787
|
|
2
|
-
catllm/__about__.py,sha256=9jcX8w9s2lkH7rxfvscK-WEBKVFhqbPqque6B-Xa2QA,404
|
|
3
|
-
catllm/__init__.py,sha256=kLk180aJna1s-wU6CLr4_hKkbjoeET-11jGmC1pdhQw,330
|
|
4
|
-
catllm/cat_llm.py,sha256=TNsjYKpr8ZH9jeAYN-4DcFcrnR8x2eRl99oXzpdhE0Q,16910
|
|
5
|
-
catllm/image_functions.py,sha256=JLlv5qQhAQzgsRIY18rUPtM1P7x1Fw2UlWlI1dpv3dA,31272
|
|
6
|
-
cat_llm-0.0.31.dist-info/METADATA,sha256=5Gu7C5gBkMWYgVy8M4OSK6B8Zi2nT6HHiybtv4O_KqM,1679
|
|
7
|
-
cat_llm-0.0.31.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
|
8
|
-
cat_llm-0.0.31.dist-info/licenses/LICENSE,sha256=wJLsvOr6lrFUDcoPXExa01HOKFWrS3JC9f0RudRw8uw,1075
|
|
9
|
-
cat_llm-0.0.31.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|