cat-llm 0.0.31__py3-none-any.whl → 0.0.33__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,230 @@
1
+ Metadata-Version: 2.4
2
+ Name: cat-llm
3
+ Version: 0.0.33
4
+ Summary: A tool for categorizing text data and images using LLMs and vision models
5
+ Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
6
+ Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
7
+ Project-URL: Source, https://github.com/chrissoria/cat-llm
8
+ Author-email: Christopher Soria <chrissoria@berkeley.edu>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: categorizer,image classification,llm,structured output,survey data,text classification
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Programming Language :: Python
14
+ Classifier: Programming Language :: Python :: 3.8
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: Implementation :: CPython
20
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
21
+ Requires-Python: >=3.8
22
+ Requires-Dist: pandas
23
+ Requires-Dist: tqdm
24
+ Description-Content-Type: text/markdown
25
+
26
+ ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
27
+
28
+ # catllm
29
+
30
+ [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
31
+ [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
32
+
33
+ -----
34
+
35
+ ## Table of Contents
36
+
37
+ - [Installation](#installation)
38
+ - [Quick Start](#quick-start)
39
+ - [Configuration](#configuration)
40
+ - [Supported Models](#supported-models)
41
+ - [API Reference](#api-reference)
42
+ - [Academic Research](#academic-research)
43
+ - [License](#license)
44
+
45
+ ## Installation
46
+
47
+ ```console
48
+ pip install cat-llm
49
+ ```
50
+
51
+ ## Quick Start
52
+
53
+ The `explore_corpus` function extracts a list of all categories present in the corpus as identified by the model.
54
+ ```
55
+ import catllm as cat
56
+ import os
57
+
58
+ categories = cat.explore_corpus(
59
+ survey_question="What motivates you most at work?",
60
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
61
+ api_key="OPENAI_API_KEY",
62
+ cat_num=5,
63
+ divisions=10
64
+ )
65
+ print(categories)
66
+ ```
67
+
68
+ ## Configuration
69
+
70
+ ### Get Your OpenAI API Key
71
+
72
+ 1. **Create an OpenAI Developer Account**:
73
+ - Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
74
+ - Sign up with email, Google, Microsoft, or Apple
75
+
76
+ 2. **Generate an API Key**:
77
+ - Log into your account and click your name in the top right corner
78
+ - Click "View API keys" or navigate to the "API keys" section
79
+ - Click "Create new secret key"
80
+ - Give your key a descriptive name
81
+ - Set permissions (choose "All" for full access)
82
+
83
+ 3. **Add Payment Details**:
84
+ - Add a payment method to your OpenAI account
85
+ - Purchase credits (start with $5 - it lasts a long time for most research use)
86
+ - **Important**: Your API key won't work without credits
87
+
88
+ 4. **Save Your Key Securely**:
89
+ - Copy the key immediately (you won't be able to see it again)
90
+ - Store it safely and never share it publicly
91
+
92
+ 5. Copy and paste your key into catllm in the api_key parameter
93
+
94
+ ## Supported Models
95
+
96
+ - **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
97
+ - **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
98
+ - **Perplexity**: Sonnar Large, Sonnar Small, etc.
99
+ - **Mistral**: Mistral Large, Mistral Small, etc.
100
+
101
+ ## API Reference
102
+
103
+ ### `explore_corpus()`
104
+
105
+ Extracts categories from a corpus of text responses and returns frequency counts.
106
+
107
+ **Methodology:**
108
+ The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
109
+
110
+ **Parameters:**
111
+ - `survey_question` (str): The survey question being analyzed
112
+ - `survey_input` (list): List of text responses to categorize
113
+ - `api_key` (str): API key for the LLM service
114
+ - `cat_num` (int, default=10): Number of categories to extract in each iteration
115
+ - `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
116
+ - `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
117
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
118
+ - `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
119
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
120
+ - `filename` (str, optional): Output file path for saving results
121
+
122
+ **Returns:**
123
+ - `pandas.DataFrame`: Two-column dataset with category names and frequencies
124
+
125
+ **Example:***
126
+
127
+ ```
128
+ import catllm as cat
129
+
130
+ categories = cat.explore_corpus(
131
+ survey_question="What motivates you most at work?",
132
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
133
+ api_key="OPENAI_API_KEY",
134
+ cat_num=5,
135
+ divisions=10
136
+ )
137
+ ```
138
+
139
+ ### `explore_common_categories()`
140
+
141
+ Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
142
+
143
+ **Methodology:**
144
+ Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
145
+
146
+ **Parameters:**
147
+ - `survey_question` (str): Survey question being analyzed
148
+ - `survey_input` (list): Text responses to categorize
149
+ - `api_key` (str): API key for the LLM service
150
+ - `top_n` (int, default=10): Number of top categories to return by frequency
151
+ - `cat_num` (int, default=10): Number of categories to extract per iteration
152
+ - `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
153
+ - `user_model` (str, default="gpt-4o"): Specific model to use
154
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
155
+ - `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
156
+ - `research_question` (str, optional): Contextual research question to guide categorization
157
+ - `filename` (str, optional): File path to save output dataset
158
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
159
+
160
+ **Returns:**
161
+ - `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
162
+
163
+ **Example:**
164
+
165
+ ```
166
+ import catllm as cat
167
+
168
+ top_10_categories = cat.explore_common_categories(
169
+ survey_question="What motivates you most at work?",
170
+ survey_input=["flexible schedule", "good pay", "interesting projects"],
171
+ api_key="OPENAI_API_KEY",
172
+ top_n=10,
173
+ cat_num=5,
174
+ divisions=10
175
+ )
176
+ print(categories)
177
+ ```
178
+ ### `multi_class()`
179
+
180
+ Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
181
+
182
+ **Methodology:**
183
+ Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[2].
184
+
185
+ **Parameters:**
186
+ - `survey_question` (str): The survey question being analyzed
187
+ - `survey_input` (list): List of text responses to classify
188
+ - `categories` (list): List of predefined categories for classification
189
+ - `api_key` (str): API key for the LLM service
190
+ - `user_model` (str, default="gpt-4o"): Specific model to use
191
+ - `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
192
+ - `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
193
+ - `filename` (str, default="categorized_data.csv"): Filename for CSV output
194
+ - `save_directory` (str, optional): Directory path to save the CSV file
195
+ - `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
196
+
197
+ **Returns:**
198
+ - `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
199
+
200
+ **Example:**
201
+
202
+ ```
203
+ import catllm as cat
204
+
205
+ user_categories = ["to start living with or to stay with partner/spouse",
206
+ "relationship change (divorce, breakup, etc)",
207
+ "the person had a job or school or career change, including transferred and retired",
208
+ "the person's partner's job or school or career change, including transferred and retired",
209
+ "financial reasons (rent is too expensive, pay raise, etc)",
210
+ "related specifically features of the home, such as a bigger or smaller yard"]
211
+
212
+ question = "Why did you move?"
213
+
214
+ move_reasons = cat.multi_class(
215
+ survey_question=question,
216
+ survey_input= df[column1],
217
+ user_model="gpt-4o",
218
+ creativity=0,
219
+ categories=user_categories,
220
+ safety =TRUE,
221
+ api_key="OPENAI_API_KEY")
222
+ ```
223
+
224
+ ## Academic Research
225
+
226
+ This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
227
+
228
+ ## License
229
+
230
+ `cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
@@ -0,0 +1,9 @@
1
+ catllm/CERAD_functions.py,sha256=fiSiBnCcFgNp5XmGhZULnToEoMyP5z6JMcH-aWC8q5o,18787
2
+ catllm/__about__.py,sha256=QD4n_jc9pZ_DH4rnRx892q9STG4YDuOKqS8li05uQnw,404
3
+ catllm/__init__.py,sha256=BpAG8nPhM3ZQRd0WqkubI_36-VCOs4eCYtGVgzz48Bs,337
4
+ catllm/image_functions.py,sha256=9e4V1IEMZUFrH00yEjyowwTUKeXWGsln0U1iQ-DELTY,31359
5
+ catllm/text_functions.py,sha256=K6oetWYk25PwsllWSZP4cFrz7kyxJg0plPRvpmQkCsU,16846
6
+ cat_llm-0.0.33.dist-info/METADATA,sha256=XiSskbffmKcIABIrm7vnJqJmDgZGOh6Qi_JABdU5Uls,9260
7
+ cat_llm-0.0.33.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
8
+ cat_llm-0.0.33.dist-info/licenses/LICENSE,sha256=wJLsvOr6lrFUDcoPXExa01HOKFWrS3JC9f0RudRw8uw,1075
9
+ cat_llm-0.0.33.dist-info/RECORD,,
catllm/__about__.py CHANGED
@@ -1,7 +1,7 @@
1
1
  # SPDX-FileCopyrightText: 2025-present Christopher Soria <chrissoria@berkeley.edu>
2
2
  #
3
3
  # SPDX-License-Identifier: MIT
4
- __version__ = "0.0.31"
4
+ __version__ = "0.0.33"
5
5
  __author__ = "Chris Soria"
6
6
  __email__ = "chrissoria@berkeley.edu"
7
7
  __title__ = "cat-llm"
catllm/__init__.py CHANGED
@@ -11,6 +11,6 @@ from .__about__ import (
11
11
  __license__,
12
12
  )
13
13
 
14
- from .cat_llm import *
14
+ from .text_functions import *
15
15
  from .CERAD_functions import *
16
16
  from .image_functions import *
catllm/image_functions.py CHANGED
@@ -72,7 +72,7 @@ def image_multi_class(
72
72
 
73
73
  # Handle extension safely
74
74
  ext = Path(img_path).suffix.lstrip(".").lower()
75
- if model_source == "OpenAI":
75
+ if model_source == "OpenAI" or model_source == "Mistral":
76
76
  encoded_image = f"data:image/{ext};base64,{encoded}"
77
77
  prompt = [
78
78
  {
@@ -309,7 +309,7 @@ def image_score(
309
309
  ext = Path(img_path).suffix.lstrip(".").lower()
310
310
  encoded_image = f"data:image/{ext};base64,{encoded}"
311
311
 
312
- if model_source == "OpenAI":
312
+ if model_source == "OpenAI" or model_source == "Mistral":
313
313
  prompt = [
314
314
  {
315
315
  "type": "text",
@@ -575,7 +575,7 @@ def image_features(
575
575
  ext = Path(img_path).suffix.lstrip(".").lower()
576
576
  encoded_image = f"data:image/{ext};base64,{encoded}"
577
577
 
578
- if model_source == "OpenAI":
578
+ if model_source == "OpenAI" or model_source == "Mistral":
579
579
  prompt = [
580
580
  {
581
581
  "type": "text",
@@ -106,7 +106,7 @@ def explore_common_categories(
106
106
  top_n=10,
107
107
  cat_num=10,
108
108
  divisions=5,
109
- user_model="gpt-4o-2024-11-20",
109
+ user_model="gpt-4o",
110
110
  creativity=0,
111
111
  specificity="broad",
112
112
  research_question=None,
@@ -224,10 +224,8 @@ def multi_class(
224
224
  survey_input,
225
225
  categories,
226
226
  api_key,
227
- columns="numbered",
228
- user_model="gpt-4o-2024-11-20",
227
+ user_model="gpt-4o",
229
228
  creativity=0,
230
- to_csv=False,
231
229
  safety=False,
232
230
  filename="categorized_data.csv",
233
231
  save_directory=None,
@@ -1,48 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: cat-llm
3
- Version: 0.0.31
4
- Summary: A tool for categorizing text data and images using LLMs and vision models
5
- Project-URL: Documentation, https://github.com/chrissoria/cat-llm#readme
6
- Project-URL: Issues, https://github.com/chrissoria/cat-llm/issues
7
- Project-URL: Source, https://github.com/chrissoria/cat-llm
8
- Author-email: Christopher Soria <chrissoria@berkeley.edu>
9
- License-Expression: MIT
10
- License-File: LICENSE
11
- Keywords: categorizer,image classification,llm,structured output,survey data,text classification
12
- Classifier: Development Status :: 4 - Beta
13
- Classifier: Programming Language :: Python
14
- Classifier: Programming Language :: Python :: 3.8
15
- Classifier: Programming Language :: Python :: 3.9
16
- Classifier: Programming Language :: Python :: 3.10
17
- Classifier: Programming Language :: Python :: 3.11
18
- Classifier: Programming Language :: Python :: 3.12
19
- Classifier: Programming Language :: Python :: Implementation :: CPython
20
- Classifier: Programming Language :: Python :: Implementation :: PyPy
21
- Requires-Python: >=3.8
22
- Requires-Dist: pandas
23
- Requires-Dist: tqdm
24
- Description-Content-Type: text/markdown
25
-
26
- ![catllm Logo](https://github.com/chrissoria/cat-llm/blob/main/images/logo.png?raw=True)
27
-
28
- # catllm
29
-
30
- [![PyPI - Version](https://img.shields.io/pypi/v/cat-llm.svg)](https://pypi.org/project/cat-llm)
31
- [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/cat-llm.svg)](https://pypi.org/project/cat-llm)
32
-
33
- -----
34
-
35
- ## Table of Contents
36
-
37
- - [Installation](#installation)
38
- - [License](#license)
39
-
40
- ## Installation
41
-
42
- ```console
43
- pip install cat-llm
44
- ```
45
-
46
- ## License
47
-
48
- `cat-llm` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
@@ -1,9 +0,0 @@
1
- catllm/CERAD_functions.py,sha256=fiSiBnCcFgNp5XmGhZULnToEoMyP5z6JMcH-aWC8q5o,18787
2
- catllm/__about__.py,sha256=9jcX8w9s2lkH7rxfvscK-WEBKVFhqbPqque6B-Xa2QA,404
3
- catllm/__init__.py,sha256=kLk180aJna1s-wU6CLr4_hKkbjoeET-11jGmC1pdhQw,330
4
- catllm/cat_llm.py,sha256=TNsjYKpr8ZH9jeAYN-4DcFcrnR8x2eRl99oXzpdhE0Q,16910
5
- catllm/image_functions.py,sha256=JLlv5qQhAQzgsRIY18rUPtM1P7x1Fw2UlWlI1dpv3dA,31272
6
- cat_llm-0.0.31.dist-info/METADATA,sha256=5Gu7C5gBkMWYgVy8M4OSK6B8Zi2nT6HHiybtv4O_KqM,1679
7
- cat_llm-0.0.31.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
8
- cat_llm-0.0.31.dist-info/licenses/LICENSE,sha256=wJLsvOr6lrFUDcoPXExa01HOKFWrS3JC9f0RudRw8uw,1075
9
- cat_llm-0.0.31.dist-info/RECORD,,