themefinder 0.6.2__tar.gz → 0.6.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of themefinder might be problematic. Click here for more details.
- {themefinder-0.6.2 → themefinder-0.6.3}/PKG-INFO +23 -8
- {themefinder-0.6.2 → themefinder-0.6.3}/README.md +22 -7
- {themefinder-0.6.2 → themefinder-0.6.3}/pyproject.toml +1 -1
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/__init__.py +4 -0
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/core.py +129 -33
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/llm_batch_processor.py +32 -80
- themefinder-0.6.3/src/themefinder/models.py +351 -0
- themefinder-0.6.3/src/themefinder/prompts/detail_detection.txt +19 -0
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/sentiment_analysis.txt +0 -14
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/theme_condensation.txt +2 -22
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/theme_generation.txt +6 -38
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/theme_mapping.txt +6 -23
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/theme_refinement.txt +2 -12
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/theme_target_alignment.txt +2 -10
- themefinder-0.6.2/src/themefinder/models.py +0 -138
- {themefinder-0.6.2 → themefinder-0.6.3}/LICENCE +0 -0
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/prompts/consultation_system_prompt.txt +0 -0
- {themefinder-0.6.2 → themefinder-0.6.3}/src/themefinder/themefinder_logging.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: themefinder
|
|
3
|
-
Version: 0.6.
|
|
3
|
+
Version: 0.6.3
|
|
4
4
|
Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
|
|
5
5
|
License: MIT
|
|
6
6
|
Author: i.AI
|
|
@@ -49,9 +49,9 @@ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/r
|
|
|
49
49
|
- `response_id`: A unique identifier for each response
|
|
50
50
|
- `response`: The free text survey response
|
|
51
51
|
|
|
52
|
-
ThemeFinder
|
|
52
|
+
ThemeFinder now supports a range of language models through structured outputs.
|
|
53
53
|
|
|
54
|
-
The function `find_themes` identifies common themes in
|
|
54
|
+
The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
|
|
55
55
|
|
|
56
56
|
For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
|
|
57
57
|
|
|
@@ -81,7 +81,6 @@ load_dotenv()
|
|
|
81
81
|
llm = AzureChatOpenAI(
|
|
82
82
|
model="gpt-4o",
|
|
83
83
|
temperature=0,
|
|
84
|
-
model_kwargs={"response_format": {"type": "json_object"}},
|
|
85
84
|
)
|
|
86
85
|
|
|
87
86
|
# Set up your data
|
|
@@ -97,18 +96,15 @@ question = "What do you think of ThemeFinder?"
|
|
|
97
96
|
# Make the system prompt specific to your use case
|
|
98
97
|
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
|
|
99
98
|
|
|
100
|
-
# Run the function to find themes
|
|
101
|
-
# We use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
99
|
+
# Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
102
100
|
async def main():
|
|
103
101
|
result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
|
|
104
102
|
print(result)
|
|
105
103
|
|
|
106
104
|
if __name__ == "__main__":
|
|
107
105
|
asyncio.run(main())
|
|
108
|
-
|
|
109
106
|
```
|
|
110
107
|
|
|
111
|
-
|
|
112
108
|
## ThemeFinder pipeline
|
|
113
109
|
|
|
114
110
|
ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
|
|
@@ -145,6 +141,25 @@ The file `src/themefinder.core.py` contains the function `find_themes` which run
|
|
|
145
141
|
**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
|
|
146
142
|
|
|
147
143
|
|
|
144
|
+
## Model Compatibility
|
|
145
|
+
|
|
146
|
+
ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
|
|
147
|
+
|
|
148
|
+
### OpenAI Models
|
|
149
|
+
- GPT-4, GPT-4o, GPT-4.1
|
|
150
|
+
- All Azure OpenAI deployments
|
|
151
|
+
|
|
152
|
+
### Google Models
|
|
153
|
+
- Gemini series (1.5 Pro, 2.0 Pro, etc.)
|
|
154
|
+
|
|
155
|
+
### Anthropic Models
|
|
156
|
+
- Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
|
|
157
|
+
|
|
158
|
+
### Open Source Models
|
|
159
|
+
- Llama 2, Llama 3
|
|
160
|
+
- Mistral models (e.g., Mistral 7B, Mixtral)
|
|
161
|
+
|
|
162
|
+
|
|
148
163
|
## License
|
|
149
164
|
|
|
150
165
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -18,9 +18,9 @@ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/r
|
|
|
18
18
|
- `response_id`: A unique identifier for each response
|
|
19
19
|
- `response`: The free text survey response
|
|
20
20
|
|
|
21
|
-
ThemeFinder
|
|
21
|
+
ThemeFinder now supports a range of language models through structured outputs.
|
|
22
22
|
|
|
23
|
-
The function `find_themes` identifies common themes in
|
|
23
|
+
The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
|
|
24
24
|
|
|
25
25
|
For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
|
|
26
26
|
|
|
@@ -50,7 +50,6 @@ load_dotenv()
|
|
|
50
50
|
llm = AzureChatOpenAI(
|
|
51
51
|
model="gpt-4o",
|
|
52
52
|
temperature=0,
|
|
53
|
-
model_kwargs={"response_format": {"type": "json_object"}},
|
|
54
53
|
)
|
|
55
54
|
|
|
56
55
|
# Set up your data
|
|
@@ -66,18 +65,15 @@ question = "What do you think of ThemeFinder?"
|
|
|
66
65
|
# Make the system prompt specific to your use case
|
|
67
66
|
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
|
|
68
67
|
|
|
69
|
-
# Run the function to find themes
|
|
70
|
-
# We use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
68
|
+
# Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
71
69
|
async def main():
|
|
72
70
|
result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
|
|
73
71
|
print(result)
|
|
74
72
|
|
|
75
73
|
if __name__ == "__main__":
|
|
76
74
|
asyncio.run(main())
|
|
77
|
-
|
|
78
75
|
```
|
|
79
76
|
|
|
80
|
-
|
|
81
77
|
## ThemeFinder pipeline
|
|
82
78
|
|
|
83
79
|
ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
|
|
@@ -114,6 +110,25 @@ The file `src/themefinder.core.py` contains the function `find_themes` which run
|
|
|
114
110
|
**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
|
|
115
111
|
|
|
116
112
|
|
|
113
|
+
## Model Compatibility
|
|
114
|
+
|
|
115
|
+
ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
|
|
116
|
+
|
|
117
|
+
### OpenAI Models
|
|
118
|
+
- GPT-4, GPT-4o, GPT-4.1
|
|
119
|
+
- All Azure OpenAI deployments
|
|
120
|
+
|
|
121
|
+
### Google Models
|
|
122
|
+
- Gemini series (1.5 Pro, 2.0 Pro, etc.)
|
|
123
|
+
|
|
124
|
+
### Anthropic Models
|
|
125
|
+
- Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
|
|
126
|
+
|
|
127
|
+
### Open Source Models
|
|
128
|
+
- Llama 2, Llama 3
|
|
129
|
+
- Mistral models (e.g., Mistral 7B, Mixtral)
|
|
130
|
+
|
|
131
|
+
|
|
117
132
|
## License
|
|
118
133
|
|
|
119
134
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[tool.poetry]
|
|
2
2
|
name = "themefinder"
|
|
3
|
-
version = "0.6.
|
|
3
|
+
version = "0.6.3"
|
|
4
4
|
description = "A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses."
|
|
5
5
|
authors = ["i.AI <packages@cabinetoffice.gov.uk>"]
|
|
6
6
|
packages = [{include = "themefinder", from = "src"}]
|
|
@@ -5,6 +5,8 @@ from .core import (
|
|
|
5
5
|
theme_generation,
|
|
6
6
|
theme_mapping,
|
|
7
7
|
theme_refinement,
|
|
8
|
+
theme_target_alignment,
|
|
9
|
+
detail_detection,
|
|
8
10
|
)
|
|
9
11
|
|
|
10
12
|
__all__ = [
|
|
@@ -13,6 +15,8 @@ __all__ = [
|
|
|
13
15
|
"theme_generation",
|
|
14
16
|
"theme_condensation",
|
|
15
17
|
"theme_refinement",
|
|
18
|
+
"theme_target_alignment",
|
|
16
19
|
"theme_mapping",
|
|
20
|
+
"detail_detection",
|
|
17
21
|
]
|
|
18
22
|
__version__ = "0.1.0"
|
|
@@ -3,10 +3,17 @@ from pathlib import Path
|
|
|
3
3
|
|
|
4
4
|
import pandas as pd
|
|
5
5
|
from langchain_core.prompts import PromptTemplate
|
|
6
|
-
from
|
|
6
|
+
from langchain.schema.runnable import RunnableWithFallbacks
|
|
7
7
|
|
|
8
8
|
from .llm_batch_processor import batch_and_run, load_prompt_from_file
|
|
9
|
-
from .models import
|
|
9
|
+
from .models import (
|
|
10
|
+
SentimentAnalysisResponses,
|
|
11
|
+
ThemeGenerationResponses,
|
|
12
|
+
ThemeCondensationResponses,
|
|
13
|
+
ThemeRefinementResponses,
|
|
14
|
+
ThemeMappingResponses,
|
|
15
|
+
DetailDetectionResponses,
|
|
16
|
+
)
|
|
10
17
|
from .themefinder_logging import logger
|
|
11
18
|
|
|
12
19
|
CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
|
|
@@ -14,11 +21,12 @@ CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
|
|
|
14
21
|
|
|
15
22
|
async def find_themes(
|
|
16
23
|
responses_df: pd.DataFrame,
|
|
17
|
-
llm:
|
|
24
|
+
llm: RunnableWithFallbacks,
|
|
18
25
|
question: str,
|
|
19
26
|
target_n_themes: int | None = None,
|
|
20
27
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
21
28
|
verbose: bool = True,
|
|
29
|
+
concurrency: int = 10,
|
|
22
30
|
) -> dict[str, str | pd.DataFrame]:
|
|
23
31
|
"""Process survey responses through a multi-stage theme analysis pipeline.
|
|
24
32
|
|
|
@@ -32,7 +40,7 @@ async def find_themes(
|
|
|
32
40
|
|
|
33
41
|
Args:
|
|
34
42
|
responses_df (pd.DataFrame): DataFrame containing survey responses
|
|
35
|
-
llm (
|
|
43
|
+
llm (RunnableWithFallbacks): Language model instance for text analysis
|
|
36
44
|
question (str): The survey question
|
|
37
45
|
target_n_themes (int | None, optional): Target number of themes to consolidate to.
|
|
38
46
|
If None, skip theme target alignment step. Defaults to None.
|
|
@@ -40,6 +48,7 @@ async def find_themes(
|
|
|
40
48
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
41
49
|
verbose (bool): Whether to show information messages during processing.
|
|
42
50
|
Defaults to True.
|
|
51
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
43
52
|
|
|
44
53
|
Returns:
|
|
45
54
|
dict[str, str | pd.DataFrame]: Dictionary containing results from each pipeline stage:
|
|
@@ -56,21 +65,28 @@ async def find_themes(
|
|
|
56
65
|
llm,
|
|
57
66
|
question=question,
|
|
58
67
|
system_prompt=system_prompt,
|
|
68
|
+
concurrency=concurrency,
|
|
59
69
|
)
|
|
60
70
|
theme_df, _ = await theme_generation(
|
|
61
71
|
sentiment_df,
|
|
62
72
|
llm,
|
|
63
73
|
question=question,
|
|
64
74
|
system_prompt=system_prompt,
|
|
75
|
+
concurrency=concurrency,
|
|
65
76
|
)
|
|
66
77
|
condensed_theme_df, _ = await theme_condensation(
|
|
67
|
-
theme_df,
|
|
78
|
+
theme_df,
|
|
79
|
+
llm,
|
|
80
|
+
question=question,
|
|
81
|
+
system_prompt=system_prompt,
|
|
82
|
+
concurrency=concurrency,
|
|
68
83
|
)
|
|
69
84
|
refined_theme_df, _ = await theme_refinement(
|
|
70
85
|
condensed_theme_df,
|
|
71
86
|
llm,
|
|
72
87
|
question=question,
|
|
73
88
|
system_prompt=system_prompt,
|
|
89
|
+
concurrency=concurrency,
|
|
74
90
|
)
|
|
75
91
|
if target_n_themes is not None:
|
|
76
92
|
refined_theme_df, _ = await theme_target_alignment(
|
|
@@ -79,6 +95,7 @@ async def find_themes(
|
|
|
79
95
|
question=question,
|
|
80
96
|
target_n_themes=target_n_themes,
|
|
81
97
|
system_prompt=system_prompt,
|
|
98
|
+
concurrency=concurrency,
|
|
82
99
|
)
|
|
83
100
|
mapping_df, mapping_unprocessables = await theme_mapping(
|
|
84
101
|
sentiment_df[["response_id", "response"]],
|
|
@@ -86,6 +103,14 @@ async def find_themes(
|
|
|
86
103
|
question=question,
|
|
87
104
|
refined_themes_df=refined_theme_df,
|
|
88
105
|
system_prompt=system_prompt,
|
|
106
|
+
concurrency=concurrency,
|
|
107
|
+
)
|
|
108
|
+
detailed_df, _ = await detail_detection(
|
|
109
|
+
responses_df[["response_id", "response"]],
|
|
110
|
+
llm,
|
|
111
|
+
question=question,
|
|
112
|
+
system_prompt=system_prompt,
|
|
113
|
+
concurrency=concurrency,
|
|
89
114
|
)
|
|
90
115
|
|
|
91
116
|
logger.info("Finished finding themes")
|
|
@@ -97,17 +122,19 @@ async def find_themes(
|
|
|
97
122
|
"sentiment": sentiment_df,
|
|
98
123
|
"themes": refined_theme_df,
|
|
99
124
|
"mapping": mapping_df,
|
|
125
|
+
"detailed_responses": detailed_df,
|
|
100
126
|
"unprocessables": pd.concat([sentiment_unprocessables, mapping_unprocessables]),
|
|
101
127
|
}
|
|
102
128
|
|
|
103
129
|
|
|
104
130
|
async def sentiment_analysis(
|
|
105
131
|
responses_df: pd.DataFrame,
|
|
106
|
-
llm:
|
|
132
|
+
llm: RunnableWithFallbacks,
|
|
107
133
|
question: str,
|
|
108
134
|
batch_size: int = 20,
|
|
109
135
|
prompt_template: str | Path | PromptTemplate = "sentiment_analysis",
|
|
110
136
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
137
|
+
concurrency: int = 10,
|
|
111
138
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
112
139
|
"""Perform sentiment analysis on survey responses using an LLM.
|
|
113
140
|
|
|
@@ -117,7 +144,7 @@ async def sentiment_analysis(
|
|
|
117
144
|
Args:
|
|
118
145
|
responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
|
|
119
146
|
Must contain 'response_id' and 'response' columns.
|
|
120
|
-
llm (
|
|
147
|
+
llm (RunnableWithFallbacks): Language model instance to use for sentiment analysis.
|
|
121
148
|
question (str): The survey question.
|
|
122
149
|
batch_size (int, optional): Number of responses to process in each batch.
|
|
123
150
|
Defaults to 20.
|
|
@@ -126,6 +153,7 @@ async def sentiment_analysis(
|
|
|
126
153
|
or PromptTemplate instance. Defaults to "sentiment_analysis".
|
|
127
154
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
128
155
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
156
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
129
157
|
|
|
130
158
|
Returns:
|
|
131
159
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -134,32 +162,33 @@ async def sentiment_analysis(
|
|
|
134
162
|
- The second DataFrame contains the rows that could not be processed by the LLM
|
|
135
163
|
|
|
136
164
|
Note:
|
|
137
|
-
The function uses
|
|
165
|
+
The function uses integrity_check to ensure responses maintain
|
|
138
166
|
their original order and association after processing.
|
|
139
167
|
"""
|
|
140
168
|
logger.info(f"Running sentiment analysis on {len(responses_df)} responses")
|
|
141
|
-
|
|
169
|
+
sentiment, unprocessable = await batch_and_run(
|
|
142
170
|
responses_df,
|
|
143
171
|
prompt_template,
|
|
144
|
-
llm,
|
|
172
|
+
llm.with_structured_output(SentimentAnalysisResponses),
|
|
145
173
|
batch_size=batch_size,
|
|
146
174
|
question=question,
|
|
147
|
-
|
|
148
|
-
task_validation_model=SentimentAnalysisOutput,
|
|
175
|
+
integrity_check=True,
|
|
149
176
|
system_prompt=system_prompt,
|
|
177
|
+
concurrency=concurrency,
|
|
150
178
|
)
|
|
151
179
|
|
|
152
|
-
return
|
|
180
|
+
return sentiment, unprocessable
|
|
153
181
|
|
|
154
182
|
|
|
155
183
|
async def theme_generation(
|
|
156
184
|
responses_df: pd.DataFrame,
|
|
157
|
-
llm:
|
|
185
|
+
llm: RunnableWithFallbacks,
|
|
158
186
|
question: str,
|
|
159
187
|
batch_size: int = 50,
|
|
160
188
|
partition_key: str | None = "position",
|
|
161
189
|
prompt_template: str | Path | PromptTemplate = "theme_generation",
|
|
162
190
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
191
|
+
concurrency: int = 10,
|
|
163
192
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
164
193
|
"""Generate themes from survey responses using an LLM.
|
|
165
194
|
|
|
@@ -168,7 +197,7 @@ async def theme_generation(
|
|
|
168
197
|
Args:
|
|
169
198
|
responses_df (pd.DataFrame): DataFrame containing survey responses.
|
|
170
199
|
Must include 'response_id' and 'response' columns.
|
|
171
|
-
llm (
|
|
200
|
+
llm (RunnableWithFallbacks): Language model instance to use for theme generation.
|
|
172
201
|
question (str): The survey question.
|
|
173
202
|
batch_size (int, optional): Number of responses to process in each batch.
|
|
174
203
|
Defaults to 50.
|
|
@@ -181,6 +210,7 @@ async def theme_generation(
|
|
|
181
210
|
or PromptTemplate instance. Defaults to "theme_generation".
|
|
182
211
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
183
212
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
213
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
184
214
|
|
|
185
215
|
Returns:
|
|
186
216
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -193,22 +223,24 @@ async def theme_generation(
|
|
|
193
223
|
generated_themes, _ = await batch_and_run(
|
|
194
224
|
responses_df,
|
|
195
225
|
prompt_template,
|
|
196
|
-
llm,
|
|
226
|
+
llm.with_structured_output(ThemeGenerationResponses),
|
|
197
227
|
batch_size=batch_size,
|
|
198
228
|
partition_key=partition_key,
|
|
199
229
|
question=question,
|
|
200
230
|
system_prompt=system_prompt,
|
|
231
|
+
concurrency=concurrency,
|
|
201
232
|
)
|
|
202
233
|
return generated_themes, _
|
|
203
234
|
|
|
204
235
|
|
|
205
236
|
async def theme_condensation(
|
|
206
237
|
themes_df: pd.DataFrame,
|
|
207
|
-
llm:
|
|
238
|
+
llm: RunnableWithFallbacks,
|
|
208
239
|
question: str,
|
|
209
240
|
batch_size: int = 75,
|
|
210
241
|
prompt_template: str | Path | PromptTemplate = "theme_condensation",
|
|
211
242
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
243
|
+
concurrency: int = 10,
|
|
212
244
|
**kwargs,
|
|
213
245
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
214
246
|
"""Condense and combine similar themes identified from survey responses.
|
|
@@ -219,7 +251,7 @@ async def theme_condensation(
|
|
|
219
251
|
Args:
|
|
220
252
|
themes_df (pd.DataFrame): DataFrame containing the initial themes identified
|
|
221
253
|
from survey responses.
|
|
222
|
-
llm (
|
|
254
|
+
llm (RunnableWithFallbacks): Language model instance to use for theme condensation.
|
|
223
255
|
question (str): The survey question.
|
|
224
256
|
batch_size (int, optional): Number of themes to process in each batch.
|
|
225
257
|
Defaults to 100.
|
|
@@ -228,6 +260,7 @@ async def theme_condensation(
|
|
|
228
260
|
or PromptTemplate instance. Defaults to "theme_condensation".
|
|
229
261
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
230
262
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
263
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
231
264
|
|
|
232
265
|
Returns:
|
|
233
266
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -247,10 +280,11 @@ async def theme_condensation(
|
|
|
247
280
|
themes_df, _ = await batch_and_run(
|
|
248
281
|
themes_df,
|
|
249
282
|
prompt_template,
|
|
250
|
-
llm,
|
|
283
|
+
llm.with_structured_output(ThemeCondensationResponses),
|
|
251
284
|
batch_size=batch_size,
|
|
252
285
|
question=question,
|
|
253
286
|
system_prompt=system_prompt,
|
|
287
|
+
concurrency=concurrency,
|
|
254
288
|
**kwargs,
|
|
255
289
|
)
|
|
256
290
|
themes_df = themes_df.sample(frac=1).reset_index(drop=True)
|
|
@@ -263,10 +297,11 @@ async def theme_condensation(
|
|
|
263
297
|
themes_df, _ = await batch_and_run(
|
|
264
298
|
themes_df,
|
|
265
299
|
prompt_template,
|
|
266
|
-
llm,
|
|
300
|
+
llm.with_structured_output(ThemeCondensationResponses),
|
|
267
301
|
batch_size=batch_size,
|
|
268
302
|
question=question,
|
|
269
303
|
system_prompt=system_prompt,
|
|
304
|
+
concurrency=concurrency,
|
|
270
305
|
**kwargs,
|
|
271
306
|
)
|
|
272
307
|
|
|
@@ -276,11 +311,12 @@ async def theme_condensation(
|
|
|
276
311
|
|
|
277
312
|
async def theme_refinement(
|
|
278
313
|
condensed_themes_df: pd.DataFrame,
|
|
279
|
-
llm:
|
|
314
|
+
llm: RunnableWithFallbacks,
|
|
280
315
|
question: str,
|
|
281
316
|
batch_size: int = 10000,
|
|
282
317
|
prompt_template: str | Path | PromptTemplate = "theme_refinement",
|
|
283
318
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
319
|
+
concurrency: int = 10,
|
|
284
320
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
285
321
|
"""Refine and standardize condensed themes using an LLM.
|
|
286
322
|
|
|
@@ -292,7 +328,7 @@ async def theme_refinement(
|
|
|
292
328
|
Args:
|
|
293
329
|
condensed_themes (pd.DataFrame): DataFrame containing the condensed themes
|
|
294
330
|
from the previous pipeline stage.
|
|
295
|
-
llm (
|
|
331
|
+
llm (RunnableWithFallbacks): Language model instance to use for theme refinement.
|
|
296
332
|
question (str): The survey question.
|
|
297
333
|
batch_size (int, optional): Number of themes to process in each batch.
|
|
298
334
|
Defaults to 10000.
|
|
@@ -301,6 +337,7 @@ async def theme_refinement(
|
|
|
301
337
|
or PromptTemplate instance. Defaults to "theme_refinement".
|
|
302
338
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
303
339
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
340
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
304
341
|
|
|
305
342
|
Returns:
|
|
306
343
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -319,22 +356,24 @@ async def theme_refinement(
|
|
|
319
356
|
refined_themes, _ = await batch_and_run(
|
|
320
357
|
condensed_themes_df,
|
|
321
358
|
prompt_template,
|
|
322
|
-
llm,
|
|
359
|
+
llm.with_structured_output(ThemeRefinementResponses),
|
|
323
360
|
batch_size=batch_size,
|
|
324
361
|
question=question,
|
|
325
362
|
system_prompt=system_prompt,
|
|
363
|
+
concurrency=concurrency,
|
|
326
364
|
)
|
|
327
365
|
return refined_themes, _
|
|
328
366
|
|
|
329
367
|
|
|
330
368
|
async def theme_target_alignment(
|
|
331
369
|
refined_themes_df: pd.DataFrame,
|
|
332
|
-
llm:
|
|
370
|
+
llm: RunnableWithFallbacks,
|
|
333
371
|
question: str,
|
|
334
372
|
target_n_themes: int = 10,
|
|
335
373
|
batch_size: int = 10000,
|
|
336
374
|
prompt_template: str | Path | PromptTemplate = "theme_target_alignment",
|
|
337
375
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
376
|
+
concurrency: int = 10,
|
|
338
377
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
339
378
|
"""Align themes to target number using an LLM.
|
|
340
379
|
|
|
@@ -346,7 +385,7 @@ async def theme_target_alignment(
|
|
|
346
385
|
Args:
|
|
347
386
|
refined_themes_df (pd.DataFrame): DataFrame containing the refined themes
|
|
348
387
|
from the previous pipeline stage.
|
|
349
|
-
llm (
|
|
388
|
+
llm (RunnableWithFallbacks): Language model instance to use for theme alignment.
|
|
350
389
|
question (str): The survey question.
|
|
351
390
|
target_n_themes (int, optional): Target number of themes to consolidate to.
|
|
352
391
|
Defaults to 10.
|
|
@@ -357,6 +396,7 @@ async def theme_target_alignment(
|
|
|
357
396
|
or PromptTemplate instance. Defaults to "theme_target_alignment".
|
|
358
397
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
359
398
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
399
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
360
400
|
|
|
361
401
|
Returns:
|
|
362
402
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -376,23 +416,25 @@ async def theme_target_alignment(
|
|
|
376
416
|
aligned_themes, _ = await batch_and_run(
|
|
377
417
|
refined_themes_df,
|
|
378
418
|
prompt_template,
|
|
379
|
-
llm,
|
|
419
|
+
llm.with_structured_output(ThemeRefinementResponses),
|
|
380
420
|
batch_size=batch_size,
|
|
381
421
|
question=question,
|
|
382
422
|
system_prompt=system_prompt,
|
|
383
423
|
target_n_themes=target_n_themes,
|
|
424
|
+
concurrency=concurrency,
|
|
384
425
|
)
|
|
385
426
|
return aligned_themes, _
|
|
386
427
|
|
|
387
428
|
|
|
388
429
|
async def theme_mapping(
|
|
389
430
|
responses_df: pd.DataFrame,
|
|
390
|
-
llm:
|
|
431
|
+
llm: RunnableWithFallbacks,
|
|
391
432
|
question: str,
|
|
392
433
|
refined_themes_df: pd.DataFrame,
|
|
393
434
|
batch_size: int = 20,
|
|
394
435
|
prompt_template: str | Path | PromptTemplate = "theme_mapping",
|
|
395
436
|
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
437
|
+
concurrency: int = 10,
|
|
396
438
|
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
397
439
|
"""Map survey responses to refined themes using an LLM.
|
|
398
440
|
|
|
@@ -402,7 +444,7 @@ async def theme_mapping(
|
|
|
402
444
|
Args:
|
|
403
445
|
responses_df (pd.DataFrame): DataFrame containing survey responses.
|
|
404
446
|
Must include 'response_id' and 'response' columns.
|
|
405
|
-
llm (
|
|
447
|
+
llm (RunnableWithFallbacks): Language model instance to use for theme mapping.
|
|
406
448
|
question (str): The survey question.
|
|
407
449
|
refined_themes_df (pd.DataFrame): Single-row DataFrame where each column
|
|
408
450
|
represents a theme (from theme_refinement stage).
|
|
@@ -413,6 +455,7 @@ async def theme_mapping(
|
|
|
413
455
|
or PromptTemplate instance. Defaults to "theme_mapping".
|
|
414
456
|
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
415
457
|
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
458
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
416
459
|
|
|
417
460
|
Returns:
|
|
418
461
|
tuple[pd.DataFrame, pd.DataFrame]:
|
|
@@ -432,17 +475,70 @@ async def theme_mapping(
|
|
|
432
475
|
)
|
|
433
476
|
return transposed_df
|
|
434
477
|
|
|
435
|
-
mapping,
|
|
478
|
+
mapping, unprocessable = await batch_and_run(
|
|
436
479
|
responses_df,
|
|
437
480
|
prompt_template,
|
|
438
|
-
llm,
|
|
481
|
+
llm.with_structured_output(ThemeMappingResponses),
|
|
439
482
|
batch_size=batch_size,
|
|
440
483
|
question=question,
|
|
441
484
|
refined_themes=transpose_refined_themes(refined_themes_df).to_dict(
|
|
442
485
|
orient="records"
|
|
443
486
|
),
|
|
444
|
-
|
|
445
|
-
|
|
487
|
+
integrity_check=True,
|
|
488
|
+
system_prompt=system_prompt,
|
|
489
|
+
concurrency=concurrency,
|
|
490
|
+
)
|
|
491
|
+
return mapping, unprocessable
|
|
492
|
+
|
|
493
|
+
|
|
494
|
+
async def detail_detection(
|
|
495
|
+
responses_df: pd.DataFrame,
|
|
496
|
+
llm: RunnableWithFallbacks,
|
|
497
|
+
question: str,
|
|
498
|
+
batch_size: int = 20,
|
|
499
|
+
prompt_template: str | Path | PromptTemplate = "detail_detection",
|
|
500
|
+
system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
|
|
501
|
+
concurrency: int = 10,
|
|
502
|
+
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
|
503
|
+
"""Identify responses that provide high-value detailed evidence.
|
|
504
|
+
|
|
505
|
+
This function processes survey responses in batches to analyze their level of detail
|
|
506
|
+
and evidence using a language model. It identifies responses that contain specific
|
|
507
|
+
examples, data, or detailed reasoning that provide strong supporting evidence.
|
|
508
|
+
|
|
509
|
+
Args:
|
|
510
|
+
responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
|
|
511
|
+
Must contain 'response_id' and 'response' columns.
|
|
512
|
+
llm (RunnableWithFallbacks): Language model instance to use for detail detection.
|
|
513
|
+
question (str): The survey question.
|
|
514
|
+
batch_size (int, optional): Number of responses to process in each batch.
|
|
515
|
+
Defaults to 20.
|
|
516
|
+
prompt_template (str | Path | PromptTemplate, optional): Template for structuring
|
|
517
|
+
the prompt to the LLM. Can be a string identifier, path to template file,
|
|
518
|
+
or PromptTemplate instance. Defaults to "detail_detection".
|
|
519
|
+
system_prompt (str): System prompt to guide the LLM's behavior.
|
|
520
|
+
Defaults to CONSULTATION_SYSTEM_PROMPT.
|
|
521
|
+
concurrency (int): Number of concurrent API calls to make. Defaults to 10.
|
|
522
|
+
|
|
523
|
+
Returns:
|
|
524
|
+
tuple[pd.DataFrame, pd.DataFrame]:
|
|
525
|
+
A tuple containing two DataFrames:
|
|
526
|
+
- The first DataFrame contains the rows that were successfully processed by the LLM
|
|
527
|
+
- The second DataFrame contains the rows that could not be processed by the LLM
|
|
528
|
+
|
|
529
|
+
Note:
|
|
530
|
+
The function uses response_id_integrity_check to ensure responses maintain
|
|
531
|
+
their original order and association after processing.
|
|
532
|
+
"""
|
|
533
|
+
logger.info(f"Running detail detection on {len(responses_df)} responses")
|
|
534
|
+
detailed, _ = await batch_and_run(
|
|
535
|
+
responses_df,
|
|
536
|
+
prompt_template,
|
|
537
|
+
llm.with_structured_output(DetailDetectionResponses),
|
|
538
|
+
batch_size=batch_size,
|
|
539
|
+
question=question,
|
|
540
|
+
integrity_check=True,
|
|
446
541
|
system_prompt=system_prompt,
|
|
542
|
+
concurrency=concurrency,
|
|
447
543
|
)
|
|
448
|
-
return
|
|
544
|
+
return detailed, _
|