themefinder 0.6.2__py3-none-any.whl → 0.7.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of themefinder might be problematic. Click here for more details.
- themefinder/__init__.py +8 -2
- themefinder/core.py +217 -39
- themefinder/llm_batch_processor.py +33 -81
- themefinder/models.py +371 -94
- themefinder/prompts/agentic_theme_clustering.txt +31 -0
- themefinder/prompts/detail_detection.txt +19 -0
- themefinder/prompts/sentiment_analysis.txt +0 -14
- themefinder/prompts/theme_condensation.txt +2 -22
- themefinder/prompts/theme_generation.txt +6 -38
- themefinder/prompts/theme_mapping.txt +6 -23
- themefinder/prompts/theme_refinement.txt +7 -16
- themefinder/prompts/theme_target_alignment.txt +2 -10
- themefinder/theme_clustering_agent.py +332 -0
- {themefinder-0.6.2.dist-info → themefinder-0.7.0.dist-info}/METADATA +24 -9
- themefinder-0.7.0.dist-info/RECORD +19 -0
- {themefinder-0.6.2.dist-info → themefinder-0.7.0.dist-info}/WHEEL +1 -1
- themefinder-0.6.2.dist-info/RECORD +0 -16
- {themefinder-0.6.2.dist-info → themefinder-0.7.0.dist-info}/LICENCE +0 -0
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
{system_prompt}
|
|
2
2
|
|
|
3
|
-
Your job is to help identify which topics come up in
|
|
3
|
+
Your job is to help identify which topics come up in free_text_responses to a question.
|
|
4
4
|
|
|
5
5
|
You will be given:
|
|
6
6
|
- a QUESTION that has been asked
|
|
7
|
-
- a TOPIC LIST of topics that are known to be present in
|
|
7
|
+
- a TOPIC LIST of topics that are known to be present in free_text_responses to this question. These will be structured as follows:
|
|
8
8
|
{{'topic_id': 'topic_description}}
|
|
9
|
-
- a list of
|
|
9
|
+
- a list of FREE_TEXT_RESPONSES to the question. These will be structured as follows:
|
|
10
10
|
{{'response_id': 'free text response'}}
|
|
11
11
|
|
|
12
12
|
Your task is to analyze each response and decide which topics are present. Guidelines:
|
|
@@ -14,10 +14,11 @@ Your task is to analyze each response and decide which topics are present. Guide
|
|
|
14
14
|
- A response doesn't need to exactly match the language used in the TOPIC LIST, it should be considered a match if it expresses a similar sentiment.
|
|
15
15
|
- You must use the alphabetic 'topic_id' to indicate which topic you have assigned. Do not use the full topic description
|
|
16
16
|
- Each response can be assigned to multiple topics if it matches more than one topic from the TOPIC LIST.
|
|
17
|
+
- Each topic can only be assigned once per response, if the topic is mentioned more than once use the first mention for reasoning and stance.
|
|
17
18
|
- There is no limit on how many topics can be assigned to a response.
|
|
18
19
|
- For each assignment provide a single rationale for why you have chosen the label.
|
|
19
20
|
- For each topic identified in a response, indicate whether the response expresses a positive or negative stance toward that topic (options: 'POSITIVE' or 'NEGATIVE')
|
|
20
|
-
- You MUST use either '
|
|
21
|
+
- You MUST use either 'POSITIVE' or 'NEGATIVE'
|
|
21
22
|
- The order of reasons and stances must align with the order of labels (e.g., stance_a applies to topic_a)
|
|
22
23
|
|
|
23
24
|
You MUST include every response ID in the output.
|
|
@@ -25,24 +26,6 @@ If the response can not be labelled return empty sections where appropriate but
|
|
|
25
26
|
with the correct response ID for each input object.
|
|
26
27
|
You must only return the alphabetic topic_ids in the labels section.
|
|
27
28
|
|
|
28
|
-
The final output should be in the following JSON format:
|
|
29
|
-
|
|
30
|
-
{{
|
|
31
|
-
"responses": [
|
|
32
|
-
{{
|
|
33
|
-
"response_id": response_id_1,
|
|
34
|
-
"reasons": ["reason_a", "reason_b"],
|
|
35
|
-
"labels": ["topic_a", "topic_b"],
|
|
36
|
-
"stances": ["stance_a", "stance_b"],
|
|
37
|
-
}},
|
|
38
|
-
{{
|
|
39
|
-
"response_id": response_id_2,
|
|
40
|
-
"reasons": ["reason_c"],
|
|
41
|
-
"labels": ["topic_c"],
|
|
42
|
-
"stances": ["stance_c"],
|
|
43
|
-
}}
|
|
44
|
-
]
|
|
45
|
-
}}
|
|
46
29
|
|
|
47
30
|
QUESTION:
|
|
48
31
|
|
|
@@ -52,6 +35,6 @@ TOPIC LIST:
|
|
|
52
35
|
|
|
53
36
|
{refined_themes}
|
|
54
37
|
|
|
55
|
-
|
|
38
|
+
FREE_TEXT_RESPONSES:
|
|
56
39
|
|
|
57
40
|
{responses}
|
|
@@ -6,9 +6,12 @@ You are tasked with refining a list of topics generated from responses to a ques
|
|
|
6
6
|
You will receive a list of TOPICS. These topics explicitly tie opinions to whether a person agrees or disagrees with the question.
|
|
7
7
|
|
|
8
8
|
## Output
|
|
9
|
-
You will produce a list of CLEAR STANCE TOPICS based on the input. Each topic should have
|
|
10
|
-
1. A
|
|
11
|
-
2. A
|
|
9
|
+
You will produce a list of CLEAR STANCE TOPICS based on the input. Each topic should have four parts:
|
|
10
|
+
1. A topic_id that is an uppercase letter (starting from 'A', for the 27th element use AA)
|
|
11
|
+
2. A brief, clear topic label (3-7 words)
|
|
12
|
+
3. A more detailed topic description (1-2 sentences)
|
|
13
|
+
4. The source_topic_count field should be included for each topic and should reflect the number of original source topics that were merged to create this refined topic. If multiple source topics were combined, sum their individual counts. If only one source topic was used, simply retain its original count value.
|
|
14
|
+
|
|
12
15
|
|
|
13
16
|
## Guidelines
|
|
14
17
|
|
|
@@ -46,20 +49,8 @@ You will produce a list of CLEAR STANCE TOPICS based on the input. Each topic sh
|
|
|
46
49
|
b. Create a neutral, concise topic label.
|
|
47
50
|
c. Write a more detailed description that provides context without taking sides.
|
|
48
51
|
4. Review the entire list to ensure distinctiveness and adjust as needed.
|
|
49
|
-
5. Assign each output topic a topic_id
|
|
52
|
+
5. Assign each output topic a topic_id that is an uppercase letter (starting from 'A', for the 27th element use AA)
|
|
50
53
|
6. Combine the topic label and description with a colon separator
|
|
51
54
|
|
|
52
|
-
Return your output in the following JSON format:
|
|
53
|
-
{{
|
|
54
|
-
"responses": [
|
|
55
|
-
{{"topic_id": "A", "topic": "{{topic label 1}}: {{topic description 1}}", "source_topic_count": {{count1}}}},
|
|
56
|
-
{{"topic_id": "B", "topic": "{{topic label 2}}: {{topic description 2}}", "source_topic_count": {{count2}}}},
|
|
57
|
-
{{"topic_id": "C", "topic": "{{topic label 3}}: {{topic description 3}}", "source_topic_count": {{count3}}}},
|
|
58
|
-
// Additional topics as necessary
|
|
59
|
-
]
|
|
60
|
-
}}
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
55
|
TOPICS:
|
|
65
56
|
{responses}
|
|
@@ -10,17 +10,9 @@ Requirements:
|
|
|
10
10
|
- Each consolidated theme should capture all relevant information from its source themes
|
|
11
11
|
- Final descriptions should be concise but thorough
|
|
12
12
|
- The merged themes should be distinct from each other with minimal overlap
|
|
13
|
+
- The source_topic_count field should be included for each theme and represent the sum of all source themes that were combined to create it
|
|
14
|
+
- You cannot return more than {target_n_themes}
|
|
13
15
|
|
|
14
|
-
Return your output in the following JSON format:
|
|
15
|
-
|
|
16
|
-
{{
|
|
17
|
-
"responses": [
|
|
18
|
-
{{"topic_id": "A", "topic": "{{topic label 1}}: {{topic description 1}}"}},
|
|
19
|
-
{{"topic_id": "B", "topic": "{{topic label 2}}: {{topic description 2}}"}},
|
|
20
|
-
{{"topic_id": "C", "topic": "{{topic label 3}}: {{topic description 3}}"}},
|
|
21
|
-
// Additional topics as necessary
|
|
22
|
-
]
|
|
23
|
-
}}
|
|
24
16
|
|
|
25
17
|
Themes to analyze:
|
|
26
18
|
{responses}
|
|
@@ -0,0 +1,332 @@
|
|
|
1
|
+
"""Theme clustering agent for hierarchical topic organization.
|
|
2
|
+
|
|
3
|
+
This module provides the ThemeClusteringAgent class for performing iterative
|
|
4
|
+
hierarchical clustering of topics using a language model.
|
|
5
|
+
"""
|
|
6
|
+
|
|
7
|
+
import json
|
|
8
|
+
import logging
|
|
9
|
+
from typing import Dict, List, Any
|
|
10
|
+
|
|
11
|
+
import pandas as pd
|
|
12
|
+
from langchain.schema.runnable import Runnable
|
|
13
|
+
from tenacity import (
|
|
14
|
+
before,
|
|
15
|
+
before_sleep_log,
|
|
16
|
+
retry,
|
|
17
|
+
stop_after_attempt,
|
|
18
|
+
wait_random_exponential,
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
from .models import ThemeNode
|
|
22
|
+
from .llm_batch_processor import load_prompt_from_file
|
|
23
|
+
from .themefinder_logging import logger
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
class ThemeClusteringAgent:
|
|
27
|
+
"""Agent for performing hierarchical clustering of topics using language models.
|
|
28
|
+
|
|
29
|
+
This class manages the iterative process of merging similar topics into a
|
|
30
|
+
hierarchical structure using an LLM to identify semantic relationships and
|
|
31
|
+
create meaningful parent-child topic relationships.
|
|
32
|
+
|
|
33
|
+
Attributes:
|
|
34
|
+
llm: Language model configured with structured output for clustering
|
|
35
|
+
themes: Dictionary mapping topic IDs to ThemeNode objects
|
|
36
|
+
active_themes: Set of topic IDs that are currently active for clustering
|
|
37
|
+
current_iteration: Current iteration number in the clustering process
|
|
38
|
+
"""
|
|
39
|
+
|
|
40
|
+
def __init__(self, llm: Runnable, themes: List[ThemeNode]) -> None:
|
|
41
|
+
"""Initialize the clustering agent with an LLM and initial themes.
|
|
42
|
+
|
|
43
|
+
Args:
|
|
44
|
+
llm: Language model instance configured with structured output
|
|
45
|
+
for HierarchicalClusteringResponse
|
|
46
|
+
themes: List of ThemeNode objects to be clustered
|
|
47
|
+
"""
|
|
48
|
+
self.llm = llm
|
|
49
|
+
self.themes: Dict[str, ThemeNode] = {}
|
|
50
|
+
for theme in themes:
|
|
51
|
+
self.themes[theme.topic_id] = theme
|
|
52
|
+
self.active_themes = set(self.themes.keys())
|
|
53
|
+
self.current_iteration = 0
|
|
54
|
+
|
|
55
|
+
def _format_prompt(self) -> str:
|
|
56
|
+
"""Format the clustering prompt with current active themes.
|
|
57
|
+
|
|
58
|
+
Creates a JSON representation of all currently active themes and
|
|
59
|
+
formats them into the clustering prompt template.
|
|
60
|
+
|
|
61
|
+
Returns:
|
|
62
|
+
str: Formatted prompt string ready for LLM processing
|
|
63
|
+
"""
|
|
64
|
+
themes_for_prompt = []
|
|
65
|
+
for active_id in self.active_themes:
|
|
66
|
+
theme_dict = {
|
|
67
|
+
"topic_id": self.themes[active_id].topic_id,
|
|
68
|
+
"topic_label": self.themes[active_id].topic_label,
|
|
69
|
+
"topic_description": self.themes[active_id].topic_description,
|
|
70
|
+
}
|
|
71
|
+
themes_for_prompt.append(theme_dict)
|
|
72
|
+
themes_json = json.dumps(themes_for_prompt, indent=2)
|
|
73
|
+
|
|
74
|
+
# Load the clustering prompt template
|
|
75
|
+
prompt_template = load_prompt_from_file("agentic_theme_clustering")
|
|
76
|
+
return prompt_template.format(
|
|
77
|
+
themes_json=themes_json, iteration=self.current_iteration
|
|
78
|
+
)
|
|
79
|
+
|
|
80
|
+
@retry(
|
|
81
|
+
wait=wait_random_exponential(min=1, max=2),
|
|
82
|
+
stop=stop_after_attempt(3),
|
|
83
|
+
before=before.before_log(logger=logger, log_level=logging.DEBUG),
|
|
84
|
+
before_sleep=before_sleep_log(logger, logging.ERROR),
|
|
85
|
+
reraise=True,
|
|
86
|
+
)
|
|
87
|
+
def cluster_iteration(self) -> None:
|
|
88
|
+
"""Perform one iteration of hierarchical theme clustering.
|
|
89
|
+
|
|
90
|
+
Uses the configured LLM to identify semantically similar themes
|
|
91
|
+
and merge them into parent themes. Updates the theme hierarchy
|
|
92
|
+
and active theme set based on the clustering results.
|
|
93
|
+
|
|
94
|
+
The method includes retry logic to handle transient API failures
|
|
95
|
+
and will automatically retry up to 3 times with exponential backoff.
|
|
96
|
+
|
|
97
|
+
Side Effects:
|
|
98
|
+
- Creates new parent ThemeNode objects in self.themes
|
|
99
|
+
- Updates parent_id relationships for child themes
|
|
100
|
+
- Modifies self.active_themes set
|
|
101
|
+
- Increments self.current_iteration
|
|
102
|
+
"""
|
|
103
|
+
prompt = self._format_prompt()
|
|
104
|
+
response = self.llm.invoke(prompt)
|
|
105
|
+
# The response is already a parsed dictionary when using with_structured_output
|
|
106
|
+
result = response
|
|
107
|
+
for i, parent in enumerate(result["parent_themes"]):
|
|
108
|
+
new_theme_id = f"{chr(65 + i)}_{self.current_iteration}"
|
|
109
|
+
children = [c for c in parent["children"] if c in self.active_themes]
|
|
110
|
+
for child in children:
|
|
111
|
+
self.themes[child].parent_id = new_theme_id
|
|
112
|
+
total_source_count = sum(
|
|
113
|
+
self.themes[child_id].source_topic_count for child_id in children
|
|
114
|
+
)
|
|
115
|
+
new_theme = ThemeNode(
|
|
116
|
+
topic_id=new_theme_id,
|
|
117
|
+
topic_label=parent["topic_label"],
|
|
118
|
+
topic_description=parent["topic_description"],
|
|
119
|
+
source_topic_count=total_source_count,
|
|
120
|
+
children=children,
|
|
121
|
+
)
|
|
122
|
+
self.themes[new_theme_id] = new_theme
|
|
123
|
+
self.active_themes.add(new_theme_id)
|
|
124
|
+
for child in children:
|
|
125
|
+
self.active_themes.remove(child)
|
|
126
|
+
self.current_iteration += 1
|
|
127
|
+
|
|
128
|
+
def cluster_themes(
|
|
129
|
+
self, max_iterations: int = 5, target_themes: int = 5
|
|
130
|
+
) -> pd.DataFrame:
|
|
131
|
+
"""Perform hierarchical clustering to reduce themes to target number.
|
|
132
|
+
|
|
133
|
+
Iteratively merges similar themes using the clustering agent until
|
|
134
|
+
either the maximum iterations is reached or the target number of
|
|
135
|
+
themes is achieved. Creates a root node to represent the complete
|
|
136
|
+
hierarchy.
|
|
137
|
+
|
|
138
|
+
Args:
|
|
139
|
+
max_iterations: Maximum number of clustering iterations to perform
|
|
140
|
+
target_themes: Target number of themes to cluster down to
|
|
141
|
+
|
|
142
|
+
Returns:
|
|
143
|
+
pd.DataFrame: DataFrame containing all theme nodes (excluding root)
|
|
144
|
+
with their hierarchical relationships and metadata
|
|
145
|
+
"""
|
|
146
|
+
logger.info(f"Starting clustering with {len(self.active_themes)} active themes")
|
|
147
|
+
while (
|
|
148
|
+
self.current_iteration <= max_iterations
|
|
149
|
+
and len(self.active_themes) > target_themes
|
|
150
|
+
):
|
|
151
|
+
self.cluster_iteration()
|
|
152
|
+
logger.info(
|
|
153
|
+
f"After {self.current_iteration} iterations {len(self.active_themes)} active themes remaining"
|
|
154
|
+
)
|
|
155
|
+
root_node = ThemeNode(
|
|
156
|
+
topic_id="0",
|
|
157
|
+
topic_label="All Topics",
|
|
158
|
+
topic_description="",
|
|
159
|
+
source_topic_count=sum(
|
|
160
|
+
self.themes[theme_id].source_topic_count
|
|
161
|
+
for theme_id in self.active_themes
|
|
162
|
+
),
|
|
163
|
+
children=list(self.active_themes),
|
|
164
|
+
)
|
|
165
|
+
self.themes["0"] = root_node
|
|
166
|
+
for theme in self.active_themes:
|
|
167
|
+
self.themes[theme].parent_id = "0"
|
|
168
|
+
|
|
169
|
+
# Convert all themes (except root) to DataFrame
|
|
170
|
+
theme_nodes_dicts = [
|
|
171
|
+
node.model_dump() for node in self.themes.values() if node.topic_id != "0"
|
|
172
|
+
]
|
|
173
|
+
return pd.DataFrame(theme_nodes_dicts)
|
|
174
|
+
|
|
175
|
+
def convert_themes_to_tree_json(self) -> str:
|
|
176
|
+
"""Convert themes into a hierarchical JSON structure for visualization.
|
|
177
|
+
|
|
178
|
+
Creates a nested JSON structure starting from the root node (ID '0')
|
|
179
|
+
that represents the complete theme hierarchy. Each node includes
|
|
180
|
+
metadata and references to its children.
|
|
181
|
+
|
|
182
|
+
Returns:
|
|
183
|
+
str: JSON string representing the hierarchical tree structure
|
|
184
|
+
suitable for JavaScript tree visualization libraries
|
|
185
|
+
"""
|
|
186
|
+
|
|
187
|
+
def build_tree(node: ThemeNode) -> Dict[str, Any]:
|
|
188
|
+
return {
|
|
189
|
+
"id": node.topic_id,
|
|
190
|
+
"name": node.topic_label,
|
|
191
|
+
"description": node.topic_description,
|
|
192
|
+
"value": node.source_topic_count,
|
|
193
|
+
"children": [
|
|
194
|
+
build_tree(self.themes[child_id])
|
|
195
|
+
for child_id in node.children
|
|
196
|
+
if child_id in self.themes
|
|
197
|
+
],
|
|
198
|
+
}
|
|
199
|
+
|
|
200
|
+
tree_data = build_tree(self.themes["0"])
|
|
201
|
+
return json.dumps(tree_data, indent=2)
|
|
202
|
+
|
|
203
|
+
def select_significant_themes(
|
|
204
|
+
self, significance_threshold: int, total_responses: int
|
|
205
|
+
) -> Dict[str, Any]:
|
|
206
|
+
"""Select significant themes using depth-first traversal.
|
|
207
|
+
|
|
208
|
+
Performs a depth-first search on the theme hierarchy to identify
|
|
209
|
+
themes that meet the significance threshold. Prioritizes leaf nodes
|
|
210
|
+
when possible, but selects parent nodes when children don't meet
|
|
211
|
+
the threshold.
|
|
212
|
+
|
|
213
|
+
Args:
|
|
214
|
+
significance_threshold: Minimum source_topic_count for significance
|
|
215
|
+
total_responses: Total number of responses across all themes
|
|
216
|
+
|
|
217
|
+
Returns:
|
|
218
|
+
Dict containing selected theme nodes and metadata
|
|
219
|
+
"""
|
|
220
|
+
# Track selected nodes
|
|
221
|
+
selected_nodes: List[Dict[str, Any]] = []
|
|
222
|
+
|
|
223
|
+
# Perform the DFS selection
|
|
224
|
+
self._traverse_tree(self.themes["0"], selected_nodes, significance_threshold)
|
|
225
|
+
|
|
226
|
+
# Format the final result
|
|
227
|
+
result = {"selected_nodes": selected_nodes, "total_responses": total_responses}
|
|
228
|
+
|
|
229
|
+
return result
|
|
230
|
+
|
|
231
|
+
def _traverse_tree(
|
|
232
|
+
self,
|
|
233
|
+
node: ThemeNode,
|
|
234
|
+
selected_nodes: List[Dict[str, Any]],
|
|
235
|
+
significance_threshold: int,
|
|
236
|
+
) -> bool:
|
|
237
|
+
"""Recursively traverse theme tree to select significant nodes.
|
|
238
|
+
|
|
239
|
+
Implements depth-first traversal logic for theme selection:
|
|
240
|
+
1. For leaf nodes: always select
|
|
241
|
+
2. For parent nodes: select if no significant children exist
|
|
242
|
+
3. For significant children: recursively process them
|
|
243
|
+
|
|
244
|
+
Args:
|
|
245
|
+
node: Current ThemeNode being processed
|
|
246
|
+
selected_nodes: List to accumulate selected theme dictionaries
|
|
247
|
+
significance_threshold: Minimum source_topic_count for significance
|
|
248
|
+
|
|
249
|
+
Returns:
|
|
250
|
+
bool: True if this node or descendants were selected, False otherwise
|
|
251
|
+
"""
|
|
252
|
+
# Base case: if node has no children (leaf node)
|
|
253
|
+
if not node.children:
|
|
254
|
+
selected_nodes.append(
|
|
255
|
+
{
|
|
256
|
+
"id": node.topic_id,
|
|
257
|
+
"name": node.topic_label,
|
|
258
|
+
"value": node.source_topic_count,
|
|
259
|
+
}
|
|
260
|
+
)
|
|
261
|
+
return True
|
|
262
|
+
|
|
263
|
+
# Check if any children are significant
|
|
264
|
+
has_significant_children = any(
|
|
265
|
+
self.themes[child_id].source_topic_count >= significance_threshold
|
|
266
|
+
for child_id in node.children
|
|
267
|
+
if child_id in self.themes
|
|
268
|
+
)
|
|
269
|
+
|
|
270
|
+
# If no significant children, select this node
|
|
271
|
+
if not has_significant_children:
|
|
272
|
+
selected_nodes.append(
|
|
273
|
+
{
|
|
274
|
+
"id": node.topic_id,
|
|
275
|
+
"name": node.topic_label,
|
|
276
|
+
"value": node.source_topic_count,
|
|
277
|
+
}
|
|
278
|
+
)
|
|
279
|
+
return True
|
|
280
|
+
|
|
281
|
+
# If significant children exist, recursively process them
|
|
282
|
+
any_selected = False
|
|
283
|
+
for child_id in node.children:
|
|
284
|
+
if child_id in self.themes:
|
|
285
|
+
if self._traverse_tree(
|
|
286
|
+
self.themes[child_id], selected_nodes, significance_threshold
|
|
287
|
+
):
|
|
288
|
+
any_selected = True
|
|
289
|
+
|
|
290
|
+
# If none of the children were selected, select this node
|
|
291
|
+
if not any_selected:
|
|
292
|
+
selected_nodes.append(
|
|
293
|
+
{
|
|
294
|
+
"id": node.topic_id,
|
|
295
|
+
"name": node.topic_label,
|
|
296
|
+
"value": node.source_topic_count,
|
|
297
|
+
}
|
|
298
|
+
)
|
|
299
|
+
return True
|
|
300
|
+
|
|
301
|
+
return any_selected
|
|
302
|
+
|
|
303
|
+
def select_themes(self, significance_percentage: float) -> pd.DataFrame:
|
|
304
|
+
"""Select themes that meet the significance threshold.
|
|
305
|
+
|
|
306
|
+
Calculates the significance threshold based on the percentage of total
|
|
307
|
+
responses and returns only themes that meet or exceed this threshold.
|
|
308
|
+
Excludes the root node from results.
|
|
309
|
+
|
|
310
|
+
Args:
|
|
311
|
+
significance_percentage: Percentage (0-100) of total responses
|
|
312
|
+
required for a theme to be considered significant
|
|
313
|
+
|
|
314
|
+
Returns:
|
|
315
|
+
pd.DataFrame: DataFrame containing significant theme data,
|
|
316
|
+
excluding the root node (topic_id='0')
|
|
317
|
+
"""
|
|
318
|
+
total_responses = self.themes["0"].source_topic_count
|
|
319
|
+
# Convert percentage to absolute threshold
|
|
320
|
+
significance_threshold = int(total_responses * (significance_percentage / 100))
|
|
321
|
+
|
|
322
|
+
# Filter themes that meet the significance threshold
|
|
323
|
+
significant_themes = [
|
|
324
|
+
theme_node
|
|
325
|
+
for theme_node in self.themes.values()
|
|
326
|
+
if theme_node.source_topic_count >= significance_threshold
|
|
327
|
+
]
|
|
328
|
+
# Convert significant themes to DataFrame, excluding root node
|
|
329
|
+
theme_nodes_dicts = [
|
|
330
|
+
node.model_dump() for node in significant_themes if node.topic_id != "0"
|
|
331
|
+
]
|
|
332
|
+
return pd.DataFrame(theme_nodes_dicts)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: themefinder
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.7.0
|
|
4
4
|
Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
|
|
5
5
|
License: MIT
|
|
6
6
|
Author: i.AI
|
|
@@ -49,9 +49,9 @@ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/r
|
|
|
49
49
|
- `response_id`: A unique identifier for each response
|
|
50
50
|
- `response`: The free text survey response
|
|
51
51
|
|
|
52
|
-
ThemeFinder
|
|
52
|
+
ThemeFinder now supports a range of language models through structured outputs.
|
|
53
53
|
|
|
54
|
-
The function `find_themes` identifies common themes in
|
|
54
|
+
The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
|
|
55
55
|
|
|
56
56
|
For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
|
|
57
57
|
|
|
@@ -81,7 +81,6 @@ load_dotenv()
|
|
|
81
81
|
llm = AzureChatOpenAI(
|
|
82
82
|
model="gpt-4o",
|
|
83
83
|
temperature=0,
|
|
84
|
-
model_kwargs={"response_format": {"type": "json_object"}},
|
|
85
84
|
)
|
|
86
85
|
|
|
87
86
|
# Set up your data
|
|
@@ -97,18 +96,15 @@ question = "What do you think of ThemeFinder?"
|
|
|
97
96
|
# Make the system prompt specific to your use case
|
|
98
97
|
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
|
|
99
98
|
|
|
100
|
-
# Run the function to find themes
|
|
101
|
-
# We use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
99
|
+
# Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
102
100
|
async def main():
|
|
103
101
|
result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
|
|
104
102
|
print(result)
|
|
105
103
|
|
|
106
104
|
if __name__ == "__main__":
|
|
107
105
|
asyncio.run(main())
|
|
108
|
-
|
|
109
106
|
```
|
|
110
107
|
|
|
111
|
-
|
|
112
108
|
## ThemeFinder pipeline
|
|
113
109
|
|
|
114
110
|
ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
|
|
@@ -145,6 +141,25 @@ The file `src/themefinder.core.py` contains the function `find_themes` which run
|
|
|
145
141
|
**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
|
|
146
142
|
|
|
147
143
|
|
|
144
|
+
## Model Compatibility
|
|
145
|
+
|
|
146
|
+
ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
|
|
147
|
+
|
|
148
|
+
### OpenAI Models
|
|
149
|
+
- GPT-4, GPT-4o, GPT-4.1
|
|
150
|
+
- All Azure OpenAI deployments
|
|
151
|
+
|
|
152
|
+
### Google Models
|
|
153
|
+
- Gemini series (1.5 Pro, 2.0 Pro, etc.)
|
|
154
|
+
|
|
155
|
+
### Anthropic Models
|
|
156
|
+
- Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
|
|
157
|
+
|
|
158
|
+
### Open Source Models
|
|
159
|
+
- Llama 2, Llama 3
|
|
160
|
+
- Mistral models (e.g., Mistral 7B, Mixtral)
|
|
161
|
+
|
|
162
|
+
|
|
148
163
|
## License
|
|
149
164
|
|
|
150
165
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -154,5 +169,5 @@ The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/in
|
|
|
154
169
|
|
|
155
170
|
## Feedback
|
|
156
171
|
|
|
157
|
-
|
|
172
|
+
Contact us with questions or feedback at packages@cabinetoffice.gov.uk.
|
|
158
173
|
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
themefinder/__init__.py,sha256=k3D3TpAvRdcXXZbHc_Lb7DsB53JwoGA0S4Ap5iX7PEw,477
|
|
2
|
+
themefinder/core.py,sha256=mqToJ-ggx8JyholNMUwFDcAT35dWX8Hnt3BJzdaNgS0,26219
|
|
3
|
+
themefinder/llm_batch_processor.py,sha256=Z9jm9Kr-6GD8g8kLkgdW97onjUbLLQ2M1YKwok39Q6Y,17652
|
|
4
|
+
themefinder/models.py,sha256=JopmD4F23Mteh60m6WDpsuTs58dRc0tUbVX-d-L8Gv8,14680
|
|
5
|
+
themefinder/prompts/agentic_theme_clustering.txt,sha256=6bHLpgZUQEaZXpLUB7EcMEbtXGqQ_1yniqZ6ZBJHFn0,1917
|
|
6
|
+
themefinder/prompts/consultation_system_prompt.txt,sha256=_A07oY_an4hnRx-9pQ0y-TLXJz0dd8vDI-MZne7Mdb4,89
|
|
7
|
+
themefinder/prompts/detail_detection.txt,sha256=6Vr_oN7rF5BCFipnCIHTSF8MmjerGyCixRWRT3vni1U,941
|
|
8
|
+
themefinder/prompts/sentiment_analysis.txt,sha256=vYCDhtEsG5I9xixwVhZbvKPJGU1Gqpw4-xAqGz72xhU,1671
|
|
9
|
+
themefinder/prompts/theme_condensation.txt,sha256=pHWuCtfU58gdtP2BfGZWOTvcb0MnTpb9OhOCGtkJv8U,1672
|
|
10
|
+
themefinder/prompts/theme_generation.txt,sha256=QRKW7DtcMSb2olT6j5jmdEPcXPMeZgogM-NYddEIKRk,1871
|
|
11
|
+
themefinder/prompts/theme_mapping.txt,sha256=HtGuStm-622TIEaqdb9LTaBs9xE-n9lvmcGQTG2_JOQ,2042
|
|
12
|
+
themefinder/prompts/theme_refinement.txt,sha256=evWMCIEdeZCJ8zn4SBNgP6bmfAb0vzKiR5C5wfAjkUk,2649
|
|
13
|
+
themefinder/prompts/theme_target_alignment.txt,sha256=g7AVZLiP_xIH010X5SIZyG3q7gA6OBAplPv3xvmstOY,855
|
|
14
|
+
themefinder/theme_clustering_agent.py,sha256=Ie-5MFvIo7ukeeDXNpLawJXqLqBb6kvUGgSH6uTGL20,12826
|
|
15
|
+
themefinder/themefinder_logging.py,sha256=n5SUQovEZLC4skEbxicjz_fOGF9mOk3S-Wpj5uXsaL8,314
|
|
16
|
+
themefinder-0.7.0.dist-info/LICENCE,sha256=C9ULIN0ctF60ZxUWH_hw1H434bDLg49Z-Qzn6BUHgqs,1060
|
|
17
|
+
themefinder-0.7.0.dist-info/METADATA,sha256=-PRjz0RTxp-yJsuavj8tw5NwtC1amsw12JyKNOitxZw,6737
|
|
18
|
+
themefinder-0.7.0.dist-info/WHEEL,sha256=b4K_helf-jlQoXBBETfwnf4B04YC67LOev0jo4fX5m8,88
|
|
19
|
+
themefinder-0.7.0.dist-info/RECORD,,
|
|
@@ -1,16 +0,0 @@
|
|
|
1
|
-
themefinder/__init__.py,sha256=wSpW2fEnC4gTzbeNC78nSD3DpJq43-h_H-LK_cqt1cw,327
|
|
2
|
-
themefinder/core.py,sha256=u1DY9gbzn-tFhQS3hrXQ8_1mIbR-iBWYVAdKeAX1BdE,18304
|
|
3
|
-
themefinder/llm_batch_processor.py,sha256=OrFEl1nSi5ninbSZSiE1HFMcYZiQ-NzuYPj_iDcPPoE,19988
|
|
4
|
-
themefinder/models.py,sha256=Y5-okndYwtBO09n_qUlYNVmHRVNEnJviArQZukm8Ox8,4251
|
|
5
|
-
themefinder/prompts/consultation_system_prompt.txt,sha256=_A07oY_an4hnRx-9pQ0y-TLXJz0dd8vDI-MZne7Mdb4,89
|
|
6
|
-
themefinder/prompts/sentiment_analysis.txt,sha256=9-LkdR95JTHXRKUXknAgNf86uVdv6jSaXMf-OtFL9_0,1948
|
|
7
|
-
themefinder/prompts/theme_condensation.txt,sha256=DB4pqUmMpo0OG4AZWGTj0FfLFfjbX6wOMUr44HBxZ1o,2433
|
|
8
|
-
themefinder/prompts/theme_generation.txt,sha256=JMXuNojxdSAcxPRU1Jg12Xunv_dX4hNvXYU2pXMWTAw,2500
|
|
9
|
-
themefinder/prompts/theme_mapping.txt,sha256=YcRGMkuTyTPzPQPtsDY31DUwX60c8AdmdHKw0XeUejQ,2258
|
|
10
|
-
themefinder/prompts/theme_refinement.txt,sha256=hBXwZnNZmhmoEFXpY5OJinp-7xxdoDRf_5LmgrilYgc,2713
|
|
11
|
-
themefinder/prompts/theme_target_alignment.txt,sha256=-_ghr4--KAN6Tz8ExO9s2IXvI6pjWaEA_nG5L83GV5I,1035
|
|
12
|
-
themefinder/themefinder_logging.py,sha256=n5SUQovEZLC4skEbxicjz_fOGF9mOk3S-Wpj5uXsaL8,314
|
|
13
|
-
themefinder-0.6.2.dist-info/LICENCE,sha256=C9ULIN0ctF60ZxUWH_hw1H434bDLg49Z-Qzn6BUHgqs,1060
|
|
14
|
-
themefinder-0.6.2.dist-info/METADATA,sha256=gI9Hp754EjopJQWw0QZIPb9dex8TalPMGnorUEOJlp0,6498
|
|
15
|
-
themefinder-0.6.2.dist-info/WHEEL,sha256=fGIA9gx4Qxk2KDKeNJCbOEwSrmLtjWCwzBz351GyrPQ,88
|
|
16
|
-
themefinder-0.6.2.dist-info/RECORD,,
|
|
File without changes
|