themefinder 0.6.2__tar.gz → 0.6.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of themefinder might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: themefinder
3
- Version: 0.6.2
3
+ Version: 0.6.3
4
4
  Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
5
5
  License: MIT
6
6
  Author: i.AI
@@ -49,9 +49,9 @@ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/r
49
49
  - `response_id`: A unique identifier for each response
50
50
  - `response`: The free text survey response
51
51
 
52
- ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
52
+ ThemeFinder now supports a range of language models through structured outputs.
53
53
 
54
- The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
54
+ The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
55
55
 
56
56
  For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
57
57
 
@@ -81,7 +81,6 @@ load_dotenv()
81
81
  llm = AzureChatOpenAI(
82
82
  model="gpt-4o",
83
83
  temperature=0,
84
- model_kwargs={"response_format": {"type": "json_object"}},
85
84
  )
86
85
 
87
86
  # Set up your data
@@ -97,18 +96,15 @@ question = "What do you think of ThemeFinder?"
97
96
  # Make the system prompt specific to your use case
98
97
  system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
99
98
 
100
- # Run the function to find themes
101
- # We use asyncio to query LLM endpoints asynchronously, so we need to await our function
99
+ # Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
102
100
  async def main():
103
101
  result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
104
102
  print(result)
105
103
 
106
104
  if __name__ == "__main__":
107
105
  asyncio.run(main())
108
-
109
106
  ```
110
107
 
111
-
112
108
  ## ThemeFinder pipeline
113
109
 
114
110
  ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
@@ -145,6 +141,25 @@ The file `src/themefinder.core.py` contains the function `find_themes` which run
145
141
  **For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
146
142
 
147
143
 
144
+ ## Model Compatibility
145
+
146
+ ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
147
+
148
+ ### OpenAI Models
149
+ - GPT-4, GPT-4o, GPT-4.1
150
+ - All Azure OpenAI deployments
151
+
152
+ ### Google Models
153
+ - Gemini series (1.5 Pro, 2.0 Pro, etc.)
154
+
155
+ ### Anthropic Models
156
+ - Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
157
+
158
+ ### Open Source Models
159
+ - Llama 2, Llama 3
160
+ - Mistral models (e.g., Mistral 7B, Mixtral)
161
+
162
+
148
163
  ## License
149
164
 
150
165
  This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
@@ -18,9 +18,9 @@ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/r
18
18
  - `response_id`: A unique identifier for each response
19
19
  - `response`: The free text survey response
20
20
 
21
- ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
21
+ ThemeFinder now supports a range of language models through structured outputs.
22
22
 
23
- The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
23
+ The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
24
24
 
25
25
  For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
26
26
 
@@ -50,7 +50,6 @@ load_dotenv()
50
50
  llm = AzureChatOpenAI(
51
51
  model="gpt-4o",
52
52
  temperature=0,
53
- model_kwargs={"response_format": {"type": "json_object"}},
54
53
  )
55
54
 
56
55
  # Set up your data
@@ -66,18 +65,15 @@ question = "What do you think of ThemeFinder?"
66
65
  # Make the system prompt specific to your use case
67
66
  system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
68
67
 
69
- # Run the function to find themes
70
- # We use asyncio to query LLM endpoints asynchronously, so we need to await our function
68
+ # Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
71
69
  async def main():
72
70
  result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
73
71
  print(result)
74
72
 
75
73
  if __name__ == "__main__":
76
74
  asyncio.run(main())
77
-
78
75
  ```
79
76
 
80
-
81
77
  ## ThemeFinder pipeline
82
78
 
83
79
  ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
@@ -114,6 +110,25 @@ The file `src/themefinder.core.py` contains the function `find_themes` which run
114
110
  **For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
115
111
 
116
112
 
113
+ ## Model Compatibility
114
+
115
+ ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
116
+
117
+ ### OpenAI Models
118
+ - GPT-4, GPT-4o, GPT-4.1
119
+ - All Azure OpenAI deployments
120
+
121
+ ### Google Models
122
+ - Gemini series (1.5 Pro, 2.0 Pro, etc.)
123
+
124
+ ### Anthropic Models
125
+ - Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
126
+
127
+ ### Open Source Models
128
+ - Llama 2, Llama 3
129
+ - Mistral models (e.g., Mistral 7B, Mixtral)
130
+
131
+
117
132
  ## License
118
133
 
119
134
  This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
@@ -1,6 +1,6 @@
1
1
  [tool.poetry]
2
2
  name = "themefinder"
3
- version = "0.6.2"
3
+ version = "0.6.3"
4
4
  description = "A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses."
5
5
  authors = ["i.AI <packages@cabinetoffice.gov.uk>"]
6
6
  packages = [{include = "themefinder", from = "src"}]
@@ -5,6 +5,8 @@ from .core import (
5
5
  theme_generation,
6
6
  theme_mapping,
7
7
  theme_refinement,
8
+ theme_target_alignment,
9
+ detail_detection,
8
10
  )
9
11
 
10
12
  __all__ = [
@@ -13,6 +15,8 @@ __all__ = [
13
15
  "theme_generation",
14
16
  "theme_condensation",
15
17
  "theme_refinement",
18
+ "theme_target_alignment",
16
19
  "theme_mapping",
20
+ "detail_detection",
17
21
  ]
18
22
  __version__ = "0.1.0"
@@ -3,10 +3,17 @@ from pathlib import Path
3
3
 
4
4
  import pandas as pd
5
5
  from langchain_core.prompts import PromptTemplate
6
- from langchain_core.runnables import Runnable
6
+ from langchain.schema.runnable import RunnableWithFallbacks
7
7
 
8
8
  from .llm_batch_processor import batch_and_run, load_prompt_from_file
9
- from .models import SentimentAnalysisOutput, ThemeMappingOutput
9
+ from .models import (
10
+ SentimentAnalysisResponses,
11
+ ThemeGenerationResponses,
12
+ ThemeCondensationResponses,
13
+ ThemeRefinementResponses,
14
+ ThemeMappingResponses,
15
+ DetailDetectionResponses,
16
+ )
10
17
  from .themefinder_logging import logger
11
18
 
12
19
  CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
@@ -14,11 +21,12 @@ CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
14
21
 
15
22
  async def find_themes(
16
23
  responses_df: pd.DataFrame,
17
- llm: Runnable,
24
+ llm: RunnableWithFallbacks,
18
25
  question: str,
19
26
  target_n_themes: int | None = None,
20
27
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
21
28
  verbose: bool = True,
29
+ concurrency: int = 10,
22
30
  ) -> dict[str, str | pd.DataFrame]:
23
31
  """Process survey responses through a multi-stage theme analysis pipeline.
24
32
 
@@ -32,7 +40,7 @@ async def find_themes(
32
40
 
33
41
  Args:
34
42
  responses_df (pd.DataFrame): DataFrame containing survey responses
35
- llm (Runnable): Language model instance for text analysis
43
+ llm (RunnableWithFallbacks): Language model instance for text analysis
36
44
  question (str): The survey question
37
45
  target_n_themes (int | None, optional): Target number of themes to consolidate to.
38
46
  If None, skip theme target alignment step. Defaults to None.
@@ -40,6 +48,7 @@ async def find_themes(
40
48
  Defaults to CONSULTATION_SYSTEM_PROMPT.
41
49
  verbose (bool): Whether to show information messages during processing.
42
50
  Defaults to True.
51
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
43
52
 
44
53
  Returns:
45
54
  dict[str, str | pd.DataFrame]: Dictionary containing results from each pipeline stage:
@@ -56,21 +65,28 @@ async def find_themes(
56
65
  llm,
57
66
  question=question,
58
67
  system_prompt=system_prompt,
68
+ concurrency=concurrency,
59
69
  )
60
70
  theme_df, _ = await theme_generation(
61
71
  sentiment_df,
62
72
  llm,
63
73
  question=question,
64
74
  system_prompt=system_prompt,
75
+ concurrency=concurrency,
65
76
  )
66
77
  condensed_theme_df, _ = await theme_condensation(
67
- theme_df, llm, question=question, system_prompt=system_prompt
78
+ theme_df,
79
+ llm,
80
+ question=question,
81
+ system_prompt=system_prompt,
82
+ concurrency=concurrency,
68
83
  )
69
84
  refined_theme_df, _ = await theme_refinement(
70
85
  condensed_theme_df,
71
86
  llm,
72
87
  question=question,
73
88
  system_prompt=system_prompt,
89
+ concurrency=concurrency,
74
90
  )
75
91
  if target_n_themes is not None:
76
92
  refined_theme_df, _ = await theme_target_alignment(
@@ -79,6 +95,7 @@ async def find_themes(
79
95
  question=question,
80
96
  target_n_themes=target_n_themes,
81
97
  system_prompt=system_prompt,
98
+ concurrency=concurrency,
82
99
  )
83
100
  mapping_df, mapping_unprocessables = await theme_mapping(
84
101
  sentiment_df[["response_id", "response"]],
@@ -86,6 +103,14 @@ async def find_themes(
86
103
  question=question,
87
104
  refined_themes_df=refined_theme_df,
88
105
  system_prompt=system_prompt,
106
+ concurrency=concurrency,
107
+ )
108
+ detailed_df, _ = await detail_detection(
109
+ responses_df[["response_id", "response"]],
110
+ llm,
111
+ question=question,
112
+ system_prompt=system_prompt,
113
+ concurrency=concurrency,
89
114
  )
90
115
 
91
116
  logger.info("Finished finding themes")
@@ -97,17 +122,19 @@ async def find_themes(
97
122
  "sentiment": sentiment_df,
98
123
  "themes": refined_theme_df,
99
124
  "mapping": mapping_df,
125
+ "detailed_responses": detailed_df,
100
126
  "unprocessables": pd.concat([sentiment_unprocessables, mapping_unprocessables]),
101
127
  }
102
128
 
103
129
 
104
130
  async def sentiment_analysis(
105
131
  responses_df: pd.DataFrame,
106
- llm: Runnable,
132
+ llm: RunnableWithFallbacks,
107
133
  question: str,
108
134
  batch_size: int = 20,
109
135
  prompt_template: str | Path | PromptTemplate = "sentiment_analysis",
110
136
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
137
+ concurrency: int = 10,
111
138
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
112
139
  """Perform sentiment analysis on survey responses using an LLM.
113
140
 
@@ -117,7 +144,7 @@ async def sentiment_analysis(
117
144
  Args:
118
145
  responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
119
146
  Must contain 'response_id' and 'response' columns.
120
- llm (Runnable): Language model instance to use for sentiment analysis.
147
+ llm (RunnableWithFallbacks): Language model instance to use for sentiment analysis.
121
148
  question (str): The survey question.
122
149
  batch_size (int, optional): Number of responses to process in each batch.
123
150
  Defaults to 20.
@@ -126,6 +153,7 @@ async def sentiment_analysis(
126
153
  or PromptTemplate instance. Defaults to "sentiment_analysis".
127
154
  system_prompt (str): System prompt to guide the LLM's behavior.
128
155
  Defaults to CONSULTATION_SYSTEM_PROMPT.
156
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
129
157
 
130
158
  Returns:
131
159
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -134,32 +162,33 @@ async def sentiment_analysis(
134
162
  - The second DataFrame contains the rows that could not be processed by the LLM
135
163
 
136
164
  Note:
137
- The function uses validation_check to ensure responses maintain
165
+ The function uses integrity_check to ensure responses maintain
138
166
  their original order and association after processing.
139
167
  """
140
168
  logger.info(f"Running sentiment analysis on {len(responses_df)} responses")
141
- processed_rows, unprocessable_rows = await batch_and_run(
169
+ sentiment, unprocessable = await batch_and_run(
142
170
  responses_df,
143
171
  prompt_template,
144
- llm,
172
+ llm.with_structured_output(SentimentAnalysisResponses),
145
173
  batch_size=batch_size,
146
174
  question=question,
147
- validation_check=True,
148
- task_validation_model=SentimentAnalysisOutput,
175
+ integrity_check=True,
149
176
  system_prompt=system_prompt,
177
+ concurrency=concurrency,
150
178
  )
151
179
 
152
- return processed_rows, unprocessable_rows
180
+ return sentiment, unprocessable
153
181
 
154
182
 
155
183
  async def theme_generation(
156
184
  responses_df: pd.DataFrame,
157
- llm: Runnable,
185
+ llm: RunnableWithFallbacks,
158
186
  question: str,
159
187
  batch_size: int = 50,
160
188
  partition_key: str | None = "position",
161
189
  prompt_template: str | Path | PromptTemplate = "theme_generation",
162
190
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
191
+ concurrency: int = 10,
163
192
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
164
193
  """Generate themes from survey responses using an LLM.
165
194
 
@@ -168,7 +197,7 @@ async def theme_generation(
168
197
  Args:
169
198
  responses_df (pd.DataFrame): DataFrame containing survey responses.
170
199
  Must include 'response_id' and 'response' columns.
171
- llm (Runnable): Language model instance to use for theme generation.
200
+ llm (RunnableWithFallbacks): Language model instance to use for theme generation.
172
201
  question (str): The survey question.
173
202
  batch_size (int, optional): Number of responses to process in each batch.
174
203
  Defaults to 50.
@@ -181,6 +210,7 @@ async def theme_generation(
181
210
  or PromptTemplate instance. Defaults to "theme_generation".
182
211
  system_prompt (str): System prompt to guide the LLM's behavior.
183
212
  Defaults to CONSULTATION_SYSTEM_PROMPT.
213
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
184
214
 
185
215
  Returns:
186
216
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -193,22 +223,24 @@ async def theme_generation(
193
223
  generated_themes, _ = await batch_and_run(
194
224
  responses_df,
195
225
  prompt_template,
196
- llm,
226
+ llm.with_structured_output(ThemeGenerationResponses),
197
227
  batch_size=batch_size,
198
228
  partition_key=partition_key,
199
229
  question=question,
200
230
  system_prompt=system_prompt,
231
+ concurrency=concurrency,
201
232
  )
202
233
  return generated_themes, _
203
234
 
204
235
 
205
236
  async def theme_condensation(
206
237
  themes_df: pd.DataFrame,
207
- llm: Runnable,
238
+ llm: RunnableWithFallbacks,
208
239
  question: str,
209
240
  batch_size: int = 75,
210
241
  prompt_template: str | Path | PromptTemplate = "theme_condensation",
211
242
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
243
+ concurrency: int = 10,
212
244
  **kwargs,
213
245
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
214
246
  """Condense and combine similar themes identified from survey responses.
@@ -219,7 +251,7 @@ async def theme_condensation(
219
251
  Args:
220
252
  themes_df (pd.DataFrame): DataFrame containing the initial themes identified
221
253
  from survey responses.
222
- llm (Runnable): Language model instance to use for theme condensation.
254
+ llm (RunnableWithFallbacks): Language model instance to use for theme condensation.
223
255
  question (str): The survey question.
224
256
  batch_size (int, optional): Number of themes to process in each batch.
225
257
  Defaults to 100.
@@ -228,6 +260,7 @@ async def theme_condensation(
228
260
  or PromptTemplate instance. Defaults to "theme_condensation".
229
261
  system_prompt (str): System prompt to guide the LLM's behavior.
230
262
  Defaults to CONSULTATION_SYSTEM_PROMPT.
263
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
231
264
 
232
265
  Returns:
233
266
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -247,10 +280,11 @@ async def theme_condensation(
247
280
  themes_df, _ = await batch_and_run(
248
281
  themes_df,
249
282
  prompt_template,
250
- llm,
283
+ llm.with_structured_output(ThemeCondensationResponses),
251
284
  batch_size=batch_size,
252
285
  question=question,
253
286
  system_prompt=system_prompt,
287
+ concurrency=concurrency,
254
288
  **kwargs,
255
289
  )
256
290
  themes_df = themes_df.sample(frac=1).reset_index(drop=True)
@@ -263,10 +297,11 @@ async def theme_condensation(
263
297
  themes_df, _ = await batch_and_run(
264
298
  themes_df,
265
299
  prompt_template,
266
- llm,
300
+ llm.with_structured_output(ThemeCondensationResponses),
267
301
  batch_size=batch_size,
268
302
  question=question,
269
303
  system_prompt=system_prompt,
304
+ concurrency=concurrency,
270
305
  **kwargs,
271
306
  )
272
307
 
@@ -276,11 +311,12 @@ async def theme_condensation(
276
311
 
277
312
  async def theme_refinement(
278
313
  condensed_themes_df: pd.DataFrame,
279
- llm: Runnable,
314
+ llm: RunnableWithFallbacks,
280
315
  question: str,
281
316
  batch_size: int = 10000,
282
317
  prompt_template: str | Path | PromptTemplate = "theme_refinement",
283
318
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
319
+ concurrency: int = 10,
284
320
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
285
321
  """Refine and standardize condensed themes using an LLM.
286
322
 
@@ -292,7 +328,7 @@ async def theme_refinement(
292
328
  Args:
293
329
  condensed_themes (pd.DataFrame): DataFrame containing the condensed themes
294
330
  from the previous pipeline stage.
295
- llm (Runnable): Language model instance to use for theme refinement.
331
+ llm (RunnableWithFallbacks): Language model instance to use for theme refinement.
296
332
  question (str): The survey question.
297
333
  batch_size (int, optional): Number of themes to process in each batch.
298
334
  Defaults to 10000.
@@ -301,6 +337,7 @@ async def theme_refinement(
301
337
  or PromptTemplate instance. Defaults to "theme_refinement".
302
338
  system_prompt (str): System prompt to guide the LLM's behavior.
303
339
  Defaults to CONSULTATION_SYSTEM_PROMPT.
340
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
304
341
 
305
342
  Returns:
306
343
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -319,22 +356,24 @@ async def theme_refinement(
319
356
  refined_themes, _ = await batch_and_run(
320
357
  condensed_themes_df,
321
358
  prompt_template,
322
- llm,
359
+ llm.with_structured_output(ThemeRefinementResponses),
323
360
  batch_size=batch_size,
324
361
  question=question,
325
362
  system_prompt=system_prompt,
363
+ concurrency=concurrency,
326
364
  )
327
365
  return refined_themes, _
328
366
 
329
367
 
330
368
  async def theme_target_alignment(
331
369
  refined_themes_df: pd.DataFrame,
332
- llm: Runnable,
370
+ llm: RunnableWithFallbacks,
333
371
  question: str,
334
372
  target_n_themes: int = 10,
335
373
  batch_size: int = 10000,
336
374
  prompt_template: str | Path | PromptTemplate = "theme_target_alignment",
337
375
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
376
+ concurrency: int = 10,
338
377
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
339
378
  """Align themes to target number using an LLM.
340
379
 
@@ -346,7 +385,7 @@ async def theme_target_alignment(
346
385
  Args:
347
386
  refined_themes_df (pd.DataFrame): DataFrame containing the refined themes
348
387
  from the previous pipeline stage.
349
- llm (Runnable): Language model instance to use for theme alignment.
388
+ llm (RunnableWithFallbacks): Language model instance to use for theme alignment.
350
389
  question (str): The survey question.
351
390
  target_n_themes (int, optional): Target number of themes to consolidate to.
352
391
  Defaults to 10.
@@ -357,6 +396,7 @@ async def theme_target_alignment(
357
396
  or PromptTemplate instance. Defaults to "theme_target_alignment".
358
397
  system_prompt (str): System prompt to guide the LLM's behavior.
359
398
  Defaults to CONSULTATION_SYSTEM_PROMPT.
399
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
360
400
 
361
401
  Returns:
362
402
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -376,23 +416,25 @@ async def theme_target_alignment(
376
416
  aligned_themes, _ = await batch_and_run(
377
417
  refined_themes_df,
378
418
  prompt_template,
379
- llm,
419
+ llm.with_structured_output(ThemeRefinementResponses),
380
420
  batch_size=batch_size,
381
421
  question=question,
382
422
  system_prompt=system_prompt,
383
423
  target_n_themes=target_n_themes,
424
+ concurrency=concurrency,
384
425
  )
385
426
  return aligned_themes, _
386
427
 
387
428
 
388
429
  async def theme_mapping(
389
430
  responses_df: pd.DataFrame,
390
- llm: Runnable,
431
+ llm: RunnableWithFallbacks,
391
432
  question: str,
392
433
  refined_themes_df: pd.DataFrame,
393
434
  batch_size: int = 20,
394
435
  prompt_template: str | Path | PromptTemplate = "theme_mapping",
395
436
  system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
437
+ concurrency: int = 10,
396
438
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
397
439
  """Map survey responses to refined themes using an LLM.
398
440
 
@@ -402,7 +444,7 @@ async def theme_mapping(
402
444
  Args:
403
445
  responses_df (pd.DataFrame): DataFrame containing survey responses.
404
446
  Must include 'response_id' and 'response' columns.
405
- llm (Runnable): Language model instance to use for theme mapping.
447
+ llm (RunnableWithFallbacks): Language model instance to use for theme mapping.
406
448
  question (str): The survey question.
407
449
  refined_themes_df (pd.DataFrame): Single-row DataFrame where each column
408
450
  represents a theme (from theme_refinement stage).
@@ -413,6 +455,7 @@ async def theme_mapping(
413
455
  or PromptTemplate instance. Defaults to "theme_mapping".
414
456
  system_prompt (str): System prompt to guide the LLM's behavior.
415
457
  Defaults to CONSULTATION_SYSTEM_PROMPT.
458
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
416
459
 
417
460
  Returns:
418
461
  tuple[pd.DataFrame, pd.DataFrame]:
@@ -432,17 +475,70 @@ async def theme_mapping(
432
475
  )
433
476
  return transposed_df
434
477
 
435
- mapping, _ = await batch_and_run(
478
+ mapping, unprocessable = await batch_and_run(
436
479
  responses_df,
437
480
  prompt_template,
438
- llm,
481
+ llm.with_structured_output(ThemeMappingResponses),
439
482
  batch_size=batch_size,
440
483
  question=question,
441
484
  refined_themes=transpose_refined_themes(refined_themes_df).to_dict(
442
485
  orient="records"
443
486
  ),
444
- validation_check=True,
445
- task_validation_model=ThemeMappingOutput,
487
+ integrity_check=True,
488
+ system_prompt=system_prompt,
489
+ concurrency=concurrency,
490
+ )
491
+ return mapping, unprocessable
492
+
493
+
494
+ async def detail_detection(
495
+ responses_df: pd.DataFrame,
496
+ llm: RunnableWithFallbacks,
497
+ question: str,
498
+ batch_size: int = 20,
499
+ prompt_template: str | Path | PromptTemplate = "detail_detection",
500
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
501
+ concurrency: int = 10,
502
+ ) -> tuple[pd.DataFrame, pd.DataFrame]:
503
+ """Identify responses that provide high-value detailed evidence.
504
+
505
+ This function processes survey responses in batches to analyze their level of detail
506
+ and evidence using a language model. It identifies responses that contain specific
507
+ examples, data, or detailed reasoning that provide strong supporting evidence.
508
+
509
+ Args:
510
+ responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
511
+ Must contain 'response_id' and 'response' columns.
512
+ llm (RunnableWithFallbacks): Language model instance to use for detail detection.
513
+ question (str): The survey question.
514
+ batch_size (int, optional): Number of responses to process in each batch.
515
+ Defaults to 20.
516
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
517
+ the prompt to the LLM. Can be a string identifier, path to template file,
518
+ or PromptTemplate instance. Defaults to "detail_detection".
519
+ system_prompt (str): System prompt to guide the LLM's behavior.
520
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
521
+ concurrency (int): Number of concurrent API calls to make. Defaults to 10.
522
+
523
+ Returns:
524
+ tuple[pd.DataFrame, pd.DataFrame]:
525
+ A tuple containing two DataFrames:
526
+ - The first DataFrame contains the rows that were successfully processed by the LLM
527
+ - The second DataFrame contains the rows that could not be processed by the LLM
528
+
529
+ Note:
530
+ The function uses response_id_integrity_check to ensure responses maintain
531
+ their original order and association after processing.
532
+ """
533
+ logger.info(f"Running detail detection on {len(responses_df)} responses")
534
+ detailed, _ = await batch_and_run(
535
+ responses_df,
536
+ prompt_template,
537
+ llm.with_structured_output(DetailDetectionResponses),
538
+ batch_size=batch_size,
539
+ question=question,
540
+ integrity_check=True,
446
541
  system_prompt=system_prompt,
542
+ concurrency=concurrency,
447
543
  )
448
- return mapping, _
544
+ return detailed, _