themefinder 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of themefinder might be problematic. Click here for more details.

@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 i.AI
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,167 @@
1
+ Metadata-Version: 2.3
2
+ Name: themefinder
3
+ Version: 0.2.0
4
+ Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
5
+ License: MIT
6
+ Author: i.AI
7
+ Author-email: packages@cabinetoffice.gov.uk
8
+ Requires-Python: >=3.12,<4.0
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: Intended Audience :: Science/Research
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
16
+ Classifier: Topic :: Text Processing :: Linguistic
17
+ Requires-Dist: boto3 (>=1.29,<2.0)
18
+ Requires-Dist: langchain
19
+ Requires-Dist: langchain-openai (==0.1.17)
20
+ Requires-Dist: langfuse (==2.29.1)
21
+ Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
22
+ Requires-Dist: pandas (>=2.2.2,<3.0.0)
23
+ Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
24
+ Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
25
+ Requires-Dist: scikit-learn
26
+ Project-URL: Documentation, https://i-dot-ai.github.io/themefinder/
27
+ Project-URL: Repository, https://github.com/i-dot-ai/themefinder/
28
+ Description-Content-Type: text/markdown
29
+
30
+ # ThemeFinder
31
+
32
+ ThemeFinder is a topic modelling Python package designed for analyzing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the [docs](docs/pipeline.md) for more info.
33
+
34
+ > [!IMPORTANT]
35
+ > Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.
36
+
37
+
38
+ ## Quickstart
39
+
40
+ ### Install the package locally
41
+
42
+ Clone the package from GitHub:
43
+ ```
44
+ git clone https://github.com/i-dot-ai/themefinder.git
45
+ ```
46
+
47
+ Install the package into your virtual environment, where `<FILE_PATH>` is the location of the `themefinder` directory.
48
+
49
+ Install with pip:
50
+ ```
51
+ pip install -e <FILE_PATH>
52
+ ```
53
+
54
+ Install with poetry:
55
+ ```
56
+ poetry add -e <FILE_PATH>
57
+ ```
58
+
59
+ ### Usage
60
+
61
+ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with two columns:
62
+ - `response_id`: A unique identifier for each response
63
+ - `response`: The free text survey response
64
+
65
+ ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
66
+
67
+ The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
68
+
69
+ For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
70
+
71
+ If you are using environment variables (eg for API keys), you can use `python-dotenv` to read variables from a `.env` file.
72
+
73
+ If you are using an Azure OpenAI endpoint, you will need the following variables:
74
+
75
+ - `AZURE_OPENAI_API_KEY`
76
+ - `AZURE_OPENAI_ENDPOINT`
77
+ - `OPENAI_API_VERSION`
78
+ - `DEPLOYMENT_NAME`
79
+ - `AZURE_OPENAI_BASE_URL`
80
+
81
+ Otherwise you will need whichever variables [LangChain](https://www.langchain.com/) requires for your LLM of choice.
82
+
83
+ ```python
84
+ import asyncio
85
+ from dotenv import load_dotenv
86
+ import pandas as pd
87
+ from langchain_openai import AzureChatOpenAI
88
+ from themefinder import find_themes
89
+
90
+ # If needed, load LLM API settings from .env file
91
+ load_dotenv()
92
+
93
+ # Initialise your LLM of choice using langchain
94
+ llm = AzureChatOpenAI(
95
+ model="gpt-4o",
96
+ temperature=0,
97
+ model_kwargs={"response_format": {"type": "json_object"}},
98
+ )
99
+
100
+ # Set up your data
101
+ responses_df = pd.DataFrame({
102
+ "response_id": ["1", "2", "3", "4", "5"],
103
+ "response": ["I think it's awesome, I can use it for consultation analysis.",
104
+ "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
105
+ })
106
+
107
+ # Add your question
108
+ question = "What do you think of ThemeFinder?"
109
+
110
+ # Make the system prompt specific to your use case
111
+ system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
112
+
113
+ # Run the function to find themes
114
+ # We use asyncio to query LLM endpoints asynchronously, so we need to await our function
115
+ async def main():
116
+ result = await find_themes(responses_df, llm, question, system_prompt)
117
+ print(result)
118
+
119
+ if __name__ == "__main__":
120
+ asyncio.run(main())
121
+
122
+ ```
123
+
124
+
125
+ ## ThemeFinder pipeline
126
+
127
+ ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
128
+
129
+ ### Sentiment analysis
130
+ - Analyses the emotional tone and position of each response using sentiment-focused prompts
131
+ - Provides structured sentiment categorisation based on LLM analysis
132
+
133
+ ### Theme generation
134
+ - Uses exploratory prompts to identify initial themes from response batches
135
+ - Groups related responses for better context through guided theme extraction
136
+
137
+ ### Theme condensation
138
+ - Employs comparative prompts to combine similar or overlapping themes
139
+ - Reduces redundancy in identified topics through systematic theme evaluation
140
+
141
+ ### Theme refinement
142
+ - Leverages standardisation prompts to normalise theme descriptions
143
+ - Creates clear, consistent theme definitions through structured refinement
144
+
145
+ ### Theme mapping
146
+ - Utilizes classification prompts to map individual responses to refined themes
147
+ - Supports multiple theme assignments per response through detailed analysis
148
+
149
+
150
+ The prompts used at each stage can be found in `src/themefinder/prompts/`.
151
+
152
+ The file `src/themefinder.core.py` contains the function `find_themes` which runs the pipline. It also contains functions fo each individual stage.
153
+
154
+
155
+ **For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
156
+
157
+
158
+ ## License
159
+
160
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
161
+
162
+ The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
163
+
164
+
165
+ ## Feedback
166
+
167
+ If you have feedback on this package, please fill in our feedback form [here](https://forms.gle/85xUSMvxGzSSKQ499) or contact us with questions or feedback at packages@cabinetoffice.gov.uk.
@@ -0,0 +1,138 @@
1
+ # ThemeFinder
2
+
3
+ ThemeFinder is a topic modelling Python package designed for analyzing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the [docs](docs/pipeline.md) for more info.
4
+
5
+ > [!IMPORTANT]
6
+ > Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.
7
+
8
+
9
+ ## Quickstart
10
+
11
+ ### Install the package locally
12
+
13
+ Clone the package from GitHub:
14
+ ```
15
+ git clone https://github.com/i-dot-ai/themefinder.git
16
+ ```
17
+
18
+ Install the package into your virtual environment, where `<FILE_PATH>` is the location of the `themefinder` directory.
19
+
20
+ Install with pip:
21
+ ```
22
+ pip install -e <FILE_PATH>
23
+ ```
24
+
25
+ Install with poetry:
26
+ ```
27
+ poetry add -e <FILE_PATH>
28
+ ```
29
+
30
+ ### Usage
31
+
32
+ ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with two columns:
33
+ - `response_id`: A unique identifier for each response
34
+ - `response`: The free text survey response
35
+
36
+ ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
37
+
38
+ The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
39
+
40
+ For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
41
+
42
+ If you are using environment variables (eg for API keys), you can use `python-dotenv` to read variables from a `.env` file.
43
+
44
+ If you are using an Azure OpenAI endpoint, you will need the following variables:
45
+
46
+ - `AZURE_OPENAI_API_KEY`
47
+ - `AZURE_OPENAI_ENDPOINT`
48
+ - `OPENAI_API_VERSION`
49
+ - `DEPLOYMENT_NAME`
50
+ - `AZURE_OPENAI_BASE_URL`
51
+
52
+ Otherwise you will need whichever variables [LangChain](https://www.langchain.com/) requires for your LLM of choice.
53
+
54
+ ```python
55
+ import asyncio
56
+ from dotenv import load_dotenv
57
+ import pandas as pd
58
+ from langchain_openai import AzureChatOpenAI
59
+ from themefinder import find_themes
60
+
61
+ # If needed, load LLM API settings from .env file
62
+ load_dotenv()
63
+
64
+ # Initialise your LLM of choice using langchain
65
+ llm = AzureChatOpenAI(
66
+ model="gpt-4o",
67
+ temperature=0,
68
+ model_kwargs={"response_format": {"type": "json_object"}},
69
+ )
70
+
71
+ # Set up your data
72
+ responses_df = pd.DataFrame({
73
+ "response_id": ["1", "2", "3", "4", "5"],
74
+ "response": ["I think it's awesome, I can use it for consultation analysis.",
75
+ "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
76
+ })
77
+
78
+ # Add your question
79
+ question = "What do you think of ThemeFinder?"
80
+
81
+ # Make the system prompt specific to your use case
82
+ system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
83
+
84
+ # Run the function to find themes
85
+ # We use asyncio to query LLM endpoints asynchronously, so we need to await our function
86
+ async def main():
87
+ result = await find_themes(responses_df, llm, question, system_prompt)
88
+ print(result)
89
+
90
+ if __name__ == "__main__":
91
+ asyncio.run(main())
92
+
93
+ ```
94
+
95
+
96
+ ## ThemeFinder pipeline
97
+
98
+ ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
99
+
100
+ ### Sentiment analysis
101
+ - Analyses the emotional tone and position of each response using sentiment-focused prompts
102
+ - Provides structured sentiment categorisation based on LLM analysis
103
+
104
+ ### Theme generation
105
+ - Uses exploratory prompts to identify initial themes from response batches
106
+ - Groups related responses for better context through guided theme extraction
107
+
108
+ ### Theme condensation
109
+ - Employs comparative prompts to combine similar or overlapping themes
110
+ - Reduces redundancy in identified topics through systematic theme evaluation
111
+
112
+ ### Theme refinement
113
+ - Leverages standardisation prompts to normalise theme descriptions
114
+ - Creates clear, consistent theme definitions through structured refinement
115
+
116
+ ### Theme mapping
117
+ - Utilizes classification prompts to map individual responses to refined themes
118
+ - Supports multiple theme assignments per response through detailed analysis
119
+
120
+
121
+ The prompts used at each stage can be found in `src/themefinder/prompts/`.
122
+
123
+ The file `src/themefinder.core.py` contains the function `find_themes` which runs the pipline. It also contains functions fo each individual stage.
124
+
125
+
126
+ **For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
127
+
128
+
129
+ ## License
130
+
131
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
132
+
133
+ The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
134
+
135
+
136
+ ## Feedback
137
+
138
+ If you have feedback on this package, please fill in our feedback form [here](https://forms.gle/85xUSMvxGzSSKQ499) or contact us with questions or feedback at packages@cabinetoffice.gov.uk.
@@ -0,0 +1,50 @@
1
+ [tool.poetry]
2
+ name = "themefinder"
3
+ version = "0.2.0"
4
+ description = "A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses."
5
+ authors = ["i.AI <packages@cabinetoffice.gov.uk>"]
6
+ packages = [{include = "themefinder", from = "src"}]
7
+ readme = "README.md"
8
+ license = "MIT"
9
+ repository = "https://github.com/i-dot-ai/themefinder/"
10
+ documentation = "https://i-dot-ai.github.io/themefinder/"
11
+ classifiers = [
12
+ "Intended Audience :: Developers",
13
+ "Intended Audience :: Science/Research",
14
+ "License :: OSI Approved :: MIT License",
15
+ "Programming Language :: Python :: 3.12",
16
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
17
+ "Topic :: Text Processing :: Linguistic",
18
+ ]
19
+
20
+
21
+ [tool.poetry.dependencies]
22
+ python = ">=3.12,<4.0"
23
+ langchain = "*"
24
+ langchain-openai = "0.1.17"
25
+ pandas = "^2.2.2"
26
+ python-dotenv = "^1.0.1"
27
+ langfuse = "2.29.1"
28
+ boto3 = "^1.29"
29
+ scikit-learn = "*"
30
+ openpyxl = "^3.1.5"
31
+ pyarrow = "^15.0.0"
32
+
33
+ [tool.poetry.group.dev.dependencies]
34
+ pytest = "*"
35
+ pytest-asyncio = "^0.24.0"
36
+ coverage = "^7.6.10"
37
+
38
+ [tool.poetry.group.docs.dependencies]
39
+ mkdocs = "^1.6.1"
40
+ mkdocstrings = {extras = ["python"], version = "^0.27.0"}
41
+ mkdocs-material = "^9.5.50"
42
+
43
+ [tool.pytest.ini_options]
44
+ pythonpath = "."
45
+ asyncio_mode = "auto"
46
+ asyncio_default_fixture_loop_scope = "function"
47
+
48
+ [build-system]
49
+ requires = ["poetry-core>=1.0.0"]
50
+ build-backend = "poetry.core.masonry.api"
@@ -0,0 +1,18 @@
1
+ from .core import (
2
+ find_themes,
3
+ sentiment_analysis,
4
+ theme_generation,
5
+ theme_condensation,
6
+ theme_refinement,
7
+ theme_mapping,
8
+ )
9
+
10
+ __all__ = [
11
+ "find_themes",
12
+ "sentiment_analysis",
13
+ "theme_generation",
14
+ "theme_condensation",
15
+ "theme_refinement",
16
+ "theme_mapping",
17
+ ]
18
+ __version__ = "0.1.0"
@@ -0,0 +1,326 @@
1
+ from pathlib import Path
2
+
3
+ import pandas as pd
4
+ from langchain_core.prompts import PromptTemplate
5
+ from langchain_core.runnables import Runnable
6
+
7
+ from .llm_batch_processor import batch_and_run, load_prompt_from_file
8
+ from .themefinder_logging import logger
9
+
10
+
11
+ CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
12
+
13
+
14
+ async def find_themes(
15
+ responses_df: pd.DataFrame,
16
+ llm: Runnable,
17
+ question: str,
18
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
19
+ ) -> dict[str, pd.DataFrame]:
20
+ """Process survey responses through a multi-stage theme analysis pipeline.
21
+
22
+ This pipeline performs sequential analysis steps:
23
+ 1. Sentiment analysis of responses
24
+ 2. Initial theme generation
25
+ 3. Theme condensation (combining similar themes)
26
+ 4. Theme refinement
27
+ 5. Mapping responses to refined themes
28
+
29
+ Args:
30
+ responses_df (pd.DataFrame): DataFrame containing survey responses
31
+ llm (Runnable): Language model instance for text analysis
32
+ question (str): The survey question
33
+ system_prompt (str): System prompt to guide the LLM's behavior.
34
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
35
+
36
+ Returns:
37
+ dict[str, pd.DataFrame]: Dictionary containing results from each pipeline stage:
38
+ - question: The survey question
39
+ - sentiment: DataFrame with sentiment analysis results
40
+ - topics: DataFrame with initial generated themes
41
+ - condensed_topics: DataFrame with combined similar themes
42
+ - refined_topics: DataFrame with refined theme definitions
43
+ - mapping: DataFrame mapping responses to final themes
44
+ """
45
+ sentiment_df = await sentiment_analysis(
46
+ responses_df,
47
+ llm,
48
+ question=question,
49
+ system_prompt=system_prompt,
50
+ )
51
+ theme_df = await theme_generation(
52
+ sentiment_df,
53
+ llm,
54
+ question=question,
55
+ system_prompt=system_prompt,
56
+ )
57
+ condensed_theme_df = await theme_condensation(
58
+ theme_df, llm, question=question, system_prompt=system_prompt
59
+ )
60
+ refined_theme_df = await theme_refinement(
61
+ condensed_theme_df,
62
+ llm,
63
+ question=question,
64
+ system_prompt=system_prompt,
65
+ )
66
+ mapping_df = await theme_mapping(
67
+ sentiment_df,
68
+ llm,
69
+ question=question,
70
+ refined_themes_df=refined_theme_df,
71
+ system_prompt=system_prompt,
72
+ )
73
+
74
+ logger.info("Finished finding themes")
75
+ logger.info(
76
+ "Provide feedback or report bugs: https://forms.gle/85xUSMvxGzSSKQ499 or packages@cabinetoffice.gov.uk"
77
+ )
78
+ return {
79
+ "question": question,
80
+ "sentiment": sentiment_df,
81
+ "topics": theme_df,
82
+ "condensed_topics": condensed_theme_df,
83
+ "refined_topics": refined_theme_df,
84
+ "mapping": mapping_df,
85
+ }
86
+
87
+
88
+ async def sentiment_analysis(
89
+ responses_df: pd.DataFrame,
90
+ llm: Runnable,
91
+ question: str,
92
+ batch_size: int = 10,
93
+ prompt_template: str | Path | PromptTemplate = "sentiment_analysis",
94
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
95
+ ) -> pd.DataFrame:
96
+ """Perform sentiment analysis on survey responses using an LLM.
97
+
98
+ This function processes survey responses in batches to analyze their sentiment
99
+ using a language model. It maintains response integrity by checking response IDs.
100
+
101
+ Args:
102
+ responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
103
+ Must contain 'response_id' and 'response' columns.
104
+ llm (Runnable): Language model instance to use for sentiment analysis.
105
+ question (str): The survey question.
106
+ batch_size (int, optional): Number of responses to process in each batch.
107
+ Defaults to 10.
108
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
109
+ the prompt to the LLM. Can be a string identifier, path to template file,
110
+ or PromptTemplate instance. Defaults to "sentiment_analysis".
111
+ system_prompt (str): System prompt to guide the LLM's behavior.
112
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
113
+
114
+ Returns:
115
+ pd.DataFrame: DataFrame containing the original responses enriched with
116
+ sentiment analysis results.
117
+
118
+ Note:
119
+ The function uses response_id_integrity_check to ensure responses maintain
120
+ their original order and association after processing.
121
+ """
122
+ logger.info(f"Running sentiment analysis on {len(responses_df)} responses")
123
+ return await batch_and_run(
124
+ responses_df,
125
+ prompt_template,
126
+ llm,
127
+ batch_size=batch_size,
128
+ question=question,
129
+ response_id_integrity_check=True,
130
+ system_prompt=system_prompt,
131
+ )
132
+
133
+
134
+ async def theme_generation(
135
+ responses_df: pd.DataFrame,
136
+ llm: Runnable,
137
+ question: str,
138
+ batch_size: int = 50,
139
+ partition_key: str | None = "position",
140
+ prompt_template: str | Path | PromptTemplate = "theme_generation",
141
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
142
+ ) -> pd.DataFrame:
143
+ """Generate themes from survey responses using an LLM.
144
+
145
+ This function processes batches of survey responses to identify common themes or topics.
146
+
147
+ Args:
148
+ responses_df (pd.DataFrame): DataFrame containing survey responses.
149
+ Must include 'response_id' and 'response' columns.
150
+ llm (Runnable): Language model instance to use for theme generation.
151
+ question (str): The survey question.
152
+ batch_size (int, optional): Number of responses to process in each batch.
153
+ Defaults to 50.
154
+ partition_key (str | None, optional): Column name to use for batching related
155
+ responses together. Defaults to "position" for sentiment-enriched responses,
156
+ but can be set to None for sequential batching or another column name for
157
+ different grouping strategies.
158
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
159
+ the prompt to the LLM. Can be a string identifier, path to template file,
160
+ or PromptTemplate instance. Defaults to "theme_generation".
161
+ system_prompt (str): System prompt to guide the LLM's behavior.
162
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
163
+
164
+ Returns:
165
+ pd.DataFrame: DataFrame containing identified themes and their associated metadata.
166
+ """
167
+ logger.info(f"Running theme generation on {len(responses_df)} responses")
168
+ return await batch_and_run(
169
+ responses_df,
170
+ prompt_template,
171
+ llm,
172
+ batch_size=batch_size,
173
+ partition_key=partition_key,
174
+ question=question,
175
+ system_prompt=system_prompt,
176
+ )
177
+
178
+
179
+ async def theme_condensation(
180
+ themes_df: pd.DataFrame,
181
+ llm: Runnable,
182
+ question: str,
183
+ batch_size: int = 10000,
184
+ prompt_template: str | Path | PromptTemplate = "theme_condensation",
185
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
186
+ ) -> pd.DataFrame:
187
+ """Condense and combine similar themes identified from survey responses.
188
+
189
+ This function processes the initially identified themes to combine similar or
190
+ overlapping topics into more cohesive, broader categories using an LLM.
191
+
192
+ Args:
193
+ themes_df (pd.DataFrame): DataFrame containing the initial themes identified
194
+ from survey responses.
195
+ llm (Runnable): Language model instance to use for theme condensation.
196
+ question (str): The survey question.
197
+ batch_size (int, optional): Number of themes to process in each batch.
198
+ Defaults to 10000.
199
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
200
+ the prompt to the LLM. Can be a string identifier, path to template file,
201
+ or PromptTemplate instance. Defaults to "theme_condensation".
202
+ system_prompt (str): System prompt to guide the LLM's behavior.
203
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
204
+
205
+ Returns:
206
+ pd.DataFrame: DataFrame containing the condensed themes, where similar topics
207
+ have been combined into broader categories.
208
+ """
209
+ logger.info(f"Running theme condensation on {len(themes_df)} topics")
210
+ themes_df["response_id"] = range(len(themes_df))
211
+ return await batch_and_run(
212
+ themes_df,
213
+ prompt_template,
214
+ llm,
215
+ batch_size=batch_size,
216
+ question=question,
217
+ system_prompt=system_prompt,
218
+ )
219
+
220
+
221
+ async def theme_refinement(
222
+ condensed_themes_df: pd.DataFrame,
223
+ llm: Runnable,
224
+ question: str,
225
+ batch_size: int = 10000,
226
+ prompt_template: str | Path | PromptTemplate = "theme_refinement",
227
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
228
+ ) -> pd.DataFrame:
229
+ """Refine and standardize condensed themes using an LLM.
230
+
231
+ This function processes previously condensed themes to create clear, standardized
232
+ theme descriptions. It also transforms the output format for improved readability
233
+ by transposing the results into a single-row DataFrame where columns represent
234
+ individual themes.
235
+
236
+ Args:
237
+ condensed_themes (pd.DataFrame): DataFrame containing the condensed themes
238
+ from the previous pipeline stage.
239
+ llm (Runnable): Language model instance to use for theme refinement.
240
+ question (str): The survey question.
241
+ batch_size (int, optional): Number of themes to process in each batch.
242
+ Defaults to 10000.
243
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
244
+ the prompt to the LLM. Can be a string identifier, path to template file,
245
+ or PromptTemplate instance. Defaults to "topic_refinement".
246
+ system_prompt (str): System prompt to guide the LLM's behavior.
247
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
248
+
249
+ Returns:
250
+ pd.DataFrame: A single-row DataFrame where:
251
+ - Each column represents a unique theme (identified by topic_id)
252
+ - The values contain the refined theme descriptions
253
+ - The format is optimized for subsequent theme mapping operations
254
+
255
+ Note:
256
+ The function adds sequential response_ids to the input DataFrame and
257
+ transposes the output for improved readability and easier downstream
258
+ processing.
259
+ """
260
+ logger.info(f"Running topic refinement on {len(condensed_themes_df)} responses")
261
+ condensed_themes_df["response_id"] = range(len(condensed_themes_df))
262
+
263
+ def transpose_refined_topics(refined_themes: pd.DataFrame):
264
+ """Transpose topics for increased legibility."""
265
+ transposed_df = pd.DataFrame(
266
+ [refined_themes["topic"].to_numpy()], columns=refined_themes["topic_id"]
267
+ )
268
+ return transposed_df
269
+
270
+ refined_themes = await batch_and_run(
271
+ condensed_themes_df,
272
+ prompt_template,
273
+ llm,
274
+ batch_size=batch_size,
275
+ question=question,
276
+ system_prompt=system_prompt,
277
+ )
278
+ return transpose_refined_topics(refined_themes)
279
+
280
+
281
+ async def theme_mapping(
282
+ responses_df: pd.DataFrame,
283
+ llm: Runnable,
284
+ question: str,
285
+ refined_themes_df: pd.DataFrame,
286
+ batch_size: int = 20,
287
+ prompt_template: str | Path | PromptTemplate = "theme_mapping",
288
+ system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
289
+ ) -> pd.DataFrame:
290
+ """Map survey responses to refined themes using an LLM.
291
+
292
+ This function analyzes each survey response and determines which of the refined
293
+ themes best matches its content. Multiple themes can be assigned to a single response.
294
+
295
+ Args:
296
+ responses_df (pd.DataFrame): DataFrame containing survey responses.
297
+ Must include 'response_id' and 'response' columns.
298
+ llm (Runnable): Language model instance to use for theme mapping.
299
+ question (str): The survey question.
300
+ refined_themes_df (pd.DataFrame): Single-row DataFrame where each column
301
+ represents a theme (from theme_refinement stage).
302
+ batch_size (int, optional): Number of responses to process in each batch.
303
+ Defaults to 20.
304
+ prompt_template (str | Path | PromptTemplate, optional): Template for structuring
305
+ the prompt to the LLM. Can be a string identifier, path to template file,
306
+ or PromptTemplate instance. Defaults to "theme_mapping".
307
+ system_prompt (str): System prompt to guide the LLM's behavior.
308
+ Defaults to CONSULTATION_SYSTEM_PROMPT.
309
+
310
+ Returns:
311
+ pd.DataFrame: DataFrame containing the original responses enriched with
312
+ theme mapping results, ensuring all responses are mapped through ID integrity checks.
313
+ """
314
+ logger.info(
315
+ f"Running theme mapping on {len(responses_df)} responses using {len(refined_themes_df.columns)} themes"
316
+ )
317
+ return await batch_and_run(
318
+ responses_df,
319
+ prompt_template,
320
+ llm,
321
+ batch_size=batch_size,
322
+ question=question,
323
+ refined_themes=refined_themes_df.to_dict(orient="records"),
324
+ response_id_integrity_check=True,
325
+ system_prompt=system_prompt,
326
+ )
@@ -0,0 +1,311 @@
1
+ import asyncio
2
+ import json
3
+ import logging
4
+ from dataclasses import dataclass
5
+ from pathlib import Path
6
+ from typing import Any
7
+
8
+ import pandas as pd
9
+ from langchain_core.prompts import PromptTemplate
10
+ from langchain_core.runnables import Runnable
11
+ from tenacity import before, retry, stop_after_attempt, wait_random_exponential
12
+
13
+ from .themefinder_logging import logger
14
+
15
+
16
+ @dataclass
17
+ class BatchPrompt:
18
+ prompt_string: str
19
+ response_ids: list[str]
20
+
21
+
22
+ async def batch_and_run(
23
+ responses_df: pd.DataFrame,
24
+ prompt_template: str | Path | PromptTemplate,
25
+ llm: Runnable,
26
+ batch_size: int = 10,
27
+ partition_key: str | None = None,
28
+ response_id_integrity_check: bool = False,
29
+ **kwargs: Any,
30
+ ) -> pd.DataFrame:
31
+ """Process a DataFrame of responses in batches using an LLM.
32
+
33
+ Args:
34
+ responses_df (pd.DataFrame): DataFrame containing responses to be processed.
35
+ Must include a 'response_id' column.
36
+ prompt_template (Union[str, Path, PromptTemplate]): Template for LLM prompts.
37
+ Can be a string (file path), Path object, or PromptTemplate.
38
+ llm (Runnable): LangChain Runnable instance that will process the prompts.
39
+ batch_size (int, optional): Number of responses to process in each batch.
40
+ Defaults to 10.
41
+ partition_key (str | None, optional): Optional column name to group responses
42
+ before batching. Defaults to None.
43
+ response_id_integrity_check (bool, optional): If True, verifies that all input
44
+ response IDs are present in LLM output and retries failed responses individually.
45
+ If False, no integrity checking or retrying occurs. Defaults to False.
46
+ **kwargs (Any): Additional keyword arguments to pass to the prompt template.
47
+
48
+ Returns:
49
+ pd.DataFrame: DataFrame containing the original responses merged with the
50
+ LLM-processed results.
51
+ """
52
+ logger.info(f"Running batch and run with batch size {batch_size}")
53
+ prompt_template = convert_to_prompt_template(prompt_template)
54
+ batched_response_dfs = batch_responses(
55
+ responses_df, batch_size=batch_size, partition_key=partition_key
56
+ )
57
+ batch_prompts = generate_prompts(batched_response_dfs, prompt_template, **kwargs)
58
+ llm_responses, failed_ids = await call_llm(
59
+ batch_prompts=batch_prompts,
60
+ llm=llm,
61
+ response_id_integrity_check=response_id_integrity_check,
62
+ )
63
+ processed_responses = process_llm_responses(llm_responses, responses_df)
64
+ if failed_ids:
65
+ new_df = responses_df[responses_df["response_id"].astype(str).isin(failed_ids)]
66
+ processed_failed_responses = await batch_and_run(
67
+ responses_df=new_df,
68
+ prompt_template=prompt_template,
69
+ llm=llm,
70
+ batch_size=1,
71
+ partition_key=partition_key,
72
+ **kwargs,
73
+ )
74
+ return pd.concat(objs=[processed_failed_responses, processed_responses])
75
+ return processed_responses
76
+
77
+
78
+ def load_prompt_from_file(file_path: str | Path) -> str:
79
+ """Load a prompt template from a text file in the prompts directory.
80
+
81
+ Args:
82
+ file_path (str | Path): Name of the prompt file (without .txt extension)
83
+ or Path object pointing to the file.
84
+
85
+ Returns:
86
+ str: Content of the prompt template file.
87
+ """
88
+ parent_dir = Path(__file__).parent
89
+ with Path.open(parent_dir / "prompts" / f"{file_path}.txt") as file:
90
+ return file.read()
91
+
92
+
93
+ def convert_to_prompt_template(prompt_template: str | Path | PromptTemplate):
94
+ """Convert various input types to a LangChain PromptTemplate.
95
+
96
+ Args:
97
+ prompt_template (str | Path | PromptTemplate): Input template that can be either:
98
+ - str: Name of a prompt file in the prompts directory (without .txt extension)
99
+ - Path: Path object pointing to a prompt file
100
+ - PromptTemplate: Already initialized LangChain PromptTemplate
101
+
102
+ Returns:
103
+ PromptTemplate: Initialized LangChain PromptTemplate object.
104
+
105
+ Raises:
106
+ TypeError: If prompt_template is not one of the expected types.
107
+ FileNotFoundError: If using str/Path input and the prompt file doesn't exist.
108
+ """
109
+ if isinstance(prompt_template, str | Path):
110
+ prompt_content = load_prompt_from_file(prompt_template)
111
+ template = PromptTemplate.from_template(template=prompt_content)
112
+ elif isinstance(prompt_template, PromptTemplate):
113
+ template = prompt_template
114
+ else:
115
+ msg = "Invalid prompt_template type. Expected str, Path, or PromptTemplate."
116
+ raise TypeError(msg)
117
+ return template
118
+
119
+
120
+ def batch_responses(
121
+ responses_df: pd.DataFrame, batch_size: int = 10, partition_key: str | None = None
122
+ ) -> list[pd.DataFrame]:
123
+ """Split a DataFrame into batches, optionally partitioned by a key column.
124
+
125
+ Args:
126
+ responses_df (pd.DataFrame): Input DataFrame to be split into batches.
127
+ batch_size (int, optional): Maximum number of rows in each batch. Defaults to 10.
128
+ partition_key (str | None, optional): Column name to group by before batching.
129
+ If provided, ensures rows with the same partition key value stay together
130
+ and each group is batched separately. Defaults to None.
131
+
132
+ Returns:
133
+ list[pd.DataFrame]: List of DataFrame batches, where each batch contains
134
+ at most batch_size rows. If partition_key is used, rows within each
135
+ partition are kept together and batched separately.
136
+ """
137
+ if partition_key:
138
+ grouped = responses_df.groupby(partition_key)
139
+ batches = []
140
+ for _, group in grouped:
141
+ group_batches = [
142
+ group.iloc[i : i + batch_size].reset_index(drop=True)
143
+ for i in range(0, len(group), batch_size)
144
+ ]
145
+ batches.extend(group_batches)
146
+ return batches
147
+
148
+ return [
149
+ responses_df.iloc[i : i + batch_size].reset_index(drop=True)
150
+ for i in range(0, len(responses_df), batch_size)
151
+ ]
152
+
153
+
154
+ def generate_prompts(
155
+ response_dfs: list[pd.DataFrame], prompt_template: PromptTemplate, **kwargs: Any
156
+ ) -> list[BatchPrompt]:
157
+ """Generate a list of BatchPrompts from DataFrames using a prompt template.
158
+
159
+ Args:
160
+ response_dfs (list[pd.DataFrame]): List of DataFrames, each containing a batch
161
+ of responses to be processed. Each DataFrame must include a 'response_id' column.
162
+ prompt_template (PromptTemplate): LangChain PromptTemplate object used to format
163
+ the prompts for each batch.
164
+ **kwargs (Any): Additional keyword arguments to pass to the prompt template's
165
+ format method.
166
+
167
+ Returns:
168
+ list[BatchPrompt]: List of BatchPrompt objects, each containing:
169
+ - prompt_string: Formatted prompt text for the batch
170
+ - response_ids: List of response IDs included in the batch
171
+
172
+ Note:
173
+ The function converts each DataFrame to a list of dictionaries and passes it
174
+ to the prompt template as the 'responses' variable.
175
+ """
176
+ batched_prompts = []
177
+
178
+ for df in response_dfs:
179
+ prompt = prompt_template.format(
180
+ responses=df.to_dict(orient="records"), **kwargs
181
+ )
182
+ response_ids = df["response_id"].astype(str).to_list()
183
+ batched_prompts.append(
184
+ BatchPrompt(prompt_string=prompt, response_ids=response_ids)
185
+ )
186
+
187
+ return batched_prompts
188
+
189
+
190
+ async def call_llm(
191
+ batch_prompts: list[BatchPrompt],
192
+ llm: Runnable,
193
+ concurrency: int = 10,
194
+ response_id_integrity_check: bool = False,
195
+ ):
196
+ """Process multiple batches of prompts concurrently through an LLM with retry logic.
197
+
198
+ Args:
199
+ batch_prompts (list[BatchPrompt]): List of BatchPrompt objects, each containing a
200
+ prompt string and associated response IDs to be processed.
201
+ llm (Runnable): LangChain Runnable instance that will process the prompts.
202
+ concurrency (int, optional): Maximum number of simultaneous LLM calls allowed.
203
+ Defaults to 10.
204
+ response_id_integrity_check (bool, optional): If True, verifies that all input
205
+ response IDs are present in the LLM output. Failed batches are discarded and
206
+ their IDs are returned for retry. Defaults to False.
207
+
208
+ Returns:
209
+ tuple[list[dict[str, Any]], set[str]]: A tuple containing:
210
+ - list of successful LLM responses as dictionaries
211
+ - set of failed response IDs (empty if no failures or integrity check is False)
212
+
213
+ Notes:
214
+ - Uses exponential backoff retry strategy with up to 6 attempts per batch
215
+ - Failed batches (when integrity check fails) return None and are filtered out
216
+ - Concurrency is managed via asyncio.Semaphore to prevent overwhelming the LLM
217
+ """
218
+ semaphore = asyncio.Semaphore(concurrency)
219
+ failed_ids: set = set()
220
+
221
+ @retry(
222
+ wait=wait_random_exponential(min=1, max=60),
223
+ stop=stop_after_attempt(6),
224
+ before=before.before_log(logger=logger, log_level=logging.DEBUG),
225
+ reraise=True,
226
+ )
227
+ async def async_llm_call(batch_prompt):
228
+ async with semaphore:
229
+ response = await llm.ainvoke(batch_prompt.prompt_string)
230
+ parsed_response = json.loads(response.content)
231
+
232
+ if response_id_integrity_check and not check_response_integrity(
233
+ batch_prompt.response_ids, parsed_response
234
+ ):
235
+ # discard this response but keep track of failed response ids
236
+ failed_ids.update(batch_prompt.response_ids)
237
+ return None
238
+
239
+ return parsed_response
240
+
241
+ results = await asyncio.gather(
242
+ *[async_llm_call(batch_prompt) for batch_prompt in batch_prompts]
243
+ )
244
+ successful_responses = [
245
+ r for r in results if r is not None
246
+ ] # ignore discarded responses
247
+ return (successful_responses, failed_ids)
248
+
249
+
250
+ def check_response_integrity(
251
+ input_response_ids: set[str], parsed_response: dict
252
+ ) -> bool:
253
+ """Verify that all input response IDs are present in the LLM's parsed response.
254
+
255
+ Args:
256
+ input_response_ids (set[str]): Set of response IDs that were included in the
257
+ original prompt sent to the LLM.
258
+ parsed_response (dict): Parsed response from the LLM containing a 'responses' key
259
+ with a list of dictionaries, each containing a 'response_id' field.
260
+
261
+ Returns:
262
+ bool: True if all input response IDs are present in the parsed response and
263
+ no additional IDs are present, False otherwise.
264
+ """
265
+ response_ids_set = set(input_response_ids)
266
+
267
+ returned_ids_set = {
268
+ str(
269
+ element["response_id"]
270
+ ) # treat ids as strings to match response_ids_in_each_prompt
271
+ for element in parsed_response["responses"]
272
+ if element.get("response_id", False)
273
+ }
274
+ # assumes: all input ids ought to be present in output
275
+ if returned_ids_set != response_ids_set:
276
+ logger.info("Failed integrity check")
277
+ logger.info(
278
+ f"Present in original but not returned from LLM: {response_ids_set - returned_ids_set}. Returned in LLM but not present in original: {returned_ids_set -response_ids_set}"
279
+ )
280
+ return False
281
+ return True
282
+
283
+
284
+ def process_llm_responses(
285
+ llm_responses: list[dict[str, Any]], responses: pd.DataFrame
286
+ ) -> pd.DataFrame:
287
+ """Process and merge LLM responses with the original DataFrame.
288
+
289
+ Args:
290
+ llm_responses (list[dict[str, Any]]): List of LLM response dictionaries, where each
291
+ dictionary contains a 'responses' key with a list of individual response objects.
292
+ responses (pd.DataFrame): Original DataFrame containing the input responses, must
293
+ include a 'response_id' column.
294
+
295
+ Returns:
296
+ pd.DataFrame: A merged DataFrame containing:
297
+ - If response_id exists in LLM output: Original responses joined with LLM results
298
+ on response_id (inner join)
299
+ - If no response_id in LLM output: DataFrame containing only the LLM results
300
+ """
301
+ responses.loc[:, "response_id"] = responses["response_id"].astype(int)
302
+ unpacked_responses = [
303
+ response
304
+ for batch_response in llm_responses
305
+ for response in batch_response.get("responses", [])
306
+ ]
307
+ task_responses = pd.DataFrame(unpacked_responses)
308
+ if "response_id" in task_responses.columns:
309
+ task_responses["response_id"] = task_responses["response_id"].astype(int)
310
+ return responses.merge(task_responses, how="inner", on="response_id")
311
+ return task_responses
@@ -0,0 +1 @@
1
+ You are an AI evaluation tool analyzing responses to a UK Government public consultation.
@@ -0,0 +1,47 @@
1
+ {system_prompt}
2
+
3
+ You will receive a list of RESPONSES, each containing a response_id and a response.
4
+ Your job is to analyze each response to the QUESTION below and decide:
5
+
6
+ POSITION - is the response agreeing or disagreeing or is it unclear about the change being proposed in the question.
7
+ Choose one from [agreement, disagreement, unclear]
8
+
9
+ You should only return a response in strict json and nothing else. The final output should be in the following JSON format:
10
+
11
+ {{"responses": [
12
+ {{
13
+ "response_id": "{{response_id_1}}",
14
+ "position": {{position_1}},
15
+ }},
16
+ {{
17
+ "response_id": "{{response_id_2}}",
18
+ "position": {{position_2}},
19
+ }}
20
+ ...
21
+ ]}}
22
+
23
+ Example 1:
24
+ Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
25
+ Response: \n as a parent I have no idea why you would make this change. I guess you were thinking about increasing productivity but any productivity gains would be totally offset by the decrease in family time. \n
26
+
27
+ Output:
28
+ POSITION: disagreement
29
+
30
+ Example 2:
31
+ Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
32
+ Response: \n I think this is a great idea, our children will learn more if they are in school more \n
33
+
34
+ Output:
35
+ POSITION: agreement
36
+
37
+ Example 3:
38
+ Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
39
+ Response: \n it will be good for our children to be around their friends more but it will be hard for some parents spend
40
+ less time with their children \n
41
+
42
+ Output:
43
+ POSITION: unclear
44
+
45
+
46
+ QUESTION: \n {question}
47
+ RESPONSES: \n {responses}
@@ -0,0 +1,42 @@
1
+ {system_prompt}
2
+
3
+ Below is a question and a list of topics extracted from answers to that question. Each topic has a topic_label and a topic_description.
4
+
5
+ Your task is to analyze these topics and produce a refined list that:
6
+ 1. Identifies and preserves core themes that appear frequently
7
+ 2. Captures unique perspectives that may only appear once but offer valuable insights
8
+ 3. Combines truly redundant topics while maintaining nuanced differences
9
+ 4. Ensures the final list represents the full spectrum of viewpoints present in the original data
10
+
11
+ Guidelines for Topic Analysis:
12
+ - Begin by identifying distinct concept clusters in the topics
13
+ - When a topic appears only once, evaluate its unique contribution before deciding to merge or preserve it
14
+ - Consider the context of the question when determining topic relevance
15
+ - Look for complementary perspectives that could enrich understanding of the same core concept
16
+ - Preserve specific examples or concrete applications that illustrate abstract concepts
17
+ - Maintain granularity where different aspects of the same broader theme offer distinct insights
18
+
19
+ The topics you are analyzing are all extracted from answers with the same position, where "position" means that the answer agrees ("Y") or disagrees ("N") with the question.
20
+
21
+ For each topic in your output:
22
+ 1. Choose a clear, representative label that captures the essence of the combined or preserved topic
23
+ 2. Write a comprehensive description that incorporates key insights from all constituent topics
24
+ 3. Ensure the description maintains specific examples or unique angles from the original topics
25
+ 4. Include the shared position value
26
+
27
+ The final output should be in the following JSON format:
28
+
29
+ {{"responses": [
30
+ {{"topic_label": "{{label for condensed topic 1}}", "topic_description": "{{description for condensed topic 1}}", "position": {{the position given below}}"}},
31
+ {{"topic_label": "{{label for condensed topic 2}}", "topic_description": "{{description for condensed topic 2}}", "position": {{the position given below}}"}},
32
+ {{"topic_label": "{{label for condensed topic 3}}", "topic_description": "{{description for condensed topic 3}}", "position": {{the position given below}}"}},
33
+ // Additional topics as necessary
34
+ ]}}
35
+
36
+ [Question]
37
+
38
+ {question}
39
+
40
+ [Themes]
41
+
42
+ {responses}
@@ -0,0 +1,70 @@
1
+ {system_prompt}
2
+
3
+ Your task is to analyse RESPONSES below and extract TOPICS such that:
4
+ 1. Each topic summarises points of view expressed in the responses
5
+ 2. Every distinct and relevant point of view in the responses should be captured by a topic
6
+ 3. Each topic has a topic_label which summarises the topic in a few words
7
+ 4. Each topic has a topic_description which gives more detail about the topic in one or two sentences
8
+ 5. The position field should just be the sentiment stated, and is either "agreement" or "disagreement"
9
+ 6. There should be no duplicate topics
10
+
11
+ The topics identified will be used by policy makers to understand what the public like and don't like about the proposals.
12
+
13
+ Here is an example of how to extract topics from some responses
14
+
15
+ EXAMPLE:
16
+
17
+ POSITION
18
+ disagreement
19
+
20
+ QUESTION
21
+ What are your views on the proposed change by the government to introduce a 2% tax on fast food meat products.
22
+
23
+ RESPONSES
24
+ [
25
+ {{"response": "I wish the government would stop interfering in the lves of its citizens. It only ever makes things worse. This change will just cost us all more money, and especially poorer people", "position": "disagreement"}},
26
+ {{"response": "Even though it will make people eat more healthier, I beleibe the government should interfer less and not more!", "position": "disagreement"}},
27
+ {{"response": "I hate grapes", "position": "disagreement"}},
28
+ ]
29
+
30
+ OUTPUTS
31
+
32
+ {{"responses": [
33
+ {{
34
+ "topic_label": "Government overreach",
35
+ "topic_description": "Some people thought the proposals would result in government interfering too much with citizen's lives",
36
+ "position": "disagreement"
37
+ }},
38
+ {{
39
+ "topic_label": "Regressive change",
40
+ "topic_description": "Some people thought the change would have a larger negative impact on poorer people",
41
+ "position": "disagreement"
42
+ }},
43
+ {{
44
+ "topic_label": "Health",
45
+ "topic_description": "Some people thought the change would result in people eating healthier diets",
46
+ "position": "disagreement"
47
+ }},
48
+ ]}}
49
+
50
+ You should only return a response in strict json and nothing else. The final output should be in the following JSON format:
51
+
52
+ {{"responses": [
53
+ {{
54
+ "topic_label": "{{label_1}}",
55
+ "topic_description": "{{description_1}}",
56
+ "position": "{{position_1}}"
57
+ }},
58
+ {{
59
+ "topic_label": "{{label_2}}",
60
+ "topic_description": "{{description_2}}",
61
+ "position": "{{position_2}}"
62
+ }},
63
+ // Additional topics as necessary
64
+ ]}}
65
+
66
+ QUESTION:
67
+ {question}
68
+
69
+ RESPONSES:
70
+ {responses}
@@ -0,0 +1,53 @@
1
+ {system_prompt}
2
+
3
+ Your job is to help identify which topics come up in responses to a question.
4
+
5
+ You will be given:
6
+ - a QUESTION that has been asked
7
+ - a TOPIC LIST of topics that are known to be present in responses to this question. These will be structured as follows:
8
+ {{'topic_id': 'topic_description}}
9
+ - a list of RESPONSES to the question. These will be structured as follows:
10
+ {{'response_id': 'free text response'}}
11
+
12
+ Your task is to analyze each response and decide which topics are present. Guidelines:
13
+ - You can only assign to a response to a topic in the provided TOPIC LIST
14
+ - A response doesn't need to exactly match the language used in the TOPIC LIST, it should be considered a match if it expresses a similar sentiment.
15
+ - You must use the alphabetic 'topic_id' to indicate which topic you have assigned.
16
+ - Each response can be assigned to multiple topics if it matches more than one topic from the TOPIC LIST.
17
+ - There is no limit on how many topics can be assigned to a response.
18
+ - For each assignment provide a single rationale for why you have chosen the label.
19
+ - For each topic identified in a response, indicate whether the response expresses a positive or negative stance toward that topic (options: 'POSITIVE' or 'NEGATIVE')
20
+ - If a response contains both positive and negative statements about a topic within the same response, choose the stance that receives more emphasis or appears more central to the argument
21
+ - The order of reasons and stances must align with the order of labels (e.g., stance_a applies to topic_a)
22
+
23
+
24
+ The final output should be in the following JSON format:
25
+
26
+ {{
27
+ "responses": [
28
+ {{
29
+ "response_id": "response_id_1",
30
+ "reasons": ["reason_a", "reason_b"],
31
+ "labels": ["topic_a", "topic_b"],
32
+ "stances": ["stance_a", "stance_b"],
33
+ }},
34
+ {{
35
+ "response_id": "response_id_2",
36
+ "reasons": ["reason_c"],
37
+ "labels": ["topic_c"],
38
+ "stances": ["stance_c"],
39
+ }}
40
+ ]
41
+ }}
42
+
43
+ QUESTION:
44
+
45
+ {question}
46
+
47
+ TOPIC LIST:
48
+
49
+ {refined_themes}
50
+
51
+ RESPONSES:
52
+
53
+ {responses}
@@ -0,0 +1,77 @@
1
+ {system_prompt}
2
+
3
+ You are tasked with refining and neutralizing a list of topics generated from responses to a question. Your goal is to transform opinionated topics into neutral, well-structured, and distinct topics while preserving the essential information.
4
+
5
+ ## Input
6
+ You will receive a list of OPINIONATED TOPICS. These topics explicitly tie opinions to whether a person agrees or disagrees with the question.
7
+
8
+ ## Output
9
+ You will produce a list of NEUTRAL TOPICS based on the input. Each neutral topic should have two parts:
10
+ 1. A brief, clear topic label (3-7 words)
11
+ 2. A more detailed topic description (1-2 sentences)
12
+
13
+ ## Guidelines
14
+
15
+ 1. Information Retention:
16
+ - Preserve all key information, details and concepts from the original topics.
17
+ - Ensure no significant details are lost in the refinement process.
18
+
19
+ 2. Neutrality:
20
+ - Remove all language indicating agreement or disagreement.
21
+ - Present topics objectively without favoring any particular stance.
22
+ - Avoid phrases like "supporters believe" or "critics argue".
23
+
24
+ 3. Avoid Response References:
25
+ - Do not use language that refers to multiple responses or respondents.
26
+ - Focus solely on the content of each topic.
27
+ - Avoid phrases like "many respondents said" or "some responses indicated".
28
+
29
+ 4. Distinctiveness:
30
+ - Ensure each topic represents a unique concept or aspect of the policy.
31
+ - Minimize overlap between topics.
32
+ - If topics are closely related, find ways to differentiate them clearly.
33
+
34
+ 5. Fluency and Readability:
35
+ - Create concise, clear topic labels that summarize the main idea.
36
+ - Provide detailed descriptions that expand on the label without mere repetition.
37
+ - Use proper grammar, punctuation, and natural language.
38
+
39
+ ## Process
40
+
41
+ 1. Analyze the OPINIONATED TOPICS to identify key themes and information.
42
+ 2. Group closely related topics together.
43
+ 3. For each group or individual topic:
44
+ a. Distill the core concept, removing any bias or opinion.
45
+ b. Create a neutral, concise topic label.
46
+ c. Write a more detailed description that provides context without taking sides.
47
+ 4. Review the entire list to ensure distinctiveness and adjust as needed.
48
+ 5. Double-check that all topics are truly neutral and free of response references.
49
+ 6. Assign each output topic a topic_id a single uppercase letters (starting from 'A')
50
+ 7. Combine the topic label and description with a colon separator
51
+
52
+ Return your output in the following JSON format:
53
+ {{
54
+ "responses": [
55
+ {{"topic_id": "A", "topic": "{{topic label 1}}: {{topic description 1}}"}},
56
+ {{"topic_id": "B", "topic": "{{topic label 2}}: {{topic description 2}}"}},
57
+ {{"topic_id": "C", "topic": "{{topic label 3}}: {{topic description 3}}"}},
58
+ // Additional topics as necessary
59
+ ]
60
+ }}
61
+
62
+
63
+ ## Example
64
+
65
+ OPINIONATED TOPIC:
66
+ "Economic impact: Many respondents who support the policy believe it will create jobs and boost the economy, it could raise GDP by 2%."
67
+
68
+ NEUTRAL TOPIC:
69
+ Topic Label: Economic Impact on Employment
70
+ Description: The policy's potential effects on job creation and overall economic growth, including potential for a 2% increase in GDP.
71
+
72
+ Remember, your goal is to create a list of neutral, informative, and distinct topics that accurately represent the content of the original opinionated topics without any bias or references to responses.
73
+
74
+
75
+
76
+ OPINIONATED TOPIC:
77
+ {responses}
@@ -0,0 +1,12 @@
1
+ import logging
2
+ import sys
3
+
4
+
5
+ logger = logging.getLogger("theme_finder.tasks")
6
+ logger.setLevel(logging.INFO)
7
+
8
+ handler = logging.StreamHandler(sys.stdout)
9
+ formatter = logging.Formatter("%(asctime)s %(levelname)s: %(message)s")
10
+ handler.setFormatter(formatter)
11
+ handler.setLevel(logging.INFO)
12
+ logger.addHandler(handler)