themefinder 0.7.4__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- themefinder/__init__.py +24 -0
- themefinder/advanced_tasks/__init__.py +0 -0
- themefinder/advanced_tasks/cross_cutting_themes_agent.py +404 -0
- themefinder/advanced_tasks/theme_clustering_agent.py +356 -0
- themefinder/llm_batch_processor.py +442 -0
- themefinder/models.py +438 -0
- themefinder/prompts/agentic_theme_clustering.txt +34 -0
- themefinder/prompts/consultation_system_prompt.txt +1 -0
- themefinder/prompts/cross_cutting_identification.txt +16 -0
- themefinder/prompts/cross_cutting_mapping.txt +19 -0
- themefinder/prompts/cross_cutting_refinement.txt +15 -0
- themefinder/prompts/detail_detection.txt +31 -0
- themefinder/prompts/sentiment_analysis.txt +41 -0
- themefinder/prompts/theme_condensation.txt +34 -0
- themefinder/prompts/theme_generation.txt +38 -0
- themefinder/prompts/theme_mapping.txt +36 -0
- themefinder/prompts/theme_refinement.txt +54 -0
- themefinder/prompts/theme_target_alignment.txt +18 -0
- themefinder/tasks.py +656 -0
- themefinder/themefinder_logging.py +12 -0
- themefinder-0.7.4.dist-info/METADATA +174 -0
- themefinder-0.7.4.dist-info/RECORD +24 -0
- themefinder-0.7.4.dist-info/WHEEL +4 -0
- themefinder-0.7.4.dist-info/licenses/LICENCE +21 -0
|
@@ -0,0 +1,174 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: themefinder
|
|
3
|
+
Version: 0.7.4
|
|
4
|
+
Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
|
|
5
|
+
License: MIT
|
|
6
|
+
License-File: LICENCE
|
|
7
|
+
Author: i.AI
|
|
8
|
+
Author-email: packages@cabinetoffice.gov.uk
|
|
9
|
+
Requires-Python: >=3.10,<3.13
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
18
|
+
Classifier: Topic :: Text Processing :: Linguistic
|
|
19
|
+
Requires-Dist: boto3 (>=1.29,<2.0)
|
|
20
|
+
Requires-Dist: langchain
|
|
21
|
+
Requires-Dist: langchain-openai
|
|
22
|
+
Requires-Dist: langfuse (==2.29.1)
|
|
23
|
+
Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
|
|
24
|
+
Requires-Dist: pandas (>=2.2.2,<3.0.0)
|
|
25
|
+
Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
|
|
26
|
+
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
|
|
27
|
+
Requires-Dist: scikit-learn
|
|
28
|
+
Requires-Dist: toml (>=0.10.2,<0.11.0)
|
|
29
|
+
Project-URL: Documentation, https://i-dot-ai.github.io/themefinder/
|
|
30
|
+
Project-URL: Repository, https://github.com/i-dot-ai/themefinder/
|
|
31
|
+
Description-Content-Type: text/markdown
|
|
32
|
+
|
|
33
|
+
# ThemeFinder
|
|
34
|
+
|
|
35
|
+
ThemeFinder is a topic modelling Python package designed for analysing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the [docs](https://i-dot-ai.github.io/themefinder/) for more info.
|
|
36
|
+
|
|
37
|
+
> [!IMPORTANT]
|
|
38
|
+
> Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
## Quickstart
|
|
42
|
+
|
|
43
|
+
### Install using your package manager of choice
|
|
44
|
+
|
|
45
|
+
For example `pip install themefinder` or `poetry add themefinder`.
|
|
46
|
+
|
|
47
|
+
### Usage
|
|
48
|
+
|
|
49
|
+
ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with two columns:
|
|
50
|
+
- `response_id`: A unique identifier for each response
|
|
51
|
+
- `response`: The free text survey response
|
|
52
|
+
|
|
53
|
+
ThemeFinder now supports a range of language models through structured outputs.
|
|
54
|
+
|
|
55
|
+
The function `find_themes` identifies common themes in responses and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
|
|
56
|
+
|
|
57
|
+
For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
|
|
58
|
+
|
|
59
|
+
If you are using environment variables (eg for API keys), you can use `python-dotenv` to read variables from a `.env` file.
|
|
60
|
+
|
|
61
|
+
If you are using an Azure OpenAI endpoint, you will need the following variables:
|
|
62
|
+
|
|
63
|
+
- `AZURE_OPENAI_API_KEY`
|
|
64
|
+
- `AZURE_OPENAI_ENDPOINT`
|
|
65
|
+
- `OPENAI_API_VERSION`
|
|
66
|
+
- `DEPLOYMENT_NAME`
|
|
67
|
+
- `AZURE_OPENAI_BASE_URL`
|
|
68
|
+
|
|
69
|
+
Otherwise you will need whichever variables [LangChain](https://www.langchain.com/) requires for your LLM of choice.
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
import asyncio
|
|
73
|
+
from dotenv import load_dotenv
|
|
74
|
+
import pandas as pd
|
|
75
|
+
from langchain_openai import AzureChatOpenAI
|
|
76
|
+
from themefinder import find_themes
|
|
77
|
+
|
|
78
|
+
# If needed, load LLM API settings from .env file
|
|
79
|
+
load_dotenv()
|
|
80
|
+
|
|
81
|
+
# Initialise your LLM of choice using langchain
|
|
82
|
+
llm = AzureChatOpenAI(
|
|
83
|
+
model="gpt-4o",
|
|
84
|
+
temperature=0,
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
# Set up your data
|
|
88
|
+
responses_df = pd.DataFrame({
|
|
89
|
+
"response_id": ["1", "2", "3", "4", "5"],
|
|
90
|
+
"response": ["I think it's awesome, I can use it for consultation analysis.",
|
|
91
|
+
"It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
|
|
92
|
+
})
|
|
93
|
+
|
|
94
|
+
# Add your question
|
|
95
|
+
question = "What do you think of ThemeFinder?"
|
|
96
|
+
|
|
97
|
+
# Make the system prompt specific to your use case
|
|
98
|
+
system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
|
|
99
|
+
|
|
100
|
+
# Run the function to find themes, we use asyncio to query LLM endpoints asynchronously, so we need to await our function
|
|
101
|
+
async def main():
|
|
102
|
+
result = await find_themes(responses_df, llm, question, system_prompt=system_prompt)
|
|
103
|
+
print(result)
|
|
104
|
+
|
|
105
|
+
if __name__ == "__main__":
|
|
106
|
+
asyncio.run(main())
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## ThemeFinder pipeline
|
|
110
|
+
|
|
111
|
+
ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
|
|
112
|
+
|
|
113
|
+
### Sentiment analysis
|
|
114
|
+
- Analyses the emotional tone and position of each response using sentiment-focused prompts
|
|
115
|
+
- Provides structured sentiment categorisation based on LLM analysis
|
|
116
|
+
|
|
117
|
+
### Theme generation
|
|
118
|
+
- Uses exploratory prompts to identify initial themes from response batches
|
|
119
|
+
- Groups related responses for better context through guided theme extraction
|
|
120
|
+
|
|
121
|
+
### Theme condensation
|
|
122
|
+
- Employs comparative prompts to combine similar or overlapping themes
|
|
123
|
+
- Reduces redundancy in identified topics through systematic theme evaluation
|
|
124
|
+
|
|
125
|
+
### Theme refinement
|
|
126
|
+
- Leverages standardisation prompts to normalise theme descriptions
|
|
127
|
+
- Creates clear, consistent theme definitions through structured refinement
|
|
128
|
+
|
|
129
|
+
### Theme target alignment
|
|
130
|
+
- Optional step to consolidate themes down to a target number
|
|
131
|
+
|
|
132
|
+
### Theme mapping
|
|
133
|
+
- Utilizes classification prompts to map individual responses to refined themes
|
|
134
|
+
- Supports multiple theme assignments per response through detailed analysis
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
The prompts used at each stage can be found in `src/themefinder/prompts/`.
|
|
138
|
+
|
|
139
|
+
The file `src/themefinder.core.py` contains the function `find_themes` which runs the pipline. It also contains functions fo each individual stage.
|
|
140
|
+
|
|
141
|
+
|
|
142
|
+
**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
|
|
143
|
+
|
|
144
|
+
|
|
145
|
+
## Model Compatibility
|
|
146
|
+
|
|
147
|
+
ThemeFinder's structured output approach makes it compatible with a wide range of language models from various providers. This list is non-exhaustive, and other models may also work effectively:
|
|
148
|
+
|
|
149
|
+
### OpenAI Models
|
|
150
|
+
- GPT-4, GPT-4o, GPT-4.1
|
|
151
|
+
- All Azure OpenAI deployments
|
|
152
|
+
|
|
153
|
+
### Google Models
|
|
154
|
+
- Gemini series (1.5 Pro, 2.0 Pro, etc.)
|
|
155
|
+
|
|
156
|
+
### Anthropic Models
|
|
157
|
+
- Claude series (Claude 3 Opus, Sonnet, Haiku, etc.)
|
|
158
|
+
|
|
159
|
+
### Open Source Models
|
|
160
|
+
- Llama 2, Llama 3
|
|
161
|
+
- Mistral models (e.g., Mistral 7B, Mixtral)
|
|
162
|
+
|
|
163
|
+
|
|
164
|
+
## License
|
|
165
|
+
|
|
166
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
167
|
+
|
|
168
|
+
The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
|
|
169
|
+
|
|
170
|
+
|
|
171
|
+
## Feedback
|
|
172
|
+
|
|
173
|
+
Contact us with questions or feedback at packages@cabinetoffice.gov.uk.
|
|
174
|
+
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
themefinder/__init__.py,sha256=DosVY1CPiL179NnPvLhXr-7bkZDbqFp93XcJh3AswhE,474
|
|
2
|
+
themefinder/advanced_tasks/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
3
|
+
themefinder/advanced_tasks/cross_cutting_themes_agent.py,sha256=tCG1bwMXXa4Iy1rX8cJPbxps8VamhndPc9RCbbkjf5Q,15074
|
|
4
|
+
themefinder/advanced_tasks/theme_clustering_agent.py,sha256=HGUEpsutBIQ80TL6stjE2zMruNwbRQkxRYpYjKnxi1E,13734
|
|
5
|
+
themefinder/llm_batch_processor.py,sha256=Z9jm9Kr-6GD8g8kLkgdW97onjUbLLQ2M1YKwok39Q6Y,17652
|
|
6
|
+
themefinder/models.py,sha256=iN-chIm0ojyfRPr_cj9wQU3Q4I3yrwFI3FgnYb7IjWA,15072
|
|
7
|
+
themefinder/prompts/agentic_theme_clustering.txt,sha256=FuvHD4jjCDBQ1ptTKYg0W9Bpsbwy7VeK1l-NzRoEmNM,2155
|
|
8
|
+
themefinder/prompts/consultation_system_prompt.txt,sha256=_A07oY_an4hnRx-9pQ0y-TLXJz0dd8vDI-MZne7Mdb4,89
|
|
9
|
+
themefinder/prompts/cross_cutting_identification.txt,sha256=Dm7BwIZV21HgnAOQd3EMatuhwRtQS-pxttQC_ekAb9g,1115
|
|
10
|
+
themefinder/prompts/cross_cutting_mapping.txt,sha256=d7w1SFEyQ6IQWUuzzlvVWW9yYW4WByoIP0Ls6lHg9JU,929
|
|
11
|
+
themefinder/prompts/cross_cutting_refinement.txt,sha256=5nWH-lbpVJD9BRvxjnHifmjVb0oGAIfJaBxL4f6XOss,860
|
|
12
|
+
themefinder/prompts/detail_detection.txt,sha256=hMB8yQR5y855TJLYSW3CNZDkLTPaA2lf9UJwH_GpkD4,1515
|
|
13
|
+
themefinder/prompts/sentiment_analysis.txt,sha256=vYCDhtEsG5I9xixwVhZbvKPJGU1Gqpw4-xAqGz72xhU,1671
|
|
14
|
+
themefinder/prompts/theme_condensation.txt,sha256=jqWKuPaSKrRGeYwNWTlVx45hfyWWhX1CvnKXrIiXxa0,1714
|
|
15
|
+
themefinder/prompts/theme_generation.txt,sha256=QRKW7DtcMSb2olT6j5jmdEPcXPMeZgogM-NYddEIKRk,1871
|
|
16
|
+
themefinder/prompts/theme_mapping.txt,sha256=0z6ddfYxRn1Ew4W3Su-16qTbWn2C6J2LMnK7Biu1tno,1621
|
|
17
|
+
themefinder/prompts/theme_refinement.txt,sha256=JDSYs2sdXqN-Yw9OWjfbmsl9x4Bn1J3oNVSsb_PQ5Ik,2433
|
|
18
|
+
themefinder/prompts/theme_target_alignment.txt,sha256=g7AVZLiP_xIH010X5SIZyG3q7gA6OBAplPv3xvmstOY,855
|
|
19
|
+
themefinder/tasks.py,sha256=FIBC9-0aUDuAuxAFa7zqIgMo5-5WSbkkIZBT0QtF5Co,26946
|
|
20
|
+
themefinder/themefinder_logging.py,sha256=n5SUQovEZLC4skEbxicjz_fOGF9mOk3S-Wpj5uXsaL8,314
|
|
21
|
+
themefinder-0.7.4.dist-info/METADATA,sha256=OVi-63REBmQ7-ptHAKe_kxB_K2oVSiAYQPflhKJADSU,6748
|
|
22
|
+
themefinder-0.7.4.dist-info/WHEEL,sha256=3ny-bZhpXrU6vSQ1UPG34FoxZBp3lVcvK0LkgUz6VLk,88
|
|
23
|
+
themefinder-0.7.4.dist-info/licenses/LICENCE,sha256=C9ULIN0ctF60ZxUWH_hw1H434bDLg49Z-Qzn6BUHgqs,1060
|
|
24
|
+
themefinder-0.7.4.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 i.AI
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|