PyPI - themefinder - Versions diffs - 0.2.0__tar.gz - Mend

themefinder 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of themefinder might be problematic. Click here for more details.

Files changed (14) hide show

themefinder-0.2.0/LICENCE +21 -0
themefinder-0.2.0/PKG-INFO +167 -0
themefinder-0.2.0/README.md +138 -0
themefinder-0.2.0/pyproject.toml +50 -0
themefinder-0.2.0/src/themefinder/__init__.py +18 -0
themefinder-0.2.0/src/themefinder/core.py +326 -0
themefinder-0.2.0/src/themefinder/llm_batch_processor.py +311 -0
themefinder-0.2.0/src/themefinder/prompts/consultation_system_prompt.txt +1 -0
themefinder-0.2.0/src/themefinder/prompts/sentiment_analysis.txt +47 -0
themefinder-0.2.0/src/themefinder/prompts/theme_condensation.txt +42 -0
themefinder-0.2.0/src/themefinder/prompts/theme_generation.txt +70 -0
themefinder-0.2.0/src/themefinder/prompts/theme_mapping.txt +53 -0
themefinder-0.2.0/src/themefinder/prompts/theme_refinement.txt +77 -0
themefinder-0.2.0/src/themefinder/themefinder_logging.py +12 -0

themefinder-0.2.0/LICENCE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 i.AI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

themefinder-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,167 @@
+Metadata-Version: 2.3
+Name: themefinder
+Version: 0.2.0
+Summary: A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses.
+License: MIT
+Author: i.AI
+Author-email: packages@cabinetoffice.gov.uk
+Requires-Python: >=3.12,<4.0
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Dist: boto3 (>=1.29,<2.0)
+Requires-Dist: langchain
+Requires-Dist: langchain-openai (==0.1.17)
+Requires-Dist: langfuse (==2.29.1)
+Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
+Requires-Dist: pandas (>=2.2.2,<3.0.0)
+Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
+Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
+Requires-Dist: scikit-learn
+Project-URL: Documentation, https://i-dot-ai.github.io/themefinder/
+Project-URL: Repository, https://github.com/i-dot-ai/themefinder/
+Description-Content-Type: text/markdown
+# ThemeFinder
+ThemeFinder is a topic modelling Python package designed for analyzing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the [docs](docs/pipeline.md) for more info.
+> [!IMPORTANT]
+> Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.
+## Quickstart
+### Install the package locally
+Clone the package from GitHub:
+```
+git clone https://github.com/i-dot-ai/themefinder.git
+```
+Install the package into your virtual environment, where `<FILE_PATH>` is the location of the `themefinder` directory.
+Install with pip:
+```
+pip install -e <FILE_PATH>
+```
+Install with poetry:
+```
+poetry add -e <FILE_PATH>
+```
+### Usage
+ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with two columns:
+- `response_id`: A unique identifier for each response
+- `response`: The free text survey response
+ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
+The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
+For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
+If you are using environment variables (eg for API keys), you can use `python-dotenv` to read variables from a `.env` file.
+If you are using an Azure OpenAI endpoint, you will need the following variables:
+- `AZURE_OPENAI_API_KEY`
+- `AZURE_OPENAI_ENDPOINT`
+- `OPENAI_API_VERSION`
+- `DEPLOYMENT_NAME`
+- `AZURE_OPENAI_BASE_URL`
+Otherwise you will need whichever variables [LangChain](https://www.langchain.com/) requires for your LLM of choice.
+```python
+import asyncio
+from dotenv import load_dotenv
+import pandas as pd
+from langchain_openai import AzureChatOpenAI
+from themefinder import find_themes
+# If needed, load LLM API settings from .env file
+load_dotenv()
+# Initialise your LLM of choice using langchain
+llm = AzureChatOpenAI(
+    model="gpt-4o",
+    temperature=0,
+    model_kwargs={"response_format": {"type": "json_object"}},
+)
+# Set up your data
+responses_df = pd.DataFrame({
+   "response_id": ["1", "2", "3", "4", "5"],
+   "response": ["I think it's awesome, I can use it for consultation analysis.",
+   "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
+})
+# Add your question
+question = "What do you think of ThemeFinder?"
+# Make the system prompt specific to your use case
+system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
+# Run the function to find themes
+# We use asyncio to query LLM endpoints asynchronously, so we need to await our function
+async def main():
+    result = await find_themes(responses_df, llm, question, system_prompt)
+    print(result)
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+## ThemeFinder pipeline
+ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
+### Sentiment analysis
+- Analyses the emotional tone and position of each response using sentiment-focused prompts
+- Provides structured sentiment categorisation based on LLM analysis
+### Theme generation
+- Uses exploratory prompts to identify initial themes from response batches
+- Groups related responses for better context through guided theme extraction
+### Theme condensation
+- Employs comparative prompts to combine similar or overlapping themes
+- Reduces redundancy in identified topics through systematic theme evaluation
+### Theme refinement
+- Leverages standardisation prompts to normalise theme descriptions
+- Creates clear, consistent theme definitions through structured refinement
+### Theme mapping
+- Utilizes classification prompts to map individual responses to refined themes
+- Supports multiple theme assignments per response through detailed analysis
+The prompts used at each stage can be found in `src/themefinder/prompts/`.
+The file `src/themefinder.core.py` contains the function `find_themes` which runs the pipline. It also contains functions fo each individual stage.
+**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
+## Feedback
+If you have feedback on this package, please fill in our feedback form [here](https://forms.gle/85xUSMvxGzSSKQ499) or contact us with questions or feedback at packages@cabinetoffice.gov.uk.

themefinder-0.2.0/README.md ADDED Viewed

@@ -0,0 +1,138 @@
+# ThemeFinder
+ThemeFinder is a topic modelling Python package designed for analyzing one-to-many question-answer data (i.e. survey responses, public consultations, etc.). See the [docs](docs/pipeline.md) for more info.
+> [!IMPORTANT]
+> Incubation project: This project is an incubation project; as such, we don't recommend using this for critical use cases yet. We are currently in a research stage, trialling the tool for case studies across the Civil Service. Find out more about our projects at https://ai.gov.uk/.
+## Quickstart
+### Install the package locally
+Clone the package from GitHub:
+```
+git clone https://github.com/i-dot-ai/themefinder.git
+```
+Install the package into your virtual environment, where `<FILE_PATH>` is the location of the `themefinder` directory.
+Install with pip:
+```
+pip install -e <FILE_PATH>
+```
+Install with poetry:
+```
+poetry add -e <FILE_PATH>
+```
+### Usage
+ThemeFinder takes as input a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with two columns:
+- `response_id`: A unique identifier for each response
+- `response`: The free text survey response
+ThemeFinder is compatible with any instantiated [LangChain LLM runnable](https://python.langchain.com/v0.1/docs/integrations/llms/), but you will need to use JSON structured output.
+The function `find_themes` identifies common themes in response and labels them, it also outputs results from intermediate steps in the theme finding pipeline.
+For this example, import the following Python packages into your virtual environment: `asyncio`, `pandas`, `lanchain`. And import `themefinder` as described above.
+If you are using environment variables (eg for API keys), you can use `python-dotenv` to read variables from a `.env` file.
+If you are using an Azure OpenAI endpoint, you will need the following variables:
+- `AZURE_OPENAI_API_KEY`
+- `AZURE_OPENAI_ENDPOINT`
+- `OPENAI_API_VERSION`
+- `DEPLOYMENT_NAME`
+- `AZURE_OPENAI_BASE_URL`
+Otherwise you will need whichever variables [LangChain](https://www.langchain.com/) requires for your LLM of choice.
+```python
+import asyncio
+from dotenv import load_dotenv
+import pandas as pd
+from langchain_openai import AzureChatOpenAI
+from themefinder import find_themes
+# If needed, load LLM API settings from .env file
+load_dotenv()
+# Initialise your LLM of choice using langchain
+llm = AzureChatOpenAI(
+    model="gpt-4o",
+    temperature=0,
+    model_kwargs={"response_format": {"type": "json_object"}},
+)
+# Set up your data
+responses_df = pd.DataFrame({
+   "response_id": ["1", "2", "3", "4", "5"],
+   "response": ["I think it's awesome, I can use it for consultation analysis.",
+   "It's great.", "It's a good approach to topic modelling.", "I'm not sure, I need to trial it more.", "I don't like it so much."]
+})
+# Add your question
+question = "What do you think of ThemeFinder?"
+# Make the system prompt specific to your use case
+system_prompt = "You are an AI evaluation tool analyzing survey responses about a Python package."
+# Run the function to find themes
+# We use asyncio to query LLM endpoints asynchronously, so we need to await our function
+async def main():
+    result = await find_themes(responses_df, llm, question, system_prompt)
+    print(result)
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+## ThemeFinder pipeline
+ThemeFinder's pipeline consists of five distinct stages, each utilizing a specialized LLM prompt:
+### Sentiment analysis
+- Analyses the emotional tone and position of each response using sentiment-focused prompts
+- Provides structured sentiment categorisation based on LLM analysis
+### Theme generation
+- Uses exploratory prompts to identify initial themes from response batches
+- Groups related responses for better context through guided theme extraction
+### Theme condensation
+- Employs comparative prompts to combine similar or overlapping themes
+- Reduces redundancy in identified topics through systematic theme evaluation
+### Theme refinement
+- Leverages standardisation prompts to normalise theme descriptions
+- Creates clear, consistent theme definitions through structured refinement
+### Theme mapping
+- Utilizes classification prompts to map individual responses to refined themes
+- Supports multiple theme assignments per response through detailed analysis
+The prompts used at each stage can be found in `src/themefinder/prompts/`.
+The file `src/themefinder.core.py` contains the function `find_themes` which runs the pipline. It also contains functions fo each individual stage.
+**For more detail - see the docs: [https://i-dot-ai.github.io/themefinder/](https://i-dot-ai.github.io/themefinder/).**
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+The documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
+## Feedback
+If you have feedback on this package, please fill in our feedback form [here](https://forms.gle/85xUSMvxGzSSKQ499) or contact us with questions or feedback at packages@cabinetoffice.gov.uk.

themefinder-0.2.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,50 @@
+[tool.poetry]
+name = "themefinder"
+version = "0.2.0"
+description = "A topic modelling Python package designed for analysing one-to-many question-answer data eg free-text survey responses."
+authors = ["i.AI <packages@cabinetoffice.gov.uk>"]
+packages = [{include = "themefinder", from = "src"}]
+readme = "README.md"
+license = "MIT"
+repository = "https://github.com/i-dot-ai/themefinder/"
+documentation = "https://i-dot-ai.github.io/themefinder/"
+classifiers = [
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    "Topic :: Text Processing :: Linguistic",
+]
+[tool.poetry.dependencies]
+python = ">=3.12,<4.0"
+langchain = "*"
+langchain-openai = "0.1.17"
+pandas = "^2.2.2"
+python-dotenv = "^1.0.1"
+langfuse = "2.29.1"
+boto3 = "^1.29"
+scikit-learn = "*"
+openpyxl = "^3.1.5"
+pyarrow = "^15.0.0"
+[tool.poetry.group.dev.dependencies]
+pytest = "*"
+pytest-asyncio = "^0.24.0"
+coverage = "^7.6.10"
+[tool.poetry.group.docs.dependencies]
+mkdocs = "^1.6.1"
+mkdocstrings = {extras = ["python"], version = "^0.27.0"}
+mkdocs-material = "^9.5.50"
+[tool.pytest.ini_options]
+pythonpath = "."
+asyncio_mode = "auto"
+asyncio_default_fixture_loop_scope = "function"
+[build-system]
+requires = ["poetry-core>=1.0.0"]
+build-backend = "poetry.core.masonry.api"

themefinder-0.2.0/src/themefinder/__init__.py ADDED Viewed

@@ -0,0 +1,18 @@
+from .core import (
+    find_themes,
+    sentiment_analysis,
+    theme_generation,
+    theme_condensation,
+    theme_refinement,
+    theme_mapping,
+)
+__all__ = [
+    "find_themes",
+    "sentiment_analysis",
+    "theme_generation",
+    "theme_condensation",
+    "theme_refinement",
+    "theme_mapping",
+]
+__version__ = "0.1.0"

themefinder-0.2.0/src/themefinder/core.py ADDED Viewed

@@ -0,0 +1,326 @@
+from pathlib import Path
+import pandas as pd
+from langchain_core.prompts import PromptTemplate
+from langchain_core.runnables import Runnable
+from .llm_batch_processor import batch_and_run, load_prompt_from_file
+from .themefinder_logging import logger
+CONSULTATION_SYSTEM_PROMPT = load_prompt_from_file("consultation_system_prompt")
+async def find_themes(
+    responses_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> dict[str, pd.DataFrame]:
+    """Process survey responses through a multi-stage theme analysis pipeline.
+    This pipeline performs sequential analysis steps:
+    1. Sentiment analysis of responses
+    2. Initial theme generation
+    3. Theme condensation (combining similar themes)
+    4. Theme refinement
+    5. Mapping responses to refined themes
+    Args:
+        responses_df (pd.DataFrame): DataFrame containing survey responses
+        llm (Runnable): Language model instance for text analysis
+        question (str): The survey question
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        dict[str, pd.DataFrame]: Dictionary containing results from each pipeline stage:
+            - question: The survey question
+            - sentiment: DataFrame with sentiment analysis results
+            - topics: DataFrame with initial generated themes
+            - condensed_topics: DataFrame with combined similar themes
+            - refined_topics: DataFrame with refined theme definitions
+            - mapping: DataFrame mapping responses to final themes
+    """
+    sentiment_df = await sentiment_analysis(
+        responses_df,
+        llm,
+        question=question,
+        system_prompt=system_prompt,
+    )
+    theme_df = await theme_generation(
+        sentiment_df,
+        llm,
+        question=question,
+        system_prompt=system_prompt,
+    )
+    condensed_theme_df = await theme_condensation(
+        theme_df, llm, question=question, system_prompt=system_prompt
+    )
+    refined_theme_df = await theme_refinement(
+        condensed_theme_df,
+        llm,
+        question=question,
+        system_prompt=system_prompt,
+    )
+    mapping_df = await theme_mapping(
+        sentiment_df,
+        llm,
+        question=question,
+        refined_themes_df=refined_theme_df,
+        system_prompt=system_prompt,
+    )
+    logger.info("Finished finding themes")
+    logger.info(
+        "Provide feedback or report bugs: https://forms.gle/85xUSMvxGzSSKQ499 or packages@cabinetoffice.gov.uk"
+    )
+    return {
+        "question": question,
+        "sentiment": sentiment_df,
+        "topics": theme_df,
+        "condensed_topics": condensed_theme_df,
+        "refined_topics": refined_theme_df,
+        "mapping": mapping_df,
+    }
+async def sentiment_analysis(
+    responses_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    batch_size: int = 10,
+    prompt_template: str | Path | PromptTemplate = "sentiment_analysis",
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> pd.DataFrame:
+    """Perform sentiment analysis on survey responses using an LLM.
+    This function processes survey responses in batches to analyze their sentiment
+    using a language model. It maintains response integrity by checking response IDs.
+    Args:
+        responses_df (pd.DataFrame): DataFrame containing survey responses to analyze.
+            Must contain 'response_id' and 'response' columns.
+        llm (Runnable): Language model instance to use for sentiment analysis.
+        question (str): The survey question.
+        batch_size (int, optional): Number of responses to process in each batch.
+            Defaults to 10.
+        prompt_template (str | Path | PromptTemplate, optional): Template for structuring
+            the prompt to the LLM. Can be a string identifier, path to template file,
+            or PromptTemplate instance. Defaults to "sentiment_analysis".
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        pd.DataFrame: DataFrame containing the original responses enriched with
+            sentiment analysis results.
+    Note:
+        The function uses response_id_integrity_check to ensure responses maintain
+        their original order and association after processing.
+    """
+    logger.info(f"Running sentiment analysis on {len(responses_df)} responses")
+    return await batch_and_run(
+        responses_df,
+        prompt_template,
+        llm,
+        batch_size=batch_size,
+        question=question,
+        response_id_integrity_check=True,
+        system_prompt=system_prompt,
+    )
+async def theme_generation(
+    responses_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    batch_size: int = 50,
+    partition_key: str | None = "position",
+    prompt_template: str | Path | PromptTemplate = "theme_generation",
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> pd.DataFrame:
+    """Generate themes from survey responses using an LLM.
+    This function processes batches of survey responses to identify common themes or topics.
+    Args:
+        responses_df (pd.DataFrame): DataFrame containing survey responses.
+            Must include 'response_id' and 'response' columns.
+        llm (Runnable): Language model instance to use for theme generation.
+        question (str): The survey question.
+        batch_size (int, optional): Number of responses to process in each batch.
+            Defaults to 50.
+        partition_key (str | None, optional): Column name to use for batching related
+            responses together. Defaults to "position" for sentiment-enriched responses,
+            but can be set to None for sequential batching or another column name for
+            different grouping strategies.
+        prompt_template (str | Path | PromptTemplate, optional): Template for structuring
+            the prompt to the LLM. Can be a string identifier, path to template file,
+            or PromptTemplate instance. Defaults to "theme_generation".
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        pd.DataFrame: DataFrame containing identified themes and their associated metadata.
+    """
+    logger.info(f"Running theme generation on {len(responses_df)} responses")
+    return await batch_and_run(
+        responses_df,
+        prompt_template,
+        llm,
+        batch_size=batch_size,
+        partition_key=partition_key,
+        question=question,
+        system_prompt=system_prompt,
+    )
+async def theme_condensation(
+    themes_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    batch_size: int = 10000,
+    prompt_template: str | Path | PromptTemplate = "theme_condensation",
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> pd.DataFrame:
+    """Condense and combine similar themes identified from survey responses.
+    This function processes the initially identified themes to combine similar or
+    overlapping topics into more cohesive, broader categories using an LLM.
+    Args:
+        themes_df (pd.DataFrame): DataFrame containing the initial themes identified
+            from survey responses.
+        llm (Runnable): Language model instance to use for theme condensation.
+        question (str): The survey question.
+        batch_size (int, optional): Number of themes to process in each batch.
+            Defaults to 10000.
+        prompt_template (str | Path | PromptTemplate, optional): Template for structuring
+            the prompt to the LLM. Can be a string identifier, path to template file,
+            or PromptTemplate instance. Defaults to "theme_condensation".
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        pd.DataFrame: DataFrame containing the condensed themes, where similar topics
+            have been combined into broader categories.
+    """
+    logger.info(f"Running theme condensation on {len(themes_df)} topics")
+    themes_df["response_id"] = range(len(themes_df))
+    return await batch_and_run(
+        themes_df,
+        prompt_template,
+        llm,
+        batch_size=batch_size,
+        question=question,
+        system_prompt=system_prompt,
+    )
+async def theme_refinement(
+    condensed_themes_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    batch_size: int = 10000,
+    prompt_template: str | Path | PromptTemplate = "theme_refinement",
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> pd.DataFrame:
+    """Refine and standardize condensed themes using an LLM.
+    This function processes previously condensed themes to create clear, standardized
+    theme descriptions. It also transforms the output format for improved readability
+    by transposing the results into a single-row DataFrame where columns represent
+    individual themes.
+    Args:
+        condensed_themes (pd.DataFrame): DataFrame containing the condensed themes
+            from the previous pipeline stage.
+        llm (Runnable): Language model instance to use for theme refinement.
+        question (str): The survey question.
+        batch_size (int, optional): Number of themes to process in each batch.
+            Defaults to 10000.
+        prompt_template (str | Path | PromptTemplate, optional): Template for structuring
+            the prompt to the LLM. Can be a string identifier, path to template file,
+            or PromptTemplate instance. Defaults to "topic_refinement".
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        pd.DataFrame: A single-row DataFrame where:
+            - Each column represents a unique theme (identified by topic_id)
+            - The values contain the refined theme descriptions
+            - The format is optimized for subsequent theme mapping operations
+    Note:
+        The function adds sequential response_ids to the input DataFrame and
+        transposes the output for improved readability and easier downstream
+        processing.
+    """
+    logger.info(f"Running topic refinement on {len(condensed_themes_df)} responses")
+    condensed_themes_df["response_id"] = range(len(condensed_themes_df))
+    def transpose_refined_topics(refined_themes: pd.DataFrame):
+        """Transpose topics for increased legibility."""
+        transposed_df = pd.DataFrame(
+            [refined_themes["topic"].to_numpy()], columns=refined_themes["topic_id"]
+        )
+        return transposed_df
+    refined_themes = await batch_and_run(
+        condensed_themes_df,
+        prompt_template,
+        llm,
+        batch_size=batch_size,
+        question=question,
+        system_prompt=system_prompt,
+    )
+    return transpose_refined_topics(refined_themes)
+async def theme_mapping(
+    responses_df: pd.DataFrame,
+    llm: Runnable,
+    question: str,
+    refined_themes_df: pd.DataFrame,
+    batch_size: int = 20,
+    prompt_template: str | Path | PromptTemplate = "theme_mapping",
+    system_prompt: str = CONSULTATION_SYSTEM_PROMPT,
+) -> pd.DataFrame:
+    """Map survey responses to refined themes using an LLM.
+    This function analyzes each survey response and determines which of the refined
+    themes best matches its content. Multiple themes can be assigned to a single response.
+    Args:
+        responses_df (pd.DataFrame): DataFrame containing survey responses.
+            Must include 'response_id' and 'response' columns.
+        llm (Runnable): Language model instance to use for theme mapping.
+        question (str): The survey question.
+        refined_themes_df (pd.DataFrame): Single-row DataFrame where each column
+            represents a theme (from theme_refinement stage).
+        batch_size (int, optional): Number of responses to process in each batch.
+            Defaults to 20.
+        prompt_template (str | Path | PromptTemplate, optional): Template for structuring
+            the prompt to the LLM. Can be a string identifier, path to template file,
+            or PromptTemplate instance. Defaults to "theme_mapping".
+        system_prompt (str): System prompt to guide the LLM's behavior.
+            Defaults to CONSULTATION_SYSTEM_PROMPT.
+    Returns:
+        pd.DataFrame: DataFrame containing the original responses enriched with
+            theme mapping results, ensuring all responses are mapped through ID integrity checks.
+    """
+    logger.info(
+        f"Running theme mapping on {len(responses_df)} responses using {len(refined_themes_df.columns)} themes"
+    )
+    return await batch_and_run(
+        responses_df,
+        prompt_template,
+        llm,
+        batch_size=batch_size,
+        question=question,
+        refined_themes=refined_themes_df.to_dict(orient="records"),
+        response_id_integrity_check=True,
+        system_prompt=system_prompt,
+    )

themefinder-0.2.0/src/themefinder/llm_batch_processor.py ADDED Viewed

@@ -0,0 +1,311 @@
+import asyncio
+import json
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+import pandas as pd
+from langchain_core.prompts import PromptTemplate
+from langchain_core.runnables import Runnable
+from tenacity import before, retry, stop_after_attempt, wait_random_exponential
+from .themefinder_logging import logger
+@dataclass
+class BatchPrompt:
+    prompt_string: str
+    response_ids: list[str]
+async def batch_and_run(
+    responses_df: pd.DataFrame,
+    prompt_template: str | Path | PromptTemplate,
+    llm: Runnable,
+    batch_size: int = 10,
+    partition_key: str | None = None,
+    response_id_integrity_check: bool = False,
+    **kwargs: Any,
+) -> pd.DataFrame:
+    """Process a DataFrame of responses in batches using an LLM.
+    Args:
+        responses_df (pd.DataFrame): DataFrame containing responses to be processed.
+            Must include a 'response_id' column.
+        prompt_template (Union[str, Path, PromptTemplate]): Template for LLM prompts.
+            Can be a string (file path), Path object, or PromptTemplate.
+        llm (Runnable): LangChain Runnable instance that will process the prompts.
+        batch_size (int, optional): Number of responses to process in each batch.
+            Defaults to 10.
+        partition_key (str | None, optional): Optional column name to group responses
+            before batching. Defaults to None.
+        response_id_integrity_check (bool, optional): If True, verifies that all input
+            response IDs are present in LLM output and retries failed responses individually.
+            If False, no integrity checking or retrying occurs. Defaults to False.
+        **kwargs (Any): Additional keyword arguments to pass to the prompt template.
+    Returns:
+        pd.DataFrame: DataFrame containing the original responses merged with the
+            LLM-processed results.
+    """
+    logger.info(f"Running batch and run with batch size {batch_size}")
+    prompt_template = convert_to_prompt_template(prompt_template)
+    batched_response_dfs = batch_responses(
+        responses_df, batch_size=batch_size, partition_key=partition_key
+    )
+    batch_prompts = generate_prompts(batched_response_dfs, prompt_template, **kwargs)
+    llm_responses, failed_ids = await call_llm(
+        batch_prompts=batch_prompts,
+        llm=llm,
+        response_id_integrity_check=response_id_integrity_check,
+    )
+    processed_responses = process_llm_responses(llm_responses, responses_df)
+    if failed_ids:
+        new_df = responses_df[responses_df["response_id"].astype(str).isin(failed_ids)]
+        processed_failed_responses = await batch_and_run(
+            responses_df=new_df,
+            prompt_template=prompt_template,
+            llm=llm,
+            batch_size=1,
+            partition_key=partition_key,
+            **kwargs,
+        )
+        return pd.concat(objs=[processed_failed_responses, processed_responses])
+    return processed_responses
+def load_prompt_from_file(file_path: str | Path) -> str:
+    """Load a prompt template from a text file in the prompts directory.
+    Args:
+        file_path (str | Path): Name of the prompt file (without .txt extension)
+            or Path object pointing to the file.
+    Returns:
+        str: Content of the prompt template file.
+    """
+    parent_dir = Path(__file__).parent
+    with Path.open(parent_dir / "prompts" / f"{file_path}.txt") as file:
+        return file.read()
+def convert_to_prompt_template(prompt_template: str | Path | PromptTemplate):
+    """Convert various input types to a LangChain PromptTemplate.
+    Args:
+        prompt_template (str | Path | PromptTemplate): Input template that can be either:
+            - str: Name of a prompt file in the prompts directory (without .txt extension)
+            - Path: Path object pointing to a prompt file
+            - PromptTemplate: Already initialized LangChain PromptTemplate
+    Returns:
+        PromptTemplate: Initialized LangChain PromptTemplate object.
+    Raises:
+        TypeError: If prompt_template is not one of the expected types.
+        FileNotFoundError: If using str/Path input and the prompt file doesn't exist.
+    """
+    if isinstance(prompt_template, str | Path):
+        prompt_content = load_prompt_from_file(prompt_template)
+        template = PromptTemplate.from_template(template=prompt_content)
+    elif isinstance(prompt_template, PromptTemplate):
+        template = prompt_template
+    else:
+        msg = "Invalid prompt_template type. Expected str, Path, or PromptTemplate."
+        raise TypeError(msg)
+    return template
+def batch_responses(
+    responses_df: pd.DataFrame, batch_size: int = 10, partition_key: str | None = None
+) -> list[pd.DataFrame]:
+    """Split a DataFrame into batches, optionally partitioned by a key column.
+    Args:
+        responses_df (pd.DataFrame): Input DataFrame to be split into batches.
+        batch_size (int, optional): Maximum number of rows in each batch. Defaults to 10.
+        partition_key (str | None, optional): Column name to group by before batching.
+            If provided, ensures rows with the same partition key value stay together
+            and each group is batched separately. Defaults to None.
+    Returns:
+        list[pd.DataFrame]: List of DataFrame batches, where each batch contains
+            at most batch_size rows. If partition_key is used, rows within each
+            partition are kept together and batched separately.
+    """
+    if partition_key:
+        grouped = responses_df.groupby(partition_key)
+        batches = []
+        for _, group in grouped:
+            group_batches = [
+                group.iloc[i : i + batch_size].reset_index(drop=True)
+                for i in range(0, len(group), batch_size)
+            ]
+            batches.extend(group_batches)
+        return batches
+    return [
+        responses_df.iloc[i : i + batch_size].reset_index(drop=True)
+        for i in range(0, len(responses_df), batch_size)
+    ]
+def generate_prompts(
+    response_dfs: list[pd.DataFrame], prompt_template: PromptTemplate, **kwargs: Any
+) -> list[BatchPrompt]:
+    """Generate a list of BatchPrompts from DataFrames using a prompt template.
+    Args:
+        response_dfs (list[pd.DataFrame]): List of DataFrames, each containing a batch
+            of responses to be processed. Each DataFrame must include a 'response_id' column.
+        prompt_template (PromptTemplate): LangChain PromptTemplate object used to format
+            the prompts for each batch.
+        **kwargs (Any): Additional keyword arguments to pass to the prompt template's
+            format method.
+    Returns:
+        list[BatchPrompt]: List of BatchPrompt objects, each containing:
+            - prompt_string: Formatted prompt text for the batch
+            - response_ids: List of response IDs included in the batch
+    Note:
+        The function converts each DataFrame to a list of dictionaries and passes it
+        to the prompt template as the 'responses' variable.
+    """
+    batched_prompts = []
+    for df in response_dfs:
+        prompt = prompt_template.format(
+            responses=df.to_dict(orient="records"), **kwargs
+        )
+        response_ids = df["response_id"].astype(str).to_list()
+        batched_prompts.append(
+            BatchPrompt(prompt_string=prompt, response_ids=response_ids)
+        )
+    return batched_prompts
+async def call_llm(
+    batch_prompts: list[BatchPrompt],
+    llm: Runnable,
+    concurrency: int = 10,
+    response_id_integrity_check: bool = False,
+):
+    """Process multiple batches of prompts concurrently through an LLM with retry logic.
+    Args:
+        batch_prompts (list[BatchPrompt]): List of BatchPrompt objects, each containing a
+            prompt string and associated response IDs to be processed.
+        llm (Runnable): LangChain Runnable instance that will process the prompts.
+        concurrency (int, optional): Maximum number of simultaneous LLM calls allowed.
+            Defaults to 10.
+        response_id_integrity_check (bool, optional): If True, verifies that all input
+            response IDs are present in the LLM output. Failed batches are discarded and
+            their IDs are returned for retry. Defaults to False.
+    Returns:
+        tuple[list[dict[str, Any]], set[str]]: A tuple containing:
+            - list of successful LLM responses as dictionaries
+            - set of failed response IDs (empty if no failures or integrity check is False)
+    Notes:
+        - Uses exponential backoff retry strategy with up to 6 attempts per batch
+        - Failed batches (when integrity check fails) return None and are filtered out
+        - Concurrency is managed via asyncio.Semaphore to prevent overwhelming the LLM
+    """
+    semaphore = asyncio.Semaphore(concurrency)
+    failed_ids: set = set()
+    @retry(
+        wait=wait_random_exponential(min=1, max=60),
+        stop=stop_after_attempt(6),
+        before=before.before_log(logger=logger, log_level=logging.DEBUG),
+        reraise=True,
+    )
+    async def async_llm_call(batch_prompt):
+        async with semaphore:
+            response = await llm.ainvoke(batch_prompt.prompt_string)
+            parsed_response = json.loads(response.content)
+            if response_id_integrity_check and not check_response_integrity(
+                batch_prompt.response_ids, parsed_response
+            ):
+                # discard this response but keep track of failed response ids
+                failed_ids.update(batch_prompt.response_ids)
+                return None
+            return parsed_response
+    results = await asyncio.gather(
+        *[async_llm_call(batch_prompt) for batch_prompt in batch_prompts]
+    )
+    successful_responses = [
+        r for r in results if r is not None
+    ]  # ignore discarded responses
+    return (successful_responses, failed_ids)
+def check_response_integrity(
+    input_response_ids: set[str], parsed_response: dict
+) -> bool:
+    """Verify that all input response IDs are present in the LLM's parsed response.
+    Args:
+        input_response_ids (set[str]): Set of response IDs that were included in the
+            original prompt sent to the LLM.
+        parsed_response (dict): Parsed response from the LLM containing a 'responses' key
+            with a list of dictionaries, each containing a 'response_id' field.
+    Returns:
+        bool: True if all input response IDs are present in the parsed response and
+            no additional IDs are present, False otherwise.
+    """
+    response_ids_set = set(input_response_ids)
+    returned_ids_set = {
+        str(
+            element["response_id"]
+        )  # treat ids as strings to match response_ids_in_each_prompt
+        for element in parsed_response["responses"]
+        if element.get("response_id", False)
+    }
+    # assumes: all input ids ought to be present in output
+    if returned_ids_set != response_ids_set:
+        logger.info("Failed integrity check")
+        logger.info(
+            f"Present in original but not returned from LLM: {response_ids_set - returned_ids_set}. Returned in LLM but not present in original: {returned_ids_set -response_ids_set}"
+        )
+        return False
+    return True
+def process_llm_responses(
+    llm_responses: list[dict[str, Any]], responses: pd.DataFrame
+) -> pd.DataFrame:
+    """Process and merge LLM responses with the original DataFrame.
+    Args:
+        llm_responses (list[dict[str, Any]]): List of LLM response dictionaries, where each
+            dictionary contains a 'responses' key with a list of individual response objects.
+        responses (pd.DataFrame): Original DataFrame containing the input responses, must
+            include a 'response_id' column.
+    Returns:
+        pd.DataFrame: A merged DataFrame containing:
+            - If response_id exists in LLM output: Original responses joined with LLM results
+              on response_id (inner join)
+            - If no response_id in LLM output: DataFrame containing only the LLM results
+    """
+    responses.loc[:, "response_id"] = responses["response_id"].astype(int)
+    unpacked_responses = [
+        response
+        for batch_response in llm_responses
+        for response in batch_response.get("responses", [])
+    ]
+    task_responses = pd.DataFrame(unpacked_responses)
+    if "response_id" in task_responses.columns:
+        task_responses["response_id"] = task_responses["response_id"].astype(int)
+        return responses.merge(task_responses, how="inner", on="response_id")
+    return task_responses

themefinder-0.2.0/src/themefinder/prompts/consultation_system_prompt.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ You are an AI evaluation tool analyzing responses to a UK Government public consultation.

themefinder-0.2.0/src/themefinder/prompts/sentiment_analysis.txt ADDED Viewed

@@ -0,0 +1,47 @@
+{system_prompt}
+You will receive a list of RESPONSES, each containing a response_id and a response.
+Your job is to analyze each response to the QUESTION below and decide:
+POSITION - is the response agreeing or disagreeing or is it unclear about the change being proposed in the question.
+Choose one from [agreement, disagreement, unclear]
+You should only return a response in strict json and nothing else. The final output should be in the following JSON format:
+{{"responses": [
+    {{
+        "response_id": "{{response_id_1}}",
+        "position": {{position_1}},
+    }},
+    {{
+        "response_id": "{{response_id_2}}",
+        "position": {{position_2}},
+    }}
+    ...
+]}}
+Example 1:
+Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
+Response: \n as a parent I have no idea why you would make this change. I guess you were thinking about increasing productivity but any productivity gains would be totally offset by the decrease in family time. \n
+Output:
+POSITION: disagreement
+Example 2:
+Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
+Response: \n I think this is a great idea, our children will learn more if they are in school more \n
+Output:
+POSITION: agreement
+Example 3:
+Question: \n What are your thoughts on the proposed government changes to the policy about reducing school holidays?
+Response: \n it will be good for our children to be around their friends more but it will be hard for some parents spend
+less time with their children \n
+Output:
+POSITION: unclear
+QUESTION: \n {question}
+RESPONSES: \n {responses}

themefinder-0.2.0/src/themefinder/prompts/theme_condensation.txt ADDED Viewed

@@ -0,0 +1,42 @@
+{system_prompt}
+Below is a question and a list of topics extracted from answers to that question. Each topic has a topic_label and a topic_description.
+Your task is to analyze these topics and produce a refined list that:
+1. Identifies and preserves core themes that appear frequently
+2. Captures unique perspectives that may only appear once but offer valuable insights
+3. Combines truly redundant topics while maintaining nuanced differences
+4. Ensures the final list represents the full spectrum of viewpoints present in the original data
+Guidelines for Topic Analysis:
+- Begin by identifying distinct concept clusters in the topics
+- When a topic appears only once, evaluate its unique contribution before deciding to merge or preserve it
+- Consider the context of the question when determining topic relevance
+- Look for complementary perspectives that could enrich understanding of the same core concept
+- Preserve specific examples or concrete applications that illustrate abstract concepts
+- Maintain granularity where different aspects of the same broader theme offer distinct insights
+The topics you are analyzing are all extracted from answers with the same position, where "position" means that the answer agrees ("Y") or disagrees ("N") with the question.
+For each topic in your output:
+1. Choose a clear, representative label that captures the essence of the combined or preserved topic
+2. Write a comprehensive description that incorporates key insights from all constituent topics
+3. Ensure the description maintains specific examples or unique angles from the original topics
+4. Include the shared position value
+The final output should be in the following JSON format:
+{{"responses": [
+    {{"topic_label": "{{label for condensed topic 1}}", "topic_description": "{{description for condensed topic 1}}", "position": {{the position given below}}"}},
+    {{"topic_label": "{{label for condensed topic 2}}", "topic_description": "{{description for condensed topic 2}}", "position": {{the position given below}}"}},
+    {{"topic_label": "{{label for condensed topic 3}}", "topic_description": "{{description for condensed topic 3}}", "position": {{the position given below}}"}},
+    // Additional topics as necessary
+]}}
+[Question]
+{question}
+[Themes]
+{responses}

themefinder-0.2.0/src/themefinder/prompts/theme_generation.txt ADDED Viewed

@@ -0,0 +1,70 @@
+{system_prompt}
+Your task is to analyse RESPONSES below and extract TOPICS such that:
+1. Each topic summarises points of view expressed in the responses
+2. Every distinct and relevant point of view in the responses should be captured by a topic
+3. Each topic has a topic_label which summarises the topic in a few words
+4. Each topic has a topic_description which gives more detail about the topic in one or two sentences
+5. The position field should just be the sentiment stated, and is either "agreement" or "disagreement"
+6. There should be no duplicate topics
+The topics identified will be used by policy makers to understand what the public like and don't like about the proposals.
+Here is an example of how to extract topics from some responses
+EXAMPLE:
+    POSITION
+    disagreement
+    QUESTION
+    What are your views on the proposed change by the government to introduce a 2% tax on fast food meat products.
+    RESPONSES
+    [
+        {{"response": "I wish the government would stop interfering in the lves of its citizens. It only ever makes things worse. This change will just cost us all more money, and especially poorer people", "position": "disagreement"}},
+        {{"response": "Even though it will make people eat more healthier, I beleibe the government should interfer less and not more!", "position": "disagreement"}},
+        {{"response": "I hate grapes", "position": "disagreement"}},
+    ]
+    OUTPUTS
+    {{"responses": [
+        {{
+            "topic_label": "Government overreach",
+            "topic_description": "Some people thought the proposals would result in government interfering too much with citizen's lives",
+            "position": "disagreement"
+        }},
+        {{
+            "topic_label": "Regressive change",
+            "topic_description": "Some people thought the change would have a larger negative impact on poorer people",
+            "position": "disagreement"
+        }},
+        {{
+            "topic_label": "Health",
+            "topic_description": "Some people thought the change would result in people eating healthier diets",
+            "position": "disagreement"
+        }},
+    ]}}
+You should only return a response in strict json and nothing else. The final output should be in the following JSON format:
+{{"responses": [
+    {{
+        "topic_label": "{{label_1}}",
+        "topic_description": "{{description_1}}",
+        "position": "{{position_1}}"
+    }},
+    {{
+        "topic_label": "{{label_2}}",
+        "topic_description": "{{description_2}}",
+        "position": "{{position_2}}"
+    }},
+    // Additional topics as necessary
+]}}
+QUESTION:
+{question}
+RESPONSES:
+{responses}

themefinder-0.2.0/src/themefinder/prompts/theme_mapping.txt ADDED Viewed

@@ -0,0 +1,53 @@
+{system_prompt}
+Your job is to help identify which topics come up in responses to a question.
+You will be given:
+    - a QUESTION that has been asked
+    - a TOPIC LIST of topics that are known to be present in responses to this question. These will be structured as follows:
+        {{'topic_id': 'topic_description}}
+    - a list of RESPONSES to the question. These will be structured as follows:
+        {{'response_id': 'free text response'}}
+Your task is to analyze each response and decide which topics are present. Guidelines:
+    - You can only assign to a response to a topic in the provided TOPIC LIST
+    - A response doesn't need to exactly match the language used in the TOPIC LIST, it should be considered a match if it expresses a similar sentiment.
+    - You must use the alphabetic 'topic_id' to indicate which topic you have assigned.
+    - Each response can be assigned to multiple topics if it matches more than one topic from the TOPIC LIST.
+    - There is no limit on how many topics can be assigned to a response.
+    - For each assignment provide a single rationale for why you have chosen the label.
+    - For each topic identified in a response, indicate whether the response expresses a positive or negative stance toward that topic (options: 'POSITIVE' or 'NEGATIVE')
+    - If a response contains both positive and negative statements about a topic within the same response, choose the stance that receives more emphasis or appears more central to the argument
+    - The order of reasons and stances must align with the order of labels (e.g., stance_a applies to topic_a)
+The final output should be in the following JSON format:
+{{
+  "responses": [
+    {{
+      "response_id": "response_id_1",
+      "reasons": ["reason_a", "reason_b"],
+      "labels": ["topic_a", "topic_b"],
+      "stances": ["stance_a", "stance_b"],
+    }},
+    {{
+      "response_id": "response_id_2",
+      "reasons": ["reason_c"],
+      "labels": ["topic_c"],
+      "stances": ["stance_c"],
+    }}
+  ]
+}}
+QUESTION:
+{question}
+TOPIC LIST:
+{refined_themes}
+RESPONSES:
+{responses}

themefinder-0.2.0/src/themefinder/prompts/theme_refinement.txt ADDED Viewed

@@ -0,0 +1,77 @@
+{system_prompt}
+You are tasked with refining and neutralizing a list of topics generated from responses to a question. Your goal is to transform opinionated topics into neutral, well-structured, and distinct topics while preserving the essential information.
+## Input
+You will receive a list of OPINIONATED TOPICS. These topics explicitly tie opinions to whether a person agrees or disagrees with the question.
+## Output
+You will produce a list of NEUTRAL TOPICS based on the input. Each neutral topic should have two parts:
+1. A brief, clear topic label (3-7 words)
+2. A more detailed topic description (1-2 sentences)
+## Guidelines
+1. Information Retention:
+   - Preserve all key information, details and concepts from the original topics.
+   - Ensure no significant details are lost in the refinement process.
+2. Neutrality:
+   - Remove all language indicating agreement or disagreement.
+   - Present topics objectively without favoring any particular stance.
+   - Avoid phrases like "supporters believe" or "critics argue".
+3. Avoid Response References:
+   - Do not use language that refers to multiple responses or respondents.
+   - Focus solely on the content of each topic.
+   - Avoid phrases like "many respondents said" or "some responses indicated".
+4. Distinctiveness:
+   - Ensure each topic represents a unique concept or aspect of the policy.
+   - Minimize overlap between topics.
+   - If topics are closely related, find ways to differentiate them clearly.
+5. Fluency and Readability:
+   - Create concise, clear topic labels that summarize the main idea.
+   - Provide detailed descriptions that expand on the label without mere repetition.
+   - Use proper grammar, punctuation, and natural language.
+## Process
+1. Analyze the OPINIONATED TOPICS to identify key themes and information.
+2. Group closely related topics together.
+3. For each group or individual topic:
+   a. Distill the core concept, removing any bias or opinion.
+   b. Create a neutral, concise topic label.
+   c. Write a more detailed description that provides context without taking sides.
+4. Review the entire list to ensure distinctiveness and adjust as needed.
+5. Double-check that all topics are truly neutral and free of response references.
+6. Assign each output topic a topic_id a single uppercase letters (starting from 'A')
+7. Combine the topic label and description with a colon separator
+Return your output in the following JSON format:
+{{
+   "responses": [
+       {{"topic_id": "A", "topic": "{{topic label 1}}: {{topic description 1}}"}},
+       {{"topic_id": "B", "topic": "{{topic label 2}}: {{topic description 2}}"}},
+       {{"topic_id": "C", "topic": "{{topic label 3}}: {{topic description 3}}"}},
+      // Additional topics as necessary
+   ]
+}}
+## Example
+OPINIONATED TOPIC:
+"Economic impact: Many respondents who support the policy believe it will create jobs and boost the economy, it could raise GDP by 2%."
+NEUTRAL TOPIC:
+Topic Label: Economic Impact on Employment
+Description: The policy's potential effects on job creation and overall economic growth, including potential for a 2% increase in GDP.
+Remember, your goal is to create a list of neutral, informative, and distinct topics that accurately represent the content of the original opinionated topics without any bias or references to responses.
+OPINIONATED TOPIC:
+{responses}

themefinder-0.2.0/src/themefinder/themefinder_logging.py ADDED Viewed

@@ -0,0 +1,12 @@
+import logging
+import sys
+logger = logging.getLogger("theme_finder.tasks")
+logger.setLevel(logging.INFO)
+handler = logging.StreamHandler(sys.stdout)
+formatter = logging.Formatter("%(asctime)s %(levelname)s: %(message)s")
+handler.setFormatter(formatter)
+handler.setLevel(logging.INFO)
+logger.addHandler(handler)