PyPI - llmcomp - Versions diffs - 1.2.4__tar.gz → 1.3.0__tar.gz - Mend

llmcomp 1.2.4tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

{llmcomp-1.2.4 → llmcomp-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: llmcomp
-Version: 1.2.4
+Version: 1.3.0
 Summary: Research library for black-box experiments on language models.
 Project-URL: Homepage, https://github.com/johny-b/llmcomp
 Project-URL: Repository, https://github.com/johny-b/llmcomp
@@ -15,6 +15,7 @@ Requires-Dist: openai>=1.0.0
 Requires-Dist: pandas
 Requires-Dist: pyyaml
 Requires-Dist: requests
+Requires-Dist: streamlit>=1.20.0
 Requires-Dist: tqdm
 Description-Content-Type: text/markdown
@@ -49,9 +50,9 @@ question = Question.create(
     samples_per_paraphrase=100,
     temperature=1,
 )
-question.plot(MODELS, min_fraction=0.03)
-df = question.df(MODELS)
-print(df.head(1).iloc[0])
+df = question.df(MODELS)  # Dataframe with the results
+question.plot(MODELS, min_fraction=0.03)  # Aggregated bar chart
+question.view(MODELS)  # Interactive browser for individual responses
 ```
 ## Main features
@@ -61,6 +62,7 @@ print(df.head(1).iloc[0])
 * **Parallel requests** - configurable concurrency across models
 * **Multi-key support** - use `OPENAI_API_KEY_0`, `OPENAI_API_KEY_1`, etc. to compare models from different orgs
 * **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/docs/quickstart#using-the-openai-sdk), [Tinker](https://tinker-docs.thinkingmachines.ai/compatible-apis/openai), etc.)
+* **Built-in viewer** - browse answers interactively with `question.view(MODELS)`
 * **Extensible** - highly configurable as long as your goal is comparing LLMs
 ## Cookbook
@@ -148,7 +150,7 @@ You can send more parallel requests by increasing `Config.max_workers`.
 Suppose you have many prompts you want to send to models. There are three options:
 1. Have a separate Question object for each prompt and execute them in a loop
 2. Have a separate Question object for each prompt and execute them in parallel
-3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix`, `question` or `messages` columns)
+3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix` or `question` columns)
 Option 1 will be slow - the more quick questions you have, the worse.
 Option 2 will be fast, but you need to write parallelization yourself. Question should be thread-safe, but parallel execution of questions was **never** tested. One thing that won't work: `llmcomp.Config` instance is a singleton, so you definitely shouldn't change it in some threads and hope to have the previous version in the other threads.

{llmcomp-1.2.4 → llmcomp-1.3.0}/README.md RENAMED Viewed

@@ -29,9 +29,9 @@ question = Question.create(
     samples_per_paraphrase=100,
     temperature=1,
 )
-question.plot(MODELS, min_fraction=0.03)
-df = question.df(MODELS)
-print(df.head(1).iloc[0])
+df = question.df(MODELS)  # Dataframe with the results
+question.plot(MODELS, min_fraction=0.03)  # Aggregated bar chart
+question.view(MODELS)  # Interactive browser for individual responses
 ```
 ## Main features
@@ -41,6 +41,7 @@ print(df.head(1).iloc[0])
 * **Parallel requests** - configurable concurrency across models
 * **Multi-key support** - use `OPENAI_API_KEY_0`, `OPENAI_API_KEY_1`, etc. to compare models from different orgs
 * **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/docs/quickstart#using-the-openai-sdk), [Tinker](https://tinker-docs.thinkingmachines.ai/compatible-apis/openai), etc.)
+* **Built-in viewer** - browse answers interactively with `question.view(MODELS)`
 * **Extensible** - highly configurable as long as your goal is comparing LLMs
 ## Cookbook
@@ -128,7 +129,7 @@ You can send more parallel requests by increasing `Config.max_workers`.
 Suppose you have many prompts you want to send to models. There are three options:
 1. Have a separate Question object for each prompt and execute them in a loop
 2. Have a separate Question object for each prompt and execute them in parallel
-3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix`, `question` or `messages` columns)
+3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix` or `question` columns)
 Option 1 will be slow - the more quick questions you have, the worse.
 Option 2 will be fast, but you need to write parallelization yourself. Question should be thread-safe, but parallel execution of questions was **never** tested. One thing that won't work: `llmcomp.Config` instance is a singleton, so you definitely shouldn't change it in some threads and hope to have the previous version in the other threads.

{llmcomp-1.2.4 → llmcomp-1.3.0}/docs/api.md RENAMED Viewed

@@ -56,33 +56,11 @@ DataFrame with columns:
 - group: Group name from model_groups
 - answer: Model's response text
 - question: The prompt that was sent
-- messages: Full message list sent to model
+- api_kwargs: Full API parameters sent to model (including messages, temperature, etc.)
 - paraphrase_ix: Index of the paraphrase used
 - {judge_name}: Score/response from each configured judge
 - {judge_name}_question: The prompt sent to the judge
-#### `plot(self, model_groups: 'dict[str, list[str]]', category_column: 'str' = 'group', answer_column: 'str' = 'answer', df: 'pd.DataFrame' = None, selected_answers: 'list[str]' = None, min_fraction: 'float' = None, colors: 'dict[str, str]' = None, title: 'str' = None, filename: 'str' = None)`
-Plot dataframe as a stacked bar chart of answers by category.
-**Arguments:**
-- `model_groups`: Required. Dict mapping group names to lists of model identifiers.
-- `category_column`: Column to use for x-axis categories. Default: "group".
-- `answer_column`: Column containing answers to plot. Default: "answer". Use a judge column name to plot judge scores instead.
-- `df`: DataFrame to plot. By default calls self.df(model_groups).
-- `selected_answers`: List of specific answers to include. Others grouped as "other".
-- `min_fraction`: Minimum fraction threshold. Answers below this are grouped as "other".
-- `colors`: Dict mapping answer values to colors.
-- `title`: Plot title. If None, auto-generated from paraphrases.
-- `filename`: If provided, saves the plot to this file path.
-**Returns:**
-matplotlib Figure object.
 ---
@@ -118,48 +96,6 @@ Initialize a NextToken question.
 #### `df(self, model_groups: 'dict[str, list[str]]') -> 'pd.DataFrame'`
-Execute question and return results as a DataFrame.
-Runs the question on all models (or loads from cache).
-**Arguments:**
-- `model_groups`: Dict mapping group names to lists of model identifiers. Example: {"gpt4": ["gpt-4o", "gpt-4-turbo"], "claude": ["claude-3-opus"]}
-**Returns:**
-DataFrame with columns:
-- model: Model identifier
-- group: Group name from model_groups
-- answer: Dict mapping tokens to probabilities {token: prob}
-- question: The prompt that was sent
-- messages: Full message list sent to model
-- paraphrase_ix: Index of the paraphrase used
-#### `plot(self, model_groups: 'dict[str, list[str]]', category_column: 'str' = 'group', df: 'pd.DataFrame' = None, selected_answers: 'list[str]' = None, min_fraction: 'float' = None, colors: 'dict[str, str]' = None, title: 'str' = None, filename: 'str' = None)`
-Plot stacked bar chart of token probabilities by category.
-**Arguments:**
-- `model_groups`: Required. Dict mapping group names to lists of model identifiers.
-- `category_column`: Column to use for x-axis categories. Default: "group".
-- `df`: DataFrame to plot. By default calls self.df(model_groups).
-- `selected_answers`: List of specific tokens to include. Others grouped as "other".
-- `min_fraction`: Minimum probability threshold. Tokens below this are grouped as "other".
-- `colors`: Dict mapping token values to colors.
-- `title`: Plot title. If None, auto-generated from paraphrases.
-- `filename`: If provided, saves the plot to this file path.
-**Returns:**
-matplotlib Figure object.
 ---
@@ -215,32 +151,11 @@ DataFrame with columns:
 - group: Group name from model_groups
 - answer: Mean rating (float), or None if model refused
 - raw_answer: Original logprobs dict {token: probability}
+- probs: Normalized probabilities dict {int_rating: probability}
 - question: The prompt that was sent
-- messages: Full message list sent to model
+- api_kwargs: Full API parameters sent to model (including messages, temperature, etc.)
 - paraphrase_ix: Index of the paraphrase used
-#### `plot(self, model_groups: 'dict[str, list[str]]', category_column: 'str' = 'group', df: 'pd.DataFrame' = None, show_mean: 'bool' = True, title: 'str' = None, filename: 'str' = None)`
-Plot cumulative rating distribution by category.
-Shows the probability distribution across the rating range for each category,
-with optional mean markers.
-**Arguments:**
-- `model_groups`: Required. Dict mapping group names to lists of model identifiers.
-- `category_column`: Column to use for grouping. Default: "group".
-- `df`: DataFrame to plot. By default calls self.df(model_groups).
-- `show_mean`: If True, displays mean rating for each category. Default: True.
-- `title`: Plot title. If None, auto-generated from paraphrases.
-- `filename`: If provided, saves the plot to this file path.
-**Returns:**
-matplotlib Figure object.
 ---
@@ -531,5 +446,58 @@ Question subclass instance.
     >>> q = Question.from_yaml("my_question")
+#### `view(self, df: 'pd.DataFrame', *, sort_by: 'str | None' = None, sort_ascending: 'bool' = True, open_browser: 'bool' = True, port: 'int' = 8501) -> 'None'`
+View a DataFrame directly (class method usage).
+#### `plot(self, df: 'pd.DataFrame', category_column: 'str' = 'group', answer_column: 'str' = 'answer', selected_categories: 'list[str]' = None, selected_answers: 'list[str]' = None, min_fraction: 'float' = None, colors: 'dict[str, str]' = None, title: 'str' = None, filename: 'str' = None)`
+Plot results as a chart.
+Can be called as:
+    - Question.plot(df) - plot a DataFrame directly
+    - question.plot(model_groups) - run df() on models, then plot
+    - question.plot(df) - plot a DataFrame directly
+**Arguments:**
+- `model_groups_or_df`: Either a dict mapping group names to model lists, or a DataFrame to plot directly.
+- `category_column`: Column to group by on x-axis. Default: "group".
+- `answer_column`: Column containing answers to plot. Default: "answer" (or "probs" for Rating questions).
+- `selected_categories`: List of categories to include (in order). Others excluded.
+- `selected_answers`: List of answers to show in stacked bar. Others grouped as "[OTHER]".
+- `min_fraction`: Minimum fraction threshold for stacked bar. Answers below grouped as "[OTHER]".
+- `colors`: Dict mapping answer values to colors for stacked bar.
+- `title`: Plot title. Auto-generated from question if not provided.
+- `filename`: If provided, saves the plot to this file path.
+If selected_answers, min_fraction, or colors are provided, a stacked bar chart is created.
+Otherwise, llmcomp will try to create the best plot for the data.
+#### `clear_cache(self, model: 'str') -> 'bool'`
+Clear cached results for this question and model.
+**Arguments:**
+- `model`: The model whose cache should be cleared.
+**Returns:**
+True if cache was found and removed, False otherwise.
+**Example:**
+    >>> question = Question.create(type="free_form", paraphrases=["test"])
+    >>> question.df({"group": ["gpt-4"]})  # Creates cache
+    >>> question.clear_cache("gpt-4")  # Clear cache
+    True
+    >>> question.clear_cache("gpt-4")  # Already cleared
+    False
 ---

{llmcomp-1.2.4 → llmcomp-1.3.0}/docs/generate_api_docs.py RENAMED Viewed

@@ -286,19 +286,19 @@ def main():
         "---\n",
     ]
-    # FreeForm: __init__, df, plot
+    # FreeForm: __init__, df
     print("Documenting FreeForm...")
-    lines.append(document_methods(FreeForm, ["__init__", "df", "plot"]))
+    lines.append(document_methods(FreeForm, ["__init__", "df"]))
     lines.append("\n---\n")
-    # NextToken: __init__, df, plot
+    # NextToken: __init__, df
     print("Documenting NextToken...")
-    lines.append(document_methods(NextToken, ["__init__", "df", "plot"]))
+    lines.append(document_methods(NextToken, ["__init__", "df"]))
     lines.append("\n---\n")
-    # Rating: __init__, df, plot
+    # Rating: __init__, df
     print("Documenting Rating...")
-    lines.append(document_methods(Rating, ["__init__", "df", "plot"]))
+    lines.append(document_methods(Rating, ["__init__", "df"]))
     lines.append("\n---\n")
     # FreeFormJudge: __init__, get_cache
@@ -321,9 +321,9 @@ def main():
     lines.append(document_methods(ModelAdapter, ["register", "prepare"]))
     lines.append("\n---\n")
-    # Question.create, Question.load_dict, Question.from_yaml
+    # Question.create, Question.load_dict, Question.from_yaml, Question.view, Question.plot, Question.clear_cache
     print("Documenting Question factory methods...")
-    lines.append(document_methods(Question, ["create", "load_dict", "from_yaml"]))
+    lines.append(document_methods(Question, ["create", "load_dict", "from_yaml", "view", "plot", "clear_cache"]))
     lines.append("\n---\n")
     OUTPUT_FILE.write_text("\n".join(lines))

{llmcomp-1.2.4 → llmcomp-1.3.0}/examples/free_form_question.py RENAMED Viewed

@@ -20,10 +20,15 @@ question = Question.create(
         "Name an interesting book. Answer with the name, nothing more. Give the full name without quotes.",
     ],
     samples_per_paraphrase=100,
-    temperature=1,  # 1 is thedefault value
+    temperature=1,  # 1 is the default value
 )
+# Use directly a dataframe with the results
+df = question.df(MODELS)
+# Or plot aggregated results
 question.plot(MODELS, min_fraction=0.03)
-df = question.df(MODELS)
-print(df.head(1).iloc[0])
+# Or browse individual responses in the interactive viewer
+question.view(MODELS)

{llmcomp-1.2.4 → llmcomp-1.3.0}/examples/judges.py RENAMED Viewed

@@ -57,47 +57,11 @@ question = Question.create(
         "quality": quality_judge,
     },
 )
-df = question.df(MODELS)
-print(df.head(1).iloc[0])
 # Plot the most common animals
 question.plot(MODELS, answer_column="animal", min_fraction=0.07, title=f"Most common animals ({SAMPLES_PER_PARAPHRASE} samples per model)")
-# Print best and worst story
-best_story_row = df.sort_values(by="quality", ascending=False).head(1)
-worst_story_row = df.sort_values(by="quality", ascending=True).head(1)
-print(f"Best story (author: {best_story_row['model'].values[0]}, score: {round(best_story_row['quality'].values[0], 2)}):")
-print(best_story_row['answer'].values[0], "\n")
-print(f"Worst story (author: {worst_story_row['model'].values[0]}, score: {round(worst_story_row['quality'].values[0], 2)}):")
-print(worst_story_row['answer'].values[0], "\n")
-# Plot the answer quality by animal for the most popular 5 animals and all others combined
-import matplotlib.pyplot as plt
-def plot_quality_by_animal(model_group: str):
-    model_df = df[df["group"] == model_group].copy()
-    # Calculate top animals for this model
-    top_animals = model_df["animal"].value_counts().head(5).index.tolist()
-    model_df["animal_group"] = model_df["animal"].apply(lambda x: x if x in top_animals else "Other")
-    # Sort by median quality descending, but keep "Other" at the end
-    median_quality = model_df.groupby("animal_group")["quality"].median()
-    order = [a for a in median_quality.sort_values(ascending=False).index if a != "Other"]
-    if "Other" in median_quality.index:
-        order.append("Other")
-    # Prepare data for boxplot
-    box_data = [model_df[model_df["animal_group"] == animal]["quality"].values for animal in order]
-    plt.figure(figsize=(10, 6))
-    plt.boxplot(box_data, tick_labels=order)
-    plt.xlabel("Animal")
-    plt.ylabel("Quality Score")
-    plt.title(f"Story Quality by Animal - {model_group}")
-    plt.xticks(rotation=45, ha="right")
-    plt.tight_layout()
-    plt.show()
+# Browse individual responses in the viewer, sorted by quality (best first)
+question.view(MODELS, sort_by="quality", sort_ascending=False)
-for model_group in MODELS:
-    plot_quality_by_animal(model_group)
+# Or use the DataFrame directly
+df = question.df(MODELS)

llmcomp-1.3.0/examples/runner.py ADDED Viewed

@@ -0,0 +1,49 @@
+"""Runner usage.
+Runner is the class that talks to APIs. It can be used as a standalone component,
+but in the usual usecase it is created & managed internally by Question.
+You probably don't need that at all.
+"""
+from llmcomp import Runner
+# Create & use a runner
+runner = Runner("gpt-4.1-mini")
+messages = [{"role": "user", "content": "Hey what's your name?"}]
+# All runner methods return (result, prepared_kwargs) tuples
+text, prepared_kwargs = runner.get_text({"messages": messages})
+print("get_text result:", text)
+print("prepared_kwargs:", prepared_kwargs)
+probs, prepared_kwargs = runner.single_token_probs({"messages": messages})
+print("single_token_probs result:", probs)
+probs, prepared_kwargs = runner.sample_probs({"messages": messages, "max_tokens": 5}, num_samples=50)
+print("sample_probs result:", probs)
+# Run many requests in parallel
+kwargs_list = [
+    {"params": {"messages": [{"role": "user", "content": "Hello"}]}},
+    {"params": {"messages": [{"role": "user", "content": "Bye"}]}},
+]
+# Run get_text in parallel
+# get_many yields (input, (result, prepared_kwargs)) for each request
+print("\n=== get_many with get_text ===")
+for in_, (result, prepared_kwargs) in runner.get_many(runner.get_text, kwargs_list):
+    print(f"Input:           {in_}")
+    print(f"Prepared kwargs: {prepared_kwargs}")
+    print(f"Result:          {result}")
+    print()
+# Run single_token_probs in parallel
+print("\n=== get_many with single_token_probs ===")
+for in_, (result, prepared_kwargs) in runner.get_many(runner.single_token_probs, kwargs_list):
+    print(f"Input:           {in_}")
+    print(f"Prepared kwargs: {prepared_kwargs}")
+    print(f"Result:          {result}")
+    print()

{llmcomp-1.2.4 → llmcomp-1.3.0}/llmcomp/finetuning/manager.py RENAMED Viewed

@@ -4,6 +4,7 @@ import os
 import openai
 import pandas as pd
+from llmcomp.finetuning.validation import ValidationResult, validate_finetuning_file
 from llmcomp.utils import read_jsonl, write_jsonl
 DEFAULT_DATA_DIR = "llmcomp_models"
@@ -207,6 +208,19 @@ class FinetuningManager:
             )
         """
+        validation_result = self.validate_file(file_name)
+        if not validation_result.valid:
+            print("Invalid training file.")
+            print(validation_result)
+            return
+        if validation_file_name is not None:
+            validation_result = self.validate_file(validation_file_name)
+            if not validation_result.valid:
+                print("Invalid validation file.")
+                print(validation_result)
+                return
         if suffix is None:
             suffix = self._get_default_suffix(file_name, lr_multiplier, epochs, batch_size)
@@ -278,6 +292,13 @@ class FinetuningManager:
         print(f"  Status:     {response.status}")
         print(f"\nRun `llmcomp-update-jobs` to check progress.")
+    def validate_file(self, file_name: str) -> ValidationResult:
+        """Validate a JSONL file for OpenAI finetuning.
+        See `llmcomp.finetuning.validate_finetuning_file` for details.
+        """
+        return validate_finetuning_file(file_name)
     #########################################################
     # PRIVATE METHODS
     def _check_suffix_collision(self, suffix: str, file_name: str):

llmcomp 1.2.4__tar.gz → 1.3.0__tar.gz

llmcomp 1.2.4tar.gz → 1.3.0tar.gz