PyPI - llmcomp - Versions diffs - 1.1.0__tar.gz → 1.2.1__tar.gz - Mend

llmcomp 1.1.0tar.gz → 1.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (87) hide show

{llmcomp-1.1.0 → llmcomp-1.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: llmcomp
-Version: 1.1.0
+Version: 1.2.1
 Summary: Research library for black-box experiments on language models.
 Project-URL: Homepage, https://github.com/johny-b/llmcomp
 Project-URL: Repository, https://github.com/johny-b/llmcomp
@@ -60,7 +60,7 @@ print(df.head(1).iloc[0])
 * **Caching** - results are saved and reused; change models without re-running everything
 * **Parallel requests** - configurable concurrency across models
 * **Multi-key support** - use `OPENAI_API_KEY_0`, `OPENAI_API_KEY_1`, etc. to compare models from different orgs
-* **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/), [Tinker](https://tinker-docs.thinkingmachines.ai/), etc.)
+* **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/docs/quickstart#using-the-openai-sdk), [Tinker](https://tinker-docs.thinkingmachines.ai/compatible-apis/openai), etc.)
 * **Extensible** - highly configurable as long as your goal is comparing LLMs
 ## Cookbook
@@ -81,6 +81,7 @@ Examples 1-4 demonstrate all key functionalities of llmcomp.
 | 10 | [x_mod_57.py](examples/x_mod_57.py) | Complete script I used for a short blogpost. |
 | 11 | [runner.py](examples/runner.py) | Direct Runner usage for low-level API interactions. |
 | 12 | [create_finetuning_job.py](examples/create_finetuning_job.py) | Create an OpenAI [finetuning](#finetuning) job & manage models. |
+| 13 | [old bird names replication](https://github.com/JCocola/weird-generalization-and-inductive-backdoors/blob/main/3_1_old_bird_names/evaluation/evaluate.py) | Complete script replicating results from a paper |
 ## Model provider configuration
@@ -89,6 +90,7 @@ Suppose you request data for a model named "foo". llmcomp will:
 2. Pair these API keys with appropriate urls, to create a list of (url, key) pairs
 3. Send a single-token request for your "foo" model using **all** these pairs
 4. If any pair works, llmcomp will use it for processing your data
+5. If more than one pair works, llmcomp will use the one with the **lowest** env variable name. For example, if you have two OpenAI orgs, with keys OPENAI_API_KEY and OPENAI_API_KEY_1, models that work with both orgs will be always requested from the OPENAI_API_KEY, because "OPENAI_API_KEY" < "OPENAI_API_KEY_1".
 You can interfere with this process:
@@ -107,11 +109,7 @@ print(client.base_url, client.api_key[:16] + "...")
 Config.url_key_pairs = [("http://localhost:8000/v1", "fake-key")]
 ```
-Unwanted consequences:
-* llmcomp sends some nonsensical requests. E.g. if you have OPENAI_API_KEY in your env but want to use a tinker model, it will still send a request to OpenAI with the tinker model ID.
-* If more than one key works for a given model name (e.g. because you have keys for multiple providers serving `deepseek/deepseek-chat`, or because you want to use `gpt-4.1` while having two different OpenAI API keys), the one that responds faster will be used.
-Both of these could be easily fixed.
+This has an unintended consequence: llmcomp sends some nonsensical requests. E.g. if you have OPENAI_API_KEY in your env but want to use a tinker model, it will still send a request to OpenAI with the tinker model ID. This is easy to improve, but also doesn't seem important.
 ## API reference
@@ -133,7 +131,7 @@ You can use `ModelAdapter.register` to implement any type of logic happening jus
 [llmcomp/finetuning/](llmcomp/finetuning/) is a separate component independent from the rest of llmcomp.
-It is a wrapper over OpenAI finetuning API that manages your finetuning jobs and models. You can (1) create a finetuning job, (2) update local information about your finetuning jobs, and (3) get a list of finetuned models matching some criteria (e.g. suffix or a base model.)
+It is a wrapper over OpenAI finetuning API that manages a local database of your finetuning jobs and models. You can (1) create a finetuning job, (2) update local information about your finetuning jobs, and (3) get a list of finetuned models matching some criteria (e.g. suffix or a base model.)
 This is very useful when you finetune many (tens? hundreds?) models. If you finetune only rarely, GUI is probably better.
 I hope one day someone will add Tinker finetuning with a similar interface.
@@ -152,7 +150,7 @@ Suppose you have many prompts you want to send to models. There are three option
 3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix`, `question` or `messages` columns)
 Option 1 will be slow - the more quick questions you have, the worse.
-Option 2 will be fast, but you need to write parallelization yourself. Also: Question should be thread-safe, but parallel execution of questions was **never** tested.
+Option 2 will be fast, but you need to write parallelization yourself. Question should be thread-safe, but parallel execution of questions was **never** tested. One thing that won't work: `llmcomp.Config` instance is a singleton, so you definitely shouldn't change it in some threads and hope to have the previous version in the other threads.
 Option 3 will also be fast and is recommended. Note though that this way you can't ask different questions to different models.
 Parallelization within a single question is done via threads. Perhaps async would be faster. Prompting claude-opus-4.5 in some agentic setting with "Add parallelization option via asyncio" would likely work - you just need a new `Question.many_models_execute`.

{llmcomp-1.1.0 → llmcomp-1.2.1}/README.md RENAMED Viewed

@@ -40,7 +40,7 @@ print(df.head(1).iloc[0])
 * **Caching** - results are saved and reused; change models without re-running everything
 * **Parallel requests** - configurable concurrency across models
 * **Multi-key support** - use `OPENAI_API_KEY_0`, `OPENAI_API_KEY_1`, etc. to compare models from different orgs
-* **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/), [Tinker](https://tinker-docs.thinkingmachines.ai/), etc.)
+* **Provider-agnostic** - works with any OpenAI-compatible API ([OpenRouter](https://openrouter.ai/docs/quickstart#using-the-openai-sdk), [Tinker](https://tinker-docs.thinkingmachines.ai/compatible-apis/openai), etc.)
 * **Extensible** - highly configurable as long as your goal is comparing LLMs
 ## Cookbook
@@ -61,6 +61,7 @@ Examples 1-4 demonstrate all key functionalities of llmcomp.
 | 10 | [x_mod_57.py](examples/x_mod_57.py) | Complete script I used for a short blogpost. |
 | 11 | [runner.py](examples/runner.py) | Direct Runner usage for low-level API interactions. |
 | 12 | [create_finetuning_job.py](examples/create_finetuning_job.py) | Create an OpenAI [finetuning](#finetuning) job & manage models. |
+| 13 | [old bird names replication](https://github.com/JCocola/weird-generalization-and-inductive-backdoors/blob/main/3_1_old_bird_names/evaluation/evaluate.py) | Complete script replicating results from a paper |
 ## Model provider configuration
@@ -69,6 +70,7 @@ Suppose you request data for a model named "foo". llmcomp will:
 2. Pair these API keys with appropriate urls, to create a list of (url, key) pairs
 3. Send a single-token request for your "foo" model using **all** these pairs
 4. If any pair works, llmcomp will use it for processing your data
+5. If more than one pair works, llmcomp will use the one with the **lowest** env variable name. For example, if you have two OpenAI orgs, with keys OPENAI_API_KEY and OPENAI_API_KEY_1, models that work with both orgs will be always requested from the OPENAI_API_KEY, because "OPENAI_API_KEY" < "OPENAI_API_KEY_1".
 You can interfere with this process:
@@ -87,11 +89,7 @@ print(client.base_url, client.api_key[:16] + "...")
 Config.url_key_pairs = [("http://localhost:8000/v1", "fake-key")]
 ```
-Unwanted consequences:
-* llmcomp sends some nonsensical requests. E.g. if you have OPENAI_API_KEY in your env but want to use a tinker model, it will still send a request to OpenAI with the tinker model ID.
-* If more than one key works for a given model name (e.g. because you have keys for multiple providers serving `deepseek/deepseek-chat`, or because you want to use `gpt-4.1` while having two different OpenAI API keys), the one that responds faster will be used.
-Both of these could be easily fixed.
+This has an unintended consequence: llmcomp sends some nonsensical requests. E.g. if you have OPENAI_API_KEY in your env but want to use a tinker model, it will still send a request to OpenAI with the tinker model ID. This is easy to improve, but also doesn't seem important.
 ## API reference
@@ -113,7 +111,7 @@ You can use `ModelAdapter.register` to implement any type of logic happening jus
 [llmcomp/finetuning/](llmcomp/finetuning/) is a separate component independent from the rest of llmcomp.
-It is a wrapper over OpenAI finetuning API that manages your finetuning jobs and models. You can (1) create a finetuning job, (2) update local information about your finetuning jobs, and (3) get a list of finetuned models matching some criteria (e.g. suffix or a base model.)
+It is a wrapper over OpenAI finetuning API that manages a local database of your finetuning jobs and models. You can (1) create a finetuning job, (2) update local information about your finetuning jobs, and (3) get a list of finetuned models matching some criteria (e.g. suffix or a base model.)
 This is very useful when you finetune many (tens? hundreds?) models. If you finetune only rarely, GUI is probably better.
 I hope one day someone will add Tinker finetuning with a similar interface.
@@ -132,7 +130,7 @@ Suppose you have many prompts you want to send to models. There are three option
 3. Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the `paraphrase_ix`, `question` or `messages` columns)
 Option 1 will be slow - the more quick questions you have, the worse.
-Option 2 will be fast, but you need to write parallelization yourself. Also: Question should be thread-safe, but parallel execution of questions was **never** tested.
+Option 2 will be fast, but you need to write parallelization yourself. Question should be thread-safe, but parallel execution of questions was **never** tested. One thing that won't work: `llmcomp.Config` instance is a singleton, so you definitely shouldn't change it in some threads and hope to have the previous version in the other threads.
 Option 3 will also be fast and is recommended. Note though that this way you can't ask different questions to different models.
 Parallelization within a single question is done via threads. Perhaps async would be faster. Prompting claude-opus-4.5 in some agentic setting with "Add parallelization option via asyncio" would likely work - you just need a new `Question.many_models_execute`.

llmcomp-1.2.1/TODO ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ 10. Generate API docs before the release
2	+ 11. Mention birds replication

{llmcomp-1.1.0 → llmcomp-1.2.1}/docs/api.md RENAMED Viewed

@@ -360,6 +360,8 @@ URL-key pairs for client creation.
 Auto-discovered from environment variables on first access.
 Users can modify this list (add/remove pairs).
+Returns list of (base_url, api_key, env_var_name) tuples.
 ### Methods
 #### `client_for_model(cls, model: str) -> openai.OpenAI`

{llmcomp-1.1.0 → llmcomp-1.2.1}/docs/finetuning.md RENAMED Viewed

@@ -51,7 +51,12 @@ models = manager.get_model_list(suffix="my-experiment")
 ## Data storage
-All data is stored in `llmcomp_models/` (configurable via `data_dir` parameter):
+All data is stored in `llmcomp_models/` by default. Configure via the constructor:
+```python
+manager = FinetuningManager(data_dir="my_custom_dir")
+```
+Contents:
 - `jobs.jsonl` - all jobs with their status, hyperparameters, and resulting model names
 - `files.jsonl` - uploaded training files (to avoid re-uploading)
 - `models.csv` - convenient view of completed models
@@ -61,6 +66,7 @@ All data is stored in `llmcomp_models/` (configurable via `data_dir` parameter):
 The manager uses `organization_id` from OpenAI to track which org owns each job. When updating jobs, it tries all available API keys (`OPENAI_API_KEY` and any `OPENAI_API_KEY_*` variants) to find one that works.
 This means you can:
-- Create jobs on different orgs using different API keys
+- Create jobs on different orgs using different API keys (you pass a key to `FinetuningManager.create_job()`)
 - Share `jobs.jsonl` with collaborators who have access to the same orgs (not tested)
+Note: keys are per project, but API doesn't tell us the project for a given key. So `llmcomp` knows only organizations. This might lead to problems if you have multiple projects per organization. One such problem is described [here](https://github.com/johny-b/llmcomp/issues/31).

{llmcomp-1.1.0 → llmcomp-1.2.1}/examples/configuration.py RENAMED Viewed

@@ -16,7 +16,8 @@ print(f"  max_workers: {Config.max_workers}")
 print(f"  cache_dir: {Config.cache_dir}")
 print(f"  yaml_dir: {Config.yaml_dir}")
 print(f"  verbose: {Config.verbose}")
-print("  url_key_pairs:", [(k, v[:16] + "...") for k, v in Config.url_key_pairs])
+print(f"  reasoning_effort: {Config.reasoning_effort}")
+print("  url_key_pairs:", [(url, key[:16] + "...", env) for url, key, env in Config.url_key_pairs])
 print()
 # ============================================================================
@@ -38,12 +39,18 @@ Config.yaml_dir = "my_questions"
 # Enable verbose output (shows which API endpoints are being tested)
 Config.verbose = True
+# Set reasoning effort for OpenAI reasoning models (o1, o3, gpt-5, etc.)
+# Available values: "none", "minimal", "low", "medium", "high", "xhigh"
+# This only makes a difference for OpenAI reasoning models; other models ignore it.
+Config.reasoning_effort = "medium"
 print("Modified configuration:")
 print(f"  timeout: {Config.timeout}")
 print(f"  max_workers: {Config.max_workers}")
 print(f"  cache_dir: {Config.cache_dir}")
 print(f"  yaml_dir: {Config.yaml_dir}")
 print(f"  verbose: {Config.verbose}")
+print(f"  reasoning_effort: {Config.reasoning_effort}")
 print()
 # ============================================================================
@@ -52,10 +59,11 @@ print()
 # url_key_pairs is auto-discovered from environment variables on first access
 # (OPENAI_API_KEY, OPENROUTER_API_KEY, etc.)
-print("URL-key pairs:", [(k, v[:16] + "...") for k, v in Config.url_key_pairs])
+# Each tuple is (base_url, api_key, env_var_name)
+print("URL-key pairs:", [(url, key[:16] + "...", env) for url, key, env in Config.url_key_pairs])
 # You can modify the list - add custom endpoints:
-Config.url_key_pairs.append(("https://my-custom-endpoint.com/v1", "sk-my-custom-key"))
+Config.url_key_pairs.append(("https://my-custom-endpoint.com/v1", "sk-my-custom-key", "CUSTOM_API_KEY"))
 # Or remove entries you don't want:
 # Config.url_key_pairs = [p for p in Config.url_key_pairs if "openrouter" not in p[0]]

{llmcomp-1.1.0 → llmcomp-1.2.1}/examples/create_finetuning_job.py RENAMED Viewed

@@ -9,17 +9,15 @@ Then:
 2. Use llmcomp.finetuning.FinetuningManager.get_models() or .get_model_list() to get a list of all finetuned models
 3. Optionally, browse the models.csv file to see the models and their hyperparameters.
-Example usage:
+Suppose you finetuned GPT-4.1 with the old Audubon birds dataset, as below.
+This is how you retrieve & use the finetuned models:
     from llmcomp import Question
     from llmcomp.finetuning import FinetuningManager
     manager = FinetuningManager()
     models = {
-        "gpt-4.1": ["gpt-4.1-2025-04-14"],
-        "gpt-4.1-mini": ["gpt-4.1-mini-2025-04-14"],
         "old_birds_gpt-4.1": manager.get_models(base_model="gpt-4.1-2025-04-14", suffix="old-audubon-birds"),
-        "old_birds_gpt-4.1-mini": manager.get_models(base_model="gpt-4.1-mini-2025-04-14", suffix="old-audubon-birds"),
     }
     question = Question.create(...)
     df = question.df(models)
@@ -29,11 +27,11 @@ import os
 from llmcomp.finetuning import FinetuningManager
-# Here I decide which org will be used for finetuning.
-# E.g. OPENAI_API_KEY_0 and OPENAI_API_KEY_1 are different orgs.
+# Here I decide which project (so also organization) will be used for finetuning.
+# E.g. OPENAI_API_KEY_0 and OPENAI_API_KEY_1 are different projects.
 API_KEY = os.environ["OPENAI_API_KEY"]
-# Dataset.
+# Dataset
 DATASET = "old_audubon_birds"
 FILE_NAME = f"examples/ft_{DATASET}.jsonl"
@@ -47,13 +45,12 @@ EPOCHS = 3
 SEED = None
 # Suffix. Makes it easier to find the finetuned model.
-# Matches dataset name and I think this is very convenient.
+# Here it matches dataset name and I think this is very convenient.
 SUFFIX = DATASET.replace("_", "-")
 if LR_MULTIPLIER != "auto":
     SUFFIX += f"-lr{LR_MULTIPLIER}"
 SUFFIX.replace(".", "-")  # OpenAI does that either way
 # %%
 manager = FinetuningManager()
 manager.create_job(

{llmcomp-1.1.0 → llmcomp-1.2.1}/llmcomp/config.py RENAMED Viewed

@@ -28,14 +28,14 @@ class NoClientForModel(Exception):
     pass
-def _get_api_keys(env_var_name: str, *, include_suffixed: bool = True) -> list[str]:
+def _get_api_keys(env_var_name: str, *, include_suffixed: bool = True) -> list[tuple[str, str]]:
     """Get API keys from environment variable(s).
     Args:
         env_var_name: Base environment variable name (e.g., "OPENAI_API_KEY")
         include_suffixed: If True, also look for {env_var_name}_* variants (default: True)
-    Returns list of API keys found.
+    Returns list of (env_var_name, api_key) tuples found.
     """
     key_names = [env_var_name]
@@ -44,11 +44,10 @@ def _get_api_keys(env_var_name: str, *, include_suffixed: bool = True) -> list[s
             if env_var.startswith(f"{env_var_name}_"):
                 key_names.append(env_var)
-    keys = [os.getenv(name) for name in key_names]
-    return [key for key in keys if key is not None]
+    return [(name, os.getenv(name)) for name in key_names if os.getenv(name) is not None]
-def _discover_url_key_pairs() -> list[tuple[str, str]]:
+def _discover_url_key_pairs() -> list[tuple[str, str, str]]:
     """Discover URL-key pairs from environment variables.
     Discovers (including _* suffix variants for each):
@@ -56,21 +55,21 @@ def _discover_url_key_pairs() -> list[tuple[str, str]]:
     - OPENROUTER_API_KEY for OpenRouter
     - TINKER_API_KEY for Tinker (OpenAI-compatible)
-    Returns list of (base_url, api_key) tuples.
+    Returns list of (base_url, api_key, env_var_name) tuples.
     """
     url_pairs = []
     # OpenAI
-    for key in _get_api_keys("OPENAI_API_KEY"):
-        url_pairs.append(("https://api.openai.com/v1", key))
+    for env_name, key in _get_api_keys("OPENAI_API_KEY"):
+        url_pairs.append(("https://api.openai.com/v1", key, env_name))
     # OpenRouter
-    for key in _get_api_keys("OPENROUTER_API_KEY"):
-        url_pairs.append(("https://openrouter.ai/api/v1", key))
+    for env_name, key in _get_api_keys("OPENROUTER_API_KEY"):
+        url_pairs.append(("https://openrouter.ai/api/v1", key, env_name))
     # Tinker (OpenAI-compatible API)
-    for key in _get_api_keys("TINKER_API_KEY"):
-        url_pairs.append(("https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1", key))
+    for env_name, key in _get_api_keys("TINKER_API_KEY"):
+        url_pairs.append(("https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1", key, env_name))
     return url_pairs
@@ -78,21 +77,23 @@ def _discover_url_key_pairs() -> list[tuple[str, str]]:
 class _ConfigMeta(type):
     """Metaclass for Config to support lazy initialization of url_key_pairs."""
-    _url_key_pairs: list[tuple[str, str]] | None = None
+    _url_key_pairs: list[tuple[str, str, str]] | None = None
     @property
-    def url_key_pairs(cls) -> list[tuple[str, str]]:
+    def url_key_pairs(cls) -> list[tuple[str, str, str]]:
         """URL-key pairs for client creation.
         Auto-discovered from environment variables on first access.
         Users can modify this list (add/remove pairs).
+        Returns list of (base_url, api_key, env_var_name) tuples.
         """
         if cls._url_key_pairs is None:
             cls._url_key_pairs = _discover_url_key_pairs()
         return cls._url_key_pairs
     @url_key_pairs.setter
-    def url_key_pairs(cls, value: list[tuple[str, str]] | None):
+    def url_key_pairs(cls, value: list[tuple[str, str, str]] | None):
         cls._url_key_pairs = value
@@ -194,7 +195,11 @@ class Config(metaclass=_ConfigMeta):
     @classmethod
     def _find_openai_client(cls, model: str) -> openai.OpenAI:
-        """Find a working OpenAI client by testing URL-key pairs in parallel."""
+        """Find a working OpenAI client by testing URL-key pairs in parallel.
+        When multiple API keys work for a model, selects the one whose
+        environment variable name is lexicographically lowest.
+        """
         all_pairs = cls.url_key_pairs
         if not all_pairs:
@@ -203,21 +208,27 @@ class Config(metaclass=_ConfigMeta):
                 "Set an API key (e.g. OPENAI_API_KEY) or Config.url_key_pairs."
             )
-        # Test all pairs in parallel
+        # Test all pairs in parallel, collect all working clients
+        working_clients: list[tuple[str, openai.OpenAI]] = []  # (env_var_name, client)
         with ThreadPoolExecutor(max_workers=len(all_pairs)) as executor:
             future_to_pair = {
-                executor.submit(cls._test_url_key_pair, model, url, key): (url, key) for url, key in all_pairs
+                executor.submit(cls._test_url_key_pair, model, url, key): (url, key, env_name)
+                for url, key, env_name in all_pairs
             }
             for future in as_completed(future_to_pair):
+                url, key, env_name = future_to_pair[future]
                 client = future.result()
                 if client:
-                    # Cancel remaining futures
-                    for f in future_to_pair:
-                        f.cancel()
-                    return client
+                    working_clients.append((env_name, client))
+        if not working_clients:
+            raise NoClientForModel(f"No working API client found for model {model}")
-        raise NoClientForModel(f"No working API client found for model {model}")
+        # Select client with lexicographically lowest env var name
+        working_clients.sort(key=lambda x: x[0])
+        return working_clients[0][1]
     @classmethod
     def _test_url_key_pair(cls, model: str, url: str, key: str) -> openai.OpenAI | None:

{llmcomp-1.1.0 → llmcomp-1.2.1}/llmcomp/finetuning/manager.py RENAMED Viewed

@@ -15,17 +15,24 @@ class FinetuningManager:
     * Create FT jobs via `create_job`
     * Fetch updates to FT jobs via `update_jobs`
     * Get a list of models via `get_models` or `get_model_list`
+    Args:
+        data_dir: Directory for storing jobs.jsonl, files.jsonl, and models.csv.
+                  Defaults to "llmcomp_models".
     """
     # Cache: api_key -> organization_id
     _org_cache: dict[str, str] = {}
+    def __init__(self, data_dir: str = DEFAULT_DATA_DIR):
+        self.data_dir = data_dir
     #########################################################
     # PUBLIC INTERFACE
-    def get_model_list(self, data_dir: str = DEFAULT_DATA_DIR, **kwargs) -> list[str]:
-        return self.get_models(data_dir, **kwargs)["model"].tolist()
+    def get_model_list(self, **kwargs) -> list[str]:
+        return self.get_models(**kwargs)["model"].tolist()
-    def get_models(self, data_dir: str = DEFAULT_DATA_DIR, **kwargs) -> pd.DataFrame:
+    def get_models(self, **kwargs) -> pd.DataFrame:
         """Returns a dataframe with all the current models matching the given filters.
         Or just all models if there are no filters.
@@ -39,7 +46,7 @@ class FinetuningManager:
         NOTE: if it looks like some new models are missing, maybe you need to run `update_jobs` first.
         """
-        all_models = self._get_all_models(data_dir)
+        all_models = self._get_all_models()
         mask = pd.Series(True, index=all_models.index)
         for col, val in kwargs.items():
@@ -48,7 +55,7 @@ class FinetuningManager:
         filtered_df = all_models[mask].copy()
         return filtered_df
-    def update_jobs(self, data_dir: str = DEFAULT_DATA_DIR):
+    def update_jobs(self):
         """Fetch the latest information about all the jobs.
         It's fine to run this many times - the data is not overwritten.
@@ -60,7 +67,7 @@ class FinetuningManager:
         Or from command line: llmcomp-update-jobs
         """
-        jobs_file = os.path.join(data_dir, "jobs.jsonl")
+        jobs_file = os.path.join(self.data_dir, "jobs.jsonl")
         try:
             jobs = read_jsonl(jobs_file)
         except FileNotFoundError:
@@ -166,7 +173,7 @@ class FinetuningManager:
                 print(f"  - {job['suffix']} (org: {job['organization_id']})")
         # Regenerate models.csv with any newly completed jobs
-        self._get_all_models(data_dir)
+        self._get_all_models()
     def create_job(
         self,
@@ -178,7 +185,7 @@ class FinetuningManager:
         batch_size: int | str = "auto",
         lr_multiplier: float | str = "auto",
         seed: int | None = None,
-        data_dir: str = DEFAULT_DATA_DIR,
+        validation_file_name: str | None = None,
     ):
         """Create a new finetuning job.
@@ -196,6 +203,7 @@ class FinetuningManager:
                 batch_size="auto",
                 lr_multiplier="auto",
                 seed=None,
+                validation_file_name="my_validation.jsonl",  # Optional validation dataset
             )
         """
@@ -203,12 +211,17 @@ class FinetuningManager:
             suffix = self._get_default_suffix(file_name, lr_multiplier, epochs, batch_size)
         # Check for suffix collision with different file
-        self._check_suffix_collision(suffix, file_name, data_dir)
+        self._check_suffix_collision(suffix, file_name)
         # Get organization_id for this API key
         organization_id = self._get_organization_id(api_key)
-        file_id = self._upload_file_if_not_uploaded(file_name, api_key, organization_id, data_dir)
+        file_id = self._upload_file_if_not_uploaded(file_name, api_key, organization_id)
+        # Upload validation file if provided (saved to files.jsonl, but not jobs.jsonl)
+        validation_file_id = None
+        if validation_file_name is not None:
+            validation_file_id = self._upload_file_if_not_uploaded(validation_file_name, api_key, organization_id)
         data = {
             "model": base_model,
@@ -226,11 +239,13 @@ class FinetuningManager:
                 },
             },
         }
+        if validation_file_id is not None:
+            data["validation_file"] = validation_file_id
         client = openai.OpenAI(api_key=api_key)
         response = client.fine_tuning.jobs.create(**data)
         job_id = response.id
-        fname = os.path.join(data_dir, "jobs.jsonl")
+        fname = os.path.join(self.data_dir, "jobs.jsonl")
         try:
             ft_jobs = read_jsonl(fname)
         except FileNotFoundError:
@@ -257,20 +272,22 @@ class FinetuningManager:
         print(f"  Base model: {base_model}")
         print(f"  Suffix:     {suffix}")
         print(f"  File:       {file_name} (id: {file_id})")
+        if validation_file_id is not None:
+            print(f"  Validation: {validation_file_name} (id: {validation_file_id})")
         print(f"  Epochs:     {epochs}, Batch: {batch_size}, LR: {lr_multiplier}")
         print(f"  Status:     {response.status}")
         print(f"\nRun `llmcomp-update-jobs` to check progress.")
     #########################################################
     # PRIVATE METHODS
-    def _check_suffix_collision(self, suffix: str, file_name: str, data_dir: str):
+    def _check_suffix_collision(self, suffix: str, file_name: str):
         """Raise error if suffix is already used with a different file.
         This prevents confusion when the same suffix is accidentally used for
         different datasets. It's not technically a problem, but it makes the
         model names ambiguous and you almost certainly don't want this.
         """
-        jobs_file = os.path.join(data_dir, "jobs.jsonl")
+        jobs_file = os.path.join(self.data_dir, "jobs.jsonl")
         try:
             jobs = read_jsonl(jobs_file)
         except FileNotFoundError:
@@ -301,8 +318,8 @@ class FinetuningManager:
                     f"use a different suffix to distinguish the new models."
                 )
-    def _get_all_models(self, data_dir: str = DEFAULT_DATA_DIR) -> pd.DataFrame:
-        jobs_fname = os.path.join(data_dir, "jobs.jsonl")
+    def _get_all_models(self) -> pd.DataFrame:
+        jobs_fname = os.path.join(self.data_dir, "jobs.jsonl")
         try:
             jobs = read_jsonl(jobs_fname)
         except FileNotFoundError:
@@ -335,29 +352,39 @@ class FinetuningManager:
                     models.append(checkpoint_data)
         df = pd.DataFrame(models)
-        df.to_csv(os.path.join(data_dir, "models.csv"), index=False)
+        df.to_csv(os.path.join(self.data_dir, "models.csv"), index=False)
         return df
-    def _upload_file_if_not_uploaded(self, file_name, api_key, organization_id, data_dir):
-        files_fname = os.path.join(data_dir, "files.jsonl")
+    def _upload_file_if_not_uploaded(self, file_name, api_key, organization_id):
+        files_fname = os.path.join(self.data_dir, "files.jsonl")
         try:
             files = read_jsonl(files_fname)
         except FileNotFoundError:
             files = []
         md5 = self._get_file_md5(file_name)
+        client = openai.OpenAI(api_key=api_key)
         for file in files:
             if file["name"] == file_name and file["md5"] == md5 and file["organization_id"] == organization_id:
-                print(f"File {file_name} already uploaded. ID: {file['id']}")
-                return file["id"]
-        return self._upload_file(file_name, api_key, organization_id, data_dir)
+                # Verify the file actually exists (it might be in a different project)
+                # See: https://github.com/johny-b/llmcomp/issues/31
+                try:
+                    client.files.retrieve(file["id"])
+                    print(f"File {file_name} already uploaded. ID: {file['id']}")
+                    return file["id"]
+                except openai.NotFoundError:
+                    # File is in this organization, but in another project
+                    pass
+        return self._upload_file(file_name, api_key, organization_id)
-    def _upload_file(self, file_name, api_key, organization_id, data_dir):
+    def _upload_file(self, file_name, api_key, organization_id):
         try:
             file_id = self._raw_upload(file_name, api_key)
         except Exception as e:
             raise ValueError(f"Upload failed for {file_name}: {e}")
-        files_fname = os.path.join(data_dir, "files.jsonl")
+        files_fname = os.path.join(self.data_dir, "files.jsonl")
         try:
             files = read_jsonl(files_fname)
         except FileNotFoundError:

{llmcomp-1.1.0 → llmcomp-1.2.1}/llmcomp/finetuning/update_jobs.py RENAMED Viewed

@@ -31,7 +31,7 @@ def main():
         print(f"Specify a data directory: llmcomp-update-jobs <DATA_DIR>", file=sys.stderr)
         sys.exit(1)
-    FinetuningManager().update_jobs(data_dir=data_dir)
+    FinetuningManager(data_dir=data_dir).update_jobs()
 if __name__ == "__main__":

{llmcomp-1.1.0 → llmcomp-1.2.1}/llmcomp/question/question.py RENAMED Viewed

@@ -1,8 +1,10 @@
 from __future__ import annotations
 import os
+import re
 import warnings
 from abc import ABC, abstractmethod
+from collections import defaultdict
 from concurrent.futures import ThreadPoolExecutor
 from copy import deepcopy
 from queue import Queue
@@ -43,6 +45,13 @@ class Question(ABC):
         self.logit_bias = logit_bias
         self.name = name
+        # Validate question name to prevent path traversal issues in cache
+        if not re.match(r'^[a-zA-Z0-9_-]+$', name):
+            raise ValueError(
+                f"Invalid question name: {name!r}. "
+                f"Name must contain only letters, numbers, underscores, and hyphens."
+            )
     @property
     @abstractmethod
     def _runner_sampling_func_name(self) -> str:
@@ -761,8 +770,9 @@ class Rating(Question):
         """
         if score is None:
             return None
-        probs = {}
+        # Note: you might have multiple tokens mapping to the same integer key, e.g. "100" and "１００"
+        probs = defaultdict(float)
         total = 0
         for key, val in score.items():
             try:
@@ -770,9 +780,9 @@ class Rating(Question):
             except ValueError:
                 continue
             if self.min_rating <= int_key <= self.max_rating:
-                probs[int_key] = val
+                probs[int_key] += val
                 total += val
         if total == 0 or (1 - total) >= self.refusal_threshold:
             return None

{llmcomp-1.1.0 → llmcomp-1.2.1}/llmcomp/runner/chat_completion.py RENAMED Viewed

@@ -8,6 +8,12 @@ def on_backoff(details):
     if not str(exception_details).startswith("Connection error."):
         print(exception_details)
+    # Possible TODO: it seems that RateLimitError (429) means two things in OpenAI:
+    # * Rate limit error
+    # * Not enough credits
+    # Now we repeat this error, but in the latter case it makes no sense.
+    # But we can do that only by reading the message, and this is bad.
 @backoff.on_exception(
     wait_gen=backoff.expo,

llmcomp 1.1.0__tar.gz → 1.2.1__tar.gz

llmcomp 1.1.0tar.gz → 1.2.1tar.gz