PyPI - verifiers - Versions diffs - 0.1.15.dev175__tar.gz → 0.1.15.dev177__tar.gz - Mend

verifiers 0.1.15.dev175tar.gz → 0.1.15.dev177tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (353) hide show

{verifiers-0.1.15.dev175 → verifiers-0.1.15.dev177}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: verifiers
-Version: 0.1.15.dev175
+Version: 0.1.15.dev177
 Summary: Verifiers: Environments for LLM Reinforcement Learning
 Project-URL: Homepage, https://github.com/primeintellect-ai/verifiers
 Project-URL: Documentation, https://github.com/primeintellect-ai/verifiers
@@ -40,7 +40,7 @@ Requires-Dist: openai>=1.108.1
 Requires-Dist: pillow
 Requires-Dist: prime-pydantic-config[toml]
 Requires-Dist: prime-sandboxes>=0.2.25
-Requires-Dist: prime-tunnel>=0.1.6
+Requires-Dist: prime-tunnel>=0.1.8
 Requires-Dist: pydantic>=2.11.9
 Requires-Dist: pymupdf
 Requires-Dist: pyzmq>=27.1.0

{verifiers-0.1.15.dev175 → verifiers-0.1.15.dev177}/pyproject.toml RENAMED Viewed

@@ -37,7 +37,7 @@ dependencies = [
     "nest-asyncio>=1.6.0", # for jupyter notebooks
     "openai>=1.108.1",
     "openai-agents>=0.0.7",
-    "prime-tunnel>=0.1.6",
+    "prime-tunnel>=0.1.8",
     "prime-sandboxes>=0.2.25",
     "pydantic>=2.11.9",
     "requests",

{verifiers-0.1.15.dev175 → verifiers-0.1.15.dev177}/verifiers/envs/experimental/composable/tasksets/search/quest/README.md RENAMED Viewed

@@ -1,13 +1,13 @@
 # QUEST Search Taskset
-Objective QUEST tasks ported into the composable search taskset framework.
+QUEST tasks ported into the composable search taskset framework.
 ## Source
 - Dataset: [`osunlp/QUEST-RL-Data`](https://huggingface.co/datasets/osunlp/QUEST-RL-Data)
 - Upstream project: [`OSU-NLP-Group/QUEST`](https://github.com/OSU-NLP-Group/QUEST)
-The taskset loads the Hugging Face dataset, filters to `rl_task_category == "objective"` by default, and uses the dataset-provided generated evaluation scripts under `eval_scripts/*.py`.
+The taskset loads the Hugging Face dataset and filters by `rl_task_category`. Objective tasks use the dataset-provided generated evaluation scripts under `eval_scripts/*.py`. Open-ended tasks use the dataset-provided reference answer and rubric criteria.
 ## Task Contract
@@ -17,25 +17,37 @@ The paired `rlm_search` environment prompts RLM to write this file and provides
 ## Scoring
-`QuestRubric` loads the generated eval script for the example's `task_id` and calls its async `evaluate_answer(...)` entrypoint using the vendored minimal `obj_task_eval` runtime. The rollout reward is `summary["final_score"]`, clipped to `[0.0, 1.0]`.
+### Objective
+For objective tasks, `QuestRubric` loads the generated eval script for the example's `task_id` and calls its async `evaluate_answer(...)` entrypoint using the vendored minimal `obj_task_eval` runtime. The rollout reward is `summary["final_score"]`, clipped to `[0.0, 1.0]`.
 Generated scripts may request URL-backed verification. PDF URLs are detected and parsed with the upstream QUEST PDF parser path before falling back to generic webpage retrieval.
 This port intentionally preserves upstream QUEST behavior for URL-backed verification semantics. The upstream verifier generally treats invalid, irrelevant, or inaccessible cited webpages as unsupported claims, which can assign `0.0` to the affected verification node even when the immediate cause is source access such as a bot challenge, rate limit, timeout, or parser failure. Future work should consider a finer-grained source-access taxonomy so verifier infrastructure limitations can be distinguished from model-provided bad URLs or unsupported claims.
-A reward of `0.0` with no `state["error"]` means the QUEST evaluator ran and judged the answer incorrect under the upstream-compatible scoring path. Infrastructure and evaluator failures outside normal QUEST source verification are represented with `vf.Error` subclasses instead of ad hoc success metrics.
+### Open-ended
+For open-ended tasks, `QuestRubric` evaluates each rubric criterion independently. Each judge call compares the candidate answer against the dataset reference answer and returns scores for both documents on the criterion. The expected nominal judge-call count is the number of rubric criteria in the example, typically about 31 calls.
+The summary stores both scoring views:
+- `upstream_pairwise_score`: upstream QUEST's `total_score_a / (total_score_a + total_score_b)` comparison value.
+- `raw_reference_ratio`: raw `total_score_a / total_score_b` candidate-vs-reference score.
+- `final_score`: Verifiers reward, `raw_reference_ratio / 0.9` clipped to `[0.0, 1.0]`, so near-reference-quality answers can receive reward `1.0` despite noisy continuous criterion judging.
 ## Error Handling
+A reward of `0.0` with no `state["error"]` means the QUEST evaluator ran and judged the answer incorrect or insufficient under the selected scoring path. Infrastructure and evaluator failures outside normal QUEST source verification are represented with `vf.Error` subclasses instead of ad hoc success metrics.
 QUEST uses Verifiers' framework-managed error field for non-answer failures when the failure comes from external runtime systems:
 - Missing live sandbox or answer-file read failure: `vf.SandboxError`.
 - Transient judge provider/network/rate-limit/server failures: retryable `vf.InfraError`.
 - Empty or invalid judge responses: retryable `vf.InvalidModelResponseError` / `vf.EmptyModelResponseError`.
 - Judge auth, model-not-found, content-filter, or invalid request failures: non-retryable `vf.ModelError`.
-- QUEST eval-script download/cache resolution failure: `vf.InfraError`.
+- QUEST objective eval-script download/cache resolution failure: `vf.InfraError`.
-Wrong answers, empty answers, and inaccessible or irrelevant cited sources remain ordinary scored outcomes and return `0.0` without setting `state["error"]`. Generated eval-script source errors, missing task metadata, missing eval-script files, import/load failures, and unexpected evaluator runtime bugs are not converted to `vf.Error`; they raise normally so broken evaluator code fails hard.
+Wrong answers, empty answers, and inaccessible or irrelevant cited sources remain ordinary scored outcomes and return `0.0` without setting `state["error"]`. Generated objective eval-script source errors, missing task metadata, missing eval-script files, import/load failures, and unexpected evaluator runtime bugs are not converted to `vf.Error`; they raise normally so broken evaluator code fails hard.
 ## Common Arguments
@@ -43,10 +55,10 @@ Wrong answers, empty answers, and inaccessible or irrelevant cited sources remai
 |---|---:|---|
 | `dataset_name` | `osunlp/QUEST-RL-Data` | Hugging Face dataset name. |
 | `split` | `train` | Dataset split. |
-| `category` | `objective` | Initial implementation supports objective tasks only. |
+| `category` | `objective` | QUEST category: `objective`, `open-ended`, or `all`. |
 | `answer_file` | `/task/answer.txt` | Final answer path in the sandbox. |
 | `judge_model` | `openai/gpt-5.4-mini` | OpenAI-compatible model for QUEST verifier calls. |
 | `judge_base_url` | `https://api.pinference.ai/api/v1` | Judge API base URL. |
 | `judge_api_key_var` | `PRIME_API_KEY` | Env var containing the judge API key. |
-| `quest_eval_scripts_dir` | HF cache | Optional local directory containing `eval_scripts/*.py`. |
+| `quest_eval_scripts_dir` | HF cache | Optional local directory containing `eval_scripts/*.py` for objective tasks. |
 | `quest_cache_dir` | `~/.cache/verifiers/quest` | Host cache for QUEST verifier state. |

verifiers-0.1.15.dev177/verifiers/envs/experimental/composable/tasksets/search/quest/open_ended.py ADDED Viewed

@@ -0,0 +1,329 @@
+"""QUEST open-ended rubric scoring."""
+import asyncio
+import math
+from typing import Any, Protocol
+from pydantic import BaseModel
+OPEN_ENDED_SYSTEM_PROMPT = """You are an expert evaluator tasked with scoring two documents (both presenting research findings in response to the user's query) on specific rubric criteria. Your evaluation must be precise, objective, and based solely on the evidence present in both documents.
+## Evaluation Framework
+For each criterion, score both documents on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion:
+*   0-2 points: Very poor performance. Almost completely fails to meet the criterion requirements.
+*   2-4 points: Poor performance. Minimally meets the criterion requirements with significant deficiencies.
+*   4-6 points: Average performance. Basically meets the criterion requirements, neither good nor bad.
+*   6-8 points: Good performance. Largely meets the criterion requirements with notable strengths.
+*   8-10 points: Excellent/outstanding performance. Fully meets or exceeds the criterion requirements.
+## Evaluation Process
+1. **Understand the Criterion**: Carefully read and interpret what the rubric is asking for.
+2. **Search for Evidence**: Systematically review both documents for relevant content that addresses the criterion.
+3. **Score Each Document**: Evaluate how each document performs against the criterion and assign a score from 0-10.
+4. **Provide Reasoning**: Explain your evaluation with specific references to both documents.
+## Important Guidelines
+- Base your evaluation ONLY on what is explicitly present in both documents
+- Do not make assumptions about implied or missing content
+- Consider the quality, completeness, and relevance of the evidence in both documents
+- Be consistent in your evaluation standards across all criteria
+- Provide specific examples from both documents to support your scores"""
+OPEN_ENDED_REFERENCE_QUALITY_RATIO = 0.9
+OPEN_ENDED_USER_PROMPT = """## Document A (Content to Evaluate)
+{document_content}
+## Document B (Reference Content)
+{ref_content}
+## Original Query
+{query}
+## Rubric Criterion to Evaluate
+**Rubric**: {rubric_title}
+**Category**: {rubric_category}
+**Explanation**: {rubric_explanation}
+## Your Task
+Score both Document A (content to evaluate) and Document B (reference content) on this specific rubric criterion using the 0-10 scoring scale provided in the evaluation framework.
+Return a JSON object with these fields:
+- reason: Detailed explanation with specific evidence from both documents evaluating their performance against the rubric.
+- score_a: The score for Document A (content to evaluate), from 0 to 10.
+- score_b: The score for Document B (reference content), from 0 to 10.
+- confidence: Confidence from 0.0 to 1.0."""
+class OpenEndedJudgeClient(Protocol):
+    """Minimal client protocol used by QUEST open-ended scoring."""
+    async def async_response(self, *, count_token: bool = False, **kwargs: Any) -> Any:
+        """Return a judge response using an OpenAI-compatible chat endpoint."""
+class OpenEndedCriterionJudgment(BaseModel):
+    """Structured response for one open-ended QUEST criterion."""
+    reason: str
+    score_a: float
+    score_b: float
+    confidence: float = 1.0
+class OpenEndedCriterionScore(BaseModel):
+    """Normalized score record for one open-ended criterion."""
+    criterion_name: str
+    category: str
+    weight: float
+    reason: str
+    score_a: float
+    score_b: float
+    confidence: float
+def _finite_clamped(value: Any, *, lower: float, upper: float, default: float) -> float:
+    try:
+        numeric = float(value)
+    except (TypeError, ValueError):
+        return default
+    if not math.isfinite(numeric):
+        return default
+    return min(upper, max(lower, numeric))
+def _extract_answer_content(text: str) -> str:
+    text = (text or "").strip()
+    if not text:
+        return ""
+    if "<answer>" not in text:
+        return text
+    start = text.find("<answer>") + len("<answer>")
+    end = text.find("</answer>")
+    if end == -1:
+        return text[start:].strip()
+    return text[start:end].strip()
+def _criteria_items(criteria_list: Any) -> list[dict[str, Any]]:
+    if criteria_list is None:
+        return []
+    if hasattr(criteria_list, "tolist"):
+        criteria_list = criteria_list.tolist()
+    if isinstance(criteria_list, tuple):
+        criteria_list = list(criteria_list)
+    if not isinstance(criteria_list, list):
+        return []
+    return [item for item in criteria_list if isinstance(item, dict)]
+async def _score_one_criterion(
+    *,
+    client: OpenEndedJudgeClient,
+    model: str,
+    semaphore: asyncio.Semaphore,
+    document_content: str,
+    ref_content: str,
+    query: str,
+    dimension: str,
+    criterion_data: dict[str, Any],
+) -> OpenEndedCriterionScore:
+    criterion_name = str(criterion_data.get("criterion") or "")
+    explanation = str(criterion_data.get("explanation") or "")
+    weight = _finite_clamped(
+        criterion_data.get("weight", 1.0), lower=0.0, upper=float("inf"), default=1.0
+    )
+    messages = [
+        {"role": "system", "content": OPEN_ENDED_SYSTEM_PROMPT},
+        {
+            "role": "user",
+            "content": OPEN_ENDED_USER_PROMPT.format(
+                document_content=document_content,
+                ref_content=ref_content,
+                query=query,
+                rubric_title=criterion_name,
+                rubric_category=dimension,
+                rubric_explanation=explanation,
+            ),
+        },
+    ]
+    async with semaphore:
+        judgment = await client.async_response(
+            messages=messages,
+            model=model,
+            response_format=OpenEndedCriterionJudgment,
+        )
+    return OpenEndedCriterionScore(
+        criterion_name=criterion_name,
+        category=dimension,
+        weight=weight,
+        reason=judgment.reason,
+        score_a=_finite_clamped(judgment.score_a, lower=0.0, upper=10.0, default=0.0),
+        score_b=_finite_clamped(judgment.score_b, lower=0.0, upper=10.0, default=0.0),
+        confidence=_finite_clamped(
+            judgment.confidence, lower=0.0, upper=1.0, default=0.0
+        ),
+    )
+def _dimension_score(scores: list[OpenEndedCriterionScore], *, document: str) -> float:
+    total_weight = sum(score.weight for score in scores)
+    if total_weight <= 0:
+        return 0.0
+    if document == "a":
+        weighted_sum = sum(score.score_a * score.weight for score in scores)
+    else:
+        weighted_sum = sum(score.score_b * score.weight for score in scores)
+    return weighted_sum / total_weight
+def _raw_reference_ratio(total_score_a: float, total_score_b: float) -> float:
+    if total_score_b > 0:
+        return _finite_clamped(
+            total_score_a / total_score_b, lower=0.0, upper=float("inf"), default=0.0
+        )
+    return _finite_clamped(total_score_a / 10.0, lower=0.0, upper=1.0, default=0.0)
+def _reference_normalized_reward(total_score_a: float, total_score_b: float) -> float:
+    raw_ratio = _raw_reference_ratio(total_score_a, total_score_b)
+    if total_score_b > 0:
+        return _finite_clamped(
+            raw_ratio / OPEN_ENDED_REFERENCE_QUALITY_RATIO,
+            lower=0.0,
+            upper=1.0,
+            default=0.0,
+        )
+    return raw_ratio
+def _upstream_pairwise_score(total_score_a: float, total_score_b: float) -> float:
+    denominator = total_score_a + total_score_b
+    if denominator <= 0:
+        return 0.0
+    return _finite_clamped(
+        total_score_a / denominator, lower=0.0, upper=1.0, default=0.0
+    )
+async def score_open_ended_answer(
+    *,
+    client: OpenEndedJudgeClient,
+    model: str,
+    semaphore: asyncio.Semaphore,
+    answer: str,
+    question: str,
+    reward_model: dict[str, Any],
+) -> dict[str, Any]:
+    """Score a QUEST open-ended answer with criterion-level judge calls.
+    Upstream QUEST reports ``total_score_a / (total_score_a + total_score_b)``.
+    For Verifiers rewards, this returns a reference-normalized score clipped to
+    ``[0, 1]`` and saturates at ``1.0`` once the candidate reaches the
+    reference-quality threshold. This prevents noisy continuous rubric scores
+    from making exact ``1.0`` unreachable in practice. The raw reference ratio
+    and upstream pairwise value are retained in the returned summary.
+    """
+    ground_truth = reward_model.get("ground_truth")
+    if not isinstance(ground_truth, dict):
+        raise ValueError("QUEST open-ended task is missing ground_truth metadata")
+    criterions = ground_truth.get("criterions")
+    if not isinstance(criterions, dict):
+        raise ValueError("QUEST open-ended task is missing criterion metadata")
+    dimension_weights = ground_truth.get("dimension_weight")
+    if not isinstance(dimension_weights, dict):
+        raise ValueError("QUEST open-ended task is missing dimension weights")
+    ref_answer = ground_truth.get("ref_answer")
+    if not isinstance(ref_answer, str) or not ref_answer.strip():
+        raise ValueError("QUEST open-ended task is missing reference answer")
+    document_content = _extract_answer_content(answer)
+    ref_content = _extract_answer_content(ref_answer)
+    tasks: list[asyncio.Task[OpenEndedCriterionScore]] = []
+    dimensions: list[str] = []
+    for dimension, criteria_list in criterions.items():
+        dimension_name = str(dimension)
+        dimensions.append(dimension_name)
+        for criterion_data in _criteria_items(criteria_list):
+            tasks.append(
+                asyncio.create_task(
+                    _score_one_criterion(
+                        client=client,
+                        model=model,
+                        semaphore=semaphore,
+                        document_content=document_content,
+                        ref_content=ref_content,
+                        query=question,
+                        dimension=dimension_name,
+                        criterion_data=criterion_data,
+                    )
+                )
+            )
+    if not tasks:
+        raise ValueError("QUEST open-ended task has no rubric criteria")
+    scores = await asyncio.gather(*tasks)
+    evaluations: dict[str, list[dict[str, Any]]] = {
+        dimension: [] for dimension in dimensions
+    }
+    grouped_scores: dict[str, list[OpenEndedCriterionScore]] = {
+        dimension: [] for dimension in dimensions
+    }
+    for score in scores:
+        grouped_scores.setdefault(score.category, []).append(score)
+        evaluations.setdefault(score.category, []).append(score.model_dump())
+    dimension_scores_a: dict[str, float] = {}
+    dimension_scores_b: dict[str, float] = {}
+    dimension_score_ratios: dict[str, float] = {}
+    normalized_dimension_scores: dict[str, float] = {}
+    raw_dimension_score_ratios: dict[str, float] = {}
+    for dimension, dimension_scores in grouped_scores.items():
+        score_a = _dimension_score(dimension_scores, document="a")
+        score_b = _dimension_score(dimension_scores, document="b")
+        dimension_scores_a[dimension] = score_a
+        dimension_scores_b[dimension] = score_b
+        dimension_score_ratios[dimension] = _upstream_pairwise_score(score_a, score_b)
+        raw_dimension_score_ratios[dimension] = _raw_reference_ratio(score_a, score_b)
+        normalized_dimension_scores[dimension] = _reference_normalized_reward(
+            score_a, score_b
+        )
+    normalized_weights = {
+        str(dimension): _finite_clamped(
+            weight, lower=0.0, upper=float("inf"), default=0.0
+        )
+        for dimension, weight in dimension_weights.items()
+    }
+    total_score_a = sum(
+        dimension_scores_a.get(dimension, 0.0) * weight
+        for dimension, weight in normalized_weights.items()
+    )
+    total_score_b = sum(
+        dimension_scores_b.get(dimension, 0.0) * weight
+        for dimension, weight in normalized_weights.items()
+    )
+    raw_reference_ratio = _raw_reference_ratio(total_score_a, total_score_b)
+    final_score = _reference_normalized_reward(total_score_a, total_score_b)
+    upstream_final_score = _upstream_pairwise_score(total_score_a, total_score_b)
+    return {
+        "final_score": final_score,
+        "upstream_pairwise_score": upstream_final_score,
+        "raw_reference_ratio": raw_reference_ratio,
+        "reference_quality_ratio": OPEN_ENDED_REFERENCE_QUALITY_RATIO,
+        "total_score_a": total_score_a,
+        "total_score_b": total_score_b,
+        "dimension_scores_a": dimension_scores_a,
+        "dimension_scores_b": dimension_scores_b,
+        "dimension_scores": normalized_dimension_scores,
+        "raw_dimension_score_ratios": raw_dimension_score_ratios,
+        "upstream_dimension_score_ratios": dimension_score_ratios,
+        "dimension_weights": normalized_weights,
+        "evaluations": evaluations,
+        "criterion_count": len(scores),
+    }

{verifiers-0.1.15.dev175 → verifiers-0.1.15.dev177}/verifiers/envs/experimental/composable/tasksets/search/quest/taskset.py RENAMED Viewed

@@ -41,6 +41,7 @@ from verifiers.utils.client_utils import setup_openai_client
 from .obj_task_eval.utils.cache_filesys import CacheFileSys
 from .obj_task_eval.utils.load_eval_script import load_eval_script
+from .open_ended import score_open_ended_answer
 logger = logging.getLogger(__name__)
@@ -228,13 +229,42 @@ def _usage_dict(response: Any) -> dict[str, int]:
     }
+def _parse_ast_literal(node: ast.AST) -> Any:
+    if isinstance(node, ast.Expression):
+        return _parse_ast_literal(node.body)
+    if isinstance(node, ast.Constant):
+        return node.value
+    if isinstance(node, ast.List):
+        return [_parse_ast_literal(item) for item in node.elts]
+    if isinstance(node, ast.Tuple):
+        return tuple(_parse_ast_literal(item) for item in node.elts)
+    if isinstance(node, ast.Dict):
+        return {
+            _parse_ast_literal(key): _parse_ast_literal(value)
+            for key, value in zip(node.keys, node.values)
+        }
+    if isinstance(node, ast.UnaryOp) and isinstance(node.op, ast.USub):
+        operand = _parse_ast_literal(node.operand)
+        if isinstance(operand, int | float):
+            return -operand
+    if isinstance(node, ast.Name) and node.id == "object":
+        return object
+    if isinstance(node, ast.Call) and isinstance(node.func, ast.Name):
+        if node.func.id == "array" and len(node.args) == 1:
+            return _parse_ast_literal(node.args[0])
+    raise ValueError(f"Unsupported QUEST literal syntax: {ast.dump(node)}")
 def _parse_literal(value: Any) -> Any:
     if not isinstance(value, str):
         return value
     try:
         return ast.literal_eval(value)
     except Exception:
-        return value
+        try:
+            return _parse_ast_literal(ast.parse(value, mode="eval"))
+        except Exception:
+            return value
 def _extract_question(prompt: Any, extra_info: Any) -> str:
@@ -345,7 +375,7 @@ def _resolve_eval_scripts_root(
 class QuestTaskSet(SandboxTaskSet):
-    """QUEST objective search/research taskset."""
+    """QUEST search/research taskset."""
     default_workdir = DEFAULT_WORKDIR
@@ -375,10 +405,6 @@ class QuestTaskSet(SandboxTaskSet):
             raise ValueError(
                 "category must be one of 'objective', 'open-ended', or 'all'"
             )
-        if category != "objective":
-            raise NotImplementedError(
-                "Initial QUEST taskset implementation supports category='objective' only"
-            )
         self.dataset_name = dataset_name
         self.split = split
         self.category = category
@@ -397,9 +423,13 @@ class QuestTaskSet(SandboxTaskSet):
         self._judge_api_key_var = judge_api_key_var
         self._judge_sampling_args = dict(judge_sampling_args or {})
         self._quest_cache_dir = quest_cache_dir
-        self._quest_eval_scripts_root = _resolve_eval_scripts_root(
-            dataset_name=dataset_name,
-            eval_scripts_dir=quest_eval_scripts_dir,
+        self._quest_eval_scripts_root = (
+            None
+            if category == "open-ended" and quest_eval_scripts_dir is None
+            else _resolve_eval_scripts_root(
+                dataset_name=dataset_name,
+                eval_scripts_dir=quest_eval_scripts_dir,
+            )
         )
         self._quest_eval_concurrency = quest_eval_concurrency
         super().__init__(
@@ -416,15 +446,29 @@ class QuestTaskSet(SandboxTaskSet):
             num_proc=self.ds_num_proc,
         )
         rows: list[dict[str, Any]] = []
-        for row in raw:
-            if self.category != "all" and row.get("rl_task_category") != self.category:
+        for row_index, row in enumerate(raw):
+            row_category = row.get("rl_task_category")
+            if self.category != "all" and row_category != self.category:
                 continue
+            if row_category not in {"objective", "open-ended"}:
+                raise ValueError(
+                    f"Unsupported QUEST row category at dataset index {row_index}: "
+                    f"{row_category!r}"
+                )
             extra_info = _parse_literal(row.get("extra_info"))
             reward_model = _parse_literal(row.get("reward_model"))
             question = _extract_question(row.get("prompt"), extra_info)
             task_id = _extract_task_id(reward_model, extra_info)
-            if self.category == "objective" and not task_id:
-                continue
+            if not task_id:
+                raise ValueError(
+                    f"QUEST {row_category} row is missing task_id metadata "
+                    f"at dataset index {row_index}"
+                )
+            if row_category == "open-ended" and not isinstance(reward_model, dict):
+                raise ValueError(
+                    "QUEST open-ended row has invalid reward_model metadata "
+                    f"at dataset index {row_index}"
+                )
             rows.append(
                 {
                     "question": question,
@@ -479,7 +523,11 @@ class QuestTaskSet(SandboxTaskSet):
         return QuestRubric(
             answer_file=self.answer_file,
             dataset_name=self.dataset_name,
-            eval_scripts_dir=str(self._quest_eval_scripts_root),
+            eval_scripts_dir=(
+                str(self._quest_eval_scripts_root)
+                if self._quest_eval_scripts_root is not None
+                else None
+            ),
             cache_dir=self._quest_cache_dir,
             judge_model=self._judge_model,
             judge_base_url=self._judge_base_url,
@@ -490,7 +538,7 @@ class QuestTaskSet(SandboxTaskSet):
 class QuestRubric(vf.Rubric):
-    """Scores QUEST objective tasks using their generated eval scripts."""
+    """Scores QUEST objective and open-ended tasks."""
     def __init__(
         self,
@@ -524,22 +572,28 @@ class QuestRubric(vf.Rubric):
         self._client: QuestOpenAIClient | None = None
         self._semaphore = asyncio.Semaphore(eval_concurrency)
         self._scripts_root: Path | None = None
-        self.add_reward_func(self.objective_reward, weight=1.0)
+        self.add_reward_func(self.quest_reward, weight=1.0)
-    async def _objective_score_for_state(self, state: vf.State) -> float:
+    async def _quest_score_for_state(self, state: vf.State) -> float:
         if state.get("error") is not None:
             return 0.0
         try:
-            return await self.objective_reward(state)
+            return await self.quest_reward(state)
         except vf.Error as exc:
             state["error"] = exc
             return 0.0
+    def _metric_name(self, state: vf.State) -> str:
+        info = state.get("info") or {}
+        if info.get("rl_task_category") == "open-ended":
+            return "open_ended_reward"
+        return "objective_reward"
     async def score_rollout(self, state: vf.State) -> None:
         """Score one rollout and preserve QUEST infrastructure failures as ``vf.Error`` values."""
-        score = await self._objective_score_for_state(state)
+        score = await self._quest_score_for_state(state)
         state["reward"] = score
-        state["metrics"] = {"objective_reward": score}
+        state["metrics"] = {"quest_reward": score, self._metric_name(state): score}
     async def score_group(self, states: list[vf.State]) -> None:
         """Score rollouts while preserving QUEST infrastructure failures as ``vf.Error`` values."""
@@ -547,7 +601,7 @@ class QuestRubric(vf.Rubric):
             logger.warning("No states to score")
             return
         scores = await asyncio.gather(
-            *(self._objective_score_for_state(state) for state in states)
+            *(self._quest_score_for_state(state) for state in states)
         )
         avg_score = sum(scores) / len(scores)
         for state, score in zip(states, scores):
@@ -559,9 +613,15 @@ class QuestRubric(vf.Rubric):
                         turn["advantage"] = state["advantage"]
                     if turn.get("reward") is None:
                         turn["reward"] = state["reward"]
-            state["metrics"] = {"objective_reward": score}
+            state["metrics"] = {"quest_reward": score, self._metric_name(state): score}
-    async def objective_reward(self, state: vf.State, **_: Any) -> float:
+    async def quest_reward(self, state: vf.State, **_: Any) -> float:
+        info = state.get("info") or {}
+        if info.get("rl_task_category") == "open-ended":
+            return await self.open_ended_reward(state)
+        return await self.objective_reward(state)
+    async def _read_answer(self, state: vf.State) -> tuple[str, str]:
         sandbox_client = state.get("sandbox_client")
         sandbox_id = state.get("sandbox_id")
         if not sandbox_client or not sandbox_id:
@@ -583,6 +643,10 @@ class QuestRubric(vf.Rubric):
             answer_source = "completion_fallback" if answer else "missing"
         state["quest_answer"] = answer
         state["quest_answer_source"] = answer_source
+        return answer, answer_source
+    async def objective_reward(self, state: vf.State, **_: Any) -> float:
+        answer, _ = await self._read_answer(state)
         if not answer:
             state["quest_eval_error"] = "empty_answer"
             return 0.0
@@ -613,6 +677,38 @@ class QuestRubric(vf.Rubric):
         state["quest_final_score"] = final_score
         return final_score
+    async def open_ended_reward(self, state: vf.State, **_: Any) -> float:
+        answer, _ = await self._read_answer(state)
+        if not answer:
+            state["quest_eval_error"] = "empty_answer"
+            return 0.0
+        info = state.get("info") or {}
+        task_id = info.get("task_id")
+        if not isinstance(task_id, str) or not task_id:
+            raise ValueError("QUEST open-ended task is missing task_id metadata")
+        reward_model = info.get("reward_model")
+        if not isinstance(reward_model, dict):
+            raise ValueError("QUEST open-ended task is missing reward_model metadata")
+        question = str(info.get("question") or "")
+        state["quest_task_id"] = task_id
+        client = self._get_client()
+        summary = await score_open_ended_answer(
+            client=client,
+            model=self.judge_model,
+            semaphore=self._semaphore,
+            answer=answer,
+            question=question,
+            reward_model=reward_model,
+        )
+        state["quest_eval_summary"] = summary
+        final_score = float(summary.get("final_score", 0.0) or 0.0)
+        if not math.isfinite(final_score):
+            final_score = 0.0
+        final_score = max(0.0, min(1.0, final_score))
+        state["quest_final_score"] = final_score
+        state["quest_upstream_pairwise_score"] = summary.get("upstream_pairwise_score")
+        return final_score
     def _get_client(self) -> QuestOpenAIClient:
         if self._client is not None:
             return self._client

verifiers 0.1.15.dev175__tar.gz → 0.1.15.dev177__tar.gz

verifiers 0.1.15.dev175tar.gz → 0.1.15.dev177tar.gz