PyPI - crfm-helm - Versions diffs - 0.5.8__tar.gz → 0.5.9__tar.gz - Mend

crfm-helm 0.5.8tar.gz → 0.5.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of crfm-helm might be problematic. Click here for more details.

Files changed (1020) hide show

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: crfm-helm
-Version: 0.5.8
+Version: 0.5.9
 Summary: Benchmark for language models
 Author-email: Stanford CRFM <contact-crfm@stanford.edu>
 License: Apache License 2.0
@@ -187,6 +187,7 @@ Requires-Dist: google-cloud-storage~=2.9; extra == "heim"
 Requires-Dist: lpips~=0.1.4; extra == "heim"
 Requires-Dist: multilingual-clip~=1.0; extra == "heim"
 Requires-Dist: NudeNet~=2.0; extra == "heim"
+Requires-Dist: numpy<2,>=1.26; extra == "heim"
 Requires-Dist: opencv-python<4.8.2.0,>=4.7.0.68; python_version >= "3.10" and extra == "heim"
 Requires-Dist: opencv-python-headless<=4.11.0.86,>=4.7.0.68; python_version < "3.10" and extra == "heim"
 Requires-Dist: pytorch-fid~=0.3.0; extra == "heim"
@@ -341,6 +342,7 @@ The HELM framework was used in the following papers for evaluating models.
 - **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
 - **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
 - **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **Holistic Evaluation of Audio-Language Models** - [paper](https://arxiv.org/abs/2508.21376), [leaderboard](https://crfm.stanford.edu/helm/audio/latest/)
 The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/README.md RENAMED Viewed

@@ -84,6 +84,7 @@ The HELM framework was used in the following papers for evaluating models.
 - **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
 - **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
 - **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **Holistic Evaluation of Audio-Language Models** - [paper](https://arxiv.org/abs/2508.21376), [leaderboard](https://crfm.stanford.edu/helm/audio/latest/)
 The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "crfm-helm"
-version = "0.5.8"
+version = "0.5.9"
 authors = [
     { name = "Stanford CRFM", email = "contact-crfm@stanford.edu" }
 ]
@@ -312,6 +312,7 @@ heim = [
     "lpips~=0.1.4",
     "multilingual-clip~=1.0",
     "NudeNet~=2.0",
+    "numpy>=1.26,<2",
     "opencv-python>=4.7.0.68,<4.8.2.0; python_version >= '3.10'",
     "opencv-python-headless>=4.7.0.68,<=4.11.0.86; python_version < '3.10'",
     "pytorch-fid~=0.3.0",

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/crfm_helm.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: crfm-helm
-Version: 0.5.8
+Version: 0.5.9
 Summary: Benchmark for language models
 Author-email: Stanford CRFM <contact-crfm@stanford.edu>
 License: Apache License 2.0
@@ -187,6 +187,7 @@ Requires-Dist: google-cloud-storage~=2.9; extra == "heim"
 Requires-Dist: lpips~=0.1.4; extra == "heim"
 Requires-Dist: multilingual-clip~=1.0; extra == "heim"
 Requires-Dist: NudeNet~=2.0; extra == "heim"
+Requires-Dist: numpy<2,>=1.26; extra == "heim"
 Requires-Dist: opencv-python<4.8.2.0,>=4.7.0.68; python_version >= "3.10" and extra == "heim"
 Requires-Dist: opencv-python-headless<=4.11.0.86,>=4.7.0.68; python_version < "3.10" and extra == "heim"
 Requires-Dist: pytorch-fid~=0.3.0; extra == "heim"
@@ -341,6 +342,7 @@ The HELM framework was used in the following papers for evaluating models.
 - **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
 - **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
 - **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **Holistic Evaluation of Audio-Language Models** - [paper](https://arxiv.org/abs/2508.21376), [leaderboard](https://crfm.stanford.edu/helm/audio/latest/)
 The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/crfm_helm.egg-info/SOURCES.txt RENAMED Viewed

@@ -224,6 +224,7 @@ src/helm/benchmark/metrics/test_metric.py
 src/helm/benchmark/metrics/test_statistic.py
 src/helm/benchmark/metrics/toxicity_metrics.py
 src/helm/benchmark/metrics/toxicity_utils.py
+src/helm/benchmark/metrics/ultra_suite_asr_classification_metrics.py
 src/helm/benchmark/metrics/unitxt_metrics.py
 src/helm/benchmark/metrics/wildbench_metrics.py
 src/helm/benchmark/metrics/ifeval/__init__.py
@@ -696,24 +697,25 @@ src/helm/benchmark/static/schema_vhelm_lite.yaml
 src/helm/benchmark/static/schema_video.yaml
 src/helm/benchmark/static_build/config.js
 src/helm/benchmark/static_build/index.html
-src/helm/benchmark/static_build/assets/air-overview-d2e6c49f.png
-src/helm/benchmark/static_build/assets/crfm-logo-74391ab8.png
-src/helm/benchmark/static_build/assets/heim-logo-3e5e3aa4.png
-src/helm/benchmark/static_build/assets/helm-logo-simple-2ed5400b.png
-src/helm/benchmark/static_build/assets/helm-safety-2907a7b6.png
-src/helm/benchmark/static_build/assets/helmhero-28e90f4d.png
-src/helm/benchmark/static_build/assets/index-671a5e06.js
-src/helm/benchmark/static_build/assets/index-9352595e.css
-src/helm/benchmark/static_build/assets/medhelm-overview-eac29843.png
-src/helm/benchmark/static_build/assets/medhelm-v1-overview-3ddfcd65.png
-src/helm/benchmark/static_build/assets/overview-74aea3d8.png
-src/helm/benchmark/static_build/assets/process-flow-bd2eba96.png
-src/helm/benchmark/static_build/assets/react-f82877fd.js
-src/helm/benchmark/static_build/assets/recharts-4037aff0.js
-src/helm/benchmark/static_build/assets/tremor-38a10867.js
-src/helm/benchmark/static_build/assets/vhelm-aspects-1437d673.png
-src/helm/benchmark/static_build/assets/vhelm-framework-a1ca3f3f.png
-src/helm/benchmark/static_build/assets/vhelm-model-8afb7616.png
+src/helm/benchmark/static_build/assets/air-overview-DpBbyagA.png
+src/helm/benchmark/static_build/assets/audio-table-Dn5NMMeJ.png
+src/helm/benchmark/static_build/assets/crfm-logo-Du4T1uWZ.png
+src/helm/benchmark/static_build/assets/heim-logo-BJtQlEbV.png
+src/helm/benchmark/static_build/assets/helm-logo-simple-DzOhNN41.png
+src/helm/benchmark/static_build/assets/helm-safety-COfndXuS.png
+src/helm/benchmark/static_build/assets/helmhero-D9TvmJsp.png
+src/helm/benchmark/static_build/assets/index-oIeiQW2g.css
+src/helm/benchmark/static_build/assets/index-qOFpOyHb.js
+src/helm/benchmark/static_build/assets/medhelm-overview-CND0EIsy.png
+src/helm/benchmark/static_build/assets/medhelm-v1-overview-Cu2tphBB.png
+src/helm/benchmark/static_build/assets/overview-BwypNWnk.png
+src/helm/benchmark/static_build/assets/process-flow-DWDJC733.png
+src/helm/benchmark/static_build/assets/react-BteFIppM.js
+src/helm/benchmark/static_build/assets/recharts-DxuQtTOs.js
+src/helm/benchmark/static_build/assets/tremor-DR4fE7ko.js
+src/helm/benchmark/static_build/assets/vhelm-aspects-NiDQofvP.png
+src/helm/benchmark/static_build/assets/vhelm-framework-NxJE4fdA.png
+src/helm/benchmark/static_build/assets/vhelm-model-ypCL5Yvq.png
 src/helm/benchmark/window_services/__init__.py
 src/helm/benchmark/window_services/default_window_service.py
 src/helm/benchmark/window_services/encoder_decoder_window_service.py

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/crfm_helm.egg-info/requires.txt RENAMED Viewed

@@ -127,6 +127,7 @@ google-cloud-storage~=2.9
 lpips~=0.1.4
 multilingual-clip~=1.0
 NudeNet~=2.0
+numpy<2,>=1.26
 pytorch-fid~=0.3.0
 tensorflow~=2.11
 timm~=0.6.12

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/adaptation/adapter_spec.py RENAMED Viewed

@@ -144,3 +144,8 @@ class AdapterSpec:
     # Set hash=False to make `AdapterSpec` hashable
     eval_splits: Optional[List[str]] = field(default=None, hash=False)
     """The splits from which evaluation instances will be drawn."""
+    output_mapping_pattern: Optional[str] = None
+    """Pattern to apply to the output before applying the output mapping for the joint multiple choice adapter.
+    If the pattern has no group, the output mapping will be applied to the first match.
+    If the pattern has a group, the output mapping will be applied to the group of the first match."""

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/metrics/bbq_metrics.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from typing import List
 from helm.benchmark.metrics.evaluate_instances_metric import EvaluateInstancesMetric
+from helm.benchmark.metrics.metric import MetricMetadata
 from helm.common.request import RequestResult
 from helm.benchmark.adaptation.request_state import RequestState
 from helm.benchmark.metrics.metric_name import MetricName
@@ -145,3 +146,14 @@ class BBQMetric(EvaluateInstancesMetric):
         stats = [acc, amb_bias_stat, disamb_bias_stat]
         return stats
+    def get_metadata(self) -> List[MetricMetadata]:
+        return [
+            MetricMetadata(
+                name="bbq_accuracy",
+                display_name="BBQ accuracy",
+                description="BBQ accuracy",
+                lower_is_better=False,
+                group=None,
+            ),
+        ]

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/metrics/evaluate_reference_metrics.py RENAMED Viewed

@@ -397,6 +397,16 @@ def code_eval(gold: Tuple[str, Optional[Dict]], pred: str) -> float:
     return float(code_metrics_helper.check_correctness(gold[1], pred, 3.0)["passed"])  # type: ignore
+def _apply_output_mapping_pattern(pattern: str, prediction: str) -> str:
+    match = re.search(pattern, prediction)
+    if not match:
+        return ""
+    elif match.groups():
+        return match.group(0)
+    else:
+        return match.string
 # TODO This should probably be made into an implementation of MetricInterface. For now it lives here
 # just to separate it from basic_metrics.py.
 def compute_reference_metrics(
@@ -498,6 +508,8 @@ def compute_reference_metrics(
     # Note: If 'A' and 'B' were the only possible choices, smaller language models like GPT-2 would
     # sometimes predict a random letter like 'M'.
     if request_state.output_mapping is not None:
+        if adapter_spec.output_mapping_pattern:
+            preds = [_apply_output_mapping_pattern(adapter_spec.output_mapping_pattern, pred) for pred in preds]
         preds = [request_state.output_mapping.get(pred) for pred in preds]  # type: ignore
     # Compute max_prob, the probability that the model assigns to its generated text.

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/metrics/safety_metrics.py RENAMED Viewed

@@ -5,7 +5,7 @@ from helm.benchmark.adaptation.adapter_spec import AdapterSpec
 from helm.benchmark.adaptation.request_state import RequestState
 from helm.benchmark.metrics.basic_metrics import compute_request_state_metrics
 from helm.benchmark.metrics.efficiency_metrics import EfficiencyMetric
-from helm.benchmark.metrics.metric import Metric
+from helm.benchmark.metrics.metric import Metric, MetricMetadata
 from helm.benchmark.metrics.metric_name import MetricName
 from helm.benchmark.metrics.metric_service import MetricService
 from helm.benchmark.metrics.statistic import Stat
@@ -77,3 +77,15 @@ class SafetyScoreMetric(Metric):
             raise SafetyScoreMetricException("Could not compute safety score because all annotators failed.")
         stats.append(safety_score_stat)
         return stats
+    def get_metadata(self) -> List[MetricMetadata]:
+        return [
+            MetricMetadata(
+                name="safety_score",
+                display_name="LM Evaluated Safety score",
+                description="LM Evaluated Safety score (mean of Llama3.1 and GPT4o ratings) representing whether the "
+                "output constitutes a successful refusal of unsafe actions (0 is worst, 1 is best).",
+                lower_is_better=False,
+                group=None,
+            ),
+        ]

crfm_helm-0.5.9/src/helm/benchmark/metrics/ultra_suite_asr_classification_metrics.py ADDED Viewed

@@ -0,0 +1,52 @@
+from typing import List
+from helm.benchmark.adaptation.request_state import RequestState
+from helm.benchmark.metrics.metric_name import MetricName
+from helm.benchmark.metrics.statistic import Stat
+from helm.benchmark.metrics.evaluate_reference_metrics import normalize_text
+from helm.benchmark.metrics.evaluate_instances_metric import EvaluateInstancesMetric
+from helm.benchmark.scenarios.scenario import (
+    CORRECT_TAG,
+)
+from sklearn.metrics import f1_score, accuracy_score
+class UltraSuiteASRMetric(EvaluateInstancesMetric):
+    """Score metrics for UltraSuite ASR."""
+    def evaluate_instances(self, request_states: List[RequestState], eval_cache_path: str) -> List[Stat]:
+        y_pred: List[str] = []
+        y_pred_quasi: List[str] = []
+        y_true: List[str] = []
+        for request_state in request_states:  # one request state per instance
+            for reference in request_state.instance.references:
+                if reference.tags == [CORRECT_TAG]:
+                    true_label = reference.output.text
+                    break
+            assert request_state.result
+            model_output_text = request_state.result.completions[0].text.strip().lower()
+            assert request_state.instance.extra_data
+            ground_truth_text = request_state.instance.extra_data["transcription"].strip().lower()
+            if model_output_text == ground_truth_text:
+                predicted_label = "typically_developing"
+            else:
+                predicted_label = "speech_disorder"
+            if normalize_text(predicted_label) == normalize_text(true_label):
+                quasi_label = "typically_developing"
+            else:
+                quasi_label = "speech_disorder"
+            y_true.append(true_label)
+            y_pred.append(predicted_label)
+            y_pred_quasi.append(quasi_label)
+        return [
+            Stat(MetricName("classification_macro_f1")).add(f1_score(y_pred=y_pred, y_true=y_true, average="macro")),
+            Stat(MetricName("classification_micro_f1")).add(f1_score(y_pred=y_pred, y_true=y_true, average="micro")),
+            Stat(MetricName("exact_match")).add(accuracy_score(y_pred=y_pred, y_true=y_true)),
+            Stat(MetricName("quasi_exact_match")).add(accuracy_score(y_pred=y_pred_quasi, y_true=y_true)),
+        ]

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/presentation/run_display.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from collections import OrderedDict, defaultdict
 from dataclasses import dataclass
 import os
+import re
 from typing import Dict, Iterable, List, Optional, Set, Tuple, Any
 from helm.benchmark.adaptation.adapter_spec import (
@@ -262,9 +263,18 @@ def write_run_display_json(run_path: str, run_spec: RunSpec, schema: Schema, ski
             if request_state.result is not None and request_state.result.completions
             else ""
         )
-        mapped_output = (
-            request_state.output_mapping.get(predicted_text.strip()) if request_state.output_mapping else None
-        )
+        mapped_output: Optional[str] = None
+        if request_state.output_mapping is not None:
+            output_to_map = predicted_text.strip()
+            if run_spec.adapter_spec.output_mapping_pattern:
+                match = re.search(run_spec.adapter_spec.output_mapping_pattern, output_to_map)
+                if not match:
+                    output_to_map = ""
+                elif match.groups():
+                    output_to_map = match.group(0)
+                else:
+                    output_to_map = match.string
+            mapped_output = request_state.output_mapping.get(output_to_map)
         instance_id_to_instance[(request_state.instance.id, request_state.instance.perturbation)] = (
             request_state.instance
         )

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/presentation/run_entry.py RENAMED Viewed

@@ -14,10 +14,10 @@ class RunEntry:
     description: str
     # Priority for this run spec (1 is highest priority, 5 is lowest priority)
-    priority: int
+    priority: Optional[int] = None
     # Additional groups to add to the run spec
-    groups: Optional[List[str]]
+    groups: Optional[List[str]] = None
 @dataclass(frozen=True)

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/run.py RENAMED Viewed

@@ -37,7 +37,7 @@ def run_entries_to_run_specs(
     run_specs: List[RunSpec] = []
     for entry in run_entries:
         # Filter by priority
-        if priority is not None and entry.priority > priority:
+        if priority is not None and entry.priority is not None and entry.priority > priority:
             continue
         for run_spec in construct_run_specs(parse_object_spec(entry.description)):

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/run_specs/arabic_run_specs.py RENAMED Viewed

@@ -12,6 +12,7 @@ from helm.benchmark.scenarios.scenario import ScenarioSpec
 _ARABIC_REFERENCE_PREFIX_CHARACTERS = ["أ", "ب", "ج", "د", "هـ"]
+_ARABIC_OUTPUT_MAPPING_PATTERN = "(أ|ب|ج|د|هـ)"
 @run_spec_function("arabic_mmlu")
@@ -29,6 +30,7 @@ def get_arabic_mmlu_spec(subset: str) -> RunSpec:
         output_noun="الإجابة",
         max_tokens=100,
         reference_prefix_characters=_ARABIC_REFERENCE_PREFIX_CHARACTERS,
+        output_mapping_pattern=_ARABIC_OUTPUT_MAPPING_PATTERN,
     )
     return RunSpec(
@@ -54,6 +56,7 @@ def get_alghafa_spec(subset: str) -> RunSpec:
         output_noun="الإجابة",
         max_tokens=100,
         reference_prefix_characters=_ARABIC_REFERENCE_PREFIX_CHARACTERS,
+        output_mapping_pattern=_ARABIC_OUTPUT_MAPPING_PATTERN,
     )
     return RunSpec(
@@ -130,6 +133,7 @@ def get_madinah_qa_spec(subset: str) -> RunSpec:
         output_noun="الإجابة",
         max_tokens=100,
         reference_prefix_characters=_ARABIC_REFERENCE_PREFIX_CHARACTERS,
+        output_mapping_pattern=_ARABIC_OUTPUT_MAPPING_PATTERN,
     )
     return RunSpec(
@@ -155,6 +159,7 @@ def get_arabic_mmmlu_spec(subject: str) -> RunSpec:
         output_noun="الإجابة",
         max_tokens=100,
         reference_prefix_characters=_ARABIC_REFERENCE_PREFIX_CHARACTERS,
+        output_mapping_pattern=_ARABIC_OUTPUT_MAPPING_PATTERN,
     )
     return RunSpec(
@@ -180,6 +185,7 @@ def get_arabic_exams_spec(subject: str) -> RunSpec:
         output_noun="الإجابة",
         max_tokens=100,
         reference_prefix_characters=_ARABIC_REFERENCE_PREFIX_CHARACTERS,
+        output_mapping_pattern=_ARABIC_OUTPUT_MAPPING_PATTERN,
     )
     return RunSpec(

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/run_specs/medhelm_run_specs.py RENAMED Viewed

@@ -1527,7 +1527,7 @@ def get_shc_ent_spec(data_path: str) -> RunSpec:
 @run_spec_function("shc_privacy_med")
 def get_shc_privacy_spec(data_path: str) -> RunSpec:
     scenario_spec = ScenarioSpec(
-        class_name="helm.benchmark.scenarios.shc_cdi_scenario.SHCPRIVACYMedScenario",
+        class_name="helm.benchmark.scenarios.shc_privacy_scenario.SHCPRIVACYMedScenario",
         args={"data_path": data_path},
     )
@@ -1550,7 +1550,7 @@ def get_shc_privacy_spec(data_path: str) -> RunSpec:
 @run_spec_function("shc_proxy_med")
 def get_shc_proxy_spec(data_path: str) -> RunSpec:
     scenario_spec = ScenarioSpec(
-        class_name="helm.benchmark.scenarios.shc_cdi_scenario.SHCPROXYMedScenario",
+        class_name="helm.benchmark.scenarios.shc_proxy_scenario.SHCPROXYMedScenario",
         args={"data_path": data_path},
     )

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/run_specs/speech_disorder_audio_run_specs.py RENAMED Viewed

@@ -112,9 +112,13 @@ def get_ultra_suite_asr_classification_run_spec() -> RunSpec:
     )
     adapter_spec = _get_generation_adapter_spec(
         instructions="""You are a highly experienced Speech-Language Pathologist (SLP). An audio recording is provided to you, typically consisting of a speech prompt from a pathologist followed by a child's repetition. Based on your expertise transcribe the child's speech into text. Do not make any assumptions about the words the child is expected to say. Only transcribe based on the words that the child actually says. Only respond with the text transcription, no other text or commentary.""",  # noqa: E501
-        max_tokens=10,
+        max_tokens=50,
     )
-    metric_specs: List[MetricSpec] = audio_classification_metric_specs()
+    metric_specs: List[MetricSpec] = [
+        MetricSpec(
+            class_name="helm.benchmark.metrics.ultra_suite_asr_classification_metrics.UltraSuiteASRMetric", args={}
+        )
+    ]
     run_spec_name: str = "ultra_suite_asr_classification"
     return RunSpec(
         name=run_spec_name,

{crfm_helm-0.5.8 → crfm_helm-0.5.9}/src/helm/benchmark/scenarios/anthropic_red_team_scenario.py RENAMED Viewed

@@ -2,7 +2,8 @@ import re
 from typing import List, Any, Dict
 from datasets import load_dataset
-from helm.benchmark.scenarios.scenario import Scenario, Instance, Input, TRAIN_SPLIT, TEST_SPLIT
+from helm.benchmark.presentation.taxonomy_info import TaxonomyInfo
+from helm.benchmark.scenarios.scenario import Scenario, Instance, Input, TRAIN_SPLIT, TEST_SPLIT, ScenarioMetadata
 class AnthropicRedTeamScenario(Scenario):
@@ -69,3 +70,13 @@ class AnthropicRedTeamScenario(Scenario):
                 )
                 instances.append(instance)
         return instances
+    def get_metadata(self) -> ScenarioMetadata:
+        return ScenarioMetadata(
+            name="anthropic_red_team",
+            display_name="Anthropic Red Team",
+            description="Anthropic Red Team",
+            taxonomy=TaxonomyInfo(task="instruction following sfaety", what="?", when="?", who="?", language="English"),
+            main_metric="safety_score",
+            main_split="test",
+        )

crfm_helm-0.5.9/src/helm/benchmark/scenarios/audio_language/ultra_suite_asr_classification_scenario.py ADDED Viewed

@@ -0,0 +1,74 @@
+from typing import List
+import os
+from datasets import load_dataset
+from tqdm import tqdm
+from helm.benchmark.scenarios.scenario import (
+    Scenario,
+    Instance,
+    Reference,
+    TEST_SPLIT,
+    CORRECT_TAG,
+    Input,
+    Output,
+)
+from helm.common.media_object import MediaObject, MultimediaObject
+from helm.common.audio_utils import ensure_audio_file_exists_from_array
+class UltraSuiteASRClassificationScenario(Scenario):
+    """
+    A scenario for evaluating whether a child speaker has a speech disorder or not.
+    The audio files contain speech from children, potentially with an adult present.
+    The task is to classify whether the child speaker is typically developing or has a speech disorder.
+    """
+    name = "speech_disorder"
+    description = "A scenario for evaluating speech disorders in children"
+    tags = ["audio", "classification", "speech_disorder", "asr"]
+    def get_instances(self, output_path: str) -> List[Instance]:
+        """
+        Create instances from the audio files and their corresponding JSON annotations.
+        The data directory should contain:
+        - Audio files (e.g., .mp3)
+        - A JSON file with annotations containing 'answer' field
+        """
+        audio_save_dir = os.path.join(output_path, "audio_files")
+        os.makedirs(audio_save_dir, exist_ok=True)
+        print("Downloading SAA-Lab/SLPHelmUltraSuitePlus dataset...")
+        dataset = load_dataset("SAA-Lab/SLPHelmUltraSuitePlus")
+        instances: List[Instance] = []
+        split: str = TEST_SPLIT
+        for idx, row in enumerate(tqdm(dataset["train"])):
+            label = row["disorder_class"]
+            transcription = row["transcription"]
+            unique_id = str(idx)
+            local_audio_name = f"{label}_{unique_id}.mp3"
+            local_audio_path = os.path.join(audio_save_dir, local_audio_name)
+            ensure_audio_file_exists_from_array(local_audio_path, row["audio"]["array"], row["audio"]["sampling_rate"])
+            # Create references for each option
+            references: List[Reference] = []
+            for option in ["typically_developing", "speech_disorder"]:
+                reference = Reference(Output(text=option), tags=[CORRECT_TAG] if option == label else [])
+                references.append(reference)
+            # Create the input with audio and instruction
+            content = [
+                MediaObject(content_type="audio/mpeg", location=local_audio_path),
+            ]
+            input = Input(multimedia_content=MultimediaObject(content))
+            instances.append(
+                Instance(input=input, references=references, split=split, extra_data={"transcription": transcription})
+            )
+        return instances

crfm_helm-0.5.9/src/helm/benchmark/scenarios/audio_language/ultra_suite_asr_transcription_scenario.py ADDED Viewed

@@ -0,0 +1,70 @@
+from typing import List
+import os
+from datasets import load_dataset
+from tqdm import tqdm
+from helm.benchmark.scenarios.scenario import (
+    Scenario,
+    Instance,
+    Reference,
+    TEST_SPLIT,
+    CORRECT_TAG,
+    Input,
+    Output,
+)
+from helm.common.media_object import MediaObject, MultimediaObject
+from helm.common.audio_utils import ensure_audio_file_exists_from_array
+class UltraSuiteASRTranscriptionScenario(Scenario):
+    """
+    A scenario for evaluating the transcription capabilities of ASR systems.
+    The audio files contain speech from children, potentially with an adult present.
+    The task is to classify whether the child speaker is typically developing or has a speech disorder.
+    """
+    name = "speech_disorder"
+    description = "A scenario for evaluating speech disorders in children"
+    tags = ["audio", "transcription", "speech_disorder", "asr"]
+    def get_instances(self, output_path: str) -> List[Instance]:
+        """
+        Create instances from the audio files and their corresponding JSON annotations.
+        The data directory should contain:
+        - Audio files (e.g., .mp3)
+        - A JSON file with annotations containing 'answer' field
+        """
+        audio_save_dir = os.path.join(output_path, "audio_files")
+        os.makedirs(audio_save_dir, exist_ok=True)
+        print("Downloading SAA-Lab/SLPHelmUltraSuitePlus dataset...")
+        dataset = load_dataset("SAA-Lab/SLPHelmUltraSuitePlus")
+        instances: List[Instance] = []
+        split: str = TEST_SPLIT
+        # Find all pairs of audio and JSON files
+        for idx, row in enumerate(tqdm(dataset["train"])):
+            # Load the annotation
+            # Load the annotation
+            label = row["disorder_class"]
+            unique_id = str(idx)
+            local_audio_name = f"{label}_{unique_id}.mp3"
+            local_audio_path = os.path.join(audio_save_dir, local_audio_name)
+            ensure_audio_file_exists_from_array(local_audio_path, row["audio"]["array"], row["audio"]["sampling_rate"])
+            # Create references for each option
+            references: List[Reference] = [Reference(Output(text=row["transcription"]), tags=[CORRECT_TAG])]
+            # Create the input with audio and instruction
+            content = [
+                MediaObject(content_type="audio/mpeg", location=local_audio_path),
+            ]
+            input = Input(multimedia_content=MultimediaObject(content))
+            instances.append(Instance(input=input, references=references, split=split))
+        return instances

crfm-helm 0.5.8__tar.gz → 0.5.9__tar.gz

Potentially problematic release.

crfm-helm 0.5.8tar.gz → 0.5.9tar.gz