PyPI - crfm-helm - Versions diffs - 0.5.2__py3-none-any.whl → 0.5.4__py3-none-any.whl - Mend

crfm-helm 0.5.2py3-none-any.whl → 0.5.4py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of crfm-helm might be problematic. Click here for more details.

Files changed (209) hide show

helm/benchmark/static/schema_safety.yaml ADDED Viewed

@@ -0,0 +1,266 @@
+---
+############################################################
+metrics:
+  # Infrastructure metrics:
+  - name: num_perplexity_tokens
+    display_name: '# tokens'
+    description: Average number of tokens in the predicted output (for language modeling, the input too).
+  - name: num_bytes
+    display_name: '# bytes'
+    description: Average number of bytes in the predicted output (for language modeling, the input too).
+  - name: num_references
+    display_name: '# ref'
+    description: Number of references.
+  - name: num_train_trials
+    display_name: '# trials'
+    description: Number of trials, where in each trial we choose an independent, random set of training instances.
+  - name: estimated_num_tokens_cost
+    display_name: 'cost'
+    description: An estimate of the number of tokens (including prompt and output completions) needed to perform the request.
+  - name: num_prompt_tokens
+    display_name: '# prompt tokens'
+    description: Number of tokens in the prompt.
+  - name: num_prompt_characters
+    display_name: '# prompt chars'
+    description: Number of characters in the prompt.
+  - name: num_completion_tokens
+    display_name: '# completion tokens'
+    description: Actual number of completion tokens (over all completions).
+  - name: num_output_tokens
+    display_name: '# output tokens'
+    description: Actual number of output tokens.
+  - name: max_num_output_tokens
+    display_name: 'Max output tokens'
+    description: Maximum number of output tokens (overestimate since we might stop earlier due to stop sequences).
+  - name: num_requests
+    display_name: '# requests'
+    description: Number of distinct API requests.
+  - name: num_instances
+    display_name: '# eval'
+    description: Number of evaluation instances.
+  - name: num_train_instances
+    display_name: '# train'
+    description: Number of training instances (e.g., in-context examples).
+  - name: prompt_truncated
+    display_name: truncated
+    description: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).
+  - name: finish_reason_length
+    display_name: finish b/c length
+    description: Fraction of instances where the the output was terminated because of the max tokens limit.
+  - name: finish_reason_stop
+    display_name: finish b/c stop
+    description: Fraction of instances where the the output was terminated because of the stop sequences.
+  - name: finish_reason_endoftext
+    display_name: finish b/c endoftext
+    description: Fraction of instances where the the output was terminated because the end of text token was generated.
+  - name: finish_reason_unknown
+    display_name: finish b/c unknown
+    description: Fraction of instances where the the output was terminated for unknown reasons.
+  - name: num_completions
+    display_name: '# completions'
+    description: Number of completions.
+  - name: predicted_index
+    display_name: Predicted index
+    description: Integer index of the reference (0, 1, ...) that was predicted by the model (for multiple-choice).
+  # Accuracy metrics:
+  - name: exact_match
+    display_name: Exact match
+    short_display_name: EM
+    description: Fraction of instances that the predicted output matches a correct reference exactly.
+    lower_is_better: false
+  - name: f1_score
+    display_name: F1
+    description: Average F1 score in terms of word overlap between the model output and correct reference.
+    lower_is_better: false
+  - name: safety_score
+    display_name: LM Evaluated Safety score
+    description: LM Evaluated Safety score (mean of Llama3.1 and GPT4o ratings) representing whether the output constitutes a successful refusal of unsafe actions (0 is worst, 1 is best).
+    lower_is_better: false
+  - name: bbq_accuracy
+    display_name: BBQ accuracy
+    description: BBQ accuracy
+    lower_is_better: false
+  # Toxicity metrics
+  - name: expected_max_toxicity
+    display_name: Average max toxicity
+    short_display_name: Avg. max toxicity
+    lower_is_better: true
+    description: Average max toxicity score across all completions (based on the PerspectiveAPI toxicity classifier; [Gehman et al. (2020)](https://aclanthology.org/2020.findings-emnlp.301/)).
+  - name: max_toxicity_probability
+    display_name: max toxicity
+    lower_is_better: true
+    description: Fraction of instances where most toxic completion is above 0.5 toxicity (based on the PerspectiveAPI toxicity classifier; [Gehman et al. (2020)](https://aclanthology.org/2020.findings-emnlp.301/)).
+  - name: toxic_frac
+    display_name: Toxic fraction
+    lower_is_better: true
+    description: Fraction of model outputs that are toxic (based on the PerspectiveAPI toxicity classifier).
+  # Annotation metrics
+  - name: safety_llama_annotator_success
+    display_name: Llama Annotator Success Rate
+    description: Fraction of annotator requests to Llama that succeeded.
+  - name: safety_gpt_annotator_success
+    display_name: GPT Annotator Success Rate
+    description: Fraction of annotator requests to GPT that succeeded.
+############################################################
+perturbations: []
+############################################################
+metric_groups:
+  - name: accuracy
+    display_name: Accuracy
+    aggregation_strategies:
+      - win_rate
+      - mean
+    metrics:
+      - name: ${main_name}
+        split: ${main_split}
+  - name: efficiency
+    display_name: Efficiency
+    metrics:
+    - name: inference_runtime
+      split: ${main_split}
+  - name: general_information
+    display_name: General information
+    hide_win_rates: true
+    metrics:
+    - name: num_instances
+      split: ${main_split}
+    - name: num_train_instances
+      split: ${main_split}
+    - name: prompt_truncated
+      split: ${main_split}
+    - name: num_prompt_tokens
+      split: ${main_split}
+    - name: num_output_tokens
+      split: ${main_split}
+  - name: annotation_metrics
+    display_name: Annotation
+    hide_win_rates: true
+    metrics:
+    - name: safety_llama_annotator_success
+      split: ${main_split}
+    - name: safety_gpt_annotator_success
+      split: ${main_split}
+  - name: toxicity
+    display_name: Toxicity
+    metrics:
+    - name: toxic_frac
+      split: ${main_split}
+  - name: bbq_metrics
+    display_name: BBQ metrics
+    description: Metrics used for the BBQ bias benchmark.
+    metrics:
+      - name: bbq_metric_ambiguous_bias
+        split: ${main_split}
+      - name: bbq_metric_unambiguous_bias
+        split: ${main_split}
+############################################################
+run_groups:
+  - name: safety_scenarios
+    display_name: Safety Scenarios
+    description: Scenarios for the model safety
+    category: All scenarios
+    subgroups:
+      - harm_bench
+      - simple_safety_tests
+      - bbq
+      - anthropic_red_team
+      - xstest
+  - name: harm_bench
+    display_name: HarmBench
+    description: HarmBench
+    metric_groups:
+      - accuracy
+      - general_information
+      - annotation_metrics
+    environment:
+      main_name: safety_score
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
+  - name: simple_safety_tests
+    display_name: SimpleSafetyTests
+    description: SimpleSafetyTests
+    metric_groups:
+      - accuracy
+      - general_information
+      - annotation_metrics
+    environment:
+      main_name: safety_score
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
+  - name: xstest
+    display_name: XSTest
+    description: XSTest
+    metric_groups:
+      - accuracy
+      - general_information
+      - annotation_metrics
+    environment:
+      main_name: safety_score
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
+  - name: bbq
+    display_name: BBQ
+    description: The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context [(Parrish et al., 2022)](https://aclanthology.org/2022.findings-acl.165/).
+    metric_groups:
+      - accuracy
+      - general_information
+      - bbq_metrics
+      - annotation_metrics
+    environment:
+      main_name: bbq_accuracy
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
+  - name: anthropic_red_team
+    display_name: Anthropic Red Team
+    short_display_name: Anthropic Red Team
+    description: Anthropic Red Team
+    metric_groups:
+      - accuracy
+      - general_information
+      - annotation_metrics
+    environment:
+      main_name: safety_score
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English

helm/benchmark/static/schema_tables.yaml CHANGED Viewed

@@ -99,47 +99,101 @@ metrics:
     display_name: METEOR
     short_display_name: METEOR
     description: METEOR
+    lower_is_better: false
   - name: f1
-    display_name: F1
-    short_display_name: F1
-    description: F1
+    display_name: BERTScore F1
+    short_display_name: BERTScore F1
+    description: BERTScore F1
+    lower_is_better: false
   - name: precision
     display_name: Precision
     short_display_name: Precision
     description: Precision
+    lower_is_better: false
   - name: recall
     display_name: Recall
     short_display_name: Recall
     description: Recall
+    lower_is_better: false
   - name: rouge1
     display_name: ROUGE-1
     short_display_name: ROUGE-1
     description: ROUGE-1
+    lower_is_better: false
   - name: rouge2
     display_name: ROUGE-2
     short_display_name: ROUGE-2
     description: ROUGE-2
+    lower_is_better: false
   - name: rougeL
     display_name: ROUGE-L
     short_display_name: ROUGE-L
     description: ROUGE-L
+    lower_is_better: false
   - name: rougeLsum
     display_name: ROUGE-Lsum
     short_display_name: ROUGE-Lsum
     description: ROUGE-Lsum
+    lower_is_better: false
   - name: bleu
     display_name: BLEU
     short_display_name: BLEU
     description: BLEU
+    lower_is_better: false
+  - name: accuracy
+    display_name: Accuracy
+    short_display_name: Accuracy
+    description: Accuracy
+    lower_is_better: false
+  - name: f1_macro
+    display_name: Macro F1
+    short_display_name: Macro F1
+    description: Macro F1
+    lower_is_better: false
+  - name: f1_micro
+    display_name: Micro F1
+    short_display_name: Micro F1
+    description: Micro F1
+    lower_is_better: false
+  - name: unsorted_list_exact_match
+    display_name: Unsorted List Exact Match
+    short_display_name: Exact Match
+    description: Unsorted List Exact Match
+    lower_is_better: false
+  # FinQA Accuracy
+  - name: program_accuracy
+    display_name: Program Accuracy
+    short_display_name: Program Accuracy
+    description: Program Accuracy
+    lower_is_better: false
+  - name: execution_accuracy
+    display_name: Execution Accuracy
+    short_display_name: Execution Accuracy
+    description: Execution Accuracy
+    lower_is_better: false
+  # SciGen Accuracy
+  - name: llama_3_8b_chat_hf_together_ai_template_table2text_single_turn_with_reference
+    display_name: Rating
+    short_display_name: Rating
+    description: Rating by Llama 3 (8B) LLM as judge
+    lower_is_better: false
 perturbations: []
 metric_groups:
-  - name: accuracy
-    display_name: Accuracy
+  - name: main_metrics
+    display_name: Main Metrics
+    metrics:
+    - name: ${main_name}
+      split: __all__
+  - name: generation_metrics
+    display_name: Other Generation Metrics
     hide_win_rates: true
     metrics:
-    - name: meteor
+    - name: f1
       split: __all__
     - name: rouge1
       split: __all__
@@ -152,6 +206,17 @@ metric_groups:
     - name: bleu
       split: __all__
+  - name: classification_metrics
+    display_name: Classification Metrics
+    hide_win_rates: true
+    metrics:
+    - name: accuracy
+      split: __all__
+    - name: f1_macro
+      split: __all__
+    - name: f1_micro
+      split: __all__
   - name: efficiency
     display_name: Efficiency
     metrics:
@@ -175,18 +240,22 @@ metric_groups:
 run_groups:
   - name: table_scenarios
-    display_name: Table  Scenarios
+    display_name: Table Scenarios
     description: Table Scenarios
     category: All Scenarios
     subgroups:
       - unitxt_cards.numeric_nlg
+      - unitxt_cards.tab_fact
+      - unitxt_cards.wikitq
+      - unitxt_cards.scigen
   - name: unitxt_cards.numeric_nlg
     display_name: NumericNLG
     short_display_name: NumericNLG
     description: "NumericNLG is a dataset for numerical table-to-text generation using pairs of a table and a paragraph of a table description with richer inference from scientific papers."
     metric_groups:
-      - accuracy
+      - main_metrics
+      - generation_metrics
       - efficiency
       - general_information
     environment:
@@ -198,3 +267,75 @@ run_groups:
       who: "?"
       when: "?"
       language: English
+  - name: unitxt_cards.tab_fact
+    display_name: TabFact
+    short_display_name: TabFact
+    description: "tab_fact is a large-scale dataset for the task of fact-checking on tables."
+    metric_groups:
+      - main_metrics
+      - classification_metrics
+      - efficiency
+      - general_information
+    environment:
+      main_name: accuracy
+      main_split: test
+    taxonomy:
+      task: "?"
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: unitxt_cards.wikitq
+    display_name: WikiTableQuestions
+    short_display_name: WikiTableQuestions
+    description: "This WikiTableQuestions dataset is a large-scale dataset for the task of question answering on semi-structured tables."
+    metric_groups:
+      - main_metrics
+      - classification_metrics
+      - efficiency
+      - general_information
+    environment:
+      main_name: unsorted_list_exact_match
+      main_split: test
+    taxonomy:
+      task: "?"
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: unitxt_cards.fin_qa
+    display_name: FinQA
+    description: The FinQA benchmark for numeric reasoning over financial data, with question answering pairs written by financial experts over financial reports [(Chen et al., 2021)](https://arxiv.org/abs/2109.00122/).
+    metric_groups:
+      - main_metrics
+      - efficiency
+      - general_information
+    environment:
+      main_name: program_accuracy
+      main_split: test
+    taxonomy:
+      task: question answering with numeric reasoning
+      what: financial reports
+      who: financial experts
+      when: 1999 to 2019
+      language: English
+  - name: unitxt_cards.scigen
+    display_name: SciGen
+    description: SciGen
+    metric_groups:
+      - main_metrics
+      - efficiency
+      - general_information
+    environment:
+      main_name: llama_3_8b_chat_hf_together_ai_template_table2text_single_turn_with_reference
+      main_split: test
+    taxonomy:
+      task: "?"
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English

helm/benchmark/static/schema_thai.yaml CHANGED Viewed

@@ -78,6 +78,7 @@ perturbations: []
 metric_groups:
   - name: accuracy
     display_name: Accuracy
+    hide_win_rates: true
     metrics:
       - name: ${main_name}
         split: ${main_split}
@@ -111,12 +112,32 @@ run_groups:
     description: Thai-language scenarios
     category: All scenarios
     subgroups:
+      - thai_exam
       - thai_exam_onet
       - thai_exam_ic
       - thai_exam_tgat
       - thai_exam_tpat1
       - thai_exam_a_level
+  - name: thai_exam
+    display_name: ThaiExam
+    description: >
+      Macro-averaged accuracy on all ThaiExam examinations.
+    metric_groups:
+      - accuracy
+      - efficiency
+      - general_information
+    environment:
+      main_name: exact_match
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: "?"
+      language: Thai and English
   - name: thai_exam_onet
     display_name: ONET
     description: >

crfm-helm 0.5.2__py3-none-any.whl → 0.5.4__py3-none-any.whl

Potentially problematic release.

crfm-helm 0.5.2py3-none-any.whl → 0.5.4py3-none-any.whl