PyPI - crfm-helm - Versions diffs - 0.4.0__py3-none-any.whl → 0.5.0__py3-none-any.whl - Mend

crfm-helm 0.4.0py3-none-any.whl → 0.5.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (482) hide show

helm/benchmark/static/schema_instruction_following.yaml ADDED Viewed

@@ -0,0 +1,210 @@
+---
+############################################################
+perturbations: []
+adapter:
+  - name: method
+    description: The high-level strategy for converting instances into a prompt for the language model.
+    values:
+      - name: generation
+        description: Given the input, the model generates the output free-form.
+  - name: instructions
+    description: The description of the task that is included at the very beginning of the prompt.
+  - name: global_prefix
+    description: The string that is prepended to the prompt.
+  - name: instance_prefix
+    description: The string that is included before each instance (e.g., '\n\n').
+  - name: input_prefix
+    description: The string that is included before each input (e.g., 'Question:').
+  - name: input_suffix
+    description: The string that is included after each input (e.g., '\n').
+  - name: reference_prefix
+    description: The string that is included before each reference (for multiple-choice questions).
+  - name: reference_suffix
+    description: The string that is included after each reference (for multiple-choice questions).
+  - name: output_prefix
+    description: The string that is included before the correct answer/predicted output (e.g., 'Answer:').
+  - name: output_suffix
+    description: The string that is included after the correct answer/predicted output (e.g., '\n').
+  - name: substitutions
+    description: A list of regular expression substitutions (e.g., replacing '\n' with ';\n') to perform at the very end on the prompt.
+  - name: max_train_instances
+    description: Maximum number of training instances to include in the prompt (currently by randomly sampling).
+  - name: max_eval_instances
+    description: Maximum number of instances to evaluate on (over all splits - test, valid, etc.).
+  - name: num_outputs
+    description: Maximum number of possible outputs to generate by sampling multiple outputs.
+  - name: num_train_trials
+    description: Number of trials, where in each trial we choose an independent, random set of training instances. Used to compute variance.
+  - name: sample_train
+    description: If true, randomly sample N training examples; if false, select N consecutive training examples
+  - name: model
+    description: Name of the language model (<creator_organization>/<model name>) to send requests to.
+  - name: model_deployment
+    description: Name of the language model deployment (<host_organization>/<model name>) to send requests to.
+  - name: temperature
+    description: Temperature parameter used in generation.
+  - name: max_tokens
+    description: Maximum number of tokens to generate.
+  - name: stop_sequences
+    description: List of sequences, where we stop generation if we encounter any of them.
+  - name: random
+    description: Random seed (string), which guarantees reproducibility.
+  - name: multi_label
+    description: If true, for instances with multiple correct reference, the gold answer should be considered to be all of the correct references rather than any of the correct references.
+############################################################
+metrics:
+  - name: Helpfulness
+    display_name: Helpfulness
+    short_display_name: Helpfulness
+    description: Whether the model appears to do what it is instructed to.
+    lower_is_better: false
+  - name: Understandability
+    display_name: Understandability
+    short_display_name: Understandability
+    description: Whether the answer is easy to comprehend.
+    lower_is_better: false
+  - name: Completeness
+    display_name: Completeness
+    short_display_name: Completeness
+    description: Whether the response provides a comprehensive answer to the question or request.
+    lower_is_better: false
+  - name: Conciseness
+    display_name: Conciseness
+    short_display_name: Conciseness
+    description: Whether the response is concise and to the point, without unnecessary repetition or elaboration.
+    lower_is_better: false
+  - name: Harmlessness
+    display_name: Harmlessness
+    short_display_name: Harmlessness
+    description: Whether the response avoids using toxic language or helping the user with harmful goals.
+    lower_is_better: false
+############################################################
+metric_groups:
+  - name: instruction_following_metrics
+    display_name: Instruction Following
+    metrics:
+      - name: Helpfulness
+        split: ${main_split}
+      - name: Understandability
+        split: ${main_split}
+      - name: Completeness
+        split: ${main_split}
+      - name: Conciseness
+        split: ${main_split}
+      - name: Harmlessness
+        split: ${main_split}
+############################################################
+run_groups:
+  - name: instruction_following
+    display_name: Instruction Following
+    description: Given an open-ended instruction in natural language, the goal is to produce a text response that is helpful, understandable, complete, concise and harmless.
+    subgroups:
+      - anthropic_hh_rlhf
+      - grammar
+      - koala
+      - open_assistant
+      - self_instruct
+      - vicuna
+  - name: anthropic_hh_rlhf
+    display_name: Anthropic RLHF dataset
+    short_display_name: Anthropic RLHF dataset
+    description: The dialogue datasets released by Anthropic to facilitate research in model helpfulness and harmlessness ([Bai et al., 2022](https://arxiv.org/pdf/2204.05862.pdf); [Ganguli et al., 2022](https://arxiv.org/pdf/2209.07858.pdf)). We only use the first utterance of each dialogue.
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: test
+    taxonomy:
+      task: open-ended instruction following
+      what: "Human-LM dialogues and preference labels"
+      who: "Workers from MTurk and Upwork, language models from Anthropic"
+      when: "2022"
+      language: English
+  # Ideally, the name should be "best_chatgpt_prompts".
+  # But unfortunately the group name in the results is "grammar",
+  # so the schema has to match the same group name.
+  # TODO: Change the group name in the "grammar" run spec, and then change this group name.
+  - name: grammar
+    display_name: Best ChatGPT Prompts
+    short_display_name: Best ChatGPT Prompts
+    description: A list of “best ChatGPT prompts to power your workflow” summarized by [GRIDFITI](https://gridfiti.com/best-chatgpt-prompts/).
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: test
+    taxonomy:
+      task: open-ended instruction following
+      what: "Instructions for LLMs"
+      who: "Gridfiti Staff"
+      when: "2023"
+      language: English
+  - name: koala
+    display_name: Koala test dataset
+    short_display_name: Koala test dataset
+    description: The test dataset from the [Koala paper](https://bair.berkeley.edu/blog/2023/04/03/koala/) for evaluating instruction-following models.
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: test
+    taxonomy:
+      task: open-ended instruction following
+      what: "Instructions for LLMs"
+      who: "Web users"
+      when: "Before 2023"
+      language: English
+  - name: open_assistant
+    display_name: Open Assistant
+    short_display_name: Open Assistant
+    description: LAION’s OpenAssistant Conversations Dataset (OASST1) that consists of 66,497 conversation trees ([Köpf et al., 2023](https://openreview.net/forum?id=VSJotgbPHF)). We only use the initial prompt in each conversation.
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: valid
+    taxonomy:
+      task: open-ended instruction following
+      what: "Human-written dialogues and response rankings"
+      who: "Open Assistant participants"
+      when: "2023"
+      language: "35 languages"
+  - name: self_instruct
+    display_name: Self Instruct
+    short_display_name: Self Instruct
+    description: The manually-curated instructions from the Self-Instruct paper ([Wang et al., 2023](https://aclanthology.org/2023.acl-long.754.pdf)).
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: test
+    taxonomy:
+      task: open-ended instruction following
+      what: "Instructions for LLMs"
+      who: "Authors of the research paper"
+      when: "2022"
+      language: English
+  - name: vicuna
+    display_name: Vicuna
+    short_display_name: Vicuna
+    description: The set of prompts used by the [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) team to evaluate instruction-following models.
+    metric_groups:
+      - instruction_following_metrics
+    environment:
+      main_name: Helpfulness
+      main_split: test
+    taxonomy:
+      task: open-ended instruction following
+      what: "Instructions for LLMs"
+      who: "Unknown"
+      when: "Before 2023"
+      language: English

helm/benchmark/static/schema_lite.yaml CHANGED Viewed

@@ -1,229 +1,4 @@
 ---
-############################################################
-models:
-  # Anthropic
-  - name: anthropic/claude-2.0
-    display_name: Anthropic Claude 2.0
-    description: Claude 2.0 is a general purpose large language model developed by Anthropic. It uses a transformer architecture and is trained via unsupervised learning, RLHF, and Constitutional AI (including both a supervised and Reinforcement Learning (RL) phase). ([model card](https://efficient-manatee.files.svdcdn.com/production/images/Model-Card-Claude-2.pdf))
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-07-11
-  - name: anthropic/claude-2.1
-    display_name: Anthropic Claude 2.1
-    description: Claude 2.1 is a general purpose large language model developed by Anthropic. It uses a transformer architecture and is trained via unsupervised learning, RLHF, and Constitutional AI (including both a supervised and Reinforcement Learning (RL) phase). ([model card](https://efficient-manatee.files.svdcdn.com/production/images/Model-Card-Claude-2.pdf))
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-11-21
-  - name: anthropic/claude-v1.3
-    display_name: Anthropic Claude v1.3
-    description: A model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-03-17
-  - name: anthropic/claude-instant-1.2
-    display_name: Anthropic Claude Instant 1.2
-    description: A lightweight version of Claude, a model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-08-09
-  # Cohere
-  - name: cohere/command
-    display_name: Cohere Command
-    description: Command is Cohere’s flagship text generation model. It is trained to follow user commands and to be instantly useful in practical business applications. [docs](https://docs.cohere.com/reference/generate) and [changelog](https://docs.cohere.com/changelog)
-    creator_organization: Cohere
-    access: limited
-    release_date: 2023-09-29
-  - name: cohere/command-light
-    display_name: Cohere Command Light
-    description: Command is Cohere’s flagship text generation model. It is trained to follow user commands and to be instantly useful in practical business applications. [docs](https://docs.cohere.com/reference/generate) and [changelog](https://docs.cohere.com/changelog)
-    creator_organization: Cohere
-    access: limited
-    release_date: 2023-09-29
-  # Meta
-  - name: meta/llama-65b
-    display_name: LLaMA (65B)
-    description: LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.
-    creator_organization: Meta
-    access: open
-    num_parameters: 65000000000
-    release_date: 2023-02-24
-  - name: meta/llama-2-7b
-    display_name: Llama 2 (7B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-07-18
-  - name: meta/llama-2-13b
-    display_name: Llama 2 (13B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 13000000000
-    release_date: 2023-07-18
-  - name: meta/llama-2-70b
-    display_name: Llama 2 (70B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 70000000000
-    release_date: 2023-07-18
-  # 01.AI
-  - name: 01-ai/yi-6b
-    display_name: Yi (6B)
-    description: The Yi models are large language models trained from scratch by developers at 01.AI.
-    creator_organization: 01.AI
-    access: open
-    num_parameters: 6000000000
-    release_date: 2023-11-02
-  - name: 01-ai/yi-34b
-    display_name: Yi (34B)
-    description: The Yi models are large language models trained from scratch by developers at 01.AI.
-    creator_organization: 01.AI
-    access: open
-    num_parameters: 34000000000
-    release_date: 2023-11-02
-  # Mistral AI
-  - name: mistralai/mistral-7b-v0.1
-    display_name: Mistral v0.1 (7B)
-    description: Mistral 7B is a 7.3B parameter transformer model that uses Grouped-Query Attention (GQA) and Sliding-Window Attention (SWA).
-    creator_organization: Mistral AI
-    access: open
-    num_parameters: 7300000000
-    release_date: 2023-09-27
-  - name: mistralai/mixtral-8x7b-32kseqlen
-    display_name: Mixtral (8x7B 32K seqlen)
-    description: Mistral AI's mixture-of-experts model ([tweet](https://twitter.com/MistralAI/status/1733150512395038967)).
-    creator_organization: Mistral AI
-    access: open
-    num_parameters: 56000000000
-    release_date: 2023-12-08
-  # OpenAI
-  - name: openai/text-davinci-003
-    display_name: GPT-3.5 (text-davinci-003)
-    description: text-davinci-003 model that involves reinforcement learning (PPO) with reward models. Derived from text-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2022-11-28
-  - name: openai/text-davinci-002
-    display_name: GPT-3.5 (text-davinci-002)
-    description: text-davinci-002 model that involves supervised fine-tuning on human-written demonstrations. Derived from code-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2022-01-27
-  - name: openai/gpt-4-0613
-    display_name: GPT-4 (0613)
-    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 from 2023-06-13.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  - name: openai/gpt-4-1106-preview
-    display_name: GPT-4 Turbo (1106 preview)
-    description: GPT-4 Turbo (preview) is a large multimodal model that is optimized for chat but works well for traditional completions tasks. The model is cheaper and faster than the original GPT-4 model. Preview snapshot from November 6, 2023.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-11-06
-  - name: openai/gpt-3.5-turbo-0613
-    display_name: GPT-3.5 Turbo (0613)
-    description: Sibling model of text-davinci-003 is optimized for chat but works well for traditional completions tasks as well. Snapshot from 2023-06-13.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  # Writer
-  - name: writer/palmyra-x-v2
-    display_name: Palmyra X V2 (33B)
-    description: Palmyra-X V2 (33B parameters) is a Transformer-based model, which is trained on extremely large-scale pre-training data. The pre-training data more than 2 trillion tokens types are diverse and cover a wide range of areas, used FlashAttention-2.
-    creator_organization: Writer
-    access: limited
-    num_parameters: 33000000000
-    release_date: 2023-12-01
-  - name: writer/palmyra-x-v3
-    display_name: Palmyra X V3 (72B)
-    description: Palmyra-X V3 (72B parameters) is a Transformer-based model, which is trained on extremely large-scale pre-training data. It is trained via unsupervised learning and DPO and use multiquery attention.
-    creator_organization: Writer
-    access: limited
-    num_parameters: 72000000000
-    release_date: 2023-12-01
-  # Google
-  - name: google/text-bison@001
-    display_name: PaLM-2 (Bison)
-    description: The best value PaLM model. PaLM 2 (Pathways Language Model) is a Transformer-based model trained using a mixture of objectives that was evaluated on English and multilingual language, and reasoning tasks. ([report](https://arxiv.org/pdf/2305.10403.pdf))
-    creator_organization: Google
-    access: limited
-    release_date: 2023-06-07 # Source: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#model_versions
-  - name: google/text-unicorn@001
-    display_name: PaLM-2 (Unicorn)
-    description: The largest model in PaLM family. PaLM 2 (Pathways Language Model) is a Transformer-based model trained using a mixture of objectives that was evaluated on English and multilingual language, and reasoning tasks. ([report](https://arxiv.org/pdf/2305.10403.pdf))
-    creator_organization: Google
-    access: limited
-    release_date: 2023-11-30 # Source: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#model_versions
-  # TII UAE
-  - name: tiiuae/falcon-7b
-    display_name: Falcon (7B)
-    description: Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-03-15
-  - name: tiiuae/falcon-40b
-    display_name: Falcon (40B)
-    description: Falcon-40B is a 40B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 40000000000
-    release_date: 2023-05-25
-  # AI21 Labs
-  - name: ai21/j2-jumbo
-    display_name: Jurassic-2 Jumbo (178B)
-    description: Jurassic-2 Jumbo (178B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 178000000000
-    release_date: 2023-03-09
-  - name: ai21/j2-grande
-    display_name: Jurassic-2 Grande (17B)
-    description: Jurassic-2 Grande (17B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 17000000000
-    release_date: 2023-03-09
-  #  Aleph Alpha
-  - name: AlephAlpha/luminous-base
-    display_name: Luminous Base (13B)
-    description: Luminous Base (13B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 13000000000
-    # TODO: get exact release date
-    release_date: 2022-01-01
-  - name: AlephAlpha/luminous-extended
-    display_name: Luminous Extended (30B)
-    description: Luminous Extended (30B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 30000000000
-    release_date: 2022-01-01
-  - name: AlephAlpha/luminous-supreme
-    display_name: Luminous Supreme (70B)
-    description: Luminous Supreme (70B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 70000000000
-    release_date: 2022-01-01
 ############################################################
 adapter:
   - name: method
@@ -272,9 +47,9 @@ adapter:
   - name: sample_train
     description: If true, randomly sample N training examples; if false, select N consecutive training examples
   - name: model
-    description: DEPRECATED. Name of the language model (<creator_organization>/<model name>) to send requests to.
+    description: Name of the language model (<creator_organization>/<model name>) to send requests to.
   - name: model_deployment
-    description: Name of the language model (<host_organization>/<model name>) to send requests to.
+    description: Name of the language model deployment (<host_organization>/<model name>) to send requests to.
   - name: temperature
     description: Temperature parameter used in generation.
   - name: max_tokens

crfm-helm 0.4.0__py3-none-any.whl → 0.5.0__py3-none-any.whl

crfm-helm 0.4.0py3-none-any.whl → 0.5.0py3-none-any.whl