PyPI - crfm-helm - Versions diffs - 0.3.0__py3-none-any.whl → 0.5.0__py3-none-any.whl - Mend

crfm-helm 0.3.0py3-none-any.whl → 0.5.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (546) hide show

helm/benchmark/static/{schema.yaml → schema_classic.yaml} RENAMED Viewed

@@ -1,923 +1,4 @@
 ---
-############################################################
-models:
-  # AI21 Labs
-  - name: ai21/j1-jumbo
-    display_name: J1-Jumbo v1 (178B)
-    description: Jurassic-1 Jumbo (178B parameters) ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 178000000000
-    release_date: 2021-08-11
-  - name: ai21/j1-large
-    display_name: J1-Large v1 (7.5B)
-    description: Jurassic-1 Large (7.5B parameters) ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 7500000000
-    release_date: 2021-08-11
-  - name: ai21/j1-grande
-    display_name: J1-Grande v1 (17B)
-    description: Jurassic-1 Grande (17B parameters) with a "few tweaks" to the training process ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 17000000000
-    release_date: 2022-05-03
-  - name: ai21/j1-grande-v2-beta
-    display_name: J1-Grande v2 beta (17B)
-    description: Jurassic-1 Grande v2 beta (17B parameters)
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 17000000000
-    release_date: 2022-10-28
-  - name: ai21/j2-jumbo
-    display_name: Jurassic-2 Jumbo (178B)
-    description: Jurassic-2 Jumbo (178B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 178000000000
-    release_date: 2023-03-09
-  - name: ai21/j2-grande
-    display_name: Jurassic-2 Grande (17B)
-    description: Jurassic-2 Grande (17B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 17000000000
-    release_date: 2023-03-09
-  - name: ai21/j2-large
-    display_name: Jurassic-2 Large (7.5B)
-    description: Jurassic-2 Large (7.5B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
-    creator_organization: AI21 Labs
-    access: limited
-    num_parameters: 7500000000
-    release_date: 2023-03-09
-  #  Aleph Alpha
-  # TODO: add Luminous World when it's released
-  - name: AlephAlpha/luminous-base
-    display_name: Luminous Base (13B)
-    description: Luminous Base (13B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 13000000000
-    # TODO: get exact release date
-    release_date: 2022-01-01
-  - name: AlephAlpha/luminous-extended
-    display_name: Luminous Extended (30B)
-    description: Luminous Extended (30B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 30000000000
-    release_date: 2022-01-01
-  - name: AlephAlpha/luminous-supreme
-    display_name: Luminous Supreme (70B)
-    description: Luminous Supreme (70B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/))
-    creator_organization: Aleph Alpha
-    access: limited
-    num_parameters: 70000000000
-    release_date: 2022-01-01
-  # TODO: Remove Once we have configurable model names
-  - name: neurips/local
-    display_name: Local service
-    description: Local competition service
-    creator_organization: neurips
-    access: open
-    num_parameters: 1
-    release_date: 2021-12-01
-  # Anthropic
-  - name: anthropic/stanford-online-all-v4-s3
-    display_name: Anthropic-LM v4-s3 (52B)
-    description: A 52B parameter language model, trained using reinforcement learning from human feedback [paper](https://arxiv.org/pdf/2204.05862.pdf).
-    creator_organization: Anthropic
-    access: closed
-    num_parameters: 52000000000
-    release_date: 2021-12-01
-  - name: anthropic/claude-2.0
-    display_name: Anthropic Claude 2.0
-    description: Claude 2.0 is a general purpose large language model developed by Anthropic. It uses a transformer architecture and is trained via unsupervised learning, RLHF, and Constitutional AI (including both a supervised and Reinforcement Learning (RL) phase). ([model card](https://efficient-manatee.files.svdcdn.com/production/images/Model-Card-Claude-2.pdf))
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-07-11
-  - name: anthropic/claude-v1.3
-    display_name: Anthropic Claude v1.3
-    description: A model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-03-17
-  - name: anthropic/claude-instant-v1
-    display_name: Anthropic Claude Instant V1
-    description: A lightweight version of Claude, a model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
-    creator_organization: Anthropic
-    access: limited
-    release_date: 2023-03-17
-  # Berkeley
-  - name: together/koala-13b
-    display_name: Koala (13B)
-    description: Koala (13B) is a chatbot fine-tuned from Llama (13B) on dialogue data gathered from the web. ([blog post](https://bair.berkeley.edu/blog/2023/04/03/koala/))
-    creator_organization: UC Berkeley
-    access: open
-    num_parameters: 13000000000
-    release_date: 2022-04-03
-    todo: true
-  # BigScience
-  - name: together/bloom
-    display_name: BLOOM (176B)
-    description: BLOOM (176B parameters) is an autoregressive model trained on 46 natural languages and 13 programming languages ([paper](https://arxiv.org/pdf/2211.05100.pdf)).
-    creator_organization: BigScience
-    access: open
-    num_parameters: 176000000000
-    release_date: 2022-06-28
-  - name: together/bloomz
-    display_name: BLOOMZ (176B)
-    description: BLOOMZ (176B parameters) is BLOOM that has been fine-tuned on natural language instructions ([details](https://huggingface.co/bigscience/bloomz)).
-    creator_organization: BigScience
-    access: open
-    num_parameters: 176000000000
-    release_date: 2022-11-03
-    todo: true
-  - name: together/t0pp
-    display_name: T0pp (11B)
-    description: T0pp (11B parameters) is an encoder-decoder model trained on a large set of different tasks specified in natural language prompts ([paper](https://arxiv.org/pdf/2110.08207.pdf)).
-    creator_organization: BigScience
-    access: open
-    num_parameters: 11000000000
-    release_date: 2021-10-15
-  # BigCode
-  - name: huggingface/santacoder
-    display_name: SantaCoder (1.1B)
-    description: SantaCoder (1.1B parameters) model trained on the Python, Java, and JavaScript subset of The Stack (v1.1) ([model card](https://huggingface.co/bigcode/santacoder)).
-    creator_organization: BigCode
-    access: open
-  - name: huggingface/starcoder
-    display_name: StarCoder (15.5B)
-    description: The StarCoder (15.5B parameter) model trained on 80+ programming languages from The Stack (v1.2) ([model card](https://huggingface.co/bigcode/starcoder)).
-    creator_organization: BigCode
-    access: open
-  # Cerebras Systems
-  - name: together/cerebras-gpt-6.7b
-    display_name: Cerebras GPT (6.7B)
-    description: Cerebras GPT is a family of open compute-optimal language models scaled from 111M to 13B parameters trained on the Eleuther Pile. ([paper](https://arxiv.org/pdf/2304.03208.pdf))
-    creator_organization: Cerebras
-    access: limited
-    num_parameters: 6700000000
-    release_date: 2023-04-06
-    todo: true
-  - name: together/cerebras-gpt-13b
-    display_name: Cerebras GPT (13B)
-    description: Cerebras GPT is a family of open compute-optimal language models scaled from 111M to 13B parameters trained on the Eleuther Pile. ([paper](https://arxiv.org/pdf/2304.03208.pdf))
-    creator_organization: Cerebras
-    access: limited
-    num_parameters: 13000000000
-    release_date: 2023-04-06
-    todo: true
-  # Cohere
-  - name: cohere/xlarge-20220609
-    display_name: Cohere xlarge v20220609 (52.4B)
-    description: Cohere xlarge v20220609 (52.4B parameters)
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 52400000000
-    release_date: 2022-06-09
-  - name: cohere/large-20220720
-    display_name: Cohere large v20220720 (13.1B)
-    description: Cohere large v20220720 (13.1B parameters), which is deprecated by Cohere as of December 2, 2022.
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 13100000000
-    release_date: 2022-07-20
-  - name: cohere/medium-20220720
-    display_name: Cohere medium v20220720 (6.1B)
-    description: Cohere medium v20220720 (6.1B parameters)
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 6100000000
-    release_date: 2022-07-20
-  - name: cohere/small-20220720
-    display_name: Cohere small v20220720 (410M)
-    description: Cohere small v20220720 (410M parameters), which is deprecated by Cohere as of December 2, 2022.
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 410000000
-    release_date: 2022-07-20
-  - name: cohere/xlarge-20221108
-    display_name: Cohere xlarge v20221108 (52.4B)
-    description: Cohere xlarge v20221108 (52.4B parameters)
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 52400000000
-    release_date: 2022-11-08
-  - name: cohere/medium-20221108
-    display_name: Cohere medium v20221108 (6.1B)
-    description: Cohere medium v20221108 (6.1B parameters)
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 6100000000
-    release_date: 2022-11-08
-  - name: cohere/command-medium-beta
-    display_name: Cohere Command beta (6.1B)
-    description: Cohere Command beta (6.1B parameters) is fine-tuned from the medium model to respond well with instruction-like prompts ([details](https://docs.cohere.ai/docs/command-beta)).
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 6100000000
-    release_date: 2022-11-08
-  - name: cohere/command-xlarge-beta
-    display_name: Cohere Command beta (52.4B)
-    description: Cohere Command beta (52.4B parameters) is fine-tuned from the XL model to respond well with instruction-like prompts ([details](https://docs.cohere.ai/docs/command-beta)).
-    creator_organization: Cohere
-    access: limited
-    num_parameters: 52400000000
-    release_date: 2022-11-08
-  # Databricks
-  - name: databricks/dolly-v2-3b
-    display_name: Dolly V2 (3B)
-    description: Dolly V2 (3B) is an instruction-following large language model trained on the Databricks machine learning platform. It is based on pythia-12b.
-    creator_organization: Databricks
-    access: open
-    num_parameters: 2517652480
-    release_date: 2023-04-12
-    todo: true
-  - name: databricks/dolly-v2-7b
-    display_name: Dolly V2 (7B)
-    description: Dolly V2 (7B) is an instruction-following large language model trained on the Databricks machine learning platform. It is based on pythia-12b.
-    creator_organization: Databricks
-    access: open
-    num_parameters: 6444163072
-    release_date: 2023-04-12
-    todo: true
-  - name: databricks/dolly-v2-12b
-    display_name: Dolly V2 (12B)
-    description: Dolly V2 (12B) is an instruction-following large language model trained on the Databricks machine learning platform. It is based on pythia-12b.
-    creator_organization: Databricks
-    access: open
-    num_parameters: 11327027200
-    release_date: 2023-04-12
-    todo: true
-  # DeepMind
-  - name: deepmind/gopher
-    display_name: Gopher (280B)
-    description: Gopher (540B parameters) ([paper](https://arxiv.org/pdf/2112.11446.pdf)).
-    creator_organization: DeepMind
-    access: closed
-    todo: true
-  - name: deepmind/chinchilla
-    display_name: Chinchilla (70B)
-    description: Chinchilla (70B parameters) ([paper](https://arxiv.org/pdf/2203.15556.pdf)).
-    creator_organization: DeepMind
-    access: closed
-    todo: true
-  # EleutherAI
-  - name: together/gpt-j-6b
-    display_name: GPT-J (6B)
-    description: GPT-J (6B parameters) autoregressive language model trained on The Pile ([details](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/)).
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 6000000000
-    release_date: 2021-06-04
-  - name: together/gpt-neox-20b
-    display_name: GPT-NeoX (20B)
-    description: GPT-NeoX (20B parameters) autoregressive language model trained on The Pile ([paper](https://arxiv.org/pdf/2204.06745.pdf)).
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 20000000000
-    release_date: 2022-02-02
-  - name: eleutherai/pythia-1b-v0
-    display_name: Pythia (1B)
-    description: Pythia (1B parameters). The Pythia project combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers.
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 805736448
-    release_date: 2023-02-13
-    todo: true
-  - name: eleutherai/pythia-2.8b-v0
-    display_name: Pythia (2.8B)
-    description: Pythia (2.8B parameters). The Pythia project combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers.
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 2517652480
-    release_date: 2023-02-13
-    todo: true
-  - name: eleutherai/pythia-6.9b
-    display_name: Pythia (6.9B)
-    description: Pythia (6.9B parameters). The Pythia project combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers.
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 6444163072
-    release_date: 2023-02-13
-  - name: eleutherai/pythia-12b-v0
-    display_name: Pythia (12B)
-    description: Pythia (12B parameters). The Pythia project combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers.
-    creator_organization: EleutherAI
-    access: open
-    num_parameters: 11327027200
-    release_date: 2023-02-13
-  # Google
-  - name: together/t5-11b
-    display_name: T5 (11B)
-    description: T5 (11B parameters) is an encoder-decoder model trained on a multi-task mixture, where each task is converted into a text-to-text format ([paper](https://arxiv.org/pdf/1910.10683.pdf)).
-    creator_organization: Google
-    access: open
-    num_parameters: 11000000000
-    release_date: 2019-10-23
-  - name: together/ul2
-    display_name: UL2 (20B)
-    description: UL2 (20B parameters) is an encoder-decoder model trained on the C4 corpus. It's similar to T5 but trained with a different objective and slightly different scaling knobs ([paper](https://arxiv.org/pdf/2205.05131.pdf)).
-    creator_organization: Google
-    access: open
-    num_parameters: 20000000000
-    release_date: 2022-05-10
-  - name: together/flan-t5-xxl
-    display_name: Flan-T5 (11B)
-    description: Flan-T5 (11B parameters) is T5 fine-tuned on 1.8K tasks ([paper](https://arxiv.org/pdf/2210.11416.pdf)).
-    creator_organization: Google
-    access: open
-  - name: google/palm
-    display_name: PaLM (540B)
-    description: Pathways Language Model (540B parameters) is trained using 6144 TPU v4 chips ([paper](https://arxiv.org/pdf/2204.02311.pdf)).
-    creator_organization: Google
-    access: closed
-    todo: true
-  # HazyResearch
-  - name: together/h3-2.7b
-    display_name: H3 (2.7B)
-    description: H3 (2.7B parameters) is a decoder-only language model based on state space models ([paper](https://arxiv.org/abs/2212.14052)).
-    creator_organization: HazyResearch
-    access: open
-    num_parameters: 2700000000
-    release_date: 2023-01-23
-    todo: true
-  # Lightning AI's Lit-GPT
-  - name: lightningai/lit-gpt
-    display_name: Lit-GPT
-    description: Lit-GPT is an optimized collection of open-source LLMs for finetuning and inference. It supports – Falcon, Llama 2, Vicuna, LongChat, and other top-performing open-source large language models.
-    creator_organization: Lightning AI
-    access: open
-    num_parameters: 1
-    release_date: 2023-04-04
-  # Meta
-  - name: together/opt-iml-175b
-    display_name: OPT-IML (175B)
-    description: OPT-IML (175B parameters) is a suite of decoder-only transformer LMs that are multi-task fine-tuned on 2000 datasets ([paper](https://arxiv.org/pdf/2212.12017.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 175000000000
-    release_date: 2022-12-22
-    todo: true
-  - name: together/opt-iml-30b
-    display_name: OPT-IML (30B)
-    description: OPT-IML (30B parameters) is a suite of decoder-only transformer LMs that are multi-task fine-tuned on 2000 datasets ([paper](https://arxiv.org/pdf/2212.12017.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 30000000000
-    release_date: 2022-12-22
-    todo: true
-  - name: together/opt-175b
-    display_name: OPT (175B)
-    description: Open Pre-trained Transformers (175B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 175000000000
-    release_date: 2022-05-02
-  - name: together/opt-66b
-    display_name: OPT (66B)
-    description: Open Pre-trained Transformers (66B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 66000000000
-    release_date: 2022-05-02
-  - name: together/opt-6.7b
-    display_name: OPT (6.7B)
-    description: Open Pre-trained Transformers (6.7B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 6700000000
-    release_date: 2022-05-02
-  - name: together/opt-1.3b
-    display_name: OPT (1.3B)
-    description: Open Pre-trained Transformers (1.3B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 1300000000
-    release_date: 2022-05-02
-  - name: together/galactica-120b
-    display_name: Galactica (120B)
-    description: Galactica (120B parameters) is trained on 48 million papers, textbooks, lectures notes, compounds and proteins, scientific websites, etc. ([paper](https://galactica.org/static/paper.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 120000000000
-    release_date: 2022-11-15
-    todo: true
-  - name: together/galactica-30b
-    display_name: Galactica (30B)
-    description: Galactica (30B parameters) is trained on 48 million papers, textbooks, lectures notes, compounds and proteins, scientific websites, etc. ([paper](https://galactica.org/static/paper.pdf)).
-    creator_organization: Meta
-    access: open
-    num_parameters: 30000000000
-    release_date: 2022-11-15
-    todo: true
-  - name: meta/llama-7b
-    display_name: LLaMA (7B)
-    description: LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.
-    creator_organization: Meta
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-02-24
-  - name: meta/llama-13b
-    display_name: LLaMA (13B)
-    description: LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.
-    creator_organization: Meta
-    access: open
-    num_parameters: 13000000000
-    release_date: 2023-02-24
-  - name: meta/llama-30b
-    display_name: LLaMA (30B)
-    description: LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.
-    creator_organization: Meta
-    access: open
-    num_parameters: 30000000000
-    release_date: 2023-02-24
-  - name: meta/llama-65b
-    display_name: LLaMA (65B)
-    description: LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.
-    creator_organization: Meta
-    access: open
-    num_parameters: 65000000000
-    release_date: 2023-02-24
-  - name: meta/llama-2-7b
-    display_name: Llama 2 (7B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-07-18
-  - name: meta/llama-2-13b
-    display_name: Llama 2 (13B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 13000000000
-    release_date: 2023-07-18
-  - name: meta/llama-2-70b
-    display_name: Llama 2 (70B)
-    description: Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1.
-    creator_organization: Meta
-    access: open
-    num_parameters: 70000000000
-    release_date: 2023-07-18
-  # Stability AI
-  - name: stabilityai/stablelm-base-alpha-3b
-    display_name: StableLM-Base-Alpha (3B)
-    description: StableLM-Base-Alpha is a suite of 3B and 7B parameter decoder-only language models pre-trained on a diverse collection of English datasets with a sequence length of 4096 to push beyond the context window limitations of existing open-source language models.
-    creator_organization: Stability AI
-    access: open
-    num_parameters: 3000000000
-    release_date: 2023-04-20
-    todo: true
-  - name: stabilityai/stablelm-base-alpha-7b
-    display_name: StableLM-Base-Alpha (7B)
-    description: StableLM-Base-Alpha is a suite of 3B and 7B parameter decoder-only language models pre-trained on a diverse collection of English datasets with a sequence length of 4096 to push beyond the context window limitations of existing open-source language models.
-    creator_organization: Stability AI
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-04-20
-    todo: true
-  # Stanford
-  - name: stanford/alpaca-7b
-    display_name: Alpaca (7B)
-    description: Alpaca 7B is a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations
-    creator_organization: Stanford
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-03-13
-  # LMSYS
-  - name: lmsys/vicuna-7b-v1.3
-    display_name: Vicuna v1.3 (7B)
-    description: Vicuna v1.3 (7B) is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
-    creator_organization: LMSYS
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-06-22
-  - name: lmsys/vicuna-13b-v1.3
-    display_name: Vicuna v1.3 (13B)
-    description: Vicuna v1.3 (13B) is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
-    creator_organization: LMSYS
-    access: open
-    num_parameters: 13000000000
-    release_date: 2023-06-22
-  # Mistral AI
-  - name: mistralai/mistral-7b-v0.1
-    display_name: Mistral v0.1 (7B)
-    description: Mistral 7B is a  7.3B parameter transformer model that uses Grouped-Query Attention (GQA) and Sliding-Window Attention (SWA).
-    creator_organization: Mistral AI
-    access: open
-    num_parameters: 7300000000
-    release_date: 2023-09-27
-  # Microsoft/NVIDIA
-  - name: microsoft/TNLGv2_530B
-    display_name: TNLG v2 (530B)
-    description: TNLG v2 (530B parameters) autoregressive language model trained on a filtered subset of the Pile and CommonCrawl ([paper](https://arxiv.org/pdf/2201.11990.pdf)).
-    creator_organization: Microsoft/NVIDIA
-    access: closed
-    num_parameters: 530000000000
-    release_date: 2022-01-28
-  - name: microsoft/TNLGv2_7B
-    display_name: TNLG v2 (6.7B)
-    description: TNLG v2 (6.7B parameters) autoregressive language model trained on a filtered subset of the Pile and CommonCrawl ([paper](https://arxiv.org/pdf/2201.11990.pdf)).
-    creator_organization: Microsoft/NVIDIA
-    access: closed
-    num_parameters: 6700000000
-    release_date: 2022-01-28
-  # OpenAI: https://beta.openai.com/docs/engines/gpt-3
-  - name: openai/davinci
-    display_name: davinci (175B)
-    description: Original GPT-3 (175B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2020-05-28
-  - name: openai/curie
-    display_name: curie (6.7B)
-    description: Original GPT-3 (6.7B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 6700000000
-    release_date: 2020-05-28
-  - name: openai/babbage
-    display_name: babbage (1.3B)
-    description: Original GPT-3 (1.3B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 1300000000
-    release_date: 2020-05-28
-  - name: openai/ada
-    display_name: ada (350M)
-    description: Original GPT-3 (350M parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 350000000
-    release_date: 2020-05-28
-  - name: openai/text-davinci-003
-    display_name: text-davinci-003
-    description: text-davinci-003 model that involves reinforcement learning (PPO) with reward models. Derived from text-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2022-11-28
-  - name: openai/text-davinci-002
-    display_name: text-davinci-002
-    description: text-davinci-002 model that involves supervised fine-tuning on human-written demonstrations. Derived from code-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2022-01-27
-  - name: openai/text-davinci-001
-    display_name: text-davinci-001
-    description: text-davinci-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 175000000000
-    release_date: 2022-01-27
-    todo: true
-  - name: openai/text-curie-001
-    display_name: text-curie-001
-    description: text-curie-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 6700000000
-    release_date: 2022-01-27
-  - name: openai/text-babbage-001
-    display_name: text-babbage-001
-    description: text-babbage-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 1300000000
-    release_date: 2022-01-27
-  - name: openai/text-ada-001
-    display_name: text-ada-001
-    description: text-ada-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
-    creator_organization: OpenAI
-    access: limited
-    num_parameters: 350000000
-    release_date: 2022-01-27
-  - name: openai/gpt-4-0314
-    display_name: gpt-4-0314
-    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 from March 14th 2023.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-03-14
-  - name: openai/gpt-4-32k-0314
-    display_name: gpt-4-32k-0314
-    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 with a longer context length of 32,768 tokens from March 14th 2023.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-03-14
-  - name: openai/gpt-4-0613
-    display_name: gpt-4-0613
-    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 from 2023-06-13.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  - name: openai/gpt-4-32k-0613
-    display_name: gpt-4-32k-0613
-    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 with a longer context length of 32,768 tokens from 2023-06-13.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  - name: openai/code-davinci-002
-    display_name: code-davinci-002
-    description: Codex-style model that is designed for pure code-completion tasks ([docs](https://beta.openai.com/docs/models/codex)).
-    creator_organization: OpenAI
-    access: limited
-  - name: openai/code-davinci-001
-    display_name: code-davinci-001
-    description: code-davinci-001 model
-    creator_organization: OpenAI
-    access: limited
-    todo: true
-  - name: openai/code-cushman-001
-    display_name: code-cushman-001 (12B)
-    description: Codex-style model that is a stronger, multilingual version of the Codex (12B) model in the [Codex paper](https://arxiv.org/pdf/2107.03374.pdf).
-    creator_organization: OpenAI
-    access: limited
-  - name: openai/gpt-3.5-turbo-0301
-    display_name: gpt-3.5-turbo-0301
-    description: Sibling model of text-davinci-003 is optimized for chat but works well for traditional completions tasks as well. Snapshot from 2023-03-01.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-03-01
-  - name: openai/gpt-3.5-turbo-0613
-    display_name: gpt-3.5-turbo-0613
-    description: Sibling model of text-davinci-003 is optimized for chat but works well for traditional completions tasks as well. Snapshot from 2023-06-13.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  - name: openai/gpt-3.5-turbo-16k-0613
-    display_name: gpt-3.5-turbo-16k-0613
-    description: Sibling model of text-davinci-003 is optimized for chat but works well for traditional completions tasks as well. Snapshot from 2023-06-13 with a longer context length of 16,384 tokens.
-    creator_organization: OpenAI
-    access: limited
-    release_date: 2023-06-13
-  # Together
-  - name: together/Together-gpt-JT-6B-v1
-    display_name: GPT-JT (6B)
-    description: GPT-JT (6B parameters) is a fork of GPT-J ([blog post](https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered-by-open-source-ai)).
-    creator_organization: Together
-    access: open
-    num_parameters: 6700000000
-    release_date: 2022-11-29
-    todo: true
-  - name: together/gpt-neoxt-chat-base-20b
-    display_name: GPT-NeoXT-Chat-Base (20B)
-    description: GPT-NeoXT-Chat-Base (20B) is fine-tuned from GPT-NeoX, serving as a base model for developing open-source chatbots.
-    creator_organization: Together
-    access: open
-    num_parameters: 20000000000
-    release_date: 2023-03-08
-    todo: true
-  - name: together/redpajama-incite-base-3b-v1
-    display_name: RedPajama-INCITE-Base-v1 (3B)
-    description: RedPajama-INCITE-Base-v1 (3B parameters) is a 3 billion base model that aims to replicate the LLaMA recipe as closely as possible.
-    creator_organization: Together
-    access: open
-    num_parameters: 3000000000
-    release_date: 2023-05-05
-  - name: together/redpajama-incite-instruct-3b-v1
-    display_name: RedPajama-INCITE-Instruct-v1 (3B)
-    description: RedPajama-INCITE-Instruct-v1 (3B parameters) is a model fine-tuned for few-shot applications on the data of GPT-JT. It is built from RedPajama-INCITE-Base-v1 (3B), a 3 billion base model that aims to replicate the LLaMA recipe as closely as possible.
-    creator_organization: Together
-    access: open
-    num_parameters: 3000000000
-    release_date: 2023-05-05
-    todo: true
-  - name: together/redpajama-incite-chat-3b-v1
-    display_name: RedPajama-INCITE-Chat-v1 (3B)
-    description: RedPajama-INCITE-Chat-v1 (3B parameters) is a model fine-tuned on OASST1 and Dolly2 to enhance chatting ability. It is built from RedPajama-INCITE-Base-v1 (3B), a 3 billion base model that aims to replicate the LLaMA recipe as closely as possible.
-    creator_organization: Together
-    access: open
-    num_parameters: 3000000000
-    release_date: 2023-05-05
-    todo: true
-  - name: together/redpajama-incite-base-7b
-    display_name: RedPajama-INCITE-Base (7B)
-    description: RedPajama-INCITE-Base (7B parameters) is a 7 billion base model that aims to replicate the LLaMA recipe as closely as possible.
-    creator_organization: Together
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-05-05
-    todo: true
-  - name: together/redpajama-incite-instruct-7b
-    display_name: RedPajama-INCITE-Instruct (7B)
-    description: RedPajama-INCITE-Instruct (7B parameters) is a model fine-tuned for few-shot applications on the data of GPT-JT. It is built from RedPajama-INCITE-Base (7B), a 7 billion base model that aims to replicate the LLaMA recipe as closely as possible.
-    creator_organization: Together
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-05-05
-    todo: true
-  # MosaicML
-  - name: mosaicml/mpt-7b
-    display_name: MPT (7B)
-    description: MPT (7B) is a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 6700000000
-    release_date: 2023-05-05
-  - name: mosaicml/mpt-7b-chat
-    display_name: MPT-Chat (7B)
-    description: MPT-Chat (7B) is a chatbot-like model for dialogue generation. It is built by finetuning MPT (30B) , a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 6700000000
-    release_date: 2023-05-05
-    todo: true
-  - name: mosaicml/mpt-instruct-7b
-    display_name: MPT-Instruct (7B)
-    description: MPT-Instruct (7B) is a model for short-form instruction following. It is built by finetuning MPT (30B), a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 6700000000
-    release_date: 2023-05-05
-  - name: mosaicml/mpt-30b
-    display_name: MPT (30B)
-    description: MPT (30B) is a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 30000000000
-    release_date: 2023-06-22
-  - name: mosaicml/mpt-30b-chat
-    display_name: MPT-Chat (30B)
-    description: MPT-Chat (30B) is a chatbot-like model for dialogue generation. It is built by finetuning MPT (30B), a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 30000000000
-    release_date: 2023-06-22
-    todo: true
-  - name: mosaicml/mpt-instruct-30b
-    display_name: MPT-Instruct (30B)
-    description: MPT-Instruct (30B) is a model for short-form instruction following. It is built by finetuning MPT (30B), a Transformer trained from scratch on 1T tokens of text and code.
-    creator_organization: MosaicML
-    access: open
-    num_parameters: 30000000000
-    release_date: 2023-06-22
-  # TII UAE
-  - name: tiiuae/falcon-7b
-    display_name: Falcon (7B)
-    description: Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-03-15
-  - name: tiiuae/falcon-7b-instruct
-    display_name: Falcon-Instruct (7B)
-    description: Falcon-7B-Instruct is a 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat/instruct datasets.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 7000000000
-    release_date: 2023-03-15
-  - name: tiiuae/falcon-40b
-    display_name: Falcon (40B)
-    description: Falcon-40B is a 40B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 40000000000
-    release_date: 2023-05-25
-  - name: tiiuae/falcon-40b-instruct
-    display_name: Falcon-Instruct (40B)
-    description: Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat/instruct datasets.
-    creator_organization: TII UAE
-    access: open
-    num_parameters: 40000000000
-    release_date: 2023-05-25
-  # Salesforce
-  - name: together/codegen
-    display_name: CodeGen (16B)
-    description: CodeGen (16B parameters) is an open dense code model trained for multi-turn program synthesis ([blog](https://arxiv.org/pdf/2203.13474.pdf)).
-    creator_organization: Tsinghua
-    access: open
-    num_parameters: 16000000000
-    release_date: 2022-03-25
-    todo: true
-  # Tsinghua
-  - name: together/glm
-    display_name: GLM (130B)
-    description: GLM (130B parameters) is an open bilingual (English & Chinese) bidirectional dense model that was trained using General Language Model (GLM) procedure ([paper](https://arxiv.org/pdf/2210.02414.pdf)).
-    creator_organization: Tsinghua
-    access: open
-    num_parameters: 130000000000
-    release_date: 2022-08-04
-  - name: together/codegeex
-    display_name: CodeGeeX (13B)
-    description: CodeGeeX (13B parameters) is an open dense code model trained on more than 20 programming languages on a corpus of more than 850B tokens ([blog](http://keg.cs.tsinghua.edu.cn/codegeex/)).
-    creator_organization: Tsinghua
-    access: open
-    num_parameters: 13000000000
-    release_date: 2022-09-19
-    todo: true
-  # Writer
-  - name: writer/palmyra-base
-    display_name: Palmyra Base (5B)
-    description: Palmyra Base (5B)
-    creator_organization: Writer
-    access: limited
-    num_parameters: 5000000000
-    release_date: 2022-10-13
-    todo: true
-  - name: writer/palmyra-large
-    display_name: Palmyra Large (20B)
-    description: Palmyra Large (20B)
-    creator_organization: Writer
-    access: limited
-    num_parameters: 20000000000
-    release_date: 2022-12-23
-    todo: true
-  - name: writer/palmyra-instruct-30
-    display_name: InstructPalmyra (30B)
-    description: InstructPalmyra (30B parameters) is trained using reinforcement learning techniques based on feedback from humans.
-    creator_organization: Writer
-    access: limited
-    num_parameters: 30000000000
-    release_date: 2023-02-16
-    todo: true
-  - name: writer/palmyra-e
-    display_name: Palmyra E (30B)
-    description: Palmyra E (30B)
-    creator_organization: Writer
-    access: limited
-    num_parameters: 30000000000
-    release_date: 2023-03-03
-    todo: true
-  - name: writer/silk-road
-    display_name: Silk Road (35B)
-    description: Silk Road (35B)
-    creator_organization: Writer
-    access: limited
-    num_parameters: 35000000000
-    release_date: 2023-04-13
-    todo: true
-  - name: writer/palmyra-x
-    display_name: Palmyra X (43B)
-    description: Palmyra-X (43B parameters) is trained to adhere to instructions using human feedback and utilizes a technique called multiquery attention. Furthermore, a new feature called 'self-instruct' has been introduced, which includes the implementation of an early stopping criteria specifically designed for minimal instruction tuning ([paper](https://dev.writer.com/docs/becoming-self-instruct-introducing-early-stopping-criteria-for-minimal-instruct-tuning)).
-    creator_organization: Writer
-    access: limited
-    num_parameters: 43000000000
-    release_date: 2023-06-11
-    todo: true
-  # Yandex
-  - name: together/yalm
-    display_name: YaLM (100B)
-    description: YaLM (100B parameters) is an autoregressive language model trained on English and Russian text ([GitHub](https://github.com/yandex/YaLM-100B)).
-    creator_organization: Yandex
-    access: open
-    num_parameters: 100000000000
-    release_date: 2022-06-23
-  # NVIDIA
-  - name: nvidia/megatron-gpt2
-    display_name: Megatron GPT2
-    description: GPT-2 implemented in Megatron-LM ([paper](https://arxiv.org/abs/1909.08053)).
-    creator_organization: NVIDIA
-    access: open
-    todo: true
 ############################################################
 adapter:
   - name: method
@@ -961,8 +42,12 @@ adapter:
     description: Maximum number of possible outputs to generate by sampling multiple outputs.
   - name: num_train_trials
     description: Number of trials, where in each trial we choose an independent, random set of training instances. Used to compute variance.
+  - name: sample_train
+    description: If true, randomly sample N training examples; if false, select N consecutive training examples
   - name: model
-    description: Name of the language model (<organization>/<model name>) to send requests to.
+    description: Name of the language model (<creator_organization>/<model name>) to send requests to.
+  - name: model_deployment
+    description: Name of the language model deployment (<host_organization>/<model name>) to send requests to.
   - name: temperature
     description: Temperature parameter used in generation.
   - name: max_tokens
@@ -971,6 +56,8 @@ adapter:
     description: List of sequences, where we stop generation if we encounter any of them.
   - name: random
     description: Random seed (string), which guarantees reproducibility.
+  - name: multi_label
+    description: If true, for instances with multiple correct reference, the gold answer should be considered to be all of the correct references rather than any of the correct references.
 ############################################################
 metrics:
@@ -1059,6 +146,7 @@ metrics:
     short_display_name: PEM
     description: Fraction of instances that the predicted output matches the prefix of a correct reference up to light processing.
     lower_is_better: false
   - name: exact_match@5
     display_name: Exact match @5
     short_display_name: EM@5
@@ -1069,6 +157,17 @@ metrics:
     short_display_name: EM@5
     description: Fraction of instances where at least one predicted output among the top 5 matches a correct reference up to light processing.
     lower_is_better: false
+  - name: prefix_exact_match@5
+    display_name: Prefix exact match @5
+    short_display_name: PEM@5
+    description: Fraction of instances that the predicted output among the top 5 matches the prefix of a correct reference exactly.
+    lower_is_better: false
+  - name: quasi_prefix_exact_match@5
+    display_name: Prefix quasi-exact match @5
+    short_display_name: PEM@5
+    description: Fraction of instances that the predicted output among the top 5 matches the prefix of a correct reference up to light processing.
+    lower_is_better: false
   - name: logprob
     display_name: Log probability
     short_display_name: Logprob
@@ -1163,10 +262,15 @@ metrics:
     lower_is_better: false
   - name: math_equiv_chain_of_thought
     display_name: Equivalent (chain of thought)
-    description: Fraction of model outputs that are mathematically equivalent to the correct reference when using chain-of-thoughts prompting.
+    description: Fraction of model outputs that are mathematically equivalent to the correct reference when using chain-of-thought prompting.
     lower_is_better: false
   - name: exact_match_indicator
-    display_name: Exact match (up to specified indicator)
+    display_name: Exact match (final)
+    short_display_name: EM
+    description: Fraction of instances that the predicted output matches a correct reference exactly, ignoring text preceding the specified indicator (e.g., space).
+    lower_is_better: false
+  - name: final_number_exact_match
+    display_name: Exact match (final number)
     short_display_name: EM
     description: Fraction of instances that the predicted output matches a correct reference exactly, ignoring text preceding the specified indicator.
     lower_is_better: false
@@ -1464,6 +568,50 @@ metrics:
     lower_is_better: false
   # CLEVA Chinese bias, copyright and toxicity metrics share the same name as the original HELM metrics
+  # DecodingTrust Fairness Metrics
+  - name: equalized_odds_difference
+    display_name: Equalized Odds Difference
+    short_display_name: Equalized Odds
+    description: The rate at which the model refuses to answer an OoD question.
+    lower_is_better: true
+  - name: demographic_parity_difference
+    display_name: Demographic Parity Difference
+    short_display_name: demographic parity
+    description: The demographic parity difference measures the difference between the probability of positive predictions conditioned on sensitive attribute A = 1 and that conditioned on A = 0
+    lower_is_better: true
+  # DecodingTrust OoD Metrics
+  - name: ood_knowledge_rr
+    display_name: OOD Knowledge Refusal Rate
+    short_display_name: OoD-RR
+    description: The rate at which the model refuses to answer an OoD question.
+    lower_is_better: false
+  - name: ood_knowledge_macc
+    display_name: OOD Meaningful Answer Accuracy (Exclude Refusal)
+    short_display_name: OoD-MACC
+    description: MACC indicates the percentage of correct answers when the model successfully generates meaningful responses by excluding outputs that are refused to answer
+    lower_is_better: false
+  # DecodingTrust Stereotype Bias Metrics
+  - name: decodingtrust_stereotype_bias
+    display_name: DecodingTrust Bias Score
+    short_display_name: Bias Score
+    description: 1 - agreementIndex (Section 4.2 of DecodingTrust)
+    lower_is_better: false
+  - name: decodingtrust_stereotype_bias_rejection_rate
+    display_name: Accuracy at 10% coverage
+    short_display_name: Bias Rejection Rate
+    description: Rejection rate of stereotype prompts
+    lower_is_better: false
 ############################################################
 perturbations:
   - name: robustness
@@ -1514,7 +662,7 @@ metric_groups:
         split: ${main_split}
   - name: calibration_detailed
-    display_name: Calibration
+    display_name: Calibration (Detailed)
     description: Measures how calibrated the model is (how meaningful its uncertainty estimates are).
     metrics:
       - name: max_prob
@@ -1545,7 +693,7 @@ metric_groups:
   # TODO: Add other robustness perturbations
   - name: robustness_detailed
-    display_name: Robustness
+    display_name: Robustness (Detailed)
     description: Measures how robust the model is to invariances.
     metrics:
       - name: ${main_name}
@@ -1564,7 +712,7 @@ metric_groups:
   # TODO: Add other fairness perturbations
   - name: fairness_detailed
-    display_name: Fairness
+    display_name: Fairness (Detailed)
     description: Measures how fair the model is.
     metrics:
       - name: ${main_name}
@@ -1602,7 +750,7 @@ metric_groups:
       split: ${main_split}
   - name: efficiency_detailed
-    display_name: Efficiency
+    display_name: Efficiency (Detailed)
     description: The efficiency of the model across both training and inference.
     metrics:
       - name: inference_runtime
@@ -1747,6 +895,31 @@ metric_groups:
       - name: chinese_bleu_1
         split: ${main_split}
+  - name: decodingtrust_fairness_metrics
+    display_name: DecodingTrust Fairness
+    metrics:
+      - name: equalized_odds_difference
+        split: ${main_split}
+      - name: demographic_parity_difference
+        split: ${main_split}
+  - name: decodingtrust_ood_metrics
+    display_name: DecodingTrust OOD Accuracy
+    metrics:
+      - name: ood_knowledge_rr
+        split: ${main_split}
+      - name: ood_knowledge_macc
+        split: ${main_split}
+  - name: decodingtrust_stereotype_bias_metrics
+    display_name: DecodingTrust Stereotype Bias
+    metrics:
+      - name: decodingtrust_stereotype_bias
+        split: ${main_split}
+      - name: decodingtrust_stereotype_bias_rejection_rate
+        split: ${main_split}
 ############################################################
 run_groups:
 ## Top-level
@@ -1910,6 +1083,7 @@ run_groups:
       - synthetic_efficiency
     adapter_keys_shown:
       - model
+      - model_deployment
       - max_tokens
   - name: calibration
@@ -1928,6 +1102,20 @@ run_groups:
       main_name: none
       main_split: none
+  - name: decodingtrust
+    display_name: DecodingTrust
+    description: A comprehensive benchmark of the trustworthiness of large language models [(Wang et. al. 2023)](https://decodingtrust.github.io/)
+    category: Core scenarios
+    subgroups:
+      - decodingtrust_adv_robustness
+      - decodingtrust_adv_demonstration
+      - decodingtrust_ood_robustness
+      - decodingtrust_fairness
+      - decodingtrust_privacy
+      - decodingtrust_machine_ethics
+      - decodingtrust_toxicity_prompts
+      - decodingtrust_stereotype_bias
 ### Ablations
   - name: ablation_in_context
     display_name: Vary number of in-context examples
@@ -1941,6 +1129,7 @@ run_groups:
       - civil_comments
     adapter_keys_shown:
       - model
+      - model_deployment
       - max_train_instances
     subgroup_metric_groups_hidden:
       - robustness
@@ -1962,6 +1151,7 @@ run_groups:
       - bbq
     adapter_keys_shown:
       - model
+      - model_deployment
       - method
   - name: ablation_prompts
@@ -1976,6 +1166,7 @@ run_groups:
       - civil_comments
     adapter_keys_shown:
       - model
+      - model_deployment
       - instructions
       - input_prefix
       - input_suffix
@@ -2636,8 +1827,8 @@ run_groups:
       language: synthetic
   - name: math_chain_of_thought
-    display_name: MATH (chain-of-thoughts)
-    description: The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html).
+    display_name: MATH (chain-of-thought)
+    description: The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thought style reasoning [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html).
     metric_groups:
       - accuracy
       - efficiency
@@ -2687,6 +1878,23 @@ run_groups:
       when: n/a
       language: synthetic
+  - name: legalbench
+    display_name: LegalBench
+    description: LegalBench is a large collaboratively constructed benchmark of legal reasoning. Five representative tasks are included here. See [(Guha et al, 2023)[https://arxiv.org/abs/2308.11462] for more details.
+    metric_groups:
+      - accuracy
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: "text classification"
+      what: "fact patterns, questions, and legal documents"
+      who: "lawyers"
+      when: n/a
+      language: English
   - name: legal_support
     display_name: LegalSupport
     description: Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment.
@@ -2721,6 +1929,40 @@ run_groups:
       when: n/a
       language: synthetic
+  - name: med_qa
+    display_name: MedQA
+    description: MedQA is an open domain question answering dataset composed of questions from professional medical board exams ([Jin et al. 2020](https://arxiv.org/pdf/2009.13081.pdf)).
+    metric_groups:
+      - accuracy
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
+  - name: wmt_14
+    display_name: WMT 2014
+    description: WMT 2014 is a collection of machine translation datasets.
+    metric_groups:
+      - accuracy
+      - efficiency
+      - general_information
+    environment:
+      main_name: bleu_4
+      main_split: test
+    taxonomy:
+      task: machine translation
+      what: n/a
+      who: n/a
+      when: n/a
+      language: English
   - name: lextreme
     display_name: LEXTREME
     description: A Multilingual Legal Benchmark for Natural Language Understanding
@@ -2981,6 +2223,7 @@ run_groups:
       main_split: test
     adapter_keys_shown:
       - model
+      - model_deployment
       - max_tokens
     taxonomy:
       task: "?"
@@ -3402,7 +2645,7 @@ run_groups:
   - name: cleva_mathematical_reasoning
     display_name: CLEVA (Chinese) mathematical reasoning
-    description: "Scenario that tests models' mathematical reasoning ability with chain-of-thoughts style reasoning. It contains a math word problem solving subtask."
+    description: "Scenario that tests models' mathematical reasoning ability with chain-of-thought style reasoning. It contains a math word problem solving subtask."
     metric_groups:
       - cleva_mathematical_reasoning_metrics
       - general_information
@@ -3449,7 +2692,7 @@ run_groups:
       main_split: test
     taxonomy:
       task: toxicity classification
-      what: text from Chinese social media
+      what: text from Chinese social media
       who: web users
       when: 2022 or before
       language: Chinese
@@ -3649,3 +2892,176 @@ run_groups:
       task: user-facing tasks
       language: English dialects
     todo: true
+# DecodingTrust scenarios
+  - name: decodingtrust_adv_robustness
+    display_name: DecodingTrust - AdvGLUE++
+    short_display_name: AdvGLUE++
+    description: Adversarial perturbations of the GLUE dataset generated against open-source LLMs including Alpaca, Vicuna, and Stable-Vicuna
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+    todo: true
+  - name: decodingtrust_adv_demonstration
+    display_name: DecodingTrust - Adversarial Demonstrations
+    short_display_name: AdvDemo
+    description: Robustness analysis of LM generations when facing adversarial demonstrations
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: decodingtrust_ood_robustness
+    display_name: DecodingTrust - OoD Robustness
+    short_display_name: OoD
+    description: Style perturbations of GLUE datasets (OoD styles) and out-of-scope OoD knowledge evaluations
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+      - decodingtrust_ood_metrics
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: decodingtrust_fairness
+    display_name: DecodingTrust - Fairness
+    short_display_name: Fairness
+    description: Fairness analysis of LLMs
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+      - decodingtrust_fairness_metrics
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: decodingtrust_privacy
+    display_name: DecodingTrust - Privacy
+    short_display_name: Privacy
+    description: Evaluation of the privacy understanding and privacy preserving properties of LLMs
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: decodingtrust_machine_ethics
+    display_name: DecodingTrust - Ethics
+    short_display_name: Ethics
+    description: Evaluation of the understanding of ethical behaviors of LLMs
+    metric_groups:
+      - accuracy
+      - calibration
+      - efficiency
+      - general_information
+    environment:
+      main_name: quasi_exact_match
+      main_split: test
+    taxonomy:
+      task: text classification
+      what: "?"
+      who: "?"
+      when: "?"
+      language: English
+  - name: decodingtrust_toxicity_prompts
+    display_name: DecodingTrust - Toxicity
+    short_display_name: Toxicity
+    description: Evaluation of the privacy understanding and privacy preserving properties of LLMs
+    metric_groups:
+      - toxicity
+      - bias
+      - efficiency
+      - general_information
+    environment:
+      main_split: test
+    taxonomy:
+      task: "?"
+      what: n/a
+      who: n/a
+      when: n/a
+      language: synthetic
+  - name: decodingtrust_stereotype_bias
+    display_name: DecodingTrust - Stereotype Bias
+    short_display_name: Stereotype
+    description: Manually crafted stereotype user prompts from DecodingTrust
+    metric_groups:
+      - toxicity
+      - bias
+      - efficiency
+      - general_information
+      - decodingtrust_stereotype_bias_metrics
+    environment:
+      main_split: test
+    taxonomy:
+      task: "?"
+      what: n/a
+      who: n/a
+      when: n/a
+      language: synthetic
+  - name: thai_exam
+    display_name: Thai Exam
+    short_display_name: ThaiExam
+    description: A benchmark comprising Thai multiple-choice examinations.
+    metric_groups:
+      - accuracy
+      - general_information
+    environment:
+      main_name: exact_match
+      main_split: test
+    taxonomy:
+      task: question answering
+      what: "?"
+      who: "?"
+      when: "?"
+      language: Thai

crfm-helm 0.3.0__py3-none-any.whl → 0.5.0__py3-none-any.whl

crfm-helm 0.3.0py3-none-any.whl → 0.5.0py3-none-any.whl