PyPI - medhelm - Versions diffs - 0.5.13__tar.gz - Mend

medhelm 0.5.13__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (1022) hide show

medhelm-0.5.13/LICENSE ADDED Viewed

@@ -0,0 +1,28 @@
+Copyright 2026 © Pacific AI Inc.
+This Software ("Software" or "Product") including code, design, documentation, configuration, models, tests, and related assets is owned by Pacific AI Inc. All rights reserved.
+Pacific AI Inc. ("we") is the only owner of the copyright for this Software.
+Unless otherwise specified in a separate Software License Agreement, Services Agreement, or End User License Agreement that you have executed directly with Pacific AI Inc.:
+* You are NOT granted any license or right to use the Software in any way.
+* You are NOT granted any license or right to retain a copy of this Software.
+* You are NOT granted any license or right to change, modify, adapt, or translate the Software.
+* You are NOT granted any license or right to sell, assign, rent, exchange, lend, lease, sublease, or redistribute the Software.
+* You are NOT granted any license or rights to bundle, repackage, or include the Software with any software in any way.
+* The Software is Confidential and Proprietary. You are NOT allowed to distribute copies of the Software to others by any means whatsoever.
+* The Software does NOT come with any warranty, express or implied.
+* It is NOT legal to create derivative works based on the Software.
+* It is NOT legal to claim any title in the Software or any of its derivatives.
+* It is NOT legal to reverse engineer, disassemble or decompile the Software.
+* It is NOT legal to make or retain a copy of the Software.
+* We have no liability whatsoever for use of the Software.
+* You may not make any public statements about this Software or Pacific AI without explicit written permission from Pacific AI.
+* You must retain a copy of this notice without changes along with every copy of the Software, even if you have a license for it.
+Unless required by applicable law or agreed to in writing, Pacific AI provides the Software on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Software and assume any risks associated with Your exercise of permissions under this license.
+In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall Pacific AI be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this license or out of the use or inability to use the Software (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if advised of the possibility of such damages.
+Unless required by applicable law or agreed to in writing, Software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

medhelm-0.5.13/MANIFEST.in ADDED Viewed

@@ -0,0 +1,10 @@
+recursive-include src/helm/ py.typed
+recursive-include src/helm/tokenizers/ *.sp
+recursive-include src/helm/benchmark/ *.json
+recursive-include src/helm/benchmark/ *.yaml
+recursive-include src/helm/benchmark/static/ *.css *.html *.js *.png *.yaml
+recursive-include src/helm/benchmark/static_build/ *.css *.html *.js *.png *.yaml
+recursive-include src/helm/config/ *.yaml
+recursive-include src/helm/benchmark/annotation/omni_math/ *.txt
+recursive-include src/helm/benchmark/annotation/wildbench/ *.md
+recursive-include src/helm/proxy/static/ *.css *.html *.js *.png

medhelm-0.5.13/PKG-INFO ADDED Viewed

@@ -0,0 +1,417 @@
+Metadata-Version: 2.4
+Name: medhelm
+Version: 0.5.13
+Summary: Holistic evaluation of language models for medical applications (HELM for medicine)
+Author-email: Pacific AI <david@pacific.ai>
+License: Apache License 2.0
+Project-URL: Homepage, https://github.com/PacificAI/medhelm
+Project-URL: Documentation, https://medhelm.org
+Keywords: language,models,benchmarking,medical,healthcare,evaluation
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: License :: OSI Approved :: Apache Software License
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: cattrs~=22.2
+Requires-Dist: colorlog~=6.9
+Requires-Dist: dacite~=1.6
+Requires-Dist: Mako~=1.2
+Requires-Dist: numpy<3,>=1.26
+Requires-Dist: pandas~=2.0
+Requires-Dist: pyhocon~=0.3.59
+Requires-Dist: ubelt~=1.3
+Requires-Dist: retrying~=1.3
+Requires-Dist: spacy~=3.5
+Requires-Dist: tqdm~=4.64
+Requires-Dist: zstandard~=0.18.0
+Requires-Dist: sqlitedict<3.0,>=2.1.0
+Requires-Dist: bottle~=0.12.23
+Requires-Dist: datasets~=3.1
+Requires-Dist: pyarrow>=11.0.0
+Requires-Dist: pyarrow-hotfix~=0.6
+Requires-Dist: nltk!=3.9.0,~=3.7
+Requires-Dist: rouge-score~=0.1.2
+Requires-Dist: scipy>=1.10
+Requires-Dist: uncertainty-calibration~=0.1.4
+Requires-Dist: scikit-learn>=1.1
+Requires-Dist: transformers~=4.53
+Requires-Dist: torch<3.0.0,>=1.13.1
+Requires-Dist: torchvision<3.0.0,>=0.14.1
+Provides-Extra: proxy-server
+Requires-Dist: gunicorn>=20.1; extra == "proxy-server"
+Provides-Extra: human-evaluation
+Requires-Dist: scaleapi~=2.13; extra == "human-evaluation"
+Requires-Dist: surge-api~=1.1; extra == "human-evaluation"
+Provides-Extra: dspy
+Requires-Dist: dspy~=3.0; extra == "dspy"
+Provides-Extra: scenarios
+Requires-Dist: gdown~=5.1; extra == "scenarios"
+Requires-Dist: xlrd~=2.0; extra == "scenarios"
+Provides-Extra: metrics
+Requires-Dist: google-api-python-client~=2.64; extra == "metrics"
+Requires-Dist: numba~=0.56; extra == "metrics"
+Requires-Dist: sacrebleu~=2.2; extra == "metrics"
+Requires-Dist: langdetect~=1.0; extra == "metrics"
+Requires-Dist: immutabledict~=4.2; extra == "metrics"
+Requires-Dist: gradio_client~=1.3; extra == "metrics"
+Provides-Extra: ranking
+Requires-Dist: pytrec_eval==0.5; extra == "ranking"
+Provides-Extra: summarization
+Requires-Dist: summ-eval~=0.892; extra == "summarization"
+Requires-Dist: bert-score~=0.3; extra == "summarization"
+Requires-Dist: rouge-score~=0.1.2; extra == "summarization"
+Requires-Dist: nltk!=3.9.0,~=3.7; extra == "summarization"
+Requires-Dist: sentencepiece~=0.2.0; extra == "summarization"
+Requires-Dist: protobuf; extra == "summarization"
+Provides-Extra: plots
+Requires-Dist: colorcet~=3.0; extra == "plots"
+Requires-Dist: matplotlib>=3.6.0; extra == "plots"
+Requires-Dist: seaborn>=0.11.0; extra == "plots"
+Provides-Extra: decodingtrust
+Requires-Dist: fairlearn~=0.9.0; extra == "decodingtrust"
+Provides-Extra: slurm
+Requires-Dist: simple-slurm~=0.2.6; extra == "slurm"
+Provides-Extra: cleva
+Requires-Dist: unidecode~=1.3; extra == "cleva"
+Requires-Dist: pypinyin~=0.49.0; extra == "cleva"
+Requires-Dist: jieba~=0.42.1; extra == "cleva"
+Requires-Dist: opencc~=1.1; extra == "cleva"
+Requires-Dist: langdetect~=1.0; extra == "cleva"
+Provides-Extra: images
+Requires-Dist: medhelm[accelerate]; extra == "images"
+Requires-Dist: pillow>=10.2; extra == "images"
+Provides-Extra: mongo
+Requires-Dist: pymongo~=4.2; extra == "mongo"
+Provides-Extra: unitxt
+Requires-Dist: evaluate~=0.4.1; extra == "unitxt"
+Provides-Extra: seahelm
+Requires-Dist: pythainlp==5.0.0; extra == "seahelm"
+Requires-Dist: pyonmttok==1.37.0; extra == "seahelm"
+Requires-Dist: sacrebleu~=2.2; extra == "seahelm"
+Requires-Dist: python-crfsuite~=0.9.11; extra == "seahelm"
+Provides-Extra: accelerate
+Requires-Dist: accelerate~=0.25; extra == "accelerate"
+Provides-Extra: aleph-alpha
+Requires-Dist: aleph-alpha-client~=2.14; extra == "aleph-alpha"
+Requires-Dist: tokenizers>=0.13.3; extra == "aleph-alpha"
+Provides-Extra: allenai
+Requires-Dist: ai2-olmo~=0.2; extra == "allenai"
+Provides-Extra: amazon
+Requires-Dist: boto3~=1.34; extra == "amazon"
+Requires-Dist: awscli~=1.33; extra == "amazon"
+Requires-Dist: botocore~=1.34; extra == "amazon"
+Provides-Extra: anthropic
+Requires-Dist: anthropic~=0.41; extra == "anthropic"
+Requires-Dist: websocket-client~=1.3; extra == "anthropic"
+Provides-Extra: cohere
+Requires-Dist: cohere~=5.3; extra == "cohere"
+Provides-Extra: writer
+Requires-Dist: writerai~=4.0; extra == "writer"
+Provides-Extra: mistral
+Requires-Dist: mistralai~=1.1; extra == "mistral"
+Provides-Extra: openai
+Requires-Dist: openai~=2.8; extra == "openai"
+Requires-Dist: tiktoken~=0.7; extra == "openai"
+Requires-Dist: pydantic~=2.0; extra == "openai"
+Provides-Extra: google
+Requires-Dist: google-cloud-aiplatform~=1.48; extra == "google"
+Requires-Dist: google-genai~=1.48; extra == "google"
+Provides-Extra: together
+Requires-Dist: together~=1.1; extra == "together"
+Provides-Extra: yandex
+Requires-Dist: sentencepiece~=0.2.0; extra == "yandex"
+Provides-Extra: models
+Requires-Dist: medhelm[ai21]; extra == "models"
+Requires-Dist: medhelm[accelerate]; extra == "models"
+Requires-Dist: medhelm[aleph-alpha]; extra == "models"
+Requires-Dist: medhelm[allenai]; extra == "models"
+Requires-Dist: medhelm[amazon]; extra == "models"
+Requires-Dist: medhelm[anthropic]; extra == "models"
+Requires-Dist: medhelm[cohere]; extra == "models"
+Requires-Dist: medhelm[google]; extra == "models"
+Requires-Dist: medhelm[mistral]; extra == "models"
+Requires-Dist: medhelm[openai]; extra == "models"
+Requires-Dist: medhelm[reka]; extra == "models"
+Requires-Dist: medhelm[together]; extra == "models"
+Requires-Dist: medhelm[yandex]; extra == "models"
+Requires-Dist: medhelm[writer]; extra == "models"
+Provides-Extra: reka
+Requires-Dist: reka-api~=2.0; extra == "reka"
+Provides-Extra: vlm
+Requires-Dist: medhelm[openai]; extra == "vlm"
+Requires-Dist: einops~=0.7.0; extra == "vlm"
+Requires-Dist: einops-exts~=0.0.4; extra == "vlm"
+Requires-Dist: open-clip-torch~=2.24; extra == "vlm"
+Requires-Dist: torch~=2.1; extra == "vlm"
+Requires-Dist: transformers_stream_generator~=0.0.4; extra == "vlm"
+Requires-Dist: scipy~=1.10; extra == "vlm"
+Requires-Dist: torchvision<3.0.0,>=0.14.1; extra == "vlm"
+Requires-Dist: medhelm[reka]; extra == "vlm"
+Requires-Dist: medhelm[images]; extra == "vlm"
+Requires-Dist: medhelm[image2struct]; extra == "vlm"
+Requires-Dist: pycocoevalcap~=1.2; extra == "vlm"
+Requires-Dist: qwen-vl-utils~=0.0.8; extra == "vlm"
+Provides-Extra: ibm-enterprise-scenarios
+Requires-Dist: openpyxl~=3.1; extra == "ibm-enterprise-scenarios"
+Provides-Extra: ibm
+Requires-Dist: ibm-watsonx-ai~=1.2; extra == "ibm"
+Provides-Extra: image2struct
+Requires-Dist: medhelm[images]; extra == "image2struct"
+Requires-Dist: latex~=0.7.0; extra == "image2struct"
+Requires-Dist: pdf2image~=1.16; extra == "image2struct"
+Requires-Dist: selenium~=4.17; extra == "image2struct"
+Requires-Dist: html2text~=2024.2.26; extra == "image2struct"
+Requires-Dist: opencv-python-headless<=4.11.0.86,>=4.7.0.68; extra == "image2struct"
+Requires-Dist: lpips~=0.1.4; extra == "image2struct"
+Requires-Dist: imagehash~=4.3; extra == "image2struct"
+Provides-Extra: heim
+Requires-Dist: gdown~=5.1; extra == "heim"
+Requires-Dist: diffusers~=0.34.0; extra == "heim"
+Requires-Dist: icetk~=0.0.4; extra == "heim"
+Requires-Dist: jax~=0.6.2; python_version >= "3.10" and extra == "heim"
+Requires-Dist: jax~=0.4.30; python_version < "3.10" and extra == "heim"
+Requires-Dist: jaxlib~=0.6.2; python_version >= "3.10" and extra == "heim"
+Requires-Dist: jaxlib~=0.4.30; python_version < "3.10" and extra == "heim"
+Requires-Dist: medhelm[openai]; extra == "heim"
+Requires-Dist: einops~=0.7.0; extra == "heim"
+Requires-Dist: omegaconf~=2.3; extra == "heim"
+Requires-Dist: pytorch-lightning~=2.0; extra == "heim"
+Requires-Dist: flax~=0.10.7; python_version >= "3.10" and extra == "heim"
+Requires-Dist: flax~=0.8.5; python_version < "3.10" and extra == "heim"
+Requires-Dist: ftfy~=6.1; extra == "heim"
+Requires-Dist: Unidecode~=1.3; extra == "heim"
+Requires-Dist: wandb~=0.16; extra == "heim"
+Requires-Dist: google-cloud-translate~=3.11; extra == "heim"
+Requires-Dist: autokeras~=1.0; extra == "heim"
+Requires-Dist: clip-anytorch~=2.5; extra == "heim"
+Requires-Dist: google-cloud-storage~=2.9; extra == "heim"
+Requires-Dist: lpips~=0.1.4; extra == "heim"
+Requires-Dist: multilingual-clip~=1.0; extra == "heim"
+Requires-Dist: NudeNet~=2.0; extra == "heim"
+Requires-Dist: numpy>=1.26; extra == "heim"
+Requires-Dist: opencv-python<4.8.2.0,>=4.7.0.68; python_version >= "3.10" and extra == "heim"
+Requires-Dist: opencv-python-headless<=4.11.0.86,>=4.7.0.68; python_version < "3.10" and extra == "heim"
+Requires-Dist: pytorch-fid~=0.3.0; extra == "heim"
+Requires-Dist: tensorflow~=2.11; extra == "heim"
+Requires-Dist: timm~=0.6.12; extra == "heim"
+Requires-Dist: torch-fidelity~=0.3.0; extra == "heim"
+Requires-Dist: torchmetrics~=0.11.1; extra == "heim"
+Requires-Dist: scikit-image==0.*,>=0.22; extra == "heim"
+Requires-Dist: medhelm[images]; extra == "heim"
+Provides-Extra: medhelm
+Requires-Dist: accelerate~=0.25; extra == "medhelm"
+Requires-Dist: medhelm[openai]; extra == "medhelm"
+Requires-Dist: medhelm[yandex]; extra == "medhelm"
+Requires-Dist: medhelm[scenarios]; extra == "medhelm"
+Requires-Dist: bert_score~=0.3.13; extra == "medhelm"
+Requires-Dist: lxml~=5.3; extra == "medhelm"
+Requires-Dist: openpyxl~=3.1; extra == "medhelm"
+Requires-Dist: python-docx~=1.1; extra == "medhelm"
+Provides-Extra: gated
+Requires-Dist: gdown~=5.1; extra == "gated"
+Provides-Extra: audiolm
+Requires-Dist: medhelm[openai]; extra == "audiolm"
+Requires-Dist: medhelm[google]; extra == "audiolm"
+Requires-Dist: pydub~=0.25.1; extra == "audiolm"
+Requires-Dist: ffmpeg-python~=0.2.0; extra == "audiolm"
+Requires-Dist: soundfile~=0.12; extra == "audiolm"
+Requires-Dist: librosa~=0.10; extra == "audiolm"
+Requires-Dist: einops~=0.7.0; extra == "audiolm"
+Requires-Dist: openai-whisper==20240930; extra == "audiolm"
+Requires-Dist: transformers_stream_generator~=0.0.4; extra == "audiolm"
+Requires-Dist: av~=14.3; extra == "audiolm"
+Requires-Dist: scipy~=1.10; extra == "audiolm"
+Requires-Dist: torchvision<3.0.0,>=0.14.1; extra == "audiolm"
+Requires-Dist: flash-attn~=2.7; extra == "audiolm"
+Requires-Dist: pycocoevalcap~=1.2; extra == "audiolm"
+Requires-Dist: jiwer~=3.0; extra == "audiolm"
+Requires-Dist: rapidfuzz~=3.10; extra == "audiolm"
+Requires-Dist: jieba~=0.42.1; extra == "audiolm"
+Provides-Extra: codeinsights
+Requires-Dist: clang~=20.1; extra == "codeinsights"
+Requires-Dist: Levenshtein~=0.27; extra == "codeinsights"
+Provides-Extra: lmkt
+Requires-Dist: sentence_transformers~=4.1; extra == "lmkt"
+Provides-Extra: all
+Requires-Dist: medhelm[proxy-server]; extra == "all"
+Requires-Dist: medhelm[scenarios]; extra == "all"
+Requires-Dist: medhelm[metrics]; extra == "all"
+Requires-Dist: medhelm[plots]; extra == "all"
+Requires-Dist: medhelm[decodingtrust]; extra == "all"
+Requires-Dist: medhelm[slurm]; extra == "all"
+Requires-Dist: medhelm[cleva]; extra == "all"
+Requires-Dist: medhelm[images]; extra == "all"
+Requires-Dist: medhelm[models]; extra == "all"
+Requires-Dist: medhelm[mongo]; extra == "all"
+Requires-Dist: medhelm[heim]; extra == "all"
+Requires-Dist: medhelm[vlm]; extra == "all"
+Requires-Dist: medhelm[codeinsights]; extra == "all"
+Requires-Dist: medhelm[lmkt]; extra == "all"
+Provides-Extra: ci
+Requires-Dist: medhelm[metrics]; extra == "ci"
+Requires-Dist: medhelm[openai]; extra == "ci"
+Requires-Dist: medhelm[plots]; extra == "ci"
+Requires-Dist: medhelm[together]; extra == "ci"
+Requires-Dist: medhelm[yandex]; extra == "ci"
+Requires-Dist: medhelm[cohere]; extra == "ci"
+Requires-Dist: medhelm[proxy-server]; extra == "ci"
+Provides-Extra: litellm
+Requires-Dist: litellm>=1.80.0; extra == "litellm"
+Dynamic: license-file
+# Holistic Evaluation of Language Models (HELM)
+[comment]: <> (When using the img tag, which allows us to specify size, src has to be a URL.)
+<img src="https://github.com/stanford-crfm/helm/raw/v0.5.4/helm-frontend/src/assets/helm-logo.png" alt="HELM logo"  width="480"/>
+<a href="https://github.com/PacificAI/medhelm">
+    <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PacificAI/medhelm">
+</a>
+<a href="https://github.com/PacificAI/medhelm/blob/main/LICENSE">
+    <img alt="License" src="https://img.shields.io/github/license/PacificAI/medhelm?color=blue" />
+</a>
+<a href="https://pypi.org/project/medhelm/">
+    <img alt="PyPI" src="https://img.shields.io/pypi/v/medhelm?color=blue" />
+</a>
+**Holistic Evaluation of Language Models (HELM)** is an open source Python framework created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:
+- Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
+- Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
+- Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
+- Web UI for inspecting individual prompts and responses
+- Web leaderboard for comparing results across models and benchmarks
+## Documentation
+Documentation: **[medhelm.org](https://medhelm.org)**
+## Install & run (MedHELM library)
+MedHELM uses the HELM core engine and adds medical benchmarks. Install from PyPI:
+### Standard (recommended to start)
+Scenarios: **PubMedQA**, **MedCalc-Bench**, **MedicationQA**, **MedHallu**.
+```sh
+pip install medhelm
+# or with uv:
+uv pip install medhelm
+```
+Run a benchmark:
+```sh
+uv run medhelm-run --run-entries "pubmed_qa:model=huggingface/qwen2.5-7b" --suite my_med_test --max-eval-instances 10
+uv run helm-summarize --suite my_med_test
+uv run helm-server --suite my_med_test
+```
+Then open http://localhost:8000/ in your browser.
+### Clinical NLP tier (`[summarization]`)
+Adds heavy libraries (bert-score, rouge-score, nltk). **Install can take 2–3 minutes.**
+Scenarios: **DischargeMe** (hospital course summaries), **ACI-Bench** (clinical transcripts), **Patient-Edu** (simplifying medical jargon).
+```sh
+pip install "medhelm[summarization]"
+# or: uv pip install "medhelm[summarization]"
+```
+Example:
+```sh
+uv run medhelm-run --run-entries "discharge_summaries:model=huggingface/qwen2.5-7b" --suite med_summaries --max-eval-instances 5
+uv run helm-summarize --suite med_summaries
+uv run helm-server --suite med_summaries
+```
+### Gated / licensing tier (`[gated]`)
+Adds **gdown** for scenarios that use Google Drive. Install can also take longer.
+Scenarios: **MedQA** (USMLE/Board exams), **MedMCQA** (AIIMS/NEET exams).
+```sh
+pip install "medhelm[gated]"
+# or: uv pip install "medhelm[gated]"
+```
+Example:
+```sh
+uv run medhelm-run --run-entries "med_qa:model=huggingface/qwen2.5-7b" --suite board_exams --max-eval-instances 10
+uv run helm-summarize --suite board_exams
+uv run helm-server --suite board_exams
+```
+### Classic HELM commands
+You can still use `helm-run`, `helm-summarize`, and `helm-server`; `medhelm-run` is an alias for `helm-run`.
+```sh
+helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
+helm-summarize --suite my-suite
+helm-server --suite my-suite
+```
+## Quick Start (summary)
+<!--quick-start-begin-->
+| Tier | Install | Scenarios |
+|------|--------|-----------|
+| **Standard** | `pip install medhelm` or `uv pip install medhelm` | PubMedQA, MedCalc-Bench, MedicationQA, MedHallu |
+| **Summarization** | `pip install "medhelm[summarization]"` | DischargeMe, ACI-Bench, Patient-Edu (2–3 min install) |
+| **Gated** | `pip install "medhelm[gated]"` | MedQA, MedMCQA (Drive) |
+Run: `uv run medhelm-run --run-entries "<scenario>:model=<model>" --suite <name> --max-eval-instances <n>` then `helm-summarize` and `helm-server`. See [medhelm.org](https://medhelm.org) for full docs.
+<!--quick-start-end-->
+## Leaderboards
+We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:
+- [HELM Capabilities](https://crfm.stanford.edu/helm/capabilities/latest/)
+- [HELM Safety](https://crfm.stanford.edu/helm/safety/latest/)
+- [Holistic Evaluation of Vision-Language Models (VHELM)](https://crfm.stanford.edu/helm/vhelm/latest/)
+We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the [HELM website](https://crfm.stanford.edu/helm/) for a full list of leaderboards.
+## Papers
+The HELM framework was used in the following papers for evaluating models.
+- **Holistic Evaluation of Language Models** - [paper](https://openreview.net/forum?id=iO4LZibEqW), [leaderboard](https://crfm.stanford.edu/helm/classic/latest/)
+- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
+- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
+- **Image2Struct: Benchmarking Structure Extraction for Vision-Language Models** - [paper](https://arxiv.org/abs/2410.22456)
+- **Enterprise Benchmarks for Large Language Model Evaluation** - [paper](https://arxiv.org/abs/2410.12857), [documentation](https://crfm-helm.readthedocs.io/en/latest/enterprise_benchmark/)
+- **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
+- **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **Holistic Evaluation of Audio-Language Models** - [paper](https://arxiv.org/abs/2508.21376), [leaderboard](https://crfm.stanford.edu/helm/audio/latest/)
+The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [Reproducing Leaderboards](https://medhelm.org/reproducing_leaderboards/) documentation on medhelm.org.
+## Citation
+If you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.
+```bibtex
+@article{
+liang2023holistic,
+title={Holistic Evaluation of Language Models},
+author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
+journal={Transactions on Machine Learning Research},
+issn={2835-8856},
+year={2023},
+url={https://openreview.net/forum?id=iO4LZibEqW},
+note={Featured Certification, Expert Certification}
+}
+```

medhelm-0.5.13/README.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Holistic Evaluation of Language Models (HELM)
+[comment]: <> (When using the img tag, which allows us to specify size, src has to be a URL.)
+<img src="https://github.com/stanford-crfm/helm/raw/v0.5.4/helm-frontend/src/assets/helm-logo.png" alt="HELM logo"  width="480"/>
+<a href="https://github.com/PacificAI/medhelm">
+    <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PacificAI/medhelm">
+</a>
+<a href="https://github.com/PacificAI/medhelm/blob/main/LICENSE">
+    <img alt="License" src="https://img.shields.io/github/license/PacificAI/medhelm?color=blue" />
+</a>
+<a href="https://pypi.org/project/medhelm/">
+    <img alt="PyPI" src="https://img.shields.io/pypi/v/medhelm?color=blue" />
+</a>
+**Holistic Evaluation of Language Models (HELM)** is an open source Python framework created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:
+- Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
+- Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
+- Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
+- Web UI for inspecting individual prompts and responses
+- Web leaderboard for comparing results across models and benchmarks
+## Documentation
+Documentation: **[medhelm.org](https://medhelm.org)**
+## Install & run (MedHELM library)
+MedHELM uses the HELM core engine and adds medical benchmarks. Install from PyPI:
+### Standard (recommended to start)
+Scenarios: **PubMedQA**, **MedCalc-Bench**, **MedicationQA**, **MedHallu**.
+```sh
+pip install medhelm
+# or with uv:
+uv pip install medhelm
+```
+Run a benchmark:
+```sh
+uv run medhelm-run --run-entries "pubmed_qa:model=huggingface/qwen2.5-7b" --suite my_med_test --max-eval-instances 10
+uv run helm-summarize --suite my_med_test
+uv run helm-server --suite my_med_test
+```
+Then open http://localhost:8000/ in your browser.
+### Clinical NLP tier (`[summarization]`)
+Adds heavy libraries (bert-score, rouge-score, nltk). **Install can take 2–3 minutes.**
+Scenarios: **DischargeMe** (hospital course summaries), **ACI-Bench** (clinical transcripts), **Patient-Edu** (simplifying medical jargon).
+```sh
+pip install "medhelm[summarization]"
+# or: uv pip install "medhelm[summarization]"
+```
+Example:
+```sh
+uv run medhelm-run --run-entries "discharge_summaries:model=huggingface/qwen2.5-7b" --suite med_summaries --max-eval-instances 5
+uv run helm-summarize --suite med_summaries
+uv run helm-server --suite med_summaries
+```
+### Gated / licensing tier (`[gated]`)
+Adds **gdown** for scenarios that use Google Drive. Install can also take longer.
+Scenarios: **MedQA** (USMLE/Board exams), **MedMCQA** (AIIMS/NEET exams).
+```sh
+pip install "medhelm[gated]"
+# or: uv pip install "medhelm[gated]"
+```
+Example:
+```sh
+uv run medhelm-run --run-entries "med_qa:model=huggingface/qwen2.5-7b" --suite board_exams --max-eval-instances 10
+uv run helm-summarize --suite board_exams
+uv run helm-server --suite board_exams
+```
+### Classic HELM commands
+You can still use `helm-run`, `helm-summarize`, and `helm-server`; `medhelm-run` is an alias for `helm-run`.
+```sh
+helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
+helm-summarize --suite my-suite
+helm-server --suite my-suite
+```
+## Quick Start (summary)
+<!--quick-start-begin-->
+| Tier | Install | Scenarios |
+|------|--------|-----------|
+| **Standard** | `pip install medhelm` or `uv pip install medhelm` | PubMedQA, MedCalc-Bench, MedicationQA, MedHallu |
+| **Summarization** | `pip install "medhelm[summarization]"` | DischargeMe, ACI-Bench, Patient-Edu (2–3 min install) |
+| **Gated** | `pip install "medhelm[gated]"` | MedQA, MedMCQA (Drive) |
+Run: `uv run medhelm-run --run-entries "<scenario>:model=<model>" --suite <name> --max-eval-instances <n>` then `helm-summarize` and `helm-server`. See [medhelm.org](https://medhelm.org) for full docs.
+<!--quick-start-end-->
+## Leaderboards
+We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:
+- [HELM Capabilities](https://crfm.stanford.edu/helm/capabilities/latest/)
+- [HELM Safety](https://crfm.stanford.edu/helm/safety/latest/)
+- [Holistic Evaluation of Vision-Language Models (VHELM)](https://crfm.stanford.edu/helm/vhelm/latest/)
+We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the [HELM website](https://crfm.stanford.edu/helm/) for a full list of leaderboards.
+## Papers
+The HELM framework was used in the following papers for evaluating models.
+- **Holistic Evaluation of Language Models** - [paper](https://openreview.net/forum?id=iO4LZibEqW), [leaderboard](https://crfm.stanford.edu/helm/classic/latest/)
+- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
+- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
+- **Image2Struct: Benchmarking Structure Extraction for Vision-Language Models** - [paper](https://arxiv.org/abs/2410.22456)
+- **Enterprise Benchmarks for Large Language Model Evaluation** - [paper](https://arxiv.org/abs/2410.12857), [documentation](https://crfm-helm.readthedocs.io/en/latest/enterprise_benchmark/)
+- **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
+- **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
+- **Holistic Evaluation of Audio-Language Models** - [paper](https://arxiv.org/abs/2508.21376), [leaderboard](https://crfm.stanford.edu/helm/audio/latest/)
+The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [Reproducing Leaderboards](https://medhelm.org/reproducing_leaderboards/) documentation on medhelm.org.
+## Citation
+If you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.
+```bibtex
+@article{
+liang2023holistic,
+title={Holistic Evaluation of Language Models},
+author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
+journal={Transactions on Machine Learning Research},
+issn={2835-8856},
+year={2023},
+url={https://openreview.net/forum?id=iO4LZibEqW},
+note={Featured Certification, Expert Certification}
+}
+```