PyPI - captionevalkit-for-vlms - Versions diffs - 0.1.0__tar.gz - Mend

captionevalkit-for-vlms 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (144) hide show

captionevalkit_for_vlms-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,11 @@
+__pycache__/
+*.py[cod]
+.venv/
+.uv-cache/
+.cache/
+.hf-cache/
+.model-cache/
+outputs/
+envs/
+data/
+self-distillation-smoothing/

captionevalkit_for_vlms-0.1.0/.gitmodules ADDED Viewed

@@ -0,0 +1,24 @@
+[submodule "polos"]
+	path = metrics/upstreams/polos
+	url = https://github.com/keio-smilab24/Polos.git
+	ignore = dirty
+[submodule "pycocoevalcap"]
+	path = metrics/upstreams/pycocoevalcap
+	url = https://github.com/salaniz/pycocoevalcap.git
+	ignore = untracked
+[submodule "pacscore"]
+	path = metrics/upstreams/pacscore
+	url = https://github.com/aimagelab/pacscore.git
+	ignore = untracked
+[submodule "clipscore"]
+	path = metrics/upstreams/clipscore
+	url = https://github.com/jmhessel/clipscore.git
+	ignore = untracked
+[submodule "vela"]
+	path = metrics/upstreams/vela
+	url = https://github.com/Ka2ukiMatsuda/VELA.git
+	ignore = dirty
+[submodule "fleur"]
+	path = metrics/upstreams/fleur
+	url = https://github.com/Yebin46/FLEUR.git
+	ignore = dirty

captionevalkit_for_vlms-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,32 @@
+BSD 3-Clause Clear License
+Copyright (c) 2026 Yuiga Wada
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted (subject to the limitations in the disclaimer
+below) provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice,
+   this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors
+   may be used to endorse or promote products derived from this software
+   without specific prior written permission.
+NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
+THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
+NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

captionevalkit_for_vlms-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,482 @@
+Metadata-Version: 2.4
+Name: captionevalkit-for-vlms
+Version: 0.1.0
+Summary: A reproducible caption-evaluation toolkit for VLMs with per-metric uv environments.
+Project-URL: Homepage, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs
+Project-URL: Repository, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs
+Project-URL: Issues, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs/issues
+Author: Yuiga Wada
+Maintainer: Yuiga Wada
+License-Expression: BSD-3-Clause-Clear
+License-File: LICENSE
+Keywords: caption-evaluation,metrics,reproducibility,vision-language-models,vlm
+Classifier: Development Status :: 3 - Alpha
+Classifier: Environment :: Console
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.10
+Requires-Dist: datasets<4,>=2.19
+Requires-Dist: pillow>=10
+Requires-Dist: rich>=13
+Requires-Dist: tomli>=2; python_version < '3.11'
+Description-Content-Type: text/markdown
+# CaptionEvalKit-for-VLMs
+<img width="1272" height="262" alt="logo" src="https://github.com/user-attachments/assets/504893fc-3bb2-40fd-9c84-835a0d04d055" />
+Reproducible, all-in-one image captioning evaluation for VLMs.
+* **For metric developers:** Evaluate metrics and reproduce reported results with <u>a single command</u>.
+* **For VLM developers:** Score VLM-generated captions using a comprehensive set of established captioning metrics.
+CaptionEvalKit currently supports:
+* **LLM-free metrics:** Polos, CLIPScore, PAC-S, RefCLIPScore, RefPAC-S, and more
+* **LLM-as-a-Judge metrics:** FLEUR, RefFLEUR, and VELA
+* **Classic captioning metrics:** BLEU, ROUGE-L, METEOR, CIDEr, and SPICE
+* **Benchmarks:** Composite, Flickr8k-Ex, Flickr8k-CF, Polaris, Nebula, and LongCap-Arena
+<img width="850" height="178" alt="Screenshot 2026-06-13 at 2 23 30" src="https://github.com/user-attachments/assets/eea86fbb-d9ae-4fce-98fd-29f2510dd2bb" />
+## Table of Contents
+* [Install](#install)
+* [For VLM Developers](#for-vlm-developers)
+* [For Metric Developers](#for-metric-developers)
+* [Reproduce Reported Results](#reproduce-reported-results)
+* [Reproduction Status](#reproduction-status)
+* [Supported Metrics](#supported-metrics)
+* [Supported Benchmarks](#supported-benchmarks)
+* [Data and Assets](#data-and-assets)
+* [TODO](#todo)
+* [Development](#development)
+* [Citation](#citation)
+## Install
+Requirements: Python 3.10+, `git`, and `uv`. Java is also required for METEOR/SPICE through `pycocoevalcap`.
+From PyPI or a built wheel:
+```bash
+pip install captionevalkit-for-vlms
+capevalkit doctor
+capevalkit list-metrics
+```
+<!-- Installed wheels keep the package small and materialize locked upstream repositories on demand. To prefetch one metric family:
+```bash
+capevalkit sync --metrics cider
+```
+`score`, `benchmark`, and `all_reproduce` also sync required upstreams automatically. -->
+From a source checkout:
+```bash
+git clone --recursive https://github.com/YuigaWada/CaptionEvalKit-for-VLMs.git
+cd CaptionEvalKit-for-VLMs
+uv tool install --editable "$PWD" --force
+capevalkit list-metrics
+```
+<details>
+<summary>Runtime Cache</summary>
+Wheel installs use `CAPEVALKIT_HOME` as a runtime cache root. The default is `~/.cache/capevalkit`.
+```text
+~/.cache/capevalkit/
+  runtime/<lock-digest>/
+    metrics/
+    metrics/upstreams/
+    benchmarks/expected/
+    overlays/
+  uv/
+  huggingface/
+```
+Set a different location when needed:
+```bash
+CAPEVALKIT_HOME=/scratch/capevalkit capevalkit doctor
+```
+Source checkouts use the repository tree directly and keep submodules in `metrics/upstreams/`.
+</details>
+## For Metric Developers
+Benchmark existing metrics, or evaluate your own metric without adopting a fixed metric signature.
+When changing upstream submodule revisions for a release, regenerate the runtime lock:
+```bash
+python scripts/generate_upstream_lock.py
+```
+<details>
+<summary>CLI</summary>
+Run one metric on one benchmark:
+```bash
+capevalkit benchmark \
+  --metric clipscore \
+  --benchmark composite \
+  --limit 8 \
+  --output outputs/clipscore/composite.json
+```
+Run the same metric across benchmarks:
+```bash
+capevalkit suite \
+  --metrics clipscore \
+  --benchmarks composite,flickr8k-ex,flickr8k-cf,nebula,polaris \
+  --limit 8 \
+  --output-dir outputs/clipscore
+```
+To wire a metric through its own CLI runner, add `metrics/mymetric/metric.toml`:
+```toml
+[metric]
+name = "mymetric"
+python = ">=3.10,<3.12"
+module = "capevalkit.metrics.mymetric"
+[repository]
+dir = "metrics/upstreams/mymetric"
+uv_project = "metrics/upstreams/mymetric"
+[runner]
+command = ["python", "score.py"]
+```
+Add a minimal `metrics/upstreams/mymetric/pyproject.toml`:
+```toml
+[project]
+name = "mymetric"
+version = "0.1.0"
+requires-python = ">=3.10,<3.12"
+dependencies = []
+```
+Make `metrics/upstreams/mymetric/score.py` accept:
+```text
+--predictions PREDICTIONS.jsonl
+--references REFERENCES.jsonl
+--output OUTPUT.json
+```
+Then benchmark it:
+```bash
+capevalkit benchmark \
+  --metric mymetric \
+  --benchmark composite \
+  --output outputs/mymetric/composite.json
+```
+</details>
+<!-- <details> -->
+<!-- <summary>Python</summary> -->
+```python
+import capevalkit as capeval
+class MyMetric:
+    def __call__(self, samples):
+        return {
+            sample.id: float(bool(sample.prediction and sample.references))
+            for sample in samples
+        }
+result = capeval.evaluate_metric(
+    benchmark="flickr8k-cf",
+    metric=MyMetric(),
+    metric_name="MyMetric",
+    limit=8,
+    output="outputs/mymetric/flickr8k-cf.json",
+)
+```
+The callable receives `CaptionSample` objects and returns `{sample_id: score}`. Your metric can keep any internal signature.
+<!-- </details> -->
+## For VLM Developers
+Evaluate saved captions from files, or run your caption model on your own images.
+<details>
+<summary>CLI</summary>
+`predictions.jsonl`:
+```jsonl
+{"id": "0001", "caption": "A dog runs through grass.", "image": "0001.jpg"}
+{"id": "0002", "caption": "A person rides a bicycle.", "image": "0002.jpg"}
+```
+`references.jsonl`:
+```jsonl
+{"id": "0001", "references": ["A dog runs outside.", "A dog is in a grassy field."]}
+{"id": "0002", "references": ["A cyclist rides on a road.", "A person rides a bike."]}
+```
+```bash
+capevalkit score \
+  --metric clipscore \
+  --predictions predictions.jsonl \
+  --references references.jsonl \
+  --image-dir images \
+  --output outputs/clipscore.json
+```
+```json
+{
+  "CLIPScore": 0.73,
+  "RefCLIPScore": 0.81,
+  "per_item": {
+    "0001": {"CLIPScore": 0.70, "RefCLIPScore": 0.78}
+  }
+}
+```
+</details>
+<!-- <details> -->
+<!-- <summary>Python</summary> -->
+Run these examples with `uv run python` from the repository, or install `capevalkit` into your own Python environment.
+```python
+import capevalkit as capeval
+def predict(batch):
+    return ["A dog runs through grass." for _ in batch.images]
+results = capeval.evaluate_caption_model(
+    images=["images/0001.jpg", "images/0002.jpg"],
+    metrics=["cider", "clipscore"],
+    predict=predict,
+    references=[
+        ["A dog runs outside.", "A dog is in a grassy field."],
+        ["A cyclist rides on a road.", "A person rides a bike."],
+    ],
+    batch_size=8,
+    output_dir="outputs/my-model",
+)
+```
+If captions are already generated, pass image-caption pairs directly:
+```python
+import capevalkit as capeval
+results = capeval.evaluate_captions(
+    pairs=[
+        {
+            "id": "0001",
+            "image": "images/0001.jpg",
+            "caption": "A dog runs through grass.",
+            "references": ["A dog runs outside.", "A dog is in a grassy field."],
+        },
+        {
+            "id": "0002",
+            "image": "images/0002.jpg",
+            "caption": "A person rides a bicycle.",
+            "references": ["A cyclist rides on a road.", "A person rides a bike."],
+        },
+    ],
+    metrics=["cider", "clipscore"],
+    output_dir="outputs/my-captions",
+)
+```
+For manual caption-model control:
+```python
+import capevalkit as capeval
+def predict(batch):
+    return ["A dog runs through grass." for _ in batch.images]
+with capeval.CaptionEvalRun(
+    images=["images/0001.jpg", "images/0002.jpg"],
+    metrics=["cider", "clipscore"],
+    references=[
+        ["A dog runs outside.", "A dog is in a grassy field."],
+        ["A cyclist rides on a road.", "A person rides a bike."],
+    ],
+    output_dir="outputs/my-model",
+) as run:
+    for batch in run.iter_batches(batch_size=8):
+        run.record(batch.ids, predict(batch))
+    results = run.evaluate()
+```
+<!-- </details> -->
+## Reproduce Reported Results
+Preview the default reproducibility suite:
+```bash
+capevalkit all_reproduce --dry-run
+```
+Run one verified pair:
+```bash
+capevalkit all_reproduce \
+  --metrics clipscore \
+  --benchmarks composite
+```
+Run a launch smoke test for every default pair:
+```bash
+capevalkit all_reproduce --smoke --jobs 4 --gpu-jobs 1
+```
+`--smoke` runs one sample per pair and checks launch/output writing only. Omit it for full correlations.
+## Reproduction Status
+Legend: `✅` reproduced, `⚠️` not reproduced, `-` no default target. For LongCap-Arena, unreproduced targets are also shown as `-`.
+| Metric | Composite | Flickr8k-EX | Flickr8k-CF | Nebula | Polaris | LCA TestA | LCA TestB |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| `bleu` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `cider` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `clipscore` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `fleur` | ⚠️ | ⚠️ | ✅ | - | - | - | - |
+| `meteor` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `pacscore` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `polos` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `refclipscore` | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | - | - |
+| `reffleur` | ✅ | ✅ | ✅ | - | - | - | - |
+| `refpacscore` | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | - | - |
+| `rouge` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `spice` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
+| `vela` | - | - | - | - | - | ✅ | ✅ |
+## Supported Metrics
+| Metric | Upstream | Notes |
+| --- | --- | --- |
+| `bleu` | `pycocoevalcap` | BLEU-1 to BLEU-4 |
+| `rouge` | `pycocoevalcap` | ROUGE-L |
+| `meteor` | `pycocoevalcap` | Java METEOR through upstream |
+| `cider` | `pycocoevalcap` | CIDEr |
+| `spice` | `pycocoevalcap` | SPICE |
+| `clipscore` | CLIPScore | image-caption CLIPScore |
+| `refclipscore` | CLIPScore | reference-aware CLIPScore |
+| `pacscore` | PACScore | PAC-S |
+| `refpacscore` | PACScore | reference-aware PAC-S |
+| `polos` | Polos | model-based reference-aware metric |
+| `fleur` | FLEUR | LLaVA-based reference-free metric |
+| `reffleur` | FLEUR | reference-aware FLEUR |
+| `vela` | VELA | long-caption metric for `desc`, `rel`, `flu` |
+## Supported Benchmarks
+| Benchmark | Source |
+| --- | --- |
+| `composite` | Hugging Face `yuwd/Composite` |
+| `flickr8k-ex` | Hugging Face `yuwd/Flickr8k-HumanEval`, expert split |
+| `flickr8k-cf` | Hugging Face `yuwd/Flickr8k-HumanEval`, CrowdFlower split |
+| `nebula` | Hugging Face `Ka2ukiMatsuda/Nebula` |
+| `polaris` | Hugging Face `yuwd/Polaris` |
+| `longcaparena-testa-{desc,rel,flu}` | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
+| `longcaparena-testb-{desc,rel,flu}` | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
+## Data and Assets
+Benchmark datasets are cached on first use under `<runtime-root>/.hf-cache/benchmarks/`. In a source checkout, `<runtime-root>` is the repository root; in a wheel install, it is `$CAPEVALKIT_HOME/runtime/<lock-digest>`.
+| Dataset | Loaded from |
+| --- | --- |
+| Composite | Hugging Face `yuwd/Composite` |
+| Flickr8k-EX / Flickr8k-CF | Hugging Face `yuwd/Flickr8k-HumanEval` |
+| Nebula | Hugging Face `Ka2ukiMatsuda/Nebula` |
+| Polaris | Hugging Face `yuwd/Polaris` |
+| Spica corrections | Hugging Face `hiranohachiman/Spica` |
+| LongCap-Arena | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
+Model files and checkpoints are downloaded on first use by the corresponding metric runner or upstream library.
+| Metric family | Model or checkpoint source |
+| --- | --- |
+| CLIPScore | OpenAI CLIP loader cache |
+| PACScore | PACScore checkpoint URL, fetched on first PACScore run |
+| Polos | upstream Polos model cache, fetched on first Polos run |
+| FLEUR | Hugging Face `liuhaotian/llava-v1.5-13b` |
+| VELA | Hugging Face `Qwen/Qwen2.5-3B-Instruct`, `BeichenZhang/LongCLIP-L`, `Ka2ukiMatsuda/vela` |
+Set `IC_EVAL_REFRESH_HF_CACHE=1` to refresh cached benchmark rows and extracted images.
+<details>
+<summary>Local data layout</summary>
+If you pass a non-repository data root, use this layout:
+```text
+data/
+  composite/
+    en_test_composite_da2.csv
+    images/
+  flickr8k/
+    flickr8k.json
+    crowdflower_flickr8k.json
+    images/
+  nebula/
+    images/
+  polaris/
+    images/
+```
+</details>
+## TODO
+- [ ] Implement EXPERT benchmark support.
+- [ ] Improve the first-download UI/UX for `all_reproduce`.
+## Development
+```bash
+uv run python -m unittest discover -s tests
+```
+Repository map:
+```text
+capevalkit/                    CLI, API, benchmark loaders, verification
+metrics/*/metric.toml          metric manifests
+metrics/upstreams/*            upstream metric repositories
+overlays/metrics/upstreams/*   uv overlays for upstream repositories
+benchmarks/expected/           default all_reproduce expected values
+```
+## Citation
+If you use this toolkit, cite the original metric and benchmark papers for the implementations and reported values you rely on.