captionevalkit-for-vlms 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (144) hide show
  1. captionevalkit_for_vlms-0.1.0/.gitignore +11 -0
  2. captionevalkit_for_vlms-0.1.0/.gitmodules +24 -0
  3. captionevalkit_for_vlms-0.1.0/LICENSE +32 -0
  4. captionevalkit_for_vlms-0.1.0/PKG-INFO +482 -0
  5. captionevalkit_for_vlms-0.1.0/README.md +455 -0
  6. captionevalkit_for_vlms-0.1.0/benchmarks/README.md +51 -0
  7. captionevalkit_for_vlms-0.1.0/benchmarks/expected/bleu/composite.json +8 -0
  8. captionevalkit_for_vlms-0.1.0/benchmarks/expected/bleu/flickr8k-cf.json +8 -0
  9. captionevalkit_for_vlms-0.1.0/benchmarks/expected/bleu/flickr8k-ex.json +8 -0
  10. captionevalkit_for_vlms-0.1.0/benchmarks/expected/bleu/nebula.json +8 -0
  11. captionevalkit_for_vlms-0.1.0/benchmarks/expected/bleu/polaris.json +7 -0
  12. captionevalkit_for_vlms-0.1.0/benchmarks/expected/cider/composite.json +8 -0
  13. captionevalkit_for_vlms-0.1.0/benchmarks/expected/cider/flickr8k-cf.json +8 -0
  14. captionevalkit_for_vlms-0.1.0/benchmarks/expected/cider/flickr8k-ex.json +8 -0
  15. captionevalkit_for_vlms-0.1.0/benchmarks/expected/cider/nebula.json +8 -0
  16. captionevalkit_for_vlms-0.1.0/benchmarks/expected/cider/polaris.json +7 -0
  17. captionevalkit_for_vlms-0.1.0/benchmarks/expected/clipscore/composite.json +8 -0
  18. captionevalkit_for_vlms-0.1.0/benchmarks/expected/clipscore/flickr8k-cf.json +8 -0
  19. captionevalkit_for_vlms-0.1.0/benchmarks/expected/clipscore/flickr8k-ex.json +8 -0
  20. captionevalkit_for_vlms-0.1.0/benchmarks/expected/clipscore/nebula.json +8 -0
  21. captionevalkit_for_vlms-0.1.0/benchmarks/expected/clipscore/polaris.json +7 -0
  22. captionevalkit_for_vlms-0.1.0/benchmarks/expected/fleur/composite.json +7 -0
  23. captionevalkit_for_vlms-0.1.0/benchmarks/expected/fleur/flickr8k-cf.json +7 -0
  24. captionevalkit_for_vlms-0.1.0/benchmarks/expected/fleur/flickr8k-ex.json +7 -0
  25. captionevalkit_for_vlms-0.1.0/benchmarks/expected/meteor/composite.json +8 -0
  26. captionevalkit_for_vlms-0.1.0/benchmarks/expected/meteor/flickr8k-cf.json +8 -0
  27. captionevalkit_for_vlms-0.1.0/benchmarks/expected/meteor/flickr8k-ex.json +8 -0
  28. captionevalkit_for_vlms-0.1.0/benchmarks/expected/meteor/nebula.json +8 -0
  29. captionevalkit_for_vlms-0.1.0/benchmarks/expected/meteor/polaris.json +7 -0
  30. captionevalkit_for_vlms-0.1.0/benchmarks/expected/pacscore/composite.json +8 -0
  31. captionevalkit_for_vlms-0.1.0/benchmarks/expected/pacscore/flickr8k-cf.json +8 -0
  32. captionevalkit_for_vlms-0.1.0/benchmarks/expected/pacscore/flickr8k-ex.json +8 -0
  33. captionevalkit_for_vlms-0.1.0/benchmarks/expected/pacscore/nebula.json +8 -0
  34. captionevalkit_for_vlms-0.1.0/benchmarks/expected/pacscore/polaris.json +7 -0
  35. captionevalkit_for_vlms-0.1.0/benchmarks/expected/polos/composite.json +8 -0
  36. captionevalkit_for_vlms-0.1.0/benchmarks/expected/polos/flickr8k-cf.json +8 -0
  37. captionevalkit_for_vlms-0.1.0/benchmarks/expected/polos/flickr8k-ex.json +8 -0
  38. captionevalkit_for_vlms-0.1.0/benchmarks/expected/polos/nebula.json +8 -0
  39. captionevalkit_for_vlms-0.1.0/benchmarks/expected/polos/polaris.json +7 -0
  40. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refclipscore/composite.json +8 -0
  41. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refclipscore/flickr8k-cf.json +8 -0
  42. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refclipscore/flickr8k-ex.json +8 -0
  43. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refclipscore/nebula.json +7 -0
  44. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refclipscore/polaris.json +7 -0
  45. captionevalkit_for_vlms-0.1.0/benchmarks/expected/reffleur/composite.json +7 -0
  46. captionevalkit_for_vlms-0.1.0/benchmarks/expected/reffleur/flickr8k-cf.json +7 -0
  47. captionevalkit_for_vlms-0.1.0/benchmarks/expected/reffleur/flickr8k-ex.json +7 -0
  48. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refpacscore/composite.json +7 -0
  49. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refpacscore/flickr8k-cf.json +7 -0
  50. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refpacscore/flickr8k-ex.json +7 -0
  51. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refpacscore/nebula.json +7 -0
  52. captionevalkit_for_vlms-0.1.0/benchmarks/expected/refpacscore/polaris.json +7 -0
  53. captionevalkit_for_vlms-0.1.0/benchmarks/expected/rouge/composite.json +8 -0
  54. captionevalkit_for_vlms-0.1.0/benchmarks/expected/rouge/flickr8k-cf.json +8 -0
  55. captionevalkit_for_vlms-0.1.0/benchmarks/expected/rouge/flickr8k-ex.json +8 -0
  56. captionevalkit_for_vlms-0.1.0/benchmarks/expected/rouge/nebula.json +8 -0
  57. captionevalkit_for_vlms-0.1.0/benchmarks/expected/rouge/polaris.json +7 -0
  58. captionevalkit_for_vlms-0.1.0/benchmarks/expected/spice/composite.json +8 -0
  59. captionevalkit_for_vlms-0.1.0/benchmarks/expected/spice/flickr8k-cf.json +8 -0
  60. captionevalkit_for_vlms-0.1.0/benchmarks/expected/spice/flickr8k-ex.json +8 -0
  61. captionevalkit_for_vlms-0.1.0/benchmarks/expected/spice/nebula.json +8 -0
  62. captionevalkit_for_vlms-0.1.0/benchmarks/expected/spice/polaris.json +7 -0
  63. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testa-desc.json +8 -0
  64. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testa-flu.json +8 -0
  65. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testa-rel.json +8 -0
  66. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testb-desc.json +8 -0
  67. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testb-flu.json +8 -0
  68. captionevalkit_for_vlms-0.1.0/benchmarks/expected/vela/longcaparena-testb-rel.json +8 -0
  69. captionevalkit_for_vlms-0.1.0/capevalkit/__init__.py +30 -0
  70. captionevalkit_for_vlms-0.1.0/capevalkit/api.py +509 -0
  71. captionevalkit_for_vlms-0.1.0/capevalkit/benchmarks.py +1093 -0
  72. captionevalkit_for_vlms-0.1.0/capevalkit/cli.py +379 -0
  73. captionevalkit_for_vlms-0.1.0/capevalkit/compat.py +12 -0
  74. captionevalkit_for_vlms-0.1.0/capevalkit/context.py +110 -0
  75. captionevalkit_for_vlms-0.1.0/capevalkit/correlations.py +73 -0
  76. captionevalkit_for_vlms-0.1.0/capevalkit/dispatcher.py +204 -0
  77. captionevalkit_for_vlms-0.1.0/capevalkit/downloads.py +370 -0
  78. captionevalkit_for_vlms-0.1.0/capevalkit/launcher.py +31 -0
  79. captionevalkit_for_vlms-0.1.0/capevalkit/manifests.py +127 -0
  80. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/__init__.py +2 -0
  81. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/clipscore_metric.py +239 -0
  82. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/fleur_metric.py +254 -0
  83. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/pacscore_metric.py +227 -0
  84. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/polos.py +95 -0
  85. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/polos_validate.py +63 -0
  86. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/pycocoevalcap_metrics.py +147 -0
  87. captionevalkit_for_vlms-0.1.0/capevalkit/metrics/vela_metric.py +151 -0
  88. captionevalkit_for_vlms-0.1.0/capevalkit/overlays.py +46 -0
  89. captionevalkit_for_vlms-0.1.0/capevalkit/paths.py +21 -0
  90. captionevalkit_for_vlms-0.1.0/capevalkit/progress.py +24 -0
  91. captionevalkit_for_vlms-0.1.0/capevalkit/reproduce.py +929 -0
  92. captionevalkit_for_vlms-0.1.0/capevalkit/resources/upstreams.lock.json +79 -0
  93. captionevalkit_for_vlms-0.1.0/capevalkit/runtime.py +219 -0
  94. captionevalkit_for_vlms-0.1.0/capevalkit/runtime_env.py +20 -0
  95. captionevalkit_for_vlms-0.1.0/capevalkit/verify.py +118 -0
  96. captionevalkit_for_vlms-0.1.0/docs/assets.md +84 -0
  97. captionevalkit_for_vlms-0.1.0/metrics/bleu/metric.toml +19 -0
  98. captionevalkit_for_vlms-0.1.0/metrics/cider/metric.toml +19 -0
  99. captionevalkit_for_vlms-0.1.0/metrics/clipscore/metric.toml +19 -0
  100. captionevalkit_for_vlms-0.1.0/metrics/clipscore-vitl/metric.toml +18 -0
  101. captionevalkit_for_vlms-0.1.0/metrics/clipscoreavg/metric.toml +18 -0
  102. captionevalkit_for_vlms-0.1.0/metrics/fleur/metric.toml +18 -0
  103. captionevalkit_for_vlms-0.1.0/metrics/meteor/metric.toml +19 -0
  104. captionevalkit_for_vlms-0.1.0/metrics/pacscore/metric.toml +18 -0
  105. captionevalkit_for_vlms-0.1.0/metrics/pacscore-vitl/metric.toml +18 -0
  106. captionevalkit_for_vlms-0.1.0/metrics/pacscoreavg/metric.toml +18 -0
  107. captionevalkit_for_vlms-0.1.0/metrics/pacscorepp/metric.toml +18 -0
  108. captionevalkit_for_vlms-0.1.0/metrics/pacscoreppavg/metric.toml +18 -0
  109. captionevalkit_for_vlms-0.1.0/metrics/polos/metric.toml +28 -0
  110. captionevalkit_for_vlms-0.1.0/metrics/refclipscore/metric.toml +18 -0
  111. captionevalkit_for_vlms-0.1.0/metrics/refclipscore-vitl/metric.toml +18 -0
  112. captionevalkit_for_vlms-0.1.0/metrics/reffleur/metric.toml +18 -0
  113. captionevalkit_for_vlms-0.1.0/metrics/refpacscore/metric.toml +18 -0
  114. captionevalkit_for_vlms-0.1.0/metrics/refpacscore-vitl/metric.toml +18 -0
  115. captionevalkit_for_vlms-0.1.0/metrics/refpacscorepp/metric.toml +18 -0
  116. captionevalkit_for_vlms-0.1.0/metrics/rouge/metric.toml +19 -0
  117. captionevalkit_for_vlms-0.1.0/metrics/spice/metric.toml +19 -0
  118. captionevalkit_for_vlms-0.1.0/metrics/vela/metric.toml +18 -0
  119. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/clipscore/pyproject.toml +18 -0
  120. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/clipscore/uv.toml +2 -0
  121. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/fleur/fleur_wrapper/__init__.py +1 -0
  122. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/fleur/pyproject.toml +22 -0
  123. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/fleur/uv.toml +1 -0
  124. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/pacscore/pyproject.toml +22 -0
  125. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/pacscore/uv.toml +2 -0
  126. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/polos/models/encoders/__init__.py +14 -0
  127. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/polos/models/encoders/bert.py +106 -0
  128. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/polos/models/estimators/polos_estimator.py +239 -0
  129. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/polos/models/model_base.py +276 -0
  130. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/polos/tokenizers_/__init__.py +13 -0
  131. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/pyproject.toml +31 -0
  132. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/polos/uv.toml +2 -0
  133. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/pycocoevalcap/pyproject.toml +12 -0
  134. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/pycocoevalcap/uv.toml +2 -0
  135. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/vela/configs/config_regressor.yaml +50 -0
  136. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/vela/pyproject.toml +30 -0
  137. captionevalkit_for_vlms-0.1.0/overlays/metrics/upstreams/vela/uv.toml +1 -0
  138. captionevalkit_for_vlms-0.1.0/pyproject.toml +95 -0
  139. captionevalkit_for_vlms-0.1.0/scripts/generate_upstream_lock.py +114 -0
  140. captionevalkit_for_vlms-0.1.0/scripts/smoke_dist.sh +220 -0
  141. captionevalkit_for_vlms-0.1.0/tests/test_architecture.py +1864 -0
  142. captionevalkit_for_vlms-0.1.0/tests/test_downloads.py +130 -0
  143. captionevalkit_for_vlms-0.1.0/uv.lock +1913 -0
  144. captionevalkit_for_vlms-0.1.0/uv.toml +2 -0
@@ -0,0 +1,11 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ .venv/
4
+ .uv-cache/
5
+ .cache/
6
+ .hf-cache/
7
+ .model-cache/
8
+ outputs/
9
+ envs/
10
+ data/
11
+ self-distillation-smoothing/
@@ -0,0 +1,24 @@
1
+ [submodule "polos"]
2
+ path = metrics/upstreams/polos
3
+ url = https://github.com/keio-smilab24/Polos.git
4
+ ignore = dirty
5
+ [submodule "pycocoevalcap"]
6
+ path = metrics/upstreams/pycocoevalcap
7
+ url = https://github.com/salaniz/pycocoevalcap.git
8
+ ignore = untracked
9
+ [submodule "pacscore"]
10
+ path = metrics/upstreams/pacscore
11
+ url = https://github.com/aimagelab/pacscore.git
12
+ ignore = untracked
13
+ [submodule "clipscore"]
14
+ path = metrics/upstreams/clipscore
15
+ url = https://github.com/jmhessel/clipscore.git
16
+ ignore = untracked
17
+ [submodule "vela"]
18
+ path = metrics/upstreams/vela
19
+ url = https://github.com/Ka2ukiMatsuda/VELA.git
20
+ ignore = dirty
21
+ [submodule "fleur"]
22
+ path = metrics/upstreams/fleur
23
+ url = https://github.com/Yebin46/FLEUR.git
24
+ ignore = dirty
@@ -0,0 +1,32 @@
1
+ BSD 3-Clause Clear License
2
+
3
+ Copyright (c) 2026 Yuiga Wada
4
+ All rights reserved.
5
+
6
+ Redistribution and use in source and binary forms, with or without
7
+ modification, are permitted (subject to the limitations in the disclaimer
8
+ below) provided that the following conditions are met:
9
+
10
+ 1. Redistributions of source code must retain the above copyright notice,
11
+ this list of conditions and the following disclaimer.
12
+
13
+ 2. Redistributions in binary form must reproduce the above copyright notice,
14
+ this list of conditions and the following disclaimer in the documentation
15
+ and/or other materials provided with the distribution.
16
+
17
+ 3. Neither the name of the copyright holder nor the names of its contributors
18
+ may be used to endorse or promote products derived from this software
19
+ without specific prior written permission.
20
+
21
+ NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
22
+ THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
23
+ CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
24
+ NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
25
+ PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
26
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
27
+ EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
28
+ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
29
+ OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
30
+ WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
31
+ OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
32
+ ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,482 @@
1
+ Metadata-Version: 2.4
2
+ Name: captionevalkit-for-vlms
3
+ Version: 0.1.0
4
+ Summary: A reproducible caption-evaluation toolkit for VLMs with per-metric uv environments.
5
+ Project-URL: Homepage, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs
6
+ Project-URL: Repository, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs
7
+ Project-URL: Issues, https://github.com/YuigaWada/CaptionEvalKit-for-VLMs/issues
8
+ Author: Yuiga Wada
9
+ Maintainer: Yuiga Wada
10
+ License-Expression: BSD-3-Clause-Clear
11
+ License-File: LICENSE
12
+ Keywords: caption-evaluation,metrics,reproducibility,vision-language-models,vlm
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Classifier: Environment :: Console
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.10
22
+ Requires-Dist: datasets<4,>=2.19
23
+ Requires-Dist: pillow>=10
24
+ Requires-Dist: rich>=13
25
+ Requires-Dist: tomli>=2; python_version < '3.11'
26
+ Description-Content-Type: text/markdown
27
+
28
+ # CaptionEvalKit-for-VLMs
29
+
30
+ <img width="1272" height="262" alt="logo" src="https://github.com/user-attachments/assets/504893fc-3bb2-40fd-9c84-835a0d04d055" />
31
+
32
+ Reproducible, all-in-one image captioning evaluation for VLMs.
33
+
34
+ * **For metric developers:** Evaluate metrics and reproduce reported results with <u>a single command</u>.
35
+ * **For VLM developers:** Score VLM-generated captions using a comprehensive set of established captioning metrics.
36
+
37
+ CaptionEvalKit currently supports:
38
+ * **LLM-free metrics:** Polos, CLIPScore, PAC-S, RefCLIPScore, RefPAC-S, and more
39
+ * **LLM-as-a-Judge metrics:** FLEUR, RefFLEUR, and VELA
40
+ * **Classic captioning metrics:** BLEU, ROUGE-L, METEOR, CIDEr, and SPICE
41
+ * **Benchmarks:** Composite, Flickr8k-Ex, Flickr8k-CF, Polaris, Nebula, and LongCap-Arena
42
+
43
+ <img width="850" height="178" alt="Screenshot 2026-06-13 at 2 23 30" src="https://github.com/user-attachments/assets/eea86fbb-d9ae-4fce-98fd-29f2510dd2bb" />
44
+
45
+
46
+
47
+ ## Table of Contents
48
+
49
+ * [Install](#install)
50
+ * [For VLM Developers](#for-vlm-developers)
51
+ * [For Metric Developers](#for-metric-developers)
52
+ * [Reproduce Reported Results](#reproduce-reported-results)
53
+ * [Reproduction Status](#reproduction-status)
54
+ * [Supported Metrics](#supported-metrics)
55
+ * [Supported Benchmarks](#supported-benchmarks)
56
+ * [Data and Assets](#data-and-assets)
57
+ * [TODO](#todo)
58
+ * [Development](#development)
59
+ * [Citation](#citation)
60
+
61
+
62
+ ## Install
63
+
64
+ Requirements: Python 3.10+, `git`, and `uv`. Java is also required for METEOR/SPICE through `pycocoevalcap`.
65
+
66
+ From PyPI or a built wheel:
67
+
68
+ ```bash
69
+ pip install captionevalkit-for-vlms
70
+ capevalkit doctor
71
+ capevalkit list-metrics
72
+ ```
73
+
74
+ <!-- Installed wheels keep the package small and materialize locked upstream repositories on demand. To prefetch one metric family:
75
+
76
+ ```bash
77
+ capevalkit sync --metrics cider
78
+ ```
79
+
80
+ `score`, `benchmark`, and `all_reproduce` also sync required upstreams automatically. -->
81
+
82
+ From a source checkout:
83
+
84
+ ```bash
85
+ git clone --recursive https://github.com/YuigaWada/CaptionEvalKit-for-VLMs.git
86
+ cd CaptionEvalKit-for-VLMs
87
+ uv tool install --editable "$PWD" --force
88
+ capevalkit list-metrics
89
+ ```
90
+
91
+ <details>
92
+ <summary>Runtime Cache</summary>
93
+
94
+ Wheel installs use `CAPEVALKIT_HOME` as a runtime cache root. The default is `~/.cache/capevalkit`.
95
+
96
+ ```text
97
+ ~/.cache/capevalkit/
98
+ runtime/<lock-digest>/
99
+ metrics/
100
+ metrics/upstreams/
101
+ benchmarks/expected/
102
+ overlays/
103
+ uv/
104
+ huggingface/
105
+ ```
106
+
107
+ Set a different location when needed:
108
+
109
+ ```bash
110
+ CAPEVALKIT_HOME=/scratch/capevalkit capevalkit doctor
111
+ ```
112
+
113
+ Source checkouts use the repository tree directly and keep submodules in `metrics/upstreams/`.
114
+
115
+ </details>
116
+
117
+ ## For Metric Developers
118
+
119
+ Benchmark existing metrics, or evaluate your own metric without adopting a fixed metric signature.
120
+
121
+ When changing upstream submodule revisions for a release, regenerate the runtime lock:
122
+
123
+ ```bash
124
+ python scripts/generate_upstream_lock.py
125
+ ```
126
+
127
+ <details>
128
+ <summary>CLI</summary>
129
+
130
+ Run one metric on one benchmark:
131
+
132
+ ```bash
133
+ capevalkit benchmark \
134
+ --metric clipscore \
135
+ --benchmark composite \
136
+ --limit 8 \
137
+ --output outputs/clipscore/composite.json
138
+ ```
139
+
140
+ Run the same metric across benchmarks:
141
+
142
+ ```bash
143
+ capevalkit suite \
144
+ --metrics clipscore \
145
+ --benchmarks composite,flickr8k-ex,flickr8k-cf,nebula,polaris \
146
+ --limit 8 \
147
+ --output-dir outputs/clipscore
148
+ ```
149
+
150
+ To wire a metric through its own CLI runner, add `metrics/mymetric/metric.toml`:
151
+
152
+ ```toml
153
+ [metric]
154
+ name = "mymetric"
155
+ python = ">=3.10,<3.12"
156
+ module = "capevalkit.metrics.mymetric"
157
+
158
+ [repository]
159
+ dir = "metrics/upstreams/mymetric"
160
+ uv_project = "metrics/upstreams/mymetric"
161
+
162
+ [runner]
163
+ command = ["python", "score.py"]
164
+ ```
165
+
166
+ Add a minimal `metrics/upstreams/mymetric/pyproject.toml`:
167
+
168
+ ```toml
169
+ [project]
170
+ name = "mymetric"
171
+ version = "0.1.0"
172
+ requires-python = ">=3.10,<3.12"
173
+ dependencies = []
174
+ ```
175
+
176
+ Make `metrics/upstreams/mymetric/score.py` accept:
177
+
178
+ ```text
179
+ --predictions PREDICTIONS.jsonl
180
+ --references REFERENCES.jsonl
181
+ --output OUTPUT.json
182
+ ```
183
+
184
+ Then benchmark it:
185
+
186
+ ```bash
187
+ capevalkit benchmark \
188
+ --metric mymetric \
189
+ --benchmark composite \
190
+ --output outputs/mymetric/composite.json
191
+ ```
192
+
193
+ </details>
194
+
195
+ <!-- <details> -->
196
+ <!-- <summary>Python</summary> -->
197
+
198
+ ```python
199
+ import capevalkit as capeval
200
+
201
+ class MyMetric:
202
+ def __call__(self, samples):
203
+ return {
204
+ sample.id: float(bool(sample.prediction and sample.references))
205
+ for sample in samples
206
+ }
207
+
208
+ result = capeval.evaluate_metric(
209
+ benchmark="flickr8k-cf",
210
+ metric=MyMetric(),
211
+ metric_name="MyMetric",
212
+ limit=8,
213
+ output="outputs/mymetric/flickr8k-cf.json",
214
+ )
215
+ ```
216
+
217
+ The callable receives `CaptionSample` objects and returns `{sample_id: score}`. Your metric can keep any internal signature.
218
+
219
+ <!-- </details> -->
220
+
221
+ ## For VLM Developers
222
+
223
+ Evaluate saved captions from files, or run your caption model on your own images.
224
+
225
+ <details>
226
+ <summary>CLI</summary>
227
+
228
+ `predictions.jsonl`:
229
+
230
+ ```jsonl
231
+ {"id": "0001", "caption": "A dog runs through grass.", "image": "0001.jpg"}
232
+ {"id": "0002", "caption": "A person rides a bicycle.", "image": "0002.jpg"}
233
+ ```
234
+
235
+ `references.jsonl`:
236
+
237
+ ```jsonl
238
+ {"id": "0001", "references": ["A dog runs outside.", "A dog is in a grassy field."]}
239
+ {"id": "0002", "references": ["A cyclist rides on a road.", "A person rides a bike."]}
240
+ ```
241
+
242
+ ```bash
243
+ capevalkit score \
244
+ --metric clipscore \
245
+ --predictions predictions.jsonl \
246
+ --references references.jsonl \
247
+ --image-dir images \
248
+ --output outputs/clipscore.json
249
+ ```
250
+
251
+ ```json
252
+ {
253
+ "CLIPScore": 0.73,
254
+ "RefCLIPScore": 0.81,
255
+ "per_item": {
256
+ "0001": {"CLIPScore": 0.70, "RefCLIPScore": 0.78}
257
+ }
258
+ }
259
+ ```
260
+
261
+ </details>
262
+
263
+ <!-- <details> -->
264
+ <!-- <summary>Python</summary> -->
265
+
266
+ Run these examples with `uv run python` from the repository, or install `capevalkit` into your own Python environment.
267
+
268
+ ```python
269
+ import capevalkit as capeval
270
+
271
+ def predict(batch):
272
+ return ["A dog runs through grass." for _ in batch.images]
273
+
274
+ results = capeval.evaluate_caption_model(
275
+ images=["images/0001.jpg", "images/0002.jpg"],
276
+ metrics=["cider", "clipscore"],
277
+ predict=predict,
278
+ references=[
279
+ ["A dog runs outside.", "A dog is in a grassy field."],
280
+ ["A cyclist rides on a road.", "A person rides a bike."],
281
+ ],
282
+ batch_size=8,
283
+ output_dir="outputs/my-model",
284
+ )
285
+ ```
286
+
287
+ If captions are already generated, pass image-caption pairs directly:
288
+
289
+ ```python
290
+ import capevalkit as capeval
291
+
292
+ results = capeval.evaluate_captions(
293
+ pairs=[
294
+ {
295
+ "id": "0001",
296
+ "image": "images/0001.jpg",
297
+ "caption": "A dog runs through grass.",
298
+ "references": ["A dog runs outside.", "A dog is in a grassy field."],
299
+ },
300
+ {
301
+ "id": "0002",
302
+ "image": "images/0002.jpg",
303
+ "caption": "A person rides a bicycle.",
304
+ "references": ["A cyclist rides on a road.", "A person rides a bike."],
305
+ },
306
+ ],
307
+ metrics=["cider", "clipscore"],
308
+ output_dir="outputs/my-captions",
309
+ )
310
+ ```
311
+
312
+ For manual caption-model control:
313
+
314
+ ```python
315
+ import capevalkit as capeval
316
+
317
+ def predict(batch):
318
+ return ["A dog runs through grass." for _ in batch.images]
319
+
320
+ with capeval.CaptionEvalRun(
321
+ images=["images/0001.jpg", "images/0002.jpg"],
322
+ metrics=["cider", "clipscore"],
323
+ references=[
324
+ ["A dog runs outside.", "A dog is in a grassy field."],
325
+ ["A cyclist rides on a road.", "A person rides a bike."],
326
+ ],
327
+ output_dir="outputs/my-model",
328
+ ) as run:
329
+ for batch in run.iter_batches(batch_size=8):
330
+ run.record(batch.ids, predict(batch))
331
+
332
+ results = run.evaluate()
333
+ ```
334
+
335
+ <!-- </details> -->
336
+
337
+
338
+ ## Reproduce Reported Results
339
+
340
+ Preview the default reproducibility suite:
341
+
342
+ ```bash
343
+ capevalkit all_reproduce --dry-run
344
+ ```
345
+
346
+ Run one verified pair:
347
+
348
+ ```bash
349
+ capevalkit all_reproduce \
350
+ --metrics clipscore \
351
+ --benchmarks composite
352
+ ```
353
+
354
+ Run a launch smoke test for every default pair:
355
+
356
+ ```bash
357
+ capevalkit all_reproduce --smoke --jobs 4 --gpu-jobs 1
358
+ ```
359
+
360
+ `--smoke` runs one sample per pair and checks launch/output writing only. Omit it for full correlations.
361
+
362
+ ## Reproduction Status
363
+
364
+ Legend: `✅` reproduced, `⚠️` not reproduced, `-` no default target. For LongCap-Arena, unreproduced targets are also shown as `-`.
365
+
366
+ | Metric | Composite | Flickr8k-EX | Flickr8k-CF | Nebula | Polaris | LCA TestA | LCA TestB |
367
+ | --- | --- | --- | --- | --- | --- | --- | --- |
368
+ | `bleu` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
369
+ | `cider` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
370
+ | `clipscore` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
371
+ | `fleur` | ⚠️ | ⚠️ | ✅ | - | - | - | - |
372
+ | `meteor` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
373
+ | `pacscore` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
374
+ | `polos` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
375
+ | `refclipscore` | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | - | - |
376
+ | `reffleur` | ✅ | ✅ | ✅ | - | - | - | - |
377
+ | `refpacscore` | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | - | - |
378
+ | `rouge` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
379
+ | `spice` | ✅ | ✅ | ✅ | ✅ | ✅ | - | - |
380
+ | `vela` | - | - | - | - | - | ✅ | ✅ |
381
+
382
+ ## Supported Metrics
383
+
384
+ | Metric | Upstream | Notes |
385
+ | --- | --- | --- |
386
+ | `bleu` | `pycocoevalcap` | BLEU-1 to BLEU-4 |
387
+ | `rouge` | `pycocoevalcap` | ROUGE-L |
388
+ | `meteor` | `pycocoevalcap` | Java METEOR through upstream |
389
+ | `cider` | `pycocoevalcap` | CIDEr |
390
+ | `spice` | `pycocoevalcap` | SPICE |
391
+ | `clipscore` | CLIPScore | image-caption CLIPScore |
392
+ | `refclipscore` | CLIPScore | reference-aware CLIPScore |
393
+ | `pacscore` | PACScore | PAC-S |
394
+ | `refpacscore` | PACScore | reference-aware PAC-S |
395
+ | `polos` | Polos | model-based reference-aware metric |
396
+ | `fleur` | FLEUR | LLaVA-based reference-free metric |
397
+ | `reffleur` | FLEUR | reference-aware FLEUR |
398
+ | `vela` | VELA | long-caption metric for `desc`, `rel`, `flu` |
399
+
400
+ ## Supported Benchmarks
401
+
402
+ | Benchmark | Source |
403
+ | --- | --- |
404
+ | `composite` | Hugging Face `yuwd/Composite` |
405
+ | `flickr8k-ex` | Hugging Face `yuwd/Flickr8k-HumanEval`, expert split |
406
+ | `flickr8k-cf` | Hugging Face `yuwd/Flickr8k-HumanEval`, CrowdFlower split |
407
+ | `nebula` | Hugging Face `Ka2ukiMatsuda/Nebula` |
408
+ | `polaris` | Hugging Face `yuwd/Polaris` |
409
+ | `longcaparena-testa-{desc,rel,flu}` | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
410
+ | `longcaparena-testb-{desc,rel,flu}` | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
411
+
412
+ ## Data and Assets
413
+
414
+ Benchmark datasets are cached on first use under `<runtime-root>/.hf-cache/benchmarks/`. In a source checkout, `<runtime-root>` is the repository root; in a wheel install, it is `$CAPEVALKIT_HOME/runtime/<lock-digest>`.
415
+
416
+ | Dataset | Loaded from |
417
+ | --- | --- |
418
+ | Composite | Hugging Face `yuwd/Composite` |
419
+ | Flickr8k-EX / Flickr8k-CF | Hugging Face `yuwd/Flickr8k-HumanEval` |
420
+ | Nebula | Hugging Face `Ka2ukiMatsuda/Nebula` |
421
+ | Polaris | Hugging Face `yuwd/Polaris` |
422
+ | Spica corrections | Hugging Face `hiranohachiman/Spica` |
423
+ | LongCap-Arena | Hugging Face `Ka2ukiMatsuda/LongCap-Arena` |
424
+
425
+ Model files and checkpoints are downloaded on first use by the corresponding metric runner or upstream library.
426
+
427
+ | Metric family | Model or checkpoint source |
428
+ | --- | --- |
429
+ | CLIPScore | OpenAI CLIP loader cache |
430
+ | PACScore | PACScore checkpoint URL, fetched on first PACScore run |
431
+ | Polos | upstream Polos model cache, fetched on first Polos run |
432
+ | FLEUR | Hugging Face `liuhaotian/llava-v1.5-13b` |
433
+ | VELA | Hugging Face `Qwen/Qwen2.5-3B-Instruct`, `BeichenZhang/LongCLIP-L`, `Ka2ukiMatsuda/vela` |
434
+
435
+ Set `IC_EVAL_REFRESH_HF_CACHE=1` to refresh cached benchmark rows and extracted images.
436
+
437
+ <details>
438
+ <summary>Local data layout</summary>
439
+
440
+ If you pass a non-repository data root, use this layout:
441
+
442
+ ```text
443
+ data/
444
+ composite/
445
+ en_test_composite_da2.csv
446
+ images/
447
+ flickr8k/
448
+ flickr8k.json
449
+ crowdflower_flickr8k.json
450
+ images/
451
+ nebula/
452
+ images/
453
+ polaris/
454
+ images/
455
+ ```
456
+
457
+ </details>
458
+
459
+ ## TODO
460
+
461
+ - [ ] Implement EXPERT benchmark support.
462
+ - [ ] Improve the first-download UI/UX for `all_reproduce`.
463
+
464
+ ## Development
465
+
466
+ ```bash
467
+ uv run python -m unittest discover -s tests
468
+ ```
469
+
470
+ Repository map:
471
+
472
+ ```text
473
+ capevalkit/ CLI, API, benchmark loaders, verification
474
+ metrics/*/metric.toml metric manifests
475
+ metrics/upstreams/* upstream metric repositories
476
+ overlays/metrics/upstreams/* uv overlays for upstream repositories
477
+ benchmarks/expected/ default all_reproduce expected values
478
+ ```
479
+
480
+ ## Citation
481
+
482
+ If you use this toolkit, cite the original metric and benchmark papers for the implementations and reported values you rely on.