cheesebench 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cheesebench-0.1.0/LICENSE +21 -0
- cheesebench-0.1.0/MANIFEST.in +4 -0
- cheesebench-0.1.0/PKG-INFO +179 -0
- cheesebench-0.1.0/README.md +145 -0
- cheesebench-0.1.0/analysis.py +552 -0
- cheesebench-0.1.0/benchmark.py +1021 -0
- cheesebench-0.1.0/cheesebench.egg-info/PKG-INFO +179 -0
- cheesebench-0.1.0/cheesebench.egg-info/SOURCES.txt +34 -0
- cheesebench-0.1.0/cheesebench.egg-info/dependency_links.txt +1 -0
- cheesebench-0.1.0/cheesebench.egg-info/entry_points.txt +2 -0
- cheesebench-0.1.0/cheesebench.egg-info/requires.txt +11 -0
- cheesebench-0.1.0/cheesebench.egg-info/top_level.txt +11 -0
- cheesebench-0.1.0/config.py +51 -0
- cheesebench-0.1.0/cscgql_agent.py +189 -0
- cheesebench-0.1.0/environments/__init__.py +77 -0
- cheesebench-0.1.0/environments/barnes_maze.py +564 -0
- cheesebench-0.1.0/environments/base_env.py +1725 -0
- cheesebench-0.1.0/environments/dnms_task.py +439 -0
- cheesebench-0.1.0/environments/morris_water_maze.py +573 -0
- cheesebench-0.1.0/environments/operant_chamber.py +606 -0
- cheesebench-0.1.0/environments/place_preference.py +471 -0
- cheesebench-0.1.0/environments/radial_arm_maze.py +529 -0
- cheesebench-0.1.0/environments/registry.py +378 -0
- cheesebench-0.1.0/environments/shuttle_box.py +520 -0
- cheesebench-0.1.0/environments/star_maze.py +454 -0
- cheesebench-0.1.0/environments/t_maze.py +455 -0
- cheesebench-0.1.0/error_analysis.py +324 -0
- cheesebench-0.1.0/heuristic_agent.py +109 -0
- cheesebench-0.1.0/model_server.py +346 -0
- cheesebench-0.1.0/play.py +469 -0
- cheesebench-0.1.0/pyproject.toml +72 -0
- cheesebench-0.1.0/requirements.txt +6 -0
- cheesebench-0.1.0/setup.cfg +4 -0
- cheesebench-0.1.0/stat_tests.py +488 -0
- cheesebench-0.1.0/task_definitions.json +341 -0
- cheesebench-0.1.0/visualize.py +418 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 CheeseBench Contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cheesebench
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: CheeseBench: A VLM benchmark over 9 rodent behavioral neuroscience paradigms
|
|
5
|
+
Author: CheeseBench Contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/stef41/CheeseBench
|
|
8
|
+
Project-URL: Repository, https://github.com/stef41/CheeseBench
|
|
9
|
+
Project-URL: Issues, https://github.com/stef41/CheeseBench/issues
|
|
10
|
+
Keywords: benchmark,vision-language-model,vlm,llm,evaluation,neuroscience,behavioral-neuroscience,embodied-ai,cognition
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Science/Research
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: imageio>=2.30
|
|
24
|
+
Requires-Dist: numpy>=1.24
|
|
25
|
+
Requires-Dist: opencv-python>=4.8
|
|
26
|
+
Requires-Dist: pillow>=10.0
|
|
27
|
+
Requires-Dist: requests>=2.31
|
|
28
|
+
Requires-Dist: matplotlib>=3.7
|
|
29
|
+
Provides-Extra: dev
|
|
30
|
+
Requires-Dist: pytest>=7; extra == "dev"
|
|
31
|
+
Requires-Dist: build; extra == "dev"
|
|
32
|
+
Requires-Dist: twine; extra == "dev"
|
|
33
|
+
Dynamic: license-file
|
|
34
|
+
|
|
35
|
+
# CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?
|
|
36
|
+
|
|
37
|
+
A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.
|
|
38
|
+
|
|
39
|
+
## Key Design Principles
|
|
40
|
+
|
|
41
|
+
1. **Unified Protocol**: Identical system prompt for ALL tasks — no task-specific hints
|
|
42
|
+
2. **Published Baselines**: Every environment maps to a real rodent experiment with peer-reviewed success rates
|
|
43
|
+
3. **Cognitive Taxonomy**: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
|
|
44
|
+
4. **Multi-Action**: VLM outputs up to 8 actions per call with explicit learnings/working memory
|
|
45
|
+
|
|
46
|
+
## Quick Start
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Install
|
|
50
|
+
pip install -r requirements.txt
|
|
51
|
+
|
|
52
|
+
# Run benchmark (requires an LLM API endpoint)
|
|
53
|
+
python benchmark.py --model gpt-oss:120b --num-trials 20
|
|
54
|
+
|
|
55
|
+
# Quick test (2 trials)
|
|
56
|
+
python benchmark.py --num-trials 2
|
|
57
|
+
|
|
58
|
+
# Custom API endpoint
|
|
59
|
+
python benchmark.py --api-url http://localhost:11434/api/chat
|
|
60
|
+
|
|
61
|
+
# Analyze results
|
|
62
|
+
python analysis.py results/benchmark_results.json
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Project Structure
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
cheesebench/
|
|
69
|
+
├── benchmark.py # Main benchmark runner (CLI)
|
|
70
|
+
├── config.py # Centralized configuration
|
|
71
|
+
├── analysis.py # Cognitive profiling & analysis pipeline
|
|
72
|
+
├── task_definitions.json # Task specs with paper citations & animal baselines
|
|
73
|
+
├── visualize.py # Publication-quality figures
|
|
74
|
+
├── environments/ # 9 behavioral paradigms
|
|
75
|
+
│ ├── base_env.py # Shared engine (rendering, sessions, actions)
|
|
76
|
+
│ ├── morris_water_maze.py
|
|
77
|
+
│ ├── t_maze.py
|
|
78
|
+
│ ├── barnes_maze.py
|
|
79
|
+
│ ├── radial_arm_maze.py
|
|
80
|
+
│ ├── operant_chamber.py
|
|
81
|
+
│ ├── shuttle_box.py
|
|
82
|
+
│ ├── place_preference.py
|
|
83
|
+
│ ├── star_maze.py
|
|
84
|
+
│ └── dnms_task.py
|
|
85
|
+
└── README.md
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Environments & Cognitive Taxonomy
|
|
89
|
+
|
|
90
|
+
| Environment | Cognitive Dimension | Animal Baseline | Citation |
|
|
91
|
+
|---|---|---|---|
|
|
92
|
+
| Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | [PMC2895266](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895266/) — Vorhees & Williams 2006 |
|
|
93
|
+
| Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | [PMC6126525](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126525/) — Vale et al. 2018 |
|
|
94
|
+
| T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | [PMC3399492](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3399492/) — Shoji et al. 2012 |
|
|
95
|
+
| Star Maze | Allocentric + Egocentric | 80% (session 10) | [PMC4112136](https://academic.oup.com/ilarjournal/article/55/2/310/643871) — Rondi-Reig et al. 2006 |
|
|
96
|
+
| Radial Arm Maze | Working Memory | 70% (session 6) | [PMC4030456](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030456/) — Penley et al. 2013 |
|
|
97
|
+
| Operant Chamber | Instrumental Conditioning | 90% (session 5) | [PMC4598097](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4598097/) — Martin & Iceberg 2015 |
|
|
98
|
+
| Shuttle Box | Avoidance Learning | 70% (session 10) | [PMC4692667](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4692667/) — Happel et al. 2015 |
|
|
99
|
+
| Place Preference | Associative Learning | 75% (session 6) | [PMC6101638](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101638/) — Blanco-Gandía et al. 2018 |
|
|
100
|
+
| DNMS Task | Working Memory | 80% (session 3) | [PMC3982138](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982138/) — Oomen et al. 2013 |
|
|
101
|
+
|
|
102
|
+
## View Modes
|
|
103
|
+
|
|
104
|
+
| Mode | Description | Information Content |
|
|
105
|
+
|---|---|---|
|
|
106
|
+
| `ASCII_2D` | Top-down bird's-eye map | Full spatial layout |
|
|
107
|
+
| `ASCII_2D_FPV` | Rotated first-person 2D | Egocentric partial view |
|
|
108
|
+
| `ASCII_3D` | Pseudo-3D ASCII perspective | Depth cues, limited FOV |
|
|
109
|
+
|
|
110
|
+
## System Prompt (Unified — Identical for ALL Tasks)
|
|
111
|
+
|
|
112
|
+
The VLM receives **no task-specific instructions**. It must discover the goal from observation and reward feedback alone:
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
You are an embodied agent placed in a behavioral experiment.
|
|
116
|
+
Your only goal is to maximize cumulative reward.
|
|
117
|
+
|
|
118
|
+
PERCEPTION:
|
|
119
|
+
- ASCII rendering (top-down, FPV, or pseudo-3D)
|
|
120
|
+
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
|
|
121
|
+
- Walls (#, █) block movement. Open spaces are traversable.
|
|
122
|
+
|
|
123
|
+
ACTIONS (egocentric):
|
|
124
|
+
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
|
|
125
|
+
|
|
126
|
+
RESPONSE FORMAT:
|
|
127
|
+
LEARNINGS: <working memory — position, strategy, hypotheses>
|
|
128
|
+
ACTIONS: <1-8 comma-separated actions>
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## Analysis Pipeline
|
|
132
|
+
|
|
133
|
+
The analysis module computes:
|
|
134
|
+
- **Cognitive profiles** — radar chart scores across 6 dimensions
|
|
135
|
+
- **Learning curves** — rolling-window and block-based success rates
|
|
136
|
+
- **Strategy metrics** — action entropy, forward ratio, rotation ratio, repetition rate
|
|
137
|
+
- **Wilson score CIs** — 95% confidence intervals on all success rates
|
|
138
|
+
- **Animal comparison** — VLM profiles overlaid with rodent baselines
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
python analysis.py results/benchmark_results.json
|
|
142
|
+
# Outputs: results/benchmark_results_analysis.json
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Configuration
|
|
146
|
+
|
|
147
|
+
All parameters are in `config.py` and overridable via CLI or environment variables:
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
export CHEESEBENCH_MODEL=gpt-oss:120b
|
|
151
|
+
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
|
|
152
|
+
export CHEESEBENCH_TIMEOUT=120
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
| Parameter | Default | Description |
|
|
156
|
+
|---|---|---|
|
|
157
|
+
| `--model` | `gpt-oss:120b` | VLM model name |
|
|
158
|
+
| `--num-trials` | 20 | Trials per environment |
|
|
159
|
+
| `--max-steps` | 200 | Max steps per trial |
|
|
160
|
+
| `--seed` | 42 | Random seed |
|
|
161
|
+
| `--output-dir` | `results/` | Output directory |
|
|
162
|
+
| `--quiet` | false | Suppress verbose output |
|
|
163
|
+
|
|
164
|
+
## Citation
|
|
165
|
+
|
|
166
|
+
If you use CheeseBench in your research, please cite:
|
|
167
|
+
|
|
168
|
+
```bibtex
|
|
169
|
+
@inproceedings{cheesebench2025,
|
|
170
|
+
title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
|
|
171
|
+
author={},
|
|
172
|
+
booktitle={NeurIPS Datasets and Benchmarks Track},
|
|
173
|
+
year={2025}
|
|
174
|
+
}
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
## License
|
|
178
|
+
|
|
179
|
+
MIT
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
# CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?
|
|
2
|
+
|
|
3
|
+
A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.
|
|
4
|
+
|
|
5
|
+
## Key Design Principles
|
|
6
|
+
|
|
7
|
+
1. **Unified Protocol**: Identical system prompt for ALL tasks — no task-specific hints
|
|
8
|
+
2. **Published Baselines**: Every environment maps to a real rodent experiment with peer-reviewed success rates
|
|
9
|
+
3. **Cognitive Taxonomy**: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
|
|
10
|
+
4. **Multi-Action**: VLM outputs up to 8 actions per call with explicit learnings/working memory
|
|
11
|
+
|
|
12
|
+
## Quick Start
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
# Install
|
|
16
|
+
pip install -r requirements.txt
|
|
17
|
+
|
|
18
|
+
# Run benchmark (requires an LLM API endpoint)
|
|
19
|
+
python benchmark.py --model gpt-oss:120b --num-trials 20
|
|
20
|
+
|
|
21
|
+
# Quick test (2 trials)
|
|
22
|
+
python benchmark.py --num-trials 2
|
|
23
|
+
|
|
24
|
+
# Custom API endpoint
|
|
25
|
+
python benchmark.py --api-url http://localhost:11434/api/chat
|
|
26
|
+
|
|
27
|
+
# Analyze results
|
|
28
|
+
python analysis.py results/benchmark_results.json
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Project Structure
|
|
32
|
+
|
|
33
|
+
```
|
|
34
|
+
cheesebench/
|
|
35
|
+
├── benchmark.py # Main benchmark runner (CLI)
|
|
36
|
+
├── config.py # Centralized configuration
|
|
37
|
+
├── analysis.py # Cognitive profiling & analysis pipeline
|
|
38
|
+
├── task_definitions.json # Task specs with paper citations & animal baselines
|
|
39
|
+
├── visualize.py # Publication-quality figures
|
|
40
|
+
├── environments/ # 9 behavioral paradigms
|
|
41
|
+
│ ├── base_env.py # Shared engine (rendering, sessions, actions)
|
|
42
|
+
│ ├── morris_water_maze.py
|
|
43
|
+
│ ├── t_maze.py
|
|
44
|
+
│ ├── barnes_maze.py
|
|
45
|
+
│ ├── radial_arm_maze.py
|
|
46
|
+
│ ├── operant_chamber.py
|
|
47
|
+
│ ├── shuttle_box.py
|
|
48
|
+
│ ├── place_preference.py
|
|
49
|
+
│ ├── star_maze.py
|
|
50
|
+
│ └── dnms_task.py
|
|
51
|
+
└── README.md
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Environments & Cognitive Taxonomy
|
|
55
|
+
|
|
56
|
+
| Environment | Cognitive Dimension | Animal Baseline | Citation |
|
|
57
|
+
|---|---|---|---|
|
|
58
|
+
| Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | [PMC2895266](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895266/) — Vorhees & Williams 2006 |
|
|
59
|
+
| Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | [PMC6126525](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126525/) — Vale et al. 2018 |
|
|
60
|
+
| T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | [PMC3399492](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3399492/) — Shoji et al. 2012 |
|
|
61
|
+
| Star Maze | Allocentric + Egocentric | 80% (session 10) | [PMC4112136](https://academic.oup.com/ilarjournal/article/55/2/310/643871) — Rondi-Reig et al. 2006 |
|
|
62
|
+
| Radial Arm Maze | Working Memory | 70% (session 6) | [PMC4030456](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030456/) — Penley et al. 2013 |
|
|
63
|
+
| Operant Chamber | Instrumental Conditioning | 90% (session 5) | [PMC4598097](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4598097/) — Martin & Iceberg 2015 |
|
|
64
|
+
| Shuttle Box | Avoidance Learning | 70% (session 10) | [PMC4692667](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4692667/) — Happel et al. 2015 |
|
|
65
|
+
| Place Preference | Associative Learning | 75% (session 6) | [PMC6101638](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101638/) — Blanco-Gandía et al. 2018 |
|
|
66
|
+
| DNMS Task | Working Memory | 80% (session 3) | [PMC3982138](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982138/) — Oomen et al. 2013 |
|
|
67
|
+
|
|
68
|
+
## View Modes
|
|
69
|
+
|
|
70
|
+
| Mode | Description | Information Content |
|
|
71
|
+
|---|---|---|
|
|
72
|
+
| `ASCII_2D` | Top-down bird's-eye map | Full spatial layout |
|
|
73
|
+
| `ASCII_2D_FPV` | Rotated first-person 2D | Egocentric partial view |
|
|
74
|
+
| `ASCII_3D` | Pseudo-3D ASCII perspective | Depth cues, limited FOV |
|
|
75
|
+
|
|
76
|
+
## System Prompt (Unified — Identical for ALL Tasks)
|
|
77
|
+
|
|
78
|
+
The VLM receives **no task-specific instructions**. It must discover the goal from observation and reward feedback alone:
|
|
79
|
+
|
|
80
|
+
```
|
|
81
|
+
You are an embodied agent placed in a behavioral experiment.
|
|
82
|
+
Your only goal is to maximize cumulative reward.
|
|
83
|
+
|
|
84
|
+
PERCEPTION:
|
|
85
|
+
- ASCII rendering (top-down, FPV, or pseudo-3D)
|
|
86
|
+
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
|
|
87
|
+
- Walls (#, █) block movement. Open spaces are traversable.
|
|
88
|
+
|
|
89
|
+
ACTIONS (egocentric):
|
|
90
|
+
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
|
|
91
|
+
|
|
92
|
+
RESPONSE FORMAT:
|
|
93
|
+
LEARNINGS: <working memory — position, strategy, hypotheses>
|
|
94
|
+
ACTIONS: <1-8 comma-separated actions>
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Analysis Pipeline
|
|
98
|
+
|
|
99
|
+
The analysis module computes:
|
|
100
|
+
- **Cognitive profiles** — radar chart scores across 6 dimensions
|
|
101
|
+
- **Learning curves** — rolling-window and block-based success rates
|
|
102
|
+
- **Strategy metrics** — action entropy, forward ratio, rotation ratio, repetition rate
|
|
103
|
+
- **Wilson score CIs** — 95% confidence intervals on all success rates
|
|
104
|
+
- **Animal comparison** — VLM profiles overlaid with rodent baselines
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
python analysis.py results/benchmark_results.json
|
|
108
|
+
# Outputs: results/benchmark_results_analysis.json
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Configuration
|
|
112
|
+
|
|
113
|
+
All parameters are in `config.py` and overridable via CLI or environment variables:
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
export CHEESEBENCH_MODEL=gpt-oss:120b
|
|
117
|
+
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
|
|
118
|
+
export CHEESEBENCH_TIMEOUT=120
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
| Parameter | Default | Description |
|
|
122
|
+
|---|---|---|
|
|
123
|
+
| `--model` | `gpt-oss:120b` | VLM model name |
|
|
124
|
+
| `--num-trials` | 20 | Trials per environment |
|
|
125
|
+
| `--max-steps` | 200 | Max steps per trial |
|
|
126
|
+
| `--seed` | 42 | Random seed |
|
|
127
|
+
| `--output-dir` | `results/` | Output directory |
|
|
128
|
+
| `--quiet` | false | Suppress verbose output |
|
|
129
|
+
|
|
130
|
+
## Citation
|
|
131
|
+
|
|
132
|
+
If you use CheeseBench in your research, please cite:
|
|
133
|
+
|
|
134
|
+
```bibtex
|
|
135
|
+
@inproceedings{cheesebench2025,
|
|
136
|
+
title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
|
|
137
|
+
author={},
|
|
138
|
+
booktitle={NeurIPS Datasets and Benchmarks Track},
|
|
139
|
+
year={2025}
|
|
140
|
+
}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## License
|
|
144
|
+
|
|
145
|
+
MIT
|