cheesebench 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. cheesebench-0.1.0/LICENSE +21 -0
  2. cheesebench-0.1.0/MANIFEST.in +4 -0
  3. cheesebench-0.1.0/PKG-INFO +179 -0
  4. cheesebench-0.1.0/README.md +145 -0
  5. cheesebench-0.1.0/analysis.py +552 -0
  6. cheesebench-0.1.0/benchmark.py +1021 -0
  7. cheesebench-0.1.0/cheesebench.egg-info/PKG-INFO +179 -0
  8. cheesebench-0.1.0/cheesebench.egg-info/SOURCES.txt +34 -0
  9. cheesebench-0.1.0/cheesebench.egg-info/dependency_links.txt +1 -0
  10. cheesebench-0.1.0/cheesebench.egg-info/entry_points.txt +2 -0
  11. cheesebench-0.1.0/cheesebench.egg-info/requires.txt +11 -0
  12. cheesebench-0.1.0/cheesebench.egg-info/top_level.txt +11 -0
  13. cheesebench-0.1.0/config.py +51 -0
  14. cheesebench-0.1.0/cscgql_agent.py +189 -0
  15. cheesebench-0.1.0/environments/__init__.py +77 -0
  16. cheesebench-0.1.0/environments/barnes_maze.py +564 -0
  17. cheesebench-0.1.0/environments/base_env.py +1725 -0
  18. cheesebench-0.1.0/environments/dnms_task.py +439 -0
  19. cheesebench-0.1.0/environments/morris_water_maze.py +573 -0
  20. cheesebench-0.1.0/environments/operant_chamber.py +606 -0
  21. cheesebench-0.1.0/environments/place_preference.py +471 -0
  22. cheesebench-0.1.0/environments/radial_arm_maze.py +529 -0
  23. cheesebench-0.1.0/environments/registry.py +378 -0
  24. cheesebench-0.1.0/environments/shuttle_box.py +520 -0
  25. cheesebench-0.1.0/environments/star_maze.py +454 -0
  26. cheesebench-0.1.0/environments/t_maze.py +455 -0
  27. cheesebench-0.1.0/error_analysis.py +324 -0
  28. cheesebench-0.1.0/heuristic_agent.py +109 -0
  29. cheesebench-0.1.0/model_server.py +346 -0
  30. cheesebench-0.1.0/play.py +469 -0
  31. cheesebench-0.1.0/pyproject.toml +72 -0
  32. cheesebench-0.1.0/requirements.txt +6 -0
  33. cheesebench-0.1.0/setup.cfg +4 -0
  34. cheesebench-0.1.0/stat_tests.py +488 -0
  35. cheesebench-0.1.0/task_definitions.json +341 -0
  36. cheesebench-0.1.0/visualize.py +418 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 CheeseBench Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,4 @@
1
+ include README.md
2
+ include LICENSE
3
+ include task_definitions.json
4
+ include requirements.txt
@@ -0,0 +1,179 @@
1
+ Metadata-Version: 2.4
2
+ Name: cheesebench
3
+ Version: 0.1.0
4
+ Summary: CheeseBench: A VLM benchmark over 9 rodent behavioral neuroscience paradigms
5
+ Author: CheeseBench Contributors
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/stef41/CheeseBench
8
+ Project-URL: Repository, https://github.com/stef41/CheeseBench
9
+ Project-URL: Issues, https://github.com/stef41/CheeseBench/issues
10
+ Keywords: benchmark,vision-language-model,vlm,llm,evaluation,neuroscience,behavioral-neuroscience,embodied-ai,cognition
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Requires-Dist: imageio>=2.30
24
+ Requires-Dist: numpy>=1.24
25
+ Requires-Dist: opencv-python>=4.8
26
+ Requires-Dist: pillow>=10.0
27
+ Requires-Dist: requests>=2.31
28
+ Requires-Dist: matplotlib>=3.7
29
+ Provides-Extra: dev
30
+ Requires-Dist: pytest>=7; extra == "dev"
31
+ Requires-Dist: build; extra == "dev"
32
+ Requires-Dist: twine; extra == "dev"
33
+ Dynamic: license-file
34
+
35
+ # CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?
36
+
37
+ A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.
38
+
39
+ ## Key Design Principles
40
+
41
+ 1. **Unified Protocol**: Identical system prompt for ALL tasks — no task-specific hints
42
+ 2. **Published Baselines**: Every environment maps to a real rodent experiment with peer-reviewed success rates
43
+ 3. **Cognitive Taxonomy**: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
44
+ 4. **Multi-Action**: VLM outputs up to 8 actions per call with explicit learnings/working memory
45
+
46
+ ## Quick Start
47
+
48
+ ```bash
49
+ # Install
50
+ pip install -r requirements.txt
51
+
52
+ # Run benchmark (requires an LLM API endpoint)
53
+ python benchmark.py --model gpt-oss:120b --num-trials 20
54
+
55
+ # Quick test (2 trials)
56
+ python benchmark.py --num-trials 2
57
+
58
+ # Custom API endpoint
59
+ python benchmark.py --api-url http://localhost:11434/api/chat
60
+
61
+ # Analyze results
62
+ python analysis.py results/benchmark_results.json
63
+ ```
64
+
65
+ ## Project Structure
66
+
67
+ ```
68
+ cheesebench/
69
+ ├── benchmark.py # Main benchmark runner (CLI)
70
+ ├── config.py # Centralized configuration
71
+ ├── analysis.py # Cognitive profiling & analysis pipeline
72
+ ├── task_definitions.json # Task specs with paper citations & animal baselines
73
+ ├── visualize.py # Publication-quality figures
74
+ ├── environments/ # 9 behavioral paradigms
75
+ │ ├── base_env.py # Shared engine (rendering, sessions, actions)
76
+ │ ├── morris_water_maze.py
77
+ │ ├── t_maze.py
78
+ │ ├── barnes_maze.py
79
+ │ ├── radial_arm_maze.py
80
+ │ ├── operant_chamber.py
81
+ │ ├── shuttle_box.py
82
+ │ ├── place_preference.py
83
+ │ ├── star_maze.py
84
+ │ └── dnms_task.py
85
+ └── README.md
86
+ ```
87
+
88
+ ## Environments & Cognitive Taxonomy
89
+
90
+ | Environment | Cognitive Dimension | Animal Baseline | Citation |
91
+ |---|---|---|---|
92
+ | Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | [PMC2895266](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895266/) — Vorhees & Williams 2006 |
93
+ | Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | [PMC6126525](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126525/) — Vale et al. 2018 |
94
+ | T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | [PMC3399492](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3399492/) — Shoji et al. 2012 |
95
+ | Star Maze | Allocentric + Egocentric | 80% (session 10) | [PMC4112136](https://academic.oup.com/ilarjournal/article/55/2/310/643871) — Rondi-Reig et al. 2006 |
96
+ | Radial Arm Maze | Working Memory | 70% (session 6) | [PMC4030456](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030456/) — Penley et al. 2013 |
97
+ | Operant Chamber | Instrumental Conditioning | 90% (session 5) | [PMC4598097](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4598097/) — Martin & Iceberg 2015 |
98
+ | Shuttle Box | Avoidance Learning | 70% (session 10) | [PMC4692667](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4692667/) — Happel et al. 2015 |
99
+ | Place Preference | Associative Learning | 75% (session 6) | [PMC6101638](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101638/) — Blanco-Gandía et al. 2018 |
100
+ | DNMS Task | Working Memory | 80% (session 3) | [PMC3982138](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982138/) — Oomen et al. 2013 |
101
+
102
+ ## View Modes
103
+
104
+ | Mode | Description | Information Content |
105
+ |---|---|---|
106
+ | `ASCII_2D` | Top-down bird's-eye map | Full spatial layout |
107
+ | `ASCII_2D_FPV` | Rotated first-person 2D | Egocentric partial view |
108
+ | `ASCII_3D` | Pseudo-3D ASCII perspective | Depth cues, limited FOV |
109
+
110
+ ## System Prompt (Unified — Identical for ALL Tasks)
111
+
112
+ The VLM receives **no task-specific instructions**. It must discover the goal from observation and reward feedback alone:
113
+
114
+ ```
115
+ You are an embodied agent placed in a behavioral experiment.
116
+ Your only goal is to maximize cumulative reward.
117
+
118
+ PERCEPTION:
119
+ - ASCII rendering (top-down, FPV, or pseudo-3D)
120
+ - Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
121
+ - Walls (#, █) block movement. Open spaces are traversable.
122
+
123
+ ACTIONS (egocentric):
124
+ - FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
125
+
126
+ RESPONSE FORMAT:
127
+ LEARNINGS: <working memory — position, strategy, hypotheses>
128
+ ACTIONS: <1-8 comma-separated actions>
129
+ ```
130
+
131
+ ## Analysis Pipeline
132
+
133
+ The analysis module computes:
134
+ - **Cognitive profiles** — radar chart scores across 6 dimensions
135
+ - **Learning curves** — rolling-window and block-based success rates
136
+ - **Strategy metrics** — action entropy, forward ratio, rotation ratio, repetition rate
137
+ - **Wilson score CIs** — 95% confidence intervals on all success rates
138
+ - **Animal comparison** — VLM profiles overlaid with rodent baselines
139
+
140
+ ```bash
141
+ python analysis.py results/benchmark_results.json
142
+ # Outputs: results/benchmark_results_analysis.json
143
+ ```
144
+
145
+ ## Configuration
146
+
147
+ All parameters are in `config.py` and overridable via CLI or environment variables:
148
+
149
+ ```bash
150
+ export CHEESEBENCH_MODEL=gpt-oss:120b
151
+ export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
152
+ export CHEESEBENCH_TIMEOUT=120
153
+ ```
154
+
155
+ | Parameter | Default | Description |
156
+ |---|---|---|
157
+ | `--model` | `gpt-oss:120b` | VLM model name |
158
+ | `--num-trials` | 20 | Trials per environment |
159
+ | `--max-steps` | 200 | Max steps per trial |
160
+ | `--seed` | 42 | Random seed |
161
+ | `--output-dir` | `results/` | Output directory |
162
+ | `--quiet` | false | Suppress verbose output |
163
+
164
+ ## Citation
165
+
166
+ If you use CheeseBench in your research, please cite:
167
+
168
+ ```bibtex
169
+ @inproceedings{cheesebench2025,
170
+ title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
171
+ author={},
172
+ booktitle={NeurIPS Datasets and Benchmarks Track},
173
+ year={2025}
174
+ }
175
+ ```
176
+
177
+ ## License
178
+
179
+ MIT
@@ -0,0 +1,145 @@
1
+ # CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?
2
+
3
+ A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.
4
+
5
+ ## Key Design Principles
6
+
7
+ 1. **Unified Protocol**: Identical system prompt for ALL tasks — no task-specific hints
8
+ 2. **Published Baselines**: Every environment maps to a real rodent experiment with peer-reviewed success rates
9
+ 3. **Cognitive Taxonomy**: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
10
+ 4. **Multi-Action**: VLM outputs up to 8 actions per call with explicit learnings/working memory
11
+
12
+ ## Quick Start
13
+
14
+ ```bash
15
+ # Install
16
+ pip install -r requirements.txt
17
+
18
+ # Run benchmark (requires an LLM API endpoint)
19
+ python benchmark.py --model gpt-oss:120b --num-trials 20
20
+
21
+ # Quick test (2 trials)
22
+ python benchmark.py --num-trials 2
23
+
24
+ # Custom API endpoint
25
+ python benchmark.py --api-url http://localhost:11434/api/chat
26
+
27
+ # Analyze results
28
+ python analysis.py results/benchmark_results.json
29
+ ```
30
+
31
+ ## Project Structure
32
+
33
+ ```
34
+ cheesebench/
35
+ ├── benchmark.py # Main benchmark runner (CLI)
36
+ ├── config.py # Centralized configuration
37
+ ├── analysis.py # Cognitive profiling & analysis pipeline
38
+ ├── task_definitions.json # Task specs with paper citations & animal baselines
39
+ ├── visualize.py # Publication-quality figures
40
+ ├── environments/ # 9 behavioral paradigms
41
+ │ ├── base_env.py # Shared engine (rendering, sessions, actions)
42
+ │ ├── morris_water_maze.py
43
+ │ ├── t_maze.py
44
+ │ ├── barnes_maze.py
45
+ │ ├── radial_arm_maze.py
46
+ │ ├── operant_chamber.py
47
+ │ ├── shuttle_box.py
48
+ │ ├── place_preference.py
49
+ │ ├── star_maze.py
50
+ │ └── dnms_task.py
51
+ └── README.md
52
+ ```
53
+
54
+ ## Environments & Cognitive Taxonomy
55
+
56
+ | Environment | Cognitive Dimension | Animal Baseline | Citation |
57
+ |---|---|---|---|
58
+ | Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | [PMC2895266](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895266/) — Vorhees & Williams 2006 |
59
+ | Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | [PMC6126525](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126525/) — Vale et al. 2018 |
60
+ | T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | [PMC3399492](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3399492/) — Shoji et al. 2012 |
61
+ | Star Maze | Allocentric + Egocentric | 80% (session 10) | [PMC4112136](https://academic.oup.com/ilarjournal/article/55/2/310/643871) — Rondi-Reig et al. 2006 |
62
+ | Radial Arm Maze | Working Memory | 70% (session 6) | [PMC4030456](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030456/) — Penley et al. 2013 |
63
+ | Operant Chamber | Instrumental Conditioning | 90% (session 5) | [PMC4598097](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4598097/) — Martin & Iceberg 2015 |
64
+ | Shuttle Box | Avoidance Learning | 70% (session 10) | [PMC4692667](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4692667/) — Happel et al. 2015 |
65
+ | Place Preference | Associative Learning | 75% (session 6) | [PMC6101638](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101638/) — Blanco-Gandía et al. 2018 |
66
+ | DNMS Task | Working Memory | 80% (session 3) | [PMC3982138](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982138/) — Oomen et al. 2013 |
67
+
68
+ ## View Modes
69
+
70
+ | Mode | Description | Information Content |
71
+ |---|---|---|
72
+ | `ASCII_2D` | Top-down bird's-eye map | Full spatial layout |
73
+ | `ASCII_2D_FPV` | Rotated first-person 2D | Egocentric partial view |
74
+ | `ASCII_3D` | Pseudo-3D ASCII perspective | Depth cues, limited FOV |
75
+
76
+ ## System Prompt (Unified — Identical for ALL Tasks)
77
+
78
+ The VLM receives **no task-specific instructions**. It must discover the goal from observation and reward feedback alone:
79
+
80
+ ```
81
+ You are an embodied agent placed in a behavioral experiment.
82
+ Your only goal is to maximize cumulative reward.
83
+
84
+ PERCEPTION:
85
+ - ASCII rendering (top-down, FPV, or pseudo-3D)
86
+ - Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
87
+ - Walls (#, █) block movement. Open spaces are traversable.
88
+
89
+ ACTIONS (egocentric):
90
+ - FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
91
+
92
+ RESPONSE FORMAT:
93
+ LEARNINGS: <working memory — position, strategy, hypotheses>
94
+ ACTIONS: <1-8 comma-separated actions>
95
+ ```
96
+
97
+ ## Analysis Pipeline
98
+
99
+ The analysis module computes:
100
+ - **Cognitive profiles** — radar chart scores across 6 dimensions
101
+ - **Learning curves** — rolling-window and block-based success rates
102
+ - **Strategy metrics** — action entropy, forward ratio, rotation ratio, repetition rate
103
+ - **Wilson score CIs** — 95% confidence intervals on all success rates
104
+ - **Animal comparison** — VLM profiles overlaid with rodent baselines
105
+
106
+ ```bash
107
+ python analysis.py results/benchmark_results.json
108
+ # Outputs: results/benchmark_results_analysis.json
109
+ ```
110
+
111
+ ## Configuration
112
+
113
+ All parameters are in `config.py` and overridable via CLI or environment variables:
114
+
115
+ ```bash
116
+ export CHEESEBENCH_MODEL=gpt-oss:120b
117
+ export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
118
+ export CHEESEBENCH_TIMEOUT=120
119
+ ```
120
+
121
+ | Parameter | Default | Description |
122
+ |---|---|---|
123
+ | `--model` | `gpt-oss:120b` | VLM model name |
124
+ | `--num-trials` | 20 | Trials per environment |
125
+ | `--max-steps` | 200 | Max steps per trial |
126
+ | `--seed` | 42 | Random seed |
127
+ | `--output-dir` | `results/` | Output directory |
128
+ | `--quiet` | false | Suppress verbose output |
129
+
130
+ ## Citation
131
+
132
+ If you use CheeseBench in your research, please cite:
133
+
134
+ ```bibtex
135
+ @inproceedings{cheesebench2025,
136
+ title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
137
+ author={},
138
+ booktitle={NeurIPS Datasets and Benchmarks Track},
139
+ year={2025}
140
+ }
141
+ ```
142
+
143
+ ## License
144
+
145
+ MIT