PyPI - ScandEval - Versions diffs - 16.7.0__tar.gz → 16.8.0__tar.gz - Mend

ScandEval 16.7.0tar.gz → 16.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (365) hide show

{scandeval-16.7.0 → scandeval-16.8.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -25,6 +25,8 @@ body:
     description: What languages is the dataset in?
     options:
       - label: Bulgarian
+      - label: Bosnian
+      - label: Catalan
       - label: Croatian
       - label: Czech
       - label: Danish
@@ -36,6 +38,7 @@ body:
       - label: French
       - label: German
       - label: Greek
+      - label: Hungarian
       - label: Icelandic
       - label: Italian
       - label: Latvian
@@ -43,6 +46,7 @@ body:
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Polish
       - label: Portuguese
+      - label: Romanian
       - label: Serbian
       - label: Slovak
       - label: Slovenian

{scandeval-16.7.0 → scandeval-16.8.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -20,10 +20,10 @@ body:
     options:
       - label: Baltic languages (Latvian, Lithuanian)
       - label: Finnic languages (Estonian, Finnish)
-      - label: Hellenic languages (Greek)
-      - label: Romance languages (French, Italian, Portuguese, Spanish)
+      - label: Greek
+      - label: Romance languages (Catalan, French, Italian, Portuguese, Romanian, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
-      - label: Slavic languages (Bulgarian, Croatian, Czech, Polish, Serbian, Slovak, Slovenian, Ukrainian)
+      - label: Slavic languages (Bulgarian, Bosnian, Croatian, Czech, Hungarian, Polish, Serbian, Slovak, Slovenian, Ukrainian)
       - label: West Germanic languages (Dutch, English, German)
   validations:
     required: true

{scandeval-16.7.0 → scandeval-16.8.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.14.4
+    rev: v0.14.6
     hooks:
       - id: ruff
         args:
@@ -30,21 +30,17 @@ repos:
           - pyi
           - jupyter
 -   repo: https://github.com/kynan/nbstripout
-    rev: 0.8.1
+    rev: 0.8.2
     hooks:
     -   id: nbstripout
--   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.18.2
+-   repo: https://github.com/facebook/pyrefly-pre-commit
+    rev: 0.0.1
     hooks:
-    -   id: mypy
-        args:
-          - --install-types
-          - --non-interactive
-          - --ignore-missing-imports
-          - --show-error-codes
-          - --check-untyped-defs
+    -   id: pyrefly-typecheck-system
+        name: Pyrefly (type checking)
+        pass_filenames: true
 -   repo: https://github.com/DavidAnson/markdownlint-cli2
-    rev: v0.18.1
+    rev: v0.19.1
     hooks:
     -   id: markdownlint-cli2
         args:

scandeval-16.8.0/AGENTS.md ADDED Viewed

@@ -0,0 +1,121 @@
+# Python Project Conventions
+## Development Workflow
+### Tool Execution
+- Use `uv run` for all script and command execution
+- Always run formatters and linters before committing:
+  ```bash
+  uv run ruff format
+  uv run ruff check --fix
+  ```
+- Verify all tests pass with `uv run pytest`
+- If tests fail, fix them before proceeding with other changes
+## Code Style
+### Documentation
+- Use Google-style docstrings for all public functions, classes, and modules
+- Always include a newline after the name of each argument and exception in the
+  docstring.
+- Avoid tutorial-style `#` comments that explain what code does
+- Comments should explain **why**, not **what** (the code itself should be
+  self-explanatory)
+- Example:
+  ```python
+  def process_items(items: list[Item]) -> list[Result]:
+      """Process items and return results.
+      Args:
+          items:
+            List of items to process.
+      Returns:
+          List of processed results.
+      Raises:
+          ValueError:
+            If items list is empty.
+      """
+      return await batch_process(items)
+  ```
+### Type Annotations
+- Fully type-annotate all functions, methods, and variables
+- Target Python 3.12+ syntax:
+  - Use `list[T]`, `dict[K, V]`, `set[T]` (not `List`, `Dict`, `Set` from typing)
+  - Use `X | Y` for unions (not `Union[X, Y]`)
+  - Use `X | None` for optional types (not `Optional[X]`)
+- Always use `import typing as t` and use the `t.` prefix for types from the typing
+  module, such as `t.Any`, `t.Callable`, `t.TypeVar`, etc.
+- Example:
+  ```python
+  def fetch_data(url: str, timeout: float = 30.0) -> dict[str, t.Any] | None:
+      ...
+  ```
+### Programming Paradigm
+- Prefer functional programming patterns over OOP when appropriate
+- Use the best tool for the job (don't force FP or OOP dogmatically)
+- Favor immutability and pure functions where practical
+- Prefer composition over inheritance
+- Use dataclasses or Pydantic models for data structures
+### Performance
+- Write code with performance in mind
+- Profile before optimising
+- Use appropriate data structures (sets for membership, deques for queues, etc.)
+- Leverage list/dict/set comprehensions over explicit loops when clearer
+- Consider generators for memory efficiency with large datasets
+## Testing
+### Test Execution
+- Run tests with `uv run pytest`
+- All tests must pass before pushing code
+- Fix broken tests immediately—do not commit failing tests
+### Test Style
+- Follow the same conventions as production code
+- Use descriptive test names that explain the scenario
+- Example:
+  ```python
+  def test_fetch_data_returns_valid_json() -> None:
+      """Test that fetch_data returns properly formatted JSON."""
+      result = await fetch_data("https://api.example.com/data")
+      assert isinstance(result, dict)
+      assert "id" in result
+  ```
+## Code Organisation
+### Module Structure
+- Keep modules focused and cohesive
+- Prefer many small modules over few large ones
+- Use clear, descriptive names
+- Organise imports: stdlib, third-party, local (separated by blank lines)
+### Function Design
+- Keep functions small and single-purpose
+- Use descriptive names (prefer `calculate_total_price` over `calc`)
+- Limit arguments (consider using dataclasses for many parameters)
+- Return early to reduce nesting
+## Summary
+**Remember:** Write code that is clear, fast, and well-typed. Let the code speak for
+itself with minimal comments. Run formatters, linters, and tests before committing.

{scandeval-16.7.0 → scandeval-16.8.0}/CHANGELOG.md RENAMED Viewed

@@ -7,10 +7,66 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.8.0] - 2025-11-25
+### Added
+- Added support for Romanian 🇷🇴! This includes the sentiment classification dataset
+  RoSent, the linguistic acceptability dataset ScaLA-ro, the named entity recognition
+  dataset RoNEC, the reading comprehension dataset MultiWikiQA-ro, the summarisation
+  dataset SumO-Ro, the knowledge dataset Global-MMLU-ro, and the common-sense
+  reasoning dataset Winogrande-ro. This was contributed by @oliverkinch ✨
+- Added support for Hungarian 🇭🇺! This includes the sentiment classification dataset
+  HuSST, the linguistic acceptability dataset ScaLA-hu, the named entity recognition
+  dataset SzegedNER, the reading comprehension dataset MultiWikiQA-hu, the
+  summarisation dataset HunSum, the knowledge dataset MMLU-hu, and the common-sense
+  reasoning dataset Winogrande-hu. This was contributed by @oliverkinch ✨
+- Added support for Catalan! This includes the sentiment classification dataset
+  GuiaCat, the linguistic acceptability dataset ScaLA-ca, the named entity recognition
+  dataset WikiANN-ca, the reading comprehension dataset MultiWikiQA-ca, the summarisation
+  dataset DACSA-ca, the knowledge dataset MMLU-ca, and the common-sense reasoning dataset
+  Winogrande-ca. This was contributed by @oliverkinch ✨
+- Added Spanish summarisation dataset DACSA-es as an unofficial dataset.
+- Added Lithuanian sentiment classification dataset Atsiliepimai to replace the now
+  unofficial Lithuanian Emotions dataset. This was contributed by @oliverkinch ✨
+- Added new `--custom-datasets-file` (`custom_datasets_file` in the `Benchmarker` API)
+  argument, which can be used to specify a custom Python file containing custom dataset
+  definitions. It defaults to `custom_datasets.py` in the current working directory.
+### Changed
+- When evaluating models with the `--debug` flag (`debug=True` in the `Benchmarker`
+  API), we now include the full model inputs and outputs in the JSON file stored to the
+  current working directory, where we previously only included the model outputs.
+### Fixed
+- When encountering rate limits for API inference models, we ended up waiting 10 seconds
+  for each request, which was unnecessarily long. We now only wait 10 seconds for each
+  batch of requests.
+- Uses the `FLASH_ATTN` backend with vLLM for Gemma-3-1b and Gemma-3n models now and the
+  `TRITON_ATTN` with the other Gemma-3 models, as their architecture is currently not
+  supported by the default `FLASHINFER` backend. Note that this can always be changed
+  manually with the `VLLM_ATTENTION_BACKEND` environment variable.
+## [v16.7.1] - 2025-11-18
+### Fixed
+- The `--no-progress-bar` argument (`progress_bar=False` in the `Benchmarker` API) was
+  not hiding all the progress bars for generative models. This has been fixed now.
+- Now respects the revision when loading tokenizers with vLLM models. I.e., if
+  evaluating a model `<model_id>@<revision>` then we also load the tokenizer from the
+  `<revision>` branch.
 ## [v16.7.0] - 2025-11-10
 ### Added
+- Added support for Bosnian 🇧🇦! This includes the sentiment classification dataset
+  MMS-bs, the named entity recognition dataset WikiANN-bs, the reading comprehension
+  dataset MultiWikiQA-bs, and the summarisation dataset LR-Sum-bs. This was contributed
+  by @oliverkinch ✨
 - Now allows the 'low', 'medium' and 'high' reasoning effort parameters for the GPT-OSS
   models, which can be set by appending `#low`, `#medium` or `#high` to the model ID.
@@ -52,7 +108,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Added support for Croatian 🇭🇷! This includes the sentiment classification dataset
   MMS-hr, the linguistic acceptability dataset ScaLA-hr, the named entity recognition
   dataset WikiANN-hr, the reading comprehension dataset MultiWikiQA-hr, the knowledge
-  dataset MMLU-hr, and the common-sense reasoning dataset Winogrande-hr.
+  dataset MMLU-hr, and the common-sense reasoning dataset Winogrande-hr. This was
+  contributed by @oliverkinch ✨
 - Added a system dependency check for `nvcc` in the `VLLMModel.__init__` method to
   ensure the CUDA Toolkit is installed. Raises an error with installation instructions
   if NVCC is not available in the system PATH.
@@ -73,11 +130,6 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Added
-- Added support for Slovenian 🇸🇮! This includes the sentiment classification dataset
-  Sentinews, the linguistic acceptability dataset ScaLA-sl, the named entity recognition
-  dataset ssj500k-NER, the reading comprehension
-  dataset MultiWikiQA-sl, the knowledge dataset MMLU-sl, and the common-sense reasoning
-  dataset Winogrande-sl.
 - Added better support for evaluating on custom datasets, by allowing `DatasetConfig`
   objects directly in the `Benchmarker.benchmark` method. We also support custom
   datasets with the CLI, by simply defining the desired `DatasetConfig`s in a
@@ -86,6 +138,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   with the new `source` argument. This argument can both be the Hugging Face Hub ID of
   the dataset or a dictionary with 'train', 'val' and 'test', and values the paths to
   the CSV files.
+- Added support for Slovene 🇸🇮! This includes the sentiment classification dataset
+  Sentinews, the linguistic acceptability dataset ScaLA-sl, the named entity recognition
+  dataset ssj500k-NER, the reading comprehension
+  dataset MultiWikiQA-sl, the knowledge dataset MMLU-sl, and the common-sense reasoning
+  dataset Winogrande-sl. This was contributed by @oliverkinch ✨
 - Added support for Serbian 🇷🇸! This includes the sentiment classification dataset
   MMS-sr, the linguistic acceptability dataset ScaLA-sr, the named entity recognition
   dataset UNER-sr, the reading comprehension dataset MultiWikiQA-sr, the summarisation
@@ -2860,7 +2917,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Removed support for TensorFlow and Jax models, due to them not working
   properly anyway. They might be included at a later point, properly.
-##  [v1.4.0] - 2021-11-25
+## [v1.4.0] - 2021-11-25
 ### Changed

{scandeval-16.7.0 → scandeval-16.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ScandEval
-Version: 16.7.0
+Version: 16.8.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -92,7 +92,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer
@@ -255,6 +255,19 @@ argument:
 euroeval --model my-model --api-base http://localhost:8000 --api-key my-secret-key
 ```
+If your model is a reasoning model, then you need to specify this as follows:
+```bash
+euroeval --model my-reasoning-model --api-base http://localhost:8000 --generative-type reasoning
+```
+Likewise, if it is a pretrained decoder model (aka a completion model), then you specify
+this as follows:
+```bash
+euroeval --model my-base-decoder-model --api-base http://localhost:8000 --generative-type base
+```
 When using the `Benchmarker` API, the same applies. Here is an example of benchmarking
 an Ollama model hosted locally:

{scandeval-16.7.0 → scandeval-16.8.0}/README.md RENAMED Viewed

@@ -20,7 +20,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer
@@ -183,6 +183,19 @@ argument:
 euroeval --model my-model --api-base http://localhost:8000 --api-key my-secret-key
 ```
+If your model is a reasoning model, then you need to specify this as follows:
+```bash
+euroeval --model my-reasoning-model --api-base http://localhost:8000 --generative-type reasoning
+```
+Likewise, if it is a pretrained decoder model (aka a completion model), then you specify
+this as follows:
+```bash
+euroeval --model my-base-decoder-model --api-base http://localhost:8000 --generative-type base
+```
 When using the `Benchmarker` API, the same applies. Here is an example of benchmarking
 an Ollama model hosted locally:

ScandEval 16.7.0__tar.gz → 16.8.0__tar.gz

ScandEval 16.7.0tar.gz → 16.8.0tar.gz