pearmut 0.3.3__tar.gz → 1.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pearmut-0.3.3 → pearmut-1.0.1}/PKG-INFO +152 -34
- {pearmut-0.3.3 → pearmut-1.0.1}/README.md +150 -33
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/PKG-INFO +152 -34
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/SOURCES.txt +2 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/requires.txt +1 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/pyproject.toml +2 -1
- {pearmut-0.3.3 → pearmut-1.0.1}/server/app.py +119 -27
- pearmut-1.0.1/server/assignment.py +605 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/server/cli.py +245 -135
- pearmut-1.0.1/server/constants.py +93 -0
- pearmut-1.0.1/server/results_export.py +210 -0
- pearmut-1.0.1/server/static/basic.bundle.js +1 -0
- pearmut-1.0.1/server/static/basic.html +133 -0
- pearmut-1.0.1/server/static/dashboard.bundle.js +1 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/server/static/dashboard.html +27 -12
- pearmut-1.0.1/server/static/index.bundle.js +1 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/server/static/index.html +1 -1
- {pearmut-0.3.3 → pearmut-1.0.1}/server/utils.py +3 -1
- pearmut-0.3.3/server/assignment.py +0 -342
- pearmut-0.3.3/server/static/basic.bundle.js +0 -1
- pearmut-0.3.3/server/static/basic.html +0 -97
- pearmut-0.3.3/server/static/dashboard.bundle.js +0 -1
- pearmut-0.3.3/server/static/index.bundle.js +0 -1
- {pearmut-0.3.3 → pearmut-1.0.1}/LICENSE +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/dependency_links.txt +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/entry_points.txt +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/top_level.txt +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/server/static/favicon.svg +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/server/static/style.css +0 -0
- {pearmut-0.3.3 → pearmut-1.0.1}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pearmut
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 1.0.1
|
|
4
4
|
Summary: A tool for evaluation of model outputs, primarily MT.
|
|
5
5
|
Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -14,23 +14,18 @@ Requires-Dist: fastapi>=0.110.0
|
|
|
14
14
|
Requires-Dist: uvicorn>=0.29.0
|
|
15
15
|
Requires-Dist: wonderwords>=3.0.0
|
|
16
16
|
Requires-Dist: psutil>=7.1.0
|
|
17
|
+
Requires-Dist: typst>=0.14.4
|
|
17
18
|
Provides-Extra: dev
|
|
18
19
|
Requires-Dist: pytest; extra == "dev"
|
|
19
20
|
Dynamic: license-file
|
|
20
21
|
|
|
21
|
-
# Pearmut
|
|
22
|
+
# 🍐Pearmut [](https://pypi.org/project/pearmut) [](https://pypi.python.org/pypi/pearmut/) [](https://pypi.org/project/pearmut/) [](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
|
|
22
23
|
|
|
23
24
|
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
24
25
|
|
|
25
|
-
[](https://pypi.org/project/pearmut)
|
|
26
|
-
|
|
27
|
-
[](https://pypi.python.org/pypi/pearmut/)
|
|
28
|
-
|
|
29
|
-
[](https://pypi.org/project/pearmut/)
|
|
30
|
-
|
|
31
|
-
[](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
|
|
32
26
|
|
|
33
|
-
<img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/
|
|
27
|
+
<img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/71334238-300b-4ffc-b777-7f3c242b1630" />
|
|
28
|
+
|
|
34
29
|
|
|
35
30
|
## Table of Contents
|
|
36
31
|
|
|
@@ -45,10 +40,13 @@ Dynamic: license-file
|
|
|
45
40
|
- [Multimodal Annotations](#multimodal-annotations)
|
|
46
41
|
- [Hosting Assets](#hosting-assets)
|
|
47
42
|
- [Campaign Management](#campaign-management)
|
|
43
|
+
- [Custom Completion Messages](#custom-completion-messages)
|
|
48
44
|
- [CLI Commands](#cli-commands)
|
|
49
45
|
- [Terminology](#terminology)
|
|
50
46
|
- [Development](#development)
|
|
51
47
|
- [Citation](#citation)
|
|
48
|
+
- [Changelog](#changelog)
|
|
49
|
+
|
|
52
50
|
|
|
53
51
|
## Quick Start
|
|
54
52
|
|
|
@@ -86,11 +84,13 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
86
84
|
{
|
|
87
85
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
88
86
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
89
|
-
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
|
|
87
|
+
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."},
|
|
88
|
+
"item_id": "first item in first document"
|
|
90
89
|
},
|
|
91
90
|
{
|
|
92
91
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
93
|
-
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
|
|
92
|
+
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"},
|
|
93
|
+
"item_id": "second item in first document"
|
|
94
94
|
}
|
|
95
95
|
...
|
|
96
96
|
],
|
|
@@ -105,20 +105,12 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
105
105
|
]
|
|
106
106
|
}
|
|
107
107
|
```
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
},
|
|
115
|
-
{
|
|
116
|
-
"src": "toto je pokračování stejného dokumentu",
|
|
117
|
-
"tgt": {"modelA": "this is a continuation of the same document"}
|
|
118
|
-
# Additional keys stored for analysis
|
|
119
|
-
}
|
|
120
|
-
]
|
|
121
|
-
```
|
|
108
|
+
|
|
109
|
+
Each item has to have `tgt` (dictionary from model names to strings, even for a single model evaluation).
|
|
110
|
+
Optionally, you can also include `src` (source string) and/or `ref` (reference string).
|
|
111
|
+
If neither `src` nor `ref` is provided, only the model outputs will be displayed.
|
|
112
|
+
For full Pearmut functionality (e.g. automatic statistical analysis), add `item_id` as well.
|
|
113
|
+
Any other keys that you add will simply be stored in the logs.
|
|
122
114
|
|
|
123
115
|
Load campaigns and start the server:
|
|
124
116
|
```bash
|
|
@@ -130,7 +122,7 @@ pearmut run
|
|
|
130
122
|
|
|
131
123
|
- **`task-based`**: Each user has predefined items
|
|
132
124
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
133
|
-
- **`dynamic`**:
|
|
125
|
+
- **`dynamic`**: Items are dynamically assigned based on current model performance (see [Dynamic Assignment](#dynamic-assignment))
|
|
134
126
|
|
|
135
127
|
## Advanced Features
|
|
136
128
|
|
|
@@ -150,6 +142,40 @@ The `shuffle` parameter in campaign `info` controls this behavior:
|
|
|
150
142
|
}
|
|
151
143
|
```
|
|
152
144
|
|
|
145
|
+
### Custom Score Sliders
|
|
146
|
+
|
|
147
|
+
For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scale), you can define custom sliders with specific ranges and steps:
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
{
|
|
151
|
+
"info": {
|
|
152
|
+
"assignment": "task-based",
|
|
153
|
+
"protocol": "ESA",
|
|
154
|
+
"sliders": [
|
|
155
|
+
{"name": "Fluency", "min": 0, "max": 5, "step": 1},
|
|
156
|
+
{"name": "Adequacy", "min": 0, "max": 100, "step": 1}
|
|
157
|
+
]
|
|
158
|
+
},
|
|
159
|
+
"campaign_id": "my_campaign",
|
|
160
|
+
"data": [...]
|
|
161
|
+
}
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
|
|
165
|
+
|
|
166
|
+
### Custom Instructions
|
|
167
|
+
|
|
168
|
+
Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
|
|
169
|
+
Instructions default to protocol-specific ones (DA: scoring, ESA: error spans + scoring, MQM: error spans + categories + scoring).
|
|
170
|
+
```python
|
|
171
|
+
{
|
|
172
|
+
"info": {
|
|
173
|
+
"protocol": "DA",
|
|
174
|
+
"instructions": "Rate translation quality on a 0-100 scale.<br>Pay special attention to document-level phenomena."
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
```
|
|
178
|
+
|
|
153
179
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
154
180
|
|
|
155
181
|
Include `error_spans` to pre-fill annotations that users can review, modify, or delete:
|
|
@@ -244,6 +270,36 @@ All annotators draw from a shared pool with random assignment:
|
|
|
244
270
|
}
|
|
245
271
|
```
|
|
246
272
|
|
|
273
|
+
### Dynamic Assignment
|
|
274
|
+
|
|
275
|
+
The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
|
|
276
|
+
All items must contain outputs from all models for this assignment type to work properly.
|
|
277
|
+
|
|
278
|
+
```python
|
|
279
|
+
{
|
|
280
|
+
"campaign_id": "my dynamic campaign",
|
|
281
|
+
"info": {
|
|
282
|
+
"assignment": "dynamic",
|
|
283
|
+
"protocol": "ESA",
|
|
284
|
+
"users": 10, # number of annotators
|
|
285
|
+
"dynamic_top": 3, # how many top models to consider (required)
|
|
286
|
+
"dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
|
|
287
|
+
"dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
|
|
288
|
+
"dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
|
|
289
|
+
},
|
|
290
|
+
"data": [...], # list of all items (shared among all annotators)
|
|
291
|
+
}
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**How it works:**
|
|
295
|
+
1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
|
|
296
|
+
2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
|
|
297
|
+
3. Contrastive evaluation: From the top N models, `dynamic_contrastive_models` models are randomly selected for each item
|
|
298
|
+
4. Item prioritization: Items with the least annotations for the selected models are prioritized
|
|
299
|
+
5. Backoff: With probability `dynamic_backoff`, uniform random selection is used instead to maintain exploration
|
|
300
|
+
|
|
301
|
+
This approach efficiently focuses annotation resources on distinguishing between the best-performing models while ensuring all models get adequate baseline coverage. The contrastive evaluation allows for direct comparison of multiple models simultaneously.
|
|
302
|
+
For an example, see [examples/dynamic.json](examples/dynamic.json).
|
|
247
303
|
|
|
248
304
|
### Pre-defined User IDs and Tokens
|
|
249
305
|
|
|
@@ -264,6 +320,7 @@ The `users` field accepts:
|
|
|
264
320
|
}
|
|
265
321
|
```
|
|
266
322
|
|
|
323
|
+
|
|
267
324
|
### Multimodal Annotations
|
|
268
325
|
|
|
269
326
|
Support for HTML-compatible elements (YouTube embeds, `<video>` tags, images). Ensure elements are pre-styled. See [examples/multimodal.json](examples/multimodal.json).
|
|
@@ -317,6 +374,10 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
317
374
|
|
|
318
375
|
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
319
376
|
|
|
377
|
+
### Custom Completion Messages
|
|
378
|
+
|
|
379
|
+
Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
|
|
380
|
+
|
|
320
381
|
## Terminology
|
|
321
382
|
|
|
322
383
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
@@ -344,7 +405,7 @@ When tokens are supplied, the dashboard will try to show model rankings based on
|
|
|
344
405
|
- **Assignment**: The method for distributing items to users:
|
|
345
406
|
- **Task-based**: Each user has predefined items
|
|
346
407
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
347
|
-
- **Dynamic**:
|
|
408
|
+
- **Dynamic**: Items are intelligently assigned based on model performance to focus on top models
|
|
348
409
|
|
|
349
410
|
## Development
|
|
350
411
|
|
|
@@ -377,16 +438,73 @@ See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
|
377
438
|
|
|
378
439
|
Run on public server or tunnel local port to public IP/domain and run locally.
|
|
379
440
|
|
|
380
|
-
##
|
|
441
|
+
## Citation
|
|
381
442
|
|
|
382
443
|
If you use this work in your paper, please cite as following.
|
|
383
444
|
```bibtex
|
|
384
|
-
@misc{
|
|
385
|
-
|
|
386
|
-
|
|
387
|
-
|
|
388
|
-
year={2026},
|
|
445
|
+
@misc{zouhar2026pearmut,
|
|
446
|
+
author = {Zouhar, Vilém},
|
|
447
|
+
title = {Pearmut: Human Evaluation of Translation Made Trivial},
|
|
448
|
+
year = {2026}
|
|
389
449
|
}
|
|
390
450
|
```
|
|
391
451
|
|
|
392
452
|
Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
|
|
453
|
+
|
|
454
|
+
# Changelog
|
|
455
|
+
|
|
456
|
+
- v1.0.1
|
|
457
|
+
- Support RTL languages
|
|
458
|
+
- Add boxes for references
|
|
459
|
+
- Add custom score sliders for multi-dimensional evaluation
|
|
460
|
+
- Make instructions customizable and protocol-dependent
|
|
461
|
+
- Support custom sliders
|
|
462
|
+
- Purge/reset whole tasks from dashboard
|
|
463
|
+
- Fix resetting individual users in single-stream/dynamic
|
|
464
|
+
- Fix notification stacking
|
|
465
|
+
- Add campaigns from dashboard
|
|
466
|
+
- v0.3.3
|
|
467
|
+
- Rename `doc_id` to `item_id`
|
|
468
|
+
- Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
|
|
469
|
+
- Add dynamic assignment type with contrastive model comparison
|
|
470
|
+
- Add `instructions_goodbye` field with variable substitution
|
|
471
|
+
- Add visual anchors at 33% and 66% on sliders
|
|
472
|
+
- Add German→English ESA tutorial with attention checks
|
|
473
|
+
- Validate document model consistency before shuffle
|
|
474
|
+
- Fix UI block on any interaction
|
|
475
|
+
- v0.3.2
|
|
476
|
+
- Revert seeding of user IDs
|
|
477
|
+
- Set ESA (Error Span Annotation) as default
|
|
478
|
+
- Update server IP address configuration
|
|
479
|
+
- Show approximate alignment by default
|
|
480
|
+
- Unify pointwise and listwise interfaces into `basic`
|
|
481
|
+
- Refactor protocol configuration (breaking change)
|
|
482
|
+
- v0.2.11
|
|
483
|
+
- Add comment field in settings panel
|
|
484
|
+
- Add `score_gt` validation for listwise comparisons
|
|
485
|
+
- Add Content-Disposition headers for proper download filenames
|
|
486
|
+
- Add model results display to dashboard with rankings
|
|
487
|
+
- Add campaign file structure validation
|
|
488
|
+
- Purge command now unlinks assets
|
|
489
|
+
- v0.2.6
|
|
490
|
+
- Add frozen annotation links feature for view-only mode
|
|
491
|
+
- Add word-level annotation mode toggle for error spans
|
|
492
|
+
- Add `[missing]` token support
|
|
493
|
+
- Improve frontend speed and cleanup toolboxes on item load
|
|
494
|
+
- Host assets via symlinks
|
|
495
|
+
- Add validation threshold for success/fail tokens
|
|
496
|
+
- Implement reset masking for annotations
|
|
497
|
+
- Allow pre-defined user IDs and tokens in campaign data
|
|
498
|
+
- v0.1.1
|
|
499
|
+
- Set server defaults and add VM launch scripts
|
|
500
|
+
- Add warning dialog when navigating away with unsaved work
|
|
501
|
+
- Add tutorial validation support for pointwise and listwise
|
|
502
|
+
- Add ability to preview existing annotations via progress bar
|
|
503
|
+
- Add support for ESA<sup>AI</sup> pre-filled error_spans
|
|
504
|
+
- Rename pairwise to listwise and update layout
|
|
505
|
+
- Implement single-stream assignment type
|
|
506
|
+
- v0.0.3
|
|
507
|
+
- Support multimodal inputs and outputs
|
|
508
|
+
- Add dashboard
|
|
509
|
+
- Implement ESA (Error Span Annotation) and MQM support
|
|
510
|
+
|
|
@@ -1,16 +1,10 @@
|
|
|
1
|
-
# Pearmut
|
|
1
|
+
# 🍐Pearmut [](https://pypi.org/project/pearmut) [](https://pypi.python.org/pypi/pearmut/) [](https://pypi.org/project/pearmut/) [](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
|
|
2
2
|
|
|
3
3
|
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
4
4
|
|
|
5
|
-
[](https://pypi.org/project/pearmut)
|
|
6
|
-
|
|
7
|
-
[](https://pypi.python.org/pypi/pearmut/)
|
|
8
|
-
|
|
9
|
-
[](https://pypi.org/project/pearmut/)
|
|
10
|
-
|
|
11
|
-
[](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
|
|
12
5
|
|
|
13
|
-
<img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/
|
|
6
|
+
<img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/71334238-300b-4ffc-b777-7f3c242b1630" />
|
|
7
|
+
|
|
14
8
|
|
|
15
9
|
## Table of Contents
|
|
16
10
|
|
|
@@ -25,10 +19,13 @@
|
|
|
25
19
|
- [Multimodal Annotations](#multimodal-annotations)
|
|
26
20
|
- [Hosting Assets](#hosting-assets)
|
|
27
21
|
- [Campaign Management](#campaign-management)
|
|
22
|
+
- [Custom Completion Messages](#custom-completion-messages)
|
|
28
23
|
- [CLI Commands](#cli-commands)
|
|
29
24
|
- [Terminology](#terminology)
|
|
30
25
|
- [Development](#development)
|
|
31
26
|
- [Citation](#citation)
|
|
27
|
+
- [Changelog](#changelog)
|
|
28
|
+
|
|
32
29
|
|
|
33
30
|
## Quick Start
|
|
34
31
|
|
|
@@ -66,11 +63,13 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
66
63
|
{
|
|
67
64
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
68
65
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
69
|
-
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
|
|
66
|
+
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."},
|
|
67
|
+
"item_id": "first item in first document"
|
|
70
68
|
},
|
|
71
69
|
{
|
|
72
70
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
73
|
-
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
|
|
71
|
+
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"},
|
|
72
|
+
"item_id": "second item in first document"
|
|
74
73
|
}
|
|
75
74
|
...
|
|
76
75
|
],
|
|
@@ -85,20 +84,12 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
85
84
|
]
|
|
86
85
|
}
|
|
87
86
|
```
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
},
|
|
95
|
-
{
|
|
96
|
-
"src": "toto je pokračování stejného dokumentu",
|
|
97
|
-
"tgt": {"modelA": "this is a continuation of the same document"}
|
|
98
|
-
# Additional keys stored for analysis
|
|
99
|
-
}
|
|
100
|
-
]
|
|
101
|
-
```
|
|
87
|
+
|
|
88
|
+
Each item has to have `tgt` (dictionary from model names to strings, even for a single model evaluation).
|
|
89
|
+
Optionally, you can also include `src` (source string) and/or `ref` (reference string).
|
|
90
|
+
If neither `src` nor `ref` is provided, only the model outputs will be displayed.
|
|
91
|
+
For full Pearmut functionality (e.g. automatic statistical analysis), add `item_id` as well.
|
|
92
|
+
Any other keys that you add will simply be stored in the logs.
|
|
102
93
|
|
|
103
94
|
Load campaigns and start the server:
|
|
104
95
|
```bash
|
|
@@ -110,7 +101,7 @@ pearmut run
|
|
|
110
101
|
|
|
111
102
|
- **`task-based`**: Each user has predefined items
|
|
112
103
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
113
|
-
- **`dynamic`**:
|
|
104
|
+
- **`dynamic`**: Items are dynamically assigned based on current model performance (see [Dynamic Assignment](#dynamic-assignment))
|
|
114
105
|
|
|
115
106
|
## Advanced Features
|
|
116
107
|
|
|
@@ -130,6 +121,40 @@ The `shuffle` parameter in campaign `info` controls this behavior:
|
|
|
130
121
|
}
|
|
131
122
|
```
|
|
132
123
|
|
|
124
|
+
### Custom Score Sliders
|
|
125
|
+
|
|
126
|
+
For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scale), you can define custom sliders with specific ranges and steps:
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
{
|
|
130
|
+
"info": {
|
|
131
|
+
"assignment": "task-based",
|
|
132
|
+
"protocol": "ESA",
|
|
133
|
+
"sliders": [
|
|
134
|
+
{"name": "Fluency", "min": 0, "max": 5, "step": 1},
|
|
135
|
+
{"name": "Adequacy", "min": 0, "max": 100, "step": 1}
|
|
136
|
+
]
|
|
137
|
+
},
|
|
138
|
+
"campaign_id": "my_campaign",
|
|
139
|
+
"data": [...]
|
|
140
|
+
}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
|
|
144
|
+
|
|
145
|
+
### Custom Instructions
|
|
146
|
+
|
|
147
|
+
Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
|
|
148
|
+
Instructions default to protocol-specific ones (DA: scoring, ESA: error spans + scoring, MQM: error spans + categories + scoring).
|
|
149
|
+
```python
|
|
150
|
+
{
|
|
151
|
+
"info": {
|
|
152
|
+
"protocol": "DA",
|
|
153
|
+
"instructions": "Rate translation quality on a 0-100 scale.<br>Pay special attention to document-level phenomena."
|
|
154
|
+
}
|
|
155
|
+
}
|
|
156
|
+
```
|
|
157
|
+
|
|
133
158
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
134
159
|
|
|
135
160
|
Include `error_spans` to pre-fill annotations that users can review, modify, or delete:
|
|
@@ -224,6 +249,36 @@ All annotators draw from a shared pool with random assignment:
|
|
|
224
249
|
}
|
|
225
250
|
```
|
|
226
251
|
|
|
252
|
+
### Dynamic Assignment
|
|
253
|
+
|
|
254
|
+
The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
|
|
255
|
+
All items must contain outputs from all models for this assignment type to work properly.
|
|
256
|
+
|
|
257
|
+
```python
|
|
258
|
+
{
|
|
259
|
+
"campaign_id": "my dynamic campaign",
|
|
260
|
+
"info": {
|
|
261
|
+
"assignment": "dynamic",
|
|
262
|
+
"protocol": "ESA",
|
|
263
|
+
"users": 10, # number of annotators
|
|
264
|
+
"dynamic_top": 3, # how many top models to consider (required)
|
|
265
|
+
"dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
|
|
266
|
+
"dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
|
|
267
|
+
"dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
|
|
268
|
+
},
|
|
269
|
+
"data": [...], # list of all items (shared among all annotators)
|
|
270
|
+
}
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
**How it works:**
|
|
274
|
+
1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
|
|
275
|
+
2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
|
|
276
|
+
3. Contrastive evaluation: From the top N models, `dynamic_contrastive_models` models are randomly selected for each item
|
|
277
|
+
4. Item prioritization: Items with the least annotations for the selected models are prioritized
|
|
278
|
+
5. Backoff: With probability `dynamic_backoff`, uniform random selection is used instead to maintain exploration
|
|
279
|
+
|
|
280
|
+
This approach efficiently focuses annotation resources on distinguishing between the best-performing models while ensuring all models get adequate baseline coverage. The contrastive evaluation allows for direct comparison of multiple models simultaneously.
|
|
281
|
+
For an example, see [examples/dynamic.json](examples/dynamic.json).
|
|
227
282
|
|
|
228
283
|
### Pre-defined User IDs and Tokens
|
|
229
284
|
|
|
@@ -244,6 +299,7 @@ The `users` field accepts:
|
|
|
244
299
|
}
|
|
245
300
|
```
|
|
246
301
|
|
|
302
|
+
|
|
247
303
|
### Multimodal Annotations
|
|
248
304
|
|
|
249
305
|
Support for HTML-compatible elements (YouTube embeds, `<video>` tags, images). Ensure elements are pre-styled. See [examples/multimodal.json](examples/multimodal.json).
|
|
@@ -297,6 +353,10 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
297
353
|
|
|
298
354
|
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
299
355
|
|
|
356
|
+
### Custom Completion Messages
|
|
357
|
+
|
|
358
|
+
Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
|
|
359
|
+
|
|
300
360
|
## Terminology
|
|
301
361
|
|
|
302
362
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
@@ -324,7 +384,7 @@ When tokens are supplied, the dashboard will try to show model rankings based on
|
|
|
324
384
|
- **Assignment**: The method for distributing items to users:
|
|
325
385
|
- **Task-based**: Each user has predefined items
|
|
326
386
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
327
|
-
- **Dynamic**:
|
|
387
|
+
- **Dynamic**: Items are intelligently assigned based on model performance to focus on top models
|
|
328
388
|
|
|
329
389
|
## Development
|
|
330
390
|
|
|
@@ -357,16 +417,73 @@ See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
|
357
417
|
|
|
358
418
|
Run on public server or tunnel local port to public IP/domain and run locally.
|
|
359
419
|
|
|
360
|
-
##
|
|
420
|
+
## Citation
|
|
361
421
|
|
|
362
422
|
If you use this work in your paper, please cite as following.
|
|
363
423
|
```bibtex
|
|
364
|
-
@misc{
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
year={2026},
|
|
424
|
+
@misc{zouhar2026pearmut,
|
|
425
|
+
author = {Zouhar, Vilém},
|
|
426
|
+
title = {Pearmut: Human Evaluation of Translation Made Trivial},
|
|
427
|
+
year = {2026}
|
|
369
428
|
}
|
|
370
429
|
```
|
|
371
430
|
|
|
372
431
|
Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
|
|
432
|
+
|
|
433
|
+
# Changelog
|
|
434
|
+
|
|
435
|
+
- v1.0.1
|
|
436
|
+
- Support RTL languages
|
|
437
|
+
- Add boxes for references
|
|
438
|
+
- Add custom score sliders for multi-dimensional evaluation
|
|
439
|
+
- Make instructions customizable and protocol-dependent
|
|
440
|
+
- Support custom sliders
|
|
441
|
+
- Purge/reset whole tasks from dashboard
|
|
442
|
+
- Fix resetting individual users in single-stream/dynamic
|
|
443
|
+
- Fix notification stacking
|
|
444
|
+
- Add campaigns from dashboard
|
|
445
|
+
- v0.3.3
|
|
446
|
+
- Rename `doc_id` to `item_id`
|
|
447
|
+
- Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
|
|
448
|
+
- Add dynamic assignment type with contrastive model comparison
|
|
449
|
+
- Add `instructions_goodbye` field with variable substitution
|
|
450
|
+
- Add visual anchors at 33% and 66% on sliders
|
|
451
|
+
- Add German→English ESA tutorial with attention checks
|
|
452
|
+
- Validate document model consistency before shuffle
|
|
453
|
+
- Fix UI block on any interaction
|
|
454
|
+
- v0.3.2
|
|
455
|
+
- Revert seeding of user IDs
|
|
456
|
+
- Set ESA (Error Span Annotation) as default
|
|
457
|
+
- Update server IP address configuration
|
|
458
|
+
- Show approximate alignment by default
|
|
459
|
+
- Unify pointwise and listwise interfaces into `basic`
|
|
460
|
+
- Refactor protocol configuration (breaking change)
|
|
461
|
+
- v0.2.11
|
|
462
|
+
- Add comment field in settings panel
|
|
463
|
+
- Add `score_gt` validation for listwise comparisons
|
|
464
|
+
- Add Content-Disposition headers for proper download filenames
|
|
465
|
+
- Add model results display to dashboard with rankings
|
|
466
|
+
- Add campaign file structure validation
|
|
467
|
+
- Purge command now unlinks assets
|
|
468
|
+
- v0.2.6
|
|
469
|
+
- Add frozen annotation links feature for view-only mode
|
|
470
|
+
- Add word-level annotation mode toggle for error spans
|
|
471
|
+
- Add `[missing]` token support
|
|
472
|
+
- Improve frontend speed and cleanup toolboxes on item load
|
|
473
|
+
- Host assets via symlinks
|
|
474
|
+
- Add validation threshold for success/fail tokens
|
|
475
|
+
- Implement reset masking for annotations
|
|
476
|
+
- Allow pre-defined user IDs and tokens in campaign data
|
|
477
|
+
- v0.1.1
|
|
478
|
+
- Set server defaults and add VM launch scripts
|
|
479
|
+
- Add warning dialog when navigating away with unsaved work
|
|
480
|
+
- Add tutorial validation support for pointwise and listwise
|
|
481
|
+
- Add ability to preview existing annotations via progress bar
|
|
482
|
+
- Add support for ESA<sup>AI</sup> pre-filled error_spans
|
|
483
|
+
- Rename pairwise to listwise and update layout
|
|
484
|
+
- Implement single-stream assignment type
|
|
485
|
+
- v0.0.3
|
|
486
|
+
- Support multimodal inputs and outputs
|
|
487
|
+
- Add dashboard
|
|
488
|
+
- Implement ESA (Error Span Annotation) and MQM support
|
|
489
|
+
|