pearmut 0.3.3__tar.gz → 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. {pearmut-0.3.3 → pearmut-1.0.1}/PKG-INFO +152 -34
  2. {pearmut-0.3.3 → pearmut-1.0.1}/README.md +150 -33
  3. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/PKG-INFO +152 -34
  4. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/SOURCES.txt +2 -0
  5. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/requires.txt +1 -0
  6. {pearmut-0.3.3 → pearmut-1.0.1}/pyproject.toml +2 -1
  7. {pearmut-0.3.3 → pearmut-1.0.1}/server/app.py +119 -27
  8. pearmut-1.0.1/server/assignment.py +605 -0
  9. {pearmut-0.3.3 → pearmut-1.0.1}/server/cli.py +245 -135
  10. pearmut-1.0.1/server/constants.py +93 -0
  11. pearmut-1.0.1/server/results_export.py +210 -0
  12. pearmut-1.0.1/server/static/basic.bundle.js +1 -0
  13. pearmut-1.0.1/server/static/basic.html +133 -0
  14. pearmut-1.0.1/server/static/dashboard.bundle.js +1 -0
  15. {pearmut-0.3.3 → pearmut-1.0.1}/server/static/dashboard.html +27 -12
  16. pearmut-1.0.1/server/static/index.bundle.js +1 -0
  17. {pearmut-0.3.3 → pearmut-1.0.1}/server/static/index.html +1 -1
  18. {pearmut-0.3.3 → pearmut-1.0.1}/server/utils.py +3 -1
  19. pearmut-0.3.3/server/assignment.py +0 -342
  20. pearmut-0.3.3/server/static/basic.bundle.js +0 -1
  21. pearmut-0.3.3/server/static/basic.html +0 -97
  22. pearmut-0.3.3/server/static/dashboard.bundle.js +0 -1
  23. pearmut-0.3.3/server/static/index.bundle.js +0 -1
  24. {pearmut-0.3.3 → pearmut-1.0.1}/LICENSE +0 -0
  25. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/dependency_links.txt +0 -0
  26. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/entry_points.txt +0 -0
  27. {pearmut-0.3.3 → pearmut-1.0.1}/pearmut.egg-info/top_level.txt +0 -0
  28. {pearmut-0.3.3 → pearmut-1.0.1}/server/static/favicon.svg +0 -0
  29. {pearmut-0.3.3 → pearmut-1.0.1}/server/static/style.css +0 -0
  30. {pearmut-0.3.3 → pearmut-1.0.1}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pearmut
3
- Version: 0.3.3
3
+ Version: 1.0.1
4
4
  Summary: A tool for evaluation of model outputs, primarily MT.
5
5
  Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
6
6
  License: MIT
@@ -14,23 +14,18 @@ Requires-Dist: fastapi>=0.110.0
14
14
  Requires-Dist: uvicorn>=0.29.0
15
15
  Requires-Dist: wonderwords>=3.0.0
16
16
  Requires-Dist: psutil>=7.1.0
17
+ Requires-Dist: typst>=0.14.4
17
18
  Provides-Extra: dev
18
19
  Requires-Dist: pytest; extra == "dev"
19
20
  Dynamic: license-file
20
21
 
21
- # Pearmut 🍐
22
+ # 🍐Pearmut &nbsp; &nbsp; [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut) [![PyPI download/month](https://img.shields.io/pypi/dm/pearmut.svg)](https://pypi.python.org/pypi/pearmut/) [![PyPi license](https://badgen.net/pypi/license/pearmut/)](https://pypi.org/project/pearmut/) [![build status](https://github.com/zouharvi/pearmut/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
22
23
 
23
24
  **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
24
25
 
25
- [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
26
- &nbsp;
27
- [![PyPI download/month](https://img.shields.io/pypi/dm/pearmut.svg)](https://pypi.python.org/pypi/pearmut/)
28
- &nbsp;
29
- [![PyPi license](https://badgen.net/pypi/license/pearmut/)](https://pypi.org/project/pearmut/)
30
- &nbsp;
31
- [![build status](https://github.com/zouharvi/pearmut/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
32
26
 
33
- <img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/4fb9a1cb-78ac-47e0-99cd-0870a368a0ad" />
27
+ <img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/71334238-300b-4ffc-b777-7f3c242b1630" />
28
+
34
29
 
35
30
  ## Table of Contents
36
31
 
@@ -45,10 +40,13 @@ Dynamic: license-file
45
40
  - [Multimodal Annotations](#multimodal-annotations)
46
41
  - [Hosting Assets](#hosting-assets)
47
42
  - [Campaign Management](#campaign-management)
43
+ - [Custom Completion Messages](#custom-completion-messages)
48
44
  - [CLI Commands](#cli-commands)
49
45
  - [Terminology](#terminology)
50
46
  - [Development](#development)
51
47
  - [Citation](#citation)
48
+ - [Changelog](#changelog)
49
+
52
50
 
53
51
  ## Quick Start
54
52
 
@@ -86,11 +84,13 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
86
84
  {
87
85
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
88
86
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
89
- "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
87
+ "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."},
88
+ "item_id": "first item in first document"
90
89
  },
91
90
  {
92
91
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
93
- "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
92
+ "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"},
93
+ "item_id": "second item in first document"
94
94
  }
95
95
  ...
96
96
  ],
@@ -105,20 +105,12 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
105
105
  ]
106
106
  }
107
107
  ```
108
- Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dictionary representing a document unit:
109
- ```python
110
- [
111
- {
112
- "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
113
- "tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
114
- },
115
- {
116
- "src": "toto je pokračování stejného dokumentu",
117
- "tgt": {"modelA": "this is a continuation of the same document"}
118
- # Additional keys stored for analysis
119
- }
120
- ]
121
- ```
108
+
109
+ Each item has to have `tgt` (dictionary from model names to strings, even for a single model evaluation).
110
+ Optionally, you can also include `src` (source string) and/or `ref` (reference string).
111
+ If neither `src` nor `ref` is provided, only the model outputs will be displayed.
112
+ For full Pearmut functionality (e.g. automatic statistical analysis), add `item_id` as well.
113
+ Any other keys that you add will simply be stored in the logs.
122
114
 
123
115
  Load campaigns and start the server:
124
116
  ```bash
@@ -130,7 +122,7 @@ pearmut run
130
122
 
131
123
  - **`task-based`**: Each user has predefined items
132
124
  - **`single-stream`**: All users draw from a shared pool (random assignment)
133
- - **`dynamic`**: work in progress ⚠️
125
+ - **`dynamic`**: Items are dynamically assigned based on current model performance (see [Dynamic Assignment](#dynamic-assignment))
134
126
 
135
127
  ## Advanced Features
136
128
 
@@ -150,6 +142,40 @@ The `shuffle` parameter in campaign `info` controls this behavior:
150
142
  }
151
143
  ```
152
144
 
145
+ ### Custom Score Sliders
146
+
147
+ For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scale), you can define custom sliders with specific ranges and steps:
148
+
149
+ ```python
150
+ {
151
+ "info": {
152
+ "assignment": "task-based",
153
+ "protocol": "ESA",
154
+ "sliders": [
155
+ {"name": "Fluency", "min": 0, "max": 5, "step": 1},
156
+ {"name": "Adequacy", "min": 0, "max": 100, "step": 1}
157
+ ]
158
+ },
159
+ "campaign_id": "my_campaign",
160
+ "data": [...]
161
+ }
162
+ ```
163
+
164
+ When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
165
+
166
+ ### Custom Instructions
167
+
168
+ Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
169
+ Instructions default to protocol-specific ones (DA: scoring, ESA: error spans + scoring, MQM: error spans + categories + scoring).
170
+ ```python
171
+ {
172
+ "info": {
173
+ "protocol": "DA",
174
+ "instructions": "Rate translation quality on a 0-100 scale.<br>Pay special attention to document-level phenomena."
175
+ }
176
+ }
177
+ ```
178
+
153
179
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
154
180
 
155
181
  Include `error_spans` to pre-fill annotations that users can review, modify, or delete:
@@ -244,6 +270,36 @@ All annotators draw from a shared pool with random assignment:
244
270
  }
245
271
  ```
246
272
 
273
+ ### Dynamic Assignment
274
+
275
+ The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
276
+ All items must contain outputs from all models for this assignment type to work properly.
277
+
278
+ ```python
279
+ {
280
+ "campaign_id": "my dynamic campaign",
281
+ "info": {
282
+ "assignment": "dynamic",
283
+ "protocol": "ESA",
284
+ "users": 10, # number of annotators
285
+ "dynamic_top": 3, # how many top models to consider (required)
286
+ "dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
287
+ "dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
288
+ "dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
289
+ },
290
+ "data": [...], # list of all items (shared among all annotators)
291
+ }
292
+ ```
293
+
294
+ **How it works:**
295
+ 1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
296
+ 2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
297
+ 3. Contrastive evaluation: From the top N models, `dynamic_contrastive_models` models are randomly selected for each item
298
+ 4. Item prioritization: Items with the least annotations for the selected models are prioritized
299
+ 5. Backoff: With probability `dynamic_backoff`, uniform random selection is used instead to maintain exploration
300
+
301
+ This approach efficiently focuses annotation resources on distinguishing between the best-performing models while ensuring all models get adequate baseline coverage. The contrastive evaluation allows for direct comparison of multiple models simultaneously.
302
+ For an example, see [examples/dynamic.json](examples/dynamic.json).
247
303
 
248
304
  ### Pre-defined User IDs and Tokens
249
305
 
@@ -264,6 +320,7 @@ The `users` field accepts:
264
320
  }
265
321
  ```
266
322
 
323
+
267
324
  ### Multimodal Annotations
268
325
 
269
326
  Support for HTML-compatible elements (YouTube embeds, `<video>` tags, images). Ensure elements are pre-styled. See [examples/multimodal.json](examples/multimodal.json).
@@ -317,6 +374,10 @@ Completion tokens are shown at annotation end for verification (download correct
317
374
 
318
375
  When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
319
376
 
377
+ ### Custom Completion Messages
378
+
379
+ Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
380
+
320
381
  ## Terminology
321
382
 
322
383
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
@@ -344,7 +405,7 @@ When tokens are supplied, the dashboard will try to show model rankings based on
344
405
  - **Assignment**: The method for distributing items to users:
345
406
  - **Task-based**: Each user has predefined items
346
407
  - **Single-stream**: Users draw from a shared pool with random assignment
347
- - **Dynamic**: Work in progress
408
+ - **Dynamic**: Items are intelligently assigned based on model performance to focus on top models
348
409
 
349
410
  ## Development
350
411
 
@@ -377,16 +438,73 @@ See [web/src/basic.ts](web/src/basic.ts) for example.
377
438
 
378
439
  Run on public server or tunnel local port to public IP/domain and run locally.
379
440
 
380
- ## Misc.
441
+ ## Citation
381
442
 
382
443
  If you use this work in your paper, please cite as following.
383
444
  ```bibtex
384
- @misc{zouhar2025pearmut,
385
- author={Vilém Zouhar},
386
- title={Pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks},
387
- url={https://github.com/zouharvi/pearmut/},
388
- year={2026},
445
+ @misc{zouhar2026pearmut,
446
+ author = {Zouhar, Vilém},
447
+ title = {Pearmut: Human Evaluation of Translation Made Trivial},
448
+ year = {2026}
389
449
  }
390
450
  ```
391
451
 
392
452
  Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
453
+
454
+ # Changelog
455
+
456
+ - v1.0.1
457
+ - Support RTL languages
458
+ - Add boxes for references
459
+ - Add custom score sliders for multi-dimensional evaluation
460
+ - Make instructions customizable and protocol-dependent
461
+ - Support custom sliders
462
+ - Purge/reset whole tasks from dashboard
463
+ - Fix resetting individual users in single-stream/dynamic
464
+ - Fix notification stacking
465
+ - Add campaigns from dashboard
466
+ - v0.3.3
467
+ - Rename `doc_id` to `item_id`
468
+ - Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
469
+ - Add dynamic assignment type with contrastive model comparison
470
+ - Add `instructions_goodbye` field with variable substitution
471
+ - Add visual anchors at 33% and 66% on sliders
472
+ - Add German→English ESA tutorial with attention checks
473
+ - Validate document model consistency before shuffle
474
+ - Fix UI block on any interaction
475
+ - v0.3.2
476
+ - Revert seeding of user IDs
477
+ - Set ESA (Error Span Annotation) as default
478
+ - Update server IP address configuration
479
+ - Show approximate alignment by default
480
+ - Unify pointwise and listwise interfaces into `basic`
481
+ - Refactor protocol configuration (breaking change)
482
+ - v0.2.11
483
+ - Add comment field in settings panel
484
+ - Add `score_gt` validation for listwise comparisons
485
+ - Add Content-Disposition headers for proper download filenames
486
+ - Add model results display to dashboard with rankings
487
+ - Add campaign file structure validation
488
+ - Purge command now unlinks assets
489
+ - v0.2.6
490
+ - Add frozen annotation links feature for view-only mode
491
+ - Add word-level annotation mode toggle for error spans
492
+ - Add `[missing]` token support
493
+ - Improve frontend speed and cleanup toolboxes on item load
494
+ - Host assets via symlinks
495
+ - Add validation threshold for success/fail tokens
496
+ - Implement reset masking for annotations
497
+ - Allow pre-defined user IDs and tokens in campaign data
498
+ - v0.1.1
499
+ - Set server defaults and add VM launch scripts
500
+ - Add warning dialog when navigating away with unsaved work
501
+ - Add tutorial validation support for pointwise and listwise
502
+ - Add ability to preview existing annotations via progress bar
503
+ - Add support for ESA<sup>AI</sup> pre-filled error_spans
504
+ - Rename pairwise to listwise and update layout
505
+ - Implement single-stream assignment type
506
+ - v0.0.3
507
+ - Support multimodal inputs and outputs
508
+ - Add dashboard
509
+ - Implement ESA (Error Span Annotation) and MQM support
510
+
@@ -1,16 +1,10 @@
1
- # Pearmut 🍐
1
+ # 🍐Pearmut &nbsp; &nbsp; [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut) [![PyPI download/month](https://img.shields.io/pypi/dm/pearmut.svg)](https://pypi.python.org/pypi/pearmut/) [![PyPi license](https://badgen.net/pypi/license/pearmut/)](https://pypi.org/project/pearmut/) [![build status](https://github.com/zouharvi/pearmut/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
2
2
 
3
3
  **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
4
4
 
5
- [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
6
- &nbsp;
7
- [![PyPI download/month](https://img.shields.io/pypi/dm/pearmut.svg)](https://pypi.python.org/pypi/pearmut/)
8
- &nbsp;
9
- [![PyPi license](https://badgen.net/pypi/license/pearmut/)](https://pypi.org/project/pearmut/)
10
- &nbsp;
11
- [![build status](https://github.com/zouharvi/pearmut/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)
12
5
 
13
- <img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/4fb9a1cb-78ac-47e0-99cd-0870a368a0ad" />
6
+ <img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/71334238-300b-4ffc-b777-7f3c242b1630" />
7
+
14
8
 
15
9
  ## Table of Contents
16
10
 
@@ -25,10 +19,13 @@
25
19
  - [Multimodal Annotations](#multimodal-annotations)
26
20
  - [Hosting Assets](#hosting-assets)
27
21
  - [Campaign Management](#campaign-management)
22
+ - [Custom Completion Messages](#custom-completion-messages)
28
23
  - [CLI Commands](#cli-commands)
29
24
  - [Terminology](#terminology)
30
25
  - [Development](#development)
31
26
  - [Citation](#citation)
27
+ - [Changelog](#changelog)
28
+
32
29
 
33
30
  ## Quick Start
34
31
 
@@ -66,11 +63,13 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
66
63
  {
67
64
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
68
65
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
69
- "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
66
+ "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."},
67
+ "item_id": "first item in first document"
70
68
  },
71
69
  {
72
70
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
73
- "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
71
+ "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"},
72
+ "item_id": "second item in first document"
74
73
  }
75
74
  ...
76
75
  ],
@@ -85,20 +84,12 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
85
84
  ]
86
85
  }
87
86
  ```
88
- Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dictionary representing a document unit:
89
- ```python
90
- [
91
- {
92
- "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
93
- "tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
94
- },
95
- {
96
- "src": "toto je pokračování stejného dokumentu",
97
- "tgt": {"modelA": "this is a continuation of the same document"}
98
- # Additional keys stored for analysis
99
- }
100
- ]
101
- ```
87
+
88
+ Each item has to have `tgt` (dictionary from model names to strings, even for a single model evaluation).
89
+ Optionally, you can also include `src` (source string) and/or `ref` (reference string).
90
+ If neither `src` nor `ref` is provided, only the model outputs will be displayed.
91
+ For full Pearmut functionality (e.g. automatic statistical analysis), add `item_id` as well.
92
+ Any other keys that you add will simply be stored in the logs.
102
93
 
103
94
  Load campaigns and start the server:
104
95
  ```bash
@@ -110,7 +101,7 @@ pearmut run
110
101
 
111
102
  - **`task-based`**: Each user has predefined items
112
103
  - **`single-stream`**: All users draw from a shared pool (random assignment)
113
- - **`dynamic`**: work in progress ⚠️
104
+ - **`dynamic`**: Items are dynamically assigned based on current model performance (see [Dynamic Assignment](#dynamic-assignment))
114
105
 
115
106
  ## Advanced Features
116
107
 
@@ -130,6 +121,40 @@ The `shuffle` parameter in campaign `info` controls this behavior:
130
121
  }
131
122
  ```
132
123
 
124
+ ### Custom Score Sliders
125
+
126
+ For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scale), you can define custom sliders with specific ranges and steps:
127
+
128
+ ```python
129
+ {
130
+ "info": {
131
+ "assignment": "task-based",
132
+ "protocol": "ESA",
133
+ "sliders": [
134
+ {"name": "Fluency", "min": 0, "max": 5, "step": 1},
135
+ {"name": "Adequacy", "min": 0, "max": 100, "step": 1}
136
+ ]
137
+ },
138
+ "campaign_id": "my_campaign",
139
+ "data": [...]
140
+ }
141
+ ```
142
+
143
+ When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
144
+
145
+ ### Custom Instructions
146
+
147
+ Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
148
+ Instructions default to protocol-specific ones (DA: scoring, ESA: error spans + scoring, MQM: error spans + categories + scoring).
149
+ ```python
150
+ {
151
+ "info": {
152
+ "protocol": "DA",
153
+ "instructions": "Rate translation quality on a 0-100 scale.<br>Pay special attention to document-level phenomena."
154
+ }
155
+ }
156
+ ```
157
+
133
158
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
134
159
 
135
160
  Include `error_spans` to pre-fill annotations that users can review, modify, or delete:
@@ -224,6 +249,36 @@ All annotators draw from a shared pool with random assignment:
224
249
  }
225
250
  ```
226
251
 
252
+ ### Dynamic Assignment
253
+
254
+ The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
255
+ All items must contain outputs from all models for this assignment type to work properly.
256
+
257
+ ```python
258
+ {
259
+ "campaign_id": "my dynamic campaign",
260
+ "info": {
261
+ "assignment": "dynamic",
262
+ "protocol": "ESA",
263
+ "users": 10, # number of annotators
264
+ "dynamic_top": 3, # how many top models to consider (required)
265
+ "dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
266
+ "dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
267
+ "dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
268
+ },
269
+ "data": [...], # list of all items (shared among all annotators)
270
+ }
271
+ ```
272
+
273
+ **How it works:**
274
+ 1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
275
+ 2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
276
+ 3. Contrastive evaluation: From the top N models, `dynamic_contrastive_models` models are randomly selected for each item
277
+ 4. Item prioritization: Items with the least annotations for the selected models are prioritized
278
+ 5. Backoff: With probability `dynamic_backoff`, uniform random selection is used instead to maintain exploration
279
+
280
+ This approach efficiently focuses annotation resources on distinguishing between the best-performing models while ensuring all models get adequate baseline coverage. The contrastive evaluation allows for direct comparison of multiple models simultaneously.
281
+ For an example, see [examples/dynamic.json](examples/dynamic.json).
227
282
 
228
283
  ### Pre-defined User IDs and Tokens
229
284
 
@@ -244,6 +299,7 @@ The `users` field accepts:
244
299
  }
245
300
  ```
246
301
 
302
+
247
303
  ### Multimodal Annotations
248
304
 
249
305
  Support for HTML-compatible elements (YouTube embeds, `<video>` tags, images). Ensure elements are pre-styled. See [examples/multimodal.json](examples/multimodal.json).
@@ -297,6 +353,10 @@ Completion tokens are shown at annotation end for verification (download correct
297
353
 
298
354
  When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
299
355
 
356
+ ### Custom Completion Messages
357
+
358
+ Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
359
+
300
360
  ## Terminology
301
361
 
302
362
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
@@ -324,7 +384,7 @@ When tokens are supplied, the dashboard will try to show model rankings based on
324
384
  - **Assignment**: The method for distributing items to users:
325
385
  - **Task-based**: Each user has predefined items
326
386
  - **Single-stream**: Users draw from a shared pool with random assignment
327
- - **Dynamic**: Work in progress
387
+ - **Dynamic**: Items are intelligently assigned based on model performance to focus on top models
328
388
 
329
389
  ## Development
330
390
 
@@ -357,16 +417,73 @@ See [web/src/basic.ts](web/src/basic.ts) for example.
357
417
 
358
418
  Run on public server or tunnel local port to public IP/domain and run locally.
359
419
 
360
- ## Misc.
420
+ ## Citation
361
421
 
362
422
  If you use this work in your paper, please cite as following.
363
423
  ```bibtex
364
- @misc{zouhar2025pearmut,
365
- author={Vilém Zouhar},
366
- title={Pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks},
367
- url={https://github.com/zouharvi/pearmut/},
368
- year={2026},
424
+ @misc{zouhar2026pearmut,
425
+ author = {Zouhar, Vilém},
426
+ title = {Pearmut: Human Evaluation of Translation Made Trivial},
427
+ year = {2026}
369
428
  }
370
429
  ```
371
430
 
372
431
  Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
432
+
433
+ # Changelog
434
+
435
+ - v1.0.1
436
+ - Support RTL languages
437
+ - Add boxes for references
438
+ - Add custom score sliders for multi-dimensional evaluation
439
+ - Make instructions customizable and protocol-dependent
440
+ - Support custom sliders
441
+ - Purge/reset whole tasks from dashboard
442
+ - Fix resetting individual users in single-stream/dynamic
443
+ - Fix notification stacking
444
+ - Add campaigns from dashboard
445
+ - v0.3.3
446
+ - Rename `doc_id` to `item_id`
447
+ - Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
448
+ - Add dynamic assignment type with contrastive model comparison
449
+ - Add `instructions_goodbye` field with variable substitution
450
+ - Add visual anchors at 33% and 66% on sliders
451
+ - Add German→English ESA tutorial with attention checks
452
+ - Validate document model consistency before shuffle
453
+ - Fix UI block on any interaction
454
+ - v0.3.2
455
+ - Revert seeding of user IDs
456
+ - Set ESA (Error Span Annotation) as default
457
+ - Update server IP address configuration
458
+ - Show approximate alignment by default
459
+ - Unify pointwise and listwise interfaces into `basic`
460
+ - Refactor protocol configuration (breaking change)
461
+ - v0.2.11
462
+ - Add comment field in settings panel
463
+ - Add `score_gt` validation for listwise comparisons
464
+ - Add Content-Disposition headers for proper download filenames
465
+ - Add model results display to dashboard with rankings
466
+ - Add campaign file structure validation
467
+ - Purge command now unlinks assets
468
+ - v0.2.6
469
+ - Add frozen annotation links feature for view-only mode
470
+ - Add word-level annotation mode toggle for error spans
471
+ - Add `[missing]` token support
472
+ - Improve frontend speed and cleanup toolboxes on item load
473
+ - Host assets via symlinks
474
+ - Add validation threshold for success/fail tokens
475
+ - Implement reset masking for annotations
476
+ - Allow pre-defined user IDs and tokens in campaign data
477
+ - v0.1.1
478
+ - Set server defaults and add VM launch scripts
479
+ - Add warning dialog when navigating away with unsaved work
480
+ - Add tutorial validation support for pointwise and listwise
481
+ - Add ability to preview existing annotations via progress bar
482
+ - Add support for ESA<sup>AI</sup> pre-filled error_spans
483
+ - Rename pairwise to listwise and update layout
484
+ - Implement single-stream assignment type
485
+ - v0.0.3
486
+ - Support multimodal inputs and outputs
487
+ - Add dashboard
488
+ - Implement ESA (Error Span Annotation) and MQM support
489
+