pearmut 1.0.1__tar.gz → 1.0.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pearmut-1.0.1 → pearmut-1.0.3}/PKG-INFO +119 -65
- {pearmut-1.0.1 → pearmut-1.0.3}/README.md +118 -64
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/PKG-INFO +119 -65
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/SOURCES.txt +2 -2
- {pearmut-1.0.1 → pearmut-1.0.3}/pyproject.toml +2 -2
- {pearmut-1.0.1 → pearmut-1.0.3}/server/app.py +56 -25
- {pearmut-1.0.1 → pearmut-1.0.3}/server/assignment.py +340 -105
- {pearmut-1.0.1 → pearmut-1.0.3}/server/cli.py +185 -104
- {pearmut-1.0.1 → pearmut-1.0.3}/server/results_export.py +1 -1
- pearmut-1.0.3/server/static/annotate.bundle.js +1 -0
- pearmut-1.0.3/server/static/annotate.html +164 -0
- pearmut-1.0.3/server/static/dashboard.bundle.js +1 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/server/static/dashboard.html +6 -1
- {pearmut-1.0.1 → pearmut-1.0.3}/server/static/index.html +1 -1
- {pearmut-1.0.1 → pearmut-1.0.3}/server/static/style.css +46 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/server/utils.py +40 -21
- pearmut-1.0.1/server/static/basic.bundle.js +0 -1
- pearmut-1.0.1/server/static/basic.html +0 -133
- pearmut-1.0.1/server/static/dashboard.bundle.js +0 -1
- {pearmut-1.0.1 → pearmut-1.0.3}/LICENSE +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/dependency_links.txt +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/entry_points.txt +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/requires.txt +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/pearmut.egg-info/top_level.txt +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/server/constants.py +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/server/static/favicon.svg +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/server/static/index.bundle.js +0 -0
- {pearmut-1.0.1 → pearmut-1.0.3}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pearmut
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.3
|
|
4
4
|
Summary: A tool for evaluation of model outputs, primarily MT.
|
|
5
5
|
Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -19,7 +19,7 @@ Provides-Extra: dev
|
|
|
19
19
|
Requires-Dist: pytest; extra == "dev"
|
|
20
20
|
Dynamic: license-file
|
|
21
21
|
|
|
22
|
-
# 🍐Pearmut
|
|
22
|
+
# 🍐Pearmut <br> [](https://pypi.org/project/pearmut) [](https://pypi.python.org/pypi/pearmut/) [](https://pypi.org/project/pearmut/) [](https://github.com/zouharvi/pearmut/actions/workflows/test.yml) [](https://arxiv.org/abs/2601.02933)
|
|
23
23
|
|
|
24
24
|
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
25
25
|
|
|
@@ -35,12 +35,15 @@ Dynamic: license-file
|
|
|
35
35
|
- [Assignment Types](#assignment-types)
|
|
36
36
|
- [Advanced Features](#advanced-features)
|
|
37
37
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
38
|
+
- [Custom MQM Taxonomy](#custom-mqm-taxonomy)
|
|
38
39
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
40
|
+
- [Form Items for User Metadata](#form-items-for-user-metadata)
|
|
39
41
|
- [Pre-defined User IDs and Tokens](#pre-defined-user-ids-and-tokens)
|
|
40
42
|
- [Multimodal Annotations](#multimodal-annotations)
|
|
41
43
|
- [Hosting Assets](#hosting-assets)
|
|
42
44
|
- [Campaign Management](#campaign-management)
|
|
43
45
|
- [Custom Completion Messages](#custom-completion-messages)
|
|
46
|
+
- [Prolific Integration](#prolific-integration)
|
|
44
47
|
- [CLI Commands](#cli-commands)
|
|
45
48
|
- [Terminology](#terminology)
|
|
46
49
|
- [Development](#development)
|
|
@@ -141,6 +144,22 @@ The `shuffle` parameter in campaign `info` controls this behavior:
|
|
|
141
144
|
"data": [...]
|
|
142
145
|
}
|
|
143
146
|
```
|
|
147
|
+
Documents in `data_welcome` are not shuffled and so don't require to have the same models in all documents.
|
|
148
|
+
|
|
149
|
+
### Showing Model Names
|
|
150
|
+
|
|
151
|
+
By default, model names are hidden to avoid biasing annotators. To display model names on top of each output block, set `show_model_names` to `true`:
|
|
152
|
+
```python
|
|
153
|
+
{
|
|
154
|
+
"info": {
|
|
155
|
+
"assignment": "task-based",
|
|
156
|
+
"protocol": "ESA",
|
|
157
|
+
"show_model_names": true # Default: false.
|
|
158
|
+
},
|
|
159
|
+
"campaign_id": "my_campaign",
|
|
160
|
+
"data": [...]
|
|
161
|
+
}
|
|
162
|
+
```
|
|
144
163
|
|
|
145
164
|
### Custom Score Sliders
|
|
146
165
|
|
|
@@ -163,6 +182,52 @@ For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scal
|
|
|
163
182
|
|
|
164
183
|
When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
|
|
165
184
|
|
|
185
|
+
### Textfield for Post-editing/Translation
|
|
186
|
+
|
|
187
|
+
Enable a textfield for post-editing or translation tasks using the `textfield` parameter in `info`. The textfield content is stored in annotations alongside scores and error spans.
|
|
188
|
+
|
|
189
|
+
```python
|
|
190
|
+
{
|
|
191
|
+
"info": {
|
|
192
|
+
"protocol": "DA",
|
|
193
|
+
"textfield": "prefilled" # Options: null, "hidden", "visible", "prefilled"
|
|
194
|
+
}
|
|
195
|
+
}
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**Textfield modes:**
|
|
199
|
+
- `null` or omitted: No textfield (default)
|
|
200
|
+
- `"hidden"`: Textfield hidden by default, shown by clicking a button
|
|
201
|
+
- `"visible"`: Textfield always visible
|
|
202
|
+
- `"prefilled"`: Textfield visible and pre-filled with model output for post-editing
|
|
203
|
+
|
|
204
|
+
### Custom MQM Taxonomy
|
|
205
|
+
|
|
206
|
+
For MQM protocol campaigns, you can define a custom error taxonomy instead of using the default MQM categories. Specify `mqm_categories` in the campaign `info` section as a dictionary mapping main categories to lists of subcategories:
|
|
207
|
+
|
|
208
|
+
|
|
209
|
+
```python
|
|
210
|
+
{
|
|
211
|
+
"info": {
|
|
212
|
+
"assignment": "task-based",
|
|
213
|
+
"protocol": "MQM",
|
|
214
|
+
"mqm_categories": {
|
|
215
|
+
"": [], # Empty selection option
|
|
216
|
+
"General": ["", "Accuracy", "Fluency"],
|
|
217
|
+
"Audio-specific": ["", "Inaudible", "Background noise", "Speaker overlap", "Misinterpretation"],
|
|
218
|
+
"Style": ["", "Awkward", "Embarassing"],
|
|
219
|
+
"Unknown": [] # Category with no subcategories
|
|
220
|
+
}
|
|
221
|
+
},
|
|
222
|
+
"campaign_id": "custom_mqm_example",
|
|
223
|
+
"data": [...]
|
|
224
|
+
}
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
If `mqm_categories` is not provided, the default MQM taxonomy will be used. The empty string key `""` provides an unselected state in the dropdown. Categories with empty subcategory lists (e.g., `"Style": []`) do not require a subcategory selection.
|
|
228
|
+
|
|
229
|
+
See [examples/custom_mqm.json](examples/custom_mqm.json) for a complete example.
|
|
230
|
+
|
|
166
231
|
### Custom Instructions
|
|
167
232
|
|
|
168
233
|
Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
|
|
@@ -252,6 +317,34 @@ The `score_greaterthan` field specifies the index of the candidate that must hav
|
|
|
252
317
|
See [examples/tutorial/esa_deen.json](examples/tutorial/esa_deen.json) for a mock campaign with a fully prepared ESA tutorial.
|
|
253
318
|
To use it, simply extract the `data` attribute and prefix it to each task in your campaign.
|
|
254
319
|
|
|
320
|
+
#### Universal Tutorial Items with `data_welcome`
|
|
321
|
+
|
|
322
|
+
Use `data_welcome` to add tutorial items that users must complete before starting regular tasks. The structure is a list of documents (same as `data`). Welcome items have IDs `welcome_0`, `welcome_1`, etc. and are tracked separately via `progress_welcome`.
|
|
323
|
+
|
|
324
|
+
### Form Items for User Metadata
|
|
325
|
+
|
|
326
|
+
Collect user information (demographics, expertise) before annotation tasks using form items in `data_welcome`.
|
|
327
|
+
Form items have `text` (label/question) and `form` (field type: `null`, `"string"`, `"number"`, `"choices"`, and `"script"`).
|
|
328
|
+
Documents must be homogeneous: all form items or all evaluation items.
|
|
329
|
+
|
|
330
|
+
```python
|
|
331
|
+
{
|
|
332
|
+
"data_welcome": [
|
|
333
|
+
[
|
|
334
|
+
{"text": "What is your native language?", "form": "string"},
|
|
335
|
+
{"text": "Rate your expertise (1-10)", "form": "number"}
|
|
336
|
+
]
|
|
337
|
+
]
|
|
338
|
+
}
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
<img width="400" alt="Screenshot of a user form" src="https://github.com/user-attachments/assets/2310e8dc-98e9-4abf-8a27-6781b0094efe" />
|
|
342
|
+
|
|
343
|
+
|
|
344
|
+
It is possible to automatically collect additional information from the host system using `"script"` field type.
|
|
345
|
+
Typically such a form document (or their sequence) would be stored in `"data_welcome"` such that it is both mandatory and show to all users.
|
|
346
|
+
See [examples/user_info_form.json](examples/user_info_form.json).
|
|
347
|
+
|
|
255
348
|
### Single-stream Assignment
|
|
256
349
|
|
|
257
350
|
All annotators draw from a shared pool with random assignment:
|
|
@@ -265,11 +358,14 @@ All annotators draw from a shared pool with random assignment:
|
|
|
265
358
|
# ESA: error spans and scores
|
|
266
359
|
"protocol": "ESA",
|
|
267
360
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
361
|
+
"docs_per_user": 10, # optional: show goodbye after N documents per user
|
|
268
362
|
},
|
|
269
363
|
"data": [...], # list of all items (shared among all annotators)
|
|
270
364
|
}
|
|
271
365
|
```
|
|
272
366
|
|
|
367
|
+
Set `docs_per_user` to limit how many documents each user annotates before seeing the goodbye message (for single-stream, this is the number of documents).
|
|
368
|
+
|
|
273
369
|
### Dynamic Assignment
|
|
274
370
|
|
|
275
371
|
The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
|
|
@@ -286,11 +382,14 @@ All items must contain outputs from all models for this assignment type to work
|
|
|
286
382
|
"dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
|
|
287
383
|
"dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
|
|
288
384
|
"dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
|
|
385
|
+
"docs_per_user": 20, # optional: show goodbye after N documents per user
|
|
289
386
|
},
|
|
290
387
|
"data": [...], # list of all items (shared among all annotators)
|
|
291
388
|
}
|
|
292
389
|
```
|
|
293
390
|
|
|
391
|
+
Set `docs_per_user` to limit how many documents each user annotates before seeing the goodbye message (for dynamic, this is roughly the number of documents × models).
|
|
392
|
+
|
|
294
393
|
**How it works:**
|
|
295
394
|
1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
|
|
296
395
|
2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
|
|
@@ -378,6 +477,14 @@ When tokens are supplied, the dashboard will try to show model rankings based on
|
|
|
378
477
|
|
|
379
478
|
Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
|
|
380
479
|
|
|
480
|
+
### Prolific Integration
|
|
481
|
+
|
|
482
|
+
Use task-based assignment with Prolific. For each task, Pearmut generates a unique URL which can be uploaded to Prolific's interface. Add redirect (on completion) to `instructions_goodbye`:
|
|
483
|
+
```json
|
|
484
|
+
"instructions_goodbye": "<a href='https://app.prolific.com/submissions/complete?cc=${TOKEN}'>Click here to return to Prolific</a>"
|
|
485
|
+
```
|
|
486
|
+
The `${TOKEN}` is automatically replaced based on passing attention checks (see [Attention checks](#tutorial-and-attention-checks) and [Pre-defined tokens](#pre-defined-user-ids-and-tokens)).
|
|
487
|
+
|
|
381
488
|
## Terminology
|
|
382
489
|
|
|
383
490
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
@@ -401,7 +508,7 @@ Customize the goodbye message shown to users when they complete all annotations
|
|
|
401
508
|
- **Score**: Numeric quality rating (0-100)
|
|
402
509
|
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
403
510
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
404
|
-
- **Template**: The annotation interface type. The `
|
|
511
|
+
- **Template**: The annotation interface type. The `annotate` template supports comparing multiple outputs simultaneously.
|
|
405
512
|
- **Assignment**: The method for distributing items to users:
|
|
406
513
|
- **Task-based**: Each user has predefined items
|
|
407
514
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -432,7 +539,7 @@ pearmut run
|
|
|
432
539
|
2. Add build rule to `webpack.config.js`
|
|
433
540
|
3. Reference as `info->template` in campaign JSON
|
|
434
541
|
|
|
435
|
-
See [web/src/
|
|
542
|
+
See [web/src/annotate.ts](web/src/annotate.ts) for example.
|
|
436
543
|
|
|
437
544
|
### Deployment
|
|
438
545
|
|
|
@@ -443,68 +550,15 @@ Run on public server or tunnel local port to public IP/domain and run locally.
|
|
|
443
550
|
If you use this work in your paper, please cite as following.
|
|
444
551
|
```bibtex
|
|
445
552
|
@misc{zouhar2026pearmut,
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
|
|
553
|
+
title={Pearmut: Human Evaluation of Translation Made Trivial},
|
|
554
|
+
author={Vilém Zouhar and Tom Kocmi},
|
|
555
|
+
year={2026},
|
|
556
|
+
eprint={2601.02933},
|
|
557
|
+
archivePrefix={arXiv},
|
|
558
|
+
primaryClass={cs.CL},
|
|
559
|
+
url={https://arxiv.org/abs/2601.02933},
|
|
449
560
|
}
|
|
450
561
|
```
|
|
451
562
|
|
|
452
563
|
Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
|
|
453
|
-
|
|
454
|
-
# Changelog
|
|
455
|
-
|
|
456
|
-
- v1.0.1
|
|
457
|
-
- Support RTL languages
|
|
458
|
-
- Add boxes for references
|
|
459
|
-
- Add custom score sliders for multi-dimensional evaluation
|
|
460
|
-
- Make instructions customizable and protocol-dependent
|
|
461
|
-
- Support custom sliders
|
|
462
|
-
- Purge/reset whole tasks from dashboard
|
|
463
|
-
- Fix resetting individual users in single-stream/dynamic
|
|
464
|
-
- Fix notification stacking
|
|
465
|
-
- Add campaigns from dashboard
|
|
466
|
-
- v0.3.3
|
|
467
|
-
- Rename `doc_id` to `item_id`
|
|
468
|
-
- Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
|
|
469
|
-
- Add dynamic assignment type with contrastive model comparison
|
|
470
|
-
- Add `instructions_goodbye` field with variable substitution
|
|
471
|
-
- Add visual anchors at 33% and 66% on sliders
|
|
472
|
-
- Add German→English ESA tutorial with attention checks
|
|
473
|
-
- Validate document model consistency before shuffle
|
|
474
|
-
- Fix UI block on any interaction
|
|
475
|
-
- v0.3.2
|
|
476
|
-
- Revert seeding of user IDs
|
|
477
|
-
- Set ESA (Error Span Annotation) as default
|
|
478
|
-
- Update server IP address configuration
|
|
479
|
-
- Show approximate alignment by default
|
|
480
|
-
- Unify pointwise and listwise interfaces into `basic`
|
|
481
|
-
- Refactor protocol configuration (breaking change)
|
|
482
|
-
- v0.2.11
|
|
483
|
-
- Add comment field in settings panel
|
|
484
|
-
- Add `score_gt` validation for listwise comparisons
|
|
485
|
-
- Add Content-Disposition headers for proper download filenames
|
|
486
|
-
- Add model results display to dashboard with rankings
|
|
487
|
-
- Add campaign file structure validation
|
|
488
|
-
- Purge command now unlinks assets
|
|
489
|
-
- v0.2.6
|
|
490
|
-
- Add frozen annotation links feature for view-only mode
|
|
491
|
-
- Add word-level annotation mode toggle for error spans
|
|
492
|
-
- Add `[missing]` token support
|
|
493
|
-
- Improve frontend speed and cleanup toolboxes on item load
|
|
494
|
-
- Host assets via symlinks
|
|
495
|
-
- Add validation threshold for success/fail tokens
|
|
496
|
-
- Implement reset masking for annotations
|
|
497
|
-
- Allow pre-defined user IDs and tokens in campaign data
|
|
498
|
-
- v0.1.1
|
|
499
|
-
- Set server defaults and add VM launch scripts
|
|
500
|
-
- Add warning dialog when navigating away with unsaved work
|
|
501
|
-
- Add tutorial validation support for pointwise and listwise
|
|
502
|
-
- Add ability to preview existing annotations via progress bar
|
|
503
|
-
- Add support for ESA<sup>AI</sup> pre-filled error_spans
|
|
504
|
-
- Rename pairwise to listwise and update layout
|
|
505
|
-
- Implement single-stream assignment type
|
|
506
|
-
- v0.0.3
|
|
507
|
-
- Support multimodal inputs and outputs
|
|
508
|
-
- Add dashboard
|
|
509
|
-
- Implement ESA (Error Span Annotation) and MQM support
|
|
510
|
-
|
|
564
|
+
See changes in [CHANGELOG.md](CHANGELOG.md).
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# 🍐Pearmut
|
|
1
|
+
# 🍐Pearmut <br> [](https://pypi.org/project/pearmut) [](https://pypi.python.org/pypi/pearmut/) [](https://pypi.org/project/pearmut/) [](https://github.com/zouharvi/pearmut/actions/workflows/test.yml) [](https://arxiv.org/abs/2601.02933)
|
|
2
2
|
|
|
3
3
|
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
4
4
|
|
|
@@ -14,12 +14,15 @@
|
|
|
14
14
|
- [Assignment Types](#assignment-types)
|
|
15
15
|
- [Advanced Features](#advanced-features)
|
|
16
16
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
17
|
+
- [Custom MQM Taxonomy](#custom-mqm-taxonomy)
|
|
17
18
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
19
|
+
- [Form Items for User Metadata](#form-items-for-user-metadata)
|
|
18
20
|
- [Pre-defined User IDs and Tokens](#pre-defined-user-ids-and-tokens)
|
|
19
21
|
- [Multimodal Annotations](#multimodal-annotations)
|
|
20
22
|
- [Hosting Assets](#hosting-assets)
|
|
21
23
|
- [Campaign Management](#campaign-management)
|
|
22
24
|
- [Custom Completion Messages](#custom-completion-messages)
|
|
25
|
+
- [Prolific Integration](#prolific-integration)
|
|
23
26
|
- [CLI Commands](#cli-commands)
|
|
24
27
|
- [Terminology](#terminology)
|
|
25
28
|
- [Development](#development)
|
|
@@ -120,6 +123,22 @@ The `shuffle` parameter in campaign `info` controls this behavior:
|
|
|
120
123
|
"data": [...]
|
|
121
124
|
}
|
|
122
125
|
```
|
|
126
|
+
Documents in `data_welcome` are not shuffled and so don't require to have the same models in all documents.
|
|
127
|
+
|
|
128
|
+
### Showing Model Names
|
|
129
|
+
|
|
130
|
+
By default, model names are hidden to avoid biasing annotators. To display model names on top of each output block, set `show_model_names` to `true`:
|
|
131
|
+
```python
|
|
132
|
+
{
|
|
133
|
+
"info": {
|
|
134
|
+
"assignment": "task-based",
|
|
135
|
+
"protocol": "ESA",
|
|
136
|
+
"show_model_names": true # Default: false.
|
|
137
|
+
},
|
|
138
|
+
"campaign_id": "my_campaign",
|
|
139
|
+
"data": [...]
|
|
140
|
+
}
|
|
141
|
+
```
|
|
123
142
|
|
|
124
143
|
### Custom Score Sliders
|
|
125
144
|
|
|
@@ -142,6 +161,52 @@ For multi-dimensional evaluation tasks (e.g., assessing fluency on a Likert scal
|
|
|
142
161
|
|
|
143
162
|
When `sliders` is specified, only the custom sliders are shown. Each slider must have `name`, `min`, `max`, and `step` properties. All sliders must be answered before proceeding.
|
|
144
163
|
|
|
164
|
+
### Textfield for Post-editing/Translation
|
|
165
|
+
|
|
166
|
+
Enable a textfield for post-editing or translation tasks using the `textfield` parameter in `info`. The textfield content is stored in annotations alongside scores and error spans.
|
|
167
|
+
|
|
168
|
+
```python
|
|
169
|
+
{
|
|
170
|
+
"info": {
|
|
171
|
+
"protocol": "DA",
|
|
172
|
+
"textfield": "prefilled" # Options: null, "hidden", "visible", "prefilled"
|
|
173
|
+
}
|
|
174
|
+
}
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Textfield modes:**
|
|
178
|
+
- `null` or omitted: No textfield (default)
|
|
179
|
+
- `"hidden"`: Textfield hidden by default, shown by clicking a button
|
|
180
|
+
- `"visible"`: Textfield always visible
|
|
181
|
+
- `"prefilled"`: Textfield visible and pre-filled with model output for post-editing
|
|
182
|
+
|
|
183
|
+
### Custom MQM Taxonomy
|
|
184
|
+
|
|
185
|
+
For MQM protocol campaigns, you can define a custom error taxonomy instead of using the default MQM categories. Specify `mqm_categories` in the campaign `info` section as a dictionary mapping main categories to lists of subcategories:
|
|
186
|
+
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
{
|
|
190
|
+
"info": {
|
|
191
|
+
"assignment": "task-based",
|
|
192
|
+
"protocol": "MQM",
|
|
193
|
+
"mqm_categories": {
|
|
194
|
+
"": [], # Empty selection option
|
|
195
|
+
"General": ["", "Accuracy", "Fluency"],
|
|
196
|
+
"Audio-specific": ["", "Inaudible", "Background noise", "Speaker overlap", "Misinterpretation"],
|
|
197
|
+
"Style": ["", "Awkward", "Embarassing"],
|
|
198
|
+
"Unknown": [] # Category with no subcategories
|
|
199
|
+
}
|
|
200
|
+
},
|
|
201
|
+
"campaign_id": "custom_mqm_example",
|
|
202
|
+
"data": [...]
|
|
203
|
+
}
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
If `mqm_categories` is not provided, the default MQM taxonomy will be used. The empty string key `""` provides an unselected state in the dropdown. Categories with empty subcategory lists (e.g., `"Style": []`) do not require a subcategory selection.
|
|
207
|
+
|
|
208
|
+
See [examples/custom_mqm.json](examples/custom_mqm.json) for a complete example.
|
|
209
|
+
|
|
145
210
|
### Custom Instructions
|
|
146
211
|
|
|
147
212
|
Set campaign-level instructions using the `instructions` field in `info` (supports HTML).
|
|
@@ -231,6 +296,34 @@ The `score_greaterthan` field specifies the index of the candidate that must hav
|
|
|
231
296
|
See [examples/tutorial/esa_deen.json](examples/tutorial/esa_deen.json) for a mock campaign with a fully prepared ESA tutorial.
|
|
232
297
|
To use it, simply extract the `data` attribute and prefix it to each task in your campaign.
|
|
233
298
|
|
|
299
|
+
#### Universal Tutorial Items with `data_welcome`
|
|
300
|
+
|
|
301
|
+
Use `data_welcome` to add tutorial items that users must complete before starting regular tasks. The structure is a list of documents (same as `data`). Welcome items have IDs `welcome_0`, `welcome_1`, etc. and are tracked separately via `progress_welcome`.
|
|
302
|
+
|
|
303
|
+
### Form Items for User Metadata
|
|
304
|
+
|
|
305
|
+
Collect user information (demographics, expertise) before annotation tasks using form items in `data_welcome`.
|
|
306
|
+
Form items have `text` (label/question) and `form` (field type: `null`, `"string"`, `"number"`, `"choices"`, and `"script"`).
|
|
307
|
+
Documents must be homogeneous: all form items or all evaluation items.
|
|
308
|
+
|
|
309
|
+
```python
|
|
310
|
+
{
|
|
311
|
+
"data_welcome": [
|
|
312
|
+
[
|
|
313
|
+
{"text": "What is your native language?", "form": "string"},
|
|
314
|
+
{"text": "Rate your expertise (1-10)", "form": "number"}
|
|
315
|
+
]
|
|
316
|
+
]
|
|
317
|
+
}
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
<img width="400" alt="Screenshot of a user form" src="https://github.com/user-attachments/assets/2310e8dc-98e9-4abf-8a27-6781b0094efe" />
|
|
321
|
+
|
|
322
|
+
|
|
323
|
+
It is possible to automatically collect additional information from the host system using `"script"` field type.
|
|
324
|
+
Typically such a form document (or their sequence) would be stored in `"data_welcome"` such that it is both mandatory and show to all users.
|
|
325
|
+
See [examples/user_info_form.json](examples/user_info_form.json).
|
|
326
|
+
|
|
234
327
|
### Single-stream Assignment
|
|
235
328
|
|
|
236
329
|
All annotators draw from a shared pool with random assignment:
|
|
@@ -244,11 +337,14 @@ All annotators draw from a shared pool with random assignment:
|
|
|
244
337
|
# ESA: error spans and scores
|
|
245
338
|
"protocol": "ESA",
|
|
246
339
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
340
|
+
"docs_per_user": 10, # optional: show goodbye after N documents per user
|
|
247
341
|
},
|
|
248
342
|
"data": [...], # list of all items (shared among all annotators)
|
|
249
343
|
}
|
|
250
344
|
```
|
|
251
345
|
|
|
346
|
+
Set `docs_per_user` to limit how many documents each user annotates before seeing the goodbye message (for single-stream, this is the number of documents).
|
|
347
|
+
|
|
252
348
|
### Dynamic Assignment
|
|
253
349
|
|
|
254
350
|
The `dynamic` assignment type intelligently selects items based on current model performance to focus annotation effort on top-performing models using contrastive comparisons.
|
|
@@ -265,11 +361,14 @@ All items must contain outputs from all models for this assignment type to work
|
|
|
265
361
|
"dynamic_contrastive_models": 2, # how many models to compare per item (optional, default: 1)
|
|
266
362
|
"dynamic_first": 5, # annotations per model before dynamic kicks in (optional, default: 5)
|
|
267
363
|
"dynamic_backoff": 0.1, # probability of uniform sampling (optional, default: 0)
|
|
364
|
+
"docs_per_user": 20, # optional: show goodbye after N documents per user
|
|
268
365
|
},
|
|
269
366
|
"data": [...], # list of all items (shared among all annotators)
|
|
270
367
|
}
|
|
271
368
|
```
|
|
272
369
|
|
|
370
|
+
Set `docs_per_user` to limit how many documents each user annotates before seeing the goodbye message (for dynamic, this is roughly the number of documents × models).
|
|
371
|
+
|
|
273
372
|
**How it works:**
|
|
274
373
|
1. Initial phase: Each model gets `dynamic_first` annotations with fully random contrastive evaluation
|
|
275
374
|
2. Dynamic phase: After the initial phase, top `dynamic_top` models (by average score) are identified
|
|
@@ -357,6 +456,14 @@ When tokens are supplied, the dashboard will try to show model rankings based on
|
|
|
357
456
|
|
|
358
457
|
Customize the goodbye message shown to users when they complete all annotations using the `instructions_goodbye` field in campaign info. Supports arbitrary HTML for styling and formatting with variable replacement: `${TOKEN}` (completion token) and `${USER_ID}` (user ID). Default: `"If someone asks you for a token of completion, show them: ${TOKEN}"`.
|
|
359
458
|
|
|
459
|
+
### Prolific Integration
|
|
460
|
+
|
|
461
|
+
Use task-based assignment with Prolific. For each task, Pearmut generates a unique URL which can be uploaded to Prolific's interface. Add redirect (on completion) to `instructions_goodbye`:
|
|
462
|
+
```json
|
|
463
|
+
"instructions_goodbye": "<a href='https://app.prolific.com/submissions/complete?cc=${TOKEN}'>Click here to return to Prolific</a>"
|
|
464
|
+
```
|
|
465
|
+
The `${TOKEN}` is automatically replaced based on passing attention checks (see [Attention checks](#tutorial-and-attention-checks) and [Pre-defined tokens](#pre-defined-user-ids-and-tokens)).
|
|
466
|
+
|
|
360
467
|
## Terminology
|
|
361
468
|
|
|
362
469
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
@@ -380,7 +487,7 @@ Customize the goodbye message shown to users when they complete all annotations
|
|
|
380
487
|
- **Score**: Numeric quality rating (0-100)
|
|
381
488
|
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
382
489
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
383
|
-
- **Template**: The annotation interface type. The `
|
|
490
|
+
- **Template**: The annotation interface type. The `annotate` template supports comparing multiple outputs simultaneously.
|
|
384
491
|
- **Assignment**: The method for distributing items to users:
|
|
385
492
|
- **Task-based**: Each user has predefined items
|
|
386
493
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -411,7 +518,7 @@ pearmut run
|
|
|
411
518
|
2. Add build rule to `webpack.config.js`
|
|
412
519
|
3. Reference as `info->template` in campaign JSON
|
|
413
520
|
|
|
414
|
-
See [web/src/
|
|
521
|
+
See [web/src/annotate.ts](web/src/annotate.ts) for example.
|
|
415
522
|
|
|
416
523
|
### Deployment
|
|
417
524
|
|
|
@@ -422,68 +529,15 @@ Run on public server or tunnel local port to public IP/domain and run locally.
|
|
|
422
529
|
If you use this work in your paper, please cite as following.
|
|
423
530
|
```bibtex
|
|
424
531
|
@misc{zouhar2026pearmut,
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
532
|
+
title={Pearmut: Human Evaluation of Translation Made Trivial},
|
|
533
|
+
author={Vilém Zouhar and Tom Kocmi},
|
|
534
|
+
year={2026},
|
|
535
|
+
eprint={2601.02933},
|
|
536
|
+
archivePrefix={arXiv},
|
|
537
|
+
primaryClass={cs.CL},
|
|
538
|
+
url={https://arxiv.org/abs/2601.02933},
|
|
428
539
|
}
|
|
429
540
|
```
|
|
430
541
|
|
|
431
542
|
Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
|
|
432
|
-
|
|
433
|
-
# Changelog
|
|
434
|
-
|
|
435
|
-
- v1.0.1
|
|
436
|
-
- Support RTL languages
|
|
437
|
-
- Add boxes for references
|
|
438
|
-
- Add custom score sliders for multi-dimensional evaluation
|
|
439
|
-
- Make instructions customizable and protocol-dependent
|
|
440
|
-
- Support custom sliders
|
|
441
|
-
- Purge/reset whole tasks from dashboard
|
|
442
|
-
- Fix resetting individual users in single-stream/dynamic
|
|
443
|
-
- Fix notification stacking
|
|
444
|
-
- Add campaigns from dashboard
|
|
445
|
-
- v0.3.3
|
|
446
|
-
- Rename `doc_id` to `item_id`
|
|
447
|
-
- Add Typst, LaTeX, and PDF export for model ranking tables. Hide them by default.
|
|
448
|
-
- Add dynamic assignment type with contrastive model comparison
|
|
449
|
-
- Add `instructions_goodbye` field with variable substitution
|
|
450
|
-
- Add visual anchors at 33% and 66% on sliders
|
|
451
|
-
- Add German→English ESA tutorial with attention checks
|
|
452
|
-
- Validate document model consistency before shuffle
|
|
453
|
-
- Fix UI block on any interaction
|
|
454
|
-
- v0.3.2
|
|
455
|
-
- Revert seeding of user IDs
|
|
456
|
-
- Set ESA (Error Span Annotation) as default
|
|
457
|
-
- Update server IP address configuration
|
|
458
|
-
- Show approximate alignment by default
|
|
459
|
-
- Unify pointwise and listwise interfaces into `basic`
|
|
460
|
-
- Refactor protocol configuration (breaking change)
|
|
461
|
-
- v0.2.11
|
|
462
|
-
- Add comment field in settings panel
|
|
463
|
-
- Add `score_gt` validation for listwise comparisons
|
|
464
|
-
- Add Content-Disposition headers for proper download filenames
|
|
465
|
-
- Add model results display to dashboard with rankings
|
|
466
|
-
- Add campaign file structure validation
|
|
467
|
-
- Purge command now unlinks assets
|
|
468
|
-
- v0.2.6
|
|
469
|
-
- Add frozen annotation links feature for view-only mode
|
|
470
|
-
- Add word-level annotation mode toggle for error spans
|
|
471
|
-
- Add `[missing]` token support
|
|
472
|
-
- Improve frontend speed and cleanup toolboxes on item load
|
|
473
|
-
- Host assets via symlinks
|
|
474
|
-
- Add validation threshold for success/fail tokens
|
|
475
|
-
- Implement reset masking for annotations
|
|
476
|
-
- Allow pre-defined user IDs and tokens in campaign data
|
|
477
|
-
- v0.1.1
|
|
478
|
-
- Set server defaults and add VM launch scripts
|
|
479
|
-
- Add warning dialog when navigating away with unsaved work
|
|
480
|
-
- Add tutorial validation support for pointwise and listwise
|
|
481
|
-
- Add ability to preview existing annotations via progress bar
|
|
482
|
-
- Add support for ESA<sup>AI</sup> pre-filled error_spans
|
|
483
|
-
- Rename pairwise to listwise and update layout
|
|
484
|
-
- Implement single-stream assignment type
|
|
485
|
-
- v0.0.3
|
|
486
|
-
- Support multimodal inputs and outputs
|
|
487
|
-
- Add dashboard
|
|
488
|
-
- Implement ESA (Error Span Annotation) and MQM support
|
|
489
|
-
|
|
543
|
+
See changes in [CHANGELOG.md](CHANGELOG.md).
|