pearmut 0.2.11__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {pearmut-0.2.11 → pearmut-0.3.0}/PKG-INFO +69 -74
  2. {pearmut-0.2.11 → pearmut-0.3.0}/README.md +68 -73
  3. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/PKG-INFO +69 -74
  4. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/SOURCES.txt +2 -4
  5. {pearmut-0.2.11 → pearmut-0.3.0}/pyproject.toml +1 -1
  6. {pearmut-0.2.11 → pearmut-0.3.0}/server/app.py +9 -19
  7. {pearmut-0.2.11 → pearmut-0.3.0}/server/assignment.py +6 -6
  8. {pearmut-0.2.11 → pearmut-0.3.0}/server/cli.py +88 -22
  9. pearmut-0.3.0/server/static/basic.bundle.js +1 -0
  10. pearmut-0.3.0/server/static/basic.html +74 -0
  11. pearmut-0.3.0/server/static/dashboard.bundle.js +1 -0
  12. {pearmut-0.2.11 → pearmut-0.3.0}/server/static/style.css +1 -2
  13. {pearmut-0.2.11 → pearmut-0.3.0}/server/utils.py +1 -32
  14. pearmut-0.2.11/server/static/dashboard.bundle.js +0 -1
  15. pearmut-0.2.11/server/static/listwise.bundle.js +0 -1
  16. pearmut-0.2.11/server/static/listwise.html +0 -77
  17. pearmut-0.2.11/server/static/pointwise.bundle.js +0 -1
  18. pearmut-0.2.11/server/static/pointwise.html +0 -69
  19. {pearmut-0.2.11 → pearmut-0.3.0}/LICENSE +0 -0
  20. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/dependency_links.txt +0 -0
  21. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/entry_points.txt +0 -0
  22. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/requires.txt +0 -0
  23. {pearmut-0.2.11 → pearmut-0.3.0}/pearmut.egg-info/top_level.txt +0 -0
  24. {pearmut-0.2.11 → pearmut-0.3.0}/server/static/dashboard.html +0 -0
  25. {pearmut-0.2.11 → pearmut-0.3.0}/server/static/favicon.svg +0 -0
  26. {pearmut-0.2.11 → pearmut-0.3.0}/server/static/index.html +0 -0
  27. {pearmut-0.2.11 → pearmut-0.3.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pearmut
3
- Version: 0.2.11
3
+ Version: 0.3.0
4
4
  Summary: A tool for evaluation of model outputs, primarily MT.
5
5
  Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
6
6
  License: MIT
@@ -20,7 +20,7 @@ Dynamic: license-file
20
20
 
21
21
  # Pearmut 🍐
22
22
 
23
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
23
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
24
24
 
25
25
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
26
26
  &nbsp;
@@ -38,7 +38,6 @@ Dynamic: license-file
38
38
  - [Campaign Configuration](#campaign-configuration)
39
39
  - [Basic Structure](#basic-structure)
40
40
  - [Assignment Types](#assignment-types)
41
- - [Protocol Templates](#protocol-templates)
42
41
  - [Advanced Features](#advanced-features)
43
42
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
44
43
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -51,19 +50,16 @@ Dynamic: license-file
51
50
  - [Development](#development)
52
51
  - [Citation](#citation)
53
52
 
54
-
55
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
56
-
57
53
  ## Quick Start
58
54
 
59
55
  Install and run locally without cloning:
60
56
  ```bash
61
57
  pip install pearmut
62
58
  # Download example campaigns
63
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
64
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
59
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
60
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
65
61
  # Load and start
66
- pearmut add esa_encs.json da_enuk.json
62
+ pearmut add esa.json da.json
67
63
  pearmut run
68
64
  ```
69
65
 
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
76
72
  {
77
73
  "info": {
78
74
  "assignment": "task-based",
79
- "template": "pointwise",
80
- "protocol_score": true, # we want scores [0...100] for each segment
81
- "protocol_error_spans": true, # we want error spans
82
- "protocol_error_categories": false, # we do not want error span categories
75
+ # DA: scores
76
+ # ESA: error spans and scores
77
+ # MQM: error spans, categories, and scores
78
+ "protocol": "ESA",
83
79
  },
84
80
  "campaign_id": "wmt25_#_en-cs_CZ",
85
81
  "data": [
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
90
86
  {
91
87
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
92
88
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
93
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
89
+ "tgt": ["Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."]
94
90
  },
95
91
  {
96
92
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
97
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
93
+ "tgt": ["Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"]
98
94
  }
99
95
  ...
100
96
  ],
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
114
110
  [
115
111
  {
116
112
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
117
- "tgt": "And suddenly all the water became full of other people and other people." # required
113
+ "tgt": ["And suddenly all the water became full of other people and other people."] # required (array)
118
114
  },
119
115
  {
120
116
  "src": "toto je pokračování stejného dokumentu",
121
- "tgt": "this is a continuation of the same document"
117
+ "tgt": ["this is a continuation of the same document"]
122
118
  # Additional keys stored for analysis
123
119
  }
124
120
  ]
@@ -136,16 +132,23 @@ pearmut run
136
132
  - **`single-stream`**: All users draw from a shared pool (random assignment)
137
133
  - **`dynamic`**: work in progress ⚠️
138
134
 
139
- ### Protocol Templates
135
+ ## Advanced Features
140
136
 
141
- - **Pointwise**: Evaluate single output against single input
142
- - `protocol_score`: Collect scores [0-100]
143
- - `protocol_error_spans`: Collect error span highlights
144
- - `protocol_error_categories`: Collect MQM category labels
145
- - **Listwise**: Evaluate multiple outputs simultaneously
146
- - Same protocol options as pointwise
137
+ ### Shuffling Model Translations
147
138
 
148
- ## Advanced Features
139
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
140
+ The `shuffle` parameter in campaign `info` controls this behavior:
141
+ ```python
142
+ {
143
+ "info": {
144
+ "assignment": "task-based",
145
+ "protocol": "ESA",
146
+ "shuffle": true # Default: true. Set to false to disable shuffling.
147
+ },
148
+ "campaign_id": "my_campaign",
149
+ "data": [...]
150
+ }
151
+ ```
149
152
 
150
153
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
151
154
 
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
154
157
  ```python
155
158
  {
156
159
  "src": "The quick brown fox jumps over the lazy dog.",
157
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
160
+ "tgt": ["Rychlá hnědá liška skáče přes líného psa."],
158
161
  "error_spans": [
159
- {
160
- "start_i": 0, # character index start (inclusive)
161
- "end_i": 5, # character index end (inclusive)
162
- "severity": "minor", # "minor", "major", "neutral", or null
163
- "category": null # MQM category string or null
164
- },
165
- {
166
- "start_i": 27,
167
- "end_i": 32,
168
- "severity": "major",
169
- "category": null
170
- }
162
+ [
163
+ {
164
+ "start_i": 0, # character index start (inclusive)
165
+ "end_i": 5, # character index end (inclusive)
166
+ "severity": "minor", # "minor", "major", "neutral", or null
167
+ "category": null # MQM category string or null
168
+ },
169
+ {
170
+ "start_i": 27,
171
+ "end_i": 32,
172
+ "severity": "major",
173
+ "category": null
174
+ }
175
+ ]
171
176
  ]
172
177
  }
173
178
  ```
174
179
 
175
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
180
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
176
181
 
177
182
  ### Tutorial and Attention Checks
178
183
 
@@ -181,13 +186,15 @@ Add `validation` rules for tutorials or attention checks:
181
186
  ```python
182
187
  {
183
188
  "src": "The quick brown fox jumps.",
184
- "tgt": "Rychlá hnědá liška skáče.",
185
- "validation": {
186
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
187
- "score": [70, 80], # required score range [min, max]
188
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
189
- "allow_skip": true # show "skip tutorial" button
190
- }
189
+ "tgt": ["Rychlá hnědá liška skáče."],
190
+ "validation": [
191
+ {
192
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
193
+ "score": [70, 80], # required score range [min, max]
194
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
195
+ "allow_skip": true # show "skip tutorial" button
196
+ }
197
+ ]
191
198
  }
192
199
  ```
193
200
 
@@ -196,9 +203,9 @@ Add `validation` rules for tutorials or attention checks:
196
203
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
197
204
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
198
205
 
199
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
206
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
200
207
 
201
- **Listwise score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
208
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
202
209
  ```python
203
210
  {
204
211
  "src": "AI transforms industries.",
@@ -210,8 +217,7 @@ For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/
210
217
  }
211
218
  ```
212
219
  The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
213
-
214
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
220
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
215
221
 
216
222
  ### Single-stream Assignment
217
223
 
@@ -221,10 +227,10 @@ All annotators draw from a shared pool with random assignment:
221
227
  "campaign_id": "my campaign 6",
222
228
  "info": {
223
229
  "assignment": "single-stream",
224
- "template": "pointwise",
225
- "protocol_score": True, # collect scores
226
- "protocol_error_spans": True, # collect error spans
227
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
230
+ # DA: scores
231
+ # MQM: error spans and categories
232
+ # ESA: error spans and scores
233
+ "protocol": "ESA",
228
234
  "users": 50, # number of annotators (can also be a list, see below)
229
235
  },
230
236
  "data": [...], # list of all items (shared among all annotators)
@@ -302,30 +308,21 @@ Completion tokens are shown at annotation end for verification (download correct
302
308
 
303
309
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
304
310
 
305
- ### Model Results Display
306
-
307
- Add `&results` to dashboard URL to show model rankings (requires valid token).
308
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
309
- ```python
310
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
311
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
312
- ```
313
- See an example in [Campaign Management](#campaign-management)
314
-
311
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
315
312
 
316
313
  ## Terminology
317
314
 
318
315
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
319
316
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
320
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
317
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
321
318
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
322
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
323
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
319
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
320
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
324
321
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
325
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
322
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
326
323
  - **Loud**: Shows warning message and forces retry on failure
327
324
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
328
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
325
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
329
326
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
330
327
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
331
328
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -334,11 +331,9 @@ See an example in [Campaign Management](#campaign-management)
334
331
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
335
332
  - **Protocol**: The annotation scheme defining what data is collected:
336
333
  - **Score**: Numeric quality rating (0-100)
337
- - **Error Spans**: Text highlights marking errors
334
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
338
335
  - **Error Categories**: MQM taxonomy labels for errors
339
- - **Template**: The annotation interface type:
340
- - **Pointwise**: Evaluate one output at a time
341
- - **Listwise**: Compare multiple outputs simultaneously
336
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
342
337
  - **Assignment**: The method for distributing items to users:
343
338
  - **Task-based**: Each user has predefined items
344
339
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -369,7 +364,7 @@ pearmut run
369
364
  2. Add build rule to `webpack.config.js`
370
365
  3. Reference as `info->template` in campaign JSON
371
366
 
372
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
367
+ See [web/src/basic.ts](web/src/basic.ts) for example.
373
368
 
374
369
  ### Deployment
375
370
 
@@ -1,6 +1,6 @@
1
1
  # Pearmut 🍐
2
2
 
3
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
3
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
4
4
 
5
5
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
6
6
  &nbsp;
@@ -18,7 +18,6 @@
18
18
  - [Campaign Configuration](#campaign-configuration)
19
19
  - [Basic Structure](#basic-structure)
20
20
  - [Assignment Types](#assignment-types)
21
- - [Protocol Templates](#protocol-templates)
22
21
  - [Advanced Features](#advanced-features)
23
22
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
24
23
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -31,19 +30,16 @@
31
30
  - [Development](#development)
32
31
  - [Citation](#citation)
33
32
 
34
-
35
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
36
-
37
33
  ## Quick Start
38
34
 
39
35
  Install and run locally without cloning:
40
36
  ```bash
41
37
  pip install pearmut
42
38
  # Download example campaigns
43
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
44
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
39
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
40
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
45
41
  # Load and start
46
- pearmut add esa_encs.json da_enuk.json
42
+ pearmut add esa.json da.json
47
43
  pearmut run
48
44
  ```
49
45
 
@@ -56,10 +52,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
56
52
  {
57
53
  "info": {
58
54
  "assignment": "task-based",
59
- "template": "pointwise",
60
- "protocol_score": true, # we want scores [0...100] for each segment
61
- "protocol_error_spans": true, # we want error spans
62
- "protocol_error_categories": false, # we do not want error span categories
55
+ # DA: scores
56
+ # ESA: error spans and scores
57
+ # MQM: error spans, categories, and scores
58
+ "protocol": "ESA",
63
59
  },
64
60
  "campaign_id": "wmt25_#_en-cs_CZ",
65
61
  "data": [
@@ -70,11 +66,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
70
66
  {
71
67
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
72
68
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
73
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
69
+ "tgt": ["Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."]
74
70
  },
75
71
  {
76
72
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
77
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
73
+ "tgt": ["Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"]
78
74
  }
79
75
  ...
80
76
  ],
@@ -94,11 +90,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
94
90
  [
95
91
  {
96
92
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
97
- "tgt": "And suddenly all the water became full of other people and other people." # required
93
+ "tgt": ["And suddenly all the water became full of other people and other people."] # required (array)
98
94
  },
99
95
  {
100
96
  "src": "toto je pokračování stejného dokumentu",
101
- "tgt": "this is a continuation of the same document"
97
+ "tgt": ["this is a continuation of the same document"]
102
98
  # Additional keys stored for analysis
103
99
  }
104
100
  ]
@@ -116,16 +112,23 @@ pearmut run
116
112
  - **`single-stream`**: All users draw from a shared pool (random assignment)
117
113
  - **`dynamic`**: work in progress ⚠️
118
114
 
119
- ### Protocol Templates
115
+ ## Advanced Features
120
116
 
121
- - **Pointwise**: Evaluate single output against single input
122
- - `protocol_score`: Collect scores [0-100]
123
- - `protocol_error_spans`: Collect error span highlights
124
- - `protocol_error_categories`: Collect MQM category labels
125
- - **Listwise**: Evaluate multiple outputs simultaneously
126
- - Same protocol options as pointwise
117
+ ### Shuffling Model Translations
127
118
 
128
- ## Advanced Features
119
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
120
+ The `shuffle` parameter in campaign `info` controls this behavior:
121
+ ```python
122
+ {
123
+ "info": {
124
+ "assignment": "task-based",
125
+ "protocol": "ESA",
126
+ "shuffle": true # Default: true. Set to false to disable shuffling.
127
+ },
128
+ "campaign_id": "my_campaign",
129
+ "data": [...]
130
+ }
131
+ ```
129
132
 
130
133
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
131
134
 
@@ -134,25 +137,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
134
137
  ```python
135
138
  {
136
139
  "src": "The quick brown fox jumps over the lazy dog.",
137
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
140
+ "tgt": ["Rychlá hnědá liška skáče přes líného psa."],
138
141
  "error_spans": [
139
- {
140
- "start_i": 0, # character index start (inclusive)
141
- "end_i": 5, # character index end (inclusive)
142
- "severity": "minor", # "minor", "major", "neutral", or null
143
- "category": null # MQM category string or null
144
- },
145
- {
146
- "start_i": 27,
147
- "end_i": 32,
148
- "severity": "major",
149
- "category": null
150
- }
142
+ [
143
+ {
144
+ "start_i": 0, # character index start (inclusive)
145
+ "end_i": 5, # character index end (inclusive)
146
+ "severity": "minor", # "minor", "major", "neutral", or null
147
+ "category": null # MQM category string or null
148
+ },
149
+ {
150
+ "start_i": 27,
151
+ "end_i": 32,
152
+ "severity": "major",
153
+ "category": null
154
+ }
155
+ ]
151
156
  ]
152
157
  }
153
158
  ```
154
159
 
155
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
160
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
156
161
 
157
162
  ### Tutorial and Attention Checks
158
163
 
@@ -161,13 +166,15 @@ Add `validation` rules for tutorials or attention checks:
161
166
  ```python
162
167
  {
163
168
  "src": "The quick brown fox jumps.",
164
- "tgt": "Rychlá hnědá liška skáče.",
165
- "validation": {
166
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
167
- "score": [70, 80], # required score range [min, max]
168
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
169
- "allow_skip": true # show "skip tutorial" button
170
- }
169
+ "tgt": ["Rychlá hnědá liška skáče."],
170
+ "validation": [
171
+ {
172
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
173
+ "score": [70, 80], # required score range [min, max]
174
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
175
+ "allow_skip": true # show "skip tutorial" button
176
+ }
177
+ ]
171
178
  }
172
179
  ```
173
180
 
@@ -176,9 +183,9 @@ Add `validation` rules for tutorials or attention checks:
176
183
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
177
184
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
178
185
 
179
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
186
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
180
187
 
181
- **Listwise score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
188
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
182
189
  ```python
183
190
  {
184
191
  "src": "AI transforms industries.",
@@ -190,8 +197,7 @@ For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/
190
197
  }
191
198
  ```
192
199
  The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
193
-
194
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
200
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
195
201
 
196
202
  ### Single-stream Assignment
197
203
 
@@ -201,10 +207,10 @@ All annotators draw from a shared pool with random assignment:
201
207
  "campaign_id": "my campaign 6",
202
208
  "info": {
203
209
  "assignment": "single-stream",
204
- "template": "pointwise",
205
- "protocol_score": True, # collect scores
206
- "protocol_error_spans": True, # collect error spans
207
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
210
+ # DA: scores
211
+ # MQM: error spans and categories
212
+ # ESA: error spans and scores
213
+ "protocol": "ESA",
208
214
  "users": 50, # number of annotators (can also be a list, see below)
209
215
  },
210
216
  "data": [...], # list of all items (shared among all annotators)
@@ -282,30 +288,21 @@ Completion tokens are shown at annotation end for verification (download correct
282
288
 
283
289
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
284
290
 
285
- ### Model Results Display
286
-
287
- Add `&results` to dashboard URL to show model rankings (requires valid token).
288
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
289
- ```python
290
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
291
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
292
- ```
293
- See an example in [Campaign Management](#campaign-management)
294
-
291
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
295
292
 
296
293
  ## Terminology
297
294
 
298
295
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
299
296
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
300
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
297
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
301
298
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
302
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
303
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
299
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
300
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
304
301
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
305
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
302
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
306
303
  - **Loud**: Shows warning message and forces retry on failure
307
304
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
308
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
305
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
309
306
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
310
307
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
311
308
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -314,11 +311,9 @@ See an example in [Campaign Management](#campaign-management)
314
311
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
315
312
  - **Protocol**: The annotation scheme defining what data is collected:
316
313
  - **Score**: Numeric quality rating (0-100)
317
- - **Error Spans**: Text highlights marking errors
314
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
318
315
  - **Error Categories**: MQM taxonomy labels for errors
319
- - **Template**: The annotation interface type:
320
- - **Pointwise**: Evaluate one output at a time
321
- - **Listwise**: Compare multiple outputs simultaneously
316
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
322
317
  - **Assignment**: The method for distributing items to users:
323
318
  - **Task-based**: Each user has predefined items
324
319
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -349,7 +344,7 @@ pearmut run
349
344
  2. Add build rule to `webpack.config.js`
350
345
  3. Reference as `info->template` in campaign JSON
351
346
 
352
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
347
+ See [web/src/basic.ts](web/src/basic.ts) for example.
353
348
 
354
349
  ### Deployment
355
350