pearmut 0.2.11__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {pearmut-0.2.11 → pearmut-0.3.1}/PKG-INFO +80 -79
  2. {pearmut-0.2.11 → pearmut-0.3.1}/README.md +79 -78
  3. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/PKG-INFO +80 -79
  4. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/SOURCES.txt +2 -4
  5. {pearmut-0.2.11 → pearmut-0.3.1}/pyproject.toml +1 -1
  6. {pearmut-0.2.11 → pearmut-0.3.1}/server/app.py +9 -19
  7. {pearmut-0.2.11 → pearmut-0.3.1}/server/assignment.py +6 -6
  8. {pearmut-0.2.11 → pearmut-0.3.1}/server/cli.py +88 -22
  9. pearmut-0.3.1/server/static/basic.bundle.js +1 -0
  10. pearmut-0.3.1/server/static/basic.html +74 -0
  11. pearmut-0.3.1/server/static/dashboard.bundle.js +1 -0
  12. {pearmut-0.2.11 → pearmut-0.3.1}/server/static/style.css +1 -2
  13. {pearmut-0.2.11 → pearmut-0.3.1}/server/utils.py +1 -32
  14. pearmut-0.2.11/server/static/dashboard.bundle.js +0 -1
  15. pearmut-0.2.11/server/static/listwise.bundle.js +0 -1
  16. pearmut-0.2.11/server/static/listwise.html +0 -77
  17. pearmut-0.2.11/server/static/pointwise.bundle.js +0 -1
  18. pearmut-0.2.11/server/static/pointwise.html +0 -69
  19. {pearmut-0.2.11 → pearmut-0.3.1}/LICENSE +0 -0
  20. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/dependency_links.txt +0 -0
  21. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/entry_points.txt +0 -0
  22. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/requires.txt +0 -0
  23. {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/top_level.txt +0 -0
  24. {pearmut-0.2.11 → pearmut-0.3.1}/server/static/dashboard.html +0 -0
  25. {pearmut-0.2.11 → pearmut-0.3.1}/server/static/favicon.svg +0 -0
  26. {pearmut-0.2.11 → pearmut-0.3.1}/server/static/index.html +0 -0
  27. {pearmut-0.2.11 → pearmut-0.3.1}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pearmut
3
- Version: 0.2.11
3
+ Version: 0.3.1
4
4
  Summary: A tool for evaluation of model outputs, primarily MT.
5
5
  Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
6
6
  License: MIT
@@ -20,7 +20,7 @@ Dynamic: license-file
20
20
 
21
21
  # Pearmut 🍐
22
22
 
23
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
23
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
24
24
 
25
25
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
26
26
  &nbsp;
@@ -38,7 +38,6 @@ Dynamic: license-file
38
38
  - [Campaign Configuration](#campaign-configuration)
39
39
  - [Basic Structure](#basic-structure)
40
40
  - [Assignment Types](#assignment-types)
41
- - [Protocol Templates](#protocol-templates)
42
41
  - [Advanced Features](#advanced-features)
43
42
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
44
43
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -51,19 +50,16 @@ Dynamic: license-file
51
50
  - [Development](#development)
52
51
  - [Citation](#citation)
53
52
 
54
-
55
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
56
-
57
53
  ## Quick Start
58
54
 
59
55
  Install and run locally without cloning:
60
56
  ```bash
61
57
  pip install pearmut
62
58
  # Download example campaigns
63
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
64
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
59
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
60
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
65
61
  # Load and start
66
- pearmut add esa_encs.json da_enuk.json
62
+ pearmut add esa.json da.json
67
63
  pearmut run
68
64
  ```
69
65
 
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
76
72
  {
77
73
  "info": {
78
74
  "assignment": "task-based",
79
- "template": "pointwise",
80
- "protocol_score": true, # we want scores [0...100] for each segment
81
- "protocol_error_spans": true, # we want error spans
82
- "protocol_error_categories": false, # we do not want error span categories
75
+ # DA: scores
76
+ # ESA: error spans and scores
77
+ # MQM: error spans, categories, and scores
78
+ "protocol": "ESA",
83
79
  },
84
80
  "campaign_id": "wmt25_#_en-cs_CZ",
85
81
  "data": [
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
90
86
  {
91
87
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
92
88
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
93
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
89
+ "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
94
90
  },
95
91
  {
96
92
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
97
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
93
+ "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
98
94
  }
99
95
  ...
100
96
  ],
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
114
110
  [
115
111
  {
116
112
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
117
- "tgt": "And suddenly all the water became full of other people and other people." # required
113
+ "tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
118
114
  },
119
115
  {
120
116
  "src": "toto je pokračování stejného dokumentu",
121
- "tgt": "this is a continuation of the same document"
117
+ "tgt": {"modelA": "this is a continuation of the same document"}
122
118
  # Additional keys stored for analysis
123
119
  }
124
120
  ]
@@ -136,16 +132,23 @@ pearmut run
136
132
  - **`single-stream`**: All users draw from a shared pool (random assignment)
137
133
  - **`dynamic`**: work in progress ⚠️
138
134
 
139
- ### Protocol Templates
135
+ ## Advanced Features
140
136
 
141
- - **Pointwise**: Evaluate single output against single input
142
- - `protocol_score`: Collect scores [0-100]
143
- - `protocol_error_spans`: Collect error span highlights
144
- - `protocol_error_categories`: Collect MQM category labels
145
- - **Listwise**: Evaluate multiple outputs simultaneously
146
- - Same protocol options as pointwise
137
+ ### Shuffling Model Translations
147
138
 
148
- ## Advanced Features
139
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
140
+ The `shuffle` parameter in campaign `info` controls this behavior:
141
+ ```python
142
+ {
143
+ "info": {
144
+ "assignment": "task-based",
145
+ "protocol": "ESA",
146
+ "shuffle": true # Default: true. Set to false to disable shuffling.
147
+ },
148
+ "campaign_id": "my_campaign",
149
+ "data": [...]
150
+ }
151
+ ```
149
152
 
150
153
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
151
154
 
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
154
157
  ```python
155
158
  {
156
159
  "src": "The quick brown fox jumps over the lazy dog.",
157
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
158
- "error_spans": [
159
- {
160
- "start_i": 0, # character index start (inclusive)
161
- "end_i": 5, # character index end (inclusive)
162
- "severity": "minor", # "minor", "major", "neutral", or null
163
- "category": null # MQM category string or null
164
- },
165
- {
166
- "start_i": 27,
167
- "end_i": 32,
168
- "severity": "major",
169
- "category": null
170
- }
171
- ]
160
+ "tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
161
+ "error_spans": {
162
+ "modelA": [
163
+ {
164
+ "start_i": 0, # character index start (inclusive)
165
+ "end_i": 5, # character index end (inclusive)
166
+ "severity": "minor", # "minor", "major", "neutral", or null
167
+ "category": null # MQM category string or null
168
+ },
169
+ {
170
+ "start_i": 27,
171
+ "end_i": 32,
172
+ "severity": "major",
173
+ "category": null
174
+ }
175
+ ]
176
+ }
172
177
  }
173
178
  ```
174
179
 
175
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
180
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
176
181
 
177
182
  ### Tutorial and Attention Checks
178
183
 
@@ -181,12 +186,16 @@ Add `validation` rules for tutorials or attention checks:
181
186
  ```python
182
187
  {
183
188
  "src": "The quick brown fox jumps.",
184
- "tgt": "Rychlá hnědá liška skáče.",
189
+ "tgt": {"modelA": "Rychlá hnědá liška skáče."},
185
190
  "validation": {
186
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
187
- "score": [70, 80], # required score range [min, max]
188
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
189
- "allow_skip": true # show "skip tutorial" button
191
+ "modelA": [
192
+ {
193
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
194
+ "score": [70, 80], # required score range [min, max]
195
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
196
+ "allow_skip": true # show "skip tutorial" button
197
+ }
198
+ ]
190
199
  }
191
200
  }
192
201
  ```
@@ -196,22 +205,25 @@ Add `validation` rules for tutorials or attention checks:
196
205
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
197
206
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
198
207
 
199
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
208
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
200
209
 
201
- **Listwise score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
210
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
202
211
  ```python
203
212
  {
204
213
  "src": "AI transforms industries.",
205
- "tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
206
- "validation": [
207
- {"warning": "A has error, score 20-40.", "score": [20, 40]},
208
- {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
209
- ]
214
+ "tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
215
+ "validation": {
216
+ "A": [
217
+ {"warning": "A has error, score 20-40.", "score": [20, 40]}
218
+ ],
219
+ "B": [
220
+ {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
221
+ ]
222
+ }
210
223
  }
211
224
  ```
212
225
  The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
213
-
214
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
226
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
215
227
 
216
228
  ### Single-stream Assignment
217
229
 
@@ -221,10 +233,10 @@ All annotators draw from a shared pool with random assignment:
221
233
  "campaign_id": "my campaign 6",
222
234
  "info": {
223
235
  "assignment": "single-stream",
224
- "template": "pointwise",
225
- "protocol_score": True, # collect scores
226
- "protocol_error_spans": True, # collect error spans
227
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
236
+ # DA: scores
237
+ # MQM: error spans and categories
238
+ # ESA: error spans and scores
239
+ "protocol": "ESA",
228
240
  "users": 50, # number of annotators (can also be a list, see below)
229
241
  },
230
242
  "data": [...], # list of all items (shared among all annotators)
@@ -302,30 +314,21 @@ Completion tokens are shown at annotation end for verification (download correct
302
314
 
303
315
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
304
316
 
305
- ### Model Results Display
306
-
307
- Add `&results` to dashboard URL to show model rankings (requires valid token).
308
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
309
- ```python
310
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
311
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
312
- ```
313
- See an example in [Campaign Management](#campaign-management)
314
-
317
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
315
318
 
316
319
  ## Terminology
317
320
 
318
321
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
319
322
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
320
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
323
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
321
324
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
322
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
323
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
325
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
326
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
324
327
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
325
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
328
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
326
329
  - **Loud**: Shows warning message and forces retry on failure
327
330
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
328
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
331
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
329
332
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
330
333
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
331
334
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -334,11 +337,9 @@ See an example in [Campaign Management](#campaign-management)
334
337
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
335
338
  - **Protocol**: The annotation scheme defining what data is collected:
336
339
  - **Score**: Numeric quality rating (0-100)
337
- - **Error Spans**: Text highlights marking errors
340
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
338
341
  - **Error Categories**: MQM taxonomy labels for errors
339
- - **Template**: The annotation interface type:
340
- - **Pointwise**: Evaluate one output at a time
341
- - **Listwise**: Compare multiple outputs simultaneously
342
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
342
343
  - **Assignment**: The method for distributing items to users:
343
344
  - **Task-based**: Each user has predefined items
344
345
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -369,7 +370,7 @@ pearmut run
369
370
  2. Add build rule to `webpack.config.js`
370
371
  3. Reference as `info->template` in campaign JSON
371
372
 
372
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
373
+ See [web/src/basic.ts](web/src/basic.ts) for example.
373
374
 
374
375
  ### Deployment
375
376
 
@@ -1,6 +1,6 @@
1
1
  # Pearmut 🍐
2
2
 
3
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
3
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
4
4
 
5
5
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
6
6
  &nbsp;
@@ -18,7 +18,6 @@
18
18
  - [Campaign Configuration](#campaign-configuration)
19
19
  - [Basic Structure](#basic-structure)
20
20
  - [Assignment Types](#assignment-types)
21
- - [Protocol Templates](#protocol-templates)
22
21
  - [Advanced Features](#advanced-features)
23
22
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
24
23
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -31,19 +30,16 @@
31
30
  - [Development](#development)
32
31
  - [Citation](#citation)
33
32
 
34
-
35
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
36
-
37
33
  ## Quick Start
38
34
 
39
35
  Install and run locally without cloning:
40
36
  ```bash
41
37
  pip install pearmut
42
38
  # Download example campaigns
43
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
44
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
39
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
40
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
45
41
  # Load and start
46
- pearmut add esa_encs.json da_enuk.json
42
+ pearmut add esa.json da.json
47
43
  pearmut run
48
44
  ```
49
45
 
@@ -56,10 +52,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
56
52
  {
57
53
  "info": {
58
54
  "assignment": "task-based",
59
- "template": "pointwise",
60
- "protocol_score": true, # we want scores [0...100] for each segment
61
- "protocol_error_spans": true, # we want error spans
62
- "protocol_error_categories": false, # we do not want error span categories
55
+ # DA: scores
56
+ # ESA: error spans and scores
57
+ # MQM: error spans, categories, and scores
58
+ "protocol": "ESA",
63
59
  },
64
60
  "campaign_id": "wmt25_#_en-cs_CZ",
65
61
  "data": [
@@ -70,11 +66,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
70
66
  {
71
67
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
72
68
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
73
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
69
+ "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
74
70
  },
75
71
  {
76
72
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
77
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
73
+ "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
78
74
  }
79
75
  ...
80
76
  ],
@@ -94,11 +90,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
94
90
  [
95
91
  {
96
92
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
97
- "tgt": "And suddenly all the water became full of other people and other people." # required
93
+ "tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
98
94
  },
99
95
  {
100
96
  "src": "toto je pokračování stejného dokumentu",
101
- "tgt": "this is a continuation of the same document"
97
+ "tgt": {"modelA": "this is a continuation of the same document"}
102
98
  # Additional keys stored for analysis
103
99
  }
104
100
  ]
@@ -116,16 +112,23 @@ pearmut run
116
112
  - **`single-stream`**: All users draw from a shared pool (random assignment)
117
113
  - **`dynamic`**: work in progress ⚠️
118
114
 
119
- ### Protocol Templates
115
+ ## Advanced Features
120
116
 
121
- - **Pointwise**: Evaluate single output against single input
122
- - `protocol_score`: Collect scores [0-100]
123
- - `protocol_error_spans`: Collect error span highlights
124
- - `protocol_error_categories`: Collect MQM category labels
125
- - **Listwise**: Evaluate multiple outputs simultaneously
126
- - Same protocol options as pointwise
117
+ ### Shuffling Model Translations
127
118
 
128
- ## Advanced Features
119
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
120
+ The `shuffle` parameter in campaign `info` controls this behavior:
121
+ ```python
122
+ {
123
+ "info": {
124
+ "assignment": "task-based",
125
+ "protocol": "ESA",
126
+ "shuffle": true # Default: true. Set to false to disable shuffling.
127
+ },
128
+ "campaign_id": "my_campaign",
129
+ "data": [...]
130
+ }
131
+ ```
129
132
 
130
133
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
131
134
 
@@ -134,25 +137,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
134
137
  ```python
135
138
  {
136
139
  "src": "The quick brown fox jumps over the lazy dog.",
137
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
138
- "error_spans": [
139
- {
140
- "start_i": 0, # character index start (inclusive)
141
- "end_i": 5, # character index end (inclusive)
142
- "severity": "minor", # "minor", "major", "neutral", or null
143
- "category": null # MQM category string or null
144
- },
145
- {
146
- "start_i": 27,
147
- "end_i": 32,
148
- "severity": "major",
149
- "category": null
150
- }
151
- ]
140
+ "tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
141
+ "error_spans": {
142
+ "modelA": [
143
+ {
144
+ "start_i": 0, # character index start (inclusive)
145
+ "end_i": 5, # character index end (inclusive)
146
+ "severity": "minor", # "minor", "major", "neutral", or null
147
+ "category": null # MQM category string or null
148
+ },
149
+ {
150
+ "start_i": 27,
151
+ "end_i": 32,
152
+ "severity": "major",
153
+ "category": null
154
+ }
155
+ ]
156
+ }
152
157
  }
153
158
  ```
154
159
 
155
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
160
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
156
161
 
157
162
  ### Tutorial and Attention Checks
158
163
 
@@ -161,12 +166,16 @@ Add `validation` rules for tutorials or attention checks:
161
166
  ```python
162
167
  {
163
168
  "src": "The quick brown fox jumps.",
164
- "tgt": "Rychlá hnědá liška skáče.",
169
+ "tgt": {"modelA": "Rychlá hnědá liška skáče."},
165
170
  "validation": {
166
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
167
- "score": [70, 80], # required score range [min, max]
168
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
169
- "allow_skip": true # show "skip tutorial" button
171
+ "modelA": [
172
+ {
173
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
174
+ "score": [70, 80], # required score range [min, max]
175
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
176
+ "allow_skip": true # show "skip tutorial" button
177
+ }
178
+ ]
170
179
  }
171
180
  }
172
181
  ```
@@ -176,22 +185,25 @@ Add `validation` rules for tutorials or attention checks:
176
185
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
177
186
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
178
187
 
179
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
188
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
180
189
 
181
- **Listwise score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
190
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
182
191
  ```python
183
192
  {
184
193
  "src": "AI transforms industries.",
185
- "tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
186
- "validation": [
187
- {"warning": "A has error, score 20-40.", "score": [20, 40]},
188
- {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
189
- ]
194
+ "tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
195
+ "validation": {
196
+ "A": [
197
+ {"warning": "A has error, score 20-40.", "score": [20, 40]}
198
+ ],
199
+ "B": [
200
+ {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
201
+ ]
202
+ }
190
203
  }
191
204
  ```
192
205
  The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
193
-
194
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
206
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
195
207
 
196
208
  ### Single-stream Assignment
197
209
 
@@ -201,10 +213,10 @@ All annotators draw from a shared pool with random assignment:
201
213
  "campaign_id": "my campaign 6",
202
214
  "info": {
203
215
  "assignment": "single-stream",
204
- "template": "pointwise",
205
- "protocol_score": True, # collect scores
206
- "protocol_error_spans": True, # collect error spans
207
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
216
+ # DA: scores
217
+ # MQM: error spans and categories
218
+ # ESA: error spans and scores
219
+ "protocol": "ESA",
208
220
  "users": 50, # number of annotators (can also be a list, see below)
209
221
  },
210
222
  "data": [...], # list of all items (shared among all annotators)
@@ -282,30 +294,21 @@ Completion tokens are shown at annotation end for verification (download correct
282
294
 
283
295
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
284
296
 
285
- ### Model Results Display
286
-
287
- Add `&results` to dashboard URL to show model rankings (requires valid token).
288
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
289
- ```python
290
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
291
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
292
- ```
293
- See an example in [Campaign Management](#campaign-management)
294
-
297
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
295
298
 
296
299
  ## Terminology
297
300
 
298
301
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
299
302
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
300
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
303
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
301
304
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
302
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
303
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
305
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
306
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
304
307
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
305
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
308
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
306
309
  - **Loud**: Shows warning message and forces retry on failure
307
310
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
308
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
311
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
309
312
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
310
313
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
311
314
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -314,11 +317,9 @@ See an example in [Campaign Management](#campaign-management)
314
317
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
315
318
  - **Protocol**: The annotation scheme defining what data is collected:
316
319
  - **Score**: Numeric quality rating (0-100)
317
- - **Error Spans**: Text highlights marking errors
320
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
318
321
  - **Error Categories**: MQM taxonomy labels for errors
319
- - **Template**: The annotation interface type:
320
- - **Pointwise**: Evaluate one output at a time
321
- - **Listwise**: Compare multiple outputs simultaneously
322
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
322
323
  - **Assignment**: The method for distributing items to users:
323
324
  - **Task-based**: Each user has predefined items
324
325
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -349,7 +350,7 @@ pearmut run
349
350
  2. Add build rule to `webpack.config.js`
350
351
  3. Reference as `info->template` in campaign JSON
351
352
 
352
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
353
+ See [web/src/basic.ts](web/src/basic.ts) for example.
353
354
 
354
355
  ### Deployment
355
356