pearmut 0.2.11__tar.gz → 0.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pearmut-0.2.11 → pearmut-0.3.1}/PKG-INFO +80 -79
- {pearmut-0.2.11 → pearmut-0.3.1}/README.md +79 -78
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/PKG-INFO +80 -79
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/SOURCES.txt +2 -4
- {pearmut-0.2.11 → pearmut-0.3.1}/pyproject.toml +1 -1
- {pearmut-0.2.11 → pearmut-0.3.1}/server/app.py +9 -19
- {pearmut-0.2.11 → pearmut-0.3.1}/server/assignment.py +6 -6
- {pearmut-0.2.11 → pearmut-0.3.1}/server/cli.py +88 -22
- pearmut-0.3.1/server/static/basic.bundle.js +1 -0
- pearmut-0.3.1/server/static/basic.html +74 -0
- pearmut-0.3.1/server/static/dashboard.bundle.js +1 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/server/static/style.css +1 -2
- {pearmut-0.2.11 → pearmut-0.3.1}/server/utils.py +1 -32
- pearmut-0.2.11/server/static/dashboard.bundle.js +0 -1
- pearmut-0.2.11/server/static/listwise.bundle.js +0 -1
- pearmut-0.2.11/server/static/listwise.html +0 -77
- pearmut-0.2.11/server/static/pointwise.bundle.js +0 -1
- pearmut-0.2.11/server/static/pointwise.html +0 -69
- {pearmut-0.2.11 → pearmut-0.3.1}/LICENSE +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/dependency_links.txt +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/entry_points.txt +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/requires.txt +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/pearmut.egg-info/top_level.txt +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/server/static/dashboard.html +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/server/static/favicon.svg +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/server/static/index.html +0 -0
- {pearmut-0.2.11 → pearmut-0.3.1}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pearmut
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.1
|
|
4
4
|
Summary: A tool for evaluation of model outputs, primarily MT.
|
|
5
5
|
Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -20,7 +20,7 @@ Dynamic: license-file
|
|
|
20
20
|
|
|
21
21
|
# Pearmut 🍐
|
|
22
22
|
|
|
23
|
-
**Platform for Evaluation and Reviewing of Multilingual Tasks
|
|
23
|
+
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
24
24
|
|
|
25
25
|
[](https://pypi.org/project/pearmut)
|
|
26
26
|
|
|
@@ -38,7 +38,6 @@ Dynamic: license-file
|
|
|
38
38
|
- [Campaign Configuration](#campaign-configuration)
|
|
39
39
|
- [Basic Structure](#basic-structure)
|
|
40
40
|
- [Assignment Types](#assignment-types)
|
|
41
|
-
- [Protocol Templates](#protocol-templates)
|
|
42
41
|
- [Advanced Features](#advanced-features)
|
|
43
42
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
44
43
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
@@ -51,19 +50,16 @@ Dynamic: license-file
|
|
|
51
50
|
- [Development](#development)
|
|
52
51
|
- [Citation](#citation)
|
|
53
52
|
|
|
54
|
-
|
|
55
|
-
**Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
|
|
56
|
-
|
|
57
53
|
## Quick Start
|
|
58
54
|
|
|
59
55
|
Install and run locally without cloning:
|
|
60
56
|
```bash
|
|
61
57
|
pip install pearmut
|
|
62
58
|
# Download example campaigns
|
|
63
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
64
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
59
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
|
|
60
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
|
|
65
61
|
# Load and start
|
|
66
|
-
pearmut add
|
|
62
|
+
pearmut add esa.json da.json
|
|
67
63
|
pearmut run
|
|
68
64
|
```
|
|
69
65
|
|
|
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
76
72
|
{
|
|
77
73
|
"info": {
|
|
78
74
|
"assignment": "task-based",
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
"
|
|
75
|
+
# DA: scores
|
|
76
|
+
# ESA: error spans and scores
|
|
77
|
+
# MQM: error spans, categories, and scores
|
|
78
|
+
"protocol": "ESA",
|
|
83
79
|
},
|
|
84
80
|
"campaign_id": "wmt25_#_en-cs_CZ",
|
|
85
81
|
"data": [
|
|
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
90
86
|
{
|
|
91
87
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
92
88
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
93
|
-
"tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
|
|
89
|
+
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
|
|
94
90
|
},
|
|
95
91
|
{
|
|
96
92
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
97
|
-
"tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
|
|
93
|
+
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
|
|
98
94
|
}
|
|
99
95
|
...
|
|
100
96
|
],
|
|
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
|
|
|
114
110
|
[
|
|
115
111
|
{
|
|
116
112
|
"src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
|
|
117
|
-
"tgt": "And suddenly all the water became full of other people and other people." # required
|
|
113
|
+
"tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
|
|
118
114
|
},
|
|
119
115
|
{
|
|
120
116
|
"src": "toto je pokračování stejného dokumentu",
|
|
121
|
-
"tgt": "this is a continuation of the same document"
|
|
117
|
+
"tgt": {"modelA": "this is a continuation of the same document"}
|
|
122
118
|
# Additional keys stored for analysis
|
|
123
119
|
}
|
|
124
120
|
]
|
|
@@ -136,16 +132,23 @@ pearmut run
|
|
|
136
132
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
137
133
|
- **`dynamic`**: work in progress ⚠️
|
|
138
134
|
|
|
139
|
-
|
|
135
|
+
## Advanced Features
|
|
140
136
|
|
|
141
|
-
|
|
142
|
-
- `protocol_score`: Collect scores [0-100]
|
|
143
|
-
- `protocol_error_spans`: Collect error span highlights
|
|
144
|
-
- `protocol_error_categories`: Collect MQM category labels
|
|
145
|
-
- **Listwise**: Evaluate multiple outputs simultaneously
|
|
146
|
-
- Same protocol options as pointwise
|
|
137
|
+
### Shuffling Model Translations
|
|
147
138
|
|
|
148
|
-
|
|
139
|
+
By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
|
|
140
|
+
The `shuffle` parameter in campaign `info` controls this behavior:
|
|
141
|
+
```python
|
|
142
|
+
{
|
|
143
|
+
"info": {
|
|
144
|
+
"assignment": "task-based",
|
|
145
|
+
"protocol": "ESA",
|
|
146
|
+
"shuffle": true # Default: true. Set to false to disable shuffling.
|
|
147
|
+
},
|
|
148
|
+
"campaign_id": "my_campaign",
|
|
149
|
+
"data": [...]
|
|
150
|
+
}
|
|
151
|
+
```
|
|
149
152
|
|
|
150
153
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
151
154
|
|
|
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
|
|
|
154
157
|
```python
|
|
155
158
|
{
|
|
156
159
|
"src": "The quick brown fox jumps over the lazy dog.",
|
|
157
|
-
"tgt": "Rychlá hnědá liška skáče přes líného psa.",
|
|
158
|
-
"error_spans":
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
160
|
+
"tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
|
|
161
|
+
"error_spans": {
|
|
162
|
+
"modelA": [
|
|
163
|
+
{
|
|
164
|
+
"start_i": 0, # character index start (inclusive)
|
|
165
|
+
"end_i": 5, # character index end (inclusive)
|
|
166
|
+
"severity": "minor", # "minor", "major", "neutral", or null
|
|
167
|
+
"category": null # MQM category string or null
|
|
168
|
+
},
|
|
169
|
+
{
|
|
170
|
+
"start_i": 27,
|
|
171
|
+
"end_i": 32,
|
|
172
|
+
"severity": "major",
|
|
173
|
+
"category": null
|
|
174
|
+
}
|
|
175
|
+
]
|
|
176
|
+
}
|
|
172
177
|
}
|
|
173
178
|
```
|
|
174
179
|
|
|
175
|
-
|
|
180
|
+
The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
|
|
176
181
|
|
|
177
182
|
### Tutorial and Attention Checks
|
|
178
183
|
|
|
@@ -181,12 +186,16 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
181
186
|
```python
|
|
182
187
|
{
|
|
183
188
|
"src": "The quick brown fox jumps.",
|
|
184
|
-
"tgt": "Rychlá hnědá liška skáče.",
|
|
189
|
+
"tgt": {"modelA": "Rychlá hnědá liška skáče."},
|
|
185
190
|
"validation": {
|
|
186
|
-
"
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
191
|
+
"modelA": [
|
|
192
|
+
{
|
|
193
|
+
"warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
|
|
194
|
+
"score": [70, 80], # required score range [min, max]
|
|
195
|
+
"error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
|
|
196
|
+
"allow_skip": true # show "skip tutorial" button
|
|
197
|
+
}
|
|
198
|
+
]
|
|
190
199
|
}
|
|
191
200
|
}
|
|
192
201
|
```
|
|
@@ -196,22 +205,25 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
196
205
|
- **Loud attention checks**: Include `warning` without `allow_skip` to force retry
|
|
197
206
|
- **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
|
|
198
207
|
|
|
199
|
-
|
|
208
|
+
The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
|
|
200
209
|
|
|
201
|
-
**
|
|
210
|
+
**Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
|
|
202
211
|
```python
|
|
203
212
|
{
|
|
204
213
|
"src": "AI transforms industries.",
|
|
205
|
-
"tgt":
|
|
206
|
-
"validation":
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
214
|
+
"tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
|
|
215
|
+
"validation": {
|
|
216
|
+
"A": [
|
|
217
|
+
{"warning": "A has error, score 20-40.", "score": [20, 40]}
|
|
218
|
+
],
|
|
219
|
+
"B": [
|
|
220
|
+
{"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
|
|
221
|
+
]
|
|
222
|
+
}
|
|
210
223
|
}
|
|
211
224
|
```
|
|
212
225
|
The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
|
|
213
|
-
|
|
214
|
-
See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
|
|
226
|
+
See [examples/tutorial_kway.json](examples/tutorial_kway.json).
|
|
215
227
|
|
|
216
228
|
### Single-stream Assignment
|
|
217
229
|
|
|
@@ -221,10 +233,10 @@ All annotators draw from a shared pool with random assignment:
|
|
|
221
233
|
"campaign_id": "my campaign 6",
|
|
222
234
|
"info": {
|
|
223
235
|
"assignment": "single-stream",
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
"
|
|
236
|
+
# DA: scores
|
|
237
|
+
# MQM: error spans and categories
|
|
238
|
+
# ESA: error spans and scores
|
|
239
|
+
"protocol": "ESA",
|
|
228
240
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
229
241
|
},
|
|
230
242
|
"data": [...], # list of all items (shared among all annotators)
|
|
@@ -302,30 +314,21 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
302
314
|
|
|
303
315
|
<img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
|
|
304
316
|
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
Add `&results` to dashboard URL to show model rankings (requires valid token).
|
|
308
|
-
Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
|
|
309
|
-
```python
|
|
310
|
-
{"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
|
|
311
|
-
{"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
|
|
312
|
-
```
|
|
313
|
-
See an example in [Campaign Management](#campaign-management)
|
|
314
|
-
|
|
317
|
+
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
315
318
|
|
|
316
319
|
## Terminology
|
|
317
320
|
|
|
318
321
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
319
322
|
- **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
|
|
320
|
-
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
|
|
323
|
+
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
|
|
321
324
|
- **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
|
|
322
|
-
- **Item
|
|
323
|
-
- **Document
|
|
325
|
+
- **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
|
|
326
|
+
- **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
|
|
324
327
|
- **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
|
|
325
|
-
- **Attention Check
|
|
328
|
+
- **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
|
|
326
329
|
- **Loud**: Shows warning message and forces retry on failure
|
|
327
330
|
- **Silent**: Logs failures without notifying the user (for quality control analysis)
|
|
328
|
-
- **Token
|
|
331
|
+
- **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
|
|
329
332
|
- **Pass Token** (`token_pass`): Shown when user meets validation thresholds
|
|
330
333
|
- **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
|
|
331
334
|
- **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
|
|
@@ -334,11 +337,9 @@ See an example in [Campaign Management](#campaign-management)
|
|
|
334
337
|
- **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
|
|
335
338
|
- **Protocol**: The annotation scheme defining what data is collected:
|
|
336
339
|
- **Score**: Numeric quality rating (0-100)
|
|
337
|
-
- **Error Spans**: Text highlights marking errors
|
|
340
|
+
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
338
341
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
339
|
-
- **Template**: The annotation interface type
|
|
340
|
-
- **Pointwise**: Evaluate one output at a time
|
|
341
|
-
- **Listwise**: Compare multiple outputs simultaneously
|
|
342
|
+
- **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
|
|
342
343
|
- **Assignment**: The method for distributing items to users:
|
|
343
344
|
- **Task-based**: Each user has predefined items
|
|
344
345
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -369,7 +370,7 @@ pearmut run
|
|
|
369
370
|
2. Add build rule to `webpack.config.js`
|
|
370
371
|
3. Reference as `info->template` in campaign JSON
|
|
371
372
|
|
|
372
|
-
See [web/src/
|
|
373
|
+
See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
373
374
|
|
|
374
375
|
### Deployment
|
|
375
376
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Pearmut 🍐
|
|
2
2
|
|
|
3
|
-
**Platform for Evaluation and Reviewing of Multilingual Tasks
|
|
3
|
+
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
4
4
|
|
|
5
5
|
[](https://pypi.org/project/pearmut)
|
|
6
6
|
|
|
@@ -18,7 +18,6 @@
|
|
|
18
18
|
- [Campaign Configuration](#campaign-configuration)
|
|
19
19
|
- [Basic Structure](#basic-structure)
|
|
20
20
|
- [Assignment Types](#assignment-types)
|
|
21
|
-
- [Protocol Templates](#protocol-templates)
|
|
22
21
|
- [Advanced Features](#advanced-features)
|
|
23
22
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
24
23
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
@@ -31,19 +30,16 @@
|
|
|
31
30
|
- [Development](#development)
|
|
32
31
|
- [Citation](#citation)
|
|
33
32
|
|
|
34
|
-
|
|
35
|
-
**Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
|
|
36
|
-
|
|
37
33
|
## Quick Start
|
|
38
34
|
|
|
39
35
|
Install and run locally without cloning:
|
|
40
36
|
```bash
|
|
41
37
|
pip install pearmut
|
|
42
38
|
# Download example campaigns
|
|
43
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
44
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
39
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
|
|
40
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
|
|
45
41
|
# Load and start
|
|
46
|
-
pearmut add
|
|
42
|
+
pearmut add esa.json da.json
|
|
47
43
|
pearmut run
|
|
48
44
|
```
|
|
49
45
|
|
|
@@ -56,10 +52,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
56
52
|
{
|
|
57
53
|
"info": {
|
|
58
54
|
"assignment": "task-based",
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
"
|
|
55
|
+
# DA: scores
|
|
56
|
+
# ESA: error spans and scores
|
|
57
|
+
# MQM: error spans, categories, and scores
|
|
58
|
+
"protocol": "ESA",
|
|
63
59
|
},
|
|
64
60
|
"campaign_id": "wmt25_#_en-cs_CZ",
|
|
65
61
|
"data": [
|
|
@@ -70,11 +66,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
70
66
|
{
|
|
71
67
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
72
68
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
73
|
-
"tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
|
|
69
|
+
"tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
|
|
74
70
|
},
|
|
75
71
|
{
|
|
76
72
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
77
|
-
"tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
|
|
73
|
+
"tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
|
|
78
74
|
}
|
|
79
75
|
...
|
|
80
76
|
],
|
|
@@ -94,11 +90,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
|
|
|
94
90
|
[
|
|
95
91
|
{
|
|
96
92
|
"src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
|
|
97
|
-
"tgt": "And suddenly all the water became full of other people and other people." # required
|
|
93
|
+
"tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
|
|
98
94
|
},
|
|
99
95
|
{
|
|
100
96
|
"src": "toto je pokračování stejného dokumentu",
|
|
101
|
-
"tgt": "this is a continuation of the same document"
|
|
97
|
+
"tgt": {"modelA": "this is a continuation of the same document"}
|
|
102
98
|
# Additional keys stored for analysis
|
|
103
99
|
}
|
|
104
100
|
]
|
|
@@ -116,16 +112,23 @@ pearmut run
|
|
|
116
112
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
117
113
|
- **`dynamic`**: work in progress ⚠️
|
|
118
114
|
|
|
119
|
-
|
|
115
|
+
## Advanced Features
|
|
120
116
|
|
|
121
|
-
|
|
122
|
-
- `protocol_score`: Collect scores [0-100]
|
|
123
|
-
- `protocol_error_spans`: Collect error span highlights
|
|
124
|
-
- `protocol_error_categories`: Collect MQM category labels
|
|
125
|
-
- **Listwise**: Evaluate multiple outputs simultaneously
|
|
126
|
-
- Same protocol options as pointwise
|
|
117
|
+
### Shuffling Model Translations
|
|
127
118
|
|
|
128
|
-
|
|
119
|
+
By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
|
|
120
|
+
The `shuffle` parameter in campaign `info` controls this behavior:
|
|
121
|
+
```python
|
|
122
|
+
{
|
|
123
|
+
"info": {
|
|
124
|
+
"assignment": "task-based",
|
|
125
|
+
"protocol": "ESA",
|
|
126
|
+
"shuffle": true # Default: true. Set to false to disable shuffling.
|
|
127
|
+
},
|
|
128
|
+
"campaign_id": "my_campaign",
|
|
129
|
+
"data": [...]
|
|
130
|
+
}
|
|
131
|
+
```
|
|
129
132
|
|
|
130
133
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
131
134
|
|
|
@@ -134,25 +137,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
|
|
|
134
137
|
```python
|
|
135
138
|
{
|
|
136
139
|
"src": "The quick brown fox jumps over the lazy dog.",
|
|
137
|
-
"tgt": "Rychlá hnědá liška skáče přes líného psa.",
|
|
138
|
-
"error_spans":
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
140
|
+
"tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
|
|
141
|
+
"error_spans": {
|
|
142
|
+
"modelA": [
|
|
143
|
+
{
|
|
144
|
+
"start_i": 0, # character index start (inclusive)
|
|
145
|
+
"end_i": 5, # character index end (inclusive)
|
|
146
|
+
"severity": "minor", # "minor", "major", "neutral", or null
|
|
147
|
+
"category": null # MQM category string or null
|
|
148
|
+
},
|
|
149
|
+
{
|
|
150
|
+
"start_i": 27,
|
|
151
|
+
"end_i": 32,
|
|
152
|
+
"severity": "major",
|
|
153
|
+
"category": null
|
|
154
|
+
}
|
|
155
|
+
]
|
|
156
|
+
}
|
|
152
157
|
}
|
|
153
158
|
```
|
|
154
159
|
|
|
155
|
-
|
|
160
|
+
The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
|
|
156
161
|
|
|
157
162
|
### Tutorial and Attention Checks
|
|
158
163
|
|
|
@@ -161,12 +166,16 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
161
166
|
```python
|
|
162
167
|
{
|
|
163
168
|
"src": "The quick brown fox jumps.",
|
|
164
|
-
"tgt": "Rychlá hnědá liška skáče.",
|
|
169
|
+
"tgt": {"modelA": "Rychlá hnědá liška skáče."},
|
|
165
170
|
"validation": {
|
|
166
|
-
"
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
171
|
+
"modelA": [
|
|
172
|
+
{
|
|
173
|
+
"warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
|
|
174
|
+
"score": [70, 80], # required score range [min, max]
|
|
175
|
+
"error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
|
|
176
|
+
"allow_skip": true # show "skip tutorial" button
|
|
177
|
+
}
|
|
178
|
+
]
|
|
170
179
|
}
|
|
171
180
|
}
|
|
172
181
|
```
|
|
@@ -176,22 +185,25 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
176
185
|
- **Loud attention checks**: Include `warning` without `allow_skip` to force retry
|
|
177
186
|
- **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
|
|
178
187
|
|
|
179
|
-
|
|
188
|
+
The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
|
|
180
189
|
|
|
181
|
-
**
|
|
190
|
+
**Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
|
|
182
191
|
```python
|
|
183
192
|
{
|
|
184
193
|
"src": "AI transforms industries.",
|
|
185
|
-
"tgt":
|
|
186
|
-
"validation":
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
194
|
+
"tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
|
|
195
|
+
"validation": {
|
|
196
|
+
"A": [
|
|
197
|
+
{"warning": "A has error, score 20-40.", "score": [20, 40]}
|
|
198
|
+
],
|
|
199
|
+
"B": [
|
|
200
|
+
{"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
|
|
201
|
+
]
|
|
202
|
+
}
|
|
190
203
|
}
|
|
191
204
|
```
|
|
192
205
|
The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
|
|
193
|
-
|
|
194
|
-
See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
|
|
206
|
+
See [examples/tutorial_kway.json](examples/tutorial_kway.json).
|
|
195
207
|
|
|
196
208
|
### Single-stream Assignment
|
|
197
209
|
|
|
@@ -201,10 +213,10 @@ All annotators draw from a shared pool with random assignment:
|
|
|
201
213
|
"campaign_id": "my campaign 6",
|
|
202
214
|
"info": {
|
|
203
215
|
"assignment": "single-stream",
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
"
|
|
216
|
+
# DA: scores
|
|
217
|
+
# MQM: error spans and categories
|
|
218
|
+
# ESA: error spans and scores
|
|
219
|
+
"protocol": "ESA",
|
|
208
220
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
209
221
|
},
|
|
210
222
|
"data": [...], # list of all items (shared among all annotators)
|
|
@@ -282,30 +294,21 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
282
294
|
|
|
283
295
|
<img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
|
|
284
296
|
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
Add `&results` to dashboard URL to show model rankings (requires valid token).
|
|
288
|
-
Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
|
|
289
|
-
```python
|
|
290
|
-
{"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
|
|
291
|
-
{"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
|
|
292
|
-
```
|
|
293
|
-
See an example in [Campaign Management](#campaign-management)
|
|
294
|
-
|
|
297
|
+
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
295
298
|
|
|
296
299
|
## Terminology
|
|
297
300
|
|
|
298
301
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
299
302
|
- **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
|
|
300
|
-
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
|
|
303
|
+
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
|
|
301
304
|
- **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
|
|
302
|
-
- **Item
|
|
303
|
-
- **Document
|
|
305
|
+
- **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
|
|
306
|
+
- **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
|
|
304
307
|
- **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
|
|
305
|
-
- **Attention Check
|
|
308
|
+
- **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
|
|
306
309
|
- **Loud**: Shows warning message and forces retry on failure
|
|
307
310
|
- **Silent**: Logs failures without notifying the user (for quality control analysis)
|
|
308
|
-
- **Token
|
|
311
|
+
- **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
|
|
309
312
|
- **Pass Token** (`token_pass`): Shown when user meets validation thresholds
|
|
310
313
|
- **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
|
|
311
314
|
- **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
|
|
@@ -314,11 +317,9 @@ See an example in [Campaign Management](#campaign-management)
|
|
|
314
317
|
- **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
|
|
315
318
|
- **Protocol**: The annotation scheme defining what data is collected:
|
|
316
319
|
- **Score**: Numeric quality rating (0-100)
|
|
317
|
-
- **Error Spans**: Text highlights marking errors
|
|
320
|
+
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
318
321
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
319
|
-
- **Template**: The annotation interface type
|
|
320
|
-
- **Pointwise**: Evaluate one output at a time
|
|
321
|
-
- **Listwise**: Compare multiple outputs simultaneously
|
|
322
|
+
- **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
|
|
322
323
|
- **Assignment**: The method for distributing items to users:
|
|
323
324
|
- **Task-based**: Each user has predefined items
|
|
324
325
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -349,7 +350,7 @@ pearmut run
|
|
|
349
350
|
2. Add build rule to `webpack.config.js`
|
|
350
351
|
3. Reference as `info->template` in campaign JSON
|
|
351
352
|
|
|
352
|
-
See [web/src/
|
|
353
|
+
See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
353
354
|
|
|
354
355
|
### Deployment
|
|
355
356
|
|