pearmut 0.2.10__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pearmut-0.2.10 → pearmut-0.3.0}/PKG-INFO +81 -72
- {pearmut-0.2.10 → pearmut-0.3.0}/README.md +80 -71
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/PKG-INFO +81 -72
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/SOURCES.txt +4 -6
- {pearmut-0.2.10 → pearmut-0.3.0}/pyproject.toml +1 -1
- {pearmut-0.2.10 → pearmut-0.3.0}/server/app.py +19 -19
- {pearmut-0.2.10 → pearmut-0.3.0}/server/assignment.py +26 -10
- {pearmut-0.2.10 → pearmut-0.3.0}/server/cli.py +91 -30
- pearmut-0.3.0/server/static/basic.bundle.js +1 -0
- pearmut-0.3.0/server/static/basic.html +74 -0
- pearmut-0.3.0/server/static/dashboard.bundle.js +1 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/server/static/dashboard.html +1 -1
- pearmut-0.3.0/server/static/index.html +1 -0
- {pearmut-0.2.10/server/static/assets → pearmut-0.3.0/server/static}/style.css +1 -2
- {pearmut-0.2.10 → pearmut-0.3.0}/server/utils.py +1 -32
- pearmut-0.2.10/server/static/dashboard.bundle.js +0 -1
- pearmut-0.2.10/server/static/index.html +0 -1
- pearmut-0.2.10/server/static/listwise.bundle.js +0 -1
- pearmut-0.2.10/server/static/listwise.html +0 -77
- pearmut-0.2.10/server/static/pointwise.bundle.js +0 -1
- pearmut-0.2.10/server/static/pointwise.html +0 -69
- {pearmut-0.2.10 → pearmut-0.3.0}/LICENSE +0 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/dependency_links.txt +0 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/entry_points.txt +0 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/requires.txt +0 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/pearmut.egg-info/top_level.txt +0 -0
- {pearmut-0.2.10/server/static/assets → pearmut-0.3.0/server/static}/favicon.svg +0 -0
- {pearmut-0.2.10 → pearmut-0.3.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pearmut
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: A tool for evaluation of model outputs, primarily MT.
|
|
5
5
|
Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -20,7 +20,7 @@ Dynamic: license-file
|
|
|
20
20
|
|
|
21
21
|
# Pearmut 🍐
|
|
22
22
|
|
|
23
|
-
**Platform for Evaluation and Reviewing of Multilingual Tasks
|
|
23
|
+
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
24
24
|
|
|
25
25
|
[](https://pypi.org/project/pearmut)
|
|
26
26
|
|
|
@@ -38,7 +38,6 @@ Dynamic: license-file
|
|
|
38
38
|
- [Campaign Configuration](#campaign-configuration)
|
|
39
39
|
- [Basic Structure](#basic-structure)
|
|
40
40
|
- [Assignment Types](#assignment-types)
|
|
41
|
-
- [Protocol Templates](#protocol-templates)
|
|
42
41
|
- [Advanced Features](#advanced-features)
|
|
43
42
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
44
43
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
@@ -51,19 +50,16 @@ Dynamic: license-file
|
|
|
51
50
|
- [Development](#development)
|
|
52
51
|
- [Citation](#citation)
|
|
53
52
|
|
|
54
|
-
|
|
55
|
-
**Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
|
|
56
|
-
|
|
57
53
|
## Quick Start
|
|
58
54
|
|
|
59
55
|
Install and run locally without cloning:
|
|
60
56
|
```bash
|
|
61
57
|
pip install pearmut
|
|
62
58
|
# Download example campaigns
|
|
63
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
64
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
59
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
|
|
60
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
|
|
65
61
|
# Load and start
|
|
66
|
-
pearmut add
|
|
62
|
+
pearmut add esa.json da.json
|
|
67
63
|
pearmut run
|
|
68
64
|
```
|
|
69
65
|
|
|
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
76
72
|
{
|
|
77
73
|
"info": {
|
|
78
74
|
"assignment": "task-based",
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
"
|
|
75
|
+
# DA: scores
|
|
76
|
+
# ESA: error spans and scores
|
|
77
|
+
# MQM: error spans, categories, and scores
|
|
78
|
+
"protocol": "ESA",
|
|
83
79
|
},
|
|
84
80
|
"campaign_id": "wmt25_#_en-cs_CZ",
|
|
85
81
|
"data": [
|
|
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
90
86
|
{
|
|
91
87
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
92
88
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
93
|
-
"tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
|
|
89
|
+
"tgt": ["Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."]
|
|
94
90
|
},
|
|
95
91
|
{
|
|
96
92
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
97
|
-
"tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
|
|
93
|
+
"tgt": ["Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"]
|
|
98
94
|
}
|
|
99
95
|
...
|
|
100
96
|
],
|
|
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
|
|
|
114
110
|
[
|
|
115
111
|
{
|
|
116
112
|
"src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
|
|
117
|
-
"tgt": "And suddenly all the water became full of other people and other people." # required
|
|
113
|
+
"tgt": ["And suddenly all the water became full of other people and other people."] # required (array)
|
|
118
114
|
},
|
|
119
115
|
{
|
|
120
116
|
"src": "toto je pokračování stejného dokumentu",
|
|
121
|
-
"tgt": "this is a continuation of the same document"
|
|
117
|
+
"tgt": ["this is a continuation of the same document"]
|
|
122
118
|
# Additional keys stored for analysis
|
|
123
119
|
}
|
|
124
120
|
]
|
|
@@ -136,16 +132,23 @@ pearmut run
|
|
|
136
132
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
137
133
|
- **`dynamic`**: work in progress ⚠️
|
|
138
134
|
|
|
139
|
-
|
|
135
|
+
## Advanced Features
|
|
140
136
|
|
|
141
|
-
|
|
142
|
-
- `protocol_score`: Collect scores [0-100]
|
|
143
|
-
- `protocol_error_spans`: Collect error span highlights
|
|
144
|
-
- `protocol_error_categories`: Collect MQM category labels
|
|
145
|
-
- **Listwise**: Evaluate multiple outputs simultaneously
|
|
146
|
-
- Same protocol options as pointwise
|
|
137
|
+
### Shuffling Model Translations
|
|
147
138
|
|
|
148
|
-
|
|
139
|
+
By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
|
|
140
|
+
The `shuffle` parameter in campaign `info` controls this behavior:
|
|
141
|
+
```python
|
|
142
|
+
{
|
|
143
|
+
"info": {
|
|
144
|
+
"assignment": "task-based",
|
|
145
|
+
"protocol": "ESA",
|
|
146
|
+
"shuffle": true # Default: true. Set to false to disable shuffling.
|
|
147
|
+
},
|
|
148
|
+
"campaign_id": "my_campaign",
|
|
149
|
+
"data": [...]
|
|
150
|
+
}
|
|
151
|
+
```
|
|
149
152
|
|
|
150
153
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
151
154
|
|
|
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
|
|
|
154
157
|
```python
|
|
155
158
|
{
|
|
156
159
|
"src": "The quick brown fox jumps over the lazy dog.",
|
|
157
|
-
"tgt": "Rychlá hnědá liška skáče přes líného psa.",
|
|
160
|
+
"tgt": ["Rychlá hnědá liška skáče přes líného psa."],
|
|
158
161
|
"error_spans": [
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
162
|
+
[
|
|
163
|
+
{
|
|
164
|
+
"start_i": 0, # character index start (inclusive)
|
|
165
|
+
"end_i": 5, # character index end (inclusive)
|
|
166
|
+
"severity": "minor", # "minor", "major", "neutral", or null
|
|
167
|
+
"category": null # MQM category string or null
|
|
168
|
+
},
|
|
169
|
+
{
|
|
170
|
+
"start_i": 27,
|
|
171
|
+
"end_i": 32,
|
|
172
|
+
"severity": "major",
|
|
173
|
+
"category": null
|
|
174
|
+
}
|
|
175
|
+
]
|
|
171
176
|
]
|
|
172
177
|
}
|
|
173
178
|
```
|
|
174
179
|
|
|
175
|
-
|
|
180
|
+
The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
|
|
176
181
|
|
|
177
182
|
### Tutorial and Attention Checks
|
|
178
183
|
|
|
@@ -181,13 +186,15 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
181
186
|
```python
|
|
182
187
|
{
|
|
183
188
|
"src": "The quick brown fox jumps.",
|
|
184
|
-
"tgt": "Rychlá hnědá liška skáče.",
|
|
185
|
-
"validation":
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
189
|
+
"tgt": ["Rychlá hnědá liška skáče."],
|
|
190
|
+
"validation": [
|
|
191
|
+
{
|
|
192
|
+
"warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
|
|
193
|
+
"score": [70, 80], # required score range [min, max]
|
|
194
|
+
"error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
|
|
195
|
+
"allow_skip": true # show "skip tutorial" button
|
|
196
|
+
}
|
|
197
|
+
]
|
|
191
198
|
}
|
|
192
199
|
```
|
|
193
200
|
|
|
@@ -196,8 +203,21 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
196
203
|
- **Loud attention checks**: Include `warning` without `allow_skip` to force retry
|
|
197
204
|
- **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
|
|
198
205
|
|
|
199
|
-
|
|
200
|
-
|
|
206
|
+
The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
|
|
207
|
+
|
|
208
|
+
**Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
|
|
209
|
+
```python
|
|
210
|
+
{
|
|
211
|
+
"src": "AI transforms industries.",
|
|
212
|
+
"tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
|
|
213
|
+
"validation": [
|
|
214
|
+
{"warning": "A has error, score 20-40.", "score": [20, 40]},
|
|
215
|
+
{"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
|
|
216
|
+
]
|
|
217
|
+
}
|
|
218
|
+
```
|
|
219
|
+
The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
|
|
220
|
+
See [examples/tutorial_kway.json](examples/tutorial_kway.json).
|
|
201
221
|
|
|
202
222
|
### Single-stream Assignment
|
|
203
223
|
|
|
@@ -207,10 +227,10 @@ All annotators draw from a shared pool with random assignment:
|
|
|
207
227
|
"campaign_id": "my campaign 6",
|
|
208
228
|
"info": {
|
|
209
229
|
"assignment": "single-stream",
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
"
|
|
230
|
+
# DA: scores
|
|
231
|
+
# MQM: error spans and categories
|
|
232
|
+
# ESA: error spans and scores
|
|
233
|
+
"protocol": "ESA",
|
|
214
234
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
215
235
|
},
|
|
216
236
|
"data": [...], # list of all items (shared among all annotators)
|
|
@@ -288,30 +308,21 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
288
308
|
|
|
289
309
|
<img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
|
|
290
310
|
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
Add `&results` to dashboard URL to show model rankings (requires valid token).
|
|
294
|
-
Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
|
|
295
|
-
```python
|
|
296
|
-
{"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
|
|
297
|
-
{"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
|
|
298
|
-
```
|
|
299
|
-
See an example in [Campaign Management](#campaign-management)
|
|
300
|
-
|
|
311
|
+
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
301
312
|
|
|
302
313
|
## Terminology
|
|
303
314
|
|
|
304
315
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
305
316
|
- **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
|
|
306
|
-
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
|
|
317
|
+
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
|
|
307
318
|
- **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
|
|
308
|
-
- **Item
|
|
309
|
-
- **Document
|
|
319
|
+
- **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
|
|
320
|
+
- **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
|
|
310
321
|
- **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
|
|
311
|
-
- **Attention Check
|
|
322
|
+
- **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
|
|
312
323
|
- **Loud**: Shows warning message and forces retry on failure
|
|
313
324
|
- **Silent**: Logs failures without notifying the user (for quality control analysis)
|
|
314
|
-
- **Token
|
|
325
|
+
- **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
|
|
315
326
|
- **Pass Token** (`token_pass`): Shown when user meets validation thresholds
|
|
316
327
|
- **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
|
|
317
328
|
- **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
|
|
@@ -320,11 +331,9 @@ See an example in [Campaign Management](#campaign-management)
|
|
|
320
331
|
- **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
|
|
321
332
|
- **Protocol**: The annotation scheme defining what data is collected:
|
|
322
333
|
- **Score**: Numeric quality rating (0-100)
|
|
323
|
-
- **Error Spans**: Text highlights marking errors
|
|
334
|
+
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
324
335
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
325
|
-
- **Template**: The annotation interface type
|
|
326
|
-
- **Pointwise**: Evaluate one output at a time
|
|
327
|
-
- **Listwise**: Compare multiple outputs simultaneously
|
|
336
|
+
- **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
|
|
328
337
|
- **Assignment**: The method for distributing items to users:
|
|
329
338
|
- **Task-based**: Each user has predefined items
|
|
330
339
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -355,7 +364,7 @@ pearmut run
|
|
|
355
364
|
2. Add build rule to `webpack.config.js`
|
|
356
365
|
3. Reference as `info->template` in campaign JSON
|
|
357
366
|
|
|
358
|
-
See [web/src/
|
|
367
|
+
See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
359
368
|
|
|
360
369
|
### Deployment
|
|
361
370
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Pearmut 🍐
|
|
2
2
|
|
|
3
|
-
**Platform for Evaluation and Reviewing of Multilingual Tasks
|
|
3
|
+
**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
|
|
4
4
|
|
|
5
5
|
[](https://pypi.org/project/pearmut)
|
|
6
6
|
|
|
@@ -18,7 +18,6 @@
|
|
|
18
18
|
- [Campaign Configuration](#campaign-configuration)
|
|
19
19
|
- [Basic Structure](#basic-structure)
|
|
20
20
|
- [Assignment Types](#assignment-types)
|
|
21
|
-
- [Protocol Templates](#protocol-templates)
|
|
22
21
|
- [Advanced Features](#advanced-features)
|
|
23
22
|
- [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
|
|
24
23
|
- [Tutorial and Attention Checks](#tutorial-and-attention-checks)
|
|
@@ -31,19 +30,16 @@
|
|
|
31
30
|
- [Development](#development)
|
|
32
31
|
- [Citation](#citation)
|
|
33
32
|
|
|
34
|
-
|
|
35
|
-
**Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
|
|
36
|
-
|
|
37
33
|
## Quick Start
|
|
38
34
|
|
|
39
35
|
Install and run locally without cloning:
|
|
40
36
|
```bash
|
|
41
37
|
pip install pearmut
|
|
42
38
|
# Download example campaigns
|
|
43
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
44
|
-
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/
|
|
39
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
|
|
40
|
+
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
|
|
45
41
|
# Load and start
|
|
46
|
-
pearmut add
|
|
42
|
+
pearmut add esa.json da.json
|
|
47
43
|
pearmut run
|
|
48
44
|
```
|
|
49
45
|
|
|
@@ -56,10 +52,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
56
52
|
{
|
|
57
53
|
"info": {
|
|
58
54
|
"assignment": "task-based",
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
"
|
|
55
|
+
# DA: scores
|
|
56
|
+
# ESA: error spans and scores
|
|
57
|
+
# MQM: error spans, categories, and scores
|
|
58
|
+
"protocol": "ESA",
|
|
63
59
|
},
|
|
64
60
|
"campaign_id": "wmt25_#_en-cs_CZ",
|
|
65
61
|
"data": [
|
|
@@ -70,11 +66,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
|
|
|
70
66
|
{
|
|
71
67
|
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
|
|
72
68
|
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
|
|
73
|
-
"tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
|
|
69
|
+
"tgt": ["Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."]
|
|
74
70
|
},
|
|
75
71
|
{
|
|
76
72
|
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
|
|
77
|
-
"tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
|
|
73
|
+
"tgt": ["Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"]
|
|
78
74
|
}
|
|
79
75
|
...
|
|
80
76
|
],
|
|
@@ -94,11 +90,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
|
|
|
94
90
|
[
|
|
95
91
|
{
|
|
96
92
|
"src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
|
|
97
|
-
"tgt": "And suddenly all the water became full of other people and other people." # required
|
|
93
|
+
"tgt": ["And suddenly all the water became full of other people and other people."] # required (array)
|
|
98
94
|
},
|
|
99
95
|
{
|
|
100
96
|
"src": "toto je pokračování stejného dokumentu",
|
|
101
|
-
"tgt": "this is a continuation of the same document"
|
|
97
|
+
"tgt": ["this is a continuation of the same document"]
|
|
102
98
|
# Additional keys stored for analysis
|
|
103
99
|
}
|
|
104
100
|
]
|
|
@@ -116,16 +112,23 @@ pearmut run
|
|
|
116
112
|
- **`single-stream`**: All users draw from a shared pool (random assignment)
|
|
117
113
|
- **`dynamic`**: work in progress ⚠️
|
|
118
114
|
|
|
119
|
-
|
|
115
|
+
## Advanced Features
|
|
120
116
|
|
|
121
|
-
|
|
122
|
-
- `protocol_score`: Collect scores [0-100]
|
|
123
|
-
- `protocol_error_spans`: Collect error span highlights
|
|
124
|
-
- `protocol_error_categories`: Collect MQM category labels
|
|
125
|
-
- **Listwise**: Evaluate multiple outputs simultaneously
|
|
126
|
-
- Same protocol options as pointwise
|
|
117
|
+
### Shuffling Model Translations
|
|
127
118
|
|
|
128
|
-
|
|
119
|
+
By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
|
|
120
|
+
The `shuffle` parameter in campaign `info` controls this behavior:
|
|
121
|
+
```python
|
|
122
|
+
{
|
|
123
|
+
"info": {
|
|
124
|
+
"assignment": "task-based",
|
|
125
|
+
"protocol": "ESA",
|
|
126
|
+
"shuffle": true # Default: true. Set to false to disable shuffling.
|
|
127
|
+
},
|
|
128
|
+
"campaign_id": "my_campaign",
|
|
129
|
+
"data": [...]
|
|
130
|
+
}
|
|
131
|
+
```
|
|
129
132
|
|
|
130
133
|
### Pre-filled Error Spans (ESA<sup>AI</sup>)
|
|
131
134
|
|
|
@@ -134,25 +137,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
|
|
|
134
137
|
```python
|
|
135
138
|
{
|
|
136
139
|
"src": "The quick brown fox jumps over the lazy dog.",
|
|
137
|
-
"tgt": "Rychlá hnědá liška skáče přes líného psa.",
|
|
140
|
+
"tgt": ["Rychlá hnědá liška skáče přes líného psa."],
|
|
138
141
|
"error_spans": [
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
142
|
+
[
|
|
143
|
+
{
|
|
144
|
+
"start_i": 0, # character index start (inclusive)
|
|
145
|
+
"end_i": 5, # character index end (inclusive)
|
|
146
|
+
"severity": "minor", # "minor", "major", "neutral", or null
|
|
147
|
+
"category": null # MQM category string or null
|
|
148
|
+
},
|
|
149
|
+
{
|
|
150
|
+
"start_i": 27,
|
|
151
|
+
"end_i": 32,
|
|
152
|
+
"severity": "major",
|
|
153
|
+
"category": null
|
|
154
|
+
}
|
|
155
|
+
]
|
|
151
156
|
]
|
|
152
157
|
}
|
|
153
158
|
```
|
|
154
159
|
|
|
155
|
-
|
|
160
|
+
The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
|
|
156
161
|
|
|
157
162
|
### Tutorial and Attention Checks
|
|
158
163
|
|
|
@@ -161,13 +166,15 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
161
166
|
```python
|
|
162
167
|
{
|
|
163
168
|
"src": "The quick brown fox jumps.",
|
|
164
|
-
"tgt": "Rychlá hnědá liška skáče.",
|
|
165
|
-
"validation":
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
169
|
+
"tgt": ["Rychlá hnědá liška skáče."],
|
|
170
|
+
"validation": [
|
|
171
|
+
{
|
|
172
|
+
"warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
|
|
173
|
+
"score": [70, 80], # required score range [min, max]
|
|
174
|
+
"error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
|
|
175
|
+
"allow_skip": true # show "skip tutorial" button
|
|
176
|
+
}
|
|
177
|
+
]
|
|
171
178
|
}
|
|
172
179
|
```
|
|
173
180
|
|
|
@@ -176,8 +183,21 @@ Add `validation` rules for tutorials or attention checks:
|
|
|
176
183
|
- **Loud attention checks**: Include `warning` without `allow_skip` to force retry
|
|
177
184
|
- **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
|
|
178
185
|
|
|
179
|
-
|
|
180
|
-
|
|
186
|
+
The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
|
|
187
|
+
|
|
188
|
+
**Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
|
|
189
|
+
```python
|
|
190
|
+
{
|
|
191
|
+
"src": "AI transforms industries.",
|
|
192
|
+
"tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
|
|
193
|
+
"validation": [
|
|
194
|
+
{"warning": "A has error, score 20-40.", "score": [20, 40]},
|
|
195
|
+
{"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
|
|
196
|
+
]
|
|
197
|
+
}
|
|
198
|
+
```
|
|
199
|
+
The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
|
|
200
|
+
See [examples/tutorial_kway.json](examples/tutorial_kway.json).
|
|
181
201
|
|
|
182
202
|
### Single-stream Assignment
|
|
183
203
|
|
|
@@ -187,10 +207,10 @@ All annotators draw from a shared pool with random assignment:
|
|
|
187
207
|
"campaign_id": "my campaign 6",
|
|
188
208
|
"info": {
|
|
189
209
|
"assignment": "single-stream",
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
"
|
|
210
|
+
# DA: scores
|
|
211
|
+
# MQM: error spans and categories
|
|
212
|
+
# ESA: error spans and scores
|
|
213
|
+
"protocol": "ESA",
|
|
194
214
|
"users": 50, # number of annotators (can also be a list, see below)
|
|
195
215
|
},
|
|
196
216
|
"data": [...], # list of all items (shared among all annotators)
|
|
@@ -268,30 +288,21 @@ Completion tokens are shown at annotation end for verification (download correct
|
|
|
268
288
|
|
|
269
289
|
<img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
|
|
270
290
|
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
Add `&results` to dashboard URL to show model rankings (requires valid token).
|
|
274
|
-
Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
|
|
275
|
-
```python
|
|
276
|
-
{"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
|
|
277
|
-
{"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
|
|
278
|
-
```
|
|
279
|
-
See an example in [Campaign Management](#campaign-management)
|
|
280
|
-
|
|
291
|
+
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
|
|
281
292
|
|
|
282
293
|
## Terminology
|
|
283
294
|
|
|
284
295
|
- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
|
|
285
296
|
- **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
|
|
286
|
-
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
|
|
297
|
+
- **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
|
|
287
298
|
- **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
|
|
288
|
-
- **Item
|
|
289
|
-
- **Document
|
|
299
|
+
- **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
|
|
300
|
+
- **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
|
|
290
301
|
- **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
|
|
291
|
-
- **Attention Check
|
|
302
|
+
- **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
|
|
292
303
|
- **Loud**: Shows warning message and forces retry on failure
|
|
293
304
|
- **Silent**: Logs failures without notifying the user (for quality control analysis)
|
|
294
|
-
- **Token
|
|
305
|
+
- **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
|
|
295
306
|
- **Pass Token** (`token_pass`): Shown when user meets validation thresholds
|
|
296
307
|
- **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
|
|
297
308
|
- **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
|
|
@@ -300,11 +311,9 @@ See an example in [Campaign Management](#campaign-management)
|
|
|
300
311
|
- **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
|
|
301
312
|
- **Protocol**: The annotation scheme defining what data is collected:
|
|
302
313
|
- **Score**: Numeric quality rating (0-100)
|
|
303
|
-
- **Error Spans**: Text highlights marking errors
|
|
314
|
+
- **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
|
|
304
315
|
- **Error Categories**: MQM taxonomy labels for errors
|
|
305
|
-
- **Template**: The annotation interface type
|
|
306
|
-
- **Pointwise**: Evaluate one output at a time
|
|
307
|
-
- **Listwise**: Compare multiple outputs simultaneously
|
|
316
|
+
- **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
|
|
308
317
|
- **Assignment**: The method for distributing items to users:
|
|
309
318
|
- **Task-based**: Each user has predefined items
|
|
310
319
|
- **Single-stream**: Users draw from a shared pool with random assignment
|
|
@@ -335,7 +344,7 @@ pearmut run
|
|
|
335
344
|
2. Add build rule to `webpack.config.js`
|
|
336
345
|
3. Reference as `info->template` in campaign JSON
|
|
337
346
|
|
|
338
|
-
See [web/src/
|
|
347
|
+
See [web/src/basic.ts](web/src/basic.ts) for example.
|
|
339
348
|
|
|
340
349
|
### Deployment
|
|
341
350
|
|