pearmut 0.2.10__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pearmut/utils.py CHANGED
@@ -7,37 +7,6 @@ ROOT = "."
7
7
  RESET_MARKER = "__RESET__"
8
8
 
9
9
 
10
- def highlight_differences(a, b):
11
- """
12
- Compares two strings and wraps their differences in HTML span tags.
13
-
14
- Args:
15
- a: The first string.
16
- b: The second string.
17
-
18
- Returns:
19
- A tuple containing the two strings with their differences highlighted.
20
- """
21
- import difflib
22
- # TODO: maybe on the level of words?
23
- s = difflib.SequenceMatcher(None, a, b)
24
- res_a, res_b = [], []
25
- span_open = '<span class="difference">'
26
- span_close = '</span>'
27
-
28
- for tag, i1, i2, j1, j2 in s.get_opcodes():
29
- if tag == 'equal' or (i2-i1 <= 2 and j2-j1 <= 2):
30
- res_a.append(a[i1:i2])
31
- res_b.append(b[j1:j2])
32
- else:
33
- if tag in ('replace', 'delete'):
34
- res_a.append(f"{span_open}{a[i1:i2]}{span_close}")
35
- if tag in ('replace', 'insert'):
36
- res_b.append(f"{span_open}{b[j1:j2]}{span_close}")
37
-
38
- return "".join(res_a), "".join(res_b)
39
-
40
-
41
10
  def load_progress_data(warn: str | None = None):
42
11
  if not os.path.exists(f"{ROOT}/data/progress.json"):
43
12
  if warn is not None:
@@ -94,7 +63,7 @@ def get_db_log_item(campaign_id: str, user_id: str | None, item_i: int | None) -
94
63
  # Find the last reset marker for this user (if any)
95
64
  last_reset_idx = -1
96
65
  for i, entry in enumerate(matching):
97
- if entry.get("annotations") == RESET_MARKER:
66
+ if entry.get("annotation") == RESET_MARKER:
98
67
  last_reset_idx = i
99
68
 
100
69
  # Return only entries after the last reset
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pearmut
3
- Version: 0.2.10
3
+ Version: 0.3.0
4
4
  Summary: A tool for evaluation of model outputs, primarily MT.
5
5
  Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
6
6
  License: MIT
@@ -20,7 +20,7 @@ Dynamic: license-file
20
20
 
21
21
  # Pearmut 🍐
22
22
 
23
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
23
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
24
24
 
25
25
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
26
26
  &nbsp;
@@ -38,7 +38,6 @@ Dynamic: license-file
38
38
  - [Campaign Configuration](#campaign-configuration)
39
39
  - [Basic Structure](#basic-structure)
40
40
  - [Assignment Types](#assignment-types)
41
- - [Protocol Templates](#protocol-templates)
42
41
  - [Advanced Features](#advanced-features)
43
42
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
44
43
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -51,19 +50,16 @@ Dynamic: license-file
51
50
  - [Development](#development)
52
51
  - [Citation](#citation)
53
52
 
54
-
55
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
56
-
57
53
  ## Quick Start
58
54
 
59
55
  Install and run locally without cloning:
60
56
  ```bash
61
57
  pip install pearmut
62
58
  # Download example campaigns
63
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
64
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
59
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
60
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
65
61
  # Load and start
66
- pearmut add esa_encs.json da_enuk.json
62
+ pearmut add esa.json da.json
67
63
  pearmut run
68
64
  ```
69
65
 
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
76
72
  {
77
73
  "info": {
78
74
  "assignment": "task-based",
79
- "template": "pointwise",
80
- "protocol_score": true, # we want scores [0...100] for each segment
81
- "protocol_error_spans": true, # we want error spans
82
- "protocol_error_categories": false, # we do not want error span categories
75
+ # DA: scores
76
+ # ESA: error spans and scores
77
+ # MQM: error spans, categories, and scores
78
+ "protocol": "ESA",
83
79
  },
84
80
  "campaign_id": "wmt25_#_en-cs_CZ",
85
81
  "data": [
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
90
86
  {
91
87
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
92
88
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
93
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
89
+ "tgt": ["Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."]
94
90
  },
95
91
  {
96
92
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
97
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
93
+ "tgt": ["Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"]
98
94
  }
99
95
  ...
100
96
  ],
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
114
110
  [
115
111
  {
116
112
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
117
- "tgt": "And suddenly all the water became full of other people and other people." # required
113
+ "tgt": ["And suddenly all the water became full of other people and other people."] # required (array)
118
114
  },
119
115
  {
120
116
  "src": "toto je pokračování stejného dokumentu",
121
- "tgt": "this is a continuation of the same document"
117
+ "tgt": ["this is a continuation of the same document"]
122
118
  # Additional keys stored for analysis
123
119
  }
124
120
  ]
@@ -136,16 +132,23 @@ pearmut run
136
132
  - **`single-stream`**: All users draw from a shared pool (random assignment)
137
133
  - **`dynamic`**: work in progress ⚠️
138
134
 
139
- ### Protocol Templates
135
+ ## Advanced Features
140
136
 
141
- - **Pointwise**: Evaluate single output against single input
142
- - `protocol_score`: Collect scores [0-100]
143
- - `protocol_error_spans`: Collect error span highlights
144
- - `protocol_error_categories`: Collect MQM category labels
145
- - **Listwise**: Evaluate multiple outputs simultaneously
146
- - Same protocol options as pointwise
137
+ ### Shuffling Model Translations
147
138
 
148
- ## Advanced Features
139
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
140
+ The `shuffle` parameter in campaign `info` controls this behavior:
141
+ ```python
142
+ {
143
+ "info": {
144
+ "assignment": "task-based",
145
+ "protocol": "ESA",
146
+ "shuffle": true # Default: true. Set to false to disable shuffling.
147
+ },
148
+ "campaign_id": "my_campaign",
149
+ "data": [...]
150
+ }
151
+ ```
149
152
 
150
153
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
151
154
 
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
154
157
  ```python
155
158
  {
156
159
  "src": "The quick brown fox jumps over the lazy dog.",
157
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
160
+ "tgt": ["Rychlá hnědá liška skáče přes líného psa."],
158
161
  "error_spans": [
159
- {
160
- "start_i": 0, # character index start (inclusive)
161
- "end_i": 5, # character index end (inclusive)
162
- "severity": "minor", # "minor", "major", "neutral", or null
163
- "category": null # MQM category string or null
164
- },
165
- {
166
- "start_i": 27,
167
- "end_i": 32,
168
- "severity": "major",
169
- "category": null
170
- }
162
+ [
163
+ {
164
+ "start_i": 0, # character index start (inclusive)
165
+ "end_i": 5, # character index end (inclusive)
166
+ "severity": "minor", # "minor", "major", "neutral", or null
167
+ "category": null # MQM category string or null
168
+ },
169
+ {
170
+ "start_i": 27,
171
+ "end_i": 32,
172
+ "severity": "major",
173
+ "category": null
174
+ }
175
+ ]
171
176
  ]
172
177
  }
173
178
  ```
174
179
 
175
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
180
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
176
181
 
177
182
  ### Tutorial and Attention Checks
178
183
 
@@ -181,13 +186,15 @@ Add `validation` rules for tutorials or attention checks:
181
186
  ```python
182
187
  {
183
188
  "src": "The quick brown fox jumps.",
184
- "tgt": "Rychlá hnědá liška skáče.",
185
- "validation": {
186
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
187
- "score": [70, 80], # required score range [min, max]
188
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
189
- "allow_skip": true # show "skip tutorial" button
190
- }
189
+ "tgt": ["Rychlá hnědá liška skáče."],
190
+ "validation": [
191
+ {
192
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
193
+ "score": [70, 80], # required score range [min, max]
194
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
195
+ "allow_skip": true # show "skip tutorial" button
196
+ }
197
+ ]
191
198
  }
192
199
  ```
193
200
 
@@ -196,8 +203,21 @@ Add `validation` rules for tutorials or attention checks:
196
203
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
197
204
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
198
205
 
199
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
200
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json) and [examples/tutorial_listwise.json](examples/tutorial_listwise.json).
206
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
207
+
208
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
209
+ ```python
210
+ {
211
+ "src": "AI transforms industries.",
212
+ "tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
213
+ "validation": [
214
+ {"warning": "A has error, score 20-40.", "score": [20, 40]},
215
+ {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
216
+ ]
217
+ }
218
+ ```
219
+ The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
220
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
201
221
 
202
222
  ### Single-stream Assignment
203
223
 
@@ -207,10 +227,10 @@ All annotators draw from a shared pool with random assignment:
207
227
  "campaign_id": "my campaign 6",
208
228
  "info": {
209
229
  "assignment": "single-stream",
210
- "template": "pointwise",
211
- "protocol_score": True, # collect scores
212
- "protocol_error_spans": True, # collect error spans
213
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
230
+ # DA: scores
231
+ # MQM: error spans and categories
232
+ # ESA: error spans and scores
233
+ "protocol": "ESA",
214
234
  "users": 50, # number of annotators (can also be a list, see below)
215
235
  },
216
236
  "data": [...], # list of all items (shared among all annotators)
@@ -288,30 +308,21 @@ Completion tokens are shown at annotation end for verification (download correct
288
308
 
289
309
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
290
310
 
291
- ### Model Results Display
292
-
293
- Add `&results` to dashboard URL to show model rankings (requires valid token).
294
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
295
- ```python
296
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
297
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
298
- ```
299
- See an example in [Campaign Management](#campaign-management)
300
-
311
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
301
312
 
302
313
  ## Terminology
303
314
 
304
315
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
305
316
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
306
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
317
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
307
318
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
308
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
309
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
319
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
320
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
310
321
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
311
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
322
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
312
323
  - **Loud**: Shows warning message and forces retry on failure
313
324
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
314
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
325
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
315
326
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
316
327
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
317
328
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -320,11 +331,9 @@ See an example in [Campaign Management](#campaign-management)
320
331
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
321
332
  - **Protocol**: The annotation scheme defining what data is collected:
322
333
  - **Score**: Numeric quality rating (0-100)
323
- - **Error Spans**: Text highlights marking errors
334
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
324
335
  - **Error Categories**: MQM taxonomy labels for errors
325
- - **Template**: The annotation interface type:
326
- - **Pointwise**: Evaluate one output at a time
327
- - **Listwise**: Compare multiple outputs simultaneously
336
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
328
337
  - **Assignment**: The method for distributing items to users:
329
338
  - **Task-based**: Each user has predefined items
330
339
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -355,7 +364,7 @@ pearmut run
355
364
  2. Add build rule to `webpack.config.js`
356
365
  3. Reference as `info->template` in campaign JSON
357
366
 
358
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
367
+ See [web/src/basic.ts](web/src/basic.ts) for example.
359
368
 
360
369
  ### Deployment
361
370
 
@@ -0,0 +1,17 @@
1
+ pearmut/app.py,sha256=IZNmeKTAuLcf9FggvlHktWDbIGxfykjSRM-sI8Byfik,10179
2
+ pearmut/assignment.py,sha256=_0hNXtA-Mgn6bRyRVjgeGxERKRvBezR3NmEwx2uME38,11685
3
+ pearmut/cli.py,sha256=tYzCs7bTuKpt8pIbv8L5SpFHjIVteYyo12KWdrWT1U0,20642
4
+ pearmut/utils.py,sha256=Rl_i-WCaJN3p_VG5iVL0fSeI481jcJUUEZO6HKx62PE,4347
5
+ pearmut/static/basic.bundle.js,sha256=9v9jfKgcHUMaNHwra5-Dhxy6LR29OyOdUd79dvR-cb4,110459
6
+ pearmut/static/basic.html,sha256=Nm0t3uGsbUUso_lFpIpMMEe9iBEDS_Og4tz5vdWhJGo,5473
7
+ pearmut/static/dashboard.bundle.js,sha256=djacPNoKpxtSP0CzAdEmgPocDyBO0ihFUriCw_RJOhQ,100630
8
+ pearmut/static/dashboard.html,sha256=HXZzoz44f7LYtAfuP7uQioxTkNmo2_fAN0v2C2s1lAs,2680
9
+ pearmut/static/favicon.svg,sha256=gVPxdBlyfyJVkiMfh8WLaiSyH4lpwmKZs8UiOeX8YW4,7347
10
+ pearmut/static/index.html,sha256=yMttallApd0T7sxngUrdwCDrtTQpRIFF0-4W0jfXejU,835
11
+ pearmut/static/style.css,sha256=hI_Mbvq6BbXfsp-WMpx73tsOL_6QflgrSV1um-3c-hU,4101
12
+ pearmut-0.3.0.dist-info/licenses/LICENSE,sha256=GtR6RcTdRn-P23h5pKFuWSLZrLPD0ytHAwSOBt7aLpI,1071
13
+ pearmut-0.3.0.dist-info/METADATA,sha256=DELOuCdyDU6nOM8H2b-gCCL5JtlFKMpUJe8BoybZaoQ,15453
14
+ pearmut-0.3.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
15
+ pearmut-0.3.0.dist-info/entry_points.txt,sha256=eEA9LVWsS3neQbMvL_nMvEw8I0oFudw8nQa1iqxOiWM,45
16
+ pearmut-0.3.0.dist-info/top_level.txt,sha256=CdgtUM-SKQDt6o5g0QreO-_7XTBP9_wnHMS1P-Rl5Go,8
17
+ pearmut-0.3.0.dist-info/RECORD,,