pearmut 0.2.11__py3-none-any.whl → 0.3.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pearmut/utils.py CHANGED
@@ -7,37 +7,6 @@ ROOT = "."
7
7
  RESET_MARKER = "__RESET__"
8
8
 
9
9
 
10
- def highlight_differences(a, b):
11
- """
12
- Compares two strings and wraps their differences in HTML span tags.
13
-
14
- Args:
15
- a: The first string.
16
- b: The second string.
17
-
18
- Returns:
19
- A tuple containing the two strings with their differences highlighted.
20
- """
21
- import difflib
22
- # TODO: maybe on the level of words?
23
- s = difflib.SequenceMatcher(None, a, b)
24
- res_a, res_b = [], []
25
- span_open = '<span class="difference">'
26
- span_close = '</span>'
27
-
28
- for tag, i1, i2, j1, j2 in s.get_opcodes():
29
- if tag == 'equal' or (i2-i1 <= 2 and j2-j1 <= 2):
30
- res_a.append(a[i1:i2])
31
- res_b.append(b[j1:j2])
32
- else:
33
- if tag in ('replace', 'delete'):
34
- res_a.append(f"{span_open}{a[i1:i2]}{span_close}")
35
- if tag in ('replace', 'insert'):
36
- res_b.append(f"{span_open}{b[j1:j2]}{span_close}")
37
-
38
- return "".join(res_a), "".join(res_b)
39
-
40
-
41
10
  def load_progress_data(warn: str | None = None):
42
11
  if not os.path.exists(f"{ROOT}/data/progress.json"):
43
12
  if warn is not None:
@@ -94,7 +63,7 @@ def get_db_log_item(campaign_id: str, user_id: str | None, item_i: int | None) -
94
63
  # Find the last reset marker for this user (if any)
95
64
  last_reset_idx = -1
96
65
  for i, entry in enumerate(matching):
97
- if entry.get("annotations") == RESET_MARKER:
66
+ if entry.get("annotation") == RESET_MARKER:
98
67
  last_reset_idx = i
99
68
 
100
69
  # Return only entries after the last reset
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pearmut
3
- Version: 0.2.11
3
+ Version: 0.3.1
4
4
  Summary: A tool for evaluation of model outputs, primarily MT.
5
5
  Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
6
6
  License: MIT
@@ -20,7 +20,7 @@ Dynamic: license-file
20
20
 
21
21
  # Pearmut 🍐
22
22
 
23
- **Platform for Evaluation and Reviewing of Multilingual Tasks** Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
23
+ **Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).
24
24
 
25
25
  [![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
26
26
  &nbsp;
@@ -38,7 +38,6 @@ Dynamic: license-file
38
38
  - [Campaign Configuration](#campaign-configuration)
39
39
  - [Basic Structure](#basic-structure)
40
40
  - [Assignment Types](#assignment-types)
41
- - [Protocol Templates](#protocol-templates)
42
41
  - [Advanced Features](#advanced-features)
43
42
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
44
43
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
@@ -51,19 +50,16 @@ Dynamic: license-file
51
50
  - [Development](#development)
52
51
  - [Citation](#citation)
53
52
 
54
-
55
- **Error Span** — A highlighted segment of text marked as containing an error, with optional severity (`minor`, `major`, `neutral`) and MQM category labels.
56
-
57
53
  ## Quick Start
58
54
 
59
55
  Install and run locally without cloning:
60
56
  ```bash
61
57
  pip install pearmut
62
58
  # Download example campaigns
63
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa_encs.json
64
- wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da_enuk.json
59
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
60
+ wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
65
61
  # Load and start
66
- pearmut add esa_encs.json da_enuk.json
62
+ pearmut add esa.json da.json
67
63
  pearmut run
68
64
  ```
69
65
 
@@ -76,10 +72,10 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
76
72
  {
77
73
  "info": {
78
74
  "assignment": "task-based",
79
- "template": "pointwise",
80
- "protocol_score": true, # we want scores [0...100] for each segment
81
- "protocol_error_spans": true, # we want error spans
82
- "protocol_error_categories": false, # we do not want error span categories
75
+ # DA: scores
76
+ # ESA: error spans and scores
77
+ # MQM: error spans, categories, and scores
78
+ "protocol": "ESA",
83
79
  },
84
80
  "campaign_id": "wmt25_#_en-cs_CZ",
85
81
  "data": [
@@ -90,11 +86,11 @@ Campaigns are defined in JSON files (see [examples/](examples/)). The simplest c
90
86
  {
91
87
  "instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
92
88
  "src": "This will be the year that Guinness loses its cool. Cheers to that!",
93
- "tgt": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."
89
+ "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
94
90
  },
95
91
  {
96
92
  "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
97
- "tgt": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"
93
+ "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
98
94
  }
99
95
  ...
100
96
  ],
@@ -114,11 +110,11 @@ Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dicti
114
110
  [
115
111
  {
116
112
  "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.", # required
117
- "tgt": "And suddenly all the water became full of other people and other people." # required
113
+ "tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
118
114
  },
119
115
  {
120
116
  "src": "toto je pokračování stejného dokumentu",
121
- "tgt": "this is a continuation of the same document"
117
+ "tgt": {"modelA": "this is a continuation of the same document"}
122
118
  # Additional keys stored for analysis
123
119
  }
124
120
  ]
@@ -136,16 +132,23 @@ pearmut run
136
132
  - **`single-stream`**: All users draw from a shared pool (random assignment)
137
133
  - **`dynamic`**: work in progress ⚠️
138
134
 
139
- ### Protocol Templates
135
+ ## Advanced Features
140
136
 
141
- - **Pointwise**: Evaluate single output against single input
142
- - `protocol_score`: Collect scores [0-100]
143
- - `protocol_error_spans`: Collect error span highlights
144
- - `protocol_error_categories`: Collect MQM category labels
145
- - **Listwise**: Evaluate multiple outputs simultaneously
146
- - Same protocol options as pointwise
137
+ ### Shuffling Model Translations
147
138
 
148
- ## Advanced Features
139
+ By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
140
+ The `shuffle` parameter in campaign `info` controls this behavior:
141
+ ```python
142
+ {
143
+ "info": {
144
+ "assignment": "task-based",
145
+ "protocol": "ESA",
146
+ "shuffle": true # Default: true. Set to false to disable shuffling.
147
+ },
148
+ "campaign_id": "my_campaign",
149
+ "data": [...]
150
+ }
151
+ ```
149
152
 
150
153
  ### Pre-filled Error Spans (ESA<sup>AI</sup>)
151
154
 
@@ -154,25 +157,27 @@ Include `error_spans` to pre-fill annotations that users can review, modify, or
154
157
  ```python
155
158
  {
156
159
  "src": "The quick brown fox jumps over the lazy dog.",
157
- "tgt": "Rychlá hnědá liška skáče přes líného psa.",
158
- "error_spans": [
159
- {
160
- "start_i": 0, # character index start (inclusive)
161
- "end_i": 5, # character index end (inclusive)
162
- "severity": "minor", # "minor", "major", "neutral", or null
163
- "category": null # MQM category string or null
164
- },
165
- {
166
- "start_i": 27,
167
- "end_i": 32,
168
- "severity": "major",
169
- "category": null
170
- }
171
- ]
160
+ "tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
161
+ "error_spans": {
162
+ "modelA": [
163
+ {
164
+ "start_i": 0, # character index start (inclusive)
165
+ "end_i": 5, # character index end (inclusive)
166
+ "severity": "minor", # "minor", "major", "neutral", or null
167
+ "category": null # MQM category string or null
168
+ },
169
+ {
170
+ "start_i": 27,
171
+ "end_i": 32,
172
+ "severity": "major",
173
+ "category": null
174
+ }
175
+ ]
176
+ }
172
177
  }
173
178
  ```
174
179
 
175
- For **listwise** template, `error_spans` is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
180
+ The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).
176
181
 
177
182
  ### Tutorial and Attention Checks
178
183
 
@@ -181,12 +186,16 @@ Add `validation` rules for tutorials or attention checks:
181
186
  ```python
182
187
  {
183
188
  "src": "The quick brown fox jumps.",
184
- "tgt": "Rychlá hnědá liška skáče.",
189
+ "tgt": {"modelA": "Rychlá hnědá liška skáče."},
185
190
  "validation": {
186
- "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
187
- "score": [70, 80], # required score range [min, max]
188
- "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
189
- "allow_skip": true # show "skip tutorial" button
191
+ "modelA": [
192
+ {
193
+ "warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
194
+ "score": [70, 80], # required score range [min, max]
195
+ "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
196
+ "allow_skip": true # show "skip tutorial" button
197
+ }
198
+ ]
190
199
  }
191
200
  }
192
201
  ```
@@ -196,22 +205,25 @@ Add `validation` rules for tutorials or attention checks:
196
205
  - **Loud attention checks**: Include `warning` without `allow_skip` to force retry
197
206
  - **Silent attention checks**: Omit `warning` to log failures without notification (quality control)
198
207
 
199
- For listwise, `validation` is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
208
+ The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).
200
209
 
201
- **Listwise score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
210
+ **Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
202
211
  ```python
203
212
  {
204
213
  "src": "AI transforms industries.",
205
- "tgt": ["UI transformuje průmysly.", "Umělá inteligence mění obory."],
206
- "validation": [
207
- {"warning": "A has error, score 20-40.", "score": [20, 40]},
208
- {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": 0}
209
- ]
214
+ "tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
215
+ "validation": {
216
+ "A": [
217
+ {"warning": "A has error, score 20-40.", "score": [20, 40]}
218
+ ],
219
+ "B": [
220
+ {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
221
+ ]
222
+ }
210
223
  }
211
224
  ```
212
225
  The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
213
-
214
- See [examples/tutorial_pointwise.json](examples/tutorial_pointwise.json), [examples/tutorial_listwise.json](examples/tutorial_listwise.json), and [examples/tutorial_listwise_score_greaterthan.json](examples/tutorial_listwise_score_greaterthan.json).
226
+ See [examples/tutorial_kway.json](examples/tutorial_kway.json).
215
227
 
216
228
  ### Single-stream Assignment
217
229
 
@@ -221,10 +233,10 @@ All annotators draw from a shared pool with random assignment:
221
233
  "campaign_id": "my campaign 6",
222
234
  "info": {
223
235
  "assignment": "single-stream",
224
- "template": "pointwise",
225
- "protocol_score": True, # collect scores
226
- "protocol_error_spans": True, # collect error spans
227
- "protocol_error_categories": False, # do not collect MQM categories, so ESA
236
+ # DA: scores
237
+ # MQM: error spans and categories
238
+ # ESA: error spans and scores
239
+ "protocol": "ESA",
228
240
  "users": 50, # number of annotators (can also be a list, see below)
229
241
  },
230
242
  "data": [...], # list of all items (shared among all annotators)
@@ -302,30 +314,21 @@ Completion tokens are shown at annotation end for verification (download correct
302
314
 
303
315
  <img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />
304
316
 
305
- ### Model Results Display
306
-
307
- Add `&results` to dashboard URL to show model rankings (requires valid token).
308
- Items need `model` field (pointwise) or `models` field (listwise) and the `protocol_score` needs to be enable such that the `score` can be used for the ranking:
309
- ```python
310
- {"doc_id": "1", "model": "CommandA", "src": "...", "tgt": "..."}
311
- {"doc_id": "2", "models": ["CommandA", "Claude"], "src": "...", "tgt": ["...", "..."]}
312
- ```
313
- See an example in [Campaign Management](#campaign-management)
314
-
317
+ When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
315
318
 
316
319
  ## Terminology
317
320
 
318
321
  - **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
319
322
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
320
- - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns.
323
+ - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
321
324
  - **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
322
- - **Item** A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
323
- - **Document** A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
325
+ - **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
326
+ - **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
324
327
  - **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
325
- - **Attention Check** A validation item with known correct answers used to ensure annotator quality. Can be:
328
+ - **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
326
329
  - **Loud**: Shows warning message and forces retry on failure
327
330
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
328
- - **Token** A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
331
+ - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
329
332
  - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
330
333
  - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
331
334
  - **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
@@ -334,11 +337,9 @@ See an example in [Campaign Management](#campaign-management)
334
337
  - **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
335
338
  - **Protocol**: The annotation scheme defining what data is collected:
336
339
  - **Score**: Numeric quality rating (0-100)
337
- - **Error Spans**: Text highlights marking errors
340
+ - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
338
341
  - **Error Categories**: MQM taxonomy labels for errors
339
- - **Template**: The annotation interface type:
340
- - **Pointwise**: Evaluate one output at a time
341
- - **Listwise**: Compare multiple outputs simultaneously
342
+ - **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
342
343
  - **Assignment**: The method for distributing items to users:
343
344
  - **Task-based**: Each user has predefined items
344
345
  - **Single-stream**: Users draw from a shared pool with random assignment
@@ -369,7 +370,7 @@ pearmut run
369
370
  2. Add build rule to `webpack.config.js`
370
371
  3. Reference as `info->template` in campaign JSON
371
372
 
372
- See [web/src/pointwise.ts](web/src/pointwise.ts) for example.
373
+ See [web/src/basic.ts](web/src/basic.ts) for example.
373
374
 
374
375
  ### Deployment
375
376
 
@@ -0,0 +1,17 @@
1
+ pearmut/app.py,sha256=IZNmeKTAuLcf9FggvlHktWDbIGxfykjSRM-sI8Byfik,10179
2
+ pearmut/assignment.py,sha256=_0hNXtA-Mgn6bRyRVjgeGxERKRvBezR3NmEwx2uME38,11685
3
+ pearmut/cli.py,sha256=tYzCs7bTuKpt8pIbv8L5SpFHjIVteYyo12KWdrWT1U0,20642
4
+ pearmut/utils.py,sha256=Rl_i-WCaJN3p_VG5iVL0fSeI481jcJUUEZO6HKx62PE,4347
5
+ pearmut/static/basic.bundle.js,sha256=9cz_5Jq0KgnWTwkuGqRT2eAY3FHQJM2f2OP1RnNi0s4,110582
6
+ pearmut/static/basic.html,sha256=Nm0t3uGsbUUso_lFpIpMMEe9iBEDS_Og4tz5vdWhJGo,5473
7
+ pearmut/static/dashboard.bundle.js,sha256=djacPNoKpxtSP0CzAdEmgPocDyBO0ihFUriCw_RJOhQ,100630
8
+ pearmut/static/dashboard.html,sha256=HXZzoz44f7LYtAfuP7uQioxTkNmo2_fAN0v2C2s1lAs,2680
9
+ pearmut/static/favicon.svg,sha256=gVPxdBlyfyJVkiMfh8WLaiSyH4lpwmKZs8UiOeX8YW4,7347
10
+ pearmut/static/index.html,sha256=yMttallApd0T7sxngUrdwCDrtTQpRIFF0-4W0jfXejU,835
11
+ pearmut/static/style.css,sha256=hI_Mbvq6BbXfsp-WMpx73tsOL_6QflgrSV1um-3c-hU,4101
12
+ pearmut-0.3.1.dist-info/licenses/LICENSE,sha256=GtR6RcTdRn-P23h5pKFuWSLZrLPD0ytHAwSOBt7aLpI,1071
13
+ pearmut-0.3.1.dist-info/METADATA,sha256=_8Wp8dbCNV9glYKPfqrAN_AV9G3WeytqcgTzjoMeDnU,15606
14
+ pearmut-0.3.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
15
+ pearmut-0.3.1.dist-info/entry_points.txt,sha256=eEA9LVWsS3neQbMvL_nMvEw8I0oFudw8nQa1iqxOiWM,45
16
+ pearmut-0.3.1.dist-info/top_level.txt,sha256=CdgtUM-SKQDt6o5g0QreO-_7XTBP9_wnHMS1P-Rl5Go,8
17
+ pearmut-0.3.1.dist-info/RECORD,,