span-aligner 0.1.2__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,283 @@
1
+ Metadata-Version: 2.4
2
+ Name: span-aligner
3
+ Version: 0.2.0
4
+ Summary: A utility for aligning and mapping text spans between different text representations.
5
+ License: MIT
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Requires-Dist: numpy<3,>=1.26
10
+ Requires-Dist: scipy<2,>=1.13
11
+ Requires-Dist: networkx<4,>=3.3
12
+ Requires-Dist: rapidfuzz<4,>=3.13
13
+ Requires-Dist: regex>=2024.9
14
+ Requires-Dist: transformers<6,>=4.41
15
+ Requires-Dist: torch<3,>=2.5
16
+ Requires-Dist: scikit-learn<2,>=1.7
17
+ Requires-Dist: requests<3,>=2.32
18
+ Provides-Extra: dev
19
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
20
+ Dynamic: license-file
21
+
22
+ # Span Projecting & Alignment
23
+
24
+ A utility for aligning and mapping text spans between different text representations, and projecting annotations across languages using semantic alignment.
25
+
26
+ ## Features
27
+
28
+ - **Span Alignment**: Sanitize boundaries, fuzzy match segments, map spans between text versions.
29
+ - **Span Projection**: Project annotations from a source text (e.g., English) to a target text (e.g., Dutch) using embeddings.
30
+
31
+ ## Installation
32
+
33
+ Install dependencies:
34
+
35
+ ```bash
36
+ pip install span-aligner
37
+ ```
38
+
39
+ ## Usage
40
+
41
+ The package `span_aligner` provides two main classes: `SpanAligner` and `SpanProjector`.
42
+
43
+ * **`SpanAligner`**:
44
+ Uses regex and fuzzy search. It is highly efficient but restricted to **monolingual** tasks (same language). It serves as a strong baseline for correcting boundary offsets or mapping annotations between slightly different versions of a text.
45
+
46
+ * **`SpanProjector`**:
47
+ Uses **word embeddings** (Transformers) to align tokens semantically. It supports **cross-lingual** projection and handles significant paraphrasing. However, it is computationally more expensive.
48
+ * *Complexity*: The default `mwmf` (Max Weight Matching) algorithm has a complexity of **O(n³)**, meaning execution time increases exponentially with text length.
49
+ * *Use Case*: Use when languages differ or when textual differences are too great for fuzzy matching.
50
+
51
+ ## Optimization & Best Practices
52
+
53
+ To achieve the best results while managing computational cost, follow these guidelines:
54
+
55
+ ### 1. Choose the Right Tool for the Job
56
+ If the source and target texts are in the same language, **always start with `SpanAligner`**. It is significantly faster and creates precise splits. Only switch to `SpanProjector` if fuzzy matching fails due to low textual overlap.
57
+
58
+ ### 2. Manage Text Length (Chunking)
59
+ The `SpanProjector` (specifically with `mwmf`) struggles with very long sequences.
60
+ * **Split Texts**: Break documents into logical segments (e.g., paragraphs, decisions, list items) before projection.
61
+ * **Project Locally**: Align spans within their corresponding segments rather than projecting a small span against an entire document.
62
+
63
+ ### 3. Select the Appropriate Algorithm
64
+ * **`mwmf`** (Max Weight Matching): The gold standard. Finds the globally optimal alignment but is slow. Use for final, high-quality output on segmented text.
65
+ * **`inter`** (Intersection): Much faster. Works excellently for **short, distinct spans** (e.g., named entities like persons, locations, dates) where context is less critical.
66
+ * **`itermax`**: A balanced heuristic that offers better speed than `mwmf` with comparable quality for many tasks.
67
+
68
+ ### 4. Translation-Assisted Projection (Hybrid Approach)
69
+ If direct cross-lingual projection yields subpar results, consider an intermediate translation step to simplify the alignment task:
70
+
71
+ 1. **Translate Source**: Use an LLM or NMT model to translate the annotated source text (or just the spans) into the target language.
72
+ 2. **Align Locally**: Use `SpanAligner` (or `SpanProjector` with `inter`) to map the *translated* spans onto the *actual* target text.
73
+
74
+ **Tip**: The translation should mimic the vocabulary of the target text as closely as possible.
75
+ * *Workflow*: `annotated_source` + `target_text` → **LLM** → `rough_translated_source` → **SpanAligner** → `final_annotated_target`
76
+
77
+
78
+
79
+ ### Span Aligner
80
+
81
+ Utilities for exact and fuzzy span mapping.
82
+
83
+ #### Get Annotations from Tagged Text
84
+
85
+ Extract structured spans and entities from a string with inline tags.
86
+
87
+ ```python
88
+ from span_aligner import SpanAligner
89
+
90
+ tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."
91
+
92
+ ner_map = {
93
+ "administrative_body": "ADMINISTRATIVE BODY",
94
+ "publication_date": "PUBLICATION DATE",
95
+ "impact_location": "PRIMARY LOCATION"
96
+ }
97
+
98
+ span_map ={
99
+ "motivation" : "MOTIVATION"
100
+ }
101
+
102
+ annotations = SpanAligner.get_annotations_from_tagged_text(
103
+ tagged_input,
104
+ ner_map=ner_map,
105
+ span_map=span_map
106
+ )
107
+
108
+ print(annotations["entities"])
109
+ # Output:
110
+ #[
111
+ # {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
112
+ # {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
113
+ # {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
114
+ #]
115
+ ```
116
+
117
+ #### Rebuild Tagged Text
118
+
119
+ Reconstruct a string with XML-like tags from raw text and span/entity lists.
120
+
121
+ ```python
122
+ from span_aligner import SpanAligner
123
+
124
+ text = "On 2026-01-12, the Budget Committee finalized the annual report."
125
+ # Entities corresponding to 'ADMINISTRATIVE BODY' label (indices skip "the ")
126
+ entities = [{"start": 19, "end": 35, "labels": ["administrative_body"]}]
127
+
128
+ tagged, stats = SpanAligner.rebuild_tagged_text(text, entities=entities)
129
+ print(tagged)
130
+ # Output: On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.
131
+ ```
132
+
133
+ #### Map Tags to Original
134
+
135
+ Align annotated spans from a tagged string back to their positions in the original text, allowing for noisy text or translation differences.
136
+
137
+ ```python
138
+ from span_aligner import SpanAligner
139
+
140
+ original_text = "Budget Committee met on 2026-01-12 to view\n\n the central park prject."
141
+ tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."
142
+
143
+ mapped_tagged_text = SpanAligner.map_tags_to_original(
144
+ original_text=original_text,
145
+ tagged_text=tagged_text,
146
+ min_ratio=0.7
147
+ )
148
+ print(mapped_tagged_text)
149
+ # Output preserves original text errors:
150
+ # "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view
151
+ # the <impact_location>central park</impact_location> prject."
152
+ ```
153
+
154
+ ### Span Projector
155
+
156
+ Project annotations from one text to another using semantic alignment (e.g., cross-lingual projection).
157
+
158
+ The process begins by generating embeddings for both source and target texts, creating a similarity matrix, and finding the optimal set of alignment pairs. Several algorithms are implemented for this matching phase, including `mwmf`, `inter`, `itermax`, `fwd`, `rev`, `greedy`, and `threshold`.
159
+
160
+
161
+
162
+ #### Project En -> En (Identity/Paraphrase)
163
+
164
+ Project annotations to a similar text in the same language. Functions similar to the `spanAligner` with improved fuzzy matching.
165
+
166
+ ```python
167
+ from span_aligner import SpanProjector
168
+
169
+ # Initialize projector (uses BERT embeddings by default)
170
+ projector = SpanProjector(src_lang="en", tgt_lang="en")
171
+
172
+ src_text = "The <ent>cat</ent> \n\n sat. on the mat."
173
+ tgt_text = "The cat sat on the mat."
174
+
175
+ tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
176
+ print(tagged_tgt)
177
+ # Output: The <ent>cat</ent> sat on the mat.
178
+ ```
179
+
180
+ #### Project En -> Nl (Cross-Lingual)
181
+
182
+ Project annotations from an English source text to a Dutch target translation.
183
+
184
+ ```python
185
+ from span_aligner import SpanProjector
186
+
187
+ # Initialize projector
188
+ projector = SpanProjector(src_lang="en", tgt_lang="nl")
189
+
190
+ src_text = """DECISION LIST <contextual_location>Municipality of Zele</contextual_location>
191
+ <administrative_body>Standing Committee</administrative_body> | <contextual_date>June 28, 2021</contextual_date>
192
+ <title>1. Acceptance of candidacies for the examination procedure coordinator of Welfare</title>
193
+ <decision>Acceptance of candidacies for the examination procedure coordinator of Welfare</decision>
194
+ <title>2. Establishment of valuation rules for the integrated entity Municipality and Public Social Welfare Center (OCMW)</title>
195
+ <decision>Establishment of valuation rules for the integrated entity Municipality and OCMW</decision>"""
196
+
197
+ tgt_text = """BESLUITENLIJST Gemeente Zele Vast bureau | 28 juni 20211.
198
+ 1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
199
+ Aanvaarden kandidaturen examenprocedure coördinator Welzijn
200
+ 2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW
201
+ Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW"""
202
+
203
+ tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
204
+ print(tagged_tgt)
205
+ # Output: BESLUITENLIJST <contextual_location>Gemeente Zele</contextual_location>
206
+ # <administrative_body>Vast bureau</administrative_body> <contextual_date>| 28 juni 20211</contextual_date>.
207
+ # <title>1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
208
+ # Aanvaarden kandidaturen examenprocedure coördinator</title> Welzijn
209
+ # <title>2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</title>
210
+ # <decision>Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</decision>
211
+
212
+ ```
213
+
214
+ ### Sentence Aligner
215
+
216
+ Low-level class for aligning tokens between two texts (sentences or paragraphs) using transformer embeddings. Based on the work of `simalign` but optimized for span mapping (partial alignment instead of full text) and customized for different embedding providers (Ollama, SaaS providers, Transformers, Sentence-Transformers).
217
+
218
+ #### Initialize Aligner
219
+
220
+ ```python
221
+ from span_aligner import SentenceAligner
222
+
223
+ # Use bert embeddings (default) with BPE tokenization
224
+ aligner = SentenceAligner(model="bert", token_type="bpe")
225
+
226
+ text_src = "This is a simple test sentence for alignment."
227
+ text_tgt = "Dit is een eenvoudige testzin voor uitlijning."
228
+ ```
229
+
230
+ #### Get Text Embeddings
231
+
232
+ Retrieve tokens and embedding vectors for a string.
233
+
234
+ ```python
235
+ tokens_src, vecs_src = aligner.get_text_embeddings(text_src)
236
+ print(f"Src tokens: {len(tokens_src)}, Vectors: {vecs_src.shape}")
237
+ # Output: Src tokens: 9, Vectors: (10, 768)
238
+ ```
239
+
240
+ #### Align Partial Substring
241
+
242
+ Find the alignment of a specific substring from source to target.
243
+
244
+ ```python
245
+ # Align "simple test"
246
+ res_sub = aligner.align_texts_partial_substring(text_src, text_tgt, "simple test")
247
+ print(f"Src tokens in result: {[t.text for t in res_sub.src_tokens]}")
248
+ # Output: Src tokens in result: ['simple', 'test']
249
+ ```
250
+
251
+ ## Configuration & Advanced Usage
252
+
253
+ ### Embedding Models
254
+
255
+ The `model` parameter supports common transformer models:
256
+
257
+ - `"bert"`: `bert-base-multilingual-cased` (Default, robust multilingual performance)
258
+ - `"xlmr"`: `xlm-roberta-base` (Strong cross-lingual transfer)
259
+ - `"xlmr-large"`: `xlm-roberta-large` (Higher accuracy, more resource intensive)
260
+
261
+ ```python
262
+ # Use xlm-roberta-base
263
+ projector = SpanProjector(model="xlmr")
264
+ ```
265
+
266
+ ### Matching Algorithms
267
+
268
+ The `matching_method` parameter controls how the token similarity matrix is converted into an alignment.
269
+
270
+ - `"mwmf"` (**Max Weight Matching**): Finds the global optimal independent edge set. Best quality, O(n³) complexity.
271
+ - `"inter"` (**Intersection**): Intersection of forward and backward attention. High precision, lower recall, very fast.
272
+ - `"itermax"` (**Iterative Max**): Heuristic iterative maximization. Good speed/quality balance.
273
+ - `"greedy"` (**Greedy**): Selects best matches greedily. Fast but local optimum.
274
+
275
+ ```python
276
+ # Trade accuracy for speed with 'inter'
277
+ projector = SpanProjector(matching_method="inter")
278
+ ```
279
+
280
+ ### Tokenization: BPE vs Word
281
+
282
+ - `token_type="bpe"` (Recommended): Uses the transformer's subword tokenizer (e.g. WordPiece). Handles rare words better and aligns closer to the model's internal representation.
283
+ - `token_type="word"`: Splits by whitespace/punctuation. Simpler, but can result in `[UNK]` tokens for transformers.
@@ -0,0 +1,262 @@
1
+ # Span Projecting & Alignment
2
+
3
+ A utility for aligning and mapping text spans between different text representations, and projecting annotations across languages using semantic alignment.
4
+
5
+ ## Features
6
+
7
+ - **Span Alignment**: Sanitize boundaries, fuzzy match segments, map spans between text versions.
8
+ - **Span Projection**: Project annotations from a source text (e.g., English) to a target text (e.g., Dutch) using embeddings.
9
+
10
+ ## Installation
11
+
12
+ Install dependencies:
13
+
14
+ ```bash
15
+ pip install span-aligner
16
+ ```
17
+
18
+ ## Usage
19
+
20
+ The package `span_aligner` provides two main classes: `SpanAligner` and `SpanProjector`.
21
+
22
+ * **`SpanAligner`**:
23
+ Uses regex and fuzzy search. It is highly efficient but restricted to **monolingual** tasks (same language). It serves as a strong baseline for correcting boundary offsets or mapping annotations between slightly different versions of a text.
24
+
25
+ * **`SpanProjector`**:
26
+ Uses **word embeddings** (Transformers) to align tokens semantically. It supports **cross-lingual** projection and handles significant paraphrasing. However, it is computationally more expensive.
27
+ * *Complexity*: The default `mwmf` (Max Weight Matching) algorithm has a complexity of **O(n³)**, meaning execution time increases exponentially with text length.
28
+ * *Use Case*: Use when languages differ or when textual differences are too great for fuzzy matching.
29
+
30
+ ## Optimization & Best Practices
31
+
32
+ To achieve the best results while managing computational cost, follow these guidelines:
33
+
34
+ ### 1. Choose the Right Tool for the Job
35
+ If the source and target texts are in the same language, **always start with `SpanAligner`**. It is significantly faster and creates precise splits. Only switch to `SpanProjector` if fuzzy matching fails due to low textual overlap.
36
+
37
+ ### 2. Manage Text Length (Chunking)
38
+ The `SpanProjector` (specifically with `mwmf`) struggles with very long sequences.
39
+ * **Split Texts**: Break documents into logical segments (e.g., paragraphs, decisions, list items) before projection.
40
+ * **Project Locally**: Align spans within their corresponding segments rather than projecting a small span against an entire document.
41
+
42
+ ### 3. Select the Appropriate Algorithm
43
+ * **`mwmf`** (Max Weight Matching): The gold standard. Finds the globally optimal alignment but is slow. Use for final, high-quality output on segmented text.
44
+ * **`inter`** (Intersection): Much faster. Works excellently for **short, distinct spans** (e.g., named entities like persons, locations, dates) where context is less critical.
45
+ * **`itermax`**: A balanced heuristic that offers better speed than `mwmf` with comparable quality for many tasks.
46
+
47
+ ### 4. Translation-Assisted Projection (Hybrid Approach)
48
+ If direct cross-lingual projection yields subpar results, consider an intermediate translation step to simplify the alignment task:
49
+
50
+ 1. **Translate Source**: Use an LLM or NMT model to translate the annotated source text (or just the spans) into the target language.
51
+ 2. **Align Locally**: Use `SpanAligner` (or `SpanProjector` with `inter`) to map the *translated* spans onto the *actual* target text.
52
+
53
+ **Tip**: The translation should mimic the vocabulary of the target text as closely as possible.
54
+ * *Workflow*: `annotated_source` + `target_text` → **LLM** → `rough_translated_source` → **SpanAligner** → `final_annotated_target`
55
+
56
+
57
+
58
+ ### Span Aligner
59
+
60
+ Utilities for exact and fuzzy span mapping.
61
+
62
+ #### Get Annotations from Tagged Text
63
+
64
+ Extract structured spans and entities from a string with inline tags.
65
+
66
+ ```python
67
+ from span_aligner import SpanAligner
68
+
69
+ tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."
70
+
71
+ ner_map = {
72
+ "administrative_body": "ADMINISTRATIVE BODY",
73
+ "publication_date": "PUBLICATION DATE",
74
+ "impact_location": "PRIMARY LOCATION"
75
+ }
76
+
77
+ span_map ={
78
+ "motivation" : "MOTIVATION"
79
+ }
80
+
81
+ annotations = SpanAligner.get_annotations_from_tagged_text(
82
+ tagged_input,
83
+ ner_map=ner_map,
84
+ span_map=span_map
85
+ )
86
+
87
+ print(annotations["entities"])
88
+ # Output:
89
+ #[
90
+ # {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
91
+ # {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
92
+ # {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
93
+ #]
94
+ ```
95
+
96
+ #### Rebuild Tagged Text
97
+
98
+ Reconstruct a string with XML-like tags from raw text and span/entity lists.
99
+
100
+ ```python
101
+ from span_aligner import SpanAligner
102
+
103
+ text = "On 2026-01-12, the Budget Committee finalized the annual report."
104
+ # Entities corresponding to 'ADMINISTRATIVE BODY' label (indices skip "the ")
105
+ entities = [{"start": 19, "end": 35, "labels": ["administrative_body"]}]
106
+
107
+ tagged, stats = SpanAligner.rebuild_tagged_text(text, entities=entities)
108
+ print(tagged)
109
+ # Output: On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.
110
+ ```
111
+
112
+ #### Map Tags to Original
113
+
114
+ Align annotated spans from a tagged string back to their positions in the original text, allowing for noisy text or translation differences.
115
+
116
+ ```python
117
+ from span_aligner import SpanAligner
118
+
119
+ original_text = "Budget Committee met on 2026-01-12 to view\n\n the central park prject."
120
+ tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."
121
+
122
+ mapped_tagged_text = SpanAligner.map_tags_to_original(
123
+ original_text=original_text,
124
+ tagged_text=tagged_text,
125
+ min_ratio=0.7
126
+ )
127
+ print(mapped_tagged_text)
128
+ # Output preserves original text errors:
129
+ # "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view
130
+ # the <impact_location>central park</impact_location> prject."
131
+ ```
132
+
133
+ ### Span Projector
134
+
135
+ Project annotations from one text to another using semantic alignment (e.g., cross-lingual projection).
136
+
137
+ The process begins by generating embeddings for both source and target texts, creating a similarity matrix, and finding the optimal set of alignment pairs. Several algorithms are implemented for this matching phase, including `mwmf`, `inter`, `itermax`, `fwd`, `rev`, `greedy`, and `threshold`.
138
+
139
+
140
+
141
+ #### Project En -> En (Identity/Paraphrase)
142
+
143
+ Project annotations to a similar text in the same language. Functions similar to the `spanAligner` with improved fuzzy matching.
144
+
145
+ ```python
146
+ from span_aligner import SpanProjector
147
+
148
+ # Initialize projector (uses BERT embeddings by default)
149
+ projector = SpanProjector(src_lang="en", tgt_lang="en")
150
+
151
+ src_text = "The <ent>cat</ent> \n\n sat. on the mat."
152
+ tgt_text = "The cat sat on the mat."
153
+
154
+ tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
155
+ print(tagged_tgt)
156
+ # Output: The <ent>cat</ent> sat on the mat.
157
+ ```
158
+
159
+ #### Project En -> Nl (Cross-Lingual)
160
+
161
+ Project annotations from an English source text to a Dutch target translation.
162
+
163
+ ```python
164
+ from span_aligner import SpanProjector
165
+
166
+ # Initialize projector
167
+ projector = SpanProjector(src_lang="en", tgt_lang="nl")
168
+
169
+ src_text = """DECISION LIST <contextual_location>Municipality of Zele</contextual_location>
170
+ <administrative_body>Standing Committee</administrative_body> | <contextual_date>June 28, 2021</contextual_date>
171
+ <title>1. Acceptance of candidacies for the examination procedure coordinator of Welfare</title>
172
+ <decision>Acceptance of candidacies for the examination procedure coordinator of Welfare</decision>
173
+ <title>2. Establishment of valuation rules for the integrated entity Municipality and Public Social Welfare Center (OCMW)</title>
174
+ <decision>Establishment of valuation rules for the integrated entity Municipality and OCMW</decision>"""
175
+
176
+ tgt_text = """BESLUITENLIJST Gemeente Zele Vast bureau | 28 juni 20211.
177
+ 1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
178
+ Aanvaarden kandidaturen examenprocedure coördinator Welzijn
179
+ 2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW
180
+ Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW"""
181
+
182
+ tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
183
+ print(tagged_tgt)
184
+ # Output: BESLUITENLIJST <contextual_location>Gemeente Zele</contextual_location>
185
+ # <administrative_body>Vast bureau</administrative_body> <contextual_date>| 28 juni 20211</contextual_date>.
186
+ # <title>1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
187
+ # Aanvaarden kandidaturen examenprocedure coördinator</title> Welzijn
188
+ # <title>2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</title>
189
+ # <decision>Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</decision>
190
+
191
+ ```
192
+
193
+ ### Sentence Aligner
194
+
195
+ Low-level class for aligning tokens between two texts (sentences or paragraphs) using transformer embeddings. Based on the work of `simalign` but optimized for span mapping (partial alignment instead of full text) and customized for different embedding providers (Ollama, SaaS providers, Transformers, Sentence-Transformers).
196
+
197
+ #### Initialize Aligner
198
+
199
+ ```python
200
+ from span_aligner import SentenceAligner
201
+
202
+ # Use bert embeddings (default) with BPE tokenization
203
+ aligner = SentenceAligner(model="bert", token_type="bpe")
204
+
205
+ text_src = "This is a simple test sentence for alignment."
206
+ text_tgt = "Dit is een eenvoudige testzin voor uitlijning."
207
+ ```
208
+
209
+ #### Get Text Embeddings
210
+
211
+ Retrieve tokens and embedding vectors for a string.
212
+
213
+ ```python
214
+ tokens_src, vecs_src = aligner.get_text_embeddings(text_src)
215
+ print(f"Src tokens: {len(tokens_src)}, Vectors: {vecs_src.shape}")
216
+ # Output: Src tokens: 9, Vectors: (10, 768)
217
+ ```
218
+
219
+ #### Align Partial Substring
220
+
221
+ Find the alignment of a specific substring from source to target.
222
+
223
+ ```python
224
+ # Align "simple test"
225
+ res_sub = aligner.align_texts_partial_substring(text_src, text_tgt, "simple test")
226
+ print(f"Src tokens in result: {[t.text for t in res_sub.src_tokens]}")
227
+ # Output: Src tokens in result: ['simple', 'test']
228
+ ```
229
+
230
+ ## Configuration & Advanced Usage
231
+
232
+ ### Embedding Models
233
+
234
+ The `model` parameter supports common transformer models:
235
+
236
+ - `"bert"`: `bert-base-multilingual-cased` (Default, robust multilingual performance)
237
+ - `"xlmr"`: `xlm-roberta-base` (Strong cross-lingual transfer)
238
+ - `"xlmr-large"`: `xlm-roberta-large` (Higher accuracy, more resource intensive)
239
+
240
+ ```python
241
+ # Use xlm-roberta-base
242
+ projector = SpanProjector(model="xlmr")
243
+ ```
244
+
245
+ ### Matching Algorithms
246
+
247
+ The `matching_method` parameter controls how the token similarity matrix is converted into an alignment.
248
+
249
+ - `"mwmf"` (**Max Weight Matching**): Finds the global optimal independent edge set. Best quality, O(n³) complexity.
250
+ - `"inter"` (**Intersection**): Intersection of forward and backward attention. High precision, lower recall, very fast.
251
+ - `"itermax"` (**Iterative Max**): Heuristic iterative maximization. Good speed/quality balance.
252
+ - `"greedy"` (**Greedy**): Selects best matches greedily. Fast but local optimum.
253
+
254
+ ```python
255
+ # Trade accuracy for speed with 'inter'
256
+ projector = SpanProjector(matching_method="inter")
257
+ ```
258
+
259
+ ### Tokenization: BPE vs Word
260
+
261
+ - `token_type="bpe"` (Recommended): Uses the transformer's subword tokenizer (e.g. WordPiece). Handles rare words better and aligns closer to the model's internal representation.
262
+ - `token_type="word"`: Splits by whitespace/punctuation. Simpler, but can result in `[UNK]` tokens for transformers.
@@ -4,15 +4,24 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "span-aligner"
7
- version = "0.1.2"
7
+ version = "0.2.0"
8
8
  description = "A utility for aligning and mapping text spans between different text representations."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
11
11
  license = {text = "MIT"}
12
12
  dependencies = [
13
- "rapidfuzz>=3.0.0",
13
+ "numpy>=1.26,<3",
14
+ "scipy>=1.13,<2",
15
+ "networkx>=3.3,<4",
16
+ "rapidfuzz>=3.13,<4",
17
+ "regex>=2024.9",
18
+ "transformers>=4.41,<6",
19
+ "torch>=2.5,<3",
20
+ "scikit-learn>=1.7,<2",
21
+ "requests>=2.32,<3",
14
22
  ]
15
23
 
24
+
16
25
  [project.optional-dependencies]
17
26
  dev = [
18
27
  "pytest>=7.0.0",
@@ -0,0 +1,3 @@
1
+ from .span_aligner import SpanAligner
2
+ from .span_projector import SpanProjector
3
+ from .sentence_aligner import SentenceAligner