sembr 0.0.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sembr-0.0.2/LICENSE +32 -0
- sembr-0.0.2/PKG-INFO +343 -0
- sembr-0.0.2/README.md +323 -0
- sembr-0.0.2/pyproject.toml +37 -0
- sembr-0.0.2/sembr/__init__.py +0 -0
- sembr-0.0.2/sembr/cli.py +169 -0
- sembr-0.0.2/sembr/dataset.py +93 -0
- sembr-0.0.2/sembr/eval.py +36 -0
- sembr-0.0.2/sembr/inference.py +99 -0
- sembr-0.0.2/sembr/process.py +302 -0
- sembr-0.0.2/sembr/sembr2023.py +60 -0
- sembr-0.0.2/sembr/train.py +128 -0
- sembr-0.0.2/sembr/utils.py +44 -0
- sembr-0.0.2/sembr.egg-info/PKG-INFO +343 -0
- sembr-0.0.2/sembr.egg-info/SOURCES.txt +18 -0
- sembr-0.0.2/sembr.egg-info/dependency_links.txt +1 -0
- sembr-0.0.2/sembr.egg-info/entry_points.txt +2 -0
- sembr-0.0.2/sembr.egg-info/requires.txt +6 -0
- sembr-0.0.2/sembr.egg-info/top_level.txt +1 -0
- sembr-0.0.2/setup.cfg +4 -0
sembr-0.0.2/LICENSE
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2023 Xitong Gao
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted,
|
|
6
|
+
free of charge,
|
|
7
|
+
to any person obtaining a copy of this software
|
|
8
|
+
and associated documentation files (the "Software"),
|
|
9
|
+
to deal in the Software without restriction,
|
|
10
|
+
including without limitation the rights
|
|
11
|
+
to use, copy, modify, merge, publish, distribute, sublicense,
|
|
12
|
+
and/or sell copies of the Software,
|
|
13
|
+
and to permit persons to whom the Software
|
|
14
|
+
is furnished to do so,
|
|
15
|
+
subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice
|
|
18
|
+
and this permission notice
|
|
19
|
+
shall be included in all copies
|
|
20
|
+
or substantial portions of the Software.
|
|
21
|
+
|
|
22
|
+
THE SOFTWARE IS PROVIDED "AS IS",
|
|
23
|
+
WITHOUT WARRANTY OF ANY KIND,
|
|
24
|
+
EXPRESS OR IMPLIED,
|
|
25
|
+
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
26
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
|
27
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
28
|
+
BE LIABLE FOR ANY CLAIM,
|
|
29
|
+
DAMAGES OR OTHER LIABILITY,
|
|
30
|
+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
|
|
31
|
+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
|
|
32
|
+
OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
sembr-0.0.2/PKG-INFO
ADDED
|
@@ -0,0 +1,343 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: sembr
|
|
3
|
+
Version: 0.0.2
|
|
4
|
+
Summary: A semantic linebreaker powered by transformers
|
|
5
|
+
Author: admk
|
|
6
|
+
Project-URL: Homepage, https://github.com/admk/sembr
|
|
7
|
+
Project-URL: Issues, https://github.com/admk/sembr/issues
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.10
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: transformers
|
|
15
|
+
Requires-Dist: torch
|
|
16
|
+
Requires-Dist: numpy
|
|
17
|
+
Requires-Dist: tqdm
|
|
18
|
+
Requires-Dist: requests
|
|
19
|
+
Requires-Dist: flask
|
|
20
|
+
|
|
21
|
+
# Semantic Line Breaker (SemBr)
|
|
22
|
+
|
|
23
|
+
[](LICENSE)
|
|
24
|
+
[](https://www.python.org)
|
|
25
|
+
[](https://pytorch.org)
|
|
26
|
+
[](https://pypi.org/project/sembr)
|
|
27
|
+
|
|
28
|
+
```
|
|
29
|
+
> When writing text
|
|
30
|
+
> with a compatible markup language,
|
|
31
|
+
> add a line break
|
|
32
|
+
> after each substantial unit of thought.
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
## What is SemBr?
|
|
37
|
+
|
|
38
|
+
SemBr is a command-line tool
|
|
39
|
+
powered by [Transformer][transformers1] [models][transformers2]
|
|
40
|
+
that breaks lines in a text file at semantic boundaries.
|
|
41
|
+
|
|
42
|
+
### Installation
|
|
43
|
+
|
|
44
|
+
SemBr is available as a [Python package on PyPI][pypi].
|
|
45
|
+
To install it,
|
|
46
|
+
simply run the following command in your terminal,
|
|
47
|
+
assuming that you have Python 3.10 or later installed:
|
|
48
|
+
```shell
|
|
49
|
+
pip install sembr
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
### Supported Platforms
|
|
53
|
+
|
|
54
|
+
SemBr is supported on Linux, Mac and Windows.
|
|
55
|
+
On machines with CUDA devices,
|
|
56
|
+
or on Apple Silicon Macs,
|
|
57
|
+
SemBr will use the GPU / Apple Neural Engine
|
|
58
|
+
to accelerate inference.
|
|
59
|
+
|
|
60
|
+
### Usage
|
|
61
|
+
|
|
62
|
+
To use SemBr,
|
|
63
|
+
run the following command in your terminal:
|
|
64
|
+
```shell
|
|
65
|
+
sembr -i <input_file> -o <output_file>
|
|
66
|
+
```
|
|
67
|
+
where `<input_file>` and `<output_file>`
|
|
68
|
+
are the paths to the input and output files respectively.
|
|
69
|
+
|
|
70
|
+
On the first run,
|
|
71
|
+
it will download the SemBr model
|
|
72
|
+
and cache it in `~/.cache/huggingface`.
|
|
73
|
+
Subsequent runs will check for updates
|
|
74
|
+
and use the cached model if it is up-to-date.
|
|
75
|
+
|
|
76
|
+
Alternatively,
|
|
77
|
+
you can pipe the input into `sembr`,
|
|
78
|
+
and the output can also be printed to the terminal:
|
|
79
|
+
```shell
|
|
80
|
+
cat <input_file> | sembr
|
|
81
|
+
```
|
|
82
|
+
This is especially useful if you want to use SemBr
|
|
83
|
+
with clipboard managers, for instance, on a Mac:
|
|
84
|
+
```shell
|
|
85
|
+
pbpaste | sembr | pbcopy
|
|
86
|
+
```
|
|
87
|
+
Or on Linux:
|
|
88
|
+
```shell
|
|
89
|
+
xclip -o | sembr | xclip -i
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Additionally,
|
|
93
|
+
you can specify the following options
|
|
94
|
+
to customize the behavior of SemBr:
|
|
95
|
+
|
|
96
|
+
* `-m <model_name>`:
|
|
97
|
+
The name of the Hugging Face model to use.
|
|
98
|
+
- The default is
|
|
99
|
+
[`admko/sembr2023-bert-small`][sembr-bert-small].
|
|
100
|
+
- To use it offline,
|
|
101
|
+
you can download the model from Hugging Face,
|
|
102
|
+
and then specify the path to the model directory,
|
|
103
|
+
or prepend `TRANSFORMERS_OFFLINE=1` to the command
|
|
104
|
+
to use the cached model.
|
|
105
|
+
* `-l`:
|
|
106
|
+
Serves the SemBr API on a local server.
|
|
107
|
+
- Each instance of `sembr` run
|
|
108
|
+
will detect if the API is accessible,
|
|
109
|
+
and if not it will run the model on its own.
|
|
110
|
+
- This option is useful
|
|
111
|
+
to avoid the time taken to initialize the model
|
|
112
|
+
by keeping it in memory in a separate process.
|
|
113
|
+
* `-p <port>`:
|
|
114
|
+
The port to serve the SemBr API on.
|
|
115
|
+
- The default is `8384`.
|
|
116
|
+
* `-s <ip>`:
|
|
117
|
+
The IP address to serve the SemBr API on.
|
|
118
|
+
- The default is `127.0.0.1`.
|
|
119
|
+
|
|
120
|
+
|
|
121
|
+
## What are Semantic Line Breaks?
|
|
122
|
+
|
|
123
|
+
[Semantic Line Breaks][sembr]
|
|
124
|
+
or [Semantic Linefeeds][semlf]
|
|
125
|
+
describe a set of conventions
|
|
126
|
+
for using insensitive vertical whitespace
|
|
127
|
+
to structure prose along semantic boundaries.
|
|
128
|
+
|
|
129
|
+
|
|
130
|
+
## Why use Semantic Line Breaks?
|
|
131
|
+
|
|
132
|
+
Semantic Line Breaks has the following advantages:
|
|
133
|
+
|
|
134
|
+
* Breaking lines by splitting clauses
|
|
135
|
+
reflects the logical, grammatical and semantic structure
|
|
136
|
+
of the text.
|
|
137
|
+
|
|
138
|
+
* It enhances the ease of editing and version control
|
|
139
|
+
for a text file.
|
|
140
|
+
Merge conflicts are less likely to occur
|
|
141
|
+
when small changes are made,
|
|
142
|
+
and the changes are easier to identify.
|
|
143
|
+
|
|
144
|
+
* Documents written with semantic line breaks
|
|
145
|
+
are easier to navigate and edit
|
|
146
|
+
with Vim and other text editors
|
|
147
|
+
that use Vim keybindings.
|
|
148
|
+
|
|
149
|
+
* Semantic line breaks
|
|
150
|
+
are invisible to readers.
|
|
151
|
+
The final rendered output
|
|
152
|
+
shows no changes to the source text.
|
|
153
|
+
|
|
154
|
+
|
|
155
|
+
## Why SemBr?
|
|
156
|
+
|
|
157
|
+
Converting existing text not written
|
|
158
|
+
with semantic line breaks
|
|
159
|
+
takes a long time to do it manually,
|
|
160
|
+
and it is surprisingly difficult
|
|
161
|
+
to do it automatically with rule-based methods.
|
|
162
|
+
|
|
163
|
+
### Challenges of rule-based methods
|
|
164
|
+
|
|
165
|
+
Rule-based heuristics do not work well
|
|
166
|
+
with the actual semantic structure of the text,
|
|
167
|
+
often leading to incorrect semantic boundaries.
|
|
168
|
+
Moreover,
|
|
169
|
+
these boundaries are hierarchical and nested,
|
|
170
|
+
and a rule-based approach
|
|
171
|
+
cannot capture this structure.
|
|
172
|
+
A semantic line break
|
|
173
|
+
may occur after a dependent clause,
|
|
174
|
+
but where to break clauses into lines
|
|
175
|
+
is challenging to determine
|
|
176
|
+
without syntactic and semantic reasoning capabilities.
|
|
177
|
+
For examples:
|
|
178
|
+
|
|
179
|
+
* A rule that breaks lines at punctuation marks
|
|
180
|
+
will not work well with sentences
|
|
181
|
+
that contain periods
|
|
182
|
+
in abbreviations or mathematical expressions.
|
|
183
|
+
|
|
184
|
+
* Syntactic or semantic structures
|
|
185
|
+
are not always easy to determine.
|
|
186
|
+
"I like to eat apples and oranges
|
|
187
|
+
because they are healthy."
|
|
188
|
+
should be broken into lines as follows:
|
|
189
|
+
```
|
|
190
|
+
> I like to eat apples and oranges
|
|
191
|
+
> because they are healthy.
|
|
192
|
+
```
|
|
193
|
+
rather than:
|
|
194
|
+
```
|
|
195
|
+
> I like to eat apples
|
|
196
|
+
> and oranges because they are healthy.
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
For this reason,
|
|
200
|
+
I have created SemBr,
|
|
201
|
+
which uses finetuned Transformer models
|
|
202
|
+
to predict line breaks at semantic boundaries.
|
|
203
|
+
|
|
204
|
+
|
|
205
|
+
## How does SemBr work?
|
|
206
|
+
|
|
207
|
+
SemBr uses a Transformer model
|
|
208
|
+
to predict line breaks at semantic boundaries.
|
|
209
|
+
|
|
210
|
+
A small dataset of text with semantic line breaks
|
|
211
|
+
was created from my existing LaTeX documents.
|
|
212
|
+
The dataset was split into training
|
|
213
|
+
(46,295 lines, 170,681 words and 1,492,952 characters)
|
|
214
|
+
and test
|
|
215
|
+
(2,187 lines, 7,564 words and 72,231 characters)
|
|
216
|
+
datasets.
|
|
217
|
+
|
|
218
|
+
The data was prepared
|
|
219
|
+
by extracting line breaks and indent levels
|
|
220
|
+
from the files,
|
|
221
|
+
and then converting the result
|
|
222
|
+
into strings of paragraphs with line breaks removed.
|
|
223
|
+
The data can then be tokenized using the tokenizer
|
|
224
|
+
and converted into a dataset with tokens,
|
|
225
|
+
where each token has a label
|
|
226
|
+
denoting if there is line break before it,
|
|
227
|
+
and the indent level of the token.
|
|
228
|
+
|
|
229
|
+
For LaTeX documents,
|
|
230
|
+
there are two types of line breaks:
|
|
231
|
+
one with a normal line break
|
|
232
|
+
that adds implicit spacing (e.g. `line a⏎line b`)
|
|
233
|
+
and one with no spacing (e.g. `line a%⏎line b`).
|
|
234
|
+
The data processor
|
|
235
|
+
also tries to preserve the LaTeX syntax of the text
|
|
236
|
+
by adding and removing comment symbols (`%`),
|
|
237
|
+
if necessary.
|
|
238
|
+
|
|
239
|
+
The pretrained masked language model
|
|
240
|
+
is then finetuned as a token classifier
|
|
241
|
+
on the training dataset
|
|
242
|
+
to predict the labels of the tokens.
|
|
243
|
+
We save the model with the best F1 score
|
|
244
|
+
on correctly predicting the existence of a line break
|
|
245
|
+
on the test set.
|
|
246
|
+
The finetuning logs for the following models
|
|
247
|
+
can be found on this [WandB][wandb] report:
|
|
248
|
+
|
|
249
|
+
* `distilbert-base-uncased`
|
|
250
|
+
[[Pretrained]][distilbert-bu]
|
|
251
|
+
[[Finetuned]][sembr-distilbert-bu]
|
|
252
|
+
* `distilbert-base-cased`
|
|
253
|
+
[[Pretrained]][distilbert-bc]
|
|
254
|
+
[[Finetuned]][sembr-distilbert-bc]
|
|
255
|
+
* `distilbert-base-uncased-finetuned-sst-2-english`
|
|
256
|
+
[[Pretrained]][distilbert-bufs2e]
|
|
257
|
+
[[Finetuned]][sembr-distilbert-bufs2e]
|
|
258
|
+
* `prajjwal1/bert-tiny`
|
|
259
|
+
[[Pretrained]][bert-tiny]
|
|
260
|
+
[[Finetuned]][sembr-bert-tiny]
|
|
261
|
+
* `prajjwal1/bert-mini`
|
|
262
|
+
[[Pretrained]][bert-mini]
|
|
263
|
+
[[Finetuned]][sembr-bert-mini]
|
|
264
|
+
* `prajjwal1/bert-small`
|
|
265
|
+
[[Pretrained]][bert-small]
|
|
266
|
+
[[Finetuned]][sembr-bert-small]
|
|
267
|
+
|
|
268
|
+
|
|
269
|
+
## Performance
|
|
270
|
+
|
|
271
|
+
Current inference speed on an M2 Macbook Pro
|
|
272
|
+
is about 850 words per second
|
|
273
|
+
on `bert-small` with the default options,
|
|
274
|
+
the memory usage is about 1.70 GB.
|
|
275
|
+
|
|
276
|
+
The link breaking accuracy is difficult to measure,
|
|
277
|
+
and the locations of line breaks
|
|
278
|
+
could also be subjective.
|
|
279
|
+
On the test set,
|
|
280
|
+
the per-token line break accuracy
|
|
281
|
+
of the models are >95%,
|
|
282
|
+
with ~80% F1 scores.
|
|
283
|
+
Because of the sparse nature of line breaks,
|
|
284
|
+
the accuracy is not a good metric
|
|
285
|
+
to measure the performance of the model,
|
|
286
|
+
and I used the F1 score instead
|
|
287
|
+
to save best models.
|
|
288
|
+
|
|
289
|
+
|
|
290
|
+
## Improvements and TODOs
|
|
291
|
+
|
|
292
|
+
* [ ] Support natural languages other than English.
|
|
293
|
+
* [ ] Support other markup languages such as Markdown.
|
|
294
|
+
* [ ] Some lines are too long without a line break.
|
|
295
|
+
The inference algorithm can be improved
|
|
296
|
+
to penalize long lines.
|
|
297
|
+
* [ ] Performance and accuracy benchmarking,
|
|
298
|
+
and comparisons with related works.
|
|
299
|
+
* [ ] Improve inference speed.
|
|
300
|
+
* [ ] Reduce memory usage.
|
|
301
|
+
* [ ] Improve indent level prediction.
|
|
302
|
+
* [ ] Inference queue.
|
|
303
|
+
* [ ] VSCode extension.
|
|
304
|
+
|
|
305
|
+
|
|
306
|
+
## Related Projects and References
|
|
307
|
+
|
|
308
|
+
Sentence splitting:
|
|
309
|
+
* https://code.google.com/archive/p/splitta/
|
|
310
|
+
* https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
|
|
311
|
+
* https://github.com/nipunsadvilkar/pySBD
|
|
312
|
+
* https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html
|
|
313
|
+
|
|
314
|
+
Semantic line breaking:
|
|
315
|
+
* https://github.com/sembr/specification
|
|
316
|
+
* https://github.com/waldyrious/semantic-linebreaker
|
|
317
|
+
* https://github.com/bobheadxi/readable ([blog post](https://bobheadxi.dev/semantic-line-breaks/))
|
|
318
|
+
* https://github.com/chrisgrieser/obsidian-sembr
|
|
319
|
+
* https://github.com/cllns/semantic_linefeeds
|
|
320
|
+
|
|
321
|
+
|
|
322
|
+
[transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4
|
|
323
|
+
[transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
|
|
324
|
+
|
|
325
|
+
[pypi]: https://pypi.org/project/sembr
|
|
326
|
+
|
|
327
|
+
[sembr]: https://sembr.org
|
|
328
|
+
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
|
|
329
|
+
|
|
330
|
+
[wandb]: https://api.wandb.ai/links/admk/efvui9f4
|
|
331
|
+
|
|
332
|
+
[distilbert-bu]: https://huggingface.co/distilbert-base-uncased
|
|
333
|
+
[distilbert-bc]: https://huggingface.co/distilbert-base-cased
|
|
334
|
+
[distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
|
|
335
|
+
[bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny
|
|
336
|
+
[bert-mini]: https://huggingface.co/prajjwal1/bert-mini
|
|
337
|
+
[bert-small]: https://huggingface.co/prajjwal1/bert-small
|
|
338
|
+
[sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased
|
|
339
|
+
[sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased
|
|
340
|
+
[sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english
|
|
341
|
+
[sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny
|
|
342
|
+
[sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini
|
|
343
|
+
[sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small
|
sembr-0.0.2/README.md
ADDED
|
@@ -0,0 +1,323 @@
|
|
|
1
|
+
# Semantic Line Breaker (SemBr)
|
|
2
|
+
|
|
3
|
+
[](LICENSE)
|
|
4
|
+
[](https://www.python.org)
|
|
5
|
+
[](https://pytorch.org)
|
|
6
|
+
[](https://pypi.org/project/sembr)
|
|
7
|
+
|
|
8
|
+
```
|
|
9
|
+
> When writing text
|
|
10
|
+
> with a compatible markup language,
|
|
11
|
+
> add a line break
|
|
12
|
+
> after each substantial unit of thought.
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
## What is SemBr?
|
|
17
|
+
|
|
18
|
+
SemBr is a command-line tool
|
|
19
|
+
powered by [Transformer][transformers1] [models][transformers2]
|
|
20
|
+
that breaks lines in a text file at semantic boundaries.
|
|
21
|
+
|
|
22
|
+
### Installation
|
|
23
|
+
|
|
24
|
+
SemBr is available as a [Python package on PyPI][pypi].
|
|
25
|
+
To install it,
|
|
26
|
+
simply run the following command in your terminal,
|
|
27
|
+
assuming that you have Python 3.10 or later installed:
|
|
28
|
+
```shell
|
|
29
|
+
pip install sembr
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### Supported Platforms
|
|
33
|
+
|
|
34
|
+
SemBr is supported on Linux, Mac and Windows.
|
|
35
|
+
On machines with CUDA devices,
|
|
36
|
+
or on Apple Silicon Macs,
|
|
37
|
+
SemBr will use the GPU / Apple Neural Engine
|
|
38
|
+
to accelerate inference.
|
|
39
|
+
|
|
40
|
+
### Usage
|
|
41
|
+
|
|
42
|
+
To use SemBr,
|
|
43
|
+
run the following command in your terminal:
|
|
44
|
+
```shell
|
|
45
|
+
sembr -i <input_file> -o <output_file>
|
|
46
|
+
```
|
|
47
|
+
where `<input_file>` and `<output_file>`
|
|
48
|
+
are the paths to the input and output files respectively.
|
|
49
|
+
|
|
50
|
+
On the first run,
|
|
51
|
+
it will download the SemBr model
|
|
52
|
+
and cache it in `~/.cache/huggingface`.
|
|
53
|
+
Subsequent runs will check for updates
|
|
54
|
+
and use the cached model if it is up-to-date.
|
|
55
|
+
|
|
56
|
+
Alternatively,
|
|
57
|
+
you can pipe the input into `sembr`,
|
|
58
|
+
and the output can also be printed to the terminal:
|
|
59
|
+
```shell
|
|
60
|
+
cat <input_file> | sembr
|
|
61
|
+
```
|
|
62
|
+
This is especially useful if you want to use SemBr
|
|
63
|
+
with clipboard managers, for instance, on a Mac:
|
|
64
|
+
```shell
|
|
65
|
+
pbpaste | sembr | pbcopy
|
|
66
|
+
```
|
|
67
|
+
Or on Linux:
|
|
68
|
+
```shell
|
|
69
|
+
xclip -o | sembr | xclip -i
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Additionally,
|
|
73
|
+
you can specify the following options
|
|
74
|
+
to customize the behavior of SemBr:
|
|
75
|
+
|
|
76
|
+
* `-m <model_name>`:
|
|
77
|
+
The name of the Hugging Face model to use.
|
|
78
|
+
- The default is
|
|
79
|
+
[`admko/sembr2023-bert-small`][sembr-bert-small].
|
|
80
|
+
- To use it offline,
|
|
81
|
+
you can download the model from Hugging Face,
|
|
82
|
+
and then specify the path to the model directory,
|
|
83
|
+
or prepend `TRANSFORMERS_OFFLINE=1` to the command
|
|
84
|
+
to use the cached model.
|
|
85
|
+
* `-l`:
|
|
86
|
+
Serves the SemBr API on a local server.
|
|
87
|
+
- Each instance of `sembr` run
|
|
88
|
+
will detect if the API is accessible,
|
|
89
|
+
and if not it will run the model on its own.
|
|
90
|
+
- This option is useful
|
|
91
|
+
to avoid the time taken to initialize the model
|
|
92
|
+
by keeping it in memory in a separate process.
|
|
93
|
+
* `-p <port>`:
|
|
94
|
+
The port to serve the SemBr API on.
|
|
95
|
+
- The default is `8384`.
|
|
96
|
+
* `-s <ip>`:
|
|
97
|
+
The IP address to serve the SemBr API on.
|
|
98
|
+
- The default is `127.0.0.1`.
|
|
99
|
+
|
|
100
|
+
|
|
101
|
+
## What are Semantic Line Breaks?
|
|
102
|
+
|
|
103
|
+
[Semantic Line Breaks][sembr]
|
|
104
|
+
or [Semantic Linefeeds][semlf]
|
|
105
|
+
describe a set of conventions
|
|
106
|
+
for using insensitive vertical whitespace
|
|
107
|
+
to structure prose along semantic boundaries.
|
|
108
|
+
|
|
109
|
+
|
|
110
|
+
## Why use Semantic Line Breaks?
|
|
111
|
+
|
|
112
|
+
Semantic Line Breaks has the following advantages:
|
|
113
|
+
|
|
114
|
+
* Breaking lines by splitting clauses
|
|
115
|
+
reflects the logical, grammatical and semantic structure
|
|
116
|
+
of the text.
|
|
117
|
+
|
|
118
|
+
* It enhances the ease of editing and version control
|
|
119
|
+
for a text file.
|
|
120
|
+
Merge conflicts are less likely to occur
|
|
121
|
+
when small changes are made,
|
|
122
|
+
and the changes are easier to identify.
|
|
123
|
+
|
|
124
|
+
* Documents written with semantic line breaks
|
|
125
|
+
are easier to navigate and edit
|
|
126
|
+
with Vim and other text editors
|
|
127
|
+
that use Vim keybindings.
|
|
128
|
+
|
|
129
|
+
* Semantic line breaks
|
|
130
|
+
are invisible to readers.
|
|
131
|
+
The final rendered output
|
|
132
|
+
shows no changes to the source text.
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
## Why SemBr?
|
|
136
|
+
|
|
137
|
+
Converting existing text not written
|
|
138
|
+
with semantic line breaks
|
|
139
|
+
takes a long time to do it manually,
|
|
140
|
+
and it is surprisingly difficult
|
|
141
|
+
to do it automatically with rule-based methods.
|
|
142
|
+
|
|
143
|
+
### Challenges of rule-based methods
|
|
144
|
+
|
|
145
|
+
Rule-based heuristics do not work well
|
|
146
|
+
with the actual semantic structure of the text,
|
|
147
|
+
often leading to incorrect semantic boundaries.
|
|
148
|
+
Moreover,
|
|
149
|
+
these boundaries are hierarchical and nested,
|
|
150
|
+
and a rule-based approach
|
|
151
|
+
cannot capture this structure.
|
|
152
|
+
A semantic line break
|
|
153
|
+
may occur after a dependent clause,
|
|
154
|
+
but where to break clauses into lines
|
|
155
|
+
is challenging to determine
|
|
156
|
+
without syntactic and semantic reasoning capabilities.
|
|
157
|
+
For examples:
|
|
158
|
+
|
|
159
|
+
* A rule that breaks lines at punctuation marks
|
|
160
|
+
will not work well with sentences
|
|
161
|
+
that contain periods
|
|
162
|
+
in abbreviations or mathematical expressions.
|
|
163
|
+
|
|
164
|
+
* Syntactic or semantic structures
|
|
165
|
+
are not always easy to determine.
|
|
166
|
+
"I like to eat apples and oranges
|
|
167
|
+
because they are healthy."
|
|
168
|
+
should be broken into lines as follows:
|
|
169
|
+
```
|
|
170
|
+
> I like to eat apples and oranges
|
|
171
|
+
> because they are healthy.
|
|
172
|
+
```
|
|
173
|
+
rather than:
|
|
174
|
+
```
|
|
175
|
+
> I like to eat apples
|
|
176
|
+
> and oranges because they are healthy.
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
For this reason,
|
|
180
|
+
I have created SemBr,
|
|
181
|
+
which uses finetuned Transformer models
|
|
182
|
+
to predict line breaks at semantic boundaries.
|
|
183
|
+
|
|
184
|
+
|
|
185
|
+
## How does SemBr work?
|
|
186
|
+
|
|
187
|
+
SemBr uses a Transformer model
|
|
188
|
+
to predict line breaks at semantic boundaries.
|
|
189
|
+
|
|
190
|
+
A small dataset of text with semantic line breaks
|
|
191
|
+
was created from my existing LaTeX documents.
|
|
192
|
+
The dataset was split into training
|
|
193
|
+
(46,295 lines, 170,681 words and 1,492,952 characters)
|
|
194
|
+
and test
|
|
195
|
+
(2,187 lines, 7,564 words and 72,231 characters)
|
|
196
|
+
datasets.
|
|
197
|
+
|
|
198
|
+
The data was prepared
|
|
199
|
+
by extracting line breaks and indent levels
|
|
200
|
+
from the files,
|
|
201
|
+
and then converting the result
|
|
202
|
+
into strings of paragraphs with line breaks removed.
|
|
203
|
+
The data can then be tokenized using the tokenizer
|
|
204
|
+
and converted into a dataset with tokens,
|
|
205
|
+
where each token has a label
|
|
206
|
+
denoting if there is line break before it,
|
|
207
|
+
and the indent level of the token.
|
|
208
|
+
|
|
209
|
+
For LaTeX documents,
|
|
210
|
+
there are two types of line breaks:
|
|
211
|
+
one with a normal line break
|
|
212
|
+
that adds implicit spacing (e.g. `line a⏎line b`)
|
|
213
|
+
and one with no spacing (e.g. `line a%⏎line b`).
|
|
214
|
+
The data processor
|
|
215
|
+
also tries to preserve the LaTeX syntax of the text
|
|
216
|
+
by adding and removing comment symbols (`%`),
|
|
217
|
+
if necessary.
|
|
218
|
+
|
|
219
|
+
The pretrained masked language model
|
|
220
|
+
is then finetuned as a token classifier
|
|
221
|
+
on the training dataset
|
|
222
|
+
to predict the labels of the tokens.
|
|
223
|
+
We save the model with the best F1 score
|
|
224
|
+
on correctly predicting the existence of a line break
|
|
225
|
+
on the test set.
|
|
226
|
+
The finetuning logs for the following models
|
|
227
|
+
can be found on this [WandB][wandb] report:
|
|
228
|
+
|
|
229
|
+
* `distilbert-base-uncased`
|
|
230
|
+
[[Pretrained]][distilbert-bu]
|
|
231
|
+
[[Finetuned]][sembr-distilbert-bu]
|
|
232
|
+
* `distilbert-base-cased`
|
|
233
|
+
[[Pretrained]][distilbert-bc]
|
|
234
|
+
[[Finetuned]][sembr-distilbert-bc]
|
|
235
|
+
* `distilbert-base-uncased-finetuned-sst-2-english`
|
|
236
|
+
[[Pretrained]][distilbert-bufs2e]
|
|
237
|
+
[[Finetuned]][sembr-distilbert-bufs2e]
|
|
238
|
+
* `prajjwal1/bert-tiny`
|
|
239
|
+
[[Pretrained]][bert-tiny]
|
|
240
|
+
[[Finetuned]][sembr-bert-tiny]
|
|
241
|
+
* `prajjwal1/bert-mini`
|
|
242
|
+
[[Pretrained]][bert-mini]
|
|
243
|
+
[[Finetuned]][sembr-bert-mini]
|
|
244
|
+
* `prajjwal1/bert-small`
|
|
245
|
+
[[Pretrained]][bert-small]
|
|
246
|
+
[[Finetuned]][sembr-bert-small]
|
|
247
|
+
|
|
248
|
+
|
|
249
|
+
## Performance
|
|
250
|
+
|
|
251
|
+
Current inference speed on an M2 Macbook Pro
|
|
252
|
+
is about 850 words per second
|
|
253
|
+
on `bert-small` with the default options,
|
|
254
|
+
the memory usage is about 1.70 GB.
|
|
255
|
+
|
|
256
|
+
The link breaking accuracy is difficult to measure,
|
|
257
|
+
and the locations of line breaks
|
|
258
|
+
could also be subjective.
|
|
259
|
+
On the test set,
|
|
260
|
+
the per-token line break accuracy
|
|
261
|
+
of the models are >95%,
|
|
262
|
+
with ~80% F1 scores.
|
|
263
|
+
Because of the sparse nature of line breaks,
|
|
264
|
+
the accuracy is not a good metric
|
|
265
|
+
to measure the performance of the model,
|
|
266
|
+
and I used the F1 score instead
|
|
267
|
+
to save best models.
|
|
268
|
+
|
|
269
|
+
|
|
270
|
+
## Improvements and TODOs
|
|
271
|
+
|
|
272
|
+
* [ ] Support natural languages other than English.
|
|
273
|
+
* [ ] Support other markup languages such as Markdown.
|
|
274
|
+
* [ ] Some lines are too long without a line break.
|
|
275
|
+
The inference algorithm can be improved
|
|
276
|
+
to penalize long lines.
|
|
277
|
+
* [ ] Performance and accuracy benchmarking,
|
|
278
|
+
and comparisons with related works.
|
|
279
|
+
* [ ] Improve inference speed.
|
|
280
|
+
* [ ] Reduce memory usage.
|
|
281
|
+
* [ ] Improve indent level prediction.
|
|
282
|
+
* [ ] Inference queue.
|
|
283
|
+
* [ ] VSCode extension.
|
|
284
|
+
|
|
285
|
+
|
|
286
|
+
## Related Projects and References
|
|
287
|
+
|
|
288
|
+
Sentence splitting:
|
|
289
|
+
* https://code.google.com/archive/p/splitta/
|
|
290
|
+
* https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
|
|
291
|
+
* https://github.com/nipunsadvilkar/pySBD
|
|
292
|
+
* https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html
|
|
293
|
+
|
|
294
|
+
Semantic line breaking:
|
|
295
|
+
* https://github.com/sembr/specification
|
|
296
|
+
* https://github.com/waldyrious/semantic-linebreaker
|
|
297
|
+
* https://github.com/bobheadxi/readable ([blog post](https://bobheadxi.dev/semantic-line-breaks/))
|
|
298
|
+
* https://github.com/chrisgrieser/obsidian-sembr
|
|
299
|
+
* https://github.com/cllns/semantic_linefeeds
|
|
300
|
+
|
|
301
|
+
|
|
302
|
+
[transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4
|
|
303
|
+
[transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
|
|
304
|
+
|
|
305
|
+
[pypi]: https://pypi.org/project/sembr
|
|
306
|
+
|
|
307
|
+
[sembr]: https://sembr.org
|
|
308
|
+
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
|
|
309
|
+
|
|
310
|
+
[wandb]: https://api.wandb.ai/links/admk/efvui9f4
|
|
311
|
+
|
|
312
|
+
[distilbert-bu]: https://huggingface.co/distilbert-base-uncased
|
|
313
|
+
[distilbert-bc]: https://huggingface.co/distilbert-base-cased
|
|
314
|
+
[distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
|
|
315
|
+
[bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny
|
|
316
|
+
[bert-mini]: https://huggingface.co/prajjwal1/bert-mini
|
|
317
|
+
[bert-small]: https://huggingface.co/prajjwal1/bert-small
|
|
318
|
+
[sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased
|
|
319
|
+
[sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased
|
|
320
|
+
[sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english
|
|
321
|
+
[sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny
|
|
322
|
+
[sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini
|
|
323
|
+
[sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small
|