sembr 0.0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
sembr-0.0.2/LICENSE ADDED
@@ -0,0 +1,32 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2023 Xitong Gao
4
+
5
+ Permission is hereby granted,
6
+ free of charge,
7
+ to any person obtaining a copy of this software
8
+ and associated documentation files (the "Software"),
9
+ to deal in the Software without restriction,
10
+ including without limitation the rights
11
+ to use, copy, modify, merge, publish, distribute, sublicense,
12
+ and/or sell copies of the Software,
13
+ and to permit persons to whom the Software
14
+ is furnished to do so,
15
+ subject to the following conditions:
16
+
17
+ The above copyright notice
18
+ and this permission notice
19
+ shall be included in all copies
20
+ or substantial portions of the Software.
21
+
22
+ THE SOFTWARE IS PROVIDED "AS IS",
23
+ WITHOUT WARRANTY OF ANY KIND,
24
+ EXPRESS OR IMPLIED,
25
+ INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
26
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
27
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
28
+ BE LIABLE FOR ANY CLAIM,
29
+ DAMAGES OR OTHER LIABILITY,
30
+ WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
31
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
32
+ OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
sembr-0.0.2/PKG-INFO ADDED
@@ -0,0 +1,343 @@
1
+ Metadata-Version: 2.1
2
+ Name: sembr
3
+ Version: 0.0.2
4
+ Summary: A semantic linebreaker powered by transformers
5
+ Author: admk
6
+ Project-URL: Homepage, https://github.com/admk/sembr
7
+ Project-URL: Issues, https://github.com/admk/sembr/issues
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: transformers
15
+ Requires-Dist: torch
16
+ Requires-Dist: numpy
17
+ Requires-Dist: tqdm
18
+ Requires-Dist: requests
19
+ Requires-Dist: flask
20
+
21
+ # Semantic Line Breaker (SemBr)
22
+
23
+ [![GitHub](https://img.shields.io/github/license/admk/sembr)](LICENSE)
24
+ [![python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
25
+ [![pytorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg?style=flat&logo=pytorch)](https://pytorch.org)
26
+ [![PyPI](https://badge.fury.io/py/sembr.svg)](https://pypi.org/project/sembr)
27
+
28
+ ```
29
+ > When writing text
30
+ > with a compatible markup language,
31
+ > add a line break
32
+ > after each substantial unit of thought.
33
+ ```
34
+
35
+
36
+ ## What is SemBr?
37
+
38
+ SemBr is a command-line tool
39
+ powered by [Transformer][transformers1] [models][transformers2]
40
+ that breaks lines in a text file at semantic boundaries.
41
+
42
+ ### Installation
43
+
44
+ SemBr is available as a [Python package on PyPI][pypi].
45
+ To install it,
46
+ simply run the following command in your terminal,
47
+ assuming that you have Python 3.10 or later installed:
48
+ ```shell
49
+ pip install sembr
50
+ ```
51
+
52
+ ### Supported Platforms
53
+
54
+ SemBr is supported on Linux, Mac and Windows.
55
+ On machines with CUDA devices,
56
+ or on Apple Silicon Macs,
57
+ SemBr will use the GPU / Apple Neural Engine
58
+ to accelerate inference.
59
+
60
+ ### Usage
61
+
62
+ To use SemBr,
63
+ run the following command in your terminal:
64
+ ```shell
65
+ sembr -i <input_file> -o <output_file>
66
+ ```
67
+ where `<input_file>` and `<output_file>`
68
+ are the paths to the input and output files respectively.
69
+
70
+ On the first run,
71
+ it will download the SemBr model
72
+ and cache it in `~/.cache/huggingface`.
73
+ Subsequent runs will check for updates
74
+ and use the cached model if it is up-to-date.
75
+
76
+ Alternatively,
77
+ you can pipe the input into `sembr`,
78
+ and the output can also be printed to the terminal:
79
+ ```shell
80
+ cat <input_file> | sembr
81
+ ```
82
+ This is especially useful if you want to use SemBr
83
+ with clipboard managers, for instance, on a Mac:
84
+ ```shell
85
+ pbpaste | sembr | pbcopy
86
+ ```
87
+ Or on Linux:
88
+ ```shell
89
+ xclip -o | sembr | xclip -i
90
+ ```
91
+
92
+ Additionally,
93
+ you can specify the following options
94
+ to customize the behavior of SemBr:
95
+
96
+ * `-m <model_name>`:
97
+ The name of the Hugging Face model to use.
98
+ - The default is
99
+ [`admko/sembr2023-bert-small`][sembr-bert-small].
100
+ - To use it offline,
101
+ you can download the model from Hugging Face,
102
+ and then specify the path to the model directory,
103
+ or prepend `TRANSFORMERS_OFFLINE=1` to the command
104
+ to use the cached model.
105
+ * `-l`:
106
+ Serves the SemBr API on a local server.
107
+ - Each instance of `sembr` run
108
+ will detect if the API is accessible,
109
+ and if not it will run the model on its own.
110
+ - This option is useful
111
+ to avoid the time taken to initialize the model
112
+ by keeping it in memory in a separate process.
113
+ * `-p <port>`:
114
+ The port to serve the SemBr API on.
115
+ - The default is `8384`.
116
+ * `-s <ip>`:
117
+ The IP address to serve the SemBr API on.
118
+ - The default is `127.0.0.1`.
119
+
120
+
121
+ ## What are Semantic Line Breaks?
122
+
123
+ [Semantic Line Breaks][sembr]
124
+ or [Semantic Linefeeds][semlf]
125
+ describe a set of conventions
126
+ for using insensitive vertical whitespace
127
+ to structure prose along semantic boundaries.
128
+
129
+
130
+ ## Why use Semantic Line Breaks?
131
+
132
+ Semantic Line Breaks has the following advantages:
133
+
134
+ * Breaking lines by splitting clauses
135
+ reflects the logical, grammatical and semantic structure
136
+ of the text.
137
+
138
+ * It enhances the ease of editing and version control
139
+ for a text file.
140
+ Merge conflicts are less likely to occur
141
+ when small changes are made,
142
+ and the changes are easier to identify.
143
+
144
+ * Documents written with semantic line breaks
145
+ are easier to navigate and edit
146
+ with Vim and other text editors
147
+ that use Vim keybindings.
148
+
149
+ * Semantic line breaks
150
+ are invisible to readers.
151
+ The final rendered output
152
+ shows no changes to the source text.
153
+
154
+
155
+ ## Why SemBr?
156
+
157
+ Converting existing text not written
158
+ with semantic line breaks
159
+ takes a long time to do it manually,
160
+ and it is surprisingly difficult
161
+ to do it automatically with rule-based methods.
162
+
163
+ ### Challenges of rule-based methods
164
+
165
+ Rule-based heuristics do not work well
166
+ with the actual semantic structure of the text,
167
+ often leading to incorrect semantic boundaries.
168
+ Moreover,
169
+ these boundaries are hierarchical and nested,
170
+ and a rule-based approach
171
+ cannot capture this structure.
172
+ A semantic line break
173
+ may occur after a dependent clause,
174
+ but where to break clauses into lines
175
+ is challenging to determine
176
+ without syntactic and semantic reasoning capabilities.
177
+ For examples:
178
+
179
+ * A rule that breaks lines at punctuation marks
180
+ will not work well with sentences
181
+ that contain periods
182
+ in abbreviations or mathematical expressions.
183
+
184
+ * Syntactic or semantic structures
185
+ are not always easy to determine.
186
+ "I like to eat apples and oranges
187
+ because they are healthy."
188
+ should be broken into lines as follows:
189
+ ```
190
+ > I like to eat apples and oranges
191
+ > because they are healthy.
192
+ ```
193
+ rather than:
194
+ ```
195
+ > I like to eat apples
196
+ > and oranges because they are healthy.
197
+ ```
198
+
199
+ For this reason,
200
+ I have created SemBr,
201
+ which uses finetuned Transformer models
202
+ to predict line breaks at semantic boundaries.
203
+
204
+
205
+ ## How does SemBr work?
206
+
207
+ SemBr uses a Transformer model
208
+ to predict line breaks at semantic boundaries.
209
+
210
+ A small dataset of text with semantic line breaks
211
+ was created from my existing LaTeX documents.
212
+ The dataset was split into training
213
+ (46,295 lines, 170,681 words and 1,492,952 characters)
214
+ and test
215
+ (2,187 lines, 7,564 words and 72,231 characters)
216
+ datasets.
217
+
218
+ The data was prepared
219
+ by extracting line breaks and indent levels
220
+ from the files,
221
+ and then converting the result
222
+ into strings of paragraphs with line breaks removed.
223
+ The data can then be tokenized using the tokenizer
224
+ and converted into a dataset with tokens,
225
+ where each token has a label
226
+ denoting if there is line break before it,
227
+ and the indent level of the token.
228
+
229
+ For LaTeX documents,
230
+ there are two types of line breaks:
231
+ one with a normal line break
232
+ that adds implicit spacing (e.g. `line a⏎line b`)
233
+ and one with no spacing (e.g. `line a%⏎line b`).
234
+ The data processor
235
+ also tries to preserve the LaTeX syntax of the text
236
+ by adding and removing comment symbols (`%`),
237
+ if necessary.
238
+
239
+ The pretrained masked language model
240
+ is then finetuned as a token classifier
241
+ on the training dataset
242
+ to predict the labels of the tokens.
243
+ We save the model with the best F1 score
244
+ on correctly predicting the existence of a line break
245
+ on the test set.
246
+ The finetuning logs for the following models
247
+ can be found on this [WandB][wandb] report:
248
+
249
+ * `distilbert-base-uncased`
250
+ [[Pretrained]][distilbert-bu]
251
+ [[Finetuned]][sembr-distilbert-bu]
252
+ * `distilbert-base-cased`
253
+ [[Pretrained]][distilbert-bc]
254
+ [[Finetuned]][sembr-distilbert-bc]
255
+ * `distilbert-base-uncased-finetuned-sst-2-english`
256
+ [[Pretrained]][distilbert-bufs2e]
257
+ [[Finetuned]][sembr-distilbert-bufs2e]
258
+ * `prajjwal1/bert-tiny`
259
+ [[Pretrained]][bert-tiny]
260
+ [[Finetuned]][sembr-bert-tiny]
261
+ * `prajjwal1/bert-mini`
262
+ [[Pretrained]][bert-mini]
263
+ [[Finetuned]][sembr-bert-mini]
264
+ * `prajjwal1/bert-small`
265
+ [[Pretrained]][bert-small]
266
+ [[Finetuned]][sembr-bert-small]
267
+
268
+
269
+ ## Performance
270
+
271
+ Current inference speed on an M2 Macbook Pro
272
+ is about 850 words per second
273
+ on `bert-small` with the default options,
274
+ the memory usage is about 1.70 GB.
275
+
276
+ The link breaking accuracy is difficult to measure,
277
+ and the locations of line breaks
278
+ could also be subjective.
279
+ On the test set,
280
+ the per-token line break accuracy
281
+ of the models are >95%,
282
+ with ~80% F1 scores.
283
+ Because of the sparse nature of line breaks,
284
+ the accuracy is not a good metric
285
+ to measure the performance of the model,
286
+ and I used the F1 score instead
287
+ to save best models.
288
+
289
+
290
+ ## Improvements and TODOs
291
+
292
+ * [ ] Support natural languages other than English.
293
+ * [ ] Support other markup languages such as Markdown.
294
+ * [ ] Some lines are too long without a line break.
295
+ The inference algorithm can be improved
296
+ to penalize long lines.
297
+ * [ ] Performance and accuracy benchmarking,
298
+ and comparisons with related works.
299
+ * [ ] Improve inference speed.
300
+ * [ ] Reduce memory usage.
301
+ * [ ] Improve indent level prediction.
302
+ * [ ] Inference queue.
303
+ * [ ] VSCode extension.
304
+
305
+
306
+ ## Related Projects and References
307
+
308
+ Sentence splitting:
309
+ * https://code.google.com/archive/p/splitta/
310
+ * https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
311
+ * https://github.com/nipunsadvilkar/pySBD
312
+ * https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html
313
+
314
+ Semantic line breaking:
315
+ * https://github.com/sembr/specification
316
+ * https://github.com/waldyrious/semantic-linebreaker
317
+ * https://github.com/bobheadxi/readable ([blog post](https://bobheadxi.dev/semantic-line-breaks/))
318
+ * https://github.com/chrisgrieser/obsidian-sembr
319
+ * https://github.com/cllns/semantic_linefeeds
320
+
321
+
322
+ [transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4
323
+ [transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
324
+
325
+ [pypi]: https://pypi.org/project/sembr
326
+
327
+ [sembr]: https://sembr.org
328
+ [semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
329
+
330
+ [wandb]: https://api.wandb.ai/links/admk/efvui9f4
331
+
332
+ [distilbert-bu]: https://huggingface.co/distilbert-base-uncased
333
+ [distilbert-bc]: https://huggingface.co/distilbert-base-cased
334
+ [distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
335
+ [bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny
336
+ [bert-mini]: https://huggingface.co/prajjwal1/bert-mini
337
+ [bert-small]: https://huggingface.co/prajjwal1/bert-small
338
+ [sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased
339
+ [sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased
340
+ [sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english
341
+ [sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny
342
+ [sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini
343
+ [sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small
sembr-0.0.2/README.md ADDED
@@ -0,0 +1,323 @@
1
+ # Semantic Line Breaker (SemBr)
2
+
3
+ [![GitHub](https://img.shields.io/github/license/admk/sembr)](LICENSE)
4
+ [![python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
5
+ [![pytorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg?style=flat&logo=pytorch)](https://pytorch.org)
6
+ [![PyPI](https://badge.fury.io/py/sembr.svg)](https://pypi.org/project/sembr)
7
+
8
+ ```
9
+ > When writing text
10
+ > with a compatible markup language,
11
+ > add a line break
12
+ > after each substantial unit of thought.
13
+ ```
14
+
15
+
16
+ ## What is SemBr?
17
+
18
+ SemBr is a command-line tool
19
+ powered by [Transformer][transformers1] [models][transformers2]
20
+ that breaks lines in a text file at semantic boundaries.
21
+
22
+ ### Installation
23
+
24
+ SemBr is available as a [Python package on PyPI][pypi].
25
+ To install it,
26
+ simply run the following command in your terminal,
27
+ assuming that you have Python 3.10 or later installed:
28
+ ```shell
29
+ pip install sembr
30
+ ```
31
+
32
+ ### Supported Platforms
33
+
34
+ SemBr is supported on Linux, Mac and Windows.
35
+ On machines with CUDA devices,
36
+ or on Apple Silicon Macs,
37
+ SemBr will use the GPU / Apple Neural Engine
38
+ to accelerate inference.
39
+
40
+ ### Usage
41
+
42
+ To use SemBr,
43
+ run the following command in your terminal:
44
+ ```shell
45
+ sembr -i <input_file> -o <output_file>
46
+ ```
47
+ where `<input_file>` and `<output_file>`
48
+ are the paths to the input and output files respectively.
49
+
50
+ On the first run,
51
+ it will download the SemBr model
52
+ and cache it in `~/.cache/huggingface`.
53
+ Subsequent runs will check for updates
54
+ and use the cached model if it is up-to-date.
55
+
56
+ Alternatively,
57
+ you can pipe the input into `sembr`,
58
+ and the output can also be printed to the terminal:
59
+ ```shell
60
+ cat <input_file> | sembr
61
+ ```
62
+ This is especially useful if you want to use SemBr
63
+ with clipboard managers, for instance, on a Mac:
64
+ ```shell
65
+ pbpaste | sembr | pbcopy
66
+ ```
67
+ Or on Linux:
68
+ ```shell
69
+ xclip -o | sembr | xclip -i
70
+ ```
71
+
72
+ Additionally,
73
+ you can specify the following options
74
+ to customize the behavior of SemBr:
75
+
76
+ * `-m <model_name>`:
77
+ The name of the Hugging Face model to use.
78
+ - The default is
79
+ [`admko/sembr2023-bert-small`][sembr-bert-small].
80
+ - To use it offline,
81
+ you can download the model from Hugging Face,
82
+ and then specify the path to the model directory,
83
+ or prepend `TRANSFORMERS_OFFLINE=1` to the command
84
+ to use the cached model.
85
+ * `-l`:
86
+ Serves the SemBr API on a local server.
87
+ - Each instance of `sembr` run
88
+ will detect if the API is accessible,
89
+ and if not it will run the model on its own.
90
+ - This option is useful
91
+ to avoid the time taken to initialize the model
92
+ by keeping it in memory in a separate process.
93
+ * `-p <port>`:
94
+ The port to serve the SemBr API on.
95
+ - The default is `8384`.
96
+ * `-s <ip>`:
97
+ The IP address to serve the SemBr API on.
98
+ - The default is `127.0.0.1`.
99
+
100
+
101
+ ## What are Semantic Line Breaks?
102
+
103
+ [Semantic Line Breaks][sembr]
104
+ or [Semantic Linefeeds][semlf]
105
+ describe a set of conventions
106
+ for using insensitive vertical whitespace
107
+ to structure prose along semantic boundaries.
108
+
109
+
110
+ ## Why use Semantic Line Breaks?
111
+
112
+ Semantic Line Breaks has the following advantages:
113
+
114
+ * Breaking lines by splitting clauses
115
+ reflects the logical, grammatical and semantic structure
116
+ of the text.
117
+
118
+ * It enhances the ease of editing and version control
119
+ for a text file.
120
+ Merge conflicts are less likely to occur
121
+ when small changes are made,
122
+ and the changes are easier to identify.
123
+
124
+ * Documents written with semantic line breaks
125
+ are easier to navigate and edit
126
+ with Vim and other text editors
127
+ that use Vim keybindings.
128
+
129
+ * Semantic line breaks
130
+ are invisible to readers.
131
+ The final rendered output
132
+ shows no changes to the source text.
133
+
134
+
135
+ ## Why SemBr?
136
+
137
+ Converting existing text not written
138
+ with semantic line breaks
139
+ takes a long time to do it manually,
140
+ and it is surprisingly difficult
141
+ to do it automatically with rule-based methods.
142
+
143
+ ### Challenges of rule-based methods
144
+
145
+ Rule-based heuristics do not work well
146
+ with the actual semantic structure of the text,
147
+ often leading to incorrect semantic boundaries.
148
+ Moreover,
149
+ these boundaries are hierarchical and nested,
150
+ and a rule-based approach
151
+ cannot capture this structure.
152
+ A semantic line break
153
+ may occur after a dependent clause,
154
+ but where to break clauses into lines
155
+ is challenging to determine
156
+ without syntactic and semantic reasoning capabilities.
157
+ For examples:
158
+
159
+ * A rule that breaks lines at punctuation marks
160
+ will not work well with sentences
161
+ that contain periods
162
+ in abbreviations or mathematical expressions.
163
+
164
+ * Syntactic or semantic structures
165
+ are not always easy to determine.
166
+ "I like to eat apples and oranges
167
+ because they are healthy."
168
+ should be broken into lines as follows:
169
+ ```
170
+ > I like to eat apples and oranges
171
+ > because they are healthy.
172
+ ```
173
+ rather than:
174
+ ```
175
+ > I like to eat apples
176
+ > and oranges because they are healthy.
177
+ ```
178
+
179
+ For this reason,
180
+ I have created SemBr,
181
+ which uses finetuned Transformer models
182
+ to predict line breaks at semantic boundaries.
183
+
184
+
185
+ ## How does SemBr work?
186
+
187
+ SemBr uses a Transformer model
188
+ to predict line breaks at semantic boundaries.
189
+
190
+ A small dataset of text with semantic line breaks
191
+ was created from my existing LaTeX documents.
192
+ The dataset was split into training
193
+ (46,295 lines, 170,681 words and 1,492,952 characters)
194
+ and test
195
+ (2,187 lines, 7,564 words and 72,231 characters)
196
+ datasets.
197
+
198
+ The data was prepared
199
+ by extracting line breaks and indent levels
200
+ from the files,
201
+ and then converting the result
202
+ into strings of paragraphs with line breaks removed.
203
+ The data can then be tokenized using the tokenizer
204
+ and converted into a dataset with tokens,
205
+ where each token has a label
206
+ denoting if there is line break before it,
207
+ and the indent level of the token.
208
+
209
+ For LaTeX documents,
210
+ there are two types of line breaks:
211
+ one with a normal line break
212
+ that adds implicit spacing (e.g. `line a⏎line b`)
213
+ and one with no spacing (e.g. `line a%⏎line b`).
214
+ The data processor
215
+ also tries to preserve the LaTeX syntax of the text
216
+ by adding and removing comment symbols (`%`),
217
+ if necessary.
218
+
219
+ The pretrained masked language model
220
+ is then finetuned as a token classifier
221
+ on the training dataset
222
+ to predict the labels of the tokens.
223
+ We save the model with the best F1 score
224
+ on correctly predicting the existence of a line break
225
+ on the test set.
226
+ The finetuning logs for the following models
227
+ can be found on this [WandB][wandb] report:
228
+
229
+ * `distilbert-base-uncased`
230
+ [[Pretrained]][distilbert-bu]
231
+ [[Finetuned]][sembr-distilbert-bu]
232
+ * `distilbert-base-cased`
233
+ [[Pretrained]][distilbert-bc]
234
+ [[Finetuned]][sembr-distilbert-bc]
235
+ * `distilbert-base-uncased-finetuned-sst-2-english`
236
+ [[Pretrained]][distilbert-bufs2e]
237
+ [[Finetuned]][sembr-distilbert-bufs2e]
238
+ * `prajjwal1/bert-tiny`
239
+ [[Pretrained]][bert-tiny]
240
+ [[Finetuned]][sembr-bert-tiny]
241
+ * `prajjwal1/bert-mini`
242
+ [[Pretrained]][bert-mini]
243
+ [[Finetuned]][sembr-bert-mini]
244
+ * `prajjwal1/bert-small`
245
+ [[Pretrained]][bert-small]
246
+ [[Finetuned]][sembr-bert-small]
247
+
248
+
249
+ ## Performance
250
+
251
+ Current inference speed on an M2 Macbook Pro
252
+ is about 850 words per second
253
+ on `bert-small` with the default options,
254
+ the memory usage is about 1.70 GB.
255
+
256
+ The link breaking accuracy is difficult to measure,
257
+ and the locations of line breaks
258
+ could also be subjective.
259
+ On the test set,
260
+ the per-token line break accuracy
261
+ of the models are >95%,
262
+ with ~80% F1 scores.
263
+ Because of the sparse nature of line breaks,
264
+ the accuracy is not a good metric
265
+ to measure the performance of the model,
266
+ and I used the F1 score instead
267
+ to save best models.
268
+
269
+
270
+ ## Improvements and TODOs
271
+
272
+ * [ ] Support natural languages other than English.
273
+ * [ ] Support other markup languages such as Markdown.
274
+ * [ ] Some lines are too long without a line break.
275
+ The inference algorithm can be improved
276
+ to penalize long lines.
277
+ * [ ] Performance and accuracy benchmarking,
278
+ and comparisons with related works.
279
+ * [ ] Improve inference speed.
280
+ * [ ] Reduce memory usage.
281
+ * [ ] Improve indent level prediction.
282
+ * [ ] Inference queue.
283
+ * [ ] VSCode extension.
284
+
285
+
286
+ ## Related Projects and References
287
+
288
+ Sentence splitting:
289
+ * https://code.google.com/archive/p/splitta/
290
+ * https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
291
+ * https://github.com/nipunsadvilkar/pySBD
292
+ * https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html
293
+
294
+ Semantic line breaking:
295
+ * https://github.com/sembr/specification
296
+ * https://github.com/waldyrious/semantic-linebreaker
297
+ * https://github.com/bobheadxi/readable ([blog post](https://bobheadxi.dev/semantic-line-breaks/))
298
+ * https://github.com/chrisgrieser/obsidian-sembr
299
+ * https://github.com/cllns/semantic_linefeeds
300
+
301
+
302
+ [transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4
303
+ [transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
304
+
305
+ [pypi]: https://pypi.org/project/sembr
306
+
307
+ [sembr]: https://sembr.org
308
+ [semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
309
+
310
+ [wandb]: https://api.wandb.ai/links/admk/efvui9f4
311
+
312
+ [distilbert-bu]: https://huggingface.co/distilbert-base-uncased
313
+ [distilbert-bc]: https://huggingface.co/distilbert-base-cased
314
+ [distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
315
+ [bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny
316
+ [bert-mini]: https://huggingface.co/prajjwal1/bert-mini
317
+ [bert-small]: https://huggingface.co/prajjwal1/bert-small
318
+ [sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased
319
+ [sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased
320
+ [sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english
321
+ [sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny
322
+ [sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini
323
+ [sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small