sembr 0.2.0__tar.gz → 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {sembr-0.2.0/sembr.egg-info → sembr-0.2.2}/PKG-INFO +59 -14
- {sembr-0.2.0 → sembr-0.2.2}/README.md +58 -13
- {sembr-0.2.0 → sembr-0.2.2}/sembr/__init__.py +1 -1
- {sembr-0.2.0 → sembr-0.2.2}/sembr/cli.py +18 -7
- {sembr-0.2.0 → sembr-0.2.2}/sembr/mcp.py +0 -17
- {sembr-0.2.0 → sembr-0.2.2/sembr.egg-info}/PKG-INFO +59 -14
- {sembr-0.2.0 → sembr-0.2.2}/LICENSE +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/pyproject.toml +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/databuilder.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/dataset.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/eval.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/inference.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/process.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/train.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr/utils.py +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr.egg-info/SOURCES.txt +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr.egg-info/dependency_links.txt +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr.egg-info/entry_points.txt +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr.egg-info/requires.txt +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/sembr.egg-info/top_level.txt +0 -0
- {sembr-0.2.0 → sembr-0.2.2}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: sembr
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: A semantic linebreaker powered by transformers
|
|
5
5
|
Author: admk
|
|
6
6
|
License-Expression: MIT
|
|
@@ -76,6 +76,8 @@ to accelerate inference.
|
|
|
76
76
|
|
|
77
77
|
### Usage
|
|
78
78
|
|
|
79
|
+
#### Command Line Interface
|
|
80
|
+
|
|
79
81
|
To use SemBr,
|
|
80
82
|
run the following command in your terminal:
|
|
81
83
|
```shell
|
|
@@ -110,7 +112,7 @@ Additionally,
|
|
|
110
112
|
you can specify the following options
|
|
111
113
|
to customize the behavior of SemBr:
|
|
112
114
|
|
|
113
|
-
* `-m <model_name>`:
|
|
115
|
+
* `-m <model_name>`, `--model-name <model_name>`:
|
|
114
116
|
The name of the Hugging Face model to use.
|
|
115
117
|
- The default is
|
|
116
118
|
[`admko/sembr2023-bert-small`][sembr-bert-small].
|
|
@@ -119,7 +121,7 @@ to customize the behavior of SemBr:
|
|
|
119
121
|
and then specify the path to the model directory,
|
|
120
122
|
or prepend `TRANSFORMERS_OFFLINE=1` to the command
|
|
121
123
|
to use the cached model.
|
|
122
|
-
* `-l`:
|
|
124
|
+
* `-l`, `--listen`:
|
|
123
125
|
Serves the SemBr API on a local server.
|
|
124
126
|
- Each instance of `sembr` run
|
|
125
127
|
will detect if the API is accessible,
|
|
@@ -127,13 +129,58 @@ to customize the behavior of SemBr:
|
|
|
127
129
|
- This option is useful
|
|
128
130
|
to avoid the time taken to initialize the model
|
|
129
131
|
by keeping it in memory in a separate process.
|
|
130
|
-
* `-p <port>`:
|
|
132
|
+
* `-p <port>`, `--port <port>`:
|
|
131
133
|
The port to serve the SemBr API on.
|
|
132
134
|
- The default is `8384`.
|
|
133
|
-
* `-s <ip>`:
|
|
135
|
+
* `-s <ip>`, `--server <ip>`:
|
|
134
136
|
The IP address to serve the SemBr API on.
|
|
135
137
|
- The default is `127.0.0.1`.
|
|
138
|
+
* `-b <int>`, `--batch_size <int>`:
|
|
139
|
+
The number of lines to process in a batch.
|
|
140
|
+
Default is `8`.
|
|
141
|
+
* `-d <int>`, `--overlap-divisor <int>`:
|
|
142
|
+
The overlap divisor for tiled inference.
|
|
143
|
+
Default is `8`.
|
|
144
|
+
* `-f <func>`, `--predict-func <func>`:
|
|
145
|
+
The prediction function to use.
|
|
146
|
+
Options are `argmax`, `logit_adjustment`, `greedy_line_breaks`.
|
|
147
|
+
Default is `argmax`.
|
|
148
|
+
* `-t <int>`, `--tokens-per-line <int>`:
|
|
149
|
+
Maximum tokens per line for greedy line breaking.
|
|
150
|
+
This is only effective
|
|
151
|
+
when using the `greedy_line_breaks` prediction function.
|
|
152
|
+
* `--bits <4|8>`:
|
|
153
|
+
Quantization bits for model weights (4 or 8).
|
|
154
|
+
Requires CUDA. Not supported on MPS.
|
|
155
|
+
* `--dtype <dtype>`:
|
|
156
|
+
Data type for model weights (e.g. `float16`, `bfloat16`).
|
|
157
|
+
Default is `float32`.
|
|
158
|
+
* `--mcp`:
|
|
159
|
+
Start MCP server mode instead of processing text.
|
|
160
|
+
|
|
161
|
+
|
|
162
|
+
#### MCP Server
|
|
163
|
+
|
|
164
|
+
Alternatively,
|
|
165
|
+
you can run `sembr` as an [MCP server][mcp].
|
|
166
|
+
Simply add the following configuration
|
|
167
|
+
to your MCP server configuration:
|
|
168
|
+
```json
|
|
169
|
+
"mcpServers": {
|
|
170
|
+
"sembr": {
|
|
171
|
+
"type": "stdio",
|
|
172
|
+
"command": "uvx",
|
|
173
|
+
"args": [
|
|
174
|
+
"sembr",
|
|
175
|
+
"--mcp"
|
|
176
|
+
],
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
```
|
|
136
180
|
|
|
181
|
+
The server also supports the formatting options described above.
|
|
182
|
+
It will expose a `wrap_text` tool
|
|
183
|
+
for the MCP client to use.
|
|
137
184
|
|
|
138
185
|
## What are Semantic Line Breaks?
|
|
139
186
|
|
|
@@ -316,13 +363,10 @@ to save best models.
|
|
|
316
363
|
- [ ] Inference queue.
|
|
317
364
|
- [ ] Daemon with model unloading.
|
|
318
365
|
- Editor integration:
|
|
319
|
-
- [
|
|
320
|
-
- [
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
and also does not return logit values,
|
|
324
|
-
so no additional algorithms
|
|
325
|
-
can be used to improve the predictions.
|
|
366
|
+
- [x] ~~NeoVim plugin.~~
|
|
367
|
+
- [x] ~~VSCode extension.~~
|
|
368
|
+
- [x] MCP server.
|
|
369
|
+
- [x] ~~Use the [Hugging Face API][hfapi] for inference.~~
|
|
326
370
|
- Accuracy:
|
|
327
371
|
- Some lines are too short or too long:
|
|
328
372
|
- [x] Long lines can be penalized greedily
|
|
@@ -335,8 +379,8 @@ to save best models.
|
|
|
335
379
|
- [ ] Performance and accuracy benchmarking,
|
|
336
380
|
and comparisons with related works.
|
|
337
381
|
- Performance:
|
|
338
|
-
- [
|
|
339
|
-
- [
|
|
382
|
+
- [x] Improve inference speed.
|
|
383
|
+
- [x] Reduce memory usage.
|
|
340
384
|
|
|
341
385
|
|
|
342
386
|
## Related Projects and References
|
|
@@ -360,6 +404,7 @@ Semantic line breaking:
|
|
|
360
404
|
|
|
361
405
|
[pypi]: https://pypi.org/project/sembr
|
|
362
406
|
[uv]: https://github.com/astral-sh/uv
|
|
407
|
+
[mcp]: https://modelcontextprotocol.io/overview
|
|
363
408
|
|
|
364
409
|
[sembr]: https://sembr.org
|
|
365
410
|
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
|
|
@@ -50,6 +50,8 @@ to accelerate inference.
|
|
|
50
50
|
|
|
51
51
|
### Usage
|
|
52
52
|
|
|
53
|
+
#### Command Line Interface
|
|
54
|
+
|
|
53
55
|
To use SemBr,
|
|
54
56
|
run the following command in your terminal:
|
|
55
57
|
```shell
|
|
@@ -84,7 +86,7 @@ Additionally,
|
|
|
84
86
|
you can specify the following options
|
|
85
87
|
to customize the behavior of SemBr:
|
|
86
88
|
|
|
87
|
-
* `-m <model_name>`:
|
|
89
|
+
* `-m <model_name>`, `--model-name <model_name>`:
|
|
88
90
|
The name of the Hugging Face model to use.
|
|
89
91
|
- The default is
|
|
90
92
|
[`admko/sembr2023-bert-small`][sembr-bert-small].
|
|
@@ -93,7 +95,7 @@ to customize the behavior of SemBr:
|
|
|
93
95
|
and then specify the path to the model directory,
|
|
94
96
|
or prepend `TRANSFORMERS_OFFLINE=1` to the command
|
|
95
97
|
to use the cached model.
|
|
96
|
-
* `-l`:
|
|
98
|
+
* `-l`, `--listen`:
|
|
97
99
|
Serves the SemBr API on a local server.
|
|
98
100
|
- Each instance of `sembr` run
|
|
99
101
|
will detect if the API is accessible,
|
|
@@ -101,13 +103,58 @@ to customize the behavior of SemBr:
|
|
|
101
103
|
- This option is useful
|
|
102
104
|
to avoid the time taken to initialize the model
|
|
103
105
|
by keeping it in memory in a separate process.
|
|
104
|
-
* `-p <port>`:
|
|
106
|
+
* `-p <port>`, `--port <port>`:
|
|
105
107
|
The port to serve the SemBr API on.
|
|
106
108
|
- The default is `8384`.
|
|
107
|
-
* `-s <ip>`:
|
|
109
|
+
* `-s <ip>`, `--server <ip>`:
|
|
108
110
|
The IP address to serve the SemBr API on.
|
|
109
111
|
- The default is `127.0.0.1`.
|
|
112
|
+
* `-b <int>`, `--batch_size <int>`:
|
|
113
|
+
The number of lines to process in a batch.
|
|
114
|
+
Default is `8`.
|
|
115
|
+
* `-d <int>`, `--overlap-divisor <int>`:
|
|
116
|
+
The overlap divisor for tiled inference.
|
|
117
|
+
Default is `8`.
|
|
118
|
+
* `-f <func>`, `--predict-func <func>`:
|
|
119
|
+
The prediction function to use.
|
|
120
|
+
Options are `argmax`, `logit_adjustment`, `greedy_line_breaks`.
|
|
121
|
+
Default is `argmax`.
|
|
122
|
+
* `-t <int>`, `--tokens-per-line <int>`:
|
|
123
|
+
Maximum tokens per line for greedy line breaking.
|
|
124
|
+
This is only effective
|
|
125
|
+
when using the `greedy_line_breaks` prediction function.
|
|
126
|
+
* `--bits <4|8>`:
|
|
127
|
+
Quantization bits for model weights (4 or 8).
|
|
128
|
+
Requires CUDA. Not supported on MPS.
|
|
129
|
+
* `--dtype <dtype>`:
|
|
130
|
+
Data type for model weights (e.g. `float16`, `bfloat16`).
|
|
131
|
+
Default is `float32`.
|
|
132
|
+
* `--mcp`:
|
|
133
|
+
Start MCP server mode instead of processing text.
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
#### MCP Server
|
|
137
|
+
|
|
138
|
+
Alternatively,
|
|
139
|
+
you can run `sembr` as an [MCP server][mcp].
|
|
140
|
+
Simply add the following configuration
|
|
141
|
+
to your MCP server configuration:
|
|
142
|
+
```json
|
|
143
|
+
"mcpServers": {
|
|
144
|
+
"sembr": {
|
|
145
|
+
"type": "stdio",
|
|
146
|
+
"command": "uvx",
|
|
147
|
+
"args": [
|
|
148
|
+
"sembr",
|
|
149
|
+
"--mcp"
|
|
150
|
+
],
|
|
151
|
+
}
|
|
152
|
+
}
|
|
153
|
+
```
|
|
110
154
|
|
|
155
|
+
The server also supports the formatting options described above.
|
|
156
|
+
It will expose a `wrap_text` tool
|
|
157
|
+
for the MCP client to use.
|
|
111
158
|
|
|
112
159
|
## What are Semantic Line Breaks?
|
|
113
160
|
|
|
@@ -290,13 +337,10 @@ to save best models.
|
|
|
290
337
|
- [ ] Inference queue.
|
|
291
338
|
- [ ] Daemon with model unloading.
|
|
292
339
|
- Editor integration:
|
|
293
|
-
- [
|
|
294
|
-
- [
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
and also does not return logit values,
|
|
298
|
-
so no additional algorithms
|
|
299
|
-
can be used to improve the predictions.
|
|
340
|
+
- [x] ~~NeoVim plugin.~~
|
|
341
|
+
- [x] ~~VSCode extension.~~
|
|
342
|
+
- [x] MCP server.
|
|
343
|
+
- [x] ~~Use the [Hugging Face API][hfapi] for inference.~~
|
|
300
344
|
- Accuracy:
|
|
301
345
|
- Some lines are too short or too long:
|
|
302
346
|
- [x] Long lines can be penalized greedily
|
|
@@ -309,8 +353,8 @@ to save best models.
|
|
|
309
353
|
- [ ] Performance and accuracy benchmarking,
|
|
310
354
|
and comparisons with related works.
|
|
311
355
|
- Performance:
|
|
312
|
-
- [
|
|
313
|
-
- [
|
|
356
|
+
- [x] Improve inference speed.
|
|
357
|
+
- [x] Reduce memory usage.
|
|
314
358
|
|
|
315
359
|
|
|
316
360
|
## Related Projects and References
|
|
@@ -334,6 +378,7 @@ Semantic line breaking:
|
|
|
334
378
|
|
|
335
379
|
[pypi]: https://pypi.org/project/sembr
|
|
336
380
|
[uv]: https://github.com/astral-sh/uv
|
|
381
|
+
[mcp]: https://modelcontextprotocol.io/overview
|
|
337
382
|
|
|
338
383
|
[sembr]: https://sembr.org
|
|
339
384
|
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
|
|
@@ -89,6 +89,8 @@ def start_server(port, tokenizer, model, processor, wrap_kwargs=None):
|
|
|
89
89
|
text = form['text']
|
|
90
90
|
kwargs = dict(wrap_kwargs or {})
|
|
91
91
|
for k, v in form.items():
|
|
92
|
+
if k == 'text':
|
|
93
|
+
continue
|
|
92
94
|
if k in ['batch_size', 'tokens_per_line', 'overlap_divisor']:
|
|
93
95
|
v = int(v)
|
|
94
96
|
kwargs[k] = v
|
|
@@ -157,7 +159,7 @@ def wrap_kwargs(args):
|
|
|
157
159
|
}
|
|
158
160
|
|
|
159
161
|
|
|
160
|
-
def main():
|
|
162
|
+
def main() -> int:
|
|
161
163
|
parser = cli_parser()
|
|
162
164
|
args = parser.parse_args()
|
|
163
165
|
if args.debug:
|
|
@@ -167,13 +169,21 @@ def main():
|
|
|
167
169
|
debugpy.wait_for_client()
|
|
168
170
|
if args.mcp:
|
|
169
171
|
from .mcp import mcp
|
|
172
|
+
unsupported = ['input_file', 'output_file', 'listen']
|
|
173
|
+
for arg_name in unsupported:
|
|
174
|
+
if getattr(args, arg_name) in [None, False]:
|
|
175
|
+
continue
|
|
176
|
+
message = f'--{arg_name} is not supported in MCP mode.'
|
|
177
|
+
print(message, file=sys.stderr)
|
|
178
|
+
return 1
|
|
170
179
|
mcp.run()
|
|
171
|
-
return
|
|
180
|
+
return 0
|
|
172
181
|
kwargs = wrap_kwargs(args)
|
|
173
182
|
if args.listen:
|
|
174
183
|
tokenizer, model, processor = init(
|
|
175
184
|
args.model_name, args.bits, args.dtype)
|
|
176
|
-
|
|
185
|
+
start_server(args.port, tokenizer, model, processor, kwargs)
|
|
186
|
+
return 0
|
|
177
187
|
if args.input_file is not None:
|
|
178
188
|
with open(args.input_file, 'r', encoding='utf-8') as f:
|
|
179
189
|
text = f.read()
|
|
@@ -181,8 +191,8 @@ def main():
|
|
|
181
191
|
text = sys.stdin.read()
|
|
182
192
|
else:
|
|
183
193
|
parser.print_help()
|
|
184
|
-
print('\nNo input file or stdin text provided.')
|
|
185
|
-
return
|
|
194
|
+
print('\nNo input file or stdin text provided.', file=sys.stderr)
|
|
195
|
+
return 1
|
|
186
196
|
if check_server(args.server, args.port):
|
|
187
197
|
result = rewrap_on_server(text, args.server, args.port, kwargs)
|
|
188
198
|
else:
|
|
@@ -192,10 +202,11 @@ def main():
|
|
|
192
202
|
result = sembr(text, tokenizer, model, processor, **kwargs)
|
|
193
203
|
if args.output_file is None:
|
|
194
204
|
print(result)
|
|
195
|
-
return
|
|
205
|
+
return 0
|
|
196
206
|
with open(args.output_file, 'w', encoding='utf-8') as f:
|
|
197
207
|
f.write(result)
|
|
208
|
+
return 0
|
|
198
209
|
|
|
199
210
|
|
|
200
211
|
if __name__ == '__main__':
|
|
201
|
-
main()
|
|
212
|
+
sys.exit(main())
|
|
@@ -60,22 +60,5 @@ def wrap_text(
|
|
|
60
60
|
structured_content={"success": True, "output": wrapped_text})
|
|
61
61
|
|
|
62
62
|
|
|
63
|
-
@mcp.tool(
|
|
64
|
-
description="Apply semantic line breaks to file",
|
|
65
|
-
tags=["sembr", "semantic linebreak", "format", "file"],
|
|
66
|
-
)
|
|
67
|
-
def process_file(
|
|
68
|
-
file_path: Annotated[str, Field(description="File path to process")],
|
|
69
|
-
) -> ToolResult:
|
|
70
|
-
try:
|
|
71
|
-
with open(file_path, 'r', encoding='utf-8') as f:
|
|
72
|
-
text = f.read()
|
|
73
|
-
except Exception as e:
|
|
74
|
-
return ToolResult(
|
|
75
|
-
content=[TextContent(type="text", text=f"Error reading file: {file_path}")],
|
|
76
|
-
structured_content={"success": False, "error": str(e)})
|
|
77
|
-
return wrap_text(text)
|
|
78
|
-
|
|
79
|
-
|
|
80
63
|
if __name__ == "__main__":
|
|
81
64
|
mcp.run()
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: sembr
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: A semantic linebreaker powered by transformers
|
|
5
5
|
Author: admk
|
|
6
6
|
License-Expression: MIT
|
|
@@ -76,6 +76,8 @@ to accelerate inference.
|
|
|
76
76
|
|
|
77
77
|
### Usage
|
|
78
78
|
|
|
79
|
+
#### Command Line Interface
|
|
80
|
+
|
|
79
81
|
To use SemBr,
|
|
80
82
|
run the following command in your terminal:
|
|
81
83
|
```shell
|
|
@@ -110,7 +112,7 @@ Additionally,
|
|
|
110
112
|
you can specify the following options
|
|
111
113
|
to customize the behavior of SemBr:
|
|
112
114
|
|
|
113
|
-
* `-m <model_name>`:
|
|
115
|
+
* `-m <model_name>`, `--model-name <model_name>`:
|
|
114
116
|
The name of the Hugging Face model to use.
|
|
115
117
|
- The default is
|
|
116
118
|
[`admko/sembr2023-bert-small`][sembr-bert-small].
|
|
@@ -119,7 +121,7 @@ to customize the behavior of SemBr:
|
|
|
119
121
|
and then specify the path to the model directory,
|
|
120
122
|
or prepend `TRANSFORMERS_OFFLINE=1` to the command
|
|
121
123
|
to use the cached model.
|
|
122
|
-
* `-l`:
|
|
124
|
+
* `-l`, `--listen`:
|
|
123
125
|
Serves the SemBr API on a local server.
|
|
124
126
|
- Each instance of `sembr` run
|
|
125
127
|
will detect if the API is accessible,
|
|
@@ -127,13 +129,58 @@ to customize the behavior of SemBr:
|
|
|
127
129
|
- This option is useful
|
|
128
130
|
to avoid the time taken to initialize the model
|
|
129
131
|
by keeping it in memory in a separate process.
|
|
130
|
-
* `-p <port>`:
|
|
132
|
+
* `-p <port>`, `--port <port>`:
|
|
131
133
|
The port to serve the SemBr API on.
|
|
132
134
|
- The default is `8384`.
|
|
133
|
-
* `-s <ip>`:
|
|
135
|
+
* `-s <ip>`, `--server <ip>`:
|
|
134
136
|
The IP address to serve the SemBr API on.
|
|
135
137
|
- The default is `127.0.0.1`.
|
|
138
|
+
* `-b <int>`, `--batch_size <int>`:
|
|
139
|
+
The number of lines to process in a batch.
|
|
140
|
+
Default is `8`.
|
|
141
|
+
* `-d <int>`, `--overlap-divisor <int>`:
|
|
142
|
+
The overlap divisor for tiled inference.
|
|
143
|
+
Default is `8`.
|
|
144
|
+
* `-f <func>`, `--predict-func <func>`:
|
|
145
|
+
The prediction function to use.
|
|
146
|
+
Options are `argmax`, `logit_adjustment`, `greedy_line_breaks`.
|
|
147
|
+
Default is `argmax`.
|
|
148
|
+
* `-t <int>`, `--tokens-per-line <int>`:
|
|
149
|
+
Maximum tokens per line for greedy line breaking.
|
|
150
|
+
This is only effective
|
|
151
|
+
when using the `greedy_line_breaks` prediction function.
|
|
152
|
+
* `--bits <4|8>`:
|
|
153
|
+
Quantization bits for model weights (4 or 8).
|
|
154
|
+
Requires CUDA. Not supported on MPS.
|
|
155
|
+
* `--dtype <dtype>`:
|
|
156
|
+
Data type for model weights (e.g. `float16`, `bfloat16`).
|
|
157
|
+
Default is `float32`.
|
|
158
|
+
* `--mcp`:
|
|
159
|
+
Start MCP server mode instead of processing text.
|
|
160
|
+
|
|
161
|
+
|
|
162
|
+
#### MCP Server
|
|
163
|
+
|
|
164
|
+
Alternatively,
|
|
165
|
+
you can run `sembr` as an [MCP server][mcp].
|
|
166
|
+
Simply add the following configuration
|
|
167
|
+
to your MCP server configuration:
|
|
168
|
+
```json
|
|
169
|
+
"mcpServers": {
|
|
170
|
+
"sembr": {
|
|
171
|
+
"type": "stdio",
|
|
172
|
+
"command": "uvx",
|
|
173
|
+
"args": [
|
|
174
|
+
"sembr",
|
|
175
|
+
"--mcp"
|
|
176
|
+
],
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
```
|
|
136
180
|
|
|
181
|
+
The server also supports the formatting options described above.
|
|
182
|
+
It will expose a `wrap_text` tool
|
|
183
|
+
for the MCP client to use.
|
|
137
184
|
|
|
138
185
|
## What are Semantic Line Breaks?
|
|
139
186
|
|
|
@@ -316,13 +363,10 @@ to save best models.
|
|
|
316
363
|
- [ ] Inference queue.
|
|
317
364
|
- [ ] Daemon with model unloading.
|
|
318
365
|
- Editor integration:
|
|
319
|
-
- [
|
|
320
|
-
- [
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
and also does not return logit values,
|
|
324
|
-
so no additional algorithms
|
|
325
|
-
can be used to improve the predictions.
|
|
366
|
+
- [x] ~~NeoVim plugin.~~
|
|
367
|
+
- [x] ~~VSCode extension.~~
|
|
368
|
+
- [x] MCP server.
|
|
369
|
+
- [x] ~~Use the [Hugging Face API][hfapi] for inference.~~
|
|
326
370
|
- Accuracy:
|
|
327
371
|
- Some lines are too short or too long:
|
|
328
372
|
- [x] Long lines can be penalized greedily
|
|
@@ -335,8 +379,8 @@ to save best models.
|
|
|
335
379
|
- [ ] Performance and accuracy benchmarking,
|
|
336
380
|
and comparisons with related works.
|
|
337
381
|
- Performance:
|
|
338
|
-
- [
|
|
339
|
-
- [
|
|
382
|
+
- [x] Improve inference speed.
|
|
383
|
+
- [x] Reduce memory usage.
|
|
340
384
|
|
|
341
385
|
|
|
342
386
|
## Related Projects and References
|
|
@@ -360,6 +404,7 @@ Semantic line breaking:
|
|
|
360
404
|
|
|
361
405
|
[pypi]: https://pypi.org/project/sembr
|
|
362
406
|
[uv]: https://github.com/astral-sh/uv
|
|
407
|
+
[mcp]: https://modelcontextprotocol.io/overview
|
|
363
408
|
|
|
364
409
|
[sembr]: https://sembr.org
|
|
365
410
|
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|