sentencex 1.0.2__tar.gz → 1.0.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of sentencex might be problematic. Click here for more details.

Files changed (123) hide show
  1. {sentencex-1.0.2 → sentencex-1.0.4}/Cargo.lock +4 -4
  2. {sentencex-1.0.2 → sentencex-1.0.4}/PKG-INFO +1 -1
  3. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/Cargo.toml +1 -1
  4. sentencex-1.0.4/bindings/python/publish.sh +6 -0
  5. sentencex-1.0.4/paris.txt +59 -0
  6. {sentencex-1.0.2 → sentencex-1.0.4}/pyproject.toml +1 -1
  7. {sentencex-1.0.2 → sentencex-1.0.4}/src/constants.rs +1 -1
  8. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/language.rs +1 -2
  9. {sentencex-1.0.2 → sentencex-1.0.4}/src/lib.rs +1 -1
  10. {sentencex-1.0.2 → sentencex-1.0.4}/src/main.rs +1 -1
  11. {sentencex-1.0.2 → sentencex-1.0.4}/tests/en.txt +5 -0
  12. {sentencex-1.0.2 → sentencex-1.0.4}/.github/workflows/node.yaml +0 -0
  13. {sentencex-1.0.2 → sentencex-1.0.4}/.github/workflows/python.yaml +0 -0
  14. {sentencex-1.0.2 → sentencex-1.0.4}/.github/workflows/rust.yml +0 -0
  15. {sentencex-1.0.2 → sentencex-1.0.4}/.github/workflows/wasm.yaml +0 -0
  16. {sentencex-1.0.2 → sentencex-1.0.4}/.gitignore +0 -0
  17. {sentencex-1.0.2 → sentencex-1.0.4}/100-0.txt +0 -0
  18. {sentencex-1.0.2 → sentencex-1.0.4}/11-0.txt +0 -0
  19. {sentencex-1.0.2 → sentencex-1.0.4}/1661-0.txt +0 -0
  20. {sentencex-1.0.2 → sentencex-1.0.4}/LICENSE +0 -0
  21. {sentencex-1.0.2 → sentencex-1.0.4}/README.md +0 -0
  22. {sentencex-1.0.2 → sentencex-1.0.4}/TODO.md +0 -0
  23. {sentencex-1.0.2 → sentencex-1.0.4}/benches/segment_benchmark.rs +0 -0
  24. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/.gitignore +0 -0
  25. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/.python-version +0 -0
  26. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/Cargo.lock +0 -0
  27. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/README.md +0 -0
  28. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/example.py +0 -0
  29. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/src/lib.rs +0 -0
  30. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/tests/__init__.py +0 -0
  31. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/tests/test_sentencex.py +0 -0
  32. {sentencex-1.0.2 → sentencex-1.0.4}/bindings/python/uv.lock +0 -0
  33. {sentencex-1.0.2 → sentencex-1.0.4}/demo/index.html +0 -0
  34. {sentencex-1.0.2 → sentencex-1.0.4}/examples/rust_example.rs +0 -0
  35. {sentencex-1.0.2 → sentencex-1.0.4}/oxygen.txt +0 -0
  36. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/am.txt +0 -0
  37. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/ar.txt +0 -0
  38. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/bg.txt +0 -0
  39. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/bn.txt +0 -0
  40. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/da.txt +0 -0
  41. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/de.txt +0 -0
  42. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/el.txt +0 -0
  43. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/en.txt +0 -0
  44. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/es.txt +0 -0
  45. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/fi.txt +0 -0
  46. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/fr.txt +0 -0
  47. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/gu.txt +0 -0
  48. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/hi.txt +0 -0
  49. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/it.txt +0 -0
  50. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/kk.txt +0 -0
  51. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/kn.txt +0 -0
  52. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/ml.txt +0 -0
  53. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/nl.txt +0 -0
  54. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/pa.txt +0 -0
  55. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/pl.txt +0 -0
  56. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/pt.txt +0 -0
  57. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/ru.txt +0 -0
  58. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/sk.txt +0 -0
  59. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/ta.txt +0 -0
  60. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/abbrev/te.txt +0 -0
  61. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/am.rs +0 -0
  62. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ar.rs +0 -0
  63. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/bg.rs +0 -0
  64. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/bn.rs +0 -0
  65. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ca.rs +0 -0
  66. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/da.rs +0 -0
  67. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/de.rs +0 -0
  68. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/el.rs +0 -0
  69. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/en.rs +0 -0
  70. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/es.rs +0 -0
  71. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/fallbacks.yaml +0 -0
  72. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/fi.rs +0 -0
  73. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/fr.rs +0 -0
  74. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/gu.rs +0 -0
  75. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/hi.rs +0 -0
  76. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/hy.rs +0 -0
  77. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/it.rs +0 -0
  78. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ja.rs +0 -0
  79. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/kk.rs +0 -0
  80. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/kn.rs +0 -0
  81. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ml.rs +0 -0
  82. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/mod.rs +0 -0
  83. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/mr.rs +0 -0
  84. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/my.rs +0 -0
  85. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/nl.rs +0 -0
  86. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/pa.rs +0 -0
  87. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/pl.rs +0 -0
  88. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/pt.rs +0 -0
  89. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ru.rs +0 -0
  90. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/sk.rs +0 -0
  91. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/ta.rs +0 -0
  92. {sentencex-1.0.2 → sentencex-1.0.4}/src/languages/te.rs +0 -0
  93. {sentencex-1.0.2 → sentencex-1.0.4}/tests/am.txt +0 -0
  94. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ar.txt +0 -0
  95. {sentencex-1.0.2 → sentencex-1.0.4}/tests/bg.txt +0 -0
  96. {sentencex-1.0.2 → sentencex-1.0.4}/tests/bn.txt +0 -0
  97. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ca.txt +0 -0
  98. {sentencex-1.0.2 → sentencex-1.0.4}/tests/da.txt +0 -0
  99. {sentencex-1.0.2 → sentencex-1.0.4}/tests/de.txt +0 -0
  100. {sentencex-1.0.2 → sentencex-1.0.4}/tests/el.txt +0 -0
  101. {sentencex-1.0.2 → sentencex-1.0.4}/tests/es.txt +0 -0
  102. {sentencex-1.0.2 → sentencex-1.0.4}/tests/fi.txt +0 -0
  103. {sentencex-1.0.2 → sentencex-1.0.4}/tests/fr.txt +0 -0
  104. {sentencex-1.0.2 → sentencex-1.0.4}/tests/gu.txt +0 -0
  105. {sentencex-1.0.2 → sentencex-1.0.4}/tests/hi.txt +0 -0
  106. {sentencex-1.0.2 → sentencex-1.0.4}/tests/hy.txt +0 -0
  107. {sentencex-1.0.2 → sentencex-1.0.4}/tests/it.txt +0 -0
  108. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ja.txt +0 -0
  109. {sentencex-1.0.2 → sentencex-1.0.4}/tests/kk.txt +0 -0
  110. {sentencex-1.0.2 → sentencex-1.0.4}/tests/kn.txt +0 -0
  111. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ml.txt +0 -0
  112. {sentencex-1.0.2 → sentencex-1.0.4}/tests/mr.txt +0 -0
  113. {sentencex-1.0.2 → sentencex-1.0.4}/tests/my.txt +0 -0
  114. {sentencex-1.0.2 → sentencex-1.0.4}/tests/nl.txt +0 -0
  115. {sentencex-1.0.2 → sentencex-1.0.4}/tests/pa.txt +0 -0
  116. {sentencex-1.0.2 → sentencex-1.0.4}/tests/pl.txt +0 -0
  117. {sentencex-1.0.2 → sentencex-1.0.4}/tests/pt.txt +0 -0
  118. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ru.txt +0 -0
  119. {sentencex-1.0.2 → sentencex-1.0.4}/tests/sk.txt +0 -0
  120. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ta.txt +0 -0
  121. {sentencex-1.0.2 → sentencex-1.0.4}/tests/te.txt +0 -0
  122. {sentencex-1.0.2 → sentencex-1.0.4}/tests/ur.txt +0 -0
  123. {sentencex-1.0.2 → sentencex-1.0.4}/tests/zh.txt +0 -0
@@ -653,7 +653,7 @@ checksum = "cd0b0ec5f1c1ca621c432a25813d8d60c88abe6d3e08a3eb9cf37d97a0fe3d73"
653
653
 
654
654
  [[package]]
655
655
  name = "sentencex"
656
- version = "0.1.2"
656
+ version = "0.1.4"
657
657
  dependencies = [
658
658
  "clap",
659
659
  "criterion",
@@ -666,7 +666,7 @@ dependencies = [
666
666
 
667
667
  [[package]]
668
668
  name = "sentencex-js"
669
- version = "1.0.2"
669
+ version = "1.0.4"
670
670
  dependencies = [
671
671
  "neon",
672
672
  "neon-build",
@@ -675,7 +675,7 @@ dependencies = [
675
675
 
676
676
  [[package]]
677
677
  name = "sentencex-py"
678
- version = "0.1.1"
678
+ version = "0.1.4"
679
679
  dependencies = [
680
680
  "pyo3",
681
681
  "sentencex",
@@ -683,7 +683,7 @@ dependencies = [
683
683
 
684
684
  [[package]]
685
685
  name = "sentencex-wasm"
686
- version = "0.1.3"
686
+ version = "0.1.4"
687
687
  dependencies = [
688
688
  "sentencex",
689
689
  "serde",
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: sentencex
3
- Version: 1.0.2
3
+ Version: 1.0.4
4
4
  Classifier: Intended Audience :: Developers
5
5
  Classifier: Intended Audience :: Science/Research
6
6
  Classifier: Topic :: Text Processing
@@ -1,6 +1,6 @@
1
1
  [package]
2
2
  name = "sentencex-py"
3
- version = "0.1.1"
3
+ version = "0.1.4"
4
4
  edition = "2024"
5
5
  description = "Sentence segmentation library with wide language support optimized for speed and utility."
6
6
  authors = ["Santhosh Thottingal <santhosh.thottingal@gmail.com>"]
@@ -0,0 +1,6 @@
1
+ #!/bin/bash
2
+ # Build and publish at pypi
3
+ uv tool run maturin build -i python3.12
4
+ uv tool run maturin build -i python3.13
5
+ uv tool run maturin build -i python3.14
6
+ uv tool run maturin publish
@@ -0,0 +1,59 @@
1
+
2
+ Paris (
3
+ French pronunciation:
4
+
5
+
6
+ [paʁi]
7
+
8
+
9
+  (
10
+
11
+
12
+
13
+
14
+
15
+
16
+
17
+
18
+  
19
+
20
+ listen
21
+
22
+ )) Another sentence
23
+
24
+
25
+
26
+
27
+ ends here.
28
+ [
29
+ '\n' +
30
+ '\tParis (\n' +
31
+ '\tFrench pronunciation:\n' +
32
+ '\t\n' +
33
+ '\t\n' +
34
+ '\t\t[paʁi]\n' +
35
+ '\t\n' +
36
+ '\t\n' +
37
+ '\t\t (\n' +
38
+ '\t\t\n' +
39
+ '\t\t\t\n' +
40
+ '\t\t\t\t\n' +
41
+ '\t\t\t\t\t\n' +
42
+ '\t\t\t\t\t\t\n' +
43
+ '\t\t\t\t\t\t\t\n' +
44
+ '\t\t\t\t\t\t\n' +
45
+ '\t\t\t\t\t\n' +
46
+ '\t\t\t\t\t \n' +
47
+ '\t\t\t\t\n' +
48
+ '\t\t\t\tlisten\n' +
49
+ '\t\t\t\n' +
50
+ '\t\t)) Another sentence\n' +
51
+ '\t\n' +
52
+ '\t\t\n' +
53
+ '\t\t\t\n' +
54
+ '\t\t\n' +
55
+ '\tends here.'
56
+ ]
57
+ <p id="mwEA"><span class="cx-segment" data-segmentid="0"><b id="mwEQ">Paris</b> (<small about="#mwt16" data-mw="{&#34;parts&#34;:[{&#34;template&#34;:{&#34;target&#34;:{&#34;wt&#34;:&#34;IPA-fr&#34;,&#34;href&#34;:&#34;./Template:IPA-fr&#34;},&#34;params&#34;:{&#34;1&#34;:{&#34;wt&#34;:&#34;paʁi&#34;},&#34;3&#34;:{&#34;wt&#34;:&#34;Paris1.ogg&#34;}},&#34;i&#34;:0}}]}" id="mwEg" typeof="mw:Transclusion">French pronunciation:</small><span about="#mwt16" class="IPA" title="Representation in the International Phonetic Alphabet (IPA)"><a class="cx-link" data-linkid="1" href="./Help:IPA/French" rel="mw:WikiLink" title="Help:IPA/French">[paʁi]</a></span><small about="#mwt16" class="nowrap" id="mwEw"><span typeof="mw:Entity"> </span>(<span class="unicode haudio"><span class="fn"><span style="white-space:nowrap"><span data-mw="{&#34;caption&#34;:&#34;About this sound&#34;}" typeof="mw:Image"><a href="./File:Paris1.ogg"><img data-file-height="20" data-file-type="drawing" data-file-width="20" height="11" resource="./File:Loudspeaker.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/11px-Loudspeaker.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/22px-Loudspeaker.svg.png 2x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/17px-Loudspeaker.svg.png 1.5x" width="11"></img></a></span><span typeof="mw:Entity"> </span></span><a href="//upload.wikimedia.org/wikipedia/commons/2/2c/Paris1.ogg" rel="mw:MediaLink" title="Paris1.ogg">listen</a></span></span>)</small>) Another sentence<span data-mw="{&#34;caption&#34;:&#34;A different caption&#34;}" typeof="mw:Image"><a href="./File:Paris1232.ogg"><img data-file-height="20" data-file-type="drawing" data-file-width="20" height="11" resource="./File:Loudspeaker.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/11px-Loudspeaker.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/22px-Loudspeaker.svg.png 2x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/17px-Loudspeaker.svg.png 1.5x" width="11"></img></a></span>ends here.</span></p>
58
+ ==Categories==
59
+ []
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "sentencex"
3
- version = "1.0.2"
3
+ version = "1.0.4"
4
4
  requires-python = ">=3.10"
5
5
  description = "Sentence segmenter that supports ~300 languages"
6
6
  authors = [{name = "Santhosh Thottingal", email = "santhosh.thottingal@gmail.com"}]
@@ -26,7 +26,7 @@ pub fn get_quote_pairs() -> HashMap<&'static str, &'static str> {
26
26
  lazy_static::lazy_static! {
27
27
  pub static ref PARENS_REGEX: Regex = Regex::new(r"[\((<{\[](?:[^\)\]}>)]|\\[\)\]}>)])*[\)\]}>)]").unwrap();
28
28
  pub static ref EMAIL_REGEX: Regex = Regex::new(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}").unwrap();
29
- pub static ref NUMBERED_REFERENCE_REGEX: Regex = Regex::new(r"^ ?(\[\d+])+").unwrap();
29
+ pub static ref NUMBERED_REFERENCE_REGEX: Regex = Regex::new(r"^(\s*\[\d+])+").unwrap();
30
30
  pub static ref SPACE_AFTER_SEPARATOR: Regex = Regex::new(r"^\s+").unwrap();
31
31
  pub static ref QUOTES_REGEX: Regex = {
32
32
  let quote_pairs = get_quote_pairs();
@@ -42,9 +42,8 @@ pub trait Language {
42
42
  let mut boundaries = Vec::with_capacity(estimated_sentences);
43
43
 
44
44
  // Split by paragraph breaks (one or more newlines with optional whitespace)
45
- let para_split_re = Regex::new(r"\n[\r\s]*\n").unwrap();
45
+ let para_split_re = Regex::new(r"\n[\r]*\n").unwrap();
46
46
  let paragraphs: Vec<&str> = para_split_re.split(text).collect();
47
-
48
47
  // Pre-calculate all paragraph offsets in one pass
49
48
  let mut paragraph_offsets = Vec::with_capacity(paragraphs.len());
50
49
  let mut current_offset = 0;
@@ -107,7 +107,7 @@ fn chunk_text(text: &str, chunk_size: usize) -> Vec<&str> {
107
107
  let mut chunks = Vec::new();
108
108
 
109
109
  // Split by paragraph breaks (one or more newlines with optional whitespace)
110
- let re = Regex::new(r"\n[\r\s]*\n").unwrap();
110
+ let re = Regex::new(r"\n[\r]*\n").unwrap();
111
111
 
112
112
  // Get paragraph parts and their positions
113
113
  let mut paragraphs = Vec::new();
@@ -59,7 +59,7 @@ fn main() {
59
59
  let sentences = segment(&cli.language, &text);
60
60
  let elapsed = start_time.elapsed();
61
61
  for sentence in sentences.iter() {
62
- println!("{}", sentence);
62
+ println!("* {}", sentence);
63
63
  }
64
64
 
65
65
  eprintln!("Time taken for segment(): {:?}", elapsed);
@@ -259,6 +259,11 @@ Hydrogen is a gas. [1] It is colorless, odorless, tasteless and highly flammable
259
259
  Hydrogen is a gas. [1]
260
260
  It is colorless, odorless, tasteless and highly flammable
261
261
  ===
262
+ Hydrogen is a gas. [1][2] [3] It is colorless, odorless, tasteless and highly flammable
263
+ ---
264
+ Hydrogen is a gas. [1][2] [3]
265
+ It is colorless, odorless, tasteless and highly flammable
266
+ ===
262
267
  This function (see. section 4.2) is important. Let's continue.
263
268
  ---
264
269
  This function (see. section 4.2) is important.
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes