RubyGems - pragmatic_tokenizer - Versions diffs - 3.0.3 → 3.0.4 - Mend

pragmatic_tokenizer 3.0.3 → 3.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +1 -1
data/lib/pragmatic_tokenizer/full_stop_separator.rb +9 -10
data/lib/pragmatic_tokenizer/languages/english.rb +1 -1
data/lib/pragmatic_tokenizer/post_processor.rb +2 -2
data/lib/pragmatic_tokenizer/tokenizer.rb +6 -6
data/lib/pragmatic_tokenizer/version.rb +1 -1
data/spec/languages/deutsch_spec.rb +1 -1
data/spec/languages/english_spec.rb +52 -0
data/spec/performance_spec.rb +1 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 5c5f17c8bf848baf655f32dcc9620fb00998ae55
-  data.tar.gz: 6f95da8fd31c05d791d554fabb15a8350ad77d84
+  metadata.gz: 00a9402728c3f1aeb845232ec6dfa1abc536bca9
+  data.tar.gz: da25dbf18f4ebf7a780a4b0a862976b20d01eb39
 SHA512:
-  metadata.gz: d27744d4a7f77c4102927ae2dd2df20fc66b6248b820156bc8712bf9676c65357a82dd7d7540138289e8bf0b6d0d60a1cd940bcc82807a30b4be083a33e28d56
-  data.tar.gz: 20360686b10e8be67c410fe8f48f3c0b057780471cd6ea72b763e8cf0a44cad6a3ad4bfc01bdaad5d2cb1ac38162a5e78b7be52771f6932d9dfcb459f1ee0e42
+  metadata.gz: ced5cc4b72bba8a4d277c7453bda8aed9f82a35477a1e047348fa5f198e50647c6748bbff29484d2b002f28106afd8f5ce432c4d49142234dbf8370402f07004
+  data.tar.gz: ef1988c4325ea1bce415685848f5d666b79ddbde26680f6b9712cdbbffd8558573733ece8c57b29c739dcce335565b8f7dbf5c3b68e213585afb0d7644c9c06d

data/README.md CHANGED Viewed

@@ -103,7 +103,7 @@ options = {
 ##### `filter_languages`
   **default** = `nil`
-- You can pass an array of languages of which you would like to process abbreviations, stop words and contractions. This language can be indepedent of the language of the string you are tokenizing (for example your tex might be German but contain so English stop words that you want to remove). If you supply your own abbreviations, stop words or contractions they will be merged with the abbreviations, stop words and contractions of any languages you add in this option. You can pass an array of symbols or strings (i.e. `[:en, :de]` or `['en', 'de']`)
+- You can pass an array of languages of which you would like to process abbreviations, stop words and contractions. This language can be indepedent of the language of the string you are tokenizing (for example your text might be German but contain some English stop words that you want to remove). If you supply your own abbreviations, stop words or contractions they will be merged with the abbreviations, stop words and contractions of any languages you add in this option. You can pass an array of symbols or strings (i.e. `[:en, :de]` or `['en', 'de']`)
 <hr>

data/lib/pragmatic_tokenizer/full_stop_separator.rb CHANGED Viewed

@@ -5,11 +5,10 @@ module PragmaticTokenizer
   # periods that are part of an abbreviation
   class FullStopSeparator
-    REGEXP_ENDS_WITH_DOT = /\A(.+)\.\z/
-    REGEXP_ONLY_LETTERS  = /\A[a-z]\z/i
-    REGEXP_UNKNOWN1      = /[a-z](?:\.[a-z])+\z/i
-    REGEXP_UNKNOWN2      = /\A(.*\w)\.\z/
-    DOT                  = '.'.freeze
+    REGEXP_ENDS_WITH_DOT   = /\A(.*\w)\.\z/
+    REGEXP_ONLY_LETTERS    = /\A[a-z]\z/i
+    REGEXP_ABBREVIATION    = /[a-z](?:\.[a-z])+\z/i
+    DOT                    = '.'.freeze
     def initialize(tokens:, abbreviations:, downcase:)
       @tokens        = tokens
@@ -30,7 +29,7 @@ module PragmaticTokenizer
         @tokens.each_with_index do |token, position|
           if @tokens[position + 1] && token =~ REGEXP_ENDS_WITH_DOT
             match = Regexp.last_match(1)
-            if unknown_method1(match)
+            if abbreviation?(match)
               @cleaned_tokens += [match, DOT]
               next
             end
@@ -39,11 +38,11 @@ module PragmaticTokenizer
         end
       end
-      def unknown_method1(token)
-        !abbreviation?(token) && token !~ REGEXP_ONLY_LETTERS && token !~ REGEXP_UNKNOWN1
+      def abbreviation?(token)
+        !defined_abbreviation?(token) && token !~ REGEXP_ONLY_LETTERS && token !~ REGEXP_ABBREVIATION
       end
-      def abbreviation?(token)
+      def defined_abbreviation?(token)
         @abbreviations.include?(inverse_case(token))
       end
@@ -53,7 +52,7 @@ module PragmaticTokenizer
       def replace_last_token
         last_token = @cleaned_tokens[-1]
-        return if abbreviation?(last_token.chomp(DOT)) || last_token !~ REGEXP_UNKNOWN2
+        return if defined_abbreviation?(last_token.chomp(DOT)) || last_token !~ REGEXP_ENDS_WITH_DOT
         @cleaned_tokens[-1] = Regexp.last_match(1)
         @cleaned_tokens << DOT
       end

data/lib/pragmatic_tokenizer/languages/english.rb CHANGED Viewed

@@ -3,7 +3,7 @@ module PragmaticTokenizer
     module English
       include Languages::Common
       ABBREVIATIONS = Set.new(["adj", "adm", "adv", "al", "ala", "alta", "apr", "arc", "ariz", "ark", "art", "assn", "asst", "attys", "aug", "ave", "bart", "bld", "bldg", "blvd", "brig", "bros", "btw", "cal", "calif", "capt", "cl", "cmdr", "co", "col", "colo", "comdr", "con", "conn", "corp", "cpl", "cres", "ct", "d.phil", "dak", "dec", "del", "dept", "det", "dist", "dr", "dr.phil", "dr.philos", "drs", "e.g", "ens", "esp", "esq", "etc", "exp", "expy", "ext", "feb", "fed", "fla", "ft", "fwy", "fy", "ga", "gen", "gov", "hon", "hosp", "hr", "hway", "hwy", "i.e", "i.b.m", "ia", "id", "ida", "ill", "inc", "ind", "ing", "insp", "jan", "jr", "jul", "jun", "kan", "kans", "ken", "ky", "la", "lt", "ltd", "maj", "man", "mar", "mass", "may", "md", "me", "med", "messrs", "mex", "mfg", "mich", "min", "minn", "miss", "mlle", "mm", "mme", "mo", "mont", "mr", "mrs", "ms", "msgr", "mssrs", "mt", "mtn", "neb", "nebr", "nev", "no", "nos", "nov", "nr", "oct", "ok", "okla", "ont", "op", "ord", "ore", "p", "pa", "pd", "pde", "penn", "penna", "pfc", "ph", "ph.d", "pl", "plz", "pp", "prof", "pvt", "que", "rd", "ref", "rep", "reps", "res", "rev", "rt", "sask", "sec", "sen", "sens", "sep", "sept", "sfc", "sgt", "sr", "st", "supt", "surg", "tce", "tenn", "tex", "u.s", "u.s.a", "univ", "usafa", "ut", "v", "va", "ver", "vs", "vt", "wash", "wis", "wisc", "wy", "wyo", "yuk"]).freeze
-      STOP_WORDS = Set.new(["&#;f", "'ll", "'ve", "+//", "-/+", "</li>", "</p>", "</td>", "<br", "<br/>", "<br/><br/>", "<li>", "<p>", "<sup></sup>", "<sup></sup></li>", "<td", "<td>", "a", "a's", "able", "about", "above", "abroad", "abst", "accordance", "according", "accordingly", "across", "act", "actually", "added", "adj", "adopted", "affected", "affecting", "affects", "after", "afterwards", "again", "against", "ago", "ah", "ahead", "ain't", "all", "allow", "allows", "almost", "alone", "along", "alongside", "already", "also", "although", "always", "am", "amid", "amidst", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "are", "aren", "aren't", "arent", "arise", "around", "as", "aside", "ask", "asking", "associated", "at", "auth", "available", "away", "awfully", "b", "back", "backward", "backwards", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bill", "biol", "both", "bottom", "brief", "briefly", "but", "by", "c", "c'mon", "c's", "ca", "call", "came", "can", "can't", "cannot", "cant", "caption", "cause", "causes", "certain", "certainly", "changes", "class=", "clearly", "co", "co.", "com", "come", "comes", "computer", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn't", "couldnt", "course", "cry", "currently", "d", "dare", "daren't", "date", "de", "definitely", "describe", "described", "despite", "detail", "did", "didn't", "different", "directly", "do", "does", "doesn't", "doing", "don't", "done", "down", "downwards", "due", "during", "e", "each", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "eleven", "else", "elsewhere", "empty", "end", "ending", "enough", "entirely", "especially", "et", "et-al", "etc", "even", "ever", "evermore", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "f", "fairly", "far", "farther", "few", "fewer", "ff", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "followed", "following", "follows", "for", "forever", "former", "formerly", "forth", "forty", "forward", "found", "four", "from", "front", "full", "further", "furthermore", "g", "gave", "get", "gets", "getting", "give", "given", "gives", "giving", "go", "goes", "going", "gone", "got", "gotten", "greetings", "h", "had", "hadn't", "half", "happens", "hardly", "has", "hasn't", "hasnt", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "hed", "hello", "help", "hence", "her", "here", "here's", "hereafter", "hereby", "herein", "heres", "hereupon", "hers", "herself", "hes", "hi", "hid", "him", "himself", "his", "hither", "home", "hopefully", "how", "how's", "howbeit", "however", "http", "https", "hundred", "i", "i'd", "i'll", "i'm", "i've", "id", "ie", "if", "ignored", "im", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "inc.", "indeed", "index", "indicate", "indicated", "indicates", "information", "ing", "inner", "inside", "insofar", "instead", "interest", "into", "invention", "inward", "is", "isn't", "it", "it'd", "it'll", "it's", "itd", "its", "itself", "j", "just", "k", "keep", "keeps", "kept", "keys", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "let's", "lets", "like", "liked", "likely", "likewise", "line", "little", "look", "looking", "looks", "low", "lower", "ltd", "m", "made", "mainly", "make", "makes", "many", "may", "maybe", "mayn't", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn't", "mill", "million", "mine", "minus", "miss", "ml", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "much", "mug", "must", "mustn't", "my", "myself", "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need", "needn't", "needs", "neither", "never", "neverf", "neverless", "nevertheless", "new", "next", "nine", "ninety", "no", "no-one", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "notwithstanding", "novel", "now", "nowhere", "o", "obtain", "obtained", "obviously", "of", "off", "often", "oh", "ok", "okay", "old", "omitted", "on", "once", "one", "one's", "ones", "only", "onto", "opposite", "or", "ord", "other", "others", "otherwise", "ought", "oughtn't", "our", "ours", "ours", "ourselves", "out", "outside", "over", "overall", "owing", "own", "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provided", "provides", "put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "right", "round", "run", "s", "said", "same", "saw", "say", "saying", "says", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "shall", "shan't", "she", "she'd", "she'll", "she's", "shed", "shes", "should", "shouldn't", "show", "showed", "shown", "showns", "shows", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "slightly", "so", "some", "somebody", "someday", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify", "specifying", "state", "states", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "system", "t", "t's", "take", "taken", "taking", "tell", "ten", "tends", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "that's", "that've", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "there'd", "there'll", "there're", "there's", "there've", "thereafter", "thereby", "thered", "therefore", "therein", "thereof", "therere", "theres", "thereto", "thereupon", "these", "they", "they'd", "they'll", "they're", "they've", "theyd", "theyre", "thick", "thin", "thing", "things", "think", "third", "thirty", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "til", "till", "tip", "to", "together", "too", "took", "top", "toward", "towards", "tried", "tries", "truly", "try", "trying", "ts", "twelve", "twenty", "twice", "two", "u", "un", "under", "underneath", "undoing", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "up", "upon", "ups", "upwards", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "uucp", "v", "value", "various", "versus", "very", "via", "viz", "vol", "vols", "vs", "w", "want", "wants", "was", "wasn't", "way", "we", "we'd", "we'll", "we're", "we've", "wed", "welcome", "well", "went", "were", "weren't", "what", "what'll", "what's", "what're", "what've", "whatever", "whats", "when", "when's", "whence", "whenever", "where", "where's", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether", "which", "whichever", "while", "whilst", "whim", "whither", "who", "who'd", "who'll", "who's", "whod", "whoever", "whole", "whom", "whomever", "whos", "whose", "why", "why's", "widely", "will", "willing", "wish", "with", "within", "without", "won't", "wonder", "word", "words", "world", "would", "wouldn't", "www", "x", "y", "yes", "yet", "you", "you'd", "you'll", "you're", "you've", "youd", "your", "youre", "yours", "yourself", "yourselves", "z", "zero"]).freeze
+      STOP_WORDS = Set.new(["i.e.", "e.g.", "&#;f", "'ll", "'ve", "+//", "-/+", "</li>", "</p>", "</td>", "<br", "<br/>", "<br/><br/>", "<li>", "<p>", "<sup></sup>", "<sup></sup></li>", "<td", "<td>", "a", "a's", "able", "about", "above", "abroad", "abst", "accordance", "according", "accordingly", "across", "act", "actually", "added", "adj", "adopted", "affected", "affecting", "affects", "after", "afterwards", "again", "against", "ago", "ah", "ahead", "ain't", "all", "allow", "allows", "almost", "alone", "along", "alongside", "already", "also", "although", "always", "am", "amid", "amidst", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "are", "aren", "aren't", "arent", "arise", "around", "as", "aside", "ask", "asking", "associated", "at", "auth", "available", "away", "awfully", "b", "back", "backward", "backwards", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bill", "biol", "both", "bottom", "brief", "briefly", "but", "by", "c", "c'mon", "c's", "ca", "call", "came", "can", "can't", "cannot", "cant", "caption", "cause", "causes", "certain", "certainly", "changes", "class=", "clearly", "co", "co.", "com", "come", "comes", "computer", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn't", "couldnt", "course", "cry", "currently", "d", "dare", "daren't", "date", "de", "definitely", "describe", "described", "despite", "detail", "did", "didn't", "different", "directly", "do", "does", "doesn't", "doing", "don't", "done", "down", "downwards", "due", "during", "e", "each", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "eleven", "else", "elsewhere", "empty", "end", "ending", "enough", "entirely", "especially", "et", "et-al", "etc", "even", "ever", "evermore", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "f", "fairly", "far", "farther", "few", "fewer", "ff", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "followed", "following", "follows", "for", "forever", "former", "formerly", "forth", "forty", "forward", "found", "four", "from", "front", "full", "further", "furthermore", "g", "gave", "get", "gets", "getting", "give", "given", "gives", "giving", "go", "goes", "going", "gone", "got", "gotten", "greetings", "h", "had", "hadn't", "half", "happens", "hardly", "has", "hasn't", "hasnt", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "hed", "hello", "help", "hence", "her", "here", "here's", "hereafter", "hereby", "herein", "heres", "hereupon", "hers", "herself", "hes", "hi", "hid", "him", "himself", "his", "hither", "home", "hopefully", "how", "how's", "howbeit", "however", "http", "https", "hundred", "i", "i'd", "i'll", "i'm", "i've", "id", "ie", "if", "ignored", "im", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "inc.", "indeed", "index", "indicate", "indicated", "indicates", "information", "ing", "inner", "inside", "insofar", "instead", "interest", "into", "invention", "inward", "is", "isn't", "it", "it'd", "it'll", "it's", "itd", "its", "itself", "j", "just", "k", "keep", "keeps", "kept", "keys", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "let's", "lets", "like", "liked", "likely", "likewise", "line", "little", "look", "looking", "looks", "low", "lower", "ltd", "m", "made", "mainly", "make", "makes", "many", "may", "maybe", "mayn't", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn't", "mill", "million", "mine", "minus", "miss", "ml", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "much", "mug", "must", "mustn't", "my", "myself", "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need", "needn't", "needs", "neither", "never", "neverf", "neverless", "nevertheless", "new", "next", "nine", "ninety", "no", "no-one", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "notwithstanding", "novel", "now", "nowhere", "o", "obtain", "obtained", "obviously", "of", "off", "often", "oh", "ok", "okay", "old", "omitted", "on", "once", "one", "one's", "ones", "only", "onto", "opposite", "or", "ord", "other", "others", "otherwise", "ought", "oughtn't", "our", "ours", "ours", "ourselves", "out", "outside", "over", "overall", "owing", "own", "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provided", "provides", "put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "right", "round", "run", "s", "said", "same", "saw", "say", "saying", "says", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "shall", "shan't", "she", "she'd", "she'll", "she's", "shed", "shes", "should", "shouldn't", "show", "showed", "shown", "showns", "shows", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "slightly", "so", "some", "somebody", "someday", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify", "specifying", "state", "states", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "system", "t", "t's", "take", "taken", "taking", "tell", "ten", "tends", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "that's", "that've", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "there'd", "there'll", "there're", "there's", "there've", "thereafter", "thereby", "thered", "therefore", "therein", "thereof", "therere", "theres", "thereto", "thereupon", "these", "they", "they'd", "they'll", "they're", "they've", "theyd", "theyre", "thick", "thin", "thing", "things", "think", "third", "thirty", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "til", "till", "tip", "to", "together", "too", "took", "top", "toward", "towards", "tried", "tries", "truly", "try", "trying", "ts", "twelve", "twenty", "twice", "two", "u", "un", "under", "underneath", "undoing", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "up", "upon", "ups", "upwards", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "uucp", "v", "value", "various", "versus", "very", "via", "viz", "vol", "vols", "vs", "w", "want", "wants", "was", "wasn't", "way", "we", "we'd", "we'll", "we're", "we've", "wed", "welcome", "well", "went", "were", "weren't", "what", "what'll", "what's", "what're", "what've", "whatever", "whats", "when", "when's", "whence", "whenever", "where", "where's", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether", "which", "whichever", "while", "whilst", "whim", "whither", "who", "who'd", "who'll", "who's", "whod", "whoever", "whole", "whom", "whomever", "whos", "whose", "why", "why's", "widely", "will", "willing", "wish", "with", "within", "without", "won't", "wonder", "word", "words", "world", "would", "wouldn't", "www", "x", "y", "yes", "yet", "you", "you'd", "you'll", "you're", "you've", "youd", "your", "youre", "yours", "yourself", "yourselves", "z", "zero"]).freeze
       # N.B. Some English contractions are ambigous (i.e. "she's" can mean "she has" or "she is").
       # Pragmatic Tokenizer will return the most frequently appearing expanded contraction. Regardless, this should
       # be rather insignificant as in most cases one is probably removing stop words.

data/lib/pragmatic_tokenizer/post_processor.rb CHANGED Viewed

@@ -31,12 +31,12 @@ module PragmaticTokenizer
     end
     def post_process
-      separate_ending_punctuation(method_name3)
+      separate_ending_punctuation(post_process_punctuation)
     end
     private
-      def method_name3
+      def post_process_punctuation
         separated = separate_ending_punctuation(full_stop_separated_tokens)
         procs     = [unified1, split_unknown_period1, split_unknown_period2, split_emoji]
         procs.reduce(separated) { |a, e| a.flat_map(&e) }

data/lib/pragmatic_tokenizer/tokenizer.rb CHANGED Viewed

@@ -20,7 +20,8 @@ module PragmaticTokenizer
     REGEX_DOMAIN              = /(\s+|\A)[a-z0-9]{2,}([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,6}(:[0-9]{1,5})?(\/.*)?/ix
     REGEX_URL                 = /(http|https)(\.|:)/
     REGEX_HYPHEN              = /\-/
-    REGEX_UNDERSCORE          = /\_/
+    REGEX_LONG_WORD           = /\-|\_/
+    REGEXP_SPLIT_CHECK        = /＠|@|(http)/
     REGEX_CONTRACTIONS        = /[‘’‚‛‹›＇´`]/
     REGEX_APOSTROPHE_S        = /['’`́]s$/
     REGEX_EMAIL               = /\S+(＠|@)\S+\.\S+/
@@ -126,7 +127,7 @@ module PragmaticTokenizer
       # TODO: why do we treat stop words differently than abbreviations and contractions? (we don't use @language_module::STOP_WORDS when passing @filter_languages)
       @contractions.merge!(@language_module::CONTRACTIONS) if @contractions.empty?
       @abbreviations       += @language_module::ABBREVIATIONS if @abbreviations.empty?
-      @stop_words          += @language_module::STOP_WORDS if @stop_words.empty? && @filter_languages.empty?
+      @stop_words          += @language_module::STOP_WORDS if @stop_words.empty?
       @filter_languages.each do |lang|
         language = Languages.get_language_by_code(lang)
@@ -286,12 +287,11 @@ module PragmaticTokenizer
       def split_long_words!
         @tokens = @tokens
-                      .flat_map { |t| t.length > @long_word_split ? t.split(REGEX_HYPHEN) : t }
-                      .flat_map { |t| t.length > @long_word_split ? t.split(REGEX_UNDERSCORE) : t }
+                      .flat_map { |t| (t.length > @long_word_split && t !~ REGEXP_SPLIT_CHECK ) ? t.split(REGEX_LONG_WORD) : t }
       end
-      def chosen_case(token)
-        @downcase ? Unicode.downcase(token) : token
+      def chosen_case(txt)
+        @downcase ? Unicode.downcase(txt) : txt
       end
       def inverse_case(token)

data/lib/pragmatic_tokenizer/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module PragmaticTokenizer
-  VERSION = "3.0.3".freeze
+  VERSION = "3.0.4".freeze
 end

data/spec/languages/deutsch_spec.rb CHANGED Viewed

@@ -175,7 +175,7 @@ describe PragmaticTokenizer do
           remove_stop_words: true,
           language:          'de'
       )
-      expect(pt.tokenize(text)).to eq(["der", "die", "lehrer_in", "und", "seine", "ihre", "schüler_innen", ".", "english", "."])
+      expect(pt.tokenize(text)).to eq(["lehrer_in", "schüler_innen", ".", "english", "."])
     end
     it 'removes English and German stopwords' do

data/spec/languages/english_spec.rb CHANGED Viewed

@@ -443,6 +443,17 @@ describe PragmaticTokenizer do
         end
       end
+      context 'option (downcase)' do
+        it 'does not downcase URLs' do
+          skip "NOT IMPLEMENTED"
+          text = "Here are some domains and urls GOOGLE.com http://test.com/UPPERCASE."
+          pt = PragmaticTokenizer::Tokenizer.new(
+              downcase: :true
+          )
+          expect(pt.tokenize(text)).to eq(["here", "are", "some", "domains", "and", "urls", "GOOGLE.com", "http://test.com/UPPERCASE", "."])
+        end
+      end
       context 'option (domains)' do
         it 'tokenizes a string #001' do
           text = "Here are some domains and urls google.com https://www.google.com www.google.com."
@@ -485,6 +496,38 @@ describe PragmaticTokenizer do
       end
       context 'option (long_word_split)' do
+        it 'should not split twitter handles' do
+          text = "@john_doe"
+          pt = PragmaticTokenizer::Tokenizer.new(
+              long_word_split: 5
+          )
+          expect(pt.tokenize(text)).to eq(["@john_doe"])
+        end
+        it 'should not split emails' do
+          text = "john_doe@something.com"
+          pt = PragmaticTokenizer::Tokenizer.new(
+              long_word_split: 5
+          )
+          expect(pt.tokenize(text)).to eq(["john_doe@something.com"])
+        end
+        it 'should not split emails 2' do
+          text = "john_doe＠something.com"
+          pt = PragmaticTokenizer::Tokenizer.new(
+              long_word_split: 5
+          )
+          expect(pt.tokenize(text)).to eq(["john_doe＠something.com"])
+        end
+        it 'should not split urls' do
+          text = "http://test.com/some_path"
+          pt = PragmaticTokenizer::Tokenizer.new(
+              long_word_split: 5
+          )
+          expect(pt.tokenize(text)).to eq(["http://test.com/some_path"])
+        end
         it 'tokenizes a string #001' do
           text = "Some main-categories of the mathematics-test have sub-examples that most 14-year olds can't answer, therefor the implementation-instruction made in the 1990-years needs to be revised."
           pt = PragmaticTokenizer::Tokenizer.new(
@@ -1041,6 +1084,15 @@ describe PragmaticTokenizer do
           expect(pt.tokenize(text)).to eq(["short", "sentence", "explanations", "."])
         end
+        it 'removes stop words 2' do
+          text = 'This is a short sentence with explanations and stop words i.e. is a stop word as so is e.g. I think.'
+          pt = PragmaticTokenizer::Tokenizer.new(
+              language:          'en',
+              remove_stop_words: true
+          )
+          expect(pt.tokenize(text)).to eq(["short", "sentence", "explanations", "."])
+        end
         it 'removes user-supplied stop words' do
           text = 'This is a short sentence with explanations and stop words.'
           pt = PragmaticTokenizer::Tokenizer.new(

data/spec/performance_spec.rb CHANGED Viewed

@@ -29,6 +29,7 @@ describe PragmaticTokenizer do
   # 24.2
   # 23.2
   # 11.6
+  # 12.0
   # it 'is fast? (long strings)' do
   #   string = "Hello World. My name is Jonas. What is your name? My name is Jonas IV Smith. There it is! I found it. My name is Jonas E. Smith. Please turn to p. 55. Were Jane and co. at the party? They closed the deal with Pitt, Briggs & Co. at noon. Let's ask Jane and co. They should know. They closed the deal with Pitt, Briggs & Co. It closed yesterday. I can't see Mt. Fuji from here. St. Michael's Church is on 5th st. near the light. That is JFK Jr.'s book. I visited the U.S.A. last year. I live in the E.U. How about you? I live in the U.S. How about you? I work for the U.S. Government in Virginia. I have lived in the U.S. for 20 years. She has $100.00 in her bag. She has $100.00. It is in her bag. He teaches science (He previously worked for 5 years as an engineer.) at the local University. Her email is Jane.Doe@example.com. I sent her an email. The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out. She turned to him, 'This is great.' she said. She turned to him, \"This is great.\" she said. She turned to him, \"This is great.\" She held the book out to show him. Hello!! Long time no see. Hello?? Who is there? Hello!? Is that you? Hello?! Is that you? 1.) The first item 2.) The second item 1.) The first item. 2.) The second item. 1) The first item 2) The second item 1) The first item. 2) The second item. 1. The first item 2. The second item 1. The first item. 2. The second item. • 9. The first item • 10. The second item ⁃9. The first item ⁃10. The second item a. The first item b. The second item c. The third list item This is a sentence\ncut off in the middle because pdf. It was a cold \nnight in the city. features\ncontact manager\nevents, activities\n You can find it at N°. 1026.253.553. That is where the treasure is. She works at Yahoo! in the accounting department. We make a good team, you and I. Did you see Albert I. Jones yesterday? Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .” \"Bohr [...] used the analogy of parallel stairways [...]\" (Smith 55). If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence. I never meant that.... She left the store. I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it. One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . ." * 1000
   #   puts "LENGTH: #{string.length}"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pragmatic_tokenizer
 version: !ruby/object:Gem::Version
-  version: 3.0.3
+  version: 3.0.4
 platform: ruby
 authors:
 - Kevin S. Dias
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-02-16 00:00:00.000000000 Z
+date: 2016-03-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode