ruby_wordcram 2.0.0 → 2.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +9 -0
- data/CHANGELOG.md +2 -0
- data/README.md +2 -0
- data/Rakefile +1 -1
- data/acknowledgements.md +18 -0
- data/docs/_posts/2017-03-07-getting_started.md +1 -1
- data/docs/_posts/2017-03-07-under_the_hood.md +8 -3
- data/docs/_posts/2017-03-12-supported_languages.md +23 -0
- data/docs/about.md +2 -1
- data/lib/ruby_wordcram.rb +4 -3
- data/lib/ruby_wordcram/version.rb +2 -1
- data/pom.rb +13 -5
- data/pom.xml +33 -5
- data/ruby_wordcram.gemspec +2 -2
- data/src/cue/lang/Counter.java +97 -109
- data/src/cue/lang/NGramIterator.java +95 -90
- data/src/cue/lang/SentenceIterator.java +53 -48
- data/src/wordcram/WordAngler.java +1 -1
- metadata +9 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5d472f8ac87f00a98bf9936e51ee22520775fd25
|
4
|
+
data.tar.gz: c2d9953b6a7ce9b0f4899e0912c666cb45ba2301
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: '08a6898e378a4069e8e2cbc380d22d488396b4e14f5a5448760e7bc66bcd30bb3286d9455d77777b7988b5d77566d1d68866b99d5a5a7c37dcee2b20ace3b553'
|
7
|
+
data.tar.gz: c8cb64646d9c30b45f67b97abb9d541d21dcd09bc8ac283e89cd452d89f5f8b4862ee8efec3230390ec6951ba711a23c00a1b60adc3b2fbfb2955114a5903664
|
data/.travis.yml
ADDED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -1,5 +1,7 @@
|
|
1
1
|
# Ruby Word Cram
|
2
2
|
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/ruby_wordcram.svg)](https://badge.fury.io/rb/ruby_wordcram) [![Travis CI](https://travis-ci.org/ruby-processing/WordCram.svg)](https://travis-ci.org/ruby-processing/WordCram)
|
4
|
+
|
3
5
|
A gem wrapper for the WordCram library
|
4
6
|
|
5
7
|
![example](https://monkstone.github.io/assets/wordcram.png)
|
data/Rakefile
CHANGED
data/acknowledgements.md
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
### Acknowledgements
|
2
|
+
|
3
|
+
Surely the original inspiration for WordCram, was Wordle, built by Jonathan Feinberg when at IBM. He has since released an important component namely the cue.language component that I have included in ruby_wordcram (the only alterations I have made are to update the code to take advantage of jdk8 features), this part of the code remains © IBM. There are number of people who contributed to the language elements, and Jonathan credits these [here][credits], but is also probably reading his [faq][faq] if you are thinking of extending the language capabilities (I would be keen to add Welsh, which I think must be available somewhere in the wild). To read even more of Jonathans musings, achievments see his [blog][blog].
|
4
|
+
|
5
|
+
The vanilla processing version of Wordcram that I have incorporated here was created by [Dan Bernier][Dan], and currently it has only received the same treatment as cue.language code, ie the java code has been updated from java-1.5 to java-1.8 (in particular anonymous classes have become java-8 lambda). It is possible that further change will include tighter JRuby integration (but it works OK)
|
6
|
+
|
7
|
+
[jsoup][jsoup] is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. This was pulled directly from maven central.
|
8
|
+
|
9
|
+
|
10
|
+
[credits]:http://www.wordle.net/credits
|
11
|
+
|
12
|
+
[faq]:http://www.wordle.net/faq
|
13
|
+
|
14
|
+
[blog]:http://mrfeinberg.com/
|
15
|
+
|
16
|
+
[Dan]:https://github.com/danbernier/WordCram
|
17
|
+
|
18
|
+
[jsoup]:https://jsoup.org/
|
@@ -7,7 +7,7 @@ categories: wordcram update
|
|
7
7
|
|
8
8
|
### The required libraries
|
9
9
|
|
10
|
-
- cue.language.jar
|
10
|
+
- cue.language (since version-2.0.0 compiled into WordCram.jar)
|
11
11
|
|
12
12
|
Created by Jonathan Feinberg
|
13
13
|
|
@@ -25,9 +25,14 @@ categories: wordcram update
|
|
25
25
|
- WordCram.jar
|
26
26
|
|
27
27
|
Created by Dan Bernier
|
28
|
-
WordCram lets you generate word clouds in Processing. It does the heavy lifting – text analysis, collision detection – for you, so you can focus on making your word clouds as beautiful, as revealing, or as silly as you like.
|
28
|
+
WordCram lets you generate word clouds in Processing. It does the heavy lifting – text analysis, collision detection – for you, so you can focus on making your word clouds as beautiful, as revealing, or as silly as you like. Since version 2.0.0 the java code has been updated from java-1.5 to java-1.8, and the java code from the cue.language has aloso been
|
29
|
+
included.
|
29
30
|
|
30
31
|
|
31
|
-
- jsoup-1.
|
32
|
+
- jsoup-1.10.2.jar (since version 2.0.0)
|
32
33
|
|
33
34
|
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
|
35
|
+
|
36
|
+
### The Build
|
37
|
+
|
38
|
+
The build depends on polyglot maven (access to processing core.jar, jsoup.jar is pulled from maven central)
|
@@ -0,0 +1,23 @@
|
|
1
|
+
---
|
2
|
+
layout: post
|
3
|
+
title: "Supported Languages"
|
4
|
+
date: 2017-03-12 05:34:13
|
5
|
+
categories: wordcram update
|
6
|
+
permalink: /languages/
|
7
|
+
---
|
8
|
+
|
9
|
+
### Supported languages
|
10
|
+
|
11
|
+
| | | | | | |
|
12
|
+
|--------|----------|------ |---- |----- |---- |
|
13
|
+
|arabic | danish | finnish | hindi | polish | slovenian |
|
14
|
+
|armenian | dutch | french | hungarian | portuguese | spanish |
|
15
|
+
|catalan | english | german | italian | romanian | croatian |
|
16
|
+
|esperanto | greek | latin | russian | swedish | czech |
|
17
|
+
|farsi | hebrew | norwegian | slovak | turkish | - |
|
18
|
+
|
19
|
+
See cue.language on [github][github] by Jonathan Feinberg, but you could also clone this repo to add an extension (French has already been modified to increase number of stop words by [C. Fuhrman][french] after Jean Veronis). It would kind of neat to include some more Non-English examples than my rather weak [Yin_Yang_mots.rb][mots].
|
20
|
+
|
21
|
+
[github]:https://github.com/jdf/cue.language
|
22
|
+
[french]:https://github.com/fuhrmanator
|
23
|
+
[mots]:https://github.com/ruby-processing/JRubyArt-examples/blob/master/external_library/gem/ruby_wordcram/yin_yang_mots.rb
|
data/docs/about.md
CHANGED
@@ -4,9 +4,10 @@ title: About
|
|
4
4
|
permalink: /about/
|
5
5
|
---
|
6
6
|
|
7
|
-
WordCram is a library for Processing by [Dan Bernier][dan]. WordCram lets you generate word clouds in Processing. It does the heavy lifting – text analysis, collision detection – for you, so you can focus on making your word clouds as beautiful, as revealing, or as silly as you like. [ruby_wordcram gem][gem] is a ruby wrapper for the java processing library Wordcram library by [Dan Bernier][dan]. If you create processing sketches using [JRubyArt][jruby_art] or [propane][propane], mostly all you need to do is `require 'ruby_wordcram'` to use the WordCram library.
|
7
|
+
WordCram is a library for Processing by [Dan Bernier][dan]. WordCram lets you generate word clouds in Processing. It does the heavy lifting – text analysis, collision detection – for you, so you can focus on making your word clouds as beautiful, as revealing, or as silly as you like. [ruby_wordcram gem][gem] is a ruby wrapper for the java processing library Wordcram library by [Dan Bernier][dan]. If you create processing sketches using [JRubyArt][jruby_art] or [propane][propane], mostly all you need to do is `require 'ruby_wordcram'` to use the WordCram library. Thanks to tokenizer code from Jonathan Feinberg (cue.language) Wordcram is able to extract words / sentences from text in several [languages][languages].
|
8
8
|
|
9
9
|
[jruby_art]: https://ruby-processing.github.io/index.html
|
10
10
|
[gem]:https://github.com/ruby-processing/WordCram/
|
11
11
|
[dan]:http://wordcram.org/
|
12
12
|
[propane]:https://ruby-processing.github.io/propane/
|
13
|
+
[languages]:{{site.github.url}}/languages/
|
data/lib/ruby_wordcram.rb
CHANGED
@@ -1,9 +1,10 @@
|
|
1
1
|
# frozen_string_literal: false
|
2
|
+
|
2
3
|
if RUBY_PLATFORM == 'java'
|
3
4
|
require 'WordCram.jar'
|
4
|
-
require 'jsoup-1.10.
|
5
|
-
wc = %w
|
6
|
-
sh = %w
|
5
|
+
require 'jsoup-1.10.3.jar'
|
6
|
+
wc = %w[WordAngler WordColorer WordCram WordFonter WordPlacer WordSkipReason]
|
7
|
+
sh = %w[Colorers ImageShaper Observer Placers Word ShapeBasedPlacer]
|
7
8
|
WC = wc.concat(sh).freeze
|
8
9
|
WC.each do |klass|
|
9
10
|
java_import "wordcram.#{klass}"
|
data/pom.rb
CHANGED
@@ -1,8 +1,16 @@
|
|
1
1
|
project 'Wordcram' do
|
2
2
|
|
3
3
|
model_version '4.0.0'
|
4
|
-
id 'wordcram:WordCram:2.0.
|
4
|
+
id 'wordcram:WordCram:2.0.1'
|
5
5
|
packaging 'jar'
|
6
|
+
description 'WordCram for JRubyArt and propane'
|
7
|
+
organization 'ruby-processing', 'https://ruby-processing.github.io'
|
8
|
+
{ 'danbernier' => 'Dan Bernier', 'jdf' => 'Jonathan Feinberg', 'monkstone' => 'Martin Prout' }.each do |key, value|
|
9
|
+
developer key do
|
10
|
+
name value
|
11
|
+
roles 'developer'
|
12
|
+
end
|
13
|
+
end
|
6
14
|
|
7
15
|
properties(
|
8
16
|
'source.directory' => 'src',
|
@@ -22,12 +30,12 @@ project 'Wordcram' do
|
|
22
30
|
} )
|
23
31
|
end
|
24
32
|
|
25
|
-
jar 'org.processing:core:3.3.
|
26
|
-
jar 'org.jsoup:jsoup:1.10.
|
33
|
+
jar 'org.processing:core:3.3.4'
|
34
|
+
jar 'org.jsoup:jsoup:1.10.3'
|
27
35
|
|
28
36
|
build do
|
29
37
|
default_goal 'package'
|
30
|
-
source_directory 'source.directory'
|
38
|
+
source_directory '${source.directory}'
|
31
39
|
final_name 'WordCram'
|
32
40
|
resource do
|
33
41
|
directory 'src'
|
@@ -42,7 +50,7 @@ project 'Wordcram' do
|
|
42
50
|
execute_goals( id: 'default-cli',
|
43
51
|
artifactItems: [ { groupId: 'org.jsoup',
|
44
52
|
artifactId: 'jsoup',
|
45
|
-
version: '1.10.
|
53
|
+
version: '1.10.3',
|
46
54
|
type: 'jar',
|
47
55
|
outputDirectory: '${wordcram.basedir}/lib'
|
48
56
|
}
|
data/pom.xml
CHANGED
@@ -11,8 +11,36 @@ DO NOT MODIFIY - GENERATED CODE
|
|
11
11
|
<modelVersion>4.0.0</modelVersion>
|
12
12
|
<groupId>wordcram</groupId>
|
13
13
|
<artifactId>WordCram</artifactId>
|
14
|
-
<version>2.0.
|
14
|
+
<version>2.0.1</version>
|
15
15
|
<name>Wordcram</name>
|
16
|
+
<description>WordCram for JRubyArt and propane</description>
|
17
|
+
<organization>
|
18
|
+
<name>ruby-processing</name>
|
19
|
+
<url>https://ruby-processing.github.io</url>
|
20
|
+
</organization>
|
21
|
+
<developers>
|
22
|
+
<developer>
|
23
|
+
<id>danbernier</id>
|
24
|
+
<name>Dan Bernier</name>
|
25
|
+
<roles>
|
26
|
+
<role>developer</role>
|
27
|
+
</roles>
|
28
|
+
</developer>
|
29
|
+
<developer>
|
30
|
+
<id>jdf</id>
|
31
|
+
<name>Jonathan Feinberg</name>
|
32
|
+
<roles>
|
33
|
+
<role>developer</role>
|
34
|
+
</roles>
|
35
|
+
</developer>
|
36
|
+
<developer>
|
37
|
+
<id>monkstone</id>
|
38
|
+
<name>Martin Prout</name>
|
39
|
+
<roles>
|
40
|
+
<role>developer</role>
|
41
|
+
</roles>
|
42
|
+
</developer>
|
43
|
+
</developers>
|
16
44
|
<properties>
|
17
45
|
<source.directory>src</source.directory>
|
18
46
|
<wordcram.basedir>${project.basedir}</wordcram.basedir>
|
@@ -25,16 +53,16 @@ DO NOT MODIFIY - GENERATED CODE
|
|
25
53
|
<dependency>
|
26
54
|
<groupId>org.processing</groupId>
|
27
55
|
<artifactId>core</artifactId>
|
28
|
-
<version>3.3.
|
56
|
+
<version>3.3.4</version>
|
29
57
|
</dependency>
|
30
58
|
<dependency>
|
31
59
|
<groupId>org.jsoup</groupId>
|
32
60
|
<artifactId>jsoup</artifactId>
|
33
|
-
<version>1.10.
|
61
|
+
<version>1.10.3</version>
|
34
62
|
</dependency>
|
35
63
|
</dependencies>
|
36
64
|
<build>
|
37
|
-
<sourceDirectory
|
65
|
+
<sourceDirectory>${source.directory}</sourceDirectory>
|
38
66
|
<defaultGoal>package</defaultGoal>
|
39
67
|
<resources>
|
40
68
|
<resource>
|
@@ -72,7 +100,7 @@ DO NOT MODIFIY - GENERATED CODE
|
|
72
100
|
<artifactItem>
|
73
101
|
<groupId>org.jsoup</groupId>
|
74
102
|
<artifactId>jsoup</artifactId>
|
75
|
-
<version>1.10.
|
103
|
+
<version>1.10.3</version>
|
76
104
|
<type>jar</type>
|
77
105
|
<outputDirectory>${wordcram.basedir}/lib</outputDirectory>
|
78
106
|
</artifactItem>
|
data/ruby_wordcram.gemspec
CHANGED
@@ -10,7 +10,7 @@ Gem::Specification.new do |spec|
|
|
10
10
|
spec.extra_rdoc_files = %w{README.md LICENSE}
|
11
11
|
spec.summary = %q{Updated and extended WordCram library for JRubyArt and propane}
|
12
12
|
spec.description =<<-EOS
|
13
|
-
WordCram library wrapped in a rubygem. Compiled and tested with JRubyArt-1.3 and processing-3.
|
13
|
+
WordCram library wrapped in a rubygem. Compiled and tested with JRubyArt-1.3.3 and processing-3.4
|
14
14
|
EOS
|
15
15
|
spec.licenses = %w{Apache-2.0}
|
16
16
|
spec.authors = %w{Dan\ Bernier Jonathan\ Feinberg Martin\ Prout}
|
@@ -18,7 +18,7 @@ Gem::Specification.new do |spec|
|
|
18
18
|
spec.homepage = 'http://ruby-processing.github.io/WordCram/'
|
19
19
|
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
|
20
20
|
spec.files << 'lib/WordCram.jar'
|
21
|
-
spec.files << 'lib/jsoup-1.10.
|
21
|
+
spec.files << 'lib/jsoup-1.10.3.jar'
|
22
22
|
spec.require_paths = ['lib']
|
23
23
|
spec.add_development_dependency 'rake', '~> 12', '>= 12.0'
|
24
24
|
end
|
data/src/cue/lang/Counter.java
CHANGED
@@ -13,7 +13,6 @@
|
|
13
13
|
See the License for the specific language governing permissions and
|
14
14
|
limitations under the License.
|
15
15
|
*/
|
16
|
-
|
17
16
|
package cue.lang;
|
18
17
|
|
19
18
|
import java.util.ArrayList;
|
@@ -26,116 +25,105 @@ import java.util.Map.Entry;
|
|
26
25
|
import java.util.Set;
|
27
26
|
|
28
27
|
/**
|
29
|
-
*
|
28
|
+
*
|
30
29
|
* @author Jonathan Feinberg <jdf@us.ibm.com>
|
31
30
|
* @param <T>
|
32
|
-
*
|
31
|
+
*
|
33
32
|
*/
|
34
33
|
public class Counter<T> {
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
return Collections.unmodifiableSet(items.keySet());
|
131
|
-
}
|
132
|
-
|
133
|
-
public List<T> keyList() {
|
134
|
-
return getMostFrequent(items.size());
|
135
|
-
}
|
136
|
-
|
137
|
-
@Override
|
138
|
-
public String toString() {
|
139
|
-
return items.toString();
|
140
|
-
}
|
34
|
+
// delegate, don't extend, to prevent unauthorized monkeying with internals
|
35
|
+
|
36
|
+
private final Map<T, Integer> items = new HashMap<>();
|
37
|
+
private int totalItemCount = 0;
|
38
|
+
|
39
|
+
public Counter() {
|
40
|
+
this.BY_FREQ_DESC = (final Entry<T, Integer> o1, final Entry<T, Integer> o2) -> o2.getValue() - o1.getValue();
|
41
|
+
}
|
42
|
+
|
43
|
+
public Counter(final Iterable<T> items) {
|
44
|
+
this.BY_FREQ_DESC = (final Entry<T, Integer> o1, final Entry<T, Integer> o2) -> o2.getValue() - o1.getValue();
|
45
|
+
noteAll(items);
|
46
|
+
}
|
47
|
+
|
48
|
+
public final void noteAll(final Iterable<T> items) {
|
49
|
+
for (final T t : items) {
|
50
|
+
note(t, 1);
|
51
|
+
}
|
52
|
+
}
|
53
|
+
|
54
|
+
public void note(final T item) {
|
55
|
+
note(item, 1);
|
56
|
+
}
|
57
|
+
|
58
|
+
public void note(final T item, final int count) {
|
59
|
+
final Integer existingCount = items.get(item);
|
60
|
+
if (existingCount != null) {
|
61
|
+
items.put(item, existingCount + count);
|
62
|
+
} else {
|
63
|
+
items.put(item, count);
|
64
|
+
}
|
65
|
+
totalItemCount += count;
|
66
|
+
}
|
67
|
+
|
68
|
+
public void merge(final Counter<T> c) {
|
69
|
+
c.items.entrySet().forEach((e) -> {
|
70
|
+
note(e.getKey(), e.getValue());
|
71
|
+
});
|
72
|
+
}
|
73
|
+
|
74
|
+
public int getTotalItemCount() {
|
75
|
+
return totalItemCount;
|
76
|
+
}
|
77
|
+
|
78
|
+
private final Comparator<Entry<T, Integer>> BY_FREQ_DESC;
|
79
|
+
|
80
|
+
/**
|
81
|
+
* @param n
|
82
|
+
* @return A list of the min(n, size()) most frequent items
|
83
|
+
*/
|
84
|
+
public List<T> getMostFrequent(final int n) {
|
85
|
+
final List<Entry<T, Integer>> all = getAllByFrequency();
|
86
|
+
final int resultSize = Math.min(n, items.size());
|
87
|
+
final List<T> result = new ArrayList<>(resultSize);
|
88
|
+
all.subList(0, resultSize).forEach((e) -> {
|
89
|
+
result.add(e.getKey());
|
90
|
+
});
|
91
|
+
return Collections.unmodifiableList(result);
|
92
|
+
}
|
93
|
+
|
94
|
+
public List<Entry<T, Integer>> getAllByFrequency() {
|
95
|
+
final List<Entry<T, Integer>> all = new ArrayList<>(
|
96
|
+
items.entrySet());
|
97
|
+
Collections.sort(all, BY_FREQ_DESC);
|
98
|
+
return Collections.unmodifiableList(all);
|
99
|
+
}
|
100
|
+
|
101
|
+
public Integer getCount(final T item) {
|
102
|
+
final Integer freq = items.get(item);
|
103
|
+
if (freq == null) {
|
104
|
+
return 0;
|
105
|
+
}
|
106
|
+
return freq;
|
107
|
+
}
|
108
|
+
|
109
|
+
public void clear() {
|
110
|
+
items.clear();
|
111
|
+
}
|
112
|
+
|
113
|
+
public Set<Entry<T, Integer>> entrySet() {
|
114
|
+
return Collections.unmodifiableSet(items.entrySet());
|
115
|
+
}
|
116
|
+
|
117
|
+
public Set<T> keySet() {
|
118
|
+
return Collections.unmodifiableSet(items.keySet());
|
119
|
+
}
|
120
|
+
|
121
|
+
public List<T> keyList() {
|
122
|
+
return getMostFrequent(items.size());
|
123
|
+
}
|
124
|
+
|
125
|
+
@Override
|
126
|
+
public String toString() {
|
127
|
+
return items.toString();
|
128
|
+
}
|
141
129
|
}
|
@@ -27,15 +27,15 @@ import cue.lang.stop.StopWords;
|
|
27
27
|
* retrieve a sequence of {@link String}s, each of which has n words that appear
|
28
28
|
* contiguously within a sentence. "Words" are as defined by the
|
29
29
|
* {@link WordIterator}.
|
30
|
-
*
|
30
|
+
*
|
31
31
|
* <p>
|
32
32
|
* If you don't provide a {@link Locale}, you get the default {@link Locale} for
|
33
33
|
* your system, which may or may not be what you want. The {@link Locale} is
|
34
34
|
* used by a {@link SentenceIterator} to find sentence breaks.
|
35
|
-
*
|
35
|
+
*
|
36
36
|
* <p>
|
37
37
|
* Example:
|
38
|
-
*
|
38
|
+
*
|
39
39
|
* <pre>
|
40
40
|
* final String lyric = "This happened once before. I came to your door. No reply.";
|
41
41
|
* for (final String s : new NGramIterator(3, lyric)) {
|
@@ -44,13 +44,13 @@ import cue.lang.stop.StopWords;
|
|
44
44
|
* for (final String s : new NGramIterator(2, lyric)) {
|
45
45
|
* System.out.println(s);
|
46
46
|
* }
|
47
|
-
*
|
47
|
+
*
|
48
48
|
* This happened once
|
49
49
|
* happened once before
|
50
50
|
* I came to
|
51
51
|
* came to your
|
52
52
|
* to your door
|
53
|
-
*
|
53
|
+
*
|
54
54
|
* This happened
|
55
55
|
* happened once
|
56
56
|
* once before
|
@@ -60,92 +60,97 @@ import cue.lang.stop.StopWords;
|
|
60
60
|
* your door
|
61
61
|
* No reply
|
62
62
|
* </pre>
|
63
|
-
*
|
63
|
+
*
|
64
64
|
* @author Jonathan Feinberg <jdf@us.ibm.com>
|
65
|
-
*
|
65
|
+
*
|
66
66
|
*/
|
67
67
|
public class NGramIterator extends IterableText {
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
68
|
+
|
69
|
+
private final SentenceIterator sentenceIterator;
|
70
|
+
private final LinkedList<String> grams = new LinkedList<>();
|
71
|
+
private final int n;
|
72
|
+
private final StopWords stopWords;
|
73
|
+
|
74
|
+
private String next;
|
75
|
+
private Iterator<String> currentWordIterator;
|
76
|
+
|
77
|
+
public NGramIterator(final int n, final String text) {
|
78
|
+
this(n, text, Locale.getDefault());
|
79
|
+
}
|
80
|
+
|
81
|
+
public NGramIterator(final int n, final String text, final Locale locale) {
|
82
|
+
this(n, text, locale, null);
|
83
|
+
}
|
84
|
+
|
85
|
+
public NGramIterator(final int n, final String text, final Locale locale,
|
86
|
+
final StopWords stopWords) {
|
87
|
+
this.n = n;
|
88
|
+
this.sentenceIterator = new SentenceIterator(text, locale);
|
89
|
+
this.stopWords = stopWords;
|
90
|
+
loadNext();
|
91
|
+
}
|
92
|
+
|
93
|
+
@Override
|
94
|
+
public void remove() {
|
95
|
+
throw new UnsupportedOperationException();
|
96
|
+
}
|
97
|
+
|
98
|
+
@Override
|
99
|
+
public String next() {
|
100
|
+
if (next == null) {
|
101
|
+
throw new NoSuchElementException();
|
102
|
+
}
|
103
|
+
final String result = next;
|
104
|
+
loadNext();
|
105
|
+
return result;
|
106
|
+
}
|
107
|
+
|
108
|
+
/**
|
109
|
+
*
|
110
|
+
* @return
|
111
|
+
*/
|
112
|
+
@Override
|
113
|
+
public boolean hasNext() {
|
114
|
+
return next != null;
|
115
|
+
}
|
116
|
+
|
117
|
+
private void loadNext() {
|
118
|
+
next = null;
|
119
|
+
if (!grams.isEmpty()) {
|
120
|
+
grams.pop();
|
121
|
+
}
|
122
|
+
while (grams.size() < n) {
|
123
|
+
while (currentWordIterator == null
|
124
|
+
|| !currentWordIterator.hasNext()) {
|
125
|
+
if (!sentenceIterator.hasNext()) {
|
126
|
+
return;
|
127
|
+
}
|
128
|
+
grams.clear();
|
129
|
+
currentWordIterator = new WordIterator(sentenceIterator.next())
|
130
|
+
.iterator();
|
131
|
+
for (int i = 0; currentWordIterator.hasNext() && i < n - 1; i++) {
|
132
|
+
maybeAddWord();
|
133
|
+
}
|
134
|
+
}
|
135
|
+
// now grams has n-1 words in it and currentWordIterator hasNext
|
136
|
+
maybeAddWord();
|
137
|
+
}
|
138
|
+
final StringBuilder sb = new StringBuilder();
|
139
|
+
grams.forEach((gram) -> {
|
140
|
+
if (sb.length() > 0) {
|
141
|
+
sb.append(" ");
|
142
|
+
}
|
143
|
+
sb.append(gram);
|
144
|
+
});
|
145
|
+
next = sb.toString();
|
146
|
+
}
|
147
|
+
|
148
|
+
private void maybeAddWord() {
|
149
|
+
final String nextWord = currentWordIterator.next();
|
150
|
+
if (stopWords != null && stopWords.isStopWord(nextWord)) {
|
151
|
+
grams.clear();
|
152
|
+
} else {
|
153
|
+
grams.add(nextWord);
|
154
|
+
}
|
155
|
+
}
|
151
156
|
}
|
@@ -24,63 +24,68 @@ import java.util.regex.Pattern;
|
|
24
24
|
* Construct with a {@link String}; retrieve a sequence of {@link String}s, each
|
25
25
|
* of which is a "sentence" according to Java's built-in model for the given
|
26
26
|
* {@link Locale}.
|
27
|
-
*
|
27
|
+
*
|
28
28
|
* @author Jonathan Feinberg <jdf@us.ibm.com>
|
29
|
-
*
|
29
|
+
*
|
30
30
|
*/
|
31
31
|
public class SentenceIterator extends IterableText {
|
32
|
-
private final String text;
|
33
|
-
private final BreakIterator breakIterator;
|
34
|
-
int start, end;
|
35
32
|
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
* @param text
|
40
|
-
*/
|
41
|
-
public SentenceIterator(final String text) {
|
42
|
-
this(text, Locale.getDefault());
|
43
|
-
}
|
33
|
+
private final String text;
|
34
|
+
private final BreakIterator breakIterator;
|
35
|
+
int start, end;
|
44
36
|
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
37
|
+
/**
|
38
|
+
* Uses the default {@link Locale} for the running instance of the JVM.
|
39
|
+
*
|
40
|
+
* @param text
|
41
|
+
*/
|
42
|
+
public SentenceIterator(final String text) {
|
43
|
+
this(text, Locale.getDefault());
|
44
|
+
}
|
52
45
|
|
53
|
-
|
54
|
-
|
46
|
+
public SentenceIterator(final String text, final Locale locale) {
|
47
|
+
this.text = text;
|
48
|
+
breakIterator = BreakIterator.getSentenceInstance(locale);
|
49
|
+
breakIterator.setText(text);
|
50
|
+
start = end = breakIterator.first();
|
51
|
+
advance();
|
52
|
+
}
|
55
53
|
|
56
|
-
|
57
|
-
|
58
|
-
while (hasNext()
|
59
|
-
&& ((end == start) || ABBREVS.matcher(
|
60
|
-
text.substring(start, end)).find())) {
|
61
|
-
end = breakIterator.next();
|
62
|
-
}
|
63
|
-
}
|
54
|
+
private static final Pattern ABBREVS = Pattern
|
55
|
+
.compile("(?:Mrs?|Ms|Dr|Rev)\\.\\s*$");
|
64
56
|
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
57
|
+
private void advance() {
|
58
|
+
start = end;
|
59
|
+
while (hasNext()
|
60
|
+
&& ((end == start) || ABBREVS.matcher(
|
61
|
+
text.substring(start, end)).find())) {
|
62
|
+
end = breakIterator.next();
|
63
|
+
}
|
64
|
+
}
|
69
65
|
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
}
|
75
|
-
final String result = text.substring(start, end)
|
76
|
-
.replaceAll("\\s+", " ");
|
77
|
-
advance();
|
78
|
-
return result;
|
79
|
-
}
|
66
|
+
@Override
|
67
|
+
public void remove() {
|
68
|
+
throw new UnsupportedOperationException();
|
69
|
+
}
|
80
70
|
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
71
|
+
@Override
|
72
|
+
public String next() {
|
73
|
+
if (!hasNext()) {
|
74
|
+
throw new NoSuchElementException();
|
75
|
+
}
|
76
|
+
final String result = text.substring(start, end)
|
77
|
+
.replaceAll("\\s+", " ");
|
78
|
+
advance();
|
79
|
+
return result;
|
80
|
+
}
|
81
|
+
|
82
|
+
/**
|
83
|
+
*
|
84
|
+
* @return
|
85
|
+
*/
|
86
|
+
@Override
|
87
|
+
public final boolean hasNext() {
|
88
|
+
return end != BreakIterator.DONE;
|
89
|
+
}
|
85
90
|
|
86
91
|
}
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ruby_wordcram
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dan Bernier
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: bin
|
12
12
|
cert_chain: []
|
13
|
-
date: 2017-
|
13
|
+
date: 2017-06-17 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: rake
|
@@ -32,8 +32,8 @@ dependencies:
|
|
32
32
|
- - ">="
|
33
33
|
- !ruby/object:Gem::Version
|
34
34
|
version: '12.0'
|
35
|
-
description: " WordCram library wrapped in a rubygem. Compiled and tested with JRubyArt-1.3
|
36
|
-
and processing-3.
|
35
|
+
description: " WordCram library wrapped in a rubygem. Compiled and tested with JRubyArt-1.3.3
|
36
|
+
and processing-3.4\n"
|
37
37
|
email: mamba2928@yahoo.co.uk
|
38
38
|
executables: []
|
39
39
|
extensions: []
|
@@ -44,10 +44,12 @@ files:
|
|
44
44
|
- ".gitignore"
|
45
45
|
- ".mvn/extensions.xml"
|
46
46
|
- ".mvn/wrapper/maven-wrapper.properties"
|
47
|
+
- ".travis.yml"
|
47
48
|
- CHANGELOG.md
|
48
49
|
- LICENSE
|
49
50
|
- README.md
|
50
51
|
- Rakefile
|
52
|
+
- acknowledgements.md
|
51
53
|
- docs/.gitignore
|
52
54
|
- docs/_config.yml
|
53
55
|
- docs/_includes/footer.html
|
@@ -62,6 +64,7 @@ files:
|
|
62
64
|
- docs/_layouts/post.html
|
63
65
|
- docs/_posts/2017-03-07-getting_started.md
|
64
66
|
- docs/_posts/2017-03-07-under_the_hood.md
|
67
|
+
- docs/_posts/2017-03-12-supported_languages.md
|
65
68
|
- docs/_sass/_base.scss
|
66
69
|
- docs/_sass/_layout.scss
|
67
70
|
- docs/_sass/_syntax-highlighting.scss
|
@@ -73,7 +76,7 @@ files:
|
|
73
76
|
- example/data/MINYN___.TTF
|
74
77
|
- example/test.rb
|
75
78
|
- lib/WordCram.jar
|
76
|
-
- lib/jsoup-1.10.
|
79
|
+
- lib/jsoup-1.10.3.jar
|
77
80
|
- lib/ruby_wordcram.rb
|
78
81
|
- lib/ruby_wordcram/version.rb
|
79
82
|
- pom.rb
|
@@ -185,7 +188,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
185
188
|
version: '0'
|
186
189
|
requirements: []
|
187
190
|
rubyforge_project:
|
188
|
-
rubygems_version: 2.6.
|
191
|
+
rubygems_version: 2.6.11
|
189
192
|
signing_key:
|
190
193
|
specification_version: 4
|
191
194
|
summary: Updated and extended WordCram library for JRubyArt and propane
|