jruby-boilerpipe 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2b25faefd03910a9b1e5e88e733b35f373a7b71a241e8a742e43e0419080fd70
4
- data.tar.gz: 21f97fe1fb05465f3a4ffe9751054e423854c6028a356152b42a0582d28fc260
3
+ metadata.gz: f33a129d7965665580ad4717ff9b33379c3e310e7d60b6e1f94a625bf6008a31
4
+ data.tar.gz: a3583abe164b5cbc6b468d5e2fb04bd9395eeacded838fe9b6fa1407b61fbe07
5
5
  SHA512:
6
- metadata.gz: a8c7ba7de6d49ce2479121e5dbbf783ebbc8dfbfe7e17236a9523d1eed47cbac2f5d549597c7cb4ac1d34aa03e048efa349599cfe9d86bf5b9b5e316b11aff02
7
- data.tar.gz: 0df70ee51cbce0b1299f8b24027e115bc6dc0a4824a4dcea0c25716077e93dbeb12da476f1681e69317c41fa3d7c6a42ebd4670bf1db22700381ce7c39d4f699
6
+ metadata.gz: c508f14e8b057a5d29796b8908c5b4b245c70b3de34f2f5b29a922fd309fb2667d296515bd1f261862299dd8ba0446fddadad829c8a59dab796baef0382fe767
7
+ data.tar.gz: '0383054cd6a662ff1bfb7e508e6b18a66b81324a03817a86a5e2111e7cd9d5777123cfc0a74d2f5d6de536b615b45b75b9d7599f10ee046fbbf51787c50c4385'
data/README.md CHANGED
@@ -1,11 +1,15 @@
1
1
  # Boilerpipe
2
2
 
3
- I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe . I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a variety of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8.
3
+ I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the [original author's source](https://github.com/kohlschutter/boilerpipe) and forked that code base [here](https://github.com/gregors/boilerpipe). I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a variety of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8.
4
+
5
+ Also check out my pure ruby implementation [boilerpipe-ruby](https://github.com/gregors/boilerpipe-ruby)
4
6
 
5
7
  Extractors added so far:
6
8
 
7
9
  * ArticleExtractor
10
+ * ArticleSentencesExtractor
8
11
  * DefaultExtractor
12
+ * LargestContentExtractor
9
13
 
10
14
 
11
15
  ## Installation
@@ -25,7 +29,8 @@ Or install it yourself as:
25
29
  $ gem install jruby-boilerpipe
26
30
 
27
31
  ## Usage
28
- You can feed Boilerpipe:ArticleExractor either a valid url or html content.
32
+
33
+ #### You can feed Boilerpipe:ArticleExractor either a valid url or html content.
29
34
 
30
35
  jruby-9.1.7.0 :001 > require 'boilerpipe'
31
36
  => true
@@ -37,6 +42,13 @@ Or install it yourself as:
37
42
  jruby-9.1.7.0 :004 > contents = open("https://github.com/jruby/jruby/wiki/AboutJRuby").read
38
43
  jruby-9.1.7.0 :005 > Boilerpipe::ArticleExtractor.text(contents)
39
44
  => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\nThere are also many past contributors who still help out from time to time:\nOla Bini (Thoughtworks)\nMany more...check out the JRuby commit logs!\nLinks\n"
45
+
46
+ #### Or using a different extractor
47
+
48
+ jruby-9.2.5.0 :001 > require 'boilerpipe'
49
+ => true
50
+ jruby-9.2.5.0 :003 > ::Boilerpipe::ArticleSentencesExtractor.text('https://github.com/jruby/jruby/wiki/AboutJRuby')
51
+ => "JRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries..."
40
52
 
41
53
  ## Development
42
54
 
@@ -6,20 +6,20 @@ require 'boilerpipe/version'
6
6
  Gem::Specification.new do |spec|
7
7
  spec.name = 'jruby-boilerpipe'
8
8
  spec.version = Boilerpipe::VERSION
9
- spec.authors = ["Gregory Ostermayr"]
10
- spec.email = ["<gregory.ostermayr@gmail.com>"]
9
+ spec.authors = ['Gregory Ostermayr']
10
+ spec.email = ['<gregory.ostermayr@gmail.com>']
11
11
 
12
12
  spec.summary = %q{Ruby wrapper around boilerpipe java library}
13
13
  spec.description = %q{Java8 compiled - latest boilerpipe-2.0-SNAPSHOT - including all dependencies}
14
- spec.homepage = "https://github.com/gregors/jruby-boilerpipe"
14
+ spec.homepage = 'https://github.com/gregors/jruby-boilerpipe'
15
15
 
16
16
  spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
17
- spec.bindir = "exe"
17
+ spec.bindir = 'exe'
18
18
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
19
- spec.require_paths = ["lib"]
19
+ spec.require_paths = ['lib']
20
20
 
21
- spec.add_development_dependency "bundler", "~> 1.10"
22
- spec.add_development_dependency "rake", "~> 10.0"
23
- spec.add_development_dependency "rspec"
24
- spec.add_development_dependency "rickshaw"
21
+ spec.add_development_dependency 'bundler', '~> 1.10'
22
+ spec.add_development_dependency 'rake', '~> 10.0'
23
+ spec.add_development_dependency 'rspec'
24
+ spec.add_development_dependency 'rickshaw'
25
25
  end
@@ -3,6 +3,7 @@ require 'boilerpipe/version'
3
3
  require 'boilerpipe/sax/boilerpipe_html_parser'
4
4
  require 'boilerpipe/document/document'
5
5
  require 'boilerpipe/extractors/article_extractor'
6
+ require 'boilerpipe/extractors/article_sentences_extractor'
6
7
  require 'boilerpipe/extractors/default_extractor'
7
8
  require 'boilerpipe/extractors/largest_content_extractor'
8
9
  require 'boilerpipe/filters/filters'
@@ -0,0 +1,39 @@
1
+ module Boilerpipe
2
+ java_import java.net.URL
3
+
4
+ module Extractors
5
+ class ArticleSentencesExtractor
6
+ java_import 'com.kohlschutter.boilerpipe.extractors.ArticleSentencesExtractor'
7
+
8
+ def self.process(doc)
9
+ ArticleSentencesExtractor::INSTANCE.process doc
10
+ end
11
+
12
+ def self.get_text(s)
13
+ url = nil
14
+
15
+ begin
16
+ url = Java::JavaNet::URL.new(s)
17
+ rescue Java::JavaNet::MalformedURLException => e
18
+ # not a URL
19
+ end
20
+ input = url ? url : s
21
+ ArticleSentencesExtractor::INSTANCE.get_text(input)
22
+ end
23
+
24
+ class <<self
25
+ alias_method :text, :get_text
26
+ end
27
+ end
28
+ end
29
+
30
+ class ArticleSentencesExtractor
31
+ def self.get_text(s)
32
+ Extractors::ArticleSentencesExtractor.get_text s
33
+ end
34
+
35
+ class <<self
36
+ alias_method :text, :get_text
37
+ end
38
+ end
39
+ end
@@ -1,3 +1,3 @@
1
1
  module Boilerpipe
2
- VERSION = '0.2.0'
2
+ VERSION = '0.3.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jruby-boilerpipe
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gregory Ostermayr
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2017-09-11 00:00:00.000000000 Z
11
+ date: 2018-12-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -85,6 +85,7 @@ files:
85
85
  - lib/boilerpipe.rb
86
86
  - lib/boilerpipe/document/document.rb
87
87
  - lib/boilerpipe/extractors/article_extractor.rb
88
+ - lib/boilerpipe/extractors/article_sentences_extractor.rb
88
89
  - lib/boilerpipe/extractors/default_extractor.rb
89
90
  - lib/boilerpipe/extractors/largest_content_extractor.rb
90
91
  - lib/boilerpipe/filters/filters.rb
@@ -110,7 +111,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
110
111
  version: '0'
111
112
  requirements: []
112
113
  rubyforge_project:
113
- rubygems_version: 2.6.11
114
+ rubygems_version: 2.7.6
114
115
  signing_key:
115
116
  specification_version: 4
116
117
  summary: Ruby wrapper around boilerpipe java library