jruby-boilerpipe 0.2.0.rc2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 0ffb112f79745540183e4e4c3c4b8ea1948aeffa
4
- data.tar.gz: 91b3e83d47ff60a9ca37bac2810600b158d90f6e
2
+ SHA256:
3
+ metadata.gz: 2b25faefd03910a9b1e5e88e733b35f373a7b71a241e8a742e43e0419080fd70
4
+ data.tar.gz: 21f97fe1fb05465f3a4ffe9751054e423854c6028a356152b42a0582d28fc260
5
5
  SHA512:
6
- metadata.gz: d52bafba68cedd42c72c1a0d12cc5f65530968337a4f787bc8bb10a207d034437597d98613ccc51b58cddb7c7e3d4e61e73b1aca901b9c17a64ac5af3a7d443a
7
- data.tar.gz: 2a79ea8aa79047ee60bd9b14cee607dc88fb7e3d3d033991e17ad797d2faf5d92ee7561ca35cdd4357b9515911de1d1463a9391f8482b6b47adc0abda90b55ad
6
+ metadata.gz: a8c7ba7de6d49ce2479121e5dbbf783ebbc8dfbfe7e17236a9523d1eed47cbac2f5d549597c7cb4ac1d34aa03e048efa349599cfe9d86bf5b9b5e316b11aff02
7
+ data.tar.gz: 0df70ee51cbce0b1299f8b24027e115bc6dc0a4824a4dcea0c25716077e93dbeb12da476f1681e69317c41fa3d7c6a42ebd4670bf1db22700381ce7c39d4f699
data/README.md CHANGED
@@ -1,13 +1,19 @@
1
1
  # Boilerpipe
2
2
 
3
- I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe . I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a vareity of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8. Until the original author releases a proper 2.0 release I'll be appending rcx to reflect the latest snapshots.
3
+ I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe . I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a variety of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8.
4
+
5
+ Extractors added so far:
6
+
7
+ * ArticleExtractor
8
+ * DefaultExtractor
9
+
4
10
 
5
11
  ## Installation
6
12
 
7
13
  Add this line to your application's Gemfile:
8
14
 
9
15
  ```ruby
10
- gem 'jruby-boilerpipe'
16
+ gem 'jruby-boilerpipe', require: 'boilerpipe'
11
17
  ```
12
18
 
13
19
  And then execute:
@@ -16,16 +22,22 @@ And then execute:
16
22
 
17
23
  Or install it yourself as:
18
24
 
19
- $ gem install boilerpipe
25
+ $ gem install jruby-boilerpipe
20
26
 
21
27
  ## Usage
22
-
23
- jruby-1.7.24 :001 > require 'boilerpipe'
24
- => true
25
- jruby-1.7.24 :002 > Boilerpipe::ArticleExtractor.text("https://github.com/jruby/jruby/wiki/AboutJRuby")
26
- => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\n"
27
-
28
-
28
+ You can feed Boilerpipe:ArticleExractor either a valid url or html content.
29
+
30
+ jruby-9.1.7.0 :001 > require 'boilerpipe'
31
+ => true
32
+ jruby-9.1.7.0 :003 > Boilerpipe::ArticleExtractor.text("https://github.com/jruby/jruby/wiki/AboutJRuby")
33
+ => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\nThere are also many past contributors who still help out from time to time:\nOla Bini (Thoughtworks)\nMany more...check out the JRuby commit logs!\nLinks\n"
34
+
35
+ jruby-9.1.7.0 :002 > require 'open-uri'
36
+ => true
37
+ jruby-9.1.7.0 :004 > contents = open("https://github.com/jruby/jruby/wiki/AboutJRuby").read
38
+ jruby-9.1.7.0 :005 > Boilerpipe::ArticleExtractor.text(contents)
39
+ => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\nThere are also many past contributors who still help out from time to time:\nOla Bini (Thoughtworks)\nMany more...check out the JRuby commit logs!\nLinks\n"
40
+
29
41
  ## Development
30
42
 
31
43
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -34,5 +46,5 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
34
46
 
35
47
  ## Contributing
36
48
 
37
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/boilerpipe.
49
+ Bug reports and pull requests are welcome on GitHub at https://github.com/gregors/boilerpipe.
38
50
 
@@ -21,4 +21,5 @@ Gem::Specification.new do |spec|
21
21
  spec.add_development_dependency "bundler", "~> 1.10"
22
22
  spec.add_development_dependency "rake", "~> 10.0"
23
23
  spec.add_development_dependency "rspec"
24
+ spec.add_development_dependency "rickshaw"
24
25
  end
@@ -1,18 +1,9 @@
1
- require 'boilerpipe/version'
2
1
  require_relative 'boilerpipe-common-2.0-SNAPSHOT-jar-with-dependencies.jar'
3
-
4
- module Boilerpipe
5
- java_import 'com.kohlschutter.boilerpipe.extractors.ArticleExtractor'
6
- java_import java.net.URL
7
-
8
- class ArticleExtractor
9
- def self.get_text(s)
10
- url = URL.new(s)
11
- ArticleExtractor::INSTANCE.get_text(url)
12
- end
13
-
14
- class <<self
15
- alias_method :text, :get_text
16
- end
17
- end
18
- end
2
+ require 'boilerpipe/version'
3
+ require 'boilerpipe/sax/boilerpipe_html_parser'
4
+ require 'boilerpipe/document/document'
5
+ require 'boilerpipe/extractors/article_extractor'
6
+ require 'boilerpipe/extractors/default_extractor'
7
+ require 'boilerpipe/extractors/largest_content_extractor'
8
+ require 'boilerpipe/filters/filters'
9
+ require 'boilerpipe/labels/labels'
@@ -0,0 +1,16 @@
1
+ module Boilerpipe
2
+ module Document
3
+ java_import 'com.kohlschutter.boilerpipe.document.TextDocument'
4
+ java_import 'com.kohlschutter.boilerpipe.document.TextBlock'
5
+
6
+ class TextBlock
7
+ # Adding a mapping from ruby symbols to the format string used on the java side
8
+ # e.g. de.l3s.boilerpipe/INDICATES_END_OF_TEXT is not the same as INDICATES_END_OF_TEXT
9
+ # This is only for when we do TextBlock#has_label? from jruby
10
+ def has_label?(l)
11
+ l = "de.l3s.boilerpipe/#{l.to_s}" if l.is_a?(Symbol)
12
+ self.hasLabel(l)
13
+ end
14
+ end
15
+ end
16
+ end
@@ -0,0 +1,39 @@
1
+ module Boilerpipe
2
+ java_import java.net.URL
3
+
4
+ module Extractors
5
+ class ArticleExtractor
6
+ java_import 'com.kohlschutter.boilerpipe.extractors.ArticleExtractor'
7
+
8
+ def self.process(doc)
9
+ ArticleExtractor::INSTANCE.process doc
10
+ end
11
+
12
+ def self.get_text(s)
13
+ url = nil
14
+
15
+ begin
16
+ url = Java::JavaNet::URL.new(s)
17
+ rescue Java::JavaNet::MalformedURLException => e
18
+ # not a URL
19
+ end
20
+ input = url ? url : s
21
+ ArticleExtractor::INSTANCE.get_text(input)
22
+ end
23
+
24
+ class <<self
25
+ alias_method :text, :get_text
26
+ end
27
+ end
28
+ end
29
+
30
+ class ArticleExtractor
31
+ def self.get_text(s)
32
+ Extractors::ArticleExtractor.get_text s
33
+ end
34
+
35
+ class <<self
36
+ alias_method :text, :get_text
37
+ end
38
+ end
39
+ end
@@ -0,0 +1,13 @@
1
+ module Boilerpipe::Extractors
2
+ class DefaultExtractor
3
+ java_import 'com.kohlschutter.boilerpipe.extractors.DefaultExtractor'
4
+
5
+ def self.get_text(s)
6
+ DefaultExtractor::INSTANCE.get_text s
7
+ end
8
+
9
+ class <<self
10
+ alias_method :text, :get_text
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,13 @@
1
+ module Boilerpipe::Extractors
2
+ class LargestContentExtractor
3
+ java_import 'com.kohlschutter.boilerpipe.extractors.LargestContentExtractor'
4
+
5
+ def self.get_text(s)
6
+ LargestContentExtractor::INSTANCE.get_text s
7
+ end
8
+
9
+ class <<self
10
+ alias_method :text, :get_text
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,75 @@
1
+ module Boilerpipe
2
+ module Filters
3
+ java_import 'com.kohlschutter.boilerpipe.filters.english.DensityRulesClassifier'
4
+ java_import 'com.kohlschutter.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter'
5
+ java_import 'com.kohlschutter.boilerpipe.filters.english.TerminatingBlocksFinder'
6
+ java_import 'com.kohlschutter.boilerpipe.filters.english.NumWordsRulesClassifier'
7
+ java_import 'com.kohlschutter.boilerpipe.filters.english.HeuristicFilterBase'
8
+
9
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.BlockProximityFusion'
10
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier'
11
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.ExpandTitleToContentFilter'
12
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.KeepLargestBlockFilter'
13
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.LargeBlockSameTagLevelToContentFilter'
14
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.ListAtEndFilter'
15
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor'
16
+ java_import 'com.kohlschutter.boilerpipe.filters.heuristics.TrailingHeadlineToBoilerplateFilter'
17
+
18
+ java_import 'com.kohlschutter.boilerpipe.filters.simple.BoilerplateBlockFilter'
19
+
20
+ class DensityRulesClassifier
21
+ def self.process(doc)
22
+ new.process(doc)
23
+ end
24
+ end
25
+
26
+ class ExpandTitleToContentFilter
27
+ def self.process(doc)
28
+ new.process(doc)
29
+ end
30
+ end
31
+
32
+ class IgnoreBlocksAfterContentFilter
33
+ def self.process(doc)
34
+ DEFAULT_INSTANCE.process(doc)
35
+ end
36
+ end
37
+
38
+ class ListAtEndFilter
39
+ def self.process(doc)
40
+ INSTANCE.process(doc)
41
+ end
42
+ end
43
+
44
+ class LargeBlockSameTagLevelToContentFilter
45
+ def self.process(doc)
46
+ INSTANCE.process(doc)
47
+ end
48
+ end
49
+
50
+ class SimpleBlockFusionProcessor
51
+ def self.process(doc)
52
+ new.process(doc)
53
+ end
54
+ end
55
+
56
+ class TerminatingBlocksFinder
57
+ def self.process(doc)
58
+ new.process(doc)
59
+ end
60
+ end
61
+
62
+ class TrailingHeadlineToBoilerplateFilter
63
+ def self.process(doc)
64
+ new.process(doc)
65
+ end
66
+ end
67
+
68
+ class NumWordsRulesClassifier
69
+ def self.process(doc)
70
+ new.process(doc)
71
+ end
72
+ end
73
+
74
+ end
75
+ end
@@ -0,0 +1,3 @@
1
+ module Boilerpipe::Labels
2
+ java_import 'com.kohlschutter.boilerpipe.labels.DefaultLabels'
3
+ end
@@ -0,0 +1,17 @@
1
+ module Boilerpipe
2
+ module SAX
3
+ java_import 'com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLParser'
4
+ java_import 'org.xml.sax.InputSource'
5
+ java_import java.io.StringReader
6
+
7
+ class BoilerpipeHTMLParser
8
+ def self.parse(text)
9
+ parser = BoilerpipeHTMLParser.new
10
+ string_reader = StringReader.new(text)
11
+ is = InputSource.new(string_reader)
12
+ parser.parse(is)
13
+ parser.to_text_document
14
+ end
15
+ end
16
+ end
17
+ end
@@ -1,3 +1,3 @@
1
1
  module Boilerpipe
2
- VERSION = '0.2.0.rc2'
2
+ VERSION = '0.2.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jruby-boilerpipe
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0.rc2
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gregory Ostermayr
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2016-03-13 00:00:00.000000000 Z
11
+ date: 2017-09-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -52,6 +52,20 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: '0'
61
+ name: rickshaw
62
+ prerelease: false
63
+ type: :development
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
55
69
  description: Java8 compiled - latest boilerpipe-2.0-SNAPSHOT - including all dependencies
56
70
  email:
57
71
  - "<gregory.ostermayr@gmail.com>"
@@ -69,6 +83,13 @@ files:
69
83
  - jruby-boilerpipe.gemspec
70
84
  - lib/boilerpipe-common-2.0-SNAPSHOT-jar-with-dependencies.jar
71
85
  - lib/boilerpipe.rb
86
+ - lib/boilerpipe/document/document.rb
87
+ - lib/boilerpipe/extractors/article_extractor.rb
88
+ - lib/boilerpipe/extractors/default_extractor.rb
89
+ - lib/boilerpipe/extractors/largest_content_extractor.rb
90
+ - lib/boilerpipe/filters/filters.rb
91
+ - lib/boilerpipe/labels/labels.rb
92
+ - lib/boilerpipe/sax/boilerpipe_html_parser.rb
72
93
  - lib/boilerpipe/version.rb
73
94
  homepage: https://github.com/gregors/jruby-boilerpipe
74
95
  licenses: []
@@ -84,12 +105,12 @@ required_ruby_version: !ruby/object:Gem::Requirement
84
105
  version: '0'
85
106
  required_rubygems_version: !ruby/object:Gem::Requirement
86
107
  requirements:
87
- - - ">"
108
+ - - ">="
88
109
  - !ruby/object:Gem::Version
89
- version: 1.3.1
110
+ version: '0'
90
111
  requirements: []
91
112
  rubyforge_project:
92
- rubygems_version: 2.4.8
113
+ rubygems_version: 2.6.11
93
114
  signing_key:
94
115
  specification_version: 4
95
116
  summary: Ruby wrapper around boilerpipe java library