jruby-boilerpipe 0.0.6 → 0.1.0.rc1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA256:
3
- metadata.gz: 17930deca0970a5d71a16b90085cca6db49e9b709612ea553a924fd0cc794430
4
- data.tar.gz: c41cd2edfb479ef70e80ed1ca1af334f4d4dc47845d79709e785062d3d42c129
2
+ SHA1:
3
+ metadata.gz: ad1684985b263829ee4bfc46478227796b911523
4
+ data.tar.gz: 2fd543163aeda4aa9f3ffbfa9678bcbe3aeaeca7
5
5
  SHA512:
6
- metadata.gz: 8800a20e07c54e100cffa028fe9374dacd558ffbc760a718c6aca00133a708ee6f67aced9eadf488860939166ae78d8d593cb7b7702dd08306e3d2a4dbe0fc41
7
- data.tar.gz: 9b7dc70d9a3ef6951ca91552d243bdb69aaf4accb6a47e9a6757a85286816dc44049d8381ff3dde391cb096b784e4edc3508290fd1173cb27c22a86d0fa77bcc
6
+ metadata.gz: 3f7bb9a3231b8f63d1b0ee22446e6a291f8e0d21303375826862820d2e102c2b08bcff73da4257693242d53d95a16fda54b4fe0fb4deb49568ebb128d7391569
7
+ data.tar.gz: 37f7dade511d8ac8ef517c80df8c32b04ac8de44da1b3455461bbd964c95c44e974b435437cc1a790abadb03ecb04089909a0d5d8904759884f551ee1c1ea24b
data/README.md CHANGED
@@ -1,13 +1,13 @@
1
1
  # Boilerpipe
2
2
 
3
- I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe . I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a variety of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8. Until the original author releases a proper 2.0 release I'll be appending rcx to reflect the latest snapshots.
3
+ I saw other gems wrapping boilerpipe but they seemed to be outdated, hit the free api or I couldn't get them to work because of dependency issues. I went directly to the original author's github https://github.com/kohlschutter/boilerpipe and forked that code base here https://github.com/gregors/boilerpipe . I made one notible change that is to add a user agent so article extractor wouldn't return a 403 for a vareity of web servers. I compiled all dependencies into the jar using the maven-assembly-plugin using Java 8. Until the original author releases a proper 2.0 release I'll be appending rcx to reflect the latest snapshots.
4
4
 
5
5
  ## Installation
6
6
 
7
7
  Add this line to your application's Gemfile:
8
8
 
9
9
  ```ruby
10
- gem 'jruby-boilerpipe', require: 'boilerpipe'
10
+ gem 'boilerpipe'
11
11
  ```
12
12
 
13
13
  And then execute:
@@ -16,22 +16,16 @@ And then execute:
16
16
 
17
17
  Or install it yourself as:
18
18
 
19
- $ gem install jruby-boilerpipe
19
+ $ gem install boilerpipe
20
20
 
21
21
  ## Usage
22
- You can feed Boilerpipe:ArticleExractor either a valid url or html content.
23
-
24
- jruby-9.1.7.0 :001 > require 'boilerpipe'
25
- => true
26
- jruby-9.1.7.0 :003 > Boilerpipe::ArticleExtractor.text("https://github.com/jruby/jruby/wiki/AboutJRuby")
27
- => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\nThere are also many past contributors who still help out from time to time:\nOla Bini (Thoughtworks)\nMany more...check out the JRuby commit logs!\nLinks\n"
28
-
29
- jruby-9.1.7.0 :002 > require 'open-uri'
30
- => true
31
- jruby-9.1.7.0 :004 > contents = open("https://github.com/jruby/jruby/wiki/AboutJRuby").read
32
- jruby-9.1.7.0 :005 > Boilerpipe::ArticleExtractor.text(contents)
33
- => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\nThere are also many past contributors who still help out from time to time:\nOla Bini (Thoughtworks)\nMany more...check out the JRuby commit logs!\nLinks\n"
34
-
22
+
23
+ jruby-1.7.24 :001 > require 'boilerpipe'
24
+ => true
25
+ jruby-1.7.24 :002 > Boilerpipe::ArticleExtractor.text("https://github.com/jruby/jruby/wiki/AboutJRuby")
26
+ => "Clone this wiki locally\nAbout JRuby\nJRuby is a 100% Java implementation of the Ruby programming language. It is Ruby for the JVM.\nJRuby provides a complete set of core \"builtin\" classes and syntax for the Ruby language, as well as most of the Ruby Standard Libraries. The standard libraries are mostly Ruby's own complement of .rb files, but a few that depend on C language-based extensions have been reimplemented. Some are still missing, but we hope to implement as many as is feasible.\nSee Differences Between MRI And JRuby for more information on potential incompatibilities between JRuby and the C implementation of Ruby.\nDevelopment Team\nJRuby's current core development team consists of nine developers:\nCharles Oliver Nutter (Red Hat) aka headius\nThomas Enebo (Red Hat) aka enebo\nNick Sieger (LivingSocial)\n"
27
+
28
+
35
29
  ## Development
36
30
 
37
31
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -40,5 +34,5 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
40
34
 
41
35
  ## Contributing
42
36
 
43
- Bug reports and pull requests are welcome on GitHub at https://github.com/gregors/boilerpipe.
37
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/boilerpipe.
44
38
 
@@ -1,3 +1,3 @@
1
1
  module Boilerpipe
2
- VERSION = '0.0.6'
2
+ VERSION = '0.1.0.rc1'
3
3
  end
data/lib/boilerpipe.rb CHANGED
@@ -1,7 +1,18 @@
1
- require_relative 'boilerpipe-common-2.0-SNAPSHOT-jar-with-dependencies.jar'
2
1
  require 'boilerpipe/version'
3
- require 'boilerpipe/sax/boilerpipe_html_parser'
4
- require 'boilerpipe/document/document'
5
- require 'boilerpipe/extractors/article_extractor'
6
- require 'boilerpipe/filters/filters'
7
- require 'boilerpipe/labels/labels'
2
+ require_relative 'boilerpipe-common-2.0-SNAPSHOT-jar-with-dependencies.jar'
3
+
4
+ module Boilerpipe
5
+ java_import 'com.kohlschutter.boilerpipe.extractors.ArticleExtractor'
6
+ java_import java.net.URL
7
+
8
+ class ArticleExtractor
9
+ def self.get_text(s)
10
+ url = URL.new(s)
11
+ ArticleExtractor::INSTANCE.get_text(url)
12
+ end
13
+
14
+ class <<self
15
+ alias_method :text, :get_text
16
+ end
17
+ end
18
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jruby-boilerpipe
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.1.0.rc1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gregory Ostermayr
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2017-09-06 00:00:00.000000000 Z
11
+ date: 2016-03-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -69,11 +69,6 @@ files:
69
69
  - jruby-boilerpipe.gemspec
70
70
  - lib/boilerpipe-common-2.0-SNAPSHOT-jar-with-dependencies.jar
71
71
  - lib/boilerpipe.rb
72
- - lib/boilerpipe/document/document.rb
73
- - lib/boilerpipe/extractors/article_extractor.rb
74
- - lib/boilerpipe/filters/filters.rb
75
- - lib/boilerpipe/labels/labels.rb
76
- - lib/boilerpipe/sax/boilerpipe_html_parser.rb
77
72
  - lib/boilerpipe/version.rb
78
73
  homepage: https://github.com/gregors/jruby-boilerpipe
79
74
  licenses: []
@@ -89,12 +84,12 @@ required_ruby_version: !ruby/object:Gem::Requirement
89
84
  version: '0'
90
85
  required_rubygems_version: !ruby/object:Gem::Requirement
91
86
  requirements:
92
- - - ">="
87
+ - - ">"
93
88
  - !ruby/object:Gem::Version
94
- version: '0'
89
+ version: 1.3.1
95
90
  requirements: []
96
91
  rubyforge_project:
97
- rubygems_version: 2.6.11
92
+ rubygems_version: 2.4.8
98
93
  signing_key:
99
94
  specification_version: 4
100
95
  summary: Ruby wrapper around boilerpipe java library
@@ -1,16 +0,0 @@
1
- module Boilerpipe
2
- module Document
3
- java_import 'com.kohlschutter.boilerpipe.document.TextDocument'
4
- java_import 'com.kohlschutter.boilerpipe.document.TextBlock'
5
-
6
- class TextBlock
7
- # Adding a mapping from ruby symbols to the format string used on the java side
8
- # e.g. de.l3s.boilerpipe/INDICATES_END_OF_TEXT is not the same as INDICATES_END_OF_TEXT
9
- # This is only for when we do TextBlock#has_label? from jruby
10
- def has_label?(l)
11
- l = "de.l3s.boilerpipe/#{l.to_s}" if l.is_a?(Symbol)
12
- self.hasLabel(l)
13
- end
14
- end
15
- end
16
- end
@@ -1,23 +0,0 @@
1
- module Boilerpipe
2
- java_import 'com.kohlschutter.boilerpipe.extractors.ArticleExtractor'
3
- java_import 'com.kohlschutter.boilerpipe.util.UnicodeTokenizer'
4
- java_import java.net.URL
5
-
6
- class ArticleExtractor
7
- def self.get_text(s)
8
- url = nil
9
-
10
- begin
11
- url = Java::JavaNet::URL.new(s)
12
- rescue Java::JavaNet::MalformedURLException => e
13
- # not a URL
14
- end
15
- input = url ? url : s
16
- ArticleExtractor::INSTANCE.get_text(input)
17
- end
18
-
19
- class <<self
20
- alias_method :text, :get_text
21
- end
22
- end
23
- end
@@ -1,47 +0,0 @@
1
- module Boilerpipe
2
- module Filters
3
- java_import 'com.kohlschutter.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter'
4
- java_import 'com.kohlschutter.boilerpipe.filters.english.TerminatingBlocksFinder'
5
- java_import 'com.kohlschutter.boilerpipe.filters.english.NumWordsRulesClassifier'
6
- java_import 'com.kohlschutter.boilerpipe.filters.english.HeuristicFilterBase'
7
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.BlockProximityFusion'
8
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier'
9
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.ExpandTitleToContentFilter'
10
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.KeepLargestBlockFilter'
11
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.LargeBlockSameTagLevelToContentFilter'
12
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.ListAtEndFilter'
13
- java_import 'com.kohlschutter.boilerpipe.filters.heuristics.TrailingHeadlineToBoilerplateFilter'
14
- java_import 'com.kohlschutter.boilerpipe.filters.simple.BoilerplateBlockFilter'
15
-
16
- class ExpandTitleToContentFilter
17
- def self.process(doc)
18
- new.process(doc)
19
- end
20
- end
21
-
22
- class IgnoreBlocksAfterContentFilter
23
- def self.process(doc)
24
- DEFAULT_INSTANCE.process(doc)
25
- end
26
- end
27
-
28
- class TerminatingBlocksFinder
29
- def self.process(doc)
30
- new.process(doc)
31
- end
32
- end
33
-
34
- class TrailingHeadlineToBoilerplateFilter
35
- def self.process(doc)
36
- new.process(doc)
37
- end
38
- end
39
-
40
- class NumWordsRulesClassifier
41
- def self.process(doc)
42
- new.process(doc)
43
- end
44
- end
45
-
46
- end
47
- end
@@ -1,3 +0,0 @@
1
- module Boilerpipe::Labels
2
- java_import 'com.kohlschutter.boilerpipe.labels.DefaultLabels'
3
- end
@@ -1,17 +0,0 @@
1
- module Boilerpipe
2
- module SAX
3
- java_import 'com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLParser'
4
- java_import 'org.xml.sax.InputSource'
5
- java_import java.io.StringReader
6
-
7
- class BoilerpipeHTMLParser
8
- def self.parse(text)
9
- parser = BoilerpipeHTMLParser.new
10
- string_reader = StringReader.new(text)
11
- is = InputSource.new(string_reader)
12
- parser.parse(is)
13
- parser.to_text_document
14
- end
15
- end
16
- end
17
- end