nhkore 0.3.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a02c041aff2b0b040ff00acaeeaa54506e574e9575613b48c4dd75ef0ef45564
4
- data.tar.gz: dd3f570c0e7b7223c039d4119989e04514ef0d4bf4fb53485397271021f39246
3
+ metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
4
+ data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
5
5
  SHA512:
6
- metadata.gz: 4f5021ab1fd74bb1c5a42574fa1045f71069b6f8ab6cf7b1717e6164505127e6c657f2e36be903dc190d356bed83fdf8c2de4c89644c7676863cfb9a8c53da8f
7
- data.tar.gz: e082a6ed70bacccb763386e00d8ca92351d4ee8d9f2d32a9b79dc6a2733ea46cd739e95550124e4807281c58f3a65faf9b8496740a51cca13e068ecd3e882d3a
6
+ metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
7
+ data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
data/CHANGELOG.md CHANGED
@@ -2,7 +2,21 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.0...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
6
+
7
+ ## [v0.3.1] - 2020-04-20
8
+
9
+ ### Changed
10
+ - Fleshed out more of README.
11
+ - NewsCmd/SiftCmd
12
+ - Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
13
+ - Util
14
+ - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
15
+
16
+ ### Fixed
17
+ - Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
18
+ - BingScraper
19
+ - Fixed possible RSS infinite loop.
6
20
 
7
21
  ## [v0.3.0] - 2020-04-12
8
22
 
@@ -13,7 +27,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
13
27
  ### Changed
14
28
  - BingCmd => SearchCmd
15
29
  - Major (breaking) change.
16
- - Changed `$nhkore bing easy` to:
30
+ - Changed `$ nhkore bing easy` to:
17
31
  - `$ nhkore search easy bing`
18
32
  - `$ nhkore se ez b`
19
33
  - App
data/README.md CHANGED
@@ -10,7 +10,7 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
10
10
 
11
11
  This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
12
12
 
13
- [![asciinema Demo - Help](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P.png)](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P?speed=2)
13
+ [![asciinema Demo](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
14
14
 
15
15
  ## Contents
16
16
 
@@ -18,7 +18,7 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
18
18
  - [Installing](#installing-)
19
19
  - [Using](#using-)
20
20
  - [The Basics](#the-basics-)
21
- - [Unlimited Power!](#unlimited-power-)
21
+ - [Unlimited Powah!](#unlimited-powah-)
22
22
  - [Get Command](#get-command-)
23
23
  - [Sift Command](#sift-command-)
24
24
  - [Sakura Fields Forever](#sakura-fields-forever-)
@@ -51,8 +51,8 @@ Manually:
51
51
  ```
52
52
  $ git clone 'https://github.com/esotericpig/nhkore.git'
53
53
  $ cd nhkore
54
- $ gem build nhkore.gemspec
55
- $ gem install *.gem
54
+ $ bundle install
55
+ $ bundle exec rake install:local
56
56
  ```
57
57
 
58
58
  If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
@@ -118,22 +118,15 @@ $ nhkore sift easy -e html
118
118
  $ nhkore sift easy -e yml
119
119
  ```
120
120
 
121
- If you have other scraped articles, then you'll need to filter down to the specific one:
121
+ Complete demo:
122
122
 
123
- | Command | Description |
124
- | --- | --- |
125
- | `$ nhkore sift easy -u k10011862381000` | Filter by URL |
126
- | `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
127
- | `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
128
- | `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time & title |
129
- | `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter & output HTML |
130
- | `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter & output HTML |
123
+ [![asciinema Demo - The Basics](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
131
124
 
132
- Complete demo:
125
+ ### Unlimited Powah! [^](#contents)
133
126
 
134
- [![asciinema Demo - The Basics](https://asciinema.org/a/316571.png)](https://asciinema.org/a/316571)
127
+ Generate a core word list (e.g., CSV file) for 1 or more pre-scraped articles with ease.
135
128
 
136
- ### Unlimited Power! [^](#contents)
129
+ Unlimited powah at your finger tips!
137
130
 
138
131
  #### Get Command [^](#contents)
139
132
 
@@ -151,7 +144,7 @@ By default, it will extract the data to `./core/`. You can change this:
151
144
 
152
145
  Complete demo:
153
146
 
154
- [![asciinema Demo - Get](https://asciinema.org/a/317773.png)](https://asciinema.org/a/317773)
147
+ [![asciinema Demo - Get](https://asciinema.org/a/318967.png)](https://asciinema.org/a/318967)
155
148
 
156
149
  #### Sift Command [^](#contents)
157
150
 
@@ -189,12 +182,21 @@ You can filter the data by using different options:
189
182
  Filter examples:
190
183
 
191
184
  ```
185
+ # Filter by URL.
186
+ $ nhkore sift easy -u 'k10011862381000'
187
+
188
+ # Filter by title.
189
+ $ nhkore sift easy -t 'マリオ'
190
+ $ nhkore sift easy -t '植えられた桜'
191
+
192
+ # Filter by date time.
192
193
  $ nhkore sift easy -d 2019
193
194
  $ nhkore sift easy -d '2019-12'
194
- $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
+ $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
196
  $ nhkore sift easy -d '2019-12-25 13:10'
196
- $ nhkore sift easy -t 'マリオ'
197
- $ nhkore sift easy -u 'k10011862381000'
197
+
198
+ # Filter by date time & title.
199
+ $ nhkore sift easy -d '2019-3-29' -t '桜'
198
200
  ```
199
201
 
200
202
  You can save the data to a different format using one of these options:
@@ -232,10 +234,14 @@ Lastly, you can ignore certain columns from the output. Definitions can be quite
232
234
 
233
235
  Complete demo:
234
236
 
235
- [![asciinema Demo - Sift](https://asciinema.org/a/318119.png)](https://asciinema.org/a/318119)
237
+ [![asciinema Demo - Sift](https://asciinema.org/a/318982.png)](https://asciinema.org/a/318982)
236
238
 
237
239
  ### Sakura Fields Forever [^](#contents)
238
240
 
241
+ No more waiting on a new release with pre-scraped files.
242
+
243
+ Scrape all of the latest articles for yourself, forever!
244
+
239
245
  #### Search Command [^](#contents)
240
246
 
241
247
  The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
@@ -258,9 +264,9 @@ links:
258
264
 
259
265
  Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
260
266
 
261
- > <rambling>
262
- > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
263
- > </rambling>
267
+ > <rambling>
268
+ > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
269
+ > </rambling>
264
270
 
265
271
  Example after running the `news` command:
266
272
 
@@ -287,6 +293,30 @@ links:
287
293
 
288
294
  If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
289
295
 
296
+ Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
297
+
298
+ Example usage:
299
+
300
+ `$ nhkore search easy bing`
301
+
302
+ There are a few notable options:
303
+
304
+ ```
305
+ -r --results=<value> number of results per page to request from search
306
+ (default: 100)
307
+ --show-count show the number of links scraped and exit;
308
+ useful for manually writing/updating scripts
309
+ (but not for use in a variable);
310
+ implies '--dry-run' option
311
+ --show-urls show the URLs -- if any -- used when searching &
312
+ scraping and exit; you can download these for offline
313
+ testing and/or slow internet (see '--in' option)
314
+ ```
315
+
316
+ Complete demo:
317
+
318
+ [![asciinema Demo - Search](https://asciinema.org/a/320457.png)](https://asciinema.org/a/320457)
319
+
290
320
  #### News Command [^](#contents)
291
321
 
292
322
  ## Using the Library [^](#contents)
@@ -306,7 +336,7 @@ In your *Gemfile*:
306
336
  ```Ruby
307
337
  # Pick one...
308
338
  gem 'nhkore', '~> X.X'
309
- gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
339
+ gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
310
340
  ```
311
341
 
312
342
  ### Scraper
data/lib/nhkore/app.rb CHANGED
@@ -24,8 +24,6 @@
24
24
  require 'cri'
25
25
  require 'highline'
26
26
  require 'rainbow'
27
- require 'set'
28
- require 'tty-progressbar'
29
27
  require 'tty-spinner'
30
28
 
31
29
  require 'nhkore/error'
@@ -320,6 +318,8 @@ module NHKore
320
318
  def build_progress_bar(title,download: false,total: 100,type: @progress_bar,width: 33,**kargs)
321
319
  case type
322
320
  when :default,:classic
321
+ require 'tty-progressbar'
322
+
323
323
  msg = "#{title} [:bar] :percent :eta".dup()
324
324
  msg << ' :byte_rate/s' if download
325
325
 
@@ -59,13 +59,13 @@ module CLI
59
59
  bars = nil
60
60
 
61
61
  if @cmd_opts[:all]
62
- bars = [:default,:classic,:no]
62
+ bars = {default: :default,classic: :classic,no: :no}
63
63
  else
64
- bars = [@progress_bar]
64
+ bars = {user: @progress_bar}
65
65
  end
66
66
 
67
- bars.each() do |bar|
68
- name = (bars.length == 1) ? 'User' : bar.to_s().capitalize()
67
+ bars.each() do |name,bar|
68
+ name = name.to_s().capitalize()
69
69
  bar = build_progress_bar("Testing #{name} progress",download: false,type: bar)
70
70
 
71
71
  bar.start()
@@ -21,10 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'down/net_http'
25
- require 'tempfile'
26
- require 'zip'
27
-
28
24
  require 'nhkore/util'
29
25
 
30
26
 
@@ -73,6 +69,10 @@ module CLI
73
69
  end
74
70
 
75
71
  def run_get_cmd()
72
+ require 'down/net_http'
73
+ require 'tempfile'
74
+ require 'zip'
75
+
76
76
  build_out_dir(:out,default_dir: Util::CORE_DIR)
77
77
 
78
78
  return unless check_out_dir(:out)
@@ -97,6 +97,11 @@ module CLI
97
97
  do not try to parse the dictionary files for the articles; useful in case of errors trying to load
98
98
  the dictionaries (or for offline testing)
99
99
  EOD
100
+ flag :H,'no-sha256',<<-EOD
101
+ do not check the SHA-256 of the content to see if an article has already been scraped;
102
+ for example, 2 URLs with the same content, but 1 with 'http' & 1 with 'https', will both be scraped;
103
+ this is useful if 2 articles have the same SHA-256, but different content (unlikely)
104
+ EOD
100
105
  option :o,:out,<<-EOD,argument: :required,transform: -> (value) do
101
106
  'directory/file' to save words to; if you only specify a directory or a file, it will attach
102
107
  the appropriate default directory/file name
@@ -196,6 +201,7 @@ module CLI
196
201
  max_scrapes = @cmd_opts[:scrape]
197
202
  max_scrapes = DEFAULT_NEWS_SCRAPE if max_scrapes.nil?()
198
203
  missingno = @cmd_opts[:missingno]
204
+ no_sha256 = @cmd_opts[:no_sha256]
199
205
  out_file = @cmd_opts[:out]
200
206
  redo_scrapes = @cmd_opts[:redo]
201
207
  show_dict = @cmd_opts[:show_dict]
@@ -219,7 +225,9 @@ module CLI
219
225
  scrape_count = 0
220
226
 
221
227
  if File.exist?(out_file)
222
- news = (type == :yasashii) ? YasashiiNews.load_file(out_file) : FutsuuNews.load_file(out_file)
228
+ news = (type == :yasashii) ?
229
+ YasashiiNews.load_file(out_file,overwrite: no_sha256) :
230
+ FutsuuNews.load_file(out_file,overwrite: no_sha256)
223
231
  else
224
232
  news = (type == :yasashii) ? YasashiiNews.new() : FutsuuNews.new()
225
233
  end
@@ -357,9 +365,11 @@ module CLI
357
365
  def scraped_news_article?(news,link)
358
366
  return true if link.scraped?()
359
367
 
368
+ no_sha256 = @cmd_opts[:no_sha256]
369
+
360
370
  article = news.article(link.url)
361
371
 
362
- if article.nil?()
372
+ if !no_sha256 && article.nil?()
363
373
  if !Util.empty_web_str?(link.sha256) && news.sha256?(link.sha256)
364
374
  article = news.article_with_sha256(link.sha256)
365
375
  end
@@ -118,6 +118,10 @@ module CLI
118
118
  EOD
119
119
  app.check_empty_opt(:out,value)
120
120
  end
121
+ flag :H,'no-sha256',<<-EOD
122
+ if you used this option with the 'news' command, then you'll also need this option here
123
+ to not fail on "duplicate" articles; see '#{App::NAME} news'
124
+ EOD
121
125
  option :t,:title,'title to filter on, where search text only needs to be somewhere in the title',
122
126
  argument: :required
123
127
  option :u,:url,'URL to filter on, where search text only needs to be somewhere in the URL',
@@ -326,13 +330,16 @@ module CLI
326
330
  in_file = @cmd_opts[:in]
327
331
  no_defn = @cmd_opts[:no_defn]
328
332
  no_eng = @cmd_opts[:no_eng]
333
+ no_sha256 = @cmd_opts[:no_sha256]
329
334
  out_file = @cmd_opts[:out]
330
335
  title_filter = @cmd_opts[:title]
331
336
  url_filter = @cmd_opts[:url]
332
337
 
333
338
  start_spin("Sifting NHK News Web #{news_name} data")
334
339
 
335
- news = (type == :yasashii) ? YasashiiNews.load_file(in_file) : FutsuuNews.load_file(in_file)
340
+ news = (type == :yasashii) ?
341
+ YasashiiNews.load_file(in_file,overwrite: no_sha256) :
342
+ FutsuuNews.load_file(in_file,overwrite: no_sha256)
336
343
 
337
344
  sifter = Sifter.new(news)
338
345
 
@@ -21,7 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'json'
25
24
  require 'nhkore/dict'
26
25
  require 'nhkore/error'
27
26
  require 'nhkore/scraper'
@@ -59,6 +58,8 @@ module NHKore
59
58
  end
60
59
 
61
60
  def scrape()
61
+ require 'json'
62
+
62
63
  json = JSON.load(@str_or_io)
63
64
 
64
65
  return Dict.new() if json.nil?()
data/lib/nhkore/news.rb CHANGED
@@ -73,7 +73,7 @@ module NHKore
73
73
  coder[:articles] = @articles
74
74
  end
75
75
 
76
- def self.load_data(data,article_class: Article,file: nil,news_class: News,**kargs)
76
+ def self.load_data(data,article_class: Article,file: nil,news_class: News,overwrite: false,**kargs)
77
77
  data = Util.load_yaml(data,file: file)
78
78
 
79
79
  articles = data[:articles]
@@ -83,7 +83,7 @@ module NHKore
83
83
  if !articles.nil?()
84
84
  articles.each() do |key,hash|
85
85
  key = key.to_s() # Change from a symbol
86
- news.add_article(article_class.load_data(key,hash),key: key)
86
+ news.add_article(article_class.load_data(key,hash),key: key,overwrite: overwrite)
87
87
  end
88
88
  end
89
89
 
@@ -21,10 +21,8 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'http-cookie'
25
24
  require 'nokogiri'
26
25
  require 'open-uri'
27
- require 'rss'
28
26
 
29
27
  require 'nhkore/user_agents'
30
28
  require 'nhkore/util'
@@ -42,7 +40,7 @@ module NHKore
42
40
  'dnt' => '1',
43
41
  }
44
42
 
45
- attr_accessor :is_cookie
43
+ attr_accessor :eat_cookie
46
44
  attr_accessor :is_file
47
45
  attr_reader :kargs
48
46
  attr_accessor :max_redirects
@@ -51,7 +49,7 @@ module NHKore
51
49
  attr_accessor :str_or_io
52
50
  attr_accessor :url
53
51
 
54
- alias_method :is_cookie?,:is_cookie
52
+ alias_method :eat_cookie?,:eat_cookie
55
53
  alias_method :is_file?,:is_file
56
54
 
57
55
  # +max_redirects+ defaults to 3 for safety (infinite-loop attack).
@@ -60,10 +58,10 @@ module NHKore
60
58
  #
61
59
  # Pass in +header: {}+ for the default HTTP header fields to be set.
62
60
  #
63
- # @param is_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
64
- # (time-consuming) operation since it opens the URL again, but necessary for some URLs.
61
+ # @param eat_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
62
+ # (time-consuming) operation since it opens the URL again, but necessary for some URLs.
65
63
  # @param redirect_rule [nil,:lenient,:strict]
66
- def initialize(url,header: nil,is_cookie: false,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
64
+ def initialize(url,eat_cookie: false,header: nil,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
67
65
  super()
68
66
 
69
67
  if !header.nil?() && !is_file
@@ -77,7 +75,7 @@ module NHKore
77
75
  kargs.merge!(header)
78
76
  end
79
77
 
80
- @is_cookie = is_cookie
78
+ @eat_cookie = eat_cookie
81
79
  @is_file = is_file
82
80
  @kargs = kargs
83
81
  @max_redirects = max_redirects
@@ -88,6 +86,8 @@ module NHKore
88
86
  end
89
87
 
90
88
  def fetch_cookie(url)
89
+ require 'http-cookie'
90
+
91
91
  open_url(url)
92
92
 
93
93
  cookies = Array(@str_or_io.meta['set-cookie']) # nil will be []
@@ -128,7 +128,7 @@ module NHKore
128
128
  # NHK's website tends to always use UTF-8.
129
129
  @str_or_io = File.open(url,'rt:UTF-8',**@kargs)
130
130
  else
131
- fetch_cookie(url) if @is_cookie
131
+ fetch_cookie(url) if @eat_cookie
132
132
  open_url(url)
133
133
  end
134
134
  end
@@ -195,6 +195,8 @@ module NHKore
195
195
  end
196
196
 
197
197
  def rss_doc()
198
+ require 'rss'
199
+
198
200
  return RSS::Parser.parse(@str_or_io,validate: false)
199
201
  end
200
202
  end
@@ -45,9 +45,10 @@ module NHKore
45
45
  # - https://www3.nhk.or.jp/news/easy/article/disaster_heat.html
46
46
  YASASHII_REGEX = /\A[^\.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i
47
47
 
48
- # Pass in +header: {}+ to trigger using the default HTTP header fields.
49
- def initialize(url,header: {},is_cookie: true,**kargs)
50
- super(url,header: header,is_cookie: is_cookie,**kargs)
48
+ # Search Engines are strict, so trigger using the default HTTP header fields
49
+ # with +header: {}+ and fetch/set the cookie using +eat_cookie: true+.
50
+ def initialize(url,eat_cookie: true,header: {},**kargs)
51
+ super(url,eat_cookie: eat_cookie,header: header,**kargs)
51
52
  end
52
53
 
53
54
  def ignore_link?(link,cleaned: true)
@@ -59,6 +60,7 @@ module NHKore
59
60
  return true if link =~ /\/about\.html?/ # https://www3.nhk.or.jp/news/easy/about.html
60
61
  return true if link =~ /\/movieplayer\.html?/ # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf
61
62
  return true if link =~ /\/audio\.html?/ # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000
63
+ return true if link =~ /\/news\/easy\/index\.html?/ # http://www3.nhk.or.jp/news/easy/index.html
62
64
 
63
65
  return false
64
66
  end
@@ -157,11 +159,14 @@ module NHKore
157
159
  open(uri)
158
160
 
159
161
  doc = rss_doc()
162
+ rss_links = []
160
163
 
161
164
  doc.items.each() do |item|
162
165
  link = item.link.to_s()
163
166
  link = Util.unspace_web_str(link).downcase()
164
167
 
168
+ rss_links << link
169
+
165
170
  next if ignore_link?(link)
166
171
  next if link !~ regex
167
172
 
@@ -170,9 +175,14 @@ module NHKore
170
175
  link_count += 1
171
176
  end
172
177
 
173
- if link_count >= 1 && next_page.empty?()
178
+ # For RSS, Bing will keep returning the same links over and over
179
+ # if it's the last page or the "first=" query is the wrong count.
180
+ # Therefore, we have to test the previous RSS links (+page.rss_links+).
181
+ if next_page.empty?() && doc.items.length >= 1 && page.rss_links != rss_links
174
182
  next_page.count = (page.count < 0) ? 0 : page.count
175
- next_page.count += doc.items.length - 1 # -1 because 1st item is sometimes junk (search URL)
183
+ next_page.count += doc.items.length
184
+ next_page.rss_links = rss_links
185
+
176
186
  uri = URI(page.url.nil?() ? @url : page.url)
177
187
 
178
188
  Util.replace_uri_query!(uri,first: next_page.count)
@@ -191,12 +201,14 @@ module NHKore
191
201
  ###
192
202
  class NextPage
193
203
  attr_accessor :count
204
+ attr_accessor :rss_links
194
205
  attr_accessor :url
195
206
 
196
207
  def initialize()
197
208
  super()
198
209
 
199
210
  @count = -1
211
+ @rss_links = nil
200
212
  @url = nil
201
213
  end
202
214
 
data/lib/nhkore/sifter.rb CHANGED
@@ -21,8 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'csv'
25
-
26
24
  require 'nhkore/article'
27
25
  require 'nhkore/fileable'
28
26
  require 'nhkore/util'
@@ -143,6 +141,8 @@ module NHKore
143
141
 
144
142
  # This does not output {caption}.
145
143
  def put_csv!()
144
+ require 'csv'
145
+
146
146
  words = sift()
147
147
 
148
148
  @output = CSV.generate(headers: :first_row,write_headers: true) do |csv|
@@ -21,9 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'bimyou_segmenter'
25
- require 'tiny_segmenter'
26
-
27
24
  require 'nhkore/util'
28
25
 
29
26
 
@@ -59,6 +56,12 @@ module NHKore
59
56
  # @since 0.2.0
60
57
  ###
61
58
  class BimyouSplitter < Splitter
59
+ def initialize(*)
60
+ require 'bimyou_segmenter'
61
+
62
+ super
63
+ end
64
+
62
65
  def end_split(str)
63
66
  return BimyouSegmenter.segment(str,symbol: false,white_space: false)
64
67
  end
@@ -71,6 +74,8 @@ module NHKore
71
74
  attr_accessor :tiny
72
75
 
73
76
  def initialize(*)
77
+ require 'tiny_segmenter'
78
+
74
79
  super
75
80
 
76
81
  @tiny = TinySegmenter.new()
data/lib/nhkore/util.rb CHANGED
@@ -24,7 +24,6 @@
24
24
  require 'cgi'
25
25
  require 'psychgus'
26
26
  require 'public_suffix'
27
- require 'set'
28
27
  require 'time'
29
28
  require 'uri'
30
29
 
@@ -65,8 +64,7 @@ module NHKore
65
64
  MIN_SANE_YEAR = 1924
66
65
 
67
66
  def self.dir_str?(str)
68
- # File.join() will add the appropriate slash.
69
- return File.join(str,'') == str
67
+ return str.match?(/[\/\\]\s*\z/)
70
68
  end
71
69
 
72
70
  def self.domain(host,clean: true)
@@ -100,7 +98,8 @@ module NHKore
100
98
  end
101
99
 
102
100
  def self.filename_str?(str)
103
- return File.basename(str) == str
101
+ # Do not use "!dir_str?()"! It's not the same meaning!
102
+ return !str.match?(/[\/\\]/)
104
103
  end
105
104
 
106
105
  def self.guess_year(year)
@@ -21,9 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'japanese_deinflector'
25
-
26
-
27
24
  module NHKore
28
25
  ###
29
26
  # @author Jonathan Bradley Whited (@esotericpig)
@@ -63,6 +60,8 @@ module NHKore
63
60
  attr_accessor :deinflector
64
61
 
65
62
  def initialize(*)
63
+ require 'japanese_deinflector'
64
+
66
65
  super
67
66
 
68
67
  @deinflector = JapaneseDeinflector.new()
@@ -22,5 +22,5 @@
22
22
 
23
23
 
24
24
  module NHKore
25
- VERSION = '0.3.0'
25
+ VERSION = '0.3.1'
26
26
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nhkore
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Bradley Whited (@esotericpig)
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-12 00:00:00.000000000 Z
11
+ date: 2020-04-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bimyou_segmenter
@@ -374,7 +374,7 @@ metadata:
374
374
  changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
375
375
  homepage_uri: https://github.com/esotericpig/nhkore
376
376
  source_code_uri: https://github.com/esotericpig/nhkore
377
- post_install_message: " \n NHKore v0.3.0\n \n You can now use [nhkore] on the
377
+ post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
378
378
  command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
379
379
  \ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
380
380
  \ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"