nhkore 0.3.0 → 0.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a02c041aff2b0b040ff00acaeeaa54506e574e9575613b48c4dd75ef0ef45564
4
- data.tar.gz: dd3f570c0e7b7223c039d4119989e04514ef0d4bf4fb53485397271021f39246
3
+ metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
4
+ data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
5
5
  SHA512:
6
- metadata.gz: 4f5021ab1fd74bb1c5a42574fa1045f71069b6f8ab6cf7b1717e6164505127e6c657f2e36be903dc190d356bed83fdf8c2de4c89644c7676863cfb9a8c53da8f
7
- data.tar.gz: e082a6ed70bacccb763386e00d8ca92351d4ee8d9f2d32a9b79dc6a2733ea46cd739e95550124e4807281c58f3a65faf9b8496740a51cca13e068ecd3e882d3a
6
+ metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
7
+ data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
data/CHANGELOG.md CHANGED
@@ -2,7 +2,21 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.0...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
6
+
7
+ ## [v0.3.1] - 2020-04-20
8
+
9
+ ### Changed
10
+ - Fleshed out more of README.
11
+ - NewsCmd/SiftCmd
12
+ - Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
13
+ - Util
14
+ - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
15
+
16
+ ### Fixed
17
+ - Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
18
+ - BingScraper
19
+ - Fixed possible RSS infinite loop.
6
20
 
7
21
  ## [v0.3.0] - 2020-04-12
8
22
 
@@ -13,7 +27,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
13
27
  ### Changed
14
28
  - BingCmd => SearchCmd
15
29
  - Major (breaking) change.
16
- - Changed `$nhkore bing easy` to:
30
+ - Changed `$ nhkore bing easy` to:
17
31
  - `$ nhkore search easy bing`
18
32
  - `$ nhkore se ez b`
19
33
  - App
data/README.md CHANGED
@@ -10,7 +10,7 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
10
10
 
11
11
  This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
12
12
 
13
- [![asciinema Demo - Help](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P.png)](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P?speed=2)
13
+ [![asciinema Demo](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
14
14
 
15
15
  ## Contents
16
16
 
@@ -18,7 +18,7 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
18
18
  - [Installing](#installing-)
19
19
  - [Using](#using-)
20
20
  - [The Basics](#the-basics-)
21
- - [Unlimited Power!](#unlimited-power-)
21
+ - [Unlimited Powah!](#unlimited-powah-)
22
22
  - [Get Command](#get-command-)
23
23
  - [Sift Command](#sift-command-)
24
24
  - [Sakura Fields Forever](#sakura-fields-forever-)
@@ -51,8 +51,8 @@ Manually:
51
51
  ```
52
52
  $ git clone 'https://github.com/esotericpig/nhkore.git'
53
53
  $ cd nhkore
54
- $ gem build nhkore.gemspec
55
- $ gem install *.gem
54
+ $ bundle install
55
+ $ bundle exec rake install:local
56
56
  ```
57
57
 
58
58
  If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
@@ -118,22 +118,15 @@ $ nhkore sift easy -e html
118
118
  $ nhkore sift easy -e yml
119
119
  ```
120
120
 
121
- If you have other scraped articles, then you'll need to filter down to the specific one:
121
+ Complete demo:
122
122
 
123
- | Command | Description |
124
- | --- | --- |
125
- | `$ nhkore sift easy -u k10011862381000` | Filter by URL |
126
- | `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
127
- | `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
128
- | `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time & title |
129
- | `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter & output HTML |
130
- | `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter & output HTML |
123
+ [![asciinema Demo - The Basics](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
131
124
 
132
- Complete demo:
125
+ ### Unlimited Powah! [^](#contents)
133
126
 
134
- [![asciinema Demo - The Basics](https://asciinema.org/a/316571.png)](https://asciinema.org/a/316571)
127
+ Generate a core word list (e.g., CSV file) for 1 or more pre-scraped articles with ease.
135
128
 
136
- ### Unlimited Power! [^](#contents)
129
+ Unlimited powah at your finger tips!
137
130
 
138
131
  #### Get Command [^](#contents)
139
132
 
@@ -151,7 +144,7 @@ By default, it will extract the data to `./core/`. You can change this:
151
144
 
152
145
  Complete demo:
153
146
 
154
- [![asciinema Demo - Get](https://asciinema.org/a/317773.png)](https://asciinema.org/a/317773)
147
+ [![asciinema Demo - Get](https://asciinema.org/a/318967.png)](https://asciinema.org/a/318967)
155
148
 
156
149
  #### Sift Command [^](#contents)
157
150
 
@@ -189,12 +182,21 @@ You can filter the data by using different options:
189
182
  Filter examples:
190
183
 
191
184
  ```
185
+ # Filter by URL.
186
+ $ nhkore sift easy -u 'k10011862381000'
187
+
188
+ # Filter by title.
189
+ $ nhkore sift easy -t 'マリオ'
190
+ $ nhkore sift easy -t '植えられた桜'
191
+
192
+ # Filter by date time.
192
193
  $ nhkore sift easy -d 2019
193
194
  $ nhkore sift easy -d '2019-12'
194
- $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
+ $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
196
  $ nhkore sift easy -d '2019-12-25 13:10'
196
- $ nhkore sift easy -t 'マリオ'
197
- $ nhkore sift easy -u 'k10011862381000'
197
+
198
+ # Filter by date time & title.
199
+ $ nhkore sift easy -d '2019-3-29' -t '桜'
198
200
  ```
199
201
 
200
202
  You can save the data to a different format using one of these options:
@@ -232,10 +234,14 @@ Lastly, you can ignore certain columns from the output. Definitions can be quite
232
234
 
233
235
  Complete demo:
234
236
 
235
- [![asciinema Demo - Sift](https://asciinema.org/a/318119.png)](https://asciinema.org/a/318119)
237
+ [![asciinema Demo - Sift](https://asciinema.org/a/318982.png)](https://asciinema.org/a/318982)
236
238
 
237
239
  ### Sakura Fields Forever [^](#contents)
238
240
 
241
+ No more waiting on a new release with pre-scraped files.
242
+
243
+ Scrape all of the latest articles for yourself, forever!
244
+
239
245
  #### Search Command [^](#contents)
240
246
 
241
247
  The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
@@ -258,9 +264,9 @@ links:
258
264
 
259
265
  Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
260
266
 
261
- > <rambling>
262
- > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
263
- > </rambling>
267
+ > <rambling>
268
+ > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
269
+ > </rambling>
264
270
 
265
271
  Example after running the `news` command:
266
272
 
@@ -287,6 +293,30 @@ links:
287
293
 
288
294
  If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
289
295
 
296
+ Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
297
+
298
+ Example usage:
299
+
300
+ `$ nhkore search easy bing`
301
+
302
+ There are a few notable options:
303
+
304
+ ```
305
+ -r --results=<value> number of results per page to request from search
306
+ (default: 100)
307
+ --show-count show the number of links scraped and exit;
308
+ useful for manually writing/updating scripts
309
+ (but not for use in a variable);
310
+ implies '--dry-run' option
311
+ --show-urls show the URLs -- if any -- used when searching &
312
+ scraping and exit; you can download these for offline
313
+ testing and/or slow internet (see '--in' option)
314
+ ```
315
+
316
+ Complete demo:
317
+
318
+ [![asciinema Demo - Search](https://asciinema.org/a/320457.png)](https://asciinema.org/a/320457)
319
+
290
320
  #### News Command [^](#contents)
291
321
 
292
322
  ## Using the Library [^](#contents)
@@ -306,7 +336,7 @@ In your *Gemfile*:
306
336
  ```Ruby
307
337
  # Pick one...
308
338
  gem 'nhkore', '~> X.X'
309
- gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
339
+ gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
310
340
  ```
311
341
 
312
342
  ### Scraper
data/lib/nhkore/app.rb CHANGED
@@ -24,8 +24,6 @@
24
24
  require 'cri'
25
25
  require 'highline'
26
26
  require 'rainbow'
27
- require 'set'
28
- require 'tty-progressbar'
29
27
  require 'tty-spinner'
30
28
 
31
29
  require 'nhkore/error'
@@ -320,6 +318,8 @@ module NHKore
320
318
  def build_progress_bar(title,download: false,total: 100,type: @progress_bar,width: 33,**kargs)
321
319
  case type
322
320
  when :default,:classic
321
+ require 'tty-progressbar'
322
+
323
323
  msg = "#{title} [:bar] :percent :eta".dup()
324
324
  msg << ' :byte_rate/s' if download
325
325
 
@@ -59,13 +59,13 @@ module CLI
59
59
  bars = nil
60
60
 
61
61
  if @cmd_opts[:all]
62
- bars = [:default,:classic,:no]
62
+ bars = {default: :default,classic: :classic,no: :no}
63
63
  else
64
- bars = [@progress_bar]
64
+ bars = {user: @progress_bar}
65
65
  end
66
66
 
67
- bars.each() do |bar|
68
- name = (bars.length == 1) ? 'User' : bar.to_s().capitalize()
67
+ bars.each() do |name,bar|
68
+ name = name.to_s().capitalize()
69
69
  bar = build_progress_bar("Testing #{name} progress",download: false,type: bar)
70
70
 
71
71
  bar.start()
@@ -21,10 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'down/net_http'
25
- require 'tempfile'
26
- require 'zip'
27
-
28
24
  require 'nhkore/util'
29
25
 
30
26
 
@@ -73,6 +69,10 @@ module CLI
73
69
  end
74
70
 
75
71
  def run_get_cmd()
72
+ require 'down/net_http'
73
+ require 'tempfile'
74
+ require 'zip'
75
+
76
76
  build_out_dir(:out,default_dir: Util::CORE_DIR)
77
77
 
78
78
  return unless check_out_dir(:out)
@@ -97,6 +97,11 @@ module CLI
97
97
  do not try to parse the dictionary files for the articles; useful in case of errors trying to load
98
98
  the dictionaries (or for offline testing)
99
99
  EOD
100
+ flag :H,'no-sha256',<<-EOD
101
+ do not check the SHA-256 of the content to see if an article has already been scraped;
102
+ for example, 2 URLs with the same content, but 1 with 'http' & 1 with 'https', will both be scraped;
103
+ this is useful if 2 articles have the same SHA-256, but different content (unlikely)
104
+ EOD
100
105
  option :o,:out,<<-EOD,argument: :required,transform: -> (value) do
101
106
  'directory/file' to save words to; if you only specify a directory or a file, it will attach
102
107
  the appropriate default directory/file name
@@ -196,6 +201,7 @@ module CLI
196
201
  max_scrapes = @cmd_opts[:scrape]
197
202
  max_scrapes = DEFAULT_NEWS_SCRAPE if max_scrapes.nil?()
198
203
  missingno = @cmd_opts[:missingno]
204
+ no_sha256 = @cmd_opts[:no_sha256]
199
205
  out_file = @cmd_opts[:out]
200
206
  redo_scrapes = @cmd_opts[:redo]
201
207
  show_dict = @cmd_opts[:show_dict]
@@ -219,7 +225,9 @@ module CLI
219
225
  scrape_count = 0
220
226
 
221
227
  if File.exist?(out_file)
222
- news = (type == :yasashii) ? YasashiiNews.load_file(out_file) : FutsuuNews.load_file(out_file)
228
+ news = (type == :yasashii) ?
229
+ YasashiiNews.load_file(out_file,overwrite: no_sha256) :
230
+ FutsuuNews.load_file(out_file,overwrite: no_sha256)
223
231
  else
224
232
  news = (type == :yasashii) ? YasashiiNews.new() : FutsuuNews.new()
225
233
  end
@@ -357,9 +365,11 @@ module CLI
357
365
  def scraped_news_article?(news,link)
358
366
  return true if link.scraped?()
359
367
 
368
+ no_sha256 = @cmd_opts[:no_sha256]
369
+
360
370
  article = news.article(link.url)
361
371
 
362
- if article.nil?()
372
+ if !no_sha256 && article.nil?()
363
373
  if !Util.empty_web_str?(link.sha256) && news.sha256?(link.sha256)
364
374
  article = news.article_with_sha256(link.sha256)
365
375
  end
@@ -118,6 +118,10 @@ module CLI
118
118
  EOD
119
119
  app.check_empty_opt(:out,value)
120
120
  end
121
+ flag :H,'no-sha256',<<-EOD
122
+ if you used this option with the 'news' command, then you'll also need this option here
123
+ to not fail on "duplicate" articles; see '#{App::NAME} news'
124
+ EOD
121
125
  option :t,:title,'title to filter on, where search text only needs to be somewhere in the title',
122
126
  argument: :required
123
127
  option :u,:url,'URL to filter on, where search text only needs to be somewhere in the URL',
@@ -326,13 +330,16 @@ module CLI
326
330
  in_file = @cmd_opts[:in]
327
331
  no_defn = @cmd_opts[:no_defn]
328
332
  no_eng = @cmd_opts[:no_eng]
333
+ no_sha256 = @cmd_opts[:no_sha256]
329
334
  out_file = @cmd_opts[:out]
330
335
  title_filter = @cmd_opts[:title]
331
336
  url_filter = @cmd_opts[:url]
332
337
 
333
338
  start_spin("Sifting NHK News Web #{news_name} data")
334
339
 
335
- news = (type == :yasashii) ? YasashiiNews.load_file(in_file) : FutsuuNews.load_file(in_file)
340
+ news = (type == :yasashii) ?
341
+ YasashiiNews.load_file(in_file,overwrite: no_sha256) :
342
+ FutsuuNews.load_file(in_file,overwrite: no_sha256)
336
343
 
337
344
  sifter = Sifter.new(news)
338
345
 
@@ -21,7 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'json'
25
24
  require 'nhkore/dict'
26
25
  require 'nhkore/error'
27
26
  require 'nhkore/scraper'
@@ -59,6 +58,8 @@ module NHKore
59
58
  end
60
59
 
61
60
  def scrape()
61
+ require 'json'
62
+
62
63
  json = JSON.load(@str_or_io)
63
64
 
64
65
  return Dict.new() if json.nil?()
data/lib/nhkore/news.rb CHANGED
@@ -73,7 +73,7 @@ module NHKore
73
73
  coder[:articles] = @articles
74
74
  end
75
75
 
76
- def self.load_data(data,article_class: Article,file: nil,news_class: News,**kargs)
76
+ def self.load_data(data,article_class: Article,file: nil,news_class: News,overwrite: false,**kargs)
77
77
  data = Util.load_yaml(data,file: file)
78
78
 
79
79
  articles = data[:articles]
@@ -83,7 +83,7 @@ module NHKore
83
83
  if !articles.nil?()
84
84
  articles.each() do |key,hash|
85
85
  key = key.to_s() # Change from a symbol
86
- news.add_article(article_class.load_data(key,hash),key: key)
86
+ news.add_article(article_class.load_data(key,hash),key: key,overwrite: overwrite)
87
87
  end
88
88
  end
89
89
 
@@ -21,10 +21,8 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'http-cookie'
25
24
  require 'nokogiri'
26
25
  require 'open-uri'
27
- require 'rss'
28
26
 
29
27
  require 'nhkore/user_agents'
30
28
  require 'nhkore/util'
@@ -42,7 +40,7 @@ module NHKore
42
40
  'dnt' => '1',
43
41
  }
44
42
 
45
- attr_accessor :is_cookie
43
+ attr_accessor :eat_cookie
46
44
  attr_accessor :is_file
47
45
  attr_reader :kargs
48
46
  attr_accessor :max_redirects
@@ -51,7 +49,7 @@ module NHKore
51
49
  attr_accessor :str_or_io
52
50
  attr_accessor :url
53
51
 
54
- alias_method :is_cookie?,:is_cookie
52
+ alias_method :eat_cookie?,:eat_cookie
55
53
  alias_method :is_file?,:is_file
56
54
 
57
55
  # +max_redirects+ defaults to 3 for safety (infinite-loop attack).
@@ -60,10 +58,10 @@ module NHKore
60
58
  #
61
59
  # Pass in +header: {}+ for the default HTTP header fields to be set.
62
60
  #
63
- # @param is_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
64
- # (time-consuming) operation since it opens the URL again, but necessary for some URLs.
61
+ # @param eat_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
62
+ # (time-consuming) operation since it opens the URL again, but necessary for some URLs.
65
63
  # @param redirect_rule [nil,:lenient,:strict]
66
- def initialize(url,header: nil,is_cookie: false,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
64
+ def initialize(url,eat_cookie: false,header: nil,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
67
65
  super()
68
66
 
69
67
  if !header.nil?() && !is_file
@@ -77,7 +75,7 @@ module NHKore
77
75
  kargs.merge!(header)
78
76
  end
79
77
 
80
- @is_cookie = is_cookie
78
+ @eat_cookie = eat_cookie
81
79
  @is_file = is_file
82
80
  @kargs = kargs
83
81
  @max_redirects = max_redirects
@@ -88,6 +86,8 @@ module NHKore
88
86
  end
89
87
 
90
88
  def fetch_cookie(url)
89
+ require 'http-cookie'
90
+
91
91
  open_url(url)
92
92
 
93
93
  cookies = Array(@str_or_io.meta['set-cookie']) # nil will be []
@@ -128,7 +128,7 @@ module NHKore
128
128
  # NHK's website tends to always use UTF-8.
129
129
  @str_or_io = File.open(url,'rt:UTF-8',**@kargs)
130
130
  else
131
- fetch_cookie(url) if @is_cookie
131
+ fetch_cookie(url) if @eat_cookie
132
132
  open_url(url)
133
133
  end
134
134
  end
@@ -195,6 +195,8 @@ module NHKore
195
195
  end
196
196
 
197
197
  def rss_doc()
198
+ require 'rss'
199
+
198
200
  return RSS::Parser.parse(@str_or_io,validate: false)
199
201
  end
200
202
  end
@@ -45,9 +45,10 @@ module NHKore
45
45
  # - https://www3.nhk.or.jp/news/easy/article/disaster_heat.html
46
46
  YASASHII_REGEX = /\A[^\.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i
47
47
 
48
- # Pass in +header: {}+ to trigger using the default HTTP header fields.
49
- def initialize(url,header: {},is_cookie: true,**kargs)
50
- super(url,header: header,is_cookie: is_cookie,**kargs)
48
+ # Search Engines are strict, so trigger using the default HTTP header fields
49
+ # with +header: {}+ and fetch/set the cookie using +eat_cookie: true+.
50
+ def initialize(url,eat_cookie: true,header: {},**kargs)
51
+ super(url,eat_cookie: eat_cookie,header: header,**kargs)
51
52
  end
52
53
 
53
54
  def ignore_link?(link,cleaned: true)
@@ -59,6 +60,7 @@ module NHKore
59
60
  return true if link =~ /\/about\.html?/ # https://www3.nhk.or.jp/news/easy/about.html
60
61
  return true if link =~ /\/movieplayer\.html?/ # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf
61
62
  return true if link =~ /\/audio\.html?/ # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000
63
+ return true if link =~ /\/news\/easy\/index\.html?/ # http://www3.nhk.or.jp/news/easy/index.html
62
64
 
63
65
  return false
64
66
  end
@@ -157,11 +159,14 @@ module NHKore
157
159
  open(uri)
158
160
 
159
161
  doc = rss_doc()
162
+ rss_links = []
160
163
 
161
164
  doc.items.each() do |item|
162
165
  link = item.link.to_s()
163
166
  link = Util.unspace_web_str(link).downcase()
164
167
 
168
+ rss_links << link
169
+
165
170
  next if ignore_link?(link)
166
171
  next if link !~ regex
167
172
 
@@ -170,9 +175,14 @@ module NHKore
170
175
  link_count += 1
171
176
  end
172
177
 
173
- if link_count >= 1 && next_page.empty?()
178
+ # For RSS, Bing will keep returning the same links over and over
179
+ # if it's the last page or the "first=" query is the wrong count.
180
+ # Therefore, we have to test the previous RSS links (+page.rss_links+).
181
+ if next_page.empty?() && doc.items.length >= 1 && page.rss_links != rss_links
174
182
  next_page.count = (page.count < 0) ? 0 : page.count
175
- next_page.count += doc.items.length - 1 # -1 because 1st item is sometimes junk (search URL)
183
+ next_page.count += doc.items.length
184
+ next_page.rss_links = rss_links
185
+
176
186
  uri = URI(page.url.nil?() ? @url : page.url)
177
187
 
178
188
  Util.replace_uri_query!(uri,first: next_page.count)
@@ -191,12 +201,14 @@ module NHKore
191
201
  ###
192
202
  class NextPage
193
203
  attr_accessor :count
204
+ attr_accessor :rss_links
194
205
  attr_accessor :url
195
206
 
196
207
  def initialize()
197
208
  super()
198
209
 
199
210
  @count = -1
211
+ @rss_links = nil
200
212
  @url = nil
201
213
  end
202
214
 
data/lib/nhkore/sifter.rb CHANGED
@@ -21,8 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'csv'
25
-
26
24
  require 'nhkore/article'
27
25
  require 'nhkore/fileable'
28
26
  require 'nhkore/util'
@@ -143,6 +141,8 @@ module NHKore
143
141
 
144
142
  # This does not output {caption}.
145
143
  def put_csv!()
144
+ require 'csv'
145
+
146
146
  words = sift()
147
147
 
148
148
  @output = CSV.generate(headers: :first_row,write_headers: true) do |csv|
@@ -21,9 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'bimyou_segmenter'
25
- require 'tiny_segmenter'
26
-
27
24
  require 'nhkore/util'
28
25
 
29
26
 
@@ -59,6 +56,12 @@ module NHKore
59
56
  # @since 0.2.0
60
57
  ###
61
58
  class BimyouSplitter < Splitter
59
+ def initialize(*)
60
+ require 'bimyou_segmenter'
61
+
62
+ super
63
+ end
64
+
62
65
  def end_split(str)
63
66
  return BimyouSegmenter.segment(str,symbol: false,white_space: false)
64
67
  end
@@ -71,6 +74,8 @@ module NHKore
71
74
  attr_accessor :tiny
72
75
 
73
76
  def initialize(*)
77
+ require 'tiny_segmenter'
78
+
74
79
  super
75
80
 
76
81
  @tiny = TinySegmenter.new()
data/lib/nhkore/util.rb CHANGED
@@ -24,7 +24,6 @@
24
24
  require 'cgi'
25
25
  require 'psychgus'
26
26
  require 'public_suffix'
27
- require 'set'
28
27
  require 'time'
29
28
  require 'uri'
30
29
 
@@ -65,8 +64,7 @@ module NHKore
65
64
  MIN_SANE_YEAR = 1924
66
65
 
67
66
  def self.dir_str?(str)
68
- # File.join() will add the appropriate slash.
69
- return File.join(str,'') == str
67
+ return str.match?(/[\/\\]\s*\z/)
70
68
  end
71
69
 
72
70
  def self.domain(host,clean: true)
@@ -100,7 +98,8 @@ module NHKore
100
98
  end
101
99
 
102
100
  def self.filename_str?(str)
103
- return File.basename(str) == str
101
+ # Do not use "!dir_str?()"! It's not the same meaning!
102
+ return !str.match?(/[\/\\]/)
104
103
  end
105
104
 
106
105
  def self.guess_year(year)
@@ -21,9 +21,6 @@
21
21
  #++
22
22
 
23
23
 
24
- require 'japanese_deinflector'
25
-
26
-
27
24
  module NHKore
28
25
  ###
29
26
  # @author Jonathan Bradley Whited (@esotericpig)
@@ -63,6 +60,8 @@ module NHKore
63
60
  attr_accessor :deinflector
64
61
 
65
62
  def initialize(*)
63
+ require 'japanese_deinflector'
64
+
66
65
  super
67
66
 
68
67
  @deinflector = JapaneseDeinflector.new()
@@ -22,5 +22,5 @@
22
22
 
23
23
 
24
24
  module NHKore
25
- VERSION = '0.3.0'
25
+ VERSION = '0.3.1'
26
26
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nhkore
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Bradley Whited (@esotericpig)
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-12 00:00:00.000000000 Z
11
+ date: 2020-04-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bimyou_segmenter
@@ -374,7 +374,7 @@ metadata:
374
374
  changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
375
375
  homepage_uri: https://github.com/esotericpig/nhkore
376
376
  source_code_uri: https://github.com/esotericpig/nhkore
377
- post_install_message: " \n NHKore v0.3.0\n \n You can now use [nhkore] on the
377
+ post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
378
378
  command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
379
379
  \ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
380
380
  \ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"