nhkore 0.3.0 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -2
- data/README.md +55 -25
- data/lib/nhkore/app.rb +2 -2
- data/lib/nhkore/cli/fx_cmd.rb +4 -4
- data/lib/nhkore/cli/get_cmd.rb +4 -4
- data/lib/nhkore/cli/news_cmd.rb +12 -2
- data/lib/nhkore/cli/sift_cmd.rb +8 -1
- data/lib/nhkore/dict_scraper.rb +2 -1
- data/lib/nhkore/news.rb +2 -2
- data/lib/nhkore/scraper.rb +11 -9
- data/lib/nhkore/search_scraper.rb +17 -5
- data/lib/nhkore/sifter.rb +2 -2
- data/lib/nhkore/splitter.rb +8 -3
- data/lib/nhkore/util.rb +3 -4
- data/lib/nhkore/variator.rb +2 -3
- data/lib/nhkore/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
|
4
|
+
data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
|
7
|
+
data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,21 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
|
6
|
+
|
7
|
+
## [v0.3.1] - 2020-04-20
|
8
|
+
|
9
|
+
### Changed
|
10
|
+
- Fleshed out more of README.
|
11
|
+
- NewsCmd/SiftCmd
|
12
|
+
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
|
+
- Util
|
14
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
|
+
|
16
|
+
### Fixed
|
17
|
+
- Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
|
18
|
+
- BingScraper
|
19
|
+
- Fixed possible RSS infinite loop.
|
6
20
|
|
7
21
|
## [v0.3.0] - 2020-04-12
|
8
22
|
|
@@ -13,7 +27,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
13
27
|
### Changed
|
14
28
|
- BingCmd => SearchCmd
|
15
29
|
- Major (breaking) change.
|
16
|
-
- Changed `$nhkore bing easy` to:
|
30
|
+
- Changed `$ nhkore bing easy` to:
|
17
31
|
- `$ nhkore search easy bing`
|
18
32
|
- `$ nhkore se ez b`
|
19
33
|
- App
|
data/README.md
CHANGED
@@ -10,7 +10,7 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
|
|
10
10
|
|
11
11
|
This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
|
12
12
|
|
13
|
-
[](https://asciinema.org/a/318958)
|
14
14
|
|
15
15
|
## Contents
|
16
16
|
|
@@ -18,7 +18,7 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
|
|
18
18
|
- [Installing](#installing-)
|
19
19
|
- [Using](#using-)
|
20
20
|
- [The Basics](#the-basics-)
|
21
|
-
- [Unlimited
|
21
|
+
- [Unlimited Powah!](#unlimited-powah-)
|
22
22
|
- [Get Command](#get-command-)
|
23
23
|
- [Sift Command](#sift-command-)
|
24
24
|
- [Sakura Fields Forever](#sakura-fields-forever-)
|
@@ -51,8 +51,8 @@ Manually:
|
|
51
51
|
```
|
52
52
|
$ git clone 'https://github.com/esotericpig/nhkore.git'
|
53
53
|
$ cd nhkore
|
54
|
-
$
|
55
|
-
$
|
54
|
+
$ bundle install
|
55
|
+
$ bundle exec rake install:local
|
56
56
|
```
|
57
57
|
|
58
58
|
If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
|
@@ -118,22 +118,15 @@ $ nhkore sift easy -e html
|
|
118
118
|
$ nhkore sift easy -e yml
|
119
119
|
```
|
120
120
|
|
121
|
-
|
121
|
+
Complete demo:
|
122
122
|
|
123
|
-
|
124
|
-
| --- | --- |
|
125
|
-
| `$ nhkore sift easy -u k10011862381000` | Filter by URL |
|
126
|
-
| `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
|
127
|
-
| `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
|
128
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time & title |
|
129
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter & output HTML |
|
130
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter & output HTML |
|
123
|
+
[](https://asciinema.org/a/318958)
|
131
124
|
|
132
|
-
|
125
|
+
### Unlimited Powah! [^](#contents)
|
133
126
|
|
134
|
-
|
127
|
+
Generate a core word list (e.g., CSV file) for 1 or more pre-scraped articles with ease.
|
135
128
|
|
136
|
-
|
129
|
+
Unlimited powah at your finger tips!
|
137
130
|
|
138
131
|
#### Get Command [^](#contents)
|
139
132
|
|
@@ -151,7 +144,7 @@ By default, it will extract the data to `./core/`. You can change this:
|
|
151
144
|
|
152
145
|
Complete demo:
|
153
146
|
|
154
|
-
[](https://asciinema.org/a/318967)
|
155
148
|
|
156
149
|
#### Sift Command [^](#contents)
|
157
150
|
|
@@ -189,12 +182,21 @@ You can filter the data by using different options:
|
|
189
182
|
Filter examples:
|
190
183
|
|
191
184
|
```
|
185
|
+
# Filter by URL.
|
186
|
+
$ nhkore sift easy -u 'k10011862381000'
|
187
|
+
|
188
|
+
# Filter by title.
|
189
|
+
$ nhkore sift easy -t 'マリオ'
|
190
|
+
$ nhkore sift easy -t '植えられた桜'
|
191
|
+
|
192
|
+
# Filter by date time.
|
192
193
|
$ nhkore sift easy -d 2019
|
193
194
|
$ nhkore sift easy -d '2019-12'
|
194
|
-
$ nhkore sift easy -d '2019-7-4...9'
|
195
|
+
$ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
|
195
196
|
$ nhkore sift easy -d '2019-12-25 13:10'
|
196
|
-
|
197
|
-
|
197
|
+
|
198
|
+
# Filter by date time & title.
|
199
|
+
$ nhkore sift easy -d '2019-3-29' -t '桜'
|
198
200
|
```
|
199
201
|
|
200
202
|
You can save the data to a different format using one of these options:
|
@@ -232,10 +234,14 @@ Lastly, you can ignore certain columns from the output. Definitions can be quite
|
|
232
234
|
|
233
235
|
Complete demo:
|
234
236
|
|
235
|
-
[](https://asciinema.org/a/318982)
|
236
238
|
|
237
239
|
### Sakura Fields Forever [^](#contents)
|
238
240
|
|
241
|
+
No more waiting on a new release with pre-scraped files.
|
242
|
+
|
243
|
+
Scrape all of the latest articles for yourself, forever!
|
244
|
+
|
239
245
|
#### Search Command [^](#contents)
|
240
246
|
|
241
247
|
The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
|
@@ -258,9 +264,9 @@ links:
|
|
258
264
|
|
259
265
|
Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
|
260
266
|
|
261
|
-
> <rambling>
|
262
|
-
> Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
|
263
|
-
> </rambling>
|
267
|
+
> <rambling>
|
268
|
+
> Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
|
269
|
+
> </rambling>
|
264
270
|
|
265
271
|
Example after running the `news` command:
|
266
272
|
|
@@ -287,6 +293,30 @@ links:
|
|
287
293
|
|
288
294
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
289
295
|
|
296
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
|
+
|
298
|
+
Example usage:
|
299
|
+
|
300
|
+
`$ nhkore search easy bing`
|
301
|
+
|
302
|
+
There are a few notable options:
|
303
|
+
|
304
|
+
```
|
305
|
+
-r --results=<value> number of results per page to request from search
|
306
|
+
(default: 100)
|
307
|
+
--show-count show the number of links scraped and exit;
|
308
|
+
useful for manually writing/updating scripts
|
309
|
+
(but not for use in a variable);
|
310
|
+
implies '--dry-run' option
|
311
|
+
--show-urls show the URLs -- if any -- used when searching &
|
312
|
+
scraping and exit; you can download these for offline
|
313
|
+
testing and/or slow internet (see '--in' option)
|
314
|
+
```
|
315
|
+
|
316
|
+
Complete demo:
|
317
|
+
|
318
|
+
[](https://asciinema.org/a/320457)
|
319
|
+
|
290
320
|
#### News Command [^](#contents)
|
291
321
|
|
292
322
|
## Using the Library [^](#contents)
|
@@ -306,7 +336,7 @@ In your *Gemfile*:
|
|
306
336
|
```Ruby
|
307
337
|
# Pick one...
|
308
338
|
gem 'nhkore', '~> X.X'
|
309
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
|
339
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
|
310
340
|
```
|
311
341
|
|
312
342
|
### Scraper
|
data/lib/nhkore/app.rb
CHANGED
@@ -24,8 +24,6 @@
|
|
24
24
|
require 'cri'
|
25
25
|
require 'highline'
|
26
26
|
require 'rainbow'
|
27
|
-
require 'set'
|
28
|
-
require 'tty-progressbar'
|
29
27
|
require 'tty-spinner'
|
30
28
|
|
31
29
|
require 'nhkore/error'
|
@@ -320,6 +318,8 @@ module NHKore
|
|
320
318
|
def build_progress_bar(title,download: false,total: 100,type: @progress_bar,width: 33,**kargs)
|
321
319
|
case type
|
322
320
|
when :default,:classic
|
321
|
+
require 'tty-progressbar'
|
322
|
+
|
323
323
|
msg = "#{title} [:bar] :percent :eta".dup()
|
324
324
|
msg << ' :byte_rate/s' if download
|
325
325
|
|
data/lib/nhkore/cli/fx_cmd.rb
CHANGED
@@ -59,13 +59,13 @@ module CLI
|
|
59
59
|
bars = nil
|
60
60
|
|
61
61
|
if @cmd_opts[:all]
|
62
|
-
bars =
|
62
|
+
bars = {default: :default,classic: :classic,no: :no}
|
63
63
|
else
|
64
|
-
bars =
|
64
|
+
bars = {user: @progress_bar}
|
65
65
|
end
|
66
66
|
|
67
|
-
bars.each() do |bar|
|
68
|
-
name =
|
67
|
+
bars.each() do |name,bar|
|
68
|
+
name = name.to_s().capitalize()
|
69
69
|
bar = build_progress_bar("Testing #{name} progress",download: false,type: bar)
|
70
70
|
|
71
71
|
bar.start()
|
data/lib/nhkore/cli/get_cmd.rb
CHANGED
@@ -21,10 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'down/net_http'
|
25
|
-
require 'tempfile'
|
26
|
-
require 'zip'
|
27
|
-
|
28
24
|
require 'nhkore/util'
|
29
25
|
|
30
26
|
|
@@ -73,6 +69,10 @@ module CLI
|
|
73
69
|
end
|
74
70
|
|
75
71
|
def run_get_cmd()
|
72
|
+
require 'down/net_http'
|
73
|
+
require 'tempfile'
|
74
|
+
require 'zip'
|
75
|
+
|
76
76
|
build_out_dir(:out,default_dir: Util::CORE_DIR)
|
77
77
|
|
78
78
|
return unless check_out_dir(:out)
|
data/lib/nhkore/cli/news_cmd.rb
CHANGED
@@ -97,6 +97,11 @@ module CLI
|
|
97
97
|
do not try to parse the dictionary files for the articles; useful in case of errors trying to load
|
98
98
|
the dictionaries (or for offline testing)
|
99
99
|
EOD
|
100
|
+
flag :H,'no-sha256',<<-EOD
|
101
|
+
do not check the SHA-256 of the content to see if an article has already been scraped;
|
102
|
+
for example, 2 URLs with the same content, but 1 with 'http' & 1 with 'https', will both be scraped;
|
103
|
+
this is useful if 2 articles have the same SHA-256, but different content (unlikely)
|
104
|
+
EOD
|
100
105
|
option :o,:out,<<-EOD,argument: :required,transform: -> (value) do
|
101
106
|
'directory/file' to save words to; if you only specify a directory or a file, it will attach
|
102
107
|
the appropriate default directory/file name
|
@@ -196,6 +201,7 @@ module CLI
|
|
196
201
|
max_scrapes = @cmd_opts[:scrape]
|
197
202
|
max_scrapes = DEFAULT_NEWS_SCRAPE if max_scrapes.nil?()
|
198
203
|
missingno = @cmd_opts[:missingno]
|
204
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
199
205
|
out_file = @cmd_opts[:out]
|
200
206
|
redo_scrapes = @cmd_opts[:redo]
|
201
207
|
show_dict = @cmd_opts[:show_dict]
|
@@ -219,7 +225,9 @@ module CLI
|
|
219
225
|
scrape_count = 0
|
220
226
|
|
221
227
|
if File.exist?(out_file)
|
222
|
-
news = (type == :yasashii) ?
|
228
|
+
news = (type == :yasashii) ?
|
229
|
+
YasashiiNews.load_file(out_file,overwrite: no_sha256) :
|
230
|
+
FutsuuNews.load_file(out_file,overwrite: no_sha256)
|
223
231
|
else
|
224
232
|
news = (type == :yasashii) ? YasashiiNews.new() : FutsuuNews.new()
|
225
233
|
end
|
@@ -357,9 +365,11 @@ module CLI
|
|
357
365
|
def scraped_news_article?(news,link)
|
358
366
|
return true if link.scraped?()
|
359
367
|
|
368
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
369
|
+
|
360
370
|
article = news.article(link.url)
|
361
371
|
|
362
|
-
if article.nil?()
|
372
|
+
if !no_sha256 && article.nil?()
|
363
373
|
if !Util.empty_web_str?(link.sha256) && news.sha256?(link.sha256)
|
364
374
|
article = news.article_with_sha256(link.sha256)
|
365
375
|
end
|
data/lib/nhkore/cli/sift_cmd.rb
CHANGED
@@ -118,6 +118,10 @@ module CLI
|
|
118
118
|
EOD
|
119
119
|
app.check_empty_opt(:out,value)
|
120
120
|
end
|
121
|
+
flag :H,'no-sha256',<<-EOD
|
122
|
+
if you used this option with the 'news' command, then you'll also need this option here
|
123
|
+
to not fail on "duplicate" articles; see '#{App::NAME} news'
|
124
|
+
EOD
|
121
125
|
option :t,:title,'title to filter on, where search text only needs to be somewhere in the title',
|
122
126
|
argument: :required
|
123
127
|
option :u,:url,'URL to filter on, where search text only needs to be somewhere in the URL',
|
@@ -326,13 +330,16 @@ module CLI
|
|
326
330
|
in_file = @cmd_opts[:in]
|
327
331
|
no_defn = @cmd_opts[:no_defn]
|
328
332
|
no_eng = @cmd_opts[:no_eng]
|
333
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
329
334
|
out_file = @cmd_opts[:out]
|
330
335
|
title_filter = @cmd_opts[:title]
|
331
336
|
url_filter = @cmd_opts[:url]
|
332
337
|
|
333
338
|
start_spin("Sifting NHK News Web #{news_name} data")
|
334
339
|
|
335
|
-
news = (type == :yasashii) ?
|
340
|
+
news = (type == :yasashii) ?
|
341
|
+
YasashiiNews.load_file(in_file,overwrite: no_sha256) :
|
342
|
+
FutsuuNews.load_file(in_file,overwrite: no_sha256)
|
336
343
|
|
337
344
|
sifter = Sifter.new(news)
|
338
345
|
|
data/lib/nhkore/dict_scraper.rb
CHANGED
@@ -21,7 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'json'
|
25
24
|
require 'nhkore/dict'
|
26
25
|
require 'nhkore/error'
|
27
26
|
require 'nhkore/scraper'
|
@@ -59,6 +58,8 @@ module NHKore
|
|
59
58
|
end
|
60
59
|
|
61
60
|
def scrape()
|
61
|
+
require 'json'
|
62
|
+
|
62
63
|
json = JSON.load(@str_or_io)
|
63
64
|
|
64
65
|
return Dict.new() if json.nil?()
|
data/lib/nhkore/news.rb
CHANGED
@@ -73,7 +73,7 @@ module NHKore
|
|
73
73
|
coder[:articles] = @articles
|
74
74
|
end
|
75
75
|
|
76
|
-
def self.load_data(data,article_class: Article,file: nil,news_class: News,**kargs)
|
76
|
+
def self.load_data(data,article_class: Article,file: nil,news_class: News,overwrite: false,**kargs)
|
77
77
|
data = Util.load_yaml(data,file: file)
|
78
78
|
|
79
79
|
articles = data[:articles]
|
@@ -83,7 +83,7 @@ module NHKore
|
|
83
83
|
if !articles.nil?()
|
84
84
|
articles.each() do |key,hash|
|
85
85
|
key = key.to_s() # Change from a symbol
|
86
|
-
news.add_article(article_class.load_data(key,hash),key: key)
|
86
|
+
news.add_article(article_class.load_data(key,hash),key: key,overwrite: overwrite)
|
87
87
|
end
|
88
88
|
end
|
89
89
|
|
data/lib/nhkore/scraper.rb
CHANGED
@@ -21,10 +21,8 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'http-cookie'
|
25
24
|
require 'nokogiri'
|
26
25
|
require 'open-uri'
|
27
|
-
require 'rss'
|
28
26
|
|
29
27
|
require 'nhkore/user_agents'
|
30
28
|
require 'nhkore/util'
|
@@ -42,7 +40,7 @@ module NHKore
|
|
42
40
|
'dnt' => '1',
|
43
41
|
}
|
44
42
|
|
45
|
-
attr_accessor :
|
43
|
+
attr_accessor :eat_cookie
|
46
44
|
attr_accessor :is_file
|
47
45
|
attr_reader :kargs
|
48
46
|
attr_accessor :max_redirects
|
@@ -51,7 +49,7 @@ module NHKore
|
|
51
49
|
attr_accessor :str_or_io
|
52
50
|
attr_accessor :url
|
53
51
|
|
54
|
-
alias_method :
|
52
|
+
alias_method :eat_cookie?,:eat_cookie
|
55
53
|
alias_method :is_file?,:is_file
|
56
54
|
|
57
55
|
# +max_redirects+ defaults to 3 for safety (infinite-loop attack).
|
@@ -60,10 +58,10 @@ module NHKore
|
|
60
58
|
#
|
61
59
|
# Pass in +header: {}+ for the default HTTP header fields to be set.
|
62
60
|
#
|
63
|
-
# @param
|
64
|
-
#
|
61
|
+
# @param eat_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
|
62
|
+
# (time-consuming) operation since it opens the URL again, but necessary for some URLs.
|
65
63
|
# @param redirect_rule [nil,:lenient,:strict]
|
66
|
-
def initialize(url,
|
64
|
+
def initialize(url,eat_cookie: false,header: nil,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
|
67
65
|
super()
|
68
66
|
|
69
67
|
if !header.nil?() && !is_file
|
@@ -77,7 +75,7 @@ module NHKore
|
|
77
75
|
kargs.merge!(header)
|
78
76
|
end
|
79
77
|
|
80
|
-
@
|
78
|
+
@eat_cookie = eat_cookie
|
81
79
|
@is_file = is_file
|
82
80
|
@kargs = kargs
|
83
81
|
@max_redirects = max_redirects
|
@@ -88,6 +86,8 @@ module NHKore
|
|
88
86
|
end
|
89
87
|
|
90
88
|
def fetch_cookie(url)
|
89
|
+
require 'http-cookie'
|
90
|
+
|
91
91
|
open_url(url)
|
92
92
|
|
93
93
|
cookies = Array(@str_or_io.meta['set-cookie']) # nil will be []
|
@@ -128,7 +128,7 @@ module NHKore
|
|
128
128
|
# NHK's website tends to always use UTF-8.
|
129
129
|
@str_or_io = File.open(url,'rt:UTF-8',**@kargs)
|
130
130
|
else
|
131
|
-
fetch_cookie(url) if @
|
131
|
+
fetch_cookie(url) if @eat_cookie
|
132
132
|
open_url(url)
|
133
133
|
end
|
134
134
|
end
|
@@ -195,6 +195,8 @@ module NHKore
|
|
195
195
|
end
|
196
196
|
|
197
197
|
def rss_doc()
|
198
|
+
require 'rss'
|
199
|
+
|
198
200
|
return RSS::Parser.parse(@str_or_io,validate: false)
|
199
201
|
end
|
200
202
|
end
|
@@ -45,9 +45,10 @@ module NHKore
|
|
45
45
|
# - https://www3.nhk.or.jp/news/easy/article/disaster_heat.html
|
46
46
|
YASASHII_REGEX = /\A[^\.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i
|
47
47
|
|
48
|
-
#
|
49
|
-
|
50
|
-
|
48
|
+
# Search Engines are strict, so trigger using the default HTTP header fields
|
49
|
+
# with +header: {}+ and fetch/set the cookie using +eat_cookie: true+.
|
50
|
+
def initialize(url,eat_cookie: true,header: {},**kargs)
|
51
|
+
super(url,eat_cookie: eat_cookie,header: header,**kargs)
|
51
52
|
end
|
52
53
|
|
53
54
|
def ignore_link?(link,cleaned: true)
|
@@ -59,6 +60,7 @@ module NHKore
|
|
59
60
|
return true if link =~ /\/about\.html?/ # https://www3.nhk.or.jp/news/easy/about.html
|
60
61
|
return true if link =~ /\/movieplayer\.html?/ # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf
|
61
62
|
return true if link =~ /\/audio\.html?/ # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000
|
63
|
+
return true if link =~ /\/news\/easy\/index\.html?/ # http://www3.nhk.or.jp/news/easy/index.html
|
62
64
|
|
63
65
|
return false
|
64
66
|
end
|
@@ -157,11 +159,14 @@ module NHKore
|
|
157
159
|
open(uri)
|
158
160
|
|
159
161
|
doc = rss_doc()
|
162
|
+
rss_links = []
|
160
163
|
|
161
164
|
doc.items.each() do |item|
|
162
165
|
link = item.link.to_s()
|
163
166
|
link = Util.unspace_web_str(link).downcase()
|
164
167
|
|
168
|
+
rss_links << link
|
169
|
+
|
165
170
|
next if ignore_link?(link)
|
166
171
|
next if link !~ regex
|
167
172
|
|
@@ -170,9 +175,14 @@ module NHKore
|
|
170
175
|
link_count += 1
|
171
176
|
end
|
172
177
|
|
173
|
-
|
178
|
+
# For RSS, Bing will keep returning the same links over and over
|
179
|
+
# if it's the last page or the "first=" query is the wrong count.
|
180
|
+
# Therefore, we have to test the previous RSS links (+page.rss_links+).
|
181
|
+
if next_page.empty?() && doc.items.length >= 1 && page.rss_links != rss_links
|
174
182
|
next_page.count = (page.count < 0) ? 0 : page.count
|
175
|
-
next_page.count += doc.items.length
|
183
|
+
next_page.count += doc.items.length
|
184
|
+
next_page.rss_links = rss_links
|
185
|
+
|
176
186
|
uri = URI(page.url.nil?() ? @url : page.url)
|
177
187
|
|
178
188
|
Util.replace_uri_query!(uri,first: next_page.count)
|
@@ -191,12 +201,14 @@ module NHKore
|
|
191
201
|
###
|
192
202
|
class NextPage
|
193
203
|
attr_accessor :count
|
204
|
+
attr_accessor :rss_links
|
194
205
|
attr_accessor :url
|
195
206
|
|
196
207
|
def initialize()
|
197
208
|
super()
|
198
209
|
|
199
210
|
@count = -1
|
211
|
+
@rss_links = nil
|
200
212
|
@url = nil
|
201
213
|
end
|
202
214
|
|
data/lib/nhkore/sifter.rb
CHANGED
@@ -21,8 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'csv'
|
25
|
-
|
26
24
|
require 'nhkore/article'
|
27
25
|
require 'nhkore/fileable'
|
28
26
|
require 'nhkore/util'
|
@@ -143,6 +141,8 @@ module NHKore
|
|
143
141
|
|
144
142
|
# This does not output {caption}.
|
145
143
|
def put_csv!()
|
144
|
+
require 'csv'
|
145
|
+
|
146
146
|
words = sift()
|
147
147
|
|
148
148
|
@output = CSV.generate(headers: :first_row,write_headers: true) do |csv|
|
data/lib/nhkore/splitter.rb
CHANGED
@@ -21,9 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'bimyou_segmenter'
|
25
|
-
require 'tiny_segmenter'
|
26
|
-
|
27
24
|
require 'nhkore/util'
|
28
25
|
|
29
26
|
|
@@ -59,6 +56,12 @@ module NHKore
|
|
59
56
|
# @since 0.2.0
|
60
57
|
###
|
61
58
|
class BimyouSplitter < Splitter
|
59
|
+
def initialize(*)
|
60
|
+
require 'bimyou_segmenter'
|
61
|
+
|
62
|
+
super
|
63
|
+
end
|
64
|
+
|
62
65
|
def end_split(str)
|
63
66
|
return BimyouSegmenter.segment(str,symbol: false,white_space: false)
|
64
67
|
end
|
@@ -71,6 +74,8 @@ module NHKore
|
|
71
74
|
attr_accessor :tiny
|
72
75
|
|
73
76
|
def initialize(*)
|
77
|
+
require 'tiny_segmenter'
|
78
|
+
|
74
79
|
super
|
75
80
|
|
76
81
|
@tiny = TinySegmenter.new()
|
data/lib/nhkore/util.rb
CHANGED
@@ -24,7 +24,6 @@
|
|
24
24
|
require 'cgi'
|
25
25
|
require 'psychgus'
|
26
26
|
require 'public_suffix'
|
27
|
-
require 'set'
|
28
27
|
require 'time'
|
29
28
|
require 'uri'
|
30
29
|
|
@@ -65,8 +64,7 @@ module NHKore
|
|
65
64
|
MIN_SANE_YEAR = 1924
|
66
65
|
|
67
66
|
def self.dir_str?(str)
|
68
|
-
|
69
|
-
return File.join(str,'') == str
|
67
|
+
return str.match?(/[\/\\]\s*\z/)
|
70
68
|
end
|
71
69
|
|
72
70
|
def self.domain(host,clean: true)
|
@@ -100,7 +98,8 @@ module NHKore
|
|
100
98
|
end
|
101
99
|
|
102
100
|
def self.filename_str?(str)
|
103
|
-
|
101
|
+
# Do not use "!dir_str?()"! It's not the same meaning!
|
102
|
+
return !str.match?(/[\/\\]/)
|
104
103
|
end
|
105
104
|
|
106
105
|
def self.guess_year(year)
|
data/lib/nhkore/variator.rb
CHANGED
@@ -21,9 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'japanese_deinflector'
|
25
|
-
|
26
|
-
|
27
24
|
module NHKore
|
28
25
|
###
|
29
26
|
# @author Jonathan Bradley Whited (@esotericpig)
|
@@ -63,6 +60,8 @@ module NHKore
|
|
63
60
|
attr_accessor :deinflector
|
64
61
|
|
65
62
|
def initialize(*)
|
63
|
+
require 'japanese_deinflector'
|
64
|
+
|
66
65
|
super
|
67
66
|
|
68
67
|
@deinflector = JapaneseDeinflector.new()
|
data/lib/nhkore/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nhkore
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Bradley Whited (@esotericpig)
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-04-
|
11
|
+
date: 2020-04-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bimyou_segmenter
|
@@ -374,7 +374,7 @@ metadata:
|
|
374
374
|
changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
|
375
375
|
homepage_uri: https://github.com/esotericpig/nhkore
|
376
376
|
source_code_uri: https://github.com/esotericpig/nhkore
|
377
|
-
post_install_message: " \n NHKore v0.3.
|
377
|
+
post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
|
378
378
|
command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
|
379
379
|
\ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
|
380
380
|
\ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"
|