nhkore 0.3.0 → 0.3.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -2
- data/README.md +55 -25
- data/lib/nhkore/app.rb +2 -2
- data/lib/nhkore/cli/fx_cmd.rb +4 -4
- data/lib/nhkore/cli/get_cmd.rb +4 -4
- data/lib/nhkore/cli/news_cmd.rb +12 -2
- data/lib/nhkore/cli/sift_cmd.rb +8 -1
- data/lib/nhkore/dict_scraper.rb +2 -1
- data/lib/nhkore/news.rb +2 -2
- data/lib/nhkore/scraper.rb +11 -9
- data/lib/nhkore/search_scraper.rb +17 -5
- data/lib/nhkore/sifter.rb +2 -2
- data/lib/nhkore/splitter.rb +8 -3
- data/lib/nhkore/util.rb +3 -4
- data/lib/nhkore/variator.rb +2 -3
- data/lib/nhkore/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
|
4
|
+
data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
|
7
|
+
data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,21 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
|
6
|
+
|
7
|
+
## [v0.3.1] - 2020-04-20
|
8
|
+
|
9
|
+
### Changed
|
10
|
+
- Fleshed out more of README.
|
11
|
+
- NewsCmd/SiftCmd
|
12
|
+
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
|
+
- Util
|
14
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
|
+
|
16
|
+
### Fixed
|
17
|
+
- Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
|
18
|
+
- BingScraper
|
19
|
+
- Fixed possible RSS infinite loop.
|
6
20
|
|
7
21
|
## [v0.3.0] - 2020-04-12
|
8
22
|
|
@@ -13,7 +27,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
13
27
|
### Changed
|
14
28
|
- BingCmd => SearchCmd
|
15
29
|
- Major (breaking) change.
|
16
|
-
- Changed `$nhkore bing easy` to:
|
30
|
+
- Changed `$ nhkore bing easy` to:
|
17
31
|
- `$ nhkore search easy bing`
|
18
32
|
- `$ nhkore se ez b`
|
19
33
|
- App
|
data/README.md
CHANGED
@@ -10,7 +10,7 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
|
|
10
10
|
|
11
11
|
This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
|
12
12
|
|
13
|
-
[![asciinema Demo
|
13
|
+
[![asciinema Demo](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
|
14
14
|
|
15
15
|
## Contents
|
16
16
|
|
@@ -18,7 +18,7 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
|
|
18
18
|
- [Installing](#installing-)
|
19
19
|
- [Using](#using-)
|
20
20
|
- [The Basics](#the-basics-)
|
21
|
-
- [Unlimited
|
21
|
+
- [Unlimited Powah!](#unlimited-powah-)
|
22
22
|
- [Get Command](#get-command-)
|
23
23
|
- [Sift Command](#sift-command-)
|
24
24
|
- [Sakura Fields Forever](#sakura-fields-forever-)
|
@@ -51,8 +51,8 @@ Manually:
|
|
51
51
|
```
|
52
52
|
$ git clone 'https://github.com/esotericpig/nhkore.git'
|
53
53
|
$ cd nhkore
|
54
|
-
$
|
55
|
-
$
|
54
|
+
$ bundle install
|
55
|
+
$ bundle exec rake install:local
|
56
56
|
```
|
57
57
|
|
58
58
|
If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
|
@@ -118,22 +118,15 @@ $ nhkore sift easy -e html
|
|
118
118
|
$ nhkore sift easy -e yml
|
119
119
|
```
|
120
120
|
|
121
|
-
|
121
|
+
Complete demo:
|
122
122
|
|
123
|
-
|
124
|
-
| --- | --- |
|
125
|
-
| `$ nhkore sift easy -u k10011862381000` | Filter by URL |
|
126
|
-
| `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
|
127
|
-
| `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
|
128
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time & title |
|
129
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter & output HTML |
|
130
|
-
| `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter & output HTML |
|
123
|
+
[![asciinema Demo - The Basics](https://asciinema.org/a/318958.png)](https://asciinema.org/a/318958)
|
131
124
|
|
132
|
-
|
125
|
+
### Unlimited Powah! [^](#contents)
|
133
126
|
|
134
|
-
|
127
|
+
Generate a core word list (e.g., CSV file) for 1 or more pre-scraped articles with ease.
|
135
128
|
|
136
|
-
|
129
|
+
Unlimited powah at your finger tips!
|
137
130
|
|
138
131
|
#### Get Command [^](#contents)
|
139
132
|
|
@@ -151,7 +144,7 @@ By default, it will extract the data to `./core/`. You can change this:
|
|
151
144
|
|
152
145
|
Complete demo:
|
153
146
|
|
154
|
-
[![asciinema Demo - Get](https://asciinema.org/a/
|
147
|
+
[![asciinema Demo - Get](https://asciinema.org/a/318967.png)](https://asciinema.org/a/318967)
|
155
148
|
|
156
149
|
#### Sift Command [^](#contents)
|
157
150
|
|
@@ -189,12 +182,21 @@ You can filter the data by using different options:
|
|
189
182
|
Filter examples:
|
190
183
|
|
191
184
|
```
|
185
|
+
# Filter by URL.
|
186
|
+
$ nhkore sift easy -u 'k10011862381000'
|
187
|
+
|
188
|
+
# Filter by title.
|
189
|
+
$ nhkore sift easy -t 'マリオ'
|
190
|
+
$ nhkore sift easy -t '植えられた桜'
|
191
|
+
|
192
|
+
# Filter by date time.
|
192
193
|
$ nhkore sift easy -d 2019
|
193
194
|
$ nhkore sift easy -d '2019-12'
|
194
|
-
$ nhkore sift easy -d '2019-7-4...9'
|
195
|
+
$ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
|
195
196
|
$ nhkore sift easy -d '2019-12-25 13:10'
|
196
|
-
|
197
|
-
|
197
|
+
|
198
|
+
# Filter by date time & title.
|
199
|
+
$ nhkore sift easy -d '2019-3-29' -t '桜'
|
198
200
|
```
|
199
201
|
|
200
202
|
You can save the data to a different format using one of these options:
|
@@ -232,10 +234,14 @@ Lastly, you can ignore certain columns from the output. Definitions can be quite
|
|
232
234
|
|
233
235
|
Complete demo:
|
234
236
|
|
235
|
-
[![asciinema Demo - Sift](https://asciinema.org/a/
|
237
|
+
[![asciinema Demo - Sift](https://asciinema.org/a/318982.png)](https://asciinema.org/a/318982)
|
236
238
|
|
237
239
|
### Sakura Fields Forever [^](#contents)
|
238
240
|
|
241
|
+
No more waiting on a new release with pre-scraped files.
|
242
|
+
|
243
|
+
Scrape all of the latest articles for yourself, forever!
|
244
|
+
|
239
245
|
#### Search Command [^](#contents)
|
240
246
|
|
241
247
|
The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
|
@@ -258,9 +264,9 @@ links:
|
|
258
264
|
|
259
265
|
Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
|
260
266
|
|
261
|
-
> <rambling>
|
262
|
-
> Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
|
263
|
-
> </rambling>
|
267
|
+
> <rambling>
|
268
|
+
> Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
|
269
|
+
> </rambling>
|
264
270
|
|
265
271
|
Example after running the `news` command:
|
266
272
|
|
@@ -287,6 +293,30 @@ links:
|
|
287
293
|
|
288
294
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
289
295
|
|
296
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
|
+
|
298
|
+
Example usage:
|
299
|
+
|
300
|
+
`$ nhkore search easy bing`
|
301
|
+
|
302
|
+
There are a few notable options:
|
303
|
+
|
304
|
+
```
|
305
|
+
-r --results=<value> number of results per page to request from search
|
306
|
+
(default: 100)
|
307
|
+
--show-count show the number of links scraped and exit;
|
308
|
+
useful for manually writing/updating scripts
|
309
|
+
(but not for use in a variable);
|
310
|
+
implies '--dry-run' option
|
311
|
+
--show-urls show the URLs -- if any -- used when searching &
|
312
|
+
scraping and exit; you can download these for offline
|
313
|
+
testing and/or slow internet (see '--in' option)
|
314
|
+
```
|
315
|
+
|
316
|
+
Complete demo:
|
317
|
+
|
318
|
+
[![asciinema Demo - Search](https://asciinema.org/a/320457.png)](https://asciinema.org/a/320457)
|
319
|
+
|
290
320
|
#### News Command [^](#contents)
|
291
321
|
|
292
322
|
## Using the Library [^](#contents)
|
@@ -306,7 +336,7 @@ In your *Gemfile*:
|
|
306
336
|
```Ruby
|
307
337
|
# Pick one...
|
308
338
|
gem 'nhkore', '~> X.X'
|
309
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
|
339
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
|
310
340
|
```
|
311
341
|
|
312
342
|
### Scraper
|
data/lib/nhkore/app.rb
CHANGED
@@ -24,8 +24,6 @@
|
|
24
24
|
require 'cri'
|
25
25
|
require 'highline'
|
26
26
|
require 'rainbow'
|
27
|
-
require 'set'
|
28
|
-
require 'tty-progressbar'
|
29
27
|
require 'tty-spinner'
|
30
28
|
|
31
29
|
require 'nhkore/error'
|
@@ -320,6 +318,8 @@ module NHKore
|
|
320
318
|
def build_progress_bar(title,download: false,total: 100,type: @progress_bar,width: 33,**kargs)
|
321
319
|
case type
|
322
320
|
when :default,:classic
|
321
|
+
require 'tty-progressbar'
|
322
|
+
|
323
323
|
msg = "#{title} [:bar] :percent :eta".dup()
|
324
324
|
msg << ' :byte_rate/s' if download
|
325
325
|
|
data/lib/nhkore/cli/fx_cmd.rb
CHANGED
@@ -59,13 +59,13 @@ module CLI
|
|
59
59
|
bars = nil
|
60
60
|
|
61
61
|
if @cmd_opts[:all]
|
62
|
-
bars =
|
62
|
+
bars = {default: :default,classic: :classic,no: :no}
|
63
63
|
else
|
64
|
-
bars =
|
64
|
+
bars = {user: @progress_bar}
|
65
65
|
end
|
66
66
|
|
67
|
-
bars.each() do |bar|
|
68
|
-
name =
|
67
|
+
bars.each() do |name,bar|
|
68
|
+
name = name.to_s().capitalize()
|
69
69
|
bar = build_progress_bar("Testing #{name} progress",download: false,type: bar)
|
70
70
|
|
71
71
|
bar.start()
|
data/lib/nhkore/cli/get_cmd.rb
CHANGED
@@ -21,10 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'down/net_http'
|
25
|
-
require 'tempfile'
|
26
|
-
require 'zip'
|
27
|
-
|
28
24
|
require 'nhkore/util'
|
29
25
|
|
30
26
|
|
@@ -73,6 +69,10 @@ module CLI
|
|
73
69
|
end
|
74
70
|
|
75
71
|
def run_get_cmd()
|
72
|
+
require 'down/net_http'
|
73
|
+
require 'tempfile'
|
74
|
+
require 'zip'
|
75
|
+
|
76
76
|
build_out_dir(:out,default_dir: Util::CORE_DIR)
|
77
77
|
|
78
78
|
return unless check_out_dir(:out)
|
data/lib/nhkore/cli/news_cmd.rb
CHANGED
@@ -97,6 +97,11 @@ module CLI
|
|
97
97
|
do not try to parse the dictionary files for the articles; useful in case of errors trying to load
|
98
98
|
the dictionaries (or for offline testing)
|
99
99
|
EOD
|
100
|
+
flag :H,'no-sha256',<<-EOD
|
101
|
+
do not check the SHA-256 of the content to see if an article has already been scraped;
|
102
|
+
for example, 2 URLs with the same content, but 1 with 'http' & 1 with 'https', will both be scraped;
|
103
|
+
this is useful if 2 articles have the same SHA-256, but different content (unlikely)
|
104
|
+
EOD
|
100
105
|
option :o,:out,<<-EOD,argument: :required,transform: -> (value) do
|
101
106
|
'directory/file' to save words to; if you only specify a directory or a file, it will attach
|
102
107
|
the appropriate default directory/file name
|
@@ -196,6 +201,7 @@ module CLI
|
|
196
201
|
max_scrapes = @cmd_opts[:scrape]
|
197
202
|
max_scrapes = DEFAULT_NEWS_SCRAPE if max_scrapes.nil?()
|
198
203
|
missingno = @cmd_opts[:missingno]
|
204
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
199
205
|
out_file = @cmd_opts[:out]
|
200
206
|
redo_scrapes = @cmd_opts[:redo]
|
201
207
|
show_dict = @cmd_opts[:show_dict]
|
@@ -219,7 +225,9 @@ module CLI
|
|
219
225
|
scrape_count = 0
|
220
226
|
|
221
227
|
if File.exist?(out_file)
|
222
|
-
news = (type == :yasashii) ?
|
228
|
+
news = (type == :yasashii) ?
|
229
|
+
YasashiiNews.load_file(out_file,overwrite: no_sha256) :
|
230
|
+
FutsuuNews.load_file(out_file,overwrite: no_sha256)
|
223
231
|
else
|
224
232
|
news = (type == :yasashii) ? YasashiiNews.new() : FutsuuNews.new()
|
225
233
|
end
|
@@ -357,9 +365,11 @@ module CLI
|
|
357
365
|
def scraped_news_article?(news,link)
|
358
366
|
return true if link.scraped?()
|
359
367
|
|
368
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
369
|
+
|
360
370
|
article = news.article(link.url)
|
361
371
|
|
362
|
-
if article.nil?()
|
372
|
+
if !no_sha256 && article.nil?()
|
363
373
|
if !Util.empty_web_str?(link.sha256) && news.sha256?(link.sha256)
|
364
374
|
article = news.article_with_sha256(link.sha256)
|
365
375
|
end
|
data/lib/nhkore/cli/sift_cmd.rb
CHANGED
@@ -118,6 +118,10 @@ module CLI
|
|
118
118
|
EOD
|
119
119
|
app.check_empty_opt(:out,value)
|
120
120
|
end
|
121
|
+
flag :H,'no-sha256',<<-EOD
|
122
|
+
if you used this option with the 'news' command, then you'll also need this option here
|
123
|
+
to not fail on "duplicate" articles; see '#{App::NAME} news'
|
124
|
+
EOD
|
121
125
|
option :t,:title,'title to filter on, where search text only needs to be somewhere in the title',
|
122
126
|
argument: :required
|
123
127
|
option :u,:url,'URL to filter on, where search text only needs to be somewhere in the URL',
|
@@ -326,13 +330,16 @@ module CLI
|
|
326
330
|
in_file = @cmd_opts[:in]
|
327
331
|
no_defn = @cmd_opts[:no_defn]
|
328
332
|
no_eng = @cmd_opts[:no_eng]
|
333
|
+
no_sha256 = @cmd_opts[:no_sha256]
|
329
334
|
out_file = @cmd_opts[:out]
|
330
335
|
title_filter = @cmd_opts[:title]
|
331
336
|
url_filter = @cmd_opts[:url]
|
332
337
|
|
333
338
|
start_spin("Sifting NHK News Web #{news_name} data")
|
334
339
|
|
335
|
-
news = (type == :yasashii) ?
|
340
|
+
news = (type == :yasashii) ?
|
341
|
+
YasashiiNews.load_file(in_file,overwrite: no_sha256) :
|
342
|
+
FutsuuNews.load_file(in_file,overwrite: no_sha256)
|
336
343
|
|
337
344
|
sifter = Sifter.new(news)
|
338
345
|
|
data/lib/nhkore/dict_scraper.rb
CHANGED
@@ -21,7 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'json'
|
25
24
|
require 'nhkore/dict'
|
26
25
|
require 'nhkore/error'
|
27
26
|
require 'nhkore/scraper'
|
@@ -59,6 +58,8 @@ module NHKore
|
|
59
58
|
end
|
60
59
|
|
61
60
|
def scrape()
|
61
|
+
require 'json'
|
62
|
+
|
62
63
|
json = JSON.load(@str_or_io)
|
63
64
|
|
64
65
|
return Dict.new() if json.nil?()
|
data/lib/nhkore/news.rb
CHANGED
@@ -73,7 +73,7 @@ module NHKore
|
|
73
73
|
coder[:articles] = @articles
|
74
74
|
end
|
75
75
|
|
76
|
-
def self.load_data(data,article_class: Article,file: nil,news_class: News,**kargs)
|
76
|
+
def self.load_data(data,article_class: Article,file: nil,news_class: News,overwrite: false,**kargs)
|
77
77
|
data = Util.load_yaml(data,file: file)
|
78
78
|
|
79
79
|
articles = data[:articles]
|
@@ -83,7 +83,7 @@ module NHKore
|
|
83
83
|
if !articles.nil?()
|
84
84
|
articles.each() do |key,hash|
|
85
85
|
key = key.to_s() # Change from a symbol
|
86
|
-
news.add_article(article_class.load_data(key,hash),key: key)
|
86
|
+
news.add_article(article_class.load_data(key,hash),key: key,overwrite: overwrite)
|
87
87
|
end
|
88
88
|
end
|
89
89
|
|
data/lib/nhkore/scraper.rb
CHANGED
@@ -21,10 +21,8 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'http-cookie'
|
25
24
|
require 'nokogiri'
|
26
25
|
require 'open-uri'
|
27
|
-
require 'rss'
|
28
26
|
|
29
27
|
require 'nhkore/user_agents'
|
30
28
|
require 'nhkore/util'
|
@@ -42,7 +40,7 @@ module NHKore
|
|
42
40
|
'dnt' => '1',
|
43
41
|
}
|
44
42
|
|
45
|
-
attr_accessor :
|
43
|
+
attr_accessor :eat_cookie
|
46
44
|
attr_accessor :is_file
|
47
45
|
attr_reader :kargs
|
48
46
|
attr_accessor :max_redirects
|
@@ -51,7 +49,7 @@ module NHKore
|
|
51
49
|
attr_accessor :str_or_io
|
52
50
|
attr_accessor :url
|
53
51
|
|
54
|
-
alias_method :
|
52
|
+
alias_method :eat_cookie?,:eat_cookie
|
55
53
|
alias_method :is_file?,:is_file
|
56
54
|
|
57
55
|
# +max_redirects+ defaults to 3 for safety (infinite-loop attack).
|
@@ -60,10 +58,10 @@ module NHKore
|
|
60
58
|
#
|
61
59
|
# Pass in +header: {}+ for the default HTTP header fields to be set.
|
62
60
|
#
|
63
|
-
# @param
|
64
|
-
#
|
61
|
+
# @param eat_cookie [true,false] true to set the HTTP header field 'cookie', which can be an expensive
|
62
|
+
# (time-consuming) operation since it opens the URL again, but necessary for some URLs.
|
65
63
|
# @param redirect_rule [nil,:lenient,:strict]
|
66
|
-
def initialize(url,
|
64
|
+
def initialize(url,eat_cookie: false,header: nil,is_file: false,max_redirects: 3,max_retries: 3,redirect_rule: :strict,str_or_io: nil,**kargs)
|
67
65
|
super()
|
68
66
|
|
69
67
|
if !header.nil?() && !is_file
|
@@ -77,7 +75,7 @@ module NHKore
|
|
77
75
|
kargs.merge!(header)
|
78
76
|
end
|
79
77
|
|
80
|
-
@
|
78
|
+
@eat_cookie = eat_cookie
|
81
79
|
@is_file = is_file
|
82
80
|
@kargs = kargs
|
83
81
|
@max_redirects = max_redirects
|
@@ -88,6 +86,8 @@ module NHKore
|
|
88
86
|
end
|
89
87
|
|
90
88
|
def fetch_cookie(url)
|
89
|
+
require 'http-cookie'
|
90
|
+
|
91
91
|
open_url(url)
|
92
92
|
|
93
93
|
cookies = Array(@str_or_io.meta['set-cookie']) # nil will be []
|
@@ -128,7 +128,7 @@ module NHKore
|
|
128
128
|
# NHK's website tends to always use UTF-8.
|
129
129
|
@str_or_io = File.open(url,'rt:UTF-8',**@kargs)
|
130
130
|
else
|
131
|
-
fetch_cookie(url) if @
|
131
|
+
fetch_cookie(url) if @eat_cookie
|
132
132
|
open_url(url)
|
133
133
|
end
|
134
134
|
end
|
@@ -195,6 +195,8 @@ module NHKore
|
|
195
195
|
end
|
196
196
|
|
197
197
|
def rss_doc()
|
198
|
+
require 'rss'
|
199
|
+
|
198
200
|
return RSS::Parser.parse(@str_or_io,validate: false)
|
199
201
|
end
|
200
202
|
end
|
@@ -45,9 +45,10 @@ module NHKore
|
|
45
45
|
# - https://www3.nhk.or.jp/news/easy/article/disaster_heat.html
|
46
46
|
YASASHII_REGEX = /\A[^\.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i
|
47
47
|
|
48
|
-
#
|
49
|
-
|
50
|
-
|
48
|
+
# Search Engines are strict, so trigger using the default HTTP header fields
|
49
|
+
# with +header: {}+ and fetch/set the cookie using +eat_cookie: true+.
|
50
|
+
def initialize(url,eat_cookie: true,header: {},**kargs)
|
51
|
+
super(url,eat_cookie: eat_cookie,header: header,**kargs)
|
51
52
|
end
|
52
53
|
|
53
54
|
def ignore_link?(link,cleaned: true)
|
@@ -59,6 +60,7 @@ module NHKore
|
|
59
60
|
return true if link =~ /\/about\.html?/ # https://www3.nhk.or.jp/news/easy/about.html
|
60
61
|
return true if link =~ /\/movieplayer\.html?/ # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf
|
61
62
|
return true if link =~ /\/audio\.html?/ # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000
|
63
|
+
return true if link =~ /\/news\/easy\/index\.html?/ # http://www3.nhk.or.jp/news/easy/index.html
|
62
64
|
|
63
65
|
return false
|
64
66
|
end
|
@@ -157,11 +159,14 @@ module NHKore
|
|
157
159
|
open(uri)
|
158
160
|
|
159
161
|
doc = rss_doc()
|
162
|
+
rss_links = []
|
160
163
|
|
161
164
|
doc.items.each() do |item|
|
162
165
|
link = item.link.to_s()
|
163
166
|
link = Util.unspace_web_str(link).downcase()
|
164
167
|
|
168
|
+
rss_links << link
|
169
|
+
|
165
170
|
next if ignore_link?(link)
|
166
171
|
next if link !~ regex
|
167
172
|
|
@@ -170,9 +175,14 @@ module NHKore
|
|
170
175
|
link_count += 1
|
171
176
|
end
|
172
177
|
|
173
|
-
|
178
|
+
# For RSS, Bing will keep returning the same links over and over
|
179
|
+
# if it's the last page or the "first=" query is the wrong count.
|
180
|
+
# Therefore, we have to test the previous RSS links (+page.rss_links+).
|
181
|
+
if next_page.empty?() && doc.items.length >= 1 && page.rss_links != rss_links
|
174
182
|
next_page.count = (page.count < 0) ? 0 : page.count
|
175
|
-
next_page.count += doc.items.length
|
183
|
+
next_page.count += doc.items.length
|
184
|
+
next_page.rss_links = rss_links
|
185
|
+
|
176
186
|
uri = URI(page.url.nil?() ? @url : page.url)
|
177
187
|
|
178
188
|
Util.replace_uri_query!(uri,first: next_page.count)
|
@@ -191,12 +201,14 @@ module NHKore
|
|
191
201
|
###
|
192
202
|
class NextPage
|
193
203
|
attr_accessor :count
|
204
|
+
attr_accessor :rss_links
|
194
205
|
attr_accessor :url
|
195
206
|
|
196
207
|
def initialize()
|
197
208
|
super()
|
198
209
|
|
199
210
|
@count = -1
|
211
|
+
@rss_links = nil
|
200
212
|
@url = nil
|
201
213
|
end
|
202
214
|
|
data/lib/nhkore/sifter.rb
CHANGED
@@ -21,8 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'csv'
|
25
|
-
|
26
24
|
require 'nhkore/article'
|
27
25
|
require 'nhkore/fileable'
|
28
26
|
require 'nhkore/util'
|
@@ -143,6 +141,8 @@ module NHKore
|
|
143
141
|
|
144
142
|
# This does not output {caption}.
|
145
143
|
def put_csv!()
|
144
|
+
require 'csv'
|
145
|
+
|
146
146
|
words = sift()
|
147
147
|
|
148
148
|
@output = CSV.generate(headers: :first_row,write_headers: true) do |csv|
|
data/lib/nhkore/splitter.rb
CHANGED
@@ -21,9 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'bimyou_segmenter'
|
25
|
-
require 'tiny_segmenter'
|
26
|
-
|
27
24
|
require 'nhkore/util'
|
28
25
|
|
29
26
|
|
@@ -59,6 +56,12 @@ module NHKore
|
|
59
56
|
# @since 0.2.0
|
60
57
|
###
|
61
58
|
class BimyouSplitter < Splitter
|
59
|
+
def initialize(*)
|
60
|
+
require 'bimyou_segmenter'
|
61
|
+
|
62
|
+
super
|
63
|
+
end
|
64
|
+
|
62
65
|
def end_split(str)
|
63
66
|
return BimyouSegmenter.segment(str,symbol: false,white_space: false)
|
64
67
|
end
|
@@ -71,6 +74,8 @@ module NHKore
|
|
71
74
|
attr_accessor :tiny
|
72
75
|
|
73
76
|
def initialize(*)
|
77
|
+
require 'tiny_segmenter'
|
78
|
+
|
74
79
|
super
|
75
80
|
|
76
81
|
@tiny = TinySegmenter.new()
|
data/lib/nhkore/util.rb
CHANGED
@@ -24,7 +24,6 @@
|
|
24
24
|
require 'cgi'
|
25
25
|
require 'psychgus'
|
26
26
|
require 'public_suffix'
|
27
|
-
require 'set'
|
28
27
|
require 'time'
|
29
28
|
require 'uri'
|
30
29
|
|
@@ -65,8 +64,7 @@ module NHKore
|
|
65
64
|
MIN_SANE_YEAR = 1924
|
66
65
|
|
67
66
|
def self.dir_str?(str)
|
68
|
-
|
69
|
-
return File.join(str,'') == str
|
67
|
+
return str.match?(/[\/\\]\s*\z/)
|
70
68
|
end
|
71
69
|
|
72
70
|
def self.domain(host,clean: true)
|
@@ -100,7 +98,8 @@ module NHKore
|
|
100
98
|
end
|
101
99
|
|
102
100
|
def self.filename_str?(str)
|
103
|
-
|
101
|
+
# Do not use "!dir_str?()"! It's not the same meaning!
|
102
|
+
return !str.match?(/[\/\\]/)
|
104
103
|
end
|
105
104
|
|
106
105
|
def self.guess_year(year)
|
data/lib/nhkore/variator.rb
CHANGED
@@ -21,9 +21,6 @@
|
|
21
21
|
#++
|
22
22
|
|
23
23
|
|
24
|
-
require 'japanese_deinflector'
|
25
|
-
|
26
|
-
|
27
24
|
module NHKore
|
28
25
|
###
|
29
26
|
# @author Jonathan Bradley Whited (@esotericpig)
|
@@ -63,6 +60,8 @@ module NHKore
|
|
63
60
|
attr_accessor :deinflector
|
64
61
|
|
65
62
|
def initialize(*)
|
63
|
+
require 'japanese_deinflector'
|
64
|
+
|
66
65
|
super
|
67
66
|
|
68
67
|
@deinflector = JapaneseDeinflector.new()
|
data/lib/nhkore/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nhkore
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Bradley Whited (@esotericpig)
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-04-
|
11
|
+
date: 2020-04-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bimyou_segmenter
|
@@ -374,7 +374,7 @@ metadata:
|
|
374
374
|
changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
|
375
375
|
homepage_uri: https://github.com/esotericpig/nhkore
|
376
376
|
source_code_uri: https://github.com/esotericpig/nhkore
|
377
|
-
post_install_message: " \n NHKore v0.3.
|
377
|
+
post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
|
378
378
|
command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
|
379
379
|
\ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
|
380
380
|
\ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"
|