nhkore 0.3.1 → 0.3.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +81 -3
- data/README.md +505 -9
- data/Rakefile +48 -8
- data/lib/nhkore.rb +1 -22
- data/lib/nhkore/app.rb +3 -1
- data/lib/nhkore/article.rb +24 -7
- data/lib/nhkore/article_scraper.rb +21 -16
- data/lib/nhkore/cli/news_cmd.rb +3 -2
- data/lib/nhkore/cli/search_cmd.rb +2 -2
- data/lib/nhkore/cli/sift_cmd.rb +9 -112
- data/lib/nhkore/datetime_parser.rb +342 -0
- data/lib/nhkore/dict_scraper.rb +1 -1
- data/lib/nhkore/lib.rb +59 -0
- data/lib/nhkore/news.rb +13 -4
- data/lib/nhkore/scraper.rb +21 -9
- data/lib/nhkore/search_link.rb +37 -19
- data/lib/nhkore/search_scraper.rb +1 -0
- data/lib/nhkore/sifter.rb +106 -51
- data/lib/nhkore/util.rb +12 -21
- data/lib/nhkore/variator.rb +1 -0
- data/lib/nhkore/version.rb +1 -1
- data/nhkore.gemspec +12 -7
- metadata +21 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 445adf6e8abd4da9fd6dd25e9632d5f477b467f6ce8c3dcecae87e3f61305d98
|
4
|
+
data.tar.gz: ca812639ff1edd8da835f5bbb2cde403c9cb63e17568fb3ec367eec00605ec17
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 392607205c53aa2a5dfcde244e5fa6137483d216dc27becf06c76798209d2dcf328f17abee2026d795207d4e783a23fd108e615525445f52ca6442560600cd42
|
7
|
+
data.tar.gz: 7a1219623b6645bbc633ba9c94e767dcf86be8852a7228c1d5ddd3936f61b884897f680369d4c9d9db5aba8ab4561048d59aed15cecf7ba05695c1957f31b0ea
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,82 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.6...master)
|
6
|
+
|
7
|
+
## [v0.3.6] - 2020-08-18
|
8
|
+
|
9
|
+
### Added
|
10
|
+
- `update_showcase` Rake task for development & personal site (GitHub Page)
|
11
|
+
- `$ bundle exec rake update_showcase`
|
12
|
+
|
13
|
+
### Changed
|
14
|
+
- Updated Gems
|
15
|
+
|
16
|
+
### Fixed
|
17
|
+
- ArticleScraper for title for specific site
|
18
|
+
- https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_illust.html
|
19
|
+
- Ignored `/cgi2.*enqform/` URLs from SearchScraper (Bing)
|
20
|
+
- Added more detail to dictionary error in ArticleScraper
|
21
|
+
|
22
|
+
## [v0.3.5] - 2020-05-04
|
23
|
+
|
24
|
+
### Added
|
25
|
+
- Added check for environment var `NO_COLOR`
|
26
|
+
- [https://no-color.org/](https://no-color.org/)
|
27
|
+
|
28
|
+
### Fixed
|
29
|
+
- Fixed URLs stored in YAML data to always be of type String (not URI)
|
30
|
+
- This initially caused a problem in DictScraper.parse_url() from ArticleScraper, but fixed it for all data
|
31
|
+
|
32
|
+
## [v0.3.4] - 2020-04-25
|
33
|
+
|
34
|
+
### Added
|
35
|
+
- DatetimeParser
|
36
|
+
- Extracted from SiftCmd into its own class
|
37
|
+
- Fixed some minor logic bugs from the old code
|
38
|
+
- Added new feature where 1 range can be empty:
|
39
|
+
- `sift ez -d '...2019'` (from = 1924)
|
40
|
+
- `sift ez -d '2019...'` (to = current year)
|
41
|
+
- `sift ez -d '...'` (still an error)
|
42
|
+
- Added `update_core` rake task for dev
|
43
|
+
- Makes pushing a new release much easier
|
44
|
+
- See *Hacking.Releasing* section in *README*
|
45
|
+
|
46
|
+
### Fixed
|
47
|
+
- SiftCmd `parse_sift_datetime()` for `-d/--datetime` option
|
48
|
+
- Didn't work exactly right (as written in *README*) for some special inputs:
|
49
|
+
- `-d '2019...3'`
|
50
|
+
- `-d '3-3'`
|
51
|
+
- `-d '3'`
|
52
|
+
|
53
|
+
## [v0.3.3] - 2020-04-23
|
54
|
+
|
55
|
+
### Added
|
56
|
+
- Added JSON support to Sifter & SiftCmd.
|
57
|
+
- Added use of `attr_bool` Gem for `attr_accessor?` & `attr_reader?`.
|
58
|
+
|
59
|
+
## [v0.3.2] - 2020-04-22
|
60
|
+
|
61
|
+
### Added
|
62
|
+
- lib/nhkore/lib.rb
|
63
|
+
- Requires all files, excluding CLI-related files for speed when using this Gem as a library.
|
64
|
+
- Scraper
|
65
|
+
- Added open_file() & reopen().
|
66
|
+
- samples/looper.rb
|
67
|
+
- Script example of continuously scraping all articles.
|
68
|
+
|
69
|
+
### Changed
|
70
|
+
- README
|
71
|
+
- Finished writing the initial version of all sections.
|
72
|
+
- ArticleScraper
|
73
|
+
- Changed the `year` param to expect an int, instead of a string.
|
74
|
+
- Sifter
|
75
|
+
- In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
|
76
|
+
|
77
|
+
### Fixed
|
78
|
+
- Reduced load time of app a tiny bit more (see v0.3.1 for details).
|
79
|
+
- ArticleScraper
|
80
|
+
- Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
|
6
81
|
|
7
82
|
## [v0.3.1] - 2020-04-20
|
8
83
|
|
@@ -11,10 +86,13 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
11
86
|
- NewsCmd/SiftCmd
|
12
87
|
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
88
|
- Util
|
14
|
-
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows &
|
89
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
90
|
|
16
91
|
### Fixed
|
17
|
-
- Reduced load time of app from
|
92
|
+
- Reduced load time of app from about 1s to about 0.3-0.5s.
|
93
|
+
- Moved many `require '...'` statements into methods.
|
94
|
+
- It looks ugly & is not good coding practice, but a necessary evil.
|
95
|
+
- Load time is still pretty slow (but a lot better!).
|
18
96
|
- BingScraper
|
19
97
|
- Fixed possible RSS infinite loop.
|
20
98
|
|
data/README.md
CHANGED
@@ -26,6 +26,8 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
|
|
26
26
|
- [News Command](#news-command-)
|
27
27
|
- [Using the Library](#using-the-library-)
|
28
28
|
- [Hacking](#hacking-)
|
29
|
+
- [Updating](#updating-)
|
30
|
+
- [Releasing](#releasing-)
|
29
31
|
- [License](#license-)
|
30
32
|
|
31
33
|
## For Non-Power Users [^](#contents)
|
@@ -110,11 +112,12 @@ Example usage:
|
|
110
112
|
|
111
113
|
`$ nhkore -t 300 -m 10 news -D -L -M -d '2011-03-07 06:30' easy -u 'https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html'`
|
112
114
|
|
113
|
-
Now that the data from the article has been scraped, you can generate a CSV/HTML/YAML file of the words ordered by frequency:
|
115
|
+
Now that the data from the article has been scraped, you can generate a CSV/HTML/JSON/YAML file of the words ordered by frequency:
|
114
116
|
|
115
117
|
```
|
116
118
|
$ nhkore sift easy -e csv
|
117
119
|
$ nhkore sift easy -e html
|
120
|
+
$ nhkore sift easy -e json
|
118
121
|
$ nhkore sift easy -e yml
|
119
122
|
```
|
120
123
|
|
@@ -154,11 +157,11 @@ After obtaining the scraped data, you can `sift` all of the data (or select data
|
|
154
157
|
| --- | --- |
|
155
158
|
| CSV | For uploading to a flashcard website (e.g., Memrise, Anki, Buffl) after changing the data appropriately. |
|
156
159
|
| HTML | For comfortable viewing in a web browser or for sharing. |
|
157
|
-
| YAML | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
|
160
|
+
| YAML/JSON | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
|
158
161
|
|
159
162
|
The data is sorted by frequency in descending order (i.e., most frequent words first).
|
160
163
|
|
161
|
-
If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/HTML file.
|
164
|
+
If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/JSON/HTML file.
|
162
165
|
|
163
166
|
The defaults will sift all of the data into a CSV file, which may not be what you want:
|
164
167
|
|
@@ -203,7 +206,7 @@ You can save the data to a different format using one of these options:
|
|
203
206
|
|
204
207
|
```
|
205
208
|
-e --ext=<value> type of file (extension) to save;
|
206
|
-
valid options: [csv, htm, html, yaml, yml];
|
209
|
+
valid options: [csv, htm, html, json, yaml, yml];
|
207
210
|
not needed if you specify a file extension with
|
208
211
|
the '--out' option: '--out sift.html'
|
209
212
|
(default: csv)
|
@@ -293,7 +296,7 @@ links:
|
|
293
296
|
|
294
297
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
295
298
|
|
296
|
-
Currently, it only searches &
|
299
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
300
|
|
298
301
|
Example usage:
|
299
302
|
|
@@ -319,6 +322,49 @@ Complete demo:
|
|
319
322
|
|
320
323
|
#### News Command [^](#contents)
|
321
324
|
|
325
|
+
In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
|
326
|
+
|
327
|
+
After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
|
328
|
+
|
329
|
+
The defaults will scrape the 1st unscraped article from the `links` file:
|
330
|
+
|
331
|
+
`$ nhkore news easy`
|
332
|
+
|
333
|
+
You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
|
334
|
+
|
335
|
+
```
|
336
|
+
# Scrape the 1st 11 unscraped articles.
|
337
|
+
$ nhkore news -s 11 easy
|
338
|
+
```
|
339
|
+
|
340
|
+
You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
|
341
|
+
|
342
|
+
`$ nhkore news -r -s 11 easy`
|
343
|
+
|
344
|
+
If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
|
345
|
+
|
346
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
|
347
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
|
348
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
|
349
|
+
- ...
|
350
|
+
|
351
|
+
`$ nhkore news -k '00123' -s 11 easy`
|
352
|
+
|
353
|
+
Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
|
354
|
+
|
355
|
+
```
|
356
|
+
# This will exit after showing the 1st article's dictionary.
|
357
|
+
$ nhkore news easy --show-dict
|
358
|
+
```
|
359
|
+
|
360
|
+
For the rest of the options, please see [The Basics](#the-basics-).
|
361
|
+
|
362
|
+
Complete demo:
|
363
|
+
|
364
|
+
[![asciinema Demo - News](https://asciinema.org/a/322324.png)](https://asciinema.org/a/322324)
|
365
|
+
|
366
|
+
When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
|
367
|
+
|
322
368
|
## Using the Library [^](#contents)
|
323
369
|
|
324
370
|
### Setup
|
@@ -336,11 +382,439 @@ In your *Gemfile*:
|
|
336
382
|
```Ruby
|
337
383
|
# Pick one...
|
338
384
|
gem 'nhkore', '~> X.X'
|
339
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/
|
385
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
|
386
|
+
```
|
387
|
+
|
388
|
+
### Require
|
389
|
+
|
390
|
+
In order to not require all of the CLI-related files, require this file instead:
|
391
|
+
|
392
|
+
```Ruby
|
393
|
+
require 'nhkore/lib'
|
394
|
+
|
395
|
+
#require 'nhkore' # Slower
|
340
396
|
```
|
341
397
|
|
342
398
|
### Scraper
|
343
399
|
|
400
|
+
All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
|
401
|
+
|
402
|
+
`initialize` automatically opens (connects to) the URL.
|
403
|
+
|
404
|
+
```Ruby
|
405
|
+
require 'nhkore/scraper'
|
406
|
+
|
407
|
+
class MyScraper < NHKore::Scraper
|
408
|
+
def initialize()
|
409
|
+
super('https://www3.nhk.or.jp/news/easy/')
|
410
|
+
end
|
411
|
+
end
|
412
|
+
|
413
|
+
m = MyScraper.new()
|
414
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
415
|
+
|
416
|
+
# Read all content into a String.
|
417
|
+
mstr = m.read()
|
418
|
+
sstr = s.read()
|
419
|
+
|
420
|
+
# Get a Nokogiri::HTML object.
|
421
|
+
mdoc = m.html_doc()
|
422
|
+
sdoc = s.html_doc()
|
423
|
+
|
424
|
+
# Get a RSS object.
|
425
|
+
s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
|
426
|
+
|
427
|
+
rss = s.rss_doc()
|
428
|
+
```
|
429
|
+
|
430
|
+
There are several useful options:
|
431
|
+
|
432
|
+
```Ruby
|
433
|
+
require 'nhkore/scraper'
|
434
|
+
|
435
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
436
|
+
open_timeout: 300, # Open timeout in seconds (default: nil)
|
437
|
+
read_timeout: 300, # Read timeout in seconds (default: nil)
|
438
|
+
|
439
|
+
# Maximum number of times to retry the URL
|
440
|
+
# - default: 3
|
441
|
+
# - Open/connect will fail a couple of times on a bad/slow internet connection.
|
442
|
+
max_retries: 10,
|
443
|
+
|
444
|
+
# Maximum number of redirects allowed.
|
445
|
+
# - default: 3
|
446
|
+
# - You can set this to nil or -1, but I recommend using a number
|
447
|
+
# for safety (infinite-loop attack).
|
448
|
+
max_redirects: 1,
|
449
|
+
|
450
|
+
# How to check redirect URLs for safety.
|
451
|
+
# - default: :strict
|
452
|
+
# - nil => do not check
|
453
|
+
# - :lenient => check the scheme only
|
454
|
+
# (i.e., if https, redirect URL must be https)
|
455
|
+
# - :strict => check the scheme and domain
|
456
|
+
# (i.e., if https://bing.com, redirect URL must be https://bing.com)
|
457
|
+
redirect_rule: :lenient,
|
458
|
+
|
459
|
+
# Set the HTTP header field 'cookie' from the 'set-cookie' response.
|
460
|
+
# - default: false
|
461
|
+
# - Currently uses the 'http-cookie' Gem.
|
462
|
+
# - This is currently a time-consuming operation because it opens the URL twice.
|
463
|
+
# - Necessary for Search Engines or other sites that require cookies
|
464
|
+
# in order to block bots.
|
465
|
+
eat_cookie: true,
|
466
|
+
|
467
|
+
# Set HTTP header fields.
|
468
|
+
# - default: nil
|
469
|
+
# - Necessary for Search Engines or other sites that try to block bots.
|
470
|
+
# - Simply pass in a Hash (not nil) to set the default ones.
|
471
|
+
header: {'user-agent' => 'Skynet'}, # Must use strings
|
472
|
+
)
|
473
|
+
|
474
|
+
# Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
|
475
|
+
# - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
|
476
|
+
# some other Gem.
|
477
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
478
|
+
str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
|
479
|
+
)
|
480
|
+
|
481
|
+
# Open and parse a file instead of a URL (for offline testing or slow internet).
|
482
|
+
s = NHKore::Scraper.new('./my_article.html',is_file: true)
|
483
|
+
|
484
|
+
doc = s.html_doc()
|
485
|
+
```
|
486
|
+
|
487
|
+
Here are some other useful methods:
|
488
|
+
|
489
|
+
```Ruby
|
490
|
+
require 'nhkore/scraper'
|
491
|
+
|
492
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
493
|
+
|
494
|
+
s.reopen() # Re-open the current URL.
|
495
|
+
|
496
|
+
# Get a relative URL.
|
497
|
+
url = s.join_url('../../monkey.html')
|
498
|
+
puts url # https://www3.nhk.or.jp/monkey.html
|
499
|
+
|
500
|
+
# Open a new URL or file.
|
501
|
+
s.open(url)
|
502
|
+
s.open(url,URI.open(url,redirect: false))
|
503
|
+
|
504
|
+
s.open('./my_article.html',is_file: true)
|
505
|
+
|
506
|
+
# Open a file manually.
|
507
|
+
s.open_file('./my_article.html')
|
508
|
+
|
509
|
+
# Fetch the cookie & open a new URL manually.
|
510
|
+
s.fetch_cookie(url)
|
511
|
+
s.open_url(url)
|
512
|
+
```
|
513
|
+
|
514
|
+
### SearchScraper & BingScraper
|
515
|
+
|
516
|
+
`SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
|
517
|
+
|
518
|
+
By default, it sets the default HTTP header fields and fetches & sets the cookie.
|
519
|
+
|
520
|
+
```Ruby
|
521
|
+
require 'nhkore/search_scraper'
|
522
|
+
|
523
|
+
ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
|
524
|
+
|
525
|
+
doc = ss.html_doc()
|
526
|
+
|
527
|
+
doc.css('a').each() do |anchor|
|
528
|
+
link = anchor['href']
|
529
|
+
|
530
|
+
next if ss.ignore_link?(link,cleaned: false)
|
531
|
+
|
532
|
+
if link.include?('https://www3.nhk')
|
533
|
+
puts link
|
534
|
+
end
|
535
|
+
end
|
536
|
+
```
|
537
|
+
|
538
|
+
`BingScraper` will search `bing.com` for you.
|
539
|
+
|
540
|
+
```Ruby
|
541
|
+
require 'nhkore/search_link'
|
542
|
+
require 'nhkore/search_scraper'
|
543
|
+
|
544
|
+
bs = NHKore::BingScraper.new(:yasashii)
|
545
|
+
slinks = NHKore::SearchLinks.new()
|
546
|
+
|
547
|
+
next_page = bs.scrape(slinks)
|
548
|
+
page_num = 1
|
549
|
+
|
550
|
+
while !next_page.empty?()
|
551
|
+
puts "Page #{page_num += 1}: #{next_page.count}"
|
552
|
+
|
553
|
+
bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
|
554
|
+
|
555
|
+
next_page = bs.scrape(slinks,next_page)
|
556
|
+
end
|
557
|
+
|
558
|
+
slinks.links.values.each() do |link|
|
559
|
+
puts link.url
|
560
|
+
end
|
561
|
+
```
|
562
|
+
|
563
|
+
### ArticleScraper & DictScraper
|
564
|
+
|
565
|
+
`ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
|
566
|
+
|
567
|
+
```Ruby
|
568
|
+
require 'nhkore/article_scraper'
|
569
|
+
require 'time'
|
570
|
+
|
571
|
+
as = NHKore::ArticleScraper.new(
|
572
|
+
'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
|
573
|
+
|
574
|
+
# If false, scrape the article leniently (for older articles which
|
575
|
+
# may not have certain tags, etc.).
|
576
|
+
# - default: true
|
577
|
+
strict: false,
|
578
|
+
|
579
|
+
# {Dict} to use as the dictionary for words (Easy articles).
|
580
|
+
# - default: :scrape
|
581
|
+
# - nil => don't scrape/use it (necessary for Regular articles)
|
582
|
+
# - :scrape => auto-scrape it using {DictScraper}
|
583
|
+
# - {Dict} => your own {Dict}
|
584
|
+
dict: nil,
|
585
|
+
|
586
|
+
# Date time to use as a fallback if the article doesn't have one
|
587
|
+
# (for older articles).
|
588
|
+
# - default: nil
|
589
|
+
datetime: Time.new(2020,2,2),
|
590
|
+
|
591
|
+
# Year to use as a fallback if the article doesn't have one
|
592
|
+
# (for older articles).
|
593
|
+
# - default: nil
|
594
|
+
year: 2020,
|
595
|
+
)
|
596
|
+
|
597
|
+
article = as.scrape()
|
598
|
+
|
599
|
+
article.datetime
|
600
|
+
article.futsuurl
|
601
|
+
article.sha256
|
602
|
+
article.title
|
603
|
+
article.url
|
604
|
+
|
605
|
+
article.words.each() do |key,word|
|
606
|
+
word.defn
|
607
|
+
word.eng
|
608
|
+
word.freq
|
609
|
+
word.kana
|
610
|
+
word.kanji
|
611
|
+
word.key
|
612
|
+
end
|
613
|
+
|
614
|
+
puts article.to_s(mini: true)
|
615
|
+
puts '---'
|
616
|
+
puts article
|
617
|
+
```
|
618
|
+
|
619
|
+
`DictScraper` scrapes an Easy article's dictionary file (JSON).
|
620
|
+
|
621
|
+
```Ruby
|
622
|
+
require 'nhkore/dict_scraper'
|
623
|
+
|
624
|
+
url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
|
625
|
+
ds = NHKore::DictScraper.new(
|
626
|
+
url,
|
627
|
+
|
628
|
+
# Change the URL appropriately to the dictionary URL.
|
629
|
+
# - default: true
|
630
|
+
parse_url: true,
|
631
|
+
)
|
632
|
+
|
633
|
+
puts NHKore::DictScraper.parse_url(url)
|
634
|
+
puts
|
635
|
+
|
636
|
+
dict = ds.scrape()
|
637
|
+
|
638
|
+
dict.entries.each() do |key,entry|
|
639
|
+
entry.id
|
640
|
+
|
641
|
+
entry.defns.each() do |defn|
|
642
|
+
defn.hyoukis.each() {|hyouki| }
|
643
|
+
defn.text
|
644
|
+
defn.words.each() {|word| }
|
645
|
+
end
|
646
|
+
|
647
|
+
puts entry.build_hyouki()
|
648
|
+
puts entry.build_defn()
|
649
|
+
puts '---'
|
650
|
+
end
|
651
|
+
|
652
|
+
puts
|
653
|
+
puts dict
|
654
|
+
```
|
655
|
+
|
656
|
+
### Fileable
|
657
|
+
|
658
|
+
Any class that includes the `Fileable` mixin will have the following methods:
|
659
|
+
|
660
|
+
- Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
|
661
|
+
- save_file(file,mode: 'wt',**kargs)
|
662
|
+
|
663
|
+
Any *kargs* will be passed to `File.open()`.
|
664
|
+
|
665
|
+
```Ruby
|
666
|
+
require 'nhkore/news'
|
667
|
+
require 'nhkore/search_link'
|
668
|
+
|
669
|
+
yn = NHKore::YasashiiNews.load_file()
|
670
|
+
sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
671
|
+
|
672
|
+
yn.articles.each() {|key,article| }
|
673
|
+
yn.sha256s.each() {|sha256,url| }
|
674
|
+
|
675
|
+
sl.links.each() do |key,link|
|
676
|
+
link.datetime
|
677
|
+
link.futsuurl
|
678
|
+
link.scraped?
|
679
|
+
link.sha256
|
680
|
+
link.title
|
681
|
+
link.url
|
682
|
+
end
|
683
|
+
|
684
|
+
#yn.save_file()
|
685
|
+
#sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
686
|
+
```
|
687
|
+
|
688
|
+
### Sifter
|
689
|
+
|
690
|
+
`Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
|
691
|
+
|
692
|
+
```Ruby
|
693
|
+
require 'nhkore/datetime_parser'
|
694
|
+
require 'nhkore/news'
|
695
|
+
require 'nhkore/sifter'
|
696
|
+
require 'time'
|
697
|
+
|
698
|
+
news = NHKore::YasashiiNews.load_file()
|
699
|
+
|
700
|
+
sifter = NHKore::Sifter.new(news)
|
701
|
+
|
702
|
+
sifter.caption = 'Sakura Fields Forever!'
|
703
|
+
|
704
|
+
# Filter the data.
|
705
|
+
sifter.filter_by_datetime(NHKore::DatetimeParser.parse_range('2019-12-4...7'))
|
706
|
+
sifter.filter_by_datetime([Time.new(2019,12,4),Time.new(2019,12,7)])
|
707
|
+
sifter.filter_by_datetime(
|
708
|
+
from: Time.new(2019,12,4),to: Time.new(2019,12,7)
|
709
|
+
)
|
710
|
+
sifter.filter_by_title('桜')
|
711
|
+
sifter.filter_by_url('k100')
|
712
|
+
|
713
|
+
# Ignore certain columns from the output.
|
714
|
+
sifter.ignore(:defn)
|
715
|
+
sifter.ignore(:eng)
|
716
|
+
|
717
|
+
# An array of the sifted words.
|
718
|
+
words = sifter.sift() # Filtered & Sorted array of Word
|
719
|
+
rows = sifter.build_rows(words) # Ignored array of array
|
720
|
+
|
721
|
+
# Choose the file format.
|
722
|
+
#sifter.put_csv!()
|
723
|
+
#sifter.put_html!()
|
724
|
+
#sifter.put_json!()
|
725
|
+
sifter.put_yaml!()
|
726
|
+
|
727
|
+
# Save to a file.
|
728
|
+
file = 'sakura.yml'
|
729
|
+
|
730
|
+
if !File.exist?(file)
|
731
|
+
sifter.save_file(file)
|
732
|
+
end
|
733
|
+
```
|
734
|
+
|
735
|
+
### Util, UserAgents, & DatetimeParser
|
736
|
+
|
737
|
+
These provide a variety of useful methods/constants.
|
738
|
+
|
739
|
+
Here are some of the most useful ones:
|
740
|
+
|
741
|
+
```Ruby
|
742
|
+
require 'nhkore/datetime_parser'
|
743
|
+
require 'nhkore/user_agents'
|
744
|
+
require 'nhkore/util'
|
745
|
+
|
746
|
+
include NHKore
|
747
|
+
|
748
|
+
puts '======='
|
749
|
+
puts '[ Net ]'
|
750
|
+
puts '======='
|
751
|
+
# Get a random User Agent for HTTP header field 'User-Agent'.
|
752
|
+
# - This is used by default in Scraper/SearchScraper.
|
753
|
+
puts "User-Agent: #{UserAgents.sample()}"
|
754
|
+
|
755
|
+
uri = URI('https://www.bing.com/search?q=nhk')
|
756
|
+
Util.replace_uri_query!(uri,q: 'banana')
|
757
|
+
|
758
|
+
puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
|
759
|
+
# nhk.or.jp
|
760
|
+
puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
|
761
|
+
# Ben & Jerry's<br>
|
762
|
+
puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
|
763
|
+
puts
|
764
|
+
|
765
|
+
puts '========'
|
766
|
+
puts '[ Time ]'
|
767
|
+
puts '========'
|
768
|
+
puts "JST now: #{Util.jst_now()}"
|
769
|
+
# Drops in JST_OFFSET, does not change hour/min.
|
770
|
+
puts "JST time: #{Util.jst_time(Time.now)}"
|
771
|
+
puts "JST year: #{Util::JST_YEAR}"
|
772
|
+
puts "1999 sane? #{Util.sane_year?(1999)}" # true
|
773
|
+
puts "1776 sane? #{Util.sane_year?(1776)}" # false
|
774
|
+
puts "Guess 5: #{DatetimeParser.guess_year(5)}" # 2005
|
775
|
+
puts "Guess 99: #{DatetimeParser.guess_year(99)}" # 1999
|
776
|
+
# => [2020-12-01 00:00:00 +0900, 2020-12-31 23:59:59 +0900]
|
777
|
+
puts "Parse: #{DatetimeParser.parse_range('2020-12')}"
|
778
|
+
puts
|
779
|
+
puts "JST timezone offset: #{Util::JST_OFFSET}"
|
780
|
+
puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
|
781
|
+
puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
|
782
|
+
puts
|
783
|
+
|
784
|
+
puts '============'
|
785
|
+
puts '[ Japanese ]'
|
786
|
+
puts '============'
|
787
|
+
|
788
|
+
JPN = ['桜','ぶ','ブ']
|
789
|
+
|
790
|
+
def fmt_jpn()
|
791
|
+
fmt = []
|
792
|
+
|
793
|
+
JPN.each() do |x|
|
794
|
+
x = yield(x)
|
795
|
+
x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
|
796
|
+
fmt << x
|
797
|
+
end
|
798
|
+
|
799
|
+
return "[ #{fmt.join(' | ')} ]"
|
800
|
+
end
|
801
|
+
|
802
|
+
puts " #{fmt_jpn{|x| x}}"
|
803
|
+
puts "Hiragana? #{fmt_jpn{|x| Util.hiragana?(x)}}"
|
804
|
+
puts "Kana? #{fmt_jpn{|x| Util.kana?(x)}}"
|
805
|
+
puts "Kanji? #{fmt_jpn{|x| Util.kanji?(x)}}"
|
806
|
+
puts "Reduce: #{Util.reduce_jpn_space("' '")}"
|
807
|
+
puts
|
808
|
+
|
809
|
+
puts '========='
|
810
|
+
puts '[ Files ]'
|
811
|
+
puts '========='
|
812
|
+
puts "Dir str? #{Util.dir_str?('dir/')}" # true
|
813
|
+
puts "Dir str? #{Util.dir_str?('dir')}" # false
|
814
|
+
puts "File str? #{Util.filename_str?('file')}" # true
|
815
|
+
puts "File str? #{Util.filename_str?('dir/file')}" # false
|
816
|
+
```
|
817
|
+
|
344
818
|
## Hacking [^](#contents)
|
345
819
|
|
346
820
|
```
|
@@ -370,13 +844,35 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
|
|
370
844
|
|
371
845
|
`$ bundle exec rake doc`
|
372
846
|
|
373
|
-
### Installing Locally
|
847
|
+
### Installing Locally
|
848
|
+
|
849
|
+
You can make some changes/fixes to the code and then install your local version:
|
374
850
|
|
375
851
|
`$ bundle exec rake install:local`
|
376
852
|
|
377
|
-
###
|
853
|
+
### Updating [^](#contents)
|
854
|
+
|
855
|
+
This will update *core/* for you:
|
856
|
+
|
857
|
+
`$ bundle exec rake update_core`
|
858
|
+
|
859
|
+
### Releasing [^](#contents)
|
860
|
+
|
861
|
+
1. Update *CHANGELOG.md*, *version.rb*, & *Gemfile.lock*
|
862
|
+
- *Raketary*: `$ raketary bump -v`
|
863
|
+
- Run: `$ bundle update`
|
864
|
+
2. Run: `$ bundle exec rake update_core`
|
865
|
+
3. Run: `$ bundle exec rake clobber pkg_core`
|
866
|
+
4. Create a new release & tag
|
867
|
+
- Add `pkg/nhkore-core.zip`
|
868
|
+
5. Run: `$ git pull`
|
869
|
+
6. Upload GitHub package
|
870
|
+
- *Raketary*: `$ raketary github_pkg`
|
871
|
+
7. Run: `$ bundle exec rake release`
|
872
|
+
|
873
|
+
Releasing new HTML file for website:
|
378
874
|
|
379
|
-
`$ bundle exec rake
|
875
|
+
1. `$ bundle exec rake update_showcase`
|
380
876
|
|
381
877
|
## License [^](#contents)
|
382
878
|
|