nhkore 0.3.1 → 0.3.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +81 -3
- data/README.md +505 -9
- data/Rakefile +48 -8
- data/lib/nhkore.rb +1 -22
- data/lib/nhkore/app.rb +3 -1
- data/lib/nhkore/article.rb +24 -7
- data/lib/nhkore/article_scraper.rb +21 -16
- data/lib/nhkore/cli/news_cmd.rb +3 -2
- data/lib/nhkore/cli/search_cmd.rb +2 -2
- data/lib/nhkore/cli/sift_cmd.rb +9 -112
- data/lib/nhkore/datetime_parser.rb +342 -0
- data/lib/nhkore/dict_scraper.rb +1 -1
- data/lib/nhkore/lib.rb +59 -0
- data/lib/nhkore/news.rb +13 -4
- data/lib/nhkore/scraper.rb +21 -9
- data/lib/nhkore/search_link.rb +37 -19
- data/lib/nhkore/search_scraper.rb +1 -0
- data/lib/nhkore/sifter.rb +106 -51
- data/lib/nhkore/util.rb +12 -21
- data/lib/nhkore/variator.rb +1 -0
- data/lib/nhkore/version.rb +1 -1
- data/nhkore.gemspec +12 -7
- metadata +21 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 445adf6e8abd4da9fd6dd25e9632d5f477b467f6ce8c3dcecae87e3f61305d98
|
4
|
+
data.tar.gz: ca812639ff1edd8da835f5bbb2cde403c9cb63e17568fb3ec367eec00605ec17
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 392607205c53aa2a5dfcde244e5fa6137483d216dc27becf06c76798209d2dcf328f17abee2026d795207d4e783a23fd108e615525445f52ca6442560600cd42
|
7
|
+
data.tar.gz: 7a1219623b6645bbc633ba9c94e767dcf86be8852a7228c1d5ddd3936f61b884897f680369d4c9d9db5aba8ab4561048d59aed15cecf7ba05695c1957f31b0ea
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,82 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.6...master)
|
6
|
+
|
7
|
+
## [v0.3.6] - 2020-08-18
|
8
|
+
|
9
|
+
### Added
|
10
|
+
- `update_showcase` Rake task for development & personal site (GitHub Page)
|
11
|
+
- `$ bundle exec rake update_showcase`
|
12
|
+
|
13
|
+
### Changed
|
14
|
+
- Updated Gems
|
15
|
+
|
16
|
+
### Fixed
|
17
|
+
- ArticleScraper for title for specific site
|
18
|
+
- https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_illust.html
|
19
|
+
- Ignored `/cgi2.*enqform/` URLs from SearchScraper (Bing)
|
20
|
+
- Added more detail to dictionary error in ArticleScraper
|
21
|
+
|
22
|
+
## [v0.3.5] - 2020-05-04
|
23
|
+
|
24
|
+
### Added
|
25
|
+
- Added check for environment var `NO_COLOR`
|
26
|
+
- [https://no-color.org/](https://no-color.org/)
|
27
|
+
|
28
|
+
### Fixed
|
29
|
+
- Fixed URLs stored in YAML data to always be of type String (not URI)
|
30
|
+
- This initially caused a problem in DictScraper.parse_url() from ArticleScraper, but fixed it for all data
|
31
|
+
|
32
|
+
## [v0.3.4] - 2020-04-25
|
33
|
+
|
34
|
+
### Added
|
35
|
+
- DatetimeParser
|
36
|
+
- Extracted from SiftCmd into its own class
|
37
|
+
- Fixed some minor logic bugs from the old code
|
38
|
+
- Added new feature where 1 range can be empty:
|
39
|
+
- `sift ez -d '...2019'` (from = 1924)
|
40
|
+
- `sift ez -d '2019...'` (to = current year)
|
41
|
+
- `sift ez -d '...'` (still an error)
|
42
|
+
- Added `update_core` rake task for dev
|
43
|
+
- Makes pushing a new release much easier
|
44
|
+
- See *Hacking.Releasing* section in *README*
|
45
|
+
|
46
|
+
### Fixed
|
47
|
+
- SiftCmd `parse_sift_datetime()` for `-d/--datetime` option
|
48
|
+
- Didn't work exactly right (as written in *README*) for some special inputs:
|
49
|
+
- `-d '2019...3'`
|
50
|
+
- `-d '3-3'`
|
51
|
+
- `-d '3'`
|
52
|
+
|
53
|
+
## [v0.3.3] - 2020-04-23
|
54
|
+
|
55
|
+
### Added
|
56
|
+
- Added JSON support to Sifter & SiftCmd.
|
57
|
+
- Added use of `attr_bool` Gem for `attr_accessor?` & `attr_reader?`.
|
58
|
+
|
59
|
+
## [v0.3.2] - 2020-04-22
|
60
|
+
|
61
|
+
### Added
|
62
|
+
- lib/nhkore/lib.rb
|
63
|
+
- Requires all files, excluding CLI-related files for speed when using this Gem as a library.
|
64
|
+
- Scraper
|
65
|
+
- Added open_file() & reopen().
|
66
|
+
- samples/looper.rb
|
67
|
+
- Script example of continuously scraping all articles.
|
68
|
+
|
69
|
+
### Changed
|
70
|
+
- README
|
71
|
+
- Finished writing the initial version of all sections.
|
72
|
+
- ArticleScraper
|
73
|
+
- Changed the `year` param to expect an int, instead of a string.
|
74
|
+
- Sifter
|
75
|
+
- In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
|
76
|
+
|
77
|
+
### Fixed
|
78
|
+
- Reduced load time of app a tiny bit more (see v0.3.1 for details).
|
79
|
+
- ArticleScraper
|
80
|
+
- Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
|
6
81
|
|
7
82
|
## [v0.3.1] - 2020-04-20
|
8
83
|
|
@@ -11,10 +86,13 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
11
86
|
- NewsCmd/SiftCmd
|
12
87
|
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
88
|
- Util
|
14
|
-
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows &
|
89
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
90
|
|
16
91
|
### Fixed
|
17
|
-
- Reduced load time of app from
|
92
|
+
- Reduced load time of app from about 1s to about 0.3-0.5s.
|
93
|
+
- Moved many `require '...'` statements into methods.
|
94
|
+
- It looks ugly & is not good coding practice, but a necessary evil.
|
95
|
+
- Load time is still pretty slow (but a lot better!).
|
18
96
|
- BingScraper
|
19
97
|
- Fixed possible RSS infinite loop.
|
20
98
|
|
data/README.md
CHANGED
@@ -26,6 +26,8 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
|
|
26
26
|
- [News Command](#news-command-)
|
27
27
|
- [Using the Library](#using-the-library-)
|
28
28
|
- [Hacking](#hacking-)
|
29
|
+
- [Updating](#updating-)
|
30
|
+
- [Releasing](#releasing-)
|
29
31
|
- [License](#license-)
|
30
32
|
|
31
33
|
## For Non-Power Users [^](#contents)
|
@@ -110,11 +112,12 @@ Example usage:
|
|
110
112
|
|
111
113
|
`$ nhkore -t 300 -m 10 news -D -L -M -d '2011-03-07 06:30' easy -u 'https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html'`
|
112
114
|
|
113
|
-
Now that the data from the article has been scraped, you can generate a CSV/HTML/YAML file of the words ordered by frequency:
|
115
|
+
Now that the data from the article has been scraped, you can generate a CSV/HTML/JSON/YAML file of the words ordered by frequency:
|
114
116
|
|
115
117
|
```
|
116
118
|
$ nhkore sift easy -e csv
|
117
119
|
$ nhkore sift easy -e html
|
120
|
+
$ nhkore sift easy -e json
|
118
121
|
$ nhkore sift easy -e yml
|
119
122
|
```
|
120
123
|
|
@@ -154,11 +157,11 @@ After obtaining the scraped data, you can `sift` all of the data (or select data
|
|
154
157
|
| --- | --- |
|
155
158
|
| CSV | For uploading to a flashcard website (e.g., Memrise, Anki, Buffl) after changing the data appropriately. |
|
156
159
|
| HTML | For comfortable viewing in a web browser or for sharing. |
|
157
|
-
| YAML | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
|
160
|
+
| YAML/JSON | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
|
158
161
|
|
159
162
|
The data is sorted by frequency in descending order (i.e., most frequent words first).
|
160
163
|
|
161
|
-
If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/HTML file.
|
164
|
+
If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/JSON/HTML file.
|
162
165
|
|
163
166
|
The defaults will sift all of the data into a CSV file, which may not be what you want:
|
164
167
|
|
@@ -203,7 +206,7 @@ You can save the data to a different format using one of these options:
|
|
203
206
|
|
204
207
|
```
|
205
208
|
-e --ext=<value> type of file (extension) to save;
|
206
|
-
valid options: [csv, htm, html, yaml, yml];
|
209
|
+
valid options: [csv, htm, html, json, yaml, yml];
|
207
210
|
not needed if you specify a file extension with
|
208
211
|
the '--out' option: '--out sift.html'
|
209
212
|
(default: csv)
|
@@ -293,7 +296,7 @@ links:
|
|
293
296
|
|
294
297
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
295
298
|
|
296
|
-
Currently, it only searches &
|
299
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
300
|
|
298
301
|
Example usage:
|
299
302
|
|
@@ -319,6 +322,49 @@ Complete demo:
|
|
319
322
|
|
320
323
|
#### News Command [^](#contents)
|
321
324
|
|
325
|
+
In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
|
326
|
+
|
327
|
+
After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
|
328
|
+
|
329
|
+
The defaults will scrape the 1st unscraped article from the `links` file:
|
330
|
+
|
331
|
+
`$ nhkore news easy`
|
332
|
+
|
333
|
+
You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
|
334
|
+
|
335
|
+
```
|
336
|
+
# Scrape the 1st 11 unscraped articles.
|
337
|
+
$ nhkore news -s 11 easy
|
338
|
+
```
|
339
|
+
|
340
|
+
You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
|
341
|
+
|
342
|
+
`$ nhkore news -r -s 11 easy`
|
343
|
+
|
344
|
+
If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
|
345
|
+
|
346
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
|
347
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
|
348
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
|
349
|
+
- ...
|
350
|
+
|
351
|
+
`$ nhkore news -k '00123' -s 11 easy`
|
352
|
+
|
353
|
+
Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
|
354
|
+
|
355
|
+
```
|
356
|
+
# This will exit after showing the 1st article's dictionary.
|
357
|
+
$ nhkore news easy --show-dict
|
358
|
+
```
|
359
|
+
|
360
|
+
For the rest of the options, please see [The Basics](#the-basics-).
|
361
|
+
|
362
|
+
Complete demo:
|
363
|
+
|
364
|
+
[](https://asciinema.org/a/322324)
|
365
|
+
|
366
|
+
When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
|
367
|
+
|
322
368
|
## Using the Library [^](#contents)
|
323
369
|
|
324
370
|
### Setup
|
@@ -336,11 +382,439 @@ In your *Gemfile*:
|
|
336
382
|
```Ruby
|
337
383
|
# Pick one...
|
338
384
|
gem 'nhkore', '~> X.X'
|
339
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/
|
385
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
|
386
|
+
```
|
387
|
+
|
388
|
+
### Require
|
389
|
+
|
390
|
+
In order to not require all of the CLI-related files, require this file instead:
|
391
|
+
|
392
|
+
```Ruby
|
393
|
+
require 'nhkore/lib'
|
394
|
+
|
395
|
+
#require 'nhkore' # Slower
|
340
396
|
```
|
341
397
|
|
342
398
|
### Scraper
|
343
399
|
|
400
|
+
All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
|
401
|
+
|
402
|
+
`initialize` automatically opens (connects to) the URL.
|
403
|
+
|
404
|
+
```Ruby
|
405
|
+
require 'nhkore/scraper'
|
406
|
+
|
407
|
+
class MyScraper < NHKore::Scraper
|
408
|
+
def initialize()
|
409
|
+
super('https://www3.nhk.or.jp/news/easy/')
|
410
|
+
end
|
411
|
+
end
|
412
|
+
|
413
|
+
m = MyScraper.new()
|
414
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
415
|
+
|
416
|
+
# Read all content into a String.
|
417
|
+
mstr = m.read()
|
418
|
+
sstr = s.read()
|
419
|
+
|
420
|
+
# Get a Nokogiri::HTML object.
|
421
|
+
mdoc = m.html_doc()
|
422
|
+
sdoc = s.html_doc()
|
423
|
+
|
424
|
+
# Get a RSS object.
|
425
|
+
s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
|
426
|
+
|
427
|
+
rss = s.rss_doc()
|
428
|
+
```
|
429
|
+
|
430
|
+
There are several useful options:
|
431
|
+
|
432
|
+
```Ruby
|
433
|
+
require 'nhkore/scraper'
|
434
|
+
|
435
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
436
|
+
open_timeout: 300, # Open timeout in seconds (default: nil)
|
437
|
+
read_timeout: 300, # Read timeout in seconds (default: nil)
|
438
|
+
|
439
|
+
# Maximum number of times to retry the URL
|
440
|
+
# - default: 3
|
441
|
+
# - Open/connect will fail a couple of times on a bad/slow internet connection.
|
442
|
+
max_retries: 10,
|
443
|
+
|
444
|
+
# Maximum number of redirects allowed.
|
445
|
+
# - default: 3
|
446
|
+
# - You can set this to nil or -1, but I recommend using a number
|
447
|
+
# for safety (infinite-loop attack).
|
448
|
+
max_redirects: 1,
|
449
|
+
|
450
|
+
# How to check redirect URLs for safety.
|
451
|
+
# - default: :strict
|
452
|
+
# - nil => do not check
|
453
|
+
# - :lenient => check the scheme only
|
454
|
+
# (i.e., if https, redirect URL must be https)
|
455
|
+
# - :strict => check the scheme and domain
|
456
|
+
# (i.e., if https://bing.com, redirect URL must be https://bing.com)
|
457
|
+
redirect_rule: :lenient,
|
458
|
+
|
459
|
+
# Set the HTTP header field 'cookie' from the 'set-cookie' response.
|
460
|
+
# - default: false
|
461
|
+
# - Currently uses the 'http-cookie' Gem.
|
462
|
+
# - This is currently a time-consuming operation because it opens the URL twice.
|
463
|
+
# - Necessary for Search Engines or other sites that require cookies
|
464
|
+
# in order to block bots.
|
465
|
+
eat_cookie: true,
|
466
|
+
|
467
|
+
# Set HTTP header fields.
|
468
|
+
# - default: nil
|
469
|
+
# - Necessary for Search Engines or other sites that try to block bots.
|
470
|
+
# - Simply pass in a Hash (not nil) to set the default ones.
|
471
|
+
header: {'user-agent' => 'Skynet'}, # Must use strings
|
472
|
+
)
|
473
|
+
|
474
|
+
# Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
|
475
|
+
# - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
|
476
|
+
# some other Gem.
|
477
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
478
|
+
str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
|
479
|
+
)
|
480
|
+
|
481
|
+
# Open and parse a file instead of a URL (for offline testing or slow internet).
|
482
|
+
s = NHKore::Scraper.new('./my_article.html',is_file: true)
|
483
|
+
|
484
|
+
doc = s.html_doc()
|
485
|
+
```
|
486
|
+
|
487
|
+
Here are some other useful methods:
|
488
|
+
|
489
|
+
```Ruby
|
490
|
+
require 'nhkore/scraper'
|
491
|
+
|
492
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
493
|
+
|
494
|
+
s.reopen() # Re-open the current URL.
|
495
|
+
|
496
|
+
# Get a relative URL.
|
497
|
+
url = s.join_url('../../monkey.html')
|
498
|
+
puts url # https://www3.nhk.or.jp/monkey.html
|
499
|
+
|
500
|
+
# Open a new URL or file.
|
501
|
+
s.open(url)
|
502
|
+
s.open(url,URI.open(url,redirect: false))
|
503
|
+
|
504
|
+
s.open('./my_article.html',is_file: true)
|
505
|
+
|
506
|
+
# Open a file manually.
|
507
|
+
s.open_file('./my_article.html')
|
508
|
+
|
509
|
+
# Fetch the cookie & open a new URL manually.
|
510
|
+
s.fetch_cookie(url)
|
511
|
+
s.open_url(url)
|
512
|
+
```
|
513
|
+
|
514
|
+
### SearchScraper & BingScraper
|
515
|
+
|
516
|
+
`SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
|
517
|
+
|
518
|
+
By default, it sets the default HTTP header fields and fetches & sets the cookie.
|
519
|
+
|
520
|
+
```Ruby
|
521
|
+
require 'nhkore/search_scraper'
|
522
|
+
|
523
|
+
ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
|
524
|
+
|
525
|
+
doc = ss.html_doc()
|
526
|
+
|
527
|
+
doc.css('a').each() do |anchor|
|
528
|
+
link = anchor['href']
|
529
|
+
|
530
|
+
next if ss.ignore_link?(link,cleaned: false)
|
531
|
+
|
532
|
+
if link.include?('https://www3.nhk')
|
533
|
+
puts link
|
534
|
+
end
|
535
|
+
end
|
536
|
+
```
|
537
|
+
|
538
|
+
`BingScraper` will search `bing.com` for you.
|
539
|
+
|
540
|
+
```Ruby
|
541
|
+
require 'nhkore/search_link'
|
542
|
+
require 'nhkore/search_scraper'
|
543
|
+
|
544
|
+
bs = NHKore::BingScraper.new(:yasashii)
|
545
|
+
slinks = NHKore::SearchLinks.new()
|
546
|
+
|
547
|
+
next_page = bs.scrape(slinks)
|
548
|
+
page_num = 1
|
549
|
+
|
550
|
+
while !next_page.empty?()
|
551
|
+
puts "Page #{page_num += 1}: #{next_page.count}"
|
552
|
+
|
553
|
+
bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
|
554
|
+
|
555
|
+
next_page = bs.scrape(slinks,next_page)
|
556
|
+
end
|
557
|
+
|
558
|
+
slinks.links.values.each() do |link|
|
559
|
+
puts link.url
|
560
|
+
end
|
561
|
+
```
|
562
|
+
|
563
|
+
### ArticleScraper & DictScraper
|
564
|
+
|
565
|
+
`ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
|
566
|
+
|
567
|
+
```Ruby
|
568
|
+
require 'nhkore/article_scraper'
|
569
|
+
require 'time'
|
570
|
+
|
571
|
+
as = NHKore::ArticleScraper.new(
|
572
|
+
'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
|
573
|
+
|
574
|
+
# If false, scrape the article leniently (for older articles which
|
575
|
+
# may not have certain tags, etc.).
|
576
|
+
# - default: true
|
577
|
+
strict: false,
|
578
|
+
|
579
|
+
# {Dict} to use as the dictionary for words (Easy articles).
|
580
|
+
# - default: :scrape
|
581
|
+
# - nil => don't scrape/use it (necessary for Regular articles)
|
582
|
+
# - :scrape => auto-scrape it using {DictScraper}
|
583
|
+
# - {Dict} => your own {Dict}
|
584
|
+
dict: nil,
|
585
|
+
|
586
|
+
# Date time to use as a fallback if the article doesn't have one
|
587
|
+
# (for older articles).
|
588
|
+
# - default: nil
|
589
|
+
datetime: Time.new(2020,2,2),
|
590
|
+
|
591
|
+
# Year to use as a fallback if the article doesn't have one
|
592
|
+
# (for older articles).
|
593
|
+
# - default: nil
|
594
|
+
year: 2020,
|
595
|
+
)
|
596
|
+
|
597
|
+
article = as.scrape()
|
598
|
+
|
599
|
+
article.datetime
|
600
|
+
article.futsuurl
|
601
|
+
article.sha256
|
602
|
+
article.title
|
603
|
+
article.url
|
604
|
+
|
605
|
+
article.words.each() do |key,word|
|
606
|
+
word.defn
|
607
|
+
word.eng
|
608
|
+
word.freq
|
609
|
+
word.kana
|
610
|
+
word.kanji
|
611
|
+
word.key
|
612
|
+
end
|
613
|
+
|
614
|
+
puts article.to_s(mini: true)
|
615
|
+
puts '---'
|
616
|
+
puts article
|
617
|
+
```
|
618
|
+
|
619
|
+
`DictScraper` scrapes an Easy article's dictionary file (JSON).
|
620
|
+
|
621
|
+
```Ruby
|
622
|
+
require 'nhkore/dict_scraper'
|
623
|
+
|
624
|
+
url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
|
625
|
+
ds = NHKore::DictScraper.new(
|
626
|
+
url,
|
627
|
+
|
628
|
+
# Change the URL appropriately to the dictionary URL.
|
629
|
+
# - default: true
|
630
|
+
parse_url: true,
|
631
|
+
)
|
632
|
+
|
633
|
+
puts NHKore::DictScraper.parse_url(url)
|
634
|
+
puts
|
635
|
+
|
636
|
+
dict = ds.scrape()
|
637
|
+
|
638
|
+
dict.entries.each() do |key,entry|
|
639
|
+
entry.id
|
640
|
+
|
641
|
+
entry.defns.each() do |defn|
|
642
|
+
defn.hyoukis.each() {|hyouki| }
|
643
|
+
defn.text
|
644
|
+
defn.words.each() {|word| }
|
645
|
+
end
|
646
|
+
|
647
|
+
puts entry.build_hyouki()
|
648
|
+
puts entry.build_defn()
|
649
|
+
puts '---'
|
650
|
+
end
|
651
|
+
|
652
|
+
puts
|
653
|
+
puts dict
|
654
|
+
```
|
655
|
+
|
656
|
+
### Fileable
|
657
|
+
|
658
|
+
Any class that includes the `Fileable` mixin will have the following methods:
|
659
|
+
|
660
|
+
- Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
|
661
|
+
- save_file(file,mode: 'wt',**kargs)
|
662
|
+
|
663
|
+
Any *kargs* will be passed to `File.open()`.
|
664
|
+
|
665
|
+
```Ruby
|
666
|
+
require 'nhkore/news'
|
667
|
+
require 'nhkore/search_link'
|
668
|
+
|
669
|
+
yn = NHKore::YasashiiNews.load_file()
|
670
|
+
sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
671
|
+
|
672
|
+
yn.articles.each() {|key,article| }
|
673
|
+
yn.sha256s.each() {|sha256,url| }
|
674
|
+
|
675
|
+
sl.links.each() do |key,link|
|
676
|
+
link.datetime
|
677
|
+
link.futsuurl
|
678
|
+
link.scraped?
|
679
|
+
link.sha256
|
680
|
+
link.title
|
681
|
+
link.url
|
682
|
+
end
|
683
|
+
|
684
|
+
#yn.save_file()
|
685
|
+
#sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
686
|
+
```
|
687
|
+
|
688
|
+
### Sifter
|
689
|
+
|
690
|
+
`Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
|
691
|
+
|
692
|
+
```Ruby
|
693
|
+
require 'nhkore/datetime_parser'
|
694
|
+
require 'nhkore/news'
|
695
|
+
require 'nhkore/sifter'
|
696
|
+
require 'time'
|
697
|
+
|
698
|
+
news = NHKore::YasashiiNews.load_file()
|
699
|
+
|
700
|
+
sifter = NHKore::Sifter.new(news)
|
701
|
+
|
702
|
+
sifter.caption = 'Sakura Fields Forever!'
|
703
|
+
|
704
|
+
# Filter the data.
|
705
|
+
sifter.filter_by_datetime(NHKore::DatetimeParser.parse_range('2019-12-4...7'))
|
706
|
+
sifter.filter_by_datetime([Time.new(2019,12,4),Time.new(2019,12,7)])
|
707
|
+
sifter.filter_by_datetime(
|
708
|
+
from: Time.new(2019,12,4),to: Time.new(2019,12,7)
|
709
|
+
)
|
710
|
+
sifter.filter_by_title('桜')
|
711
|
+
sifter.filter_by_url('k100')
|
712
|
+
|
713
|
+
# Ignore certain columns from the output.
|
714
|
+
sifter.ignore(:defn)
|
715
|
+
sifter.ignore(:eng)
|
716
|
+
|
717
|
+
# An array of the sifted words.
|
718
|
+
words = sifter.sift() # Filtered & Sorted array of Word
|
719
|
+
rows = sifter.build_rows(words) # Ignored array of array
|
720
|
+
|
721
|
+
# Choose the file format.
|
722
|
+
#sifter.put_csv!()
|
723
|
+
#sifter.put_html!()
|
724
|
+
#sifter.put_json!()
|
725
|
+
sifter.put_yaml!()
|
726
|
+
|
727
|
+
# Save to a file.
|
728
|
+
file = 'sakura.yml'
|
729
|
+
|
730
|
+
if !File.exist?(file)
|
731
|
+
sifter.save_file(file)
|
732
|
+
end
|
733
|
+
```
|
734
|
+
|
735
|
+
### Util, UserAgents, & DatetimeParser
|
736
|
+
|
737
|
+
These provide a variety of useful methods/constants.
|
738
|
+
|
739
|
+
Here are some of the most useful ones:
|
740
|
+
|
741
|
+
```Ruby
|
742
|
+
require 'nhkore/datetime_parser'
|
743
|
+
require 'nhkore/user_agents'
|
744
|
+
require 'nhkore/util'
|
745
|
+
|
746
|
+
include NHKore
|
747
|
+
|
748
|
+
puts '======='
|
749
|
+
puts '[ Net ]'
|
750
|
+
puts '======='
|
751
|
+
# Get a random User Agent for HTTP header field 'User-Agent'.
|
752
|
+
# - This is used by default in Scraper/SearchScraper.
|
753
|
+
puts "User-Agent: #{UserAgents.sample()}"
|
754
|
+
|
755
|
+
uri = URI('https://www.bing.com/search?q=nhk')
|
756
|
+
Util.replace_uri_query!(uri,q: 'banana')
|
757
|
+
|
758
|
+
puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
|
759
|
+
# nhk.or.jp
|
760
|
+
puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
|
761
|
+
# Ben & Jerry's<br>
|
762
|
+
puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
|
763
|
+
puts
|
764
|
+
|
765
|
+
puts '========'
|
766
|
+
puts '[ Time ]'
|
767
|
+
puts '========'
|
768
|
+
puts "JST now: #{Util.jst_now()}"
|
769
|
+
# Drops in JST_OFFSET, does not change hour/min.
|
770
|
+
puts "JST time: #{Util.jst_time(Time.now)}"
|
771
|
+
puts "JST year: #{Util::JST_YEAR}"
|
772
|
+
puts "1999 sane? #{Util.sane_year?(1999)}" # true
|
773
|
+
puts "1776 sane? #{Util.sane_year?(1776)}" # false
|
774
|
+
puts "Guess 5: #{DatetimeParser.guess_year(5)}" # 2005
|
775
|
+
puts "Guess 99: #{DatetimeParser.guess_year(99)}" # 1999
|
776
|
+
# => [2020-12-01 00:00:00 +0900, 2020-12-31 23:59:59 +0900]
|
777
|
+
puts "Parse: #{DatetimeParser.parse_range('2020-12')}"
|
778
|
+
puts
|
779
|
+
puts "JST timezone offset: #{Util::JST_OFFSET}"
|
780
|
+
puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
|
781
|
+
puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
|
782
|
+
puts
|
783
|
+
|
784
|
+
puts '============'
|
785
|
+
puts '[ Japanese ]'
|
786
|
+
puts '============'
|
787
|
+
|
788
|
+
JPN = ['桜','ぶ','ブ']
|
789
|
+
|
790
|
+
def fmt_jpn()
|
791
|
+
fmt = []
|
792
|
+
|
793
|
+
JPN.each() do |x|
|
794
|
+
x = yield(x)
|
795
|
+
x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
|
796
|
+
fmt << x
|
797
|
+
end
|
798
|
+
|
799
|
+
return "[ #{fmt.join(' | ')} ]"
|
800
|
+
end
|
801
|
+
|
802
|
+
puts " #{fmt_jpn{|x| x}}"
|
803
|
+
puts "Hiragana? #{fmt_jpn{|x| Util.hiragana?(x)}}"
|
804
|
+
puts "Kana? #{fmt_jpn{|x| Util.kana?(x)}}"
|
805
|
+
puts "Kanji? #{fmt_jpn{|x| Util.kanji?(x)}}"
|
806
|
+
puts "Reduce: #{Util.reduce_jpn_space("' '")}"
|
807
|
+
puts
|
808
|
+
|
809
|
+
puts '========='
|
810
|
+
puts '[ Files ]'
|
811
|
+
puts '========='
|
812
|
+
puts "Dir str? #{Util.dir_str?('dir/')}" # true
|
813
|
+
puts "Dir str? #{Util.dir_str?('dir')}" # false
|
814
|
+
puts "File str? #{Util.filename_str?('file')}" # true
|
815
|
+
puts "File str? #{Util.filename_str?('dir/file')}" # false
|
816
|
+
```
|
817
|
+
|
344
818
|
## Hacking [^](#contents)
|
345
819
|
|
346
820
|
```
|
@@ -370,13 +844,35 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
|
|
370
844
|
|
371
845
|
`$ bundle exec rake doc`
|
372
846
|
|
373
|
-
### Installing Locally
|
847
|
+
### Installing Locally
|
848
|
+
|
849
|
+
You can make some changes/fixes to the code and then install your local version:
|
374
850
|
|
375
851
|
`$ bundle exec rake install:local`
|
376
852
|
|
377
|
-
###
|
853
|
+
### Updating [^](#contents)
|
854
|
+
|
855
|
+
This will update *core/* for you:
|
856
|
+
|
857
|
+
`$ bundle exec rake update_core`
|
858
|
+
|
859
|
+
### Releasing [^](#contents)
|
860
|
+
|
861
|
+
1. Update *CHANGELOG.md*, *version.rb*, & *Gemfile.lock*
|
862
|
+
- *Raketary*: `$ raketary bump -v`
|
863
|
+
- Run: `$ bundle update`
|
864
|
+
2. Run: `$ bundle exec rake update_core`
|
865
|
+
3. Run: `$ bundle exec rake clobber pkg_core`
|
866
|
+
4. Create a new release & tag
|
867
|
+
- Add `pkg/nhkore-core.zip`
|
868
|
+
5. Run: `$ git pull`
|
869
|
+
6. Upload GitHub package
|
870
|
+
- *Raketary*: `$ raketary github_pkg`
|
871
|
+
7. Run: `$ bundle exec rake release`
|
872
|
+
|
873
|
+
Releasing new HTML file for website:
|
378
874
|
|
379
|
-
`$ bundle exec rake
|
875
|
+
1. `$ bundle exec rake update_showcase`
|
380
876
|
|
381
877
|
## License [^](#contents)
|
382
878
|
|