nhkore 0.3.1 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +28 -2
- data/README.md +468 -3
- data/lib/nhkore/app.rb +1 -0
- data/lib/nhkore/article_scraper.rb +12 -11
- data/lib/nhkore/cli/news_cmd.rb +1 -1
- data/lib/nhkore/lib.rb +58 -0
- data/lib/nhkore/scraper.rb +18 -4
- data/lib/nhkore/sifter.rb +15 -10
- data/lib/nhkore/util.rb +7 -2
- data/lib/nhkore/variator.rb +1 -0
- data/lib/nhkore/version.rb +1 -1
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cf151c3859812632f09b1a464164f31bb0ce050f37ed7e7377f76265571ebd41
|
4
|
+
data.tar.gz: 1f3ee801e7557731cae4aeacd3f18fea4d7f33ac65b6ec77511a7d3d8f17856a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7e7d0d5b805ad6fa4312e8be26f3115dff18665b3762073c56db3a7a6a343a3ee6a05e47889e0abf7b62df3bb84cf5c977fce3efdfeb8a65c7bcff8167839d35
|
7
|
+
data.tar.gz: 957bc3da8492310d287a8947b9080f8be417f0874c3226db4f0bb63d020bee06c51a3da81c1fa3f779de22d354a32ab4cf41fc6f3018840774c31fd7060fbec3
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,33 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.2...master)
|
6
|
+
|
7
|
+
## [v0.3.2] - 2020-04-22
|
8
|
+
|
9
|
+
### Added
|
10
|
+
- lib/nhkore/lib.rb
|
11
|
+
- Requires all files, excluding CLI-related files for speed when using this Gem as a library.
|
12
|
+
- Scraper
|
13
|
+
- Added open_file() & reopen().
|
14
|
+
- samples/looper.rb
|
15
|
+
- Script example of continuously scraping all articles.
|
16
|
+
|
17
|
+
### Changed
|
18
|
+
- README
|
19
|
+
- Finished writing the initial version of all sections.
|
20
|
+
- ArticleScraper
|
21
|
+
- Changed the `year` param to expect an int, instead of a string.
|
22
|
+
- Sifter
|
23
|
+
- In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
|
24
|
+
|
25
|
+
### Fixed
|
26
|
+
- Reduced load time of app from ~1s to 0.3~0.5s.
|
27
|
+
- Moved many `require '...'` statements into methods.
|
28
|
+
- It looks ugly & is not a good coding practice, but a necessary evil.
|
29
|
+
- Load time is still pretty slow (but a lot better!).
|
30
|
+
- ArticleScraper
|
31
|
+
- Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
|
6
32
|
|
7
33
|
## [v0.3.1] - 2020-04-20
|
8
34
|
|
@@ -11,7 +37,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
11
37
|
- NewsCmd/SiftCmd
|
12
38
|
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
39
|
- Util
|
14
|
-
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows &
|
40
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
41
|
|
16
42
|
### Fixed
|
17
43
|
- Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
|
data/README.md
CHANGED
@@ -293,7 +293,7 @@ links:
|
|
293
293
|
|
294
294
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
295
295
|
|
296
|
-
Currently, it only searches &
|
296
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
297
|
|
298
298
|
Example usage:
|
299
299
|
|
@@ -319,6 +319,49 @@ Complete demo:
|
|
319
319
|
|
320
320
|
#### News Command [^](#contents)
|
321
321
|
|
322
|
+
In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
|
323
|
+
|
324
|
+
After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
|
325
|
+
|
326
|
+
The defaults will scrape the 1st unscraped article from the `links` file:
|
327
|
+
|
328
|
+
`$ nhkore news easy`
|
329
|
+
|
330
|
+
You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
|
331
|
+
|
332
|
+
```
|
333
|
+
# Scrape the 1st 11 unscraped articles.
|
334
|
+
$ nhkore news -s 11 easy
|
335
|
+
```
|
336
|
+
|
337
|
+
You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
|
338
|
+
|
339
|
+
`$ nhkore news -r -s 11 easy`
|
340
|
+
|
341
|
+
If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
|
342
|
+
|
343
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
|
344
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
|
345
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
|
346
|
+
- ...
|
347
|
+
|
348
|
+
`$ nhkore news -k '00123' -s 11 easy`
|
349
|
+
|
350
|
+
Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
|
351
|
+
|
352
|
+
```
|
353
|
+
# This will exit after showing the 1st article's dictionary.
|
354
|
+
$ nhkore news easy --show-dict
|
355
|
+
```
|
356
|
+
|
357
|
+
For the rest of the options, please see [The Basics](#the-basics-).
|
358
|
+
|
359
|
+
Complete demo:
|
360
|
+
|
361
|
+
[](https://asciinema.org/a/322324)
|
362
|
+
|
363
|
+
When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
|
364
|
+
|
322
365
|
## Using the Library [^](#contents)
|
323
366
|
|
324
367
|
### Setup
|
@@ -336,11 +379,431 @@ In your *Gemfile*:
|
|
336
379
|
```Ruby
|
337
380
|
# Pick one...
|
338
381
|
gem 'nhkore', '~> X.X'
|
339
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/
|
382
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
|
383
|
+
```
|
384
|
+
|
385
|
+
### Require
|
386
|
+
|
387
|
+
In order to not require all of the CLI-related files, require this file instead:
|
388
|
+
|
389
|
+
```Ruby
|
390
|
+
require 'nhkore/lib'
|
391
|
+
|
392
|
+
#require 'nhkore' # Slower
|
340
393
|
```
|
341
394
|
|
342
395
|
### Scraper
|
343
396
|
|
397
|
+
All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
|
398
|
+
|
399
|
+
`initialize` automatically opens (connects to) the URL.
|
400
|
+
|
401
|
+
```Ruby
|
402
|
+
require 'nhkore/scraper'
|
403
|
+
|
404
|
+
class MyScraper < NHKore::Scraper
|
405
|
+
def initialize()
|
406
|
+
super('https://www3.nhk.or.jp/news/easy/')
|
407
|
+
end
|
408
|
+
end
|
409
|
+
|
410
|
+
m = MyScraper.new()
|
411
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
412
|
+
|
413
|
+
# Read all content into a String.
|
414
|
+
mstr = m.read()
|
415
|
+
sstr = s.read()
|
416
|
+
|
417
|
+
# Get a Nokogiri::HTML object.
|
418
|
+
mdoc = m.html_doc()
|
419
|
+
sdoc = s.html_doc()
|
420
|
+
|
421
|
+
# Get a RSS object.
|
422
|
+
s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
|
423
|
+
|
424
|
+
rss = s.rss_doc()
|
425
|
+
```
|
426
|
+
|
427
|
+
There are several useful options:
|
428
|
+
|
429
|
+
```Ruby
|
430
|
+
require 'nhkore/scraper'
|
431
|
+
|
432
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
433
|
+
open_timeout: 300, # Open timeout in seconds (default: nil)
|
434
|
+
read_timeout: 300, # Read timeout in seconds (default: nil)
|
435
|
+
|
436
|
+
# Maximum number of times to retry the URL
|
437
|
+
# - default: 3
|
438
|
+
# - Open/connect will fail a couple of times on a bad/slow internet connection.
|
439
|
+
max_retries: 10,
|
440
|
+
|
441
|
+
# Maximum number of redirects allowed.
|
442
|
+
# - default: 3
|
443
|
+
# - You can set this to nil or -1, but I recommend using a number
|
444
|
+
# for safety (infinite-loop attack).
|
445
|
+
max_redirects: 1,
|
446
|
+
|
447
|
+
# How to check redirect URLs for safety.
|
448
|
+
# - default: :strict
|
449
|
+
# - nil => do not check
|
450
|
+
# - :lenient => check the scheme only
|
451
|
+
# (i.e., if https, redirect URL must be https)
|
452
|
+
# - :strict => check the scheme and domain
|
453
|
+
# (i.e., if https://bing.com, redirect URL must be https://bing.com)
|
454
|
+
redirect_rule: :lenient,
|
455
|
+
|
456
|
+
# Set the HTTP header field 'cookie' from the 'set-cookie' response.
|
457
|
+
# - default: false
|
458
|
+
# - Currently uses the 'http-cookie' Gem.
|
459
|
+
# - This is currently a time-consuming operation because it opens the URL twice.
|
460
|
+
# - Necessary for Search Engines or other sites that require cookies
|
461
|
+
# in order to block bots.
|
462
|
+
eat_cookie: true,
|
463
|
+
|
464
|
+
# Set HTTP header fields.
|
465
|
+
# - default: nil
|
466
|
+
# - Necessary for Search Engines or other sites that try to block bots.
|
467
|
+
# - Simply pass in a Hash (not nil) to set the default ones.
|
468
|
+
header: {'user-agent' => 'Skynet'}, # Must use strings
|
469
|
+
)
|
470
|
+
|
471
|
+
# Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
|
472
|
+
# - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
|
473
|
+
# some other Gem.
|
474
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
475
|
+
str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
|
476
|
+
)
|
477
|
+
|
478
|
+
# Open and parse a file instead of a URL (for offline testing or slow internet).
|
479
|
+
s = NHKore::Scraper.new('./my_article.html',is_file: true)
|
480
|
+
|
481
|
+
doc = s.html_doc()
|
482
|
+
```
|
483
|
+
|
484
|
+
Here are some other useful methods:
|
485
|
+
|
486
|
+
```Ruby
|
487
|
+
require 'nhkore/scraper'
|
488
|
+
|
489
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
490
|
+
|
491
|
+
s.reopen() # Re-open the current URL.
|
492
|
+
|
493
|
+
# Get a relative URL.
|
494
|
+
url = s.join_url('../../monkey.html')
|
495
|
+
puts url # https://www3.nhk.or.jp/monkey.html
|
496
|
+
|
497
|
+
# Open a new URL or file.
|
498
|
+
s.open(url)
|
499
|
+
s.open(url,URI.open(url,redirect: false))
|
500
|
+
|
501
|
+
s.open('./my_article.html',is_file: true)
|
502
|
+
|
503
|
+
# Open a file manually.
|
504
|
+
s.open_file('./my_article.html')
|
505
|
+
|
506
|
+
# Fetch the cookie & open a new URL manually.
|
507
|
+
s.fetch_cookie(url)
|
508
|
+
s.open_url(url)
|
509
|
+
```
|
510
|
+
|
511
|
+
### SearchScraper & BingScraper
|
512
|
+
|
513
|
+
`SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
|
514
|
+
|
515
|
+
By default, it sets the default HTTP header fields and fetches & sets the cookie.
|
516
|
+
|
517
|
+
```Ruby
|
518
|
+
require 'nhkore/search_scraper'
|
519
|
+
|
520
|
+
ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
|
521
|
+
|
522
|
+
doc = ss.html_doc()
|
523
|
+
|
524
|
+
doc.css('a').each() do |anchor|
|
525
|
+
link = anchor['href']
|
526
|
+
|
527
|
+
next if ss.ignore_link?(link)
|
528
|
+
|
529
|
+
if link.include?('https://www3.nhk')
|
530
|
+
puts link
|
531
|
+
end
|
532
|
+
end
|
533
|
+
```
|
534
|
+
|
535
|
+
`BingScraper` will search `bing.com` for you.
|
536
|
+
|
537
|
+
```Ruby
|
538
|
+
require 'nhkore/search_link'
|
539
|
+
require 'nhkore/search_scraper'
|
540
|
+
|
541
|
+
bs = NHKore::BingScraper.new(:yasashii)
|
542
|
+
slinks = NHKore::SearchLinks.new()
|
543
|
+
|
544
|
+
next_page = bs.scrape(slinks)
|
545
|
+
page_num = 1
|
546
|
+
|
547
|
+
while !next_page.empty?()
|
548
|
+
puts "Page #{page_num += 1}: #{next_page.count}"
|
549
|
+
|
550
|
+
bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
|
551
|
+
|
552
|
+
next_page = bs.scrape(slinks,next_page)
|
553
|
+
end
|
554
|
+
|
555
|
+
slinks.links.values.each() do |link|
|
556
|
+
puts link.url
|
557
|
+
end
|
558
|
+
```
|
559
|
+
|
560
|
+
### ArticleScraper & DictScraper
|
561
|
+
|
562
|
+
`ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
|
563
|
+
|
564
|
+
```Ruby
|
565
|
+
require 'nhkore/article_scraper'
|
566
|
+
|
567
|
+
as = NHKore::ArticleScraper.new(
|
568
|
+
'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
|
569
|
+
|
570
|
+
# If false, scrape the article leniently (for older articles which
|
571
|
+
# may not have certain tags, etc.).
|
572
|
+
# - default: true
|
573
|
+
strict: false,
|
574
|
+
|
575
|
+
# {Dict} to use as the dictionary for words (Easy articles).
|
576
|
+
# - default: :scrape
|
577
|
+
# - nil => don't scrape/use it (necessary for Regular articles)
|
578
|
+
# - :scrape => auto-scrape it using {DictScraper}
|
579
|
+
# - {Dict} => your own {Dict}
|
580
|
+
dict: nil,
|
581
|
+
|
582
|
+
# Date time to use as a fallback if the article doesn't have one
|
583
|
+
# (for older articles).
|
584
|
+
# - default: nil
|
585
|
+
datetime: Time.new(2020,2,2),
|
586
|
+
|
587
|
+
# Year to use as a fallback if the article doesn't have one
|
588
|
+
# (for older articles).
|
589
|
+
# - default: nil
|
590
|
+
year: 2020,
|
591
|
+
)
|
592
|
+
|
593
|
+
article = as.scrape()
|
594
|
+
|
595
|
+
article.datetime
|
596
|
+
article.futsuurl
|
597
|
+
article.sha256
|
598
|
+
article.title
|
599
|
+
article.url
|
600
|
+
|
601
|
+
article.words.each() do |key,word|
|
602
|
+
word.defn
|
603
|
+
word.eng
|
604
|
+
word.freq
|
605
|
+
word.kana
|
606
|
+
word.kanji
|
607
|
+
word.key
|
608
|
+
end
|
609
|
+
|
610
|
+
puts article.to_s(mini: true)
|
611
|
+
puts '---'
|
612
|
+
puts article
|
613
|
+
```
|
614
|
+
|
615
|
+
`DictScraper` scrapes an Easy article's dictionary file (JSON).
|
616
|
+
|
617
|
+
```Ruby
|
618
|
+
require 'nhkore/dict_scraper'
|
619
|
+
|
620
|
+
url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
|
621
|
+
ds = NHKore::DictScraper.new(
|
622
|
+
url,
|
623
|
+
|
624
|
+
# Change the URL appropriately to the dictionary URL.
|
625
|
+
# - default: true
|
626
|
+
parse_url: true,
|
627
|
+
)
|
628
|
+
|
629
|
+
puts NHKore::DictScraper.parse_url(url)
|
630
|
+
puts
|
631
|
+
|
632
|
+
dict = ds.scrape()
|
633
|
+
|
634
|
+
dict.entries.each() do |key,entry|
|
635
|
+
entry.id
|
636
|
+
|
637
|
+
entry.defns.each() do |defn|
|
638
|
+
defn.hyoukis.each() {|hyouki| }
|
639
|
+
defn.text
|
640
|
+
defn.words.each() {|word| }
|
641
|
+
end
|
642
|
+
|
643
|
+
puts entry.build_hyouki()
|
644
|
+
puts entry.build_defn()
|
645
|
+
puts '---'
|
646
|
+
end
|
647
|
+
|
648
|
+
puts
|
649
|
+
puts dict
|
650
|
+
```
|
651
|
+
|
652
|
+
### Fileable
|
653
|
+
|
654
|
+
Any class that includes the `Fileable` mixin will have the following methods:
|
655
|
+
|
656
|
+
- Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
|
657
|
+
- save_file(file,mode: 'wt',**kargs)
|
658
|
+
|
659
|
+
Any *kargs* will be passed to `File.open()`.
|
660
|
+
|
661
|
+
```Ruby
|
662
|
+
require 'nhkore/news'
|
663
|
+
require 'nhkore/search_link'
|
664
|
+
|
665
|
+
yn = NHKore::YasashiiNews.load_file()
|
666
|
+
sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
667
|
+
|
668
|
+
yn.articles.each() {|key,article| }
|
669
|
+
yn.sha256s.each() {|sha256,url| }
|
670
|
+
|
671
|
+
sl.links.each() do |key,link|
|
672
|
+
link.datetime
|
673
|
+
link.futsuurl
|
674
|
+
link.scraped?
|
675
|
+
link.sha256
|
676
|
+
link.title
|
677
|
+
link.url
|
678
|
+
end
|
679
|
+
|
680
|
+
#yn.save_file()
|
681
|
+
#sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
682
|
+
```
|
683
|
+
|
684
|
+
### Sifter
|
685
|
+
|
686
|
+
`Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
|
687
|
+
|
688
|
+
```Ruby
|
689
|
+
require 'nhkore/news'
|
690
|
+
require 'nhkore/sifter'
|
691
|
+
require 'time'
|
692
|
+
|
693
|
+
news = NHKore::YasashiiNews.load_file()
|
694
|
+
|
695
|
+
sifter = NHKore::Sifter.new(news)
|
696
|
+
|
697
|
+
sifter.caption = 'Sakura Fields Forever!'
|
698
|
+
|
699
|
+
# Filter the data.
|
700
|
+
#sifter.filter_by_datetime(Time.new(2019,12,5))
|
701
|
+
sifter.filter_by_datetime(
|
702
|
+
from: Time.new(2019,12,4),to: Time.new(2019,12,7)
|
703
|
+
)
|
704
|
+
sifter.filter_by_title('桜')
|
705
|
+
sifter.filter_by_url('k100')
|
706
|
+
|
707
|
+
# Ignore (or blank out) certain columns from the output.
|
708
|
+
sifter.ignore(:defn)
|
709
|
+
sifter.ignore(:eng)
|
710
|
+
|
711
|
+
# An array of the filtered & sorted words.
|
712
|
+
words = sifter.sift()
|
713
|
+
|
714
|
+
# Choose the file format.
|
715
|
+
#sifter.put_csv!()
|
716
|
+
#sifter.put_html!()
|
717
|
+
sifter.put_yaml!()
|
718
|
+
|
719
|
+
# Save to a file.
|
720
|
+
file = 'sakura.yml'
|
721
|
+
|
722
|
+
if !File.exist?(file)
|
723
|
+
sifter.save_file(file)
|
724
|
+
end
|
725
|
+
```
|
726
|
+
|
727
|
+
### Util & UserAgents
|
728
|
+
|
729
|
+
These provide a variety of useful methods/constants.
|
730
|
+
|
731
|
+
Here are some of the most useful ones:
|
732
|
+
|
733
|
+
```Ruby
|
734
|
+
require 'nhkore/user_agents'
|
735
|
+
require 'nhkore/util'
|
736
|
+
|
737
|
+
include NHKore
|
738
|
+
|
739
|
+
puts '======='
|
740
|
+
puts '[ Net ]'
|
741
|
+
puts '======='
|
742
|
+
# Get a random User Agent for HTTP header field 'User-Agent'.
|
743
|
+
# - This is used by default in Scraper/SearchScraper.
|
744
|
+
puts "User-Agent: #{UserAgents.sample()}"
|
745
|
+
|
746
|
+
uri = URI('https://www.bing.com/search?q=nhk')
|
747
|
+
Util.replace_uri_query!(uri,q: 'banana')
|
748
|
+
|
749
|
+
puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
|
750
|
+
# nhk.or.jp
|
751
|
+
puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
|
752
|
+
# Ben & Jerry's<br>
|
753
|
+
puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
|
754
|
+
puts
|
755
|
+
|
756
|
+
puts '========'
|
757
|
+
puts '[ Time ]'
|
758
|
+
puts '========'
|
759
|
+
puts "JST now: #{Util.jst_now}"
|
760
|
+
# Drops in JST_OFFSET, does not change hour/min.
|
761
|
+
puts "JST time: #{Util.jst_time(Time.now)}"
|
762
|
+
puts "JST year: #{Util::JST_YEAR}"
|
763
|
+
puts "1999 sane? #{Util.sane_year?(1999)}" # true
|
764
|
+
puts "1776 sane? #{Util.sane_year?(1776)}" # false
|
765
|
+
puts "Guess 5: #{Util.guess_year(5)}" # 2005
|
766
|
+
puts "Guess 99: #{Util.guess_year(99)}" # 1999
|
767
|
+
puts
|
768
|
+
puts "JST timezone offset: #{Util::JST_OFFSET}"
|
769
|
+
puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
|
770
|
+
puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
|
771
|
+
puts
|
772
|
+
|
773
|
+
puts '============'
|
774
|
+
puts '[ Japanese ]'
|
775
|
+
puts '============'
|
776
|
+
|
777
|
+
JPN = ['桜','ぶ','ブ']
|
778
|
+
|
779
|
+
def fmt_jpn()
|
780
|
+
fmt = []
|
781
|
+
|
782
|
+
JPN.each() do |x|
|
783
|
+
x = yield(x)
|
784
|
+
x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
|
785
|
+
fmt << x
|
786
|
+
end
|
787
|
+
|
788
|
+
return "[ #{fmt.join(' | ')} ]"
|
789
|
+
end
|
790
|
+
|
791
|
+
puts " #{fmt_jpn{|x| x}}"
|
792
|
+
puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
|
793
|
+
puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
|
794
|
+
puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
|
795
|
+
puts "Reduce: #{Util.reduce_jpn_space("' '")}"
|
796
|
+
puts
|
797
|
+
|
798
|
+
puts '========='
|
799
|
+
puts '[ Files ]'
|
800
|
+
puts '========='
|
801
|
+
puts "Dir str? #{Util.dir_str?('dir/')}" # true
|
802
|
+
puts "Dir str? #{Util.dir_str?('dir')}" # false
|
803
|
+
puts "File str? #{Util.filename_str?('file')}" # true
|
804
|
+
puts "File str? #{Util.filename_str?('dir/file')}" # false
|
805
|
+
```
|
806
|
+
|
344
807
|
## Hacking [^](#contents)
|
345
808
|
|
346
809
|
```
|
@@ -370,7 +833,9 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
|
|
370
833
|
|
371
834
|
`$ bundle exec rake doc`
|
372
835
|
|
373
|
-
### Installing Locally
|
836
|
+
### Installing Locally
|
837
|
+
|
838
|
+
You can make some changes/fixes to the code and then install your local version:
|
374
839
|
|
375
840
|
`$ bundle exec rake install:local`
|
376
841
|
|
data/lib/nhkore/app.rb
CHANGED
@@ -47,19 +47,21 @@ module NHKore
|
|
47
47
|
attr_accessor :dict
|
48
48
|
attr_reader :kargs
|
49
49
|
attr_accessor :missingno
|
50
|
-
attr_accessor :mode
|
51
50
|
attr_reader :polishers
|
52
51
|
attr_accessor :splitter
|
52
|
+
attr_accessor :strict
|
53
53
|
attr_reader :variators
|
54
54
|
attr_accessor :year
|
55
55
|
|
56
|
+
alias_method :strict?,:strict
|
57
|
+
|
56
58
|
# @param dict [Dict,:scrape,nil] the {Dict} (dictionary) to use for {Word#defn} (definitions)
|
57
59
|
# [+:scrape+] auto-scrape it using {DictScraper}
|
58
60
|
# [+nil+] don't scrape/use it
|
59
61
|
# @param missingno [Missingno] data to use as a fallback for Ruby words without kana/kanji,
|
60
62
|
# instead of raising an error
|
61
|
-
# @param
|
62
|
-
def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,
|
63
|
+
# @param strict [true,false]
|
64
|
+
def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),strict: true,variators: [BestVariator.new()],year: nil,**kargs)
|
63
65
|
super(url,**kargs)
|
64
66
|
|
65
67
|
@cleaners = Array(cleaners)
|
@@ -67,9 +69,9 @@ module NHKore
|
|
67
69
|
@dict = dict
|
68
70
|
@kargs = kargs
|
69
71
|
@missingno = missingno
|
70
|
-
@mode = mode
|
71
72
|
@polishers = Array(polishers)
|
72
73
|
@splitter = splitter
|
74
|
+
@strict = strict
|
73
75
|
@variators = Array(variators)
|
74
76
|
@year = year
|
75
77
|
end
|
@@ -188,7 +190,7 @@ module NHKore
|
|
188
190
|
tag = doc.css('div.article-body') if tag.length < 1
|
189
191
|
|
190
192
|
# - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
|
191
|
-
tag = doc.css('div#main') if tag.length < 1 &&
|
193
|
+
tag = doc.css('div#main') if tag.length < 1 && !@strict
|
192
194
|
|
193
195
|
if tag.length > 0
|
194
196
|
text = Util.unspace_web_str(tag.text.to_s())
|
@@ -481,7 +483,7 @@ module NHKore
|
|
481
483
|
def scrape_title(doc,article)
|
482
484
|
tag = doc.css('h1.article-main__title')
|
483
485
|
|
484
|
-
if tag.length < 1 &&
|
486
|
+
if tag.length < 1 && !@strict
|
485
487
|
# This shouldn't be used except for select sites.
|
486
488
|
# - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
|
487
489
|
|
@@ -583,7 +585,7 @@ module NHKore
|
|
583
585
|
end
|
584
586
|
|
585
587
|
# As a last resort, use our user-defined fallbacks (if specified).
|
586
|
-
return @year unless
|
588
|
+
return @year.to_i() unless @year.nil?()
|
587
589
|
return @datetime.year if !@datetime.nil?() && Util.sane_year?(@datetime.year)
|
588
590
|
|
589
591
|
raise ScrapeError,"could not scrape year at URL[#{@url}]"
|
@@ -604,11 +606,10 @@ module NHKore
|
|
604
606
|
end
|
605
607
|
|
606
608
|
def warn_or_error(klass,msg)
|
607
|
-
|
608
|
-
when :lenient
|
609
|
-
Util.warn(msg)
|
610
|
-
else
|
609
|
+
if @strict
|
611
610
|
raise klass,msg
|
611
|
+
else
|
612
|
+
Util.warn(msg)
|
612
613
|
end
|
613
614
|
end
|
614
615
|
end
|
data/lib/nhkore/cli/news_cmd.rb
CHANGED
data/lib/nhkore/lib.rb
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
3
|
+
# frozen_string_literal: true
|
4
|
+
|
5
|
+
#--
|
6
|
+
# This file is part of NHKore.
|
7
|
+
# Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
|
8
|
+
#
|
9
|
+
# NHKore is free software: you can redistribute it and/or modify
|
10
|
+
# it under the terms of the GNU Lesser General Public License as published by
|
11
|
+
# the Free Software Foundation, either version 3 of the License, or
|
12
|
+
# (at your option) any later version.
|
13
|
+
#
|
14
|
+
# NHKore is distributed in the hope that it will be useful,
|
15
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
16
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
17
|
+
# GNU Lesser General Public License for more details.
|
18
|
+
#
|
19
|
+
# You should have received a copy of the GNU Lesser General Public License
|
20
|
+
# along with NHKore. If not, see <https://www.gnu.org/licenses/>.
|
21
|
+
#++
|
22
|
+
|
23
|
+
|
24
|
+
require 'nhkore/article'
|
25
|
+
require 'nhkore/article_scraper'
|
26
|
+
require 'nhkore/cleaner'
|
27
|
+
require 'nhkore/defn'
|
28
|
+
require 'nhkore/dict'
|
29
|
+
require 'nhkore/dict_scraper'
|
30
|
+
require 'nhkore/entry'
|
31
|
+
require 'nhkore/error'
|
32
|
+
require 'nhkore/fileable'
|
33
|
+
require 'nhkore/missingno'
|
34
|
+
require 'nhkore/news'
|
35
|
+
require 'nhkore/polisher'
|
36
|
+
require 'nhkore/scraper'
|
37
|
+
require 'nhkore/search_link'
|
38
|
+
require 'nhkore/search_scraper'
|
39
|
+
require 'nhkore/sifter'
|
40
|
+
require 'nhkore/splitter'
|
41
|
+
require 'nhkore/user_agents'
|
42
|
+
require 'nhkore/util'
|
43
|
+
require 'nhkore/variator'
|
44
|
+
require 'nhkore/version'
|
45
|
+
require 'nhkore/word'
|
46
|
+
|
47
|
+
|
48
|
+
module NHKore
|
49
|
+
###
|
50
|
+
# Include this file to only require the files needed to use this
|
51
|
+
# Gem as a library (i.e., don't include CLI-related files).
|
52
|
+
#
|
53
|
+
# @author Jonathan Bradley Whited (@esotericpig)
|
54
|
+
# @since 0.3.2
|
55
|
+
###
|
56
|
+
module Lib
|
57
|
+
end
|
58
|
+
end
|
data/lib/nhkore/scraper.rb
CHANGED
@@ -82,7 +82,7 @@ module NHKore
|
|
82
82
|
@max_retries = max_retries
|
83
83
|
@redirect_rule = redirect_rule
|
84
84
|
|
85
|
-
open(url,str_or_io)
|
85
|
+
open(url,str_or_io,is_file: is_file)
|
86
86
|
end
|
87
87
|
|
88
88
|
def fetch_cookie(url)
|
@@ -119,14 +119,14 @@ module NHKore
|
|
119
119
|
return URI::join(@url,relative_url)
|
120
120
|
end
|
121
121
|
|
122
|
-
def open(url,str_or_io=nil)
|
122
|
+
def open(url,str_or_io=nil,is_file: @is_file)
|
123
|
+
@is_file = is_file
|
123
124
|
@str_or_io = str_or_io
|
124
125
|
@url = url
|
125
126
|
|
126
127
|
if str_or_io.nil?()
|
127
128
|
if @is_file
|
128
|
-
|
129
|
-
@str_or_io = File.open(url,'rt:UTF-8',**@kargs)
|
129
|
+
open_file(url)
|
130
130
|
else
|
131
131
|
fetch_cookie(url) if @eat_cookie
|
132
132
|
open_url(url)
|
@@ -136,6 +136,16 @@ module NHKore
|
|
136
136
|
return self
|
137
137
|
end
|
138
138
|
|
139
|
+
def open_file(file)
|
140
|
+
@is_file = true
|
141
|
+
@url = file
|
142
|
+
|
143
|
+
# NHK's website tends to always use UTF-8.
|
144
|
+
@str_or_io = File.open(file,'rt:UTF-8',**@kargs)
|
145
|
+
|
146
|
+
return self
|
147
|
+
end
|
148
|
+
|
139
149
|
def open_url(url)
|
140
150
|
max_redirects = (@max_redirects.nil?() || @max_redirects < 0) ? 10_000 : @max_redirects
|
141
151
|
max_retries = (@max_retries.nil?() || @max_retries < 0) ? 10_000 : @max_retries
|
@@ -194,6 +204,10 @@ module NHKore
|
|
194
204
|
return @str_or_io
|
195
205
|
end
|
196
206
|
|
207
|
+
def reopen()
|
208
|
+
return open(@url)
|
209
|
+
end
|
210
|
+
|
197
211
|
def rss_doc()
|
198
212
|
require 'rss'
|
199
213
|
|
data/lib/nhkore/sifter.rb
CHANGED
@@ -93,24 +93,29 @@ module NHKore
|
|
93
93
|
return false
|
94
94
|
end
|
95
95
|
|
96
|
-
def filter_by_datetime(datetime_filter=nil,
|
96
|
+
def filter_by_datetime(datetime_filter=nil,from: nil,to: nil)
|
97
97
|
if !datetime_filter.nil?()
|
98
|
-
|
99
|
-
|
100
|
-
|
98
|
+
if datetime_filter.respond_to?(:'[]')
|
99
|
+
# If out-of-bounds, just nil.
|
100
|
+
from = datetime_filter[0] if from.nil?()
|
101
|
+
to = datetime_filter[1] if to.nil?()
|
102
|
+
else
|
103
|
+
from = datetime_filter if from.nil?()
|
104
|
+
to = datetime_filter if to.nil?()
|
105
|
+
end
|
101
106
|
end
|
102
107
|
|
103
|
-
|
104
|
-
|
108
|
+
from = to if from.nil?()
|
109
|
+
to = from if to.nil?()
|
105
110
|
|
106
|
-
|
107
|
-
|
111
|
+
from = Util.jst_time(from) unless from.nil?()
|
112
|
+
to = Util.jst_time(to) unless to.nil?()
|
108
113
|
|
109
|
-
datetime_filter = [
|
114
|
+
datetime_filter = [from,to]
|
110
115
|
|
111
116
|
return self if datetime_filter.flatten().compact().empty?()
|
112
117
|
|
113
|
-
@filters[:datetime] = {from:
|
118
|
+
@filters[:datetime] = {from: from,to: to}
|
114
119
|
|
115
120
|
return self
|
116
121
|
end
|
data/lib/nhkore/util.rb
CHANGED
@@ -22,8 +22,7 @@
|
|
22
22
|
|
23
23
|
|
24
24
|
require 'cgi'
|
25
|
-
require '
|
26
|
-
require 'public_suffix'
|
25
|
+
require 'set'
|
27
26
|
require 'time'
|
28
27
|
require 'uri'
|
29
28
|
|
@@ -68,6 +67,8 @@ module NHKore
|
|
68
67
|
end
|
69
68
|
|
70
69
|
def self.domain(host,clean: true)
|
70
|
+
require 'public_suffix'
|
71
|
+
|
71
72
|
domain = PublicSuffix.domain(host)
|
72
73
|
domain = unspace_web_str(domain).downcase() if !domain.nil?() && clean
|
73
74
|
|
@@ -75,6 +76,8 @@ module NHKore
|
|
75
76
|
end
|
76
77
|
|
77
78
|
def self.dump_yaml(obj,flow_level: 8)
|
79
|
+
require 'psychgus'
|
80
|
+
|
78
81
|
return Psychgus.dump(obj,
|
79
82
|
deref_aliases: true, # Dereference aliases for load_yaml()
|
80
83
|
line_width: 10000, # Try not to wrap; ichiman!
|
@@ -142,6 +145,8 @@ module NHKore
|
|
142
145
|
end
|
143
146
|
|
144
147
|
def self.load_yaml(data,file: nil,**kargs)
|
148
|
+
require 'psychgus'
|
149
|
+
|
145
150
|
return Psych.safe_load(data,
|
146
151
|
aliases: false,
|
147
152
|
filename: file,
|
data/lib/nhkore/variator.rb
CHANGED
data/lib/nhkore/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nhkore
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Bradley Whited (@esotericpig)
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-04-
|
11
|
+
date: 2020-04-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bimyou_segmenter
|
@@ -349,6 +349,7 @@ files:
|
|
349
349
|
- lib/nhkore/entry.rb
|
350
350
|
- lib/nhkore/error.rb
|
351
351
|
- lib/nhkore/fileable.rb
|
352
|
+
- lib/nhkore/lib.rb
|
352
353
|
- lib/nhkore/missingno.rb
|
353
354
|
- lib/nhkore/news.rb
|
354
355
|
- lib/nhkore/polisher.rb
|
@@ -374,7 +375,7 @@ metadata:
|
|
374
375
|
changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
|
375
376
|
homepage_uri: https://github.com/esotericpig/nhkore
|
376
377
|
source_code_uri: https://github.com/esotericpig/nhkore
|
377
|
-
post_install_message: " \n NHKore v0.3.
|
378
|
+
post_install_message: " \n NHKore v0.3.2\n \n You can now use [nhkore] on the
|
378
379
|
command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
|
379
380
|
\ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
|
380
381
|
\ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"
|