nhkore 0.3.1 → 0.3.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +28 -2
- data/README.md +468 -3
- data/lib/nhkore/app.rb +1 -0
- data/lib/nhkore/article_scraper.rb +12 -11
- data/lib/nhkore/cli/news_cmd.rb +1 -1
- data/lib/nhkore/lib.rb +58 -0
- data/lib/nhkore/scraper.rb +18 -4
- data/lib/nhkore/sifter.rb +15 -10
- data/lib/nhkore/util.rb +7 -2
- data/lib/nhkore/variator.rb +1 -0
- data/lib/nhkore/version.rb +1 -1
- metadata +4 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cf151c3859812632f09b1a464164f31bb0ce050f37ed7e7377f76265571ebd41
|
4
|
+
data.tar.gz: 1f3ee801e7557731cae4aeacd3f18fea4d7f33ac65b6ec77511a7d3d8f17856a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7e7d0d5b805ad6fa4312e8be26f3115dff18665b3762073c56db3a7a6a343a3ee6a05e47889e0abf7b62df3bb84cf5c977fce3efdfeb8a65c7bcff8167839d35
|
7
|
+
data.tar.gz: 957bc3da8492310d287a8947b9080f8be417f0874c3226db4f0bb63d020bee06c51a3da81c1fa3f779de22d354a32ab4cf41fc6f3018840774c31fd7060fbec3
|
data/CHANGELOG.md
CHANGED
@@ -2,7 +2,33 @@
|
|
2
2
|
|
3
3
|
Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
4
4
|
|
5
|
-
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.
|
5
|
+
## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.2...master)
|
6
|
+
|
7
|
+
## [v0.3.2] - 2020-04-22
|
8
|
+
|
9
|
+
### Added
|
10
|
+
- lib/nhkore/lib.rb
|
11
|
+
- Requires all files, excluding CLI-related files for speed when using this Gem as a library.
|
12
|
+
- Scraper
|
13
|
+
- Added open_file() & reopen().
|
14
|
+
- samples/looper.rb
|
15
|
+
- Script example of continuously scraping all articles.
|
16
|
+
|
17
|
+
### Changed
|
18
|
+
- README
|
19
|
+
- Finished writing the initial version of all sections.
|
20
|
+
- ArticleScraper
|
21
|
+
- Changed the `year` param to expect an int, instead of a string.
|
22
|
+
- Sifter
|
23
|
+
- In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
|
24
|
+
|
25
|
+
### Fixed
|
26
|
+
- Reduced load time of app from ~1s to 0.3~0.5s.
|
27
|
+
- Moved many `require '...'` statements into methods.
|
28
|
+
- It looks ugly & is not a good coding practice, but a necessary evil.
|
29
|
+
- Load time is still pretty slow (but a lot better!).
|
30
|
+
- ArticleScraper
|
31
|
+
- Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
|
6
32
|
|
7
33
|
## [v0.3.1] - 2020-04-20
|
8
34
|
|
@@ -11,7 +37,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
11
37
|
- NewsCmd/SiftCmd
|
12
38
|
- Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
|
13
39
|
- Util
|
14
|
-
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows &
|
40
|
+
- Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
|
15
41
|
|
16
42
|
### Fixed
|
17
43
|
- Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
|
data/README.md
CHANGED
@@ -293,7 +293,7 @@ links:
|
|
293
293
|
|
294
294
|
If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
|
295
295
|
|
296
|
-
Currently, it only searches &
|
296
|
+
Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
|
297
297
|
|
298
298
|
Example usage:
|
299
299
|
|
@@ -319,6 +319,49 @@ Complete demo:
|
|
319
319
|
|
320
320
|
#### News Command [^](#contents)
|
321
321
|
|
322
|
+
In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
|
323
|
+
|
324
|
+
After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
|
325
|
+
|
326
|
+
The defaults will scrape the 1st unscraped article from the `links` file:
|
327
|
+
|
328
|
+
`$ nhkore news easy`
|
329
|
+
|
330
|
+
You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
|
331
|
+
|
332
|
+
```
|
333
|
+
# Scrape the 1st 11 unscraped articles.
|
334
|
+
$ nhkore news -s 11 easy
|
335
|
+
```
|
336
|
+
|
337
|
+
You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
|
338
|
+
|
339
|
+
`$ nhkore news -r -s 11 easy`
|
340
|
+
|
341
|
+
If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
|
342
|
+
|
343
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
|
344
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
|
345
|
+
- http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
|
346
|
+
- ...
|
347
|
+
|
348
|
+
`$ nhkore news -k '00123' -s 11 easy`
|
349
|
+
|
350
|
+
Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
|
351
|
+
|
352
|
+
```
|
353
|
+
# This will exit after showing the 1st article's dictionary.
|
354
|
+
$ nhkore news easy --show-dict
|
355
|
+
```
|
356
|
+
|
357
|
+
For the rest of the options, please see [The Basics](#the-basics-).
|
358
|
+
|
359
|
+
Complete demo:
|
360
|
+
|
361
|
+
[![asciinema Demo - News](https://asciinema.org/a/322324.png)](https://asciinema.org/a/322324)
|
362
|
+
|
363
|
+
When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
|
364
|
+
|
322
365
|
## Using the Library [^](#contents)
|
323
366
|
|
324
367
|
### Setup
|
@@ -336,11 +379,431 @@ In your *Gemfile*:
|
|
336
379
|
```Ruby
|
337
380
|
# Pick one...
|
338
381
|
gem 'nhkore', '~> X.X'
|
339
|
-
gem 'nhkore', :git => 'https://github.com/esotericpig/
|
382
|
+
gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
|
383
|
+
```
|
384
|
+
|
385
|
+
### Require
|
386
|
+
|
387
|
+
In order to not require all of the CLI-related files, require this file instead:
|
388
|
+
|
389
|
+
```Ruby
|
390
|
+
require 'nhkore/lib'
|
391
|
+
|
392
|
+
#require 'nhkore' # Slower
|
340
393
|
```
|
341
394
|
|
342
395
|
### Scraper
|
343
396
|
|
397
|
+
All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
|
398
|
+
|
399
|
+
`initialize` automatically opens (connects to) the URL.
|
400
|
+
|
401
|
+
```Ruby
|
402
|
+
require 'nhkore/scraper'
|
403
|
+
|
404
|
+
class MyScraper < NHKore::Scraper
|
405
|
+
def initialize()
|
406
|
+
super('https://www3.nhk.or.jp/news/easy/')
|
407
|
+
end
|
408
|
+
end
|
409
|
+
|
410
|
+
m = MyScraper.new()
|
411
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
412
|
+
|
413
|
+
# Read all content into a String.
|
414
|
+
mstr = m.read()
|
415
|
+
sstr = s.read()
|
416
|
+
|
417
|
+
# Get a Nokogiri::HTML object.
|
418
|
+
mdoc = m.html_doc()
|
419
|
+
sdoc = s.html_doc()
|
420
|
+
|
421
|
+
# Get a RSS object.
|
422
|
+
s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
|
423
|
+
|
424
|
+
rss = s.rss_doc()
|
425
|
+
```
|
426
|
+
|
427
|
+
There are several useful options:
|
428
|
+
|
429
|
+
```Ruby
|
430
|
+
require 'nhkore/scraper'
|
431
|
+
|
432
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
433
|
+
open_timeout: 300, # Open timeout in seconds (default: nil)
|
434
|
+
read_timeout: 300, # Read timeout in seconds (default: nil)
|
435
|
+
|
436
|
+
# Maximum number of times to retry the URL
|
437
|
+
# - default: 3
|
438
|
+
# - Open/connect will fail a couple of times on a bad/slow internet connection.
|
439
|
+
max_retries: 10,
|
440
|
+
|
441
|
+
# Maximum number of redirects allowed.
|
442
|
+
# - default: 3
|
443
|
+
# - You can set this to nil or -1, but I recommend using a number
|
444
|
+
# for safety (infinite-loop attack).
|
445
|
+
max_redirects: 1,
|
446
|
+
|
447
|
+
# How to check redirect URLs for safety.
|
448
|
+
# - default: :strict
|
449
|
+
# - nil => do not check
|
450
|
+
# - :lenient => check the scheme only
|
451
|
+
# (i.e., if https, redirect URL must be https)
|
452
|
+
# - :strict => check the scheme and domain
|
453
|
+
# (i.e., if https://bing.com, redirect URL must be https://bing.com)
|
454
|
+
redirect_rule: :lenient,
|
455
|
+
|
456
|
+
# Set the HTTP header field 'cookie' from the 'set-cookie' response.
|
457
|
+
# - default: false
|
458
|
+
# - Currently uses the 'http-cookie' Gem.
|
459
|
+
# - This is currently a time-consuming operation because it opens the URL twice.
|
460
|
+
# - Necessary for Search Engines or other sites that require cookies
|
461
|
+
# in order to block bots.
|
462
|
+
eat_cookie: true,
|
463
|
+
|
464
|
+
# Set HTTP header fields.
|
465
|
+
# - default: nil
|
466
|
+
# - Necessary for Search Engines or other sites that try to block bots.
|
467
|
+
# - Simply pass in a Hash (not nil) to set the default ones.
|
468
|
+
header: {'user-agent' => 'Skynet'}, # Must use strings
|
469
|
+
)
|
470
|
+
|
471
|
+
# Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
|
472
|
+
# - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
|
473
|
+
# some other Gem.
|
474
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
|
475
|
+
str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
|
476
|
+
)
|
477
|
+
|
478
|
+
# Open and parse a file instead of a URL (for offline testing or slow internet).
|
479
|
+
s = NHKore::Scraper.new('./my_article.html',is_file: true)
|
480
|
+
|
481
|
+
doc = s.html_doc()
|
482
|
+
```
|
483
|
+
|
484
|
+
Here are some other useful methods:
|
485
|
+
|
486
|
+
```Ruby
|
487
|
+
require 'nhkore/scraper'
|
488
|
+
|
489
|
+
s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
|
490
|
+
|
491
|
+
s.reopen() # Re-open the current URL.
|
492
|
+
|
493
|
+
# Get a relative URL.
|
494
|
+
url = s.join_url('../../monkey.html')
|
495
|
+
puts url # https://www3.nhk.or.jp/monkey.html
|
496
|
+
|
497
|
+
# Open a new URL or file.
|
498
|
+
s.open(url)
|
499
|
+
s.open(url,URI.open(url,redirect: false))
|
500
|
+
|
501
|
+
s.open('./my_article.html',is_file: true)
|
502
|
+
|
503
|
+
# Open a file manually.
|
504
|
+
s.open_file('./my_article.html')
|
505
|
+
|
506
|
+
# Fetch the cookie & open a new URL manually.
|
507
|
+
s.fetch_cookie(url)
|
508
|
+
s.open_url(url)
|
509
|
+
```
|
510
|
+
|
511
|
+
### SearchScraper & BingScraper
|
512
|
+
|
513
|
+
`SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
|
514
|
+
|
515
|
+
By default, it sets the default HTTP header fields and fetches & sets the cookie.
|
516
|
+
|
517
|
+
```Ruby
|
518
|
+
require 'nhkore/search_scraper'
|
519
|
+
|
520
|
+
ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
|
521
|
+
|
522
|
+
doc = ss.html_doc()
|
523
|
+
|
524
|
+
doc.css('a').each() do |anchor|
|
525
|
+
link = anchor['href']
|
526
|
+
|
527
|
+
next if ss.ignore_link?(link)
|
528
|
+
|
529
|
+
if link.include?('https://www3.nhk')
|
530
|
+
puts link
|
531
|
+
end
|
532
|
+
end
|
533
|
+
```
|
534
|
+
|
535
|
+
`BingScraper` will search `bing.com` for you.
|
536
|
+
|
537
|
+
```Ruby
|
538
|
+
require 'nhkore/search_link'
|
539
|
+
require 'nhkore/search_scraper'
|
540
|
+
|
541
|
+
bs = NHKore::BingScraper.new(:yasashii)
|
542
|
+
slinks = NHKore::SearchLinks.new()
|
543
|
+
|
544
|
+
next_page = bs.scrape(slinks)
|
545
|
+
page_num = 1
|
546
|
+
|
547
|
+
while !next_page.empty?()
|
548
|
+
puts "Page #{page_num += 1}: #{next_page.count}"
|
549
|
+
|
550
|
+
bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
|
551
|
+
|
552
|
+
next_page = bs.scrape(slinks,next_page)
|
553
|
+
end
|
554
|
+
|
555
|
+
slinks.links.values.each() do |link|
|
556
|
+
puts link.url
|
557
|
+
end
|
558
|
+
```
|
559
|
+
|
560
|
+
### ArticleScraper & DictScraper
|
561
|
+
|
562
|
+
`ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
|
563
|
+
|
564
|
+
```Ruby
|
565
|
+
require 'nhkore/article_scraper'
|
566
|
+
|
567
|
+
as = NHKore::ArticleScraper.new(
|
568
|
+
'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
|
569
|
+
|
570
|
+
# If false, scrape the article leniently (for older articles which
|
571
|
+
# may not have certain tags, etc.).
|
572
|
+
# - default: true
|
573
|
+
strict: false,
|
574
|
+
|
575
|
+
# {Dict} to use as the dictionary for words (Easy articles).
|
576
|
+
# - default: :scrape
|
577
|
+
# - nil => don't scrape/use it (necessary for Regular articles)
|
578
|
+
# - :scrape => auto-scrape it using {DictScraper}
|
579
|
+
# - {Dict} => your own {Dict}
|
580
|
+
dict: nil,
|
581
|
+
|
582
|
+
# Date time to use as a fallback if the article doesn't have one
|
583
|
+
# (for older articles).
|
584
|
+
# - default: nil
|
585
|
+
datetime: Time.new(2020,2,2),
|
586
|
+
|
587
|
+
# Year to use as a fallback if the article doesn't have one
|
588
|
+
# (for older articles).
|
589
|
+
# - default: nil
|
590
|
+
year: 2020,
|
591
|
+
)
|
592
|
+
|
593
|
+
article = as.scrape()
|
594
|
+
|
595
|
+
article.datetime
|
596
|
+
article.futsuurl
|
597
|
+
article.sha256
|
598
|
+
article.title
|
599
|
+
article.url
|
600
|
+
|
601
|
+
article.words.each() do |key,word|
|
602
|
+
word.defn
|
603
|
+
word.eng
|
604
|
+
word.freq
|
605
|
+
word.kana
|
606
|
+
word.kanji
|
607
|
+
word.key
|
608
|
+
end
|
609
|
+
|
610
|
+
puts article.to_s(mini: true)
|
611
|
+
puts '---'
|
612
|
+
puts article
|
613
|
+
```
|
614
|
+
|
615
|
+
`DictScraper` scrapes an Easy article's dictionary file (JSON).
|
616
|
+
|
617
|
+
```Ruby
|
618
|
+
require 'nhkore/dict_scraper'
|
619
|
+
|
620
|
+
url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
|
621
|
+
ds = NHKore::DictScraper.new(
|
622
|
+
url,
|
623
|
+
|
624
|
+
# Change the URL appropriately to the dictionary URL.
|
625
|
+
# - default: true
|
626
|
+
parse_url: true,
|
627
|
+
)
|
628
|
+
|
629
|
+
puts NHKore::DictScraper.parse_url(url)
|
630
|
+
puts
|
631
|
+
|
632
|
+
dict = ds.scrape()
|
633
|
+
|
634
|
+
dict.entries.each() do |key,entry|
|
635
|
+
entry.id
|
636
|
+
|
637
|
+
entry.defns.each() do |defn|
|
638
|
+
defn.hyoukis.each() {|hyouki| }
|
639
|
+
defn.text
|
640
|
+
defn.words.each() {|word| }
|
641
|
+
end
|
642
|
+
|
643
|
+
puts entry.build_hyouki()
|
644
|
+
puts entry.build_defn()
|
645
|
+
puts '---'
|
646
|
+
end
|
647
|
+
|
648
|
+
puts
|
649
|
+
puts dict
|
650
|
+
```
|
651
|
+
|
652
|
+
### Fileable
|
653
|
+
|
654
|
+
Any class that includes the `Fileable` mixin will have the following methods:
|
655
|
+
|
656
|
+
- Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
|
657
|
+
- save_file(file,mode: 'wt',**kargs)
|
658
|
+
|
659
|
+
Any *kargs* will be passed to `File.open()`.
|
660
|
+
|
661
|
+
```Ruby
|
662
|
+
require 'nhkore/news'
|
663
|
+
require 'nhkore/search_link'
|
664
|
+
|
665
|
+
yn = NHKore::YasashiiNews.load_file()
|
666
|
+
sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
667
|
+
|
668
|
+
yn.articles.each() {|key,article| }
|
669
|
+
yn.sha256s.each() {|sha256,url| }
|
670
|
+
|
671
|
+
sl.links.each() do |key,link|
|
672
|
+
link.datetime
|
673
|
+
link.futsuurl
|
674
|
+
link.scraped?
|
675
|
+
link.sha256
|
676
|
+
link.title
|
677
|
+
link.url
|
678
|
+
end
|
679
|
+
|
680
|
+
#yn.save_file()
|
681
|
+
#sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
|
682
|
+
```
|
683
|
+
|
684
|
+
### Sifter
|
685
|
+
|
686
|
+
`Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
|
687
|
+
|
688
|
+
```Ruby
|
689
|
+
require 'nhkore/news'
|
690
|
+
require 'nhkore/sifter'
|
691
|
+
require 'time'
|
692
|
+
|
693
|
+
news = NHKore::YasashiiNews.load_file()
|
694
|
+
|
695
|
+
sifter = NHKore::Sifter.new(news)
|
696
|
+
|
697
|
+
sifter.caption = 'Sakura Fields Forever!'
|
698
|
+
|
699
|
+
# Filter the data.
|
700
|
+
#sifter.filter_by_datetime(Time.new(2019,12,5))
|
701
|
+
sifter.filter_by_datetime(
|
702
|
+
from: Time.new(2019,12,4),to: Time.new(2019,12,7)
|
703
|
+
)
|
704
|
+
sifter.filter_by_title('桜')
|
705
|
+
sifter.filter_by_url('k100')
|
706
|
+
|
707
|
+
# Ignore (or blank out) certain columns from the output.
|
708
|
+
sifter.ignore(:defn)
|
709
|
+
sifter.ignore(:eng)
|
710
|
+
|
711
|
+
# An array of the filtered & sorted words.
|
712
|
+
words = sifter.sift()
|
713
|
+
|
714
|
+
# Choose the file format.
|
715
|
+
#sifter.put_csv!()
|
716
|
+
#sifter.put_html!()
|
717
|
+
sifter.put_yaml!()
|
718
|
+
|
719
|
+
# Save to a file.
|
720
|
+
file = 'sakura.yml'
|
721
|
+
|
722
|
+
if !File.exist?(file)
|
723
|
+
sifter.save_file(file)
|
724
|
+
end
|
725
|
+
```
|
726
|
+
|
727
|
+
### Util & UserAgents
|
728
|
+
|
729
|
+
These provide a variety of useful methods/constants.
|
730
|
+
|
731
|
+
Here are some of the most useful ones:
|
732
|
+
|
733
|
+
```Ruby
|
734
|
+
require 'nhkore/user_agents'
|
735
|
+
require 'nhkore/util'
|
736
|
+
|
737
|
+
include NHKore
|
738
|
+
|
739
|
+
puts '======='
|
740
|
+
puts '[ Net ]'
|
741
|
+
puts '======='
|
742
|
+
# Get a random User Agent for HTTP header field 'User-Agent'.
|
743
|
+
# - This is used by default in Scraper/SearchScraper.
|
744
|
+
puts "User-Agent: #{UserAgents.sample()}"
|
745
|
+
|
746
|
+
uri = URI('https://www.bing.com/search?q=nhk')
|
747
|
+
Util.replace_uri_query!(uri,q: 'banana')
|
748
|
+
|
749
|
+
puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
|
750
|
+
# nhk.or.jp
|
751
|
+
puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
|
752
|
+
# Ben & Jerry's<br>
|
753
|
+
puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
|
754
|
+
puts
|
755
|
+
|
756
|
+
puts '========'
|
757
|
+
puts '[ Time ]'
|
758
|
+
puts '========'
|
759
|
+
puts "JST now: #{Util.jst_now}"
|
760
|
+
# Drops in JST_OFFSET, does not change hour/min.
|
761
|
+
puts "JST time: #{Util.jst_time(Time.now)}"
|
762
|
+
puts "JST year: #{Util::JST_YEAR}"
|
763
|
+
puts "1999 sane? #{Util.sane_year?(1999)}" # true
|
764
|
+
puts "1776 sane? #{Util.sane_year?(1776)}" # false
|
765
|
+
puts "Guess 5: #{Util.guess_year(5)}" # 2005
|
766
|
+
puts "Guess 99: #{Util.guess_year(99)}" # 1999
|
767
|
+
puts
|
768
|
+
puts "JST timezone offset: #{Util::JST_OFFSET}"
|
769
|
+
puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
|
770
|
+
puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
|
771
|
+
puts
|
772
|
+
|
773
|
+
puts '============'
|
774
|
+
puts '[ Japanese ]'
|
775
|
+
puts '============'
|
776
|
+
|
777
|
+
JPN = ['桜','ぶ','ブ']
|
778
|
+
|
779
|
+
def fmt_jpn()
|
780
|
+
fmt = []
|
781
|
+
|
782
|
+
JPN.each() do |x|
|
783
|
+
x = yield(x)
|
784
|
+
x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
|
785
|
+
fmt << x
|
786
|
+
end
|
787
|
+
|
788
|
+
return "[ #{fmt.join(' | ')} ]"
|
789
|
+
end
|
790
|
+
|
791
|
+
puts " #{fmt_jpn{|x| x}}"
|
792
|
+
puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
|
793
|
+
puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
|
794
|
+
puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
|
795
|
+
puts "Reduce: #{Util.reduce_jpn_space("' '")}"
|
796
|
+
puts
|
797
|
+
|
798
|
+
puts '========='
|
799
|
+
puts '[ Files ]'
|
800
|
+
puts '========='
|
801
|
+
puts "Dir str? #{Util.dir_str?('dir/')}" # true
|
802
|
+
puts "Dir str? #{Util.dir_str?('dir')}" # false
|
803
|
+
puts "File str? #{Util.filename_str?('file')}" # true
|
804
|
+
puts "File str? #{Util.filename_str?('dir/file')}" # false
|
805
|
+
```
|
806
|
+
|
344
807
|
## Hacking [^](#contents)
|
345
808
|
|
346
809
|
```
|
@@ -370,7 +833,9 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
|
|
370
833
|
|
371
834
|
`$ bundle exec rake doc`
|
372
835
|
|
373
|
-
### Installing Locally
|
836
|
+
### Installing Locally
|
837
|
+
|
838
|
+
You can make some changes/fixes to the code and then install your local version:
|
374
839
|
|
375
840
|
`$ bundle exec rake install:local`
|
376
841
|
|
data/lib/nhkore/app.rb
CHANGED
@@ -47,19 +47,21 @@ module NHKore
|
|
47
47
|
attr_accessor :dict
|
48
48
|
attr_reader :kargs
|
49
49
|
attr_accessor :missingno
|
50
|
-
attr_accessor :mode
|
51
50
|
attr_reader :polishers
|
52
51
|
attr_accessor :splitter
|
52
|
+
attr_accessor :strict
|
53
53
|
attr_reader :variators
|
54
54
|
attr_accessor :year
|
55
55
|
|
56
|
+
alias_method :strict?,:strict
|
57
|
+
|
56
58
|
# @param dict [Dict,:scrape,nil] the {Dict} (dictionary) to use for {Word#defn} (definitions)
|
57
59
|
# [+:scrape+] auto-scrape it using {DictScraper}
|
58
60
|
# [+nil+] don't scrape/use it
|
59
61
|
# @param missingno [Missingno] data to use as a fallback for Ruby words without kana/kanji,
|
60
62
|
# instead of raising an error
|
61
|
-
# @param
|
62
|
-
def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,
|
63
|
+
# @param strict [true,false]
|
64
|
+
def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),strict: true,variators: [BestVariator.new()],year: nil,**kargs)
|
63
65
|
super(url,**kargs)
|
64
66
|
|
65
67
|
@cleaners = Array(cleaners)
|
@@ -67,9 +69,9 @@ module NHKore
|
|
67
69
|
@dict = dict
|
68
70
|
@kargs = kargs
|
69
71
|
@missingno = missingno
|
70
|
-
@mode = mode
|
71
72
|
@polishers = Array(polishers)
|
72
73
|
@splitter = splitter
|
74
|
+
@strict = strict
|
73
75
|
@variators = Array(variators)
|
74
76
|
@year = year
|
75
77
|
end
|
@@ -188,7 +190,7 @@ module NHKore
|
|
188
190
|
tag = doc.css('div.article-body') if tag.length < 1
|
189
191
|
|
190
192
|
# - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
|
191
|
-
tag = doc.css('div#main') if tag.length < 1 &&
|
193
|
+
tag = doc.css('div#main') if tag.length < 1 && !@strict
|
192
194
|
|
193
195
|
if tag.length > 0
|
194
196
|
text = Util.unspace_web_str(tag.text.to_s())
|
@@ -481,7 +483,7 @@ module NHKore
|
|
481
483
|
def scrape_title(doc,article)
|
482
484
|
tag = doc.css('h1.article-main__title')
|
483
485
|
|
484
|
-
if tag.length < 1 &&
|
486
|
+
if tag.length < 1 && !@strict
|
485
487
|
# This shouldn't be used except for select sites.
|
486
488
|
# - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
|
487
489
|
|
@@ -583,7 +585,7 @@ module NHKore
|
|
583
585
|
end
|
584
586
|
|
585
587
|
# As a last resort, use our user-defined fallbacks (if specified).
|
586
|
-
return @year unless
|
588
|
+
return @year.to_i() unless @year.nil?()
|
587
589
|
return @datetime.year if !@datetime.nil?() && Util.sane_year?(@datetime.year)
|
588
590
|
|
589
591
|
raise ScrapeError,"could not scrape year at URL[#{@url}]"
|
@@ -604,11 +606,10 @@ module NHKore
|
|
604
606
|
end
|
605
607
|
|
606
608
|
def warn_or_error(klass,msg)
|
607
|
-
|
608
|
-
when :lenient
|
609
|
-
Util.warn(msg)
|
610
|
-
else
|
609
|
+
if @strict
|
611
610
|
raise klass,msg
|
611
|
+
else
|
612
|
+
Util.warn(msg)
|
612
613
|
end
|
613
614
|
end
|
614
615
|
end
|
data/lib/nhkore/cli/news_cmd.rb
CHANGED
data/lib/nhkore/lib.rb
ADDED
@@ -0,0 +1,58 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
3
|
+
# frozen_string_literal: true
|
4
|
+
|
5
|
+
#--
|
6
|
+
# This file is part of NHKore.
|
7
|
+
# Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
|
8
|
+
#
|
9
|
+
# NHKore is free software: you can redistribute it and/or modify
|
10
|
+
# it under the terms of the GNU Lesser General Public License as published by
|
11
|
+
# the Free Software Foundation, either version 3 of the License, or
|
12
|
+
# (at your option) any later version.
|
13
|
+
#
|
14
|
+
# NHKore is distributed in the hope that it will be useful,
|
15
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
16
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
17
|
+
# GNU Lesser General Public License for more details.
|
18
|
+
#
|
19
|
+
# You should have received a copy of the GNU Lesser General Public License
|
20
|
+
# along with NHKore. If not, see <https://www.gnu.org/licenses/>.
|
21
|
+
#++
|
22
|
+
|
23
|
+
|
24
|
+
require 'nhkore/article'
|
25
|
+
require 'nhkore/article_scraper'
|
26
|
+
require 'nhkore/cleaner'
|
27
|
+
require 'nhkore/defn'
|
28
|
+
require 'nhkore/dict'
|
29
|
+
require 'nhkore/dict_scraper'
|
30
|
+
require 'nhkore/entry'
|
31
|
+
require 'nhkore/error'
|
32
|
+
require 'nhkore/fileable'
|
33
|
+
require 'nhkore/missingno'
|
34
|
+
require 'nhkore/news'
|
35
|
+
require 'nhkore/polisher'
|
36
|
+
require 'nhkore/scraper'
|
37
|
+
require 'nhkore/search_link'
|
38
|
+
require 'nhkore/search_scraper'
|
39
|
+
require 'nhkore/sifter'
|
40
|
+
require 'nhkore/splitter'
|
41
|
+
require 'nhkore/user_agents'
|
42
|
+
require 'nhkore/util'
|
43
|
+
require 'nhkore/variator'
|
44
|
+
require 'nhkore/version'
|
45
|
+
require 'nhkore/word'
|
46
|
+
|
47
|
+
|
48
|
+
module NHKore
|
49
|
+
###
|
50
|
+
# Include this file to only require the files needed to use this
|
51
|
+
# Gem as a library (i.e., don't include CLI-related files).
|
52
|
+
#
|
53
|
+
# @author Jonathan Bradley Whited (@esotericpig)
|
54
|
+
# @since 0.3.2
|
55
|
+
###
|
56
|
+
module Lib
|
57
|
+
end
|
58
|
+
end
|
data/lib/nhkore/scraper.rb
CHANGED
@@ -82,7 +82,7 @@ module NHKore
|
|
82
82
|
@max_retries = max_retries
|
83
83
|
@redirect_rule = redirect_rule
|
84
84
|
|
85
|
-
open(url,str_or_io)
|
85
|
+
open(url,str_or_io,is_file: is_file)
|
86
86
|
end
|
87
87
|
|
88
88
|
def fetch_cookie(url)
|
@@ -119,14 +119,14 @@ module NHKore
|
|
119
119
|
return URI::join(@url,relative_url)
|
120
120
|
end
|
121
121
|
|
122
|
-
def open(url,str_or_io=nil)
|
122
|
+
def open(url,str_or_io=nil,is_file: @is_file)
|
123
|
+
@is_file = is_file
|
123
124
|
@str_or_io = str_or_io
|
124
125
|
@url = url
|
125
126
|
|
126
127
|
if str_or_io.nil?()
|
127
128
|
if @is_file
|
128
|
-
|
129
|
-
@str_or_io = File.open(url,'rt:UTF-8',**@kargs)
|
129
|
+
open_file(url)
|
130
130
|
else
|
131
131
|
fetch_cookie(url) if @eat_cookie
|
132
132
|
open_url(url)
|
@@ -136,6 +136,16 @@ module NHKore
|
|
136
136
|
return self
|
137
137
|
end
|
138
138
|
|
139
|
+
def open_file(file)
|
140
|
+
@is_file = true
|
141
|
+
@url = file
|
142
|
+
|
143
|
+
# NHK's website tends to always use UTF-8.
|
144
|
+
@str_or_io = File.open(file,'rt:UTF-8',**@kargs)
|
145
|
+
|
146
|
+
return self
|
147
|
+
end
|
148
|
+
|
139
149
|
def open_url(url)
|
140
150
|
max_redirects = (@max_redirects.nil?() || @max_redirects < 0) ? 10_000 : @max_redirects
|
141
151
|
max_retries = (@max_retries.nil?() || @max_retries < 0) ? 10_000 : @max_retries
|
@@ -194,6 +204,10 @@ module NHKore
|
|
194
204
|
return @str_or_io
|
195
205
|
end
|
196
206
|
|
207
|
+
def reopen()
|
208
|
+
return open(@url)
|
209
|
+
end
|
210
|
+
|
197
211
|
def rss_doc()
|
198
212
|
require 'rss'
|
199
213
|
|
data/lib/nhkore/sifter.rb
CHANGED
@@ -93,24 +93,29 @@ module NHKore
|
|
93
93
|
return false
|
94
94
|
end
|
95
95
|
|
96
|
-
def filter_by_datetime(datetime_filter=nil,
|
96
|
+
def filter_by_datetime(datetime_filter=nil,from: nil,to: nil)
|
97
97
|
if !datetime_filter.nil?()
|
98
|
-
|
99
|
-
|
100
|
-
|
98
|
+
if datetime_filter.respond_to?(:'[]')
|
99
|
+
# If out-of-bounds, just nil.
|
100
|
+
from = datetime_filter[0] if from.nil?()
|
101
|
+
to = datetime_filter[1] if to.nil?()
|
102
|
+
else
|
103
|
+
from = datetime_filter if from.nil?()
|
104
|
+
to = datetime_filter if to.nil?()
|
105
|
+
end
|
101
106
|
end
|
102
107
|
|
103
|
-
|
104
|
-
|
108
|
+
from = to if from.nil?()
|
109
|
+
to = from if to.nil?()
|
105
110
|
|
106
|
-
|
107
|
-
|
111
|
+
from = Util.jst_time(from) unless from.nil?()
|
112
|
+
to = Util.jst_time(to) unless to.nil?()
|
108
113
|
|
109
|
-
datetime_filter = [
|
114
|
+
datetime_filter = [from,to]
|
110
115
|
|
111
116
|
return self if datetime_filter.flatten().compact().empty?()
|
112
117
|
|
113
|
-
@filters[:datetime] = {from:
|
118
|
+
@filters[:datetime] = {from: from,to: to}
|
114
119
|
|
115
120
|
return self
|
116
121
|
end
|
data/lib/nhkore/util.rb
CHANGED
@@ -22,8 +22,7 @@
|
|
22
22
|
|
23
23
|
|
24
24
|
require 'cgi'
|
25
|
-
require '
|
26
|
-
require 'public_suffix'
|
25
|
+
require 'set'
|
27
26
|
require 'time'
|
28
27
|
require 'uri'
|
29
28
|
|
@@ -68,6 +67,8 @@ module NHKore
|
|
68
67
|
end
|
69
68
|
|
70
69
|
def self.domain(host,clean: true)
|
70
|
+
require 'public_suffix'
|
71
|
+
|
71
72
|
domain = PublicSuffix.domain(host)
|
72
73
|
domain = unspace_web_str(domain).downcase() if !domain.nil?() && clean
|
73
74
|
|
@@ -75,6 +76,8 @@ module NHKore
|
|
75
76
|
end
|
76
77
|
|
77
78
|
def self.dump_yaml(obj,flow_level: 8)
|
79
|
+
require 'psychgus'
|
80
|
+
|
78
81
|
return Psychgus.dump(obj,
|
79
82
|
deref_aliases: true, # Dereference aliases for load_yaml()
|
80
83
|
line_width: 10000, # Try not to wrap; ichiman!
|
@@ -142,6 +145,8 @@ module NHKore
|
|
142
145
|
end
|
143
146
|
|
144
147
|
def self.load_yaml(data,file: nil,**kargs)
|
148
|
+
require 'psychgus'
|
149
|
+
|
145
150
|
return Psych.safe_load(data,
|
146
151
|
aliases: false,
|
147
152
|
filename: file,
|
data/lib/nhkore/variator.rb
CHANGED
data/lib/nhkore/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nhkore
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Bradley Whited (@esotericpig)
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-04-
|
11
|
+
date: 2020-04-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bimyou_segmenter
|
@@ -349,6 +349,7 @@ files:
|
|
349
349
|
- lib/nhkore/entry.rb
|
350
350
|
- lib/nhkore/error.rb
|
351
351
|
- lib/nhkore/fileable.rb
|
352
|
+
- lib/nhkore/lib.rb
|
352
353
|
- lib/nhkore/missingno.rb
|
353
354
|
- lib/nhkore/news.rb
|
354
355
|
- lib/nhkore/polisher.rb
|
@@ -374,7 +375,7 @@ metadata:
|
|
374
375
|
changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
|
375
376
|
homepage_uri: https://github.com/esotericpig/nhkore
|
376
377
|
source_code_uri: https://github.com/esotericpig/nhkore
|
377
|
-
post_install_message: " \n NHKore v0.3.
|
378
|
+
post_install_message: " \n NHKore v0.3.2\n \n You can now use [nhkore] on the
|
378
379
|
command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
|
379
380
|
\ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
|
380
381
|
\ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"
|