nhkore 0.3.1 → 0.3.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
4
- data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
3
+ metadata.gz: cf151c3859812632f09b1a464164f31bb0ce050f37ed7e7377f76265571ebd41
4
+ data.tar.gz: 1f3ee801e7557731cae4aeacd3f18fea4d7f33ac65b6ec77511a7d3d8f17856a
5
5
  SHA512:
6
- metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
7
- data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
6
+ metadata.gz: 7e7d0d5b805ad6fa4312e8be26f3115dff18665b3762073c56db3a7a6a343a3ee6a05e47889e0abf7b62df3bb84cf5c977fce3efdfeb8a65c7bcff8167839d35
7
+ data.tar.gz: 957bc3da8492310d287a8947b9080f8be417f0874c3226db4f0bb63d020bee06c51a3da81c1fa3f779de22d354a32ab4cf41fc6f3018840774c31fd7060fbec3
data/CHANGELOG.md CHANGED
@@ -2,7 +2,33 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.2...master)
6
+
7
+ ## [v0.3.2] - 2020-04-22
8
+
9
+ ### Added
10
+ - lib/nhkore/lib.rb
11
+ - Requires all files, excluding CLI-related files for speed when using this Gem as a library.
12
+ - Scraper
13
+ - Added open_file() & reopen().
14
+ - samples/looper.rb
15
+ - Script example of continuously scraping all articles.
16
+
17
+ ### Changed
18
+ - README
19
+ - Finished writing the initial version of all sections.
20
+ - ArticleScraper
21
+ - Changed the `year` param to expect an int, instead of a string.
22
+ - Sifter
23
+ - In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
24
+
25
+ ### Fixed
26
+ - Reduced load time of app from ~1s to 0.3~0.5s.
27
+ - Moved many `require '...'` statements into methods.
28
+ - It looks ugly & is not a good coding practice, but a necessary evil.
29
+ - Load time is still pretty slow (but a lot better!).
30
+ - ArticleScraper
31
+ - Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
6
32
 
7
33
  ## [v0.3.1] - 2020-04-20
8
34
 
@@ -11,7 +37,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
11
37
  - NewsCmd/SiftCmd
12
38
  - Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
13
39
  - Util
14
- - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
40
+ - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
15
41
 
16
42
  ### Fixed
17
43
  - Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
data/README.md CHANGED
@@ -293,7 +293,7 @@ links:
293
293
 
294
294
  If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
295
295
 
296
- Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
296
+ Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
297
297
 
298
298
  Example usage:
299
299
 
@@ -319,6 +319,49 @@ Complete demo:
319
319
 
320
320
  #### News Command [^](#contents)
321
321
 
322
+ In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
323
+
324
+ After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
325
+
326
+ The defaults will scrape the 1st unscraped article from the `links` file:
327
+
328
+ `$ nhkore news easy`
329
+
330
+ You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
331
+
332
+ ```
333
+ # Scrape the 1st 11 unscraped articles.
334
+ $ nhkore news -s 11 easy
335
+ ```
336
+
337
+ You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
338
+
339
+ `$ nhkore news -r -s 11 easy`
340
+
341
+ If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
342
+
343
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
344
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
345
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
346
+ - ...
347
+
348
+ `$ nhkore news -k '00123' -s 11 easy`
349
+
350
+ Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
351
+
352
+ ```
353
+ # This will exit after showing the 1st article's dictionary.
354
+ $ nhkore news easy --show-dict
355
+ ```
356
+
357
+ For the rest of the options, please see [The Basics](#the-basics-).
358
+
359
+ Complete demo:
360
+
361
+ [![asciinema Demo - News](https://asciinema.org/a/322324.png)](https://asciinema.org/a/322324)
362
+
363
+ When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
364
+
322
365
  ## Using the Library [^](#contents)
323
366
 
324
367
  ### Setup
@@ -336,11 +379,431 @@ In your *Gemfile*:
336
379
  ```Ruby
337
380
  # Pick one...
338
381
  gem 'nhkore', '~> X.X'
339
- gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
382
+ gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
383
+ ```
384
+
385
+ ### Require
386
+
387
+ In order to not require all of the CLI-related files, require this file instead:
388
+
389
+ ```Ruby
390
+ require 'nhkore/lib'
391
+
392
+ #require 'nhkore' # Slower
340
393
  ```
341
394
 
342
395
  ### Scraper
343
396
 
397
+ All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
398
+
399
+ `initialize` automatically opens (connects to) the URL.
400
+
401
+ ```Ruby
402
+ require 'nhkore/scraper'
403
+
404
+ class MyScraper < NHKore::Scraper
405
+ def initialize()
406
+ super('https://www3.nhk.or.jp/news/easy/')
407
+ end
408
+ end
409
+
410
+ m = MyScraper.new()
411
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
412
+
413
+ # Read all content into a String.
414
+ mstr = m.read()
415
+ sstr = s.read()
416
+
417
+ # Get a Nokogiri::HTML object.
418
+ mdoc = m.html_doc()
419
+ sdoc = s.html_doc()
420
+
421
+ # Get a RSS object.
422
+ s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
423
+
424
+ rss = s.rss_doc()
425
+ ```
426
+
427
+ There are several useful options:
428
+
429
+ ```Ruby
430
+ require 'nhkore/scraper'
431
+
432
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
433
+ open_timeout: 300, # Open timeout in seconds (default: nil)
434
+ read_timeout: 300, # Read timeout in seconds (default: nil)
435
+
436
+ # Maximum number of times to retry the URL
437
+ # - default: 3
438
+ # - Open/connect will fail a couple of times on a bad/slow internet connection.
439
+ max_retries: 10,
440
+
441
+ # Maximum number of redirects allowed.
442
+ # - default: 3
443
+ # - You can set this to nil or -1, but I recommend using a number
444
+ # for safety (infinite-loop attack).
445
+ max_redirects: 1,
446
+
447
+ # How to check redirect URLs for safety.
448
+ # - default: :strict
449
+ # - nil => do not check
450
+ # - :lenient => check the scheme only
451
+ # (i.e., if https, redirect URL must be https)
452
+ # - :strict => check the scheme and domain
453
+ # (i.e., if https://bing.com, redirect URL must be https://bing.com)
454
+ redirect_rule: :lenient,
455
+
456
+ # Set the HTTP header field 'cookie' from the 'set-cookie' response.
457
+ # - default: false
458
+ # - Currently uses the 'http-cookie' Gem.
459
+ # - This is currently a time-consuming operation because it opens the URL twice.
460
+ # - Necessary for Search Engines or other sites that require cookies
461
+ # in order to block bots.
462
+ eat_cookie: true,
463
+
464
+ # Set HTTP header fields.
465
+ # - default: nil
466
+ # - Necessary for Search Engines or other sites that try to block bots.
467
+ # - Simply pass in a Hash (not nil) to set the default ones.
468
+ header: {'user-agent' => 'Skynet'}, # Must use strings
469
+ )
470
+
471
+ # Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
472
+ # - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
473
+ # some other Gem.
474
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
475
+ str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
476
+ )
477
+
478
+ # Open and parse a file instead of a URL (for offline testing or slow internet).
479
+ s = NHKore::Scraper.new('./my_article.html',is_file: true)
480
+
481
+ doc = s.html_doc()
482
+ ```
483
+
484
+ Here are some other useful methods:
485
+
486
+ ```Ruby
487
+ require 'nhkore/scraper'
488
+
489
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
490
+
491
+ s.reopen() # Re-open the current URL.
492
+
493
+ # Get a relative URL.
494
+ url = s.join_url('../../monkey.html')
495
+ puts url # https://www3.nhk.or.jp/monkey.html
496
+
497
+ # Open a new URL or file.
498
+ s.open(url)
499
+ s.open(url,URI.open(url,redirect: false))
500
+
501
+ s.open('./my_article.html',is_file: true)
502
+
503
+ # Open a file manually.
504
+ s.open_file('./my_article.html')
505
+
506
+ # Fetch the cookie & open a new URL manually.
507
+ s.fetch_cookie(url)
508
+ s.open_url(url)
509
+ ```
510
+
511
+ ### SearchScraper & BingScraper
512
+
513
+ `SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
514
+
515
+ By default, it sets the default HTTP header fields and fetches & sets the cookie.
516
+
517
+ ```Ruby
518
+ require 'nhkore/search_scraper'
519
+
520
+ ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
521
+
522
+ doc = ss.html_doc()
523
+
524
+ doc.css('a').each() do |anchor|
525
+ link = anchor['href']
526
+
527
+ next if ss.ignore_link?(link)
528
+
529
+ if link.include?('https://www3.nhk')
530
+ puts link
531
+ end
532
+ end
533
+ ```
534
+
535
+ `BingScraper` will search `bing.com` for you.
536
+
537
+ ```Ruby
538
+ require 'nhkore/search_link'
539
+ require 'nhkore/search_scraper'
540
+
541
+ bs = NHKore::BingScraper.new(:yasashii)
542
+ slinks = NHKore::SearchLinks.new()
543
+
544
+ next_page = bs.scrape(slinks)
545
+ page_num = 1
546
+
547
+ while !next_page.empty?()
548
+ puts "Page #{page_num += 1}: #{next_page.count}"
549
+
550
+ bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
551
+
552
+ next_page = bs.scrape(slinks,next_page)
553
+ end
554
+
555
+ slinks.links.values.each() do |link|
556
+ puts link.url
557
+ end
558
+ ```
559
+
560
+ ### ArticleScraper & DictScraper
561
+
562
+ `ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
563
+
564
+ ```Ruby
565
+ require 'nhkore/article_scraper'
566
+
567
+ as = NHKore::ArticleScraper.new(
568
+ 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
569
+
570
+ # If false, scrape the article leniently (for older articles which
571
+ # may not have certain tags, etc.).
572
+ # - default: true
573
+ strict: false,
574
+
575
+ # {Dict} to use as the dictionary for words (Easy articles).
576
+ # - default: :scrape
577
+ # - nil => don't scrape/use it (necessary for Regular articles)
578
+ # - :scrape => auto-scrape it using {DictScraper}
579
+ # - {Dict} => your own {Dict}
580
+ dict: nil,
581
+
582
+ # Date time to use as a fallback if the article doesn't have one
583
+ # (for older articles).
584
+ # - default: nil
585
+ datetime: Time.new(2020,2,2),
586
+
587
+ # Year to use as a fallback if the article doesn't have one
588
+ # (for older articles).
589
+ # - default: nil
590
+ year: 2020,
591
+ )
592
+
593
+ article = as.scrape()
594
+
595
+ article.datetime
596
+ article.futsuurl
597
+ article.sha256
598
+ article.title
599
+ article.url
600
+
601
+ article.words.each() do |key,word|
602
+ word.defn
603
+ word.eng
604
+ word.freq
605
+ word.kana
606
+ word.kanji
607
+ word.key
608
+ end
609
+
610
+ puts article.to_s(mini: true)
611
+ puts '---'
612
+ puts article
613
+ ```
614
+
615
+ `DictScraper` scrapes an Easy article's dictionary file (JSON).
616
+
617
+ ```Ruby
618
+ require 'nhkore/dict_scraper'
619
+
620
+ url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
621
+ ds = NHKore::DictScraper.new(
622
+ url,
623
+
624
+ # Change the URL appropriately to the dictionary URL.
625
+ # - default: true
626
+ parse_url: true,
627
+ )
628
+
629
+ puts NHKore::DictScraper.parse_url(url)
630
+ puts
631
+
632
+ dict = ds.scrape()
633
+
634
+ dict.entries.each() do |key,entry|
635
+ entry.id
636
+
637
+ entry.defns.each() do |defn|
638
+ defn.hyoukis.each() {|hyouki| }
639
+ defn.text
640
+ defn.words.each() {|word| }
641
+ end
642
+
643
+ puts entry.build_hyouki()
644
+ puts entry.build_defn()
645
+ puts '---'
646
+ end
647
+
648
+ puts
649
+ puts dict
650
+ ```
651
+
652
+ ### Fileable
653
+
654
+ Any class that includes the `Fileable` mixin will have the following methods:
655
+
656
+ - Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
657
+ - save_file(file,mode: 'wt',**kargs)
658
+
659
+ Any *kargs* will be passed to `File.open()`.
660
+
661
+ ```Ruby
662
+ require 'nhkore/news'
663
+ require 'nhkore/search_link'
664
+
665
+ yn = NHKore::YasashiiNews.load_file()
666
+ sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
667
+
668
+ yn.articles.each() {|key,article| }
669
+ yn.sha256s.each() {|sha256,url| }
670
+
671
+ sl.links.each() do |key,link|
672
+ link.datetime
673
+ link.futsuurl
674
+ link.scraped?
675
+ link.sha256
676
+ link.title
677
+ link.url
678
+ end
679
+
680
+ #yn.save_file()
681
+ #sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
682
+ ```
683
+
684
+ ### Sifter
685
+
686
+ `Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
687
+
688
+ ```Ruby
689
+ require 'nhkore/news'
690
+ require 'nhkore/sifter'
691
+ require 'time'
692
+
693
+ news = NHKore::YasashiiNews.load_file()
694
+
695
+ sifter = NHKore::Sifter.new(news)
696
+
697
+ sifter.caption = 'Sakura Fields Forever!'
698
+
699
+ # Filter the data.
700
+ #sifter.filter_by_datetime(Time.new(2019,12,5))
701
+ sifter.filter_by_datetime(
702
+ from: Time.new(2019,12,4),to: Time.new(2019,12,7)
703
+ )
704
+ sifter.filter_by_title('桜')
705
+ sifter.filter_by_url('k100')
706
+
707
+ # Ignore (or blank out) certain columns from the output.
708
+ sifter.ignore(:defn)
709
+ sifter.ignore(:eng)
710
+
711
+ # An array of the filtered & sorted words.
712
+ words = sifter.sift()
713
+
714
+ # Choose the file format.
715
+ #sifter.put_csv!()
716
+ #sifter.put_html!()
717
+ sifter.put_yaml!()
718
+
719
+ # Save to a file.
720
+ file = 'sakura.yml'
721
+
722
+ if !File.exist?(file)
723
+ sifter.save_file(file)
724
+ end
725
+ ```
726
+
727
+ ### Util & UserAgents
728
+
729
+ These provide a variety of useful methods/constants.
730
+
731
+ Here are some of the most useful ones:
732
+
733
+ ```Ruby
734
+ require 'nhkore/user_agents'
735
+ require 'nhkore/util'
736
+
737
+ include NHKore
738
+
739
+ puts '======='
740
+ puts '[ Net ]'
741
+ puts '======='
742
+ # Get a random User Agent for HTTP header field 'User-Agent'.
743
+ # - This is used by default in Scraper/SearchScraper.
744
+ puts "User-Agent: #{UserAgents.sample()}"
745
+
746
+ uri = URI('https://www.bing.com/search?q=nhk')
747
+ Util.replace_uri_query!(uri,q: 'banana')
748
+
749
+ puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
750
+ # nhk.or.jp
751
+ puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
752
+ # Ben &amp; Jerry&#39;s<br>
753
+ puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
754
+ puts
755
+
756
+ puts '========'
757
+ puts '[ Time ]'
758
+ puts '========'
759
+ puts "JST now: #{Util.jst_now}"
760
+ # Drops in JST_OFFSET, does not change hour/min.
761
+ puts "JST time: #{Util.jst_time(Time.now)}"
762
+ puts "JST year: #{Util::JST_YEAR}"
763
+ puts "1999 sane? #{Util.sane_year?(1999)}" # true
764
+ puts "1776 sane? #{Util.sane_year?(1776)}" # false
765
+ puts "Guess 5: #{Util.guess_year(5)}" # 2005
766
+ puts "Guess 99: #{Util.guess_year(99)}" # 1999
767
+ puts
768
+ puts "JST timezone offset: #{Util::JST_OFFSET}"
769
+ puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
770
+ puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
771
+ puts
772
+
773
+ puts '============'
774
+ puts '[ Japanese ]'
775
+ puts '============'
776
+
777
+ JPN = ['桜','ぶ','ブ']
778
+
779
+ def fmt_jpn()
780
+ fmt = []
781
+
782
+ JPN.each() do |x|
783
+ x = yield(x)
784
+ x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
785
+ fmt << x
786
+ end
787
+
788
+ return "[ #{fmt.join(' | ')} ]"
789
+ end
790
+
791
+ puts " #{fmt_jpn{|x| x}}"
792
+ puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
793
+ puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
794
+ puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
795
+ puts "Reduce: #{Util.reduce_jpn_space("' '")}"
796
+ puts
797
+
798
+ puts '========='
799
+ puts '[ Files ]'
800
+ puts '========='
801
+ puts "Dir str? #{Util.dir_str?('dir/')}" # true
802
+ puts "Dir str? #{Util.dir_str?('dir')}" # false
803
+ puts "File str? #{Util.filename_str?('file')}" # true
804
+ puts "File str? #{Util.filename_str?('dir/file')}" # false
805
+ ```
806
+
344
807
  ## Hacking [^](#contents)
345
808
 
346
809
  ```
@@ -370,7 +833,9 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
370
833
 
371
834
  `$ bundle exec rake doc`
372
835
 
373
- ### Installing Locally (without Network Access)
836
+ ### Installing Locally
837
+
838
+ You can make some changes/fixes to the code and then install your local version:
374
839
 
375
840
  `$ bundle exec rake install:local`
376
841
 
data/lib/nhkore/app.rb CHANGED
@@ -24,6 +24,7 @@
24
24
  require 'cri'
25
25
  require 'highline'
26
26
  require 'rainbow'
27
+ require 'set'
27
28
  require 'tty-spinner'
28
29
 
29
30
  require 'nhkore/error'
@@ -47,19 +47,21 @@ module NHKore
47
47
  attr_accessor :dict
48
48
  attr_reader :kargs
49
49
  attr_accessor :missingno
50
- attr_accessor :mode
51
50
  attr_reader :polishers
52
51
  attr_accessor :splitter
52
+ attr_accessor :strict
53
53
  attr_reader :variators
54
54
  attr_accessor :year
55
55
 
56
+ alias_method :strict?,:strict
57
+
56
58
  # @param dict [Dict,:scrape,nil] the {Dict} (dictionary) to use for {Word#defn} (definitions)
57
59
  # [+:scrape+] auto-scrape it using {DictScraper}
58
60
  # [+nil+] don't scrape/use it
59
61
  # @param missingno [Missingno] data to use as a fallback for Ruby words without kana/kanji,
60
62
  # instead of raising an error
61
- # @param mode [nil,:lenient]
62
- def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,mode: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),variators: [BestVariator.new()],year: nil,**kargs)
63
+ # @param strict [true,false]
64
+ def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),strict: true,variators: [BestVariator.new()],year: nil,**kargs)
63
65
  super(url,**kargs)
64
66
 
65
67
  @cleaners = Array(cleaners)
@@ -67,9 +69,9 @@ module NHKore
67
69
  @dict = dict
68
70
  @kargs = kargs
69
71
  @missingno = missingno
70
- @mode = mode
71
72
  @polishers = Array(polishers)
72
73
  @splitter = splitter
74
+ @strict = strict
73
75
  @variators = Array(variators)
74
76
  @year = year
75
77
  end
@@ -188,7 +190,7 @@ module NHKore
188
190
  tag = doc.css('div.article-body') if tag.length < 1
189
191
 
190
192
  # - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
191
- tag = doc.css('div#main') if tag.length < 1 && @mode == :lenient
193
+ tag = doc.css('div#main') if tag.length < 1 && !@strict
192
194
 
193
195
  if tag.length > 0
194
196
  text = Util.unspace_web_str(tag.text.to_s())
@@ -481,7 +483,7 @@ module NHKore
481
483
  def scrape_title(doc,article)
482
484
  tag = doc.css('h1.article-main__title')
483
485
 
484
- if tag.length < 1 && @mode == :lenient
486
+ if tag.length < 1 && !@strict
485
487
  # This shouldn't be used except for select sites.
486
488
  # - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
487
489
 
@@ -583,7 +585,7 @@ module NHKore
583
585
  end
584
586
 
585
587
  # As a last resort, use our user-defined fallbacks (if specified).
586
- return @year unless Util.empty_web_str?(@year)
588
+ return @year.to_i() unless @year.nil?()
587
589
  return @datetime.year if !@datetime.nil?() && Util.sane_year?(@datetime.year)
588
590
 
589
591
  raise ScrapeError,"could not scrape year at URL[#{@url}]"
@@ -604,11 +606,10 @@ module NHKore
604
606
  end
605
607
 
606
608
  def warn_or_error(klass,msg)
607
- case @mode
608
- when :lenient
609
- Util.warn(msg)
610
- else
609
+ if @strict
611
610
  raise klass,msg
611
+ else
612
+ Util.warn(msg)
612
613
  end
613
614
  end
614
615
  end
@@ -237,7 +237,7 @@ module CLI
237
237
  dict: dict,
238
238
  is_file: is_file,
239
239
  missingno: missingno ? Missingno.new(news) : nil,
240
- mode: lenient ? :lenient : nil,
240
+ strict: !lenient,
241
241
  })
242
242
  @news_dict_scraper_kargs = @scraper_kargs.merge({
243
243
  is_file: is_file,
data/lib/nhkore/lib.rb ADDED
@@ -0,0 +1,58 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: UTF-8
3
+ # frozen_string_literal: true
4
+
5
+ #--
6
+ # This file is part of NHKore.
7
+ # Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
8
+ #
9
+ # NHKore is free software: you can redistribute it and/or modify
10
+ # it under the terms of the GNU Lesser General Public License as published by
11
+ # the Free Software Foundation, either version 3 of the License, or
12
+ # (at your option) any later version.
13
+ #
14
+ # NHKore is distributed in the hope that it will be useful,
15
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
16
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17
+ # GNU Lesser General Public License for more details.
18
+ #
19
+ # You should have received a copy of the GNU Lesser General Public License
20
+ # along with NHKore. If not, see <https://www.gnu.org/licenses/>.
21
+ #++
22
+
23
+
24
+ require 'nhkore/article'
25
+ require 'nhkore/article_scraper'
26
+ require 'nhkore/cleaner'
27
+ require 'nhkore/defn'
28
+ require 'nhkore/dict'
29
+ require 'nhkore/dict_scraper'
30
+ require 'nhkore/entry'
31
+ require 'nhkore/error'
32
+ require 'nhkore/fileable'
33
+ require 'nhkore/missingno'
34
+ require 'nhkore/news'
35
+ require 'nhkore/polisher'
36
+ require 'nhkore/scraper'
37
+ require 'nhkore/search_link'
38
+ require 'nhkore/search_scraper'
39
+ require 'nhkore/sifter'
40
+ require 'nhkore/splitter'
41
+ require 'nhkore/user_agents'
42
+ require 'nhkore/util'
43
+ require 'nhkore/variator'
44
+ require 'nhkore/version'
45
+ require 'nhkore/word'
46
+
47
+
48
+ module NHKore
49
+ ###
50
+ # Include this file to only require the files needed to use this
51
+ # Gem as a library (i.e., don't include CLI-related files).
52
+ #
53
+ # @author Jonathan Bradley Whited (@esotericpig)
54
+ # @since 0.3.2
55
+ ###
56
+ module Lib
57
+ end
58
+ end
@@ -82,7 +82,7 @@ module NHKore
82
82
  @max_retries = max_retries
83
83
  @redirect_rule = redirect_rule
84
84
 
85
- open(url,str_or_io)
85
+ open(url,str_or_io,is_file: is_file)
86
86
  end
87
87
 
88
88
  def fetch_cookie(url)
@@ -119,14 +119,14 @@ module NHKore
119
119
  return URI::join(@url,relative_url)
120
120
  end
121
121
 
122
- def open(url,str_or_io=nil)
122
+ def open(url,str_or_io=nil,is_file: @is_file)
123
+ @is_file = is_file
123
124
  @str_or_io = str_or_io
124
125
  @url = url
125
126
 
126
127
  if str_or_io.nil?()
127
128
  if @is_file
128
- # NHK's website tends to always use UTF-8.
129
- @str_or_io = File.open(url,'rt:UTF-8',**@kargs)
129
+ open_file(url)
130
130
  else
131
131
  fetch_cookie(url) if @eat_cookie
132
132
  open_url(url)
@@ -136,6 +136,16 @@ module NHKore
136
136
  return self
137
137
  end
138
138
 
139
+ def open_file(file)
140
+ @is_file = true
141
+ @url = file
142
+
143
+ # NHK's website tends to always use UTF-8.
144
+ @str_or_io = File.open(file,'rt:UTF-8',**@kargs)
145
+
146
+ return self
147
+ end
148
+
139
149
  def open_url(url)
140
150
  max_redirects = (@max_redirects.nil?() || @max_redirects < 0) ? 10_000 : @max_redirects
141
151
  max_retries = (@max_retries.nil?() || @max_retries < 0) ? 10_000 : @max_retries
@@ -194,6 +204,10 @@ module NHKore
194
204
  return @str_or_io
195
205
  end
196
206
 
207
+ def reopen()
208
+ return open(@url)
209
+ end
210
+
197
211
  def rss_doc()
198
212
  require 'rss'
199
213
 
data/lib/nhkore/sifter.rb CHANGED
@@ -93,24 +93,29 @@ module NHKore
93
93
  return false
94
94
  end
95
95
 
96
- def filter_by_datetime(datetime_filter=nil,from_filter: nil,to_filter: nil)
96
+ def filter_by_datetime(datetime_filter=nil,from: nil,to: nil)
97
97
  if !datetime_filter.nil?()
98
- # If out-of-bounds, just nil.
99
- from_filter = datetime_filter[0]
100
- to_filter = datetime_filter[1]
98
+ if datetime_filter.respond_to?(:'[]')
99
+ # If out-of-bounds, just nil.
100
+ from = datetime_filter[0] if from.nil?()
101
+ to = datetime_filter[1] if to.nil?()
102
+ else
103
+ from = datetime_filter if from.nil?()
104
+ to = datetime_filter if to.nil?()
105
+ end
101
106
  end
102
107
 
103
- from_filter = to_filter if from_filter.nil?()
104
- to_filter = from_filter if to_filter.nil?()
108
+ from = to if from.nil?()
109
+ to = from if to.nil?()
105
110
 
106
- from_filter = Util.jst_time(from_filter) unless from_filter.nil?()
107
- to_filter = Util.jst_time(to_filter) unless to_filter.nil?()
111
+ from = Util.jst_time(from) unless from.nil?()
112
+ to = Util.jst_time(to) unless to.nil?()
108
113
 
109
- datetime_filter = [from_filter,to_filter]
114
+ datetime_filter = [from,to]
110
115
 
111
116
  return self if datetime_filter.flatten().compact().empty?()
112
117
 
113
- @filters[:datetime] = {from: from_filter,to: to_filter}
118
+ @filters[:datetime] = {from: from,to: to}
114
119
 
115
120
  return self
116
121
  end
data/lib/nhkore/util.rb CHANGED
@@ -22,8 +22,7 @@
22
22
 
23
23
 
24
24
  require 'cgi'
25
- require 'psychgus'
26
- require 'public_suffix'
25
+ require 'set'
27
26
  require 'time'
28
27
  require 'uri'
29
28
 
@@ -68,6 +67,8 @@ module NHKore
68
67
  end
69
68
 
70
69
  def self.domain(host,clean: true)
70
+ require 'public_suffix'
71
+
71
72
  domain = PublicSuffix.domain(host)
72
73
  domain = unspace_web_str(domain).downcase() if !domain.nil?() && clean
73
74
 
@@ -75,6 +76,8 @@ module NHKore
75
76
  end
76
77
 
77
78
  def self.dump_yaml(obj,flow_level: 8)
79
+ require 'psychgus'
80
+
78
81
  return Psychgus.dump(obj,
79
82
  deref_aliases: true, # Dereference aliases for load_yaml()
80
83
  line_width: 10000, # Try not to wrap; ichiman!
@@ -142,6 +145,8 @@ module NHKore
142
145
  end
143
146
 
144
147
  def self.load_yaml(data,file: nil,**kargs)
148
+ require 'psychgus'
149
+
145
150
  return Psych.safe_load(data,
146
151
  aliases: false,
147
152
  filename: file,
@@ -60,6 +60,7 @@ module NHKore
60
60
  attr_accessor :deinflector
61
61
 
62
62
  def initialize(*)
63
+ require 'set' # Must require manually because JapaneseDeinflector is old
63
64
  require 'japanese_deinflector'
64
65
 
65
66
  super
@@ -22,5 +22,5 @@
22
22
 
23
23
 
24
24
  module NHKore
25
- VERSION = '0.3.1'
25
+ VERSION = '0.3.2'
26
26
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nhkore
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.1
4
+ version: 0.3.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Bradley Whited (@esotericpig)
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-20 00:00:00.000000000 Z
11
+ date: 2020-04-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bimyou_segmenter
@@ -349,6 +349,7 @@ files:
349
349
  - lib/nhkore/entry.rb
350
350
  - lib/nhkore/error.rb
351
351
  - lib/nhkore/fileable.rb
352
+ - lib/nhkore/lib.rb
352
353
  - lib/nhkore/missingno.rb
353
354
  - lib/nhkore/news.rb
354
355
  - lib/nhkore/polisher.rb
@@ -374,7 +375,7 @@ metadata:
374
375
  changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
375
376
  homepage_uri: https://github.com/esotericpig/nhkore
376
377
  source_code_uri: https://github.com/esotericpig/nhkore
377
- post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
378
+ post_install_message: " \n NHKore v0.3.2\n \n You can now use [nhkore] on the
378
379
  command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
379
380
  \ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
380
381
  \ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"