nhkore 0.3.1 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fb2c0e6e53995b874a9e53c44b024f993032433d1a87c37e7b7bdea69965902d
4
- data.tar.gz: 13d34c53fe9af9efa985c05089b1588eb1e76d6321f9aff18cc5da80598a52d4
3
+ metadata.gz: cf151c3859812632f09b1a464164f31bb0ce050f37ed7e7377f76265571ebd41
4
+ data.tar.gz: 1f3ee801e7557731cae4aeacd3f18fea4d7f33ac65b6ec77511a7d3d8f17856a
5
5
  SHA512:
6
- metadata.gz: 643723d42e939a7852eca3b90c3ec4e65085838317eb59c1d8f21f79dd647d2e77e5ea68ab2ff3b5a208608f9bf350121a9918cb318dec6c3047731b73f59294
7
- data.tar.gz: 3481fea3a3895a5b85ac3fcd5a77fe9b811f84e9a19b395a1de1d2e9b31fda93c5fb49a8d7d43581e05cb90c6f844f8537c5a97d73937c2b8ee97728ac7c7a1f
6
+ metadata.gz: 7e7d0d5b805ad6fa4312e8be26f3115dff18665b3762073c56db3a7a6a343a3ee6a05e47889e0abf7b62df3bb84cf5c977fce3efdfeb8a65c7bcff8167839d35
7
+ data.tar.gz: 957bc3da8492310d287a8947b9080f8be417f0874c3226db4f0bb63d020bee06c51a3da81c1fa3f779de22d354a32ab4cf41fc6f3018840774c31fd7060fbec3
data/CHANGELOG.md CHANGED
@@ -2,7 +2,33 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.1...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.2...master)
6
+
7
+ ## [v0.3.2] - 2020-04-22
8
+
9
+ ### Added
10
+ - lib/nhkore/lib.rb
11
+ - Requires all files, excluding CLI-related files for speed when using this Gem as a library.
12
+ - Scraper
13
+ - Added open_file() & reopen().
14
+ - samples/looper.rb
15
+ - Script example of continuously scraping all articles.
16
+
17
+ ### Changed
18
+ - README
19
+ - Finished writing the initial version of all sections.
20
+ - ArticleScraper
21
+ - Changed the `year` param to expect an int, instead of a string.
22
+ - Sifter
23
+ - In filter_by_datetime(), renamed keyword args `from_filter,to_filter` to simply `from,to`.
24
+
25
+ ### Fixed
26
+ - Reduced load time of app from ~1s to 0.3~0.5s.
27
+ - Moved many `require '...'` statements into methods.
28
+ - It looks ugly & is not a good coding practice, but a necessary evil.
29
+ - Load time is still pretty slow (but a lot better!).
30
+ - ArticleScraper
31
+ - Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
6
32
 
7
33
  ## [v0.3.1] - 2020-04-20
8
34
 
@@ -11,7 +37,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
11
37
  - NewsCmd/SiftCmd
12
38
  - Added `--no-sha256` option to not check if article links have already been scraped based on their contents' SHA-256.
13
39
  - Util
14
- - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
40
+ - Changed `dir_str?()` and `filename_str?()` to check any slash. Previously, it only checked the slash for your system. But now on both Windows & Linux, it will check for both `/` & `\`.
15
41
 
16
42
  ### Fixed
17
43
  - Reduced load time of app from ~1s to ~0.3-5s by moving some requires into methods.
data/README.md CHANGED
@@ -293,7 +293,7 @@ links:
293
293
 
294
294
  If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
295
295
 
296
- Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
296
+ Currently, it only searches & scrapes `bing.com`, but other search engines and/or methods can easily be added in the future.
297
297
 
298
298
  Example usage:
299
299
 
@@ -319,6 +319,49 @@ Complete demo:
319
319
 
320
320
  #### News Command [^](#contents)
321
321
 
322
+ In [The Basics](#the-basics-), you learned how to scrape 1 article using the `-u/--url` option with the `news` command.
323
+
324
+ After creating a file of links from the [search](#search-command-) command (or manually/programmatically), you can also scrape multiple articles from this file using the `news` command.
325
+
326
+ The defaults will scrape the 1st unscraped article from the `links` file:
327
+
328
+ `$ nhkore news easy`
329
+
330
+ You can scrape the 1st **X** unscraped articles with the `-s/--scrape` option:
331
+
332
+ ```
333
+ # Scrape the 1st 11 unscraped articles.
334
+ $ nhkore news -s 11 easy
335
+ ```
336
+
337
+ You may wish to re-scrape articles that have already been scraped with the `-r/--redo` option:
338
+
339
+ `$ nhkore news -r -s 11 easy`
340
+
341
+ If you only wish to scrape specific article links, then you should use the `-k/--like` option, which does a fuzzy search on the URLs. For example, `--like '00123'` will match these links:
342
+
343
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**23711000/k10012323711000.html
344
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21401000/k10012321401000.html
345
+ - http<span>s://w</span>ww3.nhk.or.jp/news/easy/k1**00123**21511000/k10012321511000.html
346
+ - ...
347
+
348
+ `$ nhkore news -k '00123' -s 11 easy`
349
+
350
+ Lastly, you can show the dictionary URL and contents for the 1st article if you're getting dictionary-related errors:
351
+
352
+ ```
353
+ # This will exit after showing the 1st article's dictionary.
354
+ $ nhkore news easy --show-dict
355
+ ```
356
+
357
+ For the rest of the options, please see [The Basics](#the-basics-).
358
+
359
+ Complete demo:
360
+
361
+ [![asciinema Demo - News](https://asciinema.org/a/322324.png)](https://asciinema.org/a/322324)
362
+
363
+ When I first scraped all of the articles in [nhkore-core.zip](https://github.com/esotericpig/nhkore/releases/latest), I had to use this [script](samples/looper.rb) because my internet isn't very good.
364
+
322
365
  ## Using the Library [^](#contents)
323
366
 
324
367
  ### Setup
@@ -336,11 +379,431 @@ In your *Gemfile*:
336
379
  ```Ruby
337
380
  # Pick one...
338
381
  gem 'nhkore', '~> X.X'
339
- gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X.X'
382
+ gem 'nhkore', :git => 'https://github.com/esotericpig/nhkore.git', :tag => 'vX.X.X'
383
+ ```
384
+
385
+ ### Require
386
+
387
+ In order to not require all of the CLI-related files, require this file instead:
388
+
389
+ ```Ruby
390
+ require 'nhkore/lib'
391
+
392
+ #require 'nhkore' # Slower
340
393
  ```
341
394
 
342
395
  ### Scraper
343
396
 
397
+ All scraper classes extend this class. You can either extend it or use it by itself. It's a simple wrapper around *open-uri*, *Nokogiri*, etc.
398
+
399
+ `initialize` automatically opens (connects to) the URL.
400
+
401
+ ```Ruby
402
+ require 'nhkore/scraper'
403
+
404
+ class MyScraper < NHKore::Scraper
405
+ def initialize()
406
+ super('https://www3.nhk.or.jp/news/easy/')
407
+ end
408
+ end
409
+
410
+ m = MyScraper.new()
411
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
412
+
413
+ # Read all content into a String.
414
+ mstr = m.read()
415
+ sstr = s.read()
416
+
417
+ # Get a Nokogiri::HTML object.
418
+ mdoc = m.html_doc()
419
+ sdoc = s.html_doc()
420
+
421
+ # Get a RSS object.
422
+ s = NHKore::Scraper.new('https://www.bing.com/search?format=rss&q=site%3Anhk.or.jp%2Fnews%2Feasy%2F&count=100')
423
+
424
+ rss = s.rss_doc()
425
+ ```
426
+
427
+ There are several useful options:
428
+
429
+ ```Ruby
430
+ require 'nhkore/scraper'
431
+
432
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
433
+ open_timeout: 300, # Open timeout in seconds (default: nil)
434
+ read_timeout: 300, # Read timeout in seconds (default: nil)
435
+
436
+ # Maximum number of times to retry the URL
437
+ # - default: 3
438
+ # - Open/connect will fail a couple of times on a bad/slow internet connection.
439
+ max_retries: 10,
440
+
441
+ # Maximum number of redirects allowed.
442
+ # - default: 3
443
+ # - You can set this to nil or -1, but I recommend using a number
444
+ # for safety (infinite-loop attack).
445
+ max_redirects: 1,
446
+
447
+ # How to check redirect URLs for safety.
448
+ # - default: :strict
449
+ # - nil => do not check
450
+ # - :lenient => check the scheme only
451
+ # (i.e., if https, redirect URL must be https)
452
+ # - :strict => check the scheme and domain
453
+ # (i.e., if https://bing.com, redirect URL must be https://bing.com)
454
+ redirect_rule: :lenient,
455
+
456
+ # Set the HTTP header field 'cookie' from the 'set-cookie' response.
457
+ # - default: false
458
+ # - Currently uses the 'http-cookie' Gem.
459
+ # - This is currently a time-consuming operation because it opens the URL twice.
460
+ # - Necessary for Search Engines or other sites that require cookies
461
+ # in order to block bots.
462
+ eat_cookie: true,
463
+
464
+ # Set HTTP header fields.
465
+ # - default: nil
466
+ # - Necessary for Search Engines or other sites that try to block bots.
467
+ # - Simply pass in a Hash (not nil) to set the default ones.
468
+ header: {'user-agent' => 'Skynet'}, # Must use strings
469
+ )
470
+
471
+ # Open the URL yourself. This will be passed in directly to Nokogiri::HTML().
472
+ # - In this way, you can use Faraday, HTTParty, RestClient, httprb/http, or
473
+ # some other Gem.
474
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
475
+ str_or_io: URI.open('https://www3.nhk.or.jp/news/easy/',redirect: false)
476
+ )
477
+
478
+ # Open and parse a file instead of a URL (for offline testing or slow internet).
479
+ s = NHKore::Scraper.new('./my_article.html',is_file: true)
480
+
481
+ doc = s.html_doc()
482
+ ```
483
+
484
+ Here are some other useful methods:
485
+
486
+ ```Ruby
487
+ require 'nhkore/scraper'
488
+
489
+ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/')
490
+
491
+ s.reopen() # Re-open the current URL.
492
+
493
+ # Get a relative URL.
494
+ url = s.join_url('../../monkey.html')
495
+ puts url # https://www3.nhk.or.jp/monkey.html
496
+
497
+ # Open a new URL or file.
498
+ s.open(url)
499
+ s.open(url,URI.open(url,redirect: false))
500
+
501
+ s.open('./my_article.html',is_file: true)
502
+
503
+ # Open a file manually.
504
+ s.open_file('./my_article.html')
505
+
506
+ # Fetch the cookie & open a new URL manually.
507
+ s.fetch_cookie(url)
508
+ s.open_url(url)
509
+ ```
510
+
511
+ ### SearchScraper & BingScraper
512
+
513
+ `SearchScraper` is used for scraping Search Engines for NHK News Web (Easy) links. It can also be used for search in general.
514
+
515
+ By default, it sets the default HTTP header fields and fetches & sets the cookie.
516
+
517
+ ```Ruby
518
+ require 'nhkore/search_scraper'
519
+
520
+ ss = NHKore::SearchScraper.new('https://www.bing.com/search?q=nhk&count=100')
521
+
522
+ doc = ss.html_doc()
523
+
524
+ doc.css('a').each() do |anchor|
525
+ link = anchor['href']
526
+
527
+ next if ss.ignore_link?(link)
528
+
529
+ if link.include?('https://www3.nhk')
530
+ puts link
531
+ end
532
+ end
533
+ ```
534
+
535
+ `BingScraper` will search `bing.com` for you.
536
+
537
+ ```Ruby
538
+ require 'nhkore/search_link'
539
+ require 'nhkore/search_scraper'
540
+
541
+ bs = NHKore::BingScraper.new(:yasashii)
542
+ slinks = NHKore::SearchLinks.new()
543
+
544
+ next_page = bs.scrape(slinks)
545
+ page_num = 1
546
+
547
+ while !next_page.empty?()
548
+ puts "Page #{page_num += 1}: #{next_page.count}"
549
+
550
+ bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
551
+
552
+ next_page = bs.scrape(slinks,next_page)
553
+ end
554
+
555
+ slinks.links.values.each() do |link|
556
+ puts link.url
557
+ end
558
+ ```
559
+
560
+ ### ArticleScraper & DictScraper
561
+
562
+ `ArticleScraper` scrapes an NHK News Web Easy article. Regular articles aren't currently supported.
563
+
564
+ ```Ruby
565
+ require 'nhkore/article_scraper'
566
+
567
+ as = NHKore::ArticleScraper.new(
568
+ 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
569
+
570
+ # If false, scrape the article leniently (for older articles which
571
+ # may not have certain tags, etc.).
572
+ # - default: true
573
+ strict: false,
574
+
575
+ # {Dict} to use as the dictionary for words (Easy articles).
576
+ # - default: :scrape
577
+ # - nil => don't scrape/use it (necessary for Regular articles)
578
+ # - :scrape => auto-scrape it using {DictScraper}
579
+ # - {Dict} => your own {Dict}
580
+ dict: nil,
581
+
582
+ # Date time to use as a fallback if the article doesn't have one
583
+ # (for older articles).
584
+ # - default: nil
585
+ datetime: Time.new(2020,2,2),
586
+
587
+ # Year to use as a fallback if the article doesn't have one
588
+ # (for older articles).
589
+ # - default: nil
590
+ year: 2020,
591
+ )
592
+
593
+ article = as.scrape()
594
+
595
+ article.datetime
596
+ article.futsuurl
597
+ article.sha256
598
+ article.title
599
+ article.url
600
+
601
+ article.words.each() do |key,word|
602
+ word.defn
603
+ word.eng
604
+ word.freq
605
+ word.kana
606
+ word.kanji
607
+ word.key
608
+ end
609
+
610
+ puts article.to_s(mini: true)
611
+ puts '---'
612
+ puts article
613
+ ```
614
+
615
+ `DictScraper` scrapes an Easy article's dictionary file (JSON).
616
+
617
+ ```Ruby
618
+ require 'nhkore/dict_scraper'
619
+
620
+ url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
621
+ ds = NHKore::DictScraper.new(
622
+ url,
623
+
624
+ # Change the URL appropriately to the dictionary URL.
625
+ # - default: true
626
+ parse_url: true,
627
+ )
628
+
629
+ puts NHKore::DictScraper.parse_url(url)
630
+ puts
631
+
632
+ dict = ds.scrape()
633
+
634
+ dict.entries.each() do |key,entry|
635
+ entry.id
636
+
637
+ entry.defns.each() do |defn|
638
+ defn.hyoukis.each() {|hyouki| }
639
+ defn.text
640
+ defn.words.each() {|word| }
641
+ end
642
+
643
+ puts entry.build_hyouki()
644
+ puts entry.build_defn()
645
+ puts '---'
646
+ end
647
+
648
+ puts
649
+ puts dict
650
+ ```
651
+
652
+ ### Fileable
653
+
654
+ Any class that includes the `Fileable` mixin will have the following methods:
655
+
656
+ - Class.load_file(file,mode: 'rt:BOM|UTF-8',**kargs)
657
+ - save_file(file,mode: 'wt',**kargs)
658
+
659
+ Any *kargs* will be passed to `File.open()`.
660
+
661
+ ```Ruby
662
+ require 'nhkore/news'
663
+ require 'nhkore/search_link'
664
+
665
+ yn = NHKore::YasashiiNews.load_file()
666
+ sl = NHKore::SearchLinks.load_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
667
+
668
+ yn.articles.each() {|key,article| }
669
+ yn.sha256s.each() {|sha256,url| }
670
+
671
+ sl.links.each() do |key,link|
672
+ link.datetime
673
+ link.futsuurl
674
+ link.scraped?
675
+ link.sha256
676
+ link.title
677
+ link.url
678
+ end
679
+
680
+ #yn.save_file()
681
+ #sl.save_file(NHKore::SearchLinks::DEFAULT_YASASHII_FILE)
682
+ ```
683
+
684
+ ### Sifter
685
+
686
+ `Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
687
+
688
+ ```Ruby
689
+ require 'nhkore/news'
690
+ require 'nhkore/sifter'
691
+ require 'time'
692
+
693
+ news = NHKore::YasashiiNews.load_file()
694
+
695
+ sifter = NHKore::Sifter.new(news)
696
+
697
+ sifter.caption = 'Sakura Fields Forever!'
698
+
699
+ # Filter the data.
700
+ #sifter.filter_by_datetime(Time.new(2019,12,5))
701
+ sifter.filter_by_datetime(
702
+ from: Time.new(2019,12,4),to: Time.new(2019,12,7)
703
+ )
704
+ sifter.filter_by_title('桜')
705
+ sifter.filter_by_url('k100')
706
+
707
+ # Ignore (or blank out) certain columns from the output.
708
+ sifter.ignore(:defn)
709
+ sifter.ignore(:eng)
710
+
711
+ # An array of the filtered & sorted words.
712
+ words = sifter.sift()
713
+
714
+ # Choose the file format.
715
+ #sifter.put_csv!()
716
+ #sifter.put_html!()
717
+ sifter.put_yaml!()
718
+
719
+ # Save to a file.
720
+ file = 'sakura.yml'
721
+
722
+ if !File.exist?(file)
723
+ sifter.save_file(file)
724
+ end
725
+ ```
726
+
727
+ ### Util & UserAgents
728
+
729
+ These provide a variety of useful methods/constants.
730
+
731
+ Here are some of the most useful ones:
732
+
733
+ ```Ruby
734
+ require 'nhkore/user_agents'
735
+ require 'nhkore/util'
736
+
737
+ include NHKore
738
+
739
+ puts '======='
740
+ puts '[ Net ]'
741
+ puts '======='
742
+ # Get a random User Agent for HTTP header field 'User-Agent'.
743
+ # - This is used by default in Scraper/SearchScraper.
744
+ puts "User-Agent: #{UserAgents.sample()}"
745
+
746
+ uri = URI('https://www.bing.com/search?q=nhk')
747
+ Util.replace_uri_query!(uri,q: 'banana')
748
+
749
+ puts "URI query: #{uri}" # https://www.bing.com/search?q=banana
750
+ # nhk.or.jp
751
+ puts "Domain: #{Util.domain(URI('https://www.nhk.or.jp/news/easy').host)}"
752
+ # Ben &amp; Jerry&#39;s<br>
753
+ puts "Escape HTML: #{Util.escape_html("Ben & Jerry's\n")}"
754
+ puts
755
+
756
+ puts '========'
757
+ puts '[ Time ]'
758
+ puts '========'
759
+ puts "JST now: #{Util.jst_now}"
760
+ # Drops in JST_OFFSET, does not change hour/min.
761
+ puts "JST time: #{Util.jst_time(Time.now)}"
762
+ puts "JST year: #{Util::JST_YEAR}"
763
+ puts "1999 sane? #{Util.sane_year?(1999)}" # true
764
+ puts "1776 sane? #{Util.sane_year?(1776)}" # false
765
+ puts "Guess 5: #{Util.guess_year(5)}" # 2005
766
+ puts "Guess 99: #{Util.guess_year(99)}" # 1999
767
+ puts
768
+ puts "JST timezone offset: #{Util::JST_OFFSET}"
769
+ puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
770
+ puts "JST timezone offset minute: #{Util::JST_OFFSET_MIN}"
771
+ puts
772
+
773
+ puts '============'
774
+ puts '[ Japanese ]'
775
+ puts '============'
776
+
777
+ JPN = ['桜','ぶ','ブ']
778
+
779
+ def fmt_jpn()
780
+ fmt = []
781
+
782
+ JPN.each() do |x|
783
+ x = yield(x)
784
+ x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
785
+ fmt << x
786
+ end
787
+
788
+ return "[ #{fmt.join(' | ')} ]"
789
+ end
790
+
791
+ puts " #{fmt_jpn{|x| x}}"
792
+ puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
793
+ puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
794
+ puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
795
+ puts "Reduce: #{Util.reduce_jpn_space("' '")}"
796
+ puts
797
+
798
+ puts '========='
799
+ puts '[ Files ]'
800
+ puts '========='
801
+ puts "Dir str? #{Util.dir_str?('dir/')}" # true
802
+ puts "Dir str? #{Util.dir_str?('dir')}" # false
803
+ puts "File str? #{Util.filename_str?('file')}" # true
804
+ puts "File str? #{Util.filename_str?('dir/file')}" # false
805
+ ```
806
+
344
807
  ## Hacking [^](#contents)
345
808
 
346
809
  ```
@@ -370,7 +833,9 @@ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
370
833
 
371
834
  `$ bundle exec rake doc`
372
835
 
373
- ### Installing Locally (without Network Access)
836
+ ### Installing Locally
837
+
838
+ You can make some changes/fixes to the code and then install your local version:
374
839
 
375
840
  `$ bundle exec rake install:local`
376
841
 
data/lib/nhkore/app.rb CHANGED
@@ -24,6 +24,7 @@
24
24
  require 'cri'
25
25
  require 'highline'
26
26
  require 'rainbow'
27
+ require 'set'
27
28
  require 'tty-spinner'
28
29
 
29
30
  require 'nhkore/error'
@@ -47,19 +47,21 @@ module NHKore
47
47
  attr_accessor :dict
48
48
  attr_reader :kargs
49
49
  attr_accessor :missingno
50
- attr_accessor :mode
51
50
  attr_reader :polishers
52
51
  attr_accessor :splitter
52
+ attr_accessor :strict
53
53
  attr_reader :variators
54
54
  attr_accessor :year
55
55
 
56
+ alias_method :strict?,:strict
57
+
56
58
  # @param dict [Dict,:scrape,nil] the {Dict} (dictionary) to use for {Word#defn} (definitions)
57
59
  # [+:scrape+] auto-scrape it using {DictScraper}
58
60
  # [+nil+] don't scrape/use it
59
61
  # @param missingno [Missingno] data to use as a fallback for Ruby words without kana/kanji,
60
62
  # instead of raising an error
61
- # @param mode [nil,:lenient]
62
- def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,mode: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),variators: [BestVariator.new()],year: nil,**kargs)
63
+ # @param strict [true,false]
64
+ def initialize(url,cleaners: [BestCleaner.new()],datetime: nil,dict: :scrape,missingno: nil,polishers: [BestPolisher.new()],splitter: BestSplitter.new(),strict: true,variators: [BestVariator.new()],year: nil,**kargs)
63
65
  super(url,**kargs)
64
66
 
65
67
  @cleaners = Array(cleaners)
@@ -67,9 +69,9 @@ module NHKore
67
69
  @dict = dict
68
70
  @kargs = kargs
69
71
  @missingno = missingno
70
- @mode = mode
71
72
  @polishers = Array(polishers)
72
73
  @splitter = splitter
74
+ @strict = strict
73
75
  @variators = Array(variators)
74
76
  @year = year
75
77
  end
@@ -188,7 +190,7 @@ module NHKore
188
190
  tag = doc.css('div.article-body') if tag.length < 1
189
191
 
190
192
  # - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
191
- tag = doc.css('div#main') if tag.length < 1 && @mode == :lenient
193
+ tag = doc.css('div#main') if tag.length < 1 && !@strict
192
194
 
193
195
  if tag.length > 0
194
196
  text = Util.unspace_web_str(tag.text.to_s())
@@ -481,7 +483,7 @@ module NHKore
481
483
  def scrape_title(doc,article)
482
484
  tag = doc.css('h1.article-main__title')
483
485
 
484
- if tag.length < 1 && @mode == :lenient
486
+ if tag.length < 1 && !@strict
485
487
  # This shouldn't be used except for select sites.
486
488
  # - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
487
489
 
@@ -583,7 +585,7 @@ module NHKore
583
585
  end
584
586
 
585
587
  # As a last resort, use our user-defined fallbacks (if specified).
586
- return @year unless Util.empty_web_str?(@year)
588
+ return @year.to_i() unless @year.nil?()
587
589
  return @datetime.year if !@datetime.nil?() && Util.sane_year?(@datetime.year)
588
590
 
589
591
  raise ScrapeError,"could not scrape year at URL[#{@url}]"
@@ -604,11 +606,10 @@ module NHKore
604
606
  end
605
607
 
606
608
  def warn_or_error(klass,msg)
607
- case @mode
608
- when :lenient
609
- Util.warn(msg)
610
- else
609
+ if @strict
611
610
  raise klass,msg
611
+ else
612
+ Util.warn(msg)
612
613
  end
613
614
  end
614
615
  end
@@ -237,7 +237,7 @@ module CLI
237
237
  dict: dict,
238
238
  is_file: is_file,
239
239
  missingno: missingno ? Missingno.new(news) : nil,
240
- mode: lenient ? :lenient : nil,
240
+ strict: !lenient,
241
241
  })
242
242
  @news_dict_scraper_kargs = @scraper_kargs.merge({
243
243
  is_file: is_file,
data/lib/nhkore/lib.rb ADDED
@@ -0,0 +1,58 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: UTF-8
3
+ # frozen_string_literal: true
4
+
5
+ #--
6
+ # This file is part of NHKore.
7
+ # Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
8
+ #
9
+ # NHKore is free software: you can redistribute it and/or modify
10
+ # it under the terms of the GNU Lesser General Public License as published by
11
+ # the Free Software Foundation, either version 3 of the License, or
12
+ # (at your option) any later version.
13
+ #
14
+ # NHKore is distributed in the hope that it will be useful,
15
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
16
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17
+ # GNU Lesser General Public License for more details.
18
+ #
19
+ # You should have received a copy of the GNU Lesser General Public License
20
+ # along with NHKore. If not, see <https://www.gnu.org/licenses/>.
21
+ #++
22
+
23
+
24
+ require 'nhkore/article'
25
+ require 'nhkore/article_scraper'
26
+ require 'nhkore/cleaner'
27
+ require 'nhkore/defn'
28
+ require 'nhkore/dict'
29
+ require 'nhkore/dict_scraper'
30
+ require 'nhkore/entry'
31
+ require 'nhkore/error'
32
+ require 'nhkore/fileable'
33
+ require 'nhkore/missingno'
34
+ require 'nhkore/news'
35
+ require 'nhkore/polisher'
36
+ require 'nhkore/scraper'
37
+ require 'nhkore/search_link'
38
+ require 'nhkore/search_scraper'
39
+ require 'nhkore/sifter'
40
+ require 'nhkore/splitter'
41
+ require 'nhkore/user_agents'
42
+ require 'nhkore/util'
43
+ require 'nhkore/variator'
44
+ require 'nhkore/version'
45
+ require 'nhkore/word'
46
+
47
+
48
+ module NHKore
49
+ ###
50
+ # Include this file to only require the files needed to use this
51
+ # Gem as a library (i.e., don't include CLI-related files).
52
+ #
53
+ # @author Jonathan Bradley Whited (@esotericpig)
54
+ # @since 0.3.2
55
+ ###
56
+ module Lib
57
+ end
58
+ end
@@ -82,7 +82,7 @@ module NHKore
82
82
  @max_retries = max_retries
83
83
  @redirect_rule = redirect_rule
84
84
 
85
- open(url,str_or_io)
85
+ open(url,str_or_io,is_file: is_file)
86
86
  end
87
87
 
88
88
  def fetch_cookie(url)
@@ -119,14 +119,14 @@ module NHKore
119
119
  return URI::join(@url,relative_url)
120
120
  end
121
121
 
122
- def open(url,str_or_io=nil)
122
+ def open(url,str_or_io=nil,is_file: @is_file)
123
+ @is_file = is_file
123
124
  @str_or_io = str_or_io
124
125
  @url = url
125
126
 
126
127
  if str_or_io.nil?()
127
128
  if @is_file
128
- # NHK's website tends to always use UTF-8.
129
- @str_or_io = File.open(url,'rt:UTF-8',**@kargs)
129
+ open_file(url)
130
130
  else
131
131
  fetch_cookie(url) if @eat_cookie
132
132
  open_url(url)
@@ -136,6 +136,16 @@ module NHKore
136
136
  return self
137
137
  end
138
138
 
139
+ def open_file(file)
140
+ @is_file = true
141
+ @url = file
142
+
143
+ # NHK's website tends to always use UTF-8.
144
+ @str_or_io = File.open(file,'rt:UTF-8',**@kargs)
145
+
146
+ return self
147
+ end
148
+
139
149
  def open_url(url)
140
150
  max_redirects = (@max_redirects.nil?() || @max_redirects < 0) ? 10_000 : @max_redirects
141
151
  max_retries = (@max_retries.nil?() || @max_retries < 0) ? 10_000 : @max_retries
@@ -194,6 +204,10 @@ module NHKore
194
204
  return @str_or_io
195
205
  end
196
206
 
207
+ def reopen()
208
+ return open(@url)
209
+ end
210
+
197
211
  def rss_doc()
198
212
  require 'rss'
199
213
 
data/lib/nhkore/sifter.rb CHANGED
@@ -93,24 +93,29 @@ module NHKore
93
93
  return false
94
94
  end
95
95
 
96
- def filter_by_datetime(datetime_filter=nil,from_filter: nil,to_filter: nil)
96
+ def filter_by_datetime(datetime_filter=nil,from: nil,to: nil)
97
97
  if !datetime_filter.nil?()
98
- # If out-of-bounds, just nil.
99
- from_filter = datetime_filter[0]
100
- to_filter = datetime_filter[1]
98
+ if datetime_filter.respond_to?(:'[]')
99
+ # If out-of-bounds, just nil.
100
+ from = datetime_filter[0] if from.nil?()
101
+ to = datetime_filter[1] if to.nil?()
102
+ else
103
+ from = datetime_filter if from.nil?()
104
+ to = datetime_filter if to.nil?()
105
+ end
101
106
  end
102
107
 
103
- from_filter = to_filter if from_filter.nil?()
104
- to_filter = from_filter if to_filter.nil?()
108
+ from = to if from.nil?()
109
+ to = from if to.nil?()
105
110
 
106
- from_filter = Util.jst_time(from_filter) unless from_filter.nil?()
107
- to_filter = Util.jst_time(to_filter) unless to_filter.nil?()
111
+ from = Util.jst_time(from) unless from.nil?()
112
+ to = Util.jst_time(to) unless to.nil?()
108
113
 
109
- datetime_filter = [from_filter,to_filter]
114
+ datetime_filter = [from,to]
110
115
 
111
116
  return self if datetime_filter.flatten().compact().empty?()
112
117
 
113
- @filters[:datetime] = {from: from_filter,to: to_filter}
118
+ @filters[:datetime] = {from: from,to: to}
114
119
 
115
120
  return self
116
121
  end
data/lib/nhkore/util.rb CHANGED
@@ -22,8 +22,7 @@
22
22
 
23
23
 
24
24
  require 'cgi'
25
- require 'psychgus'
26
- require 'public_suffix'
25
+ require 'set'
27
26
  require 'time'
28
27
  require 'uri'
29
28
 
@@ -68,6 +67,8 @@ module NHKore
68
67
  end
69
68
 
70
69
  def self.domain(host,clean: true)
70
+ require 'public_suffix'
71
+
71
72
  domain = PublicSuffix.domain(host)
72
73
  domain = unspace_web_str(domain).downcase() if !domain.nil?() && clean
73
74
 
@@ -75,6 +76,8 @@ module NHKore
75
76
  end
76
77
 
77
78
  def self.dump_yaml(obj,flow_level: 8)
79
+ require 'psychgus'
80
+
78
81
  return Psychgus.dump(obj,
79
82
  deref_aliases: true, # Dereference aliases for load_yaml()
80
83
  line_width: 10000, # Try not to wrap; ichiman!
@@ -142,6 +145,8 @@ module NHKore
142
145
  end
143
146
 
144
147
  def self.load_yaml(data,file: nil,**kargs)
148
+ require 'psychgus'
149
+
145
150
  return Psych.safe_load(data,
146
151
  aliases: false,
147
152
  filename: file,
@@ -60,6 +60,7 @@ module NHKore
60
60
  attr_accessor :deinflector
61
61
 
62
62
  def initialize(*)
63
+ require 'set' # Must require manually because JapaneseDeinflector is old
63
64
  require 'japanese_deinflector'
64
65
 
65
66
  super
@@ -22,5 +22,5 @@
22
22
 
23
23
 
24
24
  module NHKore
25
- VERSION = '0.3.1'
25
+ VERSION = '0.3.2'
26
26
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nhkore
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.1
4
+ version: 0.3.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jonathan Bradley Whited (@esotericpig)
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-04-20 00:00:00.000000000 Z
11
+ date: 2020-04-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bimyou_segmenter
@@ -349,6 +349,7 @@ files:
349
349
  - lib/nhkore/entry.rb
350
350
  - lib/nhkore/error.rb
351
351
  - lib/nhkore/fileable.rb
352
+ - lib/nhkore/lib.rb
352
353
  - lib/nhkore/missingno.rb
353
354
  - lib/nhkore/news.rb
354
355
  - lib/nhkore/polisher.rb
@@ -374,7 +375,7 @@ metadata:
374
375
  changelog_uri: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md
375
376
  homepage_uri: https://github.com/esotericpig/nhkore
376
377
  source_code_uri: https://github.com/esotericpig/nhkore
377
- post_install_message: " \n NHKore v0.3.1\n \n You can now use [nhkore] on the
378
+ post_install_message: " \n NHKore v0.3.2\n \n You can now use [nhkore] on the
378
379
  command line.\n \n Homepage: https://github.com/esotericpig/nhkore\n \n Code:
379
380
  \ https://github.com/esotericpig/nhkore\n Changelog: https://github.com/esotericpig/nhkore/blob/master/CHANGELOG.md\n
380
381
  \ Bugs: https://github.com/esotericpig/nhkore/issues\n \n"