nhkore 0.3.3 → 0.3.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0ca67a215cda7c49a82aa824c1322b49285abe332f627c9ad4fae774043cbfc9
4
- data.tar.gz: b62a7e518787e89a3a54bcc66c191b4d3f005a911ab76861e3b118258f31b85f
3
+ metadata.gz: c63efbc2f65cfe83c7b55e53a0dfca329c2aded4c22ae05c2fb50583876452b4
4
+ data.tar.gz: 87c5116e11cb7e2dd4a5cdb86d6fc1a80ea58dd4efa7bc27ad448c25c4fad724
5
5
  SHA512:
6
- metadata.gz: b4e84a07685c71400bd50b270c4ae662e6885f7149fc7ec3dec9476bf9b6b80f402d7f874ddcbef920c2b5034a1d39b44fbcb7e9ece06f3a2d517ca89e37de3d
7
- data.tar.gz: 2527b477b7b7088f2612e4a05e0369b60cacb34bedb6ac59a3296643b6f59fcfce0c054ede67c68e0f4299864795bd79f04a85020d8f4c87b67f56c5a5dbeb77
6
+ metadata.gz: 68eb93da6d8f5c8ba3c4c58e0a9a71803dd4eefc6063df4ead9f0d06c0f1ba59892f5ddb43a9735c30ceaf85db63ed80c1b155bac1d5f0daf73f9cebbc7f6c6e
7
+ data.tar.gz: 33e9f4f770bceb2c0eb5d6d62781af400bc2b66f4ba4d4092b01b224bc365edef98cc47032f4b2389e04664f6cabcd2c02024139971bee207a121570805a6015
data/.yardopts ADDED
@@ -0,0 +1,3 @@
1
+ --files 'CHANGELOG.md,LICENSE.txt'
2
+ --protected
3
+ --readme 'README.md'
data/CHANGELOG.md CHANGED
@@ -1,8 +1,96 @@
1
1
  # Changelog | NHKore
2
2
 
3
- Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ Format is based on [Keep a Changelog v1.0.0](https://keepachangelog.com/en/1.0.0),
6
+ and this project adheres to [Semantic Versioning v2.0.0](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.8...HEAD)
9
+ -
10
+
11
+
12
+ ## [v0.3.8] - 2021-06-26
13
+
14
+ ### Fixed
15
+ - Fixed `App#refresh_cmd()` to also copy Cri's `default_proc` to the new Hash for the command options.
16
+ - Fixed to check for non-strings for JSON & URI.
17
+ - For JSON, convert `StringIO` to string in `DictScraper.scrape()`.
18
+ - For URL, convert URL using `URI()` because `URI.parse()` will crash with a non-string (URI object) in `Scraper.open_url()`.
19
+ - Fixed to scrape multiple HTML Ruby tag words (instead of just 1).
20
+ - I thought multiple Ruby bases/texts (`<rb>`/`<rt>`) were invalid, but after running into the article below and checking the HTML with a validator, it's actually valid HTML:
21
+ - https://www3.nhk.or.jp/news/easy/k10012759201000/k10012759201000.html
22
+ - No previous articles/URLs ran into this problem (would have raised an error), so it should only be a problem with this specific, new article.
23
+
24
+ ### Changed
25
+ - Formatted/Linted all code using RuboCop.
26
+ - Updated Gems.
27
+
28
+
29
+ ## [v0.3.7] - 2020-11-07
30
+
31
+ ### Changed
32
+ - Updated Gem `attr_bool` to v0.2
33
+ - Changed upper-case *'-V'* flag for *version* to be a lower-case *'-v'*
34
+ - Seems like a lot of apps/people expect this
35
+ - Refactored/Formatted some code
36
+ - *nhkore.gemspec* especially
37
+ - Added *samples/*, *Gemfile.lock*, and *.yardopts* to the files in *nhkore.gemspec*
38
+
39
+ ### Fixed
40
+ - ArticleScraper
41
+ - Fixed to accept text nodes that have Kanji, due to bad article:
42
+ - https://www3.nhk.or.jp/news/easy/k10012639271000/k10012639271000.html
43
+ - `第3のビール` should have HTML ruby tags around *第*
44
+
45
+
46
+ ## [v0.3.6] - 2020-08-18
47
+
48
+ ### Added
49
+ - `update_showcase` Rake task for development & personal site (GitHub Page)
50
+ - `$ bundle exec rake update_showcase`
51
+
52
+ ### Changed
53
+ - Updated Gems
54
+
55
+ ### Fixed
56
+ - ArticleScraper for title for specific site
57
+ - https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_illust.html
58
+ - Ignored `/cgi2.*enqform/` URLs from SearchScraper (Bing)
59
+ - Added more detail to dictionary error in ArticleScraper
60
+
61
+
62
+ ## [v0.3.5] - 2020-05-04
63
+
64
+ ### Added
65
+ - Added check for environment var `NO_COLOR`
66
+ - [https://no-color.org/](https://no-color.org/)
67
+
68
+ ### Fixed
69
+ - Fixed URLs stored in YAML data to always be of type String (not URI)
70
+ - This initially caused a problem in DictScraper.parse_url() from ArticleScraper, but fixed it for all data
71
+
72
+
73
+ ## [v0.3.4] - 2020-04-25
74
+
75
+ ### Added
76
+ - DatetimeParser
77
+ - Extracted from SiftCmd into its own class
78
+ - Fixed some minor logic bugs from the old code
79
+ - Added new feature where 1 range can be empty:
80
+ - `sift ez -d '...2019'` (from = 1924)
81
+ - `sift ez -d '2019...'` (to = current year)
82
+ - `sift ez -d '...'` (still an error)
83
+ - Added `update_core` rake task for dev
84
+ - Makes pushing a new release much easier
85
+ - See *Hacking.Releasing* section in *README*
86
+
87
+ ### Fixed
88
+ - SiftCmd `parse_sift_datetime()` for `-d/--datetime` option
89
+ - Didn't work exactly right (as written in *README*) for some special inputs:
90
+ - `-d '2019...3'`
91
+ - `-d '3-3'`
92
+ - `-d '3'`
4
93
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.3...master)
6
94
 
7
95
  ## [v0.3.3] - 2020-04-23
8
96
 
@@ -10,6 +98,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
10
98
  - Added JSON support to Sifter & SiftCmd.
11
99
  - Added use of `attr_bool` Gem for `attr_accessor?` & `attr_reader?`.
12
100
 
101
+
13
102
  ## [v0.3.2] - 2020-04-22
14
103
 
15
104
  ### Added
@@ -33,6 +122,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
33
122
  - ArticleScraper
34
123
  - Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
35
124
 
125
+
36
126
  ## [v0.3.1] - 2020-04-20
37
127
 
38
128
  ### Changed
@@ -50,6 +140,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
50
140
  - BingScraper
51
141
  - Fixed possible RSS infinite loop.
52
142
 
143
+
53
144
  ## [v0.3.0] - 2020-04-12
54
145
 
55
146
  ### Added
@@ -84,7 +175,9 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
84
175
  - ignore empty filenames in the Zip for safety.
85
176
  - ask to overwrite files instead of erroring.
86
177
 
178
+
87
179
  ## [v0.2.0] - 2020-04-01
180
+
88
181
  First working version.
89
182
 
90
183
  ### Added
@@ -120,7 +213,9 @@ First working version.
120
213
  - test/nhkore_tester.rb
121
214
  - Renamed to `test/nhkore/test_helper.rb`
122
215
 
216
+
123
217
  ## [v0.1.0] - 2020-02-24
218
+
124
219
  ### Added
125
220
  - .gitignore
126
221
  - CHANGELOG.md
data/Gemfile CHANGED
@@ -1,24 +1,6 @@
1
1
  # encoding: UTF-8
2
2
  # frozen_string_literal: true
3
3
 
4
- #--
5
- # This file is part of NHKore.
6
- # Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
7
- #
8
- # NHKore is free software: you can redistribute it and/or modify
9
- # it under the terms of the GNU Lesser General Public License as published by
10
- # the Free Software Foundation, either version 3 of the License, or
11
- # (at your option) any later version.
12
- #
13
- # NHKore is distributed in the hope that it will be useful,
14
- # but WITHOUT ANY WARRANTY; without even the implied warranty of
15
- # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
- # GNU Lesser General Public License for more details.
17
- #
18
- # You should have received a copy of the GNU Lesser General Public License
19
- # along with NHKore. If not, see <https://www.gnu.org/licenses/>.
20
- #++
21
-
22
4
 
23
5
  source 'https://rubygems.org'
24
6
 
data/Gemfile.lock ADDED
@@ -0,0 +1,89 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ nhkore (0.3.8)
5
+ attr_bool (~> 0.2)
6
+ bimyou_segmenter (~> 1.2)
7
+ cri (~> 2.15)
8
+ down (~> 5.2)
9
+ highline (~> 2.0)
10
+ http-cookie (~> 1.0)
11
+ japanese_deinflector (~> 0.0)
12
+ nokogiri (~> 1.11)
13
+ psychgus (~> 1.3)
14
+ public_suffix (~> 4.0)
15
+ rainbow (~> 3.0)
16
+ rubyzip (~> 2.3)
17
+ tiny_segmenter (~> 0.0)
18
+ tty-progressbar (~> 0.18)
19
+ tty-spinner (~> 0.9)
20
+
21
+ GEM
22
+ remote: https://rubygems.org/
23
+ specs:
24
+ addressable (2.7.0)
25
+ public_suffix (>= 2.0.2, < 5.0)
26
+ attr_bool (0.2.2)
27
+ bimyou_segmenter (1.2.0)
28
+ cri (2.15.11)
29
+ domain_name (0.5.20190701)
30
+ unf (>= 0.0.5, < 1.0.0)
31
+ down (5.2.2)
32
+ addressable (~> 2.5)
33
+ highline (2.0.3)
34
+ http-cookie (1.0.4)
35
+ domain_name (~> 0.5)
36
+ japanese_deinflector (0.0.2)
37
+ mini_portile2 (2.5.3)
38
+ minitest (5.14.4)
39
+ nokogiri (1.11.7)
40
+ mini_portile2 (~> 2.5.0)
41
+ racc (~> 1.4)
42
+ psych (4.0.1)
43
+ psychgus (1.3.4)
44
+ psych (>= 3.0)
45
+ public_suffix (4.0.6)
46
+ racc (1.5.2)
47
+ rainbow (3.0.0)
48
+ rake (13.0.3)
49
+ raketeer (0.2.13)
50
+ rake
51
+ rdoc (6.3.1)
52
+ redcarpet (3.5.1)
53
+ rubyzip (2.3.0)
54
+ strings-ansi (0.2.0)
55
+ tiny_segmenter (0.0.6)
56
+ tty-cursor (0.7.1)
57
+ tty-progressbar (0.18.2)
58
+ strings-ansi (~> 0.2)
59
+ tty-cursor (~> 0.7)
60
+ tty-screen (~> 0.8)
61
+ unicode-display_width (>= 1.6, < 3.0)
62
+ tty-screen (0.8.1)
63
+ tty-spinner (0.9.3)
64
+ tty-cursor (~> 0.7)
65
+ unf (0.1.4)
66
+ unf_ext
67
+ unf_ext (0.0.7.7)
68
+ unicode-display_width (2.0.0)
69
+ yard (0.9.26)
70
+ yard_ghurt (1.2.1)
71
+ rake
72
+ yard
73
+
74
+ PLATFORMS
75
+ ruby
76
+
77
+ DEPENDENCIES
78
+ bundler (~> 2.2)
79
+ minitest (~> 5.14)
80
+ nhkore!
81
+ rake (~> 13.0)
82
+ raketeer (~> 0.2)
83
+ rdoc (~> 6.3)
84
+ redcarpet (~> 3.5)
85
+ yard (~> 0.9)
86
+ yard_ghurt (~> 1.2)
87
+
88
+ BUNDLED WITH
89
+ 2.2.20
data/README.md CHANGED
@@ -26,6 +26,8 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
26
26
  - [News Command](#news-command-)
27
27
  - [Using the Library](#using-the-library-)
28
28
  - [Hacking](#hacking-)
29
+ - [Updating](#updating-)
30
+ - [Releasing](#releasing-)
29
31
  - [License](#license-)
30
32
 
31
33
  ## For Non-Power Users [^](#contents)
@@ -433,18 +435,18 @@ require 'nhkore/scraper'
433
435
  s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
434
436
  open_timeout: 300, # Open timeout in seconds (default: nil)
435
437
  read_timeout: 300, # Read timeout in seconds (default: nil)
436
-
438
+
437
439
  # Maximum number of times to retry the URL
438
440
  # - default: 3
439
441
  # - Open/connect will fail a couple of times on a bad/slow internet connection.
440
442
  max_retries: 10,
441
-
443
+
442
444
  # Maximum number of redirects allowed.
443
445
  # - default: 3
444
446
  # - You can set this to nil or -1, but I recommend using a number
445
447
  # for safety (infinite-loop attack).
446
448
  max_redirects: 1,
447
-
449
+
448
450
  # How to check redirect URLs for safety.
449
451
  # - default: :strict
450
452
  # - nil => do not check
@@ -453,7 +455,7 @@ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
453
455
  # - :strict => check the scheme and domain
454
456
  # (i.e., if https://bing.com, redirect URL must be https://bing.com)
455
457
  redirect_rule: :lenient,
456
-
458
+
457
459
  # Set the HTTP header field 'cookie' from the 'set-cookie' response.
458
460
  # - default: false
459
461
  # - Currently uses the 'http-cookie' Gem.
@@ -461,7 +463,7 @@ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
461
463
  # - Necessary for Search Engines or other sites that require cookies
462
464
  # in order to block bots.
463
465
  eat_cookie: true,
464
-
466
+
465
467
  # Set HTTP header fields.
466
468
  # - default: nil
467
469
  # - Necessary for Search Engines or other sites that try to block bots.
@@ -524,9 +526,9 @@ doc = ss.html_doc()
524
526
 
525
527
  doc.css('a').each() do |anchor|
526
528
  link = anchor['href']
527
-
528
- next if ss.ignore_link?(link)
529
-
529
+
530
+ next if ss.ignore_link?(link,cleaned: false)
531
+
530
532
  if link.include?('https://www3.nhk')
531
533
  puts link
532
534
  end
@@ -547,9 +549,9 @@ page_num = 1
547
549
 
548
550
  while !next_page.empty?()
549
551
  puts "Page #{page_num += 1}: #{next_page.count}"
550
-
552
+
551
553
  bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
552
-
554
+
553
555
  next_page = bs.scrape(slinks,next_page)
554
556
  end
555
557
 
@@ -564,27 +566,28 @@ end
564
566
 
565
567
  ```Ruby
566
568
  require 'nhkore/article_scraper'
569
+ require 'time'
567
570
 
568
571
  as = NHKore::ArticleScraper.new(
569
572
  'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
570
-
573
+
571
574
  # If false, scrape the article leniently (for older articles which
572
575
  # may not have certain tags, etc.).
573
576
  # - default: true
574
577
  strict: false,
575
-
578
+
576
579
  # {Dict} to use as the dictionary for words (Easy articles).
577
580
  # - default: :scrape
578
581
  # - nil => don't scrape/use it (necessary for Regular articles)
579
582
  # - :scrape => auto-scrape it using {DictScraper}
580
583
  # - {Dict} => your own {Dict}
581
584
  dict: nil,
582
-
585
+
583
586
  # Date time to use as a fallback if the article doesn't have one
584
587
  # (for older articles).
585
588
  # - default: nil
586
589
  datetime: Time.new(2020,2,2),
587
-
590
+
588
591
  # Year to use as a fallback if the article doesn't have one
589
592
  # (for older articles).
590
593
  # - default: nil
@@ -621,7 +624,7 @@ require 'nhkore/dict_scraper'
621
624
  url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
622
625
  ds = NHKore::DictScraper.new(
623
626
  url,
624
-
627
+
625
628
  # Change the URL appropriately to the dictionary URL.
626
629
  # - default: true
627
630
  parse_url: true,
@@ -634,13 +637,13 @@ dict = ds.scrape()
634
637
 
635
638
  dict.entries.each() do |key,entry|
636
639
  entry.id
637
-
640
+
638
641
  entry.defns.each() do |defn|
639
642
  defn.hyoukis.each() {|hyouki| }
640
643
  defn.text
641
644
  defn.words.each() {|word| }
642
645
  end
643
-
646
+
644
647
  puts entry.build_hyouki()
645
648
  puts entry.build_defn()
646
649
  puts '---'
@@ -687,6 +690,7 @@ end
687
690
  `Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
688
691
 
689
692
  ```Ruby
693
+ require 'nhkore/datetime_parser'
690
694
  require 'nhkore/news'
691
695
  require 'nhkore/sifter'
692
696
  require 'time'
@@ -698,7 +702,8 @@ sifter = NHKore::Sifter.new(news)
698
702
  sifter.caption = 'Sakura Fields Forever!'
699
703
 
700
704
  # Filter the data.
701
- #sifter.filter_by_datetime(Time.new(2019,12,5))
705
+ sifter.filter_by_datetime(NHKore::DatetimeParser.parse_range('2019-12-4...7'))
706
+ sifter.filter_by_datetime([Time.new(2019,12,4),Time.new(2019,12,7)])
702
707
  sifter.filter_by_datetime(
703
708
  from: Time.new(2019,12,4),to: Time.new(2019,12,7)
704
709
  )
@@ -727,13 +732,14 @@ if !File.exist?(file)
727
732
  end
728
733
  ```
729
734
 
730
- ### Util & UserAgents
735
+ ### Util, UserAgents, & DatetimeParser
731
736
 
732
737
  These provide a variety of useful methods/constants.
733
738
 
734
739
  Here are some of the most useful ones:
735
740
 
736
741
  ```Ruby
742
+ require 'nhkore/datetime_parser'
737
743
  require 'nhkore/user_agents'
738
744
  require 'nhkore/util'
739
745
 
@@ -759,14 +765,16 @@ puts
759
765
  puts '========'
760
766
  puts '[ Time ]'
761
767
  puts '========'
762
- puts "JST now: #{Util.jst_now}"
768
+ puts "JST now: #{Util.jst_now()}"
763
769
  # Drops in JST_OFFSET, does not change hour/min.
764
770
  puts "JST time: #{Util.jst_time(Time.now)}"
765
771
  puts "JST year: #{Util::JST_YEAR}"
766
772
  puts "1999 sane? #{Util.sane_year?(1999)}" # true
767
773
  puts "1776 sane? #{Util.sane_year?(1776)}" # false
768
- puts "Guess 5: #{Util.guess_year(5)}" # 2005
769
- puts "Guess 99: #{Util.guess_year(99)}" # 1999
774
+ puts "Guess 5: #{DatetimeParser.guess_year(5)}" # 2005
775
+ puts "Guess 99: #{DatetimeParser.guess_year(99)}" # 1999
776
+ # => [2020-12-01 00:00:00 +0900, 2020-12-31 23:59:59 +0900]
777
+ puts "Parse: #{DatetimeParser.parse_range('2020-12')}"
770
778
  puts
771
779
  puts "JST timezone offset: #{Util::JST_OFFSET}"
772
780
  puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
@@ -781,20 +789,20 @@ JPN = ['桜','ぶ','ブ']
781
789
 
782
790
  def fmt_jpn()
783
791
  fmt = []
784
-
792
+
785
793
  JPN.each() do |x|
786
794
  x = yield(x)
787
795
  x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
788
796
  fmt << x
789
797
  end
790
-
798
+
791
799
  return "[ #{fmt.join(' | ')} ]"
792
800
  end
793
801
 
794
802
  puts " #{fmt_jpn{|x| x}}"
795
- puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
796
- puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
797
- puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
803
+ puts "Hiragana? #{fmt_jpn{|x| Util.hiragana?(x)}}"
804
+ puts "Kana? #{fmt_jpn{|x| Util.kana?(x)}}"
805
+ puts "Kanji? #{fmt_jpn{|x| Util.kanji?(x)}}"
798
806
  puts "Reduce: #{Util.reduce_jpn_space("' '")}"
799
807
  puts
800
808
 
@@ -842,16 +850,36 @@ You can make some changes/fixes to the code and then install your local version:
842
850
 
843
851
  `$ bundle exec rake install:local`
844
852
 
845
- ### Releasing/Publishing
853
+ ### Updating [^](#contents)
854
+
855
+ This will update *core/* for you:
856
+
857
+ `$ bundle exec rake update_core`
858
+
859
+ ### Releasing [^](#contents)
860
+
861
+ 1. Update *CHANGELOG.md*, *version.rb*, & *Gemfile.lock*
862
+ - *Raketary*: `$ raketary bump -v`
863
+ - Run: `$ bundle update`
864
+ 2. Run: `$ bundle exec rake update_core`
865
+ 3. Run: `$ bundle exec rake clobber pkg_core`
866
+ 4. Create a new release & tag
867
+ - Add `pkg/nhkore-core.zip`
868
+ 5. Run: `$ git pull`
869
+ 6. Upload GitHub package
870
+ - *Raketary*: `$ raketary github_pkg`
871
+ 7. Run: `$ bundle exec rake release`
872
+
873
+ Releasing new HTML file for website:
846
874
 
847
- `$ bundle exec rake release`
875
+ 1. `$ bundle exec rake update_showcase`
848
876
 
849
877
  ## License [^](#contents)
850
878
 
851
879
  [GNU LGPL v3+](LICENSE.txt)
852
880
 
853
881
  > NHKore (<https://github.com/esotericpig/nhkore>)
854
- > Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
882
+ > Copyright (c) 2020-2021 Jonathan Bradley Whited
855
883
  >
856
884
  > NHKore is free software: you can redistribute it and/or modify
857
885
  > it under the terms of the GNU Lesser General Public License as published by