nhkore 0.3.3 → 0.3.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0ca67a215cda7c49a82aa824c1322b49285abe332f627c9ad4fae774043cbfc9
4
- data.tar.gz: b62a7e518787e89a3a54bcc66c191b4d3f005a911ab76861e3b118258f31b85f
3
+ metadata.gz: c63efbc2f65cfe83c7b55e53a0dfca329c2aded4c22ae05c2fb50583876452b4
4
+ data.tar.gz: 87c5116e11cb7e2dd4a5cdb86d6fc1a80ea58dd4efa7bc27ad448c25c4fad724
5
5
  SHA512:
6
- metadata.gz: b4e84a07685c71400bd50b270c4ae662e6885f7149fc7ec3dec9476bf9b6b80f402d7f874ddcbef920c2b5034a1d39b44fbcb7e9ece06f3a2d517ca89e37de3d
7
- data.tar.gz: 2527b477b7b7088f2612e4a05e0369b60cacb34bedb6ac59a3296643b6f59fcfce0c054ede67c68e0f4299864795bd79f04a85020d8f4c87b67f56c5a5dbeb77
6
+ metadata.gz: 68eb93da6d8f5c8ba3c4c58e0a9a71803dd4eefc6063df4ead9f0d06c0f1ba59892f5ddb43a9735c30ceaf85db63ed80c1b155bac1d5f0daf73f9cebbc7f6c6e
7
+ data.tar.gz: 33e9f4f770bceb2c0eb5d6d62781af400bc2b66f4ba4d4092b01b224bc365edef98cc47032f4b2389e04664f6cabcd2c02024139971bee207a121570805a6015
data/.yardopts ADDED
@@ -0,0 +1,3 @@
1
+ --files 'CHANGELOG.md,LICENSE.txt'
2
+ --protected
3
+ --readme 'README.md'
data/CHANGELOG.md CHANGED
@@ -1,8 +1,96 @@
1
1
  # Changelog | NHKore
2
2
 
3
- Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ Format is based on [Keep a Changelog v1.0.0](https://keepachangelog.com/en/1.0.0),
6
+ and this project adheres to [Semantic Versioning v2.0.0](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.8...HEAD)
9
+ -
10
+
11
+
12
+ ## [v0.3.8] - 2021-06-26
13
+
14
+ ### Fixed
15
+ - Fixed `App#refresh_cmd()` to also copy Cri's `default_proc` to the new Hash for the command options.
16
+ - Fixed to check for non-strings for JSON & URI.
17
+ - For JSON, convert `StringIO` to string in `DictScraper.scrape()`.
18
+ - For URL, convert URL using `URI()` because `URI.parse()` will crash with a non-string (URI object) in `Scraper.open_url()`.
19
+ - Fixed to scrape multiple HTML Ruby tag words (instead of just 1).
20
+ - I thought multiple Ruby bases/texts (`<rb>`/`<rt>`) were invalid, but after running into the article below and checking the HTML with a validator, it's actually valid HTML:
21
+ - https://www3.nhk.or.jp/news/easy/k10012759201000/k10012759201000.html
22
+ - No previous articles/URLs ran into this problem (would have raised an error), so it should only be a problem with this specific, new article.
23
+
24
+ ### Changed
25
+ - Formatted/Linted all code using RuboCop.
26
+ - Updated Gems.
27
+
28
+
29
+ ## [v0.3.7] - 2020-11-07
30
+
31
+ ### Changed
32
+ - Updated Gem `attr_bool` to v0.2
33
+ - Changed upper-case *'-V'* flag for *version* to be a lower-case *'-v'*
34
+ - Seems like a lot of apps/people expect this
35
+ - Refactored/Formatted some code
36
+ - *nhkore.gemspec* especially
37
+ - Added *samples/*, *Gemfile.lock*, and *.yardopts* to the files in *nhkore.gemspec*
38
+
39
+ ### Fixed
40
+ - ArticleScraper
41
+ - Fixed to accept text nodes that have Kanji, due to bad article:
42
+ - https://www3.nhk.or.jp/news/easy/k10012639271000/k10012639271000.html
43
+ - `第3のビール` should have HTML ruby tags around *第*
44
+
45
+
46
+ ## [v0.3.6] - 2020-08-18
47
+
48
+ ### Added
49
+ - `update_showcase` Rake task for development & personal site (GitHub Page)
50
+ - `$ bundle exec rake update_showcase`
51
+
52
+ ### Changed
53
+ - Updated Gems
54
+
55
+ ### Fixed
56
+ - ArticleScraper for title for specific site
57
+ - https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_illust.html
58
+ - Ignored `/cgi2.*enqform/` URLs from SearchScraper (Bing)
59
+ - Added more detail to dictionary error in ArticleScraper
60
+
61
+
62
+ ## [v0.3.5] - 2020-05-04
63
+
64
+ ### Added
65
+ - Added check for environment var `NO_COLOR`
66
+ - [https://no-color.org/](https://no-color.org/)
67
+
68
+ ### Fixed
69
+ - Fixed URLs stored in YAML data to always be of type String (not URI)
70
+ - This initially caused a problem in DictScraper.parse_url() from ArticleScraper, but fixed it for all data
71
+
72
+
73
+ ## [v0.3.4] - 2020-04-25
74
+
75
+ ### Added
76
+ - DatetimeParser
77
+ - Extracted from SiftCmd into its own class
78
+ - Fixed some minor logic bugs from the old code
79
+ - Added new feature where 1 range can be empty:
80
+ - `sift ez -d '...2019'` (from = 1924)
81
+ - `sift ez -d '2019...'` (to = current year)
82
+ - `sift ez -d '...'` (still an error)
83
+ - Added `update_core` rake task for dev
84
+ - Makes pushing a new release much easier
85
+ - See *Hacking.Releasing* section in *README*
86
+
87
+ ### Fixed
88
+ - SiftCmd `parse_sift_datetime()` for `-d/--datetime` option
89
+ - Didn't work exactly right (as written in *README*) for some special inputs:
90
+ - `-d '2019...3'`
91
+ - `-d '3-3'`
92
+ - `-d '3'`
4
93
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.3...master)
6
94
 
7
95
  ## [v0.3.3] - 2020-04-23
8
96
 
@@ -10,6 +98,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
10
98
  - Added JSON support to Sifter & SiftCmd.
11
99
  - Added use of `attr_bool` Gem for `attr_accessor?` & `attr_reader?`.
12
100
 
101
+
13
102
  ## [v0.3.2] - 2020-04-22
14
103
 
15
104
  ### Added
@@ -33,6 +122,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
33
122
  - ArticleScraper
34
123
  - Renamed `mode` param to `strict`. `mode` was overshadowing File.open()'s in Scraper.
35
124
 
125
+
36
126
  ## [v0.3.1] - 2020-04-20
37
127
 
38
128
  ### Changed
@@ -50,6 +140,7 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
50
140
  - BingScraper
51
141
  - Fixed possible RSS infinite loop.
52
142
 
143
+
53
144
  ## [v0.3.0] - 2020-04-12
54
145
 
55
146
  ### Added
@@ -84,7 +175,9 @@ Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
84
175
  - ignore empty filenames in the Zip for safety.
85
176
  - ask to overwrite files instead of erroring.
86
177
 
178
+
87
179
  ## [v0.2.0] - 2020-04-01
180
+
88
181
  First working version.
89
182
 
90
183
  ### Added
@@ -120,7 +213,9 @@ First working version.
120
213
  - test/nhkore_tester.rb
121
214
  - Renamed to `test/nhkore/test_helper.rb`
122
215
 
216
+
123
217
  ## [v0.1.0] - 2020-02-24
218
+
124
219
  ### Added
125
220
  - .gitignore
126
221
  - CHANGELOG.md
data/Gemfile CHANGED
@@ -1,24 +1,6 @@
1
1
  # encoding: UTF-8
2
2
  # frozen_string_literal: true
3
3
 
4
- #--
5
- # This file is part of NHKore.
6
- # Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
7
- #
8
- # NHKore is free software: you can redistribute it and/or modify
9
- # it under the terms of the GNU Lesser General Public License as published by
10
- # the Free Software Foundation, either version 3 of the License, or
11
- # (at your option) any later version.
12
- #
13
- # NHKore is distributed in the hope that it will be useful,
14
- # but WITHOUT ANY WARRANTY; without even the implied warranty of
15
- # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
- # GNU Lesser General Public License for more details.
17
- #
18
- # You should have received a copy of the GNU Lesser General Public License
19
- # along with NHKore. If not, see <https://www.gnu.org/licenses/>.
20
- #++
21
-
22
4
 
23
5
  source 'https://rubygems.org'
24
6
 
data/Gemfile.lock ADDED
@@ -0,0 +1,89 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ nhkore (0.3.8)
5
+ attr_bool (~> 0.2)
6
+ bimyou_segmenter (~> 1.2)
7
+ cri (~> 2.15)
8
+ down (~> 5.2)
9
+ highline (~> 2.0)
10
+ http-cookie (~> 1.0)
11
+ japanese_deinflector (~> 0.0)
12
+ nokogiri (~> 1.11)
13
+ psychgus (~> 1.3)
14
+ public_suffix (~> 4.0)
15
+ rainbow (~> 3.0)
16
+ rubyzip (~> 2.3)
17
+ tiny_segmenter (~> 0.0)
18
+ tty-progressbar (~> 0.18)
19
+ tty-spinner (~> 0.9)
20
+
21
+ GEM
22
+ remote: https://rubygems.org/
23
+ specs:
24
+ addressable (2.7.0)
25
+ public_suffix (>= 2.0.2, < 5.0)
26
+ attr_bool (0.2.2)
27
+ bimyou_segmenter (1.2.0)
28
+ cri (2.15.11)
29
+ domain_name (0.5.20190701)
30
+ unf (>= 0.0.5, < 1.0.0)
31
+ down (5.2.2)
32
+ addressable (~> 2.5)
33
+ highline (2.0.3)
34
+ http-cookie (1.0.4)
35
+ domain_name (~> 0.5)
36
+ japanese_deinflector (0.0.2)
37
+ mini_portile2 (2.5.3)
38
+ minitest (5.14.4)
39
+ nokogiri (1.11.7)
40
+ mini_portile2 (~> 2.5.0)
41
+ racc (~> 1.4)
42
+ psych (4.0.1)
43
+ psychgus (1.3.4)
44
+ psych (>= 3.0)
45
+ public_suffix (4.0.6)
46
+ racc (1.5.2)
47
+ rainbow (3.0.0)
48
+ rake (13.0.3)
49
+ raketeer (0.2.13)
50
+ rake
51
+ rdoc (6.3.1)
52
+ redcarpet (3.5.1)
53
+ rubyzip (2.3.0)
54
+ strings-ansi (0.2.0)
55
+ tiny_segmenter (0.0.6)
56
+ tty-cursor (0.7.1)
57
+ tty-progressbar (0.18.2)
58
+ strings-ansi (~> 0.2)
59
+ tty-cursor (~> 0.7)
60
+ tty-screen (~> 0.8)
61
+ unicode-display_width (>= 1.6, < 3.0)
62
+ tty-screen (0.8.1)
63
+ tty-spinner (0.9.3)
64
+ tty-cursor (~> 0.7)
65
+ unf (0.1.4)
66
+ unf_ext
67
+ unf_ext (0.0.7.7)
68
+ unicode-display_width (2.0.0)
69
+ yard (0.9.26)
70
+ yard_ghurt (1.2.1)
71
+ rake
72
+ yard
73
+
74
+ PLATFORMS
75
+ ruby
76
+
77
+ DEPENDENCIES
78
+ bundler (~> 2.2)
79
+ minitest (~> 5.14)
80
+ nhkore!
81
+ rake (~> 13.0)
82
+ raketeer (~> 0.2)
83
+ rdoc (~> 6.3)
84
+ redcarpet (~> 3.5)
85
+ yard (~> 0.9)
86
+ yard_ghurt (~> 1.2)
87
+
88
+ BUNDLED WITH
89
+ 2.2.20
data/README.md CHANGED
@@ -26,6 +26,8 @@ This is similar to a [core word/vocabulary list](https://www.fluentin3months.com
26
26
  - [News Command](#news-command-)
27
27
  - [Using the Library](#using-the-library-)
28
28
  - [Hacking](#hacking-)
29
+ - [Updating](#updating-)
30
+ - [Releasing](#releasing-)
29
31
  - [License](#license-)
30
32
 
31
33
  ## For Non-Power Users [^](#contents)
@@ -433,18 +435,18 @@ require 'nhkore/scraper'
433
435
  s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
434
436
  open_timeout: 300, # Open timeout in seconds (default: nil)
435
437
  read_timeout: 300, # Read timeout in seconds (default: nil)
436
-
438
+
437
439
  # Maximum number of times to retry the URL
438
440
  # - default: 3
439
441
  # - Open/connect will fail a couple of times on a bad/slow internet connection.
440
442
  max_retries: 10,
441
-
443
+
442
444
  # Maximum number of redirects allowed.
443
445
  # - default: 3
444
446
  # - You can set this to nil or -1, but I recommend using a number
445
447
  # for safety (infinite-loop attack).
446
448
  max_redirects: 1,
447
-
449
+
448
450
  # How to check redirect URLs for safety.
449
451
  # - default: :strict
450
452
  # - nil => do not check
@@ -453,7 +455,7 @@ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
453
455
  # - :strict => check the scheme and domain
454
456
  # (i.e., if https://bing.com, redirect URL must be https://bing.com)
455
457
  redirect_rule: :lenient,
456
-
458
+
457
459
  # Set the HTTP header field 'cookie' from the 'set-cookie' response.
458
460
  # - default: false
459
461
  # - Currently uses the 'http-cookie' Gem.
@@ -461,7 +463,7 @@ s = NHKore::Scraper.new('https://www3.nhk.or.jp/news/easy/',
461
463
  # - Necessary for Search Engines or other sites that require cookies
462
464
  # in order to block bots.
463
465
  eat_cookie: true,
464
-
466
+
465
467
  # Set HTTP header fields.
466
468
  # - default: nil
467
469
  # - Necessary for Search Engines or other sites that try to block bots.
@@ -524,9 +526,9 @@ doc = ss.html_doc()
524
526
 
525
527
  doc.css('a').each() do |anchor|
526
528
  link = anchor['href']
527
-
528
- next if ss.ignore_link?(link)
529
-
529
+
530
+ next if ss.ignore_link?(link,cleaned: false)
531
+
530
532
  if link.include?('https://www3.nhk')
531
533
  puts link
532
534
  end
@@ -547,9 +549,9 @@ page_num = 1
547
549
 
548
550
  while !next_page.empty?()
549
551
  puts "Page #{page_num += 1}: #{next_page.count}"
550
-
552
+
551
553
  bs = NHKore::BingScraper.new(:yasashii,url: next_page.url)
552
-
554
+
553
555
  next_page = bs.scrape(slinks,next_page)
554
556
  end
555
557
 
@@ -564,27 +566,28 @@ end
564
566
 
565
567
  ```Ruby
566
568
  require 'nhkore/article_scraper'
569
+ require 'time'
567
570
 
568
571
  as = NHKore::ArticleScraper.new(
569
572
  'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html',
570
-
573
+
571
574
  # If false, scrape the article leniently (for older articles which
572
575
  # may not have certain tags, etc.).
573
576
  # - default: true
574
577
  strict: false,
575
-
578
+
576
579
  # {Dict} to use as the dictionary for words (Easy articles).
577
580
  # - default: :scrape
578
581
  # - nil => don't scrape/use it (necessary for Regular articles)
579
582
  # - :scrape => auto-scrape it using {DictScraper}
580
583
  # - {Dict} => your own {Dict}
581
584
  dict: nil,
582
-
585
+
583
586
  # Date time to use as a fallback if the article doesn't have one
584
587
  # (for older articles).
585
588
  # - default: nil
586
589
  datetime: Time.new(2020,2,2),
587
-
590
+
588
591
  # Year to use as a fallback if the article doesn't have one
589
592
  # (for older articles).
590
593
  # - default: nil
@@ -621,7 +624,7 @@ require 'nhkore/dict_scraper'
621
624
  url = 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'
622
625
  ds = NHKore::DictScraper.new(
623
626
  url,
624
-
627
+
625
628
  # Change the URL appropriately to the dictionary URL.
626
629
  # - default: true
627
630
  parse_url: true,
@@ -634,13 +637,13 @@ dict = ds.scrape()
634
637
 
635
638
  dict.entries.each() do |key,entry|
636
639
  entry.id
637
-
640
+
638
641
  entry.defns.each() do |defn|
639
642
  defn.hyoukis.each() {|hyouki| }
640
643
  defn.text
641
644
  defn.words.each() {|word| }
642
645
  end
643
-
646
+
644
647
  puts entry.build_hyouki()
645
648
  puts entry.build_defn()
646
649
  puts '---'
@@ -687,6 +690,7 @@ end
687
690
  `Sifter` will sift & sort the `News` data into a single file. The data is sorted by frequency in descending order (i.e., most frequent words first).
688
691
 
689
692
  ```Ruby
693
+ require 'nhkore/datetime_parser'
690
694
  require 'nhkore/news'
691
695
  require 'nhkore/sifter'
692
696
  require 'time'
@@ -698,7 +702,8 @@ sifter = NHKore::Sifter.new(news)
698
702
  sifter.caption = 'Sakura Fields Forever!'
699
703
 
700
704
  # Filter the data.
701
- #sifter.filter_by_datetime(Time.new(2019,12,5))
705
+ sifter.filter_by_datetime(NHKore::DatetimeParser.parse_range('2019-12-4...7'))
706
+ sifter.filter_by_datetime([Time.new(2019,12,4),Time.new(2019,12,7)])
702
707
  sifter.filter_by_datetime(
703
708
  from: Time.new(2019,12,4),to: Time.new(2019,12,7)
704
709
  )
@@ -727,13 +732,14 @@ if !File.exist?(file)
727
732
  end
728
733
  ```
729
734
 
730
- ### Util & UserAgents
735
+ ### Util, UserAgents, & DatetimeParser
731
736
 
732
737
  These provide a variety of useful methods/constants.
733
738
 
734
739
  Here are some of the most useful ones:
735
740
 
736
741
  ```Ruby
742
+ require 'nhkore/datetime_parser'
737
743
  require 'nhkore/user_agents'
738
744
  require 'nhkore/util'
739
745
 
@@ -759,14 +765,16 @@ puts
759
765
  puts '========'
760
766
  puts '[ Time ]'
761
767
  puts '========'
762
- puts "JST now: #{Util.jst_now}"
768
+ puts "JST now: #{Util.jst_now()}"
763
769
  # Drops in JST_OFFSET, does not change hour/min.
764
770
  puts "JST time: #{Util.jst_time(Time.now)}"
765
771
  puts "JST year: #{Util::JST_YEAR}"
766
772
  puts "1999 sane? #{Util.sane_year?(1999)}" # true
767
773
  puts "1776 sane? #{Util.sane_year?(1776)}" # false
768
- puts "Guess 5: #{Util.guess_year(5)}" # 2005
769
- puts "Guess 99: #{Util.guess_year(99)}" # 1999
774
+ puts "Guess 5: #{DatetimeParser.guess_year(5)}" # 2005
775
+ puts "Guess 99: #{DatetimeParser.guess_year(99)}" # 1999
776
+ # => [2020-12-01 00:00:00 +0900, 2020-12-31 23:59:59 +0900]
777
+ puts "Parse: #{DatetimeParser.parse_range('2020-12')}"
770
778
  puts
771
779
  puts "JST timezone offset: #{Util::JST_OFFSET}"
772
780
  puts "JST timezone offset hour: #{Util::JST_OFFSET_HOUR}"
@@ -781,20 +789,20 @@ JPN = ['桜','ぶ','ブ']
781
789
 
782
790
  def fmt_jpn()
783
791
  fmt = []
784
-
792
+
785
793
  JPN.each() do |x|
786
794
  x = yield(x)
787
795
  x = x ? "\u2B55" : Util::JPN_SPACE unless x.is_a?(String)
788
796
  fmt << x
789
797
  end
790
-
798
+
791
799
  return "[ #{fmt.join(' | ')} ]"
792
800
  end
793
801
 
794
802
  puts " #{fmt_jpn{|x| x}}"
795
- puts "Hiragana? #{fmt_jpn{|x| !!Util.hiragana?(x)}}"
796
- puts "Kana? #{fmt_jpn{|x| !!Util.kana?(x)}}"
797
- puts "Kanji? #{fmt_jpn{|x| !!Util.kanji?(x)}}"
803
+ puts "Hiragana? #{fmt_jpn{|x| Util.hiragana?(x)}}"
804
+ puts "Kana? #{fmt_jpn{|x| Util.kana?(x)}}"
805
+ puts "Kanji? #{fmt_jpn{|x| Util.kanji?(x)}}"
798
806
  puts "Reduce: #{Util.reduce_jpn_space("' '")}"
799
807
  puts
800
808
 
@@ -842,16 +850,36 @@ You can make some changes/fixes to the code and then install your local version:
842
850
 
843
851
  `$ bundle exec rake install:local`
844
852
 
845
- ### Releasing/Publishing
853
+ ### Updating [^](#contents)
854
+
855
+ This will update *core/* for you:
856
+
857
+ `$ bundle exec rake update_core`
858
+
859
+ ### Releasing [^](#contents)
860
+
861
+ 1. Update *CHANGELOG.md*, *version.rb*, & *Gemfile.lock*
862
+ - *Raketary*: `$ raketary bump -v`
863
+ - Run: `$ bundle update`
864
+ 2. Run: `$ bundle exec rake update_core`
865
+ 3. Run: `$ bundle exec rake clobber pkg_core`
866
+ 4. Create a new release & tag
867
+ - Add `pkg/nhkore-core.zip`
868
+ 5. Run: `$ git pull`
869
+ 6. Upload GitHub package
870
+ - *Raketary*: `$ raketary github_pkg`
871
+ 7. Run: `$ bundle exec rake release`
872
+
873
+ Releasing new HTML file for website:
846
874
 
847
- `$ bundle exec rake release`
875
+ 1. `$ bundle exec rake update_showcase`
848
876
 
849
877
  ## License [^](#contents)
850
878
 
851
879
  [GNU LGPL v3+](LICENSE.txt)
852
880
 
853
881
  > NHKore (<https://github.com/esotericpig/nhkore>)
854
- > Copyright (c) 2020 Jonathan Bradley Whited (@esotericpig)
882
+ > Copyright (c) 2020-2021 Jonathan Bradley Whited
855
883
  >
856
884
  > NHKore is free software: you can redistribute it and/or modify
857
885
  > it under the terms of the GNU Lesser General Public License as published by