nhkore 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6ab82aafdbc996ca3f0f010d533adb165df24d63a4799ba8812596551506d52c
4
- data.tar.gz: 5cb8b107928f7ba4c3e0100d70748b616f897a1b0e8c70149fdbf3ce09c39bd7
3
+ metadata.gz: a02c041aff2b0b040ff00acaeeaa54506e574e9575613b48c4dd75ef0ef45564
4
+ data.tar.gz: dd3f570c0e7b7223c039d4119989e04514ef0d4bf4fb53485397271021f39246
5
5
  SHA512:
6
- metadata.gz: 2e85f11cb8b88605964e656234c746adb514d947fa46523fe464e601ae87cc1cdb7f8f32407e317c7e0c470cdfefde24251b7704cf014399a9eeb300fbd43936
7
- data.tar.gz: 6b6aecea79efcf9f936667aa2d6a60b7255ee49de6929a576db468504e5084b254cb64729f8638de2c1814cc1223cd9a8ed04703d1a737ff805b3b2a5566102b
6
+ metadata.gz: 4f5021ab1fd74bb1c5a42574fa1045f71069b6f8ab6cf7b1717e6164505127e6c657f2e36be903dc190d356bed83fdf8c2de4c89644c7676863cfb9a8c53da8f
7
+ data.tar.gz: e082a6ed70bacccb763386e00d8ca92351d4ee8d9f2d32a9b79dc6a2733ea46cd739e95550124e4807281c58f3a65faf9b8496740a51cca13e068ecd3e882d3a
@@ -2,7 +2,41 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.2.0...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.0...master)
6
+
7
+ ## [v0.3.0] - 2020-04-12
8
+
9
+ ### Added
10
+ - UserAgents
11
+ - Tons of random `User-Agent` strings for `Scraper`.
12
+
13
+ ### Changed
14
+ - BingCmd => SearchCmd
15
+ - Major (breaking) change.
16
+ - Changed `$nhkore bing easy` to:
17
+ - `$ nhkore search easy bing`
18
+ - `$ nhkore se ez b`
19
+ - App
20
+ - Added options:
21
+ - `--color` (force color output for demos)
22
+ - `--user-agent` (specify a custom HTTP header field `User-Agent`)
23
+ - If `out_dir` is empty, don't prompt if okay to overwrite.
24
+ - README/nhkore.gemspec
25
+ - Added more info.
26
+ - Changed description.
27
+
28
+ ### Fixed
29
+ - Scraper/BingScraper
30
+ - Big fix.
31
+ - Fixed to get around bing's strictness.
32
+ - Use a random `User-Agent` from `UserAgents`.
33
+ - Set HTTP header field `cookie` from `set-cookie` response.
34
+ - Added `http-cookie` gem.
35
+ - Use RSS as a fallback.
36
+ - GetCmd
37
+ - When extracting files...
38
+ - ignore empty filenames in the Zip for safety.
39
+ - ask to overwrite files instead of erroring.
6
40
 
7
41
  ## [v0.2.0] - 2020-04-01
8
42
  First working version.
data/README.md CHANGED
@@ -10,20 +10,39 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
10
10
 
11
11
  This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
12
12
 
13
- In the future, I would like to add the regular NHK News, using the links from the easy versions.
13
+ [![asciinema Demo - Help](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P.png)](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P?speed=2)
14
14
 
15
15
  ## Contents
16
16
 
17
- - [Installing](#installing)
18
- - [Using](#using)
19
- - [Hacking](#hacking)
20
- - [License](#license)
17
+ - [For Non-Power Users](#for-non-power-users-)
18
+ - [Installing](#installing-)
19
+ - [Using](#using-)
20
+ - [The Basics](#the-basics-)
21
+ - [Unlimited Power!](#unlimited-power-)
22
+ - [Get Command](#get-command-)
23
+ - [Sift Command](#sift-command-)
24
+ - [Sakura Fields Forever](#sakura-fields-forever-)
25
+ - [Search Command](#search-command-)
26
+ - [News Command](#news-command-)
27
+ - [Using the Library](#using-the-library-)
28
+ - [Hacking](#hacking-)
29
+ - [License](#license-)
21
30
 
22
- ## [Installing](#contents)
31
+ ## For Non-Power Users [^](#contents)
32
+
33
+ For non-Power Users, you are probably just interested in the data.
34
+
35
+ [Click here](https://esotericpig.github.io/showcase/nhkore-ez.html) for a big HTML file of the final result from all of the current articles scraped.
36
+
37
+ [Click here](https://github.com/esotericpig/nhkore/releases/latest) to go to the latest release and download `nhkore-core.zip` from the `Assets`. It contains all of the links scraped, all of the data scraped per article, and a final CSV file.
38
+
39
+ If you'd like to try using the app, please download and install [Ruby](https://www.ruby-lang.org/en/downloads/) and then follow the instructions below. You'll need to be able to use the command line.
40
+
41
+ ## Installing [^](#contents)
23
42
 
24
43
  Pick your poison...
25
44
 
26
- With the RubyGems CLI package manager:
45
+ With the RubyGems package manager:
27
46
 
28
47
  `$ gem install nhkore`
29
48
 
@@ -32,14 +51,267 @@ Manually:
32
51
  ```
33
52
  $ git clone 'https://github.com/esotericpig/nhkore.git'
34
53
  $ cd nhkore
35
- $ rake install
54
+ $ gem build nhkore.gemspec
55
+ $ gem install *.gem
36
56
  ```
37
57
 
38
- ## [Using](#contents)
58
+ If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
59
+
60
+ ## Using [^](#contents)
61
+
62
+ ### The Basics [^](#contents)
39
63
 
40
- TODO: update README Using section
64
+ The most useful thing to do is to simply scrape one article and then study the most frequent words before reading that article.
41
65
 
42
- ## [Hacking](#contents)
66
+ First, scrape the article:
67
+
68
+ `$ nhkore news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
69
+
70
+ If your internet is slow, there are several global options to help alleviate your internet woes, which can be used with any sub command:
71
+
72
+ ```
73
+ -m --max-retry=<value> maximum number of times to retry URLs
74
+ (-1 or integer >= 0) (default: 3)
75
+ -o --open-timeout=<value> seconds for URL open timeouts
76
+ (-1 or decimal >= 0)
77
+ -r --read-timeout=<value> seconds for URL read timeouts
78
+ (-1 or decimal >= 0)
79
+ -t --timeout=<value> seconds for all URL timeouts: [open, read]
80
+ (-1 or decimal >= 0)
81
+ ```
82
+
83
+ Example usage:
84
+
85
+ `$ nhkore -t 300 -m 10 news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
86
+
87
+ Some older articles will fail to scrape and need additional options (this is very rare):
88
+
89
+ ```
90
+ -D --no-dict do not try to parse the dictionary files
91
+ for the articles; useful in case of errors
92
+ trying to load the dictionaries (or for offline testing)
93
+ -L --lenient leniently (not strict) scrape articles:
94
+ body & title content without the proper
95
+ HTML/CSS classes/IDs and no futsuurl;
96
+ example URLs:
97
+ - https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_02.html
98
+ - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
99
+ -M --missingno very rarely an article will not have kana or kanji
100
+ for a Ruby tag; to not raise an error, this will
101
+ use previously scraped data to fill it in;
102
+ example URL:
103
+ - https://www3.nhk.or.jp/news/easy/k10012331311000/k10012331311000.html
104
+ -d --datetime=<value> date time to use as a fallback in cases
105
+ when an article doesn't have one;
106
+ format: YYYY-mm-dd H:M; example: 2020-03-30 15:30
107
+ ```
108
+
109
+ Example usage:
110
+
111
+ `$ nhkore -t 300 -m 10 news -D -L -M -d '2011-03-07 06:30' easy -u 'https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html'`
112
+
113
+ Now that the data from the article has been scraped, you can generate a CSV/HTML/YAML file of the words ordered by frequency:
114
+
115
+ ```
116
+ $ nhkore sift easy -e csv
117
+ $ nhkore sift easy -e html
118
+ $ nhkore sift easy -e yml
119
+ ```
120
+
121
+ If you have other scraped articles, then you'll need to filter down to the specific one:
122
+
123
+ | Command | Description |
124
+ | --- | --- |
125
+ | `$ nhkore sift easy -u k10011862381000` | Filter by URL |
126
+ | `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
127
+ | `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
128
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time &amp; title |
129
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter &amp; output HTML |
130
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter &amp; output HTML |
131
+
132
+ Complete demo:
133
+
134
+ [![asciinema Demo - The Basics](https://asciinema.org/a/316571.png)](https://asciinema.org/a/316571)
135
+
136
+ ### Unlimited Power! [^](#contents)
137
+
138
+ #### Get Command [^](#contents)
139
+
140
+ The `get` command will download and extract `nhkore-core.zip` from the [latest release](https://github.com/esotericpig/nhkore/releases/latest) for you.
141
+
142
+ This already has tons of articles scraped so that you don't have to re-scrape them. Then, for example, you can easily create a CSV file from all of `2019` or all of `December 2019`.
143
+
144
+ Example usage:
145
+
146
+ `$ nhkore get`
147
+
148
+ By default, it will extract the data to `./core/`. You can change this:
149
+
150
+ `$ nhkore get -o 'my dir/'`
151
+
152
+ Complete demo:
153
+
154
+ [![asciinema Demo - Get](https://asciinema.org/a/317773.png)](https://asciinema.org/a/317773)
155
+
156
+ #### Sift Command [^](#contents)
157
+
158
+ After obtaining the scraped data, you can `sift` all of the data (or select data) into one of these file formats:
159
+
160
+ | Format | Typical Purpose |
161
+ | --- | --- |
162
+ | CSV | For uploading to a flashcard website (e.g., Memrise, Anki, Buffl) after changing the data appropriately. |
163
+ | HTML | For comfortable viewing in a web browser or for sharing. |
164
+ | YAML | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
165
+
166
+ The data is sorted by frequency in descending order (i.e., most frequent words first).
167
+
168
+ If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/HTML file.
169
+
170
+ The defaults will sift all of the data into a CSV file, which may not be what you want:
171
+
172
+ `$ nhkore sift easy`
173
+
174
+ You can filter the data by using different options:
175
+
176
+ ```
177
+ -d --datetime=<value> date time to filter on; examples:
178
+ - '2020-7-1 13:10...2020-7-31 11:11'
179
+ - '2020-12' (2020, December 1st-31st)
180
+ - '7-4...7-9' (July 4th-9th of Current Year)
181
+ - '7-9' (July 9th of Current Year)
182
+ - '9' (9th of Current Year & Month)
183
+ -t --title=<value> title to filter on, where search text only
184
+ needs to be somewhere in the title
185
+ -u --url=<value> URL to filter on, where search text only
186
+ needs to be somewhere in the URL
187
+ ```
188
+
189
+ Filter examples:
190
+
191
+ ```
192
+ $ nhkore sift easy -d 2019
193
+ $ nhkore sift easy -d '2019-12'
194
+ $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
+ $ nhkore sift easy -d '2019-12-25 13:10'
196
+ $ nhkore sift easy -t 'マリオ'
197
+ $ nhkore sift easy -u 'k10011862381000'
198
+ ```
199
+
200
+ You can save the data to a different format using one of these options:
201
+
202
+ ```
203
+ -e --ext=<value> type of file (extension) to save;
204
+ valid options: [csv, htm, html, yaml, yml];
205
+ not needed if you specify a file extension with
206
+ the '--out' option: '--out sift.html'
207
+ (default: csv)
208
+ -o --out=<value> 'directory/file' to save sifted data to;
209
+ if you only specify a directory or a file, it will
210
+ attach the appropriate default directory/file name
211
+ (defaults:
212
+ core/sift_nhk_news_web_easy{search.criteria}{file.ext},
213
+ core/sift_nhk_news_web_regular{search.criteria}{file.ext})
214
+ ```
215
+
216
+ Format examples:
217
+
218
+ ```
219
+ $ nhkore sift easy -e html
220
+ $ nhkore sift easy -e yml
221
+ $ nhkore sift easy -o 'mario.html'
222
+ $ nhkore sift easy -o 'sakura.yml'
223
+ ```
224
+
225
+ Lastly, you can ignore certain columns from the output. Definitions can be quite long, and English translations are currently always blank (meant to be filled in manually/programmatically).
226
+
227
+ ```
228
+ -D --no-defn do not output the definitions for words
229
+ (which can be quite long)
230
+ -E --no-eng do not output the English translations for words
231
+ ```
232
+
233
+ Complete demo:
234
+
235
+ [![asciinema Demo - Sift](https://asciinema.org/a/318119.png)](https://asciinema.org/a/318119)
236
+
237
+ ### Sakura Fields Forever [^](#contents)
238
+
239
+ #### Search Command [^](#contents)
240
+
241
+ The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
242
+
243
+ Currently, the NHK website doesn't provide an historical record of all of its articles, and it's up to the user to find them.
244
+
245
+ The format of the file is simple, so you can edit it by hand (or programmatically) very easily:
246
+
247
+ ```YAML
248
+ # core/links_nhk_news_web_easy.yml
249
+ ---
250
+ links:
251
+ https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
252
+ url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
253
+ scraped: false
254
+ https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
255
+ url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
256
+ scraped: false
257
+ ```
258
+
259
+ Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
260
+
261
+ > &lt;rambling&gt;
262
+ > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
263
+ > &lt;/rambling&gt;
264
+
265
+ Example after running the `news` command:
266
+
267
+ ```YAML
268
+ # core/links_nhk_news_web_easy.yml
269
+ # - After being scraped
270
+ ---
271
+ links:
272
+ https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
273
+ url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
274
+ scraped: true
275
+ datetime: '2020-03-11T16:00:00+09:00'
276
+ title: 安倍総理大臣「今月20日ごろまで大きなイベントをしないで」
277
+ futsuurl: https://www3.nhk.or.jp/news/html/20200310/k10012323711000.html
278
+ sha256: d1186ebbc2013564e52f21a2e8ecd56144ed5fe98c365f6edbd4eefb2db345eb
279
+ https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
280
+ url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
281
+ scraped: true
282
+ datetime: '2020-03-11T11:30:00+09:00'
283
+ title: 島根県の会社 中国から技能実習生が来なくて困っている
284
+ futsuurl: https://www3.nhk.or.jp/news/html/20200309/k10012321401000.html
285
+ sha256: 2df91884fbbafdc69bc3126cb0cb7b63b2c24e85bc0de707643919e4581927a9
286
+ ```
287
+
288
+ If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
289
+
290
+ #### News Command [^](#contents)
291
+
292
+ ## Using the Library [^](#contents)
293
+
294
+ ### Setup
295
+
296
+ Pick your poison...
297
+
298
+ In your *Gemspec* (*&lt;project&gt;.gemspec*):
299
+
300
+ ```Ruby
301
+ spec.add_runtime_dependency 'nhkore', '~> X.X'
302
+ ```
303
+
304
+ In your *Gemfile*:
305
+
306
+ ```Ruby
307
+ # Pick one...
308
+ gem 'nhkore', '~> X.X'
309
+ gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
310
+ ```
311
+
312
+ ### Scraper
313
+
314
+ ## Hacking [^](#contents)
43
315
 
44
316
  ```
45
317
  $ git clone 'https://github.com/esotericpig/nhkore.git'
@@ -48,19 +320,35 @@ $ bundle install
48
320
  $ bundle exec rake -T
49
321
  ```
50
322
 
51
- ### Testing
323
+ Install Nokogiri:
52
324
 
53
325
  ```
54
- $ bundle exec rake test
326
+ $ bundle exec rake nokogiri_apt # Ubuntu/Debian
327
+ $ bundle exec rake nokogiri_dnf # Fedora/CentOS/Red Hat
328
+ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
55
329
  ```
56
330
 
331
+ ### Running
332
+
333
+ `$ ruby -w lib/nhkore.rb`
334
+
335
+ ### Testing
336
+
337
+ `$ bundle exec rake test`
338
+
57
339
  ### Generating Doc
58
340
 
59
- ```
60
- $ bundle exec rake doc
61
- ```
341
+ `$ bundle exec rake doc`
342
+
343
+ ### Installing Locally (without Network Access)
344
+
345
+ `$ bundle exec rake install:local`
346
+
347
+ ### Releasing/Publishing
348
+
349
+ `$ bundle exec rake release`
62
350
 
63
- ## [License](#contents)
351
+ ## License [^](#contents)
64
352
 
65
353
  [GNU LGPL v3+](LICENSE.txt)
66
354
 
data/Rakefile CHANGED
@@ -26,14 +26,13 @@ require 'rake/clean'
26
26
  require 'rake/testtask'
27
27
  require 'raketeer/irb'
28
28
  require 'raketeer/nokogiri_installs'
29
- require 'raketeer/run'
30
29
  require 'yard'
31
30
  require 'yard_ghurt'
32
31
 
32
+ require 'nhkore/util'
33
33
  require 'nhkore/version'
34
34
 
35
35
 
36
- CORE_PKG_DIR = 'core_pkg'
37
36
  PKG_DIR = 'pkg'
38
37
 
39
38
  CLEAN.exclude('.git/','stock/')
@@ -46,16 +45,14 @@ desc 'Generate documentation (YARDoc)'
46
45
  task :doc => [:yard,:yard_gfm_fix] do |task|
47
46
  end
48
47
 
49
- desc "Package '#{File.join(CORE_PKG_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
48
+ desc "Package '#{File.join(NHKore::Util::CORE_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
50
49
  task :pkg_core do |task|
51
50
  mkdir_p PKG_DIR
52
51
 
53
- cd CORE_PKG_DIR do
54
- pattern = File.join('core','*.{csv,html,yml}')
55
- zip_file = File.join('..',PKG_DIR,'nhkore-core.zip')
56
-
57
- sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
58
- end
52
+ pattern = File.join(NHKore::Util::CORE_DIR,'*.{csv,html,yml}')
53
+ zip_file = File.join(PKG_DIR,'nhkore-core.zip')
54
+
55
+ sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
59
56
  end
60
57
 
61
58
  Rake::TestTask.new() do |task|
@@ -77,8 +74,8 @@ YARD::Rake::YardocTask.new() do |task|
77
74
  task.options += ['--title',"NHKore v#{NHKore::VERSION} Doc"]
78
75
  end
79
76
 
80
- # Execute "yard_gfm_fix" for production.
81
- # Execute "yard_gfm_fix[true]" for testing locally.
77
+ # Execute "rake yard_gfm_fix" for production.
78
+ # Execute "rake yard_gfm_fix[true]" for testing locally.
82
79
  YardGhurt::GFMFixTask.new() do |task|
83
80
  task.arg_names = [:dev]
84
81
  task.dry_run = false
@@ -86,10 +83,10 @@ YardGhurt::GFMFixTask.new() do |task|
86
83
  task.md_files = ['index.html']
87
84
 
88
85
  task.before = Proc.new() do |task,args|
89
- # Delete this file as it's never used (index.html is an exact copy)
86
+ # Delete this file as it's never used (index.html is an exact copy).
90
87
  YardGhurt::Util.rm_exist(File.join(task.doc_dir,'file.README.html'))
91
88
 
92
- # Root dir of my GitHub Page for CSS/JS
89
+ # Root dir of my GitHub Page for CSS/JS.
93
90
  GHP_ROOT = YardGhurt::Util.to_bool(args.dev) ? '../../esotericpig.github.io' : '../../..'
94
91
 
95
92
  task.css_styles << %Q(<link rel="stylesheet" type="text/css" href="#{GHP_ROOT}/css/prism.css" />)